Image processing techniques have classically operated at a pixel level, i.e., on pixel values or intensities. However, operating on low-level pixel data is not practical for applications such as altering the high-level visual appearance of an image. For such tasks, feature-based approaches are more effective. Feature-based techniques include first defining a set of specific features (e.g., edges, patches, SURF, SIFT, etc.) and then defining mathematical models on those features that can be used to analyze and manipulate image content.
Machine learning techniques may be employed to learn either or both features and mathematical model parameters based on pertinent application dependent cost functions. However, such artificial intelligence techniques require an exhaustive training data set that spans the space of all features for a particular application and that is labeled with relevant ground truth data. It has typically been prohibitive to gather exhaustive training data sets for most useful applications. Furthermore, it has been difficult to label images with more complex or sophisticated ground truth data. Thus, the use of machine learning techniques until lately has been limited to basic object recognition or classification applications.
An image processing architecture that overcomes such limitations and effectively leverages machine learning techniques is thus needed and disclosed herein as well as novel image-based applications resulting therefrom.
Various embodiments of the invention are disclosed in the following detailed description and the accompanying drawings.
The invention can be implemented in numerous ways, including as a process; an apparatus; a system; a composition of matter; a computer program product embodied on a computer readable storage medium; and/or a processor, such as a processor configured to execute instructions stored on and/or provided by a memory coupled to the processor. In this specification, these implementations, or any other form that the invention may take, may be referred to as techniques. In general, the order of the steps of disclosed processes may be altered within the scope of the invention. Unless stated otherwise, a component such as a processor or a memory described as being configured to perform a task may be implemented as a general component that is temporarily configured to perform the task at a given time or a specific component that is manufactured to perform the task. As used herein, the term ‘processor’ refers to one or more devices, circuits, and/or processing cores configured to process data, such as computer program instructions.
A detailed description of one or more embodiments of the invention is provided below along with accompanying figures that illustrate the principles of the invention. The invention is described in connection with such embodiments, but the invention is not limited to any embodiment. The scope of the invention is limited only by the claims, and the invention encompasses numerous alternatives, modifications, and equivalents. Numerous specific details are set forth in the following description in order to provide a thorough understanding of the invention. These details are provided for the purpose of example, and the invention may be practiced according to the claims without some or all of these specific details. For the purpose of clarity, technical material that is known in the technical fields related to the invention has not been described in detail so that the invention is not unnecessarily obscured.
An image processing architecture and image-based applications resulting therefrom are disclosed herein. Generally, images may comprise scenes including a single object, a plurality of objects, or rich (virtual) environments and furthermore may comprise stills or frames of animation or video sequences. Moreover, images may comprise (high quality) photographs or (photorealistic) renderings.
Artificial intelligence techniques including machine learning techniques are data driven and therefore require exhaustive training data sets to be effective for most practical applications. Thus, the basis for using machine learning techniques for image-based applications relies on access to comprehensive training data sets that have been labeled or tagged with appropriate metadata that is relevant for desired applications.
In this description, machine learning techniques are generally described and in various embodiments may comprise the use of any combination of one or more machine learning architectures appropriate for a given application, such as deep neural networks and convolutional neural networks. Moreover, the terms “labels”, “tags”, and “metadata” are interchangeably used in this description to refer to properties or attributes associated and persisted with a unit of data or content, such as an image.
A fundamental aspect of the disclosed image processing architecture comprises a content generation platform for generating comprehensive image data sets labeled with relevant ground truth data. Such data sets may be populated, for example, with images rendered from three-dimensional (polygon mesh) models and/or captured from imaging devices such as cameras or scanning devices such as 3D scanners. Image data sets encompassing exhaustive permutations of individual and/or combinations of objects and arrangements, camera perspectives or poses, lighting types and locations, materials and textures, etc., may be generated, for example, during offline processes. Image assets may moreover be obtained from external sources. Metadata may be generated and associated with (e.g., used to label or tag) images when the images are generated at the time of generation and/or may be added or modified afterwards and furthermore may be automatically generated and/or may at least in part be determined or defined manually.
Images comprising datasets 104 are tagged with comprehensive sets of labels or metadata. A set of labels defined and/or selected for images of a prescribed dataset may at least in part be application dependent. The set of labels may include one or more hierarchical high level labels that provide classification of the object(s) and/or scene comprising an image. The set of labels may furthermore include lower level labels comprising ground truth data associated with rendering an image or parts thereof from underlying three-dimensional model(s) or capturing the image using a physical device such as a camera or a scanner. Examples of such labels include labels associated with the (three-dimensional) geometry of the scene comprising the image such as object types and locations, material properties of objects comprising the scene, surface normal vectors, lighting types and positions (e.g., direct sources as well as indirect sources such as reflective surfaces that contribute to substantial higher order bounces), camera characteristics (e.g., perspective or pose, orientation, rotation, depth information, focal length, aperture, zoom level), etc. The labels may include absolute and/or relative location or position, orientation, and depth information amongst various scene objects, light sources, and the (virtual) camera that captures the scene. Other labels that are not scene-based but instead image-based such as noise statistics (e.g., the number of samples of rays used in ray tracing when rendering) may be associated with an image. Furthermore, more complex or sophisticated labels may be defined for various applications by combining a plurality of other labels. Examples include defining labels for specific object poses relative to light sources, presence of special materials (e.g., leather) with specific light sources, etc.
As described, knowledge of ground truth data associated with generating images comprising a dataset 104 facilitates detailed as well as custom (e.g., application specific) labels to be associated with the images comprising the dataset 104, including many types and classes of labels that otherwise could not be manually identified and associated with images such as lighting type and location. Extensive, labeled datasets 104 are perfectly qualified for artificial intelligence based learning. Training 106 on a dataset 104, for example, using any combination of one or more appropriate machine learning techniques such as deep neural networks and convolutional neural networks, results in a set of one or more low level properties or attributes 110 associated with the dataset 104 to be learned. Such attributes may be derived or inferred from labels of the dataset 104. Examples of attributes that may be learned include attributes associated with object/scene types and geometries, materials and textures, camera characteristics, lighting characteristics, noise statistics, contrast (e.g., global and/or local image contrast defined by prescribed metrics, which may be based on, for instance, maximum and minimum pixel intensity values), etc. Other more intangible attributes that are (e.g., some unknown nonlinear) combinations or functions of a plurality of low level attributes may also be learned. Examples of such intangible attributes include attributes associated with style, aesthetic, noise signatures, etc. In various embodiments, different training models may be used to learn different attributes. Furthermore, image processing framework 100 may be trained with respect to a plurality of different training datasets. After training on large sets of data to learn various attributes, image processing framework 100 may subsequently be employed to detect similar attributes or combinations thereof in other images for which such attributes are unknown as further described below.
Many applications include the creation of collections or portfolios of curated images that share one or more attributes or properties that are difficult to quantify or define but that collectively contribute to a distinct signature style or visual appearance. For example, images comprising product catalogs published by high end merchants or retailers typically have a specific branded style or aesthetic, and similarly frames of scenes of animation or video sequences are all often constrained to prescribed visual characteristics. In such applications, captured photographs or generated renderings are often subjected to manual post-processing (e.g., retouching and remastering) by artists to create publishable imagery having visual characteristics that conform to a desired or sanctioned style or aesthetic or theme. Many of the properties that contribute to style or aesthetic or theme are the result of artist manipulation beyond anything that can be achieved from photography/rendering or global post-processing. Thus, an artist imparted aesthetic on an image is an intangible quality that has conventionally been difficult to isolate, model, and replicate. As a result, many existing applications still require artists to manually post-process each image or frame to obtain a desired curated look.
In some cases, machine learning based framework 304 is trained on large labeled image datasets comprising a substantial subset of, if not all, possible permutations of objects of a constrained set of possible objects that may appear in a prescribed scene type in order to learn associated attributes and combinations thereof and which may subsequently be employed to detect or identify such attributes in other images such as a corresponding set of curated or artist processed images 302 that include objects from the same constrained set of possible objects. Examples of attributes 306 that may be detected for the set of images 302 include object/scene types and geometries, materials and textures, camera characteristics, lighting characteristics, noise statistics, contrast, etc. In some embodiments, a high level aesthetic attribute that results in images 302 having a shared look is defined by a (e.g., unknown nonlinear) function or combination of a plurality of identified low level attributes 306. The described machine learning based framework 304 facilitates identifying shared properties of images belonging to a set 302 and identifying and isolating a high level shared style or aesthetic attribute that includes artist imparted characteristics. Attributes 306 identified for the set of images 302 may be employed to automatically label or tag images 302 that do not already have such labels or tags.
As previously described, many applications require artists to post-process images to impart prescribed styles or aesthetics. Thus, in such applications, curated imagery suitable for publication is often limited to a small number of viable shots that conform to a desired style or aesthetic. It would be useful to automatically generate more extensive sets of curated images, for example, that have prescribed styles or aesthetics but without requiring artists to impart the styles or aesthetics via post-processing.
In some embodiments, three-dimensional (polygon mesh) models for most, if not all, individual objects comprising a constrained set of possible objects that may be included in a prescribed scene type exist and may be employed to automatically render any number of additional curated shots of the scene type without artist input but having the same aesthetic or style as a relatively small set of artist created base imagery from which aesthetic or style attributes are identified, for example, using framework 300 of
In one example, machine learning framework 501 identifies attributes that collectively define an aesthetic 504 of a small set of curated catalog images 502 that have been post-processed by artists to have the aesthetic. In some cases, the isolated aesthetic (i.e., corresponding attributes) 504 may be applied to available three-dimensional models to generate a super catalog 506 of additional curated catalog images that all have that aesthetic but without requiring artist post-processing like the original set 502. Super catalog 506 may be used to further train and build machine learning framework 501. Thus, aesthetics or styles or themes may be identified using machine learning framework 501 and then applied to available three-dimensional models to generate additional datasets on which machine learning framework 501 may be further trained.
Machine learning framework 501 generally facilitates a variety of image processing applications 508 for modifying input images 510 or parts thereof to generate corresponding output images 512 having the desired modifications. Both high and low level attributes associated with images are detectable, and high level attributes are separable into constituent lower level attributes. Thus, independent decisions can be made on different attributes or combinations of attributes in various applications. Some example image processing applications 508 include restyling (e.g., changing aesthetic), object replacement, relighting (e.g., changing light source types and/or locations), etc., a few of which are further described next.
Image processing applications 508 may furthermore comprise more complex enterprise applications such as aggregating objects from datasets having different aesthetic attributes into the same scene and styling to have a prescribed aesthetic. For example, home furnishings objects from one or more brands may be included in an image of a room but may be all styled to have the aesthetic of a prescribed brand. In such cases, the resulting image would have the curated look or aesthetic of a catalog image of the prescribed brand.
Generally, image processing applications 508 rely on attribute detection using machine learning framework 501. That is, actual attributes used to generate images 510 are detected using machine learning framework 501 and modified to generate output images 512 having modified attributes. Thus, image modification based on attribute detection, manipulation, and/or modification as described herein is notably different from conventional image editing applications that operate at pixel levels on pixel values and do not have information on actual image content (e.g., objects) and underlying attributes associated with the physics of capturing or rendering the content such as geometry, camera, lighting, materials, textures, placements, etc., on which the disclosed techniques are based. The disclosed attribute detection and manipulation techniques are especially useful for photorealistic applications because conventional pixel manipulations are not sufficiently constrained to generate images that look real and consistent.
Another useful application 508 based on machine learning framework 501 comprises image denoising. One or more learned spatial filters may be applied to various parts of a noisy input image 510 (e.g., that is generated using a low number of samples of rays during ray tracing) to remove noise so that output image 512 has a noise profile or quality that is comparable to that achievable using a large number of samples of rays. That is, various parts of a sparsely sampled image are filtered using a set of filters identified by machine learning framework 501 to generate an output image equivalent to ray tracing with a much larger number of samples (e.g., a number of samples needed for complete convergence). As one example, a ten sample ray traced image can quickly be transformed into the equivalent of a corresponding thousand sample ray traced image by filtering the ten sample ray traced image with appropriate filters identified by machine learning framework 501. Thus, image render time can substantially be reduced by only ray tracing with a small number of samples and then using filters that predict pixel values that would result from larger numbers of samples. This technique effectively eliminates the need to ray trace with large numbers of samples while still producing images having qualities or noise profiles that ray tracing with large numbers of samples provides.
In some embodiments, training datasets for such a denoising application comprise ray traced snapshots of images at different sampling intervals, with each snapshot labeled with an attribute specifying the number of samples for that snapshot in addition to being labeled with other image attributes. Machine learning framework 501 trains on such datasets to learn spatial filters or parameters thereof for different numbers of samples. For example, filters may be learned for transforming from a low number (x) of samples to a high number (y) of samples for many different values and combinations of x and y, where x<<y. Noise signatures, however, are not only based on numbers of samples but also on one or more other image attributes that affect noise (e.g., during ray tracing) such as materials and lighting. Thus, different filter parameters may be learned for attribute combinations that result in different noise signatures, and a machine learning framework 501 that identifies filters for an input image may identify different filters or parameters for different portions of the image. For example, a different set of filter parameters may be identified for a portion of an image that has attribute combination “ten samples on leather with bright light” than a portion of the image that has attribute combination “ten samples on fabric in dim light”.
Although the foregoing embodiments have been described in some detail for purposes of clarity of understanding, the invention is not limited to the details provided. There are many alternative ways of implementing the invention. The disclosed embodiments are illustrative and not restrictive.
This application is a continuation of U.S. patent application Ser. No. 16/056,136, now U.S. Pat. No. 10,762,605, entitled MACHINE LEARNING BASED IMAGE PROCESSING TECHNIQUES filed Aug. 6, 2018, which claims priority to U.S. Provisional Patent Application No. 62/541,603 entitled DIRECT AND DERIVED 3D DATA FOR MACHINE LEARNING IN IMAGE BASED APPLICATIONS filed Aug. 4, 2017, both of which are incorporated herein by reference for all purposes.
Number | Name | Date | Kind |
---|---|---|---|
5526446 | Adelson | Jun 1996 | A |
6249610 | Matsumoto | Jun 2001 | B1 |
6353816 | Tsukimoto | Mar 2002 | B1 |
6816621 | Handley | Nov 2004 | B1 |
7573508 | Kondo | Aug 2009 | B1 |
8712184 | Liao | Apr 2014 | B1 |
9940753 | Grundhöfer | Apr 2018 | B1 |
10388059 | Luebke | Aug 2019 | B2 |
10451900 | Fonte | Oct 2019 | B2 |
10453165 | Kostov | Oct 2019 | B1 |
10496104 | Liu | Dec 2019 | B1 |
10567641 | Rueckner | Feb 2020 | B1 |
10783549 | Sinha | Sep 2020 | B2 |
20020037101 | Aihara | Mar 2002 | A1 |
20020146179 | Handley | Oct 2002 | A1 |
20030039401 | Gallagher | Feb 2003 | A1 |
20030187354 | Rust | Oct 2003 | A1 |
20050060142 | Visser | Mar 2005 | A1 |
20050083556 | Carlson | Apr 2005 | A1 |
20050147291 | Huang | Jul 2005 | A1 |
20050147292 | Huang | Jul 2005 | A1 |
20080158400 | Reyneri | Jul 2008 | A1 |
20100066868 | Shohara | Mar 2010 | A1 |
20100111436 | Jung | May 2010 | A1 |
20100189358 | Kaneda | Jul 2010 | A1 |
20100246991 | Naito | Sep 2010 | A1 |
20110137837 | Kobayashi | Jun 2011 | A1 |
20120195378 | Zheng | Aug 2012 | A1 |
20130051519 | Yang | Feb 2013 | A1 |
20130120354 | Falco, Jr. | May 2013 | A1 |
20130128056 | Chuang | May 2013 | A1 |
20130156297 | Shotton | Jun 2013 | A1 |
20130182779 | Lim | Jul 2013 | A1 |
20130235067 | Cherna | Sep 2013 | A1 |
20140043436 | Bell | Feb 2014 | A1 |
20140104450 | Cox | Apr 2014 | A1 |
20140133775 | Wang | May 2014 | A1 |
20150055085 | Fonte | Feb 2015 | A1 |
20150172677 | Norkin | Jun 2015 | A1 |
20150189186 | Fahn | Jul 2015 | A1 |
20150228110 | Hecht | Aug 2015 | A1 |
20150287172 | Choudhury | Oct 2015 | A1 |
20150294193 | Tate | Oct 2015 | A1 |
20150296152 | Fanello | Oct 2015 | A1 |
20150310602 | Lee | Oct 2015 | A1 |
20150338722 | Bonnier | Nov 2015 | A1 |
20150378325 | El Zur | Dec 2015 | A1 |
20160171753 | Park | Jun 2016 | A1 |
20160321523 | Sen | Nov 2016 | A1 |
20160373743 | Zhao | Dec 2016 | A1 |
20170097948 | Kerr | Apr 2017 | A1 |
20170098152 | Kerr | Apr 2017 | A1 |
20170103512 | Mailhe | Apr 2017 | A1 |
20170108569 | Harvey | Apr 2017 | A1 |
20170148158 | Najarian | May 2017 | A1 |
20170150180 | Lin | May 2017 | A1 |
20170193332 | Kim | Jul 2017 | A1 |
20170294010 | Shen | Oct 2017 | A1 |
20170300785 | Merhav | Oct 2017 | A1 |
20170337657 | Cornell | Nov 2017 | A1 |
20170344807 | Jillela | Nov 2017 | A1 |
20180018970 | Heyl | Jan 2018 | A1 |
20180089583 | Iyer | Mar 2018 | A1 |
20180114096 | Sen | Apr 2018 | A1 |
20180121762 | Han | May 2018 | A1 |
20180147015 | She | May 2018 | A1 |
20180286037 | Zaharchuk | Oct 2018 | A1 |
20180293496 | Vogels | Oct 2018 | A1 |
20180293710 | Meyer | Oct 2018 | A1 |
20180300865 | Weiss | Oct 2018 | A1 |
20180315172 | Smirnov | Nov 2018 | A1 |
20180316918 | Drugeon | Nov 2018 | A1 |
20180365856 | Parasnis | Dec 2018 | A1 |
20190026884 | Huang | Jan 2019 | A1 |
20190026956 | Gausebeck | Jan 2019 | A1 |
20190043243 | Chui | Feb 2019 | A1 |
20190205946 | Mellina | Jul 2019 | A1 |
20190244011 | Liu | Aug 2019 | A1 |
20190261945 | Funka-Lea | Aug 2019 | A1 |
20190311199 | Mukherjee | Oct 2019 | A1 |
Entry |
---|
Baharul et al., A Survey of Aesthetics-Driven Image Recomposition, Multimedia Tools and Applications, vol. 76, No. 7, pp. 9517-9542, May 13, 2016. |
Laffont et al., Transient Attributes for High-Level Understanding and Editing of Outdoor Scenes, ACM Transactions on Graphics, vol. 33, No. 4, Article 149, Jul. 27, 2014. |
Zhou et al., GeneGAN: Learning Object Transfiguration and Attribute Subspace from Unpaired Data, Student, Prof, Collaborator: BMVC Author Guidelines, pp. 1-13, May 14, 2017. |
Cheng et al., “ImageSpirit”, ACM Transactions on Graphics, ACM, NY, US, vol. 34, No. 1, Dec. 29, 2014, pp. 1-11. |
Sarkar et al., “Trained 3D Models for CNN based Object Recognition”, Proceedings of the 12th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, Mar. 1, 2017, pp. 130-137. |
Yu et al., “Semantic Jitter: Dense Supervision for Visual Comparisons via Synthetic Images”, arxiv.org, Cornell University Library, Ithaca, NY, Dec. 19, 2016. |
Number | Date | Country | |
---|---|---|---|
20200349678 A1 | Nov 2020 | US |
Number | Date | Country | |
---|---|---|---|
62541603 | Aug 2017 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 16056136 | Aug 2018 | US |
Child | 16932677 | US |