This application claims the benefit under 35 USC § 119(a) of Korean Patent Application No. 10-2019-0142886, filed on Nov. 8, 2019, in the Korean Intellectual Property Office, the entire disclosure of which is incorporated herein by reference for all purposes.
The following description relates to an image processing technology of generating a depth image.
Use of three-dimensional (3D) information may be important for recognizing an image or understanding a scene. By adding depth information to two-dimensional (2D) spatial information, a spatial distribution of objects may be effectively predicted. Generally, depth information is obtained only when a depth image is acquired using a depth camera, and a quality of a depth image that may be acquired from the depth camera varies depending on a performance of the depth camera. For example, a noise level or a resolution of the acquired depth image may vary depending on the performance of the depth camera. Since an accuracy of depth information has a great influence on a quality of a result based on the depth information, it is important to acquire a depth image with a high quality.
This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.
In one general aspect, a method with depth image generation includes: receiving an input image; generating a first low-resolution image having a resolution lower than a resolution of the input image; acquiring a first depth residual image corresponding to the input image by using a first generation model based on a first neural network; generating a first low-resolution depth image corresponding to the first low-resolution image by using a second generation model based on a second neural network; and generating a target depth image corresponding to the input image, based on the first depth residual image and the first low-resolution depth image.
The generating of the target depth image may include: upsampling the first low-resolution depth image to a resolution of the input image; and generating the target depth image by combining depth information of the upsampled first low-resolution depth image and depth information of the first depth residual image.
The generating of the first low-resolution depth image may include: acquiring a second depth residual image corresponding to the first low-resolution image using the second generation model; generating a second low-resolution image having a resolution lower than the resolution of the first low-resolution image; acquiring a second low-resolution depth image corresponding to the second low-resolution image using a third neural network-based third generation model; and generating the first low-resolution depth image based on the second depth residual image and the second low-resolution depth image.
The generating of the second low-resolution image may include downsampling the first low-resolution image to generate the second low-resolution image.
The generating of the first low-resolution depth image may include: upsampling the second low-resolution depth image to a resolution of the second depth residual image; and generating the first low-resolution depth image by combining depth information of the upsampled second low-resolution depth image and depth information of the second depth residual image.
A resolution of the second low-resolution depth image may be lower than a resolution of the first low-resolution depth image.
The second depth residual image may include depth information of a high-frequency component in comparison to the second low-resolution depth image.
The first low-resolution depth image may include depth information of a low-frequency component in comparison to the first depth residual image.
The generating of the first low-resolution image may include downsampling the input image to generate the first low-resolution image.
The input image may include a color image or an infrared image.
The input image may include a color image and an input depth image. In the acquiring of the first depth residual image, the first generation model may use a pixel value of the color image and a pixel value of the input depth image as inputs, and output a pixel value of the first depth residual image.
The input image may include an infrared image and an input depth image. In the acquiring of the first depth residual image, the first generation model may use a pixel value of the infrared image and a pixel value of the input depth image as inputs, and output a pixel value of the first depth residual image.
In another general aspect, a non-transitory computer-readable storage medium stores instructions that, when executed by a processor, cause the processor to perform the method described above.
In another general aspect, a method with depth image generation includes: receiving an input image; acquiring a first depth residual image and a first low-resolution depth image by using a generation model that is based on a neural network that uses the input image as an input; and generating a target depth image corresponding to the input image, based on the first depth residual image and the first low-resolution depth image.
The acquiring of the first depth residual image and the first low-resolution depth image may include: acquiring a second depth residual image and a second low-resolution depth image using the generation model; and generating the first low-resolution depth image based on the second depth residual image and the second low-resolution depth image.
The generation model may use the input image as an input and output the first depth residual image, the second depth residual image, and the second low-resolution depth image.
The generation model may include a single neural network model.
In another general aspect, a method with depth image generation includes: receiving an input image; acquiring intermediate depth images having a same size using a generation model that is based on a neural network that uses the input image as an input; and generating a target depth image by combining the acquired intermediate depth images, wherein the intermediate depth images include depth information of different degrees of precision.
In another general aspect, an apparatus with depth image generation includes a processor configured to: receive an input image; generate a first low-resolution image having a resolution lower than a resolution of the input image; acquire a first depth residual image corresponding to the input image, by using a first generation model based on a first neural network; generate a first low-resolution depth image corresponding to the first low-resolution image, by using a second generation model based on a second neural network; and generate a target depth image corresponding to the input image, based on the first depth residual image and the first low-resolution depth image.
The processor may be further configured to: upsample the first low-resolution depth image to a resolution of the input image; and generate the target depth image by combining depth information of the upsampled first low-resolution depth image and depth information of the first depth residual image.
The combining of the depth information of the upsampled first low-resolution depth image and the depth information of the first depth residual image may include calculating a weighted sum or a summation of depth values of pixel positions corresponding to each other in the first depth residual image and the upsampled first low-resolution depth image.
The processor may be further configured to: acquire a second depth residual image corresponding to the first low-resolution image using the second generation model; generate a second low-resolution image having a resolution lower than a resolution of the first low-resolution image; acquire a second low-resolution depth image corresponding to the second low-resolution image using a third neural network-based third generation model; and generate the first low-resolution depth image based on the second depth residual image and the second low-resolution depth image.
The processor may be further configured to: upsample the second low-resolution depth image to a resolution of the second depth residual image; and generate the first low-resolution depth image by combining depth information of the upsampled second low-resolution depth image and depth information of the second depth residual image.
The combining of the depth information of the upsampled second low-resolution depth image and the depth information of the second depth residual image may include calculating a weighted sum or a summation of depth values of pixel positions corresponding to each other in the second depth residual image and the upsampled second low-resolution depth image.
A resolution of the first low-resolution depth image may be higher than a resolution of the second low-resolution depth image. The second depth residual image may include depth information of a high-frequency component in comparison to the second low-resolution depth image.
The processor may be further configured to downsample the input image to generate the first low-resolution image.
The input image may include a color image and an input depth image. In the acquiring of the first depth residual image, the first generation model may use a pixel value of the color image and a pixel value of the input depth image as inputs, and output a pixel value of the first depth residual image.
The input image may include an infrared image and an input depth image. In the acquiring of the first depth residual image, the first generation model may use a pixel value of the infrared image and a pixel value of the input depth image as inputs, and output a pixel value of the first depth residual image.
The apparatus may further include: a sensor configured to acquire the input image, wherein the input image includes either one or both of a color image and an infrared image.
In another general aspect, an apparatus with depth image generation includes a processor configured to: receive an input image; acquire a first depth residual image and a first low-resolution depth image by using a generation model that is based on a neural network that uses the input image as an input; and generate a target depth image corresponding to the input image, based on the first depth residual image and the first low-resolution depth image.
The processor may be further configured to: acquire a second depth residual image and a second low-resolution depth image using the generation model; and generate the first low-resolution depth image based on the second depth residual image and the second low-resolution depth image.
The first low-resolution depth image may have a resolution lower than a resolution of the input image. The second low-resolution depth image may have a resolution lower than the resolution of the first low-resolution depth image.
In another general aspect, an apparatus with depth image generation includes a processor configured to: receive an input image; acquire intermediate depth images having a same size by using a generation model that is based on a neural network that uses the input image as an input; and generate a target depth image by combining the acquired intermediate depth images, wherein the intermediate depth images include depth information of different degrees of precision.
The apparatus of claim 33, wherein the combining of the acquired intermediate depth images includes calculating a weighted sum or summation of depth values of pixel positions corresponding to each other in the acquired intermediate depth images.
Other features and aspects will be apparent from the following detailed description, the drawings, and the claims.
Throughout the drawings and the detailed description refer to the same elements, features, and structures. The drawings may not be to scale, and the relative size, proportions, and depiction of elements in the drawings may be exaggerated for clarity, illustration, and convenience.
The following detailed description is provided to assist the reader in gaining a comprehensive understanding of the methods, apparatuses, and/or systems described herein. However, various changes, modifications, and equivalents of the methods, apparatuses, and/or systems described herein will be apparent after an understanding of the disclosure of this application. For example, the sequences of operations described herein are merely examples, and are not limited to those set forth herein, but may be changed as will be apparent after an understanding of the disclosure of this application, with the exception of operations necessarily occurring in a certain order. Also, descriptions of features that are known in the art may be omitted for increased clarity and conciseness.
The features described herein may be embodied in different forms, and are not to be construed as being limited to the examples described herein. Rather, the examples described herein have been provided merely to illustrate some of the many possible ways of implementing the methods, apparatuses, and/or systems described herein that will be apparent after an understanding of the disclosure of this application.
Herein, it is noted that use of the term “may” with respect to an example or embodiment, e.g., as to what an example or embodiment may include or implement, means that at least one example or embodiment exists in which such a feature is included or implemented while all examples and embodiments are not limited thereto.
Throughout the specification, when an element, such as a layer, region, or substrate, is described as being “on,” “connected to,” or “coupled to” another element, it may be directly “on,” “connected to,” or “coupled to” the other element, or there may be one or more other elements intervening therebetween. In contrast, when an element is described as being “directly on,” “directly connected to,” or “directly coupled to” another element, there can be no other elements intervening therebetween. As used herein, the term “and/or” includes any one and any combination of any two or more of the associated listed items.
Although terms such as “first,” “second,” and “third” may be used herein to describe various members, components, regions, layers, or sections, these members, components, regions, layers, or sections are not to be limited by these terms. Rather, these terms are only used to distinguish one member, component, region, layer, or section from another member, component, region, layer, or section. Thus, a first member, component, region, layer, or section referred to in examples described herein may also be referred to as a second member, component, region, layer, or section without departing from the teachings of the examples.
The terminology used herein is for describing various examples only and is not to be used to limit the disclosure. The articles “a,” “an,” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. The terms “comprises,” “includes,” and “has” specify the presence of stated features, numbers, operations, members, elements, and/or combinations thereof, but do not preclude the presence or addition of one or more other features, numbers, operations, members, elements, and/or combinations thereof.
The features of the examples described herein may be combined in various ways as will be apparent after an understanding of the disclosure of this application. Further, although the examples described herein have a variety of configurations, other configurations are possible as will be apparent after an understanding of the disclosure of this application.
Referring to
In an example, the depth image generation apparatus 100 may generate a depth image based on a color image sensed by an image sensor 110 or an infrared image sensed by an infrared sensor 120. In another example, the depth image generation apparatus 100 may generate a depth image having a resolution higher than a resolution of a depth image sensed by a depth sensor 130, based on a color image sensed by the image sensor 110 and the depth image sensed by the depth sensor 130. In another example, the depth image generation apparatus 100 may generate a depth image having a resolution higher than a resolution of a depth image sensed by the depth sensor 130, based on an infrared image sensed by the infrared sensor 120 and the depth image sensed by the depth sensor 130. In the foregoing examples, the color image, the infrared image and the depth image may be images that represent the same scene and that correspond to each other.
The image sensor 110 is, for example, a sensor configured to acquire a color image representing color information of an object, and includes, for example, a complementary metal-oxide-semiconductor (CMOS) image sensor, a charge-coupled device (CCD) image sensor, or a stereo camera. The infrared sensor 120 is a sensor configured to sense an infrared ray emitted from an object or infrared reflection light reflected by the object and to generate an infrared image. The depth sensor 130 is a device configured to acquire a depth image representing depth information of an object, and may include, for example, a kinect, a time-of-flight (TOF) depth camera, or a three-dimensional (3D) scanner. In an example in which the image sensor 110 is a stereo camera, a stereo image including a left image and a right image may be acquired from the stereo camera, and a depth image may be derived from the stereo image using a known stereo matching scheme.
The depth image is an image representing depth information, which is information about a depth or a distance from a capturing position to an object. The depth image may be used for an object recognition such as a 3D face recognition, or used to process a photographic effect such as an out-of-focusing effect. For example, a depth image may be used to understand a scene including an object. The depth image may determine a geometric relationship between objects or provide 3D geometric information, to help to enhance a performance of a visual object recognition.
When a physical sensor, for example, the depth sensor 130, is used to acquire a depth image, costs may increase, a depth measurement distance may be limited, a measurement error may occur, and vulnerability to external light may be caused. The depth image generation apparatus 100 may generate a depth image from a color image or an infrared image using a deep learning-based generation model, such that the depth image is acquired even though the depth sensor 130 is not used, to address the above-described limitations. For example, based on the depth image generated by the depth image generation apparatus 100, a distribution in a 3D space may be predicted using a single color image or a single infrared image, an accuracy of an object recognition may be increased, and a scene may be understood robustly in a circumstance in which an occlusion is present.
To increase a utility of a depth image, it is important to use a depth image with a high resolution or a high quality. To obtain a desirable result based on a depth image, it is important to acquire a depth image that accurately represents a depth feature, for example, a depth feature of an edge of an object. The depth image generation apparatus 100 may generate a depth image with a high resolution and a high quality using a multi-scale-based depth image generation method that will be described below. That is, the depth image generation apparatus 100 may more precisely and accurately estimate depth information using a multi-scale-based depth image generation method of distinguishing global information from local information in a depth image and of estimating the global information and the local information.
Also, the depth image generation apparatus 100 may generate a depth image with a high quality by processing a depth image acquired by, for example, the depth sensor 130. An operation by which the depth image generation apparatus 100 generates a depth image with a higher quality by processing a depth image provided as an input may correspond to a calibration of a depth image. For example, the depth image generation apparatus 100 generates a depth image with depth information that is finer than depth information of a depth image provided as an input, based on information included in a color image or an infrared image. In this example, the depth image generation apparatus 100 may generate a depth image with a high quality using the multi-scale-based depth image generation method.
Hereinafter, a method of generating a depth image by the depth image generation apparatus 100 will be further described with reference to the drawings.
Referring to
In operation 220, the depth image generation apparatus acquires a first depth residual image corresponding to the input image using a first generation model that is based on a first neural network. A pixel value of the input image is input to the first generation model, and the first depth residual image corresponding to a scale of the input image is output from the first generation model. The first generation model is a model trained to output a depth residual image based on input information through a training process. The first depth residual image includes, for example, depth information of a high-frequency component, and is an image that may relatively accurately represent an edge component of an object. In the disclosure herein, the terms “scale” and “resolution” may be interchangeably used with respect to each other.
In operation 230, the depth image generation apparatus generates a first low-resolution image having a resolution lower than a resolution of the input image. In an example, the depth image generation apparatus may downsample the input image to generate the first low-resolution image. For example, the depth image generation apparatus may generate a first low-resolution image corresponding to a scale that is half of the scale of the input image. In an example in which the input image is a color image, the first low-resolution image may be a color image with a reduced resolution. In an example in which the input image is an infrared image, the first low-resolution image may be an infrared image with a reduced resolution.
In operation 240, the depth image generation apparatus generates a first low-resolution depth image corresponding to the first low-resolution image using a second generation model that is based on a second neural network. The first low-resolution depth image includes, for example, depth information of a low-frequency component, in comparison to the first depth residual image generated in operation 220. The second generation model is also a model trained through the training process.
In an example, the depth image generation apparatus estimates depth information of a high-frequency component and depth information of a low-frequency component, and combines the estimated depth information of the high-frequency component and the estimated depth information of the low-frequency component to generate a depth image. In this example, a pixel value of the first low-resolution image is input to the second generation model, and a first low-resolution depth image corresponding to a scale or a resolution of the first low resolution image is output from the second generation model. The second generation model is a model trained to output the first low-resolution depth image based on input information. The first depth residual image includes depth information of a high-frequency component, and the first low-resolution depth image includes depth information of a low-frequency component.
In another example, the depth image generation apparatus estimates depth information of a high-frequency component, depth information of an intermediate-frequency component and depth information of a low-frequency component, and combines the estimated depth information of the high-frequency component, the estimated depth information of the intermediate-frequency component and the estimated depth information of the low-frequency component, to generate a depth image. In this example, a third generation model that is based on a third neural network may be used together with the second generation model. This example will be further described with reference to
Referring to
In operation 320, the depth image generation apparatus generates a second low-resolution image having a resolution lower than the resolution of the first low-resolution image. For example, the depth image generation apparatus may downsample the first low-resolution image to generate the second low-resolution image. For example, the depth image generation apparatus may generate the second low-resolution image corresponding to a scale that is half of a scale of the first low-resolution image.
In operation 330, the depth image generation apparatus acquires a second low-resolution depth image corresponding to the second low-resolution image using the third generation model. The second low-resolution depth image has a resolution lower than the resolution of the first low-resolution depth image, and includes, for example, depth information of a low-frequency component. The third generation model is a model trained to output the second low-resolution depth image based on input information.
In operation 340, the depth image generation apparatus generates the first low-resolution depth image based on the second depth residual image and the second low-resolution depth image. The second depth residual image may include depth information of a high-frequency component, in comparison to the second low-resolution depth image. In an example, the depth image generation apparatus upsamples the second low-resolution depth image to a resolution of the second depth residual image, and combines depth information of the upsampled second low-resolution depth image and depth information of the second depth residual image to generate the first low-resolution depth image.
Referring back to
As described above, the depth image generation apparatus may generate a depth image based on a structure in which depth information is increasingly refined by being subdivided by scale. The depth image generation apparatus may configure input images with various scales, may input an input image with each scale to each generation model, and may acquire images including depth information with different frequency components from each generation model. The depth image generation apparatus may generate a final target depth image by combining the acquired images including depth information with different frequency components. Thus, the depth image generation apparatus may derive a depth image with a high quality from a color image or an infrared image, instead of using a separate depth sensor or an initial depth image.
In an example, the depth image generation apparatus may generate a depth image with a higher quality by calibrating an input depth image provided as an input. In this example, the depth image generation method of
In an example in which the input image includes a color image and an input depth image, the depth image generation apparatus may acquire the first depth residual image using the first generation model that uses a pixel value of the color image and a pixel value of the input depth image as inputs, and that outputs a pixel value of the first depth residual image in operation 220. In operation 230, the depth image generation apparatus may generate a first low-resolution input depth image having a resolution lower than a resolution of the input depth image, together with the first low-resolution image having a resolution lower than a resolution of the color image. In operation 240, the depth image generation apparatus may acquire the first low-resolution depth image using the second generation model that uses a pixel value of the first low-resolution image and a pixel value of the first low-resolution input depth image as inputs, and that outputs a pixel value of the first low-resolution depth image.
In another example, the depth image generation apparatus may also acquire the first low-resolution depth image using a process similar to the example of
Similarly to the above description, in operation 250, the depth image generation apparatus generates the target depth image corresponding to the input image based on the first depth residual image and the first low-resolution depth image. Unlike the above example, when the input image includes an infrared image and an input depth image, the depth image generation apparatus may generate the target depth image based on a process of replacing the color image by the infrared image in the above-described process. The depth image generation apparatus may generate a depth image with a higher quality based on a multi-scale depth image generation structure as described above, even though depth information is not fine or even though a depth image with a low quality, (e.g., a large amount of noise) is provided as an input depth image.
As described above, a depth image generation apparatus may generate a depth image based on an input image through a multi-scale-based depth estimation structure. Even when depth information is not provided as an input, the depth image generation apparatus may estimate depth information from a color image or an infrared image using the multi-scale-based depth estimation structure. The multi-scale-based depth estimation structure is a structure of decomposing an input image into frequency components and estimating and processing depth information corresponding to each of the frequency components. For example, the multi-scale-based depth estimation structure of
Referring to
The depth image generation apparatus acquires a first depth residual image 430 corresponding to the input image 410 using a first generation model 420 that is based on a first neural network. A pixel value of the input image 410 is input to the first generation model 420, and the first generation model 420 outputs a pixel value of the first depth residual image 430. The first depth residual image 430 has a resolution or scale corresponding to a resolution or a scale of the input image 410, and includes depth information of a high-frequency component including an edge detail component of an object.
The depth image generation apparatus generates a first low-resolution image 440 by downscaling the input image 410. For example, the depth image generation apparatus may downsample the input image 410, may perform a blurring process, for example, Gaussian smoothing, and may generate the first low-resolution image 440. The first low-resolution image 440 includes color information of a low-frequency component in comparison to the input image 410.
The depth image generation apparatus generates a first low-resolution depth image 460 corresponding to the first low-resolution image 440 using a second generation model 450 that is based on a second neural network. A pixel value of the first low-resolution image 440 is input to the second generation model 450, and the second generation model 450 outputs a pixel value of the first low-resolution depth image 460. The first low-resolution depth image 460 has a resolution or scale corresponding to a resolution or a scale of the first low-resolution image 440, and includes depth information of a low-frequency component in comparison to the first depth residual image 430.
The first generation model 420 and the second generation model 450 are models trained to output the first depth residual image 430 and the first low-resolution depth image 460, respectively, based on input information. An image-to-image translation scheme, for example, Pix2Pix, GycleGAN, and DiscoGAN, using generative adversarial networks (GANs) may be used to implement the first generation model 420 and the second generation model 450.
The depth image generation apparatus upscales the first low-resolution depth image 460 and generates an upscaled first low-resolution depth image 470. For example, the depth image generation apparatus upsamples the first low-resolution depth image 460 to generate the first low-resolution depth image 470 having a scale corresponding to a scale of the first depth residual image 430. The depth image generation apparatus combines the first depth residual image 430 and the upscaled first low-resolution depth image 470 in operation 480 to generate a target depth image 490 corresponding to the input image 410. Operation 480 corresponds to, for example, a weighted sum or summation of depth values of pixel positions corresponding to each other in the first depth residual image 430 and the upscaled first low-resolution depth image 470. In an example, the first depth residual image 430 includes depth information of a residual component obtained by removing depth information of the upscaled first low-resolution depth image 470 from the target depth image 490.
As described above, in the generation of depth information by combining global information and local information for the depth information, the depth image generation apparatus may guide depth information of different frequency components to be estimated in each of a plurality of steps in a multi-scale structure, to generate a depth image with a higher resolution. The depth image generation apparatus may guide a residual component corresponding to depth information that is not estimated in each of the steps, to be processed in another step, and accordingly depth information of a frequency component may be separated and independently estimated in each of the steps. The depth image generation apparatus may generate a sophisticated depth image from a color image or an infrared image even though a separate depth sensor is not used, and may also generate a plurality of depth images from a single input image.
In an example, the depth image generation apparatus generates a depth image with a quality higher than a quality of an input depth image by calibrating depth information of the input depth image using the multi-scale-based depth estimation structure of
Referring to
When a training image 510 is provided, the training apparatus generates a first depth residual image 520 corresponding to the training image 510 using the first generation model 515, which is based on a first neural network. The training image 510 may include, for example, a color image, an infrared image, or an image obtained by concatenating the color image and the infrared image. The first depth residual image 520 may include depth information of a high-frequency component.
The training apparatus downscales the training image 510 to generate a first low-resolution image 530. The training apparatus generates a first low-resolution depth image 540 corresponding to the first low-resolution image 530 using the second generation model 535, which is based on a second neural network. The first low-resolution depth image 540 includes depth information of a low-frequency component.
The training apparatus upscales the first low-resolution depth image 540 to generate an upscaled first low-resolution depth image 550 having a same scale as a scale of the first depth residual image 520, and combines the upscaled first low-resolution depth image 550 and the first depth residual image 520 in operation 560 to generate a resulting depth image 570.
The above process by which the training apparatus generates the resulting depth image 570 corresponds to a process of generating the target depth image 490 based on the input image 410 in the example of
The training apparatus calculates a difference between the resulting depth image 570 and a depth image 580 corresponding to a ground truth of depth information of a high-frequency component by comparing the resulting depth image 570 to the depth image 580. The training apparatus adjusts values of parameters, for example, parameters of the first neural network of the first generation model 515 to reduce the difference between the resulting depth image 570 and the depth image 580. For example, the training apparatus may find an optimal parameter value to minimize a value of a loss function that defines the difference between the resulting depth image 570 and the depth image 580. In this example, the loss function may be defined in various forms based on a classification scheme or a regression scheme. A scheme of adjusting a parameter value or a process of calibrating depth information for generation of the depth image 580 may be changed based on how the loss function is defined. Also, the training apparatus calculates a difference between the first low-resolution depth image 540 and a depth image 590 corresponding to a ground truth of depth information of a low-frequency component by comparing the first low-resolution depth image 540 to the depth image 590. The training apparatus adjusts parameters of the second generation model 535 to reduce the difference between the first low-resolution depth image 540 and the depth image 590. The training apparatus may find optimal values of parameters of each of the first generation model 515 and the second generation model 535 by repeatedly performing the above process on a large number of training images.
As a result, through the training process, the first generation model 515 is trained to output a first depth residual image including a residual component obtained by subtracting a depth image generated by upscaling the depth image 590 by a scale of the depth image 580 from the depth image 580, and the second generation model 535 is trained to output the depth image 590 that is downscaled.
The training apparatus may find optimal values of parameters of each of the first generation model 515 and the second generation model 535 by repeatedly performing the above process on a large number of training images. The training apparatus separately trains the first generation model 515 and the second generation model 535 that estimate depth information of each frequency component of a depth image, such that the depth information is effectively estimated. In the multi-scale-based depth estimation structure, the training apparatus uses a depth estimation result of a previous operation as a guide for a next training.
In
Referring to
The depth image generation apparatus acquires a first depth residual image 620 corresponding to the input image 610 using a first generation model 615 that is based on a first neural network. A pixel value of the input image 610 is input to the first generation model 615, and the first generation model 615 outputs a pixel value of the first depth residual image 620. The first depth residual image 620 may have a resolution corresponding to a resolution or a scale of the input image 610, and may include depth information of a high-frequency component.
The depth image generation apparatus downscales the input image 610 to generate a first low-resolution image 625. For example, the depth image generation apparatus may downsample the input image 610, may perform Gaussian smoothing and may generate the first low-resolution image 625. The first low-resolution image 625 may include color information of a low-frequency component in comparison to the input image 610.
The depth image generation apparatus acquires a second depth residual image 640 corresponding to the first low-resolution image 625 using a second generation model 630 that is based on a second neural network. A pixel value of the first low-resolution image 625 is input to the second generation model 630, and the second generation model 630 outputs a pixel value of the second depth residual image 640. The second depth residual image 640 may include depth information of an intermediate-frequency component, and may include depth information of a low-frequency component in comparison to the first depth residual image 620.
The depth image generation apparatus downscales the first low-resolution image 625 to generate a second low-resolution image 645. For example, the depth image generation apparatus may downsample the first low-resolution image 625, may perform Gaussian smoothing and may generate the second low-resolution image 645. The second low-resolution image 645 may include color information of a low-frequency component in comparison to the first low-resolution image 625.
The depth image generation apparatus acquires a second low-resolution depth image 655 corresponding to the second low-resolution image 645 using a third generation model 650 that is based on a third neural network. A pixel value of the second low-resolution image 645 is input to the third generation model 650, and the third generation model 650 outputs a pixel value of the second low-resolution depth image 655. The second low-resolution depth image 655 may include depth information of a low-frequency component.
The first generation model 615, the second generation model 630 and the third generation model 650 are models trained to output the first depth residual image 620, the second depth residual image 640 and the second low-resolution depth image 655 based on input information, respectively. An image-to-image translation scheme, for example, Pix2Pix, GycleGAN, and DiscoGAN, using GANs, may be used to implement the first generation model 615, the second generation model 630 and the third generation model 650.
The depth image generation apparatus upscales the second low-resolution depth image 655 to generate an upscaled second low-resolution depth image 660. For example, the depth image generation apparatus may upsample the second low-resolution depth image 655 to generate the second low-resolution depth image 660 having a scale corresponding to a scale of the second depth residual image 640. The depth image generation apparatus combines the second depth residual image 640 and the upscaled second low-resolution depth image 660 in operation 665, to generate a first low-resolution depth image 670. Operation 665 may correspond to, for example, a weighted sum or summation of depth values of pixel positions corresponding to each other in the second depth residual image 640 and the upscaled second low-resolution depth image 660. In an example, the second depth residual image 640 includes depth information of a residual component obtained by removing depth information of the upscaled second low-resolution depth image 660 from the first low-resolution depth image 670.
The depth image generation apparatus upscales the first low-resolution depth image 670 to generate an upscaled first low-resolution depth image 675. For example, the depth image generation apparatus may upsample the first low-resolution depth image 670 and generate the first low-resolution depth image 675 having a scale corresponding to a scale of the first depth residual image 620. The depth image generation apparatus combines the first depth residual image 620 and the upscaled first low-resolution depth image 675 in operation 680 to generate a target depth image 685 corresponding to the input image 610. Operation 680 corresponds to, for example, a weighted sum or summation of depth values of pixel positions corresponding to each other in the first depth residual image 620 and the upscaled first low-resolution depth image 675. In an example, the first depth residual image 620 includes depth information of a residual component obtained by removing depth information of the upscaled first low-resolution depth image 675 from the target depth image 685.
As described above, the depth image generation apparatus combines global information and local information for depth information through multiple steps of the multi-scale-based depth estimation structure. The depth image generation apparatus extracts global depth information from a color image with a smallest scale, extracts local depth information from color images with the other scales, and adds the extracted local depth information to the extracted global depth information, to gradually refine depth information.
The multi-scale-based depth estimation structure used to generate a depth image may have four or more layers, as well as two layers as described in the example of
In an example, the depth image generation apparatus generates a depth image with a quality higher than a quality of an input depth image by calibrating the input depth image using the multi-scale-based depth estimation structure of
Referring to
The training apparatus downscales the depth image 790 to generate the depth image 792 having a reduced scale, and downscales the depth image 792 to generate the depth image 794 having a further reduced scale. The depth image 790 may include depth information of a high-frequency component, the depth image 792 may include depth information of an intermediate-frequency component, and the depth image 794 may include depth information of a low-frequency component. Each of the depth images 790, 792 and 794 is used as a reference image to calculate an error value of an output of each of the first generation model 715, the second generation model 730 and the third generation model 750.
When a training image 710 is provided, the training apparatus generates a first depth residual image 720 corresponding to the training image 710 using the first generation model 715 that is based on a first neural network. The training image 710 may include, for example, a color image, an infrared image, or an image obtained by concatenating the color image and the infrared image. The first depth residual image 720 may include depth information of a high-frequency component.
The training apparatus downscales the training image 710 to generate a first low-resolution image 725. The training apparatus generates a second depth residual image 740 corresponding to the first low-resolution image 725 using the second generation model 730 that is based on a second neural network. The second depth residual image 740 may include depth information of an intermediate-frequency component.
The training apparatus downscales the first low-resolution image 725 to generate a second low-resolution image 745. The training apparatus generates a second low-resolution depth image 755 corresponding to the second low-resolution image 745 using the third generation model 750 that is based on a third neural network. The second low-resolution depth image 755 may include depth information of a low-frequency component.
The training apparatus upscales the second low-resolution depth image 755 by a scale of the second depth residual image 740 to generate an upscaled second low-resolution depth image 760. The training apparatus combines the second depth residual image 740 and the upscaled second low-resolution depth image 760 in operation 765 to generate a first low-resolution depth image 770. The training apparatus upscales the first low-resolution depth image 770 to generate an upscaled first low-resolution depth image 775 having the same scale as a scale of the first depth residual image 720, and combines the upscaled first low-resolution depth image 775 and the first depth residual image 720 in operation 780, to generate a resulting depth image 785.
The above process by which the training apparatus generates the resulting depth image 785 corresponds to a process of generating a target depth image based on the training image 510 in the example of
The training apparatus calculates a difference between the resulting depth image 785 and the depth image 790 corresponding to a ground truth of depth information of a high-frequency component, and adjusts values of parameters of the first generation model 715 to reduce the difference between the resulting depth image 785 and the depth image 790. The training apparatus calculates a difference between the first low-resolution depth image 770 and the depth image 792 corresponding to a ground truth of depth information of an intermediate-frequency component, and adjusts values of parameters of the second generation model 730 to reduce the difference between the first low-resolution depth image 770 and the depth image 792. Also, the training apparatus calculates a difference between the second low-resolution depth image 755 and the depth image 794 corresponding to a ground truth of depth information of a low-frequency component, and adjusts values of parameters of the third generation model 750 to reduce the difference between the second low-resolution depth image 755 and the depth image 794. The training apparatus may find optimal values of parameters of each of the first generation model 715, the second generation model 730 and the third generation model 750 by repeatedly performing the above process on a large number of training images.
As a result, through the training process, the first generation model 715 is trained to output a first depth residual image including a residual component obtained by subtracting a depth image generated by upscaling the depth image 792 by a scale of the depth image 790 from the depth image 790. The second generation model 730 is trained to output a second depth residual image including a residual component obtained by subtracting a depth image generated by upscaling the depth image 794 by a scale of the depth image 792 from the depth image 792. Also, the third generation model 750 is trained to output the depth image 794 that is downscaled.
As described above, the training apparatus decomposes the depth image 790 into frequency components, and trains the first generation model 715, the second generation model 730 and the third generation model 750 to estimate depth information of each frequency component. In operations other than an operation of using the third generation model 750, the training apparatus allows learning of only a depth residual component of a previous operation to separate characteristics of depth information estimated in each operation and to allow learning of the characteristics. Depth information estimated in a previous operation is used to generate an image for training in a next operation and used to guide the next operation. The training apparatus guides a residual component that is not estimated in each operation to be processed in a next operation, such that each of the first generation model 715, the second generation model 730 and the third generation model 750 efficiently estimates depth information of a frequency component corresponding to each of the first generation model 715, the second generation model 730 and the third generation model 750.
Referring to
Similarly to the process of
Referring to
Similarly to the process of
The depth image generation apparatus upscales the first low-resolution depth image 970 to generate an upscaled first low-resolution depth image 975, and combines the first depth residual image 930 and the upscaled first low-resolution depth image 975 in operation 980, to generate a target depth image 990 corresponding to the input image 910. Operation 980 corresponds to, for example, a weighted sum or summation of depth values of pixel positions corresponding to each other in the first depth residual image 930 and the upscaled first low-resolution depth image 975.
The multi-scale-based depth estimation structure used to generate a depth image may four or more layers, as well as two layers as described in the example of
Referring to
The depth image generation apparatus combines the intermediate depth images 1030, 1040 and 1050 in operation 1060, to generate a target depth image 1070 corresponding to the input image 1010. Operation 1060 corresponds to, for example, a weighted sum or summation of depth values of pixel positions corresponding to each other in the intermediate depth images 1030, 1040 and 1050. Through the above process, the depth image generation apparatus generates a depth image with a high quality based on a color image or an infrared image.
Referring to
The sensor 1110 may include any one or any combination of an image sensor configured to acquire a color image, an infrared sensor configured to acquire an infrared image, and a depth sensor configured to acquire a depth image. For example, the sensor 1110 acquires an input image including either one or both of a color image and an infrared image. The sensor 1110 transfers the acquired input image to either one or both of the processor 1120 and the memory 1130.
The processor 1120 controls the depth image generation apparatus and processes at least one operation associated with the above-described depth image generation method. In an example, the processor 1120 receives an input image including either one or both of a color image and an infrared image, and generates a first low-resolution image having a resolution lower than a resolution of the input image. The processor 1120 downsamples the input image to generate the first low-resolution image. The processor 1120 acquires a first depth residual image corresponding to the input image using a first generation model that is based on a first neural network, and generates a first low-resolution depth image corresponding to the first low-resolution image using a second generation model that is based on a second neural network. The processor 1120 generates a target depth image corresponding to the input image based on the first depth residual image and the first low-resolution depth image. The first depth residual image includes, for example, depth information of a high-frequency component in comparison to the first low-resolution depth image. The processor 1120 upsamples the first low-resolution depth image to a resolution of the input image, and combines depth information of the upsampled first low-resolution depth image and depth information of the first depth residual image to generate the target depth image.
In an example, to generate the target depth image in a multi-scale structure with three layers in the processor 1120, the processor 1120 may use a third generation model that is based on a third neural network, in addition to the first generation model and the second generation model. In this example, the processor 1120 acquires a second depth residual image corresponding to the first low-resolution image using the second generation model. The processor 1120 generates a second low-resolution image having a resolution lower than that of the first low-resolution image, and acquires a second low-resolution depth image corresponding to the second low-resolution image using the third generation model. The processor 1120 upsamples the second low-resolution depth image to a resolution of the second depth residual image, and combines depth information of the upsampled second low-resolution depth image and depth information of the second depth residual image to generate a first low-resolution depth image. The second depth residual image includes depth information of a high-frequency component in comparison to the second low-resolution depth image. The processor 1120 combines the generated first low-resolution depth image and the first depth residual image to generate the target depth image.
In another example, the processor 1120 performs a process of generating a depth image with a high quality by calibrating an input depth image acquired by a depth sensor based on a color image or an infrared image. This example has been described above with reference to
In still another example, the processor 1120 receives an input image, and acquires a first depth residual image and a first low-resolution depth image using a generation model that is based on a neural network that uses the input image as an input. The processor 1120 generates a target depth image corresponding to the input image based on the first depth residual image and the first low-resolution depth image. To acquire the first low-resolution depth image, the processor 1120 acquires a second depth residual image and a second low-resolution depth image using the generation model, and generates the first low-resolution depth image based on the second depth residual image and the second low-resolution depth image. This example has been described above with reference to
In still another example, the processor 1120 receives an input image, and acquires intermediate depth images with the same size using a generation model that is based on a neural network that uses the input image as an input. The intermediate depth images include depth information of different degrees of precision. The processor 1120 combines the acquired intermediate depth images to generate a target depth image. This example has been described above with reference to
Also, the processor 1120 may perform at least one of the operations described above with reference to
The memory 1130 stores information used in the above-described process of generating a depth image and result information. Also, the memory 1130 stores instructions readable in a computer. When instructions stored in the memory 1130 are executed by the processor 1120, the processor 1120 processes at least one of the above-described operations.
The computing apparatus 1200 is an apparatus configured to perform a function of generating a depth image, and performs operations of the depth image generation apparatus of
The processor 1210 performs functions and execute instructions in the computing apparatus 1200. For example, the processor 1210 may process instructions stored in the memory 1220 or the storage device 1240. The processor 1210 performs at least one of the operations described above with reference to
The memory 1220 stores data and/or information. The memory 1220 includes a non-transitory computer-readable storage medium or a computer-readable storage device. The memory 1220 may include, for example, a random access memory (RAM), a dynamic RAM (DRAM), a static RAM (SRAM), or other types of non-volatile memories known in the art. The memory 1220 stores instructions to be executed by the processor 1210, and information associated with execution of software or an application while the software or the application is being executed by the computing apparatus 1200.
The first camera 1230 may acquire either one or both of a still image and a video image as a color image. The first camera 1230 corresponds to, for example, an image sensor described herein. The second camera 1235 may acquire an infrared image. The second camera 1235 may capture an infrared ray emitted from an object or an infrared ray reflected from an object. The second camera 1235 corresponds to, for example, an infrared sensor described in the herein. In an example, the computing apparatus 1200 may include either one or both of the first camera 1230 and the second camera 1235. In another example, the computing apparatus 1200 may further include a third camera (not shown) configured to acquire a depth image. In this example, the third camera may correspond to a depth sensor described herein.
The storage device 1240 includes a non-transitory computer-readable storage medium or a computer-readable storage device. The storage device 1240 may store a larger amount of information that that of the memory 1220 and may store information for a relatively long period of time. The storage device 1240 may include, for example, a magnetic hard disk, an optical disk, a flash memory, an electrically erasable programmable read-only memory (EEPROM), or other types of non-volatile memories known in the art.
The input device 1250 receives an input from a user through a tactile input, a video input, an audio input, or a touch input. For example, the input device 1250 may detect an input from a keyboard, a mouse, a touchscreen, a microphone, or the user, and may include other devices configured to transfer the detected input to the computing apparatus 1200.
The output device 1260 provides a user with an output of the computing apparatus 1200 using a visual scheme, an auditory scheme, or a tactile scheme. For example, the output device 1260 may include, for example, a liquid crystal display (LCD), a light emitting diode (LED) display, a touchscreen, a speaker, a vibration generator, or other devices configured to provide the user with the output.
The communication device 1270 communicates with an external device via a wired or wireless network. For example, the communication device 1270 may communicate with the external device using a wired communication scheme, or a wireless communication scheme, for example, a Bluetooth communication, a wireless fidelity (Wi-Fi) communication, a third generation (3G) communication or a long term evolution (LTE) communication.
The first generation models 420, 515, 615, and 715, the second generation models 450, 535, 630, and 730, the third generation models 650 and 750, the processors 1120 and 1210, the memories 1130 and 1220, the communication buses 1140 and 1280, the storage device 1240, the input device 1250, the output device 1260, the communication device 1270, the processors, the memories, and other components and devices in
The methods illustrated in
Instructions or software to control computing hardware, for example, one or more processors or computers, to implement the hardware components and perform the methods as described above may be written as computer programs, code segments, instructions or any combination thereof, for individually or collectively instructing or configuring the one or more processors or computers to operate as a machine or special-purpose computer to perform the operations that are performed by the hardware components and the methods as described above. In one example, the instructions or software include machine code that is directly executed by the one or more processors or computers, such as machine code produced by a compiler. In another example, the instructions or software includes higher-level code that is executed by the one or more processors or computer using an interpreter. The instructions or software may be written using any programming language based on the block diagrams and the flow charts illustrated in the drawings and the corresponding descriptions in the specification, which disclose algorithms for performing the operations that are performed by the hardware components and the methods as described above.
The instructions or software to control computing hardware, for example, one or more processors or computers, to implement the hardware components and perform the methods as described above, and any associated data, data files, and data structures, may be recorded, stored, or fixed in or on one or more non-transitory computer-readable storage media. Examples of a non-transitory computer-readable storage medium include read-only memory (ROM), random-access memory (RAM), flash memory, CD-ROMs, CD-Rs, CD+Rs, CD-RWs, CD+RWs, DVD-ROMs, DVD-Rs, DVD+Rs, DVD-RWs, DVD+RWs, DVD-RAMs, BD-ROMs, BD-Rs, BD-R LTHs, BD-REs, magnetic tapes, floppy disks, magneto-optical data storage devices, optical data storage devices, hard disks, solid-state disks, and any other device that is configured to store the instructions or software and any associated data, data files, and data structures in a non-transitory manner and provide the instructions or software and any associated data, data files, and data structures to one or more processors or computers so that the one or more processors or computers can execute the instructions. In one example, the instructions or software and any associated data, data files, and data structures are distributed over network-coupled computer systems so that the instructions and software and any associated data, data files, and data structures are stored, accessed, and executed in a distributed fashion by the one or more processors or computers.
While this disclosure includes specific examples, it will be apparent after an understanding of the disclosure of this application that various changes in form and details may be made in these examples without departing from the spirit and scope of the claims and their equivalents. The examples described herein are to be considered in a descriptive sense only, and not for purposes of limitation. Descriptions of features or aspects in each example are to be considered as being applicable to similar features or aspects in other examples. Suitable results may be achieved if the described techniques are performed in a different order, and/or if components in a described system, architecture, device, or circuit are combined in a different manner, and/or replaced or supplemented by other components or their equivalents. Therefore, the scope of the disclosure is defined not by the detailed description, but by the claims and their equivalents, and all variations within the scope of the claims and their equivalents are to be construed as being included in the disclosure.
| Number | Date | Country | Kind |
|---|---|---|---|
| 10-2019-0142886 | Nov 2019 | KR | national |