The present application relates to a technical field of image processing, and more particularly to a method for identifying depths of images and related devices.
In a current method for depth identification of vehicle images, depth information with measurement units cannot be identified through an image recognition model. It is difficult to determine an accurate distance between a vehicle and various objects or obstacles in the surrounding environment, thereby affecting driving safety.
The accompanying drawings combined with the detailed description illustrate the embodiments of the present disclosure hereinafter. It is noted that embodiments of the present disclosure and features of the embodiments can be combined, when there is no conflict.
Various details are described in the following descriptions for a better understanding of the present disclosure, however, the present disclosure may also be implemented in other ways other than those described herein. The scope of the present disclosure is not to be limited by the specific embodiments disclosed below. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which the present disclosure belongs. The terms used herein in the present disclosure are only for the purpose of describing specific embodiments and are not intended to limit the present disclosure.
The computer device 1 may include hardware such as, but is not limited to, a microprocessor and an Application Specific Integrated Circuit (ASIC), a Field-Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), and embedded devices, for example.
The computer device 1 may be any electronic equipment that can interact with a user, such as a personal computer, a tablet computer, a smart phone, a personal digital assistant (PDA), a game console, and an interactive network television, Internet Protocol Television (IPTV), or smart wearable devices, for example.
The computer device 1 may also include a network device and/or a user device. The network device includes, but is not limited to, a single network server, a server group composed of multiple network servers, or a cloud including a large number of hosts or network servers based on a cloud computing technology.
A network can include, but is not limited to, the Internet, a wide area network (WAN), a metropolitan area network, a local area network, and a virtual private network (VPN), for example.
In block S101, the computer device obtains point clouds of a road scene and a spatial coordinate value of each point in the point clouds, and the computer device obtains a first image and a second image of the road scene captured by a camera device.
In one embodiment, the point clouds and the spatial coordinate value of each point in the point clouds can be obtained by scanning the road scene with a lidar. Measurement unit information of the lidar may include, but is not limited to meters or centimeters. The road scene represents a scene that shows multiple objects, such as vehicles, ground, pedestrians, sky, trees, and so on. The camera device can be a monocular camera, and the first image and the second image are Red Green Blue (RGB) images of adjacent frames. Generation time of the second image is later than generation time of the first image. For example, a timestamp of the second image is later than a timestamp of the first image.
In one embodiment, the computer device obtains the first image by controlling the camera device to capture the road scene, and then the camera device captures the road scene again after a preset time interval, and the computer device obtains the second image. The preset time interval can be preset to be short, for example, the preset time interval can be 10 ms. Due to a short duration of the preset time interval, a distance that objects in the road scene can move slightly within the preset time interval. Therefore, the second image and the first image include more identical objects.
In block S102, the computer device inputs the first image into a preset depth identification network, and obtains an initial depth image.
In one embodiment, the preset depth identification network can be one or more depth identification network frameworks. For example, the preset depth identification network frameworks can be FCRN (Fully Convolutional Recurrent Network) framework, FCN (Fully Convolutional Network) framework, U-net framework, and so on. The preset depth identification network includes convolution layers, batch normalization layers, pooling layers, and activation function layers, for example. A generation process of the initial depth image can be referred to a generation process of a target depth image as provided below.
In block S103, the computer device converts the spatial coordinate value according to a pose matrix generated by the first image and the second image and an internal reference matrix of the camera device, and the computer device obtains a projected depth value and a projected coordinate value of each point in the point clouds based on the spatial coordinate value converted.
In one embodiment, the pose matrix refers to a conversion relationship between a camera coordinate system corresponding to the camera device and a world coordinate system. The world coordinate system refers to a coordinate system established according to any object in real world, and the world coordinate system can reflect a real position of the any object in real world. For example, the world coordinate system may be a coordinate system corresponding to the lidar. The spatial coordinate value includes a horizontal space coordinate value, a vertical space coordinate value and a longitudinal space coordinate value. The projected depth value refers to a value obtained by converting the longitudinal space coordinate value based on the pose matrix and the internal reference matrix. The projected coordinate value refers to a coordinate value obtained by converting the horizontal space coordinate value and the vertical space coordinate value based on the pose matrix and the internal reference matrix.
In one embodiment, the computer device obtains a first homogeneous coordinate matrix of each pixel in the first image, and obtains a second homogeneous coordinate matrix of each pixel in the second image. The computer device obtains an inverse matrix of the internal reference matrix. The computer device calculates a first camera coordinate of each pixel in the first image according to the first homogeneous coordinate matrix and the inverse matrix, and calculates a second camera coordinate of each pixel in the second image according to the second homogeneous coordinate matrix and the inverse matrix. The computer device calculates the first camera coordinate and the second camera coordinate based on a preset relational of an epipolar constraint, and obtains a rotation matrix and a translation matrix. The computer device obtains the pose matrix by splices the rotation matrix and the translation matrix. The first homogeneous coordinate matrix refers to a matrix with dimension more than a pixel coordinate matrix, and an element value of the extra dimension is 1. The pixel coordinate matrix refers to a matrix generated according to a first pixel coordinate of each pixel in the first image, and the first pixel coordinate refers to a coordinate of each pixel in the first image in a pixel coordinate system. For example, the first pixel coordinate of any pixel in the first image in the pixel coordinate system is represented as (u, v), and the pixel coordinate matrix of any pixel is represented as
a homogeneous coordinate matrix of the any pixel is represented as
The first camera coordinate refers to a camera coordinate of each pixel of the first image in the camera coordinate system corresponding to the camera device.
In one embodiment, the computer device multiplies the first homogeneous coordinate matrix by the inverse matrix, and obtains the first camera coordinate. The computer device multiplies the second homogeneous coordinate matrix and the inverse matrix, and obtains the second camera coordinate. A generation method of the second homogeneous coordinate matrix is basically the same as a generation method of the first homogeneous coordinate matrix, it is not repeated here. The pose matrix can be expressed as:
in which pose represents the pose matrix, R represents the rotation matrix, and t represents the translation matrix. A calculation formula of the rotation matrix and the translation matrix is represented as: K−1ρ1(txR)(K−ρ2)T=0, in which K−1ρ1 represents the first camera coordinate, K−1ρ2 represents the second camera coordinate, ρ1 represents the first homogeneous coordinate matrix, ρ2 represents the second homogeneous coordinate matrix, K−1 represents the inverse matrix.
In one embodiment, the computer device obtains a camera coordinate matrix by multiplying a spatial homogeneous matrix corresponding to the spatial coordinate value by the pose matrix. The computer device determines a vertical coordinate value of the camera coordinate matrix as the projected depth value, and obtains a camera pose matrix by multiplying the camera coordinate matrix by the internal reference matrix. The computer device performs a division operation on each element value of the camera pose matrix by the projected depth value, and obtains the projected coordinate value. The spatial homogeneous matrix refers to a matrix with dimension more than a dimension of the spatial coordinate matrix, and an element value of an extra dimension is 1, and the spatial coordinate matrix refers to a matrix generated according to the spatial coordinate value. For example, the spatial coordinate value is (x, y, z), the spatial coordinate matrix is represented as
and the spatial homogeneous matrix is represented as
By directly establishing pixel coordinate systems in the first image and the second image respectively, the pose matrix can be generated according to coordinates of each pixel in the first image and the second image in a corresponding pixel coordinate system, and the space coordinate value can be quickly converted.
In block S104, the computer device calculates a scaling factor for each point in the point clouds according to the projected depth value, a number of points in the point clouds, and an initial depth value of an initial pixel point corresponding to the projected coordinate value of the initial depth image.
In one embodiment, the scaling factor refers to an average value of ratios between a plurality of projected depth values and a plurality of corresponding initial depth values. In one embodiment, a calculation formula of the scaling factor is represented as:
in which Cscale represents the scaling factor, Nr represents the number of points in the point clouds, dir represents the projected depth value of any point in the point clouds, dip represents the initial depth value of the initial pixel point corresponding to any point. By dividing the projected depth value of each point in the point clouds with the initial depth value of the initial pixel point corresponding to each point, a plurality of ratios are obtained. By selecting the average value of the plurality of ratios as the scaling factor, a rationality of the scaling factor can be improved. Since the projected depth value includes measurement unit information, the scaling factor can also include the measurement unit information.
In block S105, the computer device calculates a target depth value for each point in the point clouds according to the scaling factor and the initial depth value.
In one embodiment, a calculation formula of the target depth value is represented as: Dt=Cscale*dip, in which Dt represents the target depth value, Cscale represents the scaling factor, dip represents the initial depth value. By multiplying the initial depth value of the initial pixel point corresponding to each point in the point clouds with the scaling factor, since pixel values of initial pixel points corresponding to all points in the point clouds are involved in a calculation operation, the target depth value of each point in the point clouds can have the same measurement unit information. In addition, since the number of points in the point clouds may be smaller than a number of pixels in the first image or the second image, the initial depth value of the initial pixel point corresponding to the projected coordinate value in the initial depth image can be accurately selected according to the projected depth value. Moreover, the projected depth value and the corresponding initial depth value can be accurately calculated.
In block S106, the computer device generates an initial projection image based on the pose matrix, the internal reference matrix, the target depth value, the second image, and a pixel coordinate value of a target pixel point corresponding to the projected coordinate value in the second image.
In one embodiment, the initial projection image refers to a projection image generated by remapping the second image back to the first image.
In one embodiment, the computer device constructs a homogeneous coordinate matrix according to the pixel coordinate value of the target pixel point, and obtains an inverse matrix of the internal reference matrix. The computer device calculates a target coordinate value of the target pixel point according to the pose matrix, the inverse matrix, the internal reference matrix, the homogeneous coordinate matrix and the target depth value. The computer device obtains the initial projection image by adjusting the pixel coordinate value of the target pixel point to be corresponding target coordinate value in the second image. The pixel coordinate value of the target pixel point refers to a coordinate value in the pixel coordinate system corresponding to the second image, and the pixel coordinate value of the target pixel point includes an abscissa value and an ordinate value. A calculation formula of the target coordinate value is represented as: P=K*pose*Z*K−1*H, in which P represents the target coordinate value, K represents the internal reference matrix, pose represents the pose matrix, K−1 represents the inverse matrix, H represents the homogeneous coordinate matrix, and Z represents the target depth value. Since the target depth value of each point in the point clouds has the same measurement unit information, it can be ensured that the pixel value in the initial projection image generated according to the target depth value includes the measurement unit information.
In block S107, the computer device calculates a loss value of the preset depth identification network according to the first image, the initial projection image and the second image, and obtains the pre-trained image identification model by adjusting the preset depth identification network based on the loss value.
In one embodiment, the pre-trained image identification model refers to a model generated after adjusting the preset depth identification network. The preset depth identification network may be a deep neural network, and the preset depth identification network may be obtained from a database on the Internet.
The computer device calculates a first pixel difference value between a pixel value of each of pixel points in the first image and a pixel value of corresponding pixel points in the initial projection image, and obtains a first difference image by adjusting the pixel value of each of pixel points in the first image to be corresponding first pixel difference value. The computer device calculates a second pixel difference value between the pixel value of each of pixel points in the first image and a pixel value corresponding to pixel points in the second image, and generates a second difference image corresponding to the first image according to the second pixel difference value. The computer device obtains a target image by adjusting the second pixel difference value of the second difference image according to a comparison result of the second pixel difference value with the corresponding first pixel difference value and a preset value, and the computer device calculates the loss value according to a pixel value of each of pixel points in the target image and the corresponding first pixel difference value of corresponding pixel points in the first difference image. A method for generating the second difference image is basically the same as a method for generating the first difference image, it is not repeated here.
Specifically, the computer device compares the second pixel difference value with the corresponding first pixel difference value. In response that the second pixel difference value is smaller than the corresponding first pixel difference value, the computer device determines a pixel point corresponding to the second pixel difference value in the second difference image as a feature pixel point. The computer device obtains the target image by adjusting a plurality of second pixel difference values corresponding to a plurality of feature pixel points in the second difference image to be the preset value. The preset value can be set or updated, which is not limited in this application. For example, the preset value may be zero. The computer device multiplies the pixel value of each of pixel points in the target image by the corresponding first pixel difference value of corresponding pixel points in the first difference image, and obtains the loss value.
In one embodiment, the computer device adjusts parameters of the preset depth identification network based on the loss value until the loss value drops to a configured value, and obtains the pre-trained image identification model. The parameters of the preset depth identification network include a learning rate, a batch size for each training, and so on.
In other embodiments, the computer device adjusts parameters of the preset depth identification network based on the loss value until the loss value satisfies a preset convergence condition, and obtains the pre-trained image identification model. The convergence condition can be set or updated, which is not limited in this application. For example, the convergence condition may be that the loss value is less than or equal to a preset threshold.
In the above embodiments, since a pixel value corresponding to a moving object will cause a calculated loss value to be inaccurate, by comparing the second pixel difference value with the corresponding first pixel difference value, it can be determined whether there is a moving object in the second difference image. If the second pixel difference value is smaller than the corresponding first pixel difference value, it is indicated that there is a movable object in the second difference image. By determining a pixel point corresponding to the second pixel difference value in the second difference image as the feature pixel point, and adjusting a plurality of second pixel difference values corresponding to a plurality of feature pixel points in the second difference image to be the preset value, and adjusting a pixel value of the pixel point corresponding to the moving object in the second difference image to be zero, it can be ensured that the pixel value corresponding to the moving object is not utilized to calculate the loss value, thereby ensuring an accuracy of the loss values.
In block S108, the computer device obtains a plurality of images to be identified.
In one embodiment, the plurality of images to be identified refers to images that need to be identified depth information. The computer device obtains the plurality of images to be identified from a preset database. The preset database may be KITTI database, Cityscapes database, vKITTI database and so on.
In block S109, the computer device obtains a plurality of target depth images based on the plurality of images to be identified, and determines depth information of the plurality of images to be identified by inputting the plurality of images to be identified into a pre-trained image identification model.
In one embodiment, the plurality of target depth images refer to images including depth information of each pixel in the plurality of images to be identified and measurement unit information, and the depth information of each pixel in the plurality of images to be identified refer to a distance between an object to be identified corresponding to each pixel in the plurality of images to be identified and the camera device that captures the plurality of images to be identified. The measurement unit information includes meter, centimeter, and so on. For example, the depth information of each pixel in the plurality of images to be identified may be 10, and the measurement unit information corresponding to the depth information 10 is meter. A method for generating each of the plurality of target depth images is basically the same as a method for generating the initial depth image, thus it is not repeated here. The computer device obtains a pixel value of each pixel in the plurality of target depth images as the depth information of the corresponding pixel in the plurality of images to be identified, and determines the measurement unit information of the pixel value as a measurement unit information of the depth information.
In the above embodiments, by obtaining the point clouds of the road scene and the spatial coordinate value of each point in the point clouds, since the point clouds and the spatial coordinate value are obtained by lidar, the spatial coordinate value includes a measurement unit information of lidar. The computer device calculates the scaling factor for each point in the point clouds according to the projected depth value, the number of points in the point clouds, and the initial depth value of the initial pixel point corresponding to the projected coordinate value of the initial depth image, since the scaling factor is the average value of the ratios between the plurality of projected depth values and the plurality of corresponding initial depth values, the scaling factor can better represent an overall ratio relation between the plurality of projected depth values and the plurality of corresponding initial depth values relation. The computer device calculates the target depth value for each point in the point clouds according to the scaling factor and the initial depth value. Since the target depth value is generated by scaling the initial depth value according to the scaling factor, an accuracy of the initial projection image generated based on the target depth value can be ensured. In addition, since the spatial coordinate value includes the measurement unit information of the lidar, the generated loss value also includes the measurement unit information. Moreover, the preset depth identification network is adjusted based on the loss value and a pre-trained image identification model is obtained, so as to ensure that the pre-trained image identification model can accurately acquire the measurement unit information of the lidar. The pre-trained image identification model can generate the plurality of target depth images include the measurement unit information, so that a real distance between a vehicle and various objects or obstacles in a surrounding environment can be determined.
The computer device 1 may include a storage device 12, and at least one processor 13. Computer-readable instructions are stored in the storage device 12 and executable by the at least one processor 13.
The at least one processor 13 can be a central processing unit (CPU), or can be other general-purpose processor, digital signal processor (DSPs), application-specific integrated circuit (ASIC), Field-Programmable Gate Array (FPGA), another programmable logic device, discrete gate, transistor logic device, or discrete hardware component, etc. The processor 13 can be a microprocessor or any conventional processor. The processor 13 is a control center of the computer device 1 and connects various parts of the entire computer device 1 by using various interfaces and lines.
The processor 13 executes the computer-readable instructions to implement a training method of the pre-trained image identification model, such as in block S101-S107 shown in
For example, the computer-readable instructions can be divided into one or more modules/units, and the one or more modules/units are stored in the storage device 12 and executed by the at least one processor 13. The one or more modules/units can be a series of computer-readable instruction segments capable of performing specific functions, and the instruction segments are used to describe execution processes of the computer-readable instructions in the computer device 1.
The storage device 12 stores the computer-readable instructions and/or modules/units. The processor 13 may run or execute the computer-readable instructions and/or modules/units stored in the storage device 12 and may call up data stored in the storage device 12 to implement various functions of the computer device 1. The storage device 12 mainly includes a program storage area and a data storage area. The storage area for programs may store an operating system, and an application program required for at least one function (such as a sound playback function, an image playback function, for example), for example. The storage area for data may store data (such as audio data, phone book data, for example) created during the use of the computer device 1. In addition, the storage device 12 may include a high-speed random access memory, and may also include a non-transitory storage medium, such as a hard disk, an internal memory, a plug-in hard disk, a smart media card (SMC), a secure digital (SD) Card, a flashcard, at least one disk storage device, a flash memory device, or another non-transitory solid-state storage device.
The storage device 12 may be an external memory and/or an internal memory of the computer device 1. The storage device 12 may be a memory in a physical form, such as a memory stick, or a Trans-flash Card (TF card), for example.
When the modules/units integrated into the computer device 1 are implemented in the form of independent software functional units, they can be stored in a non-transitory readable storage medium. Based on this understanding, all or some of the processes in the methods of the above embodiments implemented by the present disclosure can also be completed by related hardware instructed by computer-readable instructions. The computer-readable instructions can be stored in a non-transitory readable storage medium. The computer-readable instructions, when executed by the processor, may implement the steps of the foregoing method embodiments. The computer-readable instructions include computer-readable instruction codes, and the computer-readable instruction codes can be in a source code form, an object code form, an executable file, or some intermediate form. The non-transitory readable storage medium can include any entity or device capable of carrying the computer-readable instruction code, such as a recording medium, a U disk, a mobile hard disk, a magnetic disk, an optical disk, a computer memory, or a read-only memory (ROM).
With reference to
With reference to
The computer-readable instructions are executed by the processor 13 to realize the functions of each module/unit in the above-mentioned device embodiments, which will not be repeated here.
In the several embodiments provided in the preset embodiment, the disclosed computer device and method can be implemented in other ways. For example, the embodiments of the devices described above are merely illustrative. For example, a division of the modules is based on logical function only, and there can be other manners of division in actual implementation.
In addition, each functional module in each embodiment of the present disclosure can be integrated into one processing module, or can be physically present separately in each unit, or two or more modules can be integrated into one module. The above modules can be implemented in a form of hardware or in a form of a software functional unit.
Therefore, the present embodiments are considered as illustrative and not restrictive, and the scope of the present disclosure is defined by the appended claims. All changes and variations in the meaning and scope of equivalent elements are included in the present disclosure. Any reference sign in the claims should not be construed as limiting the claim.
Moreover, the word “comprising” does not exclude other units nor does the singular exclude the plural. A plurality of units or devices stated in the system claims may also be implemented by one unit or device through software or hardware. Words such as “first” and “second” are used to indicate names, not a particular order.
Finally, the above embodiments are only used to illustrate technical solutions of the present disclosure and are not to be taken as restrictions on the technical solutions. Although the present disclosure has been described in detail with reference to the above embodiments, those skilled in the art should understand that the technical solutions described in one embodiment can be modified, or some of the technical features can be equivalently substituted, and that these modifications or substitutions are not to detract from the essence of the technical solutions or from the scope of the technical solutions of the embodiments of the present disclosure.
Number | Date | Country | Kind |
---|---|---|---|
202211737780.9 | Dec 2022 | CN | national |