The present invention relates to systems and methods for applying labels for image data using artificial neural networks and for training an artificial neural network to apply labels to the image data.
With technological breakthroughs in virtual and augmented reality, both the demands and amount of immersive content has been growing rapidly. One source of immersive content is 360-degree images and video. The 360 image, as its name suggests, captures omnidirectional visual information of surrounding environment. Understanding and extracting semantic information captured in 360 images has large potential, for example, in various business areas including augmented & virtual reality, building construction & maintenance, and robotics. One technique for representing 360 images is “equal-rectangular panorama” (ERP).
In some embodiments, ERP is used as input to a deep neural network that is trained to produce as output a room layout estimation, object detection, and/or object classification based on the ERP image data. Compared to conventional color images generated from perspective camera projection, ERP images are less sensitive to occlusion cases because the ERP images include 360-degree global information of the surround environment (e.g., a room). However, one downside of using ERP images is a lack of a sufficiently large amount of labelled data, which leads to limited performance of layout estimation. In some implementations, this limitation is addressed by utilizing multi-view consistency regularization, which leverages the rotation-invariance of layout in ERP images to reduce the need for large amounts of training data.
In various embodiments, the systems and methods described herein provide a novel regularization term to improve performance of deep neural networks for semantic interpretation of equal-rectangular panorama (ERP) images. Consistencies between different view of panorama images are utilized to reduce the amount of labelled ground truth data used for training of the deep neural network. This multi-view consistency regularization approach can be applied to various business areas including, for example, building construction & maintenance and augmented & virtual reality systems.
In one embodiment, the invention provides a method of training an artificial neural network to produce spatial labelling for a three-dimensional environment based on image data. A two-dimensional image representation is produced of omni-direction image data captured by one or more cameras of the three-dimensional environment. The artificial neural network is applied using the two-dimensional image representation as input and producing a first predicted label as output. A rotated two-dimensional image is generated by shifting image pixels of the two-dimensional image representation in a horizontal direction. The artificial neural network is then applied again using the rotated two-dimensional image as input and producing a second predicted label as its output. The artificial neural network is retrained based at least in part on a difference between the first predicted label and the second predicted label.
In another embodiment the invention provides system for producing spatial labelling for a three-dimensional environment based on image data using an artificial neural network. The system includes a camera system configured to capture omni-directional image data of the three-dimensional environment and a controller. The controller is configured to receive the omni-directional image data from the camera system and to produce a two-dimensional image representation of omni-direction image data. The controller then applies the artificial neural network using the two-dimensional image representation as input to produce a first predicted label as output. A rotated two-dimensional image is generated by shifting image pixels of the two-dimensional image representation in a horizontal direction. The artificial neural network is then applied again using the rotated two-dimensional image as input and producing a second predicted label as its output. The artificial neural network is retrained based at least in part on a difference between the first predicted label and the second predicted label.
In yet another embodiment, the invention provides a method of training an artificial neural network to produce a spatial labelling of layout boundaries for a three-dimensional environment based on image data. Spherical image data of the three-dimensional environment surrounding a camera system is captured by the camera system and a two-dimensional representation of the spherical image data is produced using equal-rectangular projection (ERP). The artificial neural network is applied using the two-dimensional image representation as input and producing a first predicted label as output. The artificial neural network is configured to produce as its output a predicted label defining layout boundaries for the three-dimensional environment based on equal-rectangular projection (ERP) image data received as the input. A multi-view consistency regularization loss term is determined by generating a rotated two-dimensional image (by moving a defined number of pixel columns from one horizontal end of the two-dimensional image representation to the other horizontal end) and applying the artificial neural network using the rotated two-dimensional image as input to produce a second predicted label as output. The multi-view consistency regularization loss term is determined based on a comparison of the first predicted label and the second predicted label. A task-specific loss term is determined based on a difference between the first predicted label and a ground truth label for the two-dimensional image representation and the artificial neural network is retrained based on both the task-specific loss term and the multi-view consistency regularization loss term.
Other aspects of the invention will become apparent by consideration of the detailed description and accompanying drawings.
Before any embodiments of the invention are explained in detail, it is to be understood that the invention is not limited in its application to the details of construction and the arrangement of components set forth in the following description or illustrated in the following drawings. The invention is capable of other embodiments and of being practiced or of being carried out in various ways.
The controller 101 is configured to receive image data from one or more cameras 107 that are communicative coupled to the controller 101. In some implementations, the one or more cameras 107 are configured to capture omni-directional image data including, for example, 360 images. The image data captured by the one or more cameras 107 is processed by the controller 101 in order to define labels for the surrounding environment. In some implementations, the controller 101 is also communicative coupled to a display 109 and is configured to cause the display 109 to display all or part of the captured image data and/or visual representations of the determined labels. In some implementations, the controller 101 is configured to show on the display 109 an “equal-rectangular panorama” (ERP) representation of the captured image data overlaid with the visual representation of the determined labels. In some implementations, the display 109 may also be configured to provide a graphical user interface for the system of
In some implementations, the controller 101 is also communicative coupled to one or more actuators 111. The controller 101 is configured to provide control signals to operate the one or more actuators 111 based on the captured image data and/or the determined labels. For example, in some implementations, the actuators 111 may include electric motors for controlling the movement and operation of a robotic system. In some such implementations, the controller 101 may be configured to transmit control signals to the actuators 111 to maneuver the robot through a room based on a layout as determined based on the captured image data. Similarly, in some implementations where the controller 101 is configured to detect and classify objects in the surrounding environment based on the image data, the controller 1010 is further configured to transmit control signals to the actuators 111 to cause the robot to interact with one or more detected objects.
ERP images contain the 360-degree by 180-degree full visual information of an environment. Therefore, some ERP images may have a size of 2N×N, where N is the height of the image, so that each pixel can be mapped to the spherical space of (−180-degree to 180-degree)×(−90-degrees to 90-degrees). ERP images are created by projecting spherical space to 2D flat surfaces with equal-rectangular projection. The process of projecting spherical image data into a 2D rectangular space introduces “stretching” distorting in the horizontal direction at different locations in the vertical direction. This “stretching” distortion is illustrated in
Although the degree to which the image data in the ERP image is stretched in the horizontal direction varies based on the position of the image data in the vertical direction, the ERP image data does not exhibit similar distortions or “stretching” in the vertical direction. Accordingly, any rotations of the sphere in the horizontal direction simply results in a shifting of the image data to the left or right. For example, a 45-degree horizontal rotation of the ERP image data can be generated by cutting ⅛ of the ERP image data from the left of the ERP image and appending it to the right of the ERP image. This rotational characteristic applies, not only to the image data in the ERP image, but also to the ground-truth semantic labels applied to the image data.
The same image data in the ERP image 401 and the same portion of the label 403 that are displayed in the horizontal center of the ERP image 401 in
Machine learning mechanisms such as the artificial neural network are “trained” based on a “training set” or “training data.” In some implementations, an artificial neural network is configured to produce an “output” in response to a received “input.” The artificial neural network is trained to minimize differences between the output produced by the artificial neural network and the “ground truth” output. The difference between the output of the artificial neural network and the “ground truth” output is called “loss.” Known algorithms can be used to train an artificial neural network by defining one or more “loss functions” expressing this “loss.”
Differences between the first predicted label 505 and the ground truth label 507 are referred to as “task specific loss” (i.e., the difference between the actual output of the DNN 503 and an ideal “correct” output). This task specific loss can then be used as to define the loss function that will be used to train the DNN 503. However, to improve the training of the DNN 503, the mechanism illustrated in
The original ERP image 501 is “rotated” by removing a portion of the image data from one side of the ERP image 501 and appending it to the other side of the ERP image 501 to create a rotated ERP image 509. The rotated ERP image 509 is then provide as input to the DNN 503 and a second predicted label 511 (i.e., a predicted label of the rotated view of the ERP image) is produced as the output of the DNN 503. As discussed above, both the ERP image itself and the “label” can be rotated by moving image data from one side of the 2D image to the other. Accordingly, in an ideally trained DNN 503, the difference between the first predicted label 505 and the second predicted label 511 should be a shift of the label by a known degree (corresponding to the shift of pixels in the ERP image data). Any differences between the first predicted label 505 and the second predicted label 511 other than this expected shift in the horizontal direction (i.e., “consistency regularization loss”) is then used to define an additional loss function that can also be used to train the DNN 503.
In addition to providing an additional loss function that can be used to train the DNN 503, the number of loss terms that can be determined from a single ERP image is significantly increased by using this simulated rotation of the ERP image data. The number of different rotationally “shifted” images that can be produced from a single ERP image is limited only by the horizontal resolution of the ERP image. Therefore, a relatively large number of “consistency regularization loss” terms can be determined from a single ERP image (i.e., at least one for each shift in the horizontal direction). Additionally, because the ground truth label 507 can also be shifted to the same degree as the rotated ERP image 509, in some implementations, the second predicted label 511 is then compared to a correspondingly shifted ground truth label 507 to produce additional task specific loss terms.
The original ERP image is then shifted based on a defined rotation angle ω (step 611). In some implementations, the defined rotation angle ω is determined based on the number of different view N to be processed for multi-view consistency regularization loss such that the angle ω can be sampled uniformly by dividing 360-degrees by N. In other implementations, the system may be configured to select one or more rotation angles co randomly between −180-degrees and 180-degrees.
The rotated ERP image is then provided as input to the DNN 503 (step 613) and an additional predicted label Lω for the rotated ERP image is produced as the output of the DNN 503 (step 615). This new predicted label Lω is then rotated back to the perspective of the original ERP image (step 617). This reverse rotated additional predicted label is then compared to the predicted label from the original ERP image L (step 619) to produce the additional training data (i.e., a multi-view consistency regularization loss term). This shifting of the ERP image data (step 611), reverse shifting of the predicted label (step 617), and comparison of the predicted labels (step 619) is repeated until the Nth iteration (step 621). After the Nth iteration (step 621), the DNN 503 is retrained based on the task specific training data and the additional training data (step 623). By adding an additional multi-view consistency regularization loss term/function during training, the system is able to train the DNN 503 to produce consistent results regardless of the physical rotational position of the camera and thereby prevents the DNN 503 from overfitting to certain camera views.
Thus, the invention provides, among other thing, systems and methods for training an artificial neural network to define labels for a three-dimensional environment based on omni-directional image data mapped in equal-rectangular panorama by using multi-view consistency regularization as a loss function for training the artificial neural network. Additional features and aspects of this invention are set forth in the following claims.