The goal of unsupervised domain adaptive object detection is to learn a robust detector in the domain shift circumstance, where the training (source) domain is label-rich with bounding box annotations, while the testing (target) domain is label-agnostic and the feature distributions between training and testing domains are dissimilar or even totally different.
In real-world scenarios, generic object detection always faces severe challenges from variations in viewpoint, background, object appearance, illumination, occlusion conditions, scene change, etc. These unavoidable factors make object detection in domain-shift circumstance challenging. Also, domain change is a widely-recognized, intractable problem that urgently needs to be addressed for real-world detection tasks, for example, video surveillance, autonomous driving, etc.
Common approaches for addressing domain-shift object detection include: (1) training a supervised model and then fine-tuning the model on the target domain; or (2) unsupervised cross-domain representation learning. The former requires additional instance-level annotations on target data, which is fairly laborious, expensive and time-consuming. As such, most approaches focus on the latter approach, but challenges still remain. The first challenge is that the representations of source and target domain data should be embedded into a common space for matching the object, such as the hidden feature space, input space or both. The second is that a feature alignment/matching operation or mechanism for source/target domains should be further defined, such as subspace alignment, -divergence and adversarial learning, MRL, strong-weak alignment, universal alignment, etc.
The disclosed invention targets these two challenges and it is also a learning-based alignment method across domains with an end-to-end framework. Disclosed herein is an effective detach strategy which selectively prevents the flow of gradients from context sub-networks through the detection backbone path to obtain a more discriminative context. This path carries information with diversity and, hence, suppressing gradients from this path achieves the desired effect. A sub-network is defined to generate the context information from early layers of the detection backbone. Because instance and context focus on perceptually different parts of an image, the representations from either of them should also be discrepant. However, if the conventional process is used for training, the companion sub-network will be updated jointly with the detection backbone, which may lead to an indistinguishable behavior from these two parts. To this end, the disclosed invention suppresses gradients during backpropagation and forces the representation of context sub-network to be dissimilar from the detection network.
By way of example, a specific exemplary embodiment of the disclosed
system and method will now be described, with reference to the accompanying drawings, in which:
Disclosed herein is a design that is specific to convolutional neural network optimization and which improves its training on tasks that adapt on discrepant domains using a novel approach referred to herein as stacked complementary losses (SCL), which is an effective approach for domain-shift object detection. Previous approaches that focus on conducting domain alignment on high-level layers only cannot fully adapt shallow layer parameters to both source and target domains, which restricts the ability of the model to learn. Further, gradient detach is a critical part of learning with complementary losses.
Following the common formulation of domain adaptive object detection, a source domain is defined where annotated bounding boxes are available, and a target domain where only the image can be used in the training process, without any labels. Examples of a source domain and a target domain are shown in
Multi-Complement Objective Learning—
where:
The complementary loss of domain classifier k is formulated as follows:
where:
The framework also adopts a gradient reverse layer (GRL) to enable
adversarial training where a GRL layer is placed between the domain classifier for each layer k and the detection backbone network. During backpropagation, the GRLs reverse the gradient that passes through from the domain classifiers to the detection network. For the instance-context alignment loss ILoss, the instance-level
representation and context vector are taken as inputs. The instance-level vectors are from RoI layer that each vector focuses on the representation of local object only. The context vector is from the sub-network that combines hierarchical global features. Instance feature vectors are each concatenated with the same context vector generated by the sub-network. Context information is fairly different from objects, and therefore, when the joint training detection and context networks mix the critical information from each part, the invention provides a better solution that uses detach strategy to update the gradients, explained in more detail below. Aligning instance and context representations simultaneously helps to alleviate the variances of object appearance, part deformation, object size, etc. in instance vector and illumination, scene, etc. in the context vector. di is defined as the domain label of ith training image, where di=1 for the source and di=0 for the target, so the instance-context alignment loss can be further formulated as:
where:
Gradient Detach Updating—The detach strategy which prevents the flow
of gradients from the context sub-network through the detection backbone path will now be disclosed. This feature of the invention helps to obtain a more discriminative context. Further, this path carries information with diversity and, hence, suppressing gradients from this path is desirable.
As previously mentioned, a sub-network is defined to generate the context information from early layers of the detection backbone. Intuitively, instance and context focus on perceptually different parts of an image, so the representations from either of them should also be discrepant. However, if the conventional process is used for training, the companion sub-network will be updated jointly with the detection backbone, which may lead to an indistinguishable behavior from these two parts. To this end, gradients are suppressed during backpropagation and the representation of the context sub-network is forced to be dissimilar to the detection network. This is implemented by the algorithm shown in meta-language in
Overall Framework—In one embodiment, the detection part is based on Faster RCNN, including the Region Proposal Network (RPN) and other modules. This is a conventional practice in many adaptive detection works. The objective of the detection loss is summarized as:
where:
To train the whole model, the overall objective function is given as:
where:
Choosing Complementary Losses—The framework adopts three known types of losses as the complementary loss: a cross-entropy loss, a weighted least squares loss and a focal loss.
Cross-entropy (CE) loss measures the performance of a classification model whose output is a probability value. It increases as the predicted probability diverges from the actual label:
where:
Weighted least-squares (LS) loss stabilizes the training of the domain classifier for aligning low-level features. The loss is designed to align each receptive field of features with the other domain. The least-squares loss is formulated as:
where:
Focal Loss (FL) is adopted to ignore easy-to-classify examples and focus on those hard-to- classify ones during training:
where:
Unsupervised domain adaptive object detection has been addressed through stacked complementary losses. One novel aspect of the invention is the use of gradient detach training, enabled by suppressing gradients flowing back to the detection backbone. In addition, multiple complementary losses are used for better optimization.
As would be realized by one of skill in the art, the disclosed method described herein can be implemented by a system comprising a processor and memory, storing software that, when executed by the processor, performs the functions comprising the method.
As would further be realized by one of skill in the art, many variations on implementations discussed herein which fall within the scope of the invention are possible. Moreover, it is to be understood that the features of the various embodiments described herein were not mutually exclusive and can exist in various combinations and permutations, even if such combinations or permutations were not made express herein, without departing from the spirit and scope of the invention. Accordingly, the method and apparatus disclosed herein are not to be taken as limitations on the invention but as an illustration thereof. The scope of the invention is defined by the claims which follow.
This application claims the benefit of U.S. Provisional Patent Application No. 63/147,934, filed Feb. 10, 2021, the contents of which are incorporated herein in their entirety.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2022/014485 | 1/31/2022 | WO |
Number | Date | Country | |
---|---|---|---|
63147934 | Feb 2021 | US |