Crowd counting has recently been a popular task in computer vision. Despite many advances, this is one area still largely reliant on manual labor, for example, often by requiring an extensive annotation of thousands of images.
In an embodiment, the present disclosure pertains to a method of training a network via a source domain of labeled image samples and a target domain of unlabeled image samples. The method includes determining, for each domain of the source domain and the target domain, an entropy loss related to the domain, determining an adversarial loss for a domain discriminator configured to predict whether a given input belongs to the source domain or the target domain, and executing domain adaptation training of the network using the entropy loss for the source domain, the entropy loss for the target domain, and the adversarial loss.
In another embodiment, the present disclosure pertains to a system having a processor and memory. The processor and memory in combination are operable to perform a method of training a network via a source domain of labeled image samples and a target domain of unlabeled image samples. The method includes determining, for each domain of the source domain and the target domain, an entropy loss related to the domain, determining an adversarial loss for a domain discriminator configured to predict whether a given input belongs to the source domain or the target domain, and executing domain adaptation training of the network using the entropy loss for the source domain, the entropy loss for the target domain, and the adversarial loss.
In an additional embodiment, the present disclosure pertains to a computer-program product having a non-transitory computer-usable medium having computer-readable program code embodied therein, the computer-readable program code adapted to be executed to implement a method for training a network via a source domain of labeled image samples and a target domain of unlabeled image samples. The method includes, for each domain of the source domain and the target domain, extracting a feature map related to one or more image samples of the domain, estimating an offset map and a classification map given the feature map, and determining an entropy loss related to the domain using information related to the classification map. The method further includes, determining an adversarial loss for a domain discriminator configured to predict whether a given input belongs to the source domain or the target domain and executing domain adaptation training of the network using the entropy loss for the source domain, the entropy loss for the target domain, and the adversarial loss. The domain discriminator is trained to produce fault predictions.
A better understanding of the present invention can be obtained when the following detailed description is considered in conjunction with the following drawings, in which:
It is to be understood that both the foregoing general description and the following detailed description are illustrative and explanatory, and are not restrictive of the subject matter, as claimed. In this application, the use of the singular includes the plural, the word “a” or “an” means “at least one”, and the use of “or” means “and/or”, unless specifically stated otherwise. Furthermore, the use of the term “including”, as well as other forms, such as “includes” and “included”, is not limiting. Also, terms such as “element” or “component” encompass both elements or components comprising one unit and elements or components that include more than one unit unless specifically stated otherwise.
The section headings used herein are for organizational purposes and are not to be construed as limiting the subject matter described. All documents, or portions of documents, cited in this application, including, but not limited to, patents, patent applications, articles, books, and treatises, are hereby expressly incorporated herein by reference in their entirety for any purpose. In the event that one or more of the incorporated literature and similar materials defines a term in a manner that contradicts the definition of that term in this application, this application controls.
Crowd counting presents numerous challenges. For example, manually observing individual chickens in commercial broiler production housing systems has become a tedious and demanding job. From laying hens to broiler chickens, there are numerous problems when housing chickens in large groups and at high density, including respiratory symptoms, bacterial infections, lameness in broilers, and feather-pecking where a chicken will peck at the feathers of another chicken and cause injuries, and the like. Early detection of abnormal behaviors of the chickens aids in the optimization of growth performance.
Vision-based crowd counting, tracking, and behavior analysis systems using advanced computer vision and machine learning methods have gained popularity over the years to automatically locate subjects and analyze location-based data for behavioral research. Following this trend, automated tracking and behavior understanding systems have become useful in quickly identifying individual abnormal chickens in large groups for populational and behavioral analysis. Tracking results of the behavior analysis can be integrated with environmental and production monitoring.
Crowd counting has recently been one of the popular tasks in computer vision. Recent developed methods and datasets have been introduced to tackle counting tasks with thousands of targets. However, in real-world scenarios, fully supervised deep learning methods usually learn to predict through a training process that requires an extensive annotation of densely populated subjects in thousands of images. Directly employing deep learning models that are trained on existing datasets to a new dataset suffers from a significant performance decrease due to the domain gap, for example, gaps in context, characteristics, and constraints of the datasets. Despite the many advances in poultry science, this is one area still largely reliant on manual labor by requiring an extensive annotation of densely populated chicken in thousands of images.
Therefore, in addition to semantic scene understanding and video temporal modeling, some self-training methods appear to utilize existing datasets with labels (i.e., source domain), and perform counting on more open-set scenarios (i.e., target domain), by transfer learning and domain adaptation techniques. While general self-learning methods improve the generalization capability by attempting to estimate pseudo ground-truths or distillation learning from a teacher network, few approaches investigate a new direction to narrow the domain shift from entropy feedback of the target domain, especially in the semantic segmentation task.
In various embodiments, the artificial intelligence systems described herein directly solve the above problems and have the ability to continually learn to count notwithstanding new farm scenes, new species, and new population-level variation. Therefore, aspects of the present disclosure relate to a new training approach to the crowd counting task toward a domain adaptation setting where the deep learning algorithm utilizes entropy minimization and adversarial learning to alleviate the distributional discrepancy between the source domain and the target domain.
The present disclosure describes examples of a novel framework for self-supervised domain adaptation by entropy minimization and adversarial learning. Inspired by anchor-based and offset-based detection approaches, aspects of the present disclosure reformulate the crowd counting problem from normally estimating density maps to directly predicting target points in images. To maximize the prediction certainty, the methods disclosed herein utilize the Shannon entropy formula as a loss objective function. In addition, the present disclosure relates to an adversarial learning scheme to motivate the deep learning network to produce similar distributional predictions over the source domain and the target domain. In the cross-domain setting, the method disclosed herein demonstrate substantial generalization compared to the previous crowd counting methods, and further performs estimating on a new chicken counting dataset.
Far apart from prior approaches that normally learn to predict a density map, the present disclosure, in some embodiments, illustrates a network to estimate location offset maps and classification maps directly. With the source domain samples, since labels are available, supervised Lebesgue-2 (L2) distance loss and cross entropy loss can be effortlessly calculated and ca be used to guide the network. On the other hand, since samples on the target domain do not have labels, some recent approaches utilize output from a teacher model as a pseudo-label with lower confidence to guide the learning process. The present disclosure, in certain embodiments, adopts the Shannon entropy formulation to be a loss function in order to encourage the deep network to produce a higher confidence score. To further narrow the domain gap, a discriminator, which is a fully convolutional neural network classifier, to motivate the network to extract similar distribution output over both domains may be utilized. This discriminator tries to determine which domain the input belongs to by learning domain classification, while the main network tries to make the discriminator produce fault predictions. The Shannon entropy loss and adversarial loss functions are added simultaneously to the main, fully supervised training process in order to teach the domain adaptation learning scheme.
Various methods may be utilized to train a network for optimized learning. For example, in certain embodiments, the training may include self-supervised domain adaptation. In some embodiments, the self-supervised domain adaptation may use entropy minimization and/or adversarial learning. As an illustrative example,
At step 10, an entropy loss relating to a domain is determined. In certain embodiments, the domain may include, for example, a source domain and/or a target domain. In some embodiments, the source domain and/or the target domain may include various types of data. For example, in some embodiments, each domain may include one or more image samples. In certain embodiments, entropy loss may be determined by, for example, extracting a feature map related to the one or more image samples of the domain.
In certain embodiments, each cell of the feature map corresponds to a window size on the original input (e.g., the source domain). In some embodiments, a processed feature map, as discussed in detail below, two network branches may be used to predict a point coordinate and background-foreground classification. In some embodiments, the entropy loss may be determined by estimating an offset map and a classification map. In certain embodiments, the entropy loss may be determined by at least a portion of the classification map. In some embodiments, the offset map and/or the classification map may be based at least in part on the feature map.
In some embodiments, the offset map and/or the classification map may be estimated by predicting a point coordinate and a background-foreground classification as discussed above. In certain embodiments, the predicted background-foreground classification may include a predicted score of the point coordinate belonging to an object. In some embodiments, the offset map and the classification map may be estimated for each domain (e.g., the source domain and the target domain). In certain embodiments, the entropy loss for each domain may be based at least in part on the predicted background-foreground classification.
At step 12, an adversarial loss for a domain discriminator is determined. In certain embodiments, the domain discriminator may be configured to predict whether a given input belongs to a specific domain, such as, the source domain and/or the target domain. In some embodiments, the adversarial loss may be based on the offset map and/or the classification map for a specific domain (e.g., the target domain). In certain embodiments, the domain discriminator is trained to produce fault predictions.
At step 14, domain adaptation training of the network is executed. In certain embodiments, the domain adaptation training may use one or more of the entropy loss for the source domain, the entropy loss for the target domain, or the adversarial loss. Additionally, in some embodiments, the method may include calculating a supervised training loss for a domain, such as, the source domain. In some embodiments, the calculated supervised training loss may be calculated using, for example, the offset map for the source domain and the classification map for the source domain. In such embodiments, the method may include executing supervised training of the network using the supervised training loss.
In some embodiments, a distance loss may be determined using the predicted point coordinate for the source domain. In certain embodiments, a cross-entropy loss may be determined using the predicted background-foreground classification for the source domain. In some embodiments, the supervised training loss may be based at least in part on the distance loss and the cross-entropy loss.
The computing devices of the present disclosure can have various architectures. For instance, embodiments of the present disclosure as discussed herein may be implemented using a computing device 30 illustrated in
Computing device 30 has a processor 31 connected to various other components by system bus 32. An operating system 33 runs on processor 31 and provides control and coordinates the functions of the various components of
Referring again to
Computing device 30 may further include a communications adapter 39 connected to bus 32. Communications adapter 39 interconnects bus 32 with an outside network (e.g., wide area network) to communicate with other devices.
Aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computing devices according to embodiments of the disclosure. It will be understood that computer-readable program instructions can implement each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams.
These computer-readable program instructions may be provided to a processor of a computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, a programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable storage medium having instructions stored therein includes an article of manufacture including instructions which implement aspects of the function/act specified in the flowchart and/or block diagram block or blocks. The computer-readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other device to produce a computer implemented process, such that the instructions which execute on the computer, other programmable apparatus, or other device implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computing devices according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which includes one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the blocks may occur out of the order noted in the Figures. For example, two blocks shown in succession may, in fact, be accomplished as one step, executed concurrently, substantially concurrently, in a partially or wholly temporally overlapping manner, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions.
Reference will now be made to more specific embodiments of the present disclosure and experimental results that provide support for such embodiments. However, Applicant notes that the disclosure below is for illustrative purposes only and is not intended to limit the scope of the claimed subject matter in any way.
Self-training crowd counting has not been attentively explored though it is one of the important challenges in computer vision. In practice, the fully supervised methods usually require an intensive resource of manual annotation. To address this challenge, this example introduces a new approach to utilize existing datasets with ground truth to produce more robust predictions on unlabeled datasets, named domain adaptation, in crowd counting. While the network is trained with labeled data, samples without labels from the target domain are also added to the training process. In this process, the entropy map is computed and minimized in addition to the adversarial training process designed in parallel. Experiments on Shanghaitech, UCF_CC_50, and UCF-QNRF datasets prove a more generalized improvement of this method over the other state-of-the-arts in the cross-domain setting.
Crowd counting has recently been one of the popular tasks in computer vision. Recent developed methods and datasets have been introduced to tackle the counting task with thousands of targets. However, in real-world scenarios, these supervised methods usually learn to count through a training process that requires an extensive annotation of densely populated points in thousands of images. Directly employing models that are trained on existing datasets to a new dataset suffers from a significant performance decrease due to the domain gap.
Therefore, in addition to semantic scene understanding and video temporal modeling, some self-training methods appear to utilize existing datasets with labels (i.e., source domain) and perform counting on more open-set scenarios (i.e., target domain) by transfer learning and domain adaptation techniques. Knowledge distillation between both regression-based and detection-based models have been enabled by formulating the mutual transformation of outputs. The generalization over density variance has been enhanced by categorizing image patches into several density levels. While general self-learning methods improve the generalization capability by attempting to estimate pseudo ground-truths or distillation learning from a teacher network, a few approaches investigate a new direction to narrow the domain shift from entropy feedback of the target domain, especially in the semantic segmentation task.
In this example, Applicant introduces a new training approach to the crowd counting task toward a domain adaptation setting where the crowd counter utilizes the entropy minimization and adversarial learning to alleviate the distributional discrepancy between the source domain and the target domain. Particularly, the contributions can be summarized as follows: (1) Reformulate the crowd counting problem from normally estimating density map to directly predicting target points in images, inspired by anchor-based and offset-based approaches; (2) Utilize the Shannon entropy formula as a loss objective function to maximize the prediction certainty; (3) Design an adversarial learning scheme to motivate the network to produce similar distributional predictions over the source domain and the target domain; and (4) Evaluate the proposed method with cross-domain settings to demonstrate its substantial generalization compared against the previous crowd counting methods and further perform estimating on a new chicken counting dataset.
Point Proposal Network: Far apart from prior approaches that normally learn to predict a density map, this example designs a network to estimate head points directly. Given an RGB image x∈, the training source domain, the deep feature extracted from the backbone network
can be denoted as
(x) and its output size is W×H×D.
(x) involves a hyper-parameter s that is the backbone's downscale stride. Each cell on the feature map
(x) basically is correspondence to a window size s×s on the original input x. The maximum number of points that can exist in the window is D (point's index is denoted as k, k∈[0, D−1]). Then, given the processed feature map
(x), two network branches are adopted to predict the point coordinate (denoted as {circumflex over (p)}loc) and background-foreground classification (denoted as {circumflex over (p)}cls). From the location (i,j) where the pixel is located in the feature map
(x), the regression branch learns to estimate 2×k offset values (δi
In the classification task, two predicted scores belong to positive class posk (object's point) and negative class ne gk (background). The Softmax function is employed to normalize two confident scores {circumflex over (p)}i,j,kcls=(kpos,
kneg) that follow a probability distribution whose total sums up to one:
Supervised Training Losses: On the source domain where labels are provided, the supervised training losses on both branches are formulated as the standard ones. The
2 distance and Cross Entropy losses are adopted for the regression branch and the classification branch, respectively. Denoting piloc, clsipos, clsineg as corresponding ground-truth values of {circumflex over (p)}iloc,
ipos,
ineg, those loss functions are defined as follows:
where N is the set of points of the ground truth and M is the set of proposals containing both negative and positive pixel points. M can be obtained from a one-to-one matching strategy (i.e., Hungarian algorithm). Finally, the fully supervised training loss can be obtained as follows:
where denotes a particular loss calculated on all samples from the source domain
.
On the target domain , where labels are not available, while some approaches utilize output from a teacher model as a pseudo-label with lower confidence to guide the learning process, entropy minimization is a preferable principle in self-training semantic segmentation demonstrated through a number of research works. By formulating the point's head classification similar to the semantic segmentation problem, the Shannon entropy formulation can be adopted to be a loss function in order to encourage the deep network to produce a higher confidence score. Given an RGB image y∈
on the target domain, the classification per pixel entropy can be formulated as follows:
And the self-training entropy loss can be defined as:
To further narrow the domain gap, a discriminator was utilized, which is a fully convolutional neural network classifier, to motivate the network to extract similar distribution output over both domains. This discriminator tries to determine which domain the input belongs to by learning domain classification (
x,
y), while the main network tries to make the discriminator produce fault predictions. Given the concatenation of offset and category maps from the network ({circumflex over (p)}loc∥{circumflex over (p)}cls), the loss function of the discriminator can be formulated as follows,
where z=0 if {circumflex over (p)}=(x) or z=1 if {circumflex over (p)}≡
(y), which x∈
, y∈
, and (.∥.) is the tensor concatenation operation.
Additionally, to narrow the produced distributions of source domain and the target domain, an adversarial loss was added in the main network's training process:
More specifically, the adversarial loss is designed to maximize the probability of the discriminator predicting source domain class given target domain samples y∈.
To summarize, the learning process of the main point proposal network involves Eqn. 3, 4, 7, and 9 loss functions:
where λloc, λcis, λent, lady are weighted parameters to balance corresponding objective functions, and
denote particular losses calculated on all samples from domain
and
, respectively. In parallel, the discriminator
learns with the guidance of Eqn. 8:
The entire training procedure is depicted as in
Table 1 illustrates error rates comparison among loss components. Numbers in italic indicate error rates on source domain, while underlined numbers are results on adapted domain.
54.32
25.36
39.14
162.78
289.47
11.53
60.76
22.03
34.27
105.48
164.36
10.43
15.60
54.04
21.58
30.84
87.76
126.53
11.98
62.83
107.42
28.39
47.58
174.59
302.87
15.57
27.38
57.67
18.29
26.21
69.21
12.53
Table 2 illustrates error rates comparison between the approach of the disclosure with other domain adaptation (DA) and supervised methods. Numbers in italic indicate error rates on source domain, while underlined numbers are results on adapted domain.
96.01
22.91
34.69
142.00
241.02
11.87
91.94
24.36
37.22
10.88
88.48
21.91
33.86
158.30
267.51
73.5
112.3
49.1
99.2
140.4
226.1
18.7
26.0
64.2
98.4
21.2
38.7
126.8
203.9
7.2
11.1
13.38
29.25
112.24
218.18
93.71
18.29
26.21
95.36
12.53
To illustrate the effectiveness of each proposed objective loss in the method of this disclosure, Applicant conducted the ablative experiments as shown in Tab. 1. Applicant slightly added and removed their training strategies on top of the original supervised approach. The experimental results have shown that the proposed losses have achieved significant improvement.
The Shanghaitech Dataset is composed of two parts: Part-A and Part-B and it contains a total of 1,198 images of 330,165 people. Applicant used these two parts to take turns as source and target domains as shown in Tab. 2. In each method, the first row is using SHTechA for the source domain, SHTechB for the target domain, and the second row is trained in reversed order. The results show that, with domain adaptation learning, the method of the disclosure can be aware of the target's distribution and yields better quantitative results on its samples (69.21/95.36 vs 112.24/218.18 of RDBT on SHTechA), while the performance on source domain is not hurt (57.67/93.71 vs 53.02/88.48 on SHTechA and 8.72/12.53 vs 6.55/9.50 on SHTechB of P2P).
The UCF_CC_50 dataset and the UCF-QNRF dataset have a large variant number of head counts. While the former only contains 50 images but the number of head points varies from 94 to 4,543, the latter consists of 1,535 images with 1,251,642 point heads in total. Applicant used Shanghaitech Part-A for the source domain to adapt on these two datasets. The results also prove the method disclosed herein with domain adaptation perform superior quantitative results on target domain as shown in Tab. 3 (305.57/400.62 vs 332.4/425.0 of SPN+L2SM on UCF_CC_50) and (154.73/237.84 vs 227.2/405.2 of SPN+L2SM on UCF-QNRF).
Table 3 illustrates error rates comparison between the approach of this disclosure with other domain adaptation (DA) and supervised methods.
305.57
400.62
154.73
237.84
Applicant wanted to evaluate the proposed training method on the chicken dataset collected in farm scenes which have not been annotated as shown in
The first row is the training process with entropy minimization on the target domain. Since the network is mainly guided to learn the localization and classification tasks from the human dataset, the network finds it difficult to recognize chickens as positive class and the result mostly returns false negatives. The second row is the training process with adversarial loss. While the distribution gap is narrower, resulting in more densely populated prediction, the network produces more false positives by trying to map the dense distribution of the source domain. The final training process balances those loss functions with weighted parameters and refines better results. However, it still does not yield optimal predictions and there are some missing counts caused by different lighting conditions (e.g., darker and brighter areas in top-left and bottom-left corners).
In this example, Applicant has proposed a domain adaptation training scheme for the crowd counting task. The method of this disclosure is designed to minimize the domain gap between the source domain and the target domain through the entropy loss and the adversarial loss. The entropy minimization is computed on both domains while the adversarial objective minimizes the distribution discrepancy on target samples. As a result, the method shows better results on the target domain than recent self-training learning methods, while maintaining nearly the same error rates on the source domain. Furthermore, Applicant shows qualitative estimation on the chicken dataset which is used as the target domain. However, there are still some false negative counts on chickens, due to the lighting condition problem which is not fully addressed in this work.
Without further elaboration, it is believed that one skilled in the art can, using the description herein, utilize the present disclosure to its fullest extent. The embodiments described herein are to be construed as illustrative and not as constraining the remainder of the disclosure in any way whatsoever. While the embodiments have been shown and described, many variations and modifications thereof can be made by one skilled in the art without departing from the spirit and teachings of the invention. Accordingly, the scope of protection is not limited by the description set out above, but is only limited by the claims, including all equivalents of the subject matter of the claims. The disclosures of all patents, patent applications and publications cited herein are hereby incorporated herein by reference, to the extent that they provide procedural or other details consistent with and supplementary to those set forth herein.
This application claims priority to U.S. Provisional Patent Application No. 63/444,890, filed on Feb. 10, 2023. The entirety of the aforementioned application is incorporated herein by reference.
This invention was made with government support under 1946391 awarded by the National Science Foundation. The government has certain rights in the invention.
Number | Date | Country | |
---|---|---|---|
63444890 | Feb 2023 | US |