Aerial and satellite photographs of the Earth are used to determine what parts of the Earth are covered by water and what parts are covered by land. Because the photographs are collected at high altitudes, the difference between land and water is not always apparent in the photographs. As a result, both people and computers struggle to correctly classify each pixel in each photograph. In particular, the operation of computers during such classification is inadequate and needs to be improved.
A system includes an aerial image database containing sensor data representing an aerial image of the earth surface, the sensor data comprising a feature vector for each pixel in the aerial image. A processor applies a plurality of classifiers to each feature vector to produce a plurality of classifier scores for each feature vector. The processor then determines a plurality of cluster probabilities for each feature vector, each cluster probability for a feature vector indicating a probability of the feature vector given a respective cluster of feature vectors. The processor uses the cluster probabilities for the feature vectors to form a respective weight for each of the plurality of classifiers. The processor combines the weights and the classifier scores to form an ensemble score for each pixel, the ensemble score indicating which of two possible land cover types is present on a portion of the earth surface represented by the pixel.
In accordance with a further embodiment, a method includes retrieving from memory, features for a set of pixels, each pixel representing an image of a geographic area. Each pixel's features are classified using a plurality of different classifiers to generate a plurality of classifier scores for each pixel's features. A weight is determined for each classifier score for each pixel based on similarities between the pixel's features and features used to train the respective classifier that generated the classifier score. Each weight is applied to the weight's respective classifier score to form a weighted score and the weighted scores are combined to determine an ensemble score for each pixel. The ensemble score for each pixel is then used to designated the geographic area represented by the pixel as being one of two land cover types.
A computer-readable storage device having stored thereon computer-executable instructions that when executed by a processor cause the processor to perform steps. The steps include for each pixel in an image of a geographic area, determining a plurality of classifier scores, each classifier score indicative of whether the pixel represents a first land cover type or a second land cover type. Each classifier score is weighted based on a relevance score of a classifier that generated the classifier score, the relevance score indicating the likelihood that the pixel would be part of clusters of pixels that the classifier was trained to discriminate between. The weighted classifier scores are used to produce an ensemble score that is indicative of whether the pixel represents the first land cover type or the second land cover type.
The embodiments described below improve the operation of a computer during the task of classifying a pixel in an image as either representing land or water.
A number of binary classification problems commonly experience heterogeneity within the two classes, which is characterized by the presence of multiple modes of each of the two classes in the feature space. For example, in order to classify locations on the Earth as water or land (binary classes) using remote sensing data (explanatory features), there is a need to account for the variety of water categories (e.g. shallow water, water near swamps, etc.) and land categories (e.g. forests, shrublands, sandy soil, etc.) that exist at a global scale, resulting in a multi-modal distribution of both water and land classes.
We consider binary classification problems where the classification has to be performed over different test scenarios, and every test scenario involves only a subset of all the positive and negative modes in the data. As an illustrative example, in the context of classifying locations on the Earth as water or land, a test scenario would comprise of instances observed in the vicinity of the same water body and at the same time-step. In such a setting, different pairs of positive and negative modes may emerge or disappear in different test scenarios, and even though some modes may be participating in class confusion, the subset of modes appearing in a given test scenario can be considered to be locally separable among each other. This shows a promise in using information about the context of a test scenario for overcoming class confusion.
To illustrate the importance of using the local context of a test scenario in the learning of a classifier, consider the toy dataset shown in
However, if we consider a test scenario S1,2 involving instances from P1 and N2, we would notice that even though P1 and N2 are easily separable in the feature space, the presence of class confusion between P2 and N2 would hamper the classification performance at N2, since instances belonging to N2 can be easily misclassified to be belonging to P2. To overcome this challenge, consider the following simplistic approach: let us assign a relevance score to every pair-wise classifier, Ci,j, in accordance with its likelihood of being used in the context of a test scenario. In particular, classifiers that discriminate between modes having a higher likelihood of being observed given the distribution of instances in a test scenario would receive higher relevance scores. Using this approach, we can assign a relevance score to every pair-wise classifier for both test scenarios, S1,1 and S1,2, and consider it to be either “Relevant” or “Not Relevant”, as summarized in Table I. For S1,1, the only relevant classifier would then be C1,1, which would correctly label all test instances in S1,1. However, for S1,2, both C1,2 and C2,2 would be considered as relevant, as the test instances in S1,2 would show high likelihood for all the three modes, P1, P2, and N2. However, C2,2 would show poor cross-validation accuracy on the training set, since it discriminates between a pair of confusing modes, P2 and N2. C2,2 could thus be discarded from the set of relevant classifiers, resulting in the only relevant classifier for S1,2 to be C1,2. C1,2 would then be able to correctly label all test instances in S1,2, and thus avoid class confusion in this particular situation. Note that the ability of the above simplistic scheme in overcoming class confusion arises from the fact that the distribution of test instances belonging to a test scenario contains reasonable information about its local context. We use this property as a guiding principle for motivating our proposed approach.
We propose the Adaptive Heterogeneous Ensemble Learning (AHEL) algorithm that takes into account the context of test instances belonging to a test scenario for overcoming class confusion in certain scenarios. We demonstrate the effectiveness of our approach in comparison with baseline approaches on a synthetic dataset and a real-world application involving global water monitoring.
Notations:
Let ={(xi,yi)}1n denote the training dataset with n labeled instances, where xiεd is a d-dimensional feature vector and yiε{−1, +1} is its binary response label. Let us assume that this training dataset comprises of n+ positively labeled instances, denoted by X+={xi}1n+, and n− negatively labeled instances, denoted by X−={xi}1n−. Given this training dataset, our objective is to estimate the binary response, yε{−1,1}, for every test instance, x, belonging to a test scenario, XS={xi}1s.
We present the Adaptive Heterogeneous Ensemble Learning (AHEL) algorithm that comprises of the following steps:
We assume that our training dataset, D, contains a variety of instances from all possible positive and negative modes in the data, but explicit information about the multi-modal structure of the two classes is not known and needs to be inferred. To achieve this, we consider clustering the training instances belonging to each of the two classes separately. This results in the decomposition of the positive class, X+, into m+ clusters or modes and the negative class, X−, into m− clusters or modes, respectively. The choice of the clustering algorithm and the number of clusters, m+ and m−, used for representing the multi-modality within the classes depends on the characteristics of the data. For every cluster label c, let Xc denote the set of training instances with cluster label c, where c can either be one of the positive cluster labels, P1 to Pm+, or the negative cluster labels, N1 to Nm−.
We further consider every cluster label c to have an associated conditional probability distribution, (x|c), for every instance xεd. This can either be available as a by-product of the clustering algorithm or can be inferred from the distribution of instances in Xc. As an example, we consider (x|c) to follow a normal distribution in the feature space with the sample mean,
We construct an ensemble of classifiers to discriminate between every pair of positive and negative cluster labels in , similar in essence to a Bipartite One-vs-One (BOVO) ensemble construction strategy. This ensures adequate representation of every mode in the ensemble construction process, along with maintaining sufficient diversity among the classifiers. This can be contrasted with traditional ensemble learning approaches for binary classification, e.g. bagging, boosting, and random forests, which make use of random partitions of the training data as opposed to using a stratified sampling of the training instances in accordance with the multi-modal structure of the two classes.
For every pair of positive and negative cluster labels, (Pi,Nj), we learn a classifier, fl, to discriminate between XPi and XNj, using an appropriate choice of the base classifier. This results in the learning of an ensemble of classifiers {f1, . . . ,fm*}, where m*=m+×m−. We further compute the cross-validation accuracy of every classifier, f1, using 5-fold cross-validation on XPi and XNj, and use it as a measure of the accuracy of f1, denoted by Acc(f1).
For every classifier, fl, we assign it a weight, w(fl,XS), representing its importance of being used for classification in the context of a test scenario, XS. In particular, we want to assign higher weights to classifiers that discriminate between pairs of modes that have a higher likelihood of being observed, given the distribution of instances in a test scenario, XS. Such a weighting scheme is achieved as follows.
For every test instance x belonging to XS, we compute its probability of being generated from a mode c as P(x|c). We can then assign a relevance score to every mode c, denoted by (c, XS), which indicates its likelihood of being observed given the distribution of instances in XS, defined as:
For a classifier, fl, that discriminates between Pi and Nj, the relevance of using fl in the context of XS, denoted by (f1,XS), depends on the relevance of observing modes Pi and Nj in XS, and can be estimated as:
(f1,XS)=(Pi,XS)×(Nj,Xs) (2)
(f1,XS) ensures that classifiers receive high weights only if both the modes involved in learning f1 have a high likelihood of being observed in XS. Each classifier f1 is further assigned a score α(f1), denoting its ability to differentiate between its pair of participating modes. α(f1) can be computed as:
The weight of a classifier fl in the context of test scenario XS is then estimated as:
w(f1,XS)=α(f1)×(f1,XS) (3)
To illustrate the usefulness of w(f1,XS) in choosing the appropriate set of classifiers, especially in the presence of class confusion, consider a test scenario XS that involves instances from Pc and Nnc, such that Pc shows class confusion with some other mode Nc not present in XS. In such a situation, Pc, Nc, and Nnc would receive the highest relevance scores in the context of XS. By taking the products of the relevance scores, the two classifiers that would receive the highest relevance scores would then be the ones that separate (Pc and Nc) and (Pc and Nnc). On the other hand, none of the pair-wise classifiers separating Pc, Nc, and Nnc from some other mode, O, will have a high relevance score, due to the low relevance score of O. The classifier separating (Pc and Nc) will eventually receive a low weight owing to its poor cross-validation accuracy and will be discarded. Thus, the classifier separating (Pc and Nnc) will be appropriately selected with the highest weight, resulting in adequate classification performance even in the presence of class confusion.
Note that our proposed weighting scheme inherently assumes that every test scenario involves a subset of positive and negative modes that are separable among each other but may show class confusion with other modes observed globally that are not present in the current test scenario. It is also assumed that a test scenario involving a confusing mode has instances from both the classes, thus requiring the use of a classifier in the first place. Furthermore, the ability of the above weighting scheme in avoiding class confusion hinges on the presence of at least a single non-confusing mode in the test scenario, which can dominate the assignment of relevance scores to classifiers.
We apply the ensemble of classifiers on a test instance, xεXS, to obtain a vector of ensemble responses, f(x)=[f1(x), . . . ,fm*(x)]. For each ensemble response, fl(x), we compute its loss w.r.t. a cluster label, c, as follows:
where, Pi and Nj are the positive and negative cluster labels used for learning f1, and L(z) is an appropriate loss function, e.g. the hinge loss function, L(z)=max[1−z,0}, commonly used with support vector machines (SVMs) as base classifiers. The combined loss of all ensemble response w.r.t a cluster label c is then defined as:
We choose ĉ as the cluster label which provides the minimum loss, ĉ=argminc Loss (c,f(x)). The test instance x is then classified as positive if ĉ is a positive cluster label, otherwise it is classified as negative.
We compared the performance of AHEL with the baseline approach of learning a single non-linear classifier, termed as the GLOBAL approach. We also compared our results with the Bipartite One-vs-One (BOVO) ensemble learning approach, which is able to handle heterogeneity within the classes but is unable to adapt its learning using the local context of a test scenario. In order to compare our performance with local learning algorithms, we considered the k-nearest neighbor (KNN) algorithm with k=5 as a baseline approach. Furthermore, in order to emphasize the importance of using the distribution of an entire group of instances belonging to a test scenario as opposed to an individual test instance, we considered a variant of our algorithm that uses instance-specific information for assigning weights to ensemble classifiers, termed as the Instance-specific Heterogeneous Ensemble Learning (IHEL) algorithm. Specifically, IHEL considers the relevance of using a classifier fl on a test instance x as (f1,x)=max ((x|Pi), (x|Nj)), where fl discriminates between Pi and Nj. IHEL thus follows the same formulation as AHEL, except for the fact that it uses (f1,x) in place of (f1,XS).
We used support vector machines (SVMs) with radial basis function (RBF) kernel as the base classifier for the GLOBAL approach and all ensemble learning methods used in this paper. The optimal hyper-parameters of SVM were chosen using 5-fold cross-validation on the training set in every experiment. The number of positive and negative clusters were kept equal in all experiments (m+=m−=m). The classification error rate was used as the evaluation metric for comparing the performance of classification algorithms in every experiment.
We considered the synthetic dataset shown in
We consider a real-world application of AHEL for monitoring water bodies at a global scale using remote sensing variables. Monitoring water bodies is important for effective water management and for understanding the impact of human actions and climate change on water bodies. To this end, remote sensing variables capture a variety of information about the Earth's surface that can be used for labeling every location on the Earth at a given time as water or land (binary classes). However, the presence of a rich variety of land and water categories that exist at a global scale makes it challenging to perform global water monitoring. There is an opportunity to overcome this challenge by using the local context of a test scenario, involving test instances observed in the vicinity of the same water body at the same time-step.
We used the seven reflectance bands collected by the MODerate-resolution Imaging Spectoradiometer (MODIS) instruments onboard NASA's satellites as the set of features for classification, which are available at 500 m resolution for every 8 days. Ground truth information was obtained via the Shuttle Radar Topography Mission's (SRTM) Water Body Dataset (SWBD), which provides a mapping of all water bodies for a large fraction of the Earth (60° S to 60° N), but for a single date: Feb. 18, 2000. We considered a diverse set of 99 lakes collected from different regions of the world for the purpose of evaluation. For each lake, we created a buffer region of 20 pixels at 500 m resolution around the periphery of the water body, and used the buffer region as well as the interior of the water body to construct the evaluation dataset. After removing instances at the immediate boundaries of the water bodies and ignoring instances with missing values, this evaluation dataset comprised of ≈1.3 million data instances, where every instance had an associated binary label of water (positive) or land (negative). We randomly sampled 2000 instances each from both classes to construct the global training dataset. The remainder of the evaluation dataset was considered for testing. Since different pairs of water and land categories appear together in different regions of the world and at different times, we needed to consider test scenarios involving different pairs of water and land categories for the purpose of evaluation. To achieve this, we first clustered the water and land classes in the test set into m=15 clusters each using the Bisecting K-means clustering algorithm. Every pair of water and land clusters, (Wi, Lj), was then considered as a different test scenario, Si,j. We repeated the sampling procedure for obtaining the training and test sets 10 times.
We next analyze the differences in the performance of AHEL and baseline approaches over two illustrative test scenarios, S5,1 and S10,1.
A processor in a computing device 804 executes instructions to implement a feature extractor 808 that retrieves image data 803 from the memory of data servers 806 and identifies features from the image data 803 to produce feature data 810 for each pixel in each image. Feature extractor 808 can form the feature data 810 by using the image data 803 directly or by applying one or more digital processes to image data 803 to alter the color balance, contrast, and brightness and to remove some noise from image data 803. Other digital image processes may also be applied when forming feature data 810. In addition, feature extractor 808 can extract features such that the resulting feature space enhances the ability to identify land cover types.
Experts review some of the feature data of feature data 810 and label the feature data to form label data 812. Label data 812 includes a feature vector for a pixel and a land cover class that the pixel belongs to. In accordance with one embodiment, binary class assignments are used such that each pixel is either labeled as water or land. Labeled data 812 is provided to a data clustering algorithm 814 as described above. Data clustering algorithm 814 first divides labeled data 812 based on the labels applied to the feature vectors. For each label (e.g. water or land), data clustering algorithm 814 groups the feature vectors into clusters based on the similarities between the feature vectors. Thus, the feature vectors labeled as being water are clustered separately from feature vectors labeled as being land. Data clustering algorithm 814 also produces cluster probability distribution 816 that can be used to determine the probability of any feature vector being part of the cluster as described above.
The data clusters formed by data clustering algorithm 814 are provided to a classifier trainer 818, which trains a plurality of classifiers 820 from the data clusters. In particular, a separate classifier is trained for each possible pairing of a cluster with a water label and a cluster with a land label. For example, if there were five water clusters and six land clusters, thirty classifiers would be trained. When training a classifier for a pairing of a water cluster and a land cluster, the classifier is trained to discriminate the feature vectors of the water cluster from the feature vectors of the land cluster. Classifier trainer 818 also determines the cross-validation accuracy 822 of each classifier 820.
A test data sample set 826 is selected from feature data 810 and is applied to each of the classifiers 820 to generate a respective classifier score 828 that is indicative of which class, water or land, the classifier identifies as being more likely for the particular feature vector. Each feature vector of the test data sample set 826 is also provided to a classifier weight identifier 830, which also receives classifier accuracy 822 and cluster probability distribution 816. Classifier weight identifier 830 uses the equations described above to determine a weight 832 for each classifier. Each classifier weight 832 is based on the entire test data sample set 826 as discussed above. Ensemble scorer 834 receives the classifier weights 832 and the classifier scores 828 and combines the scores and the classifier weights to form class labels 836 for each of the pixels in test data sample set 826 as discussed above.
Class labels 836 can be used by a user interface generator 840 implemented by a processor to generate a user interface on a display 842. In accordance with one embodiment, the user interface produced by user interface generator 840 comprises a color-coded image indicating the land cover state of each pixel. Using the color coding, the land cover state of each pixel in an image can be quickly conveyed to the user through the user interface on display 842. Alternatively, user interface generator 840 may generate statistics indicating the number or percentage of each land cover state in each image or across multiple image areas. These statistics can be displayed to the user through a user interface on display 842.
An example of a computing device that can be used as computing device 804, data server 806, and receiving station 802 in the various embodiments is shown in the block diagram of
Embodiments of the present invention can be applied in the context of computer systems other than computing device 10. Other appropriate computer systems include handheld devices, multi-processor systems, various consumer electronic devices, mainframe computers, and the like. Those skilled in the art will also appreciate that embodiments can also be applied within computer systems wherein tasks are performed by remote processing devices that are linked through a communications network (e.g., communication utilizing Internet or web-based software systems). For example, program modules may be located in either local or remote memory storage devices or simultaneously in both local and remote memory storage devices. Similarly, any storage of data associated with embodiments of the present invention may be accomplished utilizing either local or remote storage devices, or simultaneously utilizing both local and remote storage devices.
Computing device 10 further includes a hard disc drive 24, an external memory device 28, and an optical disc drive 30. External memory device 28 can include an external disc drive or solid state memory that may be attached to computing device 10 through an interface such as Universal Serial Bus interface 34, which is connected to system bus 16. Optical disc drive 30 can illustratively be utilized for reading data from (or writing data to) optical media, such as a CD-ROM disc 32. Hard disc drive 24 and optical disc drive 30 are connected to the system bus 16 by a hard disc drive interface 32 and an optical disc drive interface 36, respectively. The drives and external memory devices and their associated computer-readable storage media provide nonvolatile storage media for the computing device 10 on which computer-executable instructions and computer-readable data structures may be stored. Other types of media that are readable by a computer may also be used in the exemplary operation environment.
A number of program modules may be stored in the drives and RAM 20, including an operating system 38, one or more application programs 40, other program modules 42 and program data 44. In particular, application programs 40 can include programs for executing the methods described above including feature extraction, data clustering, classifier training, classifier execution, classifier weight identification, ensemble scoring and user interface generation. Program data 44 may include image data, feature data, class labels, cluster probability functions, classifier accuracy, classifier weights, labeled data, classifier scores and class labels.
Input devices including a keyboard 63 and a mouse 65 are connected to system bus 16 through an Input/Output interface 46 that is coupled to system bus 16. Monitor 48 is connected to the system bus 16 through a video adapter 50 and provides graphical images to users. Other peripheral output devices (e.g., speakers or printers) could also be included but have not been illustrated. In accordance with some embodiments, monitor 48 comprises a touch screen that both displays input and provides locations on the screen where the user is contacting the screen.
The computing device 10 may operate in a network environment utilizing connections to one or more remote computers, such as a remote computer 52. The remote computer 52 may be a server, a router, a peer device, or other common network node. Remote computer 52 may include many or all of the features and elements described in relation to computing device 10, although only a memory storage device 54 has been illustrated in
The computing device 10 is connected to the LAN 56 through a network interface 60. The computing device 10 is also connected to WAN 58 and includes a modem 62 for establishing communications over the WAN 58. The modem 62, which may be internal or external, is connected to the system bus 16 via the I/O interface 46.
In a networked environment, program modules depicted relative to the computing device 10, or portions thereof, may be stored in the remote memory storage device 54. For example, application programs may be stored utilizing memory storage device 54. In addition, data associated with an application program, such as data stored in the databases or lists described above, may illustratively be stored within memory storage device 54. It will be appreciated that the network connections shown in
We consider binary classification problems where both classes show a multi-modal distribution in the feature space and the classification has to be performed over different test scenarios, where every test scenario involves only a subset of all the positive and negative modes in the data. We propose the Adaptive Heterogeneous Ensemble Learning (AHEL) algorithm that constructs an ensemble of classifiers to discriminate between every pair of positive and negative modes, and uses the local context of test scenarios for adaptively weighting the ensemble of classifiers. We demonstrate the effectiveness of AHEL in comparison with baseline approaches on a synthetic dataset and a real-world application involving global water monitoring.
Although the present invention has been described with reference to preferred embodiments, workers skilled in the art will recognize that changes may be made in form and detail without departing from the spirit and scope of the invention.
The present application is based on and claims the benefit of U.S. provisional patent application Ser. No. 62/278,182, filed Jan. 13, 2016, the content of which is hereby incorporated by reference in its entirety.
This invention was made with government support under 1029711 and 0905581 awarded by the National Science Foundation (NSF) and NNX12AP37G awarded by National Aeronautics and Space Administration (NASA). The government has certain rights in the invention.
Number | Date | Country | |
---|---|---|---|
62278182 | Jan 2016 | US |