The present disclosure relates generally to fairness in visual clustering, and more particularly, and not by way of limitation, a novel transformer clustering approach for visual clustering.
In an embodiment, the present disclosure pertains to a method for evaluating demographic bias of images in a model having multiple clusters of images. In some embodiments, the method may include the steps of: determining demographic bias of the images in each of the multiple clusters of the model via a cluster purity evaluation module, encouraging a demographic fairness consistency for each of the multiple clusters via a loss function module to maintain fairness of the model, identifying, via a cross-attention module, correlations between each of the multiple clusters, and strengthening, via the cross-attention module, samples to have a stronger relationship with a centroid of each of the multiple clusters.
In a further embodiment, the present disclosure pertains to a system having a processor coupled to a memory. In some embodiments, the processor is operable to implement a method including the steps of: determining demographic bias of the images in each of the multiple clusters of the model via a cluster purity evaluation module, encouraging a demographic fairness consistency for each of the multiple clusters via a loss function module to maintain fairness of the model, identifying, via a cross-attention module, correlations between each of the multiple clusters, and strengthening, via the cross-attention module, samples to have a stronger relationship with a centroid of each of the multiple clusters.
In an additional embodiment, the present disclosure pertains to a computer-program product having a non-transitory computer-usable medium having computer-readable program code embodied therein. In some embodiments, the computer-readable program code is adapted to be executed to implement a method including the steps of: determining demographic bias of the images in each of the multiple clusters of the model via a cluster purity evaluation module, encouraging a demographic fairness consistency for each of the multiple clusters via a loss function module to maintain fairness of the model, identifying, via a cross-attention module, correlations between each of the multiple clusters, and strengthening, via the cross-attention module, samples to have a stronger relationship with a centroid of each of the multiple clusters.
A better understanding of the present invention can be obtained when the following detailed description is considered in conjunction with the following drawings, in which:
It is to be understood that both the foregoing general description and the following detailed description are illustrative and explanatory, and are not restrictive of the subject matter, as claimed. In this application, the use of the singular includes the plural, the word “a” or “an” means “at least one”, and the use of “or” means “and/or”, unless specifically stated otherwise. Furthermore, the use of the term “including”, as well as other forms, such as “includes” and “included”, is not limiting. Also, terms such as “element” or “component” encompass both elements or components comprising one unit and elements or components that include more than one unit unless specifically stated otherwise.
The section headings used herein are for organizational purposes and are not to be construed as limiting the subject matter described. All documents, or portions of documents, cited in this application, including, but not limited to, patents, patent applications, articles, books, and treatises, are hereby expressly incorporated herein by reference in their entirety for any purpose.
In the event that one or more of the incorporated literature and similar materials defines a term in a manner that contradicts the definition of that term in this application, this application controls.
Unsupervised learning in automated objects and human understanding has recently become one of the most popular research topics. This is because of the nature of the extensive collection of available raw data without labels and the demand for consistent visual recognition algorithms across various challenging conditions. Standard visual recognition (e.g., face recognition or visual landmark recognition), has recently achieved high-performance accuracy in numerous practical applications where probe and gallery visual photos are collected in different environments. Together with accuracy, algorithmic fairness has recently received broad attention from research communities as it may bring an enormous impact on applications deployed in practice. For example, a face recognition system giving very impressive accuracy on white faces while having a very high false positive rate on non-white faces could result in unfair treatment of individuals across different demographic groups. Therefore, by defining a sensitive (protected) attribute (such as gender, ethnicity, or age), a fair and accurate recognition system is particularly desirable.
In most recent studies of deep clustering, clustering functions are often Graph Convolution Neural Networks (GCN) or Transformer-based Networks. Both approaches share the same goal: determine which samples in a cluster are different from the centroid and remove them from the cluster. In GCN-based methods, each cluster is treated as a graph where each vertex represents a sample, and an edge connecting two vertices illustrates how strong the connection between them is. The GCN method aims to explore the correlation of a vertex with its neighbors. Recent studies presented a method to estimate confidence (GCN-V) of a vertex being a true positive sample or outlier of a given cluster. In addition, GCN-E is also introduced to predict if two vertexes belong to the same class (or cluster).
However, in minor clusters where purity is low and connections between vertices are weak, the network becomes ineffective. Other GCN-based approaches also face the same issue. Transformer has been applied to many tasks in Natural Language Processing (NLP) and computer vision. A recent study introduced a new Transformer based architecture for visual clustering. This method treats a cluster as a sequence starting with a cluster's centroid and followed by its neighbors with decreasing order of similarities. Then, Clusformer predicts which vertex is the same class as the centroid.
Another study utilized a leverage transformer-based approach as the feature encoder for face clustering. Fairness has garnered much attention in recent computer vision and deep learning research. The most common objective is to improve fairness by lowering the model's accuracy disparity between images from various demographic sub-groups. Facial recognition is one of the most common topics, attracting numerous researchers. There are notable datasets for fair face recognition such as the Diversity in Faces (DiF) dataset, Racial Faces in-the-Wild (RFW), BUPT-GlobalFace, and BUPT-BalancedFace.
Fairness in the visual clustering problem has also been addressed. A recent study proposed a deep fair clustering method to hide sensitive attributes. Another study on fairness regarding unsupervised outlier detection proposed a Deep Clustering based Fair Outlier Detection framework (DCFOD) to simultaneously detect the validity and group fairness. Experiments were conducted on the relatively small MNIST database. However, there has been no prior large-scale face clustering work with the fairness criteria.
Thus, a need exists for a method to make facial grouping in computer vision fairer. As such, the disclosure addresses concerns regarding unfair treatment that arises in different demographic groups in facial clustering algorithms used in computer vision systems.
Unlike a few concurrent research on the fairness of computer vision, this disclosure studies how to address the unfair facial clustering problem. The disclosure introduces cluster purity as an indicator of demographic bias. Secondly, this disclosure proposes a novel loss to enforce the consistency of purity between clusters of different groups; thus, fairness can be achieved. Finally, a novel framework for visual clustering is presented to strengthen the hard clusters, which usually come from the minor/biased group. This framework contributes not only to fairness but also to clustering performance overall.
In some embodiments, the present disclosure pertains to a method for evaluating demographic bias of images in a model.
As illustrated in
In some embodiments, the determining may include calculating the ratio of positive samples within each of the multiple clusters to their correlation degree. In some embodiments, the method further includes using the calculation as an indication of the demographic bias in the model. In some embodiments, the determined demographic may include, without limitation, biases based, at least in part, on gender, ethnicity, race, age, or combinations of the same and like.
In some embodiments, the images include facial images of a plurality of subjects. In some embodiments, the images may include unlabeled images. In some embodiments, the images may include unlabeled photographic images. In some embodiments, the model may include, for example, a deep clustering model having multiple clusters of images. In some embodiments, the model may include, for example, a k-means clustering model, a gaussian mixture model, a hierarchical model, a density-based spatial cluster model, a spectral clustering model, self-organizing maps, t-distributed stochastic neighbor models, and combinations of the same and like.
In some embodiments, the loss function module may include, without limitation, a Fowlkes-Mallows-based index. In some embodiments, the step of encouraging demographic fairness may include, for example, improving purity, via the loss function module, of the multiple clusters. In some embodiments, the purity may include an indicator of demographic bias between clusters of different groups.
While the present disclosure focuses on a Fowlkes-Mallows index to evaluate the quality of clustering, alternative evaluation metrics and/or loss functions techniques may include, without limitation, silhouette scores, Davies-Bouldin indexes, adjusted Rand indexes, mutual information techniques, normalized mutual information techniques, V-measure techniques, cluster purity methods, confusion matrix-based metrics, Calinski-Harabasz indexes, Dunn indexes, and combinations of the same and like.
Additional embodiments of the present disclosure pertain to a system for evaluating demographic bias of images in a model. In some embodiments, the model may be a deep clustering model. In some embodiments, the model includes multiple clusters of images (e.g., photographic images). In some embodiments, the system includes a processor coupled to a memory. In some embodiments, the processor is operable to implement a method as outlined in detail above.
In some embodiments, the system includes a computing device. In some embodiments, the computing device includes one or more computer readable storage mediums having at least one program code embodied therewith and may be operable to implement the methods as described herein. In some embodiments, the program code of the system includes programming instructions for determining demographic bias of the images in each of the multiple clusters of the model via a cluster purity evaluation module. In some embodiments, the program code of the system includes programming instructions for encouraging a demographic fairness consistency for each of the multiple clusters via a loss function module to maintain fairness of the model.
In some embodiments, the program code of the system includes programming instructions for identifying, via a cross-attention module, correlations between each of the multiple clusters and strengthening, via the cross-attention module, samples to have a stronger relationship with a centroid of each of the multiple clusters.
In some embodiments, the system of the present disclosure also includes the cluster purity evaluation module. In some embodiments, the system of the present disclosure also includes the loss function module. In some embodiments, the system of the present disclosure also includes the cross-attention module. In some embodiments, each module may be separate or combined into a single module.
Additionally and/or alternatively, in some embodiments, the present disclosure relates to a computer-program product having a non-transitory computer-usable medium having computer-readable program code embodied therein. In some embodiments, the computer-readable program code is adapted to be executed to implement the methods as disclosed above.
In some embodiments, the present disclosure can provide systems and methods for improved fairness. In some embodiment, by addressing demographic biases in facial clustering, the systems and methods promote fairness and equal treatment of different groups in computer vision systems. In some embodiments, the systems and methods provide enhanced clustering performance. For example, the novel framework not only ensures fairness but also contributes to better overall performance in clustering tasks.
In some embodiments, the systems and methods of the disclosure provide better decision-making using fair and unbiased computer vision systems that lead to more accurate and informed decisions in various applications, such as security, marketing, and hiring. Additionally, in some embodiments, the systems and methods provided herein allow for ethical artificial intelligence development by addressing fairness concerns in aspects of artificial intelligence technology.
The systems and methods of the present disclosure may be utilized for surveillance and security to enhance the fairness of facial recognition systems used in public spaces, airports, and other security-sensitive locations to reduce biased outcomes. Additionally, the systems and methods may be utilized in social media to improve the grouping and tagging of users' images and videos by ensuring fairness in clustering algorithms, leading to better content organization and recommendations. Alternatively, the systems and methods may be used for hiring and recruitment to ensure unbiased analysis of job applicants' photos and videos during automated screening processes, thus promoting equal opportunities for candidates from different backgrounds. Other applications are readily envisioned in, for example: (1) marketing and advertising for analyzing customer demographics fairly for targeted marketing campaigns, ensuring better representation of various groups in ads and promotional materials; (2) healthcare and medical imaging to improve accuracy and fairness of computer vision algorithms used to analyze medical images by reducing potential biases in identifying patterns across diverse patient populations; (3) law enforcement for enhancing the fairness of facial recognition systems used by police and other law enforcement agencies, helping to prevent wrongful identifications or unjust targeting of specific demographic groups; (4) entertainment and media to ensuring fair representation of diverse actors and models in casting processes by using unbiased computer vision systems to analyze auditions and portfolios; and (5) smart city applications for creating unbiased artificial intelligence-driven services for city dwellers by using fair facial clustering algorithms in traffic management, public transportation, and other smart city initiatives.
Reference will now be made to more specific embodiments of the present disclosure and experimental results that provide support for such embodiments. However, Applicant notes that the disclosure below is for illustrative purposes only and is not intended to limit the scope of the claimed subject matter in any way.
Promoting fairness for deep clustering models in unsupervised clustering settings to reduce demographic bias is a challenging goal. This is because of the limitation of large-scale balanced data with well-annotated labels for sensitive or protected attributes. In this Example, Applicants first evaluate demographic bias in deep clustering models from the perspective of cluster purity, which is measured by the ratio of positive samples within a cluster to their correlation degree. This measurement is adopted as an indication of demographic bias. Then, a novel loss function is introduced to encourage a purity consistency for all clusters to maintain the fairness aspect of the learned clustering model. Moreover, Applicants present a novel attention mechanism, Cross-attention, to measure correlations between multiple clusters, strengthening faraway positive samples and improving the purity of clusters during the learning process. Experimental results on a large-scale dataset with numerous attribute settings have demonstrated the effectiveness of the proposed approach on both clustering accuracy and fairness enhancement on several sensitive attributes.
Unsupervised learning in automated objects and human understanding has recently become one of the most popular research topics. This is because of the nature of the extensive collection of available raw data without labels and the demand for consistent visual recognition algorithms across various challenging conditions. Standard visual recognition, e.g. Face Recognition or Visual Landmark Recognition, has recently achieved high-performance accuracy in numerous practical applications where probe and gallery visual photos are collected in different environments.
Together with accuracy, algorithmic fairness has recently received broad attention from research communities as it may bring an enormous impact on applications deployed in practice. For example, a face recognition system giving very impressive accuracy on white faces while having a very high false positive rate on non-white faces could result in unfair treatment of individuals across different demographic groups. Therefore, by defining a sensitive (protected) attribute (such as gender, ethnicity, or age), a fair and accurate recognition system is particularly desirable.
To alleviate the demographic bias for better fairness, many efforts have been directed toward deep learning approaches to extract less biased representations. A straightforward approach is to collect balanced datasets with respect to the sensitive attributes and use them for the training process. However, this approach requires extreme efforts to collect relatively balanced samples as well as annotations for sensitive attributes. Otherwise, an insufficient number of training samples may also lead to reducing the accuracy of learned models. Interestingly, approaches relying on balanced training data still suffer from demographic bias to some degree. Another approach is to focus on algorithmic designs such as Fairness-Adversarial Loss, maximizing the conditional mutual information (CMI) between inputs and cluster assignment given sensitive attributes, Domain Adaptation, or deep information maximization adaptation network by transferring recognition knowledge from one demographic group to other groups. Although these introduced methods have shown their advantages in reducing demographic bias without requirements of balanced data, they still lack generalization capability (i.e. learn for a particular sensitive attribute). Moreover, they also rely on accurate annotations for that sensitive attribute during learning.
In this Examples, one of the goals focuses on promoting fairness for deep clustering models in unsupervised clustering settings. Motivated by the fact that clusters of major demographic groups with many samples consist of only a few negative samples and that their correlations are very strong while those of minor demographic groups usually contain a large number of noisy or negative samples with weak correlations among positive ones, the cluster purity, i.e. ratio of the number of positive samples within a cluster to their correlations, plays an important role to measure clustering bias across demographic groups. Therefore, Applicants first evaluate the bias in deep clustering models from the perspective of cluster purity. Then, by encouraging consistency of the purity aspect for all clusters, demographic bias can be effectively mitigated in the learned clustering model. In the scope of this Example, Applicants further assume that the annotations for sensitive attributes are inaccessible due to their expensive collecting efforts and/or privacy concerns on annotating them.
Contributions. In summary, the contributions are four-fold. (1) The cluster purity and correlation between positive samples are first analyzed and adopted as an indication of a demographic bias for a visual clustering approach. (2) By promoting the purity similarity across clusters, the visual clustering fairness is effectively achieved without requirements of sensitive attributes' annotations. This property is formed into a novel loss function to improve fairness with respect to various kinds of sensitive attributes. (3) In terms of deep network design, Applicants present a novel attention mechanism, termed Cross-attention, to measure correlations between multiple clusters and help faraway samples have a stronger relationship with the centroid. (4) Finally, by comprehensive experiments, the proposed approach consistently achieves State-of-the-Art (SOTA) results compared to recent clustering methods on a standard large-scale visual benchmark across various demographic attributes, i.e., ethnicity, age, gender, race as shown in the
Deep Clustering. In most recent studies of deep clustering, clustering functions are often Graph Convolution Neural Networks (GCN) or Transformer-based Networks. Both approaches share the same goal: determine which samples in a cluster are different from the centroid and remove them from the cluster.
In GCN-based methods, each cluster is treated as a graph where each vertex represents a sample, and an edge connecting two vertices illustrates how strong the connection between them is. The GCN method aims to explore the correlation of a vertex with its neighbors. A method to estimate confidence (GCN-V) of a vertex being a true positive sample or outlier of a given cluster has been previously presented. In addition, GCN-E is also introduced to predict if two vertexes belong to the same class (or cluster). However, in minor clusters where purity is low and connections between vertices are weak, the network becomes ineffective. Other GCN-based approaches also face the same issue. Transformer has been applied to many tasks in Natural Language Processing (NLP) and computer vision. Others introduced a new Transformer based architecture for visual clustering. This method treats a cluster as a sequence starting with a cluster's centroid and followed by its neighbors with decreasing order of similarities. Then, Clusformer predicts which vertex is the same class as the centroid. Some have leveraged transformer-based approaches as the feature encoder for face clustering.
Fairness in Computer Vision. Fairness has garnered much attention in recent computer vision and deep learning research. The most common objective is to improve fairness by lowering the model's accuracy disparity between images from various demographic sub-groups. Facial recognition is one of the most common topics, attracting numerous researchers. There are notable datasets for fair face recognition such as the Diversity in Faces (DiF) dataset, Racial Faces in-the-Wild (RFW), BUPT-GlobalFace, BUPT-BalancedFace. Fairness in the visual clustering problem is also addressed. Researchers have proposed a deep fair clustering method to hide sensitive attributes. Others focused on fairness regarding unsupervised outlier detection and proposed a Deep Clustering based Fair Outlier Detection framework (DCFOD) to simultaneously detect the validity and group fairness. Experiments were conducted on the relatively small MNIST database.
These prior works were not designed to be robust and scalable. Training these models is expensive as they require demographic attributes costly to annotate. In addition, these clustering approaches are not practical when dealing with unseen/unknown subjects as these methods require defining the number of clusters prior.
This Example is the first work dealing with large-scale clustering with the fairness criteria without involving demographic attributes during training
Let D be a set of N data points to be clustered. Applicants define H:D→X⊂RN×d as a function that embeds these data points to a latent space X of d dimensions. Let Ŷ be a set of ground-truth cluster IDs assigned to the corresponding data points. Since Ŷ has a maximum of N different cluster IDs, one can represent Y as a vector of [0, N−1]N. A deep clustering algorithm Φ:X→Y∈[0,N−1]N is defined as a mapping that maximizes a clustering metric σ(Y, Y). As Y is not accessible during the training stage, the number of clusters and the number of samples per cluster are not given beforehand. Therefore, divide and conquer philosophy can be adopted to relax the problem. In particular, a deep clustering algorithm P can be divided into three subfunctions: pre-processing K, in-processing M, and post-processing P functions. Formally, Φ can be presented as in Eqn. (1).
where K denotes an unsupervised clustering algorithm on deep features xi of the i-th sample. k-NN is usually a common choice for K. In addition, for the post-processing P function, it is usually a rule-based algorithm to merge two or more clusters into one when they share high concavities. Since P does not depend on the input data points nor is a learnable function, the parameters θM of Φ can be optimized via this objective function:
where k is the number of the nearest neighbors of input xi, qt is the objective of M to be optimized, and L denotes a suitable clustering or classification loss.
Additionally, let G={g1,g2, . . . gp} be a set of sensitive attributes (i.e., race, gender, age). Applicants further seek predictions for 0 to be accurate and fair to G. In other words, fairness criteria require Y not be biased in favor of one attribute or another, i.e., P(Y|G=g1)=P(Y|G=g2)==P(Y|G=gp). To achieve this goal, Applicants define μi=Ex
In order to produce a fair prediction across sensitive attributes, ΔDP (Φ) should be minimized during the learning process. Therefore, from Eqn. (2) and Eqn. (3), the objective function with fairness factor can be reformulated as:
Theoretical Analysis. From the definition of μi, since Lfair(MºK(xi,n), qt) is a loss function (i.e., Binary-cross entropy), the expectation of this function is greater than 0.
Applying the Lemma 1 to Eqn. (4), a new fairness objective function can be derived as,
In practice, each sensitive distribution p(gi) is unknown, expensive to measure, or even shifts over time. Moreover, some sensitive attributes have large and dominant samples, while the minor groups have few samples. Thus, optimizing the second objective (O2) and maintaining fairness in clustering remains challenging. The biases in clustering and their relationships to cluster purity is discussed further below.
Bias in Clustering and Its Effects. Since training data points D for a clustering approach are collected in some limited environments and constraints, the distributions of sensitive attributes are usually imbalanced. This data type often leads to a subsequent unfairness of every learning stage and the overall system. Let Dmajor and Dminor represent the samples in the major and minor attributes, respectively. Because |Dmajor|>>|Dminor|, a feature extractor H trained on D tends to generate more discriminative features for Dmajor than Dminor.
Bias in Cluster Purity. A domino effect from D to H also leads to another bias on the predicted clusters, namely cluster purity. Particularly, latent features xi ∈X are employed to K to construct predicted clusters denoted as N(xi)=MºK(xi, n). As latent features of samples with sensitive attributes belonging to Dmajor are discriminative, their correlations are quite strong. Therefore, their constructed cluster has many positive samples, making its purity very high. Meanwhile, weak connections between samples of Dminor may reflect in many noisy samples within a cluster, causing very low-quality clusters. Let the cluster purity be the ratio of the number of true positive samples within a predicted cluster:
where N+(xi)={{xi}|xj ∈N(xi)∧yj=yi denotes the true positive set in the predicted cluster N(xi), and n− is the number of predicted negative samples. yi, yj ∈Y are the corresponding cluster IDs. Noted that the cluster purity yi(xi) not only provides the rate of correctly predicted samples but also indicates the rate of positive samples that cannot be identified by <P. Due to the bias,
In general, directly mitigating this kind of bias via a “perfect balanced training dataset” is infeasible due to (1) a considerable effort to collect millions of images and their annotations for sensitive attributes; and (2) numerous enumerations of various sensitive attributes such as race, age, gender. Therefore, rather than focusing on constructing a balanced training dataset for different sensitive attributes, Applicants proposed a penalty loss to promote the purity consistency of all clusters across demographic groups. In this way, the rate of correctly predicted positive samples, as well as the missing positive samples, can be maintained to be similar among all clusters in both Dminor and Dmajor, and, therefore, enhance the fairness in model's predictions. In addition, Applicants further proposed an Intraformer architecture to encourage more robust connections between positive samples that are far away from the cluster's centroid. It can effectively enhance the purity of hard clusters belonging to minority groups.
Clustering Accuracy Penalty. In previous methods, clustering performance is optimized via Binary Cross Entropy (BCE) or Softmax function. However, none of these loss functions reflect clustering metrics such as Fowlkes-Mallows (denoted as FF). To achieve better performance, Applicants introduce a novel loss function named Fowlkes-Mallows Loss. No study has been conducted on adopting a supervised loss for the unsupervised clustering problem, so this is the first time Fowlkes-Mallows-based loss has been introduced in deep learning. Given an input cluster N(xi), the corresponding output of the network is qi ∈[0, 1]estimating a probability for the i-th positive sample having the same cluster ID as the centroid or not. The Fowlkes-Mallows Index (FMI) is measured as,
Since 0≤FMI≤1, in order to maximize FMI, Applicants minimize 1−f(qi, qt).
Theoretical Analysis. Since TP, FN, and FP≥0, following Cauchy-Schwarz inequality:
The upper bound of 1−f(ql, {circumflex over (q)}t) is derived as follows:
From this observation, Applicants can maximize the FMI by minimizing the function LFMI In practice, the implementation of LFMI is presented in Algorithm 1, shown below. The converging point is when FP=FN=0, and all TP samples are estimated correctly. Consequently, Applicants adopt LFMI for L in Eqn. (6) to maximize the clustering performance.
Solution to Minimize the Group Discrepancy ADP (D) of Eqn. (3). Formally, (02) can be reformulated as:
where Lfair(·) penalizes the discrepancy between yi of the i-th cluster a reference yf within a batch as in Eqn. (12).
Here, B is the batch size, and yf is the fairness point that Applicants want all clusters' purity to converge to.
Selection of yf. If yf is too small, the Lfair is easy to optimize, but the lower value of yf will decrease the performance overall. If yf is too large, it will be hard to find the converging points of Lfair. In order to solve this problem, Applicants define yf as a flexible value and adaptive according to the current performance of the model during the learning stage. Applicants select
as the average value of yi within a mini batch. Notice that other selections (such as median and percentile) can still be applicable for Lfair.
Optimizing with Lfair. Initially, the network is warmed up with only LFMI loss (i.e. λ=0). Then, as the network achieves considerable accuracy after warming up, A is gradually increased to penalize the clustering fairness and enhance consistency across clusters.
The choice of yi. With Lfair form, besides cluster purity yi, other differential metrics can be flexibly selected. In the following sections, Applicants propose a novel architecture that enhances the performance of hard clusters belonging to minority groups. By doing so, Lfair will converge faster.
As clusters N(xi) of the minor group contain a large number of noisy/negative samples (especially when k is large), M easily failed to recognize them. Therefore, rather than learning directly from all samples of the large cluster at once, Applicants propose to first decompose N(xi) into k sub-clusters. Each sub-cluster CimºN(xi) has
samples where N(xi)=Um=0k−1 and Ci0∩Ci1 . . . ∩Cik−1=0, with 0≤m≤k−1. Because N(xi) is an unstructured set, two constraints are defined for the sub-clusters to guarantee order consistency, as in Eqn. (13).
where x0, xj, xj+1, xs−1 ∈Cik rare in a same sub-cluster, while x0+∈Cik+1 denotes the sample in the next sub-clusters. Correlations between samples within a sub-cluster and across sub-clusters are then exploited. Moreover, as the size of sub-clusters is kept equal, the balance between positive (hard) and noisy samples is effectively maintained.
Intraformer Architecture. Following constraints in Eqn. (13), the centroid xi belongs to Ci0. Applicants define the feature of Cim as the concatenated features Fm ∈Rs×d of all samples in Cim, i.e. Fm=concat(x0, x1, . . . xs−1) where x0, x1, . . . xs−1 εCim. Taking {Fm}m=0k−1 as input, the proposed Intraformer architecture is presented as in
Decomposing Cluster Benefits Self-Attention Mechanism. Given ith and jth samples of a cluster N(x), their attention score is measured as
where Q and K denote the query and key in a transformer block, respectively. This score also indicates the importance of their correlation against correlations of ith and other samples in N(x). Let N′(x) be a smaller sub-cluster of N(x) where i, j<n′=|N′(x)|<n. In addition, as
aij can be rewritten as follows:
where aij′ denotes the attention score of the ith and jth samples in N′(x). Therefore, the number of samples in a cluster has big impacts on their attention scores. The more samples are in a cluster, the weaker correlations between samples can be extracted. Thus, decomposing N(x) into multiple smaller sub-clusters will benefit attention mechanism and help M focus on the hard samples.
briefly demonstrates the cluster decomposition and attention benefits.
Cross Sub-cluster Attention Mechanism. As the centroid is only assigned to the first sub-clusters Ci0 of N(xi), dividing into non-overlapped sub-clusters leads to an issue of ignoring the interactions between samples in k−1 subclusters (Ci1, . . . Cik−1) and its centroid. Given two samples xi and xj of two different sub-clusters, Since xi and xj are fixed, two learnable matrices Wi ∈Rd×d′ and Wj ∈Rd×d′ are presented to transform these features to a new d′ dimensional hyperspace where their correlations can be computed as in Eqn. (15)
where W=WiWj
where {xj′}denotes all samples of the k-th sub-cluster Cik; and s is the number of samples in Cik.
Given this new attention score, the attention-based feature of an out-of-cluster sample looking at a cluster Cik can be computed using weighted sum features as in Eqn. (17).
Datasets. The visual clustering problem is well defined. The most common datasets are MS-Celeb-1M, MNIST-Fashion, and Google Landmark. To evaluate the fairness of clustering methods, demographic attributes are needed. Among these datasets, only BUPT-BalancedFace (i.e., different splits of MS-Celeb-1M) provides this information. For this reason, a subset of MS-Celeb-IM named BUPT-Balancedface (i.e. 1.3M images of 28 K celebrities with a race-balance of 7 K identities per demographic group) is adopted. Applicants also denote BUPT-In-the-wild (3.4M) as the remaining MS-Celeb-1M after excluding BUPT-Balancedface. This dataset contains 3.4M images of 70 K identities and is highly biased in terms of the number of identities per race as well as images per identity. LTC, GCN-V, GCN-E, STAR-FC and Transformer-based methods, i.e., Clusformer, OMHC are used as baselines. Note that LTC, GCN-V, and GCN-V/E generate a large affinity graph without any mechanism to separate this graph into multiple GPUs for training. Thus, BUPT-In-the-wild cannot be used for their training process. Applicants propose to decompose BUPT-In-the-wild into smaller parts as:
(C1) BUPT-In-the-wild (485 K). BUPT-In-the-wild (3.4M) is randomly split into 7 parts, each part consisting of 485 K samples of 7 K identities. The race distributions are similar to BUPT-In-the-wild (3.4M) (see
(C2) BUPT-GlobalFace (2.2M). This subset contains 2M images from 38K celebrities in total. Its racial distribution is approximately the same as the real distribution of the world's population. Compared to the BUPT-In-the-wild (485 K), this subset is more balanced in racial distribution. Only STAR-FC, Clusformer, OMHC, and Intraformer are implemented to run on this subset since there are out-of-memory issues with LTC, GCN-V, and GCN-E.
(C3) BUPT-GlobalFace (3.4M). This configuration is to evaluate the fairness with respect to racial attributes against large-scale data of 3.4M images with a highly biased distribution. Similar to BUPT-GlobalFace (2.2M), Only STAR-FC, Clusformer, OMHC, and Intraformer are included.
Metrics. Pairwise F-score (Fp), BCubed F-score (FB), and NMI are adopted for evaluation as in previous benchmarking protocols. To measure the fairness of a method on a demographic, Applicants measure the clustering performance on all attributes of this demographic and estimate the standard deviation std among them following previous works. The lower value of std, the better the fairness method is.
Implementation Details. Applicants get the feature extractor H by training resnet34 with Arcface loss on BUPT-In-the-wild (3.4) within 20 epochs. The embedding feature is d=512 dimension. This model is then adopted to extract latent features followed by a k-NN K to construct initial clusters. The architecture of the Intraformer contains two main blocks, i.e. Transformer and Cross Transformer. In each block, transformers are stacked Nblock times continuously and in each transformers block, self-attention is divided into multiple heads Nhead·k=4, Nblock=6 and Nhead=4, and n=256. The network is trained in 10 epochs with Adam optimizer. The learning rate starts at 0.0001 and reduces following the cosine annealing. The loss weight of (01) is equal to (02) in all experiments.
Fairness on Ethnicity.
Overall, Applicants' method outperforms others in their respective categories. In configuration (C1), Intraformer achieved an FB of 88.41% which is higher than the SOTA of GCN-based STAR-FC and Transformer-based OMHC by 4% and 3%, respectively. In addition, the results demonstrate that Applicants' method is the fairest by scoring 5.41% on std, lower than the second-best OMHC by 1.1%. In configuration (C2), when the number of training samples is 4.5 times larger than (C1) and the distribution of demographic attributes is more balanced, improved performance is expected. All methods get higher scores than (C1) in all categories, and Intraformer stills holds the best performance in terms of accuracy (89.49%) and fairness (4.40%) among Clusformer, STAR-FC, and OMHC. In configuration (C3), where the number of samples is 3.4M images, but the distribution of demographic attributes is highly biased, it is interesting to note that the performance is higher than (C2). Once again, the best results are achieved by Intraformer when trained on the largest dataset, with Fp of 93.28% and std of 2.47%. In practice, obtaining (C2) is difficult since demographic annotations are required. Instead, (C3) is preferred as it is easy to collect. Configuration (C3) illustrates that Applicants' method is efficiently handling highly biased training databases and achieves both better fairness and clustering performance. Applicants observe similar results on the FB and NMI metrics as well.
Fairness on Different Demographics. Applicants study the fairness of their methods in different demographics such as age, gender, and race. In particular, the pre-trained model on FairFace is employed to predict these demographics for each subject in the BUPT-BalancedFace database. A threshold of 0.9 is applied to filter out the uncertain samples and keep only the most confident images for evaluation. Applicants conduct experiments and measure fairness on 9 ranges of age (0-2, 3-9, 10-19, 20-29, 30-39, 40-49, 50-59, 60-69, and 70+), 2 groups of gender (Male and Female) and 7 racial groups (East Asian, Southeast Asian, Latino Hispanic, Black, Indian, Middle Eastern, White). These results are reported in
In addition, Applicants also show the detailed experiment results of clustering performance on different demographics in the table style. Table 1, Table 2, Table 3, Tables 4, 5, 6, and Tables 7, 8, 9 show the full results (Fe, FB and NMI) of methods' on BUPT-BalancedFace with respect to different demographics. It is noted that Table 1 and Table 3 share the same meaning of demographic, i.e., ethnicity, but the labels for ethnicity in 3 are inferred by a trained FairFace model. Both tables demonstrate superior performance compared to previous works.
Ablation Studies. Here, Applicants analyze the roles of C-ATT, LFMI and Lfair on their contributions (see Table 11) and experiment Intraformer with different decomposition settings (see Table 10). Applicants train the model on BUPT-In-the-wild (3.4M) and evaluate on the ethnicity aspect of BUPT-BalancedFace.
The roles of C-ATT, LFMI and Lfair. Applicants configure a baseline Intraformer with 4 sub-clusters using neither C-ATT nor LFMI nor Lfair. It is not surprising that Applicants' baseline achieves an FP of 91.21% and std of 3.89% which is much lower than previous methods, i.e. Clusformer, STAR-FC and OMHC. With LFMI loss, an improvement of 0.65% for Fp and 0.3% for std is achieved. In the third experiment, Lfair is employed to train the model concurrently, and both Fp and std values are improved by 0.8% and 0.5% respectively. When C-ATT is enabled to further explore correlations between centroid to all samples, and encourage hard clusters toward the fairness point, the performance surpasses the state-of-the-art method.
Stabilization of Intraformer. There are several ways to decompose a cluster in the dataset into smaller sub-clusters. Table 11 shows the performance of Intraformer in different settings. It is important to note that the performance of Intraformer does not fluctuate much across different settings. The average F, score is 93.28% ±2.47%. These results emphasize the stabilization and robustness of Applicants' approach.
fair
FMI
Unlike a few concurrent research on the fairness of computer vision, Applicants study how to address the unfair facial clustering problem. Applicants introduce cluster purity as an indicator of demographic bias. Secondly, Applicants propose a novel loss to enforce the consistency of purity between clusters of different groups; thus, fairness can be achieved. Finally, a novel framework for visual clustering was presented to strengthen the hard clusters, which usually come from the minor/biased group. This framework contributes not only to fairness but also to clustering performance overall.
Without further elaboration, it is believed that one skilled in the art can, using the description herein, utilize the present disclosure to its fullest extent. The embodiments described herein are to be construed as illustrative and not as constraining the remainder of the disclosure in any way whatsoever. While the embodiments have been shown and described, many variations and modifications thereof can be made by one skilled in the art without departing from the spirit and teachings of the invention. Accordingly, the scope of protection is not limited by the description set out above, but is only limited by the claims, including all equivalents of the subject matter of the claims. The disclosures of all patents, patent applications and publications cited herein are hereby incorporated herein by reference, to the extent that they provide procedural or other details consistent with and supplementary to those set forth herein.
This application claims priority to U.S. Provisional Patent Application No. 63/533,187, filed on Aug. 17, 2023. The entirety of the aforementioned application is incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
63533187 | Aug 2023 | US |