The disclosure generally relates to a method and system for language-independent gender classification on an Online Social Network such as Twitter.
Online Social Networks (OSNs) have spread at stunning speed over the past decade. They are now a part of the lives of dozens of millions of people. The onset of OSNs has stretched the traditional notion of “community” to include groups of people who have never met in person but communicate with each other through OSNs to share knowledge, opinions, interests and activities.
Online Social Networks (OSNs) generate a huge volume of user-originated texts. OSNs allow users to share knowledge, opinions, interests, activities, relationships and friendships with each other. Gender classification can serve multiple purposes in these settings. Commercial organizations may use gender classification for advertising. Law enforcement may use gender classification as part of legal investigations. Others may use gender information for social reasons.
Methods for gender classification of users of OSNs are typically language dependent, not scalable, inefficient, and held offline using high-dimensional spaces. For example, most existing approaches to gender classification on Twitter depend heavily on an analysis of text in posted messages, aptly called tweets. Most existing research for gender classification on Twitter is language dependent. Those methods use high-dimensional spaces consisting of unique words extracted from such text fields as postings, user names, and profile descriptions. Those existing approaches use word based n-grams resulting in a huge feature space consisting of unique words and word combinations extracted from tweets. The size of the resulting feature sets is often in the order of many million features.
There is a need in the field for gender identification of users of OSNs with an emphasis on accuracy, computational efficiency and scalability of gender predictions. There is especially a need in the field for language-independent methods for determining gender information of users of OSNs.
In an embodiment, there is provided a computer-implemented method for predicting gender classification of users of an OSN such as Twitter. In an embodiment, the computer-implemented method may predict gender using five color-based features extracted from Twitter profiles such as the background color in a user's profile page. This is in contrast with most existing methods for gender prediction that are language dependent. Those methods use high-dimensional spaces consisting of unique words extracted from such text fields as postings, user names, and profile descriptions. The present method is independent of the user's language, efficient, scalable, and computationally tractable, while attaining a good level of accuracy.
In an embodiment, there is provided a computer-implemented method comprising: receiving a color data set of a given user of an online social network, said online social network allowing said given user to select a set of colors within its profile; and comparing said color data set of said given user to predetermined color data sets for determining a gender of said given user.
In another embodiment, there is provided a computer-implemented method comprising: receiving color data sets of a plurality of users of an online social network, said online social network allowing each user to select a set of colors within their profile; quantizing said color data sets into predetermined color data sets; assigning a gender to each predetermined color data sets; receiving a color data set of a given user of said online social network; and comparing said color data set of said given user to said predetermined color data sets for determining a gender of said given user.
In another embodiment, there is provided a system comprising: a memory; and one or more processors coupled to the memory, wherein the memory comprises program instructions to: receive color data sets of a plurality of users of an online social network, said online social network allowing each user to select a set of colors within their profile; quantize said color data sets into predetermined color data sets; assign a gender to each predetermined color data sets; receive a color data set of a given user of said online social network; and compare said color data set of said given user to said predetermined color data sets for determining a gender of said given user.
Referring to
Referring to
Dataset Collection: Applicants chose Twitter profiles as the starting point of their data collection for several reasons. First, Twitter is one of the most popular social networks to date with a huge user community cutting across great many languages, cultures and age groups. In early 2013, Twitter reached 555 million registered users. As of today, Twitter states that there are more than 200 million active users producing around 400 million tweets per a day. Second, Twitter has all the color attributes that were needed to set up the experiment. These attributes are generally public, meaning that they can be accessed and viewed by anyone who requests them. Lastly, Twitter provides a rich Application Programming Interface (API), which supports automatic collection of large data sets.
For Applicant's experiments, they chose Twitter profiles as the starting point of their data collection. In Twitter's terminology, the followers of a given user U are users interested in reading U's tweets. These users will be notified when U posts a new tweet. Also, the friends of a user V are the users following V's tweets. In general, users can register themselves as followers of any other user; no permission is required unless the user protects his/her profile using Twitter's protection features. A new Twitter user must first fill a profile form, consisting of about 30 fields containing biographical and other personal information, such personal interests and hobbies. However, many fields in the form are optional, and indeed substantial portions of Twitter users leave many or all of those optional fields blank. In addition, Twitter's profile form does not include a specific “gender” field, which complicates gender identification for Twitter users. One can choose additional fields that are not mentioned above for gender classification such as posted tweets; however, Applicants decided to perform gender classification using only profile colors.
Among many other fields in a Twitter profile, Applicants were interested in the five fields that allow users to choose different colors for the following items: Background color; Text color; Link color; Sidebar fill color; and Sidebar border color.
Users choose their own preferences by selecting colors from a color wheel while editing their profiles. Unlike other OSNs, such as Facebook, Twitter allows users to redesign and change their profiles. In some cases, users chose both a background color and a background picture (from a picture file) for their profiles. In these cases, the background picture overrides the background color, which is not shown. However, Applicants' empirical setup takes into account the background color chosen by a user even if that color is overridden by that user.
Applicants ran their crawler between August and December 2013, subject to Twitter's limitation of less than 150 requests per hour. Applicants started their crawler with a set of random profiles and continuously added any profile that the crawler encountered (e.g., profiles of users whose names were mentioned in tweets harvested). Subsequently, Applicants filtered all the profiles with valid URLs. The URL is a profile field that lets a Twitter user create a link to a profile hosted by another OSN, such as Facebook. This field is important because profiles hosted by other OSNs often contain an explicit gender field, which Twitter profiles do not include.
In all, the dataset Applicants used at the time of their study consisted of 169,449 profiles, of which 94,251 were classified as male and 75,198 were classified as female. Applicants considered only profiles for which they obtained gender information independently of Twitter content (i.e., by following links to other profiles). For each profile in the dataset, Applicants collected the five profile colors listed above. Applicants also stratified the data by randomly sampling 150,000 profiles, of which about 75,000 are classified as male and about 75,000 are classified as female. In this manner, one obtains an even baseline containing 50% male and female profiles. Twitter offers 19 predefined designs, including a default design, to each new user joining the social network. Each design defines colors for all five fields. Users can select those designs easily. As of this writing, the color (R=192, G=222, B=237), a light shade of blue, is the default background color for any new Twitter user.
In order to account for the existence of predefined designs in the Twitter user setup, Applicants have considered different subsets of their overall dataset, and studied each subset independently of other subsets. In addition, Applicants stratified each subset by randomly sampling the profiles, from which they obtained even baselines containing 50% male and female profiles. Applicants specifically considered the following subsets: [T1.] This is the entire dataset, {A}, consisting of 150,000 profiles with a 50% male and 50% female breakdown. [T2.] This is dataset {A}-{D}, which is the subset containing all collected profiles, except for profiles using the default design with the RGB values of (192, 222, 237) as the background color, denoted by {D}. {D} represents 11.4% of dataset {A} while {T2} represents 88.6%. The base condition is a 50% male and 50% female breakdown. [T3.] This is dataset is {A}-{C}, which is the subset obtained by excluding {C}, the subset all profiles that use any of the 19 predefined designs including the default design, from {A}. {C} represents around 57% of {A} while {T3} represents 43%. The base condition is a 50% male and 50% female breakdown. Here Applicants report detailed empirical results about {T3}, since it includes only profiles with custom color choices, and summarize results for the other datasets. [T4.] This is dataset {A}-{B}, obtained by excluding from the entire dataset, {A}, all profiles, {B}, that use any of the 19 predefined designs as well as black or white as background color. {B} represents 71.8% of {A}, while {T4} represents 28.2%. The base condition is still a 50% male and 50% female breakdown.
Referring to
Dataset Collection Validation: The main threat to the validity of this research is Applicants' reliance on self-declared gender information entered by Twitter users on external web sites for validation of their predictions. Applicants believe that deceptive people sometimes do make mistakes by entering conflicting information in different OSNs. In this study, Applicants rely on gender information from external links posted by profile owners. Applicants use this gender information as their ground truth. Evidently, a complete evaluation of 169,449 Twitter users would be impractical. However, Applicants manually spot-checked about 10,000 of the profiles in their dataset that is about 7% of the dataset. In the cases that Applicants checked by hand, they are confident that the gender information they collected automatically was indeed correct over 90% of the time. In the majority of the remaining cases Applicants could not determine the accuracy of their ground truth.
Proposed Approach: An algorithm for preprocessing colors before feeding the colors to a classifier is shown in
Colors harvested from Twitter user profiles are typically specified as a combination of RGB values ranging between 0 and 255. This gives a total of 2563 colors combinations. Because of the large number of combinations, Applicants used quantization, a compression procedure that substantially reduces the huge number of colors. Each of the red, green and blue values is shrunk from 8 bits to 4 bits and 3 bits respectively. This technique reduces the total number of color combinations from 2563≈16*106 to just 163=4096 colors and 83=512 colors, respectively. Each of the original colors harvested is converted to the compressed color having the least Euclidean distance from the original color. Next, according to the algorithm in
Applicants observed empirically that quantization and sorting are beneficial to the accuracy of gender predictions. In general, accuracy has improved by up to 15% because of these procedures.
Experimental Results: Applicants performed experiments, one for each of the four subsets of their dataset. In each experiment set, Applicants tried many classifiers; different classifiers produced different results. Next, Applicants selected the top classifiers. Here Applicants consider the following four different classifiers: Probabilistic Neural Network (PNN), Decision Tree (DT), Naive Bayes (NB) and Naive Bayes/Decision-Tree Hybrid (NB-Tree). Applicants performed a 10-fold cross validation on their data subsets for each classifier. In each set of experiments, Applicants trained their classifiers with all five color-based features.
An advantage of the present approach is that uses only five colors, making it language independent. An additional advantage is that it has a low-dimensional space, resulting in a low computational complexity of the classifiers. In contrast with the present method, most existing approaches are language dependent while using high dimensional spaces generated from unique words extracted from text (i.e. tweets, names, and profile descriptions), and millions of features.
Conclusion: Applicants have predicted automatically the gender value of users based on their color preferences. Unlike text-based approaches, Applicants used a novel method for predicting gender using five color-based features. Preliminary results with the collected data set are quite encouraging. Although there were considered only five color-based features, it was possible to predict gender with an accuracy of 74.2%, a gain of about 24% with respect to a 50% baseline. A key to this success of the gender guessing with colors is the preprocessing of color features using a quantization technique that was discussed above. An advantage of the present method is its broad applicability to Twitter users regardless of their language, as one uses only color-based features to identify gender. In addition, the color-based analysis shows promising results in term of computational complexity compared to other gender-guessing methods, which use a much larger feature set. The present approach may utilize only five color-based features. The results show that colors alone may provide reasonably accurate gender predictions, even though a substantial number of users analyzed do not change the default colors provided by Twitter in their Twitter profiles or in other web sites hosting their profiles (e.g., Twitter App). One may conclude that colors are a good gender indicator for users who do change the default colors in their profiles. In these cases, one is able to use colors alone as part of gender classification methods.
In this description, Applicant detailed their experimental study of gender classification on Twitter. Applicants presented a novel approach for predicting gender utilizing only five color-based features extracted from the profile layout colors. Unlike existing works that use millions of features, Applicants used only five color-based features. Despite the challenging feature-based characteristics for gender classification, it has been proposed color-based model for gender classification. There was applied quantization colors procedure to the color-based features that compressed the color from 24-bits to 9-bits and produced discrete set of 512 colors. Applicants empirically proved the validity of their approach by examining different classifiers over large Twitter data set collection. The present approach uses an agent with advanced colors preferences to search all profiles and predicting gender. The empirical studies show that the present method is reasonably accurate and highly efficient in terms of computational complexity.
Although the present embodiments have been described with reference to specific example embodiments, it will be evident that various modifications and changes may be made to these embodiments without departing from the broader spirit and scope of the various embodiments. Accordingly, the specification and drawings are to be regarded in an illustrative rather than a restrictive sense.