Not applicable.
Not applicable.
Online advertisers prefer to target ads at a specific audience. The target audience can be selected using demographic information such as age, gender, income, city of residence, etc. However, many online users may not be registered, and therefore have not provided their demographic information voluntarily. Additionally, registered users may give incomplete or even incorrect demographic information.
Incomplete and non-existent user profiles of demographic attributes can limit the usage of demography-based ads targeting. Therefore, it may be desirable to provide an approach in which user demographic attributes can be predicted even if a user is a non-registered user or a registered user with an incomplete profile.
A system and method are provided for predicting user demographic attributes for non-registered users and users with incomplete user profiles. A method provided includes receiving a search query, extracting at least one feature associated with the search query, correlating each extracted feature with one or more attributes, and determining a demographic profile based on the correlated attributes. Another method provides identifying a document, extracting at least one feature associated with the identified document, correlating the at least one feature with one or more attributes, and determining a first demographic profile based on the one or more attributes.
This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.
In various embodiments, the invention provides a system and method for predicting user demographic attributes. The invention uses a search log of user search history and a user profile database of registered user demographic attributes to create a first database. The first database includes features of search results associated with submitted search queries and are associated with corresponding user demographic attributes. The invention also creates a second database that includes features from web pages that have been browsed by the registered users and are associated with corresponding user demographic attributes. The first and second databases are used to create a query-demographic predictor and a page-demographic predictor respectively. By using information such as the searching history and demographic attributes of registered users, the query and page-demographic predictors can help predict the demographic attributes of non-registered users and users with incomplete profiles that have similar searching habits and web browsing habits as the registered users.
Query-demographic predictor 104 and page-demographic predictor 106 may be or can include a server including, for instance, a workstation running the Microsoft Windows®, MacOS™, Unix, Linux, Xenix, IBM AIX™, Hewlett-Packard UX™, Novell Netware™, Sun Microsystems Solaris™, OS/2™, BeOS™, Mach, Apache, OpenStep™ or other operating system or platform. In an embodiment, client 102 may also be a server.
Client 102 can include a communication interface. The communication interface can be an interface that allows the client to be directly connected to any other client or device or that allows the client 102 to be connected to a client, server, or device over network 110. Network 110 can include, for example, a local area network (LAN), a wide area network (WAN), or the Internet. In an embodiment, the client 102 can be connected to another client, server, or device via a wireless interface.
Query demographic predictor 104 can include a search engine 202, a feature extractor 204, a query-demographic classifier 206, a search log 208, and a user profile database 210. Feature extractor 204 can be any conventional feature extractor such as, but not limited to, a Document Frequency (DF) feature extractor, an Information Gain (IF) feature extractor, a Mutual Information (MI) feature extractor, a χ2 Statistic (CHI) feature extractor, or a Term Strength (TS) feature extractor. Query-demographic classifier 206 can be any conventional database for classifying information. A query-demographic classifier can be, but is not limited to, a Support Vector Machines (SVM) classifier, a k-nearest neighbor (kNN) classifier, a Linear Least Squares Fit (LLSF) classifier, a Neural Network (NNet) classifier, or a Naive Bayes (NB) classifier. The search log 208 contains user search history information including search queries inputted by users and web pages browsed by users. User profile database 210 stores any type of user demographic attributes for all registered users.
The query-demographic predictor can be configured to obtain search results for corresponding search queries from the search engine 202 and extract features from the search results using the feature extractor 204. In an embodiment, a feature is a term or phrase that can be extracted from a broader contextual description and is used to identify a type of demographic attribute. For example, a feature can be extracted from a textual description of a search result wherein the feature would be associated with a type of demographic attribute related to the textual description. The query-demographic predictor can use the search log 208 to determine which users have been inputting certain search queries and obtain the users' corresponding demographic attributes from the user profile database 210. The query-demographic predictor can then associate and store those extracted features along with the corresponding user demographic attributes within the query-demographic classifier 206.
A page-demographic predictor is used to predict a confidence level for a particular demographic attribute given a certain web page. For example, the page-demographic predictor could predict the likelihood that a particular web page was browsed by a specific gender. In another example, the page-demographic predictor could predict the likelihood that a particular web page was browsed from someone at a specific location. The page-demographic predictor can predict any type of demographic attribute given a web page and should not be limited to just gender and location.
Page-demographic predictor 106 can include a feature extractor 212, a page-demographic classifier 214, a search log 216, and a user profile database 218. Feature extractor 204 can be any conventional feature extractor such as, but not limited to, a Document Frequency (DF) feature extractor, an Information Gain (IF) feature extractor, a Mutual Information (MI) feature extractor, a χ2 Statistic (CHI) feature extractor, or a Term Strength (TS) feature extractor. Query-demographic classifier 206 can be any conventional database for classifying information. A query-demographic classifier can be, but is not limited to, a Support Vector Machines (SVM) classifier, a k-nearest neighbor (kNN) classifier, a Linear Least Squares Fit (LLSF) classifier, a Neural Network (NNet) classifier, or a Naive Bayes (NB) classifier. The search log 216 contains user search history information including search queries inputted by users and web pages browsed by users. User profile database 218 stores any type of user demographic attributes for all registered users.
The page-demographic predictor can be configured to identify and obtain web pages browsed by users from search log 216 and to extract features from the web pages using the feature extractor 212. The page-demographic predictor can also use the search log 216 to determine which users have been browsing certain web pages and can obtain the users' corresponding demographic attributes from the user profile database 210. The query-demographic predictor can then associate and store those extracted features along with the corresponding user demographic attributes within the page-demographic classifier 214.
After receiving the training queries, the search engine will then output the top search results for each training query. The query-demographic predictor can be configured to accept N search results, wherein N is the number of search results per search query. At operation 304, the query-demographic predictor can receive a snippet for each search result. In an embodiment, the snippets are textual descriptions of the search results. For example, conventional search engines provide a brief description for each search result as opposed to the entire web page in order to maximize the number of results that can be viewed on a single page. The brief description of the search result can be considered to be a snippet. The predictor uses the snippet to describe the corresponding search results of each search query as the queries themselves are sometimes too short to be understood by a feature extractor. The snippets, therefore, are used to extend the meaning of the search query.
At operation 306, features are extracted from the N snippets corresponding to each search result. The query-demographic predictor can retrieve from the search log the user IDs of the users who inputted the corresponding search queries and can then retrieve the user demographic attributes from the user profile database that are related to the user IDs. At operation 308, the extracted features and the corresponding user demographic attributes are stored together in the query-demographic classifier.
Based on the comparison, at operation 410, the query-demographic predictor can predict the demographic attributes of the user inputting the search query. For example, if the extracted features resembles any stored features in the classifier, the query-demographic predictor can take the demographic attributes that correspond to the stored features, and can, through use of various algorithms of the classifier, predict the demographic attributes of the search query by using the corresponding demographic attributes of the stored features.
The query-demographic predictor can additionally predict a confidence level for each demographic attribute that it predicts. A confidence level is a representation of how sure the query-demographic predictor is that the predicted demographic attribute is true. The confidence level can be represented by a confidence identifier. The confidence identifier is any identifier that can identify the level of confidence the predictor has that the demographic attribute is true. The confidence identifier can be any numerical or a textual description within an ascending/descending range of confidence. For example, the confidence identifier can be a percentage of confidence from 0%-100%. In another example, the confidence identifier can be textual descriptions such as “not confident,” “somewhat confident,” “confident,” and “very confident.” The query-demographic predictor can have any type of algorithm for determining the confidence level of a predicted demographic attribute. For example, in determining the gender of a user who inputs a particular search query, the query-demographic predictor can identify the number of male users within the classifier who inputted a search query that resembles the particular search query and divide by the total number of users who entered the same query. The result would be a percentage that would identify the confidence level that the user was a male. However, as mentioned previously, the query-demographic predictor can be configured to incorporate any other type of algorithm for determining a confidence level.
The page-demographic predictor can also provide a corresponding confidence identifier, as explained above, for each demographic attribute that it predicts. For example, on a department store's web page, a plurality of features may be extracted such as “MP3 player” and “video games.” The page-demographic predictor may determine that 85% of men and 65% of people ages 31-45 are likely to be associated with the “MP3 player” feature. The page-demographic predictor may also determine that 55% of men and 95% of people ages 18-30 are associated with the feature “videogames.” The predictor can then take the averages of the respective features to determine that web page has a confidence level of 70% that men are more likely to browse the page. It can also be determined that the web page has a confidence level of 65% that people ages 18-30 are likely to browse the web page (assuming that 18-30 and 31-45 are the only two possible age categories). But again, any type of algorithm can be used to determine a confidence level for a particular demographic attribute and the invention should not be limited to the example given above.
In an embodiment, the user-demographic predictor can vote for the demographic attribute that has a higher corresponding confidence identifier. For example, when evaluating gender, if the query-demographic predictor is 85% confident that the user is female and the page-demographic predictor is 50% confident that the user is male, then the user-demographic predictor will vote that the user is female since it has a higher confidence level. In another embodiment, the user-demographic predictor can vote for demographic attributes by taking the average of the confidence identifiers from the query and page-demographic predictors. For example, if the query-demographic predictor is 75% confident that the user is female and the page-demographic predictor is 15% confident that the user is female, then the average of the two is a 45% confidence level in which the user-demographic predictor will vote that the user is male since a male would have a higher confidence level of 55%. However, any voting mechanism/algorithm can be used, and the invention should not be limited to the two described above.
At operation 712, if the user is a registered user, the predicted and voted demographic attributes can be audited against the demographic information that has been stored in the user profile database. For example, the predicted and voted demographic attributes can be compared to the user's demographic attributes the user previously submitted in his/her profile to see if there are any similarities or differences. Such similarities and differences can be evaluated by an administrator, advertiser, or any other authorized user for any desired purpose.
In an embodiment, the predicted demographic attributes can be utilized by an advertiser to for determining which search queries, web pages, or users that he/she desires to bid on. In such an embodiment, at operation 714, a pricing mechanism can be used to create a bidding price for a corresponding search query, web page, or user based on the confidence identifier predicted for a given demographic attribute. For example, the query-demographic predictor can be used to inform advertisers which search queries fit their targeted demographic attribute values. The pricing mechanism can be configured to include any type of algorithm desired by a developer of the pricing mechanism. For example, if the query-demographic predictor is 75% confident that a particular search query is a female-oriented search query and the advertiser is interested in marketing to females, then the pricing mechanism could be configured to charge the advertiser 75% of the original advertisement price, wherein the original advertisement price can be any predetermined price.
The page-demographic predictor can also be used to inform advertisers which web pages fit their targeted demographic attribute values. The pricing mechanism can be configured to include any type of algorithm desired by a developer of the pricing mechanism. For example, if the page-demographic predictor is 85% confident that a particular web page is a male-oriented web page and the advertiser is interested in marketing to males, then the pricing mechanism could be configured to charge 85% of the original advertisement price, wherein the original advertisement price can be any predetermined price.
The user-demographic predictor can also be used to inform advertisers which users fit their targeted demographic attribute values. The pricing mechanism can be configured to include any type of algorithm desired by a developer of the pricing mechanism. For example, if the user-demographic predictor is 65% confident that a particular user is a male who lives in Virginia and the advertiser is interested in marketing to males who live in Virginia, then the pricing mechanism could be configured to charge 65% of the original advertisement price, wherein the original advertisement price can be any predetermined price.
While particular embodiments of the invention have been illustrated and described in detail herein, it should be understood that various changes and modifications might be made to the invention without departing from the scope and intent of the invention. The embodiments described herein are intended in all respects to be illustrative rather than restrictive. Alternate embodiments will become apparent to those skilled in the art to which the present invention pertains without departing from its scope.
From the foregoing it will be seen that this invention is one well adapted to attain all the ends and objects set forth above, together with other advantages, which are obvious and inherent to the system and method. It will be understood that certain features and sub-combinations are of utility and may be employed without reference to other features and sub-combinations. This is contemplated and within the scope of the appended claims.