1. Field of the Invention
This invention relates generally to a message optimization system and, more particularly, to a system and method for identifying the optimal message content to send to users based on the users' language characteristics.
2. Description of the Background Art
Commercial advertising has undergone a significant shift in the past decade. Traditional media advertising, taking the form of newspapers, magazines, television commercials, radio advertising, outdoor advertising, and direct mail, etc., has been decreasing as the primary method of reaching an audience, especially as related to certain target demographics or types of products. New media advertising, in the form of Popup, Flash, banner, Popunder, advergaming, email advertising, mobile advertising, etc., has been increasing in prominence.
One characteristic of new media advertising is the need to capture an audience's (viewers, readers, or listeners) attention with limited text. For example, with a banner or text message, the sponsor of the advertising message may only have a finite number of characters to persuade its audience to act by clicking on a link, texting back a message, etc. As a result, companies are increasingly interested in how to optimize their message, and the components in the message, to increase the message's response rate. International Publication Number WO 2011/076318 A1 discloses a system and method for optimizing a message and is incorporated by reference herein in its entirety. In this system, the message is divided into components and multiple values are tested for each component to determine the best response rates.
Different segments of the population may respond differently to messages. For example, male recipients may respond differently than female recipients, and urban recipients may respond differently than rural recipients. The potency of marketing messages may be increased by grouping message recipients into various segments and identify the message that works best for each segments. Therefore, it is desirable to find new ways to segment users and target messages to them.
The present invention is directed to a system, method, and computer program for identifying message content to send to users based on the users' language characteristics. User-generated content (i.e., written or voice data) for a plurality of users is obtained, where each content item is associated with a user identification (ID) that uniquely identifies the user that generated the content item. The language characteristics are extracted from the user-generated content and one or more language characteristic scores are assigned to each of the users. Language characteristics may include vocabulary used or morphology characteristics. The users are clustered into groups using the language characteristic scores.
The responsiveness of each group to different message content is tested by sending a plurality of test messages with different message content to at least a subset of users within each group. For example, the message content may be a certain product or the use of certain vocabulary in a product offer. Each group may be tested by sending a plurality of test messages to at least a subset of users within each group with offers of different products or using certain vocabulary to offer the products. The response rates to the test messages are then measured. For each group, a message content to which the group is most responsive is identified (e.g., a best product and/or vocabulary) and the message content is associated with the group.
The foregoing steps essentially create a dataset that maps language characteristics to message content. Once the best message content is identified for each group, this information is used to determine message content to send to new users (i.e., users that were not a part of the clustering process). User-generated content for a new user is then obtained. The language characteristics from the new user's user-generated content are extracted. One or more language characteristic scores are assigned to the new user. The group to which the new user belongs is identified using the new user's language characteristic scores. A message is sent to the new user with the message content previously associated with the user's group. In one embodiment, the steps pertaining to the new user are performed on a client device of the new user by a client application that execute rules that map each group to a select message.
a-5b are a flowchart that illustrates a semi-private method for identifying message content to send to users based on the users' language characteristics according to one embodiment of the invention.
The present invention provides a system, method, and computer program for identifying the optimal message content to send to users based on the users' language characteristics. In the preferred embodiment as seen in
Language characteristics are extracted from the user-generated content (step 120). One or more language characteristic scores are assigned to each of the users (step 130). The language characteristics on which the users are scored may include the vocabulary characteristics or the morphology characteristics of the language. Examples of vocabulary characteristics include the presence or absence of specific words, the frequency (absolute or relative to others) of certain words, etc. Morphology characteristics are the structure or form of the language, such as punctuation, capitalization, spelling errors, grammar errors, etc. For example, users may be scored based on spelling errors (a general count and a count for typical words that are misspelled), grammar, punctuation, acronyms, slang, user-created words, emoticons, level of formality, foreign language words, specialty words (e.g., rare, fancy, archaic, or domain-specific words), etc.
Below are example language characteristic scores assigned to a user:
The users are clustered into groups using the numerical value of the features (e.g., language characteristic scores) (step 140). Various types of algorithms may be used for clustering, including, for example, k-means, GMM, EM, various hierarchical methods, and possibly also spectral methods for dimensionality reduction, as would be known to a person skilled in the art. Another option would be to use bi-clustering (co-clustering) to cluster both the groups of users and the groups of terms prevalent for those groups of users.
In one embodiment, using a co-clustering algorithm, a matrix of size N by K is defined, where N is the number of users (e.g., 10 million) and K is the number of features (i.e., language characteristics such as discussed above) (e.g., 5000). Each element in the matrix describes the relationship between the user and a unique term (hence, user-term) (e.g., the presence of specific words, grammatical errors, spelling mistakes, or unique grammatical structures, etc.) and is normalized (i.e., proportional to the probability (in the “author attribution” sense) of identifying a text with a user given that the term was used in the text). This relationship (user-term) can be defined in many ways, such as, for example, the number of times the user has used a term divided by the number of times any individual has used the term. In certain embodiments, the probability of the user given the term can be refined (e.g., by Laplace smoothing). The resulting matrix may then be co-clustered (e.g., by sparse singular value decomposition) to yield both clusters of users and clusters of terms.
In addition to clustering users based on language characteristic scores, users may be clustered based on one or more non-language characteristics. For example, users may be clustered by geography, age, education level, and/or gender, as well as language characteristics.
Test messages are sent with different message content to at least a subset of users in each group (step 150). For each group, the response rates to the test messages are measured to identify the message content to which the group is most responsive (step 160). An exemplary description of sending test messages and measuring response rates may be found in U.S. application Ser. No. 13/517,032, filed on Jun. 18, 2012 and U.S. application Ser. No. 13/290,051 filed on Nov. 4, 2011, both of which are incorporated by reference as if fully disclosed herein. In certain embodiments, the responses to test messages are analyzed to determine if delimitations between clusters are valid. For example, if two clusters have the same optimal message content, then those clusters are combined. For each group, the identified message content is associated with the group (step 170).
In certain embodiments, test messages are sent with offers of different products to at least a subset of users in each group. For each group, the response rates to the test messages are measured to identify the product to which the group is most responsive. For each group, the identified product is associated with the group. In certain embodiments, the best product is determined while keeping the vocabularies between the test messages constant. Once the best product has been determined, test messages are sent with offers of the best product for the group and certain different vocabularies to at least a subset of users in each group. For each group, the response rates to the test messages are measured to identify the vocabularies to which the group is most responsive. For example, keeping the offer for a hamburger constant, certain users may receive a test message having the phrase “Great deal!” while other users may receive a test message having the phrase “Limited time offer!” For each group, the best product or product offer and certain vocabularies may be the message content that is associated with the group. In certain embodiments, instead of certain vocabularies being associated with the group (e.g., certain word choices), certain vocabulary rules may be associated with the group (e.g., do not use slang).
The foregoing steps essentially create a dataset that maps language characteristics to message content. Once the best message content is identified for each group, this information is used to determine message content to send to new users (i.e., users that were not a part of the clustering process). As seen in
The methods described with respect to
The message optimization system 400 includes a connectivity engine 410, a harvester 420, a feature extractor 430, a clustering engine 440, a serving platform 450, an analysis engine 460, a response aggregator 470, a message-sending interface 480, and one or more database interfaces 490. The connectivity engine 410 connects the message optimization system 400 with the Internet (e.g., FACEBOOK) or with software on a client device (e.g., SKYPE) to collect user-generated content. As seen in
The harvester 420 receives user-generated content and puts it in one or more databases 495. The feature extractor 430 takes the user-generated content, extracts the language characteristics, and calculates one or more language characteristic scores. These scores may be based on vocabulary vectors and/or the presence or absence of certain language characteristics. User scores are also stored in the one or more databases 495. Examples of parsing or extraction tools include Python's “ntlk” package or Perl's “Lingua.” The clustering engine 440 retrieves user-generated content, runs it through the feature extractor 430 to obtain language characteristic scores, and clusters users into groups using the language characteristic scores.
The serving platform 450 uses the clustering engine 440 and the results saved by the analysis engine 460 to assign a new user to a group and to determine the message content to send to the user. The analysis engine 460 analyzes the effects of different message content on each group to identify the best message content for each group. It then stores the results in the one or more databases 495. In addition, the analysis engine 460 verifies the validity of the cluster delimitations. The response aggregator 470 receives and aggregates responses to test messages, which is used by the analysis engine 460. The message-sending interface 480 sends messages to the users. The one or more database interfaces 490 interface with the one or more databases 495. The components illustrated in
a-5b illustrate another method performed by a system for identifying message content to send to users based on the users' language characteristics. In contrast to
As seen in
As seen in
As will be understood by those familiar with the art, the invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. Accordingly, the above disclosure of the present invention is intended to be illustrative and not limiting of the invention.
This application claims the benefit of U.S. Provisional Application No. 61/577,662, filed on Dec. 20, 2011, and titled “System and Method for Deriving Distinguishing Information about a User from the Characteristics of a User's Own Language,” and is also a continuation-in-part of U.S. application Ser. No. 13/517,032 filed on Jun. 18, 2012 and titled “Message Optimization.” The contents of both applications are incorporated by reference as if fully disclosed herein.
Number | Name | Date | Kind |
---|---|---|---|
6647383 | August | Nov 2003 | B1 |
7133834 | Abelow | Nov 2006 | B1 |
7363214 | Musgrove | Apr 2008 | B2 |
7945573 | Barnes | May 2011 | B1 |
8762496 | Kiveris | Jun 2014 | B1 |
20020156688 | Horn | Oct 2002 | A1 |
20050076003 | DuBose | Apr 2005 | A1 |
20050234850 | Buchheit | Oct 2005 | A1 |
20060155567 | Walker | Jul 2006 | A1 |
20070153989 | Howell | Jul 2007 | A1 |
20070168863 | Blattner | Jul 2007 | A1 |
20080109285 | Reuther | May 2008 | A1 |
20080281627 | Chang | Nov 2008 | A1 |
20080313259 | Correa | Dec 2008 | A1 |
20090150400 | Abu-Hakima | Jun 2009 | A1 |
20090177750 | Lee | Jul 2009 | A1 |
20090210899 | Lawrence-Apfelbaum | Aug 2009 | A1 |
20090276419 | Jones | Nov 2009 | A1 |
20100312769 | Bailey | Dec 2010 | A1 |
20120005041 | Mehta | Jan 2012 | A1 |
20120016661 | Pinkas | Jan 2012 | A1 |
20120095831 | Aaltonen | Apr 2012 | A1 |
20120166345 | Klemm | Jun 2012 | A1 |
20120259620 | Vratskides | Oct 2012 | A1 |
20130346870 | Greenzeiger | Dec 2013 | A1 |
20140104370 | Calman | Apr 2014 | A1 |
Number | Date | Country | |
---|---|---|---|
61577662 | Dec 2011 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 13517032 | Jun 2012 | US |
Child | 13714692 | US |