1. Field of the Invention
The present invention relates to electronic mailbox measurement. More particularly, the present invention relates to redaction of identification data in electronic mailbox measurement.
2. Background of the Related Art
Email campaigns are widely used by established companies with legitimate purposes and responsible email practices to advertise, market, promote, or provide existing customers with information related to one or more products, services, events, etc. Such email campaigns may be used for commercial or non-commercial purposes. They can be targeted to a specific set of recipients, and to a particular goal, such as increasing sales volume or increasing donations.
It is a desire of email campaign managers, and others who initiate email campaigns, for sent messages to be ultimately delivered to the intended message recipients. U.S. patent application Ser. No. 13/449,153, which is incorporated herein by reference in its entirety, describes a system and method for monitoring the deliverability of email messages (i.e., whether or not sent messages are ultimately delivered to intended message recipients).
It is a further desire of campaign managers to design campaigns that incite a maximum level of engagement by recipients of the email messages associated with each campaign. For example, campaign managers endeavor to increase the amount of campaign related messages that are read by recipients, the amount of messages that are forwarded by recipients, the amount of links within messages that are followed by recipients, and the amount of recipients that prioritize messages associated with various campaigns. To maximize engagement, campaign managers rely on practices such as carefully composing the subjects and contents of campaign-related messages, carefully selecting the time at which messages are sent, choosing the frequency at which messages are sent, and targeting campaigns to select groups of recipients.
To assist campaign managers in maximizing the effectiveness of email campaigns, there exists a need to provide campaign managers with a system and method to evaluate the effectiveness of campaigns, based on the recipients' level of engagement with each campaign. In particular, there exists a need to provide campaign managers with a system and method to compare the performances of multiple email campaigns with one another, so that the campaign managers may tailor the practices they use to increase recipient engagement with a particular campaign, based on that campaign's performance relative to other campaigns. Commonly owned U.S. application Ser. No. 13/538,518, filed Jun. 29, 20012, which is incorporated herein by reference in its entirety, provides a system and method for collecting data related to recipients' level of engagement with email campaigns.
There exists a need to provide a system and method to redact certain information, such as personal and/or private information, when evaluating and reporting the effectiveness of email campaigns.
Accordingly, it is an object of the invention to provide a system and method for redacting information from email messages. It is a further object of the invention to remove personal recipient information from email messages that are provided to a third party, such as for marketing and evaluation purposes. It is a yet another object of the invention to provide a system and method for redacting personal identification information from email messages of an email campaign that are analyzed for message processing data.
A system and method redacts information from messages, and especially messages of an email campaign. The system receives a plurality of campaign reports, each campaign report including campaign data associated with the email campaign. The system redacts information from the campaign data, such as personal information of one or more recipients of the email campaign.
These and other objects of the invention, as well as many of the intended advantages thereof, will become more readily apparent when reference is made to the following description, taken in conjunction with the accompanying drawings.
a), 4(c) are graphic displays of a user interface with an unredacted message body in an exemplary embodiment for processing by the present invention;
b), 4(d) are graphic displays of a user interface with a redacted message body in accordance with an exemplary embodiment of
In describing a preferred embodiment of the invention illustrated in the drawings, specific terminology will be resorted to for the sake of clarity. However, the invention is not intended to be limited to the specific terms so selected, and it is to be understood that each specific term includes all technical equivalents that operate in similar manner to accomplish a similar purpose. Several preferred embodiments of the invention are described for illustrative purposes, it being understood that the invention may be embodied in other forms not specifically shown in the drawings.
The system and method of the present invention is implemented by computer software that permits the accessing of data from an electronic information source. The software and the information in accordance with the invention may be within a single, free-standing computer or it may be in a central computer networked to a group of other computers or other electronic devices. The information may be stored on a computer hard drive, on a CD ROM disk or on any other appropriate data storage device.
Turning to the drawings,
Each of the components of the system 10 (including the sending servers 101, client computers 102, data collectors 103, FTP server 104, analytics cluster 105, database server 106, web server 107, and devices used by the campaign manager 108) may be implemented by a computer or computing device having one or more processors to perform various functions and operations in accordance with the invention. The computer or computing device may be, for example, a mobile device (such as a smart phone), personal computer (PC), server, or mainframe computer. In addition to the processor, the computer hardware may include one or more of a wide variety of components or subsystems including, for example, a co-processor, input devices (such as a keyboard, touchscreen, and/or mouse), display device (such as a monitor or screen), and a memory or storage device such as a database. All or parts of the system 10 and processes can be implemented at the processor by software or other machine executable instructions which may be stored on or read from computer-readable media for performing the processes described. Unless indicated otherwise, the process is preferably implemented automatically by the processor in real time without delay. Computer readable media may include, for example, hard disks, floppy disks, memory sticks, DVDs, CDs, downloadable files, read-only memory (ROM), or random-access memory (RAM).
As illustrated in
Although in
An exemplary non-limiting illustrative embodiment of the system 10 operates in accordance with the flow diagram 200 shown in
At step 202, recipient mail clients receive the email message associated with the email campaign. If the message successfully reaches a recipient, the recipient may view the message on a client computer 102 via, for example, a webmail, desktop, or mobile email client. The set of all recipients includes a subset of panel recipients, wherein the usage activity of the panel recipients is considered representative of the usage activity of all recipients. Each panel recipient's mail client is equipped with one of several third party add-ons to the email client. Such add-ons allow for anonymous recording of the recipient's usage activity regarding mailbox placement and interaction with messages. Recipients interact with the received campaign email messages as they normally would. Such interactions may include, for example, opening messages, reading messages, deleting messages either before or after reading them, adding the sender of a message to the recipient's personal address book, forwarding messages, and clicking on links within messages.
At step 203, the data collectors 103, which may be operated by the providers of the third party add-ons, collect metrics associated with the recipient interactions. The collection of such metrics may be facilitated by the add-ons, which record recipient usage activity at the client computers 102 and transmit the recorded information to the data collectors 103 via the network. Preferably, each data collector 103 is an independent entity. Each data collector 103 aggregates the collected metrics by campaign to produce a campaign report, which includes campaign data, for each specific campaign. Campaign data may include message receive date, message receive time, subject line, sender domain name, sender user name, originating IP addresses, campaign ID header, and all of the associated mailbox placement and interaction metrics. The campaign reports produced by the data collectors may take on any appropriate format, provided the campaign reports are capable of being read by the measurement center 100. For example, the campaign reports may be tab delimited files, multiple SQL dump files, XML files, etc. When multiple data collectors 103 produce campaign reports having differing formats, the measurement center 100 may employ panel data and campaign rollup logic.
At step 204, each of the data collectors 103 transmits one or more individual campaign reports to a secure server 104 via sFTP or some other similar secure protocol. At step 205, the individual campaign reports are transferred from the secure server 104 to an analytics cluster 105 where the following process occurs. Utilizing the unique combination of campaign data (e.g., message receive date, message receive time, subject line, sender domain name, sender user name, originating IP addresses, and campaign ID (which is included in the campaign ID header)) from each of the multiple individual campaign reports received from the data collectors 103, the analytics cluster 105 identifies which campaign data from each campaign report pertains to each of one or more campaigns. For example, the analytics cluster 105 may determine that certain campaign data received from different data collectors 103 pertains to the same campaign, because the campaign data is associated with the same campaign ID. Thus, one report can contain data attributed to one or more campaigns, and data for one campaign may be obtained from one or more reports.
The analytics cluster 105 aggregates the like interaction metrics from each of the individual campaign reports for each of the campaigns. For example in a system 10 with two data collectors 103, a first data collector 103 may report that twenty recipients read an email message having a particular campaign ID, and a second data collector 103 may report that ten recipients read an email message having the same campaign ID. Thus, the analytics cluster 105 would aggregate the interaction metrics from the individual reports to determine that a total of thirty recipients read the email message. Data from each of the campaigns is included in a single report generated by the analytics cluster 105, the single report providing campaign performance statistics for all of the email campaigns having messages received by the recipients reporting to the data collectors 103.
In one non-limiting illustrative embodiment, a benchmarking process is run utilizing a statistical model for testing similarity that generates an engagement score based on recipients' engagement with each of the campaigns observed by the data collectors 103. In an exemplary embodiment of the invention, the model assigns weighted rankings to the following variables to benchmark engagement: amount of messages placed in inbox, amount of messages placed in spam folder by ISP, amount of messages placed in spam folder by recipient, amount of messages rescued from spam folder by recipient, amount of messages placed in a priority inbox or similar folders for ISPs that have them (e.g., Gmail priority inbox), amount of messages for which the sender is added to a personal address book, amount of messages opened, amount of messages read, amount of messages deleted without being read, amount of messages forwarded, amount of messages replied to, and the amount of messages for which recipients do not interact with the message at all.
The analytics cluster 105 uses the weighted ranking of each of the interaction metrics for each individual campaign to generate an engagement score for the campaign. Some interaction metrics, such as the amount of messages read, may be weighted more heavily than other interaction metrics. Furthermore, the relative weights of the interaction metrics may be modified, as appropriate, in accordance with the invention. Preferably, all interaction metrics reported by the data collectors 103 are considered by the analytics cluster 105. In addition, the interaction metrics that may be considered are not limited to the exemplary interaction metrics discussed herein.
An exemplary embodiment of the invention determines and assigns an engagement score and an engagement ranking to each individual campaign. The engagement score provides an indication of the recipients' engagement with the campaign. The engagement ranking provides an indication of the recipients' engagement with the particular campaign as compared to the recipients' overall engagement with all campaign email messages received. The engagement score may be, for example, a numerical value between 0 and 1, and the engagement ranking may be an integer value from 1 to 5. Each campaign is assigned an engagement benchmark based on the engagement ranking. For example, a campaign with an engagement ranking of 1 may be assigned an engagement benchmark of “poor,” and a campaign with an engagement ranking of 5 may be assigned an engagement benchmark of “excellent.”
In step 301, candidate email messages of a particular email campaign are received from different user accounts by the data collectors 103. This can occur, for instance, at step 203 of
For example, a collection whitelist may contain “info@vanguard.com” (sender) and “V-2012-08-11-1A” (campaign ID) or “Your transaction confirmation is ready” (subject line), in which case all email messages are collected that match those criteria in step 301, as in the candidate messages shown in
A minimum number of email messages per campaign must be collected from step 301 for the process to continue. In the preferred embodiment, a minimum of 3 messages per campaign is needed since at least 3 different messages are needed to note the differences between them. If only 1 or 2 messages are collected, the differences between them could be incidental rather than instructive for redaction (i.e. the differences might not actually be personal identification information).
In step 303, the email messages are organized into one or more clusters based on message structure, message size, and/or message similarity. According to one embodiment, emails can be hierarchically clustered first based on message structure, then based on message size, and then based on message similarity. Message similarity can be determined based on longest common sub-strings. Clustering of candidate messages is conducted to separate different message content across the candidate list of messages, which have the same subject line or campaign ID, but different content. For example, the sender (which can be a social website such as LinkedIn) may send 500 emails with subject line “Reconnect with Your Business Contacts” with email content suggesting 3 business contacts to recipients. The sender may then also send 500 different emails with the same subject line but with email content suggesting 5 business contacts. The message clustering would separate these two groups into two candidate sets for redaction.
Message structure can be determined based on one or more of the presence of headers, the presence and/or number of attachments, and/or the message body. In cases as the LinkedIn example above, where two sets of messages share a sender and subject line, but differ in content, clustering groups those messages into sets sharing the most common attributes, including the email headers, presence and/or number of attachments and similarity of the message bodies. These sets of messages are separated only in the computer memory (whether at the data collector 103 or the separate processors) and each set is prepared separately for its own redaction process in step 305.
In step 305, within each cluster, each email is compared to the first email in the set and common text is detected and identified using a suitable common subsequence algorithm, such as the Hunt-Mcllroy longest common subsequence algorithm, (http://en.wikipedia.org/wiki/Hunt % E2%80%93Mcllroy_algorithm, the content of which is herein incorporated by reference). Every email in the list is compared to the first one, each pair at a time, in succession. Because this algorithm uses a character-by-character comparison of two strings of text, “common text” is only that text which is exactly the same in both message bodies.
In step 307, once the strings of common text between two emails are identified from step 305, the remainder of the text (the uncommon parts) are replaced with redaction characters (“*” or a block of black background, as seen in
Clustering is an optimization based on real-world client behavior. Some clients may send multiple different sets of content under the same campaign ID or subject line. This means that when a list of messages is collected “in a campaign” it may, in reality, be several content-driven campaigns masquerading under the same campaign identifier. Thus, clustering the messages sorts these different content sets out from one another, such that each candidate set of messages is then truly only those that share all content structure except personal identification information that will be redacted.
The list of messages (bits in memory) is passed through a clustering algorithm, which splits that list into new lists of content-grouped messages (several different sets of bits in memory). There's no need for a cluster ID, because this all happens within the same process and the data simply lives in computer memory while it is needed.
b) and 4(d) are graphic displays of a user interface with a redacted message body in accordance with non-limiting exemplary embodiments of the invention.
c) and 4(d) show another campaign email example regarding a professional networking website. In
In step 501, the process accepts a number of similar subject lines from a previously determined set of messages in a campaign, again grouped by both sender and either subject line or campaign ID. Due to the comparatively small amount of content in a subject line, at least 10 messages from at least 5 distinct user email accounts are required to continue the redaction process. This is needed since a mathematical frequency is utilized for the threshold. For instance, say our threshold is 0.2 and we only have 3 messages. If a word that happens to be personal identification information appears in the subject of only 1 of those messages, it will have a frequency of 0.33, which is greater than our threshold and thus it wouldn't be redacted. Having at least 10 messages from at least 5 distinct user email accounts avoids that issue. Message sets that don't have enough messages can be removed from the analysis altogether.
In step 503, each subject line in a candidate set (i.e., the set of all messages that matched the whitelist and are being used for redaction) is broken into individual words in order to allow comparison of the frequency of each word in the full set. In step 505, a measure of occurrence is determined for each word within the corpus of subject lines. According to one embodiment, the measure of occurrence is the normalized number of times a word appears within the corpus of subject lines; in other words, the number of times that a single word appears, divided by the total number of subject lines in the set.
In step 507, the words with a measure of occurrence below a pre-determined threshold are removed from each subject line and/or replaced with a pre-determined character. This threshold is necessary because it indicates the number of email messages that contain an individual word in the subject line is reflective of whether or not that word is personal identification information that should be redacted. Personal identification information is, by its nature, a rare occurrence in the context of an entire campaign, thus making this frequency analysis an appropriate fit for its redaction. For example, if a sender sends a campaign of emails to its customers with a subject line like “Hey Joe, 50% off All Electronics”, the frequency of every word except “Joe” will be 100% across the entire set of messages in the campaign, whereas the frequency of the word “Joe” will be less than 100%, and less than the pre-determined threshold, and will thus be redacted.
According to one embodiment, the pre-determined threshold is determined based on prior experimentation. These experiments involve running this subject line redaction process on several campaigns of email messages and having a human inspect the redacted results until the point at which all identification information is removed from all sets of subject lines. According to one embodiment, the threshold for all campaigns can be 0.1 (10%), but this could range anywhere from 0.001 to 0.3, depending on the data and usage.
It is noted that message body redaction is performed by comparing messages to each other, whereas subject line redaction is performed by determining the frequency of words in the subject. This is due to the differences between the data that message body redaction a much more difficult problem that needs to be solved in different ways. Though it may not be optimal, message body redaction can use a word frequency analysis, and subject line redaction can use a comparison technique.
Next, an example subject line redaction process is described with reference to
After receiving the emails, each subject line is split into “word atoms” (step 503 of
Thus, the message recipient's name in each of the entries 608a, b is at position 0 in the subject line. In the present example, the first word after the recipient's name is “Save”. As shown in entry 608e, the word “Save” has a position of 6. Each subject line has its own set of words with their position. So if there are 15962 subject lines (as in the example shown), there will be 15962 copies of “Save” and its corresponding position in each of those subject lines. However, the system recognizes that those 15962 copies are for the same term “Save” and consolidates those to a single entry for “Save”. The position “6” is shown even though the 15962 copies could have a range of positions. The position indicates that the term “Save” is the next term to be displayed after the name. And, the position “11” for “50%” indicates that the term “50%” is the next term to be displayed after the term “Save”.
Thereafter at step 610, any words that have a frequency less than a predefined threshold are redacted (step 507 of
Finally at step 614, a redacted subject line 616 is reassembled from the redacted word atoms by replacing the redacted word with a character such as “_”. An example of the resulting redacted subject line is “— Save 50% on All Ebooks & Videos” as shown in
We reassemble the string in position order, including redactions. For instance, if the subject was “Save 50%, Brad”, we would have the following words split out with example counts: (0, “Save”, 15962); (5, “50%”, 15962); (9, “Brad”, 123). So, “Brad” would be redacted because its frequency (123/15962) is less than the threshold (0.1), which leaves this result: (0, “Save”, 15962); (5, “50%”, 15962); (9, “—”, 123). Then the words are reassembled in order by position: “Save”+“50%”+“_”. If the redaction had taken place in the middle of the subject, it would just take the place of the previous word, e.g. “Hey”+“_”+“Check”+“Out”+“Our”+“Deals”. Thus, the words are sorted by their starting position and reassembled after the redaction analysis.
As further shown in
It should be noted, however, that any set of email messages with similar templated content, differing only in their use of private identifiable information, could be put through these same redaction processes. Email campaigns are just one such class of possible sets of emails that can be redacted in this manner. In addition, according to one embodiment, any of the processes described herein may additionally include removing information within an email header. Unless otherwise stated, the steps performed herein are all performed automatically in real-time by the processor, without manual interaction.
The foregoing description and drawings should be considered as illustrative only of the principles of the invention. The invention may be configured in a variety of shapes and sizes and is not intended to be limited by the preferred embodiment. Numerous applications of the invention will readily occur to those skilled in the art. Therefore, it is not desired to limit the invention to the specific examples disclosed or the exact construction and operation shown and described. Rather, all suitable modifications and equivalents may be resorted to, falling within the scope of the invention.
This application includes subject matter related to commonly owned U.S. application Ser. No. 13/538,518, filed Jun. 29, 2012 to the present Assignee, the entire contents of which being incorporated herein by reference.