This invention relates to software e-mail filters, especially those filters that employ adaptive rules to determine whether e-mail messages are wanted by the recipient.
The proliferation of junk e-mail, or “spam,” can be a major annoyance to e-mail users who are bombarded by unsolicited e-mails that clog up their mailboxes. While some e-mail solicitors do provide a link which allows the user to request not to receive e-mail messages from the solicitors again, many e-mail solicitors, or “spammers,” provide false addresses so that requests to opt out of receiving further e-mails have no effect as these requests are directed to addresses that either do no exist or belong to individuals or entities who have no connection to the spammer.
It is possible to filter e-mail messages using software that is associated with a user's e-mail program. In addition to message text, e-mail messages contain a header having routing information (including IP addresses), a sender's address, recipient's address, and a subject line, among other things. The information in the message header may be used to filter messages. One approach is to filter e-mails based on words that appear in the subject line of the message. For instance, an e-mail user could specify that all e-mail messages containing the word “mortgage” be deleted or posted to a file. An e-mail user can also request that all messages from a certain domain be deleted or placed in a separate folder, or that only messages from specified senders be sent to the user's mailbox. These approaches have limited success since spammers frequently use subject lines that do not indicate the subject matter of the message (subject lines such as “Hi” or “Your request for information” are common). In addition, spammers are capable of forging addresses, so limiting e-mails based solely on domains or e-mail addresses might not result in a decrease of junk mail and might filter out e-mails of actual interest to the user.
“Spam traps,” fabricated e-mail addresses that are placed on public websites, are another tool used to identify spammers. Many spammers “harvest” e-mail addresses by searching public websites for e-mail addresses, then send spam to these addresses. The senders of these messages are identified as spammers and messages from these senders are processed accordingly. More sophisticated filtering options are also available. For instance, Mailshell™ SpamCatcher works with a user's e-mail program such as Microsoft Outlook™ to filter e-mails by applying rules to identify and “blacklist” (i.e., identifying certain senders or content, etc., as spam) spam by computing a spam probability score. The Mailshell™ SpamCatcher Network creates a digital fingerprint of each received e-mail and compares the fingerprint to other fingerprints of e-mails received throughout the network to determine whether the received e-mail is spam. Each user's rating of a particular e-mail or sender may be provided to the network, where the user's ratings will be combined with other ratings from other network members to identify spam.
Mailfrontier™ Matador™ offers a plug-in that can be used with Microsoft Outlook™ to filter e-mail messages. Matador™ uses whitelists (which identify certain senders or content as being acceptable to the user), blacklists, scoring, community filters, and a challenge system (where an unrecognized sender of an e-mail message must reply to a message from the filtering software before the e-mail message is passed on to the recipient) to filter e-mails.
Cloudmark distributes SpamNet™, a software product that seeks to block spam. When a message is received, a hash or fingerprint of the content of the message is created and sent to a server. The server then checks other fingerprints of messages identified as spam and sent to the server to determine whether this message is spam. The user is then sent a confidence level indicating the server's “opinion” about whether the message is spam. If the fingerprint of the message exactly matches the fingerprint of another message in the server, then the message is spam and is removed from the user's inbox. Other users of SpamNet™ may report spam messages to the server. These users are rated for their trustworthiness and these messages are fingerprinted and, if the users are considered trustworthy, the reported messages blocked for other users in the SpamNet™ community.
SpamAssassin™ is another e-mail filter which uses a wide range of heuristic tests on mail headers and body text to try to block unsolicited e-mail. Unsolicited messages are detected based on scores of these tests.
A Bayesian filter may also be used, either on its own or in connection with one of the solutions discussed above. However, Bayesian filters require lots of training by each individual user before they can successfully detect and eliminate spam. In addition, Bayesian filters often focus on words alone, which may limit the filter's effectiveness since many words that are used in spam messages are also used in legitimate messages. In addition, Bayesian filters may be dilutive, in that not all words or terms in messages which are scanned by the filter are used in determining the probability the message is spam. For instance, one Bayesian filter (“Better Bayesian Filtering”, www.paulgraham.com/better.html, January 2003) proposed by Paul Graham uses only the fifteen most interesting “tokens” (text appearing in a message) to determine a probability the message is spam.
U.S. Pat. No. 6,161,130 to Horvitz et al. teaches an e-mail classifier which analyzes incoming messages' content to determine whether a message is “junk”. The classifier is trained on prior content classifications, i.e., features that are characteristic of junk or spam messages. Messages are probabilistically classified as legitimate or spam (though weighted probabilities are not used). The classifier may be retrained based on user input.
While current anti-spam solutions can be somewhat effective in eliminating spam, unsolicited messages often go undetected by these solutions. Part of the problem is that rules that current anti-spam solutions employ are static and therefore spammers can devise ways to get past the rules. Another problem is that most systems only give a rule significance if the rule is satisfied (for example, ten points are subtracted from a message's score if the rule is satisfied). However, rules can have significance if they are satisfied and also if they are not satisfied (example: subtract 10 if satisfied, add 5 if not satisfied) and a system that takes advantage of this could be quite powerful. Yet another drawback to some of these solutions is that they require lots of user input before they can effectively detect spam. An additional problem is that these solutions' message scores are often based on a trial and error approach rather than employing an accurate weighting system. Therefore, there is a need for an e-mail filter that employs dynamic scoring, gives rules significance if the rule is satisfied or not satisfied, does not require user input to be effective, and can precisely compute weights to give individual rules when assessing whether a received e-mail message is wanted or unsolicited.
The need has been met by an e-mail filter employing an adaptive ruleset which is applied to e-mail messages to determine whether the messages are wanted. Statistics are tracked for each of the rules of the adaptive ruleset and are used to determine weighted probabilities, or scores, indicating the likelihood that received messages are wanted or unsolicited. A rule has significance when it is satisfied and when it is not satisfied. The statistics for each rule are updated each time a message is rated, so the weights and probabilities calculated for each rule are fine-tuned without user input. This e-mail filter may be particularly effective when combined with another rule or algorithm where a very accurate initial rating of the message is obtained.
In one embodiment, when an e-mail message is received, it is first given an initial rating by an initial rule or filter which is fairly accurate. (In other embodiments, no initial rating is obtained.) The adaptive ruleset is then applied to the e-mail message. (In some embodiments, the adaptive ruleset is only applied to messages which meet certain criteria (for instance, those messages which cannot accurately be classified by the initial rule).) A final probability the message is wanted is obtained (for instance, by averaging the weighted probabilities obtained using the adaptive ruleset with the initial rating or simply using the results obtained using the adaptive ruleset). The message is then processed accordingly (sent to the recipient's Inbox, sent to a spam folder, deleted, etc.).
a is a block diagram showing a network configuration of one embodiment of the invention.
b is a block diagram showing a network configuration of another embodiment of the invention.
Referring to
In
In all embodiments of the invention, the filtering software may run on its own or may be used with other software filtering packages.
With reference to
Once the initial rating is obtained (block 32), each rule of the adaptive ruleset is applied to the e-mail message (block 34). Sample rules may include: 1) whether there are two consecutive spaces in the subject line and 2) whether there are more than four “non-English” words in the body. These rules are included for exemplary purposes; other embodiments may employ different rules for detecting wanted or unwanted messages in the ruleset. Rules may be added or deleted from the ruleset by the user or system administrator either on an individual basis or through software updates. If the rule is satisfied (block 36), a weighted probability, or score, that the message is wanted is obtained (block 40). If the rule is not satisfied (block 36), another weighted probability is obtained (block 38) since the rule may have different weights and probabilities depending on whether the rule is satisfied.
The weights and probabilities for each rule are based on statistics collected (at a database) for each rule of the adaptive ruleset as well as the initial rule. Statistics may be collected for both individual recipients or for all recipients in a network employing the adaptive ruleset. Statistics are collected for each rule in light of the initial ranking. For instance, for each rule the following statistics may be calculated:
p1=no. of good messages [as rated by the initial rule] which satisfy the current rule/total number of messages that satisfy the current rule
p2=no. of good messages [as rated by the initial rule] which don't satisfy the current rule/total number of messages that do not satisfy the current rule
p3=no. of good messages [as rated by the initial rule]/total number of messages rated by the initial rule.
If the message satisfies a rule, the weighted probability or score is |p1−p3|*p1. The weight of the rule is |p1−p3| and the probability of the rule is p1. If the message does not satisfy the rule, the weighted probability is |p2−p3|*p2. Here, the weight of the rule is |p2−p3| and the probability of the rule is p2.
In an alternative embodiment, other weights for each rule may be used. For instance, the weight of p1 could be (p1−p3)2. The greater the difference between p3 and p1, the greater p1 should be weighted since the difference between p1 and p3 indicates the discriminatory power of the rule, i.e., whether p1 can differentiate the message as wanted or unwanted better than p3. (This method of weighting should also be consistently employed for the difference between p2 and p3.)
If a message is not helpful in differentiating wanted messages from unwanted messages, it will have a weight of zero or close to zero. For instance, suppose a rule is “message contains an odd number of characters.” Statistically, half of the messages received should satisfy the rule. Further suppose that 80% of received messages are unwanted. If 100 messages have been rated, p1=10/50, p2=10/50, and p3=20/100. Therefore, the weight of p1 would be |10/50−20/100| or 0 and the weight of p2 would be |10/50−20/100|, also 0. Since the rule does not differentiate between wanted messages and spam, the rule receives a weight of 0.
Returning again to
The statistics for each rule are updated each time a message is rated (for instance, by adjusting counters of messages that are rated, the number of good messages satisfying the current rule, etc.) (block 46). Results of each rating of a message are sent to the database, where the statistics for each rule (example p1, p2, and p3) are updated. Due to this updating activity, the weights for each rule adapt to the incoming datastream without any user input.
In one embodiment of the invention, the adaptive ruleset may be used to rate the message without first obtaining an initial rating. In this embodiment, the adaptive ruleset could initially be given a set of starting values, for instance, values from another user who has been running the filter for a month or more. In this case, for each rule the values for p1, p2, and p3 could be as follows:
p1=no. good messages [as rated by the ruleset] that satisfy the rule/total no. of messages that satisfy the rule
p2=no. good messages [as rated by the ruleset] that don't satisfy the rule/total no. of messages that don't satisfy the rule
p3=no. good messages rated by the ruleset/total no. messages rated by the ruleset.
For each rule, the values p1, p2, and p3 are adjusted over time and the filter becomes better over time even though the user may never rate a single message.
In another embodiment, the adaptive ruleset may be applied only to those messages which cannot be classified as good or bad by the initial rule. In other words, the ruleset only rates a portion of the messages sent to the recipient. For instance, if the initial rule can accurately rate 95% of messages received, the adaptive ruleset is applied to the remaining 5% of messages received. In
When the message cannot be classified by the initial rule (block 52), each rule of the adaptive ruleset is applied (block 56). The values for p1, p2, and p3 for each rule may be calculated as follows:
p1=no. good messages [as rated by the ruleset] that satisfy the rule/total no. of messages that satisfy the rule
p2=no. good messages [as rated by the ruleset] that don't satisfy the rule/total no. of messages that don't satisfy the rule
p3=no. good messages rated by the ruleset/total no. messages rated by the ruleset.
Weights and probabilities may be determined as discussed in
Once a rule has been applied, a check is made to determine whether all rules have been applied (block 64). Once all the rules have been applied (block 64), the final probability that the message is wanted is obtained (block 66), for instance by summing the weighted probabilities obtained for each rule and dividing by the sum of the weights. The statistics for each rule of the adaptive ruleset are then updated as indicated above based on the final assessment of whether the message is wanted (block 68).
This embodiment is particularly useful for two reasons. First, since the adaptive ruleset is applied only to a portion of messages received, time and perhap s bandwidth (depending on whether the entire body of the message needs to be examined to classify it) are saved. Second, these initially unclassified messages may have completely different characteristics from those messages that can be classified by the initial rule. Therefore, the statistics for the rules in the adaptive ruleset are specifically related to that portion of the datastream that cannot be rated by the initial rule, as opposed to all messages sent to the recipient, and the adaptive ruleset will be extremely accurate when rating these messages.
In each of the embodiments, statistics for rules may be determined in different ways. In some embodiments, statistics are obtained based only on the application of the adaptive ruleset. In other embodiments, statistics may be obtained based on a combination of other rating algorithms (such as the initial rule(s)) which are employed with the adaptive ruleset to obtain a final probability the message is wanted.
In other embodiments, a moving average of statistics is maintained and used. More recently obtained statistics are weighted more than older statistics. For instance, when determining the moving average, the old value may be multiplied by a factor less than 1 and the new value is then added to the old value. Other embodiments may only use statistics collected and averaged over a certain time period, for example the last three months. These preferences may be set by a user or system administrator.
In each of these embodiments, thresholds may be set by a user or system administrator to determine a “good” or “bad” message depending on the final probability the message is wanted. For instance, a message may be considered “good” if the final probability the message is wanted is at least 0.90 or 90%. Those messages which are found to be good are passed on to the recipient (for instance, sent to the recipient's Inbox) while those messages that are bad are either sent to a spam folder or deleted, depending on the user's preferences. In each of the embodiments, the user can reverse the e-mail filter's rating by indicating that a message rated as good is actually unwanted and vice versa. If a rating decision is reversed, statistics are updated accordingly at the database.