This invention relates generally to electronic mail filtering and, more particularly, to a method and apparatus for detecting and filtering “phishing” attempts solicited by electronic mail.
Electronic mail (“email”) services are well known, whereby users equipped with devices including, for example, personal computers, laptop computers, mobile telephones, Personal Digital Assistants (PDAs) or the like, can exchange email transmissions with other such devices or network devices. A major problem associated with email service is the practice of “phishing,” a form of unsolicited email, or spam, where a spammer sends an email that directs a user to a fraudulent website with the intent of obtaining personal information of the user for illicit purposes. For example, a phishing email is typically constructed so as to appear to originate from a legitimate service entity (e.g., banks, credit card issuers, e-commerce enterprises) and a link in the email directs the user to what appears to be a legitimate website of the service entity, but in reality the website is a bogus site maintained by an untrusted third party. Once directed to the fraudulent site, an unwitting user can be tricked into divulging personal information including, for example, passwords, user names, personal identification numbers, bank and brokerage account numbers and the like, thereby putting the user at risk of identity theft and financial loss. Many service entities have suffered substantial financial losses as a result of their clients being victimized by the practice of phishing. Thus, there is a continuing need to develop strategies and mechanisms to guard against the practice of phishing.
Since phishing is generally viewed as a subset of spam, one manner of attacking the phishing problem is through use of spam filters implementing various spam detection strategies. Generally, however, spam filters known in the art are not well-suited to detecting phishing emails. Some prior art spam filtering strategies and their problems are as follows:
Bayesian filtering. A Bayesian filter uses a mathematical algorithm (i.e., Bayes' Theorem) to derive a probability that a given email is spam, given the presence of certain words in the email. However, a Bayesian filter does not know the probabilities in advance and must be “trained” to effectively recognize what constitutes spam. Consequently, the filter does not perform well in the face of “zero-day attacks” (i.e., new attacks that it has not been trained on). Further, a spammer can degrade the effectiveness of a Bayesian filter by sending out emails with large amounts of legitimate text. Still further, a Bayesian filter is very resource intensive and requires substantial processing power.
Black and/or white lists. Some spam filters use network information (e.g., IP and email addresses) in the email header to classify an incoming e-mail into black and/or white lists in order to deny or to allow the email. A black list comprises a list of senders that are deemed untrustworthy whereas a white list comprises a list of senders that are deemed trustworthy. The disadvantages of black and white lists are many and include, inter alia: an “introduction problem” whereby an incoming legitimate email will not penetrate a white-list based filter if it is from a sender that has not yet conversed with the recipient (and hence, the sender does not appear on the white list); in the case of black lists, the filter can introduce false positives and will not perform well in the face of zero-day attacks (e.g., a spammer can circumvent the filter by using IP addresses that do not appear on the black list); and in the case of both black and white lists, there is a management problem of maintaining and periodically adjusting the lists to add or remove certain senders.
Keyword analysis. Some spam filters analyze keywords in the email header or body to detect indicia of spam. However, a spammer can degrade the effectiveness of a keyword filter by obfuscating keywords or composing the email with images (e.g., Graphics Interchange Format (GIF) images). Further, there is a management problem of maintaining and periodically adjusting a dictionary of keywords that are indicative of spam.
Accordingly, in view of the problems associated with existing spam detection strategies in detecting phishing attacks, there is a need to develop alternative, or at least supplemental strategies and mechanisms to guard against the practice of phishing. Advantageously, the new strategies will not require training filters, maintaining black or white lists or performing keyword analysis. The present invention is directed to addressing this need.
This need is addressed and a technical advance is achieved in the art by a phishing filter that employs a set of heuristics or rules (e.g., 12 rules) to detect and filter phishing attempts solicited by electronic mail. The phishing filter does not need to be trained, does not rely on black or white lists and does not perform keyword analysis. The filter has been demonstrated to outperform existing filters with use of the entire set of 12 rules in combination, however the filter may be implemented and beneficial results achieved with selected individual rules or selected subsets of the 12 rules. The filter may be implemented as an alternative or supplemental to prior art spam detection filters.
In one embodiment, there is provided a phishing filter adapted to execute one or more heuristics to detect phishing attempts solicited by email. The phishing filter comprises (a) a login URL analysis element operable to identify and analyze a login URL of an email under review for indicia of phishing; (b) an email header analysis element operable to analyze a chain of SMTP headers in the email under review for indicia of phishing; (c) an other URL analysis element operable to analyze URLs other than the login URL in the email under review for indicia of phishing; (d) a website accessibility determination element operable to determine if the login URL of the email under review is accessible; and (e) means for producing an output metric responsive to elements (a), (b), (c) and (d) that characterizes the likelihood of the email under review comprising a phishing attempt.
In another embodiment, there is provided a method for evaluating an email for indicia of phishing, applicable to an email having a login URL and a display string comprising a URL. The method comprises determining whether the URL shown in the display string indicates use of Transport Layer Security (TLS); determining whether the login URL indicates use of TLS; producing a metric indicative of a valid email if TLS is indicated in both the URL shown in the display string and the login URL; and producing a metric indicative of a phishing email if TLS is indicated in the URL shown in the display string but not in the login URL.
In yet another embodiment, there is provided a method for evaluating an email for indicia of phishing, applicable to an email having a login URL including a path component and a host component, the host component having a domain portion. The method comprises determining if a business name appears in the path component; producing a metric indicative of a phishing email if a business name appears in the path component; if a business name does not appear in the path component, determining if a business name appears in the host component; producing a metric indicative of a valid email if a business name does not appear in the host component or if a business name appears in the domain portion of the host component; and producing a metric indicative of a phishing email if a business name appears in the host component but not in the domain portion of the host component.
In yet another embodiment, there is provided a method for evaluating an email for indicia of phishing, applicable to an email having one or more other URLs in addition to a login URL, the other URLs and the login URL each having a DNS domain. The method comprises performing a case-insensitive, byte-wise comparison of the domain of each of the other URLs to the domain of the login URL; producing a metric indicative of a valid email if the domain of each of the other URLs matches the domain of the login URL, otherwise producing a metric indicative of a phishing email.
In still another embodiment, there is provided a method for evaluating an email for indicia of phishing, applicable to an email having one or more other URLs in addition to a login URL, the other URLs and the login URL each having a DNS registrant. The method comprises comparing the DNS registrant associated with each of the other URLs to the DNS registrant associated with the login URL; producing a metric indicative of a valid email if the DNS registrant of each of the other URLs matches the DNS registrant of the login URL, otherwise producing a metric indicative of a phishing email.
The foregoing and other advantages of the invention will become apparent upon reading the following detailed description and upon reference to the drawings.
The rules are executed by functional elements including: a login URL analysis element 106 operable to identify and analyze the login URL; an email header analysis element 108 operable to analyze the chain of SMTP headers in the email 102; an “other” URL analysis element 110 operable to analyze URLs other than the login URL; and a website accessibility determination element 112 operable to determine if the login URL is accessible. The rules will be described in detail in relation to
In one embodiment, responsive to executing the plurality of rules on a target email, the phishing filter produces an output metric (“score”) 114 indicative of the probability that the email is a phishing attempt. Thereafter, depending on the output score, the email can be redirected or treated accordingly. For example and without limitation, if the output score is characteristic of a phishing email, the email can be blocked from the users email inbox and redirected to a junk email folder, the links in the email may be disabled, or a warning message may be introduced to warn the user that the email is suspected to be a phishing email.
In one embodiment, the output score 114 is produced by assigning to each rule a configurable weight, Wi and an indicator, Pi, ranging from 0.0 to 1.0, whereby a value of 1 indicates a positive result (i.e., indicative of a phishing email) and a value of 0 indicates a negative result (i.e., indicative of a valid email); and an applicability factor Xi, whereby Xi=1 if the rule is applicable; otherwise Xi=0 if the rule is not applicable. A final score S is based on a weighted sum of the points assigned by the rules divided by a weighted sum of the number of rules applied:
S indicates the probability that the email is a phishing attempt. The higher the score, the more likely the email is a phishing email. As will be appreciated, the output score may be computed using alternative algorithms, different values, etc. and may be constructed such that a lower, rather than higher, score represents a greater likelihood of phishing.
Referring to
At step 404, the extracted terms are used as search terms using a search engine (e.g., Google, Yahoo or the like) and a list of search results is obtained to determine the legitimate URL of the business. The results can be cached to avoid repeated queries for emails containing the same business. The correct URL, especially for major businesses, is typically within the top search results. For example, the legitimate URL may be determined to correspond to the first n search results, where n is configurable (n=5 is a value used by applicants with effective results). It is noted, the possibility exists that the top search results may include an illegitimate URL associated with a phishing site, for example, if a spammer practices what has been referred to as “Google bombing” or “link bombing” to insert a phishing site into the top search results. This is a valid concern but it can be mitigated by conducting a search across two or more search sites and comparing the results using statistical analysis techniques to derive a list of prospective valid URLs.
At step 406, the domain found in the host component of the actual URL from the email is compared to the domain of the top search results and it is determined whether a match occurs (i.e., is the actual URL from the email in the top search results). If a match occurs, Rule 1 yields a negative result and a value indicative of a potential valid email is assigned at step 408. If a match does not occur, Rule 1 yields a positive result and a value indicative of a potential phishing email is assigned at step 408.
If a determination is made at step 502 that the URL shown in the display string 206 uses TLS, it is determined at step 506 whether the actual URL 208 uses TLS as well. For example, in one embodiment, a positive determination will result at step 506 if the actual URL 208 uses a https://scheme and a negative determination will result if the actual URL 208 does not use a https://scheme.
If it is determined at step 506 that the actual URL 208 does not use TLS, Rule 2 yields a positive result and a value indicative of a potential phishing email is assigned at step 508.
If it is determined at step 506 that the actual URL 208 uses TLS, further analysis is performed to determine if the email is likely to be a phishing email or a valid email. In one embodiment, this analysis involves a comparison of the digital certificate (e.g., the X.509 certificate) retrieved and cached (saved) on a previous visit to a site to the certificate obtained on subsequent visits. For example, a cached X.509 certificate retrieved from a legitimate site (e.g., obtained on a first visit to the site) can be compared to the X.509 certificate on subsequent visits to detect instances where the site has been compromised to redirect users to an illegitimate site having a fraudulent X.509 certificate.
In one embodiment, following a determination that the actual URL uses TLS, it is initially determined at step 510 whether a certificate for the site already exists in a certificate “keyring” (i.e., is there already a cached X.509 certificate associated with a previous visit to the site). If a cached certificate does not already exist (which may occur, for example, upon a user's first visit to the site), the certificate associated with the site is retrieved, validated and saved at step 512 and a value indicative of a potential valid email is assigned at step 516.
If a cached certificate does exist, a certificate is obtained from the site and compared to the cached certificate at step 514. If the cached certificate and the certificate associated with the present site are the same, Rule 2 yields a negative result and a value indicative of a potential valid email is assigned at step 516. If they differ, Rule 2 yields a positive result and a value indicative of a potential phishing email is assigned at step 508.
In one embodiment, the country or region is obtained by determining the IP address associated with the actual URL and then searching a database that maps IP addresses to country codes. The country information is saved at step 606. In one embodiment, the country or region information is used for information purposes but does not contribute to the overall score of the phishing filter. Alternatively, of course, other embodiments may utilize the country information to contribute to the overall score or to influence in some manner a final determination of the presence or absence of phishing.
In Rule 4 (no flowchart shown), it is determined whether the actual URL is referenced using a “raw” IP address (i.e., a number specifying a computer network address) instead of a domain name. It is presumed that a login page of an illegitimate site may use a raw IP address and an authentic login page is less likely to use a raw IP address. Accordingly, if the actual URL uses a raw IP address, Rule 4 indicates a positive result (i.e., indicative of a phishing email). Conversely, Rule 4 indicates a negative result if the actual URL does not use a raw IP address.
At step 702, a determination is made whether the business name appears in the path component of the actual URL. If the business name does appear in the path component (as it does in the exemplary URL of
If the business name appears in the host component but not the path component, Rule 5 may indicate a positive or negative result depending on which portion of the host component the business name appears. In one embodiment, it is presumed that a business name appearing in the “domain” portion of the host component is likely to indicate a valid email. The domain portion is the portion (in
In Rule 6 (no flowchart shown), if the display string 206 of the URL is composed of a URL, it is compared to the actual URL 208. If the domains do not match, Rule 6 indicates a positive result, otherwise if the domains match, Rule 6 indicates a negative result.
Rule 7 (no flowchart shown) is a rule executed by the email header analysis element 108 of the phishing filter in one embodiment of the invention. In Rule 7, the chain of “Received” Simple Mail Transfer Protocol (SMTP) headers is checked to determine if the path included a server (based on DNS domain) or a mail user agent in the same DNS domain as the business. Under normal circumstances, the mail user agent originating the email or at the very least, a SMTP relay handling the email will be in the same DNS domain as that of the business. Rule 7 indicates a negative result if such a Received header is present, otherwise Rule 7 indicates a positive result.
For example, an email with the From header and message body indicating it is for Chase bank but without any “Received” lines containing an SMTP relay or a mail user agent in the chase.com DNS domain would be marked positive. While headers inserted by mail user agents such as To, From, and Subject are easy to spoof, it is more difficult, though not impossible, to alter the headers such as “Received” by adding intermediaries. In the event that the “Received” header is forged, Rule 7 may return a negative result (0 points), but the result will have to compete with the remaining rules in order to contribute to the final score. That is, even though Rule 7 may return a negative result in the given example, the final score after application of multiple rules may nevertheless indicate a phishing email.
Referring to
Three advantageous aspects of Rule 9 are noted herein for example and without limitation. First, this rule allows the phishing filter to be impervious to mergers and acquisitions, a common occurrence in the banking industry. For example, consider the acquisition of Bank One by Chase: under this rule, whois (“bankone.com”) and whois (“chase.com”) both yield JPMorgan Chase & Co. as the registrant, yielding a negative result (i.e., indicating a valid email). Second, this rule helps in content hosting where a business accesses its contents from another domain owned by it. For example, ebay.com stores static content on (and accesses it from) the domain ebaystatic.com. Third, this rule aids in cases where the business uses a URL not containing their domain name but which is registered to the business nonetheless. Emails from such businesses may display a URL to the recipient that includes the business name while the actual URL does not contain the business name. A well-known example is the “accountonline.com” domain: this domain is registered to Citibank, NA, but it is hard to reach that conclusion by just examining the domain name.
Rule 12 (no flowchart shown) is a rule executed by the website accessibility determination element 112 of the phishing filter in one embodiment of the invention. In Rule 12, a final check determines if the login URL is accessible (i.e., whether the resource represented by the URL can be accessed). The rule presumes that if the web page is inaccessible, it is likely to be a phishing site that has been disabled. In one embodiment, the rule produces a positive result if the web page is inaccessible; otherwise the rule is considered not applicable, in order to avoid lowering the score for an active phishing site.
The present disclosure has therefore identified a phishing filter operable to exercise 12 rules to detect and filter phishing attempts solicited by electronic mail. The phishing filter may be implemented with all or part of the rules; and may be implemented as an alternative or supplemental to prior art spam detection filters. It should also be understood that the steps of the methods set forth herein are not necessarily required to be performed in the order described, additional steps may be included in such methods, and certain steps may be omitted or combined in methods consistent with various embodiments of the present invention.
The present invention can be embodied in the form of methods and apparatuses for practicing those methods. The present invention can also be embodied in the form of program code embodied in tangible media, such as USB flash drives, CD-ROMs, hard drives or any other machine-readable storage medium, wherein, when the program code is loaded into and executed by a machine, such as a computer or processor, the machine becomes an apparatus for practicing the invention. The present invention can also be embodied in the form of program code, for example, whether stored in a storage medium, loaded into and/or executed by a machine or transmitted over some transmission medium or carrier, such as over electrical wiring or cabling, through fiber optics, or via electromagnetic radiation, wherein, when the program is loaded into and executed by a machine, such as a computer, the machine becomes an apparatus for practicing the invention.
While this invention has been described with reference to illustrative embodiments, the invention is not limited to the described embodiments but may be embodied in other specific forms without departing from its spirit or essential characteristics. The scope of the invention is, therefore, indicated by the appended claims rather than by the foregoing description. All changes that come within the meaning and range of equivalency of the claims are to be embraced within their scope.