Embodiments are related to the detection of phishing Universal Resource Locators (URLs) delivered through electronic messages such as email. Phishing detection refers to the detection of URLs in, for example, emails that purport to be from a legitimate and trustworthy source but that, in fact, do not. Such phishing URLs often are used in attempts to collect personal and financial information from the unsuspecting recipient, often for unauthorized purposes.
The goal of the phisher is most often to capture critical data such as credit card number or login/password credentials. For this purpose, the phisher sends an email to the victim that contains a URL that will lead the victim to a forged website where the victim is induced to enter the sought-after personal and financial information.
The user experience is specific to each brand. In order to maximize the capture of critical data in a forged website, the user experience occasioned by viewing and interacting with the phishing email and with the forged website should to be as close as possible to the genuine user experience with a legitimate email and website. For example, a phishing email received by the victim often contain text and graphics—typically, a known and familiar brand logo—to convince the victim to click on a URL link of the forged website and enter his or her credentials therein. Toward that end, the forged website URL often contain keywords that are close to the genuine website URL and the forged website often contains text, style sheets, graphics and user experience that resemble those of the genuine website.
One embodiment is a method of determining whether a URL is a phishing URL through real-time exploration and analysis that carry out a number of determinations that, in the aggregate, determine the likelihood that a received URL is a phishing URL, as is URL 108 in
The exploration of the URL, as shown at B206, may comprise comparing the URL or a portion or portions thereof with a database (the same or a different database than referred to above) of phishing signatures. Such phishing signatures may comprise, for example, a list of regular expressions that are most often associated with phishing attempts. Such comparison may comprise comparing the content of the webpage pointed to by the URL under consideration with database records of known phishing webpages signatures. A match of such a comparison may result, according to one embodiment, with a determination that the URL is a phishing URL. If no match is found, the method may proceed to block B207. It to be noted, however, that blocks B201-B206 may be carried out in an order that is different than that shown in
At B207, the URL (which thus far has resisted attempts to classify it as a phishing URL or as a non-phishing URL in previous determinations) may be submitted to a phishing probability engine, the output of which may be interpreted as a probability that the submitted URL under consideration is, in fact, a phishing URL. The probability may be expressed numerically, or may be expressed as a more user-friendly phishing probability rating. For example, the output of the phishing probability engine may comprise ratings such as “Most Likely Not a Phishing URL”, “Somewhat Probable Phishing URL” or “Most Likely a Phishing URL” or functionally equivalent ratings with a lesser or greater degree of granularity. According to one embodiment, the phishing probability engine may comprise supervised learning models and associated algorithms to analyze data and recognize patterns. One embodiment utilizes a Support Vector Machine (SVM) classifier on the URL itself and the webpage content.
There are a great many well-known brands and each of these brands has separate characteristics, color and font scheme and look and feel. Examples of such brands include, for example, Microsoft, PayPal, Apple or Bank of America. Well-known brands with which users interact frequently are prime candidates for phishing attacks. Rather than extracting features that are common to all brands, one embodiment comprises and accesses a knowledge database of brands configured to enable the present system to extract therefrom items that may be characteristic or specific to each brand.
Brand Elements
According to one embodiment, a brand is identified by a unique name such as Apple, PayPal, Bank of America, Chase or Yahoo. A brand contains a list of elements that defines the knowledge base relative to this brand. According to one embodiment, a knowledge database of brands configured to enable extraction therefrom of items that are characteristic or specific to each brand may include one or more of the following elements:
According to one embodiment, a brand may be defined as a logical construct that includes several elements. Such a logical construct, according to one embodiment, may be implemented as a document type definition (DTD). Other logical constructs may be devised. A DTD is a set of markup declarations that define a document type for an SGML-family markup language (SGML, XML, HTML) and defines the legal building blocks of an XML document. A DTD defines the document structure with a list of legal elements and attributes. A DTD that encapsulates a brand, according to one embodiment, may be implemented as an XML file having the following form:
The following is an exemplary brand description for the Chase bank brand:
The following is an exemplary brand description for the Apple brand:
Vector Definition
In order to classify a URL as being a legitimate or a suspected phishing URL, one embodiment computes a vector that is suitable to be input to the phishing probability detection engine. One embodiment computes a multi-dimensional vector of binary values, either 0 or 1. One implementation computes a 1-dimensional vector of binary values. Such a vector may be represented by, for example, a 14 bits array. Each dimension (represented by one bit) represents a feature: the bit is set to 1 if the feature condition is met, otherwise the bit is set to 0.
The features of one implementation are shown below, according to one embodiment.
As shown in the table below, some of these features are brand-dependent and rely on a brand selection process that will be described further. In the table below, those features having an “X” in the Brand Dependent column are brand-dependent.
As shown in
URL_HOSTNAME_IPV4
URL_MANY_SUBDOMAINS
URL_WORDPRESS_PATH_COMPONENT_OR_TILDE
URL_ACTION_KEYWORD_SUSPECT
URL_SUBDOMAIN_SUSPECT
URL_PATH_SUSPECT
At B503, it may be determined whether the determination of the selected vector features above is sufficient to enable an identification of the brand that is the subject of the phishing attempt (if such phishing attempt exists). According to one embodiment, the identification of the brand may be carried out according to the method shown and described relative to
DOCUMENT_TITLE_OR_METADESCRIPTION_SUSPECT
DOCUMENT_PHISHING_TITLE
DOCUMENT_ICON_OR_CSS_OR_JS_SUSPECT
DOCUMENT_HIGH_DOMAIN_RATE
DOCUMENT_DATA_SUSPECT
After the determination of the value of brand-specific phishing features or after it is determined in B504 that the specific brand may not be identified from the examined features, block B505 may be carried out, to determine the value of remaining, non-brand-specific phishing features such as, for example:
DOCUMENT_FORM_SUSPECT
DOCUMENT_CREDENTIAL_FIELD
DOCUMENT_PHISHING_PROCESS
The resultant features vector may now be input to the phishing probability engine, as shown at 506.
A brand identification algorithm according to one embodiment is shown in
The following phishing URL example uses the Chase brand name, for exemplary purposes only.
http://tula-tur.ru/chase/chase_auth.html
Examination of this phishing URL, according to one embodiment, would lead to a brand identification of Chase, as Chase is a keyword element matching URL path element at B602 in
http://itunes.menaiswimclub.org.au/images/confirm
This phishing link leads to a brand identification of Apple as itunes is a keyword element matching URL subdomain element at B601 in
Compute Phishing Probability with SVM Classifier
The computed input vector may now be input to phishing probability engine. According to one embodiment, the phishing probability engine may comprise a Support Vector Machine (SVM) classifier. One embodiment of the phishing probability engine uses a binary SVM classifier, in which the two classes N and P are
Herein, an element is a pair of two files. According to one embodiment, the first file of the pair of files of the element is a URL file, containing the URL under investigation. The second file of the pair of files of the element is an HTML file containing the webpage pointed to by the URL. According to one implementation, the filename of the first file is a hash of, for example, a quantity such as the current timestamp and the URL under investigation. The extension of the first file may be, for example, “.url”. Similarly, the filename of the second file may be a hash of, for example, a quantity such as the current timestamp and the content of the webpage pointed to by the link (e.g., URL) in the email. The extension of the second file may be, for example, “.html”. According to one embodiment, the hash may be a message digest algorithm such as an MD5 hash, although other hashes may be utilized as well. For example, the two files may be named as follows:
To train the SVM classifier, it may be provided with a corpus of P (phishing elements) and N (non-phishing) elements. This corpus may be updated periodically as new phishing attempts are discovered, to follow the phishing trend. The training and testing of the SVM classifier produces a SVM model that may be used by the phishing probability engine.
According to one embodiment, for an input vector V (e.g., the 14-dimensional input vector discussed herein), the SVM classifier of the phishing probability engine produces a probability: the probability that input vector V belongs to the P class, the class of phishing elements. This probability may then be used to decide whether the URL under investigation is likely a phishing URL. Subsequently, actions such as deleting, guaranteeing or placing an email in a “Junk” folder, may be carried out, based upon the computed probability.
Embodiments of the present invention are related to the use of computing device 712, 708 to detect and compute a probability that received email contains a phishing URL. According to one embodiment, the methods and systems described herein may be provided by one or more computing devices 712, 708 in response to processor(s) 802 executing sequences of instructions contained in memory 804. Such instructions may be read into memory 804 from another computer-readable medium, such as data storage device 807. Execution of the sequences of instructions contained in memory 804 causes processor(s) 802 to perform the steps and have the functionality described herein. In alternative embodiments, hard-wired circuitry may be used in place of or in combination with software instructions to implement the described embodiments. Thus, embodiments are not limited to any specific combination of hardware circuitry and software. Indeed, it should be understood by those skilled in the art that any suitable computer system may implement the functionality described herein. The computing devices may include one or a plurality of microprocessors working to perform the desired functions. In one embodiment, the instructions executed by the microprocessor or microprocessors are operable to cause the microprocessor(s) to perform the steps described herein. The instructions may be stored in any computer-readable medium. In one embodiment, they may be stored on a non-volatile semiconductor memory external to the microprocessor, or integrated with the microprocessor. In another embodiment, the instructions may be stored on a disk and read into a volatile semiconductor memory before execution by the microprocessor.
While certain embodiments of the disclosure have been described, these embodiments have been presented by way of example only, and are not intended to limit the scope of the disclosure. Indeed, the novel methods, devices and systems described herein may be embodied in a variety of other forms. Furthermore, various omissions, substitutions and changes in the form of the methods and systems described herein may be made without departing from the spirit of the disclosure. The accompanying claims and their equivalents are intended to cover such forms or modifications as would fall within the scope and spirit of the disclosure. For example, those skilled in the art will appreciate that in various embodiments, the actual physical and logical structures may differ from those shown in the figures. Depending on the embodiment, certain steps described in the example above may be removed, others may be added. Also, the features and attributes of the specific embodiments disclosed above may be combined in different ways to form additional embodiments, all of which fall within the scope of the present disclosure. Although the present disclosure provides certain preferred embodiments and applications, other embodiments that are apparent to those of ordinary skill in the art, including embodiments which do not provide all of the features and advantages set forth herein, are also within the scope of this disclosure. Accordingly, the scope of the present disclosure is intended to be defined only by reference to the appended claims.
This application is a CONTINUATION of U.S. patent application Ser. No. 14/542,939 filed on Nov. 17, 2014, entitled “METHODS AND SYSTEMS FOR PHISHING DETECTION”, the disclosure of which is incorporated by reference herein in its entirety.
Number | Name | Date | Kind |
---|---|---|---|
5890171 | Blumer et al. | Mar 1999 | A |
7412539 | Gmuender et al. | Aug 2008 | B2 |
7424616 | Brandenburg et al. | Sep 2008 | B1 |
7562387 | Nguyen et al. | Jul 2009 | B2 |
7752336 | Gmuender et al. | Jul 2010 | B2 |
7873707 | Subramanian et al. | Jan 2011 | B1 |
7958555 | Chen et al. | Jun 2011 | B1 |
7987237 | Matsuura | Jul 2011 | B2 |
8073829 | Lopez et al. | Dec 2011 | B2 |
8079087 | Spies | Dec 2011 | B1 |
8095967 | Loesh et al. | Jan 2012 | B2 |
8135790 | Castelli | Mar 2012 | B1 |
8307431 | Krishnamurthy et al. | Nov 2012 | B2 |
8336092 | Nagoya et al. | Dec 2012 | B2 |
8381292 | Warner | Feb 2013 | B1 |
8429301 | Gmuender et al. | Apr 2013 | B2 |
8438642 | Feng et al. | May 2013 | B2 |
8448245 | Banerjee et al. | May 2013 | B2 |
8468597 | Warner | Jun 2013 | B1 |
8495735 | Warner | Jul 2013 | B1 |
8521667 | Zhu et al. | Aug 2013 | B2 |
8528079 | Wang | Sep 2013 | B2 |
8621614 | Vaithilingam et al. | Dec 2013 | B2 |
8646067 | Agarwal et al. | Feb 2014 | B2 |
8667146 | Agarwal et al. | Mar 2014 | B2 |
8701185 | Krishnamurthy et al. | Apr 2014 | B2 |
8776224 | Krishnamurthy et al. | Jul 2014 | B2 |
8799515 | Wu | Aug 2014 | B1 |
8838973 | Yung et al. | Sep 2014 | B1 |
8874658 | Khalsa et al. | Oct 2014 | B1 |
9009813 | Agarwal et al. | Apr 2015 | B2 |
9058487 | Feng et al. | Jun 2015 | B2 |
9083733 | Georgiev | Jul 2015 | B2 |
9094365 | Gmuender et al. | Jul 2015 | B2 |
9210189 | Dong et al. | Dec 2015 | B2 |
9276956 | Geng et al. | Mar 2016 | B2 |
20050228899 | Wendkos et al. | Oct 2005 | A1 |
20060117307 | Averbuch | Jun 2006 | A1 |
20060168066 | Helsper | Jul 2006 | A1 |
20070078936 | Quilan | Apr 2007 | A1 |
20070192855 | Hulten | Aug 2007 | A1 |
20080141342 | Curnyn | Jun 2008 | A1 |
20100251380 | Zhang | Sep 2010 | A1 |
20120023566 | Waterson | Jan 2012 | A1 |
20120143799 | Wilson | Jun 2012 | A1 |
20120158626 | Zhu | Jun 2012 | A1 |
20120259933 | Bardsley | Oct 2012 | A1 |
20130086677 | Ma | Apr 2013 | A1 |
20130238721 | Patel | Sep 2013 | A1 |
20140033307 | Schmidtler | Jan 2014 | A1 |
20140082521 | Carolan et al. | Mar 2014 | A1 |
20140298460 | Xue | Oct 2014 | A1 |
20150200962 | Xu | Jul 2015 | A1 |
Entry |
---|
RFC 2616—https://tools.ietf.org/html/rfc2616, downloaded Mar. 15, 2016. |
RFC 3986—https://tools.ietf.org/html/rfc3986, downloaded Mar. 15, 2016. |
Wikipedia—https://en.wikipedia.org/wiki/Regular_expression, downloaded Mar. 15, 2016. |
International Search Report and Written Opinion of the International Searching Authority dated Mar. 11, 2016 in PCT/US2016/012285. |
USPTO Office Action dated Apr. 1, 2016 in U.S. Appl. No. 14/542,939. |
Marco Cova, Christopher Kruegel, and Giovanni Vigna—There is No Free Phish: An Analysis of “Free” and Live Phishing Kits—Department of Computer Science, University of California, Santa Barbara, 2008, downloaded from https://www.usenix.org/legacy/event/woot08/tech/full_papers/cova/cova_html/ on Jun. 24, 2016. |
Heather McCalley, Brad Wardman and Gary Warner—Chapter 12, Analysis of Back-Doored Phishing Kits; G. Peterson and S. Shenoi (Eds.): Advances in Digital Forensics VII, IFIP AICT 361, pp. 155-168, 2011. c IFIP International Federation for Information Processing 2011. |
Tyler Moore and Richard Clayton—Discovering Phishing Dropboxes Using Email Metadata, Pre-publication copy, Nov. 2012. To appear in the proceedings of the 7th APWG eCrime Researchers Summit (eCrime). |
Number | Date | Country | |
---|---|---|---|
20160352777 A1 | Dec 2016 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 14542939 | Nov 2014 | US |
Child | 15165503 | US |