 
                 Patent Application
 Patent Application
                     20060168066
 20060168066
                    1. Field of the Invention
The present invention relates to techniques for detecting email messages used for defrauding an individual (such as so-called “phishing” emails). The present invention provides a method, system and computer program product (hereinafter “method” or “methods”) for accepting an email message and determining whether the email message is a phishing email message. The present invention also includes methods for evaluating a requested URL to determine if the destination URL is a “trusted” host and is geographically located where expected, as well as methods for communicating the determined level of trust to a user. The present invention further includes methods for mining Trusted Hosts, which associates one or more Internet Protocol (IP) addresses of a Trusted Server with a Trusted URL. Methods are also provided for processing links in documents to determine on-site links to documents which request confidential information from a user.
2. Description of the Related Art
Phishing is a scam where a perpetrator sends out legitimate looking emails appearing to come from some of the World Wide Web's biggest and most reliable web sites for example—eBay, PayPal, MSN, Yahoo, CitiBank, and America Online—in an effort to “phish” for personal and financial information from an email recipient. Once the perpetrator obtains such information from the unsuspecting email recipient, the perpetrator subsequently uses the information for personal gain.
There are a large number of vendors today providing anti-phishing solutions. These solutions do not help to manage phishing emails proactively. Instead, they rely on providing early warnings based on known phishing emails, black lists, stolen brands, etc.
 Currently, anti-phishing solutions fall into three major categories: 
Current anti-phishing solutions fail to address phishing attacks in real time. Businesses using a link checking system must rely on a black list being constantly updated for protection against phishing attacks. Unfortunately, because the link checking system is not a proactive solution and must rely on a black list update, there is a likelihood that several customers will be phished for personal and financial information before an IP address associated with the phishing attack is added to the black list. Early warning systems attempt to trap prospective criminals and shut down phishing attacks before they happen; however, they often fail to accomplish these goals because their techniques fail to address phishing attacks that do not utilize scanning. Authentication and certification systems are required to use a variety of identification techniques; for example, shared images between a customer and a service provider which are secret between the two, digital signatures, and code specific to a particular customer being stored on the customer's computer. Such techniques are intrusive in that software must be maintained on the customer's computer and periodically updated by the customer.
Accordingly, there is a need and desire for an anti-phishing solution that proactively stops phishing attacks at a point of attack and that is minimally intrusive.
There is also a need for a solution that can proactively verify that a destination host is trusted without the use of black lists or white lists.
A method is also needed to determine a phishing email based on at least the level of trust associated with a URL extracted from the email.
A further need exists to associate one or more IP addresses of a Trusted Server with a Trusted URL, and a need to communicate to a user the level of trust associated with the host of a URL.
Finally, a method for processing links in documents to determine on-site links to documents which request confidential information is also needed in the art.
The present invention provides methods for determining whether an email message is being used in a phishing attack in real time. In one embodiment, before an end user receives an email message, the email message is analyzed by a server to determine if the email message is a phishing email. The server parses the email message to obtain information which is used in an algorithm to create a phishing score. If the phishing score exceeds a predetermined threshold, the email is determined to be a phishing email message. In a further embodiment, an email can be determined a phishing email based on a comparison between descriptive content extracted from the email and stored descriptive content.
Methods are also provided in the present invention for associating one or more IP addresses of a Trusted Server with a Trusted URL. Further methods are provided for processing links in a document to determine on-site links which reference documents requesting confidential information.
The present invention also provides methods for determining if a requested URL destination is a Trusted Host. In one embodiment, when a user chooses to visit a URL with a browser, the content of the destination page is scanned for indications that the page contains information that should only come from a Trusted Host. If the page contains information that should only be returned from a Trusted Host, the destination host is then checked to verify that the host is a Trusted Host contained in a Trusted Host database (DB). If it is not, the user is alerted that the content should not be trusted.
The foregoing and other advantages and features of the invention will become more apparent from the detailed description of the embodiments of the invention given below with reference to the accompanying drawings.
  
  
  
  
  
  
  
  
  
  
  
  
In the following detailed description, reference is made to the accompanying drawings, which form a part hereof, and in which is shown by way of illustration of specific embodiments in which the invention may be practiced. These embodiments are described in sufficient detail to enable those skilled in the art to practice the invention, and it is to be understood that other embodiments may be utilized, and that structural, logical and programming changes may be made without departing from the spirit and scope of the present invention.
Before the present methods, systems, and computer program products are disclosed and described, it is to be understood that this invention is not limited to specific methods, specific components, or to particular compositions, as such may, of course, vary. It is also to be understood that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting.
Unless otherwise expressly stated, it is in no way intended that any method or embodiment set forth herein be construed as requiring that its steps be performed in a specific order. Accordingly, where a method claim does not specifically state in the claims or descriptions that the steps are to be limited to a specific order, it is no way intended that an order be inferred, in any respect. This holds for any possible non-express basis for interpretation, including matters of logic with respect to arrangement of steps or operational flow, plain meaning derived from grammatical organization or punctuation, or the number or type of embodiments described in the specification. Furthermore, while various embodiments provided in the current application refer to the statutory classes of methods, systems, or computer program products, it should be noted that the present invention may be carried out, embodied, or claimed in any statutory class.
The term “EScam Score” refers to a combination of values that include a Header Score and a Uniform Resource Locator (URL) Score. The EScam Score represents how suspicious a particular email message may be.
The term “Header Score” refers to a combination of values associated with an internet protocol (IP) address found in an email message being analyzed.
The term “URL Score” refers to a combination of values associated with a URL found in an email message being analyzed.
The term “Non-Trusted Country” refers to a country that is designated by an EScam Server as a country not to be trusted, but is not a high-risk country or an Office of Foreign Assets Control (OFAC) country (defined below).
The term “High Risk Country” refers to a country that is designated by the EScam Server as a country that has higher than normal crime activity, but is not an OFAC country.
The term “Trusted Country” refers to a country that is designated by the EScam Server as a country to be trusted.
The term “OFAC Country” refers to a country having sanctions imposed upon it by the United States or another country.
The term “EScam Message” refers to a text field provided by the EScam Server describing the results of the EScam Server's analysis of an email message.
The term “EScam Data” refers to a portion of an EScam Server report detailing all IP addresses in the email Header and all URLs within the body of the email message.
The operation of a NetAcuity Server 240 which may be used in the present invention is discussed in U.S. Pat. No. 6,855,551, which is commonly assigned to the assignee of the present application, and which is herein incorporated by reference in its entirety.
EScam Server
  
Once the IP address has been classified at step 108, the EScam Server 202 transfers the IP address to a NetAcuity Server 240 to determine a geographic location of the IP address associated with the email header, at step 110. The NetAcuity Server 240 may also determine if the IP address is associated with an anonymous proxy server. Next at step 112, the IP address is checked against a block list to determine if the IP address is an open relay server or a dynamic server. The determination in step 112 occurs by transferring the IP address to, for example, a third party for comparisons with a stored block list (step 114). In addition, at step 112, the EScam Server 202 calculates a Header Score.
 Subsequent to step 114, all obtained information is sent to EScam Server 202. Next, at step 116, EScam Server 202 determines if any URLs are present in the email message. If no URLs are present in the email message, the EScam Server 202 proceeds to step 128. If a URL is present, the EScam Server 202 processes the URL at step 118 using an EScam API 250 to extract host names from the body of the email message. Next at step 120, the EScam Server 202 determines how the IP address associated with the URL should be classified for subsequent scoring by examining Hypertext Markup Language (HTML) tag information associated with the IP address. For example, classifications and scoring for the IP address associated with the URL could be the following:  
Once the IP address has been classified, at step 120, the EScam Server 202 transfers the IP address to the NetAcuity Server 240 to determine a geographic location of the IP address associated with the URL (step 122). Next, at step 124, the EScam Server 202 calculates a score for each IP address associated with the email message and generates a combined URL score and a reason code for each IP address. The reason code relates to a reason why a particular IP address received its score. For example, the EScam Server 202 may return a reason code indicating that an email is determined to be suspect because the IP address of the email message originated from an OFAC country and the body of the email message contains a link that has a hard coded IP address.
At step 126, EScam Server 202 compares a country code from an email server associated with the email message header and a country code from an email client to ensure that the two codes match. The EScam Server 202 obtains country code information concerning the email server and email client using the NetAcuity Server 240, which determines the location of the email server and client server and returns a code associated with a particular country for the email server and email client. If there is a mismatch between the country code of the email server and the country code of the email client, the email message is flagged and the calculated scored is adjusted accordingly. For example, upon a mismatch between country codes, the calculated score may be increased by 1 point.
In addition, an EScam Score is calculated. The EScam Score is a combination of the Header Score and URL Score. The EScam Score is determined by adding the score for each IP address in the email message and aggregating them based on whether the IP address was from the email header or a URL in the body of the email. The calculation provides a greater level of granularity when determining whether an email is fraudulent.
The EScam Score may be compared with a predetermined threshold level to determine if the email message is a phishing email. For example, if the final EScam Score exceeds the threshold level, the email message is determined to be a phishing email. In one embodiment, determinations by the EScam Server 202 may only use the URL score to calculate the EScam Score. If, however, the URL score is over a certain threshold, the Header Score can also be factored into the EScam Score calculation.
 Lastly, at step 128, the EScam Server 202 outputs an EScam Score, an EScam Message and EScam Data to an email recipient including detailed forensic information concerning each IP address associated with the email message. The detailed forensic information may be used to track down the origin of the suspicious email message and allow law enforcement to take action. For example, forensic information gleaned by the EScam server 202 during an analysis of an email message could be the following:  
Depending on a system configuration, email messages that have been determined to be phishing emails may also be for example, deleted, quarantined, or simply flagged for review.
EScam Server 202 may utilize domain name server (DNS) lookups to resolve host names in URLs to IP addresses. In addition, when parsing the headers of an email message at step 106, the EScam Server 202 may identify the IP address that represents a final email server (email message origination server) in a chain, and the IP address of the sending email client of the email message, if available. The EScam Server 202 uses the NetAcuity Server 240 (step 110) for the IP address identification. The EScam Server 202 may also identify a sending email client.
  
The EScam API 250 provides an interface between the EScam Server 202 and third party applications, such as a Microsoft Outlook email client 262 via various function calls from the EScam Server 202 and third party applications. The EScam API 250 provides an authentication mechanism and a communications conduit between the EScam Server 202 and third party applications using, for example, a TCP/IP protocol. The EScam API 250 performs parsing of the email message body to extract any host names as well as any IP addresses residing within the body of the email message. The EScam API 250 also performs some parsing of the email header to remove information determined to be private, such as a sending or receiving email address.
 The EScam API 250 may perform the following interface functions when an email client (260, 262 and 264) attempts to send an email message to EScam Server 202: 
An additional support component may be included in system 200 which allows a particular email client, for example, email client 260, to send incoming email messages to the EScam Server 202 prior to being placed in an email recipient's Inbox (not shown). The component may use the EScam API 250 to communicate with the EScam Server 202 using the communications conduit. Based on the EScam Score returned by the EScam Server 202, the component may, for example, leave the email message in the email recipient's Inbox or move the email message into a quarantine folder. If the email message is moved into the quarantine folder, the email message may have the EScam Score and message appended to the subject of the email message and the EScam Data added to the email message as an attachment.
Accordingly, the present invention couples IP Intelligence with various attributes in an email message. For example, IP address attributes of the header and URLs in the body are used by the present invention to apply rules for calculating an EScam Score which may be used in determining whether the email message is being used in a phishing ploy. Each individual element is scored based on a number of criteria, such as an HTML tag or whether or not an embedded URL has a hard coded IP address. The present invention may be integrated into a desktop (not shown) or on a backend mail server.
In a backend mail server implementation for system 200, the EScam API 250 may be integrated into the email client, for example, email client 260. As the email client 260 receives an email message, the email client 260 will pass the email message to the EScam Server 202 for analysis via the EScam API 250 and a Communications Interface 210. Based on the return code, the EScam Server 202 determines whether to forward the email message to an email recipient's Inbox or perhaps discard it.
If a desktop integration is utilized, email clients and anti-virus vendors may use an EScam Server 202 having a Windows based EScam API 250. A desktop client may subsequently request the EScam Server 202 to analyze an incoming email message. Upon completion of the analysis by the EScam Server 202, an end user may determine how the email message should be treated based on the return code from the EScam Server 202; for example, updating the subject of the email message to indicate the analyzed email message is determined to be part of a phishing ploy. The email message may also be moved to a quarantine folder if the score is above a certain threshold.
 The methods of the present invention can be carried out using a processor programmed to carry out the various embodiments. 
The method can be operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well known computing systems, environments, and/or configurations that may be suitable for use with the method include, but are not limited to, personal computers, server computers, laptop devices, and multiprocessor systems. Additional examples include set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.
The methods may be described in the general context of computer instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The methods may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
The methods disclosed herein can be implemented via a general-purpose computing device in the form of a computer 301. The components of the computer 301 can include, but are not limited to, one or more processors or processing units 303, a system memory 312, and a system bus 313 that couples various system components including the processor 303 to the system memory 312.
 The processor 303 in 
The system bus 313 represents one or more of several possible types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. By way of example, such architectures can include an Industry Standard Architecture (ISA) bus, a Micro Channel Architecture (MCA) bus, an Enhanced ISA (EISA) bus, a Video Electronics Standards Association (VESA) local bus, and a Peripheral Component Interconnects (PCI) bus also known as a Mezzanine bus. This bus, and all buses specified in this description can also be implemented over a wired or wireless network connection. The bus 313, and all buses specified in this description can also be implemented over a wired or wireless network connection and each of the subsystems, including the processor 303, a mass storage device 304, an operating system 305, application software 306, data 307, a network adapter 308, system memory 312, an Input/Output Interface 310, a display adapter 309, a display device 311, and a human machine interface 302, can be contained within one or more remote computing devices 314a,b,c at physically separate locations, connected through buses of this form, in effect implementing a fully distributed system.
 The operating system 305 in 
The computer 301 typically includes a variety of computer readable media. Such media can be any available media that is accessible by the computer 301 and includes both volatile and non-volatile media, removable and non-removable media. The system memory 312 includes computer readable media in the form of volatile memory, such as random access memory (RAM), and/or non-volatile memory, such as read only memory (ROM). The system memory 312 typically contains data such as data 307 and and/or program modules such as operating system 305 and application software 306 that are immediately accessible to and/or are presently operated on by the processing unit 303.
 The computer 301 may also include other removable/non-removable, volatile/non-volatile computer storage media. By way of example, 
Any number of program modules can be stored on the mass storage device 304, including by way of example, an operating system 305 and application software 306. Each of the operating system 305 and application software 306 (or some combination thereof) may include elements of the programming and the application software 306. Data 307 can also be stored on the mass storage device 304. Data 304 can be stored in any of one or more databases known in the art. Examples of such databases include, DB2®, Microsoft® Access, Microsoft® SQL Server, Oracle®, mySQL, PostgreSQL, and the like. The databases can be centralized or distributed across multiple systems.
A user can enter commands and information into the computer 301 via an input device (not shown). Examples of such input devices include, but are not limited to, a keyboard, pointing device (e.g., a “mouse”), a microphone, a joystick, a serial port, a scanner, touch screen mechanisms, and the like. These and other input devices can be connected to the processing unit 303 via a human machine interface 302 that is coupled to the system bus 313, but may be connected by other interface and bus structures, such as a parallel port, serial port, game port, or a universal serial bus (USB).
A display device 311 can also be connected to the system bus 313 via an interface, such as a display adapter 309. For example, a display device can be a cathode ray tube (CRT) monitor or an Liquid Crystal Display (LCD). In addition to the display device 311, other output peripheral devices can include components such as speakers (not shown) and a printer (not shown) which can be connected to the computer 301 via Input/Output Interface 310.
The computer 301 can operate in a networked environment using logical connections to one or more remote computing devices 314a,b,c. By way of example, a remote computing device can be a personal computer, portable computer, a server, a router, a network computer, a peer device or other common network node, and so on. Logical connections between the computer 301 and a remote computing device 314a,b,c can be made via a network such as a local area network (LAN), a general wide area network (WAN), or the Internet. Such network connections can be through a network adapter 308.
For purposes of illustration, application programs and other executable program components such as the operating system 305 are illustrated herein as discrete blocks, although it is recognized that such programs and components reside at various times in different storage components of the computing device 301, and are executed by the data processor(s) of the computer. An implementation of application software 306 may be stored on or transmitted across some form of computer readable media. An implementation of the disclosed methods may also be stored on or transmitted across some form of computer readable media. Computer readable media can be any available media that can be accessed by a computer. By way of example, and not limitation, computer readable media may comprise “computer storage media” and “communications media.” “Computer storage media” include volatile and non-volatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules, or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer.
Phishing Email Determiner
 In one embodiment of the present invention, a Phishing Email Determiner (PED) is provided for determining a phishing email using one or more factors, with at least one factor being the level of trust associated with a URL extracted from the email. The embodiment of 
 First, in the embodiment of 
 In an embodiment based on the embodiment of 
 A database of the kind which may be operative on the computer system of 
 In a further embodiment of the Phishing Email Determiner extending the embodiment of 
 In yet another embodiment of the PED based on the embodiment of 
 The Phishing Email Determiner of the present invention can also determine phishing emails based on descriptive content associated with an entity, such as a company, and which is extracted from an email message, as illustrated, for example, in the embodiment of 
 The PED of 
 In various embodiments of 
 In another embodiment based on that of 
Trusted Host Miner
 The Trusted Host Miner (THM) of the present invention is capable of discovering the IP addresses of all servers that serve a particular Trusted URL, and is illustrated in the embodiment of 
In one embodiment, the THM loads the list of Trusted URLs that it is responsible for discovering and maintaining from the Trusted URL database 703. The THM then performs a DNS query for each URL 704. The DNS query also returns a time-to-live (TTL) value for each address it returns. Then, at step 705, it is determined if the server address is in the database. If the server address is in the database, then the Last Seen date for the address is updated in the Trusted Server Database 706. The THM then waits for the DNS supplied Time-To-Live (TTL) for the address to expire 707, and then repeats the DNS server query at step 704.
If it was determined at step 705 that the server address was not in the database, then the address of the server is added to the Trusted Server database 708. The THM can then wait for the TTL for the address to expire, and repeat the THM method starting at step 704.
If a particular Trusted Server has not been seen for a configured amount of time, the THM can prune the server by removing 709 it from the Trusted Server database 711. This action ensures that the Trusted Server database 711 is always current and doesn't contain expired entries.
The Trusted Server database can also be preloaded with sets of Trusted Servers that are provided by the owners of those servers 710. For example, a financial institution could provide a list of its servers that are trusted. These would be placed in the Trusted Server database 711 and not mined by the THM.
 The THM of another embodiment is illustrated in 
 In an embodiment of the THM extending the embodiment of 
Trusted Host Browser
The present invention provides a Trusted Host Browser (THB) method for communicating a level of trust to a user. In one embodiment, the THB uses the Trusted Server database 711, and Trusted Host Browser is implemented as a web browser plug-in which can be useable via a toolbar. The plug-in may be loaded into a web browser and used to provide feedback to the end user regarding the security of the web site they are visiting. For example, if the end user clicks on a link in an email message they received in the belief that the link is to their bank's website, the plug-in can indicate visually whether they can trust the content delivered into the web browser from the website.
 In one embodiment of the THB as illustrated in 
If the EScam Server 202 determines that the server is not trusted, it then checks the geographic location of the server 905. If the geographic location is potentially suspicious 906, such as an OFAC country or a pre-determined suspect country, the EScam Server 202 can indicate this to the plug-in. If the geographic location is not suspicious, the plug-in may then display an icon in the browser indicating “Non-Suspicious Website” 908. If the server location is suspicious, then the plug-in will display an icon indicating “Suspicious Website” 907. The end user can then use the information concerning the validity of the website to determine whether to proceed with interaction with this site, such as providing confidential information including the user's login, password, or financial information.
 Another embodiment of the THB useful for communicating the level of trust to a user is illustrated in 
 In an embodiment of the THB based on 
Page Spider
One embodiment of the present invention provides a Page Spider method which is useful for processing links in documents to determine on-site URLs which may require the communication of confidential or sensitive information such as user credentials, login, password, financial information, social security number, or any type of personal identification information. The URLs which refer to on-site web pages requesting confidential information may also be treated as Trusted URLs, added to the Trusted URL database 711, and processed by the THM.
 The Page Spider method is illustrated in one embodiment depicted in 
In another embodiment of the Page Spider, the documents are HTML compatible documents, and the links are URLs. In further embodiments of the Page Spider, the documents are XML documents and the links are URLs. It will also be apparent to one of skill in the art that the Page Spider can be used with any type of document which contains one or more links or references to other documents.
In yet another embodiment, the first document may be parsed to determine an HTML anchor tag <A> which contains a link to the second document. The second document may also be inspected to determine if it request confidential information by determining if it contains one or more predetermined HTML tags such as the <FORM> or <INPUT> tag. In various embodiments, the confidential information may be requested by a secure login form.
 One or more embodiments of the present invention may be combined to provide enhanced functionality, such as the embodiment shown in 
 In the embodiment illustrated in 
In the current embodiment, a Page Spider User Interface (UI) 1201 is provided, which allows a user to input Jump-Off URLs, input Don't Follow URLs, and validate Didn't Follow URLs and place them in the Jump-Off URL database 1202. The Page Spider UI 1201 may also be used to validate All Inclusive database 1206 entries, validate Secure Login URL database 1207 entries, and to manually enter All Inclusive/Secure URLs, bypassing Page Spider processing.
 In the embodiment of 
 In another embodiment, a Trusted Server DB Builder 1210 polls the Trusted Server DB 1209, and when there are sufficient changes made, publishes URLs to the All Inclusive Trusted Server DB 1211 and the Secure Login Trusted Server DB 1212. In a further embodiment, a DB Distributor 1213 also sends URLs to the All Inclusive Trusted Server DB 1211 and the Secure Login Trusted Server DB 1212. Finally, a user uses an Institution UI 1215 to administer the Institution Info DB 1214, which contains descriptive content such as domain names and keywords that can be used to identify content related to the institution. The descriptive content may also be supplied to a PED coupled with the embodiment of 
While the invention has been described in detail in connection with various embodiments, it should be understood that the invention is not limited to the above-disclosed embodiments. Rather, the invention can be modified to incorporate any number of variations, alternations, substitutions, or equivalent arrangements not heretofore described, but which are commensurate with the spirit and scope of the invention. Accordingly, the invention is not limited by the foregoing description or drawings, but is only limited by the scope of the appended claims.
This application is a continuation-in-part of U.S. Utility patent application Ser. No. 10/985,664, filed Nov. 10, 2004 which is incorporated herein in its entirety by reference.
| Number | Date | Country | |
|---|---|---|---|
| Parent | 10985664 | Nov 2004 | US | 
| Child | 11298370 | Dec 2005 | US |