The present invention relates generally to electronic messages. More specifically, a message processing technique is disclosed.
Automated message filtering systems have become popular as the number of unwanted electronic messages (also known as “spam”) increases. Some basic spam filtering products identify spam messages by searching for certain terms that are commonly present in spam messages, such as names of drugs and product descriptions The senders of spam messages (also referred to as ‘spammers’) have responded by substituting the typical spam indicator words with words that look similar to the average reader. For example, ‘Viagra’® is a drug often advertised in spam messages. The spammers may substitute the letter ‘a’ with an ‘@’ sign, use a backslash and a forward slash to form a character string ‘\/’ to represent the letter ‘V.’ Other commonly employed methods include keeping the first and last letters of the keyword correct but scrambling the letters in between, and using special characters to delimit phrases instead of spaces. For example, ‘Viagra’® may be represented as ‘\/1agra’ and ‘Buy Viagra® Here’ may be spelled as ‘*Buy*\/Igrae*here*.’ While the human reader can easily guess the meaning despite the misspelling and obfuscation, it is more difficult for the automated message filtering system to detect these random variations. It would be desirable if mutated spam messages can be detected. It would also be useful if the detection technique can be implemented without significantly increasing the requirements for computing resources such as memory and processing time.
Various embodiments of the invention are disclosed in the following detailed description and the accompanying drawings.
The invention can be implemented in numerous ways, including as a process, an apparatus, a system, a composition of matter, a computer readable medium such as a computer readable storage medium or a computer network wherein program instructions are sent over optical or electronic communication links. In this specification, these implementations, or any other form that the invention may take, may be referred to as techniques. In general, the order of the steps of disclosed processes may be altered within the scope of the invention.
A detailed description of one or more embodiments of the invention is provided below along with accompanying figures that illustrate the principles of the invention. The invention is described in connection with such embodiments, but the invention is not limited to any embodiment. The scope of the invention is limited only by the claims and the invention encompasses numerous alternatives, modifications and equivalents. Numerous specific details are set forth in the following description in order to provide a thorough understanding of the invention. These details are provided for the purpose of example and the invention may be practiced according to the claims without some or all of these specific details. For the purpose of clarity, technical material that is known in the technical fields related to the invention has not been described in detail so that the invention is not unnecessarily obscured.
A technique of determining whether a guarded term is represented in a message is disclosed. In some embodiments, a portion of the message is associated with the guarded term and a cost of the association is evaluated. Techniques such as dynamic programming and genetic programming are employed in some embodiments to detect mutated guarded terms. The cost information may be used to further assist the processing of the message, including message classification, content filtering, etc.
In this example, a guarded term is selected (100), and a portion of the message is associated with the guarded term (102). In some embodiments, associating a portion of the message with the guarded term includes mutating the message portion and comparing the mutated message portion with the guarded term. In some embodiments, the guarded term may be mutated and compared with the message portion. For example, the letter “l” may be mutated as the number “1”, the letter “v” may be mutated as a back slash and a forward slash “\/”, the letter “a” may be mutated as the “@” sign, etc. Sometimes spammers take advantage of the fact that mutated words are more easily recognized by the readers if the first and last letters of the word remain unchanged. Thus, in some embodiments, the association includes matching the first and the last letters of the guarded term with the first and last letters of the portion of the message.
Sometimes the guarded term may be represented graphically, such as “ASCII art” where groups of characters are specially arranged to form graphical representations of individual characters. Optical character recognition techniques may be used to associate this type of message with guarded terms. Other appropriate association techniques are also applicable and are discussed in more detail below.
The cost of the association is then evaluated (104). In some embodiments, the cost indicates how likely the guarded term is represented in the message. Process 150 may be repeated to associate other guarded terms with the message and the cost may be cumulative. It is preferable for the system to include a limited number of guarded terms so that matching can be performed efficiently.
The string is then examined to determine whether it includes any suspicious substring that may be a mutated guarded term (202). There are a number of techniques useful for finding such a suspicious substring. For example, a suspicious substring may be found by locating a substring with a first and a last letters that match the first and the last letters of a guarded term. Further details of how to locate the suspicious string are discussed below.
If a suspicious substring is found, it is determined whether the suspicious substring is a safe string (204). A safe string is a word, phrase, or expression that may be present in the message for legitimate reasons. For example, although the word ‘Virginia’ has the same first and last letters as ‘Viagra®,’ ‘Virginia’ is a correctly spelled word and may be present in the context of the message for legitimate reasons. In some embodiments, a string is determined to be safe if it can be found in a dictionary or database of acceptable words. If the suspicious string is not a safe string, it is extracted (206). The substring is then evaluated against the guarded term (208). Details of the evaluation are discussed below. The process may be repeated for multiple guarded terms and a cumulative score may be computed.
In some embodiments, the evaluation yields a score that indicates whether the substring and the guarded term approximately match. In some embodiments, an approximate match is found if the score reaches a certain preset threshold value. In some embodiments, multiple guarded terms are compared with the message. If the message includes substrings that approximately match one or more guarded terms, this information is provided to further assist the processing of the message. For example, the message may be further processed by a spam filter. The fact that a guarded term is represented in the message in a mutated form may indicate that the message is likely to be spam. The spam filter may assess a penalty on the message based on the knowledge, and optionally apply other filtering techniques such as white listing, thumb printing, Bayesian analysis, etc. to classify the message.
The string between the potential start and end position is then extracted (304). In some embodiments, if a character, a symbol or other standard token is obfuscated by using an equivalent token, the equivalent token may be identified before the string is further processed. In some embodiments, the equivalent token is replaced by the standard token before further processing. For example, “\/” (a forward and backslash) may be replaced by “v” and “|-|” (a vertical bar, a dash and another vertical bar) may be replaced by “H”. A score that indicates the similarity between the suspicious string and the guarded term is computed (306). In some embodiments, the score measures the amount of mutation required for transforming the guarded term to the suspicious string (also known as the edit distance) by inserting, deleting, changing, and/or otherwise mutating characters. In some embodiments, the score measures the probability that the guarded term is represented in the suspicious string. The score may be generated using a variety of techniques, such as applying a dynamic programming algorithm (DPA), a genetic programming algorithm or any other appropriate methods to the guarded term and the suspicious string. For the purpose of illustration, computing the score using DPA is discussed in further detail, although other algorithms may also be applicable.
In some embodiments, Dynamic Programming Algorithm (DPA) is used for computing the similarity score. In one example, the DPA estimates the similarity of two strings in terms of edit distance by setting up a dynamic programming matrix. The matrix has as many rows as the number of tokens in the guarded term, and as many columns as the length of the suspicious string. Entry (I,J) in this matrix reflects the similarity score of the first I tokens in the guarded term against the first J tokens of the suspicious string. Each entry in the matrix is iteratively evaluated by taking the minimum of:
The Similarity of the guarded term and the suspicious string is the Matrix value at entry (length(GuardedTerm), length(SuspiciousString)). In this example, the TokenSimilarity function returns a low value (close to 0) if the tokens are similar, and a high value if the characters are dissimilar. The CostInsertion function returns a high cost for inserting an unexpected token and a low cost for inserting an expected token. The CostDeletion function returns a high cost for deleting an unexpected token and a low cost for deleting an expected token.
In some embodiments, prior probabilities of tokens, which affect similarity measurements and expectations, are factored into one or all of the above functions. In some embodiments, the TokenSimilarity, CostInsertion and CostDeletion functions are adjustable. For example, in some embodiments, prior probabilities of the tokens correspond to the frequencies of characters' occurrence in natural language or in cryptographic letter frequency table. In some embodiments, the prior probabilities of the tokens in the guarded term correspond to the actual frequencies of the letters in the guarded term, and the prior probabilities of the tokens in the message correspond to the common frequencies of letters in natural language. In some embodiments, the prior probabilities of tokens in the guarded term correspond to the actual frequencies of the tokens in the guarded term, and the prior probabilities of the different tokens in the message correspond to the common frequencies of such tokens in sample messages such as sample spam messages collected from the Internet.
In some embodiments, the context of the mutation may be taken into account during the computation. For example, a mutation due to substitution of regular characters may be a typographical error, and is penalized to a lesser degree than a substitution of special characters. Thus, “Vlagre” may be penalized to a lesser degree than “\/i@gra”.
Sometimes the tokens immediately preceding and immediately following the string may be special characters such as white spaces or punctuations. In some embodiments, this provides further indication that an approximate match, if found, is likely to be correct, thus the dynamic programming score is adjusted accordingly.
In some embodiments, the capabilities of associating guarded terms with message are built into a matching engine, which may be implemented as software or firmware, embedded in a processor, integrated circuit or any other appropriate devices or combinations thereof.
In the examples shown above, the guarded terms include special terms of interest. In some embodiments, the guarded terms also include variations of these special terms.
A technique for detecting whether a guarded term is represented in a message has been disclosed. Besides spam filtering and content filtering, the technique is also applicable for HTTP traffic filtering, virus detection, and any other appropriate applications where guarded terms may be mutated and included in the data stream.
Although the foregoing embodiments have been described in some detail for purposes of clarity of understanding, the invention is not limited to the details provided. There are many alternative ways of implementing the invention. The disclosed embodiments are illustrative and not restrictive.
This application is a continuation and claims the priority benefit of U.S. patent application Ser. No. 10/869,507 filed Jun. 15, 2004, now abandoned, and entitled “Approximate Matching of Strings for Message Filtering” which in turn claims the priority benefit of U.S. Provisional Patent Application No. 60/543,300 filed Feb. 9, 2004 and entitled “Approximate Matching of Strings for Message Filtering,” the disclosure of which is incorporated herein by reference for all purposes.
Number | Name | Date | Kind |
---|---|---|---|
6105022 | Takahashi et al. | Aug 2000 | A |
6112227 | Heiner | Aug 2000 | A |
6122372 | Hughes | Sep 2000 | A |
6161130 | Horvitz et al. | Dec 2000 | A |
6199102 | Cobb | Mar 2001 | B1 |
6330590 | Cotten | Dec 2001 | B1 |
6487586 | Ogilvie et al. | Nov 2002 | B2 |
6578025 | Pollack et al. | Jun 2003 | B1 |
6650890 | Irlam et al. | Nov 2003 | B1 |
6711615 | Porras et al. | Mar 2004 | B2 |
6732157 | Gordon et al. | May 2004 | B1 |
6751624 | Christal et al. | Jun 2004 | B2 |
6772141 | Pratt et al. | Aug 2004 | B1 |
6772196 | Kirsch et al. | Aug 2004 | B1 |
6941348 | Petry et al. | Sep 2005 | B2 |
6941467 | Judge et al. | Sep 2005 | B2 |
7089241 | Alspector et al. | Aug 2006 | B1 |
7171450 | Wallace et al. | Jan 2007 | B2 |
7373664 | Kissel | May 2008 | B2 |
7673342 | Hursey et al. | Mar 2010 | B2 |
8713110 | Oliver | Apr 2014 | B2 |
8886727 | Oliver | Nov 2014 | B1 |
20020007453 | Nemovicher | Jan 2002 | A1 |
20020035561 | Archer et al. | Mar 2002 | A1 |
20020059384 | Kaars | May 2002 | A1 |
20020065895 | Zhang et al. | May 2002 | A1 |
20020091690 | Bailey et al. | Jul 2002 | A1 |
20020165861 | Gilmour | Nov 2002 | A1 |
20020194487 | Grupe | Dec 2002 | A1 |
20030018638 | Abe et al. | Jan 2003 | A1 |
20030088627 | Rothwell et al. | May 2003 | A1 |
20030126561 | Woehler et al. | Jul 2003 | A1 |
20030172301 | Judge et al. | Sep 2003 | A1 |
20030185149 | Daniell et al. | Oct 2003 | A1 |
20030204569 | Andrews | Oct 2003 | A1 |
20030233418 | Goldman | Dec 2003 | A1 |
20040024639 | Goldman | Feb 2004 | A1 |
20040024823 | Del Monte | Feb 2004 | A1 |
20040103305 | Ginter et al. | May 2004 | A1 |
20040133793 | Ginter et al. | Jul 2004 | A1 |
20040139160 | Wallace et al. | Jul 2004 | A1 |
20040139165 | McMillan et al. | Jul 2004 | A1 |
20040158554 | Trottman | Aug 2004 | A1 |
20040205463 | Darbie | Oct 2004 | A1 |
20050021635 | Graham et al. | Jan 2005 | A1 |
20050038750 | Cahill et al. | Feb 2005 | A1 |
20050055410 | Landsman et al. | Mar 2005 | A1 |
20050080860 | Daniell et al. | Apr 2005 | A1 |
20050091321 | Daniell et al. | Apr 2005 | A1 |
20050097174 | Daniell | May 2005 | A1 |
20050108340 | Gleeson et al. | May 2005 | A1 |
20050120019 | Rigoutsos et al. | Jun 2005 | A1 |
20050125667 | Sullivan et al. | Jun 2005 | A1 |
20050204005 | Purcell | Sep 2005 | A1 |
20080104712 | Oliver | May 2008 | A1 |
20150047055 | Oliver | Feb 2015 | A1 |
Number | Date | Country |
---|---|---|
2000-353133 | Dec 2000 | JP |
2003-099371 | Apr 2003 | JP |
2003337751 | Nov 2003 | JP |
2005-018745 | Jan 2005 | JP |
WO 2004105332 | Dec 2004 | WO |
WO 2004114614 | Dec 2004 | WO |
Entry |
---|
Balvanz, Jeff, et al., “Spam Software Evaluation, Training, and Support: Fighting Back to Reclaim the Email Inbox,” in the Proc. of the 32nd Annual ACM SIGUCCS Conference on User Services, Baltimore, MD, pp. 385-387, 2004. |
Weinstein, Lauren, “Spam Wars,” Communications of the ACM, vol. 46, Issue 8, p. 136, Aug. 2003. |
Cranor, Lorrie, et al., “Spam!,” Communications of the ACM, vol. 41, Issue 8, pp. 74-83, Aug. 1998. |
Gomes, Luiz, et al., “Characterizing a Spam Traffic,” in the Proc. of the 4th ACM SIGCOMM Conference on Internet Measurement, Sicily, Italy, pp. 356-369, 2004. |
Dwork, Cynthia, et al. “Pricing via Processing or Combating Junk Mail,” CRYPTO '92, Springer-Verlag LNCS 740, pp. 139-147, 1992. |
Von Ahn, Luis, et al., “Telling Humans and COmputers Apart (Automatically) or How Lazy Cryptographers do AI,” Communications to the ACM, Feb. 2004. |
Skoll, David F., “How to Make Sure a Human is Sending You Mail,” Google Groups Thread (Nov. 17, 1996). |
Byrne, Julian, “My Spamblock,” Google Groups Thread (Jan. 19, 1997). |
Guilmette, Ronald F., “To Mung or Not to Mung,” Google Groups Thread (Jul. 24, 1997). |
McCullagh, Declan, “In-Boxes that Fight Back,” News.com, May 19, 2003. |
“2003 CSI/FBI Computer Crime and Security Survey,” Computer Security Institute and Federal Bureau of Investigation. |
Bellegarda, Jerome R., et al., “Automatic Junk E-Mail Filtering Based on Latent Content,” IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), 2003, St. Thomas U.S. Virgin Islands, Dec. 2003, pp. 465-470. |
Berger, Tracy D., et al., “Reading Quickly in the Periphery—the Roles of Letters and Sentences,” DRAFT, Journal of Vision (submitted Apr. 29, 2004). |
Collins, M.S., et al., “Efficient Induction of Finite State Automata,” D. Geiger, P. P. Shenoy, eds., Proceedings of the Thirteenth Conference on Uncertainty in Artificial Intelligence, Aug. 1-3, 1997, Brown University, Providence, Rhode Island, USA, San Francisco, CA: Morgan Kaufmann Publishers, 1997, pp. 99-107. |
Dwyer, Kelly Pate., “New Legal Fears Prod Web Spies,” DenverPost.com, May 17, 2004. |
“Field Guide to Spam,” Sophos, Last updated Feb. 3, 2004. |
Fisher, Vivienne, “Security Holes: The Danger Within,” ZDNet (Australia), Jun. 12, 2002. |
Graham, Paul, “Better Bayesian Filtering,” Jan. 2003. |
Graham, Paul, “A Plan for Spam,” Aug. 2002. |
Graham-Cummings, John, “How to Beat an Adaptive Spam Filter,” MIT Spam Conference, Cambridge, Massachusetts, Jan. 16, 2004. |
Gaudin, Sharon, “Security Begins From Within,” eSecurity Planet.com, Aug. 4, 2003. |
Jaques, Robert, “IT Fuels Intellectual Property Theft,” Personal Computer World, Feb. 20, 2004. |
Pantel, Patrick, et al., “SpamCop: A Spam Classification & Organization Program,” Poster in Proceedings of AAAI—1998 Workshop on Learning for Text Categorization, Jul. 1998. |
Peled, Ariel, “Five Steps Your Company Can Take to Keep Information Private,” Computerworld, Mar. 15, 2004. |
Pescatore, John, “High-Profile Thefts Show Insiders Do the Most Damage,” Gartner First Take, FT-18-9417, Nov. 26, 2002. |
Rawlinson, Graham, “The Significance of Letter Position in Word Recognition,” Ph.D. Thesis, Nottingham University, 1976. |
Sahami, Mehran, et al., “A Bayesian Approach to Filtering Junk E-Mail,” Proceedings of AAAI—1998 Workshop on Learning for Text Categorization, Jul. 1998. |
Totty, Michael, “The Dangers in Outbound E-Mail,” The Wall Street Journal, Apr. 26, 2004, pp. R6. |
Vamosi, Robert, “Centralize Security for Success,” ZDNet (UK), Nov. 6, 2001. |
Yerazunis, Bill, “Sparse Binary Polynomial Hash Message Filtering and the CRM114 Discriminator,” Proceedings of the 2003 Spam Conference, Cambridge, Massachusetts, Jan. 2003. |
Templeton, Brad, “Viking-12 Junk E-Mail Blocker,” (believed to have last been updated Jul. 15, 2003). |
“Majordomo FAQ,” Oct. 20, 2001. |
Langberg, Mike, “Spam Foe Needs Filter of Himself,” (Email Thread Dated Apr. 5, 2003). |
Lloyd Allison, “Dynamic Programming Algorithm for Sequence Alignment,” Oct. 1996. |
M.J. Bishop and C.J. Rawlings (eds), “Nucleic Acid and Protein Sequence Analysis, a Practical Approach,” IRL Press 1987. |
“Aoccdrnig to a rscheearch at Cmabrigde Uinervtisy, it deosn't mttaer in waht oredr the ltteers in a wrod are,”. |
“There are 600,426,974,379,824,381,952 ways to spell Viagra,” Apr. 7, 2004. |
Eric Sven Ristad and Peter N. Yianilos, “Learning String Edit Distance” Learning String Edit Distance (1997), IEEE Transactions on Pattern Analysis and Machine Intelligence. |
Simon M. Lucas, “Evolving Finite State Transducers: Some Initial Explorations,” In Proceedings of 6th European Conference on Genetic Programming, pp. 130-141, 2003. |
Manolo Gouy, “Secondary Structure Prediction of RNA,” chapter 11, pp. 259-283. |
J.F. Collins and A.F.W. Coulson, “Molecular Sequence Comparison and Alignment,” chapter 13, pp. 323-358. |
L. Allison, C. S. Wallace and C. N. Yee, “When is a String Like a String?” Al & Maths 1990. |
U.S. Appl. No. 11/036,603 Final Office Action mailed Nov. 28, 2008. |
U.S. Appl. No. 11/036,603 Office Action mailed May 30, 2008. |
U.S. Appl. No. 11/927,466 Office Action mailed Dec. 24, 2008. |
U.S. Appl. No. 10/869,507 Office Action mailed Dec. 9, 2008. |
U.S. Appl. No. 10/869,507 Final Office Action mailed Jul. 15, 2008. |
U.S. Appl. No. 10/869,507 Office Action mailed May 1, 2008. |
U.S. Appl. No. 10/869,507 Office Action mailed Jan. 9, 2008. |
U.S. Appl. No. 11/036,603, filed Jan. 14, 2005, Jonathan J. Oliver, Message Distribution Control. |
U.S. Appl. No. 11/036,603 Final Office Action mailed Dec. 17, 2013. |
U.S. Appl. No. 11/036,603 Office Action mailed Jun. 5, 2013. |
U.S. Appl. No. 11/927,466 Final Office Action mailed Oct. 22, 2013. |
U.S. Appl. No. 11/927,466 Office Action mailed Feb. 27, 2013. |
U.S. Appl. No. 14/491,829 Office Action mailed Sep. 15, 2015. |
Number | Date | Country | |
---|---|---|---|
20080104062 A1 | May 2008 | US |
Number | Date | Country | |
---|---|---|---|
60543300 | Feb 2004 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 10869507 | Jun 2004 | US |
Child | 11927458 | US |