The following relates generally to methods, and apparatus therefor, for filtering and routing unsolicited electronic message content.
Given the availability and prevalence of various technologies for transmitting electronic message content, consumers and businesses are receiving a flood of unsolicited electronic messages. These messages may be in the form of email, SMS, instant messaging, voice mail, and facsimiles. As the cost of electronic transmission is nominal and email addresses and facsimile numbers relatively easy to accumulate (for example, by randomly attempting or identifying published email addresses or phone numbers), consumers and businesses become the target of unsolicited broadcasts of advertising by, for example, direct marketers promoting products or services. Such unsolicited electronic transmissions sent against the knowledge or interest of the recipient is known as “spam”.
There exist different methods for detecting whether an electronic message such as an email or a facsimile is spam. For example, the following U.S. Patent Nos. describe systems that may be used for filtering facsimile messages: U.S. Pat. Nos. 5,168,376; 5,220,599; 5,274,467; 5,293,253; 5,307,178; 5,349,447; 4,386,303; 5,508,819; 4,963,340; and 6,239,881. In addition, the following U.S. Patent Nos. describe systems that may be used for filtering email messages: U.S. Pat. Nos. 6,161,130; 6,701,347; 6,654,787; 6,421,709; 6,330,590; and 6,324,569.
Generally, these existing systems rely on either feature-based methods or content based methods. Features based methods filter based on some characteristic(s) of the incoming email or facsimile. These characteristics are either obtained from the transmission protocol or extracted from the message itself. Once the characteristics are obtained, the incoming message may be filtered on the basis of a whitelist (i.e., acceptable sender list or a non-spammer list), a blacklist (i.e., unacceptable sender list or spammer list) or a combination of both. Content based methods may be pattern matching techniques, or alternatively may involve categorization of message content. In addition, these methods may require some user-intervention, which may consist of letting the user finally decide whether or not a message is spam.
However, notwithstanding these different existing methods, the receipt and administration of spam continues to result in economic costs to individuals, consumers, government agencies, and business that receive it. The economic costs include loss of productivity (e.g., wasted attention and time of individuals), loss of consumables (such as paper when facsimile messages are printed), and loss of computational resources (such as lost bandwidth and storage). Accordingly, it is desirable to provide an improved method, apparatus, and article of manufacture for detecting and routing spam messages based on their content.
In accordance with the various embodiments described herein, there is described a system, and method and article of manufacture therefor, for filtering electronic content for identifying spam in message data. The system includes: a content extractor for identifying and selecting message content in the message data; a content analyzer having a plurality of information type gatherers for assimilating and outputting different message attributes relating to the message content associated with an information type; a categorizer having a plurality of decision makers for receiving as input the message attributes and prior history information and providing as output a message class for classifying the message data; a history processor receiving as input (i) the class decision, (ii) the message class for each of the plurality of decision makers, (iii) message attributes of the plurality of information types, and (iv) prior history information, for (a) recording the message attributes and the class decision as part of the prior history information and/or (b) modifying the prior history information to reflect changes to fixed data or probability data; and a categorizer coalescer for assessing the message class output by the set of decision makers together with optional user input for producing a class decision identifying whether the message data is spam.
These and other aspects of the disclosure will become apparent from the following description read in conjunction with the accompanying drawings wherein the same reference numerals have been applied to like parts and in which:
The table that follows set forth definitions of terminology used throughout the specification, including the claims.
A. System Operation
The system 100 includes a content extractor 104 and a content analyzer 106. The content extractor 104 extracts different message content in the message data received from the input sources 102 for input to the content analyzer 106. In one embodiment, a content identifier, OCR (and OCR correction), and a converter form part of content extractor 104. In another embodiment, only the content identifier and/or content converter form part of the content extractor 104. The form of the message data received by the different components of the content extractor 104 from the input source 102 may be one that is possible to be input directly to content analyzer 106, or it may be in a form that requires pre-processing by the content extractor 104.
For example, in the event the message data is or contains image data (i.e., a sequence of images), the message data is first OCRed (together with possibly OCR correction, for example, to correct spelling using a language model and/or improve word recognition rate) to identify textual content therein (e.g., facsimile message data or images embedded in emails or images embedded in HTTP (e.g., from web browsers) that may be in one or more formats (GIF, TIFF, JPEG, etc.)). This enables the detection of textual spam hidden in image content. Alternatively, the message data may require converting to text depending on the format of the message data and/or the documents to which message data may be linked. Converters to text from different file formats (e.g., PDF, PostScript, MS Office formats (.doc, .rtf, .ppt, xIs), HTML, and compressed (zipped) versions of these files) exist. In addition, in the event the message data is voice data, it may require conversion using known audio-to-text converters (e.g., audio data that may be embedded in, attached to, or linked to, email message data or HTTP advertisements).
The system 100 also includes a content analyzer 106 that is made up of a plurality of information type gatherers for assimilating and outputting different message attributes that relate to the message content associated with the information type assigned by the content extractor 104. The message content output by the content extractor 104 may be directed to one or more information-type (i.e., “info-type”) gatherers of the content analyzer 106. In one embodiment, one info-type gatherer identifies sender attributes in the message data, and a second info-type gatherer transforms message data to a vector of terms identifying, for example, a term's frequency of use in the message data and/or other terms used in context (i.e., neighboring terms). Once each info-type gatherer finishes processing the message content, its output in the form of message attributes is input to categorizer 108.
In this or alternate embodiments, additional combinations of info-type gatherers are adapted to process different attributes or features of text and/or image content depending on the input source 102. For example, in one embodiment an info-type gatherer is adapted to transform OCRed facsimile message data to a vector of terms with one attribute per-feature by: (i) tokenizing (and optionally normalizing) words in OCRed facsimile message data; (ii) optionally, performing morphological analysis to the surface form of a word (i.e., as it appears in the OCRed facsimile message) and return its lemma (i.e., the normalized form of a word that can be found in a dictionary), together with a list of one or more morphological features (e.g., gender, number, tense, mood, person, etc.) and part-of-speech (POS); (iii) counting words or lemmas; (iv) associating each word or lemma with a feature; and (v) optionally, weighing feature counts using, for example, inverse document frequency.
Further, in this or other embodiments, combinations of info-type gatherers that are adapted to gather sender attributes extract different features from message content, such as, sender attributes. In addition to all the words recognized through OCR, a number of features may be extracted from the transmission protocol of a message, such as: sender information (e.g., email address, FaxID or Calling Station Identifier, CallerID, IP or HTTP address, and/or fax number), date and time of transmission and reception.
The categorizer 108 has a set of decision makers that receive as input the message attributes from the content analyzer 106 and prior history information from history processor 112. Generally, each decision maker may work on a different data type and/or rely on different decision making principles (e.g., rule based or statistical based). Each decision maker of the categorizer 108, provides as output a message class for classifying the message data that is input to categorizer coalescer 110. Further, each decision maker operates independently to categorize the message attributes output by content analyzer 106 using one or more message attributes and, possibly, prior history information. For example, one decision maker (or categorizer) may take as input sender attributes and make use of a whitelist and/or blacklist forming part of history data 114 to evaluate sender attributes and assess whether the sender of the message data is spam. Another example of a decision maker takes as input a vector of terms and bases its categorization decision on statistical analysis of the vector of terms.
Various embodiments for statistically categorizing the message attributes are described in more detail below. Advantageously, these statistical approaches to message data categorization may be adapted to rely on rules, such as, a rule that accounts for differences between a CallerlD and a number sent during the fax protocol (usually displayed on the top line of each fax page), or a rule that accounts for receiving a fax at unusual hours of the day (i.e., outside the normal working day).
More generally, each decision maker is a class decision maker, where the “class” of the decision maker may vary depending on: (a) the output from an info-type gatherer received from the content analyzer 106 that it uses; (b) history information 114 received from the history processor 112 that it uses; and/or (c) classification principles that it bases its decision on (i.e., a decision function that may be adaptive, e.g., rule or statistical based classification principles, or a combination thereof). An example of a rule-based classification principle is a classifier that bases its decision on a white-list and/or a black-list, whereas a Naïve Bayes categorizers is an example of a statistical based classifier.
The message class output by the set of decision makers forming part of the categorizer 108, is assessed by the categorizer coalescer 110 together with user input 116, which may be optional, to produce an overall class decision determining whether the message data is spam by, for example, using one or more or a combination of: a voting scheme, using a weighted averaging scheme (e.g., based on a decision maker's confidence), boosting (i.e., one or more categorizers receives the output of other categorizer(s) as input to define a more accurate classification rule by combining one or more weaker classification rules). In addition, the categorizer coalescer 110 offers routing functions, which may vary depending on the overall class decision and, possibly, the certainty of that decision. For example, message data determined to be spam with a high degree of certainty may be automatically deleted while message data with less than a high degree of certainty may be placed in temporary storage for user review.
Further, the system 100 includes a history processor 112 which stores, modifies, and accesses history data 114 stored in memory of system 100. The history processor 112 evaluates the independently produced message class output by each decision maker in the categorizer 108. That is, the history processor 112 allows the system 100 to adapt its decision function using the history of message data originating from the same sender. This means that a message received from a sender that has previously sent several borderline messages may eventually be flagged as spam by one of the adaptive decision functions described below.
More specifically, the history processor 112 receives as input (i) the overall class decision from the categorizer coalescer 110, (ii) the message class for each of the plurality of decision makers of the categorizer 108, (iii) the message attributes for the plurality of information types output by the content analyzer 106 and (iv) the history information 114. With the inputs (i)-(iv), the history processor (a) records the message attributes and the class decision(s) as part of the prior history information 114 and/or (b) modifies the prior history information 114 to reflect changes to fixed data or probability data.
Depending on the certainty of each categorizer's decision, the history processor 112 assesses the totality of the different message classification results and based on the results modifies history data to reflect changed circumstances (e.g., moving a sender from a whitelist to a blacklist). For example, if a majority of the decision makers of the categorizer 108 indicate that message content is not spam while the sender information indicates the message data is spam because the sender is on the blacklist, the history processor 112 adaptively manages the content of the whitelist and blacklist by updating the history data to remove the sender from the blacklist and, possibly in addition, add the sender to the whitelist.
The table below illustrates an example of history information 114 recorded in one embodiment of the system 100 shown in
The message content extracted (at 206) is analyzed (at 208) by, for example, gathering sender and message attributes and/or by developing one or more vectors of terms. The incoming message is categorized (at 210) using one or more of the results of the content analysis (at 208) together with history information 114. If the user specifies that the results are to be validated (at 212), then user input is sought (at 214). Subsequently, the incoming message is routed (at 216) according to how the incoming message is categorized (at 210) and validated (if performed, at 214), and the categorization results (computed at 210) are evaluated (at 218) in view of the existing history data.
Depending on the results of the evaluation (at 218), history information 114 is updated (at 220) by either modifying existing history information or adding new history information. Advantageously, future incoming messages categorized (at 210) make use of prior history data that adapts in time as the content in the incoming messages changes. For example, the use of history information 114 enables dynamic management of whitelists and blacklists through adaptive unsupervised learning by cross-referencing the results of different decision makers in the categorizer 108 (e.g., by moving, adding or removing a sender from a whitelist to/or a blacklist based on content analysis).
B. Embodiments Of Statistical Categorizers
Embodiments of statistical categorization performed by one or more decision maker forming categorizer 108 are described in this section. In these embodiments, statistical categorization methods are used in the following context: from a training set of annotated documents (i.e., messages) {(d1,z1),(d2,z2), . . . (dN,zN)} such that for all i, document di has label zi (where e.g., ziε{0,1} with 1 signifying spam and 0 signifying legitimate messages), a discriminant function f(d) is learned, such that f(d)>0 if and only if d is spam. This decision rule may be interpreted using at least the three statistical categorization models described below. These models differ in the parameters they use, the estimation procedure for these parameters, as well as, the manner in which the decision function is implemented.
B.1 Categorization Using Naïve Bayes
In one embodiment, categorization decisions are performed by a decision maker of the categorizer 108 using a Naïve Bayes formulation, as disclosed for example by Sahami et al., in a publication entitled “A Bayesian approach to filtering spam e-mail, Learning for Text Categorization”, published in Papers from the 1998 AAAI Workshop, which is incorporated herein by reference. In this statistical categorization method, the parameters of the model are the conditional probabilities of features w given the class c, P(w|c), and the class priors P(c). Both probabilities are estimated using the empirical frequencies measured on a training set. The probability of a document d containing the sequence of words (w1,w2, . . . wL) is then
and the assignment probability is P(c|d)∞P(d|c)P(c). The decision rule combines these probabilities as f(d)=log P(c=1|d)−log P(c=0|d).
B.2 Categorization Using Probabilistic Latent Analysis
In another embodiment, categorization decisions are performed by a decision maker of the categorizer 108 using probabilistic latent analysis, as disclosed for example by Gaussier et al. in a publication entitled “A Hierarchical Model For Clustering And Categorizing Documents”, published in F. Crestani, M. Girolami and C. J . van Rijsbergen (eds), Advances in Information Retrieval—Proceedings of the 24th BCS-IRSG European Colloquium on IR Research, Lecture Notes in Computer Science 2291, Springer, pp. 229-247, 2002, which is incorporated herein by reference. The parameters of the model are the same as for Naïve Bayes, plus the conditional probabilities of documents given the class, P(d|c), and they are estimated using the iterative Expectation Maximization (EM) procedure. At categorization time, the conditional probability of a new document P(dnew|c) is again estimated using EM, and the remaining part of the process (posterior and decision rule) is the same as Naïve Bayes described above.
B.3 Categorization Using Support Vector Machines
In another embodiment, categorization decisions are performed by a decision maker of the categorizer 108 using Support Vector Machines (SVM). It will be appreciated by those skilled in the art that while probabilistic models are well suited to multi-class problems (e.g., general message routing) but do not allow very flexible feature weighting schemes, SVM allow any weighting scheme but are restricted to binary classification in their basic implementation.
More specifically, SVM implement a binary classification rule expressed as a linear combination of similarity measures between a new document (i.e., message data) dnew and a number of reference examples called “support vectors”. The parameters are the similarity measure (i.e., kernel) K(di,di), the set of support vectors and their respective weights ai (an example, of the use of SVM is disclosed by Drucker et al., in a publication entitled “Support Vector Machines for Spam Categorization”, IEEE Trans. on Neural Networks, 10:5(1048-1054), 1999, which is incorporated herein by reference). The weights ai are obtained by solving a constrained quadratic programming problem, and the similarity measure is selected using cross-validation from a fixed set including polynomial and RBF (Radial Basis Function) kernels. The decision rule is given by
with ai≠0 for support vectors only.
C. Soft Whitelists/Blacklists
Generally, rule based decision making using fixed whitelists and blacklists are not sufficient on their own as they yield binary (i.e., categorical) decisions based on a rigid assumption that a sender is either legitimate or not, independent of the content of a message. That is, the use of whitelists tend to be too closed (i.e., they tend to identify too many messages as spam) while the use of blacklists tend to be too open (i.e., they tend to identify too few messages as spam). Further, both whitelists and blacklists tend to be too categorical (e.g., messages from a blacklisted sender will be rejected as spam, regardless of its content). Various embodiments set forth in this section advantageously provide operating embodiments for the history processor 112 shown in
C.1 Adaptation Using User Feedback
In a first embodiment, whitelists and/or blacklists stored in the history information 114 are updated using user feedback 116. In this embodiment, senders addresses (e.g., numbers or email or IP or HTTP addresses) of messages that are determined by categorizer coalescer 110 and acknowledged from user feedback 116 to be spam are added to the blacklist (and removed from the corresponding whitelist) information associated with that sender (e.g., phone number (determined by callerID or facsimile header) or email or IP or HTTP address), thereby minimizing future spam received from that sender. This may be implemented either automatically (e.g., implicitly, if the status of a message identified as spam is not changed after some period of time), or only after receiving user feedback confirming that the filtered message is spam. This embodiment provides a dynamic method for filtering senders of spam who regularly change their identifying information (e.g., phone number or email or IP or HTTP address) to avoid being blacklisted.
The same adaptive process is possible for updating a whitelist. Once the categorizer coalescer 110 has flagged an incoming message as legitimate, the associated sender information (e.g., phone number or email or IP or HTTP address) may be automatically inserted in the whitelist and/or removed from a corresponding blacklist by the history processor 112. Such changes to the whitelist and blacklist forming part of the history information 114 may also be conditioned on explicit or implicit user feedback 116, as for the blacklist (e.g., the user could explicitly confirm the legitimate status, or implicitly by not changing the determined status of a message after a period of time).
C.2 Adaptation Using History Information
In a second embodiment, the history processor 112 adapts the whitelist and blacklist (or simply blacklist or simply whitelist) stored in history information 114 by leveraging history information concerning the various message attributes (e.g., sender information, content information, etc.) received from the content analyzer 106 and the one or more decisions received from categorizer 108 (and possibly the overall decision if there is more than one decision maker that is received from the categorizer coalescer 110). That is, the history processor 112 keeps track of sender information in order to combine the evidence obtained from the incoming message with the available sender history. Using this history, the system 100 is adapted to leverage sender statistical information to take into account a favorable (or unfavorable) bias if the sender has already sent several messages that were judged (i.e., by its class decisions) legitimate (or not legitimate) with a high confidence or an opposite bias if the sender has previously sent messages that were only borderline legitimate.
More specifically in this second embodiment, the history processor 112 dynamically manages a probabilistic (or “soft”) whitelist/blacklist in the history information 114 rather than a binary (or “categorical”) whitelist/blacklist. That is, instead of a clear-cut evaluation that a sender x is or is not included in a blacklist (i.e., either xε blacklist or xε blacklist), each sender x is evaluated using a probability P(blacklist|x) (i.e., probability that the sender x is on the blacklist) or equivalently an original belief P(spam|x) (i.e., the original belief or knowledge that the sender x transmits spam).
For example,
Further as shown in
An alternate embodiment for using and updating a soft blacklist may be represented as follows:
C.3 Combining History Information and User Feedback
In a third embodiment, the history processor 112 includes a hybrid whitelist/blacklist mechanism that combines history information and user feedback. That is, supplemental to the prior two embodiments, when a user is able to provide feedback, the profile P(content|spam) of the user may change. This occurs when a decision about a borderline spam message is misjudged (for example, not to be spam), which may result because a new vocabulary was introduced in the message. If the user of the system 100 provides user feedback that overrides an automated decision by ruling that a message is actually spam (when the system determines otherwise), then the profile P(content|spam) of the user is updated or adapted to take into account the vocabulary from the message.
More specifically, this embodiment combines the first two embodiments directed at utilizing user feedback and sender history information to provide a third embodiment which allows the system 100 to adapt over time as one or both of user feedback and sender history information prove and disprove “evidence” of spam. In accordance with one aspect of this embodiment, system decisions may be accepted as “feedback” after a trial period (unless rejected within some predetermined period of time) and enforced by adapting history information accessed by the class decision makers as if the user had confirmed classification decisions computed by the categorizer coalescer 110. This allows the history for a sender (i.e., a priori favorable/unfavorable bias for a sender) and/or model parameters or profiles of the categorizer(s) to automatically “drift” or adapt (i) to changing circumstances over time and/or retroactive changes or (ii) to updated categorization decisions already taken to account for the drift.
Continuing with the flow diagram shown in
More generally, the flow diagram in
D. Alternate Embodiments
This section describes alternate embodiments of the system 100 shown in
A second alternate embodiment, shown in
In a third alternate embodiments, the system 100 shown in
In a fourth alternate embodiment, the system 100 shown in
E. Miscellaneous
Those skilled in the art will recognize that a general purpose computer may be used for implementing the systems described herein such as the system 100 shown in
Further, those skilled in the art will recognize that the forgoing embodiments may be implemented as a machine (or system), process (or method), or article of manufacture by using standard programming and/or engineering techniques to produce programming software, firmware, hardware, or any combination thereof. It will be appreciated by those skilled in the art that the flow diagrams described in the specification are meant to provide an understanding of different possible embodiments. As such, alternative ordering of the steps, performing one or more steps in parallel, and/or performing additional or fewer steps may be done in alternative embodiments.
Any resulting program(s), having computer-readable program code, may be embodied within one or more computer-usable media such as memory devices or transmitting devices, thereby making a computer program product or article of manufacture according to the embodiment described herein. As such, the terms “article of manufacture” and “computer program product” as used herein are intended to encompass a computer program existent (permanently, temporarily, or transitorily) on any computer-usable medium such as on any memory device or in any transmitting device.
Executing program code directly from one medium, storing program code onto a medium, copying the code from one medium to another medium, transmitting the code using a transmitting device, or other equivalent acts may involve the use of a memory or transmitting device which only embodies program code transitorily as a preliminary or final step in making, using, or selling the embodiments as set forth in the claims.
Memory devices include, but are not limited to, fixed (hard) disk drives, floppy disks (or diskettes), optical disks, magnetic tape, semiconductor memories such as RAM, ROM, Proms, etc. Transmitting devices include, but are not limited to, the Internet, intranets, electronic bulletin board and message/note exchanges, telephone/modem based network communication, hard-wired/cabled communication network, cellular communication, radio wave communication, satellite communication, and other stationary or mobile network systems/communication links.
A machine embodying the embodiments may involve one or more processing systems including, but not limited to, CPU, memory/storage devices, communication links, communication/transmitting devices, servers, I/O devices, or any subcomponents or individual parts of one or more processing systems, including software, firmware, hardware, or any combination or subcombination thereof, which embody the disclosure as set forth in the claims.
While particular embodiments have been described, alternatives, modifications, variations, improvements, and substantial equivalents that are or may be presently unforeseen may arise to applicants or others skilled in the art. Accordingly, the appended claims as filed and as they may be amended are intended to embrace all such alternatives, modifications variations, improvements, and substantial equivalents.