The invention will be better understood thanks to the attached Figures in which:
the
the
the
the
the
the
The antispam system, which filters the incoming e-mails for the users having their accounts on the same e-mail server, is placed in front of that e-mail server towards its connection to the Internet (
The antispam system designated to one e-mail server and its users can be an application added to the e-mail server machine, or it can be a computer appliance running such an application. A few such antispam systems can collaborate with each other, and each of them is also interfaced to the accounts it protects on the e-mail server it protects. The collaboration to other antispam systems can be trusted, like in the case of few antispam systems administered by the same authority, or evaluated by the antispam system and correspondingly adapted, as it would probably be the case in a self-organized collaboration of antispam systems with no inherent mutual trust.
The antispam system decides for the incoming emails whether they are spam or not. If enough evidence is collected that an e-mail is spam, it is either blocked or marked as spam and sent to the e-mail server for easy sorting into an appropriate folder. Otherwise, upon a maximum allowed delay by the antispam system or upon a periodic or user triggered send/receive request from the user's email client to the email server (the last can be considered as an option with virtually zero delay), the email is passed unchanged to the e-mail server.
The first-type inputs into the antispam system are incoming e-mail messages, before they are passed to the e-mail server.
The second-type inputs to an antispam system come from the access by the antispam system to the users' accounts it protects. The antispam system observes the following email-account information and events for each protected email account: text of the e-mails that the user sends; text of the e-mails that the user receives and does an action on them; the actions on the e-mails processed by the antispam system and received by the user, i.e. not filtered as spam, including deleting a message, deleting a message as spam, moving a message to a folder; the actions on the e-mails processed by the antispam system and filtered as spam, which could happen very rarely or never depending on the user's behavior and performances of the antispam system; the send/receive request from the email client of the user to the e-mail server; email addresses from user's contacts. We assume that some of the users protected by the antispam system have “delete” and “delete-as-spam” options available from its e-mail client for deleting messages and use them according to their wish, but this assumption could be realized and another feedback could be incorporated from the users actions on his emails, like moving the emails to good folder for example or simply deleting the emails. Here “delete” means move to “deleted messages” folder, “delete-as-spam” means move to “spam messages” folder. We also assume that all the e-mails that the user still did not permanently delete are preferably on the e-mail server, so the antispam system can observe the actions taken on them. Here “permanently delete” means remove from the e-mail account. The messages could be all moved to and manipulated only on the e-mail client, but then the client should enable all the actions on the e-mails to be observed by the antispam system.
The third-type inputs to the antispam system are messages coming from collaborating antispam systems. The messages contain useful information derived from the strings sampled from some of the e-mails that have been either deleted-as-spam by the users having accounts on the collaborating antispam systems or found by local processing as being suspicious to belong to a spam email. The third-type inputs to the antispam system are especially useful if there is small number of the accounts protected by the system. One of the factors that determine the performances of an antispam system is the total number of the active accounts protected by the antispam system and its collaborating systems.
The main outputs from the antispam system are based on the decisions for the incoming emails whether they are spam or not. If enough evidence is collected that an e-mail is spam, it is either blocked or marked as spam and sent to the e-mail server for easy sorting into an appropriate folder. Otherwise, upon a maximum allowed delay by the antispam system or upon a periodic or user triggered send/receive request from the user's email client to the email server (the last can be considered as an option with virtually zero delay), it is passed unchanged to the e-mail server.
Other outputs of the antispam system are the collaborating messages sent to other antispam systems that contain useful information derived from the strings sampled from some of the e-mails that has been deleted-as-spam by the users having accounts on the antispam system. If the collaboration is self-organized and based on evaluated and proportional information exchange, the antispam system has to create these outgoing collaborating messages in order to get similar input from other antispam systems.
In order to detect spam, the system produces and uses so-called detectors—the binary strings that are able to match incoming spam without hurting normal emails. Omitting the details, the use of the detectors is illustrated on the
The detectors are produced as illustrated on the
Internal architecture and processing steps of the antispam system are shown on
Incoming emails are put into the pending state by the antispam system, until the detection process decides if they are spam or not, or until they are forced to inbox by pending timeout, or by periodic request from the mail client, or by a request from the user. The innate processing block might declare an email as non-spam and protect it from further processing by the system. If an email is found to be spam, it is quarantined by the antispam system or it is marked as spam and forwarded to the email server for an easy classification. Otherwise it is forwarded to the email server and goes directly to the Inbox. The user has access to the quarantined emails and can force some of them to be forwarded to the Inbox.
A pending email that is not protected by the innate part is processed in the following way. First, the text strings are sampled from the mail text using our randomized algorithm explained in detail later. Then, each sampled text string is converted into the binary-string representation form called proportionally signature. Each proportional signature is passed to the negative selection block. Another input to the negative selection block are so called self signatures, the signatures obtained in the same way as the proportional signatures of the considered incoming email, but with the important difference that they are sampled from the e-mails that the user implicitly declared as non-spam (by not explicitly deleting them as spam and sorting them in a non-spam folder, for example). In the negative selection block, the proportional signatures of the considered incoming email that are within a predefined negative selection specific similarity threshold of any self signature are deleted, and those that survive become so called suspicious signatures.
Each suspicious signature is duplicated. One copy of it is passed to the maturation block, and another to the detection block. Each suspicious signature passed to the detection block is stored there as pending signatures and compared against already existing memory and active detectors and against the new active and memory detectors potentially made during the email pending time. If a suspicious signature is matched (found to be within a predefined detection specific similarity threshold) by an active or memory detector, the corresponding email is declared as spam. Optionally, one matching doesn't cause the detection, but the detection block further processes the matching results between the detectors and suspicious signatures, and if it finds enough evidence it declares the corresponding email as spam. Pending signatures contain a pointer to the originating message vice versa, and they are kept until the message is pending.
The active detectors used in the detection process are produced by the maturation (block) process. The inputs to this process are the above mentioned suspicious signatures, local danger signatures and remote danger signatures. The local danger signal signatures are created in the same way like the suspicious signatures, but from the emails being deleted as spam by users protected by the antispam system. The remote signatures are obtained from collaborating antispam systems, if any, as explained later. Except upon start of the system, when the maturation block is empty, the maturation block contains so called inactive and active detectors. When a new suspicious signature is passed to the maturation block, it is compared using a first maturation similarity threshold against the signatures of the existing inactive detectors in the maturation block. If it is not matching any of the existing inactive detectors signatures, it is added as new inactive detector to the maturation block. If it is matching an existing inactive detector, the status of that detector (the first that matched) is updated, by incrementing its counter C1, refreshing its time field value T1, and adding the id of that user.
The same happens when a local danger signature is passed to the maturation block, the only difference is that, if matching, C2 and T2 are affected instead of C1 and T1 and DS bit is set to 1. Upon refreshing, the T2 is typically set to a much later expiration time then it is the case with T1. The same happens when a remote danger signature is received from a collaborating system, with a difference that id and DS are not added and the affected fields are only C3, C4, T3, T4. Local suspicious and danger signatures are passed to the maturation block accompanied by id value, and remote danger signatures do not have the id value but have its own C3 and C4 fields set to binary or real number values, so the local C3 and C4 counters may be incremented by one or by values dependant on these remote incoming signature counters.
Possible efficient inactive/active detector syntax is shown on the
Whenever an inactive detector is updated, a function that takes as input the counters of this detector is called that decide about a possible activation of the detector. If the detector is activated, it is used for checking the pending signatures of all the local users detection blocks (1 per user). We call this recurrent detection. Optionally, only the detection blocks are checked for which id is added to the detector. Optionally, the pending messages identifiers are added along with the id to the detector whenever the detector is updated, in order to make the process of the detection faster at the price of the small additional state keeping.
Upon the activation of a detector, its signature is copied to the memory detectors databases of those users that had their id added to the detector and appropriate DS bit set to 1. Memory detectors are also assigned a life time, and this time is longer then for the activated detectors.
Whenever a new detector is added or an existing is updated by the local suspicious or danger signature, a function is called that takes as inputs C1 and C2 and decides if a signature should be sent to a collaborating system(s).
Both the inactive and active detectors live until all the lifetimes (T1-T4) are expired.
The old proportional signatures and detectors in different blocks are eventually deleted, either because of expired life time or need to make space for those newly created.
The
It should be understood that the following are possible implementations of some processing steps, and proposed improvements to the system explained in the claims and the previous part of the document, but not the necessary way to implement the system and thus not decreasing its generality achieved in the claims and the description in the previous part of the document.
Sampling the text strings from an email received by the antispam system is the first step in representing the email content in a form used for its further processing by the antispam system. The following items explain a possible sampling in detail:
The reason to sample from these two email parts is that the message that the sender passes to the recipient is fully contained in them. Here, the body of the email includes both the main text and the attachments. We emphasize that the sampled strings are processed by the adaptive part of the antispam system, and that the adaptive part looks at the “similarity” of the message strings to the strings from other messages that has been declared as spam or not spam. The header fields other than the subject line have special determined meanings, and they are not used for sampling and processing by the adaptive part of the antispam system, but they are processed by a set of rules that can be understood as the innate part of the antispam system.
If the email contains a large amount of text in the email body, sampling all the text would cause a high processing load on the antispam system, and could be exploited by the spammer for a denial-of-service attack. To avoid this problem, the antispam system uses a preprocessing method to select the only part of the incoming email body that is important to be processed, and it is the part that is most likely to be presented to the reader by his email reading program in the first opened window. Usually, based on this information the reader determines if the email is useful for him or if it is spam. The antispam system samples and processes the same relevant information. Apart from preventing from the denial-of-service attacks, this saves the resources of the antispam system while processing normal emails, and also makes the system more resistant against added text aimed at fooling the antispam system by masking the main message that might be spam by guessed “good text”. The exception are outgoing emails, that are sampled either on all the body or on its limited part, but the limit here is bigger than what is likely to be presented in an reading window. These are assumed to be normal emails, unless they are outgoing forwarded emails. The outgoing forwarded emails might often not be good examples of normal email, and are not sampled at all, if they are detected by the “Fwd” or “Fw” string in the subject line or a similar rule.
Any method which estimates the part of the email body that will fit in one window shown to the reader upon opening the message can be used to determine the part of the email that will be sampled. For example, a simple method would be the one that counts number of text characters and also takes into account the special formatting characters such as “new line” and “tab”. If email is in hypertext format, the method should take into the account the size of the letters and the size of the figures attached within the text. In a special case with many large figures in the beginning and with a little or no text, more space might be included for sampling, in order to capture some text that might follow the figures.
Unique feature of our sampling method is that it is designed with the goal to capture the information from the message similarly as it is perceived by a human reader. The idea behind this is that the antispam system should with high probability intercept and processes any textual message easily spotted by the human reader on the displayed email, even if the message is obfuscated by the spammer and hidden from simple sequential text parsing. Additionally, the sampling should be resource-consumption feasible and adaptive. The sampling should also process the attached figures that might be mixed with text in different ways.
For example this can be achieved through: 1) robust main sampling by sequential parsing of the text on the level of expressions and phrases, and 2) additional sampling triggered by the innate rules when the hypertext is found to have special structure in which included figures, colors, font, capitalization, or two-dimensional relative positions of the letters could cause the email to be perceived differently by the user then in the case of simple left-to-right character by character reading.
In the case of plain text message the reader's brain identifies the words grouped into short expressions or into phrases or sentences in order to grab meaningful information from the text. Using a deterministic algorithm to find the borders between the words and between the sentences or phrases, in order to decide the sampling units, would be easily tricked by the spammer knowing the algorithm. To avoid vulnerability to such spammers' tricks, the antispam system uses a probabilistic approach: it samples the text at pseudo-random positions, using the two possible sample sizes. One sample size is designed to have good chance to overlap well with short expressions; another is designed to overlap well with phrases. Fixed sample sizes are important as they enable the antispam system to efficiently compute significant statistical similarity among the samples from different messages, which, when accompanied with appropriate artificial immune system algorithms, enable a very robust identification of the patterns in the email that are related to the patterns in other emails sent to and/or experienced by the user or by the users of collaborating antispam systems.
Let L1 be the size of the long samples, the samples that are designed to capture the phrases. Let L2 be the size of the short samples, the samples that are designed to capture the expressions. These parameters must be equal within one group of the collaborating antispam systems, but might differ among the different groups.
The sampling is done in the following way. The subject line and the email textual part that are determined to be sampled are first concatenated and considered one text block. Let pf(i) be the index within the text block of the first character of the i-th sample, pl(i) the index within the text block of the last character of the i-th sample, L the size of the text block, Fs the positive fixed advancing step from pf(i) to pf(i+1) for the samples of size Ls, As the average additional advancing step from pf(i) to pf(i+1) for the samples of size Ls, RandU(k,l) a random integer sampled uniformly on the segment [k,l]. The algorithm for sampling the Ls-sized samples is:
Note that the first sample might be shorter then Ls. Reasonable values that we expect to work well for short strings are: Ls=12-16, Fs=⅔*Ls, As=⅙*Ls, and for long strings are: Ls=40-60, Fs=½*Ls, As=¼*Fs. Note that in this way the included figures are only processed via the corresponding hyperlinks text, which is a weakness that could be exploited by spammers tricks as: giving different names to the same figure in different spam copies, adding possibly long text to the hyperlink that will not be displayed to the human reader but can be used as a denial of service or miss-training attack, moving the figure at different position within the text in different spam copies, replacing different groups of letters by figures containing the same letters or putting the complete spam message into the figure.
More sophisticated method for processing the figures would be to replace the corresponding hyperlinks, which instruct the email reading program to display the figures together with text, by textual or binary strings that extract the features from the figure in a way that preserves the similarity of figures into the similarity of the corresponding strings, and is resistant to the obfuscations by spammer that would have a goal to hide this similarity between the different spam copies.
The most simple and cheap possible solution would be to replace each figure with a single character, the character which is preferably different from letters and numbers and other often used symbols, and then process the obtained text block as if there are no figures. This would only represent the fact that there is a figure at the given position within the text, but would be more efficient and more resistant to spammers' tricks then keeping and processing the hyperlink text. Still, this would not capture the content of the figures.
One way to sample the content of the figures and capture similarity among the different obfuscated copies of the same figure would be to process the figure using a modification of a standard text recognition technique, replace the figure with the recognized text and consider this text as the part of the text block used in main sampling procedure. As the antispam system applies post-processing of the sampled strings and is resistant to the text obfuscations, it would also be resistant to the mistakes in text recognition. Though we expect that this method is useful for any figure sizes, it seems to be especially useful in the case of text obfuscated by the spammer by replacing groups of characters by small figures containing the same characters.
Another way to sample the figures would be to divide the figure into number of parts, depending on its size in pixels, and to analyzing features of each part and encode the results by text or binary strings. Concatenation of such strings would replace the figure in the text block used in the main sampling process.
Any picture pre-processing method or combination of methods are appropriate that transform the picture into the texts and preserves the similarity among the pictures in the resulting text, as the rest of the antispam system is designed to be simple and efficient on such textual input.
One mail-specific field contains a random number generated for the email and added to the all samples taken from this email. It enables checking, with high probability, if two binary patterns corresponding to samples, or to danger signals, or to detectors, origin from the same email or not.
Another email specific field is a unique identifier of the email assigned to the all samples taken from this email. It can be implemented as a pointer to the email and is used to easily find the email related to detected proportional signatures.
The sampling-position-specific field is equal to the sample number, assigned in order in which the samples of given size are taken from the email. This field could be useful for combining incoming danger signals corresponding to the short samples.
Main reason to have both main sampling, which is applied to all incoming emails, and triggered additional sampling that is turned on only in some cases, is to manage the resources of the antispam system as optimal as possible. If an email is written in plain text only, without using any formatting tricks, the main sampling is enough to efficiently represent any possible message that this text brings. This will be the case with many normal emails. But if any common variation from normal writing is found that suggests possible use of the spammers' tricks, the message is worth of additional processing. For example, if a letter is repeated to fill the space among the peaces of a phrase, that is a sign of obfuscation. Such repeated letter will easily be filtered out from the text by the reader, but could cause the filter to not capture the spammy phrase efficiently. As this concrete obfuscation will result in binary representation of some samples having fewer bits set then statistically normal, it can be easily detected by the rule that simply checks the number of bits set in the binary representation of each sample. Detection by a rule can trigger the rule specific additional sampling or general additional sampling, or both. A specific additional sampling in the example above would be repeated standard sampling on the text block but with this letter removed whenever found to be repeated. A general additional sampling would be repeated standard sampling with higher overlap for short samples aimed at capturing the expressions.
A set of such triggering rules certainly represents the innate part of the antispam system. It applies message-content nonspecific rules and results in activation of the adaptive part for additional sampling and processing. The most general innate part of our antispam system would be any other rules-based filter or even a complete Bayesian filter for example, though the last one can be viewed as an adaptive filter itself.
Other examples of rules to be part of the innate system are: many hyperlinks to web pages; many hyperlinks to the pictures to be include in the text; some letters are colored or capitalized, suggesting possible message obtained by reading only these letters; many spaces and tabs are present in the text, suggesting special meaning of the position of the letters and possible message obtained by diagonal reading, and suggesting additional specific diagonal sampling that would take the tabs and spaces into account more precisely.
There are several reasons and goals to transform the sampled text strings into binary representation. First, in order to preserve privacy, it is important to hide the original text when exchanging the information among the antispam systems. To achieve this we use one way hash functions when transforming text string into its binary equivalent.
Second, it is important that the similarity of the strings, as it would be perceived by the reader, is kept as similarity of the corresponding binary patterns that is easy to compute and statistically confident. Similarity might mean small hamming distance, for example. Statistically confident means that the samples from unrelated emails should with very high chance have the similarity smaller than a given threshold, while the corresponding samples from the different obfuscations of the same spam email, or from similar spam emails, should with high chance have the similarity above the threshold. “Corresponding” means that they cover similar spammy patterns (expressions or phrases) that exist in the both emails.
Third, the binary representation should be efficient, i.e. it should compress the information contained in the text string and keep only what is relevant for comparing the similarity.
Last, but not least important, the binary representation should provide possibility to generate the random detectors that are difficult to be anticipated and tricked by the spammers, even if the source code of the system is known to the spammers.
To achieve the above listed goals, we design the representation based on so called similarity hashing. We use the method very similar to the one used by DCC.
The method is illustrated on the
It should be noticed that each trigram generated from the complete string by using the sliding window and generating the predefined trigrams from that window actually consists of three characters from the complete string taken at the predefined positions. Any predefined set of trigrams can be used, but preferable a trigram characters are close to each other in the complete string, and these trigrams are taken uniformly from the complete string.
It should also be noticed that use of the Bloom like structure an setting of the bits prevents from deleting some of the bits of the spammy pattern by text additions. Contrary, with a method like DCC that counts for the number of hashes that point to each signature bit, and then converts highest scores to one and the lover once to zero, it is possible to add text that will overweight the spammy phrase hashes and prevent them of being shown up in the signature.
It should also be noticed that if the size of the signature is designed for low contention when setting the bits to one in the used Bloom structure, the loss of the information is small and the similarity is better preserved, while still good compression is possible that prevents from recreating the original string; and also uses the bits efficiently. Small information loss enables the conversion of the signatures from one hash-mapping to another hash-mapping that still keeps good similarity properties, and may be for exchanging the information among different antispam systems that do not want to reveal their hash-mapping, i.e. their parameter p.
The binary representation enables two modes of collaboration among the different antispam systems and different levels of randomness of the detectors. One mode assumes that all the collaborating antispam systems have the same parameter p, which is simple and computationally cheap. Such solution is more vulnerable to the getting parameter p know by the spammer, but could be safely used if the collaborating antispam systems are controlled by the same people, for example the antispam service provider people maintaining the antispam appliances for multiple organizations.
If the antispam system collaborates to other antispam systems that might get compromised by the spammer, the preferred mode is to have different p value at each antispam system. As M is designed so that the number of bits that experience the contention during the creation of a signature, the mapping exists from the signature produced using one value of p to a signature that is similar to the one produced using another value of p. Exchange of signatures with a collaborating antispam system, without reveling its own representation parameter p is possible through a Difie-Helman like algorithm to generate a third p value that will be used for the exchange of the signatures.
So each system may have and use its own parameter p randomly generated upon startup of the system, or regenerated later, which introduces an desirable randomness in the detectors on the Internet level.