This invention relates to a method of filtering sections of a data stream, in particular where the data stream comprises end point identifier data which has already been extracted from a more general data stream.
Specific examples of this are SPAM filtering based on email addresses, or extraction of web data based on URLs, however the method of filtering is applicable to any set of data defined by an end point identifier.
In the example of email, lookup against a dictionary of target addresses is important in SPAM filtering for rejecting mail from known SPAM agents. However, the process of email address lookup against a target database can prove to be a significant performance bottleneck. The principal reason for this performance bottleneck is the processing overhead associated with checking whether all email addresses extracted from a data sample are in a database of target email addresses. In reality the probability of obtaining a hit on the database with an arbitrary email address is <1%. Consequently, 99% of the lookup effort is spent rejecting potential items.
In accordance with the present invention, a method of filtering sections of a data stream comprises determining a set of characters of interest; testing each section of the data stream for the presence of one or more of the set of characters of interest; and extracting sections in which at least one of the characters is present.
The present invention reduces the number of occasions when a look up must be carried out by excluding from the look up stage, sections of the data stream which do not satisfy a minimum co-incidence with characters in an end point identifier.
Preferably, the method further comprises determining a further set of characters of interest; testing for at least one character from the further set of characters in the portion of the data stream; and extracting sections in which at least one of the characters from the further sets of characters is also present in the section.
The method can be continued through several iterations by setting additional character sets.
Preferably, the method comprises testing for the one or more of the set of characters in a predetermined order.
By requiring the characters to appear in a particular order, fewer incorrect extractions are made, but at a cost of an increased memory requirement whilst the extraction processing is undertaken.
Preferably, a skip function is applied, so that only predetermined characters in each section are tested against the set of characters of interest.
This allows testing of a specific character, such as the first or last character, without first testing all the characters leading up to that one.
Preferably, the first and last characters of a section are compared with the first and last characters of the set of characters of interest and the section extracted if there is a match.
This reduces the likelihood of an incorrect match, using well defined test characters.
Preferably, the method comprises determining additional sets of characters of interest and testing for one or more of the set of characters in more than one set.
By testing for different character sets in parallel, throughput is increased despite some less valid sections being extracted as a result.
Preferably, the section comprises an end point identifier, such as a domain name; an email address; a uniform resource locator; a telephone number; or a data and time.
The end point identifiers are not limited to these types, although they tend to be the most commonly searched ones, and the invention is equally applicable to filtering other types of end point identifier.
Preferably, the extracted sections are stored in a store.
Preferably, the extracted sections are input to a look up table and compared with specific stored end user identifiers; wherein sections which match the specific end user identifiers are stored and those which do not match are discarded.
An example of a method of filtering sections of a data stream will now be described with reference to the accompanying drawings in which:
a illustrates a compressed domain name state machine;
b illustrates a rolled out monogram domain name state machine for the email address big@bird.com;
Thus, the present invention reduces this effort by reducing the number of items that are presented to the database for the look up comparison by filtering the extracted end point identifiers based on the contents of a target dictionary of the database. Another option is to combine the extraction and filtering stages (2, 5), then store (3) the extracted, filtered data, or pass it directly to the comparison stage (6), so that look up is only performed when the likelihood of success is increased.
A first example of a method according to the present invention describes filtering a data stream for an email address from a specific service provider before the comparison stage where a check for individual email addresses can be made. Although the example relates specifically to email addresses the techniques employed can be used to identify many other structured forms of data, such as URI/URL identification, domain names, telephone numbers, dates and times. Other examples are Session Initiation Protocol (SIP) URI identification; E.164 telephone number detection; Tag detection in other data formats; IP addresses, port range, protocol and session identifier detection; xml data structures, xml objects; HTML structures and objects; and detection of content types and identification of content from packet payloads.
Conventionally, all email addresses in a sample were looked up one by one. This has the drawback that around 99% of the addresses presented to the lookup phase are not in the dictionary. Thus, the lookup algorithm spends most of its time rejecting potential matches. Email address lookup is a two stage process involving first extraction and then comparison of the extracted email against a database containing target email addresses. Extraction of email addresses can be carried out using any conventional method, which typically uses the character set defined by the standards for identifying an email address. The present invention effectively decreases the number of comparisons that must be made against the database by performing a filtering stage as part of the email address extraction phase, or by filtering data which has been previously extracted before the look up stage is carried out.
However, for the comparison, or lookup, phase only specific email addresses are required. A set of characters used to form a set of targeted email addresses are highly unlikely to cover the complete range of allowable characters defined by the standards. Consequently, the method of the present invention requires only that some or all of the character set defined by the target dictionary containing the target email addresses is recognised, rather than that defined by the standards. In this scenario the absence of the full spectrum of valid characters in the user and domain name parts means that either fewer email addresses are identified by the extraction algorithm, if the filtering is incorporated into this, or fewer are passed from the extracted email addresses to the look-up stage. However, supporting the addresses defined in the dictionary also ensures that those addresses of interest are successfully identified so they can be passed onto the lookup stage. This reduces the number of items that are passed on to the full lookup phase and thus speeds up the overall pipeline of extraction and lookup.
In general the present invention restricts the set of characters that can appear in an email address to those that appear in the target dictionary of the look up stage. Any detection algorithm based on this restricted set then provides enhanced performance as the probability of finding an email address composed of the restricted set is less than the probability of finding an email address composed of the full set. Thus, for an arbitrary sample fewer instances are passed on to the full lookup stage which enhances the overall performance. This methodology is particularly useful when only looking for a small number of domains e.g. roke.co.uk, where there may be an increase in throughput in the extraction phase of about 20%.
State machines may be generated that recognise the domain and user name sections, so that the structure of the dictionary entries is incorporated into the identification and extraction phase, in a similar manner to that described in our co-pending application Reference 2006P23116. In this instance the range of characters in the set Chd is defined by the range of characters found in the target dictionary, rather than the complete range of valid domain name characters. As all emails include a domain name, the filtering of emails is done in accordance with the domain name example given below. An illustration of the state machines for domain name identification is shown in
In
@b ir d. co mX<< @any.any.any etc
The increased number of states means that fewer candidate email addresses will be passed on to the look up stage which enhances the throughput of the extraction/lookup pipeline.
In its simplest form multiple addresses are represented by adding additional outgoing paths to each node in the state machine as illustrated in
In effect the resulting state machine says find email addresses whose domain name is prefixed with the sequences in the set {bi, ba, li, la} and where the remaining suffix contains the characters in the set {rd.com} i.e. the end of the email address is compressed into a reduced number of states and the prefix is expanded to give enhanced filtering. This approach effectively limits the amount of branching that can occur and simplifies the implementation of the approach. It is possible to incorporate the character based filtering describing above and enhance it with a filter based on the structure (the sequence of the characters in the dictionary email addresses) of the target addresses contained in the dictionary.
Another option is to replace a lookup over the first four characters, as described above, with a hash over the first 4 or more characters of the email address as shown in
The state machine based pre-filtering method described previously does not make full use of the underlying structure contained by the email addresses in the target dictionary. In particular the mapping of several edges to a single vertex allows emails such as: big@lird.com to defeat the filter. This deficiency can be addressed by adding additional vertices to the compressed or full form of the state machine. This is illustrated for the domain name state machine in
The known method of ‘path compression’ can also be applied to the above approach. Although this would still require a follow on look up stage the advantage of this method is that it would greatly reduce the amount of memory required to represent the set of dictionary email addresses. A path compressed version of the state machine shown in
This modification saves memory by removing the internal nodes and minimises the number of characters that need to be looked at to determine if the email address is worth looking up. This modification greatly reduces the number of comparisons that need to be made in the pre-filter and also significantly improves the pre-filtering for large dictionaries. Use of this method is expected to reject a large number of the candidate email addresses before they are looked up.
The present invention introduces a filtering stage into the email address extraction phase in order to avoid unnecessary comparisons against the target email address database, or as a subsequent step before the comparison stage, so increasing the overall performance of the extraction lookup pipeline by reducing the number of items that are passed on to the lookup stage. As the lookup stage is the slowest point in the pipeline reducing the number of times it is called upon effectively increases the overall throughput.
The filtering can be achieved in a number of ways including filtering based on a restricted character set, which increases performance by utilising the reduced character set defined by the dictionary entries when compared to the set defined in the standards. The reduced set means that hitting an email address is less likely. Alternatively, filtering based on a restricted character set is combined with using the structure of the items in the target dictionary. The addition of structure to the filter means that hitting an email address is less likely. The structured filtering may use a state machine, hashing, or a tree structure. The tree structure can also be applied to the combined lookup and extraction algorithm, so doing away with the lookup stage all together and performing the extraction and lookup simultaneously. Filtering based on trees with skip vertices reduces the memory overhead of implementing the tree based approach, whilst providing similar if not better statistics for email hit rate as the character and structure based approaches.
Another example of filtering data according to the present invention, illustrated in
<title> roke </title>
the pair of labels are:
<title> and </title>
The labels are separated by a sequence of characters from the set [roke]. Starting at 70 the sequence <title> 71 takes the search to point 72. At point 72 the characters [roke] (73) loop the search back to point 72. At point 72 the symbols in the set !([roke])!(</title>) 76 take the search to point 77, i.e. the search fails. At point 72 the sequence </title> 74 takes the search to point 75 and ends the search. The identification of the pair of sequences <title> </title> identifies a page title between them. In this instance only titles that contain the characters [roke] in any combination will be extracted.
href=“http://www.roke.co.uk”
and in this case the pair of labels are:
href=“http://and”
The labels are separated by a sequence of characters from the valid set of characters [rokecuw.] these characters are defined by the set ChURL. Starting at point 78. The sequence href=“http://79 takes the search to point 80. From point 80 a symbol from the set ChURL (the set of valid URL characters) 82 takes the search to point 85. From point 80 a symbol that is not in the set ChURL (!ChURL) 81 takes the search to point 83 and the search fails. From point 85 a valid URL character 86 loops the search back to point 85. From point 85 an invalid URL character 84 (i.e. not in the set [rokecuw.) results in failure 83. From point 85 the quote character 87 takes the search to point 88. At which point a valid hyperlink has been found and can be extracted. This instance will only identify hyperlinks that have URLs based on the character set [rokecuw.] and no others.
Jan 01 2008 SPACE 10:20:22
In this case a bridge character is needed to link the date and time parts. A suitable bridge is the SPACE character after the year. Starting at point 89, the month Jan 90 moves the search to point 91. From point 91 any character 92 takes the search to point 93. At point 93 any character loops the search back to point 93. At point 93 the SPACE character 95 takes the search to point 96. At point 96 any character 97 takes the search to point 98 and at point 98 any character 99 loops the search back to point 98. At point 98 the sequence:22100 completes the search 101. This instance identifies and filters out any date starting January that appears with a time ending in 22 seconds.
In the arrangement of
In the arrangement of
In the arrangement of
In the arrangement of
In the arrangement of
In the arrangement of
Although
It will be recognized from the discussion of
The foregoing disclosure has been set forth merely to illustrate the invention and is not intended to be limiting. Since modifications of the disclosed embodiments incorporating the spirit and substance of the invention may occur to persons skilled in the art, the invention should be construed to include everything within the scope of the appended claims and equivalents thereof.
Number | Date | Country | Kind |
---|---|---|---|
0700926.9 | Jan 2007 | GB | national |
0700928.5 | Jan 2007 | GB | national |
This application is a continuation of PCT International Application No. PCT/GB2008/000172, filed Jan. 18, 2008, which claims priority under 35 U.S.C. § 119 to Great Britain Patent Application No. 0700926.9, filed Jan. 18, 2007, and Great Britain Patent Application No. 0700928.5, filed Jan. 18, 2007, the entire disclosures of the aforementioned applications herein expressly incorporated by reference. The present application is also related to U.S. patent application Ser. No. ______, entitled “A Method of Extracting Sections of a Data Stream” and filed on even date herewith, which is a continuation of PCT International Application No. PCT/GB2008/000184, filed Jan. 18, 2008, which claims priority under 35 U.S.C. § 119 to Great Britain Patent Application No. 0700926.9, filed Jan. 18, 2007, and Great Britain Patent Application No. 0700928.5, filed Jan. 18, 2007, the entire disclosures of the aforementioned applications are herein expressly incorporated by reference.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/GB2008/000172 | Jan 2008 | US |
Child | 12505179 | US |