The invention relates to electronic communications and, in particular, to classification of electronic messages into categories.
Electronic messages, such as email, instant messages, and web pages, are increasingly used to deliver information. Electronic messages that are predominantly text are relatively easy to categorize using simple pattern matching or Bayesian analysis. This categorization is very important in the detection of unwanted inbound messages (e.g. spam) and is increasingly important in the detection of unwanted or unauthorized transmission of confidential, proprietary, or inappropriate information in outbound messages.
It is possible to hide information from casual analysis, such as by typical spam filters, by placing it within images, such as in the form of digitized text.
This technique is increasingly used by purveyors of spam to cause their unwanted messages to defeat spam filters and reach their targets. An existing, straightforward, approach for automatic categorization of messages containing digitized text in images is to convert the images into text using optical character recognition techniques and to then apply a text recognition or categorization technique, such as, for example, pattern matching or Bayesian analysis, to the resulting text. This approach does not typically work well because the error rate in character recognition is unacceptably high. What has been needed, therefore, is a system for analyzing images containing text that allows the messages containing the images to be accurately categorized without the need to extract the exact content of the text.
In a method and system for categorizing electronic messages based on an analysis of the images within them, a robust message categorization occurs even when the text in the images cannot be reliably extracted. In one aspect, the present invention extracts information about potential text areas in the image. This information is then represented as a series of bounding polygons that circumscribe the regions of the image that contain text. Descriptive information and statistics are extracted from the set of bounding polygons and a set of textual representations suitable for pattern matching or Bayesian analysis is produced. The derived categorization may then be used to drive spam detection and/or security/policy engines.
Given a set of preclassified messages and their accompanying images, a suitable text representation may be computed to drive the training of a probabilistic classifier. Scores and/or rules that are produced using other message analysis techniques may also be utilized in the present invention, either as an alternative to values obtained using the tokenization method or in combination with them.
In one aspect, the present invention is a method for classifying electronic messages containing images. The method includes the steps of determining at least one bounding polygon for a region that is likely to contain text in an image in an electronic message, extracting at least one item of descriptive information from the bounding polygon, producing at least one textual representation of the region that is likely to contain text, and classifying at least one message utilizing the textual representation. In another aspect, the present invention is an electronic message classifier, the classifier deriving at least one piece of information from at least one textual token for use in making a probabilistic classification of the electronic message, the textual token being derived from at least one description of at least one derivable property of an image accompanying the electronic message.
The present invention is a method and system for categorizing messages based on an analysis of the images within them. The present invention uses preliminary means to extract information about potential text areas in the image. This information is then represented as a series of bounding polygons that circumscribe the regions of the image that contain text. The present invention therefore allows a robust message categorization to occur, even when the text in the images cannot be reliably extracted. The derived categorization can then be used to drive, for example, but not limited to, a spam detection engine (for inbound messages) and/or a security/policy engine (for outbound messages).
The first step in the method of the present invention is to analyze an image and determine bounding polygons for regions that probably contain text.
In one embodiment of the method of the present invention, a bounding polygon for the text in the image is found using technical means.
In this embodiment, bounding polygon 200 and coordinate information 210, 220, 230, 240 are then used to derive descriptions that can be either pattern matched or subjected to Bayesian analysis, support vector analysis, neural network analysis, or other any other means of discrimination known in the art that is based on automatic learning from sets of example data. To start, polygon 200, and any other polygons found in the image, are described in a straightforward text format. Table 1 depicts the text representation of bounding polygon 200 for the example image of
The description of Table 1 may then be subjected to one or more analysis methods.
In another embodiment of the present invention, the text regions within an image may be identified using an analysis program. As an example,
Other methods of representing the results of the text region analysis are also suitable for use in the present invention, and any other systematic form of representation known in the art would also be suitable.
In one embodiment of the present invention, the next step is to extract descriptive information and statistics from the previously derived set of bounding polygons. From the bounding polygons, it is then straightforward to compute a set of numerical features, such as:
In a preferred embodiment of the present invention, the next step is to produce a set of textual representations suitable for pattern matching or Bayesian analysis. As shown in the sample code provided in Table 4, in this step, the image statistics calculated in the previous step are converted, using simple text formatting, into text tokens that can be used in a conventional pattern matching or tokenization engine. Any formatting method that preserves the nature of the feature being described and the numerical value as part of a single token is suitable for use in the present invention. The log2 and log10 conversions of the quantities derived are particularly appropriate because they reduce the number of distinct tokens generated and capture the sense that differences between small numbers are more significant than the same absolute differences between large numbers.
In the example shown in Table 3, which is derived from the image of
Other methods of representing the tokenization are also possible and suitable for use in the present invention, and any other systematic form of representation known in the art would also be suitable.
Given a set of preclassified messages and their accompanying images, it is straightforward to compute a suitable text representation to drive the training of a probabilistic classifier. Such computation can be performed in any ordinary programming language, although the currently preferred embodiment is in Python. Additional programming languages that would be highly suitable include Perl, Java, C++, Lisp, Visual Basic, and C#, but any other such language known in the art could also be employed. An example script for computing a training set of tokens from precategorized messages is shown in Table 4, which is a Python script that produces a set of textual descriptions suitable for Bayesian analysis from a set of bounding polygons in images.
The tokens generated by this process can be treated in the same way that any text is treated. In a preferred embodiment, the tokens are used as input to a Bayesian classification engine in order to provide for discrimination between spam and non-spam messages and/or to provide for detection of, and discrimination between, confidential, proprietary, or other messages that may be restricted by organizational, legal, or personal policy.
As shown in
Alternatively, or in addition, the invention may employ an estimation of the information entropy of the message, obtained using a compression or other algorithm, such as by calculating the ratio of the compressed and uncompressed sizes of a file. The classifier of the present invention may also, or alternatively, employ values derived from measurement of the header information for the image and/or from properties of inaccurate information found in the header information. In particular, the detection of a file whose content does not match that indicated by its mime type and/or extension could signal either a mistake or an intention to deceive a classifier.
Information related to other aspects of the message may also be advantageously employed by the classifier of the present invention. This includes, but is not limited to, metadata, such as author, copyright, format, extension, filename, file size, creation date/age, modification date/age, encryption (y/n, scheme), and opacity (foreign language, rota13), information from or associated with the message header, such as the header content, packaging (amount (number and length) of information contained in header fields), routing (number and depth of nested messages), and shipping (number of addresses and/or domains), URLs within the message text (existence, type, content), the length, frequencies, and sampling rates of audio files, the language and length of source code files, the length of video files, the complexity of markup files, and various parameters derivable from computer files, such program files and data files.
Sys module 930 comprises the services and libraries necessary to support the chosen programming language. In the preferred embodiments, these are provided by the standard Python runtime library, but could be easily replaced in Python or replicated for other languages by a practitioner versed in the ordinary state of the art. OS module 940 comprises the core operating services and libraries necessary to allow application software to run on the chosen computational platform. Examples of commonly available and suitable platforms include Windows 98, ME, NT, XP, Server 2003, and other Microsoft operating systems; Linux, Unix, and other POSIX compatible operating systems; embedded operating systems such as Symbian, Savaje, or VxWorks; and other system suitable to support the Sys (930) module. While a preferred software embodiment is disclosed, many other implementations will occur to one of ordinary skill in the art and are all within the scope of the invention.
The present invention therefore provides a system for analyzing images containing text that allows the messages containing the images to be accurately categorized without the need to extract the exact content of the text. Each of the various embodiments described above may be combined with other described embodiments in order to provide multiple features. Furthermore, while the foregoing describes a number of separate embodiments of the apparatus and method of the present invention, what has been described herein is merely illustrative of the application of the principles of the present invention. Other arrangements, methods, modifications, and substitutions by one of ordinary skill in the art are therefore also considered to be within the scope of the present invention, which is not to be limited except by the claims that follow.
This application claims the benefit of U.S. Provisional Application Ser. No. 60/652,947, filed Feb. 14, 2005, the entire disclosure of which is herein incorporated by reference.
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/US2006/005255 | 2/14/2006 | WO | 00 | 8/14/2007 |
Number | Date | Country | |
---|---|---|---|
60652947 | Feb 2005 | US |