Documents may be classified as being members of one or more groups or classes using a number of probabilistic techniques based on the textual content and semantics of each document. These types of classifications are often made based on the presence of specific words that are observed in documents belonging to the class. For example, if an email message contains the words “Nigeria” and “million,” these facts may contribute to the probability that the message is junk mail.
Such classifications may not work as well with email messages and other documents of more nuanced classes, such as “marketing.” While marketing emails may be identifiable via the presence of words such as “coupon,” “promotion,” or “newsletter,” many emails that should be classified as “marketing” often contain nothing other than hyperlinked images, thus defying text-based semantic classification. Other types of text-bearing marketing emails generated by retailers, particularly those operating online, may include a grid of products, where the individual products change each time the email is sent out. If the name of a given product is only observed in a single email, for example, a traditional classifier would not have prior context with which to identify the email with the “marketing” classification.
It is with respect to these and other considerations that the disclosure made herein is presented.
The following detailed description is directed to technologies for classifying structured documents based on the structure of the document. Utilizing the technologies described herein, documents may be classified or categorized based on their structure, such as an HTML node hierarchy, rather than textual content and/or semantics. Classifying documents based on their structure allows documents of a similar type to be identified, irrespective of content. This may be useful for classifying instances of documents produced from a particular template or using a common toolset in which the content changes between each observed instance, such as marketing newsletters featuring a grid of N products, for example. This may also be useful for classifying documents that would otherwise be difficult to analyze due to lack of semantic text in the document, such as a marketing email containing only hyperlinked images. Further, this technique may be used to identify documents that bear similarities despite widely varied textual content, like promotional documents generated from similar templates but in different languages.
According to embodiments, a structured document is received, and the structural elements are parsed from the document to generate a text string representing the structure of the document instead of the semantic textual content of the document. The text string may be broken into N-grams utilizing a sliding window, and a classifier trained from similar structured documents labeled as belonging to one of a number of document classes is utilized to determine a probability that the document belongs to each of the document classes based on the N-grams.
It should be appreciated that the subject matter presented herein may be implemented as a computer process, a computer-controlled apparatus, a computing system, or an article of manufacture, such as a computer-readable storage medium. These and various other features will become apparent from a reading of the following disclosure and a review of the associated drawings.
While the subject matter described herein is presented in the general context of program modules that execute on one or more computing devices, those skilled in the art will recognize that other implementations may be performed in combination with other types of program modules. Generally, program modules include routines, programs, components, data structures, and other types of structures that perform particular tasks or implement particular abstract data types. Moreover, those skilled in the art will appreciate that the subject matter described herein may be practiced on or in conjunction with other computer system configurations beyond those described below, including multiprocessor systems, microprocessor-based or programmable consumer electronics, minicomputers, mainframe computers, handheld computers, personal digital assistants, electronic book readers, wireless telephone devices, special-purposed hardware devices, network appliances, or the like. The embodiments described herein may also be practiced in distributed computing environments, where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote memory storage devices.
In the following detailed description, references are made to the accompanying drawings that form a part hereof, and that show, by way of illustration, specific embodiments or examples. The drawings herein are not drawn to scale. Like numerals represent like elements throughout the several figures.
A document classification module 104 executes in the computer system 102. According to embodiments, the document classification module 104 classifies or categorizes documents, such as document 106, based on the structure of the document, as will be described in more detail below. The document classification module 104 may execute on a single virtual machine, server, or other computing device in the computer system 102, or the document classification module 104 may execute in parallel across multiple virtual machines, servers, or other computing devices. In addition, the document classification module 104 may comprise a number of subcomponents executing on different virtual machines, servers, or other computing devices in the computer system 102. The document classification module 104 may be implemented as software, hardware, or any combination of the two.
The document 106 may represent any structured document to be classified, such as an HTML or XML-based email message, a Web page, an XML file, a Portable Document Format (“PDF”) file, an application document, and the like. In some embodiments, the document 106 represents an HTML-based email message containing a periodic promotional newsletter or advertisement, an order confirmation, a shipping or delivery confirmation for an order, or the like. It will be appreciated that documents containing marketing information, such as promotional newsletters, or order-related information, such as order and shipping confirmations, are likely generated by the organization creating the document using a template that defines the overall document structure, with specific product or order information injected into specific, defined locations in the template.
According to embodiments, the document classification module 104 utilizes a machine learning technique known as statistical classification to determine a class or category of documents to which the document 106 belongs based on observed features in the document and features observed in documents known to belong to defined classes of documents. The document classification module 104 utilizes a classifier 108, such as a naïve Bayes classifier, to calculate the probability of membership in a class for the document 106 given the likelihood of observing structural features observed for that class. Bayesian classifiers may be used to make binary decisions about a particular document 106, such as “is this document junk mail,” “is this document related to financial fraud,” “is the tone of this document angry,” and the like. The document classification module 104 may maintain classifier data 110 that supports the algorithm of the classifier 108. The classifier data 110 may be stored in a file system of the computer system 102 or in a data storage system, such as a database, accessible to the computer system, for example.
In additional embodiments, the document classification module 104 may further train the classifier 108 utilizing a number of training documents 112 that have already been labeled with a document class or category. In some embodiments, the document classification module 104 may initially group the training documents into possible classifications using content-related techniques. For example, the document classification module 104 may utilize regular expressions to locate order IDs in training documents 112 containing order confirmation email messages or tracking IDs in training documents containing shipment or delivery confirmations. Similarly, the document classification module 104 may search for known content in the training documents 112 in order to initially classify the training documents, such as links to specific websites or webpages. Alternatively or additionally, the training documents 112 may be manually labeled by administrators of the computer system 102 or through crowd-sourcing techniques, for example, before being used by the document classification module 104 to train the classifier 108.
As described above, the document 106A may be generated from a template, with specific product or marketing information injected into the defined areas 202, 204, and 208 of the template. For example,
In order to classify the document 106 or to train the classifier 108, a text string may be generated from the document representing the overall structure of the document. For example, the document classification module 104 may extract a text string from the markup of the HTML-based email message shown in
The document classification module 104 may then utilize the classifier 108 to classify the document 106 from the text string. Because the “vocabulary” of HTML tags that indicate document structure is small, however, independent observations regarding the presence or absence of a given HTML tag may not be sufficient to identify membership of the document 106 in a particular class or category of documents. For example, there may be no HTML tags which occur only in marketing documents. Accordingly, the document classification module 104 may apply additional techniques to the text classification to improve the accuracy of the classifier 108, according to some embodiments. For example, the document classification module 104 may break up the text string into tuples of some number N of words, referred to as “N-grams,” using a sliding window, and apply the classifier 108 to the extracted N-grams. Utilizing N-grams in the classification may provide context to the extracted structural elements. For example, “Western Union” occurring as a 2-word tuple, or “bigram,” carries a much higher probability of being associated with fraud than the bigrams “Western European” or “European Union.” Combining the words in the order that they occur in the text string provides the semantic context.
Using N-grams comprising sequences of HTML tags extracted from the text string representing the document structure provides for more accurate comparison/classification of documents. In addition, utilizing N-grams may provide better performance than a straight comparison of each individual word or HTML tag in the document 106, since documents generated from a same template may bear only slight modifications from instance to instance. For example, one marketing message may contain a 4×3 product grid. Another marketing message from the same author may contain a 4×7 product grid, where 4 more rows of products have been added. In addition, documents of a specific type may contain similar features even when produced by different authors or from different templates. If there are similar characteristics employed by the designers of a given type of document template in general, performing comparisons across document fragments comprising N-grams may allow the system to correctly identify similar documents from new authors.
It will be appreciated that an optimal value for N representing the number of words in the tuples extracted from the text string may depend on the corpus of documents being classified, and may be determined from experimentation. In addition, N-grams of two or more different lengths, such as 3 words and 5 words, may be extracted from the same text string and utilized independently or in conjunction by the classifier 108. It will be further appreciated that other methods of grouping of the structural elements may be utilized by the document classification module 104 beyond the N-grams described herein.
In addition to the overall document structure, additional metadata or other information may be derived from the document 106 and added to the text string to increase the efficacy of the classifier 108, according to further embodiments. For example, an identifier unique to each document author, such as a sender ID or customer ID, could be added as a word or words to the text string that represents the document structure in order to relate documents 106 to authors. The ID representing the author of a document 106 may provide an additional feature upon which to train the classifier 108 and/or classify the document. In practice, this could be represented as a textual token such as “AUTHOR:192383”, where “192383” represented the identifier unique to the document author. This textual token would be prepended or appended to the text string representing the document structure, as presented in Table 1. However, because the document classification module 104 compares N-grams representing the complete structure of the document, the author feature need not be present in a candidate document 106 in order to produce a match.
In further embodiments, the overall complexity of a document 106 may provide an additional data point for the classification of documents. The complexity of a document 106 may be determined from the number of HTML tags in the document, the number of different tags in the document, the maximum depth of nested HTML tags, and/or the like. According to one embodiment, a real number in the range of 0.0 to L, where L is normalized, may be calculated to represent the overall complexity of a given document 106. Documents 106 having a small number of HTML tags will be given a very low number while documents with many tags, many different tags, and/or a deeply-nested structure will be given a higher number. This complexity value may be used as a coefficient for the probability score for a particular class determined by the classifier 108, for example. This would allow strongly similar documents 106 that are very simple to be weakly correlated, and would likewise increase the correlation for moderately similar but very complex documents.
Alternatively or additionally, the calculated complexity value for a document 106 may be added as a word to the text string representing the structure of the document and used in the classification. For example, the “word” “COMPLEXITY:0.7” may be prepended or appended to the text string. Similarly, words representing the various complexity components of the document 106, such as “NUMBER_OF_ELEMENTS:250,” “NUMBER_OF_DISTINCT_ELEMENTS:45,” and/or “MAX_NODE_DEPTH:5,” may be added to the text string to associate each document with its overall complexity.
Turning now to
The routine 400 begins at operation 402, where the document classification module 104 receives a structured document 106 that has been labeled as belonging to a particular class or category. The document 106 may be labeled by a combination of textual content analysis performed by the document classification module 104 or other module or process and/or manual analysis of the document's contents, as described above. In some embodiments, the document classification module 104 receives a number of training documents 112 labeled as to class or category from which to train the classifier 108. The routine 400 proceeds from operation 402 to operation 404, where the document classification module 104 parses the structural elements from the received structured document 106. For example, as described above in regard to
From operation 404, the routine 400 proceeds to operation 406, where the document classification module 104 generates a text string representing the structure of the received document 106 from the parsed structural elements. For example, as further described above in regard to
The routine 400 proceeds from operation 406 to operation 408, where the document classification module 104 may add additional metadata to the text string representing the structure of the document to increase the efficacy of the classifier 108, according to further embodiments. For example, the document classification module 104 may prepend or append a “word” or words to the text string representing the author of the document, the complexity of the document, and the like, as described above in regard to
From operation 408, the routine proceeds to operation 410, where the document classification module 104 updates the classifier 108 for the specified class or category of document from the text string representing the structure of the received document 106. The method used to update the classifier 108 may depend on the type of classifier implemented by the document classification module 104. For example, the document classification module 104 may implement an N-gramming Bayesian classifier 108 that breaks up the text string into a sliding window of N-grams while calculating the probabilities that a second N-gram occurs after a first N-gram in a set of training documents 112 labeled as a particular class or category. The classifier 108 may maintain these probabilities in a probability matrix for the various N-grams identified in the training documents 112. It will be appreciated that the document classification module 104 may employ other methods of grouping structural elements from the document 106 beyond the N-grams described herein.
The document classification module 104 may utilize the algorithm of the classifier 108 to update the probability matrix from the text string representing the structure of the received document 106, and store the update probability matrix and other data defining the classifier 108 in the classifier data 110, according to some embodiments. It will be appreciated that other classifiers 108 beyond the naïve Bayes classifier described herein may be utilized by the document classification module 104 to classify documents 106 based on their structure, and that the corresponding classifier algorithm may encompass a mechanism for updating the classifier from the text string representing the structure of a received document labeled as belonging to a particular class or category. From operation 410 the routine 400 ends.
The routine 500 begins at operation 502, where the document classification module 104 receives a structured document 106 to be classified. For example, the document classification module 104 may receive an HTML-based email message as described above in regard to
From operation 508, the routine 500 proceeds to operation 510, where the document classification module 104 presents the text string representing the structure of the received document 106 to the classifier 108 for a first class or category of documents. As described herein, the classifier 108 may be trained from training documents 112 labeled as belonging to a number of different classes or categories. The document classification module 104 may select a first of the known document classes and present the text string representing the structure of the document 106 to the classifier 108 for the selected class. According to embodiments, the classifier 108 utilizes the classifier algorithm to calculate a probability that the received document 106 belongs to the selected document class based on observances in the training documents 112 labeled as belonging to that class.
For example, the N-gramming Bayesian classifier 108 described above in regard to operation 510 may break up the text string representing the structure of the document into a sliding window of N-grams and identify instances of a second N-gram occurring after a first N-gram in the test string. The classifier 108 may then accumulate the probabilities of these occurrences from the probability matrix for the various N-grams identified in the training documents 112 in operation 510 and maintained in the classifier data 110, for example. The accumulated probabilities represent the probability that the received document belongs to the selected class, according to some embodiments. It will be appreciated that other methods of calculating the probability that the received document 106 belongs to the selected document class may be utilized, based on the type of classifier 108 utilized by the document classification module 104.
The routine 500 proceeds from operation 510 to operation 512, where the document classification module 104 determines if the calculated probability of the received document 106 belonging to the selected document class exceeds some threshold value. The threshold value may be set high enough to ensure that the classification of the received document 106 is performed with a high degree of certainty, such as 80%. If the calculated probability of the received document 106 belonging to the selected document class does not exceed the threshold value, then the routine 500 proceeds from operation 512 to operation 514, where the document classification module 104 presents the text string representing the structure of the received document 106 to the classifier 108 for a next class or category of documents, using the same methodology as described above in regard to operation 510.
If the calculated probability of the received document 106 belonging to the selected document class does exceed the threshold value, then the routine 500 proceeds from operation 512 to operation 516, where the document classification module 104 classifies the received document 106 as belonging to the selected document class. Alternatively, the document classification module 104 may present the text string representing the structure of the received document 106 to the classifier 108 for each of the known document classes, and select the document class having the highest calculated probability as determined by the classification for the received document. From operation 516, the routine 500 ends.
The computing device 12 includes a baseboard, or “motherboard,” which is a printed circuit board to which a multitude of components or devices may be connected by way of a system bus or other electrical communication paths. In one illustrative embodiment, one or more central processing units (“CPUs”) 14 operate in conjunction with a chipset 16. The CPUs 14 are standard programmable processors that perform arithmetic and logical operations necessary for the operation of the computing device 12.
The CPUs 14 perform the necessary operations by transitioning from one discrete, physical state to the next through the manipulation of switching elements that differentiate between and change these states. Switching elements may generally include electronic circuits that maintain one of two binary states, such as flip-flops, and electronic circuits that provide an output state based on the logical combination of the states of one or more other switching elements, such as logic gates. These basic switching elements may be combined to create more complex logic circuits, including registers, adders-subtractors, arithmetic logic units, floating-point units, or the like.
The chipset 16 provides an interface between the CPUs 14 and the remainder of the components and devices on the baseboard. The chipset 16 may provide an interface to a random access memory (“RAM”) 18, used as the main memory in the computing device 12. The chipset 16 may further provide an interface to a computer-readable storage medium such as a read-only memory (“ROM”) 20 or non-volatile RAM (“NVRAM”) for storing basic routines that help to startup the computing device 12 and to transfer information between the various components and devices. The ROM 20 or NVRAM may also store other software components necessary for the operation of the computing device 12 in accordance with the embodiments described herein.
According to various embodiments, the computing device 12 may operate in a networked environment using logical connections to remote computing devices and computer systems through one or more networks 34, such as local-area networks (“LANs”), wide-area networks (“WANs”), the Internet, or any other networking topology known in the art that connects the computing device 12 to the remote computing devices and computer systems. The chipset 16 includes functionality for providing network connectivity through a network interface controller (“NIC”) 22, such as a gigabit Ethernet adapter. It should be appreciated that any number of NICs 22 may be present in the computing device 12, connecting the computer to different types of networks and remote computer systems.
The computing device 12 may be connected to a mass storage device 28 that provides non-volatile storage for the computer. The mass storage device 28 may store system programs, application programs, other program modules, and data, which are described in greater detail herein. The mass storage device 28 may be connected to the computing device 12 through a storage controller 24 connected to the chipset 16. The mass storage device 28 may consist of one or more physical storage units. The storage controller 24 may interface with the physical storage units through a serial attached SCSI (“SAS”) interface, a serial advanced technology attachment (“SATA”) interface, a fiber channel (“FC”) interface, or other standard interface for physically connecting and transferring data between computers and physical storage devices.
The computing device 12 may store data on the mass storage device 28 by transforming the physical state of the physical storage units to reflect the information being stored. The specific transformation of physical state may depend on various factors, in different implementations of this description. Examples of such factors may include, but are not limited to, the technology used to implement the physical storage units, whether the mass storage device 28 is characterized as primary or secondary storage, or the like. For example, the computing device 12 may store information to the mass storage device 28 by issuing instructions through the storage controller 24 to alter the magnetic characteristics of a particular location within a magnetic disk drive unit, the reflective or refractive characteristics of a particular location in an optical storage unit, or the electrical characteristics of a particular capacitor, transistor, or other discrete component in a solid-state storage unit. Other transformations of physical media are possible without departing from the scope and spirit of the present description, with the foregoing examples provided only to facilitate this description. The computing device 12 may further read information from the mass storage device 28 by detecting the physical states or characteristics of one or more particular locations within the physical storage units.
In addition to the mass storage device 28 described above, the computing device 12 may have access to other computer-readable medium to store and retrieve information, such as program modules, data structures, or other data. It should be appreciated by those skilled in the art that computer-readable media can be any available media that may be accessed by the computing device 12, including computer-readable storage media and communications media. Communications media includes transitory signals. Computer-readable storage media includes volatile and non-volatile, removable and non-removable storage media implemented in any method or technology. For example, computer-readable storage media includes, but is not limited to, RAM, ROM, erasable programmable ROM (“EPROM”), electrically-erasable programmable ROM (“EEPROM”), flash memory or other solid-state memory technology, compact disc ROM (“CD-ROM”), digital versatile disk (“DVD”), high definition DVD (“HD-DVD”), BLU-RAY, or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to store the desired information.
The mass storage device 28 may store an operating system 30 utilized to control the operation of the computing device 12. According to one embodiment, the operating system comprises the LINUX operating system. According to another embodiment, the operating system comprises the WINDOWS® SERVER operating system from MICROSOFT Corporation of Redmond, Wash. According to further embodiments, the operating system may comprise the UNIX or SOLARIS operating systems. It should be appreciated that other operating systems may also be utilized.
The mass storage device 28 may store other system or application programs and data utilized by the computing device 12, such as the document classification module 104 and/or the document classification module 104, both of which were described above in regard to
The computing device 12 may also include an input/output controller 32 for receiving and processing input from a number of input devices, such as a keyboard, a mouse, a touchpad, a touch screen, an electronic stylus, or other type of input device. Similarly, the input/output controller 32 may provide output to a display device, such as a computer monitor, a flat-panel display, a digital projector, a printer, a plotter, or other type of output device. It will be appreciated that the computing device 12 may not include all of the components shown in
Based on the foregoing, it should be appreciated that technologies for classifying structured documents based on the structure of the document are presented herein. Although the subject matter presented herein has been described in language specific to computer structural features, methodological acts, and computer readable media, it is to be understood that the invention defined in the appended claims is not necessarily limited to the specific features, acts, or media described herein. Rather, the specific features, acts, and mediums are disclosed as example forms of implementing the claims.
The subject matter described above is provided by way of illustration only and should not be construed as limiting. Furthermore, the claimed subject matter is not limited to implementations that solve any or all disadvantages noted in any part of this disclosure. Various modifications and changes may be made to the subject matter described herein without following the example embodiments and applications illustrated and described, and without departing from the true spirit and scope of the present invention, which is set forth in the following claims.
Number | Name | Date | Kind |
---|---|---|---|
7275069 | Hundley et al. | Sep 2007 | B2 |
7895515 | Oliver et al. | Feb 2011 | B1 |
8005782 | Reznik et al. | Aug 2011 | B2 |
8539000 | Solmer | Sep 2013 | B2 |
20040215606 | Cossock | Oct 2004 | A1 |
20050267915 | Zhulong et al. | Dec 2005 | A1 |
20060004748 | Ramarathnam et al. | Jan 2006 | A1 |
20060288015 | Schirripa et al. | Dec 2006 | A1 |
20090049062 | Chitrapura | Feb 2009 | A1 |
20120158724 | Mahadevan et al. | Jun 2012 | A1 |
20120215853 | Sundaram et al. | Aug 2012 | A1 |
Entry |
---|
Boyan, J. “A Machine Learning Architecture for Optimizing Web Search Engines” School of Computer Science, Carnegie Mellon University, May 10, 1996, pp. 1-8 [online] [retrieved on Jan. 7, 2012] retrieved from: http://www.cs.cornell.edu/people/tj/publications/boyan—etal—96a.ps.gz. |
Lacoste-Julien, S. “Discriminative Machine Learning with Structure” Electrical Engineering and Computer Sciences, University of California at Berkeley, Technical Report No. UCB/EECS-2010-4, Jan. 12, 2010, pp. 1-148 [online] [retrieved on Jan. 7, 2012] retrieved from: http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-4.pdf. |
Cline, M. “Utilizing HTML Structure and Linked Pages to Improve Learning for Text Categorization” Department of Computer Sciences, University of Texas at Austin, Undergraduate Honors Thesis, May 1999, pp. 1-21 [online] [retrieved on Jan. 7, 2012] retrieved from: http://www.cs.utexas.edu/˜ml/papers/mbcline-ugthesis.pdf. |