This invention relates to a computer-implemented system and method for classifying the content of documents.
On-line sources of content often contain marginal or inapplicable content. Even where an on-line source of content, such as a web or HTML page, has applicable content, such as a useful or relevant article, there is often a lot of inapplicable content on the same page. For example, a web page may contain information displayed across various parts of the page. The applicable content, such as an article of interest, may be located on just a portion of the page. Other parts of the page, such as the header, footer, or side portions might contain a list of links or banner ads that are not of interest and contain inapplicable content. The page may include other documents that are not of interest and contain inapplicable content which could include system warnings, contact information and the like. When a user visits, accesses or downloads a given document returned by a search engine which has been provided with a keyword search, he or she may be frustrated because the document contains inapplicable content. Further,when a search returns a HTML page, time may be wasted distinguishing useful articles from non-articles which are located on the page.
Users also have to deal with the challenging problem of information overload as the amount of online data increases by leaps and bounds in non-commercial domains, e.g., research paper searching.
Search engines tend to return many documents or pages in response to a query. Sometimes a generic query will return thousands of possible pages. As well, many pages identified by a search or recommendation engine, or in a list of documents or catalog, are often irrelevant or only marginally relevant to the person carrying out the search. As such use of search and recommendation engines tends to often be an inefficient use of time, produce poor results, or be frustrating. As well, search engines may identify a search term in a non-article portion of a page, even when the article is unrelated to the search term. This can also cause poor, unreliable or inefficient search results.
As well, such irrelevant or only marginally relevant web pages or documents can also reduce the performance of text classification search or recommendation systems and methods, when they are input in such systems and methods.
A person could label a document as “article” or “non-article” after the person has reviewed, at least in part, the article or content. There are some significant disadvantages to this approach. First, human labeling can be very expensive and time consuming. Using people to manually label content has the further disadvantage that it does not scale up well to handle large numbers of documents. This approach suffers the further disadvantage that it is not well-suited to handle a continuous stream of requests to label documents as “articles” or “non-articles”.
The following presents a simplified summary of the invention in order to provide a basic understanding of some aspects of the invention. This summary is not an extensive overview of the invention. It is not intended to identify key/critical elements of the invention or to delineate the scope of the invention. Its sole purpose is to present some concepts of the invention in a simplified form as a prelude to the more detailed description that is presented later.
The present invention is directed to a computer-implemented system and method of document classification that can distinguish between articles and other web pages which contain non-article (i.e. irrelevant or marginal) content.
In one embodiment, the invention provides a computer-implemented method for labelling web documents as articles or non-articles comprising the steps of receiving a training set comprising documents, receiving a set of human generated labels for each document in the training set, generating a machine learning model based on the content of the document and the corresponding human generated label to generate a predicted label for the document, receiving a new document, applying the machine learning model to the new document to produce a label of article or non-article, and, associating the produced label with the new document.
In a further embodiment, the invention teaches an apparatus for article-non-article text classification comprising: means for receiving a new document, means for parsing the document according to tags, means for applying a machine learning model to each tag of the document to determine if the tag or the document contains text,and, means for labelling the document as an article if the means for apply a machine learning model has determined that the tag or the document contains text.
In a further embodiment, the invention discloses an apparatus for document classification comprising: an input processor, for receiving a new document; memory, for storing the new document and a machine learning model; and, a processor, for determining tags or other metrics in the new document and for applying the machine learning model to the tags or other metrics to produce a label of article or non-article.
Online learning provides an attractive approach to classification of documents as articles or non-articles. Online learning has the ability to take just a bit of knowledge and use it. Thus, online learning can start when few training data are available. Furthermore, online learning has the ability to incrementally adapt and improve performance while acquiring more and more data.
Online learning is especially useful in classifying documents as articles or non-articles. Although web page content can be stable for long periods of time, changes such as improvements and refinements to hypertext mark-up language (HTML) may occur from time to time. Online learning is capable of not only making predictions in real time but also tracking and incrementally evaluating web page content.
As used in this application, the terms “approach”, “module”, “component”, “classifier”, “model”, “system”, and the like are intended to refer to a computer-related entity, either hardware, a combination of hardware and software, software, or software in execution. For example, a module may be, but is not limited to being, a process running on a processor, a processor, an object, an executable, a thread of execution, a program, and/or a computer. By way of illustration, both an application running on a server and the server can be a module. One or more modules may reside within a process and/or thread of execution and a module may be localized on one computer and/or distributed between two or more computers. Also, these modules can execute from various computer readable media having various data structures stored thereon. The modules may communicate via local and/or remote processes such as in accordance with a signal having one or more data packets (e.g., data from one module interacting with another module in a local system, distributed system, and/or across a network such as the internet with other systems via the signal).
The system and method for text classification is suited for any computation environment. It may run in the background of a general purpose computer. In one aspect, it has CLI (command line interface), however, it could also be implemented with a GUI (graphical user interface), or could run as a background component or as middleware.
A HTML page consists of many predefined HTML tags, which are compliant to W3c guidelines. The following is a HTML source code snippet:
The general outline of the invention comprises the following steps or components (which will be described in greater detail below):
As a further step of the invention, prior to a selection of documents (or the contents of web-pages) into a database, this selection being known as the Training Set, an initial filtering could be carried out to filter out pages with suffixes such as “.mp3”, “.mov” or other suffixes indicating non-text documents etc., to filter out documents having a lower probability of being a document.
The steps described above are now described in greater detail.
Store a Selection of Documents (or the Contents of Web-Pages) into a Database, this Selection being known as the Training Set
In a first step 210 of the invention, a training set 110 shown in
In one embodiment, an open source crawler JOBO has been used to find documents and store them in database 120. In the preferred embodiment, JOBO has been made multi-threading. In order to carry out multi-threaded activity, the URL of each document to be downloaded is stored on a task list. Two or more instances of JOBO are instantiated. Each instance of JOBO takes a document from the task list, downloads the HTML code and text for the document and stores the code and text in database 120. When this task is complete the URL is deleted from the task list. To improve the accuracy and effectiveness of the invention, before downloading the document the suffix of the document is examined and documents with suffixes such as “mp3” are excluded from the training set.
In the second step 220 of
In the third step 230 of
In an alternate embodiment of the present invention the document may optionally be pre-processed in step 235. The data pre-processing 235 may comprise stop-word deletion, stemming and title and link extraction, which transforms or presents each article as a document vector in a bag-of-words data structure. With stop-word deletion, selected “stop” words (i.e. words such as “an”, “the”, “they” that are very frequent and do not have discriminating power) are excluded. The list of stop-words can be customized. Stemming converts words to the root form, in order to define words that are in the same context with the same term and consequently to reduce dimensionality. Such words may be stemmed by using Porter's Stemming Algorithm but other stemming algorithms could also be used. Text in links and titles from web pages can also be extracted and included in a document vector.
For each document, in step 240 of the invention a vector is created, setting out the frequency of occurrence of each of the X most frequently found tags. In other words for each d1 . . . dn a vector is created {F1, F2 . . . FX}, where F1 represents the frequency in the document of the most frequently found tag, T1; F2 represents the frequency in each of the documents d1 . . . dn of the second most frequently found tag, T2, etc. As is illustrated in
Entropy=Σ(probability of a word occurring in the document)*log (probability of a word occurring in the document). The summation occurs over all the words in the document.
Other numeric metrics could also be used as a component of the vector such as the word count of text in the document.
The vector is stored in association with the human generated label of the document as article or non-article.
Generate Further Training Sets, by Randomly Selecting Documents from the Training Set
In a preferred embodiment, further training sets in step 250 are created by randomly selecting a pre-determined number of documents from documents d1 . . . dn, permitting any document or document to be selected zero, one or more times. These further training sets are stored in database 120.
Calculate the Information Gain for the Tags in each Instance of the Further Training Sets
In step 260 of
The formula for calculating the Information Gain is given as follows:
(where the summation is taken over the y terms)
IG(T1)=((−70/100)*(4/7*log 4/7+3/7 log (3/7))−((30/100)*(2/3*log(2/3)+1/3 log (1/3))
In a preferred embodiment, for simplicity of calculation, if a particular tag, for example, T1, occurs more than once in a document, it is deemed to have occurred only once. In other words, for the purpose of calculating the Information Gain, any particular tag is either present or not present.
In an alternate embodiment, the information gain can be calculated according to each different frequency of the tag occurring within the training set. For example, as is shown in step 265 of
Generate a Decision Tree Model for each Instance of the Further Training Set
In Step 270 of
When the aggregate number of articles and non-articles is below a threshold in a particular leaf, in a preferred embodiment the aggregate threshold is twenty (20), (for example, in the above table, T4 Not Present, T62 Not Present,) then there may be a problem with that leaf not having adequate statistical significance. In other words the prediction or discrimination provided by that leaf may not be adequately reliable.
The invention provides a variety of approaches to addresses this problem of a leaf not having adequate statistical significance:
P(article|T4, T13)=10/11=0.91
P(non-article|T4, T13)=1/11=0.09
In alternate embodiments other approaches could be used to create the machine learning model, including random forest or boosting, or statistical methods such as naïve Bayes.
In the next step of the invention, the decision trees calculated from each instance of the Further Training Sets are aggregated. This is shown as step 280 of
The aggregation of the decision trees calculated from each instance of the Further Training Sets is carried out by employing LaPlace smoothing. The purpose of the LaPlace smoothing is to provide greater weights to those probabilities calculated from leaves having greater numbers of documents in such leaf. In order to carry out LaPlace smoothing, in one embodiment, the following formulae are employed:
P(article|T)=(nc+(1/c)*L)/(n+L)
P(non-article|T)=(nc+(1/c)*L)/(n+L)
P(article|T)=(10+½*1)/(11+1)=10.5/12=0.875
P(non-article|T)=0.125
In this step (step 285 of
The following two amounts are calculated in step 290 of
P(article)=P(article|T) for all Laplace smoothed leaves in all decision trees arising from the Further Training Sets
P(non-article)=P(non-article|T) for all Laplace smoothed leaves in all decision trees arising from the Further Training Sets.
As will be apparent to those skilled in the art, alternative approaches, which are included within the scope of this invention, may be used to create the decision tree model, for example, random forest approaches.
Associate the Article/Non-Article Label with the New Document and Store such an Association
In the last step of the method (step 300 of
In a further embodiment of the present invention the generated label may be used to facilitate the operation of a search or recommendation engine. For example, the search or recommendation engine could not return, in response to a query, documents which had been labelled as “non-articles”.
Once a machine learning model has been developed in accordance with the present invention it can be stored or downloaded into a variety of devices. Using such devices, it may be desirable to label a document as an article or non-article in accordance with the following steps as are illustrated in
In an embodiment of the present invention, once a document has been labelled as a non-article, it would not be presented in response to a query given to a search engine, or would not be presented by a recommendation engine. Alternatively, in a further embodiment of the present invention, documents labelled as non-articles would not be assessed or interrogated or considered by a search or recommendation system, so that words they contain would not be a possible source of inaccurate results.
A recommender system carries out the following steps as are known to those skilled in the art:
Each of the above steps is carried out with methods known to those skilled in the art.
In accordance with an embodiment of the present invention, the said second documents do not include documents labelled as non-articles in accordance with the method set out at
A search engine is a method or system designed to search for information on the World Wide Web, or a sub-set of it, or on a web-site, database or some sub-set of these. Known search engines include Google, All the Web, Info.com, Ask.com, Wikiseek, Powerset, Viewz, Cuil, Boogami, Leapfish, and Inktomi.
In general search engines work according to the following method:
Each of the above steps of the general operation of a search engine are carried out in accordance with methods known of those skilled in the art. Typically, steps (a)-(c) in the previous paragraph are carried outby a web crawler. If a database of stored results was available then these steps would not be essential to the method of seach engine operation.
In accordance with an embodiment of the present invention, the search engine method also includes the following steps:
In a further embodiment of the present invention, the device is capable of receiving an update to the machine learning model.
Such a device would have input processor, for receiving the new document; memory, for storing the new document and the machine learning model; a processor for determining the tags or other metrics in the new document and for applying the machine learning model to the new document to produce a label of article or non-article. When the label was generated, it would be stored in the memory in association with the new document. Alternatively, the new document and label may not be stored (other than transiently) if the label was to be used immediately by a search or recommendation engine.
As an embodiment of the present invention, computer media, such as Fixed Drive (2.5), could have statements and instructions for execution by a computer stored on it to carry out the method set out above, which is described schematically in
During operation of the system shown in
What has been described above includes examples of the present invention. It is, of course, not possible to describe every conceivable combination of components or methodologies for purposes of describing the present invention, but one of ordinary skill in the art may recognize that may further combinations and permutations of the present invention are possible. Accordingly, the present invention is intended to embrace all such alterations, modifications and variations that fall within the spirit and scope of the appended claims. Furthermore, to the extent that the term “includes” is used in either the detailed description or the claims, such term is intended to be inclusive in a manner similar to the term “comprising” as “comprising” is interpreted when employed as a transitional word in a claim.