Classification systems are used to classify content of data objects such as documents, email messages and web pages and also to support processing of sets of data objects.
The accompanying drawings illustrate various examples and are a part of the specification. The illustrated examples are examples and do not limit the scope of the claims. Throughout the drawings, identical reference numbers designate similar, but not necessarily identical elements.
The same part numbers designate the same or similar parts throughout the figures.
One difficulty in organizations or enterprises is that increasingly high volumes of data objects are being received, created and stored. As the volume increases, finding relevant data objects within those stored becomes increasingly difficult. Advances in computer technology have provided users with numerous options for creating data objects such as electronic files and documents. For example, many common software applications executable on a typical personal computer enable users to generate various types of useful data objects. Data objects can also be obtained from remote networks, from image acquisition devices such as scanners or digital cameras, or they can be read into memory from a data storage device (e.g., in the form of a file). Modern computer systems enable users to electronically obtain or create vast numbers of data objects varying in size, subject matter, and format. Such data objects may be located, for example, on personal computers, on file servers, network attached storage or storage area networks, or on other storage media.
In general, content classification involves assigning a data object such as a document or file to one or more sets or classes of documents with which it has commonality—usually as a consequence of shared topics, concepts, ideas and subject areas.
In certain systems, content classification may be offered to provide a class assignment for a data object such as a document, email message, web page or other data object. In certain systems, content classification may be offered to enable processing of data objects based on their respective content. One difficulty with content classification is that classes assigned may be too general. A typical problem with classifying content is that the classes used are not sufficient to differentiate the data object from other data objects. For example, a classification of “Education” is not sufficient to differentiate between pre-school books, University textbooks or literature advertising night-school courses, all of which could validly be described as being on the subject of education.
In certain systems, content classification may be performed manually. A typical problem with manual classification is that it is a lengthy activity and requires knowledge of the domain of the content for accurate classification. Due to constraints on resources, manual classification is often only used to assign very high, abstract, levels of classification. A further problem with manual classification is that two people will often decide to classify a data object differently, reducing the usefulness of the classification because common classification terms cannot be relied upon for searching and similar activities.
In certain systems, content classification may be performed automatically by a computer system. A typical problem with automatic classification is that the system may be misled into selecting inappropriate or meaningless classifications. One problem is that an author of content may use the same term in many data objects even though they may be about different subjects. This can result in that author's data objects being given a different classification to others in the same field/domain. As a result, classification may be led to be by author rather than by content of the data object.
Accordingly, various examples described herein were developed to provide a system that enables determination of sub-topics from content of data objects having an existing class. In an example of the disclosure, a system comprises a data repository, a data object analyzer including at least one processor to execute computer program code to determine terms from content of one or more data objects of each of a plurality of classes and collate said terms in said data repository and a pattern analyzer including at least one processor to execute computer program code to determine, from the terms in the data repository, a sub-topic for a selected one of said plurality of classes, the sub-topic comprising a set of terms, the set of terms being common to the content of at least a subset of said data objects of the selected class and substantially absent from data objects outside of said selected class.
Advantages of the examples described herein include that existing classifications of data objects is used to guide selection of meaningful finer granularity sub-classifications.
An advantage is that each sub-topic is preferably selected so as to be a sparse (small) set of terms such as words that tend to appear together in data objects such as documents that belong to the class, and not in the data objects outside the class. An advantage is that the use of the discrimination that exists in the data between the different broad classes enables a meaningful set of fine grained sub-topics to be found. An advantage is that the specificity of the sub-topics is controlled in part by the sparsity (having a small number of discriminating terms in every sub-topic sub-topic). An advantage is that the combination of existing classes and sub-topics enables a greater scope of classification at both broad and at granular levels. Few terms cannot discriminate the broad class, but can capture a distinct sub-topic, and eventually with other such sub-topics cover all or most of the data objects in the broad class
An advantage is that the processing to identify sub-topics can be designed to be computationally efficient. Another advantage is that the sub-topics in the form of small groups of terms are easily understood and provide contextual insight into the individual classes, to the level that they automatically identify sub topics in tagged classes.
An advantage is that sub-classification of data objects such as documents enables users to more easily locate related documents. Another advantage is that sub-classification enables relationships between data objects to be identified. Another advantage is that sub-classification enables differences in topics of data objects to be identified.
Another advantage is that accuracy of data object processing tasks such as indexing, summarization, and clustering is improved or can be increased on demand when categorisation is found to be insufficiently granular by application of sub-classification to the classes requiring further granularity.
Another advantage is that many sources or types of existing classes can be utilized and different existing class types or class assignment mechanisms can be leveraged to provide different advantages.
As used herein, a “data object” or “document” refers to any electronically readable content whether stored in a memory, data repository, file, computer readable medium, as a transient signal or another medium and including, but not limited to, text documents, email messages, data communications, web pages, unstructured data, and electronic books. A data object may include non-textual content that can be translated into a set representation. For example, a data object may include sets of events, sets of logs, image or sound data with extractable features and/or its metadata which can be represented by terms describing the respective content.
In one example, the computing device 20 is one of a desktop computer, an all-in-one computing device, a notebook computer, a server computer, a handheld computing device, a smartphone, a tablet computer, a print server, a printer, a self-service print kiosk, a subcomponent of a system, machine or device. In one example, the computer device 20 includes a processor 21, a memory 22, an Input/Output port 23. In one example, the processor is a central processing unit (CPU) that executes commands stored in the memory. In another example, the processor 21 is a semiconductor-based microprocessor that executes commands stored in the memory. In one example, the memory 22 includes any one of or a combination of volatile memory elements (e.g., RAM modules) and non-volatile memory elements (e.g., hard disk, ROM modules, etc.). In one example, the input/output port 23 is a logical data connection to a remote input/output port or queue such as a virtual port, a shared network queue or a networked print device.
In one example, the processor 21 executes computer program code from the memory 22 to execute a data object analyser 50 to determine terms from content of one or more data objects of each of a plurality of classes and collate the terms in the data repository 30.
In one example, terms are determined by the data object analyser by performing text processing operations on the content including stemming and removal of short words and/or predetermined stop words (such as “the”, “a” etc) to obtain terms that include individual words and/or word stems from the content. In one example, where content is not plain text, is graphical, audio or some mixture of content types, processing to interpret the content may be performed—for example to generate sets of distinct features that describe the graphical data object for example as a set of shapes, colors and/or properties such as persons, and locations; applying recognition techniques to extract terms from the graphical data or audio; stripping formatting and/or navigation from documents, emails, websites etc.; stripping formatting markup in the data object, extracting anomalies in signals, etc.
In one example, the processor 21 executes computer program code from the memory 22 to execute a pattern analyser 60 to determine, from the terms in the data repository 30, a sub-topic for a selected one of the plurality of classes, the sub-topic comprising a set of terms, the set of terms being common to the content of at least a subset of said data objects of the selected class and substantially absent from data objects outside of said selected class.
In one example, the pattern analyser determines a plurality of sub-topics for the selected one of the plurality of classes. Each sub-topic comprises a respective set of terms, each set of terms being common to the content of at least a subset of said data objects (and subsets may overlap so a data object may be a member of more than one subset) of the selected class and substantially absent from data objects outside of said selected class. In one example, a term appearing predominantly in the class and not predominantly in data objects outside of the class is substantially absent from data objects outside of the class. In one example, a term is assessed according to a metric or a weighted metric to determine if it is substantially absent from data objects outside of the class. In one example, a term having a predetermined magnitude of occurrences in a class relative to occurrences outside the class is substantially absent from data objects outside of the class. In one example, class membership is absolute, a term of a set of terms of a sub-topic of the class being absent from data objects outside of the selected class.
In one example, the pattern analyser is subject to optimisation criteria when determining the one or more sub-topics.
In one example, the optimisation criteria include selecting a sub-topic in which the number of data objects in the class with content common to the set of terms is maximised.
In one example, the optimisation criteria include minimising the number of terms in the set.
In one example, the optimisation criteria include minimising the number of occurrences of terms of the set in content of data objects outside of the class.
In one example, the one or more data objects are stored in the data repository 30. In another example, the one or more data objects are stored in one or more remote data repositories and accessed, for example over the data communications network 45.
In one example, the data object analyser 50 determines the plurality of classes for the data objects from data such as a tag in, or associated with, the data object. In another example, the data object analyser 50 assigns each of the data objects to one of a plurality of classes.
In one example, the data object analyser 50 and pattern analyser 60 are executed on separate computing devices. In one example, the data object analyser 50 and pattern analyser 60 are executed on a common computing device. In one example, the data object analyser 50 and pattern analyser 60 are sub-routines of a system executed by a computing device.
In one example, the existing class is assigned by a remote and/or external system or source. In one example, the existing class is assigned manually or automatically according to a broad classification. For example, a broad classification may include classes of “Education”, “Politics”, “Fiction” and “Science”.
In one example, the existing class is inferred or determined from content such as presence of a particular keyword in the content; origin such as the person, organisation or application that authored the data object.
In one example, the existing class is inferred or determined from mechanism of transmission or receipt of the data object such as locally created data object, email data object, email attachment data object, web page data object.
In one example, the existing class is inferred or determined from the author, metadata or other attribute of the data object. In one example, the existing class is the area of expertise of the author of the data object.
In one example, the existing class is inferred from, or specified by, user inputs.
A sub-topic for a data object is a set of terms from the content 110 that are common to the content of the data object and other data objects of the class for which the sub-topic is selected as a discriminator.
In one example, as shown in
In one example, the system 10 determines one or more sub-topics for class. In another example, the system 10 determines one or more sub-topics for a designated one of the classes. For the purposes of illustration, determining sub-topics for the first class 200 is discussed, although the process is the same for further classes.
The system 10 determines, from the data objects 100a-100e of the class 200, two sub-topics 210, 210a, each comprising a set of terms common to the content of the data objects 100a-100e of the first class 200 and substantially not present in the content of data objects of the second 201 and third 202 classes. In the illustrated example, data objects 100a, 100b and 100c are determined to form a first sub-topic 201 and data objects 100c and 100d a second sub-topic. Data object 100c is a member of both sub-topics while data object 100e is not selected as a member of either sub-topic in this example. This reflects that in one example sub topics are not necessarily separate. Data object 100C in this example is part of both sub-topics. In one example sub-topics may not fully cover the whole class—data object 100e being part of the class but not being selected for either sub topic. In one example, the number of data objects in a class or a sub-topic is variable. The number of data objects shown in
scan; scanner; rbg; contrast; grayscal; noise
blurri; blur; motion; sharp; de-blur; convolut
In one example, the system 10 determines the composition of the set iteratively.
At step 300, the system 10 determines multiple initial seeds of candidate sub-topics using different combinations of terms from one of the data objects 100a-100e of the class under consideration. In one example, multiple ones of the data objects of the class under consideration may be used as the source for different seeds.
Continuing at step 310, each candidate sub-topic is then scored in dependence on a metric, the metric including a measure of applicability of the set of terms of the candidate sub-topics to data objects of the class and to data objects not of the class.
Continuing at step 320, the candidate sub-topic (or optionally the top-N) having the most optimal score are retained and the others are discarded.
At step 330, the retained candidate sub-topics are grown by adding a new, different, term from the content of the source data object to each respective set such that the maximum metric score is achieved for the candidate sub-topic. The processing iterates a number of times until candidate sub-topics reach a predetermined size of terms.
At step 340, the candidate sub-topic having highest metric score is selected.
At step 350, the terms for the candidate sub-topic are individually scored against the metric and the top K terms are selected to form a sub-topic for the class 200.
At step 360, a decision is made whether further sub-topics are to be determined and, if so, data on terms used for the sub-topic is removed from consideration on documents in the subtopic and operation loops back to step 300.
In one example, data on the class and sub-topic(s) are written to a database 280 or other data repository with a link or other association to the respective data objects of the class that have content common to the terms of the sub-topic.
In one example, the database 280 is used as an index for a search, clustering or data summarization system 290 with the class and sub-topic acting as the index and the link to the data object acting as the indexed item.
In one example, as shown in
In one example, the system 10 receives, via an input/output interface 12, a user input designating one or more of the classes and a user input designating an analysis operation.
In one example, the analysis operation designated is a “zoom” operation that causes the system 10 to return a predetermined number of sub-topics and links to representative documents (data objects). If the zoom analysis operation is repeatedly performed, the predetermined number of sub-topics returned is increased on each repetition (which, while dependent on the content of the data objects, will generally have the effect of increasing the number of terms in each sub-topic in order for multiple distinct sub-topics to be determined and therefore increases the perceived zoom level).
In one example, the analysis operation designated is a “diff” operation that takes as parameters via the user interface 11 and input/output interface 12 a designation of two classes or more (or a designation of a subset of data objects from the classes) and causes the system 10 to return sub-topics that are unique to the first of the two or more classes (or subset of data objects of the class).
Starting at step 400, a binary data object-term matrix A is generated to represent the terms of the data objects of the classes under consideration.
Aε{0,1}[n×m]
where Aij=1 only if the ith data object contains the jth term in the set of terms representing the data object.
Each row of matrix A represents terms from a respective data object.
The matrix A is dependent on the data objects under consideration but is typically very sparse and the number of unique terms is usually very large. Each document has an associated class. In the following discussion, it is assumed that there are t classes C={c1, . . . , ct}, and each document is associated to only one class (single tagging). However, in another example the described approach is applied to multi-tagging, where all the data objects tagged to the class are used as C and the others as
The notation Ac refers to rows of the matrix A representing data objects in class c while A
A binary sparse pattern vector is used as the basis for analysis of patterns of terms:
Xε{0,1}[m×1]
where Xi=1 if the ith word participates in the pattern.
The notation |X| represents the number of words that belong to the pattern vector X. Note that the multiplication AX=Y yields a counter vector that holds in the jth entry the number of words that belong to X and appear in the jth data object.
A weights vector is used to guide operation to find relatively rare sub-topics that appear in a relatively small subset of data objects of a class while at the same time finding enough sub-topics to cover most or all of the data objects in the class:
WεR
[n×1] where Σj=1nWj=1, ∀jWj≧0
Weights vector Wc denotes the weights vector for Ac and W
A pattern weight (PW), a weighted LP-norm of Y is calculated as:
where Y=AX and
p≧1 and is a system parameter (discussed below).
A pattern gain (PG), a measure of the difference between pattern weight inside the class and pattern weight outside the class is calculated as:
PG(X,Ac,A
Where λ≧1 and is a parameter.
A pattern that has a high pattern gain measured for a specific class is a good discriminative pattern and possible candidate as a sub-topic.
In one example, weights vectors and are initialized as:
System parameters are initialized as:
p
high=2 and plow=1
λ=1
T
s (seed size)=5
T
p (pattern maximal size)=20
Ns (number of seeds grown in parallel)=10
Continuing at step 410, a group of initial seeds is selected. In one example, the parameter p in this stage is set to be high (typically close to 2).
An initial seed has a small number of terms and is selected as follows:
p=p
high=2
PG(Ii,Ac,A
[X1s=I1
At step 420, the group of seeds is iteratively grown Ts times.
j=argmaxj′{PG(Xis∪Ij′,Ac,A
X
i
s
=X
i
s
∪I
j
At step 430, the single seed maximizing pattern gain is selected as output of the seed estimation stage:
i=argmaxi′PG(Xi′s),Xs=Xis
Pattern estimation is then performed. The parameter p is set to be low (typically close to 1). At step 440, the seed maximizing pattern gain that is selected as output of the seed estimation stage in step 430 is used to calculate a new weights vector for Ac as follows:
At step 450, the newly calculated weights vector is used to find the pattern of terms that maximizes pattern gain. Since p is set to plow (typically close to or equal to 1), the pattern gain is linear and the contribution of each term i to the pattern gain can be computed independently as follows:
PG
i(Ii,Ac,A
In step 460, terms are sorted according to their individual contribution:
idx
terms=sort(PGi(Ii,Ac,A
In step 470, the K terms determined from the sort to have the highest contribution are selected to yield a K term pattern. In one example, K is selected to be larger than seed size Ts and smaller than the pattern maximal size Tp. In one example, pattern size is selected in dependence on magnitude of individual contributions of terms. In one example, a pattern size is selected to include terms up to a maximal decrease in individual contribution in the sorted terms.
In step 480 the K term pattern is stored in a memory as a sub-topic.
In step 490, a check is performed to decide if further sub-topics should be identified. In one example, the check is dependent on the analysis operation being performed. In one example, the check is dependent on whether all data objects of the class under consideration fall within at least one determined sub-topic. In one example, the check is dependent on the number of sub-topics determined. If further sub-topics are to be identified, Ac is updated to remove the entries for the K terms in data objects matching the K term pattern and Wc is updated to assign more weight to data objects not yet matched to a sub-topic in step 495. Operation then loops to step 410.
The algorithm is iterative, on each iteration one pattern is extracted and removed from the data. The parameter p steers operation of the algorithm. High p drives selection of combinations of terms that appear together, even if they appear in just a few data objects, whereas low p drives selection of more common terms that appear in many data objects, even if not always together. Choosing p to be high leads to focus on very rare words that appear in just a few documents whereas choosing p to be lower results in less granular sub-topics being selected that cover more data objects. In one example, p is controlled by use of the categorization.
The functions and operations described with respect to, for example, the data object analyser and/or pattern analyser may be implemented as a computer-readable storage medium containing instructions executed by a processor and stored in a memory. Processor may represent generally any instruction execution system, such as a computer/processor based system or an ASIC (Application Specific Integrated Circuit), a Field Programmable Gate Array (FPGA), a computer, or other system that can fetch or obtain instructions or logic stored in memory and execute the instructions or logic contained therein. Memory represents generally any memory configured to store program instructions and other data.
Various modifications may be made to the disclosed examples and implementations without departing from their scope. Therefore, the illustrations and examples herein should be construed in an illustrative, and not a restrictive, sense.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2013/039055 | 5/1/2013 | WO | 00 |