The current invention relates to the fields of mining documents and identifying taxonomies that can be used for organizing the content and for searching the content. The current invention specifically addresses the issue of uniquely generating enterprise specific taxonomies rather than internet-scale or a more general taxonomies.
A general approach to organize content is to mine keywords, named entities, and lookup. For example, large repositories such as Wikipedia® can be categorized using a multi-domain ontology. Various sophisticated queries can be broken down based on the searched categories. This ontology can be developed over a time and often has an Internet-wide usage and derives from Internet-scale content. Taxonomy, which is part of an ontology, can be thought of hierarchical grouping of things, often as a tree structure.
One approach can be to extract key words or phrases out of unstructured content in a document and use an existing system obtain the taxonomy for the document. For example, given a document, its keywords can be extracted. The document can then be classified and/or organized in a hierarchy, such as, “science and technology”→“biology”, etc. or “sports”→“BasketBall”, etc. These types of taxonomies can be derived after reviewing a large set of documents and may be useful for internet scale searches or services or for organizing news articles, etc.
One problem with generic taxonomy can be seen M the following example. Consider an automobile manufacturing enterprise. Classifying its documents as “automobile” may be of no use as most documents are related to automobile in one way of other. Even classifying with the kind of “automobile” such as a “SUV”, etc., may not be useful because users of that enterprise use specific terms such as their model of the SUV to search the content. Hence, any taxonomy built on this enterprise corpus should reflect the specific word they use for SUV. This principle contradicts how generic taxonomies are created.
Hence, for enterprise this kind of generic taxonomy may not be useful as most enterprises, other than news organizations, etc., are specific to certain field or to a few fields. Categorizing the content and/or interpreting user queries using this generalized taxonomy is of limited use to most enterprises. An effective organization of enterprise content can be based on the terminology commonly used within the enterprise when content is created, the terminology used when the content is consumed or searched. A most commonly used specific term used in a particular enterprise to represent a generic term that is commonly used across a wider corpus. Current solutions that are based on generic taxonomies do not solve these problems. Accordingly, methods and systems of determining enterprise content specific taxonomies can improve upon the prior art.
Hidden or surrogate tags are often not identified in prior taxonomy based systems. That is if users of an enterprise use the terms “thermal”, “control”, “systems”, and a particular document uses only the terms “thermal” and “systems”, then the hidden tag “control” is considered as a surrogate tag. Though the example listed three terms it could be applicable two or more terms. We refer to these as “surrogate tags”. These surrogate tags do not occur in the document but are closely associated with terms in the documents of a particular enterprise.
These surrogate tags are often useful when new employees join a large organization and start authoring documents. Often, they are not adept at using the new organization's terminology. Even in that case, these documents need to be classified with the surrogate tags are identified and assigned appropriate enterprise taxonomy structure. Such classification and taxonomy structure will help in a) search and b) navigation. For example, if other users' search using common terminology of the enterprise then the results might not retrieve this new document at all or even if it is retrieved, it could be ranked low because of the missing surrogate tags. Similarly for navigation of the taxonomy, if the surrogate tag is the missing link, in the hierarchy and if it is not found then (other) enterprise users might not be able to navigate to this new document by the new employee. We need a new system to a) determine these surrogate tags and b) associate them with taxonomies.
Current systems do not generate enterprise specific taxonomy and are still dependent on words occurring in a document and hence do not capture related missing bridging words that are common in an enterprise, which could be to classify a document appropriately in an enterprise specific taxonomy.
In one aspect, a method of information retrieval from at least one computer database includes the step of providing a set of digital documents of an enterprise. The method include the step a providing a tag hierarchy, wherein the tag hierarchy comprises a specified hierarchy of keyword tags. The method includes the step of extracting a set of keywords from the set of digital documents. The method includes the step of clustering the set of keywords into a keyword cluster using an is-a relationship and synsets method. For each keyword in the set of keywords the following steps are performed: selecting a keyword from the keyword cluster, determining that the keyword is in the tag hierarchy, labeling as document in the set of digital documents that includes the keyword with a tag from the keyword, adding the keyword tag to a document tag list. The method includes the step of rendering the document tag list in a searchable format.
Optionally, the method includes the step of providing a provisional tag hierarchy. The method includes the step of determining that the keyword is in the provisional tag hierarchy. The method includes the step of increasing a weight of a link in the provisional tag hierarchy. The method includes the step of determining that the weight of the link achieves a specified. value. The method includes the step of adding the tag to the document tag list.
The Figures described above are a representative set, and are not an exhaustive with respect to embodying the invention.
Disclosed are a system, method, and article of determining enterprise content specific taxonomies. The following description is presented to enable a person of ordinary skill in the an to make and use the various embodiments. Descriptions of specific devices, techniques, and applications are provided only as examples. Various modifications to the examples described herein can be readily apparent to those of ordinary skill in the art, and the general principles defined herein may be applied to other examples and applications without departing from the spirit and scope of the various embodiments.
Reference throughout this specification to “one embodiment,” “an embodiment,” ‘one example,’ or similar language means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present invention. Thus, appearances of the phrases “in one embodiment,” “in an embodiment,” and similar language throughout this specification may, but do not necessarily, all refer to the same embodiment.
Furthermore, the described features, structures, or characteristics of the invention may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided, such as examples of programming, software modules, user selections, network transactions, database queries, database structures, hardware modules, hardware circuits, hardware chips, etc., to provide a thorough understanding of embodiments of the invention. One skilled in the relevant art can recognize, however, that the invention may be practiced without one or more of the specific details, or with other methods, components, materials, and so forth. In other instances, well-known structures, materials, or operations are not shown or described in detail to avoid obscuring aspects of the invention.
The schematic flow chart diagrams included herein are generally set forth as logical flow chart diagrams. As such, the depicted order and labeled steps are indicative of one embodiment of the presented method. Other steps and methods may be conceived that are equivalent in function, logic, or effect to one or more steps, or portions thereof, of the illustrated method. Additionally the format and symbols employed are provided to explain the logical steps of the method and are understood not to limit the scope of the method. Although various arrow types and line types may be employed in the flow chart diagrams, and they are understood not to limit the scope of the corresponding method. Indeed, some arrows or other connectors may be used to indicate only the logical flow of the method. For instance, an arrow may indicate a waiting or Monitoring period of unspecified duration between enumerated steps of the depicted method. Additionally, the order in which a particular method occurs may or may not strictly adhere to the order of the corresponding steps shown.
Definitions
Cluster can be a grouping in a statistical population.
Cluster analysis (e.g. clustering) can be the task of grouping a set of objects in such a way that objects in the same cluster are more similar to each other than to those in other clusters.
Bigram can be a sequence of two adjacent elements in a string of tokens, which are typically letters, syllables, or words (e.g. n-grams for n=2).
DBpedia can be a project aiming to extract structured content from the information created as part of the Wikipedia project.
‘is-a’ relationship can be a subsumption relationship (e.g. a hyponym-hypernym relationship, etc.) between abstractions (e.g. types, classes, etc.), where one class A is a subclass of another class B (and so B is a superclass of A).
Synset (e.g. synonym ring) can be a group of data elements that are considered semantically equivalent for the purposes of information retrieval.
Tag can represent keywords or phrases that are either generic and/or used in documents enterprise.
Taxonomy can include the practice and science of classification of things or concepts, including the principles that underlie such classification.
If cluster ‘t’ is represented in tag hierarchy 104, then process 100 can label document with the tag from the keyword and continue to next keyword/bigram. The corresponding tag to a document tag list. In step 110, process 100 can associate document list with appropriate tags in updated tag hierarchy.
If cluster ‘t’ is in provisional tag hierarchy 108, then process 100 can increase the weight of link in provisional tag hierarchy 108 is strengthened. Process 100 can then check the provisional tag hierarchy 108 to determine if the link strength is higher than a threshold and insert that into tag hierarchy 104 and add the corresponding tag to the document tag list. The documents tag list can represent the tag clusters that the document belongs to. The documents tag, list can be used in indexing and/or navigation operations. Key word correlations can be graphed in step 112.
If there is a path from node cluster ‘t’ to any other node in the graph, such that all the links are strong with respect to a threshold, then the nodes in the path that are not the document but are in the tag tree can be selected. These nodes can be for the surrogate tags for the document. Surrogate tags can be used to enable other users to retrieve documents that omitted the term in a particular document as it is used in other documents, especially by new employees that may not start using appropriate enterprise terminology yet. These tags can be derived from taxonomy tree that are specific to an enterprise and often can search documents with terms that are not associated or occur in the document. Finally, in step 114, process 100 can associate document list with surrogate tags.
Furthermore, process 200 can provide a provisional tag hierarchy. Process 200 can determine that the keyword is in the provisional tag hierarchy. Process 200 can increase a weight of a link in the provisional tag hierarchy. Process 200 can determine that the weight of the link achieves a specified value. Process 200 can add the tag to the document tag list.
Process 200 can graph the key word correlations. Nodes in the graph corresponds to (key)words occurring in the document corpus. The edges and the edge weight indicate the co-occurrence and the weight of the occurrence across all documents. The weight can be a function of the co-occurrence, such as if it is in the title or body, and the co-occurring words, such as are they key-words and the rarity of the co-occurrence across documents, etc. Hence, the edge weight represents the strength of the co-occurrence within an enterprise.
If in a document, an extracted keyword phrase corresponds to a path where most of the words match, then the missing nodes in the path can be picked as the surrogate tags and identified as tags. These surrogate tags are associated with documents and are also used to strengthen the tag hierarchy (taxonomy) that is specific to this enterprise. A surrogate tag is used to enable another user to retrieve at least one document that omitted the surrogate term in a particular document as it is used in other documents. Process 200 can associate the document tag list that includes surrogate tags.
Example Systems and Computer Architectures
Cluster analysis module 304 can cluster various units of a documents (e.g. words, n-grams, phrases, etc). In one example, cluster analysis module 304 can cluster keywords and/or bigram keywords using a “is-a” relationship and/or “synsets” methods. Cluster analysis module 304 can implement various clustering models. Example clustering modules can include, inter alia: connectivity models, centroid models, distribution models, subspace models, graph-based models, etc. It is noted that both hard and fuzzy clustering methods can be utilized.
Tag hierarchy module 306 can implement tag hierarchy-related operations such as those provide in processes 100 and 200 supra. For example, tag hierarchy module 306 can determine if a cluster is represented in a particular tag hierarchy. Indexing and navigation module 308 can utility keyword tags generated by taxonomy system 300 for various indexing and navigation operations.
In some embodiments, system 500 can be include and/or be utilized by the various systems and/or methods described herein to implement any of the process and/or examples provided supra. Client 502 can be in art application (such as a web browser, augmented reality application, text messaging application, email application, instant messaging application, etc.) operating on a computer such as a personal computer, laptop computer, mobile device (e.g. a smart phone) and/or a tablet computer. In sonic embodiments, computing environment 500 can be implemented with the server(s) 504 and/or data store(s) 508 implemented in a cloud computing environment.
Additional Processes
It is noted that, in some embodiments, the term taxonomy can be used to loosely classify a tag hierarchy as a tree (or forest) structure with generic terms as nodes and “is-a” relationship capturing the parent-child relationship. The process provided herein can automatically identify tags, build, a taxonomy, and assign tags/taxonomy to individual documents in a corpus based on the enterprise specific usage of the terms.
Process 600 can be to create nodes in the taxonomy tree, provisional or otherwise, is to identify nodes in the graph that have high connectivity and get the relationship of the correlation. For example, for a pair of highly connected nodes in the graph such as “output” and “fidelity”, can resolved to its most generic relationship to create a node in the taxonomy as “speaker” or “sound system” based on user defined taxonomy. That is, a set of matched generic “is-a” relations (e.g. from DBPedia and/or standard ontology based systems), then process 600 can filter the relations using either the taxonomy term from the user defined taxonomy of that enterprise and/or with an equivalent term that is often used in the enterprise. Accordingly, process 600 can build a taxonomy that is closer to the terminology used in art enterprise rather than depending on the external, too generic, ontology systems.
The provisional tree of 610 can capture relations that are significant but not quite strong to be pushed into the actual taxonomy structure. As more documents are added to the system, with more evidence, the provisional tree can push nodes to the actual taxonomy tree if the edge strength crosses certain threshold. It is noted that these nodes can be pushed at an appropriate level in the actual taxonomy tree. Elements 616 and 618 of process 600 illustrate assigning auto rags/taxonomy structure to a document. For each old or new document the steps described supra can be followed that builds upon the original, provisional, and/or user-defined taxonomy trees. Once complete, then the subgraph that matches this particular document can be projected onto the final taxonomy tree using the canonical form. The projected subtree can then be used as the taxonomy for this document D for organizing, search, and/or for navigation.
Conclusion
Although the present embodiments have been described with reference to specific example embodiments, various modifications and changes can be made to these embodiments without departing from the broader spirit and scope of the various embodiments. For example, the various devices, modules, etc. described herein can be enabled and operated using hardware circuitry, firmware, software or any combination of hardware, firmware, and software (e.g. embodied in a machine-readable medium).
In addition, it can be appreciated that the various operations, processes, and methods disclosed herein can be embodied in a machine-readable medium and/or a machine accessible medium compatible with a data processing system (e.g. a computer system), and can be performed in any order (e.g. including using means for achieving the various operations). Accordingly, the specification and drawings are to be regarded in an illustrative rather than a restrictive sense. In some embodiments, the machine-readable medium can be a non-transitory form of machine-readable medium.
This application hereby incorporates by reference the following applications in their entirety: U.S. Provisional Patent Application No. 61/663,169, titled Cloud Based Content Management and filed on 22 Jun. 2012, and U.S. patent application Ser. No. 13/915,327, titled Method And System Of Cloud-computing Based Content Management And Collaboration Platform With Content Blocks, and filed on 26 Jun. 2013.