With the growing number of documents existing on the internet and intranets, the need for a method of organizing and searching the vast quantities of information is necessary. Users frequently look for documents based on search terms and boolean operators. However, existing content search mechanisms do not provide users with insight about actual content sources, dependencies, context or other relationships of potentially relevant documents or content modules. In existing content search mechanisms, it is not possible to pre-select the most suitable document/content module from a list of candidates by applying and comparing various criteria. Instead, a user searching for information has to retrieve and review several documents before he is able to decide which document is actually the most appropriate for his purpose.
Most relations between documents are not stored. If the relations do exist, they are typically hidden from the user or in a format that is not readily accessible or capable of being understood by a user (i.e. for the internal use of a search algorithm). A method and system is therefore needed to quantify the value of information and determine dependencies between documents that are easily viewed and understood by a user searching for documents. However, in the process of standardizing complex material lists of comprehensive sets of documents and content across documents, there is no easy way to create transparency about existing dependencies of content across documents, nor how these dependencies could be optimized. A method and system is also needed to take the calculated values and criteria and display this complex set of dependencies to a user.
a to 3c depict three example templates that may be used in a content analytics system.
a to 4g depict various example displays of a content matrix utilizing content analytics and corresponding metadata to create values corresponding to various assets. This information may be presented to a user. Corresponding keys are also displayed to a user to explain the color/value scheme utilized in the content matrix.
Content, meaning any type of meaningful content (i.e., a benefit statement, a customer pain point, etc.), are continuously used to populate repositories in intranets or over the world-wide web. This content is found in assets, for example, documents, videos, images, or any other type of file that can hold content. In intranets in particular, assets are typically related, meaning that there are content dependencies. The assets are typically text documents, but they can also be graphical or audio/visual; thus, documents and assets are referred to interchangeably.
Content dependencies can be a useful attribute of a document to quantify because users, when searching for content, typically collect information that are related to each other. Furthermore, in creating new documents, information is derived from other related documents. For example, an engineer may create a document discussing the technical specification of a product, for example an automobile. However, this technical information may be usable and integrated into marketing documents that describe advertising aspects of the automobile. The same technical information may also be used by a sales team to create a document analyzing the geographic and demographic target customers for the automobile. A manufacturing team may create another document, using the same technical specification, to determine a budget and profit margins based on the design. Each of these downstream documents, e.g. documents that are derivative of information contained in an original or source document, would have dependencies with the parent document, e.g. the document from which a document is directly derived from. Documents may have various dependencies. For example, the manufacturing team may use both a technical specification as well as a document on estimated sales created by the sales team to create its document regarding budget.
In particularly large organizations, content is typically distributed in a standard format. Information from these documents is reused and amended in various forms in other assets. Often, organizations will have templates for documents that are frequently disseminated as official documents. Other assets can also be added to an intranet database as needed. Content analytics uses the fact that document templates and assets are related. Users are therefore able to retrieve information faster and also find documents that are most up-to-date. By using a content analysis view (e.g. a content matrix, graphical display, etc.), a view display that shows values, attributes, or dependencies referring to content as represented by metadata (or analysis done thereof), users can determine not only the types of assets that have the content they are searching for, but also the types of content that have not yet been populated in a particular database.
There may be times when a user 103 will author asset or document types 200 that are not of a pre-defined format. For example, during the “Authoring” phase 208, a user 103 may submit an asset or document instance 201 as is and let the system determine ad hoc the value and relations of that document.
Alternatively, a user 103 may create an asset or document type 200 first, so that the content analytics system has a new document type established in the system. The user would then author a document using the template he created. Having a pre-defined document allows the content analytics system to later readily determine relationships 206, an important aspect in creating a “Document/Content Ontology” 211. An ontology, as used in knowledge management systems, is the hierarchical structuring of knowledge by subcategorizing information according to its essential quantitative values.
In another example, a user may create a document and an embodiment of the invention may use various natural language parsing techniques to determine context and relationships 206 to give values to the document. Thus, an embodiment would be able to integrate both pre-defined and non-pre-defined document types into a content analytics system.
During the “authoring” phase 208, a plurality of document instances 201 are created by a plurality of users 103 and 104. These documents are stored in document storage 202, typically a server or database. In the “Management” phase 209 the document that was created is assigned metadata 205 (e.g. values or attributes) which are stored as part of a “Registry” along with a reference to the document's location in the document storage. Metadata is “data about data” or data about information or data found in the documents. This metadata may be general attributes about the data, such as file type, author, history of edits, etc. Metadata may also be calculated values, such as quantity of words, number of pages, amount of reuse of content, etc. Metadata may also be other data values specified by specific systems or by the user. These values or metadata will later be used to return the best search results during the “Information Retrieval” stage 203. Some of the data may be displayed to a user in a content analysis view and other data may be used to help create the content analysis view. The “authoring” phase 208 also encompasses updates of document instances 201 that already exist in document storage 202. During the update of source document instances 201, an embodiment of the content analytics system may dynamically calculate the impact of these changes as they relate to dependent documents and bring that to the attention of the authors of these dependent documents.
During the “Information Retrieval” stage 210, a user 103 or 104 would gain access to content assets by using a navigational or a search approach or a combination of both. 203. In the navigational approach, a content analysis view is displayed to the user in order to illustrate the relationships and other dimensions to navigate along. In a search approach, the system may use the metadata that is otherwise used by the content analysis view, to calculate in the background without the knowledge of the user an order of relevance, to deliver the most relevant content assets to the top of in a search result list. An embodiment of the content analytics system may utilize the relationships 206 of the Document Ontology 211 and the Metadata 205 of the Registry 212, and through OLAP 204 (Online Analytical Processing) an embodiment may provide a user 103 or 104 with a content analysis view created ad hoc.
a to 3b provide example document templates that may be used with an embodiment of the content analytics system. The example that will be used throughout this application is a search for automobile data within an automobile manufacturing company. Within this example automobile company, all assets are contained within an example embodiment of a content analytics system.
In
b is another document template, and as shown by the document type 300, it is a “Marketing Document.” Once again there are various major headings 301, sub-headings 302, and indicators 303.
a and 4b depict a content matrix, one example of a type of content analysis view, that has been retrieved by a user 103 or 104 in a search regarding an example automobile model. At the top of the content matrix is the title 400 providing a general description of the types of documents that have been returned. In this instance, “Standard Assets” populate the content matrix. An asset would be any content that would be returned by a search, such as a document. An embodiment of the present invention may divide document types further into multiple categories of documents. For example, a newly created document may by default be a “standard asset”; however, the Document Ontology of the embodiment may also have separate categories, such as “external documents” where all the assets would be documents provide by vendors or customers of the company. Having this separate category of document types provides a further breakdown of document types and provides more metadata to help a user later narrow his search of documents or assets.
In
An advantage of using different colors/number values to quantify the content of documents is that users are not only able to determine the level of content within each asset, but also to compare assets against each other. For example, a user looking at the “Auto Advertising” document and see that cell 404 indicates that in the “Other” major content element it has the value “3”. A user would be able to compare this to the “Auto Tech Document”, which has a value of “1” 411 for the corresponding “Other” major content element. A user could then determine that the “Auto Advertising” asset has more content for the “Other” major content element than the “Auto Tech Document” asset. Thus, an advantage of an embodiment of the invention is not only to evaluate the content within a document, but also compared to other documents.
b is an expanded view of
However, content matrices can also have many different types of views depending on the needs of a user. In the case of a content matrix, a view is a display of certain types of metadata depending on the metric used to populate values in the cells. For example, a view of the content matrix may be values using metric of quantity of content. Another view of the content matrix may be values using the metric of quality of the content. Users can use different views to gather data about the assets, and also compare and sort the information found in the different views. Thus, in an alternative view of the content matrix, e.g. a quality and relevance-based metric, as is shown in
In
c represents an example key that may help a user understand the value and color scheme of a particular view of a content matrix. In the particular example key, the colors and values correspond to the metric of quantity of content. In alternative embodiments, the colors and values in a key may correspond to quality and relevance of particular types of content. In yet other embodiments, the values may correspond to a mixture between quantity and quality. The content matrix view (as are all content analysis views) is adjustable such that various embodiments may display assets and metadata values based on any number of metrics. The advantage of using colors associated with values is that the colors help a reader easily visualize the metadata corresponding to content and allows for easy determination of missing content and the relevance and usefulness of the assets.
d depicts the same content matrix listed in
e depicts another display type or view for a content matrix (only a portion of the screen is shown for convenience). In this example, a metric for calculating cell values is based on percent of content reuse. Content reuse may be evaluated based on quantity, quality, relevance, date and time of edits, or any other available metric.
Like the view based on the metric of quality and relevance shown in
f represents a key, similar to the one shown in
g represents an example asset profile, in this case a document profile. The document profile has various attribute elements 419 that provide additional information about the document. Furthermore, the content elements 402 and corresponding information is reflected on the bottom half of the profile. In this case, values are listed for each content element, as was listed in the content matrix, and in addition detailed information regarding each content element is also provided. The major content elements 402 may also be expanded down (although this is not shown in the Figure). Furthermore, percentage of reuse is shown, and the particular source or parent assets that content is reused from is also provided. Alternative asset profiles may also list information regarding different views of the content matrix based on different metrics of evaluating assets.
f displays an additional example view that is not changed due to a different metric, but rather another OLAP principle, in this case changing dimensions. In the figure, the title 420 indicates to the user that the view is that of “Comparing Assets.” The Menu Header on the vertical axis 421 indicates that the target elements 422 are listed. The individual cells 423 would indicate the relevance of a particular asset 424 to a particular target audience 422. For example, only “50%” of the content of the “Auto Tech Document” is relevant to the target audience of “customers.” In this figure, the same key and shading scheme is used as in the previous examples. Further OLAP principles may be applied with the content analytics system. For example, when doing a drill up or drill down, a user can change the level of aggregation that is displayed from high level to a lower level. The highest level of aggregation could be used to visualize a Chief Operating Officer's view of content that comes from different Lines of Business (Engineering, Marketing, Sales, etc.) by aggregating all content values into “Master” document types and comparing these against each other. On the next lower level, this master asset type would be broken down into the actual asset types. In the example provided, it could answer questions such as “What asset types do we have to describe our various cars and how are these related?” or “How can I optimize those dependencies and define a better asset set?”. The next lower level asset instance view would be relevant for an employee who works in a specific segment who needs to know exactly which documents exist in that segment (e.g. “what assets do we really have at the moment for a particular series?”). Other OLAP principles would be utilizable with a content analytics system.
In populating the database with a plurality of assets, a user first receives an asset or document template 600, or alternatively creates one if the asset type is not recognized by the system, and then populates the asset with content 601. The user then submits the asset to a database server 107 or document storage 202. The asset is then evaluated, using various metrics, to determine values to populate as metadata in a database registry 212. The metadata is also automatically created and used to create an asset profile. The user then has the option of editing the asset profile that was created 604 or simply accepting the asset profile created. The asset's metadata is then placed in the registry 605.
During information retrieval, a user will request assets 606 that fit a general criterion. An embodiment of the content analytics system, using OLAP 204, the metadata 205, and the document/content ontology 211, gathers the appropriate data and determines relations between the documents 607 and values needed to populate a content analysis view. The complete list of assets and its corresponding data would be displayed to the user in a content analysis view 608. The content elements in a content analysis view may be represented as different types of views, for example a content matrix, meaning that the values may represent various metrics, for example, quantity of content, quality of content, date of creation of content, etc. The various types of views may in turn be manipulated using different metrics but also by different principles as found in OLAP environments. The user would be able to change the specific type of view between various types of metrics or different views within a type of view. The content analysis view would also enable access to assets or content elements 609, for example in a drill through.
Several embodiments of the present invention are specifically illustrated and described herein. However, it will be appreciated that modifications and variations of the present invention are covered by the above teachings and within the purview of the appended claims without departing from the spirit and intended scope of the invention.