1. Field of the Invention to Electronic
This invention relates generally to software that manages document retrieval and ranking, and more particularly to providing a method, article, and system for utilizing the explicit metadata of retrieved documents, the extracted intrinsic metadata inside the content of retrieved documents, and knowledge of user-document relationships as important parameters in calculating relevance or ranking score for retrieved documents.
2. Description of the Related Art
There are many different document-ranking methods for query results. A large number of them are optimized in terms of performance, recall and precision ratios for searching relevant documents on the Web. Some advanced ranking methods retrieve and utilize the user searching preferences, selection types and histories. Ranking of retrieved scientific or technical documents usually use keywords in titles, abstracts, contents, and metadata. Ranking of other retrieved documents often include keywords in contents, metadata, okapi formulae, semantic, correlation factors, and others. Some advance ranking methods also monitor and record, for example, those sites from where the user frequently selected their documents in a query list of retrieved documents, and the user's preferred document types. The information used in the ranking methods is most likely linked to client side cookies stored on the user or client side by search engines. Some search engines may only store a user key on the client side as a single cookie and use it to retrieve detailed information stored on their servers. This information is used for calculating the relevant scores of matching documents returned from the query or search on the search engine's database. The retrieved documents are then ranked and sorted according to their relevant scores before sending them to the client and being displayed to the user.
Additional advanced ranking methods also utilize the information on the relevant documents retrieved from query or search. These ranking methods can calculate the relevant scores from the retrieved documents based on their popularity, where are they originated from and who created them, and whether their document types matched the user's preferences and selection histories. In the case of scientific or technical documents, which contain unique title and abstract, authors, key words, subject and outline, methods used to calculate their relevant scores based on the document's contents are also well defined.
However, in the enterprise and business world there are hundreds of electronically generated documents, in particular business related documents, created and stored each day. These electronic business documents can be procurements, purchase orders, invoices, agreements, contracts and any types of business related documents. In the case of business and contract documents, there are some explicit metadata associated with the document, such as creation date, modification and accessed dates, title, subject, author, manager and company, category, keywords, comments and so on, which the user can add in the document properties in a word processor like Microsoft Word. For a Portable Document Format (PDF) document, the user can add title, subject, keywords, created and modified dates, Uniform Resource Locators (URL) and search index as document properties. But there may be no unique titles for each type of business and contract document, as many business or contract documents will have the same title if they are created using the same business or contract template. In addition business documents share the same set of keywords, have few metadata, have varying levels of security control access, and may require parsing and text extraction from documents in various formats (i.e. PDF, tiff, etc.). Thus, calculating document relevant scores or sorting the retrieved business or contract documents based solely on their explicit metadata are not sufficient to guarantee a high precision and reliable recall ratios.
For business related documents (including forms) there is a need to look inside the contents of the retrieved business or contract documents to reveal their relevance with respect to a user's query. As a result, it is required to calculate their relevant scores not just based on their explicit metadata, but more importantly their extracted implicit metadata such as company name and contract numbers, ordered or purchased items, customer name and address, and other parameters. Moreover, the user may not be authorized or allowed to access all the retrieved business or contract documents. Some users may be able to access only those contracts that they created. Furthermore, most users would prefer to see retrieved documents that belong to their departments on the top of the list when compared with retrieved documents that belong to other or alternative departments. In general users would prefer to see active contract documents on the top of the retrieval list relative to expired contract documents. A user may also want to have contract documents with high monetary or unit values ranked higher than contract documents with low values. However, none of the document ranking methods in use today has the ability to utilize the extracted implicit metadata of retrieved documents, and the relationship between the user and the retrieved documents constructed from the explicit metadata and the extracted implicit metadata.
The present invention is directed to addressing, or at least reducing, the effects of, one or more of the problems set forth above, by utilizing not just the explicit metadata of a retrieved document, but also the extracted intrinsic metadata inside the content of retrieved document, as well as the knowledge of the user-document relationship by relating the document explicit metadata and the extracted implicit metadata to the user's and document information on the system's database, as important parameters in calculating relevance or ranking score for retrieved documents.
A method for managing document retrieval and ranking from a system, wherein the method includes: determining explicit metadata of the retrieved document; extracting intrinsic metadata from inside the content of the retrieved document; wherein the explicit metadata and the intrinsic metadata comprise document information; establishing a knowledge of the user-document relationship by relating document information to a user's information on a document system or search engine database (server) or retrieved from the user's system (client); calculating a relevance or ranking score for each of the retrieved documents based on the explicit metadata, intrinsic metadata, and knowledge of user-document relationship, as well as the static and dynamic ranking rules constructed from the user's information or inputted directly by the user or an administrator of a group of users; and wherein the method further comprises: entering a query by a user into the system with a client user module; constructing a system query by the system based on said entering; retrieving information about the user by the system; reconstructing the system query with the user information by the system; sending the reconstructed system query from the client user module to an application server by the system; retrieving the document in response to the reconstructed system query by the application server; constructing static or dynamic ranking miles from the user's information or input from user or administrator, and ranking the retrieved document by the application server.
An article including one or more machine-readable storage media containing instructions that when executed enable a processor to access a document retrieval and ranking program in a system that comprises computer servers, mainframe computers, desktop computers, and mobile computing devices; and wherein the document retrieval and ranking program facilitates document searches; and wherein the document retrieval and ranking program provides for managing document retrieval and ranking from the system by utilizing not just explicit metadata of a retrieved document, but also extracted intrinsic metadata inside content of the retrieved document, and static and dynamic ranking rules constructed from the user's information or inputs from the user or administrator (responsible for a group of users), knowledge on user and retrieved documents dynamically built from the retrieved user and document information from the user's system (client side), the systems and database of the retrieved document and search engine (server side), and the dynamically constructed user-document relationships based on the relationship rules and the dynamic knowledge of the user and retrieved document, as important parameters in calculating relevance or ranking score for the retrieved documents.
A system for managing document retrieval and ranking by utilizing not just explicit metadata of a retrieved document, but also extracted intrinsic metadata inside content of the retrieved document, and knowledge and ranking rules dynamically built on a user and the retrieved document based on the extracted data, and forms a dynamically constructed user-document relationship based on the static relationship rules retrieved from the system or dynamic relationship rules inputted by the administrator, and the knowledge on the user and the retrieved document by relating document implicit metadata to a user's information on the systems or databases of the user, retrieved documents and search engines, as important parameters in calculating relevance or ranking score for the retrieved documents, wherein the system includes computing devices and at least one network; and wherein the computing devices implement the document database; and wherein the computing devices further include: computer servers; mainframe computers; desktop computers; and mobile computing devices; and wherein the computing devices execute electronic software that manages the document retrieval and ranking; and wherein the electronic software is resident on a storage medium; and wherein the computing devices have the ability to be coupled to the network; and wherein the network further includes: local area network (LAN); wide area network (WAN); a global network; Internet; intranet; wireless networks; and cellular networks
Additional features and advantages are realized through the techniques of the present invention. Other embodiments and aspects of the invention are described in detail herein and are considered a part of the claimed invention. For a better understanding of the invention with advantages and features, refer to the description and to the drawings.
The subject matter that is regarded as the invention is particularly pointed out and distinctly claimed in the claims at the conclusion of the specification. The foregoing and other objects, features, and advantages of the invention are apparent from the following detailed description taken in conjunction with the accompanying drawings in which:
The detailed description explains the preferred embodiments of the invention, together with advantages and features, by way of example with reference to the drawings.
Embodiments of the present invention provide a method and system for knowledge-based ranking of retrieved business documents among enterprises, their partners and customers in a standalone or Web-based application. Knowledge is based on the profiles and preferences of an individual user, explicit metadata and dynamically extracted implicit metadata from business document properties, dynamically built user and document knowledge, and dynamically constructed specific user-document relationship parameters based on relationship rules inputted statically or dynamically altered by a user of an administrator, and static and dynamic ranking rules either build from the retrieved user's information or the user's input. The invention defines and builds specific user-document relationship parameters between an individual user and each retrieved document. User input or default values for these specific relationship parameters and their weighting factors are used in calculating ranking scores of retrieved documents.
Examples of user—document relationship parameters employed by preferred embodiments of the present invention include, but are not limited to the following:
The user module 104 has GUIs that provide a means for inputting the query and to communicate with the application server 106 for the user to customize and store user personal and business related information related to the document on the client side over a network interface such as the Internet. Users are required to input their personal and business related information related to documents at least once. However, the user can update this information as often as they want to. Within the client user module 104, the user first selects the query type 110 such as terms, key words, content search, quotation search or semantic search. Second, the user enters the query terms 112. Third, the system constructs the query 114 based on the user's query type and terms. Fourth, the system retrieves the user's information 118 such as the user reference number from the client cookie. Fifth, the system reconstructs the query 120 with user information and sends the query 122 to the application server 106. Other query parameters can also be entered by the user.
The search module 138 within the application server 106 first receives the query with user information from the user module. Second, it parses 136 and executes the query 134. Third, it retrieves query documents with relevant scores 132 from any generic search engines (not shown). Fourth, the system retrieves explicit metadata from document properties 130. Fifth, it also retrieves implicit metadata from any generic parser and extraction tools, such as a PDF parser and extraction tool to parse and extract implicit metadata. Sixth, the system 100 retrieves the document information form the system document database 128, such as the owner, department, status and access control of the document. Seventh, the system parses the user information sent from the user module 140. Eighth, the system retrieves user information 142 such as which department the user belongs to, the access level of the user. Ninth, the system builds the knowledge of the relations between the document and the user 146, such as comparing their departments, the relationship of the document owner and the user, the user's access level matched with which document access level 144. Tenth, the system 100 filtering all those documents that the user can see or access to according to the knowledge obtained from the user-document relationship. A partial score can be calculated 148 according to access control levels.
The numeric expression of access level of the user to the document is as follows:
score(a)=wa×(1−(ad−au)/an) equation (1)
If the user's access level does not belong to any of the document access levels, score(a)=0.
Eleventh, the system 100 calculates the partial score 148 based on the relationship between the departments the user belongs to and the document as follows:
score(d)=wd×(1−(dd−du)/dn×(gd−gu)/gn) equation (2)
Twelfth, the system 100 calculates the partial score 148 based on the user's ownership level of the document as follows:
score(e)=we×(eu/en) equation (3)
Thirteenth, the system 100 calculates the partial score based on the document's status level as follows:
score(s)=ws×(sd/sn) equation (4)
Finally, the final relevant score for ranking retrieved documents is given by total score 124 in equation (5) as follows,
total score=score(a)+score(d)+score(e)+score(s) equation (5)
Similarly, a partial score contributed from other relationships between user and document can be calculated in the same way as either equations (1) or (2). A partial score from other explicit and implicit user parameters can be accounted for in the same way as equation (3). A partial score based on explicit and implicit document parameters can be derived from similar equation to equation (4).
The algorithm of the Ranking Module relies on building specific user-document relationship parameters 216 based on user information 210, document information 212, and default or user dynamically defined user-document parameters (208, 222). The equation to calculate an individual score 218 of each specific user-document relationship parameter is as follows:
p(i)=1.0−{[u(i)−d(i)]/n(i)} and normalized to 1;
where u(i) and d(i) are the relative rank of a particular parameter i, such as the department rank for the user and document respectively. n(i) is the highest possible rank. For an example, the user department rank is 80 while the document department rank is 60. Then their difference is 20 and the normalized score p(i) is 0.8.
The ranking score 220 is calculated by adding up scores of all user-document relationship parameters with their weighting factors using:
total score=sum i[w(i)×p(i)]/sum i[w(i)] and normalized to 1;
where w(i) is the weighting factor for parameter i. The stun i is the summation of all the scores over i.
The flow diagrams depicted herein are just examples. There may be many variations to these diagrams or the steps (or operations) described therein without departing from the spirit of the invention. For instance, the steps may be performed in a differing order, or steps may be added, deleted or modified. All of these variations are considered a part of the claimed invention.
While the preferred embodiment to the invention has been described, it will be understood that those skilled in the art, both now and in the future, may make various improvements and enhancements which fall within the scope of the claims which follow. These claims should be construed to maintain the proper protection for the invention first described.