This invention relates generally to computer-implemented knowledge management systems and more specifically to computer systems that recommend to users of documents in document corpora.
Current computing systems make available vast quantities of digital documents, such as articles, technical talks, Wiki pages, slide shows, and the like. The sheer quantity of available data can make it difficult for users to locate the documents that are most pertinent to their particular interests. Recommendation systems address this problem by presenting the users with a selected set of documents chosen based on some prior knowledge of the user's interests.
However, conventional recommendation systems have a number of shortcomings. For example, many conventional systems rely on domain-specific knowledge, such as customer habits regarding the purchase of movies. This places a great burden on the creator of the system to discover such knowledge and to design a custom recommendation system based on that knowledge, and does not permit an administrator to define corpora (i.e., distinct sets of documents) in a straightforward manner. Other conventional systems, such as many of those oriented towards retail sales, use social networking techniques (e.g., collaborative filtering), which rely on data about the interactions of other users with the various documents to infer the documents in which a particular user would be interested. However, the effectiveness of this technique is a function of the amount of the data on the interactions of other users, and thus systems with a small corpus or few users may not be able to beneficially employ social networking techniques.
Disclosed is a system and method for providing recommendations of documents to a user of a document corpus—i.e., a particular collection of documents, such as those relating to technical talks, books on science, and the like. In some organizational environments, there can be a number of distinct corpora, and each is administrable by a corpus administrator. In one embodiment, the corpora are further grouped according to a domain to which they belong. The present invention is of particular applicability where the number of documents and users of a given corpus is sufficiently small to be managed by a corpus administrator, or where there are a number of distinct corpora with which users of a single organization interact differently. These are scenarios in which conventional recommendation systems have low utility.
In one embodiment, document features are extracted and assigned weights, and a profile is likewise created for the various users. Then, the documents are scored with respect to a given user based at least in part on the document features and the user's profile. The document scores are adjusted based on organization-specific information to reflect organizational goals, such as promoting recommendation of newer documents. Based on the scores, recommendations are determined for a given user by identifying the top scores for that user and the recommendations presented to the user. In one embodiment, recommendations are provided within a web-based user interface; in another they are provided via email; in another they are provided as an RSS feed; in still another they are provided as gadgets or frames embedded within other applications. Interactions of the users with recommendations are monitored and the recommendations updated accordingly.
In one embodiment, a computer-implemented method presents to a user selected portions of an organization's corpora, the corpora comprising documents, the method being carried out by a processor configured to determine a set of weighted terms for each of a plurality of the documents, to construct a user profile including user interest areas, to calculate a score for each of the plurality of the documents based on correlation between the weighted terms and the user profile, to adjust the calculated scores based at least in part on rules specified by the organization, and to present the adjusted and scored items to the user.
The features and advantages described in the specification are not all inclusive and, in particular, many additional features and advantages will be apparent to one of ordinary skill in the art in view of the drawings, specification, and claims. Moreover, it should be noted that the language used in the specification has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the inventive subject matter.
These and other features of the invention's embodiments are more fully described below. Reference is made throughout the description to the accompanying drawings, in which:
The recommendation system 110 comprises a corpus definitions database 111, which defines each corpus in the system. In one embodiment, a corpus has a name, a set of associated documents, and (optionally) a set of associated users. As used herein, a “document” is a digital representation of information. A word processing file is a common example, but documents include many other things as well, such as digital representations of calendared events (e.g. a talk scheduled for a particular place at a particular time). The associated documents need not be stored on the recommendation system 110 itself; rather, in one embodiment only identifiers (e.g., URLs, path and file names) of the documents themselves need be stored—the data for the documents can be stored on the recommendation system 110, on systems available on a network (e.g., 150) that is local to the recommendation system 110, or on a remote system. In one embodiment, the associated users are represented by identifiers, such as operating system user IDs, of users interested in documents pertaining to that particular corpus. The documents for a given corpus need not be all of the same operating system file type, e.g. a text file or presentation file for a particular presentation application software, but rather can represent the conceptual category of the corpus. For example, in an exemplary embodiment one corpus is named “Technical Presentations,” has a set of 20 associated technical presentations in formats such as ADOBE PDF, Microsoft PowerPoint, word processing formats, event announcements and the like, and has two hundred associated users.
In one embodiment, the corpora are further grouped into domains. For example, an organization administering the recommendation system 110 creates individual domains for each organization that wishes to obtain its own personalized access to the recommendation system 110. The implementation for this embodiment is similar to that described above, with the addition of an association between a corpus and a domain.
A document features repository 112 stores a set of features for each document of the various corpora defined by corpus definitions repository 111. Features of a document represent its concepts, and in one embodiment consist of words and multi-word phrases (“n-grams”). In one example, a document on fishing has associated with it in the features repository 112 the set of terms “salmon”, “fly”, “reel”, “rod”, and “fishing vessel”, with each having an associated value (also referred to as a “weight”) quantifying how relevant the term is to the document. Features of a document could be present in the document itself, could be derived from a user-specified label, or could represent a category to which the document was assigned (e.g. “technical presentations”), for example. In one embodiment, the terms are chosen from a discrete set of possible terms, such as a set of 50,000 terms known to be useful in characterizing a document for search and recommendation purposes. As with other data storage repositories described below, the document features repository 112 is implemented in a conventional manner, such as a table of a conventional relational database management system, a text file, or a specialized binary file. Other manners of implementing repository 112 will be known to one of skill in the art.
A profile features repository 113 stores features, such as terms, associated with users. In one embodiment, each user has an associated profile, the profile storing terms chosen from the same set of possible terms as for the document features repository. The terms represent the interest areas of the user, each having an associated weight quantifying the relevance of the term to the user. As described further below, the terms and their weightings are derived from sources such as documents associated with the user, areas of interest explicitly entered by the user, and user interactions with recommended documents.
A document scores repository 114 stores scores for the various documents identified in the corpus definitions repository 111, each score quantifying the relevance of a given document to a given user. In one embodiment, the score is calculated based at least in part on a function of a profile for the user and the document features for the document.
Recommendation logic processor 115, as described further below, is a subsystem that determines which documents are most relevant to a given user, and provides a list of the recommended documents to the user.
A corpora management interface 116 provides a user interface allowing administration of corpora. A root user interface allows a root administrator responsible for administration of the recommendation system 110 as a whole to perform tasks such as adding new domains, e.g. by specifying a new document type. A corpus administrator interface allows a corpus administrator to perform tasks such as adding new corpora (e.g., by specifying a new document type), specifying which documents should be included within the corpus, specifying when document features and scores should be calculated or recalculated, and the like. Such features are illustrated in more detail with respect to
A corpus recommendation user interface module 117 generates the user interface displaying and allowing interaction with the recommendations for a particular corpus. In one embodiment, the user interface is constructed using a browser-based scripting language such as JavaScript, which can be rendered within a conventional web browser, e.g. as a particular module added by a user to a web page.
Referring again back to
In one embodiment, the features and weights extracted automatically by the weighting algorithm are supplemented by additional features and weights associated with the document, such as any tags that the user has associated with the document. The features and weights are then stored in the document features repository 112 in association with an identifier of the document from which they were extracted.
A profile construction module 220 populates the profile features repository 113. In one embodiment, the profile construction module 220 creates an initial profile for a given user based on available data sources. One data source is directory information available within the organization having the domain or corpus of which the user is a member, such as Lightweight Directory Access Protocol (LDAP) information stored on the organization's directory servers, e.g. personnel data available within a company tracking attributes such as age, sex, department, and the like. Another data source is a set of particular non-directory documents associated with the user and stored within the organization, such as a resume of the user or other document indicative of the user's interest areas. Terms are extracted from the document and weighted using the algorithms described above.
Use of these data sources allows the organization to leverage existing information that it stores about the user to produce higher-quality profiles than are created for systems which lack such pre-existing data about the user. In another embodiment, the user explicitly indicates terms of interest, such as by specifying a set of keywords (e.g. “tennis”, “Victorian literature”, etc.). Such explicitly-indicated terms in one embodiment are then assigned a weight higher than the weight of any other non-explicitly-indicated terms, representing the high degree of utility of explicit interests. In one embodiment, the profile construction module 220 additionally updates a user's profile, e.g. based on interactions of the user with documents, such as viewing initially, viewing for some period of time, printing, saving, emailing, explicitly marking the document as favored or disfavored using a user interface, and the like. For example, if a user is provided with a set of recommended documents and views a document having given terms, the value of those terms within the user's profile within the profiles feature repository 113 can be increased by an appropriate amount. In one embodiment, the effect of an interaction on the value of terms within the profile may decrease over time as the interaction ages. In one embodiment, the particular interaction triggering the update of the profile term value leads to different profile update actions. In an example, viewing of the document leads to a lesser increase in the value than printing the document, an action that presumably indicates more serious interest on the part of the user than does viewing. As another example, marking a document as disfavored leads not merely to reducing the values in the user profile for the terms within the document, but also to removing that article, possibly permanently, from any recommendations later provided to that user.
A document score calculator 230 calculates a score for a given document with respect to a given user based on a correlation between the feature weights generated by the feature extraction module 210 and the profiles generated by the profile construction module 220. In one embodiment, the correlation algorithm is a conventional cosine similarity algorithm, which calculates the cosine of e.g. tf-idf vectors of terms for the user's profile and for the document being scored. In another embodiment, the document scores are not calculated independently of each other, but rather influence each other. In one example, a document scoring algorithm is designed to spread knowledge throughout the organization by recommending every document of the corpus to at least one user of the corpus. Such an approach is useful for avoiding institutional knowledge gaps that can come to exist for reasons such as employee attrition. This algorithm addresses an optimization problem in which the goal is to maximize the standard correlation measure matches between users and documents and to minimize the overlap (or maximize the completeness) of the coverage of all the documents in the corpus by the employees. The scores are calculated with respect to all of the users of the corpus at once through conjugate gradient, Monte Carlo, or other optimization techniques. In some embodiments, a number of algorithms are available, and the choice of which particular algorithm to use for a given corpus is made by the corpus administrator via the corpora management interface 116. The document scores are then stored in the document scores repository 114 in association with an identifier of the user and the document to which they correspond.
At step 320, which may be performed before, in parallel with, or after step 310, an initial profile is created for a user, as described above with respect to the profile construction module of
At step 330, documents from corpora 130 are scored by the document score calculator 230 as described above with respect to
At step 340, the document scores are adjusted as desired based on the current context and the document features. A number of different adjustment rules may be used, and in one embodiment are specified by the corpus administrator via the corpora management interface 150. For example, one adjustment rule biases the score in favor of more recent documents, e.g. by calculating an amount of time between a date of the document (e.g., a creation or modification date) and a set date, increasing the score as a function of the calculated amount of time if the document date is after the set date, and decreasing it otherwise. Adjustment of scores based on document recency can also be accomplished via exponential decay according to a specified document half-life. Another rule biases the score based on the document type or the document itself, e.g. specifying a multiplier value for the score of documents of type “tech talk”, or for a specified “tech talk” document deemed (e.g., by the corpus administrator) to be of particular interest. Still another rule increases the weight of documents that are specific to a user's organization (e.g., company) and increases the weight yet further for documents that are specific to the department or unit of the organization in which the user works. Such rules can also be used to limit the number of results, e.g. through a specified maximum number of results or through a specified minimum score (i.e., a threshold).
At step 350, recommendations for a particular user are determined by the recommendation provider module 240, as described above. They are then provided to the user. In one example, the results are displayed within the user interface provided by the corpus recommendation UI, such as the corpus recommendation user interfaces discussed with respect to
At step 360, the user's interactions with documents are monitored. As previously described, different interactions with a document could indicate an interest level of the user in the document, such as viewing, printing, emailing, saving, explicitly marking as favored or disfavored, and the like.
At step 370, if the user interactions monitored at step 360 result in a modification of the user's profile, then the recommendations for that user are likewise updated.
A user interface such as that of
It is appreciated that methods carrying out the above-described steps need not include the exact steps, formulas, or algorithms disclosed above, nor need they be in the same precise order. Rather, variations on the scope and functionality of the individual steps, and on the order thereof, are possible while still accomplishing the aims of the present invention.
As used herein any reference to “one embodiment” or “an embodiment” means that a particular element, feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment. The appearances of the phrase “in one embodiment” in various places in the specification are not necessarily all referring to the same embodiment.
As used herein, the terms “comprises,” “comprising,” “includes,” “including,” “has,” “having” or any other variation thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, article, or apparatus that comprises a list of elements is not necessarily limited to only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Further, unless expressly stated to the contrary, “or” refers to an inclusive or and not to an exclusive or. For example, a condition A or B is satisfied by any one of the following: A is true (or present) and B is false (or not present), A is false (or not present) and B is true (or present), and both A and B are true (or present).
In addition, the words “a” or “an” are employed to describe elements and components of the invention. This is done merely for convenience and to give a general sense of the invention. This description should be read to include one or at least one and the singular also includes the plural unless it is obvious that it is meant otherwise.
Certain aspects of the present invention include process steps and instructions described herein in the form of a method. It should be noted that the process steps and instructions of the present invention could be embodied in software, firmware or hardware, and when embodied in software, could be downloaded to reside on and be operated from different platforms used by real time network operating systems.
The present invention also relates to a system for performing the operations herein. This system may be specially constructed for the required purposes, or it may comprise a general-purpose computer selectively activated or reconfigured by a computer program stored on a computer readable medium that can be accessed by the computer. Such a computer program may be stored in a computer readable storage medium, such as, but is not limited to, any type of disk including floppy disks, optical disks, CD-ROMs, magnetic-optical disks, read-only memories (ROMs), random access memories (RAMs), EPROMs, EEPROMs, magnetic or optical cards, application specific integrated circuits (ASICs), or any type of media suitable for storing electronic instructions, and each coupled to a computer system bus. Furthermore, the computers referred to in the specification may include a single processor or may be architectures employing multiple processor designs for increased computing capability.
Upon reading this disclosure, those of skill in the art will appreciate that still additional alternative structural and functional designs are possible. Thus, while particular embodiments and applications have been illustrated and described, it is to be understood that the present invention is not limited to the precise construction and components disclosed herein and that various modifications, changes and variations which will be apparent to those skilled in the art may be made in the arrangement, operation and details of the method and apparatus of the present invention disclosed herein without departing from the spirit and scope of the invention as defined in the appended claims.