The present invention is related to systems and methods for data and/or information analysis, in particular for knowledge management and/or user modeling.
Analysis of data compilations, including statistical analysis of relationships in the data and future trend analysis, is an area of wide application. For example, in a typical enterprise setting, information regarding entities such as employees is usually manually updated, which often results in data of poor quality. Individuals may provide incomplete profiles, or may not invest the necessary effort in creating a rich and accurate profile of themselves, or may not keep the data up-to-date as their interests, responsibilities, and expertise changes. Individuals, at best, often provide a few keywords on expertise, making it difficult to differentiate who are the better experts from many people with similar expertise. For example, for a manager who is in charge of multiple groups with different responsibilities and capabilities, it is desirable to have this information in hand. For service personnel facing customer problems, it is desirable to be able to draw on the problem-solving expertise of all of the individuals within the organization. An internal expertise mining system would be an advantageous tool for understanding and managing the expertise and potentials of individuals within an enterprise—which usually are the most valuable assets in enterprises.
The field of “knowledge management” is receiving recognition as the gains to be realized from the systematic effort to store and export vast knowledge resource held by employees of an organization are being recognized. The sharing of knowledge broadly within an organization offers numerous potential benefits to an organization through the awareness and reuse of existing knowledge, and avoidance of duplicate efforts. In order to maximize the exploitation of knowledge resources within an organization, a knowledge management system may be presented with two primary challenges, namely (1) the identification of knowledge resources within the organization and (2) the distribution and accessing of information regarding such knowledge resources within the organization. In contrast to systems where individuals manually input their expertise information, it has been proposed to build such expertise profiles passively, i.e. by analyzing e-mail messages and other content source in order to build a representative profile of a person or entity. Traditional information retrieval techniques have been applied to address the problems of expertise matching and mining. See P. Liu, J. Curson, P. M. Dew, “Exploring RDF for Expertise Matching Within an Organizational Memory,” Conference on Advanced Information Systems Engineering, pp. 100-116 (2002); A. Mockus, J. D. Herbsleb, “Expertise Browser: A Quantitative Approach to Identifying Expertise,” Proceedings of the 24th International Conference on Software Engineering, pp. 503-512 (May 2002). However, prior art approaches have usually described expertise as a vector, which can fail to provide a richer and more accurate description of an entity's expertise. Often there is no explicit description of the relationship among the different categories of expertise, nor of the evolution of the expertise.
Systems and methods for data and/or information analysis are disclosed herein which may be directed to knowledge management and/or user modeling and may utilize relational representations and/or evolutionary representations of information, for example, expertise information. In contrast to prior art vector-based approaches, expertise profiles may be represented as, for example, graphs. Evolutionary social network models and exponential random graph models may be incorporated into a user model analysis. For example, the personalized social network for an individual, which includes how other individuals evaluate her and how she evaluates herself as well as other individuals, may be used to construct the expertise profile. The context semantics may be assumed to evolve, due to interaction of the entity with different multi-modal information sources, such as text and citation links. Classification and clustering techniques may be used to address detection of concepts and the structural and semantic units comprising the context model. Classification accuracy may be boosted by utilizing the citation linkages between texts in the classification methodology. The knowledge management system, accordingly, may utilize an expertise representation that explicitly provides relational and/or evolutional information for user modeling. Since the relationship information and the temporal evolution of the expertise are explicitly modeled, a richer and more accurate description of an entity's expertise may be provided, which may be useful for mining, retrieval, and visualization. The knowledge management system may also provide innovative mechanisms for analyzing and indexing multiple disparate modalities in order to extract relationships/correlations across the heterogeneous information source related to an entity.
The present invention may introduce social network concepts into user modeling. A user-centric modeling approach is disclosed which may be used to dynamically describe and update an expertise profile. The present invention may enhance collaboration and productivity in an enterprise environment, e.g., by quickly finding entities with complementary expertise, or entities with a specified expertise. These and other advantages of the invention will be apparent to those of ordinary skill in the art by reference to the following detailed description and the accompanying drawings.
The present invention is directed to systems and methods for data and/or information analysis. The systems and methods may be directed to knowledge management and/or user modeling. In various embodiments, the systems and methods may utilize relational representations and/or evolutionary representations of information, for example, expertise information and/or evolutional information related to expertise information. The systems and methods may be at least in part included in, for example, a computer system, a computer network, the Internet and/or a computer readable medium. Various exemplary embodiments are provided herein to illustrate at least some of the possible applications for the present invention, but the invention is not limited thereto. For example,
At 110 in
P(X=x)=mx/M
where X is a variable which indicates the citation information for each publication, x represents one of the categories, mx is the number of citations belonging to the category x, and M is the number of references. The intuition behind incorporating these features is that a paper from one category tends to cite the papers in the same area. This relational structure is useful for classification. It can be shown that incorporating this feature can boost the classification accuracy significantly.
After analyzing the textual data and the linkages, the knowledge management system will have a representation of the expertise information that categorizes the different publications and linkages. For example,
As illustrated in
At least one embodiment of the data analysis and/or knowledge management system 180 according to the present invention may be as shown in
In at least one embodiment, the data extractor 183 may receive input information from the dataset 182. The dataset may be co-located or remote to the engine 181. The data extractor 183 may analyze the input data for the presence or absence of one or more characteristics or features deemed to be of interest to the user. In at least one embodiment, the data extractor 183 may compile the extracted information of interest that is associated with a particular person or group into a profile for that person or group. The data extractor 183 may utilize a variety of extraction techniques such as, for example, pattern recognition and/or image analysis techniques.
The data analysis module 184 may receive the information from the data extractor 183 and may generate a ranking for the person or group associated with a desired characteristic or classification. In an embodiment, the data analyzer 184 may determine a strength of relationship or evolution based on, for example, the quantity and quality of the characteristics present for the various entities of interest. The data analyzer 184 may base the analysis on a comparison of each characteristic found to a search query that specifies desired characteristic(s). The data classification module 185 may classify the data into various categories according to a user query. The data may also be classified according to temporal information. In at least one embodiment, the relationship and/or evolutionary representation generator 186 may generate a representation of relationship and/or evolutionary aspects of the data and characteristics that result from a user query. This information may be in various forms, for example, a table or list and may include weighting of various characteristics and interrelationships. In at least one embodiment, the graph generator 187 may generate a relational and/or evolutionary network representation, for example, an ExpertiseNet to display the results of a user query. This may include, for example, the relationships for a person or group in a give timeframe or as may be observed over time. In at least one embodiment, the data finding and matching module 188, may provide recommendations and/or predictions regarding the various user queries. It should be recognized that the present system may be included in a computer system or network such as a PC, an intranet, or the Internet. Further, software to operate the system may be included on a computer readable medium and may be done using, for example, C++ programming language, etc. Operation of the system components will be described in more detail below.
The relational representation 130, as may be derived using various processes such as those shown in
The relational representation 130, for example, can be formulated as G(N, E), where N represents the nodes set and E represents the edge set. Two nodes ni and nj are adjacent if edge eij=(ni, nj), or eji=(nj, ni) is in the set of edges E. In the relational ExpertiseNet, each node may represent, an expertise. The size of a node may be proportional to the strength of the node which is defined as:
si=Pi
where si represents the strength of the expertise i for the person, pi is the number of publications of the person in category i resulted from classification, as described above. The edges may represent the relationship between the expertise nodes. Two types of relational ExpertiseNets may be defined: Directed ExpertiseNet and Undirected ExpertiseNet. When the database contains citation linkage information, directed ExpertiseNets may be built in which the edges have directions to indicate the directions of influences between the expertises. When the database does not contain citation linkage information, Undirected ExpertiseNets may be built in which the edges do not have directions.
It is advantageous to use the correlations among different categories to decide the edges of the representation. The text and citation linkages provide possibilities to build the edges. Citation linkages can provide solid evidence of correlations among different expertise. For example, a paper in category “A” cites many papers in category “B”, this implies the close relationship of category “A” and category “B” for this paper. As discussed above, the dataset contains linkages which include both the information of how a paper cites other papers (out-direction) and how other papers cite this paper (in-direction). This information regarding the types of linkages can be advantageously utilized. For example, authority typically comes from in-edges, while being a good “hub” comes from out-edges. From the publications of a person A, it is reasonable to infer that her expertise “X” is influenced by “Y” if her papers in category “X” cite many papers from the category “Y”, while her expertise “X” influences “Y” if her papers in category “X” are cited by papers from category “Y”. For example, in
where eA→B represents the “strength” of the edge from expertise “A” to expertise “B”, K is the total number of publications for the person, niAB represents for paper i the number of papers in category A cited by the papers in category B, Ni represents the number of citations in paper i. From this and ExpertiseNet 330 may be constructed for person A 335. Person A 335 has expertise in ML 336, IR 337, and NLP 338 that may be interrelated as shown in
In at least one embodiment, alternative or additional methods for determination between nodes may be used. For example, the correlation between nodes may be explored by text similarity analysis. This may be particularly useful when the citation linkages are not available. For example, Latent Semantic Analysis (LSA) 350 and “covariance graph” models may be applied on, for example, the term-by-document matrix (the columns of the matrix are the indices of the documents, and the rows of the matrix contains the frequency of occurrence of the terms in the documents), to build the undirected ExpertiseNet 380 as shown in
A≅USVT
where AεRN×M, UεRN×K, SεRK×K, and VεRM×K, M is the number of documents, and N is the number of terms. Here, since the process is, for example, to compare the person's expertises in different categories, in the term-by-document matrix, all the words from all publications 355 in one category for the person may be treated as one document 360. The “covariance graph” model may be applied to build relevance networks, such as the gene relevance networks where interactions between any two genes are defined through Pearson's correlation coefficients 370. This “covariance graph” model may be applied on the reconstructed matrix. First, the correlation matrix may be calculated to determine the strengths of the edges between two nodes in the undirected relational ExpertiseNet 380, then, if the magnitude of the value of the correlation is smaller than a threshold (we set, for example, 0.05), we eliminate that edge from the graph. An example of the resultant undirected ExpertiseNet 380 is shown in
It may be advantageous to incorporate exponential random graph models into the above user model analysis. The above analysis can be used to obtain an observation of the user expertise profile. Then, an exponential random graph model (otherwise known as a “p* model”) can be used to estimate an underlying distribution to describe the relational representation 130 of the expertise information. One advantage of this statistical model is that it can be used to represent structural tendencies, such as transitivity (defined by the number of transitive patterns) that define complicated dependence patterns not easily modeled by deterministic models. Given a set of n nodes, let Y denote a random graph on these nodes and y denotes a particular graph on those nodes. Then
where θ is an unknown vector of parameters, s(y) is a known vector of graph statistics on y Density (defined by the out-degrees), reciprocity (defined by the number of reciprocated relations), and transitive triads (defined by the number of a set of edges {(i→j), (j→k), (i→k)}) and the attributes of the nodes are considered herein), c(θ)is a normalization term. This probabilistic expression has advantages on describing the insights of the network, and, thus, can also help to describe the evolution of the expertise representation.
In the evolutionary representation 140, the dynamics and/or the evolution of expertises may be explored and considered. In evolutionary representation, two basic tasks are performed: (1) “evolution segmentation,” where changes are detected between expertise cohesive sections and/or (2) “expertise tracking,” where one keeps track of expertise similar to a set of previous expertise. The strength of the nodes as well as the structure of the network may be considered in evolution segmentation, and temporal sliding windows may be applied. The development of one expertise may, in fact, depend on or influence the development of others. For example, it has been determined that when a research area increases its citations from other areas, it can predict the development of this area for a period of time into the future. A possible reason for this phenomenon is that when a new branch in a traditional research area is being developed, at the beginning stage, it usually borrows ideas from other areas. When the branch of research comes to a mature period, the researchers will tend to cite the papers in its own area. Thus, it is reasonable to assume that there are correlations between the development of the expertise areas and the linkage changes.
where Vt,i indicates the “strength” of the expertise i at time t, L indicates the number of expertises for each person, th is a threshold, where the goal is to find all t satisfied by the equation. It has been found it advantageous to set the threshold th to, for example, 0.2. [Is there a range of reasonably good choices?] The evolution segments may be obtained from these change points.
As discussed above, it has been determined that the link changes are often highly correlated with the evolution of the expertise. Accordingly, at 430 in
where the variables have the same meaning as above, except that only the papers in a particular time segment t are considered.
In at least embodiment, an exponential random graph model can be estimated from the data in each window of time or time period, where temporal sliding windows may be applied. A series of parameters which indicate the network configurations can be obtained. Then, the change points of the evolutionary representation are determined by:
where θt,k indicates the parameters of the exponential random graph model at time t, M represents the number of parameters, and th is a threshold. The goal may be to find all t that satisfied the equation based on a particular th. The threshold, th, may be, for example, from 0.1 to 0.3 for satisfactory evolutionary representation 400 results. Regardless of which approach(s) for evolutionary representation 600 may be used, at 440 and Evolutionary ExpertiseNet may be developed.
After obtaining the relational representations and evolutionary representations of a variety of entities, one can then perform expertise mining and matching. In accordance with another aspect of the invention, a variety of mining and matching may be conducted. One exemplary mining technique, for example, is to conduct a search for entities who not only have the expertise of interest but who also have expertises that satisfy certain relational patterns between the relevant expertises. This approach may be referred to as “expertise relationship mining.” Another approach is to find entities who have certain evolutionary expertise patterns. This approach may be referred to as “evolutionary expertise mining.” The searching results may be ranked, for example, by the strength of the linkage in the relational or evolutionary expertise patterns, which is calculated by the methods mentioned earlier in building the ExpertiseNet.
In
One can also conduct novel forms of expertise matching, in which a search is made to find entities/persons with similar expertise. Instead of using traditional vector-based matching, the present invention may provide various ways to search entities with similar evolutional and/or relational information of the expertises. For example, and without limitation, the expertise profiles may be compared based on a generalized hamming distance function, which considers both the weighted linkages and weighted nodes into the computation, to compare different expertise profiles in order to differentiate different entities. For example, this distance can be expressed as follows:
where G, V, E indicate graph, node, and edge respectively, w indicates the number of nodes in the graph, and β0 is a weight to determine the trade-off of the importance of the nodes or structure. The similarity between expertise profiles may be based on various kinds of indices which are extracted from the expertise representations and the semantic labels of the nodes, e.g., based on degree-based, betweenness-based, closeness-based, flow-based centrality and prestige indices, structural balance, clusterability, and transitivity indices, and/or the cohesiveness of subgroups.
In a variation, where an exponential random graph model is utilized, a “distance” function may be used to compare different relational representations in order to differentiate entities, for example, as defined as follows:
where G indicate the graphs, θ indicates the parameters of the exponential random graph models, and M represents the number of parameters to describe one graph. For evolutionary representations of the expertise information, the distance function may be formulated as:
where β are the statistical parameters in the actor-oriented model.
Compared to a mining/matching process based on traditional expertise profiles, which obtain a long list with persons with similar expertise as the result, the present invention may be able to generate a much smaller list with more accurate matching for the requirement(s), thereby saving time in a mining process. User models described by the above-mentioned relational representions and evolutionary representations provide a rich and accurate representation for expertise profiles, and may be used in different applications such as mining, retrieval, and visualization of the information. The expertise may be built up in a hierarchical way. The relationships and correlations across heterogeneous information sources related to an entity can be readily extracted. Consider, for example, a manager who has a project which needs to use machine learning to solve problems in computer vision. Using a traditional system, the manager would type in keywords “machine learning” and “computer vision” and get a list with many people with similar expertise. How does the manager differentiate them? With a relationship representation, the manager will be able to identify an entity who is, for example, using “machine learning” for “NLP” while another entity is using “computer vision” with “machine learning” and doing “NLP” independently. With a representation generated for a whole community, one can readily do a search for related research areas and obtain key references automatically and/or search for individuals with similar expertise profiles and, thereby, obtain useful suggestions for potential future projects. As a result, it may be possible to classify expertise areas and predict trends in the expertise areas.
While exemplary drawings and specific embodiments of the present invention have been described and illustrated herein, it is to be understood that that the scope of the present invention is not to be limited to the particular embodiments discussed. Thus, the embodiments shall be regarded as illustrative rather than restrictive, and it should be understood that variations may be made in those embodiments by workers skilled in the arts without departing from the scope of the present invention as set forth in the claims that follow and their structural and functional equivalents.
This application claims the benefit of U.S. Provisional Application No. 60/630,050, filed Nov. 22, 2004, the entire disclosure of which is hereby incorporated by reference as if set forth fully herein. This application is related to recently filed patent application having attorney docket number 04023 (not yet assigned a serial number), the entire disclosure of which is hereby incorporated by reference as if set forth fully herein. This disclosure contains information subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent disclosure or the patent as it appears in the U.S. Patent and Trademark Office files or records, but otherwise reserves all copyright rights whatsoever.
Number | Date | Country | |
---|---|---|---|
60630050 | Nov 2004 | US |