The disclosure generally relates to systems and platforms for data analysis using interactive recommendations of data sets by matching characteristic patterns of one data set with one or more characteristic patterns of a candidate data set.
F
Data analysis platforms are applications used by data analysts and data scientists. Data analysts and data scientists need to deliver timely studies (i.e., data analyses) to answer numerous business questions from their business customers. The problem can be summarized as follows: too many potentially relevant datasets are available while, on the other end (the user end), there is little support for finding the actually relevant datasets and, on the system end, there is little or no information about the intent of the user in the analysis.
More specifically, these users are not adequately supported because in the current applications, finding data is slow. Data analysts and data scientists spend more time finding and preparing the data than performing actual analysis. In addition, data is not easily visible to the users if useful data is available, i.e., they find it hard to identify what data is suitable for the current study either as raw data to be prepared or as already prepared and fit for purpose. There also tends to be a lack of reuse of data among analysts. They cannot easily reuse the analyses already done by others: i.e., the datasets already prepared by others or prepared by the same analyst in the past.
Further issues are caused by inconsistencies among analysts. Since data analysts and data scientists work in isolation, there are always inconsistencies across organizations due to different business rules applied by different users. Another problem data analysis face is that the number of recommendations produced often is too high for the user to benefit from when there is no accounting for the goal of the user.
From the standpoint of users with IT/governance roles, the problem illustrated above also leads to undesirable data duplication issues. An example of the problem occurs when these professionals need access to relevant lookup tables. Foreign key definitions help identify the appropriate table to perform lookups, but these definitions often are missing in relational databases and non-existent in other types of data stores. Analysts typically have to reconstruct manually one set of data types (e.g., time zone information) from other data types (e.g., geographic information), leading to error and incorrect data results.
In the context of data transformation or preparation applications, where each application is a collaborative environment for data analysts, data scientists, and ETL developers to discover, explore, relate, acquire any type of data from data sources inside or outside the enterprise, the above problems are solved by a system that provides relevant dataset suggestions to a user based on the context of a prior dataset selection and an inferred goal. Specific improvements the are achieved by the systems and methods herein include reducing the average time to find data by reducing the manual steps to find the data, increasing the visibility of useful data assets by bringing them to the user, who selects and chooses, increasing reuse of analyses (over time), reducing inconsistencies as data users are exposed to the business rules of others (over time), and reducing duplication from the standpoint of IT/governance roles.
For example, as the user finds and includes in his current project the dataset with a “country code” column (but without the “country name”), the method and system described herein automatically recommends the lookup table with “country name” information, which has already been used in combination with the current dataset. In other words, a supplementary dataset. Thus, the analysts can also include the lookup table which he will then leverage at preparation time will not need to do the manual work to reconstruct the “country name” information.
Another common example of the problem is the need of data professionals to find if the dataset currently included in the project has already been extended via joins or unions with other relevant datasets. In this case, disclosed system automatically recommends the datasets that resulted from these previous joins or unions, allow the user to preview them, and, if ultimately chosen, avoid the user to repeat these manual join or union operations. In other words, an alternate dataset.
A second domain for applying the invention are the applications for ETL developers. This class of users would also benefit from join recommendations as they develop new mappings: currently they need to select manually sources and targets when building an ETL mapping, see Informatica Developer Tool. The limitations of these applications are analogous to those described above.
In one embodiment, a computer executed method of recommending datasets for data analysis. A recommendation system receives a user selection of a first dataset, for example, resulting from a search for dataset based on keywords or attributes. The system determines a context for the selection. Given the user selected dataset and context, for each of a plurality of dataset relationship types, a set of recommended datasets are identified. These recommended datasets are generated by first, determining at least one second dataset related to the first dataset based on the relationship type, scoring each second dataset using a relevance ranking algorithm specific relationship type to score the relevance of the of the second dataset to first dataset, and then ranking the datasets to determine the highest ranked datasets. From the ranked datasets, there are selected a plurality of ranked datasets as the recommended datasets, which are then presented in a graphical user interface.
The types of relationships that may be used to identify the recommended datasets include: a lineage relationship based on ancestor or descendant relationships between datasets; a content relationship based on semantically similar datasets; a structure relationship based on structurally compatible datasets; a usage based relationships based on datasets previously used by relevant classes of users in association with the previously chosen datasets; a classification-based relationship based on datasets that share one or more classifications with one or more datasets previously chosen by the user; and; an organizational or social relationship based on social or organizational relationships between users of the datasets.
After the recommended datasets are presented, a user selection of one or more of the recommended datasets is received. For the selected dataset, relationship type to the first dataset is determined, and a plurality of datasets related to the first dataset by the relationship type are further identified and scored for relevance. These further datasets are presented in the graphical user interface according to their subtypes for the relationship type.
In addition, a user interface provides a dataset selection control for receiving a user selection of a first dataset, and a recommendation bar for presenting a set of recommended datasets based on the user selection of the first dataset and a determined context for the selection, where the recommended datasets are grouped within the recommendation bar by relationship type to the first dataset. The user interface also includes a “goal” confirmation control for receiving a selection of one or more of the recommended datasets.
The features and advantages described in the specification are not all inclusive and, in particular, many additional features and advantages will be apparent to one of ordinary skill in the art in view of the drawings, specification, and claims. Moreover, it should be noted that the language used in the specification has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the inventive subject matter.
The disclosed embodiments have other advantages and features which will be more readily apparent from the detailed description and the accompanying figures. A brief introduction of the figures is below.
FIG. 4A1 illustrates a user interface showing a recommender bar with first recommendations based on a lineage relationship according to one embodiment.
4A2 illustrates the user interface of FIG. 4A1 showing a recommender bar with a menu control for selecting a goal for directing recommendations according to one embodiment.
The figures and the following description relate to particular embodiments by way of illustration only. It should be noted that from the following discussion, alternative embodiments of the structures and methods disclosed herein will be readily recognized as viable alternatives that may be employed without departing from the principles of what is claimed.
Reference will now be made in detail to several embodiments, examples of which are illustrated in the accompanying figures. It is noted that wherever practicable similar or like reference numbers may be used in the figures and may indicate similar or like functionality. The figures depict embodiments of the disclosed system (or method) for purposes of illustration only. Alternative embodiments of the structures and methods illustrated herein may be employed without departing from the principles described herein.
The entities of the system 100 include user client 110, client data store 105, network 115, and recommender system 120.
Although single instances of user client 110, client data store 105, network 115, and recommender system 120 are illustrated, multiple instances may be present. For example, multiple user clients 110 may interact with recommender system 120. The functionalities of the entities may be distributed among multiple instances. For example, recommender system 120 may be provided by a cloud computing service according to one embodiment, with multiple servers at geographically dispersed locations implementing recommender system 120.
An user client 110 refers to a computing device that accesses recommender system 120 through the network 115. Some example user clients 110 include a desktop computer or a laptop computer. In some embodiments, user clients 110 include web browsers and third party applications integrating client data store 105. User client 110 may include a display device (e.g., a screen, a projector) and an input device (e.g., a touchscreen, a mouse, a keyboard, a touchpad). In some embodiments, user clients 110 have one or more local client data stores 105, which are databases or database management system that, e.g., provide access to source data via the network 115.
Network 115 enables communications between user client 110 and the data flow design system 100. In one embodiment, the network 115 uses standard communications technologies and/or protocols. The data exchanged over the network 115 can be represented using technologies and/or formats including the hypertext markup language (HTML), the extensible markup language (XML), etc. In addition, all or some data can be encrypted using conventional encryption technologies such as secure sockets layer (SSL), transport layer security (TLS), virtual private networks (VPNs), Internet Protocol security (IPsec), etc. In another embodiment, the entities can use custom and/or dedicated data communications technologies instead of, or in addition to, the ones described above.
Recommender system 120 implements the method as described in conjunction with
Recommender system 120 includes a user interface model 135 receives selection of a dataset from a user. Context module 140 determines the context for the dataset selection, using data from knowledge base 130. Based on the selected dataset and the context for the selection, recommendation module 145 determines the applicable recommenders and calls them.
Recommenders 150 then each determine datasets to recommend based on the corresponding relationship type for each recommender 150, using data from knowledge base 130. Recommendation module 145 then aggregates, scores, ranks, and selects a subset of the datasets provided by the recommenders 150 for presenting to the user, and user interface module 135 presents the selected datasets via a user interface. Each of the components 130-150 of recommender system 120 is discussed in further detail below.
Knowledge base 130 includes an inventory of datasets, profiles of users, data definitions that are used to define the semantics of datasets and data elements. Knowledge base 130 also includes data domain information, which data domains are used to define the types of data values. Knowledge base 130 includes classification schemes that can be used to classify the datasets and data elements. Knowledge base 130 also includes lists of projects that are used to group user actions performed on datasets to achieve some goal. Knowledge base 130 includes a map of relationships that encodes different types of relationships, including lineage relationships, content relationships, structure relationships, usage-based relationships, classification-based relationships, and organizational or social relationships between users. This map of relationships feeds into the various recommenders 150.
For each user, knowledge base 130 is loaded with existing intent knowledge, history of in-project actions, and individual preferences among the different relationship types derived from prior interaction history (e.g., user profiles). For example, classes used by context module 140 are stored by knowledge base 130, as shown in Table 1 below, which lists the classes of user actions, and the user goal inferred from each action.
The three classes are as follows. Class 1 includes actions outside the context of a project, such as search history. Class 1 actions are used by the recommendation system 120 to initialize the recommendation process engine. Class 2 includes actions within the context of a project (excluding recommendations). Class 3 includes actions taken in the context of a list of recommendations provided to the user. Class 2 and 3 actions are used by recommender system 120 to revise the recommended datasets, e.g., using a stored decision tree as discussed below, which ultimately are displayed to the user, e.g., in recommender bar 410 of FIG. 4A1.
Knowledge base 130 includes data used by context module 140 for determining the context for the dataset selection, and data, such as the decisions trees discussed below, used by each recommender 150 to determine datasets to recommend to the user based on the corresponding relationship type for each recommender 150. The information maintained by knowledge base 130 for each of the relationship types is further described below.
For the lineage relationships, knowledge base 130 maintains information about how the data has moved between different systems and transformed along the way. Knowledge base 130 also maintains a decision tree for lineage relationships, as shown in
This decision tree of
On the other hand, if the user is interested in k-derivations of A (plus other parents), where C is derived from A and at least one other dataset, by selecting a recommendation corresponding to the right side of the tree the user indicates interest in k-derivations of A. This is illustrated by the common use case of a marketing analyst who needs to join the “customer” dataset with the “orders” and “customer demographics” datasets in order to answer questions about who to target for a new marketing campaign (e.g., find the list of customers that have purchased product x and have demographics most relevant to the new product y). This type of use case requires combining information (e.g., attributes or dimensions) in complementary datasets. It happens frequently when the database schema is organized following dimensional modeling principles, i.e., the database schema stores one dimension per table where that dimension can be connected with the dimensions in other related tables, e.g., via joins or union operations. An example in which a user selects a lineage relation, then k-derivations, then union operations, is discussed further below in conjunction with user interface of FIGS. 4A1, 5, and 6.
As an example, assume data is extracted from Table A in an ERP (Enterprise Resource Planning) system, transformed, and then loaded into a staging database table Table B. Then it is transformed again and loaded into a data warehouse table Table C. On that Table C, there a Business Intelligence Report that is built as Report 1. There is now a lineage relationship exists from Report 1 to Table C to Table B to Table A. Lineage relationship can be represented at table level as well column level. A diagram shown in
For the content relationships, knowledge base 130 maintains the relationships between datasets and data definitions that depict the semantics of the dataset, e.g., when datasets can be mapped onto a glossary of business terms. Knowledge base 130 also maintains a decision tree for content relationships, shown in
The decision tree of
As an example, two particular datasets that represent the same business term “customer” are semantically similar at the data set level. Knowledge base 130 also maintains relationships between data elements and data definitions which represent the semantics of the data element, e.g., where two particular datasets both contain the same specific type of data, or a column with the same set (or overlapping sets) of values (i.e., all the value can be checked against a common reference table). For instance, they both contain a “social security number” column and thus they are semantically similar at the data element level. In another example, they both contain the same set of stores ISO country codes and thus they are semantically similar at the data element value level.
The content relationship data in knowledge base 130 is used by content-based recommender 150b.
For the structure relationships, knowledge base 130 maintains the relationships between datasets and classifiers that classify datasets to be in the same group based on structural relationship such as PK-FK. Knowledge base 130 also maintains a decision tree for structural relationships, shown in
The decision tree of
For example, a “customer” and an “order” dataset from the same organization and time period have in common the column “customer ID” as PK-FK, which allows performing structural operations such as Join and Lookup between the two dataset. Knowledge base 130 maintains the relationships between datasets and classifiers that classify datasets to be in the same group based on structural relationship such as highly overlapped dataset structures between the datasets (i.e., set-subset relationship between the attributes of two tables). In another example, two “order” datasets from two subsequent years have in common the same set of columns (or the one may have a superset of the columns in the other), which allows performing structural operations such as Union. The structure relationship data in knowledge base 130 is used by structure-based recommender 150c.
For the usage-based relationships, knowledge base 130 maintains the relationships between datasets and users about who created which dataset, who used which dataset, and who rated which dataset and what the rating was (rating, in this case, represents usefulness of this dataset for that particular user). Knowledge base 130 also maintains a decision tree for usage-based relationships, shown in
The decision tree of
For the classification-based relationships, knowledge base 130 maintains the relationships between datasets and classifiers that classify datasets to be in the same group based on some classification scheme, e.g., a dataset may belong to a finance subject area, or a dataset may contain data for country USA. Knowledge base 130 also maintains the relationships between data elements and classifiers that classify data elements in the same group based on some classification scheme, e.g., a column containing sensitive information. Knowledge base 130 also maintains a decision tree for classification-based relationships, shown in
The decision tree of
For the organizational or social relationships between users, knowledge base 130 maintains the relationships between users based on the user profiles, where information such as follower/followees and organizational chart attributes are specified. Knowledge base 130 also maintains a decision tree for organizational or social relationships, shown in
The decision tree of
Recommender system 120 also includes context module 140. Context module 140 infers goals, including goals based on user actions in the current session context. This context informs the dataset selection, using context data, such as Table 1, from knowledge base 130.
Context module 140 first determines context information for a selected dataset, which is then stored in knowledge base 130. Various contexts have corresponding classes assigned to them, which determine what goal is inferred from the user's selection of the dataset within that context. Three different classes correspond to actions taken in specific contexts, as shown in Table 1, which is stored in knowledge base 130. Using this information, the datasets next suggested to the user are based on the goal inferred from the context information.
Then, when a next action is taken by the user, context module 140 determines a (possibly different) context for the next action, which action either confirms the inferred goal or not. Context module 140 revises the inferred goal, if necessary, which then again informs the next datasets presented to the user, and so on. In this way, context module 140 iteratively determines the context in which specific actions, e.g., dataset selections, are made by the user to infer a user goal for the action, and the inferred goal in turn informs selection of the next datasets to suggest to the user.
Recommender system 120 also includes recommendation module 145. Given a user-selected dataset and context, recommendation module 145 provides recommended datasets for presenting to the user. Based on the selected dataset and the context for the selection, recommendation module 145 determines the applicable recommenders and calls them. Recommendation module 145 then aggregates, scores, ranks, and selects a subset of the datasets provided by the recommenders 150 for presenting to the user.
Recommendation module 145 determines which recommenders 150 should be called in view of a selected dataset and context, calls the recommenders 150, aggregates and scores the recommended datasets produced by each recommender 150, and selects the highest ranking datasets for presentation to the user by UI module 135, e.g., in recommender bar 410 in a graphical user interface such as is shown in FIG. 4A1.
For example, assume the system has n relationships in set R. The recommendation service has a matrix W of size n where W[i] is the weight of the recommendation produced by using the relationship R[i]. Each recommender produces local recommended datasets ranked by a relevance score based on some relationship in R, using a relevance ranking algorithm specific to the recommender and relationship type.
In one embodiment, the recommendation service starts with a default weights for each of the relationships and adjusts the weights according to the actions the user performs. The default weights can be equal across all recommenders, or configured per the user's profile. The scores of the recommended datasets from each of the recommendation lists are weighed by the current weight of the relationship in the recommendation service and aggregated and presented by decreasing rank.
As the user selects datasets for inclusion or previewing, the corresponding weight for the relationship type/recommender is incremented, and the remaining weights for the other relationship types/recommenders are adjusted.
Below is a pseudo-algorithm, with explanations, for the recommendation module 145.
Recommendation module 145 maintains a map of weights applied to various recommenders 150 within the context of various goals, e.g., at the project level, user level, or the session level:
The set of recommenders 150 is registered with recommendation module 145 as:
The strategy decides how the weights applied to various recommenders 150 are adjusted
GoalInferenceStrategy goalInferenceStrategy
This method will be called by user client 110 to get recommendations:
Recommendation module 145 gets the recommender weights applicable in the current goal context:
Recommendation module 145 invokes the recommenders 150:
Recommendation module 145 aggregates the scores of all recommenders 150:
This method is invoked by recommendation module 145 when a user accepts a recommendation. The recommendation module 145 uses that information to adjust the recommender 150 weights:
Below represents an interface for adjusting weights:
Below shows one exemplary way of adjusting weights:
In another embodiment, a hybrid recommender may be configured, using a combination of different relationship types (and their corresponding decision trees) and a combination of underlying relevance ranking algorithms for the different relationship types. In this embodiment, the recommendation module 145 invokes the applicable recommenders 150 based on a user action, prioritizes relationships based on inferred goals, and aggregates the response from the recommenders 150, and displays the results into the recommender bar, e.g. 410 of FIG. 4A1.
As mentioned above, recommender system 120 maintains a map encoding all the different types of relationships among all the datasets, e.g., in knowledge base 130. Based on this map, when the recommendation module 145 is given one or more datasets previously chosen by a user Ux, it can compute a set of recommendations to that user for each of the relationship types: lineage relationships allow recommendations of ancestor or descendants datasets, using the various recommenders 150 discussed below.
Recommenders 150a-150f each use a current context, e.g., as determined by context module 140, which has the following components: (1) datasets in the project (as the user-selected datasets A) and (2) the user (for the user's role, organizational department, and follower/followee relationships).
Each recommender 150a-150f includes program code that implements a relevance ranking algorithm that is specific to the relationship type of the recommender 150. Each relevance ranking algorithm computes a relevance score for another dataset within the relationship type, measuring the relevance of the other dataset to the given, user selected dataset.
Each recommender 150a-150f is normalized and trained. Recommender system 120 is loaded with relationships and decision trees, as discussed above in conjunction with knowledge base 130. For each user, the system generate a Finite State Automaton (i.e., a directed graph) that represents all the r possible states of a recommender bar: {s1, . . . , sr} based on the information. The states are based on the taxonomy of project types defined a priori by the system administrator before initializing the system (stored in Projects and Goals in knowledge base). Then, at initialization time, the taxonomy and the corresponding states for each project type is customized to each known user profile.
Recommender 150 are trained based on two list types: local lists and a global list. Local lists pertain to relevance scores, for each dataset A in the system, each of the individual recommenders 150 compute a distinct relevance score for each of the relationship types. A local list defines the relevance based on each relationship between the recommended Cj datasets and A, where 1<j<N. The global list is computed by the recommendation module 145 to produce a globally ranked list of related datasets {C1, . . . CM} as consolidation of the above-mentioned local lists provided by the recommenders 150.
When the applicable recommenders are called by recommendation module 145, each recommender 150 determines datasets to recommend based on the corresponding relationship type, using data from knowledge base 130.
The local lists are presented to the users upon demand based on the dataset included in the project and the state of the recommender bar. The recommendations may also have a temporal component, such that the recommender 150 provides periodic updates to the recommender lists (e.g., every year or quarter), or recommender system 120 uses the logs of user interactions taken on recommendations from a fixed period (e.g., full year) to train a predictive model for each of the r states and update the underlying taxonomy of project types. Then the Beta values in the trained model can be used as weights. The predictive model may or may not also factor in also the user role (e.g., data analyst, data scientist, chief data officer). The recommender 150 training discussed above then is repeated.
Each type of relationship corresponds to a particular decision tree logic and relevance ranking algorithm, for a specific recommender 150, as discussed below. Examples algorithms for each recommender 150 are also discussed.
Lineage-based recommender 150a recommends datasets that are descendants from one or more datasets in the current project. The lineage recommender 150a uses the systems knowledge of transformations of datasets and decisions trees, as stored in knowledge base 130, to come up with alternate dataset recommendations.
Assuming the system has knowledge of n data sets represented by the set D. Let's assume the system has knowledge of m transformation represented by the set T, with a context that has k datasets represented by the set C. The lineage recommender provides two types of recommendations, 1-derived and k-derived.
In the 1-derived example, recommender system 120 produces the set of j transformations 0 where j<m and each transformation O[j] in O contains exactly one source S where S belongs to C. Each O[j] in O is assigned a relevance score equal to the count of maps which map a data element of S divided by the count of data elements in S. A transformation that maps all the data elements of a source gets a score of 1, a transformation that does not map all the data elements in S get a score less than 1 and a transformation that maps the data elements of a source to more than one output in the target gets a score higher than 1. The system produces the list of recommendations which includes the targets of each of the transformations in TJ ranked by their relevance score.
In the k-derived example, recommender system 120 produces the set of j transformations O where j<m and each transformation O[j] in O contains at least one source S such that S belongs to C and each O[j] has more than one source. For each O[j] in O, let SI be the set of sources that belong to C and let SO be the set of sources that do not belong to C. Let A be the set of all sources. For each SI[i] in SI, compute a relevance score equal to the count of maps which map a data element of SI[i] to the target divided by the number of data elements in SI[i]. This is the positive participation factor. For each SO[o] in SO, compute a relevance score equal to the count of maps which map a data element of SO[o] to the target divided by the number of data elements in SO[o]. This is the negative participation factor. For each A[n] in A, compute a relevance score equal to the count of maps which map a data element of A[n] to the target divided by the number of data elements in the target. This is the contribution factor of each source. Compute the score of the transformation as the sum of positive participation factor times the contribution factor for each SI[i] in SI minus the sum of negative participation factor times the contribution factor for each SO[o] in SO. Return the set of targets of the transformations ordered by descending relevance score.
Content-based recommender 150b recommends datasets that are similar to the datasets in the project where the similarity between datasets is established by analyzing the data and metadata of the datasets. The content recommender 150b uses the similarity between datasets, computed using dataset names, column names, row counts, column values, data domains, business terms, and classifications, as a measure of the relevance between datasets. The content recommender 150b uses the decision tree for content relationships stored in knowledge base 130.
Consider S be a two-dimensional matrix where each S[m,n] is the similarity score (equivalently, relevance score) between data set D[m] and D[n]. A characteristic of this matrix is that it is a symmetrical matrix. A score of 0 means that the datasets are completely dissimilar while a score of 1 means that the datasets are identical. Most scores will be very close to zero with a few scores will be close to 1. The dataset similarities are computed in the background. The system uses similarity computed on the basis to dataset names, column names, domains and classifications to establish candidate lists for computing similarities based on values. The similarity between the datasets is computed using a variety of techniques including: n-gram cosine similarity for column names, TF-IDF cosine similarity, Bray-Curtis coefficient, or Jaccard co-efficient for column values using a comparison of data domains and comparison of classifications. Using any of the foregoing, a threshold of similarity is used for making recommendations. Let's assume the context has k datasets represented by the set C. For each C[k] in C, the system consults the similarity matrix and suggests datasets which have a similarity score greater than the similarity threshold in order of decreasing similarity score.
Structural recommender 150c recommender recommends datasets that have documented or inferred structural relationships (PK-FK, join, lookup, union) to datasets in the current project. The structural recommender 150C uses structural PK-FK or Join/Lookup relationships to make recommendations of related result datasets to use. The structural recommender 150C users the decision tree for structural relationships stored in knowledge base 130.
If recommender system 120 has knowledge of n data sets represented by the set D. Let's also assume that the system has knowledge of a matrix R where R[i,j]=1 when there is relationship between D[i] and D[j] with D[i] being the master dataset and D[j] being the detail dataset. Given that the system has knowledge of joins/lookups JL represented by matrix IL where JL[i,j] is equal to the frequency of join or lookup in the set of known transformations T between dataset D[i] and D[j] with D[i] being the master/lookup dataset and D[j] being the detail dataset.
Using R and JL, recommender 150c constructs a graph G where each node in the graph is a dataset and an edge in the graph is an element of R and/or IL with the weight of the edge being the frequency of use. Let's assume that the context has a set of k data sets represented by the set N. Then, for each dataset in N[k] in N, the recommender 150c finds immediate neighbors in G not already in N. For each pair of datasets in N (N[i], N[j]), the recommender finds the shortest path between the two datasets in the graph and add the nodes in the path to the result aggregating their weights to a net relevance score. The recommender produces the list of datasets ordered by decreasing relevance score.
Usage-based recommender 150d recommends datasets used together with one or more datasets in the current project by users proximate to the current system user. Usage-based recommender 150D uses the decision tree for usage-based relationships stored in content store 130.
There are two embodiments for a usage based recommender, usage-base 1 (source related usage) and usage-based 2 (target related usage). In usage-based 1, the usage recommender uses proximity between users to recommend datasets most used by users proximal to the context user to identify alternative source datasets.
If the system has knowledge of N data sets represented by the set D, consider that system has the identities of M users represented by the set U. Consider P to be a three-dimensional matrix where each P[i,j,k] is the proximity between user U[i] and U[j] by dimension Dk where k=0 is department, k=1 is role, k=2 is as follows: P[ij,0]=1 if users Ui and Uj are in the same department else it will be 0. By definition, P[i,j,0]=P[j,i,0]; or P[i,j,2]=1 if user Ui follows user Uj. P[i,j,2] need not be equal to P[j,i,2]. Other dimensions of proximity may be computed based on shared interests, shared project participation, etc.
Let G be a three-dimensional matrix where G[i,j,k] is the frequency of use of dataset D[i] to produce dataset D[j] by user U[j], where D[i] is a candidate alternate source dataset (e.g., as shown in
The usage based-2 recommender uses proximity between users to recommend datasets most used by users proximal to the context user to identify alternative target datasets. If the system has knowledge of n data sets represented by the set D, consider that system has knowledge of m users represented by the set U. Consider P to be a three-dimensional matrix where each P[i,j,k] is the proximity between user U[i] and U[j] by dimension Dk where k=0 is department, k=1 is role, k=2 is as follows: P[ij,0]=1 if users Ui and Uj are in the same department else it will be 0. By definition, P[i,j,0]=P[j,i,0]; or P[i,j,2]=1 if user Ui follows user Uj. P[i,j,2] need not be equal to P[j,i,2]. Other dimensions of proximity may be computed based on shared interests, shared project participation, etc.
Let G be a three-dimensional matrix where G[i,j,k] is the frequency of use of dataset D[i] with dataset D[j] to produce some other result by user U[k], where D[j] is a candidate alternate target dataset (i.e. a related result dataset; see
Classification-based recommender 150e recommends datasets that have been similarly classified (manually or using ML techniques) to one or more datasets in the project e.g. finance business function. Classification based recommender 150E uses common classifiers to recommend related result datasets. The classification based recommender 150E uses the decision tree for classification-based relationships stored in knowledge base 130.
If the system has knowledge of n data sets represented by the set D, assume the system m classifiers represented by the set C. Consider DC to be a two-dimensional matrix where DC[i,j]=1 if dataset D[i] is classified by classifier C[j] and DC[i,j]=0 if it is not. For each data element in dataset D[i] that is classified by classifier C[j] add 1 to DC[i,j] to compute a relevance score.
Given that the context has k datasets represented by the set N. For each dataset, from matrix DC the recommender 150a collects all datasets that have been classified by the same classifier aggregating their relevance scores by each classification scheme. The recommender 150c returns the list of datasets ranked by relevance score per classification scheme.
Organizational and social recommender 150f recommends datasets that have been similarly classified based on the organizational or social ties between the author or editor of the datasets already included in the project and other authors associated to them via such ties (follower-followed tie, same-department tie, etc.). Social networking techniques are used as part of this recommender. Organizational and social recommender 150F uses the decision tree for organizational or social relationships stored in knowledge base 130.
For the organizational or social relationships between users, the recommender 150g maintains the relationships between users based on the user profiles where information such as follower/followees and org chart attributes are specified.
Recommender system 120 further includes user interface module 135. User interface model 135 receives selection of datasets from a user; and presents the selected datasets via a user interface. User interface model 135 also provides user client 110 with access to the system, and can optionally show the inferred user goal (e.g., as shown in FIG. 4A2), and allows the user to accept or replace it with a different data analysis goal, such as “find a cleaner dataset,” “enrich the dataset,” or “integrate datasets.”
User interface module 135 enables two dedicated visualizations components. First, a recommender viewer that shows each of the datasets in the ranked list (recommendations) ‘in relation to’ the dataset A selected by the user. The user interface visually shows if the type of content relation (superset/subset of the rows/columns in A) and the diff statistics in terms of profiling information between A and the proposed C (type of added columns, change in metadata such as number of rows, columns, or quality metrics), e.g., as shown in C1-C6 of FIG. 4A1. Second, a preview function can be called as the user selects of one of the datasets in the recommender bar, to be displayed as a preview, e.g., as discussed in conjunction with
User interface module 135 implements all of the user interfaces shown in FIGS. 4A1-13.
Referring to
The method begins with receiving 305 a user selection of a first dataset. When a user takes action in a project, recommender system 120 infers user intent based on three classes of actions taken by a user, as discussed above in conjunction with Table 1.
Referring also to FIGS. 4A1-4C, there are shown examples of a user interface provided to a client device by recommender system, according to various embodiments. FIG. 4A1 illustrates a user interface 400 showing a recommender bar 410 with first recommendations based on a lineage relationship according to one embodiment. In the example shown in FIG. 4A1, the user has selected 305 the dataset “Inactive Customers” 405 to the user's project (“Customer Analysis”), as illustrated. The user selection 305 of the (first) dataset may occur when recommender system 120 receives user query, e.g., for the key words “inactive customer data.” Recommender system 120 processes these key words and searches them against the various datasets (e.g., the database tables and associated metadata stored in knowledge base 130) for matching datasets. The results of the search include the dataset “Inactive Customers,” the selection 305 of which results in the user interface 400 shown in FIG. 4A1.
After the receiving the dataset 405 (“Inactive Customers”), recommender system 120 processes this action according to the user actions in Table 1, in which the user action of adding a dataset to an empty project results in a recommendation of alternative source datasets or related result datasets. In so doing, the method determines 310 a context corresponding to the user selection of the first dataset, or if a prior context existed, is determines the updated context.
Based on the first dataset and determined context, the next step in the method is determining 315, one or more dataset recommenders, each of the one or more recommenders corresponding to a relationship type between datasets. Recommender system 120 transfers the user context to each recommender 250, or if a prior context existed, is transfers the updated context.
Based on the relationship types, the method then determining 320 a plurality of second datasets related to the first dataset. Each recommender 250 consults the context and knowledge base 130 and computes its list of recommended datasets.
Each of the plurality of second datasets are then scored 325 using a relevance ranking algorithm specific to the corresponding relationship type to score the relevance of the of the second dataset to first dataset, and ranked 330 based on the scoring.
The method then selects 335 a subset of the ranked datasets as the recommended datasets. In one embodiment, recommender system 120 aggregates the recommendation lists from the different recommenders 250 and selects the highest ranking datasets from the different recommenders 250. User interface module 135 presents 340 the recommended datasets in a graphical user interface, e.g., 400 of FIG. 4A1, wherein the recommended datasets are grouped by relationship type to the first dataset. The recommended datasets 415, 420 are presented to the user in the recommendation bar, e.g., 410 of FIG. 4A1.
In this example, specific data sets Cx are recommended for a given dataset A based on each type of relationship. Thus, the user interface 400 displays the first set of several recommendations 420, 425 in the recommender bar 410, categorized in two groups by lineage relationship (shown by tab 415a): k-derived datasets 420 (datasets C1-C3: join) and (C4-C5: lookup); and 1-derived datasets 425 (C6: columns added; C7: columns removed). Datasets C1-C3 are represented by a join icon 430 indicating a join operation, indicating that each of these dataset resulted from a join operation of the Inactive Customer dataset with another dataset. Datasets C4 and C4 are represented by a lookup icon 435 indicating a lookup operation, indicating that each of these datasets resulted from a lookup operation the Inactive Customer dataset. Dataset C6 is represented by a column add icon 440 indicating a column add operation, indicating that this dataset resulted from the addition of one or more columns to the Inactive Customer dataset. Dataset C7 is represented by a column remove icon 445 indicating a column remove operation, indicating that this dataset resulted from the removal of one or more columns from the Inactive Customer dataset. Each dataset Cx also shows information indicating whether the data was validated, included extra data, or had missing data (“Extra,” “Missing,” and “Validated” labels). Other tabs 415 are available for recommendations based on content relationships and usage relationships.
FIG. 4A2 illustrates a user interface 400′ similar to FIG. 4A1, but showing a recommender bar 410 with a menu control 455 for selecting a goal for directing recommendations according to one embodiment. In this example, the user can select from drop down menu 455 to select a goal to help refine the dataset selections provided.
Returning to
Returning to FIG. 4A1, recommended datasets 420, 425 are presented, according to step 340 of the above method, in a recommendation bar 410 of a graphical user interface 400 as described above, grouped by relationship type of the recommended datasets 420, 425 to the first dataset 405. When the user selects (step 345), e.g., group 420 of FIG. 4A1 via icon 450, this action is processed by recommender system 120 according to the user actions in Table 1, specifically Class 1, the user rating a dataset, using the decision tree for the k-derived relationship stored in the knowledge base 130, since the datasets here C1-C5 were k-derived (having more than 1 parent dataset, i.e., database inactive Customers and at least one other dataset). Recommender system 120 then generates a further set of recommendations within the k-derived relationship type, but now categorized by types of k-derived relation. The result is presented (340) in user interface 500 of
Continuing with
In another example, recommender system 120 receives a user selection of the content relation tab 415b. The result is shown in the user interface 700 of
When the user selects (step 345), e.g., group 720 of
Continuing with
In yet another example, recommender system 120 receives a user selection of the social relation tab 415c. The result is shown in the user interface 1000 of
When the user selects (step 345), e.g., group 1020 of
Continuing with
Referring again to
In the foregoing discussion, the examples provided regarding data sets pertaining to customers, sales transactions, and the like are merely one example usage domain for the recommender system 120; the recommender system 120 may be used in many other domains, including scientific (e.g., datasets of experimental outcomes), medical (e.g., datasets of treatments and patient outcomes), industrial and engineering (datasets of engineering requirements, materials, performance data), and so forth.
The methods and systems described herein provide measurable improvements in database access technology. Multiple types of metrics can measure the improvement that the method and system provide to the technology underlying current applications for data transformation or preparation by data professionals (e.g., data analysts, data scientists, and ETL developers), as follows.
The first two types of metrics can be computed at the level of individual users or individual user's tasks. The first type of metric is the time taken by a data professional to find the relevant datasets and thus complete the analysis. This includes global user performance metrics such as “average time to complete the analysis” or more specific user performance metrics such as “average time to find a 2nd dataset as soon as a 1st dataset has been found.” The second type of metric is the average quality of the datasets found. This can be measured objectively through per-dataset relevance metrics (see relationships algorithms in this method) applied to all the datasets used when the analysts relied vs. did not rely on the proposed method and system. Alternatively, it can be measured subjectively via ratings by the users on the dataset used (e.g., prompted user feedback).
In addition, other improved metrics can be computed at the level of organizations or community of users over a period of time. One metric in this category is the rate of reuse of datasets across the members of the community (expected to increase with the proposed method and system). This can be computed as one measure of central tendency (percentage over all dataset, mean, mode, or median) or as the detailed distribution of values (see skewedness of distribution). Another metric in this category is the rate of duplication of datasets across the members of the community (expected to increase with the proposed method and system). This can be computed as one measure of central tendency (percentage over all dataset, mean, mode, or median) or as the detailed distribution of values (see skewedness of distribution). Yet another metric in this category is the number of requests that the IT department of the organization received from data professionals for datasets even when the dataset requested was available to the data professionals, but there was no recommendation system deployed.
Finally, an added-value metric shows the number of new analyses produced over a period of time due to ready availability of high quality recommendations. This last metric is a corollary of already existing metrics and assumes baseline measures analyses produced over a period of time in absence of the proposed method and system. This final metrics of the “outcome” of the innovation on the overall quality and quantity of the work.
Some portions of the above description describe the embodiments in terms of algorithmic processes or operations. These algorithmic descriptions and representations are commonly used by those skilled in the data processing arts to convey the substance of their work effectively to others skilled in the art. These operations, while described functionally, computationally, or logically, are understood to be implemented by computer programs comprising instructions for execution by a processor or equivalent electrical circuits, microcode, or the like. Furthermore, it has also proven convenient at times, to refer to these arrangements of functional operations as modules, without loss of generality. The described operations and their associated modules may be embodied in software, firmware, hardware, or any combinations thereof.
As used herein, the term “module” refers to computer program logic utilized to provide the specified functionality. Thus, a module can be implemented in hardware, firmware, and/or software. In one embodiment, program modules are stored on a storage device, loaded into memory, and executed by a processor. Embodiments of the physical components described herein can include other and/or different modules than the ones described here. In addition, the functionality attributed to the modules can be performed by other or different modules in other embodiments. Moreover, this description occasionally omits the term “module” for purposes of clarity and convenience.
The present invention also relates to an apparatus for performing the operations herein. This apparatus may be specially constructed for the required purposes, or it may comprise a general-purpose computer selectively activated or reconfigured by a computer program stored on a computer readable medium that can be accessed by the computer. Such a computer program may be stored in a computer readable storage medium, such as, but is not limited to, any type of disk including floppy disks, optical disks, CD-ROMs, magnetic-optical disks, read-only memories (ROMs), random access memories (RAMs), EPROMs, EEPROMs, magnetic or optical cards, application specific integrated circuits (ASICs), or any type of computer-readable storage medium suitable for storing electronic instructions, and each coupled to a computer system bus. Furthermore, the computers referred to in the specification may include a single processor or may be architectures employing multiple processor designs for increased computing capability.
As used herein any reference to “one embodiment” or “an embodiment” means that a particular element, feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment. The appearances of the phrase “in one embodiment” in various places in the specification are not necessarily all referring to the same embodiment.
As used herein, the terms “comprises,” “comprising,” “includes,” “including,” “has,” “having” or any other variation thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, article, or apparatus that comprises a list of elements is not necessarily limited to only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Further, unless expressly stated to the contrary, “or” refers to an inclusive or and not to an exclusive or. For example, a condition A or B is satisfied by any one of the following: A is true (or present) and B is false (or not present), A is false (or not present) and B is true (or present), and both A and B are true (or present).
In addition, use of the “a” or “an” are employed to describe elements and components of the embodiments herein. This is done merely for convenience and to give a general sense of the disclosure. This description should be read to include one or at least one and the singular also includes the plural unless it is obvious that it is meant otherwise.
Upon reading this disclosure, those of skill in the art will appreciate still additional alternative structural and functional designs for a system and a process for determining similarity of entities across identifier spaces. Thus, while particular embodiments and applications have been illustrated and described, it is to be understood that the present invention is not limited to the precise construction and components disclosed herein and that various modifications, changes and variations which will be apparent to those skilled in the art may be made in the arrangement, operation and details of the method and apparatus disclosed herein without departing from the spirit and scope as defined in the appended claims.
This application claims priority to of U.S. Provisional Application No. 62/159,178, filed May 8, 2015 which is incorporated by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
62159178 | May 2015 | US |