Data is commonly presented in structured or semi-structured fashion. For instance, there may be a number of data entries making up the data. Each data entry may have values for a number of different features, or attributes. For some features, the values of each data entry may be restricted to a set of permissible or possible values. This type of data is structured data. For other features, the values of each data entry may not be so restricted. This type of data is semi-structured.
As noted in the background, data can be semi-structured or structured. For example, consider survey data that may be generated by asking customers or other users a series of questions through an Internet web site. The questions may correspond to features, and the answers to the questions from the users may correspond to the data entries. Some questions may be answered by users selecting from a limited number of choices, such as a rating from 1-5, and so on. Other questions may be answered by users typing in freeform text, such to provide comments, and so on.
Collecting such survey data is useful to determine, for instance, how satisfied customers are with the products or services of a company. However, gleaning insights from survey data can be difficult to achieve, particularly when different personnel of the company may be interested in gleaning different types of insights. This issue is exacerbated when the survey data is semi-structured.
Techniques disclosed herein ameliorate these difficulties. An innovative approach by which structured data can be ranked, and reranked pursuant to user interaction, is provided, so that insights into the data can be gleaned. Another innovative approach is provided by which semi-structured data can be transformed into structured data, so that it can be ranked along with the originally structured data, for instance. A third innovative approach is provided to display an interactive graphical user interface (GUI) to present such rankings and permit the user to select data of interest to view rerankings based on the selected data.
As one example, data entries may correspond to different users answering a survey. The survey has questions, which correspond to features, made up of questions. For a given question, a user may be permitted to select from a limited number of choices, such as a rating between 1 and 5, and so on. The limited number of choices are thus the permissible values for the feature in question. Each data entry, in other words, has a value for the feature, but the value has to be one of the permissible values for the feature.
In one implementation, a numerical feature can be processed as part of part 102 so that the feature has such a set of permissible values. For instance, the data entries may have a large number of different numerical values, and indeed which each may be unique. To limit the number of permissible values for the feature, the numerical values of the entries for this feature may be quantized and transformed to categorical data. For example, if the data entries have numerical values between one and one hundred for a feature, rather than having up to one-hundred permissible values for the feature, the numerical values may be quantized and transformed to a more limited number of ten permissible values, corresponding to ranges such as 1-10, 11-20, 21-30, and so on, through 91-100. Different such quantization and transformation approaches can be employed if this is desired.
The method 100 includes constructing a graph (104). The graph has nodes representing unique combinations of features and their permissible values. For example, if feature A has permissible values aa, ab, and ac, and feature B has permissible values ba, bb, bc, and bd, there are unique feature-permissible value combinations Aaa, Aab, Aac, Bba, Bbb, Bbc, and Bbd. Therefore, in this simplified example, there are a total of seven nodes within the graph.
The graph further has edges. Each edge connects two nodes. Each edge has a weight that measures the statistical dependency between the two nodes to which it connects, as reflected within the data (i.e., within the data entries). The statistical dependency of an edge can be defined as denoting how dependent the two nodes that the edge connects are to one another, in a statistical manner. This statistical dependency can be more particularly defined in one implementation as the normalized pairwise mutual information (NPMI) between the two (permissible) values of the two connected nodes. The NPMI between every unique pair of nodes in the graph is determined, but edges are created within the graph just for those unique node pairs that have NPMIs above a predetermined threshold.
For example, consider the two nodes Aaa and Bba. The NPMI(Aaa,Bba) is defined as:
In this equation, H(X) measures the entropy of a feature X having values x within the data entries, and can be expressed as:
In each of these two equations, p(x) is the frequency of permissible value x of feature X within the data entries, and p(x,y) is the frequency of the pair of permissible values x, y of features X, Y, respectively, within the entries.
The method 100 includes ranking the features, the permissible values of the features, and links, based on the graph (106). A link is defined as follows. For a given permissible value of a given feature, the links include the unique combinations of other features and the permissible values of these other features. For example, if feature A has permissible values aa, ab, and ac; feature B has permissible values ba, bb, bc, and bd; and feature C has permissible values ca and cb, then the links for the permissible value ba of feature B are Aaa, Aab, Aac, Cca, and Ccb.
In one example implementation, the features, permissible values thereof, and links of the permissible values are ranked as follows. A centrality measure of each node in the graph is determined (108). The centrality measure of a given node is based on the edges that extend from the node that have the highest K weights. For example, K may be three, such that the centrality measure of each node is based on the edges extending therefrom that have the highest three weights. If a particular node has less than K edges extending therefrom, then each edge is selected. The centrality measure of node i having edges j can be expressed as:
In this equation, Ci is the centrality measure of node i, Wi,j is the weight of edge j extending from node i, and the summation is over the edges j having the highest weights.
A rank of each feature is determined based on at least the centrality measures of the nodes representing unique combinations that include the feature (viz., the nodes that include the feature) (110). In one example implementation, the ranking of a feature depends on the graph-based centrality measure of the feature, and an intrinsic measure that depends on the feature's entropy and the cluster size the feature represents. For example, the ranking of a feature can be expressed as:
In this equation, rankFl is the ranking of feature Fl. The first term, (•)α, relates to the intrinsic measure of this feature, and the latter term, (•)β, relates to the graph-based centrality measure.
Furthermore, H(X) measures the entropy of feature X as noted above, and clusSize(Fl) is the size of the cluster that includes this feature. For instance, a feature may represent a column that is similar to other columns of the data and were removed. As an example, one column may represent customer name, which is similar to another column that represents customer code. A feature that corresponds to the customer name can represent a cluster that includes the customer name and the customer code. The size of this cluster is thus taken into account in the ranking.
Furthermore, Ci is the centrality measure of node i as noted above, and Pi is the frequency of the value represented by node i within the data entries. The constants α and β are selected to balance between the intrinsic measure of a feature, and the graph-based centrality measure of the feature, as desired. For an equal balancing, for instance, both may be equal to one.
A rank of each permissible value of each feature is determined based on the centrality measure of the node representing the unique combination of the permissible value in question and the feature in question, and based on the frequency of this permissible value of this feature within the data entries (112). For instance, the rank of a permissible value of a feature can be expressed as:
rankVi=PiCi.
In this equation, rankVi is the rank of the value of node i for the feature of node i, Ci is the centrality measure of node i as noted above, and Pi is the frequency of the value represented by node i within the data entries as noted above.
A rank of each link is determined based on the weight of the edge corresponding to the link and based on the rank of the destination feature of the link (114), such as by multiplying the edge's weight by the destination feature's rank. That is, as noted above, for a given permissible value of a given feature, the links include the unique combinations of other features and the permissible values of these other features. Thus, a given such link includes a unique combination of a feature and a permissible value, and its rank is determined based on the weight of edge leading to the node representing this unique combination from the node representing the unique combination of the given permissible value and the given feature, and based on the rank of the feature of the link. Stated another way, a node Aaa representing feature A and value aa has a link representing Bba representing feature B and value bb. The rank of this link is determined based on the weight of the edge from node Aaa to node Bb and based on the rank of feature B. Feature B is the feature of this link, and node Aa is the node having the link.
For example, the rank of a link can be expressed as:
ranklij=wijγ*rankFl,j∈l.
In this equation, ranklij is the rank of link l from node i to node j, where node i is the node having the link, and node j is the node representing the unique combination of a value and a feature of the link. Furthermore wij is the weight of the link between these two nodes, γ is a constant that is selected to the balance the weight of the edge and the rank of the destination feature, as desired, and rankFl is the ranking of feature Fl, where feature j of the node j is ∈ l.
The initial ranking of the features, the permissible values, and the links can be subsequently modified responsive to a selection of a particular unique combination of a feature and a permissible value thereof (116). For example, a user may select a link in accordance with an interactive GUI, as is described in detail later in the detailed description. The features, the permissible values of each features, and the links for each permissible value of each feature are then reranked after construction of a sub-graph per the arrow 118. Stated another way, the reranking is performed based on a propagation of the graph from the node corresponding to the selected combination.
More specifically, graph propagation begins from the node corresponding to the selected combination to determine which features are most relevant, such as the most relevant K features. Data entries that include the selected combination, and which further include the most relevant features, are extracted, and a subgraph representing this subset of data entries is constructed. The subgraph and the originally constructed graph are employed to perform the reranking. The features and the permissible values are ranked by scores assigned from the propagation. The links are ranked so that high ranks are assigned to links that received higher ranks in the subgraph as compared to in the original graph.
Mathematically, a specific value U is selected that maps to node (Fl,U) with condition Fl=U. The centrality measure Ci of each node I is determined by a graph propagation from the node U, and the ranks of the features and the permissible values thereof are updated by determining them as described above, but with the node centrality measures for the nodes. The data entries that satisfy the condition are extracted, and new weights wiju determined. The ranks for the links are then determined as:
In this equation, wij is the weight of the edge in question in the original graph, whereas wiju is the weight in the new subgraph. Furthermore, const is a constant that is selected to suppress noise resulting from large ratios that may occur for very low values.
The data entries may be the same of those in relation to which the method 100 has been described, but that also include freeform features in addition to the features that have sets of permissible values. Unlike the latter features, freeform features do not have sets of permissible values from which the textual data is selected. An example of a feature that has a set of permissible values is a state feature, which may have as its set of permissible values the fifty United States, the District of Columbia, and various US territories. An example of a freeform feature, by comparison, is a comments feature, where in the context of a survey respondents may enter in text to answer a question such as “what are the things preventing you from recommending a given product.”
The method 200 includes extracting information items from the textual data (204). Information items are types of different text, such as terms, named entities like companies and people, and topics. The information items are thus an abstraction of various words or phrases within the textual data of the entries. For example, various data entries may include the names of cities for a freeform feature, like Detroit, Chicago, Los Angeles, Seattle, New York City, and so on. The information item corresponding to or encompassing this textual data is cities, which is the type of this text or an abstraction of this text.
Existing techniques and tools can be employed to extract information items from the textual data of the entries. Such techniques and tools in general perform textual analysis to identify words and phrases within textual data, like that of the data entries, and identify commonalities among these words and phrases, such as information items.
The method 200 includes creating new features for the information items (206). For example, the original free-text feature for which the data entries have textual data may be called “comments.” Two information items, “companies” and “cities,” may have been extracted from the textual data. Therefore, two new features are created, “comments:companies” and “comments:cities.”
The data entries have values for these new features, corresponding to the textual data thereof that is encompassed by the corresponding information items. For example, if a data entry has the term “General Motors” for the freeform feature “comments,” then the data entry has the value “General Motors” for the new feature “comments:companies.” Each new feature thus has a set of unique values, where each unique value is present in at least one data entry. That is, each unique value of each new feature is present in the textual data for a freeform feature in at least one data entry. In some implementations, though, there can be thresholds so that rare words—i.e., words that appear in a relatively small number of data entries—are removed and not considered.
The method 200 includes adding new nodes to the graph that was constructed in part 104 of the method 100 (208). Each new node represents a unique combination of a new feature and a unique value thereof. Similarly, the method 200 includes adding new edges to the graph (210). Each new edge connects a new one to an existing node of the graph as constructed in part 104 of the method 100. As in part 104 of the method 100, the new edges have weights that measure the statistical dependencies between the nodes as reflected in the data entries, as has been described above.
In some situations, parts 208 and 210 may result in a large number of new nodes and new edges being added to the graph. Therefore, the least relevant new nodes, and the new edges that connect to them, may be subsequently removed from the graph to make analysis more tractable. In this respect, the method 200 can include ranking the unique values of each new feature (214), as in parts 108, 110, and/or 112 of the method 100, where the unique values of a new feature correspond to the permissible values thereof. The method 200 then, for each new feature, removes from the graph the nodes (and their edges) that do not include one of the highest ranked unique values (214).
For example, a new feature may have a large number of unique values, numbering in the tens, hundreds, or even more. An equal number of new nodes are added for this new feature in part 208, with likely an even greater number of new edges added in part 210. In part 212, the unique values of the new feature are ranked. In part 214, just the highest ranked new nodes for the new feature and the edges connecting to these new nodes are retained. The other new nodes, and their connecting edges, are removed. For instance, just the new nodes corresponding to the highest K=3 unique values for the new feature, and their edges, may be retained.
When the method 200 is finished, the remainder of the method 100 can continue, beginning at part 106. As described in relation to the method 100, structured features have sets of permissible values. As described in relation to the method 200, new features created from freeform features have sets of unique values. The unique values of the new features may be considered as the permissible values of the new features. Stated another way, the permissible values of the structured features are the possible values of the structured features, and likewise the unique values of the new features are the possible values of the new features.
As a concrete if rudimentary example of the performance of the graph construction in particular of the methods 100 and 200, reference is now made to
Data entries 306 include values for each structured feature 302A and 3026, and textual data for the freeform feature 304. The value for each structured feature 302A and 302B of each data entry 306 is selected from the set of permissible values of that feature. For example, one data entry may have the values “hamburger” and “milk” for the features 302A and 302B, respectively, whereas another data entry may have the values “pizza” and “milk.” The textual data for the freeform feature 304 of each data entry 306 is not so limited by comparison, and can include any type of text.
In the example of
There are thus new nodes 602A and 602B, referred to as the new nodes 602. The node 602A corresponds to the unique combination of the value “Los Angeles” and the new feature “comments:city,” and the node 602B corresponds to the unique combination of the value “San Diego” and this same new feature. Note that there is no node within the graph 400′ corresponding to the unique combination of the value “Palm Springs” and the new feature “comments:city.” This may be because the node for this unique feature-value combination was removed in part 214; that is, part 214 may have considered just the two highest ranked unique values for the new feature “comments:city.”
There are also new nodes 604A and 604B, referred to as the new nodes 604. The node 604A corresponds to the unique combination of the value “Fast Burger” and the new feature “comments:restaurant,” and the node 604B corresponds to the unique combination of the value “Artisan Pizza” and this same new feature. The graph 400′ further includes edges 406′, which are the edges 406 of the graph 400 of
Graphical elements corresponding to the features, including the structured features and the new features that have been described, are displayed, in an order corresponding to the ranking of the features (702). Within each graphical element, a graphical representation of the frequencies of the corresponding feature's permissible values within the data entries is displayed (704), and the permissible values are also displayed in an order corresponding to the ranking of the permissible values (706). Furthermore, within each graphical element, for each permissible value of the corresponding feature, the links for the permissible value are displayed according to the ranking of the links (708).
Dynamic interaction with the data display can be achieved in at least two different ways. First, the method 700 can include receiving passive selection of a permissive value of a feature, responsive to which detailed information regarding the permissive value is displayed (709). For example, the detailed information can include detailed information regarding the presence of the passively selected value within the data entries. Passive selection may be achieved, for instance, a user navigating a pointer to a desired permissive value and hovering the pointer thereover within a GUI in relation to which the graphical elements have been displayed, which is known as “mouseover.”
Second, the method 700 can include receiving an active selection of one of the links that have been displayed (710), which corresponds to the receiving of a selection of a feature-permissible value combination in part 116 of the method 100. For example, within a GUI in relation to which the graphical elements have been displayed, a user may navigate a pointer to a desired link and select this link, using an input device. As a result of the selected link, a reranking is performed (712), corresponding to arrow 118 of the method 100, and the new reranked data is then displayed, per the arrow 714. In this way, the display and redisplay of data is achieved in an interactive manner. A user is able to focus in on the data of interest as desired, to glean insights into the data that may differ for different users.
The graphical elements 802A, 802B, and 802C include graphical representations 804A, 804B, and 804C, respectively. The graphical representations 804 are collectively referred to as the graphical representations 804. The graphical representations 804 are of the frequencies of the permissible values within the data entries of the corresponding features.
The graphical representation 804A is a “word cloud” graphical representation of the unique values of the new feature of the graphical element 802A that may have been created according to the method 200. Each word of the graphical representation 804A is one of the unique values of this feature. Each word has a size within the graphical representation 804A corresponding to its frequency within the data entries (i.e., the number of data entries in which the word is present).
The graphical representations 804B and 804C are pie chart graphical representations of the permissible values of the structured features of the graphical elements 802B and 802C, respectively. Each slice corresponds to a permissible value of a structured feature. The size of each slice corresponds to its permissible value's frequency within the data entries (i.e., the number of data entries having the permissible value for the feature in question).
To glean more specific information regarding the permissible values displayed within the graphical representations 804, a user may passively select a word in the representation 804A or a slice in the representation 804B or 804C. A small text box may then be displayed near the passively selected permissible value that provides detailed information regarding the presence of this permissible value within the data entries, such as the percentage of the data entries that include this value for the feature in question. As an example, a text box 806 is displayed in
For the remainder of the description of
The permissible value 808A has a highest ranking, and the permissible value 808C has a lowest ranking of these three permissible values 808, such that the value 808A is displayed left most within the graphical element 802A, and the value 808C is displayed right-most. Even lower-ranked permissible values 808 for even lower-ranked features may be displayed via a user performing a scrolling right GUI action within the data display. The permissible values 808 may be color-coded in correspondence with their colors within the graphical representation 804A, which is particularly useful where a graphical representation is a pie chart, for instance.
For the remainder of the description of
The storage devices 904 store program code 909. The processor 902 executes the code 909 to perform the methods 100, 200, and 700 that have been described. It is noted, however, that the methods 100, 200, and 700 can instead be implemented just in hardware, such as via a field-programmable gate array (FPGA) device, an application-specific integrated circuit (ASIC) device, and so on. The storage devices 904 store the data entries 910, structured features 912, and permissible values 914 of the features 912, in relation to which the methods 100 and 700 are performed. The storage devices 904 further store new features 916 and unique values of the new features 918 that may be generated as a result of performance of the method 200.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2014/063218 | 10/30/2014 | WO | 00 |