Embodiments generally relate to presentation of data clusters on a computer generated user interface and more particularly to methods and systems to graphically present detailed information of the data clusters.
Classification of data records or objects into different groups, known as data clustering, is helpful in exploratory statistical data analysis. Examples of exploratory statistical data analysis include pattern-analysis, decision making, document retrieval and image segmentation. Once clustering is identified on the data records, it is more easily understood with the help of graphical visualization. On the other hand, analyzing the data clusters manually is challenging since the human brain has difficulty in visualizing data clusters. Several methods of displaying a visualization of data clusters such as a three-dimensional map using spatial relationships among the data clusters are known in the art. However, analyzing the data clusters and differentiating the data clusters visually may be complex since detailed information on the clusters and how the records are grouped in the clusters are lacking.
Various embodiments of systems and methods to visualize data clusters on a visualization panel are described herein. In one aspect, a plurality of data records is received. Further, the received plurality of data records are classified into one or more data clusters based on parameters associated with the plurality of data records. Furthermore, a visualization panel on a computer generated graphical user interface is presented for graphically indicating number of data records in a data cluster of the one or more data clusters, density of the data records in the data cluster and proximity between the one or more data clusters. Also, the visualization panel graphically displays parameters associated with the one or more data clusters and distribution of data in the data cluster of the one or more data cluster.
These and other benefits and features of embodiments of the invention will be apparent upon consideration of the following detailed description of preferred embodiments thereof, presented in connection with the following drawings.
The claims set forth the embodiments of the invention with particularity.
The invention is illustrated by way of example and not by way of limitation in the figures of the accompanying drawings in which like references indicate similar elements. The embodiments of the invention, together with its advantages, may be best understood from the following detailed description taken in conjunction with the accompanying drawings.
Embodiments of techniques to visualize data clusters are described herein. Grouping a set of data records or objects into one or more groups or data clusters is known as data clustering. The data cluster is made up of number of data records with similar parameters or traits when compared to other data clusters. The data records may be statistical or numeric data. In one exemplary embodiment, the data records are grouped in data clusters using a data mining algorithm. The data mining algorithm analyzes the set of data records using a set of rules that describe how the data records are grouped together. Further, the data clusters are presented on a computer generated user interface for analyzing the data clusters. The computer may be desktop computers, work stations, laptop computers, hand held computers, smart phones, console devices or the like.
According to one embodiment, the computer generated user interface includes a visualization panel to display detailed information of the data clusters. The visualization panel may include a canvas divided into one or more portions depicting how the data records are grouped into data clusters. The single visualization panel displays a number of data records in the data clusters, density of the data clusters, proximity between the data clusters and parameters of the data clusters in the one or more portions. Since detailed information of how the data clusters are formed is displayed on the single visualization panel, analyzing the data clusters by evaluating the parameters of the data clusters may be easier.
In the following description, numerous specific details are set forth to provide a thorough understanding of embodiments of the invention. One skilled in the relevant art will recognize, however, that the invention can be practiced without one or more of the specific details, or with other methods, components, materials, etc. In other instances, well-known structures, materials, or operations are not shown or described in detail to avoid obscuring aspects of the invention.
Reference throughout this specification to “one embodiment”, “this embodiment” and similar phrases, means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present invention. Thus, the appearances of these phrases in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.
Table 1 includes data of percentage of water, protein, fat, lactose and ash in milk of different mammals, collectively called as data records of plurality of mammals' milk.
At step 120, the plurality of data records are classified into one or more data clusters based on parameters associated with the plurality of data records. For example, the parameters may be percentage of water, protein, fat, lactose and ash. In one exemplary embodiment, the classification is performed by executing a data mining algorithm such as but not limited to a ‘K-Means’ algorithm and ‘CURE’ (Clustering Using REpresentatives) algorithm. The ‘K-Means’ algorithm is a method of data cluster analysis which aims to partition ‘n’ data records into ‘k’ data clusters logically (e.g., an option is provided for a user to input a value for ‘k’) in which each data record belongs to the data cluster with the nearest mean. The CURE algorithm is a method of data cluster analysis for large databases that is robust to outliers and identifies data clusters having non-spherical shapes and wide variances in size.
For example, when the ‘K-Means’ algorithm is executed for the data records depicted in Table 1 with ‘k’ as 3, the data records are classified into three data clusters as depicted in Table 2.
The output of the ‘K-Means’ algorithm as depicted in Table 2 includes grouping of ‘horse’, ‘orangutan’, ‘monkey’, ‘donkey’, ‘hippo’, ‘camel’, ‘bison’, ‘llama’, ‘mule’ and ‘zebra’ into data cluster 1, grouping of ‘deer’, ‘reindeer’, ‘whale’, ‘seal’ and ‘dolphin’ into data cluster 2, and grouping ‘buffalo’, ‘guinea pig’, ‘cat’, ‘fox’, ‘pig’, ‘sheep’, ‘dog’, ‘elephant’, ‘rabbit’ and ‘rat’ are grouped into data cluster 3. In one embodiment, the data clusters are grouped based on the center value of the parameters (e.g., percentage of water, percentage of protein, percentage of fat, percentage of lactose and percentage of ash) as depicted in Table 3.
The ‘K-Means’ algorithm classify the data records of Table 1 into three data clusters based on the center values of percentage of water, percentage of protein, percentage of fat, percentage of lactose and percentage of ash. The center values may be the aggregation of the parameters associated with the data records in the data cluster. The aggregation may be an average (e.g., mean, mode, median), a total, or other function (e.g., max). Thereby, the data records having parameters closer to 88.5% of water, 2.57% of protein, 2.8% of fat, 5.68% of lactose and 0.485% of ash are grouped as data cluster 1. The data records having parameters closer to 57.36% of water, 10.5% of protein, 27.62% of fat, 1.52% of lactose and 1.176% of ash are grouped as data cluster 2. Further, the data records having parameters closer to 78.28% of water, 7.71% of protein, 9.16% of fat, 3.89% of lactose and 1.085% of ash are grouped as data cluster 3. In one embodiment, sum of squares can be used to determine closeness parameters to the center values.
Sum of square is calculated by the ‘K-Means’ algorithm. The sum of squares is used to estimate closeness of data records within each data cluster. In other words, sum of squares is used to estimate density of the data cluster. Density of a data cluster can be defined as sum of squares of distances from a center value of the data cluster to each data record in the data cluster. For example, data cluster 1 includes 10 data records. In other words, these 10 data records include parameters closer to the center values as depicted in Table 3. With the sum of squares, the proximity of the 10 data records is identified. Greater the value of sum of squares, higher is the density of data records in the data cluster and vice versa.
At step 130, a visualization panel is presented on a computer generated graphical user interface for graphically displaying the output of ‘K-Means’ algorithm. In other words, number of data records in the data cluster, density of data records in the data cluster and proximity between the data clusters are graphically presented on the visualization panel. Further, the visualization panel graphically display parameters associated with the data clusters and distribution of data in the data cluster. Thus, the output of the ‘K-Means’ algorithm as depicted in Table 3 is represented graphically in a way indicating how the data records are grouped into data clusters. The visualization panel is explained in greater detail in
In one embodiment, a second portion 215 graphically displays density of the data clusters and proximity between the data clusters. In one exemplary embodiment, the data clusters are represented as nodes. Further, size of the nodes depends on the number of data records in the data cluster. Connecting lines between the nodes are used to present the proximity between the data clusters. For example, greater the thickness of the node connecting lines, higher is the proximity. Furthermore, the density of the data clusters is presented using shades. For example, denser the shade, higher the density. The second portion 215 is described with an example in
In one embodiment, a third portion 220 graphically displays parameters associated with the one or more data clusters, which is useful to compare the corresponding parameters of each data cluster. With regard to an example depicted in Table 3, the third portion 220 graphically displays the percentage of water in the data cluster 1 when compared to percentage of water in all the data clusters. The third portion 220 is described in greater detail in
In one embodiment, a fourth portion 225 graphically displays a data chart to represent distribution of parameters in the data cluster. The center values of the parameters as depicted in Table 3 are graphically displayed in the fourth portion 225. The fourth portion 225 is described in greater detail in
In one exemplary embodiment, a drop down menu 330 is provided to a user to select a type of a chart to present the number of data records in the data clusters. For example, the type of chart can be a bar chart, a cylinder chart, a cone chart, a pyramid chart, or a pie chart. The bar chart is selected to present the number of data records in the data clusters as shown in the first portion 305A of
For the example illustrated in Table 1, the number of data records of the data clusters is depicted in Table 4:
Further, using Equation 1:
Ratio % of three data clusters=Data Cluster 1:Data Cluster 2:Data Cluster 3=40%:20%:40%
Thereby, the size of the nodes (e.g., 405A to 405C) is displayed accordingly in the second portion 400. Hence, the number of data records in the data clusters can be visualized and compared through the size of the nodes (e.g., 405A to 405C).
In one exemplary embodiment, the density of the data clusters are graphically displayed using shades or a color scale depicting density from lower value to higher value. The sum of squares as depicted in Table 3 is used to represent the density of the data clusters.
Using equation (2),
Sum of squares ratio % of three data clusters=4.2%:63.5%:32.14%
To represent the density of the data clusters graphically, the nodes of the data clusters are shaded darker to represent high density and vice versa. In other words, 0% being lighter shade having less density and 100% being higher shade having greater density. Therefore, the node 405A representing data cluster 1 has lighter shade when compared to the node 405B and the node 405C. Similarly, the node 405B has higher shade. Hence, the density of each data cluster may be compared with the other data clusters graphically on the visualization panel.
In one exemplary embodiment, a density index 410 is provided in the second portion 410 to graphically compare different data cluster's density. The density index 410 includes a color scale from low to high. Accordingly, the density of the data clusters is represented as shown in 410. Hence, graphical visualization of data cluster's density using the density index 410 will help to compare density of the data cluster more effectively.
In one embodiment, proximity between the data clusters is graphically represented on the second portion 400 of the visualization panel. For example, node connecting lines (e.g., 415, 420 and 425) are used to graphically represent the proximity between the data clusters.
In one exemplary embodiment, the thickness of the node connecting lines (e.g., 415, 420 and 425) illustrates the proximity between the nodes (e.g., 405A to 405C). The thickness of the node connecting lines (e.g., 415, 420 and 425) is determined by distance between the center values of the nodes (e.g., 405A to 405C) using standard Euclidean distance, defined as:
√{square root over (Σi=1n(qi−pi)2)}.
Consider p=(p2, p2, . . . , pn) and q=(q1, q2, . . . , qn), where p and q are co-ordinates of data cluster centers.
For i=1 to NumberOfDataClusters
End-For
Therefore, by executing the above steps, the distances between the nodes (e.g., 405A to 405C) are calculated. For example, the distance between the node 405A and the node 405B is calculated as 110.17. The distances between the node 405B and the node 405C as 29.34. Similarly, the distance between the node 405A and the node 405C as 128.71. The distances between the nodes (e.g., 405A to 405C) are graphically represented by the thickness of the node connecting lines (e.g., 415, 420 and 425). As shown in the second portion 400, the node connecting line 420 is leaner compared to the other two node connecting lines (e.g., 415 and 425) indicating that the parameters of the data cluster 1 and the data cluster 2 are not closer. Similarly, the node connecting line 425 is thicker leaner compared to the other two node connecting lines (e.g., 415 and 420) indicating that the parameters of the data cluster 2 and the data cluster 3 are closer. Therefore, thicker the node connecting lines (e.g., 415, 420 and 425), closer the data clusters. Thus, providing information regarding how close the data clusters are. Using such information, the data clusters are analyzed. When the data clusters are very close, the user may think of merging the data clusters (e.g., decreasing the value of ‘k’ in the ‘K-Means algorithm) or else add another data cluster to the existing data clusters (e.g., increasing the value of ‘k’ in the ‘K-Means algorithm).
In one exemplary embodiment, a drop down menu 535 is provided for the user to choose desired parameter. For example, in 505 of
Similarly, in 520 of
Similarly, the center values of the parameters associated with the data cluster 2 and the data cluster 3 are graphically displayed in 615 of
The data cluster visualization described above graphically represents various characteristics of data clusters on the visualization panel. The visualization panel graphically represents density of the data clusters, number of data records in the data clusters, proximity of data clusters and distribution of parameters in the data clusters. Since detailed information of the data clusters are graphically displayed on the single visualization panel, it is easier to analyze the data clusters and their characteristics and understand how the data records are grouped into data clusters. Even though the data cluster visualization is explained using ‘K-Means’ algorithm, the data cluster visualization can be applicable to other centroid based cluster techniques.
Some embodiments of the invention may include the above-described methods being written as one or more software components. These components, and the functionality associated with each, may be used by client, server, distributed, or peer computer systems. These components may be written in a computer language corresponding to one or more programming languages such as, functional, declarative, procedural, object-oriented, lower level languages and the like. They may be linked to other components via various application programming interfaces and then compiled into one complete application for a server or a client. Alternatively, the components maybe implemented in server and client applications. Further, these components may be linked together via various distributed programming protocols. Some example embodiments of the invention may include remote procedure calls being used to implement one or more of these components across a distributed programming environment. For example, a logic level may reside on a first computer system that is remotely located from a second computer system containing an interface level (e.g., a graphical user interface). These first and second computer systems can be configured in a server-client, peer-to-peer, or some other configuration. The clients can vary in complexity from mobile and handheld devices, to thin clients and on to thick clients or even other servers.
The above-illustrated software components are tangibly stored on a computer readable storage medium as instructions. The term “computer readable storage medium” should be taken to include a single medium or multiple media that stores one or more sets of instructions. The term “computer readable storage medium” should be taken to include any physical article that is capable of undergoing a set of physical changes to physically store, encode, or otherwise carry a set of instructions for execution by a computer system which causes the computer system to perform any of the methods or process steps described, represented, or illustrated herein. Examples of computer readable storage media include, but are not limited to: magnetic media, such as hard disks, floppy disks, and magnetic tape; optical media such as CD-ROMs, DVDs and holographic devices; magneto-optical media; and hardware devices that are specially configured to store and execute, such as application-specific integrated circuits (“ASICs”), programmable logic devices (“PLDs”) and ROM and RAM devices. Examples of computer readable instructions include machine code, such as produced by a compiler, and files containing higher-level code that are executed by a computer using an interpreter. For example, an embodiment of the invention may be implemented using Java, C++, or other object-oriented programming language and development tools. Another embodiment of the invention may be implemented in hard-wired circuitry in place of, or in combination with machine readable software instructions.
A data source is an information resource. Data sources include sources of data that enable data storage and retrieval. Data sources may include databases, such as, relational, transactional, hierarchical, multi-dimensional (e.g., OLAP), object oriented databases, and the like. Further data sources include tabular data (e.g., spreadsheets, delimited text files), data tagged with a markup language (e.g., XML data), transactional data, unstructured data (e.g., text files, screen scrapings), hierarchical data (e.g., data in a file system, XML data), files, a plurality of reports, and any other data source accessible through an established protocol, such as, Open DataBase Connectivity (ODBC), produced by an underlying software system (e.g., ERP system), and the like. Data sources may also include a data source where the data is not tangibly stored or otherwise ephemeral such as data streams, broadcast data, and the like. These data sources can include associated data foundations, semantic layers, management systems, security systems and so on.
In the above description, numerous specific details are set forth to provide a thorough understanding of embodiments of the invention. One skilled in the relevant art will recognize, however that the invention can be practiced without one or more of the specific details or with other methods, components, techniques, etc. In other instances, well-known operations or structures are not shown or described in details to avoid obscuring aspects of the invention.
Although the processes illustrated and described herein include series of steps, it will be appreciated that the different embodiments of the present invention are not limited by the illustrated ordering of steps, as some steps may occur in different orders, some concurrently with other steps apart from that shown and described herein. In addition, not all illustrated steps may be required to implement a methodology in accordance with the present invention. Moreover, it will be appreciated that the processes may be implemented in association with the apparatus and systems illustrated and described herein as well as in association with other systems not illustrated.
The above descriptions and illustrations of embodiments of the invention, including what is described in the Abstract, is not intended to be exhaustive or to limit the invention to the precise forms disclosed. While specific embodiments of, and examples for, the invention are described herein for illustrative purposes, various equivalent modifications are possible within the scope of the invention, as those skilled in the relevant art will recognize. These modifications can be made to the invention in light of the above detailed description. Rather, the scope of the invention is to be determined by the following claims, which are to be interpreted in accordance with established doctrines of claim construction.