The disclosed implementations relate generally to data mining, and in particular, to systems and methods for dynamically partitioning a dataset into multiple groups and visualizing the groups on a display.
Data visualization is an important aspect of data mining. Over the years, people have developed many software tools for generating different views of a dataset so that a data analyst can gain more insight into the dataset. But many of these views are visualization of a particular aspect (e.g., a subset) of the dataset and it is can be difficult for the data analyst to partition the subset into multiple groups and correlate the data samples from different groups on an individual or aggregated basis.
In accordance with some implementations described below, a computer-implemented method of visualizing a dataset is implemented on a computer having memory, one or more processors, and a display. The method includes: rendering a plurality of marks on the display, each mark corresponding to a respective data sample in the dataset; in response to detecting a first user instruction, visually highlighting a subset of the plurality of marks in accordance with the first user instruction and generating a first data structure including the data samples associated with the highlighted marks; and in response to detecting a second user instruction, replacing the plurality of marks with two marks on the display, wherein a first mark corresponds to an aggregation result of the data samples associated with the highlighted marks and a second mark corresponds to an aggregation result of data samples associated with the non-highlighted marks. Note that each data sample may include multiple data values, each data value corresponding to a respective field of the dataset, a single data value corresponding to a field of the dataset.
In response to detecting a third user instruction, the computer replaces the first mark with a group of marks on the display, wherein each mark in the group corresponds to a respective data sample in the first data structure.
The aggregation operation applied to the data samples is one selected from the group consisting of sum, average, median, count, standard deviation, variance, maximum, and minimum.
In response to detecting the first user instruction, the computer displays a table of entries in a pop-up window, each table entry corresponding to a respective data sample associated with one of the highlighted marks.
In response to detecting a fourth user instruction, the computer removes a table entry from the pop-up window and a data sample corresponding to the removed table entry from the first data structure and de-highlights a mark associated with the data sample.
In response to detecting a fifth user instruction, the computer visually highlights a second subset of the plurality of marks in accordance with the fifth user instruction and generates a second data structure including the data samples associated with the second subset of highlighted marks.
In response to detecting a sixth user instruction, the computer generates a third data structure by applying a predefined operation to the first data structure and the second data structure and a data view for visualizing the third data structure. For example, the predefined operation is one selected from the group consisting of union, intersection, complement, and Cartesian product.
In accordance with some implementations described below, a computer system for visualizing a dataset includes one or more processors; a display; and memory storing one or more programs. The one or more programs are configured to, when executed by the one or more processors, cause the one or more processors to: render a plurality of marks on the display, each mark corresponding to a respective data sample in the dataset; in response to detecting a first user instruction, visually highlight a subset of the plurality of marks in accordance with the first user instruction and generate a first data structure including the data samples associated with the highlighted marks; and in response to detecting a second user instruction, replace the plurality of marks with two marks on the display, wherein a first mark corresponds to an aggregation result of the data samples associated with the highlighted marks and a second mark corresponds to an aggregation result of data samples associated with the non-highlighted marks.
In accordance with some implementations described below, a non-transitory computer readable storage medium stores one or more programs configured for execution by a computer system that includes one or more processors, a display, and memory storing one or more programs. The one or more programs include instructions for: rendering a plurality of marks on the display, each mark corresponding to a respective data sample in the dataset; in response to detecting a first user instruction, visually highlighting a subset of the plurality of marks in accordance with the first user instruction and generating a first data structure including the data samples associated with the highlighted marks; and in response to detecting a second user instruction, replacing the plurality of marks with two marks on the display, wherein a first mark corresponds to an aggregation result of the data samples associated with the highlighted marks and a second mark corresponds to an aggregation result of data samples associated with the non-highlighted marks.
The aforementioned implementation of the invention as well as additional implementations will be more clearly understood as a result of the following detailed description of the various aspects of the invention when taken in conjunction with the drawings. Like reference numerals refer to corresponding parts throughout the several views of the drawings.
The present invention provides methods, computer program products, and computer systems for visualizing a dataset or a subset thereof. In a typical implementation, the present invention builds and displays a view of the dataset based on a user specification of the view. A more detailed description of the data visualization process can be found in U.S. Pat. No. 7,089,266, which is incorporated by reference in its entirety. As one skilled in the art will realize, the dataset can be a relational database, a multi-dimensional database, a semantic abstraction of a relational database, or an aggregated or unaggregated subset of a relational database, multi-dimensional database, or semantic abstraction. Fields are categorizations of data in a dataset. A tuple (also known as a data sample) is an entry of data (such as a record) in the dataset, specified by properties from fields in the dataset. A search query across the dataset returns one or more tuples.
A view is a visual representation of a dataset or a transformation of that dataset. Text tables, bar charts, line graphs, map views, and scatter plots are all examples of types of views. Views contain marks that represent one or more tuples of a dataset. In other words, marks are visual representations of tuples in a view. A mark is typically associated with a type of graphical display. Some examples of views and their associated marks are as follows:
In some implementations, the memory 102 includes high-speed random access memory, such as DRAM, SRAM, DDR RAM, or other random access solid state memory devices. In some implementations, the memory 102 includes non-volatile memory, such as one or more magnetic disk storage devices, optical disk storage devices, flash memory devices, or other non-volatile solid state storage devices. In some implementations, the memory 102 includes one or more storage devices remotely located from the computer system 100. Memory 102, or alternately the non-volatile memory device(s) within the memory 102, comprises a non-transitory computer readable storage medium. In some implementations, memory 102 or the computer readable storage medium of memory 102 stores the following elements, or a subset of these elements, and may also include additional elements:
After partitioning the data samples into two sets, a data analyst may issue a second user instruction to the computer for visualizing the aggregation results associated with the two sets. In response to detecting (207) the second user instruction, the computer replaces the plurality of marks with two marks on the display such that a first mark corresponds to an aggregation result of the data samples associated with the highlighted marks and a second mark corresponds to an aggregation result of data samples associated with the non-highlighted marks. Note that there may or may not be a data structure for the data samples associated with the non-highlighted marks because, given that there is a data structure or an expression for the data samples associated with the plurality of marks on the display, a virtual data structure or expression is sufficient for defining the data samples associated with the marks not highlighted on the display.
As shown in
In some implementations, the user can remove an entry from the table field 413 by issuing a fourth user instruction to the computer. In response to detecting (303) the fourth instruction, the computer removes (305) a table entry from the pop-up window as well as a data sample corresponding to the removed table entry from the first data structure. Sometimes, the computer also updates the data view by de-highlighting a mark associated with the removed data sample. As shown in
In some implementations, the data view shown in
In other words, a set defined in the present application is associated with a special operator called “IN/OUT( )” When the set is dropped into one of the shelves shown in
In some implementations, a user may need to expand the aggregated data view of a set into visualization of individual members in the set.
Besides the IN/OUT( ) operation associated with a particular set such as the “More $ to Romney” set, a user may apply other types of operations to multiple sets, including union, intersection, complement, and Cartesian product.
To do so, a user first selects the two sets in the “Set” region 404 shown in
In this example, the “swing” states are those with shared members in both sets 439. Therefore, the user can select the corresponding toggle icon and then click the “OK” button to generate a third set called “Swing States” for those states that voted for Obama in 2008 but made more contributions to Romney's campaign in 2012.
In some implementations, the members in a set are fixed. For example, the states that voted for President Obama in 2008 are known and the “Voted Obama '08” set is therefore referred to as a “static set.” In some other implementations, the members in a set are not fixed and such a set is referred to as a “dynamic set.”
In other words, if a state is a member of the “Top N States” set, its campaign contribution is kept as a separate value of the “Top N or Other” customized field without being merged with the campaign contributions from other states. If not, the state's campaign contribution is merged with the campaign contributions from other states not in the “Top N States” set. By doing so, the computer effectively generates a new set that has one more member than the “Top N States” set, i.e., “Other,” and the aggregation only occurs to the states associated with the “Other” value but not to the top N campaign donation states.
While particular implementations are described above, it will be understood it is not intended to limit the invention to these particular implementations. On the contrary, the invention includes alternatives, modifications and equivalents that are within the spirit and scope of the appended claims. Numerous specific details are set forth in order to provide a thorough understanding of the subject matter presented herein. But it will be apparent to one of ordinary skill in the art that the subject matter may be practiced without these specific details. In other instances, well-known methods, procedures, components, and circuits have not been described in detail so as not to unnecessarily obscure aspects of the implementations.
Although the terms first, second, etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another. For example, first ranking criteria could be termed second ranking criteria, and, similarly, second ranking criteria could be termed first ranking criteria, without departing from the scope of the present invention. First ranking criteria and second ranking criteria are both ranking criteria, but they are not the same ranking criteria.
The terminology used in the description of the invention herein is for the purpose of describing particular implementations only and is not intended to be limiting of the invention. As used in the description of the invention and the appended claims, the singular forms “a,” “an,” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will also be understood that the term “and/or” as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items. It will be further understood that the terms “includes,” “including,” “comprises,” and/or “comprising,” when used in this specification, specify the presence of stated features, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, operations, elements, components, and/or groups thereof.
As used herein, the term “if” may be construed to mean “when” or “upon” or “in response to determining” or “in accordance with a determination” or “in response to detecting,” that a stated condition precedent is true, depending on the context. Similarly, the phrase “if it is determined [that a stated condition precedent is true]” or “if [a stated condition precedent is true]” or “when [a stated condition precedent is true]” may be construed to mean “upon determining” or “in response to determining” or “in accordance with a determination” or “upon detecting” or “in response to detecting” that the stated condition precedent is true, depending on the context.
Although some of the various drawings illustrate a number of logical stages in a particular order, stages that are not order dependent may be reordered and other stages may be combined or broken out. While some reordering or other groupings are specifically mentioned, others will be obvious to those of ordinary skill in the art and so do not present an exhaustive list of alternatives. Moreover, it should be recognized that the stages could be implemented in hardware, firmware, software or any combination thereof.
The foregoing description, for purpose of explanation, has been described with reference to specific implementations. However, the illustrative discussions above are not intended to be exhaustive or to limit the invention to the precise forms disclosed. Many modifications and variations are possible in view of the above teachings. The implementations were chosen and described in order to best explain principles of the invention and its practical applications, to thereby enable others skilled in the art to best utilize the invention and various implementations with various modifications as are suited to the particular use contemplated. Implementations include alternatives, modifications and equivalents that are within the spirit and scope of the appended claims. Numerous specific details are set forth in order to provide a thorough understanding of the subject matter presented herein. But it will be apparent to one of ordinary skill in the art that the subject matter may be practiced without these specific details. In other instances, well-known methods, procedures, components, and circuits have not been described in detail so as not to unnecessarily obscure aspects of the implementations.