This application is based upon and claims the benefit of priority from Japanese Patent Application No. 2019-047659, filed on Mar. 14, 2019; the entire contents of which are incorporated herein by reference.
Embodiments described herein relate generally to a data processing apparatus, a display control system, a data processing method, and a computer program product.
A technology for transferring data collected by, for example, a sensor to a storage device, and visualizing many pieces of data stored in the storage device on a display device (display) is known. With the increasing variety of sensors and the evolution of the storage technology, data is easily collected while as the amount of plotting data increases, the processing loads of data transfer and plotting increase, which is a bottleneck of an interactive visualization system.
The data to be collected can be interpreted as table data where a dimension (corresponding to an item, a type, or the like) of data is set as a column, and a data sample of each dimension is set as a row. The increase of the data amount includes both of an increase in the number of dimensions and an increase in the number of data samples.
According to one embodiment, a data processing apparatus includes one or more processors. The processors generate data for displaying, in parallel coordinates, data of M dimensions (M is a natural number less than N) specified from N dimensions (N is a natural number) by a user's interactive operation. When the specification of the M dimensions is changed, the processors generate data for displaying data of the changed M dimensions in parallel coordinates. A preferred embodiment of a data processing apparatus according to the invention is described in detail hereinafter with reference to the accompanying drawings.
For example, a method where one- or two-dimensional data is visualized in a two-dimensional (the X-axis+the Y-axis) display space is relatively easy as a data visualization technique. The visualization of three-dimensional data is a method where the Z-axis is added to the X-axis and the Y-axis to display a visualization result on a two-dimensional display plane, changing the eyepoint. Such a method is used to, for example, visualize a physical entity analysis, and visualize scientific simulation data.
For example, parallel coordinates (Parallel Coordinates) and a scatter matrix (Scatter Matrix) are used as techniques for visualizing three or more multidimensional data. Parallel coordinates presents data samples with polylines, placing a plurality of generally equal spaced parallel axes corresponding to a plurality of dimensions. Parallel coordinates can represent multidimensional data in a readily understandable manner.
Visualization in parallel coordinates has problems mainly in degradation of readability and a reduction in plotting performance due to an increase in the number of samples and an increase in the number of dimensions. Parallel coordinates is generally much lower in plotting performance than a line graph, a scatter diagram, a bar chart, and the like. For example, if a graph of parallel coordinates is plotted in SVG (Scalable Vector Graphics) format on a browser of a general-purpose computer or the like, the delay of an operation response is significant even for multidimensional data having 20 to 30 dimensions and several hundred samples.
As a coping method for a case where the number of dimensions is large, there is a method that reduces the number of dimensions of data. However, when the number of dimensions is reduced, a dimension that gives important hints may not be visualized. Especially when data having both of large numbers of samples and dimensions is visualized in parallel coordinates, a method that limits the number of samples to or below a fixed number by resampling or clustering and reduces the amount of data has problems that the amount of calculation is large and also original data may not be faithfully visualized.
Moreover, there is also a method that finds an important dimension in a method such as machine learning. However, such an automation method is difficult to incorporate a user's domain knowledge and may not allow the user to make an efficient data analysis in expectation of user. In terms of visualization for the purpose of getting insight and understanding a trend in data, it is important to visualize data as it is to the extent possible rather than transform and then visualize data.
The data processing apparatus according to the embodiment generates data for displaying, in parallel coordinates, data of M dimensions (M is a natural number less than N) specified from N dimensions (N is a natural number) and, when the specification of the M dimensions is changed, updates visualization in parallel coordinates. Consequently, data with a large number of dimensions, or data with both of large numbers of dimensions and a large number of samples can be visualized in interactive parallel coordinates.
Multidimensional data is handled below as table data where dimensions correspond to columns and samples of data of the dimensions correspond to rows. The data of N dimensions from which the data of the M dimensions is extracted may be referred to as high-dimensional data, and the data of the M dimensions extracted from the high-dimensional data may be referred to as low-dimensional data. As in machine learning algorithms using multidimensional data, the dimensions may be handled as variables. In parallel coordinates, each axis corresponds to one dimension, and data of a corresponding dimension is displayed along its corresponding axis.
A variable that can be visualized in parallel coordinates includes a continuous variable (continuous variable) and an ordinal variable (ordinal variable). A nominal variable (nominal variable) can be visualized in parallel coordinates as an ordinal variable by determining some order (for example, the order of data indices) of its values. With a general-purpose personal computer, it is feasible to interactively visualize data with tens of dimensions and hundreds of samples in terms of delay in interactive parallel coordinates. More dimensions or samples may incur screen freezing or operational stress. According to the embodiment, it is possible to visualize parallel coordinates, targeting multidimensional data having dimensions and samples that are greater in number than these limits.
The display processing apparatus 100 is a client apparatus such as a personal computer. The data processing apparatus 200 is, for example, a server apparatus. The display processing apparatus 100 and the data processing apparatus 200 may be each physically configured by one apparatus, or may be physically configured by a plurality of apparatuses. For example, the data processing apparatus 200 may be constructed as a server apparatus on a cloud environment.
Moreover, the configuration of the display control system illustrated in
The display processing apparatus 100 includes a display unit 111, the storage 121, an accepting unit 101, a communication control unit 102, and a display control unit 103.
The display unit 111 is a display device such as a liquid crystal display that displays data. The display unit 111 follows control of the display control unit 103, and displays, for example, multidimensional data received from the data processing apparatus 200 in parallel coordinates.
The storage 121 stores various pieces of data used for various processes by the display processing apparatus 100. For example, the storage 121 stores data received from the data processing apparatus 200, the data being displayed on the display unit 111. The storage 121 can be configured by storage media of every kind generally used, such as flash memory, a memory card, Random Access Memory (RAM), a Hard Disk Drive (HDD), and an optical disc.
The accepting unit 101 accepts input of various pieces of data from a user or the like. For example, the accepting unit 101 accepts user operations such as the specification of the M dimensions displayed among the N dimensions and the specification of filtering conditions of each dimension. The method for inputting data by the user or the like can be any method. However, for example, a method that inputs with an input device such as a mouse, a keyboard, or a touchscreen can be used. The display unit 111 may be configured including a touchscreen for inputting data.
The communication control unit 102 controls communication with an external apparatus such as the data processing apparatus 200. For example, the communication control unit 102 transmits information (such as the specified M dimensions and filtering conditions) accepted by the accepting unit 101 to the data processing apparatus 200. Moreover, the communication control unit 102 receives data for display in parallel coordinates, from the data processing apparatus 200.
The display control unit 103 controls display of data on the display unit 111. For example, the display control unit 103 causes the display unit 111 to display data generated by the data processing apparatus 200 for display in parallel coordinates.
Each of the above units (the accepting unit 101, the communication control unit 102, and the display control unit 103) is realized by, for example, one or more processors. For example, each of the above units may be realized by causing a processor such as a Central Processing Unit (CPU) to execute a program, that is, software. Each of the above units may be realized by a processor such as a dedicated Integrated Circuit (IC), that is, hardware. Each of the above units may be realized by a combined use of software and hardware. If a plurality of processors is used, each processor may realize one of the units or two or more of the units.
Next, an example of the configuration of the data processing apparatus 200 is described. As illustrated in
The storage 221 stores various pieces of data used for various processes by the data processing apparatus 200. For example, the storage 221 stores multidimensional data targeted for display, and data (such as the specified M dimensions and filtering conditions) transmitted from the display processing apparatus 100. The storage 221 can be configured by storage media of every kind generally used, such as flash memory, a memory card, RAM, an HDD, and an optical disc.
The communication control unit 201 controls communication with an external apparatus such as the display processing apparatus 100. For example, the communication control unit 201 receives information (such as the specified M dimensions and filtering conditions) transmitted from the display processing apparatus 100. Moreover, the communication control unit 201 transmits data for display in parallel coordinates, to the display processing apparatus 100.
The calculation unit 202 calculates the number of focus dimensions, M, representing the number of dimensions that are extracted for display from the data of the N dimensions (focus dimensions). For example, the calculation unit 202 calculates the number of dimensions that allows data to be easily read when being displayed in a display area, as the number M of focus dimensions, on the basis of the width of the display area in parallel coordinates.
The sorting unit 203 generates a dimension list where dimensions of the N-dimensional data are arranged in an order in accordance with a predetermined sorting rule. The predetermined rule can be any rule in expectation of user. However, for example, a rule based on the variance or missing value of a data value, a rule based on the correlation between dimensions, and a rule using an importance scores obtained by machine learning can be applied.
The generation unit 204 generates data for displaying the data of the specified M dimensions in parallel coordinates. Moreover, when the specification of the M dimensions is changed, the generation unit 204 generates new data for displaying data of the changed M dimensions in parallel coordinates. Furthermore, the generation unit 204 may furtherly perform resampling that converts a plurality of pieces of data of each dimension of the M dimensions into fewer pieces of converted data. If the resampling is performed, the generation unit 204 generates new data for displaying the converted data in parallel coordinates.
Each of the above units (the communication control unit 201, the calculation unit 202, the sorting unit 203, and the generation unit 204) is realized by, for example, one or more processors. For example, each of the above units may be realized by causing a processor such as a CPU to execute a program, that is, software. Each of the above units may be realized by a processor such as a dedicated IC, that is, hardware. Each of the above units may be realized by a combined use of software and hardware. If a plurality of processors is used, each processor may realize one of the units or two or more of the units.
Next, a data visualization process by the display control system according to the embodiment configured in this manner is described in more detail.
The calculation unit 202 of the data processing apparatus 200 calculates the number M of focus dimensions from the width of a display area in parallel coordinates (step S101) and interval between axes. The interval between two axes may be a fixed value, or may be able to be specified by user. In other way, the number M of focus dimensions is specified and the interval between axes is then computed accordingly.
Hence, the calculation unit 202 calculates the number of axes to be displayed simultaneously, that is, the number M of focus dimensions from the width of the display area and specified interval between axes. The calculation unit 202 calculates, for example, a value obtained by dividing the width of the display area by the interval between axes (the distance between axes) (if not divisible, a value rounded to a natural number may be used) as the number M of focus dimensions. For example, if the width of the display area in parallel coordinates is 1000 pixels, and the interval between axes is 50 pixels, the calculation unit 202 calculates M to 20 (=1000/50). The interval between axes may be a fixed value, or may be able to be specified by a user. For example, if the value of the interval between axes is changed by a user operation due to a change in the width or height of the display area, return to step S101. The calculation unit 202 recalculates the value of M.
Return to
In
The width of the knob 411 may be changed to allow the user to specify the number M of focus dimensions. In this case, the calculation unit 202 may calculate the value of the interval between axes that allows displaying the specified number M of focus dimensions. For example, the calculation unit 202 may calculate a value obtained by dividing the width of the display area by the specified number M of focus dimensions as the value of the interval between axes.
The order of dimensions is generally the descending order of importance of dimensions. In other words, the sorting unit 203 arranges the dimensions in accordance with a rule that makes an arrangement in the order of importance of dimensions. Consequently, the user can check dimensions from higher importance to lower importance while sliding with a slider bar to select dimensions.
The method for calculating importance according to the dimension can be various methods. As an easy method, for example, a method that calculates the variance of data values according to the dimension as importance, and a method that considers the count of missing values when computing importance can be also applied. The implementation details of importance computation should be customable by user in expectation of the use case.
As a complicated method, a method that calculates, as importance, the degree of correlation or sensitivity of a dimension x as an explanatory variable with another dimension y (an example of one or more first dimensions) specified as a dependent variable can be applied. For example, the degree of correlation of a pair by association between the dimension y being the dependent variable and each of a plurality of the dimensions x being the explanatory variables may be calculated as importance.
Moreover, the sorting unit 203 may use a machine learning model such as association rule (Association Rule) data mining, a decision tree, or a random forest. The machine learning model is constructed by learning in such a manner as to calculate the importance of each of the plurality of the dimensions x for the specified dimension y. The sorting unit 203 uses such a machine learning model to calculate the importance of each dimension x for the specified dimension y.
User may be able to change the rule of arrangement by the sorting unit 203. If the rule is changed, return to step S102. The sorting unit 203 makes an updating in accordance with the changed rule. Consequently, importance on which the user's intention is reflected can be instantly and interactively calculated.
Return to
A coloring dimension is a dimension used for coloring in parallel coordinates. The data area (a minimum value to a maximum value) of the coloring dimension is mapped in a specified continuous color area. In other words, in terms of each sample, a value of the coloring dimension is associated with a certain color value. The color of a polyline corresponding to each sample is determined according to the setting of the coloring dimension.
A mode for displaying a polyline may be determined in a method other than color. For example, the generation unit 204 may determine display modes such as the thickness, shape, and presence or absence of blinking of a polyline depending on the value of data.
A filtering dimension is a dimension where a filter condition is applied. The filtering is a condition that is specified to filter the data of a certain dimension. The method for specifying the filtering conditions can be various methods. However, for example, a method that specifies the range of data along an axis can be applied. In
A default display range of each dimension in parallel coordinates is, for example, the full range of its dimension. In other words, the entire data range along its dimension is targeted for visualization before the filtering conditions of a dimension is set. In terms of the filtering dimension, a part range of the dimension is selected according to the filtering conditions, and the selected data is targeted for visualization. A plurality of filtering conditions may be specified for each dimension. A plurality of filtering conditions of the same dimension is applied by taking a logical OR thereof. Filtering conditions of different dimensions are applied by taking a logical AND thereof.
If the filtering is changed, return to step S102. The sorting unit 203 makes an updating in accordance with the changed filtering conditions.
Return to
Another example of a user interface that specifies the focus position is described in
In the example of
The display screen 500 illustrated in
Moreover, it may be furtherly configured in such a manner that the user can manually customize the order of some dimensions to be displayed, in addition to the above-mentioned importance order of focus dimensions and filtering dimensions. The user manually specifies that a certain dimension A is placed before or after a dimension B by drag&drop operations, irrespective of the order of importance of the dimensions A and B, on, for example, the display screen of parallel coordinates. For example, the storage 121 stores the specified pair of the dimensions A and B (the paired dimensions). If one of the paired dimensions is included in the visible dimensions, the generation unit 204 adds the other dimension to the list of visible dimensions. In parallel coordinates, the order of dimensions before and after each dimension specified by paired dimensions is preferentially held irrespective of the order of dimensions (the order of importance) on the dimension list.
Return to
In the processes up to this point (steps S101 to S105), the low-dimensional data is extracted from the high-dimensional data. However, even in a case of the low-dimensional data, as the number of samples (the number of rows of table data) increases, the transfer and plotting costs increase, and visual clutter (Clutter) may occur in parallel coordinates.
Hence, the generation unit 204 may resample data to reduce the data amount (step S106). Data resampling is a process of extracting or calculating representative samples from all samples (records). The generation unit 204 detects overlapping polylines and displays only representative polylines to efficiently visualize parallel coordinates. A method for precisely calculating resamples increases the calculation amount and accordingly is not suitable for interactive visualization of parallel coordinates.
One of resampling methods is a method that groups multidimensional data according to the distance between samples and obtains a representative value (such as a maximum value, a minimum value, an average value, a mean value, or a randomly extracted value) of each group according to the dimension.
The number of computed resamples increases roughly exponentially with the increasing number of dimensions. Therefore, as compared to resampling of the N-dimensional data, resampling of data of M′ visible dimensions in this embodiment reduces the number of resampled samples significantly.
For example, a data cube having the same number of cells as the number of pixels indicating the height of the display area in parallel coordinates for each dimension is used. In
If the height of the display area in the parallel coordinates is changed by a user operation, the generation unit 204 computes update resamples.
Resampling is executed on base of whether or not the number of samples has exceeded a threshold specified by user. For example, in a case where data with a few samples is targeted, it is not necessary to execute resampling.
Return to
The communication control unit 102 of the display processing apparatus 100 receives the data from the data processing apparatus 200. The display control unit 103 renders the received data into parallel coordinates (step S108). The display control unit 103 displays the data visualized in parallel coordinates as in, for example, the above-mentioned display screen of
The display processing apparatus 100 includes an input device (such as a mouse, a keyboard, or a touchscreen) that can be operated by a user. The user can perform operations such as setting of filtering conditions for each dimension, and switching the coloring dimension on the display of the parallel coordinates, using the input device. Consequently, parallel coordinates can be interactively visualized.
In this manner, in the embodiment, the data processing apparatus 200 with compute-intensive hardware and large storage capacity executes a process on data with large numbers of dimensions and samples, and transmits the process result to the display processing apparatus 100 being a terminal apparatus that can be easily accessed and operated by the user. The display processing apparatus 100 executes a plotting process. In this manner, data processing and visualization (the display process) are distributed and executed. Accordingly, it is feasible to reduce the delay of a response to a user operation and encourage an improvement in performance as the entire display control system.
The method for distributed processing is not limited to this. For example, the data processing apparatus 200 may execute up to the plotting process of parallel coordinates, and transmit data indicating the plotting process result (screen data) to the display processing apparatus 100. In this case, the display control unit 103 of the display processing apparatus 100 is simply required to execute only the process of displaying the received screen data on the display unit 111. Moreover, it may be configured in such a manner that each of the data processing apparatus 200 and the display processing apparatus 100 executes part of the plotting process, and the display processing apparatus 100 displays the final plotting process result (screen data).
Steps S101 to S108 are a main flow of visualization of the proposed parallel coordinates. Steps S109 to S113 correspond to a process of feeding back a user operation to the corresponding step in main flow.
For example, once accepting a user operation through the input device, the accepting unit 101 of the display processing apparatus 100 transmits operation information indicating the accepted operation to the data processing apparatus 200 via the communication control unit 102 (step S109).
For example, the generation unit 204 of the data processing apparatus 200 determines whether or not the user operation indicates a change in the height of the display area, on the basis of the operation information (step S110). If the user operation triggers a change in the height of the display area (step S110: Yes), the generation unit 204 updates resamples on base of the changed height (step S106), and repeats the subsequent processing in the main flow.
If the user operation does not indicate a change in the height of the display area (step S110: No), the generation unit 204 determines whether or not the user operation indicates a change in focus position (step S111). If the user operation indicates a change in focus position (step S111: Yes), the generation unit 204 determines M focus dimensions on the basis of the changed focus position, furtherly determines visible dimensions in such a manner as to include the focus dimensions (step S104), and repeats the subsequent processing.
If the user operation does not indicate a change in focus position (step S111: No), the generation unit 204 determines whether or not the user operation indicates an addition or change of filtering conditions, the coloring dimension, or the arrangement rule (step S112). If the user operation indicates a change of filtering conditions, the coloring dimension, or the arrangement rule (Step S112: Yes), the processing is repeated from the arrangement process (step S102) by the sorting unit 203.
If the user operation does not indicate a change in filtering conditions, the coloring dimension, or the arrangement rule (step S112: No), the generation unit 204 determines whether or not the user operation indicates a change in the width of the display area (step S113). If the user operation indicates a change in the width of the display area (step S113: Yes), the processing is repeated from the process of calculating the number M of focus dimensions by the calculation unit 202 (step S101).
If the user operation does not indicate a change in the width of the display area (step S113: No), return to step S110 and wait until the next user operation is triggered. In this manner, a user operation is fed back and the display of parallel coordinates is interactively updated in accordance with the user operation.
The order of the determination processes of steps S110 to S113 is not limited to this, and any order of steps S110 to S113 is acceptable. Moreover, the user can specify several or all of the operations corresponding to steps S110 to S113 at a time. In this case, the display processing apparatus 100 may transmit operation information indicating a plurality of operations to the data processing apparatus 200. The data processing apparatus 200 may execute part or all of the determination processes of steps S110 to S113 at a time on the basis of, for example, the operation information. If the several conditions are satisfied in the plurality of determination processes, return to the earliest step among steps S101 to S106 along the main flow, and the processing is repeated.
For example, when a user selects the dimension 701 as the coloring dimension, the dimension 701 is set as the dependent variable y, and the importance of the other dimensions being the explanatory dimensions x is recalculated by a machine learning model or the like. The other dimensions are arranged in the order of importance. A dimension corresponding to the focus position specified by the user (focus dimensions) in the slider bar is extracted from the arranged dimensions.
In the embodiment, even if a knob 704 of a scroll/slider bar changes the focus position, the dimension 701 (the coloring dimension) and the dimension 702 (the filtering dimension), which are the operation dimensions, are always displayed. For example, if the knob 704 is moved to the right from the state of
As described above, the data processing apparatus according to the embodiment extracts low-dimensional data and displays the data in parallel coordinates without displaying all dimensions of high-dimensional data (N-dimensional data) in parallel coordinates. Moreover, the display of parallel coordinates is updated with data of different dimensions sorted in the order of importance in accordance with a user operation. Consequently, the processing loads of data transfer and plotting even for data with a large number of dimensions of data or data with both of large numbers of dimensions and samples are reduced, and high-speed response performance can be realized with interactive visualization.
Next, a hardware configuration of each apparatus (the display processing apparatus and the data processing apparatus) according to the embodiment is described using
The apparatus according to the embodiment includes a control device such as a CPU 51, storage devices such as a Read Only Memory (ROM) 52 and a RAM 53, a communication I/F 54 that communicates with a connection to a network, and a bus 61 connecting each unit.
A program that is executed by the apparatus according to the embodiment is previously incorporated in, for example, the ROM 52 and is provided.
The program that is executed by the apparatus according to the embodiment may be configured in such a manner as to be recorded in an installable or executable format file in a computer readable recording medium such as a Compact Disk Read Only Memory (CD-ROM), a flexible disk (FD), a Compact Disk Recordable (CD-R), or a Digital Versatile Disk (DVD) and be provided as a computer program product.
Furthermore, the program that is executed by the apparatus according to the embodiment may be configured in such a manner as to be stored on a computer connected to a network such as the Internet, downloaded via the network, and provided. Moreover, the program that is executed by the apparatus according to the embodiment may be configured in such a manner as to be provided or distributed via a network such as the Internet.
The program that is executed by the apparatus according to the embodiment can cause a computer to function as each unit of the above-mentioned apparatus. The computer can cause the CPU 51 to read the program from a computer readable storage medium onto a main storage device and execute the program.
While certain embodiments have been described, these embodiments have been presented by way of example only, and are not intended to limit the scope of the inventions. Indeed, the novel embodiments described herein may be embodied in a variety of other forms; furthermore, various omissions, substitutions and changes in the form of the embodiments described herein may be made without departing from the spirit of the inventions. The accompanying claims and their equivalents are intended to cover such forms or modifications as would fall within the scope and spirit of the inventions.
Number | Date | Country | Kind |
---|---|---|---|
2019-047659 | Mar 2019 | JP | national |