The present disclosure generally relates to computer-implemented systems and methods for data reduction in distributed data environments.
Distance-based data mining analyses are attractive for addressing many problems in class identification and data segmentation. However, when handling large data sets, the computational cost of some clustering algorithms may become impractical or prohibitively expensive.
In accordance with the teachings provided herein, systems and methods for data reduction in distributed data environments are provided.
For example, a computer-program product tangibly embodied in a non-transitory machine-readable storage medium is provided that includes instructions that can cause a data processing apparatus to group data points in a data set into a plurality of data point bubbles. These data point bubbles are represented by a plurality of representative points where an individual point bubble of the plurality of data point bubbles includes one or more data points from the data set and a representative point of the plurality of representative points. The computer-program product further includes instructions that can cause the data processing apparatus to calculate a cluster assignment for the representative point by executing a clustering algorithm using the plurality of representative points.
In another example, a computer-implemented method is provided that includes grouping data points in a data set into a plurality of data point bubbles. These data point bubbles are represented by a plurality of representative points where an individual point bubble of the plurality of data point bubbles includes one or more data points from the data set and a representative point of the plurality of representative points. The method further includes calculating a cluster assignment for the representative point by executing a clustering algorithm using the plurality of representative points.
In another example, a system is provided that includes a processor and a non-transitory computer readable storage medium containing instructions that, when executed on the processor, cause the processor to perform operations. The operations include grouping data points in a data set into a plurality of data point bubbles. These data point bubbles are represented by a plurality of representative points where an individual point bubble of the plurality of data point bubbles includes one or more data points from the data set and a representative point of the plurality of representative points. The operations further include calculating a cluster assignment for the representative point by executing a clustering algorithm using the plurality of representative points.
The patent or application file contains at least one drawing executed in color. Copies of this patent or patent application publication with color drawings will be provided by the office upon request and payment of any necessary fee.
Like reference numbers and designations in the various drawings indicate like elements.
Aspects of the disclosed subject matter relate to techniques for using a data reduction algorithm for any clustering algorithm such as, for example, density-based spatial clustering of application with noise (DBSCAN) or k-th nearest neighbor algorithms. The data reduction algorithm groups data points in a data set into a number of data point bubbles. A representative point can be selected to represent the bubble. The representative points can be used to conduct clustering data mining analyses on the data set. The use of representative points reduces computational costs of the clustering data mining algorithm because the number of calculations needed to analyze the data set using the representative points is fewer than the number of calculations needed to analyze the data set using the full data set.
Clustering data mining analyses can be useful for solving many problems in class identification and data segmentation. For example, by clustering pixels in an image based on inter-pixel distances, different objects in the image can be identified. When handling large data sets, however, the computational cost of clustering algorithms may become prohibitively expensive, because the number of calculations increases quadratically with the number of data elements. Large data sets are often stored in a distributed environment. Because clustering data mining analyses often involve the global view of all of the data elements, distributed data further hinders the performance of these analyses. Though clustering algorithms are primarily used as examples, one skilled in the art will appreciate that the data reduction algorithm disclosed herein may be similarly beneficial for other machine learning methods, including, but not limited to supervised methods (e.g., regression and classification algorithms), and semi-supervised methods.
For example, data points in a data set can be described by two-dimensional feature points (x, y). Data points within the data set can be distributed on two or more computing nodes. Data points stored on computing node 2 can be transmitted through a network to computing node 1 before computing node 1 can calculate the distance between a data point on computing node 1 and a data point on computing node 2. This type of data movement can be expensive since network bandwidth is more limited than local memory bandwidth. Secondly, when handling large data sets in distributed format, it is infeasible to gather all remote data onto a single computing node due to limitation on local memory.
In a distributed computing environment, aspects and features of this disclosure can reduce the amount of data communication among computing nodes and reduce the computational cost of clustering calculations.
In one example, an original data set can be reduced several orders of magnitude without losing the characteristics of the original data in clustering data mining analysis. A number of “bubbles” can be created from a data set. A “bubble” is a group of one or more data points and one or more representative points, the group being aggregated by selecting at least one representative point and using a distance threshold to determine the one or more data points that are within the distance threshold. A data reduction algorithm can be executed to create bubbles from a data set in a distributed environment. The data reduction algorithm can be embedded in a distance-based algorithm such that a call to the distance-based algorithm first executes the data reduction algorithm to group the data set into multiple bubbles, extracts the representative points and then executes the distance-based algorithm on representative points rather than the full data set.
In one example, a distance threshold (Dmax) may be received. The distance threshold can be passed as a parameter into the data reduction algorithm. The distance threshold can be received from a user. Each local data point can be assigned to a specific bubble based on the distance threshold. The distance threshold may be received from a user. Generally, larger distance thresholds correspond to a courser resolution generated by the data reduction method than the resolution generated using smaller distance thresholds. Representative points can be selected and used to perform clustering data mining analyses. For clustering data mining analysis, the representative point is assigned a cluster ID. The analysis results on each representative point can be propagated back to the original data points. For example, the data points in the same bubble as the representative point can be assigned the same cluster ID as the representative point.
Though the above examples utilize a distributed environment, a non-distributed computing environment in which a single computing node has a view of the entire data set can also benefit from the data reduction algorithm described herein by gathering the representative points and propagating the clustering results back to the original data points (e.g., assigning the cluster ID of the representative point cluster ID to data points in the same bubble as the representative point).
In one example, the environment 100 may include a stand-alone computer architecture where a processing system 110 (e.g., one or more computer processors) includes the system 104 being executed on it. The processing system 110 has access to a computer-readable memory 112.
In one example, the environment 100 may include a client-server architecture. Users 102 may utilize a PC to access servers 106 running a system 104 on a processing system 110 via networks 108. The servers 106 may access a computer-readable memory 112.
A disk controller 210 can interface one or more optional disk drives to the bus 202. These disk drives may be external or internal floppy disk drives such as storage drive 212, external or internal CD-ROM, CD-R, CD-RW, or DVD drives 214, or external or internal hard drive 216. As indicated previously, these various disk drives and disk controllers are optional devices.
A display interface 218 may permit information from the bus 202 to be displayed on a display 220 in audio, graphic, or alphanumeric format. Communication with external devices may optionally occur using various communication ports 222. In addition to the standard computer-type components, the hardware may also include data input devices, such as a keyboard 224, or other input/output devices 226, such as a microphone, remote control, touchpad, keypad, stylus, motion, or gesture sensor, location sensor, still or video camera, pointer, mouse or joystick, which can obtain information from bus 202 via interface 228.
At block 304, data reduction engine 209 determines at least one representative point for each data point bubble. In one example, each data point in the data set is assigned to a bubble. A representative point can be randomly selected from the assigned data points to represent the bubble. Alternatively, representative points can be selected in other ways, for example, by selecting a point closest to the center of the bubble.
At block 306, data reduction engine 209 performs the distance-based data mining algorithm using the representative points of the data point bubbles rather than the entire data set. “Distance” refers to a value of metric space between the selected data point and the representative point. For example, distance may refer to a Euclidean distance, a Manhattan distance, or a hammering distance. Data reduction engine 209 may alternatively pass the set of representative points to another component or engine to calculate cluster identification numbers for the representative points.
At block 308, data reduction engine 209 assigns each representative point a cluster identification number based on performing the distance based mining algorithm. Alternatively, data reduction engine 209, having passed the set of representative points to another component or engine to calculate cluster identification numbers, may receive clustering results for the set of representative points.
At block 310, data reduction engine 209 assigns each data point in each bubble the same cluster identification number as the representative point of the bubble to which the data point belongs.
Pseudo-code 400 includes a function call to CreateBubblesOnEachComputingNode in order to determine bubbles of a local data set. Once bubbles are determined, a data point in the bubble is selected as the representative point for the bubble. Though one representative point is used in this example, a bubble may include one or more representative points. One skilled in the art will appreciate that selecting more than one representative point can be beneficial in stabilizing the bubbling algorithm. The one or more representative points may be selected randomly or in other ways as would be apparent to one skilled in the art.
Once representative points are determined, the representative points are collected into data set Rj. Data set Rj may be used to perform DBSCAN. The pseudo-code for DBSCAN is illustrated in
The algorithm concludes by assigning a cluster identification number to each data point in the original data set D. This is accomplished by selecting a data point, determining the representative point of the bubble to which the data point belongs, determining the cluster identification number of the representative point, and assigning the cluster identification number to the data point. The DBSCANWithBubbles function then returns information of the cluster assignments.
In one example, consider that the data set of
EpsNeighborhoodQuery returns a set of points that are in the neighborhood of point p, that is, a set of points that are less than eps distance away from point p. The algorithm makes a determination as to whether the returned set of points exceeds MinPts. If the set of points contains a number of points less than MinPts, then point p is marked as NOISE. If the set of points contains a number of points that is greater than MinPts, then the function ExpandCluster is called.
The function ExpandCluster is passed five parameters: p, NeighborPts, Cluster_ID, eps, and MinPts, where p is the current point, NeighborPts is the set of neighborhood points returned from EpsNeighborhoodQuery, Cluster_ID is the current cluster identification number, eps is a distance threshold, and MinPts is a minimum number of data points in an eps-neighborhood of points. ExpandCluster adds point p to the cluster by assigning p the current value of cluster_ID. For each point q in the set of NeighborPts, if point q has not been visited, point q is marked as visited and neighborhood points of point q are determined by calling function EpsNeighborhoodQuery in a similar manner as described above. If the size of the set of neighborhood points of point q is greater than, or equal to, the size of NeighborPts, then the set neighborhood points of point q is joined with NeighborPts (the set of neighborhood points of point p). If point q is not in any cluster, then point q is assigned to the cluster represented by cluster_ID. The DBSCAN algorithm is repeated for each point p in data set Ti.
Though DBSCAN is used as an example distance-based algorithm, other data clustering algorithms may be utilized with the data reduction algorithm described herein.
Overall, the clustering algorithm can efficiently reduce the size of data no matter how big data are distributed across computing nodes. Using the DBSCAN clustering algorithm as one example of bubbling applications, we also demonstrate that the clustering results are not altered significantly when the sole parameter Dmax is tuned. Tuning Dmax for DBSCAN is straightforward and one can use Eps for Dmax as well. It is worth emphasizing that the clustering algorithm is universal and can be readily ported to other data mining algorithms such as k-nearest neighbor analysis or k-means clustering.
Systems and methods according to some examples may include data transmissions conveyed via networks (e.g., local area network, wide area network, Internet, or combinations thereof, etc.), fiber optic medium, carrier waves, wireless networks, etc. for communication with one or more data processing devices. The data transmissions can carry any or all of the data disclosed herein that is provided to, or from, a device.
Additionally, the methods and systems described herein may be implemented on many different types of processing devices by program code comprising program instructions that are executable by the device processing subsystem. The software program instructions may include source code, object code, machine code, or any other stored data that is operable to cause a processing system to perform the methods and operations described herein. Other implementations may also be used, however, such as firmware or even appropriately designed hardware configured to carry out the methods and systems described herein.
The systems' and methods' data (e.g., associations, mappings, data input, data output, intermediate data results, final data results, etc.) may be stored and implemented in one or more different types of computer-implemented data stores, such as different types of storage devices and programming constructs (e.g., RAM, ROM, Flash memory, removable memory, flat files, temporary memory, databases, programming data structures, programming variables, IF-THEN (or similar type) statement constructs, etc.). It is noted that data structures may describe formats for use in organizing and storing data in databases, programs, memory, or other computer-readable media for use by a computer program.
A computer program (also known as a program, software, software application, script, or code) can be written in any form of programming language, including compiled or interpreted languages, and it can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A computer program does not necessarily correspond to a file in a file system. A program can be stored in a portion of a file that holds other programs or data (e.g., one or more scripts stored in a markup language document), in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, subprograms, or portions of code). A computer program can be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a communication network. The processes and logic flows and figures described and shown in this specification can be performed by one or more programmable processors executing one or more computer programs to perform functions by operating on input data and generating output.
Generally, a computer can also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data (e.g., magnetic, magneto optical disks, or optical disks). However, a computer need not have such devices. Moreover, a computer can be embedded in another device, (e.g., a mobile telephone, a personal digital assistant (PDA), a tablet, a mobile viewing device, a mobile audio player, a Global Positioning System (GPS) receiver), to name just a few. Computer-readable media suitable for storing computer program instructions and data include all forms of nonvolatile memory, media and memory devices, including by way of example semiconductor memory devices (e.g., EPROM, EEPROM, and flash memory devices); magnetic disks (e.g., internal hard disks or removable disks; magneto optical disks; and CD ROM and DVD-ROM disks). The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.
The computer components, software modules, functions, data stores and data structures described herein may be connected directly or indirectly to each other in order to allow the flow of data needed for their operations. It is also noted that a module or processor includes, but is not limited to, a unit of code that performs a software operation, and can be implemented, for example, as a subroutine unit of code, or as a software function unit of code, or as an object (as in an object-oriented paradigm), or as an applet, or in a computer script language, or as another type of computer code. The software components or functionality may be located on a single computer or distributed across multiple computers depending upon the situation at hand.
The computer may include a programmable machine that performs high-speed processing of numbers, as well as of text, graphics, symbols, and sound. The computer can process, generate, or transform data. The computer includes a central processing unit that interprets and executes instructions; input devices, such as a keyboard, keypad, or a mouse, through which data and commands enter the computer; memory that enables the computer to store programs and data; and output devices, such as printers and display screens, that show the results after the computer has processed, generated, or transformed data.
Implementations of the subject matter and the functional operations described in this specification can be implemented in digital electronic circuitry, or in computer software, firmware, or hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them. Implementations of the subject matter described in this specification can be implemented as one or more computer program products (i.e., one or more modules of computer program instructions encoded on a computer-readable medium for execution by, or to control the operation of, data processing apparatus). The computer-readable medium can be a machine-readable storage device, a machine-readable storage substrate, a memory device, a composition of matter effecting a machine-readable propagated, processed communication, or a combination of one or more of them. The term “data processing apparatus” encompasses all apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, or multiple processors or computers. The apparatus can include, in addition to hardware, code that creates an execution environment for the computer program in question (e.g., code that constitutes processor firmware, a protocol stack, a graphical system, a database management system, an operating system, or a combination of one or more of them).
While this disclosure may contain many specifics, these should not be construed as limitations on the scope of what may be claimed, but rather as descriptions of features specific to particular implementations. Certain features that are described in this specification in the context of separate implementations can also be implemented in combination in a single implementation. Conversely, various features that are described in the context of a single implementation can also be implemented in multiple implementations separately or in any suitable subcombination. Moreover, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.
Similarly, while operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In certain circumstances, multitasking and parallel processing may be advantageous. Moreover, the separation of various system components in the implementations described above should not be understood as requiring such separation in all implementations, and it should be understood that the described program components and systems can generally be integrated together in a single software or hardware product or packaged into multiple software or hardware products.
Some systems may use Hadoop@, an open-source framework for storing and analyzing big data in a distributed computing environment. Some systems may use cloud computing, which can enable ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction. Some grid systems may be implemented as a multi-node Hadoop® cluster, as understood by a person of skill in the art. Apache™ Hadoop® is an open-source software framework for distributed computing. Some systems may use the SAS® LASR™ Analytic Server in order to deliver statistical modeling and machine learning capabilities in a highly interactive programming environment, which may enable multiple users to concurrently manage data, transform variables, perform exploratory analysis, build and compare models and score with virtually no regards to the size of the data stored in Hadoop®. Some systems may use SAS In-Memory Statistics for Hadoop® to read big data once and analyze it several times by persisting it in-memory for the entire session.
It should be understood that as used in the description herein and throughout the claims that follow, the meaning of “a,” “an,” and “the” includes plural reference unless the context clearly dictates otherwise. Also, as used in the description herein and throughout the claims that follow, the meaning of “in” includes “in” and “on” unless the context clearly dictates otherwise. Finally, as used in the description herein and throughout the claims that follow, the meanings of “and” and “or” include both the conjunctive and disjunctive and may be used interchangeably unless the context expressly dictates otherwise; the phrase “exclusive or” may be used to indicate situation where only the disjunctive meaning may apply.
The present disclosure claims priority to U.S. Provisional Application No. 61/819,532, filed May 4, 2013 and titled “Methods and Systems for Data Reduction in Cluster Analysis in Distributed Data Environments,” the entirety of which is incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
61819532 | May 2013 | US |