The present disclosure generally relates to computer-implemented systems and methods for generating decision trees.
Decision trees are used as predictive models in statistical analysis, data mining, and machine learning. Current techniques for generating decision trees, however, utilize exhaustive enumeration approaches for determining branching schemes that can be computationally expensive.
In accordance with the teachings provided herein, systems and methods for data reduction in distributed data environments are provided.
For example, a computer-program product tangibly embodied in a non-transitory machine-readable storage medium is provided that includes instructions that can cause a data processing apparatus to receive input data related to a decision tree to be generated from a data set. The input data identifies a target attribute of the data set and a set of candidate attributes of the data set to be used as nodes in the decision tree. The computer-program product further includes instructions that can cause the data processing apparatus to determine, using a clustering algorithm and the set of candidate attributes, a number of potential splitting schemes to be used to split a node in the decision tree. The computer-program product further includes instructions that can cause the data processing apparatus to calculate a splitting measurement for each of the plurality of potential splitting schemes. The computer-program product further includes instructions that can cause the data processing apparatus to select an optimal splitting scheme from the plurality of potential splitting schemes for each node in the decision tree based on the splitting measurement.
In another example, a computer-implemented method is provided that includes receiving, by a computer device, input data related to a decision tree to be generated from a data set. The input data identifies a target attribute of the data set and a set of candidate attributes of the data set to be used as nodes in the decision tree. The method further includes determining, by the computer device, using a clustering algorithm and the set of candidate attributes, a plurality of potential splitting schemes to be used to split a node in the decision tree. The method further includes calculating, by the computer device, a splitting measurement for each of the plurality of potential splitting schemes. The method further includes selecting an optimal splitting scheme from the plurality of potential splitting schemes for each node in the decision tree based on the splitting measurement.
In another example, a system is provided that includes a processor and a non-transitory computer readable storage medium containing instructions that, when executed on the processor, cause the processor to perform operations. The operations include receiving, by a computer device, input data related to a decision tree to be generated from a data set. The input data identifies a target attribute of the data set and a set of candidate attributes of the data set to be used as nodes in the decision tree. The operations further include determining, by the computer device, using a clustering algorithm and the set of candidate attributes, a plurality of potential splitting schemes to be used to split a node in the decision tree. The operations further include calculating, by the computer device, a splitting measurement for each of the plurality of potential splitting schemes. The operations further include selecting an optimal splitting scheme from the plurality of potential splitting schemes for each node in the decision tree based on the splitting measurement.
The patent or application file contains at least one drawing executed in color. Copies of this patent or patent application publication with color drawings will be provided by the office upon request and payment of any necessary fee.
Like reference numbers and designations in the various drawings indicate like elements.
Aspects of the disclosed subject matter relate to techniques for using a clustering algorithm, for example, a k-means algorithm, to determine a splitting scheme for a node in a classification decision tree, hereinafter “decision tree.” A “splitting scheme” for a node can identify a number of branches to split from a node and an arrangement of branches to be split from the node. The clustering algorithm can be used to analyze each candidate attribute in a data set to determine an optimal splitting scheme between the candidate attributes. The disclosed subject matter can be used and implemented in various computer systems, such as in visual analytics, visual statistics, and high-performance computing systems, tools, products, and solutions.
Clustering data mining analyses can be useful for solving many problems experienced while generating decision trees. However, existing splitting methods to generate decision trees either exhaustively enumerate all of the splitting options or split on sorted input variable values. The computation cost of exhaustive enumeration approaches may become prohibitive when the distinct count of variable values grow and when the data set is distributed. On the other hand, sorting large amounts of data, especially distributed data, can also be costly. In some cases, sorting may be required at every split in a decision tree. Besides the high computational cost, the induced ordering can produce less flexible groupings that can affect the decision tree accuracy.
In a branching algorithm, clustering algorithms may be used to determine the optimal splitting of a candidate attribute. The ordering of the values of the attribute may not be required for the split. The computational cost of exhaustive enumeration approaches can be reduced. Though a k-means clustering algorithm is used in examples, the branching algorithm can be used in any decision tree algorithm, as well as with any clustering technique (e.g., hierarchical clustering).
In one example, an optimal splitting scheme can be determined for a node in a decision tree using a clustering algorithm. A target attribute may be determined or received. As used herein, a “target attribute” is a set of values that are used as leaf nodes in a decision tree. A target attribute can be designated as the output of a decision tree. The data set may contain one or more candidate attributes. As used herein, a set of “candidate attributes” is a set of values to be used as decision nodes in the decision tree. A target attribute and a set of candidate attributes may be received from user input. Alternatively, the target attribute and the set of candidate attributes can be obtained from an alternate source. A clustering algorithm may be used to analyze each candidate attribute to determine a potential splitting scheme if the candidate attribute is to be used for a particular decision tree node. Each potential splitting scheme can be scored according to a standard statistical measurement. Standard statistical measurements include, but are not limited to, entropy functions, Gini indexes, information gains, and information gain ratios. The highest scored splitting scheme can be selected as an optimal splitting scheme and can be used to split the decision node.
Though the above examples utilize a distributed environment, a non-distributed computing environment in which a single computing node has a view of the entire data set can also benefit from the splitting algorithm described herein.
In one example, the environment 100 may include a stand-alone computer architecture where a processing system 110 (e.g., one or more computer processors) includes the system 104 being executed on it. The processing system 110 has access to a computer-readable memory 112.
In one example, the environment 100 may include a client-server architecture. Users 102 may utilize a PC to access servers 106 running a system 104 on a processing system 110 via networks 108. The servers 106 may access a computer-readable memory 112.
A disk controller 210 can interface one or more optional disk drives to the bus 202. These disk drives may be external or internal floppy disk drives such as storage drive 212, external or internal CD-ROM, CD-R, CD-RW, or DVD drives 214, or external or internal hard drive 216. As indicated previously, these various disk drives and disk controllers are optional devices.
A display interface 218 may permit information from the bus 202 to be displayed on a display 220 in audio, graphic, or alphanumeric format. Communication with external devices may optionally occur using various communication ports 222. In addition to the standard computer-type components, the hardware may also include data input devices, such as a keyboard 224, or other input/output devices 226, such as a microphone, remote control, touchpad, keypad, stylus, motion, or gesture sensor, location sensor, still or video camera, pointer, mouse or joystick, which can obtain information from bus 202 via interface 228.
At block 304, node-splitting engine 209 determines a number of potential splitting schemes. In one example, each of the candidate attributes in the set of candidate attributes is analyzed using a clustering algorithm. The clustering algorithm may be a k-means clustering algorithm or any other algorithm used to cluster data points in a data set. The maximum number of branches to be allowed from each node may be identified in input data or, alternatively, determined from a default value. The node-splitting engine 209 may use the clustering algorithm to determine multiple splitting schemes for an individual candidate. For example, suppose the maximum branches to be allowed is equal to four. The node-splitting engine 209 may use the clustering algorithm and the maximum number of branches to determine a clustering solution containing two clusters, a clustering solution containing three clusters, and a clustering solution containing four clusters. The clusters resulting from the clustering analysis of the data set can correspond to the number and arrangement of the branches split from the node, in other words, a splitting scheme. A cluster analysis that results in three clusters will produce three branches split from the node. The maximum number of branches can be used to limit the clustering algorithm to a maximum number of clusters. At block 306, node-splitting engine 209 calculates a splitting measurement for each of the potential splitting schemes. A “splitting measurement” refers to a criterion used to measure an amount of information gain achieved if a particular splitting scheme is applied to a node in a decision tree. A splitting measurement may include, but is not limited to, an entropy function, a Gini impurity index, information gain, or an information gain ratio. Any splitting measurement may be used and would be apparent to one skilled in the art.
Assuming that a candidate attribute v, is considered to split an internal node L of the decision tree, the splitting measurement fijv is the percentage of observations for which the target attribute t takes on decision tree level j and the candidate attribute v takes on decision tree level i. It can be computed as:
where Nobs is the total number of observations at the node L and Nobs(v=i,t=j) is the number of observations at the same node whose target attribute t equals j and the input attribute v equals i. Assume that the target attribute t has m levels. As such, the set fiv≡{fijv for all j} constructs a point in space Rm. These points, one for each level of v, can be clustered to the desired number of clusters based on any standard clustering algorithm. The clusters may be used to define the splitting scheme of node L. The levels of v clustered into the same cluster fall in the same branch of the node L.
At block 308, node-splitting engine 209 selects an optimal splitting scheme from the potential splitting schemes determined at block 304. The selected optimal splitting scheme may correspond to the splitting scheme having the highest splitting measurement as calculated in block 306.
At block 406, a candidate attribute may be selected. The candidate attribute belongs to the set of candidate attributes. The set of candidate attributes may be included in a vector. Selection of a first candidate attribute may correspond to selecting a first candidate in the vector. Similarly, an array or any appropriate data container can be used.
At decision block 408, node-splitting engine 209 determines whether the selected candidate attribute has more than a minimum number of observations (e.g., leaf size). An “observation,” as used herein, means a value of a particular candidate attribute. For example, the node-splitting engine 209 may require that a candidate attribute have 15 observations or more to be considered as a potential candidate for a node. One skilled in the art will appreciate that 15 observations are used for illustration only, and that any number of observations may be used as a threshold. If the candidate attribute does not have the required minimum number of observations, a determination may be made at decision block 414 as to whether more candidate attributes are available. If there are, then a new candidate may be selected at block 406. If there are not, then the flow may continue at block 416. If the candidate attribute has the minimum number of observations needed (e.g., at least 15) then the flow may proceed to block 410.
At block 410, node-splitting engine 209 determines one or more splitting schemes for the candidate attribute using a clustering algorithm. For example, suppose a maximum number of branches is included in the input data, and that the maximum number of branches is equal to four. Alternatively, the maximum number of branches may be defaulted to any value (e.g., four). At block 410, the node-splitting engine 209 may use a clustering algorithm configured to determine an optimal branching of the candidate attribute using two clusters. The node-splitting engine 209 also determines, using the clustering algorithm, an optimal branching of the candidate attribute for three clusters, and four clusters, respectively. In general, the node-splitting engine 209 may cause a clustering analysis to be conducted on the candidate attribute with a number of clusters designated up to, and including, the value of the maximum number of branches.
At block 412, node-splitting engine 209 calculates a splitting measurement for each of the one or more splitting schemes determined at block 410. The splitting measurement may indicate a level of statistical dispersion. The splitting scheme with the highest splitting measurement may be selected as the optimal splitting scheme for the candidate attribute.
At decision block 414, node-splitting engine 209 will determine if there are further candidate attributes. If there are more candidate attributes, the flow may repeat blocks 406 to 414 for each candidate attribute. If no more candidate attributes exist, then the flow may continue to block 416.
At block 416, node-splitting engine 209 selects the optimal splitting scheme for the node. For example, assume that five splitting schemes were determined by node-splitting engine 209, each splitting scheme corresponding to a particular candidate attribute, each splitting scheme having two or more branches. Node-splitting engine 209 may select an opt mal splitting scheme for the node by comparing the splitting measurement of each splitting scheme and selecting the splitting scheme with the highest splitting measurement. The candidate attribute corresponding to the selected splitting scheme can be selected for the node, and the selected splitting scheme can be used to split the node.
At decision block 418, node-splitting engine 209, may determine whether there are more nodes in the decision tree to split. If there are more nodes to split, blocks 404 through 418 may be repeated for each node. If there are no further nodes to split, the flow y continue at block 420 where node-splitting engine 209 can generate the decision tree according to the selected optimal splitting scheme for each node determined at block 416. Alternatively, the node-splitting engine 209 can generate a portion of the decision tree at each iteration of block 416.
Pseudo-code 500 includes an initialization of variable maxMeasure. Subsequently, ChooseOptimalSplitOnOneTreeNode causes a number of clustering assignments to be decided. For example, assume that maxBranches=4, multiple cluster assignment combinations will be determined for each candidate attribute in {v} using a number of clusters starting at two and ending at four. The cluster assignments may be determined by using a clustering algorithm. In this example, ClusterLevelsOfOneAttributeIntoNumCluster may cause a clustering algorithm to be executed using the candidate attribute and maxBranches. Thus, for candidate attribute v1 of {v}, multiple cluster assignments will be determined including a cluster assignment using two clusters, as well as cluster assignment using three clusters, as well as cluster assignment using four clusters. The cluster assignments in this example correspond to the splitting schemes discussed in connection with
Pseudo-code 500 illustrates that each splitting scheme for each candidate variable will be assigned a splitting measurement determined by a function call ComputeMeasureOfSplit. The splitting measurement, as discussed above, refers to a criterion used to measure an amount of importance of the candidate attribute. The highest splitting measurement will be saved as the optimal split for the candidate attribute.
Once a splitting scheme is determined for candidate attribute v1, node-splitting engine 209 will continue to determine splitting schemes for each candidate attribute in {V} (e.g., via ClusterLevelsOfOneAttributetoNumClust( )). Each splitting scheme will be used to calculate a splitting measurement and each candidate attribute will have an optimal splitting scheme. Once each candidate attribute has an optimal splitting scheme, the ChooseOptimalSplitOnOneTreeNode algorithm may measure each of the optimal splitting schemes for the candidate attributes and select a highest optimal splitting scheme to be used as the splitting scheme for the node. Alternatively, the ChooseOptimalSplitOnOneTreeNode algorithm may use splitting measurements already calculated to determine an optimal splitting scheme for the candidate attribute when determining the optimal splitting scheme for the node.
Overall,
Systems and methods according to some examples may include data transmissions conveyed via networks (e.g., local area network, wide area network, Internet, or combinations thereof, etc.), fiber optic medium, carrier waves, wireless networks, etc. for communication with one or more data processing devices. The data transmissions can carry any or all of the data disclosed herein that is provided to, or from, a device.
Additionally, the methods and systems described herein may be implemented on many different types of processing devices by program code comprising program instructions that are executable by the device processing subsystem. The software program instructions may include source code, object code, machine code, or any other stored data that is operable to cause a processing system to perform the methods and operations described herein. Other implementations may also be used, however, such as firmware or even appropriately designed hardware configured to carry out the methods and systems described herein.
The systems' and methods' data (e.g., associations, mappings, data input, data output, intermediate data results, final data results, etc.) may be stored and implemented in one or more different types of computer-implemented data stores, such as different types of storage devices and programming constructs (e.g., RAM, ROM, Flash memory, removable memory, flat files, temporary memory, databases, programming data structures, programming variables, IF-THEN (or similar type) statement constructs, etc.). It is noted that data structures may describe formats for use in organizing and storing data in databases, programs, memory, or other computer-readable media for use by a computer program.
A computer program (also known as a program, software, software application, script, or code) can be written in any form of programming language, including compiled or interpreted languages, and it can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A computer program does not necessarily correspond to a file in a file system. A program can be stored in a portion of a file that holds other programs or data (e.g., one or more scripts stored in a markup language document), in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, subprograms, or portions of code). A computer program can be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a communication network. The processes and logic flows and figures described and shown in this specification can be performed by one or more programmable processors executing one or more computer programs to perform functions by operating on input data and generating output.
Generally, a computer can also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data (e.g., magnetic, magneto optical disks, or optical disks). However, a computer need not have such devices. Moreover, a computer can be embedded in another device, (e.g., a mobile telephone, a personal digital assistant (PDA), a tablet, a mobile viewing device, a mobile audio player, a Global Positioning System (GPS) receiver), to name just a few. Computer-readable media suitable for storing computer program instructions and data include all forms of nonvolatile memory, media and memory devices, including by way of example semiconductor memory devices (e.g., EPROM, EEPROM, and flash memory devices); magnetic disks (e.g., internal hard disks or removable disks; magneto optical disks; and CD ROM and DVD-ROM disks). The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.
The computer components, software modules, functions, data stores and data structures described herein may be connected directly or indirectly to each other in order to allow the flow of data needed for their operations. It is also noted that a module or processor includes, but is not limited to, a unit of code that performs a software operation, and can be implemented, for example, as a subroutine unit of code, or as a software function unit of code, or as an object (as in an object-oriented paradigm), or as an applet, or in a computer script language, or as another type of computer code. The software components or functionality may be located on a single computer or distributed across multiple computers depending upon the situation at hand.
The computer may include a programmable machine that performs high-speed processing of numbers, as well as of text, graphics, symbols, and sound. The computer can process, generate, or transform data. The computer includes a central processing unit that interprets and executes instructions; input devices, such as a keyboard, keypad, or a mouse, through which data and commands enter the computer; memory that enables the computer to store programs and data; and output devices, such as printers and display screens, that show the results after the computer has processed, generated, or transformed data.
Implementations of the subject matter and the functional operations described in this specification can be implemented in digital electronic circuitry, or in computer software, firmware, or hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them. Implementations of the subject matter described in this specification can be implemented as one or more computer program products (i.e., one or more modules of computer program instructions encoded on a computer-readable medium for execution by, or to control the operation of, data processing apparatus). The computer-readable medium can be a machine-readable storage device, a machine-readable storage substrate, a memory device, a composition of matter effecting a machine-readable propagated, processed communication, or a combination of one or more of them. The term “data processing apparatus” encompasses all apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, or multiple processors or computers. The apparatus can include, in addition to hardware, code that creates an execution environment for the computer program in question (e.g., code that constitutes processor firmware, a protocol stack, a graphical system, a database management system, an operating system, or a combination of one or more of them).
While this disclosure may contain many specifics, these should not be construed as limitations on the scope of what may be claimed, but rather as descriptions of features specific to particular implementations. Certain features that are described in this specification in the context of separate implementations can also be implemented in combination in a single implementation. Conversely, various features that are described in the context of a single implementation can also be implemented in multiple implementations separately or in any suitable subcombination. Moreover, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.
Similarly, while operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In certain circumstances, multitasking and parallel processing may be utilized. Moreover, the separation of various system components in the implementations described above should not be understood as requiring such separation in all implementations, and it should be understood that the described program components and systems can generally be integrated together in a single software or hardware product or packaged into multiple software or hardware products.
Some systems may use Hadoop®, an open-source framework for storing and analyzing big data in a distributed computing environment. Some systems may use cloud computing, which can enable ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction. Some grid systems may be implemented as a multi-node Hadoop® cluster, as understood by a person of skill in the art. Apache™ Hadoop® is an open-source software framework for distributed computing. Some systems may use the SAS® LASR™ Analytic Server in order to deliver statistical modeling and machine learning capabilities in a highly interactive programming environment, which may enable multiple users to concurrently manage data, transform variables, perform exploratory analysis, build and compare models and score. Some systems may use SAS In-Memory Statistics for Hadoop® to read big data once and analyze it several times by persisting it in-memory for the entire session.
It should be understood that as used in the description herein and throughout the claims that follow, the meaning of “a,” “an,” and “the” includes plural reference unless the context clearly dictates otherwise. Also, as used in the description herein and throughout the claims that follow, the meaning of “in” includes “in” and “on” unless the context clearly dictates otherwise. Finally, as used in the description herein and throughout the claims that follow, the meanings of “and” and “or” include both the conjunctive and disjunctive and may be used interchangeably unless the context expressly dictates otherwise; the phrase “exclusive or” may be used to indicate situations where only the disjunctive meaning may apply.
The present disclosure claims priority to U.S. Provisional Application No. 61/825,575, filed May 21, 2013 and titled “Methods and Systems for Clustering in Splitting Tree Nodes in Decision Trees,” the entirety of which is incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
61825575 | May 2013 | US |