The present invention relates to a storage and execution model for use in mining data.
Many common data analysis and data mining tasks involve the execution of a number of data operations for an analyst to reach a successful result. These operations are typically a subset of the following: data import, data aggregation, data preparation for data mining, evaluations of numerous statistical modeling methods to determine those that best represent the underlying correlation structure of the data, and building the resulting models are used to score, rank or prioritize data records. As database systems have become necessary pieces of IT infrastructures for companies and organizations, it becomes necessary to execute data analysis and data mining operations on a regular basis so that the most up-to-date analysis and data mining predictions are available to support optimal business decision-making and/or optimized business processes.
In the prior art, to perform these operations, analysts typically needed to use a myriad of tools for specific purposes (e.g. one tool for data import, a relational database for data aggregation, another set of tools to build statistical data mining models over the data, etc.). Additionally, it was difficult to automate the sequential execution of a number of these operations so that the process, or portions of the process, could be regularly repeated.
One benefit of the exemplary system is that it allows a data analyst user to use a single system to create sets of sequential data analysis and data mining operations that can be re-executed numerous times on a regular frequency or whenever needed. The system makes use of various tools for data import, utilizes commercial relational databases for data aggregation and data preparation for data mining modeling, and makes use of commercial and non-commercial statistical data mining algorithms or processes to model the data.
The exemplary system automates operations by interfacing with the components that make up the invention via code-level application interfaces (APIs) or by executing the components via command-line calls. The specific instructions and configurations to execute these components are defined as XML objects and the sequences of data analysis and data mining operations are also defined as XML objects. The invention consist of a storage scheme for these XML objects; an execution engine which processes sequences of data analysis and data mining operations; and a user-interface allowing the analyst to define XML objects to interface with specific components and to define the sequence of operations needed to solve specific data analysis and data mining projects.
The invention consists of three primary components used to automate general data analysis and modeling operations: i) a storage and access scheme for objects describing data sources, data manipulation operations and data mining modeling operations (metadata storage); ii) an execution engine that operates on the descriptions (i.e. operates on the metadata storage mechanism); and iii) a user interface for viewing and editing the descriptions.
The execution engine operates by processing pipelines that solve and automate various data execution operations. These operations include import of source data into relational databases, aggregating source data for analysis or reporting, computation of reports, and building and evaluating data mining models. A user interface allows an end-user of the system to configure specific data preparation and analysis steps for a particular application (e.g. predicting the likelihood that a product will sell, given historical transactional sales data). The execution process automates analysis operations and can be set to run repeatedly (e.g. whenever new source is available or on a scheduled basis).
These and other objects, advantages, and features of the invention will become better understood through review of the drawings in conjunction with a detailed description of an exemplary embodiment.
The system implements a metadata-driven system 110 for data analysis and data mining that is executed on a computer system 100(see
The system supports a notion of a ‘Project’. Typically, a project corresponds to a given analysis project, solution or task that needs to be developed and executed. Pragmatically, a project is an umbrella, under which metadata objects are associated. Note that metadata objects cannot have the same name within a given project, but can have the same name if they belong to different projects.
The Project notion allows an analyst to associate certain properties with a project. Project properties are a convenient way to access common information or parameters used in a specific analysis solution. For example, one project may utilize data from a specific database. So, the name of the database server and the name of the database itself may be defined as a property of the project. Project properties are usually key-value pairs, so in this case, an analyst may define a property with Key=“Server Name” and Value=“My Server”; and then define another property with Key=“Database Name” and Value=“My Database”. Then, metadata objects which describe data or functionality associated with this server and database can make use of keys in their description (i.e. “Server Name” and “Database Name”). Then, if the values of these keys change at some future point in time, as long as the project properties are updated, the metadata objects and processing instructions will utilize the updated values.
The system 110 stores information about data sources and information on how to perform various data analysis operations and computations as metadata. Metadata objects are used to describe existing data items (e.g. data tables) or to describe operations that are to be applied to existing data items (e.g. pipelines). Metadata definitions for objects are stored as XML in a relational database made up of multiple tables 122-126 that has a schema 120 shown in
System Metadata Storage is implemented as a relational database in Microsoft SQL Server 2005 with the schema shown in
As the execution engine 116 (described below in more detail) processes pipelines, it interfaces with the following tables:
The execution engine component 116 has access to C# classes which describe the members and functionality associated with the particular metadata object. To instantiate a given metadata object, the execution engine performs the following steps:
This generic approach allows the loading and saving of metadata values to the schema listed above in
A metadata object equates to a C# class that stores the class member values and may also include functionality associated with operations on those values. Metadata objects developed to describe source data information and analytic computation are described in detail below.
Note also that all metadata objects can be saved in the table [Definitions] outlined in
A Pipeline metadata object 130 describes a series of operations to be performed during a given execution run.
The Pipeline class consist of a single member: a list of Action classes.
An example XML representation of the pipeline object is:
The Action metadata object specifies a single data analysis operation to be performed and also stores and manages the parameters that are required to perform the given operation.
The Action class consist of the following members:
The Action class also exposes the following methods:
Example XML for an action is listed below
The Parameter object consists of (name, value) pair.
The Parameter object has the following members:
Additionally, there are methods for determining and managing the type of the value:
Example XML for a parameter object
The DataTable metadata object describes a data table, typically materialized as a relational database table. The DataTable object stores the name of the table as well as the column names and the column types associated with the table.
The DataTable object consists of the following members:
The DataTable object exposes the following functionality:
Example XML for a DataTable object:
The DataField object describes information about a column (field) typically associated with a DataTable object.
The DataField object has the following members:
The DataField object also exposes the following functionality
Example XML for the DataField object:
The CaseDataTable object is represents how a given table's columns relate to produce the concept of a case (entity of analysis) for modeling. E.g. if each row of the corresponding data table represents attributes of a case, it is typically specified as the ParentTable. If the underlying table has multiple columns that related to a given case (i.e. it is “dimensional” or a “nested table”), then the CaseDataTable object specifies how it joins to the ParentTable (case-table).
The CaseDataTable object has the following members:
Example XML for the CaseDataTable
The CaseDataSet object defines the logical relationship between source or derived data fields to bring together all data items related to a case for analysis and modeling. Note that a CaseDataSet has a ‘root’ table which is the root node in the general tree-like logical relationship that can be defined in a general star schema. Note that the key in the root table is referred to as the ‘case key’ for the CaseDataSet.
The CaseDataSet object consists of a single member:
The CaseDataSet object supports the following methods:
Example XML for a CaseDataSet is:
The CaseProperty object simply stores the column-name associated with a given table.
The CaseProperty object contains the following 3 members:
Example XML for a CaseProperty object:
The CaseConstraint object specifies a logical rule (constraint) to be applied to a case set to limit the cases that are used for given analysis operations, such as aggregation, etc.
The CaseConstraint object consists of the following members:
Example XML for a CaseConstraint object:
The CaseRule object represents a logical rule, which is defined as the conjunction (“and”) of a number of constraints. The CaseRule object is used to specify logic on the cases that are returned or used for an aggregation or a result-set.
The CaseRule object consists of the following members:
Example XML for a CaseRule object:
The CaseDataQuery object specifies a list of data columns that are to be returned from a query after a set of filters (rules) are applied.
The CaseDataQuery object consists of the following members:
Example XML for a CaseDataQuery object:
The CaseAggregation object defines an aggregate query over a CaseDataSet. The CaseAggregation requires the specification of the following items:
The CaseAggregation object contains the following members:
Example of a CaseAggregation XML object:
The DataFieldTransform object simply contains the information that describes a transformation to a given source data field.
The DataFieldTransform object consists of the following members:
Example XML of a DataFieldTransform object:
Similar to the DataFieldTransform, the DerivedDataField specifies a derived field for a data set.
The DerivedDataField object consist of the following members:
Example XML for a DerivedDataField object:
The DataFormat object describes the columns, transforms and derived fields that exist or may be computed from source data tables.
The DataFormat class consists of the following members:
Example XML for a DataFormat object
The CaseAttribute metadata object is used to characterize an attribute of a case which may be dimensional or not.
The CaseAttribute object consists of the following members:
The CaseAttribute object exposes the following methods:
Example XML for a CaseAttribute object is:
The DistributionReportSpec object is used to specify the information needed to generate a distribution report which characterizes a population of cases.
The DistributionReportSpec object consists of the following members:
Example XML for a DistributionReportSpec object is:
The ChartDataTable object describes a dataset that has been generated and aggregated for the purposes of charting the results.
The ChartDataTable object has the following members:
An example of the ChartDataTable XML is:
The DistributionReport object provides a container for a number of charts, along with a title for similar charts generated over the same dataset (CaseDataset).
The DistributionReport object consists of the following members:
The DistributionReport object also exposes the following methods:
Example XML for the DistributionReport object is:
The DataMiningTable object describes a case table object that stores source data for data mining.
The DataMiningTable object consists of the following members:
Example XML for a DataMiningTable object:
The DataMiningView object specifies the logical set of case attributes to use when applying data mining predictive or clustering processes to a case data set.
The DataMiningView object has the following members:
Example of a DataMiningView object XML:
The DMColumn class derives from DataField and appends the following information onto a DataField:
The DMCaseTable object describes the case table for modeling. Note that ‘case’ table corresponds to the same notion from SQL Server 2005 Analysis Services.
The DMCaseTable object contains the following members:
Example XML for a DMCaseTable object:
The DMNestedTable object describes a nested table for modeling. Note that ‘nested’ table corresponds to the same notion from SQL Server 2005 Analysis Services.
The DMNestedTable object is very similar to the DMCaseTable object, except that it contains a specification of the foreign-key relationship between the nested table and the case table, hence there is no assumption that the case-IDs in the case table and the nested table have the same column name.
The members of the DMNestedTable object are:
The DMDataset object describes the physical layout of a dataset that is to be used for statistical modeling. Note that ‘case’ and ‘nested’ table correspond to the same notions when modeling using SQL Server 2005 Analysis Services.
The DMDataset object consists of the following members:
Example XML for a DMDataset object:
The DMEnvironment object simply specifies the SQL Server Analysis Server and SQL Server 2005 Analysis database that should be used for modeling.
The DMEnvironment object has 2 members:
Example XML for a DMEnvironment object is:
The Algorithm object specifies which statistical/machine learning algorithm to apply when modeling a given dataset, and the specific algorithm parameters that are to be used when modeling the dataset.
The Algorithm object contains the following members:
XML example of an Algorithm object is:
The Model object defines a statistical/machine learning model that has been built as a result of applying a given algorithm to a specific dataset. The Model object stores this information along with location information of the model (i.e. the SQL Server 2005 Analysis Services server, database, and associated Analysis Services objects that represent the model)
The Model object consists of the following members:
Example XML for a Model object:
The DiscreteModelEvaluation object stores the results of testing (evaluating) a modeling configuration over a holdout set (or holdout sets). The DiscreteModelEvaluation object stores these test results in the case that the variable being predicted is discrete (i.e. has values that come from a small, finite, typically unordered set).
The DiscreteModelEvaluation object has the following members:
1DMROCNumPointsToPlot (integer): If the discrete prediction problem is Boolean (2-classes), the value for this member is that number of ROC curve points that are available.
Example XML for a DiscreteModelEvaluation object is:
Similar to the DiscreteModelEvaluation object, the ContinuousModelEvaluation object holds results when evaluating the performance of a predictive model that is estimating the value of a continuous column (i.e. a regression model).
The ContinuousModelEvaluation object has the following members:
The Dimension class is used to store the name and type associated with a dimension for charting purposes.
The Dimension object consists of the following two members:
Example XML for the Dimension object is:
The ReportChart object describes a given reporting chart that is used in the EvaluationReport object.
The ReportChart object has the following members:
Example XML for the ReportChart object is:
The EvaluationReport object is used to represent the results of either a discrete model evaluation computation or a continuous model evaluation computation.
The EvaluationReport object contains the following members:
The EvaluationReport object exposes the following methods:
Example XML for the EvaluationReport object is:
The primary purpose of the Execution Engine is to execute the tasks defined in pipeline objects and store information on errors that may be encountered, the time it takes to execute various tasks, etc.
The execution engine is implemented as a command-line application. When it is run, it requires an XML file (whose location is specified as a command-line parameter) known as the “config.xml” file. This file contains the following information:
“config.xml” has the following structure:
The execution engine has access to C# classes corresponding to the metadata classes described previously. Since each of these objects can save their state to XML and load from XML, which is stored in the [Definitions] table in the metadata relational database (see
This general metadata-driven system was constructed to largely automate as much of the data analysis and modeling process as possible. To accomplish this, the execution engine, via specific tasks, will call functionality that is provided by 3rd party components that can be automated at a code-level. 3rd party components utilized by the execution engine to perform various actions include SQL Server 2005 functionality provided by Microsoft Corp.
The command line “driver.exe” program (which is generally referred to as the “execution engine”) supports the following functionality (which is described in more detail in following sections):
When the execution engine is called with the/create-project switch a process 150 of
When the execution engine is called with the/drop-project switch a process 160 of
When driver.exe is called with the/export-project switch, a process 170 of
When driver.exe is called with the/import-project switch a process 180 of
When driver.exe is called with the/execute-pipeline switch the process 190 of
When driver.exe is called with the/execute-pipeline switch, and is passed the “config.xml” file, along with the project name, the following processes are executed:
When driver.exe is called with the/emulate-server switch, and is passed the “config.xml” file, along with the project name and the number of seconds to wait, the following processes are executed:
Actions that have been designed and implemented and interfaced with the pipeline architecture of the system perform the specific tasks needed to successfully address various analysis and data mining problems. Actions will operate on various metadata objects (or the source objects such as tables or files that the metadata objects describe) and will often generate new metadata and source objects that can be consumed by further actions downstream in the pipeline.
No action requires knowledge of previous actions or subsequent actions since all “communication” between actions takes place via metadata in the metadata store.
This section describes a set of pipeline actions that have been implemented to assist in analysis projects.
One task that can be put into a Pipeline object is the ability to execute another Pipeline object.
The Execute Pipeline task requires the following parameters:
The Execute Pipeline task will load 210 the metadata associated with the specified PipelineName and execute it (see
The Execute Command task will execute a command-line argument with given parameters. This task is useful when automating command-line data manipulations.
The Execute Command task requires the specification of the following parameters:
This task is implemented utilizing the .NET library System.Diagnostics.Process
The Execute SQL task allows the automation of a specific SQL query to be executed over a specified server and database.
The Execute SQL task requires the specification of the following parameters:
The task executes by making an OLE DB connection to the specified Server and Database, then the Statement is executing using the OleDbCommand object (contained in the .NET namespace System.Data.OleDb).
The Execute SQL Script task will execute the SQL statements in a file (typically suffixed with sql) over a specified SQL Server and database.
The Execute SQL Script task requires the specification of the following parameters:
The Execute SQL task is implemented by making a command line call to the command line executable “sqlcmd”, specifying the Server (via the —S flag), the database (via the —d flag) and the script (via the —i flag).
The Create Data Store task is used to create a relational database to hold source and aggregated data. The Data Store database is a separate repository from the Metadata database (which contains the storage schema for metadata objects) described in
The Data Store typically contains source data for a project, aggregations executed over this source data, datasets prepared for modeling, predictions from data mining algorithms, etc.
The Create Data Store task requires the specification of the following parameters:
The Create Data Store task is implemented by making an OLE DB connection to the given Server and executing a “create database . . . ” statement to generate the database with the given name. Then helper stored procedures are defined in the data store database.
The Backup Data Store task will backup a given database to a specified backup file location. This task is useful so that regular database backups can be automated.
The Backup Data Store task requires the specification of the following parameters:
The Backup Data Store task is implemented by making an OLE DB connection to the given SQL Server and executing a “backup database . . . ” statement for the specified database, specifying the backup location Filepath.
The Compute Aggregation task executes the aggregation defined in the CaseAggregation metadata object (see Section CaseAggregation for details), over a given SQL Server and database, storing the result in the table specified.
The Compute Aggregation task requires the specification of the following parameters:
After the Compute Aggregation task is executed, it generates a DataTable object describing the table that contains the aggregation result that can be used by other data analysis processes. See Section DataTable for more information on the DataTable metadata object.
The Compute Aggregation task is implemented by constructing a SQL query from the information in the CaseAggregation metadata object and making an OLE DB connection to the specified SQL Server/database and executing the task. The resultset is then stored in a table in the same server/database and a DataTable metadata object is created representing the resultset table.
The Create Distribution Report task takes a DistributionReportSpec metadata object, along with other required parameters and computes the corresponding distribution report. The result of executing the Create Distribution Report task is that a DistributionReport metadata object is saved in the metadata store for the given project.
The Create Distribution Report requires the specification of the following parameters:
After the Create Distribution Report task is executed, a DistributionReport object is generated and saved in metadata. See Section DistributionReport for details on this metadata object.
The Drop Distribution Report task is used to remove a given DistributionReport object and the associated data tables needed to generate its values, etc.
The Drop Distribution Report task requires the specification of the following parameters:
The Drop Distribution Report task loads the DistributionReport object with the given <DistributionReport> name. For each ChartDataTable contained with the DistributionReport object, the corresponding <TableName> table is dropped from the relational database (<Server>, <Database>). Then the DistributionReport metadata object is deleted.
Similar to the Drop Distribution Report task, the Drop DataTable task drops the underlying relational database table summarized by the DataTable metadata object, then also deletes this object.
The Drop DataTable task requires the specification of the following parameters:
The Drop DataTable task load the DataTable metadata object with the given <DataTable> name by querying the [Definitions] table (
The Create Affinity Report task is useful to determine pairwise correlation relationships between various attributes in a CaseDataSet. The pairwise correlation information is returned as a DistributionReport.
The Create Affinity Report task requires the specification of the following parameters:
When the Create Affinity Report task completes, it generates a DistributionReport object in the project metadata. See Section DistributionReport for more information about this metadata object.
The Create Affinity Report task utilizes cosine-similarity between attribute values to determine their correlation with one another. After this is completed, the report is generated.
The Normalize Attributes task takes a case data set and determines buckets for the continuous-valued attributes, generates a report summarizing the discretization, and creates a new table containing discretized (normalized) versions of the attributes.
The Normalize Attributes task requires the specification of the following parameters:
After the Normalize Attributes task has completed successfully, it generates a DistributionReport object and a DataTable in the project metadata. See Section DistributionReport for more information about this metadata object. See Section DataTable for more information on this metadata object. Note that the DataTable can be utilized then by further downstream pipeline tasks, etc.
The Make DataFormat From File task scans a specified data file (e.g. comma-delimited data file) and extracts the DataFormat metadata object information. This is then used when importing the file into a relational database.
The Make DataFormat From File task requires the specification of the following parameters:
Note that when the Make DataFormat From File task has finished, it generates a DataFormat metadata object. See Section DataFormat for more information.
The task is implemented by iterating over the file and deriving the DataFormat metadata object values.
The Import Data From File task utilizes the DataFormat information to create a table in a relational database containing the values from the data file.
The Import Data From File task requires the specification of the following parameters:
After the Import Data From File task has executed, a DataTable metadata object is created describing the data that has just been imported and is available for use by other pipeline processes. See Section DataTable for a description of this metadata object.
The Import Data From File task makes use of the BCP command to import data into a relational database table. The task automates the generation and execution of the specific BCP command.
Similar to Make DataFormat From File task, the Make DataFormat From Table task generates a DataFormat object by analyzing the column structure in a specified database table.
The Make DataFormat From Table task requires the specification of the following parameters:
Note that when the Make DataForm From Table task has finished, it generates a DataFormat metadata object. See Section DataFormat for more information.
The task is implemented by making an OLE DB connection to the database and querying the specified table to populate the DataFormat metadata object, then saving that to the metadata store.
The Import Data From Table task utilizes the DataFormat information to create a table in a relational database containing the data from the source table.
The Import Data From Table task requires the specification of the following parameters:
After the Import Data From Table task has executed, a DataTable metadata object is created describing the data that has just been imported and is available for use by other pipeline processes. See Section DataTable for a description of this metadata object.
The task is implemented by BCP-ing the data out to a temporary file and then BCP-ing it into the target database, generating the appropriate DataTable metadata object and saving it.
The Dump Query action allows an analyst to automate the execution of a SQL query against a specific database and export the result to a file.
The Dump Query task requires the specification of the following parameters:
The Dump Query task is implemented by connecting to the database of interest via OLE DB, executing the query via an OleDbCommand object, then writing the results to the specified file.
The Make DataFormat From Access task scans a specified table within a Microsoft Access database and extracts the DataFormat metadata object information. This is then used when importing the contents of the Access table into a relational database.
The Make DataFormat From Access task requires the specification of the following parameters:
Note that when the Make DataForm From Access task has finished, it generates a DataFormat metadata object. See Section DataFormat for more information.
The task is implemented by making an OLE DB connection to the Access database and scanning the specified table to populate the DataFormat metadata object values.
The Import Data From Access task utilizes the DataFormat information to create a table in a relational database containing the values from the corresponding Access table.
The Import Data From Access task requires the specification of the following parameters:
After the Import Data From Access task has executed, a DataTable metadata object is created describing the data that has just been imported and is available for use by other pipeline processes. See Section DataTable for a description of this metadata object.
The task is implemented by making an OLE DB connection to the Access database and making an OLE DB connection to the target SQL Server database, then moving the data from Access to the resulting SQL table in a row-wise fashion.
The Make DataFormat From Excel task scans a specified tab within a Microsoft Excel file and extracts the DataFormat metadata object information. This is then used when importing the contents of the Excel tab into a relational database.
The Make DataFormat From Excel task requires the specification of the following parameters:
Note that when the Make DataForm From Excel task has finished, it generates a DataFormat metadata object. See Section DataFormat for more information.
The task is implemented by making an OLE DB connection to the Excel file and scanning the specified table to populate the DataFormat metadata object values.
The Import Data From Excel task utilizes the DataFormat information to create a table in a relational database containing the values from the corresponding Excel sheet.
The Import Data From Excel task requires the specification of the following parameters:
After the Import Data From Excel task has executed, a DataTable metadata object is created describing the data that has just been imported and is available for use by other pipeline processes. See Section DataTable for a description of this metadata object.
The task is implemented by making an OLE DB connection to the Excel file and making an OLE DB connection to the target SQL Server database, then moving the data from Excel to the resulting SQL table in a row-wise fashion.
The Import Existing Table task generates a DataTable object from an existing relational database table. The task saves this DataTable object in the metadata database.
The Import Existing Table task requires the specification of the following parameters
After the Import Existing Table task has executed, a DataTable metadata object is created describing the data contained in the specified SQL table and is available for use by other pipeline processes. See Section DataTable for a description of this metadata object.
The task is implemented by making an OLE DB connection to the specified SQL Server and database, then iterating over the table to collect the information needed to populate the DataTable metadata object values.
The Export Data To File task allows an analyst to export the data contained in a table to text file with specified delimiters, etc.
The Export Data To File task requires the specification of the following parameters:
The task executes by connecting to the specified SQL-Server and database and is exported to the specified file.
The Export Distribution Report task exports information described in the ChartDataTable metadata objects associated with a given DistributionReport object to a series to text files.
The Export Distribution Report task requires the specification of the following parameters:
The task is executed by making an OLE DB connection to the specified SQL Server database and exporting the data contained in the ChartDataTable objects to text files. The text files have the same name as the ChartDataTable. See Section ChartDataTable for more information on this metadata object.
The Build Predictive Model task is used to construct a predictive model by applying a statistical/machine learning algorithm to a given dataset. Depending upon the algorithm that is selected for model building, the Build Predictive Model task may utilize SQL Server 2005 Analysis Services to build the predictive model.
Note that the Build Predictive Model task requires that there be a predictable or output variable specified in the training dataset (e.g. a DMColumn with DMIsPredictable set to True, see Section DMColumn for details).
The Build Predictive Model task requires the specification of the following parameters:
After the Build Predictive Model task completes successfully, it generates a Model metadata object summarizing the data mining model that has been constructed. See Section Model for more details related to this metadata object.
The Build Predictive Model task constructs the given model by applying the algorithm (with given parameter settings) specified in the Algorithm object to the dataset described by the DMDataset object.
If the algorithm is one of the SQL Server 2005 Analysis Services data mining algorithms, then the model is built on the given Analysis Server/Analysis Database specified in the DMEnvironment parameter. In this case, the model is built by interfacing with SQL Server 2005 Analysis Services using the ADOMD APIs.
The Get Predictions task is used to apply a given model to a dataset and obtain predicted values (or scores) from the model. This task allows the analyst to automate the process of regularly scoring new data, etc. with a given data mining model.
The Get Predictions task requires the specification of the following parameters
When the Get Predictions task has successfully completed, it generates a DataTable object describing the table containing the predictions. This DataTable object is saved in the metadata store.
The task is implemented by obtaining predictions using the given model for each case in the DMDataset object. These predictions are then stored in the DMPredictTable by making an OLE DB connection to the specific database, creating the predict table and populating it.
Note that if the model was built using Analysis Services 2005, the predictions are obtained by connecting to the appropriate Analysis Server/Analysis Database via an OLE DB connection and executing the appropriate DMX prediction join. See http://msdn2.microsoft.com/en-us/library/ms132031.aspx for more information on the DMX prediction join.
The Build Cluster Model task is similar to the Build Predictive Model except that it requires that the statistical algorithm used to model the data be a clustering algorithm (e.g. MICROSOFT_CLUSTERING). Also, the dataset used for modeling is not required to have a predictable or output column.
Cluster models are typically applied to datasets to determine “natural” or data-driven groupings in the dataset, facilitating a high-level understanding of the source data.
The Build Cluster Model task requires the specification of the following parameters:
After the Build Cluster Model task completes successfully, it generates a Model metadata object summarizing the data mining model that has been constructed. See Section Model for more details related to this metadata object.
The Build Cluster Model task constructs the given model by applying the algorithm (with given parameter settings) specified in the Algorithm object to the dataset described by the DMDataset object.
If the algorithm is one of the SQL Server 2005 Analysis Services data mining algorithms, then the model is built on the given Analysis Server/Analysis Database specified in the DMEnvironment parameter. In this case, the model is built by interfacing with SQL Server 2005 Analysis Services using the ADOMD APIs.
The Get Cluster Labels task is used to apply a given cluster model to a dataset to assign each case in the dataset to the cluster in which it most likely belongs. This task allows the analyst to automate the process of assigning new cases to clusters.
The Get Cluster Labels task requires the specification of the following parameters
When the Get Cluster Labels task has successfully completed, it generates a DataTable object describing the table containing the labels. This DataTable object is saved in the metadata store.
The task is implemented by obtaining cluster label assignments using the given model for each case in the DMDataset object. These cluster labels are then stored in the DMClusterTable by making an OLE DB connection to the specific database, creating the predict table and populating it.
Note that if the model was built using Analysis Services 2005, the cluster labels are obtained by connecting to the appropriate Analysis Server/Analysis Database via an OLE DB connection and executing the appropriate DMX prediction join.
The Evaluate Model Cross-Validation task is designed to estimate the predictive performance of a model built using a given statistical algorithm (with given parameter settings) that is applied to a specified dataset. The approach is based upon the methods described in:
M. Stone. Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society, 36:111-147, 1974.
In this approach, the analyst specifies a number of folds to be executed. For each fold, 1/(total number of folds) proportion of the dataset is set aside as a test set. The remaining dataset cases are used to estimate the predictive model by applying the given algorithm and parameters to the given training set. Then the resulting model is applied to the test set. Accuracy and other performance metrics (typically aggregates between the difference of the predicted values and actual values) are estimated.
These metrics are then averaged over each fold. These average performance metrics are an estimate of how well a model built with the given algorithm and parameters would perform when applied to similar, unseen data.
The Evaluate Model Cross-Validation task requires the specification of the following parameters:
When the Evaluate Model Cross-Validation task has terminated, a metadata object is saved that summarizes the performance as calculated during the evaluation:
For each fold of cross-validation, the task implements the sampling needed to create the training and testing sets (sampling over the case table (SQL-Server Analysis Services case-table notion) and internally DMDataset objects are created—one for the training set and one for the testing set.
Then, a model is built over the training set (with algorithm and parameters specified by the Algorithm object) (see Section Build Predictive Model for details). Then, predictions are generated using the information in the testing DMDataset object to obtain predicted and actual values over the testing set. The performance metrics in the DiscreteModelEvaluation or ContinuousModelEvaluation object are then computed. Performance metrics are also computed in the same way over the training DMDataset to determine training effectiveness.
Note that if the algorithm used for evaluation is one from SQL Server 2005 Analysis Services, then model building is done using the ADOMD interface to these objects and predictions are obtained by connecting to the appropriate Analysis Server via an OLE DB connection and executing the appropriate DMX prediction join.
The Evaluate Model Single Training/Testing Sets task is similar to the Evaluate Model Cross-Validation task, except that instead of sampling multiple training and testing sets from a given dataset, the analyst specifies one dataset for training and one for testing. All performance metrics are then estimated over the single testing set, after the model has been built over the training set.
The Evaluate Model Single Training/Testing Sets task requires the specification of the following parameters:
When the Evaluate Model Single Training/Testing Sets has terminated, a metadata object is saved that summarizes the performance as calculated during the evaluation:
A model is built over the training set (with algorithm and parameters specified by the Algorithm object) (see Section Build Predictive Model for details). Then, predictions are generated using the information in the testing DMDataset object to obtain predicted and actual values over the testing set. The performance metrics in the DiscreteModelEvaluation or ContinuousModelEvaluation object are then computed. Performance metrics are also computed in the same way over the training DMDataset to determine training effectiveness.
Note that if the algorithm used for evaluation is one from SQL Server 2005 Analysis Services, then model building is done using the ADOMD interface to these objects and predictions are obtained by connecting to the appropriate Analysis Server via an OLE DB connection and executing the appropriate DMX prediction join.
The Import Model Content task allows the analyst to export SQL Server 2005 Mining Model content from a given Analysis Server/Analysis database and store it in a relational database table for querying. The ability to query this content via SQL is very useful to determine the patterns and trends that are extracted.
The Import Model Content task requires the specification of the following parameters:
This task is implemented by making an OLE DB connection to the given Analysis Server/Analysis database containing the mining model of interest. The DMX query is then executed against the Analysis Server: “select flattened * from [<DMModelName>].Content”. Another OLE DB connection is made to the target relational SQL Server and database and the results are populated into the table <ModelContentTableName>.
Similar to the Import Model Content task, the Execute DMX Query task allows the analyst to execute an arbitrary DMX query against a specified SQL Server 2005 Analysis Server and the results then stored in a specified relational database table. The ability to further query these results via SQL is beneficial to the analyst in a number of instances.
The Execute DMX Query task requires the specification of the following parameters:
This task is implemented by making an OLE DB connection to the given Analysis Server/Analysis database containing the mining model of interest. The DMX query is then executed against the Analysis Server. Another OLE DB connection is made to the target relational SQL Server and database and the results are populated into the table <TargetTableName>.
The Analyst User Interface allows the analyst end-user to interact with the metadata datastore (see Section System Metadata Storage). And, by defining pipelines and setting their ExecutionStatus to Pending, the pipelines can then be executed by the Execution Engine (driver.exe). Depending upon the tasks executed by pipelines, the Analyst User Interface allows the end-user to inspect the metadata objects that are created by a task.
Also, the Analyst User Interface allows the end-user to determine pipeline processing information by interfacing with the metadata tables [PipelineInfo] and [ExecutionLog] (see
This section provides an overview 300 of the system Analyst User Interface.
These forms are described in the sub-sections below.
Note that when the Analyst User Interface is executed, it is passed the same “config.xml” file that is utilized by the Execution Engine (see Section Config.xml for details on the contents of this file). “config.xml” allows the Analyst User Interface code to connect to the metadata datastore so that metadata items can be accessed, created, and manipulated by the Analyst UI.
When the Analyst UI is executed, the first form shown to the end-user is the “Project Manager” 310 (see
This form allows the end-user analyst to:
By clicking on the “Metadata” button 314 in the Project Manager form (
After making a selection of the metadata type of interest in a “Type:” drop-down box 332, the Metadata Choose form displays the names of the metadata definitions of the selected type in a “Definitions:” text-box 334.
The analyst can then:
Values available in the “Type:” dropdown include:
Specific “Editor” forms have either been developed or a “Generic Metadata Editor” form is used. The following sub-sections describe these forms in more detail.
The Pipeline Editor 350 allows the analyst to define, add, and edit the Actions that make up a selected pipeline. See
The Pipeline Editor Form allows the end-user analyst to do the following:
The Action Editor 365 allows the end-user to define a specific action and the parameters required to execute the Action.
When the Action Editor is launched to create a new Action, the user is first required to choose the Action type that they wish to create (see
Action types are logically grouped into a tree-view 382 of multiple action types:
After choosing the Action to be created from the tree view 382, the user is returned to the Action Editor allowing the user to provide a description along with the required parameters that need to be specified. See
The user can type a description for the action in the “Description:” text-box 384.
The user then selects one of the parameters and can pick a value (useful when the parameter value is the name of another metadata object or a project property) by clicking a “Pick Value” button 386.
If the parameter value references a metadata object, the end-user is shown a window 390 that lists appropriate metadata objects that could be used as the parameter value. An example of choosing the DMDataset parameter is shown in
If the parameter value does not reference a metadata object, the end user can pick a value by clicking the “Pick Value” button 386 in
The Algorithm Editor allows the end-user to create or edit Algorithm metadata objects. When defining an Algorithm object, the end-user first chooses the algorithm type from a drop-down list 410 of a window 412 shown in
After a selection is made, the end-user can click on an “Info” button 414 of
A CaseAggregation Editor 430 allows the end user to define a CaseAggregation metadata object (see Section CaseAggregation for more details on this metadata object). See
Clicking the “Add” or “Edit” button next to “CaseDataQueries” in the CaseAggregation Editor (
The Case Data Query Editor allows the end-user to specify the name of the query and to construct the list of CaseProperties and to also edit any filters associated with the query that may limit the cases included in the overall aggregation.
The list of CaseProperties is managed by clicking on the “Add”, “Delete” or “Edit”buttons 442, 444, 446 underneath the “CaseProperties” text-box in
The filter is constructed or managed by clicking the button 448 “Edit Filter” in
By clicking the “Add” 442 or “Edit” 446 buttons underneath the CaseProperties textbox in
Clicking the “Choose . . . ” button in
By clicking the “Edit Filter” button 448 in
The Filter Editor allows the end-user to create and manage the rule-list and to change the order in which the rules are applied by using the buttons “Add”, “Delete”, “Edit”, “Move Up”, and “Move Down” 462-466 in
By clicking “Add” 462 or by highlighting a rule and clicking “Edit” 464, the Case Rule Editor is launched (see Section Case Rule Editor below and
Each Rule is made up of the conjunction (“and”) of a number of Constraints (see
Clicking either the “Add” 470 or “Edit” 472 buttons launches the Case Constraint Editor (see Section Case Constraint Editor below and
The Case Rule Editor (
The Case Constraint Editor 480 (see
By selecting an “Add” or “Edit” buttons 433, 435 under the “Conditions:” text-box in
The end-user can then provide a:
By selecting “Add” or “Edit” buttons 437, 439 under the “Measures:” text-box in
The end-user can then provide a:
The Case Data Set Editor allows the end-user to specify a logical relationship for data fields of a “case” for analysis between various CaseDataTable metadata objects.
Clicking the “View” 522 or “New” 523 buttons launches the Case Data Table Editor 530 (see
The Evaluation Report Viewer 540 provides a graphical interface to interpret the results of model evaluation objects (either DiscreteModelEvaluation metadata objects (see section DiscreteModelEvaluation) or ContinuousModelEvaluation metadata objects (see section ContinuousModelEvaluation)).
The Evaluation Report Viewer has 3 tabs 542, 544, 546:
An example of the Test Details tab is shown in
If the analyst end-user may select a row in the grid-view and click on an “Info” button 548, the corresponding Info value window 550 is displayed (see
An example of the Metrics tab is shown in
If the analyst end-user selects a row in the grid-view and clicks on “Info”, the corresponding Info value is displayed as an updated notice window 562 (see
The “Charts” tab in the Evaluation Report Viewer lists any charts 570 that have been defined and allows the analyst to view via a charting control (see
A chart is viewed via a charting control by selecting the chart and clicking the “View” button in
For other metadata objects, a Generic Metadata Editor 580 has been developed, which aids the analyst in populating the XML values of the corresponding metadata object. See
This UI allows the end-user to manually edit the metadata values and save them to the metadata database.
By clicking on the “Properties” button on the “Project Manager” form (see
This form allows the end-user to edit existing project properties, create new ones, or delete existing ones.
Clicking the “New” button 592 on
Highlighting one of the existing properties in
By clicking the “Execution” button in the Project Manager form (see
By highlighting a given pipeline and clicking the “View Details” button611 in
On the left-side of
By highlighting a given pipeline and clicking the “View Logfile” button 612 in
The log-file contents can be saved to a file by clicking the “Save To” button 632 in
The invention has been described with a degree of particularity but it is the intent that the invention include all embodiments falling within the spirit or scope of the appended claims.
The present application claims priority from U.S. Provisional application Ser. No. 61/023,987, filed Jan. 28, 2008 which is incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
61023987 | Jan 2008 | US |