This disclosure relates generally to computer based mathematical modeling techniques and, more particularly, to mathematical modeling methods and systems for identifying a desired variable subset.
Mathematical modeling techniques are often used to build relationships among variables by using data records collected through experimentation, simulation, or physical measurement or other techniques. To create a mathematical model, potential variables may need to be identified after data records are obtained. The data records may then be analyzed to build relationships among identified variables. In certain situations, the number of data records may be limited by the number of systems that can be used to generate the data records. In these situations, the number of variables may be greater than the number of available data records, which creates so-called sparse data scenarios.
Conventional solutions, such as design of experiment (DOE) techniques, have been developed to identify variables and their interactions. The design of experiment technique may also use the concept of Mahalanobis distance, as described in Genichi et al., “The Mahalanobis Taguchi Strategy, A Pattern Technology System” (John Wiley & Sons, Inc., 2002). Genichi et al. illustrates a Mahalanobis-Taguchi strategy with methods for developing multidimensional measurement scales using measures and procedures that are data analytic and not dependent upon the distribution of the characteristics of systems under measurement. Such conventional solutions, however, often do not effectively address problems associated with sparse data scenarios.
Methods and systems consistent with certain features of the disclosed systems are directed to solving one or more of the problems set forth above.
One aspect of the present disclosure includes a computer-implemented method to provide a desired variable subset. The method may include obtaining a set of data records corresponding to a plurality of variables and defining the data records as normal data or abnormal data based on predetermined criteria. The method may also include initializing a genetic algorithm with a subset of variables from the plurality of variables and calculating Mahalanobis distances of the normal data and the abnormal data based on the subset of variables. Further, the method may include identifying a desired subset of the plurality of variables by performing the genetic algorithm based on the Mahalanobis distances.
Another aspect of the present disclosure includes a computer-implemented method for defining normal data and abnormal data from a data set. The method may include obtaining two or more clusters by applying a clustering algorithm to the data set, determining a first cluster and a second cluster that have a largest difference in normalized means, and defining the first cluster as normal data and the second cluster as abnormal data.
Another aspect of the present disclosure includes a computer system. The computer system may include a console and at least one input device. The computer system may also include a central processing unit (CPU). The CPU may be configured to obtain a set of data records corresponding a plurality of variables, wherein a total number of the data records may be less than a total number of the plurality of variables. The CPU may be configured to define the data records as normal data or abnormal data based on predetermined criteria. The CPU may also be configured to further initialize a genetic algorithm with a subset of variables from the plurality of variables, calculate Mahalanobis distances of the normal data and the abnormal data based on the subset of variables, and identify a desired subset of the plurality of variables by performing the genetic algorithm based on the Mahalanobis distances.
Another aspect of the present disclosure includes a computer-readable medium for use on a computer system configured to perform a variable reducing procedure. The computer-readable medium may include computer-executable instructions for performing a method. The method may include obtaining a set of data records corresponding to a plurality of variables. The total number of the data records may be less than the total number of the plurality of variables. The method may also include defining the data records as normal data or abnormal data based on predetermined criteria and initializing a genetic algorithm with a subset of variables from the plurality of variables. The method may further include calculating Mahalanobis distances of the normal data and the abnormal data based on the subset of variables and identifying a desired subset of the plurality of variables by performing the genetic algorithm based on the Mahalanobis distances.
Reference will now be made in detail to exemplary embodiments, which are illustrated in the accompanying drawings. Wherever possible, the same reference numbers will be used throughout the drawings to refer to the same or like parts.
MDi=(Xi−μx)Σ−1(Xi−μx)′ (1)
where μx is the mean of X and Σ−1 is an inverse variance-covariance matrix of X. MDi weights the distance of a data point Xi from its mean μx such that observations that are on the same multivariate normal density contour will have the same distance. Such observations may be used to identify and select correlated variables from separate data groups having different variances.
As shown in
The pre-processed data may be provided to certain algorithms, such as a Mahalanobis distance genetic algorithm (MDGA), to reduce a large number of potential variables to a desired subset of variables (process 106). The reduced subset of variables may then be used to create accurate data models. The subset of variables may further be outputted to a data storage for later retrieval (process 108). The subset of variables may also be directly outputted to other application software programs to further analyze and/or model the data set (process 110). Application software programs may include any appropriate type of data processing software program. The processes explained above may be performed by one or more computer systems.
CPU 202 may execute sequences of computer program instructions to perform various processes as explained above. The computer program instructions may be loaded into RAM 204 for execution by CPU 202 from a read-only memory (ROM). Storage 216 may be any appropriate type of mass storage provided to store any type of information that CPU 202 may need to perform the processes. For example, storage 216 may include one or more hard disk devices, optical disk devices, or other storage devices to provide storage space.
Console 208 may provide a graphic user interface (GUI) to display information to users of computer system 200. Console 208 may be any appropriate type of computer display devices or computer monitors. Input devices 210 may be provided for users to input information into computer system 200. Input devices 210 may include a keyboard, a mouse, or other optical or wireless computer input devices. Further, network interfaces 212 may provide communication connections such that computer system 200 may be accessed remotely through computer networks.
Databases 214-1 and 214-2 may contain model data and any information related to data records under analysis, such as training and testing data. Databases 214-1 and 214-2 may also include analysis tools for analyzing the information in the databases. CPU 202 may use databases 214-1 and 214-2 to determine correlation between variables.
As explained above, computer system 200 may perform process 106 to select data set features and reduce variables. In certain embodiments, computer system 200 may use MDGA to perform process 106.
As shown in
Normal data and abnormal data may be separated by Mahalanobis distances. An exemplary relationship between the normal data, abnormal data, and corresponding Mahalanobis distances is shown in
Returning to
Initially, several such parameter lists or chromosomes may be generated to create a population. A population may be a collection of a certain number of chromosomes. The chromosomes in the population may be evaluated based on a fitness function or a goal function, and a value of goodness or fitness may be returned by the fitness function or the goal function. The population may then be sorted, with those having better fitness ranked at the top.
The genetic algorithm may generate a second population from the sorted initial population by using any or all of the genetic operators, such as selection, crossover (or reproduction), and mutation. During selection, chromosomes in the population with fitness values below a predetermined threshold may be deleted. Selection methods, such as roulette wheel selection and/or tournament selection, may also be used. After selection, reproduction operation may be performed upon the selected chromosomes. Two selected chromosomes may be crossed over along a randomly selected crossover point. Two new child chromosomes may then be created and added to the population. The reproduction operation may be continued until the population size is restored. Once the population size is restored, mutation may be selectively performed on the population. Mutation may be performed on a randomly selected chromosome by, for example, randomly altering bits in the chromosome data structure.
Selection, reproduction, and mutation may result in a second generation population having chromosomes that are different from the initial generation. The average degree of fitness may be increased by this procedure for the second generation, since better fitted chromosomes from the first generation may be selected. This entire process may be repeated for any appropriate numbers of generations until the genetic algorithm converges. Convergence may be determined if the result of the genetic algorithm is improved during each generation and the rate of improvement reaches below a predetermined rate. The rate may be chosen depending on a particular application. For example, the rate may be set at approximately 1% for general applications and may be set at approximately 0.1% for more complex applications.
When CPU 202 sets up the genetic algorithm (step 306), CPU 202 may identify a maximum number of variables of a desired subset. As explained above, the data set may be a sparse data set, which may include more potential variables than total data records in the data set. In one embodiment, the maximum number may be less than or equal to the number of total data records in the data set. CPU 202 may set the maximum number as a constraint to chromosome encodings of the genetic algorithm.
CPU 202 may also set a goal function for the genetic algorithm to evaluate goodness or fitness of chromosomes. In certain embodiments, the goal function may include maximizing Mahalanobis distances between normal data set 402 and abnormal data set 404. The maximum deviation of Mahalanobis distance may be determined based on MD{overscore (x)}, MDmin, or both, as described above. In operation, if the Mahalanobis distance deviation between normal data set 402 and abnormal data set 404 is above a predetermined threshold, the goal function may be satisfied. One or more values of the Mahalanobis distance deviation may also be returned by the goal function for further evaluations, such as convergence determination.
After setting up the genetic algorithm (step 306), CPU 202 may start the genetic algorithm (step 308). CPU 202 may choose an initial subset or subsets of variables or parameter lists for the genetic algorithm. CPU 202 may choose the initial subsets based on user inputs. Alternatively, CPU 202 may choose the initial subsets based on a correlation between potential variables and correlations between variables and results of applications 110. The correlation may depend on a particular application, such as a manufacturing, service, financial, and/or research application. For example, in a financial application including a unit variable, a price variable, and a weather variable, the unit variable and the price variable may be likely correlated. Only one of the unit variable and the price variable may be chosen to avoid redundancy; while the weather variable may be less likely correlated with the other two and may be also selected. However, if both the unit variable and the price variable correlate to a result of a financial application, for example, a total cost, both the unit variable and the price variable may be selected.
Further, alternatively, CPU 202 may cause the genetic algorithm to randomly select a subset or subsets of variables as initial chromosomes. A random seed used to randomly select the subset may be set by a user or by the genetic algorithm based on a predetermined configuration. CPU 202 may then calculate Mahalanobis distances for both normal and abnormal data based on the selected variable subset (step 310). The calculation may be performed by CPU 202 according to a series of steps related to equation 1. For example, CPU 202 may calculate descriptive statistics, calculate Z values, build a correlation matrix, invert the correlation matrix, calculate Z transpose, and calculate Mahalanobis distances.
After Mahalanobis distances (e.g., MDnormal, MDabnormal, MD{overscore (x)}, and/or MDmin) have been calculated, the goal function may be evaluated. CPU 202 may further determine whether the genetic algorithm converges on the selected subset of variables (step 312). Depending on the types of applications, predetermined criteria may be used. For example, an improvement rate of approximately 0.1% may be used to determine whether the genetic algorithm converges. If the genetic algorithm does not converge on a particular subset (step 312; no), the genetic algorithm may proceed to create a next generation of chromosomes, as explained above. The variable reducing process goes to step 310 to recalculate Mahalanobis distances based on the newly created subset of variables or chromosomes. On the other hand, if the genetic algorithm converges with a particular subset (step 312; yes), CPU 202 may determine that a desired or optimized variable subset has been found.
CPU 202 may further save the optimized subset of variables with which the genetic algorithm converges as a result of the variable reducing process (step 314). CPU 202 may also save the subset in storage 216 for later retrieval or, alternatively, in database 214-1 and/or database 214-2. CPU 202 may also output the subset of variables to other application software programs for further processing or analysis (step 316).
In certain embodiments, CPU 202 may also use a clustering algorithm to define the normal data set and abnormal data set, as described regarding step 304. The clustering algorithm may include any appropriate type of clustering algorithm, such as k-means, fuzzy k-means, nearest neighbor, kohonen networks, and/or adaptive resonance theory networks. In one embodiment, a k-means clustering algorithm with a “v-fold” cross-validation scheme may be used. At the beginning of defining the normal and abnormal data sets, CPU 202 may identify inherent data clusters (e.g., similar data or correlated data) of the data set. If only two clusters are identified, CPU 202 may use one cluster as the normal data set and use the other cluster as the abnormal data set. In certain situations, there may be more than two clusters identified. For example, CPU 202 may determine three, four, or even more clusters of the data set.
As shown in
Alternatively, CPU 202 may determine differences between each member of cluster 506 and cluster 502 and cluster 504. CPU 202 may then decide whether a particular member of cluster 506 should be defined as normal data or abnormal data based on the differences. Although three clusters are shown in
Further, relationships among variables may also be identified during clustering algorithm operation, especially when more than two clusters are determined and individual members are decided to be included in one of the data set. Such relationship may be further provided by CPU 202 to the genetic algorithm to determine initial selection of a subset of variables. For example, if some variables may contribute significantly to the determination of the clusters, these variables may be likely included in the desired subset of variables and, thus, may be provided to seed the genetic algorithm population.
The disclosed Mahalanobis distance genetic algorithm (MDGA) methods and systems may provide a desired solution for effectively reducing variables in sparse data scenarios, which may be difficult or impractical to be achieved by other conventional methods and systems. The disclosed methods and systems may be used to identify a desired subset of variables that can be used to create more accurate models. Performance of other statistical or artificial intelligence modeling tools may be significantly improved when incorporating the disclosed methods and systems.
The disclosed methods and systems may also be used to effectively reduce the dimensionality of a data set in which the number of dimensions or variables is larger than the possible number of actions that each variable may support. The disclosed methods and systems may reduce the dimensionality of a data set under various scenarios, such as sparse data scenarios, or scenarios in which the data is inverted, etc.
The disclosed methods and systems may also provide an option of using a clustering algorithm to define data characteristics. The disclosed clustering algorithm may effectively find desired data records to classify normal and abnormal data set without prior knowledge about the number of clusters. The combined clustered MDGA may provide additional functionality, such as the ability to search a candidate subset of variables for the most parsimonious solution that can quantitatively discriminate between different data records. Such data characteristics may be further provided to knowledge base modeling tools to increase operation speed of the modeling tools.
Other embodiments, features, aspects, and principles of the disclosed exemplary systems will be apparent to those skilled in the art and may be implemented in various environments not limited to work site environments.