The present invention is related to methods for optimizing clustering and modularity of problems, including within the framework of dependency structure matrices.
Many real-world problems, systems, organizations and structures can be described in terms of interrelated modules. For example, a combustion engine could be described in very simple terms as elements of one or more combustion chambers, one or more pistons, a transmission, an ignition source, and a fuel supply. Some of these components are linked to others. The pistons, for instance, are linked to the combustion chambers and the drive train, and the fuel supply and ignition source are linked to the combustion chamber. Linked elements may be thought of as forming “modules.” The pistons, combustion chamber, and ignition source, for example, may be described as a single module. In the analysis of many real-world problems, assembling elements into modules can be beneficial for purposes such as simplification of analysis.
Models are known that represent real-world systems and structures in terms of interrelated modules. A directed graph is one example of a model that does so. Another example is dependency structure matrix (“DSM”) models. A DSM is a matrix representation of a directed graph that can be used to represent systems and structures, including a complex system.
a), 2(b) and 2(c) are useful to further illustrate DSM's and their relations to other models, as well as their relation to real-world structures and organizations.
Once a DSM has been constructed, it can be analyzed to identify modules, referred to within the DSM as clusters. This is a process referred to as “clustering.” The goal of DSM clustering is to find subsets of DSM elements (i.e., the “clusters”) that are mutually exclusive or minimally interacting. That is, clusters contain most, if not all, of the interactions (i.e., “X” marks) internally and the interactions or links between separate clusters are eliminated or minimized to transform the system into independent, loosely coupled, or nearly independent system modules. One of the significances of clusters can be that all or most of the elements within the cluster are largely limited to interact mainly with other elements in the cluster and likewise are not likely to interact with elements outside of the cluster. Clustering is therefore an important part of the usefulness of DSM's since it “transforms” the initial DSM element population into a simpler “modular” model.
As a simple example of clustering, consider the identical DSM's of
While the DSM's of
Several methods are known for clustering real-world problems and structures. Several methods for creating modules of variables are discussed, for example, in “Notes on the Synthesis of Form,” by Alexander, C., 1964, Harvard Press, Boston, Mass. Methods for partitioning DSM's are discussed in “The Design Structure System: A Method for Managing the Design of Complex Systems,” by Steward, D. V., IEEE Transactions on Engineering Management 28 (1981) 77-74. Known methods for organizing modules and for clustering DSM's, however, leave many problems unresolved. Many fail to accurately predict the formation of “good” clustering arrangements for complex systems. Many known clustering methods when applied to complex DSM's have difficulty in extracting relevant information from the data, and then conveying the information to a user. Some methods suffer from an oversimplification of the objective function utilized. Others are susceptible to getting trapped in local optimal solutions. Many methods have difficulty in accurately representing busses and three-dimensional structures.
Exemplary embodiments of the present invention are directed to methods and program products for optimizing clustering of a design structure matrix. An embodiment of the present invention includes the steps of using a genetic operator to achieve an optimal clustering of a design structure matrix. Other exemplary embodiments of the invention leverage this optimal clustering by applying a genetic operator on a module-specific basis.
a) and 2(b) are exemplary schematics of the prior art showing a bus relation, while
a) and 3(b) are exemplary DSM's of the prior art;
a) and 4(b) are exemplary schematics of the prior art showing a three dimensional tetrahedron relation, while
a)-9(d) illustrate DSM's before and after operation of a method of the invention;
Before discussing exemplary embodiments of the invention in detail, it will be appreciated that embodiments of the present invention lend themselves well to practice in the form of computer program products. Accordingly, it will appreciated that embodiments of the invention may comprise computer program products comprising computer executable instructions stored on a computer readable medium that when executed cause a computer to undertake certain steps. It will further be appreciated that the steps undertaken may comprise method embodiment steps, and in this sense that description made herein in regards to method embodiments likewise applies to steps undertaken by a computer as a result of execution of a computer program product embodiment of the invention.
Exemplary embodiments of the present invention are directed to methods and program products for optimizing clustering of a design structure matrix model. An embodiment of the present invention includes the sequential steps of applying at least one genetic operator to a parent population of design structure matrix clusterings to produce an offspring population of design structure matrix clusterings. In a next step, a scoring metric is used to score each of the offspring population of design structure matrix clusterings. The method is then terminated if a termination condition has been satisfied. If not, selection is performed to create a new parent population of clusterings. Selection may be performed in a probabilistic or deterministic manner, for example. After selection, the steps of generating offspring and scoring are repeated until the termination condition is satisfied. Other exemplary embodiments of the invention include additional steps of leveraging the optimal clustering that has been determined. Exemplary steps include using the optimized clusterings to create modules of variables from a parent population of variables.
It has been discovered that methods of applying a genetic operator to a parent DSM clustering population to generate an offspring clustering population will provide desirably optimized cluster structures for DSM's for many real-world problems. Through practice of the present invention, optimal clustering of DSM's can be efficiently achieved even when confronted with complex real-world problems that involve busses and/or three-dimensional structures.
A parent population of clusterings or “chromosomes” for the DSM is then created (“chromosomes”) (step 502). In the present invention embodiment, each chromosome may be thought of as one clustering. Each clustering or chromosome signifies the presence or absence of a variable in a particular cluster as shown generally in
At least one genetic operator is then applied to the parent population of clusterings to create an offspring population of clusterings. In the exemplary method of
Next, the genes of the offspring chromosomes are mutated (step 506). Mutation occurs according to some defined probability pm (typically low) and serves to introduce some variability into the gene pool. For a binary encoding chromosome, mutation inverts the value of genes (from 0 to 1, or from 1 to 0) with the mutation probability pm. Without mutation, offspring chromosomes would be limited to only the genes available within the initial population. In addition to or as an alternative to crossover and mutation, other genetic operators may be applied in other invention embodiments. Other operators may be probabilistic, or may function through estimation of distribution, stochastic search, or the like.
In the exemplary method of the invention, the quality of the offspring population is evaluated. Specifically, each chromosome is evaluated using a scoring metric, sometimes also referred to as a fitness function, to determine the quality of the solution. In the exemplary embodiment of
The encoding of step 508 is preferably capable of representing overlapping clusters and three-dimensional structures. In one exemplary step 508, the cluster encodings make up a binary string of (c·nn) bits, where c is a predefined maximal number of clusters, and nn is the number of nodes. The (x+nn·y)-th bit represents that node (x+1) belongs to cluster (y+1). The last cluster is treated as a bus. For example, in the example DSM clustering of
Once encoded, the clusterings are evaluated through a scoring metric (step 510). Those knowledgeable in the art will appreciate that there are a number of scoring metrics suitable for practice with the invention. Preferably, the scoring metric conveys at least two general categories of information. The first category describes the complexity of clusters. For example, an exemplary first category describes the size of the data structure needed to represent the clustering. The second category describes the accuracy of the clusters. An exemplary second category describes the size of the data structure required to represent the inaccuracy of the clustering. The inaccuracy of the data may be represented by the mismatched data. As used herein, the term “mismatched data” is intended to broadly refer to the difference between the model clustering and the real-world DSM. For example, for a particular DSM, mis-matched data is signified by the unequal matrix entries between the real-world DSM and the DSM generated by a particular clustering model. Generally, low complexity and high accuracy clusterings are favored. In practice, these two desired qualities are often competing: very low complexity results tend to have relatively low accuracy while very high accuracy solutions tend to be relatively complex. An acceptable balance must be achieved between the two. In some exemplary methods of the invention, the two categories of information may be weighted as desired.
One preferred method of the invention includes steps of using the minimum description length (“MDL”) scoring metric. The MDL can be interpreted as follows: among all possible models, choose the model that uses the minimal length for describing a given data set (that is, model description length plus mismatched data description length). For example, the encoding of a complicated DSM model should be longer than that of a simple model.
In the exemplary method embodiment, the MDL encoding of each cluster starts with a number that is sequentially assigned to each cluster, and then this is followed by a sequence of nodes in the cluster. By way of example,
where nc is the number of clusters, nn is the number of nodes, cli is the number of nodes in the ith cluster, and the logarithm base is 2. In the example of
In order to achieve the second category of the MDL description that describes the mismatched data, an exemplary method step includes constructing a second DSM, referred to for convenience as DSM′. In the new second DSM′, each entry d′ij is “1” if and only if:
Next, d′ij is compared with the given dij. For every mismatched entry, where d′ij≠dij, a description should indicate where the mismatch occurred (i and j) and one additional bit to indicate whether the mismatch is zero-to-one or one-to-zero. Two mismatch sets can be defined: S1={(i,j)|dij=0,dij′=1} and S2={(i, j)|dij=1,dij′=0}. The mismatch that contributes to S1 may be referred to as the type 1 mismatch, and the mismatch that contributes to s2 the type 2 mismatch. In the exemplary method of the invention, the mismatched data description length is given by:
The first log nn in the bracket indicates i, the second one indicates j, and the additional one bit indicates the type of mismatch.
The MDL clustering metric is given by the weighted summation of the MDL model description length according to EQTN. 1 (e.g.,
where α and β are weights between 0 and 1. In EQTN. 3, the first term represents complexity, and the second two terms taken together represent accuracy. The weights α and β may be used in some embodiments of the invention to adjust weighting of the two categories of information. For example, in some real-world problems the importance of achieving a model with minimal clusters may be far more important than minimizing mismatched data. In such a case, the category of cluster complexity could be applied a weighting of 0.9 and the category of cluster accuracy a weighting of 0.05. A naïve setting is α=β=⅓. Other settings are of course useful and may be selected as appropriate for reasons related to the particular real-world application at hand, or like reasons. For example, α and β may be set to mimic the behavior of a manual clustering arrangement.
Referring once again to the flowchart of
It will be appreciated that as used herein, the term “optimal clustering” is intended to be broadly interpreted as meaning optimized to a desired degree. It will be understood that “optimal” clustering does not require the absolute best achievable clustering, but instead only a clustering that satisfies whatever termination condition was applied. For example, optimal clustering may be defined when the clustering meets some defined level of fitness. Optimal clustering may be defined, for example, when the offspring population converges sufficiently close to a single clustering. Or, optimal clustering may be defined by selecting the highest scoring offspring clustering after a desired number of generations have been created.
If the termination condition has not been met, another generation of clusterings will be created. Selection is first performed to select chromosomes that will have their information passed on to the next generation (step 518). Preferably, selection is performed on the combined parent and offspring population. Those skilled in the art will appreciate that many different forms of selection may be practiced. In an exemplary method embodiment, (λ+μ) selection is performed. Totally (λ+μ) chromosomes are evaluated. (λ+μ) selection chooses the A “best” chromosomes from the (λ+μ) chromosomes and passes them to the next generation. Elitism is embedded in (λ+μ) selection. In some circumstances, it may be useful to replace the entire parent population with that of the offspring. Each new iteration of steps 506-514 may be referred to as a generation. When the termination condition is satisfied, then optimal clustering has been achieved.
In order to further illustrate embodiments of the present invention and their benefits, a clustering was optimized on a number of input DSM's using a method of the invention consistent with that illustrated in
The left column of
In order to further illustrate a method of the invention, it was applied to a real-world DSM problem. A DSM for a generic 10 MWe gas turbine driven electrical generator set was constructed by decomposing it into 31 sub-systems. The sub-systems initially were listed randomly in the DSM and then tick marks denoting material relationship from one sub-system to another were inserted. Intuitive manual clustering of such a DSM can yield different results depending on the extent to which a single group of system-wide relationships is emphasized over “good” clusters. One alternative arrangement is shown in the manually clustered DSM of the prior art shown in
The prior art manually clustered DSM of
To illustrate the method of the invention, it will be applied to the DSM of
Also, the novel mismatch data set weighting used in the scoring metric of the present invention provides a more beneficial balance in the two types of mismatches than does manual DSM clustering of the prior art. In the manual version of the DSM of
The mismatch data sets and their weighting therefore provide the present invention with valuable flexibility. These elements of the invention allow the scoring metric practiced to be “tuned” to mimic a desired priority of selection, such as a human experts' preference. One step for tuning the scoring metric is to tune the weightings α and β. Other steps are also contemplated, such as tuning the weight of the model description.
The results of
With reference now drawn to
Referring now to the flowchart of
In one exemplary step 1302, the dependency of gene i and gene j can be detected in a manner similar to Linkage Identification by Non-linearity Checking, or “LINC,” where a “gene” can be a variable, a collection of variables, or a sub-component of a variable. θa=x,a
Δij=|fa
However, the fitness value of schemata cannot be computed unless every possible combination is visited. Now the task is to approximate Δij with sij that is computed based on the individuals seen so far. First, define the sampled fitness of a schema in the population of the t-th generation as:
where t is the generation, Pt is the population of the t-th generation, f is the fitness function, a is an individual where its i-th gene is x and j-th gene is y, and na is the number of such a in the population.
is undefined if na is zero. The information of interactions gathered from the population is:
is undefined if any of the
is undefined. To utilize all individuals of all populations seen so far, sijt is then averaged over generations. Define a set D={sijt|sijt is defined}, then
where T is the current generation. |D| equals to T if every sijt is defined.
With a threshold θ, sij is then transferred into 0-1 domain, and a DSM is constructed:
dij=0 if sij≦θ, and dij=1 if sij≧θ
In this exemplary step 1302, the threshold θ is calculated by a two-mean algorithm (a special case of the k-mean algorithm, where k=2). The threshold can also be set according to some prior knowledge such as non-linearity, or to another value as may be desirable. The two-mean algorithm can be briefly described as follows. With an initial guessed threshold (e.g., the mean of the given data), separate the given data into two clusters. Next, the threshold is updated by the average of the two means of the two clusters. Then separate the given data once again into two clusters by the updated threshold. This process is repeated until the clustering does not change anymore, and the desired threshold is finally obtained.
Note that instead of computing the fitness values of schemata directly form all individuals seen in all previous generations, the method of the present invention is two-fold averaging. First, fitness values of schemata are computed via averaging for each generation. Second, the interaction information (sij) is computed via averaging from stij for each generation. There are several advantages for doing so. For example, sij is less biased, and sij is becoming stable as the method moves towards convergence. Also, as the solutions converge, the fitness values get higher and populations lose diversity. Accordingly, a bias may occur if the two-fold averaging scheme is not used. Also, two-fold averaging adds to stability.
The DSM is then passed to the auxiliary method, for performance of the steps 502-518 of
As used herein, the term “module-specific” is intended to be broadly interpreted as applying to the modules. Module-specific crossover is like traditional crossover except that entire modules are crossed-over instead of individual genes. For example, if a cluster and its corresponding module includes genes 1, 3 and 5, then module-specific crossover would entail collectively crossing over all of genes 1, 3 and 5 as opposed to considering the genes individually. Further, it will be appreciated that since the modules may correspond to the clusters, the term “cluster-specific” is intended to be broadly interpreted in a like manner as “module-specific.” Cluster specific cross over, for example, is intended to broadly refer to crossing over of all of the variables within a crossed over cluster.
The module-specific crossover of step 1308 may be further illustrated by another example. Assume a DSM includes variables 1-10. Assume at the conclusion of the method of
Before module-specific Crossover:
After module-specific Crossover:
Accordingly, module-specific crossover has resulted in the genes in only the 1 st and 4th clusters being crossed over.
Mutation is next performed (step 1310) to introduce some variability into the gene pool. It may be introduced at some probability and desired level—for example 50% of the population may be mutated by altering of one gene. The method of
If the termination condition has not been met, the method of
The two paths represented by 1322 and 1324 can generally be described as a more exacting and computationally expensive solution (path 1324), and a less exacting but computationally less expensive option (path 1322). The decision at step 1320 that determines which path to follow can therefore be made on the basis of balancing computational expense verses accuracy of solution. In many cases, the computational expense advantages to be gained through the path 1322 are believed to outweigh its accuracy of solution disadvantages. Other decision criteria at step 1320 can also be used.
In order to further illustrate the embodiment of
where u is the number of 1's. Three linkage cases were tested: tight linkage, loose linkage, and random linkage. Define U(x) as a counting function that counts the number of 1's in x. In the tight linkage test, genes are arranged as:
In the loose linkage test case, alleles are arranged as
Given the failure rate to be 1/10, the population size is set as 182 by the gambler's ruin model. In the primary set of steps of
The simple GA converged only for the tight linkage case, and took 40 generations to do so. For loose and random linkage cases, the SGA did not achieve useful results because of modular building block disruption. The method of
Accordingly, methods of the invention including the exemplary method illustrated in
Those knowledgeable in the art will also appreciate that the present invention is well suited for practice in the form of a computer program product, and accordingly that the present invention may comprise computer program product embodiments. Indeed, it will be appreciated that the relatively intense calculational nature and manipulation of data that steps of invention embodiments comprise suggest that practice in the form of a computer program product will be advantageous. These program product embodiments may comprise computer executable instructions embedded in a computer readable medium that when executed by a computer cause the computer to carry out various steps. The executable instructions may comprise computer program language instructions that have been compiled into a machine-readable format. The computer readable medium may comprise, by way of example, a magnetic, optical, or circuitry medium useful for storing data. Also, it will be appreciated that the term “computer” as used herein is intended to broadly refer to any machine capable of reading and executing recorded instructions.
The steps performed by the computer upon execution of the instructions may generally be considered to be steps of method embodiments of the invention. That is, as discussed herein it will be understood that method embodiment steps may likewise comprise program product steps. With reference to the flowcharts of
It is intended that the specific embodiments and configurations herein disclosed are illustrative of the preferred and best modes for practicing the invention, and should not be interpreted as limitations on the scope of the invention as defined by the appended claims.
This invention was made with Government assistance under United States Air Force Office of Scientific Research, Air Force Material Command, grant No. F49620-00-0163, AFOSR grant no. F49620-03-0129 and the National Science Foundation Grant No DMI-99-08252. The Government has certain rights in the invention.
Number | Name | Date | Kind |
---|---|---|---|
5930762 | Masch | Jul 1999 | A |
5940816 | Fuhrer et al. | Aug 1999 | A |
5963902 | Wang | Oct 1999 | A |
6490572 | Akkiraju et al. | Dec 2002 | B2 |
6615205 | Cereghini et al. | Sep 2003 | B1 |
6768973 | Patel | Jul 2004 | B1 |
7047169 | Pelikan et al. | May 2006 | B2 |
20030055614 | Pelikan et al. | Mar 2003 | A1 |
20050256684 | Jin et al. | Nov 2005 | A1 |
Number | Date | Country | |
---|---|---|---|
20050177351 A1 | Aug 2005 | US |