Clustering fuzzy expected value system

Information

Patent Grant
5414797

References
Source

Patent Number
5,414,797
Date Filed
Tuesday, July 28, 1992
33 years ago
Date Issued
Tuesday, May 9, 1995
30 years ago

Inventors
Original Assignees
- International Business Machines Corporation
Examiners
- Downs; Robert W.
Agents
- Shkurko; Eugene I.
- Augspurger; Lynn A.

CPC
US Classifications
- 395
Field of Search
- US
- 395 600
- 395 934
- 395 900
- 395 51
- 395 61
International Classifications
- G06F1540
- G06F944

Information

Abstract

A system provides a tool for computing the most typical fuzzy expected value of a membership function in a fuzzy set. The clustering fuzzy expected value system is used in a question answering system. CFEV is computed by the tool is based on grouping of individual responses, that meet certain criteria, to clusters. Each cluster is considered a "super response" and contributes to the result proportionally to its relative size and the difference in opinion from the mean of the entire sample. In so doing, CFEV represents the opinion of the majority of the population, but it also respects the opinion of the minority. A comparison is made with existed tools such as the FEV and the WFEV and the advantages of CFEV are demonstrated by examples for cases where other methods fail to perform.

Description

Claims

1. A computer system for processing a database having language text records to evaluate their relevance to a predetermined subject of interest, the system comprising:
processors for processing instructions and the text records;
database storage means for storing the text records in the database;
a unique word generator for generating and storing a list of unique words contained in the database;
a relevant word generator for generating and storing a list of relevant words including a system user's selections from the stored list of unique words;
a modified database generator for generating and storing a modified database, the modified database including the list of relevant words and synonyms, if any, associated with each word in the list of relevant words, whereinafter the unique word generator generates a list of unique words of the modified database;
a relevant word table including the list of unique words of the modified database; and
means for calculating and storing confidence values associated with each word in the relevant word table, each of the confidence values based on input values from a plurality of personnel other than the system user that reflects their perceptions of a relevance of said each word in the relevant word table to the subject of interest;
said means for calculating and storing confidence values including a clustering fuzzy expected value system for determining a membership grade for said each word in the relevant word table;
said clustering fuzzy expected value system determining said membership grade for a word in the relevant word table including grouping the confidence values for said word in the relevant word table into a plurality of clusters according to a predetermined formula, determining a mean of all the confidence values for said word in the relevant word table, and determining a plurality of mean confidence values each associated with one of said plurality of clusters of said confidence values.
2. The system according to claim 1, wherein the clustering fuzzy expected value system determines the mean of all the confidence values for said word in the relevant word table according to ##EQU19## determines the plurality of mean confidence values according to ##EQU20## and determines the membership grade according to ##EQU21## where i designates a cluster, a.sub.ij is the number of confidence values in the cluster having value x.sub.ij, m is the number of clusters, x is a parameter based on the field of use of the CFEV, N.sub.i is the number of confidence values in cluster i, and N is the number of all the confidence value.
3. The system according to claim 1, further comprising:
a clustering fuzzy evaluator for calculating and storing a clustered fuzzy expected value for each record in the database, wherein the membership grades are grouped into clusters according to the predetermined formula, and wherein the clustered fuzzy expected value for each record in the database is calculated based on a mean of the membership grades for all words in the list of relevant words and on each mean of the membership grades in each of said clusters.
4. The system according to claim 3, wherein the clustering fuzzy evaluator calculates the clustered fuzzy expected value for a record containing zero words to be equal to zero.
5. The system according to claim 3, wherein the clustering fuzzy evaluator calculates the clustered fuzzy expected value for a record containing one word to be equal to the membership grade for the one word.
6. The system according to claim 3, wherein the clustering fuzzy evaluator calculates the clustered fuzzy expected value for a record containing two words according to the formula ##EQU22## where w.sub.i and w.sub.j are membership grades for the two words; k is a constant greater than zero; l is a value greater than zero and less than one; and c is equal to (l-m)/m where m is a mean of w.sub.i and w.sub.j.
7. The system according to claim 3, wherein the clustering fuzzy evaluator calculates the clustered fuzzy expected value for a record containing three or more words according to ##EQU23## where W.sub.A is the mean of membership grades for all words in the record containing three or more words and is calculated according to ##EQU24## where W.sub.Ai is the mean membership grade associated with one of the plurality of clusters of said membership grades and is calculated according to ##EQU25## and where N equals the number of words in the record containing three or more words, i designates a cluster, a.sub.ij is the number of words in the cluster having membership grade x.sub.ij, m is the number of clusters, n is the number of words in a cluster, x is a parameter based on the field of use of the CFEV, N.sub.i is the number of words in cluster i.
8. The system according to claim 3, wherein the system further comprises means for retrieving a record having a fuzzy expected value greater than a predetermined value.
9. A method for evaluating a relevance of a database having text records to a predetermined subject of interest, the method comprising computer executed steps of:
compiling and storing a list of unique words contained in the database having text records;
extracting and storing from the list of unique words a list of relevant words selected by a user of the method;
storing confidence values furnished by personnel other than the user who are familiar with the subject of interest that reflect their perceptions of a relevance of each word in the list of relevant words to the subject of interest; and
calculating and storing a membership grade for each word in the list of relevant words, including grouping the confidence values for one word of the list of relevant words into a plurality of clusters according to a predetermined formula, determining a mean of the confidence values for said one word, and determining for each of said clusters a mean of the confidence values in each of said clusters.
10. The method according to claim 9, wherein the step of determining a mean of the confidence values for said one word is according to ##EQU26## the step of determining a mean of the confidence values in each of said clusters is according to ##EQU27## and the step of calculating a membership grade is according to ##EQU28## where i designates a cluster, a.sub.ij is the number of confidence values in the cluster having value x.sub.ij, m is the number of clusters, x is a parameter based on a field of use of the method, N.sub.i is the number of confidence values in cluster i, and N is the number of all the confidence values for said one word.
11. The method according to claim 9, further comprising a computer executed step of calculating a clustered fuzzy expected joined confidence]value of each text record of the database based on fuzzy logic that incorporates the membership grade of each word in the text record.
12. The method according to claim 11, further comprising a computer executed step of retrieving a text record of the database having a clustered fuzzy expected value within a predetermined range.
13. The method according to claim 11, wherein the step of calculating a clustered fuzzy expected value of each text record is based on the formula ##EQU29## for records containing two words, where w.sub.i and w.sub.j are membership grades of the two words; k is a constant greater than zero; l is a value greater than zero and less than one; and c is equal to (l-m)/m where m is a mean of w.sub.i and w.sub.j.

RELATED APPLICATION

This application claims the priority of and is a continuation in part of the following application: U.S. patent application Ser. No. 07/701,558, filed May 16, 1991, entitled "A FUZZY REASONING DATA BASE QUESTION ANSWERING SYSTEM", of S. Vassiliadis et al., now abandoned. The related application and this application are assigned to International Business Machines Corporation of Armonk, New York. The related application is incorporated by reference. The present invention relates to a tool for computing the most typical fuzzy expected value of a membership function in a fuzzy set. Clustering Fuzzy Expected Value System is a system that deals with the most typical value of a set of numbers, called membership values, which represent the membership of an element in a fuzzy set. Given a set of membership values the CFEV system will produce a single membership value that is most typical in the set, in the sense that it represent the majority of the membership. values but it also considers the minority of the membership values in the set. In making available this tool we include for the purpose of background discussion the following publications that relate to similar tools such as the Fuzzy Expected Value (FEV) and the Weighted Fuzzy Expected Vauel (WFEV). A. Kandel introduced a new method of computing the most representative value of a fuzzy set, called the Fuzzy Expected Value (FEV). The FEV is described as "a new quantity that more than the mean value or the median, would indicate the most typical grade of membership of a given fuzzy set". Consider a fuzzy relation defined over the universe X which yields a fuzzy set A with a membership function X.sub.A that satisfies Consider a fuzzy set which consists of four observations as summarized in the following table: For the example above, .delta. becomes: Given that M. Friedman, M. Schneider and A. Kandel, observed that the FEV "may occasionally generate improper results", he introduced the Weighed Fuzzy Expected Value (WFEV), defined as follows: Let .omega.(x) be a non-negative monotonically decreasing function defined over the interval [0,1] and .lambda. a real number greater than 1. The solution s of ##EQU3## is the WFEV and is denoted WFEV(.omega.,.lambda.). Where, .omega.(x)=e.sup.-.beta.x, .beta.>0, .lambda. is a real number>1, n is the number of values in a set, and s is the weighted fuzzy expected value of order .lambda.. The following example illustrates the use of WFEV which we have invented. Consider the following example, taken from Friendman's description:______________________________________i .mu..sub.i n.sub.i p.sub.i______________________________________1 0.125 7 0.072 0.375 19 0.193 0.625 31 0.314 0.875 43 0.43______________________________________ For this example, .mu.(x)={, 0.125, 0.375, 0.625, 0.875} and p(x)={0.07, 0.19, 0.31, 0.43}. To compute the WFEV values for the parameters .lambda.,.beta. must be chosen. Choosing .lambda.=1, .beta.=2 as suggested by Friedman, Equation (2) converges after 4-5 iterations to s=0.745. Since 74% of the population answered either 0.625 or 0.875, the most typical value is expected to fall somewhere between 0.625 and 0.875. Indeed, all methods give an answer in this interval: (FEV=0.625, mean=0.65, median=0.625, WFEV=0.745. What is, however, the best value? M. Friedman, M. Schneider and A. Kandel argue that "WFEV is clearly the best choice" because it falls in the interval (0.625, 0.875) and considers the `few` elements that fall outside that interval. But the `few` elements outside the (0.625, 0.875), interval are 26% of all observations. Since 26% of the population is outside the interval (0.625, 0.875) the answer should be between the mean of (0.625, 0.875) and 0.625 to reflect the fact that 26% of the data disagree with the biggest groups. The mean respects this observation and by no means can it be discarded as non-typical. However, while we have invented the above, there are situations which the prior development fails to produce a completely satisfactor answer. The following example illustrates a case where WFEV clearly fails to produce a satisfactory answer. WFEV is insensitive to big changes to small groups______________________________________i .mu..sub.i n.sub.i p.sub.i______________________________________1 0.60 10 0.202 0.95 40 0.80______________________________________ If we assume that example #2 is our knowledge base and that WFEV=0.745 is a good answer then we need to choose .lambda., .beta. such that the desired result for example 1 is obtained. If the WFEV algorithm is applied to example 1, the following is obtained: Both sets of values for .lambda.,.beta. yield almost the same results for the example under considerations, and there may be other sets of values that yield the same answer, as well. Let us now consider another example, which is shown below, and apply the two sets of values found above to calculate WFEV. A consequence of the above discussion is that it is of interest to develop a tool that produces a result close to the majority of the data, but also considers the other membership grades of a fuzzy set. The tool should be insensitive to small variations in the data and it should recognize the groups of data that have opinions close to each other that may be represented by one group. In the following sections, we provide a brief description of the Clustering Fuzzy Expected Value (CFEV) algorithm. The clustering methods are then discussed in detail. Following that, the computation of CFEV is presented and its behavior is analyzed. Consequently, the performance of CFEV is compared with other methods such as mean, median, FEV and WFEV, and the superiority of the CFEV is shown by means of examples. CFEV is particularly important since we have found it improves even the performance of our FUZZY REASONING DATABASE QUESTION ANSWERING SYSTEM which was first described in U.S. patent application Ser. No. 07/701,558, filed May 16, 1991, entitled "A FUZZY REASONING DATA BASE QUESTION ANSWERING SYSTEM", of S. Vassiliadis et al. and it may be used in our described system to provide improved results, as we will describe. As background for question answering systems, we have explained that a question answering system is a system which deals with natural language statements and/or questions. The question answering system of the present invention uses fuzzy logic concepts. Fuzzy logic is described for example in Michael Smithson's work entitled Fuzzy Set Analysis for Behavioral and Social Sciences, published by Springer-Verlag, New York Inc., 1987. The FUZZY REASONING DATA BASE QUESTION ANSWERING SYSTEM relates in particular to database answering question problems existing during the design of computer systems, and it is in depth described by the following. During the different design, test, and release phases of the development of a computer system, a number of databases, in the form of libraries, are developed and maintained for a variety of purposes such as error tracking and bookkeeping. To understand and improve the development process, previously developed databases may be used at a later date as representatives of the entire, or as representative of part of the process. We have analyzed these with the intent to develop algorithms and tools for future use. If a database has been developed with a particular purpose in mind, then it can be used for future studies in its entirety because the objective of the database was specified a priori. For example, if a library has been developed to report the functional errors discovered during the hardware design of a system, then such a library can be used in its entirety at the end of the development for a study concerning functional errors in hardware design. However, on a number of occasions, it may be the case that a developed database needs to be used for a different purpose than originally anticipated. Such a necessity may arise for a variety of reasons including unanticipated studies. For example, a library that is created at or with the beginning of the development cycle for bookkeeping purposes contains information regarding the history of the logic design, and possibly it may contain information related to functional testing studies. If it is assumed that the database is accessed for routine bookkeeping functions, then when a functional error is discovered it would be desirable to be able to correct the error. Additionally, assuming its applicability is granted, such a database may be considered as more representative for functional error studies than an error tracking library, if the latter has been developed at the integration phase of the system development. When a database is suspected of containing pertinent information to a process, it is possible to presume that the entire data base is pertinent to the intended application. An example of this can be found in An Application of Cyclomatic Complexity Metrics to Microcode by Timothy McNamara, TR 01.A517, IBM Corporation, Systems Product Division, Endicott, N.Y. However, this presumption may not be a good choice in most circumstances because such a database was not developed to accommodate the application of interest, and it may result in erroneous conclusions regarding the development process. In essence, it is advisable to investigate the suspicion regarding the relevancy of the database to an intended application. In order to assess whether a database contains information pertinent to a subject of interest, it would be desirable to have a tool that provides the capability to assess the validity of the decision, and when it is found that the library is pertinent to the subject of interest, to exclude irrelevant library entries. We know of no tool like the one we developed which provides the appropriate functions. Since databases which emerge during the development of a computer system generally contain "comments", the previous database issues for such systems can be addressed by the examination of such comments. A validation methodology could be developed using probability theory; however, this approach may not be the most appropriate. This is because the validation of the database must be carried out from the comments of the database, which are written in a natural language, through the use of some form of common reasoning. A consequence of the previous statements is that a probability approach may not be the most appropriate for this type of applications because, as indicated by G. Klir in his paper "Is There More to Uncertainty than some probability Theorists might have us Believe? ", published in the International Journal of General Systems, Vol. 15, No. 4, April 1989: A different approach to the validation of a commented database might be to develop a natural language question answering system. While it may be entirely possible to develop such a system, possibly with use of fuzzy relations, such a solution may not be the most advantageous for a number of reasons, including the following: As a consequence of the examination we made of possibilities as just outlined, we believe there is an apparent unfulfilled need for a new tool that will expedite and facilitate the evaluation of a commented database with minimal effort. Such a tool should be able to be applied in a variety of circumstances with minimum additional development effort in its use. Such a tool will most certainly allow for more time to be exerted on the analysis of the database rather than the assessment of its applicability. Moreover, as we conceive of it, if it is assessed that the system is not accurate enough to guarantee a reasonable exclusion of comments, the system could then be used as an indicator of compliance of a database to a pre-specified application, and its capabilities may be extended with the utilization of a natural language question answering system to further investigate the relevancy of the database. If such a tool had existed, one having the capabilities of the tool we have created, then a natural language question answering system need be implemented only in the case when a more accurate analysis is required. Consequently, the development of such a natural language question answering system will take place only when it is needed rather than a priori, as it would be without our tool. The solution which we provide is a question answering system based on fuzzy logic. The tool provides a quick assessment of the applicability of a database to a specified universe of discussion and the exclusion of the irrelevant comments of the database. In making available this new tool, we have employed for the purpose of background discussion in our detailed description which follows a few publications, including those mentioned at the beginning of this application and the following: In the following sections, we provide a brief description and an intuitive justification and reasoning for the development of the fuzzy question answering system. The concept of degree of confidence, as it relates to words and comments within a database, is then formally defined and formulated. In the subsequent discussion, the fuzzy evaluator algorithm is described, and its capabilities are discussed. A tool for computing a Clustering Fuzzy Expected Value (CFEV) is provided for generating a fuzzy expected value which is close to the majority of the data examined in a set of data, but also considers the membership grades of the small populations in a fuzzy set. The new tool may be used to substitute the FEV or WFEV when a more representative result is required, or in cases where the first two fail to produce a typical result, as discussed in the previous section of this invention. In brief, the system can be described as a three-phase process: First, the elements of a fuzzy set are clustered into separate groups, such that the membership grades in the same cluster are closely associated. The desired degree of association is given by the user in two parameters, d, s. Then the system computes the mean of the entire set and the mean of each cluster. In the third phase the system computes the CFEV value of the set. We have provided a new tool for answering questions dealing with databases written in a spoken language which can use the CFEV. We call the tool a Fuzzy Reasoning Database Question Answering System based upon fuzzy logic. The new tool may be used as a substitute for a natural language question answering system for the evaluation of the relevancy of a database with respect to some subject of interest. The proposed system provides the capability to assess whether a database contains information pertinent to a subject of interest, by evaluating each comment in the database via a fuzzy evaluator which attributes a fuzzy membership value indicating its relationship to a particular subject. An assessment is provided for the database as a whole regarding its pertinence to the subject of interest, and consequently comments that are considered as irrelevant to the subject may be discarded. In brief, the tool may be described as a three-phase process. In the first phase, the database preprocess, the comment fields of the database are examined to find the words that may, in general, be used to describe a particular subject in either an affirmative or a negative way. These words are then, in the second phase, assigned a confidence value reflecting how they are most likely used in the database by personnel participating in the creation of the database or persons familiar with the subject. In the third phase, each comment of the database is evaluated using the words from phase one and their corresponding confidence values from phase two. The result of this process is a vector of numbers, one for each record, indicating the membership grade of the records to a particular set, the subject of interest. Based on this confidence vector, a decision may be made as to the pertinence of the database to a subject of interest. If the confidence vector indicates that a sufficient percentage of the records possess a strong association with the subject of interest, then the tool may extract these records for further use in the study of a particular subject. Our new system is described in the following detailed description and specifically with reference to the appended drawings which are described below.

US Referenced Citations (3)

Number	Name	Date
5020019	Ogawa	May 1991
5140692	Morita	Aug 1992
5168565	Morita	Dec 1992

Non-Patent Literature Citations (3)

Entry
Sachs, W. M., "An Approach to Associative Retrieval through the Theory of Fuzzy Sets," J. American Soc. for Info. Science, 1976, 85-87.
Salton et al., Intro-to Modern Info. Retrieval, McGraw-Hill Book Co., 1983, 59-75, 421-426.
Kamel et al., "Fuzzy Query Processing Using Clustering Techniques," Info. Processing and Management, 1990, 279-293.

Continuation in Parts (1)

	Number	Date	Country
Parent	701558	May 1991

Clustering fuzzy expected value system

Information

Patent Number

Date Filed

Date Issued

Inventors

Original Assignees

Examiners

Agents

CPC

US Classifications

Field of Search

US

International Classifications

Abstract

Description

Claims

RELATED APPLICATION

US Referenced Citations (3)

Non-Patent Literature Citations (3)

Continuation in Parts (1)