DESCRIPTION (provided by applicant): We propose to provide clustering software for very large databases and for categorical data. The implemented methods would support research on biological applications. Clinical databases are particularly interesting since they contain a variety of heterogeneous information, included mages, medical history, symptoms, and test results. Clustering or unsupervised classification has been used in the genetics research [BDSY99, ESBB98, GLDZ00, HSMLK00, MCA+98], protein classification [SF92, SM94], psychiatric research [Mez78], analysis of biomedical signals [Aka00], segmentation of medical images [CHG+94], etc. In many such problems there is a little prior knowledge available about data, and the data analyst can make only few assumptions about the data. In such circumstances clustering analysis allows for explorations of relationships among the data points to make assessments about their structure. In Phase I we will focus on analysis of user and software requirements and implementation of one method for clustering of large datasets and two methods for the clustering of categorical data. We will also prototype novel visualization tools for the exploration of the results of clustering. We will evaluate software-using data from biomedical domain. In Phase II we will implement additional scalable clustering algorithms and integrate methods implemented in Phase I with IMiner software. The created software will be flexible and easy to use, which should enable the analysis and understanding of data from wide range of applications. The software will be part of an integrated environment for data analysis, and it will permit the customization of the clustering process.