Accession Number : ADA191446

Title :   Tuning a Major Part of a Clustering Algorithm.

Descriptive Note : Technical rept.,

Corporate Author : PRINCETON UNIV NJ DEPT OF STATISTICS

Personal Author(s) : Hansen, Katherine M ; Tukey, John W

PDF Url : ADA191446

Report Date : Feb 1988

Pagination or Media Count : 64

Abstract : Most proposals for clustering algorithms have been based on introspection Few proposed algorithms have had their performance studied. This approach involves (a) striving to avoid comparing distances on remote parts of the data (because metrics deserve only minimum trust), and (b) using a stochastically defined test bed to measure, and where possible understand, the performance of an evolving algorithm, with the intent of using our understanding to modify it in such a way as to improve its performance. This test bed involves 3 circular Gaussian samples, of size 50 each, centered at the vertices of an equilateral triangle of side t(sigma). Its use assumes that a 3-group answer is being sought. Thus the only concerned is with a part of the clustering process. Our early algorithms begin to misbehave in the range 5 or = t or = 7. Our successive steps of improvement work at smaller and smaller t. The last version we have tried still performs usefully (median misclassification about 16%) at t = 2.7, where knowledge of three populations would only let us hold misclassification to a median of 13.3%. Comparison with a Gaussian maximum likelihood algorithm on the same set of triple samples shows only slightly better performance than for our algorithm.

Descriptors :   *ALGORITHMS, *CLUSTERING, MAXIMUM LIKELIHOOD ESTIMATION, TEST BEDS, TUNING, POPULATION(MATHEMATICS)

Subject Categories : Statistics and Probability

Distribution Statement : APPROVED FOR PUBLIC RELEASE