Accession Number : ADA309488

Title :   Evaluation of Sampling for Data Mining of Association Rules.

Descriptive Note : Technical rept.,

Corporate Author : ROCHESTER UNIV NY DEPT OF COMPUTER SCIENCE

Personal Author(s) : Zaki, Mohammed J. ; Parthasarathy, Srinivasan ; Li, Wei ; Ogihara, Mitsunori

PDF Url : ADA309488

Report Date : MAY 1996

Pagination or Media Count : 17

Abstract : Data mining is an emerging research area, whose goal is to extract significant patterns or interesting rules from large databases. High-level inference from large volumes of routine business data can provide valuable information to businesses, such as customer buying patterns, shelving criterion in supermarkets and stock trends. However, many algorithms proposed for data mining of association rules make repeated passes over the database to determine the commonly occurring itemsets (or set of items). For large databases, the I/O overhead in scanning the database can be extremely high. In this paper we show that random sampling of transactions in the database is an effective method for finding association rules. Sampling can speed up the mining process by more than an order of magnitude by reducing I/O costs and drastically shrinking the number of transaction to be considered. We may also be able to make the sampled database resident in main-memory. Furthermore, we show that sampling can accurately represent the data patterns in the database with high confidence. We experimentally evaluate the effectiveness of sampling on three databases.

Descriptors :   *DATA BASES, *STATISTICAL SAMPLES, ALGORITHMS, OPTIMIZATION, DATA MANAGEMENT, PROBABILITY DISTRIBUTION FUNCTIONS, RANDOM VARIABLES, STATISTICAL INFERENCE, STATISTICAL DATA, ACCURACY, LEARNING MACHINES, INPUT OUTPUT PROCESSING, RULE BASED SYSTEMS, PATTERN RECOGNITION, SYSTEMS ANALYSIS, BERNOULLI DISTRIBUTION.

Subject Categories : Statistics and Probability
      Computer Systems

Distribution Statement : APPROVED FOR PUBLIC RELEASE