Accession Number : ADA307731

Title :   A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization.

Descriptive Note : Research rept.,

Corporate Author : CARNEGIE-MELLON UNIV PITTSBURGH PA DEPT OF COMPUTER SCIENCE

Personal Author(s) : Joachims, Thorsten

PDF Url : ADA307731

Report Date : MAR 1996

Pagination or Media Count : 26

Abstract : A probabilistic analysis of the Rocchio relevance feedback algorithm, one of the most popular learning methods from information retrieval, is presented in a text categorization framework. The analysis results in a probabilistic version of the Rocchio classifier and offers an explanation for the TFIDF word weighting heuristic. The Rocchio classifier, its probabilistic variant and a standard naive Bayes classifier are compared on three text categorization tasks. The results suggest that the probabilistic algorithms are preferable to the heuristic Rocchio classifier.

Descriptors :   *ALGORITHMS, *HEURISTIC METHODS, *TEXT PROCESSING, DATA BASES, MATHEMATICAL MODELS, AUTOMATION, PERFORMANCE(ENGINEERING), MAXIMUM LIKELIHOOD ESTIMATION, PROBABILITY DISTRIBUTION FUNCTIONS, RANDOM VARIABLES, ACCURACY, LEARNING MACHINES, RULE BASED SYSTEMS, WEIGHTING FUNCTIONS, FEEDBACK, INFORMATION RETRIEVAL, CLASSIFICATION, PATTERN RECOGNITION, SYSTEMS ANALYSIS, BAYES THEOREM, WORD RECOGNITION.

Subject Categories : Operations Research
      Cybernetics

Distribution Statement : APPROVED FOR PUBLIC RELEASE