Accession Number : AD0824238

Title :   TEXT COMPRESSION OPTIMIZATION.

Descriptive Note : Final rept. Jan 66-Oct 67,

Corporate Author : BOLT BERANEK AND NEWMAN INC CAMBRIDGE MA

Personal Author(s) : Brignetti, Mario C. ; Kahn, Robert E. ; Bjorkgren, David G. ; Bobrow, Daniel G.

Report Date : NOV 1967

Pagination or Media Count : 142

Abstract : The report describes research performed on optimization techniques for the compression of text. The areas of concentration of effort were: (a) Evaluation of alternative text segmentation procedures on the basis of compression efficiency provided; (b) the problem of efficiency variability that occurs when codes designed to suit a particular sample of text are applied to other samples of text; (c) the automatic reduction of size of an encoding set, and the prediction of the effects of such size reductions; (d) the applicability of text compression techniques to document descriptor files. The pertinent conclusions are: (a) the text segmentation procedure adopted in the earlier research appears to be very close to optimal; (b) there is significant degradation of performance when encoding texts other than the ones used to obtain the code; (c) it is possible to predict quantitatively the effects of size reduction on compression efficiency, this being independent of the way the reduction is made; (d) document descriptor files are compressible using the techniques described. In addition, research was conducted on the statistics of language and on rate distortion theory. Motivated by the results obtained on the problem of efficiency variability, we developed generative models for the statistics of taxonomies, that are shown to be consistent with the available data. The aim of the research on rate distortion theory is roughly to predict how much is lost by overcompressing the text. The results presented pertain to the basic theory, that is just beginning to be developed.

Descriptors :   (*INFORMATION THEORY, OPTIMIZATION), SYMBOLS, INFORMATION RETRIEVAL, ALGORITHMS, CODING, MATHEMATICAL MODELS, LINGUISTICS, DATA PROCESSING.

Subject Categories : Information Science
      Linguistics
      Cybernetics

Distribution Statement : APPROVED FOR PUBLIC RELEASE