Best of ALL Paq
Attribute Vector Coding
Attribute vector coding is the first vector transform method for compressing multidimensional database tables. It is tuple-oriented and semantic. It breaks new ground by capturing and exploiting the innumerable interrelationships, functional dependencies, and statistical correlations in data without having to solve the intractable problem of explicitly identifying and defining them. It achieves unequaled compression because it systematically models data at the highest levels of abstraction, across dimensions and data types. That makes it far less subject than conventional methods to compressibility limits imposed by information theory.
Attribute vector coding recovers an empirical understanding of the processes that created the data and then strives to emulate those processes through functions embodied in software and codified in data. Those functions, together with block-decomposable degenerated canonical Huffman codes that encode them into symbols, decorrelate across dimensions and data types simultaneously.
Beyond entropy is a compound lexical method founded on two new design principles. It can compress length-constrained text fields below the entropy of the original data model. The technique synthesizes a data model that is representationally equivalent to the original but operates at lower entropy. It builds lexicons from the input symbols, warms one by exploiting latent symbol substructure, and assigns the symbols entropy codes using the revised statistics.
Beyond entropy reveals the presence of a technology gap between the art of data modeling and the science of information theory, the value of new principles that bridge that gap, and the benefit of exploiting latent substructure in complex data symbols. It also provides new insight into how symbols can represent objects, and into how entropy differentials can exist between the processes that generate real-world data and the generative data models that emulate them.
Repopulation is a structural method for compressing the common configuration of access data structures consisting of separate hash and collision tables, and handling collisions ? sequences of database record numbers ? through external chaining. It populates hash table locations that would otherwise be empty with parts of collision strings that would otherwise occupy memory.
Unlike almost every other compression method, repopulation is not a replacement scheme. Instead, repopulation is permutational and mechanistic; it works like a chess-playing automaton. It draws on no information-theoretic concepts. Repopulation simultaneously achieves the access speed of a low hash table load factor and the table compactness of a high one, thus avoiding that historical compromise.
Superpopulation is a variable-to-variable-length algorithm that compresses index tables, lists, arrays, zerotrees, and similar data. It systematically accommodates wide local variations in data statistics. Superpopulation may be used by itself or in conjunction with repopulation.
Superpopulation recognizes that distributions of values in access data structures are often far from random. They can be highly correlated due to the nature of the processes that generated the data, and often have areas of high and low correlation. In a typical index table application, superpopulation decomposes each collision string into a sequence of adjacent substrings, each classified as one of two distinct target types or as an incompressible substring. Each target is then compressed using one of two target type-specific encoding methods.
Wordencoding is a 0-order (context-independent) variable-to-variable-length algorithm for compressing text strings, hereinafter called words, in database table record fields. It achieves compression close to the 0-order source entropy without sacrificing speed. It does that by providing an efficient way to maximize effective combined data locality over three areas: the compressed record fields, the lexicons holding the words, and their access data structures. Wordencoding deals explicitly with the structure and statistics of the data by recognizing that redundancy in text strings exists in different forms at different levels of granularity.