Compression by extensible motifs
Traditionally, the design of codebooks used in compression proceeds from specifications that are either statistical or syntactic. The quintessential statistical approach is represented by Huffman codes, in which symbols are ranked according to their frequencies and then assigned in order of decreasing probability to longer and longer codewords. In a syntactic approach, the codebook is built out of patterns that display certain features, e.g., of robustness in the face of noise, loss of synchronization, etc. The focal point in these developments is the structure of the codewords. For instance, a codeword is a pattern w of length m such that any other codeword must be at a distance of d from w, the distance being measured in terms of errors of a certain type. We can have only substitutions in the Hamming variant, substitutions, insertions and deletions in the Levensthein variant, and so on. Of course, the two aspects blend in the final code. With Huffmann codes, for instance, once the characters are statistically ranked a code with certain syntactic characteristics, notably, obeying the prefix property, is built. Likewise, once the codebook of an error correcting code is designed, the statistics of the source is taken into account for encoding. However, these two stages are, as a rule, carried out somewhat independently.
The notion of a motif that we adopt tightly combines the structure of the motif pattern, as described by its syntactic specification, with the statistical measure of its occurrence count. This supports a notion of saturation that finds natural use in the dual contexts of structural inference and compression. As said, this saturation condition mandates that motifs that could be made more specific without altering their set of occurrences do not bear interest and may be discarded.
In this Section, we present lossy off-line data compression techniques by textual substitution in which the patterns used in compression are chosen among the extensible motifs that are found to recur in the textstring with a minimum pre-specified frequency. As mentioned, motif discovery and motif-driven parses of various kinds have been previously introduced and used in [5], however, the motifs considered in those studies are "rigid".
The transition from rigid to extensible motifs requires a complete restructuring of the combinatorial and computational tools for their extraction and implementation. Specifically, one needs:
• An algorithm for the extraction of flexible motifs.
• A criterion for choosing and encoding the motifs to be used in compression.
• A new suite of software programs implementing the whole.
The orchestration of these ingredients are briefly described next. We regard the motif discovery process as distributed on two stages, where the first stage unearths motifs endowed with a certain set of properties and the second implements them in the compression. The first part was dealt with in the preceding section. Like with rigid motifs in [5], the flexible ones presented here may be restored at the receiver using information about gap filling, to be transmitted separately. In images, for instance, a tremendous amount of compression is attained, albeit with a large loss such as 40% or so, yet simple predictors in the form of linear interpolation restores more than 95% of the original.
The methods presented here belong to a class of off-line textual substitution that try to reap through greedy approximation the benefits of otherwise intractable optimal macro schemes [9]. The specific heuristic followed here is based on a greedy iterative selection (see e.g., [10]) which consists of identifying and using, at each iteration, a substring w of the text x such that encoding all instances of w in x yields the highest possible contraction of x. This process may be also interpreted as learning a "straight line" grammar of minimum description length for the sourcestring, for which we refer to [5,11,12] and references therein. Off-line methods are not always practical and can be computationally imposing even in approximate variants. They do find use in contexts and applications, such as mass production of CD-ROMs, backup archiving, etc. (see, e.g., [13]). Paradigms of steepest descent approximations have delivered good performances in practice and also appear to be the best candidates in terms of the approximation achieved to optimum descriptor sizes [14].
Our steepest descent paradigm performs a number of phases consisting each in the selection of the pattern to be used for compression followed by the actual substitution and encoding. The process stops when no further compression is achieved. The sequence representation at the outset is finally pipelined into some of the popular encoders and the best one among the overall scores thus achieved is retained. Clearly, at any stage it is impossible to choose the motif on the basis of the actual compression eventually conveyed by that motif. The decision must be based on an estimate, that takes in to account the mechanics of encoding. In practice, we estimate at log(i) the number of bits needed to encode the integer i (we refer to, e.g., [4] for reasons that legitimate this choice). In one scheme [10], one eliminates all occurrences of m, and record in succession m, its length, and the total number of its occurrences followed by the actual list of such occurrences. Letting |m| to denote the length of m, Dm denotes the number of extensible characters in m, fm the number of occurrences of m in the textstring, sm the number of characters occupied by the motif m in all its occurrences on s, |Σ| the cardinality of the alphabet and n the size of the input string, the compression brought about by m is estimated by subtracting from the sm log |Σ| bits originally encumbered by this motif on s, the expression |m| log |Σ| + log |m| + fmDm log D + fm log n + log fm charged by encoding, thereby obtaining:
G(m) = (sm - |m|) log |Σ| - log |m| - fm(Dm log D + log n) - log fm
This is accompanied by a loss L(m) represented by the total number of don't cares introduced by the motif, expressed as a percentage of the original length. If dm is the total number of such gaps introduced across all its occurrences, this would be: L(m) = dm/sm.
Other encodings are possible (see, e.g., [10]). In one scheme, for example, every occurrence of the chosen pattern m is substituted by a pointer to a common dictionary copy, and we need to add one bit to distinguish original characters from pointers. The original encumbrance posed by m on the text is in this case (log |Σ| + 1)sm, from which we subtract |m| log |Σ| + fmDm log D + log |m| + fm(log r + 1), where r is the size of the dictionary, in itself a parameter to be either fixed a priori or estimated.