Burrows‐Wheeler post‐transformation with effective clustering and interpolative coding
Arto Niemi; Jukka Teuhola
https://urn.fi/URN:NBN:fi-fe2021042612504
Tiivistelmä
Lossless compression methods based on the Burrows‐Wheeler transform
(BWT) are regarded as an excellent compromise between speed and
compression efficiency: they provide compression rates close to the PPM
algorithms, with the speed of dictionary‐based methods. Instead of the
laborious statistics‐gathering process used in PPM, the BWT reversibly
sorts the input symbols, using as the sort key as many following
characters as necessary to make the sort unique. Characters occurring in
similar contexts are sorted close together, resulting in a clustered
symbol sequence. Run‐length encoding and Move‐to‐Front (MTF) recoding,
combined with a statistical Huffman or arithmetic coder, is then
typically used to exploit the clustering. A drawback of the MTF recoding
is that knowledge of the character that produced the MTF number is
lost. In this paper, we present a new, competitive Burrows‐Wheeler
posttransform stage that takes advantage of interpolative coding—a fast
binary encoding method for integer sequences, being able to exploit
clusters without requiring explicit statistics. We introduce a fast and
simple way to retain knowledge of the run characters during the MTF
recoding and use this to improve the clustering of MTF numbers and
run‐lengths by applying reversible, stable sorting, with the run
characters as sort keys, achieving significant improvement in the
compression rate, as shown here by experiments on common text corpora.
Kokoelmat
- Rinnakkaistallenteet [19207]