Disclaimer
Copyright (c) Arturo San Emeterio Campos 1999. All rights reserved.
Permission is granted to make verbatim copies of this document for private
use only.
Download
Download the whole article zipped.
Table of contents
Introduction
Hi, here we are learning a new compression related algorithm, this
one don't directly compress the data, but it transforms it, in a way that
it can be better compressed with other techniques. This is not the first
nor the last article that deals with bwt, may be it isn't the best, anyway
I always thought that is better having three papers about a subject than
one, coz in one of them, the explanation is good, and the other has more
theory, and the third is more practical. You know what I want to say. ;-)
I tried to keep my article easy to undertand and with enough examples,
however, if you feel that something more is required, let me know. Bwt
is rather new, its a very good scheme for compressing, his ratio is very
good, speed is his bad point, the worst case, is very bad; of course there
are tricks to speed it up. If you don't know to learn it, I'll only tell
you something: its ratio is better than pkzip's. After starting with the
article I have to tell you that I have write some articles about algorithms
needed if you want to successfully implement bwt, they are: Qsort,
mtf,
rle,
and static huffman. All of them are
available at my home page.
Bwt
What we'll do is a transformation in the data, over the whole data,
so before everything load the whole file, you can of course do this process
in little blocks of data (or big ones if you want). The first thing to
do, is make N string from it, where N is the length of the file or block
to transform, the strings will be rotated one position. For example let's
take the string "bacba" then we make the N rotated strings of it:
String | Position |
bacba |
|
acbab |
|
cbaba |
|
babac |
|
babac |
|
You see? Note that we do not actually make N rotated strings, we just make N pointers to the string, so we have the N rotated strings. But how do we deal with the end of the string? quite easy, you just take the bytes from the start:
bacbabacba |
babac <- string 4 |
String | Position |
abacb |
|
acbab |
|
babac |
|
bacba |
|
cbaba |
|
Note that we keep track of what was the original position of the strings.
For sorting it you should use qsort, note that
there's no
need of treating the bytes as an alphabet values, coz the ASCII is
'sorted', you just treat the bytes as bytes. E-D Only matter his value.
Now we must know that there's two columns that are special, the first and
the last:
First | String | Last | Position |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
As you can see both first and last columns have all the bytes from the original string. Also we must know the position of our first string, the primary index, our first string was "acbab" (2; 1 when programming) then it's 2 (or 1 when programming). Note that usually it isn't in the same row. Also note that we don't save those strings to memory, no need coz we only have to get them with the pointers to the strings. Now we can start outputting the file, for recovering the file we just need the first column, the last column, and the position where the original string was. But, hey! wait one moment, if both columns, last and first, have all the bytes and first and sorted, we can sort the last column and get the first! Now we can output the transformed file, we'll output the last string (the first string is for free E-), and the position were the first string was, in our case this is:
Recovering the
string
This may seem impossible, but it isn't. Having the first column, the
last and the primary index we can reconstruct it. We have to make a transformation
vector, that will lead us reconstruct the file telling us in what position
every byte was. We must do that: We start a loop, get a byte from last
(last[counter]) and search it in the first string, and keep the position
in 'position' and then we update the transformation vector in the position
'position' with 'counter'. Got the idea? here it is pseudo-code:
Last | First | Vector |
[b] | a | ... |
b | a | ... |
c | [b] | 0 |
a | b | ... |
a | c | ... |
Last | First | Vector |
b | a | ... |
[b] | a | ... |
c | b | 0 |
a | [b] | 1 |
a | c | ... |
Last | First | Vector |
b | a | ... |
b | a | ... |
[c] | b | 0 |
a | b | 1 |
a | [c] | 2 |
Last | First | Vector |
b | [a] | 3 |
b | a | ... |
c | b | 0 |
[a] | b | 1 |
a | c | 2 |
Last | First | Vector |
b | a | 3 |
b | [a] | 4 |
c | b | 0 |
a | b | 1 |
[a] | c | 2 |
In case there's two or more bytes equal in the first string, then you have to take the first time the first, the second the second... For doing so we keep a table of 256 entries, initialized to 0, and then, every time a byte is read we increment his entry, when we read again the same entry we add to the position the number in the table. Also note that there's a fast way of doing this search, there's no need of doing a brute force search. It's based in the fact that in the first string all the bytes are sorted, so bytes from the same kind start at the same offset. So what we do is to keep a table with 256 entries with the positions of every byte. Examples of both of them: The table of positions is: pos[a]=0 pos[b]=2 pos[c]=4 When we get the first 'b' we take the position with position= pos[b] + rep[b] (2+0=2) and then ++rep[b]. So next time it will be 2+1=3, and this is the position of the second 'b'. Once we have the transformation vector we do the following:
Last | Vector | Output | Index |
b | 3 | b | 1 |
[b] | [4] | ... | |
c | 0 | ... | |
a | 1 | ... | |
a | 2 | ... |
Last | Vector | Output | Index |
b | 3 | b | 4 |
b | 4 | a | |
c | 0 | ... | |
a | 1 | ... | |
[a] | [2] | ... |
Last | Vector | Output | Index |
b | 3 | b | 2 |
b | 4 | a | |
[c] | [0] | c | |
a | 1 | ... | |
a | 2 | ... |
Last | Vector | Output | Index |
[b] | [3] | b | 0 |
b | 4 | a | |
c | 0 | c | |
a | 1 | b | |
a | 2 | ... |
Last | Vector | Output | Index |
b | 3 | b | 3 |
b | 4 | a | |
c | 0 | c | |
[a] | [1] | b | |
a | 2 | a |
Do you see? it works! E-) Remember that the index starts initialized to the primary index, which is saved along with the last string.
We can even make this faster, look: the array first is sorted, therefore
a given character only has one starting point. The First array of our example
looks like: a,a,b,b,c. Thus the starting point for 'a' is 0, and for 'b'
2 and for 'c' 4. Thus we don't need to do a brute search in the First array,
we only have to save the starting positions of every character, the array
start[] will have 256 entries, obviously this array is indexed with the
byte being used. But we can even make another use of this start array,
we know that we have to keep track of the number of times that we have
found a byte in the first array, however it's easy to see that we can do
this with the start array too, it's only a matter of incrementing the position
once we've read it, ++start[byte]. You should note that you don't need
to to sort last to get first, because you can read directly from last.
The result of Bwt
Are you experienced in compression? if you are, then you probably know
the meaning of the term 'context'. The context is the byte or group of
bytes, which are after or before a byte or group of bytes. I.e.: "1234"
|
||
Context | Output | |
a
|
bac | b |
a
|
cba | b |
b
|
aba | c |
b
|
acb | a |
c
|
bab | a |
Want more examples?
Context | Output |
he term | t |
hen you | t |
here is | t |
his is | t |
Do you see it? (output is last column, 'h' is the first column) For
a better example I've included a paragraph of this text transformed with
my bwt implementation. (test.txt -> test.bwt) (better msdos-view it, coz
windows don't like such things ;-) The output of it is a Dword (unsigned
long) with the primary index, and then the last column. Also I've included
the debugging of another text with the implementation of Mark Nelson, both
examples are in the zip.
Compressing it
As you could see, the output of the Bwt transformation is long runs
with the same bytes, and then quickly it changes to another one, always
locally changing. Burrows and Wheeler took profit of this output with a
Mtf and an statical compressor (entropy coder: Huffman
or Arithmetic coding). Also a Rle could be applied
before the statical compressor. Mtf and rle are explained in other articles
at my h-page, so read them, even static huffman is explained. E-) (don't
worry it's easy)
Speed worst case
There's sometimes when the speed of Bwt is painful, but what makes
it so slow? the block sorting part (when you do the qsort to the strings).
This happens if similar bytes were together, i.e.: "aaaaaaa" "abcabcabc"
So sorting this takes a long of time, coz qsort has to read bytes till
it founds the difference. For the first case an Rle pass before the Bwt
transformation has proved to be good, it improves speed and doesn't hurt
compression. For the second case, it was proposed to use a lz77 parser,
but the results proved that it is bad: in the best of the cases, the improvement
was little, and in the worse, it hurt compression. So, Wheeler continued
working on Bwt, but focusing on the speed. I've not see all his work, so
I can't explain it ok, but what he did was doing a first pass with radix
sort, based on the two first bytes, and then a qsort. There are some algorithms
for sorting suffixes which are very fast.
Closing words
Well, we've learned a new scheme, now is up to you what you do with
it. E-) You can find the original paper of bwt,
some papers about it by Peter
Fenwick, and another introduction to bwt at Mark
Nelson's page.
Contacting the
author
You can reach me via email at: arturo@arturocampos.com
Also don't forget to visit my home page http://www.arturocampos.com
where you can find more and better info. See you in the next article!
This article comes from Arturo Campos home page at http://www.arturocampos.com Visit again soon for new and updated compression articles and software.