"Bwt, a transformation algorithm"
by
Arturo San Emeterio Campos





Disclaimer
Copyright (c) Arturo San Emeterio Campos 1999. All rights reserved. Permission is granted to make verbatim copies of this document for private use only.
 

Download
Download the whole article zipped.
 

Table of contents

  • Introduction
  • Bwt
  • Recovering the string
  • The result of Bwt
  • Compressing it
  • Speed worst case
  • Closing words
  • Contacting the author

  •  

     
     
     
     

    Introduction
    Hi, here we are learning a new compression related algorithm, this one don't directly compress the data, but it transforms it, in a way that it can be better compressed with other techniques. This is not the first nor the last article that deals with bwt, may be it isn't the best, anyway I always thought that is better having three papers about a subject than one, coz in one of them, the explanation is good, and the other has more theory, and the third is more practical. You know what I want to say. ;-) I tried to keep my article easy to undertand and with enough examples, however, if you feel that something more is required, let me know. Bwt is rather new, its a very good scheme for compressing, his ratio is very good, speed is his bad point, the worst case, is very bad; of course there are tricks to speed it up. If you don't know to learn it, I'll only tell you something: its ratio is better than pkzip's. After starting with the article I have to tell you that I have write some articles about algorithms needed if you want to successfully implement bwt, they are: Qsort, mtf, rle, and static huffman. All of them are available at my home page.
     

    Bwt
    What we'll do is a transformation in the data, over the whole data, so before everything load the whole file, you can of course do this process in little blocks of data (or big ones if you want). The first thing to do, is make N string from it, where N is the length of the file or block to transform, the strings will be rotated one position. For example let's take the string "bacba" then we make the N rotated strings of it:
     
    String  Position 
    bacba 
    1
    acbab 
    2
    cbaba 
    3
    babac 
    4
    babac 
    5

    You see? Note that we do not actually make N rotated strings, we just make N pointers to the string, so we have the N rotated strings. But how do we deal with the end of the string? quite easy, you just take the bytes from the start:

  •  acba end of string, get from the start b

  • This was string 1, or with string 4:
  • baba end of the string c

  • How do we actually do that? In fact you have two solutions:
    1. Every time you increment the pointer, you check if it's at the end, if it is, then put it at the start. (the method used above)
    2. Load the file (or block) twice:  bacbabacba  So when you are reading string 4, only read:
    3. bacbabacba 
            babac <- string 4 
     I prefer method 2, it spends more mem but it's faster, this is the one that I used in my implementation, I "invented" it coz I didn't know how to make it (and I said "invented" coz may be someone has done this before ;-) Once you have the rotated strings (remember that they are pointers) we just sort them lexicographically (first a, then b, c, d... E-):
     
    String Position 
    abacb
    5
    acbab
    2
    babac 
    4
    bacba
    1
    cbaba
    3

    Note that we keep track of what was the original position of the strings. For sorting it you should use qsort, note that there's no
    need of treating the bytes as an alphabet values, coz the ASCII is 'sorted', you just treat the bytes as bytes. E-D Only matter his value. Now we must know that there's two columns that are special, the first and the last:
     
    First String  Last  Position 
    a
    bac
    b
    5
    a
    cba
    b
    2
    aba
    c
    4
    b
    acb
    a
    1
    c
    bab
    a
    3

    As you can see both first and last columns have all the bytes from the original string. Also we must know the position of our first string, the primary index, our first string was "acbab" (2; 1 when programming) then it's 2 (or 1 when programming). Note that  usually it isn't in the same row. Also note that we don't save those strings to memory, no need coz we only have to get them with the pointers to the strings. Now we can start outputting the file, for recovering the file we just need the first column, the last column, and the position where the original string was. But, hey! wait one moment, if both columns, last and first, have all the bytes and first and sorted, we can sort the last column and get the first! Now we can output the transformed file, we'll output the last string (the first string is for free E-), and the position were the first string was, in our case this is:

  • "bbcaa", 1

  • Now we just need to recover the original string. Note that we haven't done any compression yet.
     

    Recovering the string
    This may seem impossible, but it isn't. Having the first column, the last and the primary index we can reconstruct it. We have to make a transformation vector, that will lead us reconstruct the file telling us in what position every byte was. We must do that: We start a loop, get a byte from last (last[counter]) and search it in the first string, and keep the position in 'position' and then we update the transformation vector in the position 'position' with 'counter'. Got the idea? here it is pseudo-code:

    Example:
     
    Last  First  Vector 
    [b] a ...
    b a ...
    c [b] 0
    a b ...
    a c ...
    Last  First  Vector 
    b a ...
    [b] a ...
    c b 0
    a [b] 1
    a c ...
    Last  First  Vector 
    b a ...
    b a ...
    [c] b 0
    a b 1
    a [c] 2
    Last  First  Vector 
    b [a] 3
    b a ...
    c b 0
    [a] b 1
    a c 2
    Last  First  Vector 
    b a 3
    b [a] 4
    c b 0
    a b 1
    [a] c 2

    In case there's two or more bytes equal in the first string, then you have to take the first time the first, the second the second... For doing so we keep a table of 256 entries, initialized to 0, and then, every time a byte is read we increment his entry, when we read again the same entry we add to the position the number in the table. Also note that there's a fast way of doing this search, there's no need of doing a brute force search. It's based in the fact that in the first string all the bytes are sorted, so bytes from the same kind start at the same offset. So what we do is to keep a table with 256 entries with the positions of every byte. Examples of both of them: The table of positions is: pos[a]=0  pos[b]=2  pos[c]=4 When we get the first 'b' we take the position with position= pos[b] + rep[b] (2+0=2) and then ++rep[b]. So next time it will be 2+1=3, and this is the position of the second 'b'. Once we have the transformation vector we do the following:

    Example:
     
     
    Last  Vector  Output  Index 
    b 3 b 1
    [b] [4] ...
    c 0 ...
    a 1 ...
    a 2 ...
    Last  Vector  Output  Index 
    b 3 b 4
    b 4 a
    c 0 ...
    a 1 ...
    [a] [2] ...
    Last  Vector  Output  Index 
    b 3 b 2
    b 4 a
    [c] [0] c
    a 1 ...
    a 2 ...
    Last  Vector  Output  Index 
    [b] [3] b 0
    b 4 a
    c 0 c
    a 1 b
    a 2 ...
    Last  Vector  Output  Index 
    b 3 b 3
    b 4 a
    c 0 c
    [a] [1] b
    a 2 a

    Do you see? it works! E-) Remember that the index starts initialized to the primary index, which is saved along with the last string.

    We can even make this faster, look: the array first is sorted, therefore a given character only has one starting point. The First array of our example looks like: a,a,b,b,c. Thus the starting point for 'a' is 0, and for 'b' 2 and for 'c' 4. Thus we don't need to do a brute search in the First array, we only have to save the starting positions of every character, the array start[] will have 256 entries, obviously this array is indexed with the byte being used. But we can even make another use of this start array, we know that we have to keep track of the number of times that we have found a byte in the first array, however it's easy to see that we can do this with the start array too, it's only a matter of incrementing the position once we've read it, ++start[byte]. You should note that you don't need to to sort last to get first, because you can read directly from last.
     

    The result of Bwt
    Are you experienced in compression? if you are, then you probably know the meaning of the term 'context'. The context is the byte or group of bytes, which are after or before a byte or group of bytes. I.e.: "1234"

  • '1' is the context of 2, 3, 4.
  • '4' is the context of 1, 2, 3.
  • '12' is the context of '34'.
  • '34' is ...

  • Ok? What bwt transformation does is to keep together bytes with the same context, bwt groups them when sorts the input string. In our case, both 'b' had the context 'a' so they were together when sorted
     
    Sorted strings 
    Context   Output 
    bac   b
    cba  b
    aba  c
    acb  a
    bab  a

    Want more examples?
     
    Context  Output 
    he term t
    hen you t
    here is t
    his is t

    Do you see it? (output is last column, 'h' is the first column) For a better example I've included a paragraph of this text transformed with my bwt implementation. (test.txt -> test.bwt) (better msdos-view it, coz windows don't like such things ;-)  The output of it is a Dword (unsigned long) with the primary index, and then the last column. Also I've included the debugging of another text with the implementation of Mark Nelson, both examples are  in the zip.
     

    Compressing it
    As you could see, the output of the Bwt transformation is long runs with the same bytes, and then quickly it changes to another one, always locally changing. Burrows and Wheeler took profit of this output with a Mtf and an statical compressor (entropy coder: Huffman or Arithmetic coding). Also a Rle could be applied before the statical compressor. Mtf and rle are explained in other articles at my h-page, so read them, even static huffman is explained. E-) (don't worry it's easy)
     

    Speed worst case
    There's sometimes when the speed of Bwt is painful, but what makes it so slow? the block sorting part (when you do the qsort to the strings). This happens if similar bytes were together, i.e.:  "aaaaaaa" "abcabcabc" So sorting this takes a long of time, coz qsort has to read bytes till it founds the difference. For the first case an Rle pass before the Bwt transformation has proved to be good, it improves speed and doesn't hurt compression. For the second case, it was proposed to use a lz77 parser, but the results proved that it is bad: in the best of the cases, the improvement was little, and in the worse, it hurt compression. So, Wheeler continued working on Bwt, but focusing on the speed. I've not see all his work, so I can't explain it ok, but what he did was doing a first pass with radix sort, based on the two first bytes, and then a qsort. There are some algorithms for sorting suffixes which are very fast.
     

    Closing words
    Well, we've learned a new scheme, now is up to you what you do with it. E-) You can find the original paper of bwt, some papers about it by Peter Fenwick, and another introduction to bwt at Mark Nelson's page.

    Contacting the author
    You can reach me via email at: arturo@arturocampos.com  Also don't forget to visit my home page http://www.arturocampos.com where you can find more and better info. See you in the next article!
     

    Arturo San Emeterio Campos, Barcelona 22-Jul-1999


    This article comes from Arturo Campos home page at http://www.arturocampos.com Visit again soon for new and updated compression articles and software.