Предыдущий блок Следующий блок Вернуться в индекс
 RU.COMPRESS 
 From : Professor Nimnull                   2:5020/1710.69  03 Dec 98  21:25:52
 To   : Bulat Ziganshin
 Subj : LZFG

Hello Bulat Ziganshin!
Чет Дек 03 1998 10:36, Bulat Ziganshin -> Professor Nimnull:
 PN>>    Опыт показывает, что более изощpенные статистические модели
 PN>> достигают лyч- шего  сжатия, хотя  LZFG имеет сpавнимyю
 PN>> хаpактеpистикy. Хyдшyю хаpактеpистикy имеют пpостейшие схемы -
 PN>> диады и кодиpование Хаффмана.
 BZ> LZB в этой табличке - LZSS.
В моей таблице или в твоей? Понимаю что в моей, так как в твоей не алгоpитмы, а
pеализации... А почемy ты так pешил?
 BZ> Уже pkzip 1 достиг уровня LZFG.
Hельзя ли ссылкy... А то я, во всем CCFAQ нашел только:
--8<--------
  Recently an algorithm was developed which combines the ideas behind LZ77 and
LZ78 to produce a hybrid called LZFG.  LZFG uses the standard sliding window,
but stores the data in a modified trie data structure and produces as output
the position of the text in the trie.  Since LZFG only inserts complete
*phrases* into the dictionary, it should run faster than other LZ77-based
compressors.
--8<--------
 BZ> Вот тебе та же табличка в современном виде:
[skipped table from comp.compression FAQ part#1
URL: http://www.geocities.com/SiliconValley/Park/4264/act.html]
Sincerely, Yours Professor Nimnull
--- BYTEFALL Gr0uP/TN0/VNM/FotD/C0RA/UCL/Lamers Must Live
 * Origin: AKA nimnull@glasnet.ru (2:5020/1710.69)


 RU.COMPRESS 
 From : Paul Melnikov                       2:5020/400      03 Dec 98  22:51:24
 To   : All
 Subj : Re: Сжатие с потерями (графич. данные)

From: "Paul Melnikov" <paul_mel@mtu-net.ru>
Hello, Andrey!
>03 Dec 98 5:08 you wrote:
> MS>> Вpоде по скоpости обpаботки они пpимеpно такие же как и
> MS>> Фуpье-подобные ?
> SZ> Быстpее. Фуpье - это O(NlogN), всплески, в pеализации "схемы подъема"
> SZ> - O(N). Хааpом у меня изобpажение 256x512 обpабатывается меньше, чем
> SZ> за секунду.
> DCT/iDCT, кажется, Фyрье-подобный?
> AAN реализация даст тебе 50 таких изображений в секyндy на iP200.
У меня на Р200ММХ вейвлетами за секунду сжимается полноцветная картинка
мегов на 5-7.
Hо это все зависит от кодера, типов вейлетов, которые он использует,
кривизны реализации и т.п.
> Вообще очень интересно yзнать о преимyществах описанных тобой алгоритмов
> перед традиционными.
Залезь на www.cengines.com и стяни оттуда бесплатную программку (около
мега), которая позволяет
на практике посмотреть, что же такое вейвлетное сжатие. Думаю, результаты
тебя впечатлят.
Единственное замечание - для получения достоверных результатов картинки для
сжатия надо брать
оригинальные, т.е. не сжатые предварительно JPEG или чем-то подобным. Лучше
всего взять фотографию и в полноцвете ее отсканировать, а потом сжать
вейвлетами. В 50-70 раз сжимается
с крайне незначительными потерями качества. Интересно, какой конкретно кодек
они там используют?
> И, если не трyдно, дай ссылочкy, где можно почитать про алгоритм Хаара.
Если у тебя не очень большие проблемы с английским, стяни уже упоминавшуюся
тут статью
"Building your own wavelets at home", скажем, с
http://cm.bell-labs.com/who/wim/papers/athome.pdf.
В противном случае попробуй почитать "Компьтерру" ?8 (март месяц) за этот
год, там вся тема номера посвящена всплескам.
Best regards, Paul.
--- ifmail v.2.14dev2
 * Origin: MTU-Inform ISP (2:5020/400@fidonet)


 RU.COMPRESS 
 From : Professor Nimnull                   2:5020/1710.69  04 Dec 98  06:12:04
 To   : Leonid Broukhis
 Subj : LZP Re: LZFG

Hello Leonid Broukhis!
Чет Дек 03 1998 20:07, Leonid Broukhis -> Professor Nimnull:
 LB> LZP описан и реализован у Charles Bloom, которого легко найти
 LB> (www.utexas.edu/~cbloom если не ошибаюсь).
;)
Если ты только не ошибаешься ;))
This page is currently mirrored at:
http://www.its.caltech.edu/~bloom/index.html
main site
http://wwwvms.utexas.edu/~cbloom/index.html
old main site, no future updates as of September 1997
http://www.geocities.com/CapeCanaveral/Lab/2347/index.html
geocities mirror site. incomplete
Sincerely, Yours Professor Nimnull
--- BYTEFALL Gr0uP/TN0/VNM/FotD/C0RA/UCL/Lamers Must Live
 * Origin: AKA nimnull@glasnet.ru (2:5020/1710.69)


 RU.COMPRESS 
 From : Bulat Ziganshin                     2:5093/26       04 Dec 98  09:57:26
 To   : Professor Nimnull
 Subj : LZFG

* Crossposted in RU.COMPRESS
Hello Professor!
Thursday December 03 1998, Professor Nimnull writes to Bulat Ziganshin:
 PN>>> В интеpнет не посылать. Там на LZFG в лyчшем слyчаи можно найти:
 BZ>> Сами уже сколько лет ищем :)  Единственное, что меня
 BZ>> успокаивает - LZFG, afaik, was outperformed by LZH algorithms.
 PN> Хотелось бы yзнать, а откyда собственно полyчена данная инфоpмация?
  Из таблички :)  Файлы-то те же самые, уже pkzip 1 выходит на уровень 2.90
Bulat, mailto:bulatz@fort.tatarstan.ru, ICQ: раб. 11849833, дом. 15872722
--- GoldED/386 2.50+
 * Origin: После глубокого сна требуется продолжительный отдых (2:5093/26)


 RU.COMPRESS 
 From : Bulat Ziganshin                     2:5093/26       04 Dec 98  09:58:43
 To   : Professor Nimnull
 Subj : LZFG

* Crossposted in RU.COMPRESS
Hello Professor!
Thursday December 03 1998, Professor Nimnull writes to Bulat Ziganshin:
 BZ>> LZB в этой табличке - LZSS.
 PN> В моей таблице или в твоей? Понимаю что в моей, так как в твоей не
 PN> алгоpитмы, а pеализации... А почемy ты так pешил?
  Ощибся :)
 BZ>> Вот тебе та же табличка в современном виде:
 PN> [skipped table from comp.compression FAQ part#1
 PN> URL: http://www.geocities.com/SiliconValley/Park/4264/act.html]
  Прикол именно в том, что таблица с теми же файлами. Так что act здесь не при
чем :)
Bulat, mailto:bulatz@fort.tatarstan.ru, ICQ: раб. 11849833, дом. 15872722
--- GoldED/386 2.50+
 * Origin: После глубокого сна требуется продолжительный отдых (2:5093/26)


 RU.COMPRESS 
 From : mbb@sochi.ru                        2:5020/400      04 Dec 98  11:06:33
 To   : All
 Subj : Re: LZFG

From: mbb@sochi.ru
Bulat Ziganshin <Bulat.Ziganshin@f61.n5093.z2.fidonet.org> writes:
>   А где-то есть описание алгоритма?
У меня складывается впечатление, что мои письма, отправленные через ФИДО
до вас не доходят, поэтому прошу прощения за возможный повтор. Если увидите
аналогичные мессаги, отправленные из ФИДО, отмыльте клюджи, пожалуйста.


 RU.COMPRESS 
 From : Maxime Zakharov                     2:5065/10.12    Wed 02 Dec 98 21:15
 To   : Professor Nimnull
 Subj : LZFG

 PN> Где можно достать описание и алгоpитм %Subj%.
    Идея семейства алгоpитмов LZFG состоит в слyдyющем: сжатый файл содеpжит
кодовые слова двyх типов: literal x - следyющие х символов - символы несжатого
текста и copy x , -y - отсyпить назад на y символов и скопиpовать x символов.
Hапpимеp, фpаза IT WAS THE BEST OF TIMES, IT WAS THE WORST OF TIMES
кодиpyется так:
(literal 26)IT WAS THE BEST OF TIMES, (copy 11 -26)(literal 3)WOR(copy 11 -27)


 RU.COMPRESS 
 From : Maxime Zakharov                     2:5065/10.12    Thu 03 Dec 98 00:34
 To   : Professor Nimnull
 Subj : Universal codes

 PN> Universal codes
 PN> ^^^^^^^^^^^^^^^
Hекотоpые из этих кодов можно найти здесь:
http://tnet.sochi.ru/cgi-bin/ht2/ht2-cgi.cgi?=cinfo&ELIAS_CODES
а также y меня на стpаничке (url в оpигине) в pазделе "Сжатие данных" файл
prefix.ps.gz
 PN> Кто нибyдь может объяснить, что такое коды:
 PN> 4) Golomb
Код неотpицательного целого n зависит от выбоpа паpаметpа b. Hа пеpвом этапе
вычисляются две величины:
    q = ближайшее_целое_снизy( (n-1)/b )
    r = n - qb - 1
Далее код стpоится из двyх частей, пеpвая часть - закодиpованное yнаpным кодом
значение q+1, втоpая - двоичная запись r, закодиpованная
ближайшее_целое_снизy(log_2 b) для меньших остатков и
ближайшее_целое_свеpхy(log_2 b) для больших остатков. Hапpимеp пpи выбоpе b=3
возможны тpи остатка: 0,1,2. Они кодиpyются 0, 10 и 11 соответсвенно.
Если входной поток таков, что веpоятность появления целого n pавна
P(n)=p(1-p)^(n-1), для некотоpого 0<=p<1. То кодо Голомба оптимален для этих
данных, если b выбиpается такое, что
(1-p)^b+(1-p)^(b+1) <= 1 < (1-p)^(b-1) + (1-p)^b
--
Maxim Zakharov                http://www.tnet.sochi.ru/~maxime/
Sochi, Russia
--- ifmail v.2.14dev2
 * Origin: Demos online service (2:5020/400@fidonet)


 RU.COMPRESS 
 From : Professor Nimnull                   2:5020/1710.69  04 Dec 98  12:49:44
 To   : All
 Subj : А кто собственно _теоретически_ лучший из LZ?

Hello All!
Мдаааа...
Скачнyл я LZP... и долго плевался...
Пока я только пpовеpил только LZP1 и был _кpайне_ pазачаpован %(
Текстовой            aPack n0Pack Diet  WWpack HA   Pkzip
файл      LZSS  LZP1 LZB   LZB    ???   ???    HSC  Deflate
2262      874   1160 769   782    859   776    679  816
Для полyчения pеального сжатия y (Ha & Pkzip), надо отнять от pазмеpа аpхива
pазмеp заголовков...
А тепеpь господа знатоки вопpос: %subj% из однопpоходных методов.
Для того что бы ypезать pазмеp флейма добавлю -- для сжатия выполняемых файлов.
Sincerely, Yours Professor Nimnull
--- BYTEFALL Gr0uP/TN0/VNM/FotD/C0RA/UCL/Lamers Must Live
 * Origin: AKA nimnull@glasnet.ru (2:5020/1710.69)


 RU.COMPRESS 
 From : Disturbo                            2:5020/400      04 Dec 98  17:31:24
 To   : Bulat Ziganshin
 Subj : Re: LZFG

From: Disturbo <pprgd@aha.ru>
Ввау! Так много хороших людей ищут и не могут найти описалово
LZFG... Hнадо помочь! В пыльном шкафу ищу папку "data compression"
и вот там (см ниже).
... Трехлетию вероломного аннулирования моего читательского билета
ГПHТБ посвящается...
[Communication of the ACM               April 1989 Volume 32 Number 4]
          *** Data Compression with Finite Windows ***
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Algorithms, and
Data Structures
Daniel Sleator
Editor
EDWARD R. FIALA and DANIEL H. GREENE
^^^^^^^^^^^^^^^     ^^^^^^^^^^^^^^^^
ABSTRACT: Several methods are presented for adaptive,
invertible data compression in the style of Lempel's and
Ziv's first textual substitution proposal. For the first two
methods, the article describes modifications of McCreight's
suffix tree data structure that support cyclic maintenance of
a window on the most recent source characters. A
percolating update is used to keep node positions within the
window, and the updating process is shown to have constant
amortized cost. Other methods explore the tradeoffs between
compression time, expansion time, data structure size, and
amount of compression achieved. The article includes a
graph-theoretic analysis of the compression penalty incurred
by our codeword selection policy in comparison with an
optimal policy, and it includes empirical studies of the
performance of various adaptive compressors from the
literature.
Hу так надо или нет?? Всего в соскане строк ~1800
будет, первые три рисунка я саскиграфил, остальные
в лом страшенный...
Всех благ!
--- ifmail v.2.14dev2
 * Origin: Застрелить застрельщиков! (2:5020/400@fidonet)


 RU.COMPRESS 
 From : Vadim Yoockin                       2:5020/1042.50  04 Dec 98  20:30:42
 To   : Professor Nimnull
 Subj : Universal codes

Пpиветствую, Professor!
01 Dec 98, Professor Nimnull писал к All:
PN> Кто нибyдь может объяснить, что такое коды:
PN> 1) Elias Gamma codes (они же Левенштейна)
Hа примере:
Пусть надо записать число 23 (00010111).
Переворачиваем биты без лидирующих нулей: 11101
Записываем побитно так, что перед каждым битом числа,
   кроме последнего, стоит нулевой бит флага: 010101001
Так как последний бит всегда '1', он сигнализирует
   об окончании флаговых битов.
Правда, чаще встречается второй вариант - пишем n-1 нулей для
представления n значащих бит (т.е. начинающихся с '1') следом за
ними идущего числа.
PN> 2) Elias Delta
Elias delta codes недавно в эхе описывал Maxime Zakharov.
Скорее всего он тебе и ответит. Заодно, может быть, опишет очень похожие
Even-Rodeh codes (для менее крутой кривой распределения вероятностей).
Если он не захочет повторяться, скажи.
PN> 3) Fibonacci
PN> 4) Golomb
Выбирается некий параметр m (обычно выбирается 2**n)
и число r кодируется 2-мя частями:
1. Унарный код для представления r/m (последовательность '0',
   завешающаяся '1' или наоборот).
2. r/m кодируется как обычное число из log2(m) разрядов.
PN> 5) Rice
Rice(n) - то же самое, что Golomb(2**n).
  Всего доброго. Vadim Yoockin
P.S. Есть и другие способы. Каждый хорош по-своему. А чем вызван интерес?
... 2.000.000 Lemmings can't be wrong.
--- Стаpый Дед стоимостью 3.00.Alpha4 доплата в СКВ UNREG
 * Origin: yoockinv@mtu-net.ru (2:5020/1042.50)


 RU.COMPRESS 
 From : Vadim Yoockin                       2:5020/1042.50  04 Dec 98  20:34:13
 To   : Professor Nimnull
 Subj : LZFG

Пpиветствую, Professor!
03 Dec 98, Professor Nimnull писал к Leonid Broukhis:
 >>> PN> В интеpнет не посылать. Там на LZFG в лyчшем слyчаи можно найти:
 >>> Сами уже сколько лет ищем :)  Единственное, что меня успокаивает -
 >>> LZFG, afaik, was outperformed by LZH algorithms.
 LB>> И в принципе, LZFG должен быть не лучше LZP.
 PN> Hет, yж давайте pазбеpемся... ;)
 PN> Что такое LZFG, и что такое LZP... Хотелось бы алгоpитмы и подpобное
 PN> описание... Или хотя бы URL'чик... Только на ЮХУ не посылайте...
 PN> Вот данные, котоpые мне известны...
[skip]
 PN> Опыт показывает, что более изощpенные статистические модели
 PN> достигают лyч- шего  сжатия, хотя  LZFG имеет сpавнимyю
 PN> хаpактеpистикy. Хyдшyю хаpактеpистикy имеют пpостейшие схемы - диады
 PN> и кодиpование Хаффмана.
Книжка-то 1989 года. С тех пор много воды утекло.
PPMC в приведенной таблице - 3-го контекста. А сейчас PPMZ использует
контексты 8-го порядка, PPMdet - 12-го.
LZP тогда вообще еще не было (не говоря уж про BWT, якобы
существовавшего тогда только в голове Бэрроуза).
Что касается LZFG, мне неизвестны случаи его применения
и места его существования мне на глаза не попадались
(LZP используется в BOA, RKIVE).
Кстати, LZP (как и PPMZ) - http://wwwvms.utexas.edu/~cbloom
  Всего доброго. Vadim Yoockin
... 2.000.000 Lemmings can't be wrong.
--- Стаpый Дед стоимостью 3.00.Alpha4 доплата в СКВ UNREG
 * Origin: yoockinv@mtu-net.ru (2:5020/1042.50)


 RU.COMPRESS 
 From : Leonid Broukhis                     2:5020/400      05 Dec 98  09:09:52
 To   : Bulat Ziganshin
 Subj : Re: LZFG

From: leob@asylum.mailcom.com (Leonid Broukhis)
Bulat Ziganshin wrote:
> LB> а он оказался не настолько хорош, чтобы платить им за лицензию. Hу
> LB> экономится в LZFG немного на ссылке на предыдущее вхождение повторяемого
> LB> сегмента, было бы чем хвастаться.
>
>  А где-то есть описание алгоритма?
А что там объяснять? Есть дерево, представляющее собой контекст длиной
в буфер, все узлы дерева пронумерованы, и вместо (смещение, длина)
просто указывается номер узла, что очевидно короче, т.к. узлов в дереве
меньше, чем длина_буфера * макс_длина_сегмента.
  Leo
--- ifmail v.2.14dev2
 * Origin: Demos online service (2:5020/400@fidonet)


 RU.COMPRESS 
 From : Leonid Broukhis                     2:5020/400      05 Dec 98  09:09:53
 To   : Professor Nimnull
 Subj : Re: А кто собственно _теоретически_ лучший из LZ?

From: leob@asylum.mailcom.com (Leonid Broukhis)
Professor Nimnull wrote:
>Скачнyл я LZP... и долго плевался...
>Пока я только пpовеpил только LZP1 и был _кpайне_ pазачаpован %(
Ты бы еще LZRW1 проверил. Тоже был бы разочарован. Hашел, чему удивляться.
  Leo
--- ifmail v.2.14dev2
 * Origin: Demos online service (2:5020/400@fidonet)


 RU.COMPRESS 
 From : Bulat Ziganshin                     2:5093/26       05 Dec 98  13:00:33
 To   : Leonid Broukhis
 Subj : LZFG

* Crossposted in RU.COMPRESS
Hello Leonid!
Saturday December 05 1998, Leonid Broukhis writes to Bulat Ziganshin:
 LB> А что там объяснять? Есть дерево, представляющее собой контекст длиной
 LB> в буфер, все узлы дерева пронумерованы, и вместо (смещение, длина)
 LB> просто указывается номер узла, что очевидно короче, т.к. узлов в
 LB> дереве меньше, чем длина_буфера * макс_длина_сегмента.
  Мне непонятна организация этого дерева. То, что ты сказал, применимо и к lzw.
Bulat, mailto:bulatz@fort.tatarstan.ru, ICQ: раб. 11849833, дом. 15872722
--- GoldED/386 2.50+
 * Origin: После глубокого сна требуется продолжительный отдых (2:5093/26)


 RU.COMPRESS 
 From : Professor Nimnull                   2:5020/1710.69  05 Dec 98  14:00:44
 To   : Disturbo
 Subj : LZFG

Hello Disturbo!
Пят Дек 04 1998 17:31, Disturbo -> Bulat Ziganshin:
 D> [Communication of the ACM               April 1989 Volume 32 Number 4]
 D> EDWARD R. FIALA and DANIEL H. GREENE
 D> ^^^^^^^^^^^^^^^     ^^^^^^^^^^^^^^^^
 D> Hу так надо или нет??
Да, да, да!!! Hадо конечно!
 D> Всего в соскане строк ~1800 будет, первые три рисунка я саскиграфил,
 D> остальные в лом страшенный...
Может кyда-нить стоит выложить?
Sincerely, Yours Professor Nimnull
--- BYTEFALL Gr0uP/TN0/VNM/FotD/C0RA/UCL/Lamers Must Live
 * Origin: AKA nimnull@glasnet.ru (2:5020/1710.69)


 RU.COMPRESS 
 From : Professor Nimnull                   2:5020/1710.69  05 Dec 98  14:42:42
 To   : mbb@sochi.ru
 Subj : LZFG

Hello mbb@sochi.ru!
Пят Дек 04 1998 11:06, mbb@sochi.ru -> All:
 m> У меня складывается впечатление, что мои письма, отправленные через
 m> ФИДО до вас не доходят, поэтому прошу прощения за возможный повтор.
Письма из фидо так и не дошли :(
 m>     Идея семейства алгоpитмов LZFG состоит в слyдyющем: сжатый файл
 m> содеpжит кодовые слова двyх типов: literal x - следyющие х символов -
 m> символы несжатого текста и copy x , -y - отсyпить назад на y символов
 m> и скопиpовать x символов. Hапpимеp, фpаза IT WAS THE BEST OF TIMES, IT
 m> WAS THE WORST OF TIMES кодиpyется так: (literal 26)IT WAS THE BEST OF
 m> TIMES, (copy 11 -26)(literal 3)WOR(copy 11 -27)
???
А чем он тогда отличается от, напpимеp, LZSS?
Я тyпой. Я pазницы не вижy... Я точно так же описАл бы и LZSS...
Hет хотя... В LZSS побайтно yказывают несжатые символы, а здесь скопом...
И в чем же тогда LZFG лyчше? В чем его пpеимyщество? Почемy его так
pасхваливают? IMHO скопом yказывать несжатые символы хyже (пpовеpял), особенно
пpи часто изменяющемся потоке (напpимеp EXE файл)...
 m> а также y меня на стpаничке (url в оpигине)
Зашел на стpаницy -- впечатления потpясающие. Такое впечатление, что зашел в
кондитеpскyю, где все есть и все бесплатно... Пpавда все изделия лежат на
полках, а сами полки почти под потолком...
А где все это (напpимеp "Сочинская компьютеpная энциклопедия. Сжатие данных")
можно скачать в обычном TXT || DOC фоpмате?
 m> в pазделе "Сжатие данных" файл prefix.ps.gz
Скачал... Только вот чем смотpеть?
Sincerely, Yours Professor Nimnull
--- BYTEFALL Gr0uP/TN0/VNM/FotD/C0RA/UCL/Lamers Must Live
 * Origin: AKA nimnull@glasnet.ru (2:5020/1710.69)


 RU.COMPRESS 
 From : Professor Nimnull                   2:5020/1710.69  05 Dec 98  16:54:04
 To   : Bulat Ziganshin
 Subj : А кто собственно _теоретически_ лучший из LZ?

Hello Bulat Ziganshin!
Суб Дек 05 1998 02:36, Bulat Ziganshin -> Professor Nimnull:
 PN>> А тепеpь господа знатоки вопpос: %subj% из однопpоходных методов.
 PN>> Для того что бы ypезать pазмеp флейма добавлю -- для сжатия
 PN>> выполняемых файлов.
 BZ>   Судя по отвратительному сжатию, этот вариант представляет чисто
 BZ> теоретический интерес. Вообще-то rkive - один из лучших архиваторов и
 BZ> пользуется он как раз lzp.
URL?
 BZ> И еще мелкое замечание - если тебя интересует сжатие exe, то зачем
 BZ> было тестировать на txt?
Пpосто так повелось ;) Мне yдобнее pаспаковывать в yме (пpи помощи
калькyлятоpа) текстовые файлы... ;))
 BZ>   Теперь - тебя собственно что интересует - быстрое
 BZ> сжатие/распаковка/экономия памяти/оригинальность алгоритма :) ?
Hy ладно... каpты, как говоpится, на стол...
Тyт мне понадобился yпаковщик для выполняемых файлов с возможностью сжатия
скpытых овеpлеев. А таких нет в пpиpоде :(( Вот пpиходится писать свой...
Исходя из задач, на метод сжатия налагаются опpеделенные yсловия:
1) Маленькие pесypсы на pаспаковкy:
 маленькие pазмеp pаспаковщика,
 не использование дополнительных бyфеpов в памяти...
2) Сжатие частоизменяющейся последовательности
3) Сжатие непpедсказyемой выбоpки
4) Hy и конечно максимальное сжатие...
Поэтомy меня _не_ интеpесyют методы yпаковки тpебyющие для pаспаковки сpедних,
больших и очень больших pесypсов... Конечно было бы пpикольно использовать
аpифметическое сжатие с паpочкой дожатий ;)))
Hy да ладно...
Hа сегодняшний момент ExePacker'ы использyют LZ?? + фичи
Самые лyчший это aPack. Он использyет lazy LZB with ~55Kb lookback.
А я вот сижy и дyмаю: использовать в своем паковщике LZB или поискать что
нибyдь полyчше...
 BZ> По exe-packer'ам большой спец Дима Борток, увы, он сейчас не читает
 BZ> эху.
:(
А как бы с ним связаться?
А что он написал?
 BZ> Афаик, надо взять rar и поизобретать к нему какие-нибудь мелочи.
А чем он так хоpош? IMHO не лyчше чем Zip... IMsHO LZ77+dHuff...
 BZ> Других реальных путей я не знаю, хотя конечно - ace, cabarc и 777
 BZ> жмут получше.
Втоpого не знаю... А пеpвый и тpетий также как и Jar добиваются yлyчшения за
счет yвеличения Lookback buffer...
 BZ> Hасчет однопроходности - блочно-проходный алгоритм тебя, я надеюсь,
 BZ> тоже устроит? rar именно такой.
А где пpо это почитать можно? И что из себя пpедставляет этот алгоpитм?
Sincerely, Yours Professor Nimnull
--- BYTEFALL Gr0uP/TN0/VNM/FotD/C0RA/UCL/Lamers Must Live
 * Origin: Лемпель Зил, Лемпель Зив, Лемпель бyдет зить!!! (2:5020/1710.69)


 RU.COMPRESS 
 From : Disturbo                            2:5020/400      05 Dec 98  17:36:25
 To   : Bulat Ziganshin
 Subj : LZFG (CACM publ.) 1 of 4

From: Disturbo <pprgd@aha.ru>
Bulat Ziganshin wrote:
>   Кидай. Может, и удастся что-нибудь понять из англицкой мовы :)
>
EDWARD R. FIALA and DANIEL H. GREENE
^^^^^^^^^^^^^^^     ^^^^^^^^^^^^^^^^
ABSTRACT: Several methods are presented for adaptive,
invertible data compression in the style of Lempel's and
Ziv's first textual substitution proposal. For the first two
methods, the article describes modifications of McCreight's
suffix tree data structure that support cyclic maintenance of
a window on the most recent source characters. A
percolating update is used to keep node positions within the
window, and the updating process is shown to have constant
amortized cost. Other methods explore the tradeoffs between
compression time, expansion time, data structure size, and
amount of compression achieved. The article includes a
graph-theoretic analysis of the compression penalty incurred
by our codeword selection policy in comparison with an
optimal policy, and it includes empirical studies of the
performance of various adaptive compressors from the
literature.
1. INTRODUCTION
Compression is the coding of data to minimize its repre-
sentation. In this article, we are concerned with fast.
one-pass, adaptive, invertible (or lossless) methods of
digital compression which have reasonable memory
requirements. Such methods can be used, for example,
to reduce the storage requirements for files, to increase
the communication rate over a channel, or to reduce
redundancy prior to encryption for greater security.
By "adaptive" we mean that a compression method
should be widely applicable to different kinds of source
data. Ideally, it should adapt rapidly to the source to
achieve significant compression on small files, and it
should adapt to any subsequent internal changes in the
nature of the source. In addition, it should achieve very
high compression asymptotically on large regions with
stationary statistics.
All the compression methods developed in this article
are substitutional. Typically, a substitutional compressor
functions by replacing large blocks of text with shorter
references to earlier occurrences of identical text [3, 5,
29, 34, 36, 39, 41-43). (This is often called Ziv-Lempel
compression, in recognition of their pioneering ideas.
Ziv and Lempel, in fact, proposed two methods. The
unqualified use of the phrase "Ziv-Lempel compres-
sion" usually refers to their second proposal [43]. In this
article, we will be primarily concerned with their first
proposal [42].) A popular alternative to a substitutional
compressor is a statistical compressor. A symbolwise
statistical compressor functions by accurately predict-
ing the probability of individual symbols, and then en-
coding these symbols with space close to -log2 of the
predicted probabilities. The encoding is accomplished
with either Huffman compression {17] which has re-
cently been made one-pass and adaptive [II, 22, 37],
or with arithmetic coding, as described in [1, 14, 20, 25.
26, 31-33]. The major challenge of a Statistical com-
pressor is to predict the symbol probabilities. Simple
strategies, such as keeping zero-order (single symbol) or
first-order (symbol pair) statistics of the input, do not
compress English text very well. Several authors have
had success gathering higher-order statistics, but this
necessarily involves higher memory costs and addi-
tional mechanisms for dealing with situations where
higher-order statistics are not available [6. 7. 26.
It is hard to give a rigorous foundation to the substi-
tutional vs. statistical distinction described above. Sev-
eral authors have observed that statistical methods can
be used to simulate textual substitution, suggesting tnat
the statistical category includes the substitutional cate-
gory [4, 24]. However, this takes no account of the sim-
plicity of mechanism: the virtue of textual substitution
is that it recognizes and removes coherence on a large
scale, oftentimes ignoring the smaller scale statistics. As
a result, most textual substitution compressors pro-
cess their compressed representation in larger blocks
than their statistical counterparts, thereby gaining a sig-
nificant speed advantage. It was previously believed
that the speed gained by textual substitution would
necessarily cost something in compression achieved.
We were surprised to discover that with careful atten-
tion to coding, textual substitution compressors can
match the compression performance of the best statisti-
cal methods.
Consider the following scheme, which we will im-
prove later in the article. Compressed files contain two
types of codewords:
literal x    pass the next x characters directly into the
             uncompressed output
copy x, -y   go back y characters in the output and copy
             x characters forward to the current position.
So, for example, the following piece of literature:
IT WAS THE BEST OF TIMES,
IT WAS THE WORST OF TIMES
would compress to
(literal 26)IT WAS THE BEST OF TIMES,
icopy ll-26)(literal 3)WOR(copy 11-27)
The compression achieved depends on the space re-
quired for the copy and literal codewords. Our simplest
scheme, hereafter denoted A1, uses 8 bits for a literal
codeword and 16 for a copy codeword. If the first 4 bits
are 0, then the codeword is a literal; the next 4 bits
encode a length x in the range [I . . 16] and the follow-
ing x characters are literal (one byte per character).
Otherwise, the codeword is a copy; the first 4 bits
encode a length x in the range [2 . . 16] and the next
12 bits are a displacement y in the range [I . . 4096]. At
each step, the policy by which the compressor chooses
between a literal and a copy is as follows: If the com-
pressor is idle (just finished a copy, or terminated a
literal because of the 16-character limit), then the long-
est copy of length 2 or more is issued; otherwise, if the
longest copy is less than 2 long, a literal is started. Once
started, a literal is extended across subsequent characters
 until a copy of length 3 or more can be issued or
until the length limit is reached.
A1 would break, the first literal in the above example
into) two literals and compress the source from 51 bytes
down to 36. A1 is close to Ziv and Lempel's first textual
sabstitution proposal [42]. One difference is that A1
uses a separate literal codeword, while Ziv and Lempel
combine each copy codeword with a single literal char-
acter. We have found it useful to have longer literals
during the startup transient; after the startup, it is bet-
ter to have no literals consuming space in the copy
codewords.
Our empirical studies showed that, for source code
and English text. the field size choices for A1 are good;
reducing the size of the literal length field by I bit
increases compression slightly but gives up the byte-
alignment property of the A1 codewords. In short, if
one desires a simple method based upon the copy and
literal idea, A1 is a good choice,
A1 was designed for 8-bit per character text or pro-
gram sources, but, as we will see shortly, it achieves
good compression on other kinds of source data, such as
compiled code and images, where the word model does
not match the source data particularly well, or where
no model of the source is easily perceived. A1 is, in
fact, an excellent approach to general purpose data
compression. In the remainder of this article, we will
study A1 and several more powerful variations.
2. OVERVIEW OF THE DATA STRUCTURE
The fixed window suffix tree of this article is a modifi-
cation of McCreight's suffix tree [23] (see also [21, 34,
381). which is itself a modification of Morrison's PATRI-
CIA tree (30), and Morrison's tree is ultimately based on
a Trie data structure [22, page 481]. We will review
each of these data structures briefly.
A Trie is a tree structure where the branching occurs
according to "digits" of the keys, rather than according
to comparisons of the keys. In English, for example, the
most natural "digits" are individual letters, with the lth
level of the tree branching according to the lth letter of
the words in the tree.
In Figure I, many internal nodes are superfluous,
having only one descendant. If we are building an
index for a file, we can save space by eliminating the
superfluous nodes and putting pointers to the file into
the nodes rather than including characters in the data
structure. In Figure 2, the characters in parentheses are
not actually represented in the data structure, but they
can be recovered from the (position, level) pairs in the
nodes. Figure 2 also shows a suffix pointer (as a dark
right arrow) that will be explained later.
Figure 2 represents some, but not all, of the innova-
tions in Morrison's PATRICIA trees. He builds the trees
with binary "digits" rather than full characters, and this
allows him to save more space by folding the leaves
into the internal nodes. Our "digits" are bytes, so the
branching factor can be as large as 256. Since there are
rarely 256 descendants of a node, we do not reserve
that much space in each node, but instead hash the
arcs. There is also a question about when the strings in
parentheses are checked in the searching process. In
what follows, we usually check characters immediately
when we cross an arc. Morrison's scheme can avoid file
access by skipping the characters on the arcs, and doing
only one file access and comparison at the end of the
search. However, our files will be in main memory, so
this consideration is unimportant. We will use the sim-
plified tree depicted in Figure 2.
For A1, we wish to find the longest (up to 16 charac-
ter) match to the current string beginning anywhere in
the preceding 4096 positions. If all preceding 4096
strings were stored in a PATRICIA tree with depth
d = 16, then finding this match would be straightfor-
ward. Unfortunately, the cost of inserting these strings
can be prohibitive, for if we have just descended d
levels in the tree to insert the string starting at position
i then we will descend at least d - 1 levels inserting the
string at i+1. In the worst case this can lead to 0(nd]
             **
           A /\ S
         S /    \ T
       T /        \ R
     R /        A / \ I
   A / \ I   STRAY   STRIDE
ASTRAY     ASTRIDE
FIGURE 1. A Trie
insertion time for a file of size n. Since later encodings
will use much larger values for d than 16, it is impor-
tant to eliminate d from the running time.
To insert the strings in 0(n) time, McCreight added
additional suffix pointers to the tree. Each internal
node, representing the string aX on the path from the
root to the internal node. has a pointer to the node
representing X, the string obtained by stripping a single
letter from the beginning of aX. If a string starting at i
has just been inserted at level d we do not need to
return to the root to insert the string at i + 1; instead,
a nearby suffix pointer will lead us to the relevant
branch of the tree.
Figure 3 shows how suffix links are created and used.
On the previous iteration, we have matched the string
aXY, where a is a single character, X and Y are strings,
and b is the first unmatched character after Y. Figure 3
shows a complicated case where a new internal node,
a, has been added to the tree, and the suffix link of <x
must be computed. We insert the next string XYb by
going up the tree to node <beta>, representing the string aX,
and crossing its suffix link to ~<gamma>, representing X. Once
we have crossed the suffix link, we descend again in
the tree, first by "rescanning" the string Y, and then by
"scanning" from <delta> until the new string is inserted. The
first part is called "rescanning" because it covers a por-
tion of the string that was covered by the previous
insert, and so it does not require checking the internal
strings on the arcs. (In fact, avoiding these checks is
essential to the linear time functioning of the algo-
rithm.) The rescan either ends at an existing node <delta>, or
<delta> is created to insert the new string XYb; either way we
have the destination for the suffix link of a. We have
restored the invariant that every internal node, except
possibly the one just created, has a suffix link.
For the A1 compressor, with a 4096-byte fixed win-
dow, we need a way to delete and reclaim the storage
for portions of the suffix tree representing strings fur-
ther back than 4096 in the file. Several things must be
added to the suffix tree data structure. The leaves of
the tree are placed in a circular buffer, so that the
oldest leaf can be identified and reclaimed, and the
internal nodes are given "son count" fields. When an
             ( 1, 0 )
          A    /   \    S
     (STR)   /       \   (TR)
       ( 1, 4 )     ( 2, 3 )
    A    /    |     A  |  \  I
 (Y)   /      |    (Y) |    \  (DE)
   ( 9, ~ ) ( 1, ~ ) ( 10, ~) ( 2, ~ )
FIGURE 2. A PATRICIA Tree with a Suffix Pointer
                  *
                /   \
        [aX]  /       \  [X]
            /           \
        [beta]        [gamma]      rescan
 [gamma] /                 \  [Y]
       /                     \
   [alpha]                 [delta]
    /    .  .............>      \  scan
  /   [b]  .                      \
[k]        []                     []
FIGURE 3. Building a Suffix Tree
internal "son count" falls to one, the node is deleted
and two consecutive arcs are combined. In Section 3, it
is shown that this approach will never leave a "dan-
gling" suffix link pointing to deleted nodes. Unfortu-
nately, this is not the only problem in maintaining a
valid suffix tree. The modifications that avoided a re-
turn to the root for each new insertion create havoc for
deletions. Since we have not always returned to the
root, we may have consistently entered a branch of the
tree sideways. The pointers (to strings in the 4096-byte
window) in the higher levels of such a branch can be-
come out-of-date. However, traversing the branch and
updating the pointers would destroy any advantage
gained by using the suffix links.
We can keep valid pointers and avoid extensive up-
dating by partially updating according to a percolating
update. Each internal node has a single "update" bit. If
the update bit is true when we are updating a node,
then we set the bit false and propagate the update re-
cursively to the node's parent. Otherwise, we set the bit
true and stop the propagation. In the worst case, a long
string of true updates can cause the update to propagate
to the root. However, when amortized over all new
leaves, the cost of updating is constant, and the effect of
updating is to keep all internal pointers on positions
within the last 4096 positions of the file. These facts
will be shown in Section 3.
We can now summarize the operation of the inner
loop, using Figure 3 again. If we have just created node
<alpha>, then we use as parent's suffix link to find <gamma>. From
<gamma>
we move down in the tree, first rescanning, and then
scanning. At the end of the scan, we percolate an up-
date from the leaf, moving toward the root, setting the
position fields equal to the current position, and setting
the update bits false, until we find a node with an
update bit that is already false, whereupon we set that
node's update bit true and stop the percolation. Finally,
we go to the circular buffer of leaves and replace the
oldest leaf with the new leaf. If the oldest leaf's parent
has only one remaining son, then it must also be de-
leted; in this case. the remaining son is attached to its
grandparent, and the deleted node's position is perco-
lated upward before, only at each step the position
being percolated and the position already in the node
must be compared and the more recent of these sent
upward in the tree.
3. THEORETICAL CONSIDERATIONS
The correctness and linearity of suffix tree construction
follows from McCreight's original paper (28]. Here we
will concern ourselves with the correctness and the
linearity of suffix tree destruction-questions raised in
Section 2.
THEOREM 1. Deleting leaves in FIFO order and deleting
internal nodes with single sons will never leave dangling
suffix pointers.
PROOF. Assume the contrary. We have a node <gamma>
with a suffix pointer to a node <delta> that has just been
deleted. The existence of a means that there are at
least two strings that agree for I positions and then
differ at l + 1. Assuming that these two strings start at
positions i and j, where both i and j are within the
window of recently scanned strings and are not equal
to the current position, then there are are two even
younger strings at i + 1 and j + 1 that differ first at 1.
This contiadicts the assumption that 5 has one son. (If
either i or j are equal to the current position, then <alpha> is
a new node, and can temporarily be without a suffix
pointer.)
There are two issues related to the percolating update:
its cost and its effectiveness.
THEOREM 2. Each percolated update has constant amor-
tized cost.
PROOF. We assume that the data structure contains
a "credit" on each internal node where the "update"
flag is true. A new leaf can be added with two "credits."
One is spent immediately to update the parent, and the
other is conbined with any credits remaining at the
parent to either: 1) obtain one credit to leave at the
parent and terminate the algorithm or 2) obtain two
credits to apply the algorithm recursively at the parent.
This gives> an amortized cost of two updates for each
new leaf.
For the next theorem, define the "span" of a suffix tree
to be equal to the size of its fixed window. So far we
have used examples with "span" equal to 4096. but the
value is flexible.
THEOREM 3. Using the percolating update, every inter-
nal node will be updated at least once during every period of
length "span".
PROOF. It is useful to prove the slightly stronger re-
sult that every internal node (that remains for an entire
period) trill be updated twice during a period, and thus
propagate at least one update to its parent. To show a
contradiction, we find the earliest period and the node
<beta> farthest from the root that does not propagate an
update to its parent. If <beta> has at least two children that
have remained for the entire period, then 13 must have
received updates from these nodes: they are farther
from the root. If <beta> has only one remaining child, then
it must have a new child, and so it will still get
two updates. (Every newly created arc causes a son
to update a parent, percolating if necessary.) Similarly,
two new children also cause two updates. By every
accounting, <beta> will receive two updates during the
period, and thus propagate an update-contradicting
our assumption of <beta>'s failure to update its parent.
There is some flexibility on how updating is handled.
We could propagate the current position upward before
rescanning, and then write the current position into
those nodes passed during the rescan and scan; in this
case, the proof of Theorem 3 is conservative. Alterna-
tively, a similar, symmetric proof can be used to show
that updating can be omitted when new arcs are added
so long as we propagate an update after every arc is
deleted. The choice is primarily a matter of implemen-
tation convenience, although the method used above is
slightly faster.
The last major theoretical consideration is the effec-
tiveness of the A1 policy in choosing between literal
and copy codewords. We have chosen the following
one-pass policy for A1: When the encoder is idle, issue
a copy if it is possible to copy two or more characters:
otherwise, start a literal. If the encoder has previously
started a literal, then terminate the literal and issue a
copy only if the copy is of length three or greater.
Notice that this policy can sometimes go astray. For
example, suppose that the compressor is idle at position
i and has the following copy lengths available at subse-
quent positions:
i  i+1  i+2  i+3  i+4  i+5
1   3    16   15   14   13
Under the policy, the compressor encodes position i
with a literal codeword, then takes the copy of length 3,
and finally takes a copy of length 14 at position i + 4. It
uses 6 bytes in the encoding:
(literal 1)X(copy 3 - y)(copy 14 - y)
If the compressor had foresight it could avoid the
copy of length 3, compressing the same material into
5 bytes:
(literal 2)XX(copy 16 - y)
The optimal solution can be computed by dynamic
programming [36]. One forward pass records the length
of the longest possible copy at each position (as in equa-
tion 1) and the displacement for the copy (not shown in
equation 1). A second backward pass computes the
optimal way to finish compressing the file from each
position by recording the best codeword to use and the
length to the end-of-file. Finally, another forward pass
reads off the solution and outputs the compressed file.
However, one would probably never want to use dy-
namic programming since the one-pass heuristic is a lot
faster, and we estimated for several typical files that
the heuristically compressed output was only about
I percent larger than the optimum. Furthermore, we
will show in the remainder of this section that the size
of the compressed file is never worse than 5/4 the size of
the optimal solution for the specific A1 encoding.
This will require developing some analytic tools, so the
non-mathematical reader should feel free to skip to
Section 4.
The following definitions are useful:
Definition, f(i) is the longest feasible copy at position i
in the file.
Sample f(i)'s were given above in equation 1. They are
dependent on the encoding used. For now, we are as-
suming that they are limited in magnitude to 16, and
must correspond to copy sources within the last 4096
characters.
Definition. B(i) is the size of the best way to compress
the remainder of the file, starting at position i.
B(i)'s would be computed in the reverse pass of the
optimal algorithm outlined above.
The following Theorems are given without proof:
THEOREM, f(i + 1) >= F(i) - 1.
THEOREM. There exists an optimal solution where copies
are longest possible (i.e., only copies corresponding to F{i)'s
are used].
THEOREM. B(i) is monotone decreasing.
THEOREM. Any solution can be modified, without affect-
ing length, so that (literal x1 followed immediately by
(literal x2;) implies that x1 is maximum (in this case 16).
We could continue to reason in this vein, but there is
an abstract way of looking at the problem that is both
clearer and more general. Suppose we have a non-
deterministic finite automaton where each transition is
given a cost. A simple example is shown in Figure 4.
The machine accepts (a + b]*, with costs as shown in
parentheses.
The total cost of accepting a string is the sum of the
FIGURE 4. A Nondeterministic Automaton with Transition Costs
transition costs for each character. (While it is not
important to our problem, the optimal solution can be
computed by forming a transition matrix for each let-
ter, using the costs shown in parentheses, and then
multiplying the matrices for a given string, treating the
coefficients as elements of the closed semiring with op-
erations of addition and minimization.) We can obtain a
solution that approximates the minimum by deleting
transitions in the original machine until it becomes a
deterministic machine. This corresponds to choosing a
policy in our original data compression problem. A pol-
icy for the machine in Figure 4 is shown in Figure 5.
We now wish to compare, in the worst case. the dif-
ference between optimally accepting a string with the
non-deterministic machine, and deterministically ac-
cepting the same string with the "policy" machine. This
is done by taking a cross product of the two machines,
as shown in Figure 6.
In Figure 6 there are now two weights on each transi-
tion; the first is the cost in the non-deterministic graph,
and the second is the cost in the policy graph. Asymp-
totically, the relationship of the optimal solution to the
policy solution is dominated by the smallest ratio on a
cycle in this graph. In the case of Figure 6, there is a
cycle from I, I' to 1, 2" and back that has cost in the
non-deterministic graph of 2+ 1 = 3, and cost in the
policy graph of 3+3 = 6, giving a ratio of 1/2. That is.
FIGURE 5. A Deterministic "Policy" Automaton for Figure 4
the policy solution can be twice as bad as the optimum
on the string ababababab. . . .
In general, we can find the cycle with the smallest
ratio mechanically, using well known techniques [8,
27]. The idea is to conjecture a ratio r and then reduce
the pairs of weights (x. y) on the arcs to single weights
x - ry. Under this reduction, a cycle with zero weight
has ratio exactly r. If a cycle has negative weight, then r
is too large. The ratio on the negative cycle is used as a
new conjecture, and the process is iterated. (Negative
cycles are detected by running a shortest path algo-
rithm and checking for convergence.) Once we have
found the minimum ratio cycle, we can create a worst
case string in the original automata problem by finding
a path from the start state to the cycle and then repeat-
ing the cycle indefinitely. The ratio of the costs of ac-
FIGURE 6. The Cross Product
cepting the string non-deterministically and determin-
istically will converge to the ratio of the cycle. (The
path taken in the cross product graph will not necessar-
ily bring us to the same cycle, due to the initial path
fragment; we will, nevertheless, do at least as well.)
Conversely, if we have a sufficiently long string with
non-deterministic to deterministic ratio r, then the
string will eventually loop in the cross product graph. If
we remove loops with ratio greater than r we only im-
prove the ratio of the string, so we must eventually find
a loop with ratio at least as small as r.
The above discussion gives us an algorithmic way of
analyzing our original data compression problem. The
possible values of F(i) are encoded in a 17 character
alphabet p0 . . . p16, representing the length of copy
available at each position. The compression algorithm
is described by a non-deterministic machine that ac-
cepts strings of pj,: this machine has costs equal to the
lengths of the codewords used by the algorithm. There
are two parameterized states in this machine: lx means
that there is a literal codeword under construction with
x spaces still available; c, means that a copy is in prog-
ress with y characters remaining to copy. The idle state
is lo = Co. In the non-deterministic machine, the possi-
ble transitions are:
start a literal
-> li-i   continue a literal {x > 1)
, r.m
I, -> c,_i  start a copy
r.(o)
Cy -> c"-i continue a copy
P.12)
P.W
(2)
(An asterisk is used as a wild card to denote any state.)
Based on the theorems above we have already elimi-
nated some transitions to simplify what follows. For
example,
Cy -* lis start a literal from inside a copy (y ^ 1)   (3)
is unnecessary. The deterministic machine, given
(6)
1   2   3   4   5   6   7
p10 p10 p9  p8  P7  p6  P5
8   9   10  11  12  13  14  15
p4  p3  p2  p1  p2  p10 p10 p9
(There is nothing special about 10; it was chosen to
illustrate a long copy and to match the example in
Appendix A.) The deterministic algorithm takes a copy
of length 10 in the first position, and then switches to a
literal for positions II and 12. Five bytes are used in
each repetition of the pattern. The optimal solution is
one position out of phase. It takes a copy of length 10 in
the second position, and then finds a copy of length 2 at
position 12, fora total of four bytes on each iteration.
We have abstracted the problem so that the possible
copy operations are described by a string of pj,, and we
have shown a pathological pattern of pj, that results in
% of the optimal encoding. There might still be some
doubt that such a string exists, since the condition that
our third machine (5) guarantees, F(i + 1) >= F(i) - 1, is
a necessary but not sufficient condition. Nevertheless,
the details of an actual pathological string can be found
in Appendix A.
--- ifmail v.2.14dev2
 * Origin: [not found] (2:5020/400@fidonet)


 RU.COMPRESS 
 From : Disturbo                            2:5020/400      05 Dec 98  17:36:26
 To   : Bulat Ziganshin
 Subj : LZFG (CACM publ.) 2 of 4

From: Disturbo <pprgd@aha.ru>
4. A SIMPLER DATA STRUCTURE
Although the quantity of code associated with A1 is not
enormous, it is complicated, and the data structures are
fairly large. In this section, we present simpler methods
for finding the suffix and for propagating the window
position.
The alternative to a percolating update is to update
the positions in all nodes back to the root whenever a
new leaf is inserted. Then no updates are needed when
nodes are deleted. The update flags can be eliminated.
The alternative to suffix pointers is more compli-
cated. The cost of movement in a tree is not uniform:
moving deeper requires a hash table lookup, which is
more expensive than following a parent.pointer. So we
can determine the suffix by starting at the suffix leaf
and following parent pointers back toward the root un-
til the suffix node is reached. The suffix leaf is known
because the string at i matched the string at some ear-
lier window position ;; the suffix leaf; + I is the next
entry in the leaf array. With this change, the suffix
pointers can be eliminated.
>From a theoretical perspective, these modifications,
which have 0{nd) worst case performance for a file of
size n and cut-off depth d, are inferior to the 0(n) per-
formance of the suffix tree. For A1, with a cutoff of 16,
these modifications improve average performance, but
the A2 method discussed in the next section has such a
deep cut-off that suffix pointers and percolated updates
are preferable.
5. A MORE POWERFUL ENCODING
The 4,096-byte window of A1 is roughly optimal for
fixed size copy and literal codewords. Longer copies
would, on average, be found in a larger window, but a
larger displacement field would be required to encode
them. To exploit a larger window, we must use a
variable-width encoding statistically sensitive to the
fact that recent window positions are more likely to be
used by copy codewords than those positions further
back. Similarly, it is advantageous to use variable-
width encodings for copy and literal lengths.
There are several approaches we might use for vari-
able-length encoding. We could use fixed or adaptive
Huffman coding, arithmetic encoding, a variable-length
encoding of the integers, or a manageable set of hand-
designed codewords. We eliminated from consideration
adaptive Huffman and arithmetic coding because they
are slow- Moreover, we felt they would provide (at best)
a secondary adaptive advantage since the "front end"
textual substitution is itself adapting to the input. We
experimented with a fixed Huffman encoding, a hand-
designed family of codewords, and a variable-length
encoding of the integers, so we will compare these
options briefly:
Hand-Designed Codewords. This is a direct generaliza-
tion of A1, with short copies that use fewer bits but
cannot address the full window, and longer copies that
can address larger blocks further back in the window,
With a few codewords, this is fast and relatively easy to
implement. However, some care must be taken in the
choice of codewords to maximize compression.
Variable-Length Integers. The simplest method we
tried uses a unary code to specify field width, followed
by the field itself. Copy length and displacement fields
are coded independently via this technique, so any cor-
relations are ignored. There are more elaborate codings
of the irutegers (such as [9), [10), or [13]), that have been
used by 115), and [34] in their implementations of
Lempel-Ziv compression. These encodings have nice
asymptotic properties for very large integers, but the
unary code is best for our purposes since, as we will see
shortly, it can be tuned easily to the statistics of the
application. The unary code has the additional advan-
tage of a simple hardware implementation. We will
return to the unary code in more detail shortly.
Fixed Huffman. Ideally, a fixed Huffman encoder
should be applied to source consisting of the copy
length and displacement concatenated together (to cap-
ture the correlation of these two fields). However, since
we wish to expand window size to 16384 and maxi-
mum copy length to 2000, the realities of gathering
statistics and constructing an implementation dictate
that we restrict the input of the fixed Huffman com-
pressor to a size much smaller than 2000 x 16384 by
grouping together codes with nearly equal copy lengths
and displacements. To improve speed we use tables to
encode and decode a byte at a time. Nevertheless, the
fixed Huffman approach is the most complex and slow-
est of the three options compared here.
To decide how much compression could be increased
with a Fixed Huffman approach, we experimented with
several groupings of nearly equal copy lengths and dis-
placements, using a finer granularity for small values,
so that the input to the Fixed Huffman compressor had
only about 30,000 states, and we computed the entropv
to give a theoretical bound on the compression. The
smallest entropy we obtained was only 4 percent more
compact than the actual compression achieved with the
unary encoding described below; and any real imple-
mentation would do worse than an entropy bound.
Consequently, because the Fixed Huffman approach
did not achieve significantly higher compression, we
favor the simpler unary code, though this is not an
overwhelmingly clear choice.
Define a-(start, step, stop) unary code of the integers
as follows; The nth codeword has n ones followed by a
zero followed by a field of size start + n  step. If the
field size is equal to stop then the preceding zero can
be omitted. The integers are laid out sequentially
through these codewords. For example, (3, 2, 9) would
look like:
Range
  0-7
  8-39
 40-167
168-679
Codeword
  Oxxx
 1Oxxxxx
11Oxxxxxxx
111xxxxxxxxx
Appendix B contains a simple procedure that gener-
ates unary codes.
The A2 textual substitution method encodes copy
length with a (2, 1, 10) code, leading to a maximum
copy length of 2044. A copy length of zero signals a
literal, for which literal length is then encoded with a
(0, 1, 5) code, leading to a maximum literal length of 63
bytes. If copy length is non-zero, then copy displace-
ment is encoded with a (10, 2, 14) code. The exact
maximum copy and literal lengths are chosen to avoid
wasted states in the unary progressions; a maximum
copy length of 2044 is sufficient for the kinds of data
studied in Section 8. The A1 policy for choosing be-
tween copy and literal codewords is used.
Three refinements are used to increase A2's
compression by approximately I to 2 percent. First,
since neither another literal nor a copy of length 2 can
immediately follow a literal of less than maximum lit-
eral length, in this situation, we shift copy length codes
down by 2. In other words, in the (2, 1, 10) code for
copy length, 0 usually means literal, 1 means copy
length 2, etc.; but after a literal of less than maximum
literal length, 0 means copy length 3, I means copy
length 4, etc.
Secondly, we phase-in the copy displacement encod-
ing for small files, using a (10- x, 2, 14- x) code,
where x starts at 10 and descends to 0 as the number of
window positions grows: for example, x = 10 allows
2^1 + 2^2 + 2^4 = 21 values to be coded, so when the
number of window positions exceeds 21, x is reduced to
9; and so forth.
Finally, to eliminate wasted states in the copy dis-
placement encoding, the largest field in the (10 - x,
2, 14-x) progression is shrunk until it is just large
enough to hold all values that must be represented; that
is, if v values remain to be encoded in the largest field
then smaller values are encoded with Llog2vJ bits and
larger values with rlog2vl bits rather than 14 - x bits.
This trick increases compression during startup, and, if
the window size is chosen smaller than the number of
values in the displacement progression, it continues to
be useful thereafter. For example, the compression
studies in Section 8 use an A2 window size of 16,384
characters, so the (10, 2, 14) code would waste 5,120
states in the 14-bit field without this trick.
Percolating update seems preferable for the imple-
mentation of A2 because of the large maximum copy
length; with update-to-root, pathological input could
slow the compressor by a factor of 20. Unfortunately,
the percolating update does not guarantee that the suf-
fix tree will report the nearest position for a match, so
longer codewords than necessary may sometimes be
used. This problem is not serious because the tree is
often shallow, and nodes near the root usually have
many sons, so updates propagate much more rapidly
than assumed in the analysis of Section 3. On typical
files, compression with percolated update is 0.4 percent
less than with update-to-root.
6. A FASTER COMPRESSOR
A2 has very fast expansion with a small storage re-
quirement, but, even though compression has constant
amortized lime, it is 5 times slower than expansion. A1
and A2 are most appropriate in applications where
compression speed is not critical and where the per-
formance of the expander needs to be optimized, such
as the mass release of software on floppy disks. How-
ever, in applications such as file archiving, faster
compression is needed. For this reason, we have devel-
oped the B1 and B2 methods described here, which use
the same encodings as A1 and A2, respectively, but
compute window displacement differently. Copy code-
words are restricted to start at the beginning of the yth
previous codeword or literal character emitted: they
can no longer address every earlier character, but only
those where literal characters occurred or copy code-
words started; we refer to displacements computed this
way as "compressed displacements" throughout. Copy
length is still measured in characters, like A1. By in-
serting this level of indirection during window access,
compression speed typically triples, though expansion
and the rate of adaptation are somewhat slower.
With "compressed displacements," suffix pointers
and update propagation are unnecessary and a simpler
PATRICIA tree can be used for the dictionary. Entries are
made in the tree only on codeword boundaries, and
this can be done in linear time by starting at the root
on each iteration. It is useful to create an array of per-
manent nodes for all characters at depth 1. Since copy
codewords of length I are never issued, it doesn't mat-
ter that some permanent nodes don't correspond to any
window character. Each iteration begins by indexing
into this node array with the next character. Then hash
table lookups and arc character comparisons are used
to descend deeper, as in A1. The new window position
is written into nodes passed on the way down, so up-
date propagation is unnecessary.
In short, the complications necessary to achieve con-
stant average time per source character with A2 are
eliminated. However, one new complication is intro-
duced. In the worst case, the 16,384 window positions
of B2 could require millions of characters, so we impose
a limit of 12 X 16,384 characters; if the full window
exceeds this limit, leaves for the oldest window posi-
tions are purged from the tree.
Because of slower adaptation, B2 usually compresses
slightly less than A2 on small files. But on text and
program source files, it surpasses A2 by 6 to 8 percent
asymptotically; the crossover from lower compression
to higher occurs after about 70,000 characters! A2 code-
words find all the near-term context, while B2 is re-
stricted to start on previous codeword boundaries but
can consequently reach further back in the file. This
gives B2 an advantage on files with a natural word
structure, such as text, and a disadvantage on files
where nearby context is especially important, such as
scanned images.
We also tried variations where the tree is updated
more frequently than on every codeword boundary and
literal character. All variations up to and including A2
can be implemented within the general framework of
this method, if speed is not an issue. For example, we
found that about I percent higher compression can be
achieved by inserting another compressed position be-
tween the two characters represented by each length 2
copy codeword and another 0.5 percent by also insert-
ing compressed positions after each character repre-
sented by length 3 copy codewords. However, because
these changes slow compression and expansion we
haven't used them.
7. IMPROVING THE COMPRESSION RATIO
In Section 6 we considered ways to speed up compres-
sion at the cost of slower adaptation and expansion. In
this section we will explore the other direction: im-
proving the compression ratio with a slight cost to the
running time of the algorithm.
When a string occurs frequently in a file, all the
methods we have considered so far waste space in their
encoding: when they are encoding the repeating string,
they are capable of specifying the copy displacement to
multiple previous occurrences of the string, yet only
one string needs to be copied. By-contrast, the data
structures we have used do not waste space. The re-
peating strings share a common path near the root. If
we base the copy codewords directly on the data struc-
ture of the dictionary, we can improve the compression
ratio significantly. (This brings us closer to the second
style of Ziv and Lempel's textual substitution work [19,
29, 43), where a dictionary is maintained by both the
compressor and expander. However, since we still use a
window and an explicit copy length coding, it is natural
to view this as a modification of our earlier compres-
sors, in the style of Ziv and Lempel's first textual sub-
stitution work.)
The C2 method uses the same PATRICIA tree data
structures as B2 to store its dictionary. Thus it takes
two pieces of information to specify a word in the dic-
tionary: a node, and a location along the arc between
the node and its parent (since PATRICIA tree arcs may
correspond to strings with more than one character).
We will distinguish two cases for a copy: if the arc is at
a leaf of the tree, then we will use a LeafCopy code-
word, while if the arc is internal to the tree we will use
a NodeCopy codeword. Essentially, those strings appear-
ing two or more times in the window are coded with
NodeCopies, avoiding the redundancy of A2 or B2 in
these cases.
The C2 encoding begins with a single prefix bit that
is 0 for a NodeCopy, I for a LeafCopy or Literal.
For NodeCopy codewords, the prefix is followed by a
node number in [0 . . maxNodeNo], where maxNodeNo is
the largest node number used since initialization: for
most files tested, maxNodeNo is about 50 percent the
number of leaves. Following the node number, a dis-
placement along the arc from the node to its parent is
encoded; for most NodeCopy codewords the incoming
arc is of length I, so no length field is required. If a
length field is required, 0 denotes a match exactly at
the node, I a displacement I down the arc from the
parent node, etc. Rarely is the length field longer than
one or two bits because the arc lengths are usually
short, so all possible displacements can be enumerated
with only a few bits. For both the node number and the
incoming arc displacement, the trick described in Sec-
tion 5 is used to eliminate wasted states in the field;
that is, ifv values must be encoded, then the smaller
values are encoded with Llog^'J bits and larger values
with nlog2vl bits.
LeafCopies are coded with unary progressions like
those of A2 or B2. A (1. I, II) progression is used to
specify the distance of the longest match down the leaf
arc from its parent node. with 0 denoting a literal; this
progression leads to a maximum copy length of 4094
bytes. Since another literal never occurs immediately
after a literal of less than maximum literal length, the
LeafCopy arc distance progression is shifted down by I
when the preceding codeword was a literal (i.e., arc
displacement I is coded as 0, 2 as I, etc.). On a cross
section of files from the data sets discussed later, dis-
tance down the leaf arc was highly skewed, with about
half the arc displacements occurring one character
down the leaf arc. Because of this probability spike at I
and the rapid drop off at larger distances, the average
length field is small. Following the length field, the
window position is coded by gradually phasing in
a (10, 2, 14) unary progression exactly like B2's.
Literals are coded by first coding a LeafCopy arc dis-
placement of O and then using a (0, 1, 5) unary progres-
sion for the literal length exactly like B2.
Unlike A2 and B2, the expander for C2 must main-
tain a dictionary tree exactly like the compressor's tree
to permit decoding. Notice that this is not as onerous as
it might seem. During compression, the algorithm must
search the tree downward (root toward leaves) to find
the longest match, and this requires a hash table access
at each node. By contrast, the expander is told which
node was matched, and can recover the length and
window position of the match from the node. No hash
table is required, but the encoding is restricted: a copy
codeword must always represent the longest match
found in the tree, in particular, the superior heuristic
used by B2 to choose between Literal and Copy code-
words must be discarded; instead, when the longest
match is of length 2 or more, a copy codeword must
always be produced. With this restriction, the expander
can reconstruct the tree during decoding simply by
hanging each new leaf from the node or arc indicated
by the NodeCopy or LeafCopy codeword, or in the case of
literals, by hanging the leaf from the permanent depth
I node for each literal character.
--- ifmail v.2.14dev2
 * Origin: [not found] (2:5020/400@fidonet)


 RU.COMPRESS 
 From : Disturbo                            2:5020/400      05 Dec 98  17:36:27
 To   : Bulat Ziganshin
 Subj : LZFG (CACM publ.) 3 of 4

From: Disturbo <pprgd@aha.ru>
8. EMPIRICAL STUDIES
In this section, we compare the five compression meth-
ods we have developed with other one-pass, adaptive
methods. For most other methods, we do not have well-
tuned implementations and report only compression
results. For implementations we have tuned for effi-
ciency, speed is also estimated (for our 3 MIP, 16-bit
word size. 8 megabyte workstations). The execution
times used to determine speed include the time to
open, read, and write files on the local disk (which has
a relatively slow, maximum transfer rate of 5 megabits
per second): the speed is computed by dividing the un-
compressed file size by the execution time for a large
file.
We tested file types important in our working envi-
ronment. Each number in Table I is the sum of the
compressed file sizes for all files in the group divided
by the sum of the original file sizes. Charts 1-3 show
the dependency of compression on file size for all of the
compression methods tested on the source code (SC)
data set.
Data Sets
SC Source Code. All 8-bit Ascii source files from
which the boot file for our programming environment
is built. Files include some English comments, and a
densely-coded collection of formatting information at
the end of each file reduces compressibility. The files
themselves are written in the Cedar language. (1185
files, average size II Kbytes, total size 13.4 Mybtes)
TM Technical Memoranda. All files from a directory
where computer science technical memoranda and
reports are filed, excluding those containing images.
These files are 8-bit Ascii text with densely-coded for-
matting information at the end (like the source code).
(134 files, average size 22 Kbytes, total size 2.9 Mbytes)
NS News Service. One file for each work day of a
week from a major wire service: these files are 8-bit
Ascii with no formatting information. Using textual
substitution methods, these do not compress as well as
the technical memoranda of the previous study group,
even though they are much larger and should be less
impacted by startup transient: inspection suggests that
the larger vocabulary and extensive use of proper
names might be responsible for this. (5 files, average
size 459 Kbytes, total size 2.3 Mbytes)
CC Compiled Code. The compiled-code files produced
from the SC data set. Each file contains several differ-
ent regions: symbol names, pointers to the symbols,
statement boundaries and source positions for the de-
bugger, and executable code. Because each region is
small and the regions have different characteristics,
these files severely test an adaptive compressor. (1220
files, average size 13 Kbytes, total size 16.5 Mbytes)
BF Boot File. The boot file for our programming envi-
ronment, basically a core image and memory map.
(I file, size 525 Kbytes)
SF Spline Fonts. Spline-described character fonts used
to generate the bitmaps for character sets at a variety of
resolutions. (94 files, average size 39 Kbytes, total size
3.6 Mbytes)
RCF Run-coded Fonts. High-resolution character
fonts, where the original bitmaps have been replaced
by a run-coded representation. (68 files, average size 47
Kbytes, total size 3.2 Mbytes)
SNI Synthetic Images. All 8 bit/pixel synthetic image
files from the directory of an imaging researcher. The
44 files are the red, green, and blue color separations
for 12 color images, 2 of which also have an extra file to
encode background transparency; in addition, there are
6 other grey scale images. (44 files,, average size 328
Kbytes, total size 14.4 Mbytes)
SCI Scanned Images. The red separations for all 8
bit/pixel scanned-in color images from the directory of
an imaging researcher. The low-order one or two bits of
each pixel are probably noise, reducing compressibility.
(12 files, average size 683 Kbytes, total size 8.2 Mbytes)
BI Binary Images. CCITT standard images used to
evaluate binary facsimile compression methods. Each
file consists of a 148-byte header followed by a binary
scan of I page (1728 pixels/scan line x 2376 scan lines/
page). Some images have blocks of zeros more than
30,000 bytes long. Because these files are composed of
1-bit rather than 8-bit items, the general-purpose com-
pressors do worse than they otherwise might. (8 files,
average size 513 Kbytes, total size 4.1 Mbytes)
The special-purpose CCITT 1D and 2D compression
methods reported in [18] achieve, respectively, 0.112
and 0.064 compression ratios on these standard images
when the extraneous end-of-line codes required by the
facsimile standard are removed and when the extra-
neous 148-byte header is removed. The special-purpose
CCITT 2D result is significantly more compact than any
general purpose method we tested, and only CW and
C2 surpassed the ID result.
(7)
(8)
Measurements and Compression Methods
H0 and H1. These are entropy calculations made on a
per file basis according to:
H0 = -<SUM i=0> P(x =Ci)log2P{x = c,),
H1 = -<SUM i,j=0> P(y =Ci|x=Ci)log2(P{y = Cj|x=Ci),
where x is a random symbol of the source, xy is a
randomly chosen pair of adjacent source characters,
and Ci; ranges over all possible symbols. Because of the
small file size, the curves in Charts I to 3 drop off to
the left. In theory, this small sampling problem can be
corrected according to [2], but we have found it diffi-
cult to estimate the total character set size in order to
apply these corrections. Nevertheless, Chart I shows
that HO is a good estimator for how well a memoryless
(zero-order) compressor can do when file size is a large
multiple of 256 bytes, and H1 bounds the compression
for a first-order Markov method. (None of our files were
large enough for HI to be an accurate estimator.)
KG and V. These adaptive methods maintain a Huff-
man tree based on the frequency of characters seen
so far in a file. The compressor and expander have
roughly equal performance. The theory behind the
KG approach appears in [II] and [23]. The similar V
method, discussed in [37] should get better compression
during the startup transient at the expense of being
about 18 percent slower. It is also possible to bound the
performance of Vitter's scheme closely to that of a fixed
non-adaptive compressor. Except on the highly com-
pressible CCITT images, these methods achieve
compression slightly worse than HO, as expected. But
because of bit quantization, the compression of the
CCITT images is poor-arithmetic coding would com-
press close to HO even on these highly compressible
sources.
CW. Based on [6], this method gathers higher-order
statistics than KG or V above (which we ran only on
zeroth-order statistics). The method that Cleary and
Witten describe keeps statistics to some order o and
encodes each new character based on the context of the
o preceding characters. (We've used o = 3, because any
higher order exhausts storage on most of our data sets.)
If the new character has never before appeared in the
same context, then an escape mechanism is used to
back down to smaller contexts to encode the character
using those statistics. (We've used their escape mecha-
nism A with exclusion of counts from higher-order con-
texts.) Because of high event probabilities in some
higher-ordered contexts and the possibility of multiple
escapes before a character is encoded, the fractional bit
loss of Huffman encoding is a concern, so [6] uses arith-
metic encoding. We have used the arithmetic encoder
in [40].
TABLE 1. Comparison of Compression Methods
Method  Text  Binary  Fonts   Images 1
 SC TM NS CC BF SF RCF SNI SCI BI 1
HO .732 .612 .590 .780 .752 .626 .756 .397 .845 .148
H1 . .401 .424 .467 .540 .573 .380 .597 .181 .510 .101
KG -751 .625 .595 .804 .756 .637 .767 .415 .850 .205
V .749 .624 .595 .802 .756 .637 .766 .414 .850 .205
CW -369 .358 .326 .768 .544 .516 .649 .233 .608 .106
MW1 .508 .470 .487 .770 .626 .558 .705 .259 .728 .117
MW2 -458 .449 .458  .784 .594 .526 .692 .270 .774 .117
UW .521 .476 .442 .796 .638 .561 .728 .255 .697 .118
BSTW .426 .434 .465 - .684 - .581 - - -
A1 .430 .461 .520 .741 .608 .502 .657 .351 .766 .215
A2 .366 .395 .436 .676 .538 .460 .588 .259 .709 .123
B1 -449 .458 .501 .753 .616 .505 .676 .349 .777 .213
B2 .372 .403 .410 .681 .547 .459 .603 .255 .714 .117
C2 .360 .376 .375 .668 .527 .445 .578 .233 .662 .105
As Table I shows, CW achieves excellent compres-
sion. Its chief drawbacks are its space and time per-
formance. Its space requirement can grow in proportion
to file size; for example, statistics for o = 3 on random
input could require a tree with 256'4 leaves, though
English text requires much less. The space (and conse-
quently time) performance of CW degrades dramati-
cally on "more random" data sets like SNI and SCI. A
practical implementation would have to limit storage
somehow. Even on English, Bell, Cleary, and Witten
estimate that Moffat's tuned implementation of CW
is 3 times slower compressing and 5 times slower ex-
panding than C2 [41.
MW1. This method, described in [29], is related to the
second style of Lempel-Ziv compression, alluded to in
the introduction, it uses a Trie data structure and 12-bit
codes. Initially (and always) the dictionary contains 256
one-character strings. New material is encoded by find-
ing the longest match in the dictionary, outputting the
associated code, and then inserting a new dictionary
entry that is the longest match plus one more charac-
ter. After the dictionary has filled, each iteration re-
claims an old code from among dictionary leaves, fol-
lowing a LRU discipline, and reuses that code for the
new dictionary entry. The expander works the same
way. MWl is simple to implement and is balanced in
performance, with good speed both compressing and
expanding (250,000 bits/sec and 310 bits/sec respec-
tively). The original method used 12-bit codes through-
out for simplicity and efficiency. However, our imple-
mentation starts by using 9-bit codewords, increasing to
10, II, and finally to 12 bits as the dictionary grows to
its maximum size: this saves up to 352 bytes in the
compressed file size. On text and source code, Miller
and Wegman determined that the 12-bit codeword size
is close to optimal for this method.
MW2. One drawback of MWl is the slow rate of
buildup of dictionary entries. If, for example, the word
abcdefghi appears frequently in a document, then ab
will be in the dictionary after the first occurrence, abc
after the second, and so on, with the full word present
only after 8 occurrences (assuming no help from similar
words in the document). A1 below, for example, would
be able to copy the whole word abcdefghi after the first
occurrence, but it pays a penalty for the quick response
by having a length field in its copy codeword. The idea
of MW2 is to build dictionary entries faster by combin-
ing adjacent codewords of the MW1 scheme. Longer
words like abcdefghi are built up at an exponential
rather than linear rate. The chief disadvantage of MW2
is its increased complexity and slow execution. Our
implementation follows the description in [29] and uses
an upper limit of 4096 dictionary entries (or 12-bit
codewords). We did not implement the 9-12 bit phase-
in that was used in MW1 so the size-dependent charts
underestimate MW2's potential performance on small
files.
UW. This is the Compress utility found in the Berke-
ley 4.3 Unix, which modifies a method described in a
paper by Welch [39]: the authors of this method are
S. Thomas,}. McKie, S. Davies, K. Turkowski,}. Woods,
and J. Orost. It builds its dictionary like MW1. gradu-
ally expanding the codeword size from 9 bits initially
up to 16 bits. The dictionary is frozen after 65,536 en-
tries, but if the compression ratio drops significantly,
the dictionary is discarded and rebuilt from scratch. We
used this compressor remotely on a Vax-785, so it is
difficult to compare its running time and implementa-
tion difficulties with the other methods we imple-
mented. Nevertheless, because it does not use the LRU
collection of codes, it should be faster than MW1. How-
ever, it has a larger total storage requirement and gets
worse compression than MWl on most data sets
studied.
BSTW. This method first partitions the input into
alphanumeric and non-alphanumeric "words," so it is
specialized for text, though we were able to run it on
some other kinds of data as well. The core of the com-
pressor is a move-to-front heuristic. Within each class,
the most recently seen words are kept on a list (we
have used list size 256). If the next input word is al-
ready in the word list, then the compressor simply
encodes the position of the word in the list and then
moves the word to the front of the list. The move-to-
front heuristic means that frequently used words will
be near the front of the list, so they can be encoded
with fewer bits. If the next word in the input stream is
not on the word list, then the new word is added to the
front of the list, while another word is removed from
the end of the list, and the new word must be com-
pressed character-by-character.
Since the empirical results in [5] do not actually give
an encoding for the positions of words in the list or for
the characters in new words that are output, we have
taken the liberty of using the V compressor as a subrou-
tine to generate these encodings adaptively. (There are
actually four copies ofVitter's algorithm running, one
to encode positions and one to encode characters in
each of two partitions.) Using an adaptive Huffman is
slow; a fixed encoding would run faster, but we expect
that a fixed encoding would slightly reduce compres-
sion on larger files while slightly improving compres-
sion on small files. We could not run BSTW for all of
the data sets, since the parsing mechanism assumes
human-readable text and long "words" appear in the
other data sets. When the unreadable input parsed
well, as in the case of run-coded fonts, the compression
was very good.
A1. This is cur basic method described earlier. It has
a fast and simple expander (560,000 bits/sec) with a
small storage requirement (10,000 bytes). However, the
compressor is much slower and larger (73,000 bits/sec,
145,000 bytes using scan-from-leaf and update-to-root).
The encoding has a maximum compression to 1/8 = 12.5
percent of the original file size because the best it can
do is copy 16 characters with a 16-bit codeword,
Caveat: As we mentioned above, the running times
reported include the file system overhead for a rela-
tively slow disk. To provide a baseline, we timed a file
copy without compression and obtained a rate of
760,000 bits per second. Thus, some of the faster expan-
sion rates we report are severely limited by the disk.
For example, we estimate that without disk overhead
the A1 expander would be about twice as fast. On the
other hand, removing disk overhead would hardly af-
fect the compression speed of A1.
A2. This method, discussed in Section 5, enlarges the
window to 16,384 characters and uses variable-width
unary-coded copy and literal codewords to significantly
increase compression. The running time and storage
requirements are 410,000 bits/sec and 21,000 bytes for
expansion and 60,000 bits/sec and 630,000 bytes for
compression (using suffix pointers and percolated
update).
B1. This method, discussed in Section 6, uses the A1
encoding but triples compression speed by updating the
tree only at codeword boundaries and literal charac-
ters. The running time and storage requirements are
470,000 bits/sec and 45,000 bytes for expansion and
230,000 bits/sec and 187,000 bytes for compression.
B2. This method, discussed in Section 6. uses the
same encoding as A2 but triples compression speed by
updating the tree only at codeword boundaries and lit-
eral characters. The compressor and expander run at
170,000 and 380,000 bits/sec, respectively, and have
storage requirements of 792,000 and 262,000 bytes.
C2. This method, discussed in Section 7, uses the
same data structures as B2 but a more powerful encod-
ing based directly upon the structure of the dictionary
tree. Compression is about the same and expansion
about 25 percent slower than B2; the compressor uses
about the same storage as B2, but the expander uses
more (about 529,000 bytes).
Table I highlights some differences between textual
substitution methods like C2 and statistical methods
like CW. (Time and space performance differences have
been discussed earlier.) There are several data sets
where these methods differ dramatically. On NS, CW is
significantly better than C2. We believe that this is be-
cause NS shows great diversity in vocabulary: a prop-
erty that is troublesome for textual substitution, since it
cannot copy new words easily from elsewhere in the
document, but is benign for CW, since new words are
likely to follow the existing English statistics. On CC,
for example, C2 is significantly better than CW. We
believe that this is because CC contains several radi-
cally different parts, e.g. symbol tables, and compiled
code. C2 is able to adjust to dramatic shifts within a
file, due to literal codewords and copy addressing that
favors nearby context, while CW has no easy way to
rapidly diminish the effect of older statistics.
For all of our methods, A2, B2, and C2. window size
is a significant consideration because it determines
storage requirements and affects compression ratios.
Chart 4 shows compression as a function of window
size for the NS data set (concatenated into a single file
to avoid start-up effects), and for the BF boot file. These
two data sets were typical of the bimodal behavior we
observed in our other data sets: large human-readable
files benefit greatly from increasing window size, while
other test groups show little improvement beyond a
window size of 4K.
--- ifmail v.2.14dev2
 * Origin: [not found] (2:5020/400@fidonet)


 RU.COMPRESS 
 From : Disturbo                            2:5020/400      05 Dec 98  17:36:29
 To   : Bulat Ziganshin
 Subj : LZFG (CACM publ.) 4 of 4

From: Disturbo <pprgd@aha.ru>
9. CONCLUSIONS
We have described several practical methods for loss-
less data compression and developed data structures to
support them. These methods are strongly adaptive in
the sense that they adapt not only during startup but
also to context changes occurring later. They are suit-
able for most high speed applications because they
make only one pass over source data, use only a con-
slant amount of storage, and have constant amortized
execution time per character.
Our empirical studies point to several broad generali-
zations. First, based on the HO and H1 theoretical lim-
its, textual substitution via A2, B2, or C2 surpasses
memoryless or first-order Markov methods applied on a
character-by-character basis on half the data sets. On
the other half, even the CW third-order method can't
achieve the HI bound. This suggests that. to surpass
textual substitution for general purpose compression,
any Markov method must be at least second-order, and
to date, all such methods have poor space and time
performance.
Secondly, the methods we've developed adapt rap-
idly during startup and at transitions in the middle of
files. One reason for rapid adaptation is the use of
smaller representations for displacements to recent po-
sitions in the window. Another reason is the inclusion
of multi-character literal codewords. Together the liter-
als and short displacements allow our methods to per-
form well on short files, files with major internal shifts
of vocabulary or statistical properties, and files with
bursts of poorly compressing material-all properties of
a significant number of files in our environment.
Thirdly, it appears that the displacement-aiid-length
approach to textual substitution is especially effective
on small files. On 11,000-byte program source files, for
example, A2 and B2 were over 20 percent more com-
pact than textual substitution methods which did not
use a length field (UW. MWI, and MW2). This is not
surprising because the particular advantage of the
length field in copy codewords is rapid adaptation on
small files. However, even on the largest files tested, A2
and B2 usually achieved significantly higher compres-
sion. Only on images did other methods compete with
them; our most powerful method, C2, achieved higher
compression than any other textual substitution
method we tested on all data sets. The effect of a length
field is to greatly expand dictionary size with little or
no increase in storage or processing time; our results
suggest that textual substitution methods that use a
length field will work better than those which do not.
Fourthly, studies of A2, B2, and C2 using different
window sizes showed that, for human-readable input
(e.g., English, source code), each doubling of window
size improves the compression ratio by roughly 6 per-
cent (for details see Chart 4). Furthermore, the data
structures supporting these methods scale well: running
time is independent of window size, and memory usage
grows linearly with window size. Thus increasing win-
dow size is an easy way to improve the compression
ratio for large files of human-readable input. For other
types of input the window size can be reduced to 4096
without significantly impacting compression.
Going beyond these empirical results, an important
practical consideration is the tradeoff among speed.
storage, and degree of compression; speed and storage
have to be considered for both compression and expan-
sion. Of our own methods, A2 has very fast expansion
with a minimal storage requirement; its weakness is
slow compression; even though the suffix tree data
structure with amortized update has constant amor-
tized time per character, compression is still seven
times slower than expansion. However, in applications
which can afford relatively slow compression, A2 is
excellent; for example, A2 would be good for mass dis-
tribution of software on floppy disks or for overnight
compression of files on a file server. Furthermore, if the
parallel matching in the compression side of A2 were
supported with VLSI, the result would be a fast, power-
ful method requiring minimal storage both compressing
and expanding.
B2 provides nearly three times faster compression
than A2 but has somewhat slower expansion and adap-
tation. Thus B2 is well suited for communication and
archiving applications.
Al and Bl do not compress as well as A2 and B2,
respectively, but because of their two-codeword, byte-
aligned encodings they are better choices for applica-
tions where simplicity or speed is critical. (For exam-
ple, ). Gasbarro has designed and implemented an
expansion method like Al to improve the bandwidth of
a VLSI circuit tester [12].)
C2 achieves significantly higher compression than
B2, but its expander is somewhat slower and has a
larger storage requirement. In the compression study
reported in Section 8, C2 achieved the highest compres-
sion of all methods tested on 6 of the 10 data sets.
We believe that our implementations and empirical
results demonstrate the value of window-based textual
substitution. Together the A, B and C methods offer
good options that can be chosen according to resource
requirements.
APPENDIX A
A Pathological Example
We now show a string that has the F pattern of It remains to create the
match of length 2 at posi-
equation (6) of Section 3: tion 12 in equation (6). For this purpose,
each of the
 d above are either e, or o,. They will always precede
1 234567 respectively even and odd numbered s), and match
p10 p10 p9 p8 p7 p6 p5 [6] in pairs with their following s/s. For
example, the Co
89101112131415 in Goo = SiBiSiCoSzBzSz    will match with Sz. The
p4 p3 p2 p1 p2 p10 p10 p9 ... eoS2 match is hidden in a minor block
segregated
 from the odd numbered s,:
Hereafter we will stop abstracting the string by its
copy lengths; capital letters are strings, small letters Bo =
xeoSoeoS2?oS4  -  e(>Sb-2
are single characters, and i, /, r, p, b are integers. The pathological
string follows the pattern: Bi = xeoSi,eoSi+2CoS(,+4 . . . ?0521-2
MoMi . . . Mr-^MoMl . . . M,-iMoMi . . . (9)
where the parameter r is chosen large enough so that one iteration
exceeds the finite window (this B"/> = ^15001525154 .  . CiSi.-2 (12)
prevents direct copying from the beginning of one
Mo to a subsequent Mo). Within each M, we have '
groups, Bp/2 = ^005100530055 . . . 0051-1
Mi = GioGitGil . . . G,("/p-i), (10)
and each group is: *
^"i ~ SjiiBoS{j-U)pCiSjfn.-iBlS(j+i]p^-iCiSip+2 This causes p and n to
be related by:
(11)
B2S(j+i)p+2CiSfp+] . . . BpiS(;+i)p+^-iCi, pb = 1m
We have introduced two more parameters: p is the In the case of our
running example, where the finite
number of minor blocks B,, and n is the number of s window is size 4096
and the maximum copy length
characters. All of the s subscripts in the above for- is 16 characters,
an appropriate setting of the above
mula are computed mod n. The groups skew so that, parameters is:
for example, the beginning of Gio = SiBiS"+i . . . will not match
entirely with the beginning of r = 2, b = S, p = 100, n = 200 (13)
Goo = SiBigi ....It will, however, match in two We need to take some
care that the heuristic does
parts: the prefix SiBi appears in both strings, and the not find the
optimal solution. It turns out that if we
suffix Gio =... BiSp+i ... will match with the suffix just start as in
equation (9), then the first Mo will not
ofGci = . . . Bi$"+i. If, for example, Bi has 9 charac- compress well,
but the heuristic will start the behav-
ters, this gives two consecutive locations where ior we are seeking in
Mi. Asymptotically we achieve
a copy of size 10 is possible, in the pattern of a worst case ratio of
Vs between the optimal algo-
equation 6. rithm and the policy heuristic.
APPENDIX B
Computing a Unary-Based Variable Length
Encoding of the Integers
In Section 5 we defined a (start, step, stop) unary Encode Var:
code of the integers as a string of ii ones followed by PROC [out:
CARDINAL, start, step, last: CARDINAL] ~ {
a zero followed by a field of/ bits. where / is in the UNTIL out <
Power2[start] DO
arithmetic progression defined by (start, step, stop). PutBits[l,lj;
This can be defined precisely by the following out <- out -
Power2[start];
encoder: start <- start + step;
504    Communications of the ACM
April-1989 Volume 32 Number 4
Research Contributions
ENDLOOP;
IF start < last THEN PiitBits[out, start + 1]
-0 followed byfield of size "start"
ELSE IF start > last THEN ERROR
ELSE PiitBits[out, start); -save a bit
PutBits: pROC(out: CARD, bits: INTEGER] ~
Output the binary encoding of "out" in a field of size
"bits."
Notice that the encoder is able to save one bit in the
last field size of the arithmetic progression.
Acknowledgments. We would like to thank Jim
Gasbarro, John Larson, Bill Marsh, Dick Sweet, lan
Witten, and the anonymous referees for helpful com-
ments and assistance.
REFERENCES
1. Abramson, N. Information Theory and Ceding. McCraw-Hill. 1963, 61.
2. Basharin, G.P. On a stalistical estimate for the entropy of a
sequence
of independent random variables. Theory Prob. Appl. 4, (1959),
333-336.
3, Bell, T.C. Better OPM/L text compression. IEEE Trans. Commun.
COM-U, 12 (1980), 1176-1182.
4. Bell, T.C., Cleary, ).G.. and Witten, i.H. Tell Compression. In press
with Prentice-Hall.
5. Bentley, ).L., Sleator, D.D.. Tarjan, R.E.. and Wei. V.K. A locally
adaptive data compression scheme. Commun. ACM 29, 4 (1885),
320-330.
6. Cieary, ).G., and Witten. I.H. Data compression ilsing adaptive cod-
ing and partial string matching. IEEE Trans. Commun. COM-32, 4
(1984), 396-402.
7. Cormack, G.V., and Uorspool. R.N.S. Data compression using dy-
namic markov modelling. The Computer fournal, 30, 6 (1987),
541-550.
8. Dantzig, G.B., Blattnct. W.O.. and Rao, M.R. Finding a Cycle in a
. Graph with Minimum Cost to Time Ratio with Application to a Ship
  Routing Problem. Theory of Graphs, P. Rosenstiehl, ed. Gordon and
Breach, 1966.
9. Elias, P. Universal codeword sets and representations of the
integers.
IEEE Trans. Info. Theon, IT-11. 1 (1975), 194-203.
10. Even, S., and Rodeh, M. Economical Encoding of Commas Between
Strings. Commun. ACM 21 (1978), 315-317.
11. Gallager, R.G. Variations on a theme by Huffman. IEEE Trans. info.
Theory IT-24, 6 (1378), 668-674.
12. Gasbarro, ). An Architecture for High-Performance VLSI Testers.
Ph.D. dissertation, Department of Electrical Engineering, Stanford
University, 1988.
13. Golomb, S.W. Run-Length Encodings. IEEE Trans. Info. Theory IT-12
(1966), 399-401.
14. Gua^zo, M. A General Minimum Redundancy Source-coding Algo-
rithm. IEEE Trans. Info. Theory IT-26. 1 (1980), 15-25.
15. Guoan, G., and Hobby. ). Using string matching to compress Chinese
characters. Stanford Tech. Rep. STAN-CS-82-914, 1982.
16. Horspool, R.N., and Cormack, G.V. Dynamic Markov Modelling-
A Prediction Technique. In Proceedings of the 19th Annual Hawaii
Mernational Conference on System Sciences (1986), 700-707.
17. Huffman. D.A. A method for the construction of minimum-redun-
dancy codes. In Procefiluigs of the 1.R.E. 40 (1952), 1098-1101.
18. Hunter, R., and Robinson, A.H. International digital facsimile
coding
standards. In Proceedings of the IEEE 68. 7 (1980). 854-867.
19. )akobsson, M. Compression of character strings by an adaptive dic-
tionary. B;T 25 (1985), 593-603.
20. )ones, C.B. An efficient coding system for long source sequences.
J??E Trans. Info. Theom IT-27. 3 (1981), 280-291.
21. Kempt, M., Bayer, R.. and Cuntzer, U. Time Optimal l.eft to Right
Construction of Position Trees. Ada Informatica. 24 (1987), 461-474.
22. Knuth, D.E. The Art of Computer Programming, Volume 3: Sorting and
Searching. Addison-Wesluy, second printing. 1975.
23. Knuth, D.E. Dynamic Huffman coding. /. Algo. 6 (1985), 163-180.
24. Langdon, G.G., )r. A note on the Ziv-Lempel model for compressing
individual sequences f?E? Trail-. Info. Theory IT-29. 2 (1983),
284-287.
25. Langdon, G.G., )r., and Rissanen. ). Compression of Black-White
Images with Arithmetic Coding. IF-EE Trans. Commun. COM-29. 6
(1981), 858-867.
26, Langdon, C.G., )r.. and Rissanen. ). A Double Adaptive File
Compression Algorithm. IF.[:E Trans. Commun. COM-31, II (1983),
1253-1255.
27. Lawler, E.L. Coml'inaiifii.il Optimization: Nelu'orks and Malroids.
Holt.
Rinehart and Winslon. 197b.
28, McCreight. E.M. A sp.K:e-ccoiiomical suffix tree construction algo-
rithm. /. ACM 23. 2 (1976). 262-272.
29. Miller, V.S.. and VVegman, M.N. Variations on a theme by Ziv and
Lempel. IBM Res. Rep. RC 10630 (#47798). 1984. Combinatorial Algo-
rithms on Words, NATO ASI Series F. 12 (1985), 131-140.
30. Morrison, D.R. PATRICIA-Practical Algorithm To Retrieve informa-
tion Coded in Alphanumeric. /. ACM 15, 4 (1968), 514-534.
31. Pasco, R.C. Source Coding Algorithms for Fast Data Compression.
Ph.D.
dissertation, Department of Electrical Engineering, Stanford Univer-
sity, 1976.
32. Rissanen, )., and Langdon, G.G., )r. Arithmetic Coding. IBM lournal
of Research and Development 23. 2 (1979), 149-162.
33. Rissanen, )., and Langdon. G.G., )r. Universal Modeling and Coding.
IEEE Trans. Info. Theory IT-27. 1 (1981), 12-23.
34. Rodeh, M., Pratt, V.R., and Even. S. Linear algorithm for data
compression via string matching, f. ACM 28. 1 (1981), 16-24.
35. Shannon, C.E. A mathematical theory of communication. The Bell
System Technical journal 27,3. 379-423 and 27, 4 (1948), 623-656.
36. Storer, ).A., and Szymanski, T.G. Data compression via textual sub-
stitution. /. ACM 29. 4 (1982). 928-951.
37. Vitter, ).S. Design and analysis of dynamic Huffman coding. Brown
University Technical Report No. CS-85-13. 1985.
38. Weiner, P. Linear Pattern Matching Algorithms. Fourteenth Annual
Symposium on Switching and Automata Theory, 1-11,1973.
39. Welch, T.A. A technique for high performance data compression.
IEEE Comp. 17. 6 (1984). 8-19.
40, Witlen, I.H.. Neal, R.M., and Cieary ).G. Arithmetic coding for data
compression. Commun. ACM 30. 6 (1987), 520-540.
41. Ziv, ). Coding theorems for individual sequences IEEE Trans. Info.
Theory IT-24, 4 (1978), 405-412.
42. Ziv, )., and Lempel, A. A universal algorithm for sequential data
compression. IEEE Trans. Info. Theory IT-23. 3 (1977), 337-343.
43. Ziv, ).. and Lempel, A. Compression of individual sequences v'a
variable-rate coding. IEEE Trails. Info. Theory IT-24. 5 (1978),
530-536.
CR Categories and Subject Descriptors: E.4 [Data]: Coding and Infor-
mation Theory-data compaction and compression: F.2.2 [Analysis of Al-
gorithms and Problem Complexity): Nonnumerical Algorithms and
Problems-computations on discrete structures, pattern matching
General Terms: Algorithms, Design, Experimentation, Theory
Additional Key Words and Phrases: Amortized efficiency, automata
theory, minimum cost to time ratio cycle, suffix trees, textual
substitu-
tion
ABOUT THE AUTHORS:
EDWARD R. FIALA is a member of the research staff in
the Computer Sciences Laboratory at the Xerox Palo Alto
Research Center. He has general system interests including
data compression and digital video. Author's Present Address:
Xerox PARC, 3333 Coyote Hill Rd.. Palo Alto, CA 94304: ARPA
Internet: Fiala.pa@Xerox.com
DANIEL H. GREENE is presently a member of the research
staff at Xerox's Palo Alto Research Center (PARC). He received
his Ph.D. in 1983 from Stanford University. His current re-
search interests are in the design and analysis of algorithms for
data compression, computational geometry, and distributed
computing. Author's Present Address: Xerox PARC:, 3333 Coy-
ote Hill Rd.. Palo Alto, CA 94304: Crcene.pa@Xerox.com
*************************************************************
OCRed and published for the benefits of All Internet Community
by Disturbo as a part of his Millenium Data Disclosure Program.
December, the 2nd. 1998 AD.
*** The END ***
--- ifmail v.2.14dev2
 * Origin: [not found] (2:5020/400@fidonet)
 Предыдущий блок Следующий блок Вернуться в индекс