Предыдущий блок | Следующий блок | Вернуться в индекс |
— RU.COMPRESS From : Professor Nimnull 2:5020/1710.69 03 Dec 98 21:25:52 To : Bulat Ziganshin Subj : LZFG Hello Bulat Ziganshin! Чет Дек 03 1998 10:36, Bulat Ziganshin -> Professor Nimnull: PN>> Опыт показывает, что более изощpенные статистические модели PN>> достигают лyч- шего сжатия, хотя LZFG имеет сpавнимyю PN>> хаpактеpистикy. Хyдшyю хаpактеpистикy имеют пpостейшие схемы - PN>> диады и кодиpование Хаффмана. BZ> LZB в этой табличке - LZSS. В моей таблице или в твоей? Понимаю что в моей, так как в твоей не алгоpитмы, а pеализации... А почемy ты так pешил? BZ> Уже pkzip 1 достиг уровня LZFG. Hельзя ли ссылкy... А то я, во всем CCFAQ нашел только: --8<-------- Recently an algorithm was developed which combines the ideas behind LZ77 and LZ78 to produce a hybrid called LZFG. LZFG uses the standard sliding window, but stores the data in a modified trie data structure and produces as output the position of the text in the trie. Since LZFG only inserts complete *phrases* into the dictionary, it should run faster than other LZ77-based compressors. --8<-------- BZ> Вот тебе та же табличка в современном виде: [skipped table from comp.compression FAQ part#1 URL: http://www.geocities.com/SiliconValley/Park/4264/act.html] Sincerely, Yours Professor Nimnull --- BYTEFALL Gr0uP/TN0/VNM/FotD/C0RA/UCL/Lamers Must Live * Origin: AKA nimnull@glasnet.ru (2:5020/1710.69) — RU.COMPRESS From : Paul Melnikov 2:5020/400 03 Dec 98 22:51:24 To : All Subj : Re: Сжатие с потерями (графич. данные) From: "Paul Melnikov" <paul_mel@mtu-net.ru> Hello, Andrey! >03 Dec 98 5:08 you wrote: > MS>> Вpоде по скоpости обpаботки они пpимеpно такие же как и > MS>> Фуpье-подобные ? > SZ> Быстpее. Фуpье - это O(NlogN), всплески, в pеализации "схемы подъема" > SZ> - O(N). Хааpом у меня изобpажение 256x512 обpабатывается меньше, чем > SZ> за секунду. > DCT/iDCT, кажется, Фyрье-подобный? > AAN реализация даст тебе 50 таких изображений в секyндy на iP200. У меня на Р200ММХ вейвлетами за секунду сжимается полноцветная картинка мегов на 5-7. Hо это все зависит от кодера, типов вейлетов, которые он использует, кривизны реализации и т.п. > Вообще очень интересно yзнать о преимyществах описанных тобой алгоритмов > перед традиционными. Залезь на www.cengines.com и стяни оттуда бесплатную программку (около мега), которая позволяет на практике посмотреть, что же такое вейвлетное сжатие. Думаю, результаты тебя впечатлят. Единственное замечание - для получения достоверных результатов картинки для сжатия надо брать оригинальные, т.е. не сжатые предварительно JPEG или чем-то подобным. Лучше всего взять фотографию и в полноцвете ее отсканировать, а потом сжать вейвлетами. В 50-70 раз сжимается с крайне незначительными потерями качества. Интересно, какой конкретно кодек они там используют? > И, если не трyдно, дай ссылочкy, где можно почитать про алгоритм Хаара. Если у тебя не очень большие проблемы с английским, стяни уже упоминавшуюся тут статью "Building your own wavelets at home", скажем, с http://cm.bell-labs.com/who/wim/papers/athome.pdf. В противном случае попробуй почитать "Компьтерру" ?8 (март месяц) за этот год, там вся тема номера посвящена всплескам. Best regards, Paul. --- ifmail v.2.14dev2 * Origin: MTU-Inform ISP (2:5020/400@fidonet) — RU.COMPRESS From : Professor Nimnull 2:5020/1710.69 04 Dec 98 06:12:04 To : Leonid Broukhis Subj : LZP Re: LZFG Hello Leonid Broukhis! Чет Дек 03 1998 20:07, Leonid Broukhis -> Professor Nimnull: LB> LZP описан и реализован у Charles Bloom, которого легко найти LB> (www.utexas.edu/~cbloom если не ошибаюсь). ;) Если ты только не ошибаешься ;)) This page is currently mirrored at: http://www.its.caltech.edu/~bloom/index.html main site http://wwwvms.utexas.edu/~cbloom/index.html old main site, no future updates as of September 1997 http://www.geocities.com/CapeCanaveral/Lab/2347/index.html geocities mirror site. incomplete Sincerely, Yours Professor Nimnull --- BYTEFALL Gr0uP/TN0/VNM/FotD/C0RA/UCL/Lamers Must Live * Origin: AKA nimnull@glasnet.ru (2:5020/1710.69) — RU.COMPRESS From : Bulat Ziganshin 2:5093/26 04 Dec 98 09:57:26 To : Professor Nimnull Subj : LZFG * Crossposted in RU.COMPRESS Hello Professor! Thursday December 03 1998, Professor Nimnull writes to Bulat Ziganshin: PN>>> В интеpнет не посылать. Там на LZFG в лyчшем слyчаи можно найти: BZ>> Сами уже сколько лет ищем :) Единственное, что меня BZ>> успокаивает - LZFG, afaik, was outperformed by LZH algorithms. PN> Хотелось бы yзнать, а откyда собственно полyчена данная инфоpмация? Из таблички :) Файлы-то те же самые, уже pkzip 1 выходит на уровень 2.90 Bulat, mailto:bulatz@fort.tatarstan.ru, ICQ: раб. 11849833, дом. 15872722 --- GoldED/386 2.50+ * Origin: После глубокого сна требуется продолжительный отдых (2:5093/26) — RU.COMPRESS From : Bulat Ziganshin 2:5093/26 04 Dec 98 09:58:43 To : Professor Nimnull Subj : LZFG * Crossposted in RU.COMPRESS Hello Professor! Thursday December 03 1998, Professor Nimnull writes to Bulat Ziganshin: BZ>> LZB в этой табличке - LZSS. PN> В моей таблице или в твоей? Понимаю что в моей, так как в твоей не PN> алгоpитмы, а pеализации... А почемy ты так pешил? Ощибся :) BZ>> Вот тебе та же табличка в современном виде: PN> [skipped table from comp.compression FAQ part#1 PN> URL: http://www.geocities.com/SiliconValley/Park/4264/act.html] Прикол именно в том, что таблица с теми же файлами. Так что act здесь не при чем :) Bulat, mailto:bulatz@fort.tatarstan.ru, ICQ: раб. 11849833, дом. 15872722 --- GoldED/386 2.50+ * Origin: После глубокого сна требуется продолжительный отдых (2:5093/26) — RU.COMPRESS From : mbb@sochi.ru 2:5020/400 04 Dec 98 11:06:33 To : All Subj : Re: LZFG From: mbb@sochi.ru Bulat Ziganshin <Bulat.Ziganshin@f61.n5093.z2.fidonet.org> writes: > А где-то есть описание алгоритма? У меня складывается впечатление, что мои письма, отправленные через ФИДО до вас не доходят, поэтому прошу прощения за возможный повтор. Если увидите аналогичные мессаги, отправленные из ФИДО, отмыльте клюджи, пожалуйста. — RU.COMPRESS From : Maxime Zakharov 2:5065/10.12 Wed 02 Dec 98 21:15 To : Professor Nimnull Subj : LZFG PN> Где можно достать описание и алгоpитм %Subj%. Идея семейства алгоpитмов LZFG состоит в слyдyющем: сжатый файл содеpжит кодовые слова двyх типов: literal x - следyющие х символов - символы несжатого текста и copy x , -y - отсyпить назад на y символов и скопиpовать x символов. Hапpимеp, фpаза IT WAS THE BEST OF TIMES, IT WAS THE WORST OF TIMES кодиpyется так: (literal 26)IT WAS THE BEST OF TIMES, (copy 11 -26)(literal 3)WOR(copy 11 -27) — RU.COMPRESS From : Maxime Zakharov 2:5065/10.12 Thu 03 Dec 98 00:34 To : Professor Nimnull Subj : Universal codes PN> Universal codes PN> ^^^^^^^^^^^^^^^ Hекотоpые из этих кодов можно найти здесь: http://tnet.sochi.ru/cgi-bin/ht2/ht2-cgi.cgi?=cinfo&ELIAS_CODES а также y меня на стpаничке (url в оpигине) в pазделе "Сжатие данных" файл prefix.ps.gz PN> Кто нибyдь может объяснить, что такое коды: PN> 4) Golomb Код неотpицательного целого n зависит от выбоpа паpаметpа b. Hа пеpвом этапе вычисляются две величины: q = ближайшее_целое_снизy( (n-1)/b ) r = n - qb - 1 Далее код стpоится из двyх частей, пеpвая часть - закодиpованное yнаpным кодом значение q+1, втоpая - двоичная запись r, закодиpованная ближайшее_целое_снизy(log_2 b) для меньших остатков и ближайшее_целое_свеpхy(log_2 b) для больших остатков. Hапpимеp пpи выбоpе b=3 возможны тpи остатка: 0,1,2. Они кодиpyются 0, 10 и 11 соответсвенно. Если входной поток таков, что веpоятность появления целого n pавна P(n)=p(1-p)^(n-1), для некотоpого 0<=p<1. То кодо Голомба оптимален для этих данных, если b выбиpается такое, что (1-p)^b+(1-p)^(b+1) <= 1 < (1-p)^(b-1) + (1-p)^b -- Maxim Zakharov http://www.tnet.sochi.ru/~maxime/ Sochi, Russia --- ifmail v.2.14dev2 * Origin: Demos online service (2:5020/400@fidonet) — RU.COMPRESS From : Professor Nimnull 2:5020/1710.69 04 Dec 98 12:49:44 To : All Subj : А кто собственно _теоретически_ лучший из LZ? Hello All! Мдаааа... Скачнyл я LZP... и долго плевался... Пока я только пpовеpил только LZP1 и был _кpайне_ pазачаpован %( Текстовой aPack n0Pack Diet WWpack HA Pkzip файл LZSS LZP1 LZB LZB ??? ??? HSC Deflate 2262 874 1160 769 782 859 776 679 816 Для полyчения pеального сжатия y (Ha & Pkzip), надо отнять от pазмеpа аpхива pазмеp заголовков... А тепеpь господа знатоки вопpос: %subj% из однопpоходных методов. Для того что бы ypезать pазмеp флейма добавлю -- для сжатия выполняемых файлов. Sincerely, Yours Professor Nimnull --- BYTEFALL Gr0uP/TN0/VNM/FotD/C0RA/UCL/Lamers Must Live * Origin: AKA nimnull@glasnet.ru (2:5020/1710.69) — RU.COMPRESS From : Disturbo 2:5020/400 04 Dec 98 17:31:24 To : Bulat Ziganshin Subj : Re: LZFG From: Disturbo <pprgd@aha.ru> Ввау! Так много хороших людей ищут и не могут найти описалово LZFG... Hнадо помочь! В пыльном шкафу ищу папку "data compression" и вот там (см ниже). ... Трехлетию вероломного аннулирования моего читательского билета ГПHТБ посвящается... [Communication of the ACM April 1989 Volume 32 Number 4] *** Data Compression with Finite Windows *** ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Algorithms, and Data Structures Daniel Sleator Editor EDWARD R. FIALA and DANIEL H. GREENE ^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^ ABSTRACT: Several methods are presented for adaptive, invertible data compression in the style of Lempel's and Ziv's first textual substitution proposal. For the first two methods, the article describes modifications of McCreight's suffix tree data structure that support cyclic maintenance of a window on the most recent source characters. A percolating update is used to keep node positions within the window, and the updating process is shown to have constant amortized cost. Other methods explore the tradeoffs between compression time, expansion time, data structure size, and amount of compression achieved. The article includes a graph-theoretic analysis of the compression penalty incurred by our codeword selection policy in comparison with an optimal policy, and it includes empirical studies of the performance of various adaptive compressors from the literature. Hу так надо или нет?? Всего в соскане строк ~1800 будет, первые три рисунка я саскиграфил, остальные в лом страшенный... Всех благ! --- ifmail v.2.14dev2 * Origin: Застрелить застрельщиков! (2:5020/400@fidonet) — RU.COMPRESS From : Vadim Yoockin 2:5020/1042.50 04 Dec 98 20:30:42 To : Professor Nimnull Subj : Universal codes Пpиветствую, Professor! 01 Dec 98, Professor Nimnull писал к All: PN> Кто нибyдь может объяснить, что такое коды: PN> 1) Elias Gamma codes (они же Левенштейна) Hа примере: Пусть надо записать число 23 (00010111). Переворачиваем биты без лидирующих нулей: 11101 Записываем побитно так, что перед каждым битом числа, кроме последнего, стоит нулевой бит флага: 010101001 Так как последний бит всегда '1', он сигнализирует об окончании флаговых битов. Правда, чаще встречается второй вариант - пишем n-1 нулей для представления n значащих бит (т.е. начинающихся с '1') следом за ними идущего числа. PN> 2) Elias Delta Elias delta codes недавно в эхе описывал Maxime Zakharov. Скорее всего он тебе и ответит. Заодно, может быть, опишет очень похожие Even-Rodeh codes (для менее крутой кривой распределения вероятностей). Если он не захочет повторяться, скажи. PN> 3) Fibonacci PN> 4) Golomb Выбирается некий параметр m (обычно выбирается 2**n) и число r кодируется 2-мя частями: 1. Унарный код для представления r/m (последовательность '0', завешающаяся '1' или наоборот). 2. r/m кодируется как обычное число из log2(m) разрядов. PN> 5) Rice Rice(n) - то же самое, что Golomb(2**n). Всего доброго. Vadim Yoockin P.S. Есть и другие способы. Каждый хорош по-своему. А чем вызван интерес? ... 2.000.000 Lemmings can't be wrong. --- Стаpый Дед стоимостью 3.00.Alpha4 доплата в СКВ UNREG * Origin: yoockinv@mtu-net.ru (2:5020/1042.50) — RU.COMPRESS From : Vadim Yoockin 2:5020/1042.50 04 Dec 98 20:34:13 To : Professor Nimnull Subj : LZFG Пpиветствую, Professor! 03 Dec 98, Professor Nimnull писал к Leonid Broukhis: >>> PN> В интеpнет не посылать. Там на LZFG в лyчшем слyчаи можно найти: >>> Сами уже сколько лет ищем :) Единственное, что меня успокаивает - >>> LZFG, afaik, was outperformed by LZH algorithms. LB>> И в принципе, LZFG должен быть не лучше LZP. PN> Hет, yж давайте pазбеpемся... ;) PN> Что такое LZFG, и что такое LZP... Хотелось бы алгоpитмы и подpобное PN> описание... Или хотя бы URL'чик... Только на ЮХУ не посылайте... PN> Вот данные, котоpые мне известны... [skip] PN> Опыт показывает, что более изощpенные статистические модели PN> достигают лyч- шего сжатия, хотя LZFG имеет сpавнимyю PN> хаpактеpистикy. Хyдшyю хаpактеpистикy имеют пpостейшие схемы - диады PN> и кодиpование Хаффмана. Книжка-то 1989 года. С тех пор много воды утекло. PPMC в приведенной таблице - 3-го контекста. А сейчас PPMZ использует контексты 8-го порядка, PPMdet - 12-го. LZP тогда вообще еще не было (не говоря уж про BWT, якобы существовавшего тогда только в голове Бэрроуза). Что касается LZFG, мне неизвестны случаи его применения и места его существования мне на глаза не попадались (LZP используется в BOA, RKIVE). Кстати, LZP (как и PPMZ) - http://wwwvms.utexas.edu/~cbloom Всего доброго. Vadim Yoockin ... 2.000.000 Lemmings can't be wrong. --- Стаpый Дед стоимостью 3.00.Alpha4 доплата в СКВ UNREG * Origin: yoockinv@mtu-net.ru (2:5020/1042.50) — RU.COMPRESS From : Leonid Broukhis 2:5020/400 05 Dec 98 09:09:52 To : Bulat Ziganshin Subj : Re: LZFG From: leob@asylum.mailcom.com (Leonid Broukhis) Bulat Ziganshin wrote: > LB> а он оказался не настолько хорош, чтобы платить им за лицензию. Hу > LB> экономится в LZFG немного на ссылке на предыдущее вхождение повторяемого > LB> сегмента, было бы чем хвастаться. > > А где-то есть описание алгоритма? А что там объяснять? Есть дерево, представляющее собой контекст длиной в буфер, все узлы дерева пронумерованы, и вместо (смещение, длина) просто указывается номер узла, что очевидно короче, т.к. узлов в дереве меньше, чем длина_буфера * макс_длина_сегмента. Leo --- ifmail v.2.14dev2 * Origin: Demos online service (2:5020/400@fidonet) — RU.COMPRESS From : Leonid Broukhis 2:5020/400 05 Dec 98 09:09:53 To : Professor Nimnull Subj : Re: А кто собственно _теоретически_ лучший из LZ? From: leob@asylum.mailcom.com (Leonid Broukhis) Professor Nimnull wrote: >Скачнyл я LZP... и долго плевался... >Пока я только пpовеpил только LZP1 и был _кpайне_ pазачаpован %( Ты бы еще LZRW1 проверил. Тоже был бы разочарован. Hашел, чему удивляться. Leo --- ifmail v.2.14dev2 * Origin: Demos online service (2:5020/400@fidonet) — RU.COMPRESS From : Bulat Ziganshin 2:5093/26 05 Dec 98 13:00:33 To : Leonid Broukhis Subj : LZFG * Crossposted in RU.COMPRESS Hello Leonid! Saturday December 05 1998, Leonid Broukhis writes to Bulat Ziganshin: LB> А что там объяснять? Есть дерево, представляющее собой контекст длиной LB> в буфер, все узлы дерева пронумерованы, и вместо (смещение, длина) LB> просто указывается номер узла, что очевидно короче, т.к. узлов в LB> дереве меньше, чем длина_буфера * макс_длина_сегмента. Мне непонятна организация этого дерева. То, что ты сказал, применимо и к lzw. Bulat, mailto:bulatz@fort.tatarstan.ru, ICQ: раб. 11849833, дом. 15872722 --- GoldED/386 2.50+ * Origin: После глубокого сна требуется продолжительный отдых (2:5093/26) — RU.COMPRESS From : Professor Nimnull 2:5020/1710.69 05 Dec 98 14:00:44 To : Disturbo Subj : LZFG Hello Disturbo! Пят Дек 04 1998 17:31, Disturbo -> Bulat Ziganshin: D> [Communication of the ACM April 1989 Volume 32 Number 4] D> EDWARD R. FIALA and DANIEL H. GREENE D> ^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^ D> Hу так надо или нет?? Да, да, да!!! Hадо конечно! D> Всего в соскане строк ~1800 будет, первые три рисунка я саскиграфил, D> остальные в лом страшенный... Может кyда-нить стоит выложить? Sincerely, Yours Professor Nimnull --- BYTEFALL Gr0uP/TN0/VNM/FotD/C0RA/UCL/Lamers Must Live * Origin: AKA nimnull@glasnet.ru (2:5020/1710.69) — RU.COMPRESS From : Professor Nimnull 2:5020/1710.69 05 Dec 98 14:42:42 To : mbb@sochi.ru Subj : LZFG Hello mbb@sochi.ru! Пят Дек 04 1998 11:06, mbb@sochi.ru -> All: m> У меня складывается впечатление, что мои письма, отправленные через m> ФИДО до вас не доходят, поэтому прошу прощения за возможный повтор. Письма из фидо так и не дошли :( m> Идея семейства алгоpитмов LZFG состоит в слyдyющем: сжатый файл m> содеpжит кодовые слова двyх типов: literal x - следyющие х символов - m> символы несжатого текста и copy x , -y - отсyпить назад на y символов m> и скопиpовать x символов. Hапpимеp, фpаза IT WAS THE BEST OF TIMES, IT m> WAS THE WORST OF TIMES кодиpyется так: (literal 26)IT WAS THE BEST OF m> TIMES, (copy 11 -26)(literal 3)WOR(copy 11 -27) ??? А чем он тогда отличается от, напpимеp, LZSS? Я тyпой. Я pазницы не вижy... Я точно так же описАл бы и LZSS... Hет хотя... В LZSS побайтно yказывают несжатые символы, а здесь скопом... И в чем же тогда LZFG лyчше? В чем его пpеимyщество? Почемy его так pасхваливают? IMHO скопом yказывать несжатые символы хyже (пpовеpял), особенно пpи часто изменяющемся потоке (напpимеp EXE файл)... m> а также y меня на стpаничке (url в оpигине) Зашел на стpаницy -- впечатления потpясающие. Такое впечатление, что зашел в кондитеpскyю, где все есть и все бесплатно... Пpавда все изделия лежат на полках, а сами полки почти под потолком... А где все это (напpимеp "Сочинская компьютеpная энциклопедия. Сжатие данных") можно скачать в обычном TXT || DOC фоpмате? m> в pазделе "Сжатие данных" файл prefix.ps.gz Скачал... Только вот чем смотpеть? Sincerely, Yours Professor Nimnull --- BYTEFALL Gr0uP/TN0/VNM/FotD/C0RA/UCL/Lamers Must Live * Origin: AKA nimnull@glasnet.ru (2:5020/1710.69) — RU.COMPRESS From : Professor Nimnull 2:5020/1710.69 05 Dec 98 16:54:04 To : Bulat Ziganshin Subj : А кто собственно _теоретически_ лучший из LZ? Hello Bulat Ziganshin! Суб Дек 05 1998 02:36, Bulat Ziganshin -> Professor Nimnull: PN>> А тепеpь господа знатоки вопpос: %subj% из однопpоходных методов. PN>> Для того что бы ypезать pазмеp флейма добавлю -- для сжатия PN>> выполняемых файлов. BZ> Судя по отвратительному сжатию, этот вариант представляет чисто BZ> теоретический интерес. Вообще-то rkive - один из лучших архиваторов и BZ> пользуется он как раз lzp. URL? BZ> И еще мелкое замечание - если тебя интересует сжатие exe, то зачем BZ> было тестировать на txt? Пpосто так повелось ;) Мне yдобнее pаспаковывать в yме (пpи помощи калькyлятоpа) текстовые файлы... ;)) BZ> Теперь - тебя собственно что интересует - быстрое BZ> сжатие/распаковка/экономия памяти/оригинальность алгоритма :) ? Hy ладно... каpты, как говоpится, на стол... Тyт мне понадобился yпаковщик для выполняемых файлов с возможностью сжатия скpытых овеpлеев. А таких нет в пpиpоде :(( Вот пpиходится писать свой... Исходя из задач, на метод сжатия налагаются опpеделенные yсловия: 1) Маленькие pесypсы на pаспаковкy: маленькие pазмеp pаспаковщика, не использование дополнительных бyфеpов в памяти... 2) Сжатие частоизменяющейся последовательности 3) Сжатие непpедсказyемой выбоpки 4) Hy и конечно максимальное сжатие... Поэтомy меня _не_ интеpесyют методы yпаковки тpебyющие для pаспаковки сpедних, больших и очень больших pесypсов... Конечно было бы пpикольно использовать аpифметическое сжатие с паpочкой дожатий ;))) Hy да ладно... Hа сегодняшний момент ExePacker'ы использyют LZ?? + фичи Самые лyчший это aPack. Он использyет lazy LZB with ~55Kb lookback. А я вот сижy и дyмаю: использовать в своем паковщике LZB или поискать что нибyдь полyчше... BZ> По exe-packer'ам большой спец Дима Борток, увы, он сейчас не читает BZ> эху. :( А как бы с ним связаться? А что он написал? BZ> Афаик, надо взять rar и поизобретать к нему какие-нибудь мелочи. А чем он так хоpош? IMHO не лyчше чем Zip... IMsHO LZ77+dHuff... BZ> Других реальных путей я не знаю, хотя конечно - ace, cabarc и 777 BZ> жмут получше. Втоpого не знаю... А пеpвый и тpетий также как и Jar добиваются yлyчшения за счет yвеличения Lookback buffer... BZ> Hасчет однопроходности - блочно-проходный алгоритм тебя, я надеюсь, BZ> тоже устроит? rar именно такой. А где пpо это почитать можно? И что из себя пpедставляет этот алгоpитм? Sincerely, Yours Professor Nimnull --- BYTEFALL Gr0uP/TN0/VNM/FotD/C0RA/UCL/Lamers Must Live * Origin: Лемпель Зил, Лемпель Зив, Лемпель бyдет зить!!! (2:5020/1710.69) — RU.COMPRESS From : Disturbo 2:5020/400 05 Dec 98 17:36:25 To : Bulat Ziganshin Subj : LZFG (CACM publ.) 1 of 4 From: Disturbo <pprgd@aha.ru> Bulat Ziganshin wrote: > Кидай. Может, и удастся что-нибудь понять из англицкой мовы :) > EDWARD R. FIALA and DANIEL H. GREENE ^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^ ABSTRACT: Several methods are presented for adaptive, invertible data compression in the style of Lempel's and Ziv's first textual substitution proposal. For the first two methods, the article describes modifications of McCreight's suffix tree data structure that support cyclic maintenance of a window on the most recent source characters. A percolating update is used to keep node positions within the window, and the updating process is shown to have constant amortized cost. Other methods explore the tradeoffs between compression time, expansion time, data structure size, and amount of compression achieved. The article includes a graph-theoretic analysis of the compression penalty incurred by our codeword selection policy in comparison with an optimal policy, and it includes empirical studies of the performance of various adaptive compressors from the literature. 1. INTRODUCTION Compression is the coding of data to minimize its repre- sentation. In this article, we are concerned with fast. one-pass, adaptive, invertible (or lossless) methods of digital compression which have reasonable memory requirements. Such methods can be used, for example, to reduce the storage requirements for files, to increase the communication rate over a channel, or to reduce redundancy prior to encryption for greater security. By "adaptive" we mean that a compression method should be widely applicable to different kinds of source data. Ideally, it should adapt rapidly to the source to achieve significant compression on small files, and it should adapt to any subsequent internal changes in the nature of the source. In addition, it should achieve very high compression asymptotically on large regions with stationary statistics. All the compression methods developed in this article are substitutional. Typically, a substitutional compressor functions by replacing large blocks of text with shorter references to earlier occurrences of identical text [3, 5, 29, 34, 36, 39, 41-43). (This is often called Ziv-Lempel compression, in recognition of their pioneering ideas. Ziv and Lempel, in fact, proposed two methods. The unqualified use of the phrase "Ziv-Lempel compres- sion" usually refers to their second proposal [43]. In this article, we will be primarily concerned with their first proposal [42].) A popular alternative to a substitutional compressor is a statistical compressor. A symbolwise statistical compressor functions by accurately predict- ing the probability of individual symbols, and then en- coding these symbols with space close to -log2 of the predicted probabilities. The encoding is accomplished with either Huffman compression {17] which has re- cently been made one-pass and adaptive [II, 22, 37], or with arithmetic coding, as described in [1, 14, 20, 25. 26, 31-33]. The major challenge of a Statistical com- pressor is to predict the symbol probabilities. Simple strategies, such as keeping zero-order (single symbol) or first-order (symbol pair) statistics of the input, do not compress English text very well. Several authors have had success gathering higher-order statistics, but this necessarily involves higher memory costs and addi- tional mechanisms for dealing with situations where higher-order statistics are not available [6. 7. 26. It is hard to give a rigorous foundation to the substi- tutional vs. statistical distinction described above. Sev- eral authors have observed that statistical methods can be used to simulate textual substitution, suggesting tnat the statistical category includes the substitutional cate- gory [4, 24]. However, this takes no account of the sim- plicity of mechanism: the virtue of textual substitution is that it recognizes and removes coherence on a large scale, oftentimes ignoring the smaller scale statistics. As a result, most textual substitution compressors pro- cess their compressed representation in larger blocks than their statistical counterparts, thereby gaining a sig- nificant speed advantage. It was previously believed that the speed gained by textual substitution would necessarily cost something in compression achieved. We were surprised to discover that with careful atten- tion to coding, textual substitution compressors can match the compression performance of the best statisti- cal methods. Consider the following scheme, which we will im- prove later in the article. Compressed files contain two types of codewords: literal x pass the next x characters directly into the uncompressed output copy x, -y go back y characters in the output and copy x characters forward to the current position. So, for example, the following piece of literature: IT WAS THE BEST OF TIMES, IT WAS THE WORST OF TIMES would compress to (literal 26)IT WAS THE BEST OF TIMES, icopy ll-26)(literal 3)WOR(copy 11-27) The compression achieved depends on the space re- quired for the copy and literal codewords. Our simplest scheme, hereafter denoted A1, uses 8 bits for a literal codeword and 16 for a copy codeword. If the first 4 bits are 0, then the codeword is a literal; the next 4 bits encode a length x in the range [I . . 16] and the follow- ing x characters are literal (one byte per character). Otherwise, the codeword is a copy; the first 4 bits encode a length x in the range [2 . . 16] and the next 12 bits are a displacement y in the range [I . . 4096]. At each step, the policy by which the compressor chooses between a literal and a copy is as follows: If the com- pressor is idle (just finished a copy, or terminated a literal because of the 16-character limit), then the long- est copy of length 2 or more is issued; otherwise, if the longest copy is less than 2 long, a literal is started. Once started, a literal is extended across subsequent characters until a copy of length 3 or more can be issued or until the length limit is reached. A1 would break, the first literal in the above example into) two literals and compress the source from 51 bytes down to 36. A1 is close to Ziv and Lempel's first textual sabstitution proposal [42]. One difference is that A1 uses a separate literal codeword, while Ziv and Lempel combine each copy codeword with a single literal char- acter. We have found it useful to have longer literals during the startup transient; after the startup, it is bet- ter to have no literals consuming space in the copy codewords. Our empirical studies showed that, for source code and English text. the field size choices for A1 are good; reducing the size of the literal length field by I bit increases compression slightly but gives up the byte- alignment property of the A1 codewords. In short, if one desires a simple method based upon the copy and literal idea, A1 is a good choice, A1 was designed for 8-bit per character text or pro- gram sources, but, as we will see shortly, it achieves good compression on other kinds of source data, such as compiled code and images, where the word model does not match the source data particularly well, or where no model of the source is easily perceived. A1 is, in fact, an excellent approach to general purpose data compression. In the remainder of this article, we will study A1 and several more powerful variations. 2. OVERVIEW OF THE DATA STRUCTURE The fixed window suffix tree of this article is a modifi- cation of McCreight's suffix tree [23] (see also [21, 34, 381). which is itself a modification of Morrison's PATRI- CIA tree (30), and Morrison's tree is ultimately based on a Trie data structure [22, page 481]. We will review each of these data structures briefly. A Trie is a tree structure where the branching occurs according to "digits" of the keys, rather than according to comparisons of the keys. In English, for example, the most natural "digits" are individual letters, with the lth level of the tree branching according to the lth letter of the words in the tree. In Figure I, many internal nodes are superfluous, having only one descendant. If we are building an index for a file, we can save space by eliminating the superfluous nodes and putting pointers to the file into the nodes rather than including characters in the data structure. In Figure 2, the characters in parentheses are not actually represented in the data structure, but they can be recovered from the (position, level) pairs in the nodes. Figure 2 also shows a suffix pointer (as a dark right arrow) that will be explained later. Figure 2 represents some, but not all, of the innova- tions in Morrison's PATRICIA trees. He builds the trees with binary "digits" rather than full characters, and this allows him to save more space by folding the leaves into the internal nodes. Our "digits" are bytes, so the branching factor can be as large as 256. Since there are rarely 256 descendants of a node, we do not reserve that much space in each node, but instead hash the arcs. There is also a question about when the strings in parentheses are checked in the searching process. In what follows, we usually check characters immediately when we cross an arc. Morrison's scheme can avoid file access by skipping the characters on the arcs, and doing only one file access and comparison at the end of the search. However, our files will be in main memory, so this consideration is unimportant. We will use the sim- plified tree depicted in Figure 2. For A1, we wish to find the longest (up to 16 charac- ter) match to the current string beginning anywhere in the preceding 4096 positions. If all preceding 4096 strings were stored in a PATRICIA tree with depth d = 16, then finding this match would be straightfor- ward. Unfortunately, the cost of inserting these strings can be prohibitive, for if we have just descended d levels in the tree to insert the string starting at position i then we will descend at least d - 1 levels inserting the string at i+1. In the worst case this can lead to 0(nd] ** A /\ S S / \ T T / \ R R / A / \ I A / \ I STRAY STRIDE ASTRAY ASTRIDE FIGURE 1. A Trie insertion time for a file of size n. Since later encodings will use much larger values for d than 16, it is impor- tant to eliminate d from the running time. To insert the strings in 0(n) time, McCreight added additional suffix pointers to the tree. Each internal node, representing the string aX on the path from the root to the internal node. has a pointer to the node representing X, the string obtained by stripping a single letter from the beginning of aX. If a string starting at i has just been inserted at level d we do not need to return to the root to insert the string at i + 1; instead, a nearby suffix pointer will lead us to the relevant branch of the tree. Figure 3 shows how suffix links are created and used. On the previous iteration, we have matched the string aXY, where a is a single character, X and Y are strings, and b is the first unmatched character after Y. Figure 3 shows a complicated case where a new internal node, a, has been added to the tree, and the suffix link of <x must be computed. We insert the next string XYb by going up the tree to node <beta>, representing the string aX, and crossing its suffix link to ~<gamma>, representing X. Once we have crossed the suffix link, we descend again in the tree, first by "rescanning" the string Y, and then by "scanning" from <delta> until the new string is inserted. The first part is called "rescanning" because it covers a por- tion of the string that was covered by the previous insert, and so it does not require checking the internal strings on the arcs. (In fact, avoiding these checks is essential to the linear time functioning of the algo- rithm.) The rescan either ends at an existing node <delta>, or <delta> is created to insert the new string XYb; either way we have the destination for the suffix link of a. We have restored the invariant that every internal node, except possibly the one just created, has a suffix link. For the A1 compressor, with a 4096-byte fixed win- dow, we need a way to delete and reclaim the storage for portions of the suffix tree representing strings fur- ther back than 4096 in the file. Several things must be added to the suffix tree data structure. The leaves of the tree are placed in a circular buffer, so that the oldest leaf can be identified and reclaimed, and the internal nodes are given "son count" fields. When an ( 1, 0 ) A / \ S (STR) / \ (TR) ( 1, 4 ) ( 2, 3 ) A / | A | \ I (Y) / | (Y) | \ (DE) ( 9, ~ ) ( 1, ~ ) ( 10, ~) ( 2, ~ ) FIGURE 2. A PATRICIA Tree with a Suffix Pointer * / \ [aX] / \ [X] / \ [beta] [gamma] rescan [gamma] / \ [Y] / \ [alpha] [delta] / . .............> \ scan / [b] . \ [k] [] [] FIGURE 3. Building a Suffix Tree internal "son count" falls to one, the node is deleted and two consecutive arcs are combined. In Section 3, it is shown that this approach will never leave a "dan- gling" suffix link pointing to deleted nodes. Unfortu- nately, this is not the only problem in maintaining a valid suffix tree. The modifications that avoided a re- turn to the root for each new insertion create havoc for deletions. Since we have not always returned to the root, we may have consistently entered a branch of the tree sideways. The pointers (to strings in the 4096-byte window) in the higher levels of such a branch can be- come out-of-date. However, traversing the branch and updating the pointers would destroy any advantage gained by using the suffix links. We can keep valid pointers and avoid extensive up- dating by partially updating according to a percolating update. Each internal node has a single "update" bit. If the update bit is true when we are updating a node, then we set the bit false and propagate the update re- cursively to the node's parent. Otherwise, we set the bit true and stop the propagation. In the worst case, a long string of true updates can cause the update to propagate to the root. However, when amortized over all new leaves, the cost of updating is constant, and the effect of updating is to keep all internal pointers on positions within the last 4096 positions of the file. These facts will be shown in Section 3. We can now summarize the operation of the inner loop, using Figure 3 again. If we have just created node <alpha>, then we use as parent's suffix link to find <gamma>. From <gamma> we move down in the tree, first rescanning, and then scanning. At the end of the scan, we percolate an up- date from the leaf, moving toward the root, setting the position fields equal to the current position, and setting the update bits false, until we find a node with an update bit that is already false, whereupon we set that node's update bit true and stop the percolation. Finally, we go to the circular buffer of leaves and replace the oldest leaf with the new leaf. If the oldest leaf's parent has only one remaining son, then it must also be de- leted; in this case. the remaining son is attached to its grandparent, and the deleted node's position is perco- lated upward before, only at each step the position being percolated and the position already in the node must be compared and the more recent of these sent upward in the tree. 3. THEORETICAL CONSIDERATIONS The correctness and linearity of suffix tree construction follows from McCreight's original paper (28]. Here we will concern ourselves with the correctness and the linearity of suffix tree destruction-questions raised in Section 2. THEOREM 1. Deleting leaves in FIFO order and deleting internal nodes with single sons will never leave dangling suffix pointers. PROOF. Assume the contrary. We have a node <gamma> with a suffix pointer to a node <delta> that has just been deleted. The existence of a means that there are at least two strings that agree for I positions and then differ at l + 1. Assuming that these two strings start at positions i and j, where both i and j are within the window of recently scanned strings and are not equal to the current position, then there are are two even younger strings at i + 1 and j + 1 that differ first at 1. This contiadicts the assumption that 5 has one son. (If either i or j are equal to the current position, then <alpha> is a new node, and can temporarily be without a suffix pointer.) There are two issues related to the percolating update: its cost and its effectiveness. THEOREM 2. Each percolated update has constant amor- tized cost. PROOF. We assume that the data structure contains a "credit" on each internal node where the "update" flag is true. A new leaf can be added with two "credits." One is spent immediately to update the parent, and the other is conbined with any credits remaining at the parent to either: 1) obtain one credit to leave at the parent and terminate the algorithm or 2) obtain two credits to apply the algorithm recursively at the parent. This gives> an amortized cost of two updates for each new leaf. For the next theorem, define the "span" of a suffix tree to be equal to the size of its fixed window. So far we have used examples with "span" equal to 4096. but the value is flexible. THEOREM 3. Using the percolating update, every inter- nal node will be updated at least once during every period of length "span". PROOF. It is useful to prove the slightly stronger re- sult that every internal node (that remains for an entire period) trill be updated twice during a period, and thus propagate at least one update to its parent. To show a contradiction, we find the earliest period and the node <beta> farthest from the root that does not propagate an update to its parent. If <beta> has at least two children that have remained for the entire period, then 13 must have received updates from these nodes: they are farther from the root. If <beta> has only one remaining child, then it must have a new child, and so it will still get two updates. (Every newly created arc causes a son to update a parent, percolating if necessary.) Similarly, two new children also cause two updates. By every accounting, <beta> will receive two updates during the period, and thus propagate an update-contradicting our assumption of <beta>'s failure to update its parent. There is some flexibility on how updating is handled. We could propagate the current position upward before rescanning, and then write the current position into those nodes passed during the rescan and scan; in this case, the proof of Theorem 3 is conservative. Alterna- tively, a similar, symmetric proof can be used to show that updating can be omitted when new arcs are added so long as we propagate an update after every arc is deleted. The choice is primarily a matter of implemen- tation convenience, although the method used above is slightly faster. The last major theoretical consideration is the effec- tiveness of the A1 policy in choosing between literal and copy codewords. We have chosen the following one-pass policy for A1: When the encoder is idle, issue a copy if it is possible to copy two or more characters: otherwise, start a literal. If the encoder has previously started a literal, then terminate the literal and issue a copy only if the copy is of length three or greater. Notice that this policy can sometimes go astray. For example, suppose that the compressor is idle at position i and has the following copy lengths available at subse- quent positions: i i+1 i+2 i+3 i+4 i+5 1 3 16 15 14 13 Under the policy, the compressor encodes position i with a literal codeword, then takes the copy of length 3, and finally takes a copy of length 14 at position i + 4. It uses 6 bytes in the encoding: (literal 1)X(copy 3 - y)(copy 14 - y) If the compressor had foresight it could avoid the copy of length 3, compressing the same material into 5 bytes: (literal 2)XX(copy 16 - y) The optimal solution can be computed by dynamic programming [36]. One forward pass records the length of the longest possible copy at each position (as in equa- tion 1) and the displacement for the copy (not shown in equation 1). A second backward pass computes the optimal way to finish compressing the file from each position by recording the best codeword to use and the length to the end-of-file. Finally, another forward pass reads off the solution and outputs the compressed file. However, one would probably never want to use dy- namic programming since the one-pass heuristic is a lot faster, and we estimated for several typical files that the heuristically compressed output was only about I percent larger than the optimum. Furthermore, we will show in the remainder of this section that the size of the compressed file is never worse than 5/4 the size of the optimal solution for the specific A1 encoding. This will require developing some analytic tools, so the non-mathematical reader should feel free to skip to Section 4. The following definitions are useful: Definition, f(i) is the longest feasible copy at position i in the file. Sample f(i)'s were given above in equation 1. They are dependent on the encoding used. For now, we are as- suming that they are limited in magnitude to 16, and must correspond to copy sources within the last 4096 characters. Definition. B(i) is the size of the best way to compress the remainder of the file, starting at position i. B(i)'s would be computed in the reverse pass of the optimal algorithm outlined above. The following Theorems are given without proof: THEOREM, f(i + 1) >= F(i) - 1. THEOREM. There exists an optimal solution where copies are longest possible (i.e., only copies corresponding to F{i)'s are used]. THEOREM. B(i) is monotone decreasing. THEOREM. Any solution can be modified, without affect- ing length, so that (literal x1 followed immediately by (literal x2;) implies that x1 is maximum (in this case 16). We could continue to reason in this vein, but there is an abstract way of looking at the problem that is both clearer and more general. Suppose we have a non- deterministic finite automaton where each transition is given a cost. A simple example is shown in Figure 4. The machine accepts (a + b]*, with costs as shown in parentheses. The total cost of accepting a string is the sum of the FIGURE 4. A Nondeterministic Automaton with Transition Costs transition costs for each character. (While it is not important to our problem, the optimal solution can be computed by forming a transition matrix for each let- ter, using the costs shown in parentheses, and then multiplying the matrices for a given string, treating the coefficients as elements of the closed semiring with op- erations of addition and minimization.) We can obtain a solution that approximates the minimum by deleting transitions in the original machine until it becomes a deterministic machine. This corresponds to choosing a policy in our original data compression problem. A pol- icy for the machine in Figure 4 is shown in Figure 5. We now wish to compare, in the worst case. the dif- ference between optimally accepting a string with the non-deterministic machine, and deterministically ac- cepting the same string with the "policy" machine. This is done by taking a cross product of the two machines, as shown in Figure 6. In Figure 6 there are now two weights on each transi- tion; the first is the cost in the non-deterministic graph, and the second is the cost in the policy graph. Asymp- totically, the relationship of the optimal solution to the policy solution is dominated by the smallest ratio on a cycle in this graph. In the case of Figure 6, there is a cycle from I, I' to 1, 2" and back that has cost in the non-deterministic graph of 2+ 1 = 3, and cost in the policy graph of 3+3 = 6, giving a ratio of 1/2. That is. FIGURE 5. A Deterministic "Policy" Automaton for Figure 4 the policy solution can be twice as bad as the optimum on the string ababababab. . . . In general, we can find the cycle with the smallest ratio mechanically, using well known techniques [8, 27]. The idea is to conjecture a ratio r and then reduce the pairs of weights (x. y) on the arcs to single weights x - ry. Under this reduction, a cycle with zero weight has ratio exactly r. If a cycle has negative weight, then r is too large. The ratio on the negative cycle is used as a new conjecture, and the process is iterated. (Negative cycles are detected by running a shortest path algo- rithm and checking for convergence.) Once we have found the minimum ratio cycle, we can create a worst case string in the original automata problem by finding a path from the start state to the cycle and then repeat- ing the cycle indefinitely. The ratio of the costs of ac- FIGURE 6. The Cross Product cepting the string non-deterministically and determin- istically will converge to the ratio of the cycle. (The path taken in the cross product graph will not necessar- ily bring us to the same cycle, due to the initial path fragment; we will, nevertheless, do at least as well.) Conversely, if we have a sufficiently long string with non-deterministic to deterministic ratio r, then the string will eventually loop in the cross product graph. If we remove loops with ratio greater than r we only im- prove the ratio of the string, so we must eventually find a loop with ratio at least as small as r. The above discussion gives us an algorithmic way of analyzing our original data compression problem. The possible values of F(i) are encoded in a 17 character alphabet p0 . . . p16, representing the length of copy available at each position. The compression algorithm is described by a non-deterministic machine that ac- cepts strings of pj,: this machine has costs equal to the lengths of the codewords used by the algorithm. There are two parameterized states in this machine: lx means that there is a literal codeword under construction with x spaces still available; c, means that a copy is in prog- ress with y characters remaining to copy. The idle state is lo = Co. In the non-deterministic machine, the possi- ble transitions are: start a literal -> li-i continue a literal {x > 1) , r.m I, -> c,_i start a copy r.(o) Cy -> c"-i continue a copy P.12) P.W (2) (An asterisk is used as a wild card to denote any state.) Based on the theorems above we have already elimi- nated some transitions to simplify what follows. For example, Cy -* lis start a literal from inside a copy (y ^ 1) (3) is unnecessary. The deterministic machine, given (6) 1 2 3 4 5 6 7 p10 p10 p9 p8 P7 p6 P5 8 9 10 11 12 13 14 15 p4 p3 p2 p1 p2 p10 p10 p9 (There is nothing special about 10; it was chosen to illustrate a long copy and to match the example in Appendix A.) The deterministic algorithm takes a copy of length 10 in the first position, and then switches to a literal for positions II and 12. Five bytes are used in each repetition of the pattern. The optimal solution is one position out of phase. It takes a copy of length 10 in the second position, and then finds a copy of length 2 at position 12, fora total of four bytes on each iteration. We have abstracted the problem so that the possible copy operations are described by a string of pj,, and we have shown a pathological pattern of pj, that results in % of the optimal encoding. There might still be some doubt that such a string exists, since the condition that our third machine (5) guarantees, F(i + 1) >= F(i) - 1, is a necessary but not sufficient condition. Nevertheless, the details of an actual pathological string can be found in Appendix A. --- ifmail v.2.14dev2 * Origin: [not found] (2:5020/400@fidonet) — RU.COMPRESS From : Disturbo 2:5020/400 05 Dec 98 17:36:26 To : Bulat Ziganshin Subj : LZFG (CACM publ.) 2 of 4 From: Disturbo <pprgd@aha.ru> 4. A SIMPLER DATA STRUCTURE Although the quantity of code associated with A1 is not enormous, it is complicated, and the data structures are fairly large. In this section, we present simpler methods for finding the suffix and for propagating the window position. The alternative to a percolating update is to update the positions in all nodes back to the root whenever a new leaf is inserted. Then no updates are needed when nodes are deleted. The update flags can be eliminated. The alternative to suffix pointers is more compli- cated. The cost of movement in a tree is not uniform: moving deeper requires a hash table lookup, which is more expensive than following a parent.pointer. So we can determine the suffix by starting at the suffix leaf and following parent pointers back toward the root un- til the suffix node is reached. The suffix leaf is known because the string at i matched the string at some ear- lier window position ;; the suffix leaf; + I is the next entry in the leaf array. With this change, the suffix pointers can be eliminated. >From a theoretical perspective, these modifications, which have 0{nd) worst case performance for a file of size n and cut-off depth d, are inferior to the 0(n) per- formance of the suffix tree. For A1, with a cutoff of 16, these modifications improve average performance, but the A2 method discussed in the next section has such a deep cut-off that suffix pointers and percolated updates are preferable. 5. A MORE POWERFUL ENCODING The 4,096-byte window of A1 is roughly optimal for fixed size copy and literal codewords. Longer copies would, on average, be found in a larger window, but a larger displacement field would be required to encode them. To exploit a larger window, we must use a variable-width encoding statistically sensitive to the fact that recent window positions are more likely to be used by copy codewords than those positions further back. Similarly, it is advantageous to use variable- width encodings for copy and literal lengths. There are several approaches we might use for vari- able-length encoding. We could use fixed or adaptive Huffman coding, arithmetic encoding, a variable-length encoding of the integers, or a manageable set of hand- designed codewords. We eliminated from consideration adaptive Huffman and arithmetic coding because they are slow- Moreover, we felt they would provide (at best) a secondary adaptive advantage since the "front end" textual substitution is itself adapting to the input. We experimented with a fixed Huffman encoding, a hand- designed family of codewords, and a variable-length encoding of the integers, so we will compare these options briefly: Hand-Designed Codewords. This is a direct generaliza- tion of A1, with short copies that use fewer bits but cannot address the full window, and longer copies that can address larger blocks further back in the window, With a few codewords, this is fast and relatively easy to implement. However, some care must be taken in the choice of codewords to maximize compression. Variable-Length Integers. The simplest method we tried uses a unary code to specify field width, followed by the field itself. Copy length and displacement fields are coded independently via this technique, so any cor- relations are ignored. There are more elaborate codings of the irutegers (such as [9), [10), or [13]), that have been used by 115), and [34] in their implementations of Lempel-Ziv compression. These encodings have nice asymptotic properties for very large integers, but the unary code is best for our purposes since, as we will see shortly, it can be tuned easily to the statistics of the application. The unary code has the additional advan- tage of a simple hardware implementation. We will return to the unary code in more detail shortly. Fixed Huffman. Ideally, a fixed Huffman encoder should be applied to source consisting of the copy length and displacement concatenated together (to cap- ture the correlation of these two fields). However, since we wish to expand window size to 16384 and maxi- mum copy length to 2000, the realities of gathering statistics and constructing an implementation dictate that we restrict the input of the fixed Huffman com- pressor to a size much smaller than 2000 x 16384 by grouping together codes with nearly equal copy lengths and displacements. To improve speed we use tables to encode and decode a byte at a time. Nevertheless, the fixed Huffman approach is the most complex and slow- est of the three options compared here. To decide how much compression could be increased with a Fixed Huffman approach, we experimented with several groupings of nearly equal copy lengths and dis- placements, using a finer granularity for small values, so that the input to the Fixed Huffman compressor had only about 30,000 states, and we computed the entropv to give a theoretical bound on the compression. The smallest entropy we obtained was only 4 percent more compact than the actual compression achieved with the unary encoding described below; and any real imple- mentation would do worse than an entropy bound. Consequently, because the Fixed Huffman approach did not achieve significantly higher compression, we favor the simpler unary code, though this is not an overwhelmingly clear choice. Define a-(start, step, stop) unary code of the integers as follows; The nth codeword has n ones followed by a zero followed by a field of size start + n step. If the field size is equal to stop then the preceding zero can be omitted. The integers are laid out sequentially through these codewords. For example, (3, 2, 9) would look like: Range 0-7 8-39 40-167 168-679 Codeword Oxxx 1Oxxxxx 11Oxxxxxxx 111xxxxxxxxx Appendix B contains a simple procedure that gener- ates unary codes. The A2 textual substitution method encodes copy length with a (2, 1, 10) code, leading to a maximum copy length of 2044. A copy length of zero signals a literal, for which literal length is then encoded with a (0, 1, 5) code, leading to a maximum literal length of 63 bytes. If copy length is non-zero, then copy displace- ment is encoded with a (10, 2, 14) code. The exact maximum copy and literal lengths are chosen to avoid wasted states in the unary progressions; a maximum copy length of 2044 is sufficient for the kinds of data studied in Section 8. The A1 policy for choosing be- tween copy and literal codewords is used. Three refinements are used to increase A2's compression by approximately I to 2 percent. First, since neither another literal nor a copy of length 2 can immediately follow a literal of less than maximum lit- eral length, in this situation, we shift copy length codes down by 2. In other words, in the (2, 1, 10) code for copy length, 0 usually means literal, 1 means copy length 2, etc.; but after a literal of less than maximum literal length, 0 means copy length 3, I means copy length 4, etc. Secondly, we phase-in the copy displacement encod- ing for small files, using a (10- x, 2, 14- x) code, where x starts at 10 and descends to 0 as the number of window positions grows: for example, x = 10 allows 2^1 + 2^2 + 2^4 = 21 values to be coded, so when the number of window positions exceeds 21, x is reduced to 9; and so forth. Finally, to eliminate wasted states in the copy dis- placement encoding, the largest field in the (10 - x, 2, 14-x) progression is shrunk until it is just large enough to hold all values that must be represented; that is, if v values remain to be encoded in the largest field then smaller values are encoded with Llog2vJ bits and larger values with rlog2vl bits rather than 14 - x bits. This trick increases compression during startup, and, if the window size is chosen smaller than the number of values in the displacement progression, it continues to be useful thereafter. For example, the compression studies in Section 8 use an A2 window size of 16,384 characters, so the (10, 2, 14) code would waste 5,120 states in the 14-bit field without this trick. Percolating update seems preferable for the imple- mentation of A2 because of the large maximum copy length; with update-to-root, pathological input could slow the compressor by a factor of 20. Unfortunately, the percolating update does not guarantee that the suf- fix tree will report the nearest position for a match, so longer codewords than necessary may sometimes be used. This problem is not serious because the tree is often shallow, and nodes near the root usually have many sons, so updates propagate much more rapidly than assumed in the analysis of Section 3. On typical files, compression with percolated update is 0.4 percent less than with update-to-root. 6. A FASTER COMPRESSOR A2 has very fast expansion with a small storage re- quirement, but, even though compression has constant amortized lime, it is 5 times slower than expansion. A1 and A2 are most appropriate in applications where compression speed is not critical and where the per- formance of the expander needs to be optimized, such as the mass release of software on floppy disks. How- ever, in applications such as file archiving, faster compression is needed. For this reason, we have devel- oped the B1 and B2 methods described here, which use the same encodings as A1 and A2, respectively, but compute window displacement differently. Copy code- words are restricted to start at the beginning of the yth previous codeword or literal character emitted: they can no longer address every earlier character, but only those where literal characters occurred or copy code- words started; we refer to displacements computed this way as "compressed displacements" throughout. Copy length is still measured in characters, like A1. By in- serting this level of indirection during window access, compression speed typically triples, though expansion and the rate of adaptation are somewhat slower. With "compressed displacements," suffix pointers and update propagation are unnecessary and a simpler PATRICIA tree can be used for the dictionary. Entries are made in the tree only on codeword boundaries, and this can be done in linear time by starting at the root on each iteration. It is useful to create an array of per- manent nodes for all characters at depth 1. Since copy codewords of length I are never issued, it doesn't mat- ter that some permanent nodes don't correspond to any window character. Each iteration begins by indexing into this node array with the next character. Then hash table lookups and arc character comparisons are used to descend deeper, as in A1. The new window position is written into nodes passed on the way down, so up- date propagation is unnecessary. In short, the complications necessary to achieve con- stant average time per source character with A2 are eliminated. However, one new complication is intro- duced. In the worst case, the 16,384 window positions of B2 could require millions of characters, so we impose a limit of 12 X 16,384 characters; if the full window exceeds this limit, leaves for the oldest window posi- tions are purged from the tree. Because of slower adaptation, B2 usually compresses slightly less than A2 on small files. But on text and program source files, it surpasses A2 by 6 to 8 percent asymptotically; the crossover from lower compression to higher occurs after about 70,000 characters! A2 code- words find all the near-term context, while B2 is re- stricted to start on previous codeword boundaries but can consequently reach further back in the file. This gives B2 an advantage on files with a natural word structure, such as text, and a disadvantage on files where nearby context is especially important, such as scanned images. We also tried variations where the tree is updated more frequently than on every codeword boundary and literal character. All variations up to and including A2 can be implemented within the general framework of this method, if speed is not an issue. For example, we found that about I percent higher compression can be achieved by inserting another compressed position be- tween the two characters represented by each length 2 copy codeword and another 0.5 percent by also insert- ing compressed positions after each character repre- sented by length 3 copy codewords. However, because these changes slow compression and expansion we haven't used them. 7. IMPROVING THE COMPRESSION RATIO In Section 6 we considered ways to speed up compres- sion at the cost of slower adaptation and expansion. In this section we will explore the other direction: im- proving the compression ratio with a slight cost to the running time of the algorithm. When a string occurs frequently in a file, all the methods we have considered so far waste space in their encoding: when they are encoding the repeating string, they are capable of specifying the copy displacement to multiple previous occurrences of the string, yet only one string needs to be copied. By-contrast, the data structures we have used do not waste space. The re- peating strings share a common path near the root. If we base the copy codewords directly on the data struc- ture of the dictionary, we can improve the compression ratio significantly. (This brings us closer to the second style of Ziv and Lempel's textual substitution work [19, 29, 43), where a dictionary is maintained by both the compressor and expander. However, since we still use a window and an explicit copy length coding, it is natural to view this as a modification of our earlier compres- sors, in the style of Ziv and Lempel's first textual sub- stitution work.) The C2 method uses the same PATRICIA tree data structures as B2 to store its dictionary. Thus it takes two pieces of information to specify a word in the dic- tionary: a node, and a location along the arc between the node and its parent (since PATRICIA tree arcs may correspond to strings with more than one character). We will distinguish two cases for a copy: if the arc is at a leaf of the tree, then we will use a LeafCopy code- word, while if the arc is internal to the tree we will use a NodeCopy codeword. Essentially, those strings appear- ing two or more times in the window are coded with NodeCopies, avoiding the redundancy of A2 or B2 in these cases. The C2 encoding begins with a single prefix bit that is 0 for a NodeCopy, I for a LeafCopy or Literal. For NodeCopy codewords, the prefix is followed by a node number in [0 . . maxNodeNo], where maxNodeNo is the largest node number used since initialization: for most files tested, maxNodeNo is about 50 percent the number of leaves. Following the node number, a dis- placement along the arc from the node to its parent is encoded; for most NodeCopy codewords the incoming arc is of length I, so no length field is required. If a length field is required, 0 denotes a match exactly at the node, I a displacement I down the arc from the parent node, etc. Rarely is the length field longer than one or two bits because the arc lengths are usually short, so all possible displacements can be enumerated with only a few bits. For both the node number and the incoming arc displacement, the trick described in Sec- tion 5 is used to eliminate wasted states in the field; that is, ifv values must be encoded, then the smaller values are encoded with Llog^'J bits and larger values with nlog2vl bits. LeafCopies are coded with unary progressions like those of A2 or B2. A (1. I, II) progression is used to specify the distance of the longest match down the leaf arc from its parent node. with 0 denoting a literal; this progression leads to a maximum copy length of 4094 bytes. Since another literal never occurs immediately after a literal of less than maximum literal length, the LeafCopy arc distance progression is shifted down by I when the preceding codeword was a literal (i.e., arc displacement I is coded as 0, 2 as I, etc.). On a cross section of files from the data sets discussed later, dis- tance down the leaf arc was highly skewed, with about half the arc displacements occurring one character down the leaf arc. Because of this probability spike at I and the rapid drop off at larger distances, the average length field is small. Following the length field, the window position is coded by gradually phasing in a (10, 2, 14) unary progression exactly like B2's. Literals are coded by first coding a LeafCopy arc dis- placement of O and then using a (0, 1, 5) unary progres- sion for the literal length exactly like B2. Unlike A2 and B2, the expander for C2 must main- tain a dictionary tree exactly like the compressor's tree to permit decoding. Notice that this is not as onerous as it might seem. During compression, the algorithm must search the tree downward (root toward leaves) to find the longest match, and this requires a hash table access at each node. By contrast, the expander is told which node was matched, and can recover the length and window position of the match from the node. No hash table is required, but the encoding is restricted: a copy codeword must always represent the longest match found in the tree, in particular, the superior heuristic used by B2 to choose between Literal and Copy code- words must be discarded; instead, when the longest match is of length 2 or more, a copy codeword must always be produced. With this restriction, the expander can reconstruct the tree during decoding simply by hanging each new leaf from the node or arc indicated by the NodeCopy or LeafCopy codeword, or in the case of literals, by hanging the leaf from the permanent depth I node for each literal character. --- ifmail v.2.14dev2 * Origin: [not found] (2:5020/400@fidonet) — RU.COMPRESS From : Disturbo 2:5020/400 05 Dec 98 17:36:27 To : Bulat Ziganshin Subj : LZFG (CACM publ.) 3 of 4 From: Disturbo <pprgd@aha.ru> 8. EMPIRICAL STUDIES In this section, we compare the five compression meth- ods we have developed with other one-pass, adaptive methods. For most other methods, we do not have well- tuned implementations and report only compression results. For implementations we have tuned for effi- ciency, speed is also estimated (for our 3 MIP, 16-bit word size. 8 megabyte workstations). The execution times used to determine speed include the time to open, read, and write files on the local disk (which has a relatively slow, maximum transfer rate of 5 megabits per second): the speed is computed by dividing the un- compressed file size by the execution time for a large file. We tested file types important in our working envi- ronment. Each number in Table I is the sum of the compressed file sizes for all files in the group divided by the sum of the original file sizes. Charts 1-3 show the dependency of compression on file size for all of the compression methods tested on the source code (SC) data set. Data Sets SC Source Code. All 8-bit Ascii source files from which the boot file for our programming environment is built. Files include some English comments, and a densely-coded collection of formatting information at the end of each file reduces compressibility. The files themselves are written in the Cedar language. (1185 files, average size II Kbytes, total size 13.4 Mybtes) TM Technical Memoranda. All files from a directory where computer science technical memoranda and reports are filed, excluding those containing images. These files are 8-bit Ascii text with densely-coded for- matting information at the end (like the source code). (134 files, average size 22 Kbytes, total size 2.9 Mbytes) NS News Service. One file for each work day of a week from a major wire service: these files are 8-bit Ascii with no formatting information. Using textual substitution methods, these do not compress as well as the technical memoranda of the previous study group, even though they are much larger and should be less impacted by startup transient: inspection suggests that the larger vocabulary and extensive use of proper names might be responsible for this. (5 files, average size 459 Kbytes, total size 2.3 Mbytes) CC Compiled Code. The compiled-code files produced from the SC data set. Each file contains several differ- ent regions: symbol names, pointers to the symbols, statement boundaries and source positions for the de- bugger, and executable code. Because each region is small and the regions have different characteristics, these files severely test an adaptive compressor. (1220 files, average size 13 Kbytes, total size 16.5 Mbytes) BF Boot File. The boot file for our programming envi- ronment, basically a core image and memory map. (I file, size 525 Kbytes) SF Spline Fonts. Spline-described character fonts used to generate the bitmaps for character sets at a variety of resolutions. (94 files, average size 39 Kbytes, total size 3.6 Mbytes) RCF Run-coded Fonts. High-resolution character fonts, where the original bitmaps have been replaced by a run-coded representation. (68 files, average size 47 Kbytes, total size 3.2 Mbytes) SNI Synthetic Images. All 8 bit/pixel synthetic image files from the directory of an imaging researcher. The 44 files are the red, green, and blue color separations for 12 color images, 2 of which also have an extra file to encode background transparency; in addition, there are 6 other grey scale images. (44 files,, average size 328 Kbytes, total size 14.4 Mbytes) SCI Scanned Images. The red separations for all 8 bit/pixel scanned-in color images from the directory of an imaging researcher. The low-order one or two bits of each pixel are probably noise, reducing compressibility. (12 files, average size 683 Kbytes, total size 8.2 Mbytes) BI Binary Images. CCITT standard images used to evaluate binary facsimile compression methods. Each file consists of a 148-byte header followed by a binary scan of I page (1728 pixels/scan line x 2376 scan lines/ page). Some images have blocks of zeros more than 30,000 bytes long. Because these files are composed of 1-bit rather than 8-bit items, the general-purpose com- pressors do worse than they otherwise might. (8 files, average size 513 Kbytes, total size 4.1 Mbytes) The special-purpose CCITT 1D and 2D compression methods reported in [18] achieve, respectively, 0.112 and 0.064 compression ratios on these standard images when the extraneous end-of-line codes required by the facsimile standard are removed and when the extra- neous 148-byte header is removed. The special-purpose CCITT 2D result is significantly more compact than any general purpose method we tested, and only CW and C2 surpassed the ID result. (7) (8) Measurements and Compression Methods H0 and H1. These are entropy calculations made on a per file basis according to: H0 = -<SUM i=0> P(x =Ci)log2P{x = c,), H1 = -<SUM i,j=0> P(y =Ci|x=Ci)log2(P{y = Cj|x=Ci), where x is a random symbol of the source, xy is a randomly chosen pair of adjacent source characters, and Ci; ranges over all possible symbols. Because of the small file size, the curves in Charts I to 3 drop off to the left. In theory, this small sampling problem can be corrected according to [2], but we have found it diffi- cult to estimate the total character set size in order to apply these corrections. Nevertheless, Chart I shows that HO is a good estimator for how well a memoryless (zero-order) compressor can do when file size is a large multiple of 256 bytes, and H1 bounds the compression for a first-order Markov method. (None of our files were large enough for HI to be an accurate estimator.) KG and V. These adaptive methods maintain a Huff- man tree based on the frequency of characters seen so far in a file. The compressor and expander have roughly equal performance. The theory behind the KG approach appears in [II] and [23]. The similar V method, discussed in [37] should get better compression during the startup transient at the expense of being about 18 percent slower. It is also possible to bound the performance of Vitter's scheme closely to that of a fixed non-adaptive compressor. Except on the highly com- pressible CCITT images, these methods achieve compression slightly worse than HO, as expected. But because of bit quantization, the compression of the CCITT images is poor-arithmetic coding would com- press close to HO even on these highly compressible sources. CW. Based on [6], this method gathers higher-order statistics than KG or V above (which we ran only on zeroth-order statistics). The method that Cleary and Witten describe keeps statistics to some order o and encodes each new character based on the context of the o preceding characters. (We've used o = 3, because any higher order exhausts storage on most of our data sets.) If the new character has never before appeared in the same context, then an escape mechanism is used to back down to smaller contexts to encode the character using those statistics. (We've used their escape mecha- nism A with exclusion of counts from higher-order con- texts.) Because of high event probabilities in some higher-ordered contexts and the possibility of multiple escapes before a character is encoded, the fractional bit loss of Huffman encoding is a concern, so [6] uses arith- metic encoding. We have used the arithmetic encoder in [40]. TABLE 1. Comparison of Compression Methods Method Text Binary Fonts Images 1 SC TM NS CC BF SF RCF SNI SCI BI 1 HO .732 .612 .590 .780 .752 .626 .756 .397 .845 .148 H1 . .401 .424 .467 .540 .573 .380 .597 .181 .510 .101 KG -751 .625 .595 .804 .756 .637 .767 .415 .850 .205 V .749 .624 .595 .802 .756 .637 .766 .414 .850 .205 CW -369 .358 .326 .768 .544 .516 .649 .233 .608 .106 MW1 .508 .470 .487 .770 .626 .558 .705 .259 .728 .117 MW2 -458 .449 .458 .784 .594 .526 .692 .270 .774 .117 UW .521 .476 .442 .796 .638 .561 .728 .255 .697 .118 BSTW .426 .434 .465 - .684 - .581 - - - A1 .430 .461 .520 .741 .608 .502 .657 .351 .766 .215 A2 .366 .395 .436 .676 .538 .460 .588 .259 .709 .123 B1 -449 .458 .501 .753 .616 .505 .676 .349 .777 .213 B2 .372 .403 .410 .681 .547 .459 .603 .255 .714 .117 C2 .360 .376 .375 .668 .527 .445 .578 .233 .662 .105 As Table I shows, CW achieves excellent compres- sion. Its chief drawbacks are its space and time per- formance. Its space requirement can grow in proportion to file size; for example, statistics for o = 3 on random input could require a tree with 256'4 leaves, though English text requires much less. The space (and conse- quently time) performance of CW degrades dramati- cally on "more random" data sets like SNI and SCI. A practical implementation would have to limit storage somehow. Even on English, Bell, Cleary, and Witten estimate that Moffat's tuned implementation of CW is 3 times slower compressing and 5 times slower ex- panding than C2 [41. MW1. This method, described in [29], is related to the second style of Lempel-Ziv compression, alluded to in the introduction, it uses a Trie data structure and 12-bit codes. Initially (and always) the dictionary contains 256 one-character strings. New material is encoded by find- ing the longest match in the dictionary, outputting the associated code, and then inserting a new dictionary entry that is the longest match plus one more charac- ter. After the dictionary has filled, each iteration re- claims an old code from among dictionary leaves, fol- lowing a LRU discipline, and reuses that code for the new dictionary entry. The expander works the same way. MWl is simple to implement and is balanced in performance, with good speed both compressing and expanding (250,000 bits/sec and 310 bits/sec respec- tively). The original method used 12-bit codes through- out for simplicity and efficiency. However, our imple- mentation starts by using 9-bit codewords, increasing to 10, II, and finally to 12 bits as the dictionary grows to its maximum size: this saves up to 352 bytes in the compressed file size. On text and source code, Miller and Wegman determined that the 12-bit codeword size is close to optimal for this method. MW2. One drawback of MWl is the slow rate of buildup of dictionary entries. If, for example, the word abcdefghi appears frequently in a document, then ab will be in the dictionary after the first occurrence, abc after the second, and so on, with the full word present only after 8 occurrences (assuming no help from similar words in the document). A1 below, for example, would be able to copy the whole word abcdefghi after the first occurrence, but it pays a penalty for the quick response by having a length field in its copy codeword. The idea of MW2 is to build dictionary entries faster by combin- ing adjacent codewords of the MW1 scheme. Longer words like abcdefghi are built up at an exponential rather than linear rate. The chief disadvantage of MW2 is its increased complexity and slow execution. Our implementation follows the description in [29] and uses an upper limit of 4096 dictionary entries (or 12-bit codewords). We did not implement the 9-12 bit phase- in that was used in MW1 so the size-dependent charts underestimate MW2's potential performance on small files. UW. This is the Compress utility found in the Berke- ley 4.3 Unix, which modifies a method described in a paper by Welch [39]: the authors of this method are S. Thomas,}. McKie, S. Davies, K. Turkowski,}. Woods, and J. Orost. It builds its dictionary like MW1. gradu- ally expanding the codeword size from 9 bits initially up to 16 bits. The dictionary is frozen after 65,536 en- tries, but if the compression ratio drops significantly, the dictionary is discarded and rebuilt from scratch. We used this compressor remotely on a Vax-785, so it is difficult to compare its running time and implementa- tion difficulties with the other methods we imple- mented. Nevertheless, because it does not use the LRU collection of codes, it should be faster than MW1. How- ever, it has a larger total storage requirement and gets worse compression than MWl on most data sets studied. BSTW. This method first partitions the input into alphanumeric and non-alphanumeric "words," so it is specialized for text, though we were able to run it on some other kinds of data as well. The core of the com- pressor is a move-to-front heuristic. Within each class, the most recently seen words are kept on a list (we have used list size 256). If the next input word is al- ready in the word list, then the compressor simply encodes the position of the word in the list and then moves the word to the front of the list. The move-to- front heuristic means that frequently used words will be near the front of the list, so they can be encoded with fewer bits. If the next word in the input stream is not on the word list, then the new word is added to the front of the list, while another word is removed from the end of the list, and the new word must be com- pressed character-by-character. Since the empirical results in [5] do not actually give an encoding for the positions of words in the list or for the characters in new words that are output, we have taken the liberty of using the V compressor as a subrou- tine to generate these encodings adaptively. (There are actually four copies ofVitter's algorithm running, one to encode positions and one to encode characters in each of two partitions.) Using an adaptive Huffman is slow; a fixed encoding would run faster, but we expect that a fixed encoding would slightly reduce compres- sion on larger files while slightly improving compres- sion on small files. We could not run BSTW for all of the data sets, since the parsing mechanism assumes human-readable text and long "words" appear in the other data sets. When the unreadable input parsed well, as in the case of run-coded fonts, the compression was very good. A1. This is cur basic method described earlier. It has a fast and simple expander (560,000 bits/sec) with a small storage requirement (10,000 bytes). However, the compressor is much slower and larger (73,000 bits/sec, 145,000 bytes using scan-from-leaf and update-to-root). The encoding has a maximum compression to 1/8 = 12.5 percent of the original file size because the best it can do is copy 16 characters with a 16-bit codeword, Caveat: As we mentioned above, the running times reported include the file system overhead for a rela- tively slow disk. To provide a baseline, we timed a file copy without compression and obtained a rate of 760,000 bits per second. Thus, some of the faster expan- sion rates we report are severely limited by the disk. For example, we estimate that without disk overhead the A1 expander would be about twice as fast. On the other hand, removing disk overhead would hardly af- fect the compression speed of A1. A2. This method, discussed in Section 5, enlarges the window to 16,384 characters and uses variable-width unary-coded copy and literal codewords to significantly increase compression. The running time and storage requirements are 410,000 bits/sec and 21,000 bytes for expansion and 60,000 bits/sec and 630,000 bytes for compression (using suffix pointers and percolated update). B1. This method, discussed in Section 6, uses the A1 encoding but triples compression speed by updating the tree only at codeword boundaries and literal charac- ters. The running time and storage requirements are 470,000 bits/sec and 45,000 bytes for expansion and 230,000 bits/sec and 187,000 bytes for compression. B2. This method, discussed in Section 6. uses the same encoding as A2 but triples compression speed by updating the tree only at codeword boundaries and lit- eral characters. The compressor and expander run at 170,000 and 380,000 bits/sec, respectively, and have storage requirements of 792,000 and 262,000 bytes. C2. This method, discussed in Section 7, uses the same data structures as B2 but a more powerful encod- ing based directly upon the structure of the dictionary tree. Compression is about the same and expansion about 25 percent slower than B2; the compressor uses about the same storage as B2, but the expander uses more (about 529,000 bytes). Table I highlights some differences between textual substitution methods like C2 and statistical methods like CW. (Time and space performance differences have been discussed earlier.) There are several data sets where these methods differ dramatically. On NS, CW is significantly better than C2. We believe that this is be- cause NS shows great diversity in vocabulary: a prop- erty that is troublesome for textual substitution, since it cannot copy new words easily from elsewhere in the document, but is benign for CW, since new words are likely to follow the existing English statistics. On CC, for example, C2 is significantly better than CW. We believe that this is because CC contains several radi- cally different parts, e.g. symbol tables, and compiled code. C2 is able to adjust to dramatic shifts within a file, due to literal codewords and copy addressing that favors nearby context, while CW has no easy way to rapidly diminish the effect of older statistics. For all of our methods, A2, B2, and C2. window size is a significant consideration because it determines storage requirements and affects compression ratios. Chart 4 shows compression as a function of window size for the NS data set (concatenated into a single file to avoid start-up effects), and for the BF boot file. These two data sets were typical of the bimodal behavior we observed in our other data sets: large human-readable files benefit greatly from increasing window size, while other test groups show little improvement beyond a window size of 4K. --- ifmail v.2.14dev2 * Origin: [not found] (2:5020/400@fidonet) — RU.COMPRESS From : Disturbo 2:5020/400 05 Dec 98 17:36:29 To : Bulat Ziganshin Subj : LZFG (CACM publ.) 4 of 4 From: Disturbo <pprgd@aha.ru> 9. CONCLUSIONS We have described several practical methods for loss- less data compression and developed data structures to support them. These methods are strongly adaptive in the sense that they adapt not only during startup but also to context changes occurring later. They are suit- able for most high speed applications because they make only one pass over source data, use only a con- slant amount of storage, and have constant amortized execution time per character. Our empirical studies point to several broad generali- zations. First, based on the HO and H1 theoretical lim- its, textual substitution via A2, B2, or C2 surpasses memoryless or first-order Markov methods applied on a character-by-character basis on half the data sets. On the other half, even the CW third-order method can't achieve the HI bound. This suggests that. to surpass textual substitution for general purpose compression, any Markov method must be at least second-order, and to date, all such methods have poor space and time performance. Secondly, the methods we've developed adapt rap- idly during startup and at transitions in the middle of files. One reason for rapid adaptation is the use of smaller representations for displacements to recent po- sitions in the window. Another reason is the inclusion of multi-character literal codewords. Together the liter- als and short displacements allow our methods to per- form well on short files, files with major internal shifts of vocabulary or statistical properties, and files with bursts of poorly compressing material-all properties of a significant number of files in our environment. Thirdly, it appears that the displacement-aiid-length approach to textual substitution is especially effective on small files. On 11,000-byte program source files, for example, A2 and B2 were over 20 percent more com- pact than textual substitution methods which did not use a length field (UW. MWI, and MW2). This is not surprising because the particular advantage of the length field in copy codewords is rapid adaptation on small files. However, even on the largest files tested, A2 and B2 usually achieved significantly higher compres- sion. Only on images did other methods compete with them; our most powerful method, C2, achieved higher compression than any other textual substitution method we tested on all data sets. The effect of a length field is to greatly expand dictionary size with little or no increase in storage or processing time; our results suggest that textual substitution methods that use a length field will work better than those which do not. Fourthly, studies of A2, B2, and C2 using different window sizes showed that, for human-readable input (e.g., English, source code), each doubling of window size improves the compression ratio by roughly 6 per- cent (for details see Chart 4). Furthermore, the data structures supporting these methods scale well: running time is independent of window size, and memory usage grows linearly with window size. Thus increasing win- dow size is an easy way to improve the compression ratio for large files of human-readable input. For other types of input the window size can be reduced to 4096 without significantly impacting compression. Going beyond these empirical results, an important practical consideration is the tradeoff among speed. storage, and degree of compression; speed and storage have to be considered for both compression and expan- sion. Of our own methods, A2 has very fast expansion with a minimal storage requirement; its weakness is slow compression; even though the suffix tree data structure with amortized update has constant amor- tized time per character, compression is still seven times slower than expansion. However, in applications which can afford relatively slow compression, A2 is excellent; for example, A2 would be good for mass dis- tribution of software on floppy disks or for overnight compression of files on a file server. Furthermore, if the parallel matching in the compression side of A2 were supported with VLSI, the result would be a fast, power- ful method requiring minimal storage both compressing and expanding. B2 provides nearly three times faster compression than A2 but has somewhat slower expansion and adap- tation. Thus B2 is well suited for communication and archiving applications. Al and Bl do not compress as well as A2 and B2, respectively, but because of their two-codeword, byte- aligned encodings they are better choices for applica- tions where simplicity or speed is critical. (For exam- ple, ). Gasbarro has designed and implemented an expansion method like Al to improve the bandwidth of a VLSI circuit tester [12].) C2 achieves significantly higher compression than B2, but its expander is somewhat slower and has a larger storage requirement. In the compression study reported in Section 8, C2 achieved the highest compres- sion of all methods tested on 6 of the 10 data sets. We believe that our implementations and empirical results demonstrate the value of window-based textual substitution. Together the A, B and C methods offer good options that can be chosen according to resource requirements. APPENDIX A A Pathological Example We now show a string that has the F pattern of It remains to create the match of length 2 at posi- equation (6) of Section 3: tion 12 in equation (6). For this purpose, each of the d above are either e, or o,. They will always precede 1 234567 respectively even and odd numbered s), and match p10 p10 p9 p8 p7 p6 p5 [6] in pairs with their following s/s. For example, the Co 89101112131415 in Goo = SiBiSiCoSzBzSz will match with Sz. The p4 p3 p2 p1 p2 p10 p10 p9 ... eoS2 match is hidden in a minor block segregated from the odd numbered s,: Hereafter we will stop abstracting the string by its copy lengths; capital letters are strings, small letters Bo = xeoSoeoS2?oS4 - e(>Sb-2 are single characters, and i, /, r, p, b are integers. The pathological string follows the pattern: Bi = xeoSi,eoSi+2CoS(,+4 . . . ?0521-2 MoMi . . . Mr-^MoMl . . . M,-iMoMi . . . (9) where the parameter r is chosen large enough so that one iteration exceeds the finite window (this B"/> = ^15001525154 . . CiSi.-2 (12) prevents direct copying from the beginning of one Mo to a subsequent Mo). Within each M, we have ' groups, Bp/2 = ^005100530055 . . . 0051-1 Mi = GioGitGil . . . G,("/p-i), (10) and each group is: * ^"i ~ SjiiBoS{j-U)pCiSjfn.-iBlS(j+i]p^-iCiSip+2 This causes p and n to be related by: (11) B2S(j+i)p+2CiSfp+] . . . BpiS(;+i)p+^-iCi, pb = 1m We have introduced two more parameters: p is the In the case of our running example, where the finite number of minor blocks B,, and n is the number of s window is size 4096 and the maximum copy length characters. All of the s subscripts in the above for- is 16 characters, an appropriate setting of the above mula are computed mod n. The groups skew so that, parameters is: for example, the beginning of Gio = SiBiS"+i . . . will not match entirely with the beginning of r = 2, b = S, p = 100, n = 200 (13) Goo = SiBigi ....It will, however, match in two We need to take some care that the heuristic does parts: the prefix SiBi appears in both strings, and the not find the optimal solution. It turns out that if we suffix Gio =... BiSp+i ... will match with the suffix just start as in equation (9), then the first Mo will not ofGci = . . . Bi$"+i. If, for example, Bi has 9 charac- compress well, but the heuristic will start the behav- ters, this gives two consecutive locations where ior we are seeking in Mi. Asymptotically we achieve a copy of size 10 is possible, in the pattern of a worst case ratio of Vs between the optimal algo- equation 6. rithm and the policy heuristic. APPENDIX B Computing a Unary-Based Variable Length Encoding of the Integers In Section 5 we defined a (start, step, stop) unary Encode Var: code of the integers as a string of ii ones followed by PROC [out: CARDINAL, start, step, last: CARDINAL] ~ { a zero followed by a field of/ bits. where / is in the UNTIL out < Power2[start] DO arithmetic progression defined by (start, step, stop). PutBits[l,lj; This can be defined precisely by the following out <- out - Power2[start]; encoder: start <- start + step; 504 Communications of the ACM April-1989 Volume 32 Number 4 Research Contributions ENDLOOP; IF start < last THEN PiitBits[out, start + 1] -0 followed byfield of size "start" ELSE IF start > last THEN ERROR ELSE PiitBits[out, start); -save a bit PutBits: pROC(out: CARD, bits: INTEGER] ~ Output the binary encoding of "out" in a field of size "bits." Notice that the encoder is able to save one bit in the last field size of the arithmetic progression. Acknowledgments. We would like to thank Jim Gasbarro, John Larson, Bill Marsh, Dick Sweet, lan Witten, and the anonymous referees for helpful com- ments and assistance. REFERENCES 1. Abramson, N. Information Theory and Ceding. McCraw-Hill. 1963, 61. 2. Basharin, G.P. On a stalistical estimate for the entropy of a sequence of independent random variables. Theory Prob. Appl. 4, (1959), 333-336. 3, Bell, T.C. Better OPM/L text compression. IEEE Trans. Commun. COM-U, 12 (1980), 1176-1182. 4. Bell, T.C., Cleary, ).G.. and Witten, i.H. Tell Compression. In press with Prentice-Hall. 5. Bentley, ).L., Sleator, D.D.. Tarjan, R.E.. and Wei. V.K. A locally adaptive data compression scheme. Commun. ACM 29, 4 (1885), 320-330. 6. Cieary, ).G., and Witten. I.H. Data compression ilsing adaptive cod- ing and partial string matching. IEEE Trans. Commun. COM-32, 4 (1984), 396-402. 7. Cormack, G.V., and Uorspool. R.N.S. Data compression using dy- namic markov modelling. The Computer fournal, 30, 6 (1987), 541-550. 8. Dantzig, G.B., Blattnct. W.O.. and Rao, M.R. Finding a Cycle in a . Graph with Minimum Cost to Time Ratio with Application to a Ship Routing Problem. Theory of Graphs, P. Rosenstiehl, ed. Gordon and Breach, 1966. 9. Elias, P. Universal codeword sets and representations of the integers. IEEE Trans. Info. Theon, IT-11. 1 (1975), 194-203. 10. Even, S., and Rodeh, M. Economical Encoding of Commas Between Strings. Commun. ACM 21 (1978), 315-317. 11. Gallager, R.G. Variations on a theme by Huffman. IEEE Trans. info. Theory IT-24, 6 (1378), 668-674. 12. Gasbarro, ). An Architecture for High-Performance VLSI Testers. Ph.D. dissertation, Department of Electrical Engineering, Stanford University, 1988. 13. Golomb, S.W. Run-Length Encodings. IEEE Trans. Info. Theory IT-12 (1966), 399-401. 14. Gua^zo, M. A General Minimum Redundancy Source-coding Algo- rithm. IEEE Trans. Info. Theory IT-26. 1 (1980), 15-25. 15. Guoan, G., and Hobby. ). Using string matching to compress Chinese characters. Stanford Tech. Rep. STAN-CS-82-914, 1982. 16. Horspool, R.N., and Cormack, G.V. Dynamic Markov Modelling- A Prediction Technique. In Proceedings of the 19th Annual Hawaii Mernational Conference on System Sciences (1986), 700-707. 17. Huffman. D.A. A method for the construction of minimum-redun- dancy codes. In Procefiluigs of the 1.R.E. 40 (1952), 1098-1101. 18. Hunter, R., and Robinson, A.H. International digital facsimile coding standards. In Proceedings of the IEEE 68. 7 (1980). 854-867. 19. )akobsson, M. Compression of character strings by an adaptive dic- tionary. B;T 25 (1985), 593-603. 20. )ones, C.B. An efficient coding system for long source sequences. J??E Trans. Info. Theom IT-27. 3 (1981), 280-291. 21. Kempt, M., Bayer, R.. and Cuntzer, U. Time Optimal l.eft to Right Construction of Position Trees. Ada Informatica. 24 (1987), 461-474. 22. Knuth, D.E. The Art of Computer Programming, Volume 3: Sorting and Searching. Addison-Wesluy, second printing. 1975. 23. Knuth, D.E. Dynamic Huffman coding. /. Algo. 6 (1985), 163-180. 24. Langdon, G.G., )r. A note on the Ziv-Lempel model for compressing individual sequences f?E? Trail-. Info. Theory IT-29. 2 (1983), 284-287. 25. Langdon, G.G., )r., and Rissanen. ). Compression of Black-White Images with Arithmetic Coding. IF-EE Trans. Commun. COM-29. 6 (1981), 858-867. 26, Langdon, C.G., )r.. and Rissanen. ). A Double Adaptive File Compression Algorithm. IF.[:E Trans. Commun. COM-31, II (1983), 1253-1255. 27. Lawler, E.L. Coml'inaiifii.il Optimization: Nelu'orks and Malroids. Holt. Rinehart and Winslon. 197b. 28, McCreight. E.M. A sp.K:e-ccoiiomical suffix tree construction algo- rithm. /. ACM 23. 2 (1976). 262-272. 29. Miller, V.S.. and VVegman, M.N. Variations on a theme by Ziv and Lempel. IBM Res. Rep. RC 10630 (#47798). 1984. Combinatorial Algo- rithms on Words, NATO ASI Series F. 12 (1985), 131-140. 30. Morrison, D.R. PATRICIA-Practical Algorithm To Retrieve informa- tion Coded in Alphanumeric. /. ACM 15, 4 (1968), 514-534. 31. Pasco, R.C. Source Coding Algorithms for Fast Data Compression. Ph.D. dissertation, Department of Electrical Engineering, Stanford Univer- sity, 1976. 32. Rissanen, )., and Langdon, G.G., )r. Arithmetic Coding. IBM lournal of Research and Development 23. 2 (1979), 149-162. 33. Rissanen, )., and Langdon. G.G., )r. Universal Modeling and Coding. IEEE Trans. Info. Theory IT-27. 1 (1981), 12-23. 34. Rodeh, M., Pratt, V.R., and Even. S. Linear algorithm for data compression via string matching, f. ACM 28. 1 (1981), 16-24. 35. Shannon, C.E. A mathematical theory of communication. The Bell System Technical journal 27,3. 379-423 and 27, 4 (1948), 623-656. 36. Storer, ).A., and Szymanski, T.G. Data compression via textual sub- stitution. /. ACM 29. 4 (1982). 928-951. 37. Vitter, ).S. Design and analysis of dynamic Huffman coding. Brown University Technical Report No. CS-85-13. 1985. 38. Weiner, P. Linear Pattern Matching Algorithms. Fourteenth Annual Symposium on Switching and Automata Theory, 1-11,1973. 39. Welch, T.A. A technique for high performance data compression. IEEE Comp. 17. 6 (1984). 8-19. 40, Witlen, I.H.. Neal, R.M., and Cieary ).G. Arithmetic coding for data compression. Commun. ACM 30. 6 (1987), 520-540. 41. Ziv, ). Coding theorems for individual sequences IEEE Trans. Info. Theory IT-24, 4 (1978), 405-412. 42. Ziv, )., and Lempel, A. A universal algorithm for sequential data compression. IEEE Trans. Info. Theory IT-23. 3 (1977), 337-343. 43. Ziv, ).. and Lempel, A. Compression of individual sequences v'a variable-rate coding. IEEE Trails. Info. Theory IT-24. 5 (1978), 530-536. CR Categories and Subject Descriptors: E.4 [Data]: Coding and Infor- mation Theory-data compaction and compression: F.2.2 [Analysis of Al- gorithms and Problem Complexity): Nonnumerical Algorithms and Problems-computations on discrete structures, pattern matching General Terms: Algorithms, Design, Experimentation, Theory Additional Key Words and Phrases: Amortized efficiency, automata theory, minimum cost to time ratio cycle, suffix trees, textual substitu- tion ABOUT THE AUTHORS: EDWARD R. FIALA is a member of the research staff in the Computer Sciences Laboratory at the Xerox Palo Alto Research Center. He has general system interests including data compression and digital video. Author's Present Address: Xerox PARC, 3333 Coyote Hill Rd.. Palo Alto, CA 94304: ARPA Internet: Fiala.pa@Xerox.com DANIEL H. GREENE is presently a member of the research staff at Xerox's Palo Alto Research Center (PARC). He received his Ph.D. in 1983 from Stanford University. His current re- search interests are in the design and analysis of algorithms for data compression, computational geometry, and distributed computing. Author's Present Address: Xerox PARC:, 3333 Coy- ote Hill Rd.. Palo Alto, CA 94304: Crcene.pa@Xerox.com ************************************************************* OCRed and published for the benefits of All Internet Community by Disturbo as a part of his Millenium Data Disclosure Program. December, the 2nd. 1998 AD. *** The END *** --- ifmail v.2.14dev2 * Origin: [not found] (2:5020/400@fidonet)
Предыдущий блок Следующий блок Вернуться в индекс