Сайт о сжатии >> Новинки \| О сервере (Compression Catalog! \| ENGLISH) Книга "Методы сжатия данных" >> Без потерь \| Изображений \| Видео Разделы >> Cтатьи \| Видео \| Arctest \| Ссылки \| Ru.compress \| Форум Проекты >> Д.Ватолина \| А.Ратушняка \| М.Смирнова \| В.Юкина \| Е.Шелвина \| Д.Шкарина

Probability estimation for PPM

Abstract

The state of the art in lossless text compression is the PPM data compression scheme. Two ap-proaches to the problem of selecting the context models used in the scheme are described. One uses an a priori upper bound on the lengths of the contexts, while the other method is unbounded. Several techniques that improve the probability estimation are described, including four new methods: partial update exclusions for the unbounded approach, deterministic scaling, recency scaling and multiple probability estimators. Each of these methods improves the performance for both the bounded and unbounded approaches. In addition, further savings are possible by combining the two approaches.

1. Introduction

The state of the art in lossless text compression is the PPM data compression scheme [1, 4]. PPM, or prediction by partial matching, is an adaptive statistical modeling technique based on blending together different length context models to predict the next character in the input sequence. The scheme achieves greater compression than Ziv-Lempel (LZ) dictionary based methods which are more widely used because of their simplicity and faster execution speeds.

PPM models are based upon context models of varying length of prior characters. The models record the frequency of characters that have followed each of the contexts. For example, a particular context may be the letters "thei". All the letters that have followed this context are counted. The next time the letters "thei" occur in the text, the counts are used to estimate the probability for the upcoming character. The PPM technique blends together the context models for the varying lengths (such as "hei" and "ei") to arrive at an overall probability distribution. Arithmetic coding is then used to optimally encode the character which actually does occur with respect to this distribution.

In the next section, some of the problems associated with estimating the probability distribution are examined. The first problem is the problem of deciding how long the contexts should be. In light of this problem, two different variants of the algorithm are examined. One fixes an a priori upper bound on the length of the context models; the other is unbounded. The second problem, called local order estimation, concerns that of selecting a particular context model from all the current context models. Following that, we describe different techniques at arriving at a probability distribution given the chosen context model. Four new methods that improve the probability estimation for both the bounded and unbounded approaches are also described. In the final section we give results that apply these techniques and compare their performance, leading to an 8% improvement in compression for both approaches over the standard PPMC implementation [1]. A further 1% improvement is possible by combining the two approaches.

2. Probability estimation for PPM

PPM splits compression into two steps. The first step accumulates a statistical model of the characters seen so far in the input text. As each character is encoded thismodel is used to generate a probability distribution over those characters which can occur next.

PPM's statistical model is built up from contexts of prior characters of varying lengths. A problem arises in deciding how long the contexts should be, whether there should be an upper bound to the length of the contexts or whether they should be unbounded. Another problem is deciding how much weight should be given to each context. The approach taken by PPM is a "back off" method of selecting a particular context and then backing off or escaping to a shorter context if the chosen context does not predict the upcoming character.

The problems of local order estimation, and estimating the escape probability, are the key problems associated with probability estimation for PPM. These two problems are discussed below. Following that, other techniques effective at improving the probability estimation are discussed, including four new methods: partial update exclusions for unbounded length contexts, deterministic scaling, recency scaling and multiple probability estimators.

Bounded vs. unbounded length contexts

There are two approaches to the problem of deciding how long the contexts should be for PPM. The first approach, adopted in the original implementation [4] and later on in [6], uses an upper bound to the length of the contexts, with the context model of longest length chosen first to estimate the probability. This strategy is surprisingly effective at modelling text, a maximum order of 4 or 5 is well suited to a wide range of text files.

A problem with the PPM algorithm is the rapid growth in the size of the context models as the maximum order increases. Until recently, it was thought that there was little gain in extending the length of the contexts, as there is a drop off in compression as the maximum order increases beyond length 5. However, a recent approach called PPM* [3] uses a variant of the PPM algorithm that exploits unbounded length contexts. It does this by storing the contexts in a data structure called a context trie.

Each context of varying length is stored in a single trie, along with input pointers back into the input string whenever a context becomes unique. A separate context list is maintained with pointers back into the trie for the currently active contexts.

Experiments (described in the next section) with adapting this data structure to the bounded approach, with the trie being limited to a certain depth, have also shown compression superior to previously used approaches. Substantial reductions in the trie memory size for the unbounded approach is also possible by combining non-branching nodes of the trie into a single node. The resulting compact data structure, which is similar to a patricia trie [7], is linear with the size of the input string.

Local order estimation

Local order estimation using bounded length contexts simply involves selecting the longest context model. However for unbounded contexts, experiments have shown that such a strategy is poor at selecting the model which predicts the upcoming character best. An entropy-based method of choosing the context which has the best average compression to date is also surprisingly poor at prediction. The most effective strategy found so far for unbounded contexts is as follows. Acontext is defined to be "deterministic" when it gives only one prediction.

Experiments conducted by [2] have shown that for such contexts the observed frequency of the deterministic character is much higher than expected based on a uniform prior distribution. This can be exploited by using such contexts for prediction. The deterministic strategy involves choosing the shortest deterministic context.

Estimating the escape probability

Asecond problem with probability estimation is how to encode a novel character, which has not been seen before in the current context. This problem is essentially the zero frequency problem which is described in [8]. PPM uses an escape probability to "escape" to another context model, usually of length one shorter than the current context. For novel characters which have never been seen before in any length model, the algorithm escapes down to a default "order -1" context model where all possible characters are equiprobable.

Various methods have been proposed for estimating the escape probability. Two ad hoc methods, methods A and B, are described in [1, 8]. Methods P, X and XC described in [8] are based upon a Poisson process model for the appearance of novel characters. For practical reasons, the method most widely used is method C, used by Moffat [6] in his implementation of the algorithm PPMC. Whenever a novel character is encountered, an escape count is incremented, and the new character’s count is set to one. The escape probability is computed as where is the number of unique characters, and is the total number of characters seen so far.

Method C has been found to be superior to methods A and B in practice. Howard [5] proposed a small modification to method C. Instead of adding 1 to both the escape count and the new character’s count, each count is incremented by 1/2 , hence the total weight is incremented by 1 as for the other non-novel characters. The escape probability for method D is computed as 2.

Experiments reported in the next section have found that method C is better suited to the unbounded approach, whereas method D is more effective with the bounded approach. One explanation for this is that method D has the effect of increasing the weight given to the character’s count for deterministic contexts. This is already achieved by the local order estimation strategy used for the unbounded approach. Scaling of deterministic contexts, described below, also has the same effect, although not as marked when method D is applied.

Scaling the counts

The frequency counts associated with each context must be scaled downward periodically to prevent overflow in the arithmetic coder. The usual method is to halve the counts when a certain threshold has been exceeded. This also has the side-effect of improving the compression because of its locally adaptive effect. However, experimental results with context tries for both the bounded and unbounded approaches have shown that count scaling is ineffective, with the maximum scaling threshold dictated by the requirements of the arithmetic encoder providing the best performance.

Exclusions

PPM makes use of exclusions by excluding predictions found at a higher order when calculating the probabilities for lower-order contexts. Exclusions have been found to be effective for both the bounded and unbounded approaches.

Update exclusions do not increment the count for a predicted character if it is already predicted by a higher-order context. The effect of update exclusions is to use the uniqueness of a character in a particular context for prediction instead of its frequency.

Update exclusions have been found to improve the performance of PPMC and other variants that apply the bounded approach by 2%. However, full update exclusions are less effective for the unbounded approach. A new technique, called partial update exclusions does however provide an improvement in performance.

Partial update exclusions involve incrementing the counts for all deterministic contexts, otherwise incrementing the count only for the longest non-deterministic one.

Deterministic scaling

Another new technique that improves the probability estimation is called deterministic scaling. The aforementioned experiments in [2] have shown that for deterministic contexts the deterministic character is far more likely to occur than predicted by a uniform prior distribution. The local order estimation method used by the unbounded approach exploits this to some extent. However, further improvement can be achieved for both bounded and unbounded approaches by scaling the count for the deterministic character upwards by a constant factor.

A curious side-effect with deterministic scaling is that greater improvement in prediction is achieved when the scaling is applied with partial update exclusions than without them. The best scaling factor increases when partial update exclusions are applied, which was found to be the case for all files in the corpus.

Recency scaling

Increasing the count for the most recent character in each context by a scaling factor also improves the prediction. Runs of the same character often occur in many of the contexts, and this can be exploited quite effectively by applying a recency scaling factor.

Multiple probability estimators

In many cases, the first context PPM selects for encoding is non-optimal, with another context model providing better prediction. This is especially true when the input string is a random sequence of characters. The ideal would be to predict an order -1 model immediately. PPM’s performance on random strings is about 9 bpc, the extra 1 bpc incurred because of the need to escape down to the lower orders.

One technique that overcomes this problem to some extent is to have multiple probability estimation methods, one of them specifically designed to cope with random strings. Each of the probability estimation methods can be made to compete for the prediction given a particular context, with an entropy-based performance measure (such as the average codelength required to encode the characters) being used to choose between the methods.

An addition of two extra probability estimators, one random and the other semi-random, along with the standard PPM estimator, has been found to be effective for the bounded approach. For each context, the codelengths for encoding the characters using each of the three methods are accumulated. The method with the minimum average codelength to date is then chosen to encode the upcoming character. The random estimator predicts using an order -1 context model. The semi-random estimator predicts using the character counts for the current context, with counts for all other characters set to 1. PPM’s performance on random strings improves from 9 bpc down to 8.4 bpc when these three estimators are used.

Adding further estimation methods may improve the compression even more. For example, an estimation method that has higher or lower deterministic and recency scaling factors may be appropriate for particular contexts. However, this multiple probability estimation method has substantial overheads, both in terms of execution speed required to calculate the codelengths, and in terms of storage, with a floating point number needed to store the codelength for each probability estimation method for all nodes in the context trie.

3. Experimental Results

Experimental results of running variants of the PPM algorithm on the Calgary corpus are shown in Tables 1, 2 and 3. The compression ratios for each file shown in thefile PPMC PPM1 PPM2 PPM3 PPM2

(bpc) (bpc) (bpc) (bpc) Order (bpc)
bib 2.11 1.90 1.86 1.86 5 1.86
book1 2.48 2.31 2.30 2.30 4 2.30
book2 2.26 1.99 1.97 1.96 5 1.97
geo 4.78 4.74 4.71 4.63 2 4.53
news 2.65 2.39 2.36 2.35 5 2.36
obj1 3.76 3.75 3.73 3.71 4 3.73
obj2 2.69 2.44 2.40 2.38 8 2.35
paper1 2.48 2.35 2.33 2.33 5 2.33
paper2 2.45 2.33 2.32 2.32 4 2.32
pic 1.09 0.81 0.80 0.80 5 0.80
progc 2.49 2.39 2.37 2.36 5 2.37
progl 1.90 1.71 1.68 1.68 8 1.64
progp 1.84 1.73 1.70 1.70 6 1.70
trans 1.77 1.52 1.47 1.47 7 1.43
average 2.48 2.31 2.29 2.28 2.26

PPMC Standard implementation using escape method C described in [1, 6]
PPM1 Maximum order = 5, update exclusions, escape method C
PPM2 Maximum order = 5, update exclusions, escape method D
deterministic scaling factor = 3.0, recency scaling factor = 1.1
PPM3 Three probability estimators: random, semi-random and variant PPM2
PPM2 Same as variant PPM2 except maximum order varies

Table 1: Compression ratios for the Calgary corpus using bounded length contexts tables are given in bits per character (bpc). The best ratio for each file is printed in boldface.

Results for both the bounded and unbounded approaches yield the same figure of 2.28 bpc for the corpus overall. This represents an improvement of 8% over the standard PPMC implementation.

Table 1 lists results for PPM using bounded length contexts. The standard PPMC implementation is shown in column 2. Implementations labelled PPM1 to PPM3 and PPM2 in the adjacent columns use context tries with no limit on memory size. The main features of each of the variants are summarized below the table.

PPM1 uses update exclusions with count scaling using a maximum threshold combined
with a maximum fixed order of 5 and method C for escaping. PPM2 uses method D instead and also applies deterministic scaling (with the best factor found to be 3.0) and recency scaling (best factor 1.1). The main contributor to the improvement over PPM1 was applying the escape method D. However, both deterministic scaling and recency scaling were found to improve the compression slightly over most of the files (although not as much as that found for unbounded contexts described below).

PPM3 uses three probability estimators, with PPM2 being one of them, and a random and semi-random estimator being the other two. The estimator with the minimum average codelength is selected for each context. Although this leads to only a small improvement over PPM2, the performance on file geo which contains random data has improved from 4.71 bpc down to 4.63 bpc. Also the improvement is consistently better for the other files. file PPM* PPM*1 PPM*2 PPM*3 PPM*4

(bpc) (bpc) (bpc) (bpc) (bpc)
bib 1.91 1.90 1.90 1.87 1.86
book1 2.40 2.40 2.41 2.41 2.41
book2 2.02 2.02 2.02 2.01 2.00
geo 4.83 4.82 4.77 4.76 4.78
news 2.42 2.41 2.40 2.38 2.37
obj1 4.00 4.00 3.84 3.82 3.83
obj2 2.43 2.42 2.39 2.35 2.31
paper1 2.37 2.36 2.36 2.33 2.33
paper2 2.36 2.36 2.36 2.35 2.34
pic 0.85 0.83 0.87 0.84 0.84
progc 2.40 2.39 2.38 2.35 2.34
progl 1.67 1.66 1.65 1.62 1.61
progp 1.62 1.60 1.61 1.57 1.55
trans 1.45 1.41 1.46 1.39 1.39
average 2.34 2.33 2.32 2.29 2.28

PPM* Unbounded order, exclusions, escape method C
PPM*1 Unbounded order, exclusions, escape method C
deterministic scaling factor = 2.0
PPM*2 Unbounded order, partial update exclusions, escape method C
PPM*3 Unbounded order, partial update exclusions, escape method C
deterministic scaling factor = 3.0
PPM*4 Same as variant PPM*3 plus recency scaling factor = 1.1

Table 2: Compression ratios for the Calgary corpus using unbounded length contexts A maximum fixed order of 5 produces the best results averaged over the whole corpus. However, this is not the case when each file is compressed separately. The best order and compression ratio found using the PPM2 variant on each file is shown in the final two columns (under the heading PPM2 ). For example, compressing the file geo with order 5 requires 4.71 bpc, and this improves to 4.53 bpc if order 2 is chosen instead. These results averaged over the whole corpus yield about a 1% improvement compared when order 5 is chosen for all files (2.26 bpc compared with 2.29 bpc). However, these improvements are only obtainable by a brute force method of repeatedly compressing the same file to determine which order is best.

Although this could be done in parallel discarding all output files except the smallest when completed, such a resource-intensive scheme is not warranted for such a small improvement. Adaptive methods for determining which order is best are currently being investigated but results so far are inferior to the brute force approach.

Table 2 lists results using unbounded length contexts. The standard PPM* algorithm which uses escape method C and count scaling with a maximum threshold combined with exclusions is shown in column 2. The implementations labelled PPM*1 to PPM*4 in the next four columns use the standard PPM*approach with a deterministic strategy for selecting the context order as their basis. The escape method D was found to be inferior in all variants to the escape method C. Below the table is a summary of the main features of each of the variants.

PPM*1 applies deterministic scaling using a factor of 2.0 which was found to be best over the range of files in the Calgary corpus. PPM*2 uses partial update exclusions.

PPM*3 combines both partial update exclusions with deterministic scaling, with the best factor increasing to 3.0.

PPM*4 improves on PPM*3 by adding a recency file PPM3 PPM*4 Combined

(bpc) (bpc) (bpc)
bib 1.86 1.86 1.86
book1 2.30 2.41 2.30
book2 1.96 2.00 1.96
geo 4.63 4.78 4.63
news 2.35 2.37 2.35
obj1 3.71 3.83 3.71
obj2 2.38 2.31 2.31
paper1 2.33 2.33 2.33
paper2 2.32 2.34 2.32
pic 0.80 0.84 0.80
progc 2.36 2.34 2.34
progl 1.68 1.61 1.61
progp 1.70 1.55 1.55
trans 1.47 1.39 1.39
average 2.28 2.28 2.25

Table 3: Combined results scaling factor of 1.1.

Combined results for the best bounded and unbounded approaches (PPM3 and PPM*4) are shown in Table 3. Results for PPM3 are taken from Table 1 and for PPM*4 from Table 2. The results in the column labelled "Combined" uses whichever of the two is smaller, and leads to a final compression ratio of 2.25 bpc for the whole corpus, which is a further improvement of 1% over each of the two methods taken separately. Interestingly, neither of the two approaches performs consistently better than the other, instead outperforming the other on about half of the files (PPM*4 performing noticeably better on program source code files such as progl and progp for example). Astraightforwardimplementation of the combined approach would be to compress each file using both methods in parallel, and then discard the output file for whichever method was inferior. An adaptive implementation of such a scheme, which selects the better method for each context as the file is being compressed exploiting the fact that the same context trie can be used to implement both methods has yet to be investigated.

4. Conclusions

Four new methods that improve the probability estimation for the PPM data compression scheme have been described. As well, two different approaches that use bounded and unbounded length context models have been compared. Each of the new methods is effective at improving the performance of the prediction for both of these approaches, leading to an 8% improvement for both approaches over the standard implementation PPMC. The two approaches combined lead to a further 1% improvement in performance.

While these gains are not large, we are in an area where diminishing returns are to be expected. Indeed, if one is interested in prediction rather than compression, such as for speech recognition and predictive typing, then even slightly improved prediction may well be worthwhile.

A number of conclusions can be made from this research. Compression may be improved by carefully fine-tuning the techniques even further, but we are in danger of over-fitting the algorithm to suit the test data. Important questions have also been raised by this research that require further investigation. For example, why are bounded context models just as effective as unbounded models? Why is an entropy-based method effective for selecting between multiple probability estimators, but less effective when applied to the problem of local order estimation? Are there better methods of blending the context models rather than the PPM method of estimating an escape probability? A greater understanding of these questions is necessary before substantial improvement in probability estimation is possible.

Acknowledgement

I would like to thank my supervisor Dr. John Cleary for his guidance in preparing this paper.

5. References

[1] T.C. Bell, J.G. Cleary, and I.H. Witten. Text compression. Prentice Hall, NJ, 1990.

[2] J.G. Cleary and W.J. Teahan. Some experiments on the zero frequency problem. In J.A. Storer and M. Cohn, editors, Proceedings DCC’95. IEEE Computer Society Press, 1995.

[3] J.G. Cleary, W.J. Teahan, and I.H. Witten. Unbounded length contexts for PPM. In J.A. Storer and M. Cohn, editors, Proceedings DCC’95. IEEE Computer Society Press, 1995.

[4] J.G. Cleary and I.H. Witten. Data compression using adaptive coding and partial string matching. IEEE Transactions on Communications, 32(4):396-402, 1984.

[5] P.G. Howard. The design and analysis of efficient lossless data compression systems. Technical Report CS-93-28, Brown University, Providence, Rhode Island, 1993.

[6] A. Moffat. Implementing the PPM data compression scheme. IEEE Transactions on Communications, 38(11):1917-1921, 1990.

[7] R. Sedgewick. Algorithms. Addison-Wesley, Reading, Massachusets, 1988.

[8] I.H. Witten and T.C. Bell. The zero-frequency problem: estimating the probabilities of novel events in adaptive text compression. IEEE Transactions on Informaton Theory, 37(4):1085-1094, 1991.

W.J.Teahan
Department of Computer Science,
University of Waikato, Hamilton