(compression using grammatical models)
Submission for DCC 1999 |
Lexical Attraction for Text Compression |
Joscha Bach Department of Computer Science, Humboldt University of Berlin, Berlin, Germany email: bach@informatik.hu-berlin.de |
Ian H. Witten Department of Computer Science, University of Waikato, Hamilton, New Zealand email: ihw@cs.waikato.ac.nz |
Abstract
New methods of acquiring structural information in text documents may support better compres sion by identifying an appropriate prediction context for each symbol. The method of "lexical attraction" infers syntactic dependency structures from statistical analysis of large corpora. We describe the generation of a lexical attraction model, discuss its application to text compression, and explore its potential to outperform fixed-context models such as word-level PPM. Perhaps the most exciting aspect of this work is the prospect of using compression as a metric for structure discovery in text. |
1 Introduction
The most successful methods of text compression work by conditioning each symbol's probability on its predecessors. Sequences of symbols establish a context that determines a probability distribution for the next symbol, and the actual next symbol is encoded with respect to this distribution. For example, PPM uses the previous k symbols as context, for some small fixed order k (Cleary and Witten, 1984). As is well known, the size of the model grows exponentially with order. While fifth-order models are widely used for character-based PPM, even second-order models are usually impractical for word-level PPM because their alphabet includes many more symbols.For textual documents written in natural language, it is obvious that there are dependencies not only between adjacent symbols, but between more distant words in a sentence. For example, verbs may depend on nouns, pronouns on names, closing brackets on opening ones, question marks on "wh"-words. It is the purpose of language grammars to characterize these dependency structures.
Grammars have some important limitations, apart from the practical difficulty of finding natural language parsers that work robustly. First, they only describe syntactically correct structures, and a grammar-based coder might fail to predict many frequently-occurring phrases--even though, in most writing, grammatically correct sentences are generally more likely than incorrect ones. Second, although grammars capture syntactic structure, they do not necessarily reflect typical language usage. Third, it is very difficult to derive grammatical dependency structures for documents written in an unknown language.
The method of "lexical attraction," developed by Deniz Yuret (1998), establishes dependency structures in textual documents based on the co-occurrence of terms in a certain order in the same sentence, regardless of the distance that separates them. This scheme is able to recognize linguistic structure in text by organizing it in the form of a low-entropy model. Once identified, the structure supports a more informed notion of context than simply using the previous few characters or words. This context can be used for prediction in much the same way as PPM.
Section 2 describes the idea of lexical attraction, particularly the measure of mutual information on which it is founded, and explains the algorithm used to acquire the dependency structure. Section 3 discusses how to encode textual documents using the context provided by the lexical attraction model. Section 4 gives experimental results, and Section 5 draws some conclusions and suggests directions for further improvement.
2 Lexical Attraction
This work is based on "link grammars," a new formalism for natural language syntax in which sentences are parsed by connecting their words together with links, or arcs (Sleator and Temperley, 1991; Lafferty et al., 1992). Links may be labeled by grammatical structure relations (subject-verb and so on); we use an unsupervised learning approach and therefore do not use (or generate) these labels.
Figure 1: The lexical attraction model for an example sentence from the Jefferson corpus
Lexical attraction was developed in a recent Ph.D. thesis by Deniz Yuret (1998) as a particular, concrete, methodology for discovering these linguistic relations. Yuret suggests an unsuper vised learning method for inferring dependency structures from a large corpus of natural language text. While his work is aimed at natural language understanding, link grammars may have other applications--such as compression modeling or topic extraction.
The lexical attraction of an ordered pair of words is the likelihood that they will appear (in that order) within a sentence. This can be estimated by counting the co-occurrences of word pairs. The resulting dependency structure defines an undirected graph with words as vertices and links between them as edges. The graph is chosen to maximize the lexical attraction between linked items. Graphs are constructed to be planar--links do not cross--and we capitalize upon this when creating them. Moreover, they are acyclic, which is a necessary condition for the structure to be acquired.
2.1 Mutual information
When a pair of words is encoded together rather than independently of each other, the coding gain is called mutual information and measures the lexical attraction between the words. The following example is taken from Yuret (1998).In a certain corpus, the probability of the word Ireland is 0.0039% and it may therefore be encoded in 14.65 bits. The probability of the word Northern is 0.016%, or 12.60 bits. However, in 35.8% of all cases, the word Northern precedes Ireland. This preceding Northern, based on knowledge of the occurrence of the word Ireland, can be coded in as little as 1.48 bits, for a gain in mutual information of 12.60 - 1.48 = 11.12 bits. Conversely, the probability of the word Ireland based on a previous occurrence of Northern is 8.5%, leading to an encoding in 3.53 bits and a gain of 11.12 bits. Note that the mutual information is the same irrespective of the direction of the prediction. (It does, of course, depend on the word order: the probability of Northern appearing after the word Ireland is entirely different.)
Given the co-occurrence counts, the mutual information gained by encoding words s1 and s2 together can be estimated as
2.2 Linking sentences
The most likely linkage for a sentence can be found by generating all possible acyclic, planar graphs over it, and choosing the one with the greatest accumulated lexical attraction--that is, the total mutual information gain for all linked pairs of words. Unfortunately, the computational complexity of this method is 0(n^5) for an n-word sentence. Yuret (1998) proposes an approximate algorithm with a complexity of O(n^2):
- for each word, consider linking it to each previous word in the sentence in turn, starting with the preceding word;
- resolve conflicts created by cycles and crossing links by deleting the weakest link in question.
For example, when processing Figure 1 word by word, would is first linked to this. Serve is then linked to this and would; however, this creates a loop, so the weakest link (which happens to be this-would) is deleted. As is linked to serve, and then linking to would is attempted. As this would create a crossing link, and because the link between this and serve is stronger, the latter is retained instead. Linking as to this creates a cycle, and because all other links in the cycle are at least as strong, the new link cannot be established. And so the algorithm proceeds on down the sentence. The numbers shown on each link give the mutual attraction of the word pairs for this example sentence, calculated as described below. This algorithm may leave some words disconnected when crossing links are broken, and as a final pass, every unlinked word is connected to its direct predecessor.
2.3 Adaptive acquisition of mutual information
The mutual information matrix for a corpus of text can be obtained by simply counting the co occurrences of all pairs of words in each sentence, n(s1,s2), and substituting into the formula above. Although this will produce good results in the (very) long run, for corpora of moderate size it gives noisy estimates of the probabilities. More reliable estimates can be obtained by making use of previ ously identified links, leading to an adaptive scheme which takes advantage of existing links to help create new ones (Yuret, 1998). As each sentence of the corpus is processed, its dependency struc ture is derived from the current set of probabilities. Then, counts are incremented only for certain word pairs in the sentence, namely (a) adjacent pairs of words, (b) linked pairs of words, and (c) the word pairs (A,Y) and (X,B) that occur in a sequence XA ... BY where A and B have already been connected. For example, in Figure 1 the counts associated with this-would, this-serve, and this-as will be incremented for reasons (a), (b) and (c) respectively.An additional advantage of this counting scheme is that it generates much smaller models than the brute force version, because fewer different pairs are brought to the attention of the processor.
2.4 Results
Yuret trained his algorithm his algorithm on a reasonably homogeneous corpus of 100 million words of Associated Press material. He hand-parsed an independent test set of 200 sentences, and applied the lexical attraction algorithm to them. Of the content-word pairs that were linked by hand-parsing, 87.5% showed positive lexical attraction, which gives an upper bound on what can be expected from this method. He also measured the quality of the dependency structure in terms of recall and precision, again for links between content-word pairs only. He found that 50% of content-word links in the human-parsed test set were made by the algorithm (recall); and 60% of content-word links established by the algorithm were present in the human-parsed version (precision). Furthermore, his results suggest average mutual information gains of up to 50% per link. It is expected that a smaller corpus, or a less homogeneous one, will yield weaker links, reducing both precision and recall.Yuret mentioned examples of the number of bits saved by encoding words based on their links, and discussed an upper bound to the number of bits required to transmit the linkage structure. How ever, he did not build a compression scheme based on the lexical attraction idea, nor evaluate the linkage in terms of the amount of compression that can be achieved.
3 Encoding based on lexical attraction
We have developed an encoding scheme that takes a series of linked sentences like that of Figure 1 and transmits them in the same manner as first-order word-level PPM. Of course, the decoder needs additional information to restore the graph structure, and thus establish the contexts for decompres sion. Unlike PPM, which performs only forward prediction, a lexical attraction coder also predicts backwards. (The situation becomes even more complicated with higher-order models, and we have not considered them.) Indeed, word-level PPM is a special case where links are made only between adjacent words.3.1 Linking
Links across sentence boundaries connect the last term of one sentence to the first term of the next. This makes sense (at least for western languages) because the former is a punctuation mark that terminates the sentence, while the latter is generally an upper-case word that is attracted to the begin ning of sentences. Considering more than one sentence at a time might lead to some improvement, because more links would be established between nouns and related pronouns, opening and closing quotation marks, and so on.In order to prime the lexical attraction linker, the whole document is processed in advance to acquire the co-occurrence counts, and again in a second pass to re-link the sentences. We have experimented with the use of multiple passes, so that counts in one pass are based on sentences linked using the statistics acquired in the previous one, but this does not yield any improvement over the one-pass adaptive method described earlier.
As already noted, pairs that occur twice or less are not considered to have significant lexical attraction, and so all terms that occur only once or twice are excluded from the statistics. This significantly reduces the size of the model, because in most corpora, about half the terms are hapax legomena (words that appear only once).
The final output of the linking stage contains a triple for each word or item of punctuation in the text: the index of the term, the index of the previous term (the one that prediction is based on), and a flag indicating the relative order of these terms (that is, whether it is a forward or a backward prediction). Also, the graph structure of the document is recorded as described below.
3.2 Adaptive encoding
To avoid the overhead that would be introduced by the transmitting the complete model, the encoding stage utilizes an adaptive PPM-style method. For each term acting as a predictor, two probability distributions are maintained--one for forward and the other for backward predictions. These are based on previously encoded pairs. For each pair (p,q) of predicting and actual terms, the coder performs the following steps.
- When p acts as a predictor for the first time, q is encoded according to its zero-order probability (the number of times it appears in the document divided by the total word count).
- If p does not include q in its predictions, an escape symbol is sent, followed by the zero-order probability of q. The probability of the escape symbol is given by Pr[esc]=r/(n+r) where n words have been seen altogether in the context of the term p, and r of them are distinct.
- Otherwise, the probability of q is estimated as Pr[q]=f(q)/n * (1-Pr[esc]), where f(q) is the number of times that q has appeared in the context of p.
- Finally, the models are updated. If the current encoding is based on a forward prediction, the counts of f(q) in the forward context of p and f(p) in the backward context of q are incremented. Furthermore, the counts for the relevant r and n are incremented in both models. In the case of a backward prediction, the backward and forward roles are reversed.
3.3 Encoding the links
The baseline method of encoding the graph is to enumerate all planar dependency structures on a sentence of n words, and transmit the index of the actual graph. By counting the number of planar graphs, an upper bound of 2.75 bits per term can be derived for the information necessary to encode the graph (Yuret, 1998).We reduce this figure by adopting the following strategy. Each sentence is viewed as a tree, the root being the first word encoded. (More precisely, the whole document is one large tree, because the last word of each sentence is linked to the first word of the next.) To represent the structure of this tree, it is sufficient to record the number of forward and backward links for each word (ignoring the link to the previous word). Once the link counts have been communicated, the words can be transmitted recursively, from left to right.
Figure 2a: Encoding the link structure: successor and predecessor counts
Figure
2b: Transmission order
In order to transmit the link counts, each different pair of numbers for forward and backward links is treated as a single symbol, and the resulting sequence is compressed using PPMD. For the texts we have worked with, transmitting the graph structure in this way imposes an overhead of around 2.3 bits/term.
3.4 Trading graph uniformity for mutual information
The number of bits required by a lexical attraction compressor to encode text is the information necessary to transmit the dependency graph, plus the actual information content of the terms. Un fortunately, encoding the graph imposes significant overhead, and overall performance may be even worse than zero-order encoding--particularly on small corpora. The overhead can be artificially re duced by generating a link structure with greater uniformity. Of course, the price is paid for this is a reduction in effective mutual information.The most frequent link combination consists of no backward links and one forward link--that is, a forward link from one term to the next, recording the standard reading order. The graph can be simplified to favor adjacent links by deleting links between non-adjacent terms if their mutual information gain is insignificant. We do this only in the second pass through the document, when the sentences are linked in final form, not in the first pass which accumulates the probability information.
In the extreme, when only links to adjacent terms are accepted, links reflect the natural word order and the coder's output is the same as first-order PPM. In this case, the overhead of transmitting the graph structure is negligible because only one count pair occurs, (0, 1). Then first-order PPM becomes a special case of (first-order) lexical attraction encoding.
4 Experimental results
Since most words in English have low frequency counts, the acquisition of mutual information depends heavily on corpus size. Best results for a lexical attraction compressor are obtained using a homogeneous corpus of very large size. The largest one we could obtain is Jefferson the Virginian (Malone, 1948), which we have available as a 6.3 Mbyte ASCII file. Most experiments were undertaken on this. It contains a total of 1,238,973 words, which we segmented into 50,729 sentences using straightforward methods. Each punctuation mark was considered to be a term, and all terms are separated by a single space.Encouraging results have been achieved with this corpus. However, further work with larger corpora is necessary to explore the limits of compression that are achievable using lexical attraction.
4.1 Acquiring mutual information
The acquisition of mutual information depends on two factors: the size of the corpus and its compo sition in terms of homogeneity, vocabulary, and sentence length. We investigate these separately.4.1.1 Size of corpus
We begin by examining how the first-order entropy depends on text size, when the entropy is calculated based on the appropriate word in the lexical dependency graph, and when it is calculated based on the preceding word. Figure 3 shows the text size in words along the horizontal axis, for the first 2^11, 2^12, etc, words of Jefferson the Virginian. The vertical axis expresses the gain in entropy as a percentage of the entropy of a zero-order model of the same text. In the upper line the first-order entropy is calculated based on the appropriate word of the dependency graph; note that this corresponds to the mutual information of the text, except that it is calculated adaptively rather than statically. The lower line uses the preceding word as context, as in ordinary first-order PPM. The horizontal scale is logarithmic: the size of the text doubles at each step.
Figure 3: First-order entropy expressed as the percentage saved over a zero-order model
However, this is not the whole story: these figures omit the overhead of transmitting the graph structure. We will see below that this significantly reduces the overall compression achieved.
4.1.2 Composition of corpus
In order to further investigate the effect of the size and composition of the corpus, we performed subsidiary experiments on three other bodies of text. One is derived from the Gutenberg project, which is making public-domain English literature available in machine-readable form. This corpus contains a wide variety of very different documents, and we used a subset containing 137 books, or 27 million words, of English prose. The second body of text, Thomas Hardy's Far from the Madding Crowd, is small but far more homogeneous than Gutenberg. Finally we used a corpus of children's language consisting of extremely short and simple sentences.Table 1 shows the size of these corpora, the number of different words they contain, and the zero order entropy. The inhomogeneity of the Gutenberg corpus can be seen from its disproportionately large vocabulary, although this does not result in a significantly increased zero-order entropy. Far from the madding crowd is both much smaller, and more homogeneous. The children's language corpus exhibits a tiny vocabulary and a correspondingly low zero-order entropy.
corpus | size (words) | vocabulary (words) | zero-order (bits/word) | lexical attraction (bits/word) | savings over zero-order model |
---|---|---|---|---|---|
Gutenberg (extracts) | 26,636,000 | 264,000 | 9.97 | 8.15 | 18.2% |
Far from the madding crowd | 168,000 | 14,000 | 9.22 | 7.51 | 18.5% |
Jefferson the Virginian | 1,239,000 | 29,000 | 9.47 | 7.24 | 23.6% |
Children's language | 184,000 | 4,700 | 7.87 | 5.84 | 25.8% |
The last two columns of Table 1 give the results of our experiments on these corpora, in terms of the first-order entropy of the lexical attraction model, and the percentage savings it yields over the zero-order entropy. Far from the madding crowd gives the same percentage savings as Gutenberg, despite being a tiny fraction of the size. Clearly quality--homogeneity--makes up for quantity, and the savings are about the same as Figure 1 shows for a similarly-sized extract from Jefferson the Virginian. The third row illustrates how better linkage structures are discovered as the text grows, while the result for children's language demonstrates that comparable savings can be achieved from a far smaller corpus if the linguistic structure is simple enough.
4.2 Encoding the graph
We now turn to the information content of the dependency graph. For a naive enumerative encod ing, this depends heavily on the average sentence length. However, our coding method (described above, using third-order PPMD) greatly reduces this effect. Table 2 shows the results, which range from 2.38 bits/term for the Gutenberg corpus to 1.61 bits/term for the children's language corpus. The per-term entropy of the graph is reasonably independent of the size of the document and its vocabulary--although it will be lower if the document is too small to establish a complex depen dency structure. Table 2 also shows the total entropy per term, taking into account both the encoding of the dependency graph and the first-order entropy based on its predictions.
corpus | average sentence length | graph encoding (bits/term) | lexical attraction (bits/term) |
---|---|---|---|
Gutenberg (extracts) | 15.3 | 2.38 | 10.53 |
Far from the madding crowd | 13.8 | 2.31 | 9.82 |
Jefferson the Virginian | 24.4 | 2.35 | 9.59 |
Children's language | 6.2 | 1.61 | 7.45 |
4.3 Results of lexical attraction encoding
Next we look at how overall compression for the lexical attraction method depends on text size, and compare it with zero- and first-order coders. Figure 4 plots the compression rate for these three methods against text size for Jefferson the Virginian, along with (at the bottom) the rate for lexical attraction without taking the graph structure into account. Because of the overhead of transmitting the dependency graph, lexical attraction gives worse compression than a zero-order model right up to texts of a million or so words. At that point it overtakes zero-order modeling. Unfortunately for lexical attraction, it is beaten hands down by first-order compression up to a million words--although a generous eye can discern a more favorable trend for larger text sizes.
Figure 4: Compression by lexical attraction , compared with zero- and first-order coding
4.4 Simplifying the graph
Results for lexical attraction can be improved by penalizing links between non-adjacent words as mentioned above. Varying the penalty yields a smooth transition from lexical attraction to a first order model, illustrated in Figure 5. If the encoder is prevented from linking non-adjacent words, the gain from mutual information drops, but the number of bits necessary to encode the graph structure decreases too. When only adjacent pairs are linked, the graph can be encoded using no bits at all, but the mutual information is the same as for first-order PPM. Another way of reducing the entropy of the graph is to artificially limit the number of links per word.
Figure 5: The effect on compression of penalizing non-adjacent links
5 Conclusions
The best predictors for individual words of text are not necessarily their immediate predecessors. Measuring the lexical attraction between terms is a way of identifying dependency structures that define a low-entropy representation of the text. This paper has investigated this phenomenon and shown how lexical attraction can be applied for the purposes of text compression. Our experimental results are somewhat disappointing, for they show that first-order lexical attraction is out-performed by first-order PPM even on fairly large texts, rendering it useless for most practical purposes. How ever, it can be extended to subsume PPM, and thereby perform at the same level with an appropriate parameter setting.The disappointing performance can be traced to the need to encode the dependency graph and transmit it to the decoder. In the current implementation, little attempt has been made to refine the encoding of the graph structure. But there are many opportunities for improvement. It is likely, for example, that many words have significant correlation with their position and link structure. The graph could be broken at points where the lexical attraction between two words is negative, and a zero-order encoding of the term in question used instead. Links over sentence boundaries could be introduced, improving the gains that can be achieved by linking between words.
Work with larger corpora will be necessary to explore the limits to the gains achievable by mutual attraction, a completely open question at present. Performance enhancements could be made simply by priming the compressor with standard corpora, and introducing ways to counter inhomogeneity by priming sub-contexts for structurally different passages of the text--for instance poems, formulas or entirely alien documents. Finally, a higher-order compression scheme could be developed to capture relations between groups of words.
We feel that lexical attraction is a promising new method for text compression, and expect that ultimately, lexical attraction coders will consistently out-perform standard prior-context models. Per haps the most exciting aspect is the prospect of uncovering more and more structure in text as it is compressed, and the use of compression to measure the success of structure discovery.
Acknowledgments
We gratefully acknowledge stimulating discussions with Tony Smith on lexical attraction. Thanks to Deniz Yuret for writing the thesis that started the whole thing off, and helping us to understand some of the details; to Bill Teahan for making available the Jefferson text; and to Stuart Inglis for technical help.
References
[1] Cleary, J.G. and Witten, I.H. (1984) "Data compression using adaptive coding and partial string matching." IEEE Trans Communications, Vol. COM-32, No. 4, pp. 396-402.[2] Howard, P.G. and Vitter, J.S. (1992) "Practical implementations of arithmetic coding." In Image and Text Compression, edited by J.A. Storer. Kluwer Academic Publishers, pp. 113-144.
[3] Lafferty, J.D., Sleator, D. and Temperley, D. (1992) "Grammatical trigrams: A probabilistic model of link grammar." Proc AAAI Fall Symposium on Probabilistic Approaches to Natural Language, Cambridge, MA, pp. 89-97.
[4] Malone, D. (1948) Jefferson the Virginian. Little Brown and Co., Boston.
[5] Moffat, A. (1990) "Implementing the PPM data compression scheme." IEEE Transactions on Communications, Vol. 38, No. 11, pp. 1917-1921.
[6] Sleator, D. and Temperley, D. (1991) "Parsing English with a link grammar." Technical Report CMU-CS-91-196, Department of Computer Science, Carnegie Mellon University.
[7] Yuret, D. (1998) "Discovery of linguistic relations using lexical attraction." PhD Thesis, Depart ment of Computer Science and Electrical Engineering, MIT; May.
Joscha Bach
Last altered: 20 Nov 98
наверх