Short version of this article
was published in spring 2001
in the popular Russian
"Hard'n'Soft" magazine.
It's of historical interest only
after the "Data Compression Methods"
book is published.
Compression of Multimedia Information
version 1.0
Multimedia information (MMI) is understood, as a rule, as sound (audio stream),
two-dimensional pictures, video (2D pictures stream), three-dimensional images.
Within the century some other types like stream of 3D images or stream of smells
will become more actual.
What's the difference between MMI and other types of information ?
1) The SOURCE of MMI are devices measuring parameters of the physical world.
With some stretch any information received by the digital device from
the physical world via measuring (quantization) devices can be called MMI
(for example, current strength, speed of movement, bearing of an apparent wind).
This process is called analog-digital conversion (ADC) or quantization
(the inverse process is digital to analog conversion, DAC).
Result of the quantization process is multimedia data (MMD), form of MMI.
Measurement errors are inevitable, and round-off in a chosen way,
including leading to better compression, is valid.
If it's assumed that the RECEIVER of block of MMD are sense organs
of a human (eyes and ears, first of all) it's possible to round off in view of
their properties.
The features of communication channel that will be used for transmission
(copper wires, optical fiber, air, water) can also be taken into consideration.
2) Dimensionality.
From 4 bits (16 values) up to 32 bits (2 in 32-nd power is more than 4
billions) are used for storing the result of every measurement, but as a rule,
16, 24 or 32. That is dimensionality of bytes R=16, 24 or 32. Not 8, as in the
majority of other cases: executable code for x86 processors, eight-bit
ASCII-text, databases (16-bit UNICODE-text and executable code for other
processors are less actual at present).
Besides, block of MMD consists exclusively of R-bit bytes, in contrast
to databases, code for x86, etc.
3) Linearity.
Measurements are taken in equal time periods (stream of sound - audio,
stream of pictures - video, etc.) and equal space distances (two-dimensional
pictures, three-dimensional images...). Bytes-measurement results are put to
MMD block sequentially.
Sometimes when recording MMD they are periodically aligned, so that the
distance from the beginning of the block to the beginning of aligned fragment
satisfies some criterion. For instance, in .BMP-pictures every line of the image
should start with the address divisible by four, and in .RAS-files - divisible
by two.
But bytes lying between fragments don't carry any useful information about MMD
block. If they do, it's a block of processed MMD (PMMD), and the algorithm of
its compression should consider that processing, otherwise compression will be
much worse than in case of taking the processing into account.
Except for palletized pictures and other blocks of PMMD, every value of
byte _linearly_ depends _only_ on values of neighbouring in space-time bytes.
In case of texts it depends also on the language (Russian, English, Chinese...)
and type of file (DOS text, Unix text, text on C, on HTML).
In case of executable code - on type of processor and the compiler. And
algorithms of compression/decompression should either know features of processed
blocks, or encode-transmit-receive-decode-use these features during operation.
In case of MMD such complication is not present: properties of the physical
world are always the same.
4) Channels.
Three previous items describe a block of one-channel MMD.
As a rule, MMD block contains information about several channels of MMD.
For example, in stereo sound we have two channels - left and right.
And in case of a picture it's possible to regard as a channel either each line
or each column of digital image. Or not to treat at all columns or lines as
different channels, but to consider that a vicinity of every byte is two-
dimensional, not one-dimensional. Because first, if two-dimensional picture
describes a surface of three-dimensional object (suppose a sphere), lengths
of channels are different, and their choice is problematic. And second, every
pixel of image often consists of several components: for instance,
RGB - red, green, blue;
HSB - hue, saturation and brightness.
These components can also be regarded as different channels, and it's more
convenient, as a rule: there are usually 2, 3 or 4 components, instead of
number of lines or columns (up to few thousands).
If block of MMD describes N channels, it either consists of N fragments
of X bytes, or of X fragments of N bytes. That is N channels are put either
sequentially or parallel. This will be definition of a channel. Contents (set
of bytes) of each channel depend either on all other channels (RGB, HSB, etc.)
or on adjacent only (lines, columns).
That's, actually, all. We'll assume first item to be definition of MMD,
and deviations from properties 2, 3 or 4 - are special cases of MMD.
How to use four main features of MMD for better compression ?
Some words about compression in general. I hope the reader is familiar
with main concepts and ideas of compression (description eliminating redundancy)
of information, stated in [1].
There are two approaches to compression - lossless and lossy. The second
is applied to MMD more often, because of the features of the Source and
the Receiver.
Theoretically any lossy algorithm can be made lossless, but such
reoriented algorithm is usually less effective, than methods having the same aim
but other paths, and initially oriented on lossless compression
(that's because "lossy" transformations increase dimensionality of bytes R).
On the other hand, blocks/streams of data of any type .ABZ can be
compressed with "losses" (probably, even improving contents from the point of
view of the Receiver) by the algorithm specially designed for the data of this
type .ABZ (for example, DocumentPress by Igor Pavlov[22], optimization of
tables in headers of EXE-files).
And three base strategies of compression:
1) "Window - dictionary" (Transformation of Stream).
Description of incoming data using already available. For instance, LZ-methods
for streams/blocks of words i.e. when combinations of incoming bytes are
predictable by already processed combinations. Table conversions, RLE, LPC, DC,
MTF, SEM, Subband Coding - for bytes, i.e. when it's meaningless to consider
combinations of two or more bytes (or to "remember" such combinations, in case
of Linear Prediction Coding). No probabilities are calculated, in contrast to
the second case. As a result of transformation, several streams are formed, as
a rule (except for LZ78, LZW, etc.) Even if total size of streams increases,
their structure is improved, and further compression goes better and faster.
2) Statistics.
a) Adaptive (stream).
Calculation of probabilities for incoming data using statistics on already
processed ("modelling"). Coding with usage of these calculated probabilities.
For example, family of PPM-methods[2] for words, adaptive Huffman, Shannon-Fano,
Arithmetic Coding - for bytes. In contrast to the first case, old statistics has
the same weight as recent, if the algorithm doesn't specially strive against it.
Besides, all combinations, including all that never appeared before and, most
likely, will never appear in future, are considered probable.
b) Block. Statistics is encoded separately and added to the compressed block.
Static Huffman/Shannon-Fano and Arithmetic for bytes. Static (pre-conditioned)
PPM for words[9].
3) Transformation of Block.
Incoming data is split into blocks which are then transformed as a whole (and
in case of homogeneous data, it's often better to take the whole block that is
to be compressed). BlockSorting methods, including BWT[3], Fourier Transform,
Discrete Cosine Transform, Discrete Wavelet Transform, Fractal transforms,
Enumerative Coding.
As in the first case, several blocks instead of one can be formed. And even if
their total length does not decrease, their structure is considerably improved,
and futher compression goes easier, faster and better.
In one sentence: compression algorithm can be either statistical or
transforming, and handle incoming data either as stream or as blocks, and
the more homogeneous and bigger the data and memory, the more effective block
algorithms;
the less homogeneous and smaller data and memory, the better stream methods;
the more complex the Source, the more will improve compression the optimal
transformation;
the simpler the Source, the more effective rectilinear statistical solution
(sources of Bernoulli and Markov).
Complex, more perfect compression algorithms will synthesize all strategies,
but now there are no such, and are three base strategies.
Nowadays all existing compressors use for blocks of words either a method from
LZ set, or from BlockSorting set, or from PPM family, but not the synthesis of
all three or at least two strategies (see [4] for example, and also [5] and [6]).
However, some synthesizing algorithms exist for a long time.
First, successfully combine stream and block strategies net favorites - MP3,
MPEG and JPEG. Thousands of scientific and popular articles are written about
them and about Discrete Cosine Transform, the main algorithm they use,
therefore they will not be discussed hereafter: see [7] and [8].
Second, the right compressor sequentially applies three or four algorithms
(preprocessing-filtering, compression of words, then bytes, and finally bits),
and the last one is statistical, as a rule, though first two or three can be
transforming.
Because of distinctive features of MMD - linearity, dimensionality,
channels - transformations practically always improve compression. Besides,
MMD come in big enough and homogeneous blocks, as a rule, and more and more
memory is available to process it. That's why more effective MMD compressors
use one of the methods of the third strategy (Transformation of Blocks)
as primary method applied to true MMD and defining efficiency of the whole
process (secondary algorithms work with PMMD).
What transformations exist?
1.
RLE, Run Length Encoding.
The most trivial way of compression. Using least memory and time. The compressor
finds repetitions (runs) of identical elements Z and substitutes them with set
{element Z, a flag of repetition, number of elements in repetition N}. All other
elements are simply copied from input stream to output. If there's more memory
than few bytes, it's possible to save distances to next repetitions instead of
flags of repetitions.
As any stream method, RLE is applicable to blocks of data. In general, block
methods differ only that they can't start working before the length of the
buffer with data is given. The rest are their properties, not signs:
- the less block size, the less effective is block method (this property is
inherent to stream methods to a lesser degree);
- any element of a block is processed in the same way, regardsless of its
distance to the beginning and the end of the block (this is also peculiar
to many stream methods, SEM and Delta Coding, for instance);
- almost all block methods are two-pass (LZ77 etc. are much more effective with
two passes "search for words" and "analysis and conversion", for example
flexible parsing [18],[19]).
If further compression of RLE output is expected, it's possible to create four
output streams/blocks:
lengths of repetitions N[i]; elements of repetitions Z[i]; lengths of other
fragments, where repetitions were not found, R[i]; other fragments OF[i].
It can turn out that compression is better if first elements of other fragments
(elements that "break off" repetitions) are put to the fifth stream/block, and
probably last elements of OF - to the sixth.
Main principles of data compression are clear from this most trivial example.
It is necessary to mention that RLE is a special case of "sliding window",
dictionary methods like LZ*, when the size of window-dictionary is one,
that is offset can be only one element back, while length is not limited.
RLE is more effective on MMD with small dimensionality R, from 1 to 8 bits. The
more dimensionality R of MMD bytes, the less probable that application of RLE
to that MMD will improve compression.
2.
SEM, Separate Exponents and Mantissas.
The big set of methods usually called Integer Coding[10], including structured
model of Peter Fenwick, Rice codes[11], Fibonacci codes, Golomb codes[12],
Elias Gamma and Delta codes[13] and many other, more complex variants, for
example DAKX[14].
While in the very basic case two streams are formed: higher N bits, lower (R-N)
bits. It's obvious that the first stream - Most Significant Bits - is compressed
better, if N is properly chosen, especially in case of MMD.
But separating exponents (numbers of the highest non-zero bits B) and mantissas
(other bits lower than B) is even more effective.
The first stream is byte-wide, the second is bit-wide.
It's possible to further separate resulting streams, especially the first one.
SEM is often applied to MMD compression, in combination with other methods,
and it's always present in lossy MMD compression algorithms.
3.
MTF, Move To Front, and DC, Distance Coding, are well described in
Vadim Yoockins BWT-FAQ[3].
To add what I haven't seen in descriptions of these methods yet.
MTF generally uses the sliding window with N elements, instead of one, as in
common always considered variant. Frequencies of elements in this window are
calculated, and the conversion table of elements from input to output stream
is built.
Other elements outgoing from the window are processed exactly as in the usual
variant. Weights of elements in the window can decrease, for instance, linearly:
from N for element that just enetered the window, to 1 for one that leaves the
window on the next step. Or exponential, or no decrease at all: same weight for
all elements, regardless of their distance to the top of the window.
In contrast to Huffman/Shannon-Fano methods, "straight" tree is built, not
"branchy": it's created so that there's only one element on every level.
Such MTF is more close to 1-st order PPM, but, unlike PPM, it outputs bytes
instead of probabilities in range from 0 to 1.
4.
DC, unlike three above described methods, RLE, SEM and MTF, can be
applied to N-dimensional data, not just one-dimensional. In bidimensional case
it's possible to encode the distance to the nearest equal element either as pair
of distances (X, Y) in the cartesian coordinate system, or as pair (angle,
distance), or as "distance in a helix":
28
23 18 11 19 24
17 6 2 7 20
27 10 1 0 3 12 25
16 5 4 8 13
22 15 9 14 21
26
Similarly in case when only higher elements and on the same height and to the
left are known:
17 14 18
11 8 6 9 12
16 7 3 2 4 10 15
13 5 1 0
Apparently, nobody tried to apply two-dimensional DC to compression of MMD like
graphics, but such experiments can lead to good results on a wide class of data.
Theoretically "Plane" one-dimensional DC, as well as RLE and MTF, is more
effective on MMD with small bytes dimesionality R, but practical results are
missing. Even examples of applying DC to one-dimensional MMD are unknown to me
and most likely they are still absent, as DC has recently drawn attention of
researchers, and by now it's studied notably less than other methods.
5.
LPC, Linear Prediction Coding,
including very complex, but very effective variants CELP (Code-Excited Linear
Prediction)[15,16] and MELP (Mixed Excitation Linear Prediction)[17], developed
for strong lossy compression of speech.
Much easier and widespread the elementary variant called Delta Coding.
For example, ADPCM, Adaptive Delta PCM, Pulse Code Modulation. Approximation
B[i]=B[i-1] is used, and stream {B[i]} is transformed to {B[i] - B[i-1]}.
Because of the third property of MMD ("LINEARITY") the difference between
neighbouring elements is small, as a rule. Therefore Delta Coding is applicable
to all "typical" MMD, regardless of their origin, dimensionality and all the
rest.
Moreover, third property of MMD is usually used to find MMD among unknown data.
Because of the smallest resource - memory and time - consuming, among all.
Through the second simple variant
approximation B[i-1]= (B[i]+B[i-2])/2, conversion to B[i]- 2*B[i-1] +B[i-2] ,
we come to the general form: B[i]= sum( K[j]*B[j] ), 0 < j < i .
The decrease of B[i] dependence from preceding elements is usually exponential,
and it's absolutely useless to consider more than ten previous B[j].
In two-dimensional case the second variant is: B[i,k]= (B[i-1,k] + B[i,k-1])/2,
having B[i,k] we create {B[i,k] - (B[i-1,k]+B[i,k-1])/2} .
And the third: B[i,k]= (B[i-1,k] + B[i,k-1] + B[i-1,k-1])/3 .
The Fourth: B[i,k] - B[i-1,k] = B[i,k-1] - B[i-1,k-1] .
Further complications are not so effective in case of MMD. Taking into account
two-three neighbouring elements leads to qualitative improvement of tens of
percents, and all other elements add one or two percents.
Nevertheless, it's useful to consider them, but for another purpose:
procedure of noise account significantly improves MMD compression with Delta
Coding. Every MMD element B[i] deflects from its predicted value not only
because of the strong conditioned impacts, but also because of weak background
oscillations, that is noise: B[i]=S[i]+N[i].
It's clear that first, the less contribution of noise N[i], the more effective
LPC, and on fragments with "silence", containing only noise, LPC is not
advantageous (since differences of R-bit elements will be (R+1)-bit, not R-bit).
And second, the more elements considered, the easier it is to separate the main
tendency from background noise.
6.
Studying Delta Coding, it's easy to invent one more fundamental method -
Subband Coding. Really, if we have two "parallel" blocks (with same lengths) or
streams (synchronous) B[i] and C[i], and know that they are interdependent,
how to use it? If B and C - higher and lower bits after SEM, it's difficult to
contrive anything better than PBS (described below, in next item).
But if B and C are parallel channels of MMD (indications of closely standing
measuring devices) more often they will be rather similar: values of forces
effects will be of same order (though contributions of noise can be different).
For instance, left and right channels of stereo sound.
Let's try to take it into account with the same approximating, as in LPC case:
B[i]=C[i]. If it turned out that it's really better to encode D[i]=B[i]-C[i]
than B[i] or C[i], what function can be applied to the other channel?
Exactly. It's possible to store the sum A[i]=B[i]+C[i] instead of the second
channel, and this also gives gain in compression ratio on the majority of MMD.
The main disadvantage of such decomposition on average and difference (delta)
was already mentioned above: if R bits are used for elements of streams B[i]
and C[i], the sums and differences of them will be (R+1)-bit.
If B[i] and C[i] are from 0 to 255, the sums are from 0 to 510, differences
are from -255 to 255. What can be done with this?
First, we see that A[i] and D[i] are either both even or both odd. That is low
bit of A[i] coincides with low bit of D[i]. Therefore it's possible to create
either D[i]= B[i]-C[i] and A[i]=(B[i] +C[i])/2,
or D[i]=(B[i]-C[i])/2 and A[i]= B[i] +C[i].
To get rid of the second "superfluous" bit is possible only at the cost of some
loss of compression ratio:
V[i]= (B[i]+C[i])mod(2^R) and E[i]= (B[i]-V[i]/2)mod(2^R),
or V[i]= (B[i]+C[i])mod(2^R) and E[i]= (C[i]-V[i]/2)mod(2^R),
or E[i]= (B[i]-C[i])mod(2^R) and V[i]= (B[i]-E[i]/2)mod(2^R),
or E[i]= (B[i]-C[i])mod(2^R) and V[i]= (C[i]+E[i]/2)mod(2^R)
(division is integer, (A)mod(B) - remainder of A divided by B).
Prove that B[i] and C[i] can be restored by any of these pairs of arrays,
and can't be restored by these:
A[i]=(B[i]+C[i])/2 and E[i]=(B[i]-A[i])mod(2^R),
A[i]=(B[i]+C[i])/2 and E[i]=(C[i]-A[i])mod(2^R),
D[i]=(B[i]-C[i])/2 and V[i]=(B[i]-D[i])mod(2^R),
D[i]=(B[i]-C[i])/2 and V[i]=(C[i]+D[i])mod(2^R).
If there are not two channels but more, suppose nine, search for optimal
transform becomes a very non-trivial task. Let's remark only, that it's possible
to divide channels into pairs and to apply Average+Delta conversion to these
pairs, and then to results. Or to sequentially take the third, the fourth, etc.
channels and on every N-th step save difference of N-th channel and the average
of all N-1 previous.
However, all this is only the foreword to Subband Coding. Let's return to case
when only one channel is present. Is Average+Delta applicable? Yes. We divide
all elements B[i] into pairs (B[2*j], B[2*j+1]) and transform to two streams -
Average and Difference. In case of MMD the physical sense of Averages are lower
frequencies, and Differences are higher frequencies.
Traditional Subband Coding consists in multiple application of Average+Delta,
each time to first of two created blocks, to lower frequencies.
7.
Parallel Blocks Sorting. Description of this method is published for
the first time. Leading experts confirm it. PBS is used for lossless compression
of multimedia files in ERI32 compressor, setting up world record on the most
popular set of 24-bit pictures, Kodak True Color Images [21].
Consider two parallel blocks: "Basis" and "Sorted" (being sorted).
First, pass through the first block, counting elements in it
(cycle to block size, calculate "lengths of boxes" N[i]).
Second, using created array N[i], build array P[i]=N[i]+N[i-1]
(cycle to 2^R, calculate "beginnings of boxes" P[i] - array of pointers).
And finally, main loop: go synchronously through "Basis" and "Sorted", take
element B from "Basis" and element S from "Sorted", write S to address (P[B]),
increment P[X] (cycle to block size, "put elements to corresponding boxes").
Notice that two or even more "Bases" can be used, then in first and third cycles
the "Total Basis" is calculated as sum B1 + B2*2^R + B3*2^(2*R) + ...
The order of bases is very important: the first, the second, the third, etc.
For example, having divided (SEM) block of 32-bit bytes to "high 8", "middle 8"
and "low 16", HI, MD and LO, compression is better if total basis is MD+HI*256,
not HI+MD*256.
It's obvious that "Basis" can coincide with "Sorted", cyclically
rotated on C bytes. The most interesting case is C=1, that is sorting N[i]
by N[i-1]. And even in this "rotated" case, there can be many bases.
If it is two-dimensional picture, the second basis can be "Sorted",
rotated D bytes so that N[i-D] is second of two known neighbours, say, left and
upper. It's possible because the rectangular block of two-dimensional MMD can
be stored in one-dimensional memory sequentially, line after line. Then D will
be just length of line of the rectangular block, (i-1)-th element - a pixel to
the left of i-th, and (i-D)-th - a pixel above i-th.
Unfortunately, there's no space for examination of even more complex algorithms.
Even the very brief description of each of them will occupy from hundred to two
hundred lines.
8. VQ, Vector Quantization[23]
9. Wavelet Transform[24]
10. Fractal Transform[25]
11. Enumerative Coding[26]
12. ERI93
Further details and articles are at http://go.to/artest and
http://dogma.net/DataCompression/
Summing up, let's return to MMD definition and consider its alternative
variant.
I.
MMI - information coming from the physical world. MMD - result of its
measurement by quantization devices. Who and how uses it - a second question.
Computer synthesized MMD (SMMD) also exist, but their total amount is
smaller, and they should be better considered as databases: their artificiality
is practically always easily seen even by men, and algorithms of their detection
are not more complex than algorithms of synthesis of "hand-made" MMD.
In particular, such pictures are losslessly compressed more than 10:1, while
created by quantization - approximately two-three to one. Optimal compression
of artificial MMD are algorithms of their creation.
But alternative "reverse" definition (thanks to E.Shelvin) is also
possible. It considers the Receiver is more important than the Source.
II.
MMI - information intended for perception by sense organs of human.
MMD - data of arbitrary structure/format, being representation of some amount
of MMI. What is the source - a second question.
But alternate definition seems to have more disadvantages than the
first. Here are some of them:
(1) While a recorded block of MMD, say, a sound, no one has heard yet -
it's not MMD, but as soon as has heard - it is?
Of course, it's possible to ask a similar "reverse" question to the first
definition:
if block of MMD, say, "white noise", is created by quantization of silence or a
white page of paper, it's MMD, and if it results from computer algorithm -
that's not MMD?
Exactly so. Because, among all, the real object (the physical world) is always
more complex than its (computer) model. Even elementary objects (silence, white
noise), and the more complex is object, the more it differs from the model.
(2) The significant part of data having three distinctive features of MMD
(dimensionality, linearity, channels), coming from peculiarities of quantization
and storage of MMI, will not be MMD only because of definition of MMD.
(3) Main characteristic of MMI in alternative case is "quality" - that is
relation of useful information and noise component in it. And main property of
MMD - is presence of noise in it. Nevertheless, MMD without noise exist (and
moreover, in ideal case MMD do not contain noise and are compressed losslessly).
Example: tune up a microphone in silence, and a scanner - on a white page of
paper so that noise appears only in thirtieth bits. After that when recording
sound and scanning images, save only 16 bits of every sample and pixel.
If contribution of noise is less than 0.001% , we can neglect it in most cases.
Many thanks to Vadim Yoockin and Eugene Shelvin for comments and questions on
first versions of this article.
References
==========
[1] http://geocities.com/eri32/int.htm
[2] http://sochi.net.ru/~maxime/doc/ppm.txt
[3] http://sochi.net.ru/~maxime/doc/bwt.txt
[4] http://members.nbci.com/vycct/
[5] http://act.by.net/
[6] http://geocities.com/eri32/
[7] http://www.mp3-tech.org/
[8] http://www.jpeg.org/
[9] http://www.cbloom.com/src/ppmz.html
[10] http://wannabe.guru.org/alg/node167.html
[11] http://www.jucs.org/jucs_3_2/symbol_ranking_text_compression/html/paper.html
[12] http://www.cs.tut.fi/~albert/Dev/pucrunch/packing.html
[13] http://www.ics.uci.edu/~dan/pubs/DC-Sec3.html
[14] http://www.dakx.com/theory/index.html
[15] http://www.data-compression.com/speech.html
[16] http://rice.ecs.soton.ac.uk/jason/speech_codecs/standards/index.html
[17] http://www.plh.af.mil/ddvpc/melp.htm
[18] http://www.arturocampos.com/ac_flexible_parsing.html
[19] http://www.dcs.warwick.ac.uk/~nasir/work/fp/
[20] http://www.computerra.ru/online/influence/kulibin/7261/
[21] http://geocities.com/eri32/kodak.html
[22] http://www.7-zip.com/docpress.html
[23] http://www.data-compression.com/vq.html
[24] http://dogma.net/DataCompression/Wavelets.shtml
[25] http://dogma.net/DataCompression/Fractal.shtml
[26] ftp://oz.ee.umn.edu/users/kieffer/seminar/5585supp3.pdf
This text, unlike [1], was originally written in Russian;
Russian ver.1.0 takes place at http://geocities.com/eri32/mmi.txt
English ver.1.0 can be found at http://geocities.com/eri32/mmd.htm
With best kind regards,
excuses for misprints,
A.Ratushnyak,
http://go.to/artest
go Back to main ARTest page