Новинки:

Сайт подключен к Orphus. Если вы заметили опечатку, выделите слово и нажмите Ctrl+Enter. Спасибо!

Sorting the matrix of two-sided contexts

Prologue ======== Were any wavelet-like ideas used for _lossless_ compression ? For PPM or BlockSorting techniques ? Either "practically no" or "not at all", but why ? While compressing (=modeling =predicting) array of elements X[n], if we are saving Values_Of_Function_Of_Not_Only_Previous_But_Also_Future_Elements Z[i], we can reconstruct (while decompressing) every next element X[j] using not only previous elements X[i], but also these Z[i]. In ordinary case we save Values_Of_Function_Of_Previous_Elements Y[i], and can't use them while decompressing every next element X[j], because they all can be reconstructed with elements we already have: Y[i]=Y(X[i],X[i-1],X[i-2],...,X[1]). I can't agree with people thinking "elegant algos don't work via backside". Not only the past affects the present, but also the future does. New(?) way ========== Sorting the matrix of two-sided contexts ======================================== The original array is X[n]. The context of N-th element of X is the following sequence K: K[0]=X[N] (the "prefix", or "parent" element) K[1]=X[N-1] K[2]=X[N+1] K[3]=X[N-2] K[4]=X[N+2] K[5]=X[N-3] K[6]=X[N+3] ... and so on; array X is a ring: after the last element - again the first: X[Length]=X[0], X[Length+1]=X[1]... Let's take, for a short example, X="MABROKADABROF", and form the matrix of all contexts of X: M A M B A M R B A M O R B A M K O R B A M A K O R B AM D A K O RMBA A D A KMOARB B A DMAAKBOR R BMAADBARKO Matrix of all contexts Left-Right matrix: CM, but first OMRABBARDOAK of original array X: column is moved to the end: MFAOBRRBOAKDA MFAOBRRBOAKDA FAOBRRBOAKDAM AMBFROORKBAAD AMBFROORKBAAD MBFROORKBAADA BARMOFKOARDBA BARMOFKOARDBA ARMOFKOARDBAB RBOAKMAFDOARB RBOAKMAFDOARB BOAKMAFDOARBR ORKBAADMAFBOR ORKBAADMAFBOR RKBAADMAFBORO KOARDBAABMRFO KOARDBAABMRFO OARDBAABMRFOK AKDOARBBRAOMF => AKDOARBBRAOMF => KDOARBBRAOMFA DAAKBORROBFAM DAAKBORROBFAM AAKBORROBFAMD ADBARKOOFRMBA ADBARKOOFRMBA DBARKOOFRMBAA BARDOAFKMOARB BARDOAFKMOARB ARDOAFKMOARBB RBOAFDMAAKBOR RBOAFDMAAKBOR BOAFDMAAKBORR ORFBMAADBARKO ORFBMAADBARKO RFBMAADBARKOO FOMRABBARDOAK FOMRABBARDOAK OMRABBARDOAKF MFAOBRRBOAKDA A BFROORKBAAD CM LRM B R OFKOARDBA (Context Matrix) (Left-Right Matrix) R O K AFDOARB O K A D AFBOR K A D A B RFO A D A B R O F D A B R O F A B R O F B R O F R O F O F F Now, we sort the lines of LRM-matrix: smaller to the top, bigger to the bottom: AAKBORROBFAMD ARDOAFKMOARBB ARMOFKOARDBAB BOAFDMAAKBORR BOAKMAFDOARBR DBARKOOFRMBAA FAOBRRBOAKDAM KDOARBBRAOMFA MBFROORKBAADA OARDBAABMRFOK OMRABBARDOAKF RFBMAADBARKOO RKBAADMAFBORO (sLRM, sorted LRM-matrix) Thus, the contexts are arranged so that all similar are grouped together, and the last column contains the "parent" elements. IMPORTANT: (1) Having any line of sLRM, plus position of its last element in X, we can reconstruct original array X. The same with any line of LRM and CM. (2) Having the last column of sLRM and few extra bits, we can rebuild the whole sLRM, and, getting one more index, the original X. (3) The last column of sLRM is more redundant than in case of BWT. (4) Does it look like we need to keep the whole matrix in memory ? Fortunately, no. One pointer to an element of X defines a line of sLRM. But now linear comparison is impossible: only element-by-element. Practice will show how big is the gain. RECONSTRUCTION: We have the last column only. Step 1 (having only 1 column): ====== We know that in the first column of sLRM - are sorted elements of the last column, and easily get the first column: 1) A...........D 2) A...........B 3) A...........B 4) B...........R 5) B...........R 6) D...........A 7) F...........M 8) K...........A 9) M...........A 10) O...........K 11) O...........F 12) R...........O 13) R...........O Step 2a (having only 2 columns, first and last, reconstruct column-2): ======= column-1: 2: column-Z: 1st 1st the parent left right element 1) a+ A d A ..... D 2) a+ A b R ..... B 3) a+ A b R ..... B 4) b+ B r O ..... R 5) b+ B r O ..... R 6) d+ D a (D?B?B?) ..... A 7) f+ F m A ..... M 8) k+ K a (D?B?B?) ..... A 9) m+ M a (D?B?B?) ..... A 10) o+ O k A ..... K 11) o+ O f M ..... F 12) r+ R o (K?F?) ..... O 13) r+ R o (K?F?) ..... O Legend of symbols: The sentence "about a+" is read so: Since after A (see column-1) only D,B,B are possible (see column-Z) => fill 2nd column, lines with "a", with this values, as in the table. about b+ : Since after B only R is possible (see columns 1 and Z) => fill 2nd column, lines with "b", with "R", as in the table. d+ : Since after D only A is possible (...) => ... f+ : Since after F only M is possible => ... k+ : Since after K only A is possible => ... m+ : Since after M only A is possible => ... o+ : Since after O only K,F are possible => ... r+ : Since after R only O is possible =>...see column-2, lines with "r". Now, take into account (1) that lines of matrix are sorted. Therefore in the 2nd column, in 12th line - F, in 13th - K. Also, remember (2) that every line and every column of sLRM contains all elements of the original array X, but they are rearranged. We know the set of all elements of X from the only Z-th column received. Therefore in the 2nd column, in 6th line - not D, but B, because D is already present in the 6th line (in 1st column). Thus, already reconstructed: column-1: 2: column-Z: 1st 1st the parent left right element 1) a+ A d A ..... D 2) a+ A b R ..... B 3) a+ A b R ..... B 4) b+ B r O ..... R 5) b+ B r O ..... R 6) d+ D a B ..... A 7) f+ F m A ..... M 8) k+ K a (D?B?) ..... A 9) m+ M a (D?B?) ..... A 10) o+ O k A ..... K 11) o+ O f M ..... F 12) r+ R o F ..... O 13) r+ R o K ..... O Step 2b (using only 2 columns, first and last, reconstruct column-3): ======= column-1: 2: 3: column-Z: 1st 1st 2nd the parent left right left element 1) d- A A a (D?K?M?) ..... D 2) b- A R a (D?K?M?) ..... B 3) b- A R a (D?K?M?) ..... B 4) r- B O b A ..... R 5) r- B O b A ..... R 6) a- D B d A ..... A 7) m- F A f O ..... M 8) a- K (D?B?) k O ..... A 9) a- M (D?B?) m F ..... A 10) k- O A o R ..... K 11) f- O M o R ..... F 12) o- R F r B ..... O 13) o- R K r B ..... O The sentence "about d-" sounds so: Since before D only A is possible (see column-Z and column-1) => fill 3rd column, lines with "d", as in the table, with this value, A. ... a-: Since before A only D,K,M are possible (see columns Z and 1) =>... fill 3rd column, lines with "a", as in the table, with this values. And so on, all steps b-, f- etc. are filling column-3 using same logic. Now, see that element in 3rd column, in 1st line, can't be D, as D is already present in 1st line (in column Z). Also, D can be in the 3rd line of 3rd column, because lines 2 and 3 are sorted, and both K and M are bigger than D. Thus, D can be only in the 2nd line. Now, column-1: 2: 3: column-Z: 1st 1st 2nd the parent left right left element 1) A A (K?M?) ..... D 2) A R D ..... B 3) A R (K?M?) ..... B 4) B O A ..... R 5) B O A ..... R 6) D B A ..... A 7) F A O ..... M 8) K (D?B?) O ..... A 9) M (D?B?) F ..... A 10) O A R ..... K 11) O M R ..... F 12) R F B ..... O 13) R K B ..... O Using two columns, we (partially) reconstructed four columns. Only pair (column-1, column-Z) was viewed, the distance between them is 1, i.e. these are neighbor columns. Further: ======== Step 4.1a (using columns Z and 2, rebuild column-4 "2nd right element") Step 4.1b (using columns 1 and 3, rebuild column-5 "3rd left element") Why 4.1a and 4.1b: Using four columns; distance=1; a - step to the right, b - to the left. After every step, get more precise values, using that (1) the lines of sLRM are sorted; (2) every line and every column of sLRM - all elements of X rearranged. Step 4.2a (using columns 1 and 2, rebuild column-6 "3rd right element") Step 4.2b (using columns Z and 3, rebuild column-7 "4th left element") Step 4.3a (using columns 2 and 3, rebuild column-8 "4th right element") Step 4.3b (using columns 2 and 3, rebuild column-9 "5th left element") Using four columns, partially reconstructed ten. Further: steps 10.1a, 10.1b, 10.2a, 10.2b, 10.3a, 10.3b, 10.4a, 10.4b, 10.5a, 10.5b and so on, until all the columns are reconstructed. When new values are written to an element of matrix, on top of some old, for example, there were (F?B?), and new (D?M?F?) are added, remain only those present in both old and new sets (F, in this example). Indeed, it's rather inconvenient: to rebuild the whole matrix each time, in such element-by-element way. There certainly are appallingly effective optimizations. Of course, not everything is now seen exceptionally precisely and clear. Epilogue ======== Practice will show how big is the gain, comparing to other (PPM or BlockSorting) techniques. Does anybody know any analogues of suggested "new way"? ------------------------------------------------------------- Correct me where I am wrong, expand me where I am incomplete. ------------------------------------------------------------- This article can be found at http://geocities.com/eri32/slrm.htm http://artest1.tripod.com/slrm.htm With best kind regards, A.Ratushnyak, RAO Inc. http://go.to/artest go Back to main ARTest page