Lossless compression via the memoizer

Via Andrew Gelman comes a link to deplump, a new compression tool. It runs the data through a predictive model (like most lossless compressors), but:

Deplump compression technology is built on a probabilistic discrete sequence predictor called the sequence memoizer. The sequence memoizer has been demonstrated to be a very good predictor for discrete sequences. The advantage deplump demonstrates in comparison to other general purpose lossless compressors is largely attributable to the better guesses made by the sequence memoizer.

The paper on the sequence memoizer (by Wood et al.) appeared at ICML 2009, with follow-ups at DCC and ICML 2010 It uses as its probabilistic model a version of the Pitman-Yor process, which is a generalization of the “Chinese restaurant”/”stick-breaking” process. Philosophically, the idea seems to be this : since we don’t know the order of the Markov process which best models the data, we will let the model order be “infinite” using the Pitman-Yor process and just infer the right parameters, hopefully avoiding overfitting while being efficient. The key challenge is that since the process can have infinite memory, the encoding seems to get hairy, which is why “memoization” becomes important. It seems that the particular parameterization of the PY process is important to reduce the number of parameters, but I didn’t have time to look at the paper in that much detail. Besides, I’m not as much of a source coding guy!

I tried it out on Leo Breiman’s paper Statistical Modeling: The Two Cultures. Measured in bytes:

307458 Breiman01StatModel.pdf         original
271279 Breiman01StatModel.pdf.bz2     bZip (Burrows-Wheeler transform)
269646 Breiman01StatModel.pdf.gz      gzip
269943 Breiman01StatModel.pdf.zip     zip
266310 Breiman01StatModel.pdf.dpl     deplump

As promised, it is better than the alternatives, (but not by much for this example).

What is interesting is that they don’t seem to cite much from the information theory literature. I’m not sure if this is a case of two communities working on related problems and unaware of the connections or that the problems are secretly not related, or that information theorists mostly “gave up” on this problem (I doubt this, but like I said, I’m not a source coding guy…)

4 thoughts on “Lossless compression via the memoizer”

Is there any similar effort on lossy compression?

Anand Sarwate says:

on August 24, 2010 at 9:32 pm

Lossy seems even harder than lossless, actually… I think the distortion metric would not play well with the Bayesian approach, since the interaction between metric and parameterization of the model could get icky. I don’t have much intuition though…

Reply

I am not a source coding guy either, and nor have I even glanced at the paper,
but … a pdf is not a good test file. By default, pdf files are compressed (using zip
with the compression ratio between 0 (no-compression) and (best compression),
IIRC). Perhaps using a big text file will be better. I believe that there is a
standard test suite for lossless data compression algorithms.

Anand Sarwate says:

on August 28, 2010 at 4:24 pm

Oh I agree, but deplump had a 2MB limit and I was lazy so I picked the file that was on my desktop. Maybe if I get bored between sessions at ITW I’ll do a better test, but they have some more exhaustive results in their papers.

Reply

	Zonghong Liu on A story about Canvas
	anonymousskimmer on “The needs of the many,…
	Chanterelle Recipes… on Broiled shrimp with chanterell…
	kvarsh on ICML 2019 encouraged code subm…
	Pulkit Grover on gender inclusivity in communic…

S	M	T	W	T	F	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31

An Ergodic Walk

a process whose average over time converges to the true average

Lossless compression via the memoizer

4 thoughts on “Lossless compression via the memoizer”

Leave a comment Cancel reply

Share this:

Related

4 thoughts on “Lossless compression via the memoizer”

Leave a comment Cancel reply