andrew makes things

a blog about programming, machine learning, and whatever else I'm working on

Compressing Code

| Comments

What can we learn about a code base or a language based on its compressibility? My pet theory is that less compressible code will be, on average, better code, because less compressible code implies more factoring, more reuse, and fewer repetitions.

Larger numbers (longer lines) indicate more compressibility.

Below are the compressibility results for some popular libraries and languages. To generate this data, I downloaded each package or library, extracted the source files, removed comments, and gzipped the files with maximum compression. The numbers represent the ratio of uncompressed source file size to compressed source file size. Smaller ratios imply less compressible code.

Unsurprisingly, Java was the most compressible language- all that boilerplate! Python was the least compressible language, with Java, on average, being about twice as compressible. I was somewhat surprised that JavaScript was the second best language in terms of incompressibility.

It’s probably disingenuous to draw strong conclusions from these results, but I still find them intriguing. What, if anything, do you think they mean?

Aside: Compression itself is a fascinating subject. Being able to compress something is fundamental to being able to understanding it. If you can rewrite a 2 page document into 2 paragraphs, while still expressing its core ideas, you’ve deeply understood the material. Hence the existence of the Hutter Prize, a standing challenge in Artificial Intelligence to further compress a corpus of English text. Hence, also, the existence of specialized image compression algorithms that compress human faces better than anything else because they understand, in software, what a human face generally looks like.

If I can compress an image of your face, I can probably also recognize it. Imagine that I have thousands of photos of faces. Using some linear algebra and creative encodings, I can figure out the commonalities and differences among these faces. Basically, I can derive a set of common noses, a set of common eyes, a set of common brows… and, given a new face photo, I can compute a mixture of these common attributes for the new face. Perhaps it, roughly speaking, has 60% of Common Nose 6 and 40% of Common Nose 12. Well, then I can represent the picture of this new nose as roughly two numbers, the “amounts” of Nose 6 and of Nose 12, and suddenly I’ve compressed a collection of hundreds or thousands of pixels- a photo of a nose- into just two numbers. We could also go in reverse, taking a photo, calculating the percentages of different common features, and then looking up in a database to see who we know whose face expresses those same feature percentages. We can compress, and, thus, we can recognize. (Interested in this? See Eigenfaces as well as PCA and ICA for face analysis.)