By Martin / The Lab, Python / Friday May 7th, 2010 / 2 comments
Byte pair encoding is a simple but (sometimes) effective algorithm for compressing data (especially textual data). Best of all, it is very easy to understand! Therefore, in this experiment I have created a simple Python script that implements byte pair encoding on files.
Basically, byte pair encoding works by finding the most common pair of bytes in a file, and replacing them with a single unused byte, at the same time storing that association in a dictionary. It then does this over and over again until there are either no pairs that occur more than two times or no unused characters left.
The script can be found in the Lab at http://lab.aspektas.com/bpe_compression.py. It doesn't implement all the concepts described in the original paper on the encoding method (Gage 1994), but should still provide a nice example of how it works.
Note that while this method of encoding may sound cool, it has its limitations. Think about binary files (e.g. images, programs, and pretty much all other files besides plain-text ones) - how much room would there be for additional compression through byte pairs? Try it!
∎
I hope you enjoyed reading this article. If you wish, you may view some of the other recent or popular things I have written, or subscribe to receive RSS updates. You can also add a comment, or share this article on Twitter or Facebook, below.
Website copyright © Aspektas 2009 - 2010 Valid XHTML 1.0 Strict and CSS 2.1