Mike Kohn!

CONTENTS

YouTube
BlueSky
GitHub
LinkedIn

Kunzip

Posted: May 2005
Updated: May 28, 2019

Introduction

Kunzip is a free library under the LGPL license for decompressing ZIP archives. Kunzip started out as a test program I wrote for decompressing .zip files through a command line program. I eventually added hooks to make it a .so or DLL that can be used from Unix or Windows programs. My plan for the kunzip library was to make it as small (currently under 40k) and simple to use as possible so it can even be used from VisualBasic or such. It should be fully compatible with zipfiles created by InfoZip, PkZip, and WinZip. It can downloaded from the link below and view the prototypes for different languages that can call the kunzip library.

I've also been recently working a small .gz compression library called libkohnz (in a separate repo below) that uses little memory and can speed up creating .gz files in situations where the redundancies are known at file creation time. There is also a utility in this repository called parse_gz which can take a .gz file and print out all the huffman tables (if it's dynamic huffman) along with showing all the length / distance codes in the file. Useful for debugging deflate compression.

Related Projects @mikekohn.net

Graphics:

SSE Image Processing, GIF, TIFF, BMP/RLE, JPEG, AVI, kunzip, gif2avi, Ringtone Tools, yuv2rgb, RTSP

Downloads

kunzip-2015-02-09.tar.gz (Source Code)
kunzip-2015-02-09.zip (Windows binary DLL, kunzip.dll)

C/C++ prototypes


git clone https://github.com/mikeakohn/kunzip
git clone https://github.com/mikeakohn/libkohnz

Some Boring History About Kunzip

I originally started this code to be able to decompress (and maybe even compress) PNG graphic files. Since this is the same algorithm used in gzip and zip (WinZip, InfoZip, etc), I decided to make it work with .zip files instead. Unfortunately, all the specs I had for this compression were for PNG and Zip is slightly different, so I had some problems getting it working.

The first version kunzip (which built the huffman tables using arrays instead of trees and did a lot of fseeking) ran painfully slow, but the latest version runs quite well. One interesting thing I found, when I did static huffman decompression for TIFF, I did it using lookup tables instead of trees. This actually ran very fast, although I'm not sure of the speed difference if trees were used instead. On this project, I started out doing all 3 huffman sets using lookup tables. After I saw how slow it ran, I switched the length huffman codes to a tree and the decompression ran almost double the speed. When I changed the distance huffman codes to trees, the decompression slowed down by a few seconds. So I left length / distance huffman codes as trees and kept the first huffman codes as lookup tables. I may try and improve on the speed even more later.

This project was written to be compiled on Unix (Linux,FreeBSD,etc) systems. To build a DLL for Windows, download the source code and type: make dll. The mingw C compiler will need to be installed.

How DEFLATE works

DEFLATE compresses data by first running an LZ77 compression scheme on a chunk of binary data and then taking the LZ77 codes and compressing those with huffman. LZ77 compresses data in the following way: If while writing a file, the compressor sees that a combination of characters have already been written to the file, the compressor will instead write out a distance code (the number of bytes to search backwards in the uncompessed file) and a length code (the number of bytes to copy from that point in the file). So for example, having a string of text that looks like this:

i am not a number, i am a free man

LZ77 would compress that to: i am not a number, (19,5)free man

Where 19 is the distance backwards and 5 is the number of characters that are similar. In DEFLATE, a series of codes (values from 0 to 285) tell the decompressor whether to output a single byte (0 to 255), to stop decompressing (256), or the length of bytes to copy with a distance code following (257-285). These codes are huffman encoded to make the compressed data more compact.