LZW

From Sega Retro

LZW[1] (Lempel-Ziv-Welch) is a lossless data compression algorithm. It was developed by Terry Welch in 1984 as an improved version of the LZ77 and LZ78 dictionary coding algorithms developed by Abraham Lempel and Jacob Ziv.

Description of the algorithm

The key insight of the method is that it is possible to automatically build a dictionary of previously seen strings in the text being compressed. The dictionary does not have to be transmitted with the compressed text, since the decompressor can build it the same way the compressor does, and if coded correctly, will have exactly the same strings that the compressor dictionary had at the same point in the text.

The dictionary starts off with 256 entries, one for each possible character (single byte string). Every time a string not already in the dictionary is seen, a longer string consisting of that string appended with the single character following it in the text, is stored in the dictionary.

The output consists of integer indices into the dictionary. These initially are 9 bits each, and as the dictionary grows, can increase to up to 16 bits. A special symbol is reserved for "flush the dictionary" which takes the dictionary back to the original 256 entries, and 9 bit indices. This is useful if compressing a text which has variable characteristics, since a dictionary of early material is not of much use later in the text.

This use of variably increasing index sizes is one of Welch's contributions. Another was to specify an efficient data structure to store the dictionary.

Simple example of dictionary-based compression algorithm

In general, dictionary-based compression replaces phrases with tokens. If the number of bits in the token is less than the number of bits in the phrase, compression will occur. Uncompressed text:

"I am dumb and because I am dumb, I can't even tell you that I am dumb."

Compressed text:

"$1 and because $1, I can't even tell you that $1. $1=[I am dumb]"

This is very far from efficient, but it illustrates the compression through phrase replacement.

Uses

The method became moderately widely used in the program "compress" which became a more or less standard utility in Unix systems circa 1986. (It has since disappeared from many for both legal and technical reasons.) Several other popular compression utilities also used the method, or closely related ones.

It became very widely used after it became part of the GIF image format in 1987. It may also (optionally) be used in TIFF files.

A variant of this algorithm is widely used in the *.zip save format which is used in OpenOffice.org save formates, PDF files, and NTFS drives.

LZW compression provided a better compression ratio, in most applications, than any well-known method available up to that time. It became the first widely used general purpose data compression method on computers. On large English texts, it typically compressed to about half the original size. Other kinds of data were also quite usefully compressed in many cases.

Patent issues

Various patents have been issued in the USA and other countries for LZW and similar algorithms. LZ77 was covered by US Patent No. 4,464,650 by Lempel, Ziv, Cohn, and Eastman, assigned to Sperry Corporation, later Unisys Corporation, filed on August 1, 1981, and presumably now expired. Two US patents were issued for the LZW algorithm: US patent No. 4,814,746 by Victor S. Miller and Mark N. Wegman and assigned to IBM, originally filed on June 1, 1983, and US Patent No. 4,558,302 by Welch, assigned to Sperry Corporation, later Unisys Corporation, filed on June 20, 1983.

US Patent 4,558,302 is the one that has caused the most controversy. Unisys at one time granted royalty-free patent licenses to developers of free software and freely available proprietary software; the company terminated those licenses in August 1999. Many legal experts have concluded that the patent does not cover devices that can only uncompress LZW data and cannot compress it; for this reason, the popular Gzip program can read .Z files but cannot write them.

It was reported in Debian Weekly News based on a comp.compression thread, that the Unisys patent in the USA expired on December 20, 2002—17 years and 10 days after it was granted. Most other sources claim the patent expired in June 2003, 20 years after it was filed, because 35 USC §154(c)(1) specifies that patents subsisting as of six months after the enactment of the Uruguay Round Agreements Act last for the greater of 17 years after grant and 20 years after filing.

According to a statement on Unisys' web site, counterpart patents on LZW in the United Kingdom, France, Germany, Italy, and Japan have expired in June 2004, and the Canadian patent expired on July 7, 2004.

IBM's US patent expired on August 11, 2006.

Lempel-Ziv-Welch vs. Ziv-Lempel-Welch

Although the LZW acronym obviously refers to the inventors as Lempel, Ziv and Welch, some people claim the intellectual property rights go to Ziv first, so the method must be called the Ziv-Lempel-Welch algorithm, and not the Lempel-Ziv-Welch algorithm.

External links

References