RELATED: How Does File Compression Work?
Algorithms and Trees
The gzip data compression tool was written in the early 1990s, and it’s still found in every Linux distribution. There are other compression tools available, but no matter which Linux computer you find yourself needing to work on, you’ll find gzip on it. So if you know how to use gzip, you’re good to go without the need to install anything.
gzip is an implementation of the DEFLATE algorithm which was invented—and patented—by Phil Katz of PKZIP fame. The DEFLATE algorithm improved on earlier compression algorithms which all operated on variations of a theme. The data to be compressed is scanned, and unique strings are identified and added to a binary tree.
The unique strings are allocated a unique ID token by virtue of their position in the tree. The tokens are used to replace the strings in the data and, because the tokens are smaller than the data they replaced, the file is compressed. Substituting the tokens for the original strings re-inflates the data back to its uncompressed state.
The DEFLATE algorithm added the twist that the most frequently encountered strings were allocated the smallest tokens and the least frequently encountered strings were allocated larger ones. The DEFLATE algorithm also incorporated ideas from two earlier compression methods, Huffman coding and LZ77 compression.
At the time of writing, the DEFLATE algorithm is nearly three decades old. Three decades ago data storage costs were high and transmission speeds were slow. Data compression was vitally important.
Data storage is much cheaper today, and transmission speeds are orders of magnitude faster. But we have so much more data to store, and the world over people are accessing cloud storage and streaming services. Data compression is still vitally important, even if all you’re doing is shrinking something that you need to upload or transmit, or you’re trying to claw back some space on a local hard drive.
The gzip Command
The bigger a file is, the better the compression can be. This is because of two reasons. One is there will be many repeated, identical sequences of bytes throughout a large file. The second reason is the list of strings and tokens needs to be stored in the compressed file so that decompression can take place. With a very small file that overhead can wipe out the benefits of the compression. But even with a fairly small file, there’s likely to be some reduction in size.
Compressing a File
To compress (or zip) a file, all you need to do is pass the name of the file to the gzip command. We’ll check the original size of the file, compress it, and then check the size of the compressed file.
The original file, a spreadsheet called “calc-sheet.ods” is 11 KB, and the compressed file—also known as an archive file—is 9.3 KB. Note that the name of the archive file is the name of the original file with “.gz” appended to it.
The first use of the ls command targets a specific file, the spreadsheet. The second use of ls looks for all files beginning with “calc-” but it only finds the compressed file. That’s because, by default, gzip creates the archive file and deletes the original file.
That’s not an issue. If you need the original file you can retrieve it from the archive file. But if you prefer to retain the original file, you can use the -k (keep) option.
This time the original ODS file is retained.
Decompressing a File
To decompress (or unzip) a GZ archive file, use the -d (decompress) option. This will extract the compressed file from the archive and decompress it so that it is indistinguishable from the original file.
This time, we can see that gzip has deleted the archive file after extracting the original file. To retain the archive file, we need to use the -k (keep) option again, as well as the -d (decompress) option.
This time, gzip doesn’t delete the archive file.
RELATED: Why Deleted Files Can Be Recovered, and How You Can Prevent It
Decompressing and Overwriting
If you try to extract a file in a directory where the original file—or a different file with the same—exists, gzip will prompt you to choose to abandon the extraction or to overwrite the existing file.
If you know in advance that you’re happy to have the file in the directory overwritten by the file from the archive, use the -f (force) option.
The file is overwritten and you’re silently returned to the command line.
Compressing Directory Trees
The -r (recursive) option causes gzip to compress the files in an entire directory tree. But the result might not be what you expect.
Here’s the directory tree we’re going to use in this example. The directories each contain a text file.
Let’s use gzip on the directory tree and see what happens.
The result is gzip has created an archive file for each text file in the directory structure. It didn’t create an archive of the entire directory tree. In fact, gzip can only put a single file in an archive.
We can create an archive file that contains a directory tree and all of its files, but we need to bring another command into play. The tar program is used to create archives of many files, but it doesn’t have its own compression routines. But by using the appropriate options with tar, we can cause tar to push the archive file through gzip. That way we get a compressed archive file and a multi-file or multi-directory archive.
The tar options are:
c: Create an archive. z: Push the files through gzip. v: Verbose mode. Print in the terminal window what tar is up to. f level1. tar. gz: Filename to use for the archive file.
This archives the directory tree structure and all files within the directory tree.
RELATED: How to Compress and Extract Files Using the tar Command on Linux
Getting Information About Archives
The -l (list) option provides some information about an archive file. It shows you the compressed and uncompressed sizes of the file in the archive, the compression ratio, and the name of the file.
You can check the integrity of an archive file with the -t (test) option.
If all is well, you’re silently returned to the command line. No news is good news.
If the archive is corrupt or not an archive you’re told about it.
Speed Versus Compression
You can choose to prioritize the speed of creation of the archive or the degree of compression. You do this by providing a number as an option, from -1 through top -9. The -1 option gives the fastest speed at the sacrifice of compression and -9 gives the highest compression at the sacrifice of speed.
Unless you provide one of these options, gzip uses -6.
With a file as small as this, we didn’t see any significant difference in speed of execution, but there was a small difference in compression.
Interestingly, there is no difference between using level 9 compression and level 6 compression. You can only wring so much compression out of any given file and in this case, that limit was reached with level 6 compression. Cranking it up to 9 brought no further reduction in filesize. With bigger files, the difference between level 6 and level 9 would be more pronounced.
Compressed, Not Protected
Don’t mistake compression for encryption or any form of protection. Compressing a file doesn’t give it any security or enhanced privacy. Anyone with access to your file can use gzip to decompress it.
RELATED: List the 10 Largest Files or Directories on Linux