To Compress Or Not, Part III

Read the rest of this series:

Back in Part II, I did some time tests to see the differences between reading and writing compressed and uncompressed files. One factor that I did not account for was the fact that the number of files in the test could affect the results.

Here's what led me down this road, from Part II:

I think the Read test here is interesting. At first glance it seems like using compression causes less CPU load in some cases. I was unsure what to think about this result, then I considered something that I didn't mention before, which is the number of files in each test. The AVI file is only 1 file, while Program Files has 348 files and Documents contains 1892 files. It seems to correlate that more files in the test has an effect on the CPU load when combined with compression. This is something that I will look into in the future.

The future is now.

For every file that's copied, in addition to the file data itself, meta information about the file needs to be queried, such as the filename, creation date, etc..., each file needs to be opened and closed, and the data must be located on the disk (seek time). With a lot of files, this can make a significant difference because this information is located on different parts of the disk than the data. With many files, the disk can spend a lot of time seeking, adding to the test time.

I performed this test using the same method described in Part II, except instead of copying a complete directory tree, I first used the 'tar' command to collect the entire tree into one file ('tar' does not compress the files), then used the .tar files for the test. The Documents folder originally contained 1892 files, while the Program Files folder contained 348 files.

The following charts show that this does indeed make a big difference. The purple bars measure the times for the single .tar file, compared to the original numbers from the first test.

The large difference in time between them really surprised me, and highlights how much time the system spends searching for files, reading their meta information, and opening and closing them. Because the .tar file only requires that to happen once, the system can spend the rest of its time doing the actual work.

The time spent on the CPU has also been affected significantly:

In all cases, copying to/from a compressed drive took more time than the non-compressed one, which upholds the results found in Part II. These tests aren't so much about the compression, as they are about the other kinds of overhead involved in moving around a large batch of files.

Does this resolve the discrepancy seen in the CPU time read test? I think the jury is still out, as the read test is consistent with what one would expect for using one file, but still does not explain the lower CPU usage seen when reading Program Files from the disk (although now it's close enough to be within the margin of error, with a difference of 0.05).

Read the rest of this series:

It appears, then, after reading all three parts of the article, that for most general purposes, it is better not to compress your NTFS drives.

It seems to be the conclusion, however, I've wanted to redo the tests as I found some inconsistencies in my test methods. One of the large files I used was encrypted and thus would not compress at all. I didn't notice it at the time.

P.S. Your account name has been changed to "someguy".

The problem with your benchmark on read speed is that it also involves disk write. In practical usage, data read from compressed drive is stored on memory for use by applications, hence giving you performance increase.