TarHeader

SharpZipLib Tar Header Checksum is Invalid

I had to write some code recently that opens a .tar.bz2 file, modifies the contents, and then repackages back into a .tar.bz2. The fun and games ended when I discovered this obscure showstopper: ICSharpCode.SharpZipLib.Tar.TarException: Header checksum is invalid.

The following is an account of how I got here and my thought process while solving the problem.

Stepping back a few days…

Surely, the kind of utility code I needed was available somewhere as a library, I thought to myself. Where permissible, it’s (almost) always more cost-efficient to go with an existing solution that’s been battle-tested rather than writing your own code from scratch.

A quick internet search led me to SharpZipLib. Nice. It’s open source and has an agreeable license.

The library provides a few ways to read from files. I wanted fine-grained control over each file in the tar. The documentation was sparse… but one can’t really complain with free software. After consulting with some examples on the web, I wrote the following:

private TarEntry[] extractTar(String filename, bool compressed)
{
    Func<Stream, TarInputStream> makeStream = source =>
        new TarInputStream(compressed ? new BZip2InputStream(source) : source);

    var manifest = new Queue<TarEntry>();

    // open the tar file
    var sourceFile = File.Open(filename, FileMode.Open, FileAccess.Read, FileShare.Delete | FileShare.Read);

    using (var tarStream = makeStream(sourceFile))
    {
        // Extract the files from the tar data
        TarEntry tarEntry;
        while ((tarEntry = tarStream.GetNextEntry()) != null)
        {
            // Write out the current entry into a file
            var tempFileName = Path.Combine(workPath, tarEntry.Name);
            using (var tempFileWriter = File.Open(tempFileName, FileMode.Create, FileAccess.Write, FileShare.Delete | FileShare.Read))
            {
                tarStream.PipeInto(tempFileWriter, tarEntry.Size);
                tempFileWriter.Flush();

                // Add the entry into the manifest
                manifest.Enqueue(tarEntry);
            }
        }
    }

    return manifest.ToArray();
}

Worked like a charm.

Now that the read side is working, it’s time to implement the write side.

private void createTar(String filename, TarEntry[] manifest, bool compressed)
{
    Func<Stream, TarOutputStream> makeStream = sink =>
        new TarOutputStream(compressed ? new BZip2OutputStream(sink) : sink);

    // open the tar file
    using (var destFile = File.Open(filename, FileMode.Create, FileAccess.ReadWrite, FileShare.Delete | FileShare.Read))
    {
        using (var tarStream = makeStream(destFile))
        {
            // Write all the manifest files into the tar
            manifest.forEach(tarEntry =>
            {
                var tempFileName = Path.Combine(workPath, tarEntry.Name);

                // Read in the current entry from disk
                using (var tempFileReader = File.Open(tempFileName, FileMode.Open, FileAccess.Read, FileShare.Delete | FileShare.Read))
                {
                    // Update the size because it may have changed
                    tarEntry.Size = tempFileReader.Length;
                    tarStream.PutNextEntry(tarEntry);

                    // Write the file contents into the tar
                    tempFileReader.PipeInto(tarStream);
                }
            });
            tarStream.Flush();
        }
    }
}

Wrong!

There is a bug in the code above. I could read input files without problems. But the extract method threw the Header checksum is invalid exception every time I tried to read a file produced by my write method.

Back to Google. Somebody must have encountered this problem before.

Ok, a few hits in Google. Most are unrelated; one looks promising at first but is several years old and turns out to have no resolution.

I started testing variations on the code… at one point, while single-stepping, I noticed the tar header had a checksum of 0. What the hell? So I spent some time forcing the checksum to be calculated.

Still no luck. I’m not getting anywhere… Time to dig deeper.

I pull all the SharpZipLib source code into my project so I can single step through it. Then I see that the header checksum is calculated automatically when the TarEntry is written to stream. Ah, so I don’t need to calculate it.

The Breakthrough

I had noticed earlier that, upon extraction, one of the files — a small file of only 35 bytes — was corrupt. But not corrupt with random garbage. It contained the filename of the following file. Curious.

I had also read on Wikipedia that the tar format uses blocks of 512 bytes. Ok… so perhaps this little file needed to be padded out to a full block or something.

The next step is to set conditional breakpoints in PutNextEntry for when the file size is 35. I want to see what exactly is going on for this file.

Well, well. There appears to be some special-case code if the buffer is smaller than the block size. So my 35 byte file is getting written into a special ‘assembly’ buffer. I keep stepping through code so I can see when the special buffer gets written to the stream. The next file happens to be a 3-byte file. While stepping, I witness these three bytes appended to the assembly buffer.

Wait… what?

The data for these two files is adjacent in the assembly buffer. But there should be a header record separating this data!

I start looking through the rest of TarOutputStream.cs to get a larger picture of what’s going on… when I stumble across this function (with big warning):

/// <summary>
/// Close an entry. This method MUST be called for all file
/// entries that contain data. The reason is that we must
/// buffer data written to the stream in order to satisfy
/// the buffer's block based writes. Thus, there may be
/// data fragments still being assembled that must be written
/// to the output stream before this entry is closed and the
/// next entry written.
/// </summary>
public void CloseEntry()
{
	if (assemblyBufferLength > 0) {
		Array.Clear(assemblyBuffer, assemblyBufferLength, assemblyBuffer.Length - assemblyBufferLength);
		
		buffer.WriteBlock(assemblyBuffer);
		
		currBytes += assemblyBufferLength;
		assemblyBufferLength = 0;
	}
	
	if (currBytes < currSize) {
		string errorText = string.Format(
			"Entry closed at '{0}' before the '{1}' bytes specified in the header were written",
			currBytes, currSize);
		throw new TarException(errorText);
	}
}

TL;DR

Well, shit. All this trouble because I wasn’t calling CloseEntry. Why is there even a CloseEntry method? TarOutputStream should know that it needs to close an entry when either the next header is written or when the archive is closed.

I feel like an idiot. Anyhow, I added a CloseEntry call and the archives are now reading and writing with no problem. Solved.

Maybe I should have read the documentation thoroughly… how often do you read it all?