It’s “weird bugs” day here on Tiny Toons.
And the “weird bug” I was encountering was using Apache Commons Compress‘s BZip2CompressorInputStream class to decompress a OpenStreetMaps Planet file on the fly while parsing it. I kept getting an unbalanced XML exception.
To make a long story short, the bzip2 file is compressed using a multi-stream compression algorithm, which means that, in order to use parallel compression on the file stream, the single file is broken into pieces, each piece is compressed, and the overall file is reassembled from the pieces–with each piece basically being a complete BZip2 file.
The best solution of course is to add multi-stream support to the BZip2CompressorInputStream class. But after spending an hour hacking at the class, I came up with a simpler solution: a wrapper input stream which, when it sees that the BZip2 decompressor has returned an EOF but there is still data in the compressed input data stream, restarts the decompressor.
Here’s that class:
import java.io.IOException; import java.io.InputStream; import org.apache.commons.compress.compressors.CompressorInputStream; import org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream; /** * Handle multistream BZip2 files. */ public class MultiStreamBZip2InputStream extends CompressorInputStream { private InputStream fInputStream; private BZip2CompressorInputStream fBZip2; public MultiStreamBZip2InputStream(InputStream in) throws IOException { fInputStream = in; fBZip2 = new BZip2CompressorInputStream(in); } @Override public int read() throws IOException { int ch = fBZip2.read(); if (ch == -1) { /* * If this is a multistream file, there will be more data that * follows that is a valid compressor input stream. Restart the * decompressor engine on the new segment of the data. */ if (fInputStream.available() > 0) { // Make use of the fact that if we hit EOF, the data for // the old compressor was deleted already, so we don't need // to close. fBZip2 = new BZip2CompressorInputStream(fInputStream); ch = fBZip2.read(); } } return ch; } /** * Read the data from read(). This makes sure we funnel through read so * we can do our multistream magic. */ public int read(byte[] dest, int off, int len) throws IOException { if ((off < 0) || (len < 0) || (off + len > dest.length)) { throw new IndexOutOfBoundsException(); } int i = 1; int c = read(); if (c == -1) return -1; dest[off++] = (byte)c; while (i < len) { c = read(); if (c == -1) break; dest[off++] = (byte)c; ++i; } return i; } public void close() throws IOException { fBZip2.close(); fInputStream.close(); } }
Wrapping our FileInputStream object in one of these, and feeding this to the SAX XML parser, seemed to do the trick.
Today I didn’t actually get any work done. Instead, it was spent looking for weird and unusual bugs and banging my head against them.
Meh.