I’ve been playing with the new .PBF file format from OpenStreetMaps for encoding their files, and thus far I’m fairly impressed. The new file format is documented here, and uses Google Protocol Buffers as the binary representation of the objects within the file. The overall file is essentially a sequence of objects written to a single data stream, with each of the elements of the stream encoded using the Google Protocol Buffer file format.
Here’s what I had to do to get a basic Java program up and running.
(1) Download the Google Protocol Buffers library and decompress.
(2) You will now need to build the Google Protocol compiler, in order to compile the .proto files for the OSM file. To do this, cd into the directory where the protocol buffers were created, and compile:
./configure
make
make install
Note that this will install Google’s libraries into your /usr/local directory. If you don’t want that, do what I did:
mkdir /Users/woody/protobuf
./configure --prefix=/Users/woody/protobuf
make
make install
(Full disclosure: I’m using MacOS X Lion.)
(3) Download the protocol buffer definitions for OSM.
(4) Compile them.
(Full disclosure: I downloaded the above files into ~/protobuf, created in step 2 above. When I did this, compiling the files took:
bin/protoc --java_out=. fileformat.proto
bin/protoc --java_out=. osmformat.proto
(5) Compile the descriptor.proto file stored in the downloaded protobuf-2.4.1 directory (created in step 1) src/google/protobuf/descriptor.proto file.
(Full disclosure: I copied this file from it’s location in the protobuf source kit into ~/protobuf created in step 2. I then compiled it with:
bin/protoc --java_out=. descriptor.proto
(6) Create a new Eclipse project. Into that project copy the following into the source kit:
(a) protobuf-2.4.1/java/src/main/java/*
(b) The product files created in steps (4) (~/protobuf/crosby…, ~/protobuf/com…)
(7) Test application
Now it turns out from the description on the OpenStreetMaps PBF file format, the file is encoded using a 4 byte length which gives the length of the BlobHeader record, the BlobHeader record (which contains the raw length of the contents), and a Blob which contains a stream which decodes into a PrimitiveBlock. The map data is contained in the PrimitiveBlock, and there are multiple PrimitiveBlocks for a single file. So the file sort of looks like a sequence of:
| Length (4 bytes) |
| BlobHeader (encoded using Protocol Buffers) |
| Blob (encoded using Protocol Buffers) |
And the blob object contains a block of data which is either compressed as a zlib deflated stream which can be inflated using the Java InflaterInputStream class, or as raw data.
And there are N of these things.
Given this, here is some sample code which I used to successfully deserialize the data from the stored file us-pacific.osm.pbf:
import java.io.DataInputStream;
import java.io.FileInputStream;
import java.io.InputStream;
import java.util.zip.InflaterInputStream;
import crosby.binary.Fileformat.Blob;
import crosby.binary.Fileformat.BlobHeader;
import crosby.binary.Osmformat.HeaderBlock;
import crosby.binary.Osmformat.PrimitiveBlock;
public class Main
{
/**
* @param args
*/
public static void main(String[] args)
{
try {
FileInputStream fis = new FileInputStream("us-pacific.osm.pbf");
DataInputStream dis = new DataInputStream(fis);
for (;;) {
if (dis.available() <= 0) break;
int len = dis.readInt();
byte[] blobHeader = new byte[len];
dis.read(blobHeader);
BlobHeader h = BlobHeader.parseFrom(blobHeader);
byte[] blob = new byte[h.getDatasize()];
dis.read(blob);
Blob b = Blob.parseFrom(blob);
InputStream blobData;
if (b.hasZlibData()) {
blobData = new InflaterInputStream(b.getZlibData().newInput());
} else {
blobData = b.getRaw().newInput();
}
System.out.println("> " + h.getType());
if (h.getType().equals("OSMHeader")) {
HeaderBlock hb = HeaderBlock.parseFrom(blobData);
System.out.println("hb: " + hb.getSource());
} else if (h.getType().equals("OSMData")) {
PrimitiveBlock pb = PrimitiveBlock.parseFrom(blobData);
System.out.println("pb: " + pb.getGranularity());
}
}
fis.close();
}
catch (Exception ex) {
ex.printStackTrace();
}
}
}
Note that we successfully parse the OSMHeader block and the PrimitiveBlock objects. (Each OSM file contains a header block and N self-contained primitive blocks.)
I’m still sorting out how to handle the contents of a PrimtiveBlock; my goal is to eventually dump this data into my own database with my own database schema for further processing. But for now this gets one in the door to reading .pbf files.
I hope this helps someone out there…
As an aside I know there are more efficient ways to parse the file. This is just something to get off the ground with, with the proviso that the code is short and simple, and hopefully rather clear.