Parsing the new OpenStreetMaps PBF file format.

I’ve been playing with the new .PBF file format from OpenStreetMaps for encoding their files, and thus far I’m fairly impressed. The new file format is documented here, and uses Google Protocol Buffers as the binary representation of the objects within the file. The overall file is essentially a sequence of objects written to a single data stream, with each of the elements of the stream encoded using the Google Protocol Buffer file format.

Here’s what I had to do to get a basic Java program up and running.

(1) Download the Google Protocol Buffers library and decompress.

(2) You will now need to build the Google Protocol compiler, in order to compile the .proto files for the OSM file. To do this, cd into the directory where the protocol buffers were created, and compile:

./configure
make
make install

Note that this will install Google’s libraries into your /usr/local directory. If you don’t want that, do what I did:

mkdir /Users/woody/protobuf
./configure --prefix=/Users/woody/protobuf
make
make install

(Full disclosure: I’m using MacOS X Lion.)

(3) Download the protocol buffer definitions for OSM.

(4) Compile them.

(Full disclosure: I downloaded the above files into ~/protobuf, created in step 2 above. When I did this, compiling the files took:

bin/protoc --java_out=. fileformat.proto
bin/protoc --java_out=. osmformat.proto

(5) Compile the descriptor.proto file stored in the downloaded protobuf-2.4.1 directory (created in step 1) src/google/protobuf/descriptor.proto file.

(Full disclosure: I copied this file from it’s location in the protobuf source kit into ~/protobuf created in step 2. I then compiled it with:

bin/protoc --java_out=. descriptor.proto

(6) Create a new Eclipse project. Into that project copy the following into the source kit:

(a) protobuf-2.4.1/java/src/main/java/*
(b) The product files created in steps (4) (~/protobuf/crosby…, ~/protobuf/com…)

(7) Test application

Now it turns out from the description on the OpenStreetMaps PBF file format, the file is encoded using a 4 byte length which gives the length of the BlobHeader record, the BlobHeader record (which contains the raw length of the contents), and a Blob which contains a stream which decodes into a PrimitiveBlock. The map data is contained in the PrimitiveBlock, and there are multiple PrimitiveBlocks for a single file. So the file sort of looks like a sequence of:

Length (4 bytes)
BlobHeader (encoded using Protocol Buffers)
Blob (encoded using Protocol Buffers)

And the blob object contains a block of data which is either compressed as a zlib deflated stream which can be inflated using the Java InflaterInputStream class, or as raw data.

And there are N of these things.

Given this, here is some sample code which I used to successfully deserialize the data from the stored file us-pacific.osm.pbf:

import java.io.DataInputStream;
import java.io.FileInputStream;
import java.io.InputStream;
import java.util.zip.InflaterInputStream;

import crosby.binary.Fileformat.Blob;
import crosby.binary.Fileformat.BlobHeader;
import crosby.binary.Osmformat.HeaderBlock;
import crosby.binary.Osmformat.PrimitiveBlock;

public class Main
{

	/**
	 * @param args
	 */
	public static void main(String[] args)
	{
		try {
			FileInputStream fis = new FileInputStream("us-pacific.osm.pbf");
			DataInputStream dis = new DataInputStream(fis);
			
			for (;;) {
				if (dis.available() <= 0) break;
				
				int len = dis.readInt();
				byte[] blobHeader = new byte[len];
				dis.read(blobHeader);
				BlobHeader h = BlobHeader.parseFrom(blobHeader);
				byte[] blob = new byte[h.getDatasize()];
				dis.read(blob);
				Blob b = Blob.parseFrom(blob);

				InputStream blobData;
				if (b.hasZlibData()) {
					blobData = new InflaterInputStream(b.getZlibData().newInput());
				} else {
					blobData = b.getRaw().newInput();
				}
				System.out.println("> " + h.getType());
				if (h.getType().equals("OSMHeader")) {
					HeaderBlock hb = HeaderBlock.parseFrom(blobData);
					System.out.println("hb: " + hb.getSource());
				} else if (h.getType().equals("OSMData")) {
					PrimitiveBlock pb = PrimitiveBlock.parseFrom(blobData);
					System.out.println("pb: " + pb.getGranularity());
				}
			}
			
			fis.close();
		}
		catch (Exception ex) {
			ex.printStackTrace();
		}
	}
}

Note that we successfully parse the OSMHeader block and the PrimitiveBlock objects. (Each OSM file contains a header block and N self-contained primitive blocks.)

I’m still sorting out how to handle the contents of a PrimtiveBlock; my goal is to eventually dump this data into my own database with my own database schema for further processing. But for now this gets one in the door to reading .pbf files.

I hope this helps someone out there…

As an aside I know there are more efficient ways to parse the file. This is just something to get off the ground with, with the proviso that the code is short and simple, and hopefully rather clear.

5 thoughts on “Parsing the new OpenStreetMaps PBF file format.

  1. Hi,

    I follow your tutorial.
    first off I have to say say thanks but there is a problem in getting nodes information with your solution (still I’m not sure the problem is my code or .proto or google code)
    take a look at my code :

    public static void main(String[] a)
    {
    try
    {
    FileInputStream fis = new FileInputStream(“c:\south_yorkshire.osm.pbf”);
    DataInputStream dis = new DataInputStream(fis);

    for(;;)
    {
    if (dis.available() ” + h.getType());

    if (h.getType().equals(“OSMHeader”))
    {
    HeaderBlock hb = HeaderBlock.parseFrom(blobData);

    System.out.println(“hb: ” + hb.getSource());
    } else if (h.getType().equals(“OSMData”))
    {
    PrimitiveBlock pb = PrimitiveBlock.parseFrom(blobData);
    System.out.println(“pb: ” + pb.getGranularity());

    //System.out.println(pb.getPrimitivegroupCount());
    List pgs = pb.getPrimitivegroupList();
    System.out.println(“primitive group”);
    for (int i=0;i<pgs.size();i++)
    {

    System.out.println("ways :" + pgs.get(i).getWaysCount());
    System.out.println("nodes :" + pgs.get(i).getNodesCount());
    System.out.println("changes :" + pgs.get(i).getChangesetsCount());
    System.out.println("relations :" + pgs.get(i).getRelationsCount());
    Osmformat.DenseNodes dns = pgs.get(i).getDense();
    System.out.println("denses :" + dns.getKeysValsCount());

    //System.out.println(pgs.get(i).getWaysCount());
    List nodes = pgs.get(i).getNodesList();

    System.out.println(nodes.size());
    }

    }
    }
    fis.close();
    }

    catch (Exception ex)
    {
    ex.printStackTrace();
    }

    }

    my nodes is always zero but if I convert the pbf with osmconvert to xml I can see loads of nodes!

    by the way when you copy and paste the google java src to your project you don’t need to compile the descriptor.proto again its available in googe src code.

    Like

  2. It’s probably not Google’s code. 🙂

    It is worth verifying that the Fileformat.java and Osmformat.java files were generated correctly. And it is worth compiling the descriptor.proto file; I don’t exactly remember why (it’s been several months) but I remember it fixed some problems I was having.

    Like

  3. How can I verify the Fileformat.java and Osmformat.java , I mean they are already too big and I don’t have any other reference to double check the output with it , all I can do is rely on google’s output

    unless you give me your output!

    Like

  4. I downloaded the osmosis.jar referenced from the OpenStreetMap Wiki and before I parse pbf files, my application transforms the pbf into the plain old *.osm file, which is pure xml. So parsing is that way much easier, however this approach requires a lot more harddisk space.

    Like

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s