I had what I thought was a pretty straightforward use case for OpenStreetMap data:
I want all bridges in the US that are mapped in OpenStreetMap in a PostGIS database.
There are about 125,000 of them – for now loosely defined as ‘ways that have the ‘bridge’ tag‘. So on the scale of OpenStreetMap data it’s a really small subset. In terms of the tools and processes needed, the task seems easy enough, and as long as you are satisfied with a one-off solution, it really is. You would need only four things:
- A planet file
- A boundary polygon for the United States
- A PostGIS database loaded with the osmosis snapshot schema and the linestring extension
- osmosis, the OpenStreetMap ETL swiss army tool.
That, and a single well-placed osmosis command:
bzcat planet-latest.osm.bz2 | \ osmosis --rx - \ --bp us.poly \ --tf accept-ways bridge=* \ --tf reject-relations \ --used-node \ --wp database=bridges user=osm password=osm
this will extract the planet file and pipe the output to osmosis. Osmosis’s –read-xml task consumes the xml stream, passes it to a –bounding-polygon task to clip the data using the US bounding polygon, a couple of –tag-filter tasks that throw out all relations and all ways except for those tagged ‘bridge=*’ (there’s a negligible number of ways tagged bridge=no, but catching all the different ways of tagging a ‘true’ value here is more work than it’s worth, if you ask me), a –used-node task that throws out all the nodes except for those that are used by the ways we are keeping, and finally a –write-pgsql task that writes it all the objects to the PostGIS database. (Osmosis can be overwhelming at first with its plethora of tasks and arguments, but if you break it down it’s really quite straightforward. It may help to use There’s also a graphical wrapper around osmosis called OSMembrane that may help to make this tool easier to understand and master.)
But for me, it didn’t end there.
OpenStreetMap data is continuously updated by more than a half million contributors around the world. People are adding, removing and changing features in OpenStreetMap around the clock. And those changes go straight into the live database. There’s no release schedule, no quality assurance. Every time one of those half a million people clicks ‘save’ in one of the OpenStreetMap editors there is, for all intents and purposes, a new OpenStreetMap version. That means the bridges database I just built is already obsolete even before the import is complete. For my yet to disclose purpose, that would not be acceptable. So let me specify my goal a little more precisely:
I want all bridges in the US that are mapped in OpenStreetMap in a PostGIS database that stays as up-to-date as possible, reflecting all the latest changes.
There is not one single ready-made solution for this, it turned out, so let me describe how I ended up doing it. It may not be the most common OpenStreetMap data processing use case out there, but it’s going to be useful for, for example, thematic overlay maps, if nothing else – even though the final step of importing into a geospatial database may need some tweaking.
After some less successful attempts I settled on the following workflow:
This workflow uses a handful of specialized tools:
- osmosis, that we’re already familiar with
- osmconvert – a fast, comprehensive OpenStreetMap data file patching, converting and processing tool
- osmfilter – a tool to filter OpenStreetMap data by tags, tag values or feature type
- osmupdate – a tool to automate patching local OpenStreetMap data files, including downloading the change files from the server.
The trio osmconvert / osmfilter / osmupdate together can do most of the things osmosis can do, but do it a heck of a lot faster, and is more flexible in a few key aspects that we will see soon.
Let’s go through the numbered steps in the diagram one by one, explaining how each step is executed and how it works.
1. Planet file – The complete, raw OpenStreetMap data. A fresh one is made available every week on the main OpenStreetMap server, but your area of interest may be covered at one of the many mirrors, which can save you some download time and bandwidth. There is no planet mirror for the entire US, so I started with the global planet file. If you have a planet file that matches your area of interest, you can skip step 3 (but not the next step).
2. Bounding polygon – regardless whether you find an initial planet file that matches your area of interest nicely, you will need a bounding polygon in OSM POLY format for the incremental updates. You’ll find ready-made POLY files in several places, including the OpenStreetMap SVN tree and GeoFabrik (read the README though), or you can create them yourself from any file that OGR can read using ogr2poly.
3. Filter area of interest – to save on disk space, memory usage and processing time, we’re going to work only with the data that is inside our area of interest. There are quite a few ways to create geographical extracts from a planet file, but we’re going to use osmconvert for two reasons: a) it’s fast! (osmosis takes about 4 hours and 45 minutes to do this, osmconvert takes 2 hours. This is on an AMD Phenom II X4 965 machine with 16GB RAM) b) it outputs the o5m format for which the next tool in the chain, osmfilter, is optimized.
bzcat planet-latest.osm.bz2 | ./osmconvert – -B=us.poly -o=us.o5m
4. Filter features of interest – The second step is creating a file that holds only the features that we are interested in. We could have done this together with the previous step in one go, but as the diagram shows we will need the output file of step 3 (the US planet file) for the incremental update process. Here, osmfilter comes into play
osmfilter us.o5m --keep= --keep-ways="bridge=" --out-o5m > us-bridges.o5m
osmfilter works in much the same way as the osmosis –tag-filter task. It accepts arguments to drop specific feature types, or to keep features that have a specific tags. In this case, we want to drop everything (–keep=) but the ways that have the key ‘bridge’ (–keep-ways=”bridge=”). We have osmfilter output the result in the efficient o5m format. (o5m lies in between the OSM xml and pbf formats in terms of file size, and was designed as a compromise between the two. One of the design goals for the o5m format was the ability to merge two files really fast, something we will be relying on in this process.)
5. Convert to pbf – The trio osmconvert / osmfilter / osmupdate is designed to handle file-based data and has no interface for PostGIS, so we need to fall back on osmosis for this step. As osmosis cannot read o5m files, we need to convert to pbf first:
osmconvert us-bridges.o5m -b=-180,-90,180,90 --drop-broken-refs -o=us-bridges.osm.pbf
Wait a minute. A lot more happened there than just a format conversion. Let’s take a step back. Because we’re working with a geographical extract of the planet file, we need to be concerned about referential integrity. Because the way objects in OpenStreetMap don’t have an inherent geometry attached to them, any process looking to filter ways based on a bounding box or polygon needs to go back to the nodes referenced and see if they are within the bounds. It then needs to decide what to do with ways that are partly within the bounds: either cut them at the bounds, dropping all nodes that lie outside the bounds (‘hard cut;), include the entire way (‘soft cut’) or drop the entire way. As the –drop-broken-refs argument name suggests, we are doing the latter here. This means that data is potentially lost near the bounds, which is not what we actually want. We need to do it this way though, because the planet update (step 7) cannot take referential integrity into account without resorting to additional (expensive) API calls. (Consider this case: a way is entirely outside the bounds on t0. Between updates, one of the nodes is moved inside the bounds, so the way would be included in the extract now. But the old file does not contain the rest of the nodes comprising that way, nor are they in the delta files that are used in the update process – so the full geometry of the new way cannot be known.)
One way to compensate for the data loss is by buffering the bounding polygon. That would yield false positives, but that may be acceptable. It is how I solved this. What’s best for your case depends on your scenario.
The -b=-180,-90,180,90 option defining a global bounding box seems superfluous, but is actually necessary to circumvent a bug in the –drop-broken-refs task that would leave only nodes in the data.
6. Initial database import – This is a straightforward step that can be done with a simple osmosis command:
osmosis --rb us-bridges.osm.pbf --wp database=bridges user=osm password=osm
This reads the pbf file we just created (–rb) and writes it directly to the database ‘bridges’ using the credentials provided (–wp). If you want way geometries, be sure to load the linestring schema extension in addition to the snapshot schema when creating the database:
psql -U osm -d bridges -f /path/to/osmosis/script/pgsnapshot_schema_0.6.sql psql -U osm -d bridges -f /path/to/osmosis/script/pgsnapshot_schema_0.6_linestring.sql
osmosis will detect this on import, there is no need to tell osmosis to create the line geometries.
Note that we are using the direct write task (–wp) which is fine for smaller datasets. If your dataset is much larger, you’re going to see real performance benefits from using the dump task (–wpd) and load the dumps into the database using the load script provided with osmosis.
Now that we have the initial import done, we can start the incremental updates. This is where the real fun is!
7. Updating the planet file – This is where osmupdate really excels in flexibility over osmosis. I had not used this tool before and was amazed by how it Just Works. What osmupdate does is look at the input file for the timestamp, intelligently grab all the daily, hourly and minutely diff files from the OpenStreetMap server, and apply them to generate an up-to-date output file. It relies on the osmconvert program that we used before to do the actual patching of the data files, so osmconvert needs to be in your path for it to function. You can pass osmconvert options in, which allows us to apply the bounding polygon in one go:
osmupdate us.o5m us-new.o5m B=us.poly
8. Filter features of interest for the updated planet file – This is a repetition of step 4, but applied to the updated planet file:
osmfilter us-new.o5m --keep= --keep-ways="bridge=" --out-o5m > us-bridges-new.o5m
We also drop the broken references from this new data file:
osmconvert us-bridges-new.o5m -b=-180,-90,180,90 --drop-broken-refs -o=us-bridges-new-nbr.o5m
9. Derive a diff file – We now have our original bridges data file, derived from the planet we downloaded, and the new bridges file derived from the updated planet file. What we need next is a diff file we can apply to our database. This file should be in the OSM Change file format, the same format that is used to publish the diffs for the planet osmconvert used in step 7. This is another task at which osmconvert excels: it can derive a change file from two o5m input files really fast:
osmconvert us-bridges.o5m us-bridges-new-nbr.o5m --diff --fake-lonlat -o=diff-bridges.osc
Again, there’s a little more going on than just deriving a diff file, isn’t there? What is that –fake-lonlat argument? As it turns out, osmconvert creates osc files that don’t have coordinate attributes for nodes that are to be deleted. To do so would be unnecessary, you really only need a node ID to know which node to delete, there is no need to repeat other attributes of the node. But some processing software, including osmosis, requires these attributes to be present, even if the node is in a <delete> block.
10. Update the database – With the osc file defining all the changes since the initial import available, we can instruct osmosis to update the database:
osmosis --wxc diff-bridges.osc --wpc database=bridges user=osm password=osm
..And we’re done. Almost. To keep the database up-to-date, we need to automate steps 7 through 10, and add some logic to move and delete a few files to create a consistent initial state for the replication process. I ended up creating a shell script for this and adding a crontab entry to have it run every three hours. This interval seemed like a good trade-off between server load and data freshness. The incremental update script takes about 11 minutes to complete: about 6 minutes for updating the US planet file, 4 minutes for filtering the bridges, and less than a minute to derive the changes, patch the database and clean up. Here’s some log output from the script, that by the way I’d be happy to share with anyone interested in using or improving it:
Tue Mar 6 03:00:01 MST 2012: update us bridges script 20120304v5 starting... Tue Mar 6 03:00:01 MST 2012: updating US planet... Tue Mar 6 03:06:28 MST 2012: filtering US planet... Tue Mar 6 03:10:11 MST 2012: dropping broken references... Tue Mar 6 03:10:12 MST 2012: deriving changes... Tue Mar 6 03:10:13 MST 2012: updating database... Tue Mar 6 03:10:16 MST 2012: cleaning up... Tue Mar 6 03:10:30 MST 2012: finished successfully in 629 seconds! Tue Mar 6 03:10:30 MST 2012: 215744 bridges in the database Tue Mar 6 06:00:01 MST 2012: update us bridges script 20120304v5 starting... Tue Mar 6 06:00:01 MST 2012: updating US planet... Tue Mar 6 06:06:10 MST 2012: filtering US planet... Tue Mar 6 06:10:38 MST 2012: dropping broken references... Tue Mar 6 06:10:40 MST 2012: deriving changes... Tue Mar 6 06:10:40 MST 2012: updating database... Tue Mar 6 06:10:43 MST 2012: cleaning up... Tue Mar 6 06:10:53 MST 2012: finished successfully in 652 seconds! Tue Mar 6 06:10:53 MST 2012: 215748 bridges in the database Tue Mar 6 09:00:02 MST 2012: update us bridges script 20120304v5 starting... Tue Mar 6 09:00:02 MST 2012: updating US planet... Tue Mar 6 09:06:47 MST 2012: filtering US planet... Tue Mar 6 09:11:23 MST 2012: dropping broken references... Tue Mar 6 09:11:24 MST 2012: deriving changes... Tue Mar 6 09:11:26 MST 2012: updating database... Tue Mar 6 09:11:29 MST 2012: cleaning up... Tue Mar 6 09:11:44 MST 2012: finished successfully in 702 seconds! Tue Mar 6 09:11:44 MST 2012: 215749 bridges in the database
I’ll spend another blog post on my purpose of having this self-updating bridges database sometime soon. It has something to do with comparing and conflating bridges between OpenStreetMap and the National Bridge Inventory. The truth is I am not quite sure how that should be done just yet. I already did some preliminary work on conflation queries in PostGIS and that looks quite promising, but not promising enough (by far) to automate the process of importing NBI data into OSM. Given that NBI is a point database, and bridges in OSM are typically linear features, this would be hard to do anyway.
I’d like to thank Markus Weber, the principal author of osmupdate / osmconvert / osmfilter, for his kind and patient help with refining the process, and for creating a great tool set!