A Look At Stale OpenStreetMap Data

Lazy people go straight here and here. But you’re not that person, are you?

The Wikimania conference is around the corner, and it’s close to home this year – in Washington, DC. DC already has a lot of resident geo geeks and mappers. With all the open, collaborative knowledge minded people in town, there is huge momentum for an OpenStreetMap Mapping Party, and I am excited to help run it! The party is taking place on Sunday, July 15 – the unconference day of Wikimania 2012. (There will  also be lots of other open mapping things going on, do check the program. The entry for the mapping party is kind of sparse, but hey – it’s a mapping party. What more is there to say?)

The question of where to direct the eager mappers quickly arose. In the beginning, that would have been an easy one as the map was without form and void. Nowadays, with the level of maturity of the map and the community OpenStreetMap has reached, it can be a lot harder. DC, with all its past mapping parties, well curated data imports and active mapping community, looks to be handsomely mapped. To pick a good destination for a mapping party requires a look under the hood.

A good indicator for areas that may need some mapping love is data staleness, defined loosely as the amount of time that has passed since someone last touched the data. A neighborhood with lots of stale data may have had one or more active mappers in the past, but they may have moved away or on to other things. While staleness is not a measure of completeness, it can point us at weak areas and neighborhoods in that way.

I did a staleness analysis for a selection of DC nodes and ways. I filtered out the nodes that have tags associated with them, and the ways that are not building outlines. (DC has seen a significant import of building outlines, which would mess up my analysis and the visualization.) And because today was procratination day, I went the extra mile and made the visualization into a web map and put the thing on GitHub. I documented the (pretty straightforward) process step by step on the project wiki, for those who want to roll their own maps, and those interested in doing something useful with OpenStreetMap data other than just making a standard map.

Below are two screenshots – one for DC and another for Amsterdam, another city for which I did the analysis. (A brief explanation of what you see is below the images.) It takes all of 15 minutes from downloading the data to publishing the HTML page, so I could easily do more. But procrastination day is over. Buy me a beer or an Aperol spritz in DC and I’ll see what I can do.

About these screenshots: The top one shows the Mall and surroundings in DC, where we see that the area around the Capitol has not been touched much in the last few years, hence the dark purple color of a lot of the linear features there. The area around the White House on the other hand has received some mapping love lately, with quite a few ways bright green, meaning they have been touched in the last 90 days.

Similar differences in the Amsterdam screenshot below the DC one. The Vondelpark area was updated very recently, while the (arguably much nicer) Rembrandtpark is pale purple – last updates between 1 and 2 years ago.

Note that the individual tagged nodes are not visible in these screenshots. They would clutter up the visualization too much at this scale. In the interactive maps, you could zoom in to see those.

As always, I love to talk about this with you, so share your thoughts, ideas for improvements, and any ol’ comment.

The State Of The OpenStreetMap Road Network In The US

Looks can be deceiving – we all know that. Did you know it also applies to maps? To OpenStreetMap? Let me give you an example.

Head over to osm.org and zoom in to an area outside the major metros in the United States. What you’re likely to see is an OK looking map. It may not be the most beautiful thing you’ve ever seen, but the basics are there: place names, the roads, railroads, lakes, rivers and streams, maybe some land use. Pretty good for a crowdsourced map!

What you’re actually likely looking at is a bunch of data that is imported – not crowdsourced – from a variety of sources ranging from the National Hydrography Dataset to TIGER. This data is at best a few years old and, in the case of TIGER, a topological mess with sometimes very little bearing on the actual ground truth.

TIGER alignment example

The horrible alignment of TIGER ways, shown on top of an aerial imagery base layer. Click on the image for an animation of how this particular case was fixed in OSM. Image from the OSM Wiki.

For most users of OpenStreetMap (not the contributors), the only thing they will ever see is the rendered map. Even for those who are going to use the raw data, the first thing they’ll refer to to get a sense of the quality is the rendered map on osm.org. The only thing that the rendered map really tells you about the data quality, however, is that it has good national coverage for the road network, hydrography and a handful of other feature classes.

To get a better idea of the data quality that underlies the rendered map, we have to look at the data itself. I have done this before in some detail for selected metropolitan areas, but not yet on a national level. This post marks the beginning of that endeavour.

I purposefully kept the first iteration of analyses simple, focusing on the quality of the road network, using the TIGER import as a baseline. I did opt for a fine geographical granularity, choosing counties (and equivalent) as the geographical unit. I designed the following analysis metrics:

  • Number of users involved in editing OSM ways – this metric tells us something about the amount of peer validation. If more people are involved in the local road network, there is a better chance that contributors are checking each other’s work. Note that this metric covers all linear features found, not only actual roads.
  • Average version increase over the TIGER imported roads – this metric provides insight into the amount of work done on improving TIGER roads. A value close to zero means that very little TIGER improvements were done for the study area, which means that all the alignment and topology problems are likely mostly still there.
  • Percentage of TIGER roads – this says something about contributor activity entering new roads (and paths). A lower value means more new roads added after the TIGER import. This is a sign that more committed mappers have been active in the area — entering new roads arguably requires more effort and knowledge than editing existing TIGER roads. A lower value here does not necessarily mean that the TIGER-imported road network has been supplemented with things like bike and footpaths – it can also be caused by mappers replacing TIGER roads with new features, for example as part of a remapping effort. That will typically not be a significant proportion, though.
  • Percentage of untouched TIGER roads – together with the average version increase, this metric shows us the effort that has gone into improving the TIGER import. A high percentage here means lots of untouched, original TIGER roads, which is almost always a bad thing.

Analysis Results

Below are map visualizations of the analysis results for these four metrics, on both the US State and County levels. I used the State and County (and equivalent) borders from the TIGER 2010 dataset for defining the study areas. These files contain 52 state features and 3221 county (and equivalent) features. Hawaii is not on the map, but the analysis was run on all 52 areas (the 50 states plus DC and Puerto Rico – although the planet file I used did not contain Puerto Rico data, so technically there’s valid results for 51 study areas on the state level).

I will let the maps mostly speak for themselves. Below the results visualisations, I will discuss ideas for further work building on this, as well as some technical background.

Map showing the number of contributors to ways, by state

Map showing the average version increase over TIGER imported ways, by state

Map showing the percentage of TIGER ways, by state

Map showing the percentage of untouched TIGER ways, by state

Map showing the number of users involved in ways, by county

Map showing the average version increase over TIGER imported ways, by county

Map showing the percentage of TIGER ways

Map showing the percentage untouched TIGER roads by county

Further work

This initial stats run for the US motivates me to do more with the technical framework I built for it. With that in place, other metrics are relatively straightforward to add to the mix. I would love to hear your ideas, here are some of my own.

Breakdown by road type – It would be interesting to break the analysis down by way type: highways / interstates, primary roads, other roads. The latter category accounts for the majority of the road features, but does not necessarily see the most intensive maintenance by mappers. A breakdown of the analysis will shed some light on this.

Full history – For this analysis, I used a snapshot Planet file from February 2, 2012. A snapshot planet does not contain any historical information about the features – only the current feature version is represented. In a next iteration of this analysis, I would like to use the full history planets that have been available for a while now. Using full history enables me to see how many users have been involved in creating and maintaining ways through time, and how many of them have been active in the last month / year. It also offers an opportunity to identify periods in time when the local community has been particularly active.

Relate users to population / land area – The absolute number of users who contributed to OSM in an area is only mildly instructive. It’d be more interesting if that number were related to the population of that area, or to the land area. Or a combination. We might just find out how many mappers it takes to ‘cover’ an area (i.e. get and keep the other metrics above certain thresholds).

Routing specific metrics – One of the most promising applications of OSM data, and one of the most interesting commercially, is routing. Analyzing the quality of the road network is an essential part of assessing the ‘cost’ of using OpenStreetMap in lieu of other road network data that costs real money. A shallow analysis like I’ve done here is not going to cut it for that purpose though. We will need to know about topological consistency, correct and complete mapping of turn restrictions, grade separations, lanes, traffic lights, and other salient features. There is only so much of that we can do without resorting to comparative analysis, but we can at least devise some quantitative metrics for some.

Technical Background

  • I used the State and County (and equivalent) borders from the TIGER 2010 dataset to determine the study areas.
  • I used osm-history-splitter (by Peter Körner) to do the actual splitting. For this, I needed to convert the TIGER shapefiles to OSM POLY files, for which I used ogr2poly, written by Josh Doe.
  • I used Jochen Topf‘s osmium, more specifically osmjs, for the data processing. The script I ran on all the study areas lives in github.
  • I collated all the results using some python and bash hacking. I used the PostgreSQL COPY function to import the results into a PostgreSQL table.
  • Using a PostgreSQL view, I combined the analysis result data with the geometry tables (which I previously imported into Postgis using shp2pgsql).
  • I exported the views as shapefiles using ogr2ogr, which also offers the option of simplifying the geometries in one step. Useful because the non-generalized counties shapefile is 230MB and takes a long time to load in a GIS).
  • I created the visualizations in Quantum GIS, using its excellent styling tool. I mostly used a quantiles distribution (for equal-sized bins) for the classes, which I tweaked to get prettier class breaks.

I’m planning to do an informal session on this process (focusing on the osmjs / osmium bit) at the upcoming OpenStreetMap hack weekend in DC. I hope to see you there!