Looks can be deceiving – we all know that. Did you know it also applies to maps? To OpenStreetMap? Let me give you an example.
Head over to osm.org and zoom in to an area outside the major metros in the United States. What you’re likely to see is an OK looking map. It may not be the most beautiful thing you’ve ever seen, but the basics are there: place names, the roads, railroads, lakes, rivers and streams, maybe some land use. Pretty good for a crowdsourced map!
What you’re actually likely looking at is a bunch of data that is imported – not crowdsourced – from a variety of sources ranging from the National Hydrography Dataset to TIGER. This data is at best a few years old and, in the case of TIGER, a topological mess with sometimes very little bearing on the actual ground truth.
For most users of OpenStreetMap (not the contributors), the only thing they will ever see is the rendered map. Even for those who are going to use the raw data, the first thing they’ll refer to to get a sense of the quality is the rendered map on osm.org. The only thing that the rendered map really tells you about the data quality, however, is that it has good national coverage for the road network, hydrography and a handful of other feature classes.
To get a better idea of the data quality that underlies the rendered map, we have to look at the data itself. I have done this before in some detail for selected metropolitan areas, but not yet on a national level. This post marks the beginning of that endeavour.
I purposefully kept the first iteration of analyses simple, focusing on the quality of the road network, using the TIGER import as a baseline. I did opt for a fine geographical granularity, choosing counties (and equivalent) as the geographical unit. I designed the following analysis metrics:
- Number of users involved in editing OSM ways – this metric tells us something about the amount of peer validation. If more people are involved in the local road network, there is a better chance that contributors are checking each other’s work. Note that this metric covers all linear features found, not only actual roads.
- Average version increase over the TIGER imported roads – this metric provides insight into the amount of work done on improving TIGER roads. A value close to zero means that very little TIGER improvements were done for the study area, which means that all the alignment and topology problems are likely mostly still there.
- Percentage of TIGER roads – this says something about contributor activity entering new roads (and paths). A lower value means more new roads added after the TIGER import. This is a sign that more committed mappers have been active in the area — entering new roads arguably requires more effort and knowledge than editing existing TIGER roads. A lower value here does not necessarily mean that the TIGER-imported road network has been supplemented with things like bike and footpaths – it can also be caused by mappers replacing TIGER roads with new features, for example as part of a remapping effort. That will typically not be a significant proportion, though.
- Percentage of untouched TIGER roads – together with the average version increase, this metric shows us the effort that has gone into improving the TIGER import. A high percentage here means lots of untouched, original TIGER roads, which is almost always a bad thing.
Below are map visualizations of the analysis results for these four metrics, on both the US State and County levels. I used the State and County (and equivalent) borders from the TIGER 2010 dataset for defining the study areas. These files contain 52 state features and 3221 county (and equivalent) features. Hawaii is not on the map, but the analysis was run on all 52 areas (the 50 states plus DC and Puerto Rico – although the planet file I used did not contain Puerto Rico data, so technically there’s valid results for 51 study areas on the state level).
I will let the maps mostly speak for themselves. Below the results visualisations, I will discuss ideas for further work building on this, as well as some technical background.
This initial stats run for the US motivates me to do more with the technical framework I built for it. With that in place, other metrics are relatively straightforward to add to the mix. I would love to hear your ideas, here are some of my own.
Breakdown by road type – It would be interesting to break the analysis down by way type: highways / interstates, primary roads, other roads. The latter category accounts for the majority of the road features, but does not necessarily see the most intensive maintenance by mappers. A breakdown of the analysis will shed some light on this.
Full history – For this analysis, I used a snapshot Planet file from February 2, 2012. A snapshot planet does not contain any historical information about the features – only the current feature version is represented. In a next iteration of this analysis, I would like to use the full history planets that have been available for a while now. Using full history enables me to see how many users have been involved in creating and maintaining ways through time, and how many of them have been active in the last month / year. It also offers an opportunity to identify periods in time when the local community has been particularly active.
Relate users to population / land area – The absolute number of users who contributed to OSM in an area is only mildly instructive. It’d be more interesting if that number were related to the population of that area, or to the land area. Or a combination. We might just find out how many mappers it takes to ‘cover’ an area (i.e. get and keep the other metrics above certain thresholds).
Routing specific metrics – One of the most promising applications of OSM data, and one of the most interesting commercially, is routing. Analyzing the quality of the road network is an essential part of assessing the ‘cost’ of using OpenStreetMap in lieu of other road network data that costs real money. A shallow analysis like I’ve done here is not going to cut it for that purpose though. We will need to know about topological consistency, correct and complete mapping of turn restrictions, grade separations, lanes, traffic lights, and other salient features. There is only so much of that we can do without resorting to comparative analysis, but we can at least devise some quantitative metrics for some.
- I used the State and County (and equivalent) borders from the TIGER 2010 dataset to determine the study areas.
- I used osm-history-splitter (by Peter Körner) to do the actual splitting. For this, I needed to convert the TIGER shapefiles to OSM POLY files, for which I used ogr2poly, written by Josh Doe.
- I used Jochen Topf‘s osmium, more specifically osmjs, for the data processing. The script I ran on all the study areas lives in github.
- I collated all the results using some python and bash hacking. I used the PostgreSQL COPY function to import the results into a PostgreSQL table.
- Using a PostgreSQL view, I combined the analysis result data with the geometry tables (which I previously imported into Postgis using shp2pgsql).
- I exported the views as shapefiles using ogr2ogr, which also offers the option of simplifying the geometries in one step. Useful because the non-generalized counties shapefile is 230MB and takes a long time to load in a GIS).
- I created the visualizations in Quantum GIS, using its excellent styling tool. I mostly used a quantiles distribution (for equal-sized bins) for the classes, which I tweaked to get prettier class breaks.
I’m planning to do an informal session on this process (focusing on the osmjs / osmium bit) at the upcoming OpenStreetMap hack weekend in DC. I hope to see you there!