A Look At Stale OpenStreetMap Data


Lazy people go straight here and here. But you’re not that person, are you?

The Wikimania conference is around the corner, and it’s close to home this year – in Washington, DC. DC already has a lot of resident geo geeks and mappers. With all the open, collaborative knowledge minded people in town, there is huge momentum for an OpenStreetMap Mapping Party, and I am excited to help run it! The party is taking place on Sunday, July 15 – the unconference day of Wikimania 2012. (There will  also be lots of other open mapping things going on, do check the program. The entry for the mapping party is kind of sparse, but hey – it’s a mapping party. What more is there to say?)

The question of where to direct the eager mappers quickly arose. In the beginning, that would have been an easy one as the map was without form and void. Nowadays, with the level of maturity of the map and the community OpenStreetMap has reached, it can be a lot harder. DC, with all its past mapping parties, well curated data imports and active mapping community, looks to be handsomely mapped. To pick a good destination for a mapping party requires a look under the hood.

A good indicator for areas that may need some mapping love is data staleness, defined loosely as the amount of time that has passed since someone last touched the data. A neighborhood with lots of stale data may have had one or more active mappers in the past, but they may have moved away or on to other things. While staleness is not a measure of completeness, it can point us at weak areas and neighborhoods in that way.

I did a staleness analysis for a selection of DC nodes and ways. I filtered out the nodes that have tags associated with them, and the ways that are not building outlines. (DC has seen a significant import of building outlines, which would mess up my analysis and the visualization.) And because today was procratination day, I went the extra mile and made the visualization into a web map and put the thing on GitHub. I documented the (pretty straightforward) process step by step on the project wiki, for those who want to roll their own maps, and those interested in doing something useful with OpenStreetMap data other than just making a standard map.

Below are two screenshots – one for DC and another for Amsterdam, another city for which I did the analysis. (A brief explanation of what you see is below the images.) It takes all of 15 minutes from downloading the data to publishing the HTML page, so I could easily do more. But procrastination day is over. Buy me a beer or an Aperol spritz in DC and I’ll see what I can do.

About these screenshots: The top one shows the Mall and surroundings in DC, where we see that the area around the Capitol has not been touched much in the last few years, hence the dark purple color of a lot of the linear features there. The area around the White House on the other hand has received some mapping love lately, with quite a few ways bright green, meaning they have been touched in the last 90 days.

Similar differences in the Amsterdam screenshot below the DC one. The Vondelpark area was updated very recently, while the (arguably much nicer) Rembrandtpark is pale purple – last updates between 1 and 2 years ago.

Note that the individual tagged nodes are not visible in these screenshots. They would clutter up the visualization too much at this scale. In the interactive maps, you could zoom in to see those.

As always, I love to talk about this with you, so share your thoughts, ideas for improvements, and any ol’ comment.

Detecting Highway Trouble in OpenStreetMap


For the impatient: do you want to get to work solving highway trouble in OpenStreetMap right away? Download the Trouble File here!

Making pretty and useful maps with freely available OpenStreetMap data has never been so easy and so much fun to do. The website Switch2OSM is an excellent starting point, and with great tools like MapBox’s TileMill at your disposal, experimenting with digital cartography is almost effortless. Design bureau Stamen shows us some beautiful examples of digital cartography based on OpenStreetMap data. Google starting to charge  for using their maps API provides a compelling push factor for some to start down this road, and the likes of Foursquare and Apple lead the way.

With all the eyes on OpenStreetMap as a source of pretty maps now, you would almost forget that the usefulness of freely available OpenStreetMap data extends way beyond that. One of the more compelling uses of OpenStreetMap data is routing and navigation, and things have been moving there. Skobbler has succeeded in making a tangible dent in the turn-by-turn navigation market for mobile devices in some countries, offering similar functionality as TomTom but at a much, much lower price point, using freely available OpenStreetMap data. MapQuest and CloudMade offer routing APIs based on OpenStreetMap. New open source routing software projects OSRM and MoNav show promise with very fast route calculation and a full feature set, and both are built from the ground up to work with OpenStreetMap data.

Routing puts very different, much stricter requirements on the source data than map rendering. For a pretty map, it does not matter much if roads in the source data do not always connect or lack usage or turn restriction information. For routing, this makes all the difference. Topological errors and lacking usage restriction metadata make for incorrect routes. They will direct you to turn left onto a one-way street, get off the highway for no apparent reason, even if there is no exit. That may seem funny if you read about it in a British tabloid, but it’s annoying when you’re on a road trip, and totally unacceptable if you depend on routing software for your business. So unless the data is pretty much flawless, we won’t see major providers of routing and navigation products make the switch to OpenStreetMap that some have so eagerly made for their base maps.

It turns out the data is not flawless. A study done at the University of Heidelberg shows that even for Germany, the country with the most prolific OpenStreetMap community by a distance, the data is not on par with commercial road network data when compared on key characteristics for routing. (Even though the study predicts that in a few months, it will be).

Turning to the US, the situation is bound to be much worse. With a much smaller community that is spread pretty thin geographically (and in some regions, almost nonexistent), and the TIGER import as a very challenging starting point, there is no way that any routing based on OpenStreetMap data in the US is going to be anywhere near perfect. Sure, the most obvious routing related problems with the TIGER data were identified and weeded out in an early effort (led by aforementioned CloudMade) shortly after the import, but many challenges still remain.

In an effort to make OpenStreetMap data more useful for routing in the US, I started to identify some of those challenges. Routing is most severely affected by problems with the primary road network, so I decided to start from there. Using some modest PostGIS magic, I isolated a set of Highway Trouble Points. The Trouble breaks down into four main classes:

Bridge Trouble

This is the case where a road crossing over or under a highway is not tagged as a bridge, and even worse, shares vertices with the highway, as illustrated below. This tricks routing software into thinking there is a turn opportunity there when there is not. This is bad enough if there actually is an exit, like in the example, but it gets really disastrous when there is not.

These cases take some practice to repair. It involves either deleting or ungluing the shared nodes, splitting the road that should be a bridge, and tagging it as a bridge=yes, layer=1.

Imaginary Exit Trouble

Sometimes, a local road or track will be connected to a highway, tricking routing software into possibly taking a shortcut. Repairing these is simple: unglue the shared node and move the end of the local road to where it actually ends, looking at the aerial imagery.

Service Road Trouble

The separate roadways of a highway are sometimes connected to allow emergency vehicles to make a U-turn. Regular traffic is not allowed to use these connector service ways, but during the TIGER import they were usually tagged as public access roads, again potentially tricking routing software into taking a shortcut. I repair these by tagging them as highway=service and access=official, access=no, emergency=yes.

Rest Area Trouble

This is of secondary importance, as rest areas are usually not connected to the road network except for their on- and off-ramps. Finding these Trouble points was an unexpected by-product of the query I ran on the data. What we have here is rest areas that are not tagged as such, instead just existing as a group of ‘residential’ roads connecting to the highway features, without a motorway_link. While we’re at it, we can clean these up nicely by adding motorway_links at the on- and off-ramps, the other road features as highway=service, adding the necessary oneway=yes and identifying a node as highway=rest_area. It’s usually obvious if there are toilets=yes from the aerial image, too.

I have done test runs of the query on OSM data for Vermont and Missouri. The query is performed on a PostGIS database with the osmosis snapshot schema, optionally with the linestring extension, and goes like this:

DROP TABLE IF EXISTS candidates;
CREATE TABLE candidates AS
    WITH agg_intersections AS
    (
        WITH intersection_nodes_wayrefs AS
        (
            WITH intersection_nodes AS
            (
                SELECT
                    a.id AS node_id,
                    b.way_id,
                    a.geom
                FROM
                    nodes a,
                    way_nodes b
                WHERE
                    a.id = b.node_id AND
                    a.id IN
                    (
                        SELECT 
                            DISTINCT node_id
                        FROM 
                            way_nodes
                        GROUP BY 
                            node_id
                        HAVING 
                            COUNT(1) = 2
                    )
            )
            SELECT
                DISTINCT a.node_id AS node_id,
                b.id AS way_id,
                b.tags->'highway' AS osm_highway,
                a.geom AS geom,
                b.tags->'ref' AS osm_ref
            FROM
                intersection_nodes a,
                ways b
            WHERE
                a.way_id = b.id
        )
        SELECT
            node_id,
            array_agg(way_id) AS way_ids,
            array_agg(osm_highway) AS osm_highways,
            array_agg(osm_ref) AS osm_refs
        FROM 
            intersection_nodes_wayrefs
        GROUP BY 
            node_id
    )
    SELECT
        a.* ,
        b.geom AS node_geom,
        -- COMMENT NEXT LINE OUT IF YOU DON'T HAVE
        -- OR WANT WAY GEOMETRIES
        c.linestring AS way_geom
    FROM 
        agg_intersections a, 
        nodes b,
        ways c
    WHERE
        (
            'motorway' = ANY(osm_highways)
            AND NOT
            (
                'motorway_link' = ANY(osm_highways)
                OR
                'service' = ANY(osm_highways)
                OR 
                'motorway' = ALL(osm_highways)
                OR 
                'construction' = ANY(osm_highways)
            )
        )    
    AND
        a.node_id = b.id
    AND
        c.id = ANY(a.way_ids);
;

The query took about a minute to run for Vermont and about 5 minutes for Missouri. For Vermont, it yielded 77 points and for Missouri 193 points. You can download these files here, but note that I have already done much of the cleanup work in these states since, as part of my thinking on how to improve the query. It still yields a some false positives, notably points where a highway=motorway turns into a highway=trunk or  highway=primary, see below.

UPDATE: This query filters out these false positives, it uses the ST_Startpoint and ST_Endpoint PostGIS functions to determine if two line features ‘meet’:

DROP TABLE IF EXISTS candidates_noendpoints;
CREATE TABLE candidates_noendpoints AS

SELECT 
    DISTINCT c.node_id,
    c.node_geom
FROM
    ways a,
    ways b,
    candidates c
WHERE
    ST_Intersects(c.node_geom, a.linestring)
AND
    ST_Intersects(c.node_geom, b.linestring)    
AND NOT
(
    ST_Intersects(c.node_geom, ST_Union(ST_StartPoint(a.linestring),ST_Endpoint(a.linestring))) 
    AND
    ST_Intersects(c.node_geom, ST_Union(ST_StartPoint(b.linestring),ST_Endpoint(b.linestring))) 
)
;

This query requires the availability of line geometries for the ways, obviously.

UPDATE 2: The query as-is made the PostgreSQL server croak because it ran out of memory, so I had to redesign the query to rely much less on in-memory tables. I will provide the updated query to anyone interested. I’m going to leave the original SQL up there, it was meant to convey the approach and it still does. The whole US trouble file is available as an OSM XML file from here.

I plan to make the Highway Trouble files available on a regular basis for all 50 states if there’s an interest for them. And as always I’m very interested to hear your opinion: any Trouble I am missing? Ways to improve the query? Let me know.

A self-updating OpenStreetMap database of US bridges – a step-by-step guide.


I had what I thought was a pretty straightforward use case for OpenStreetMap data:

I want all bridges in the US that are mapped in OpenStreetMap in a PostGIS database.

There are about 125,000 of them – for now loosely defined as ‘ways that have the ‘bridge’ tag‘. So on the scale of OpenStreetMap data it’s a really small subset. In terms of the tools and processes needed, the task seems easy enough, and as long as you are satisfied with a one-off solution, it really is. You would need only four things:

  1. A planet file
  2. A boundary polygon for the United States
  3. A PostGIS database loaded with the osmosis snapshot schema and the linestring extension
  4. osmosis, the OpenStreetMap ETL swiss army tool.

That, and a single well-placed osmosis command:

bzcat planet-latest.osm.bz2 | \
osmosis --rx - \
--bp us.poly \
--tf accept-ways bridge=* \
--tf reject-relations \
--used-node \
--wp database=bridges user=osm password=osm

this will extract the planet file and pipe the output to osmosis. Osmosis’s –read-xml task consumes the xml stream, passes it to a –bounding-polygon task to clip the data using the US bounding polygon, a couple of –tag-filter tasks that throw out all relations and all ways except for those tagged ‘bridge=*’ (there’s a negligible number of ways tagged bridge=no, but catching all the different ways of tagging a ‘true’ value here is more work than it’s worth, if you ask me), a –used-node task that throws out all the nodes except for those that are used by the ways we are keeping, and finally a –write-pgsql task that writes it all the objects to the PostGIS database. (Osmosis can be overwhelming at first with its plethora of tasks and arguments, but if you break it down it’s really quite straightforward. It may help to use There’s also a graphical wrapper around osmosis called OSMembrane that may help to make this tool easier to understand and master.)

But for me, it didn’t end there.

OpenStreetMap data is continuously updated by more than a half million contributors around the world. People are adding, removing and changing features in OpenStreetMap around the clock. And those changes go straight into the live database. There’s no release schedule, no quality assurance. Every time one of those half a million people clicks ‘save’ in one of the OpenStreetMap editors there is, for all intents and purposes, a new OpenStreetMap version. That means the bridges database I just built is already obsolete even before the import is complete. For my yet to disclose purpose, that would not be acceptable. So let me specify my goal a little more precisely:

I want all bridges in the US that are mapped in OpenStreetMap in a PostGIS database that stays as up-to-date as possible, reflecting all the latest changes.

There is not one single ready-made solution for this, it turned out, so let me describe how I ended up doing it. It may not be the most common OpenStreetMap data processing use case out there, but it’s going to be useful for, for example, thematic overlay maps, if nothing else – even though the final step of importing into a geospatial database may need some tweaking.

After some less successful attempts I settled on the following workflow:

Strategy for keeping an up-to-date geographical and functional OpenStreetMap extract

This workflow uses a handful of specialized tools:

  1. osmosis, that we’re already familiar with
  2. osmconvert – a fast, comprehensive OpenStreetMap data file patching, converting and processing tool
  3. osmfilter – a tool to filter OpenStreetMap data by tags, tag values or feature type
  4. osmupdate – a tool to automate patching local OpenStreetMap data files, including downloading the change files from the server.

The trio osmconvert / osmfilter / osmupdate together can do most of the things osmosis can do, but do it a heck of a lot faster, and is more flexible in a few key aspects that we will see soon.

Let’s go through the numbered steps in the diagram one by one, explaining how each step is executed and how it works.

1. Planet file – The complete, raw OpenStreetMap data. A fresh one is made available every week on the main OpenStreetMap server, but your area of interest may be covered at one of the many mirrors, which can save you some download time and bandwidth. There is no planet mirror for the entire US, so I started with the global planet file. If you have a planet file that matches your area of interest, you can skip step 3 (but not the next step).

2. Bounding polygon – regardless whether you find an initial planet file that matches your area of interest nicely, you will need a bounding polygon in OSM POLY format for the incremental updates. You’ll find ready-made POLY files in several places, including the OpenStreetMap SVN tree and GeoFabrik (read the README though), or you can create them yourself from any file that OGR can read using ogr2poly.

3. Filter area of interest – to save on disk space, memory usage and processing time, we’re going to work only with the data that is inside our area of interest. There are quite a few ways to create geographical extracts from a planet file, but we’re going to use osmconvert for two reasons: a) it’s fast! (osmosis takes about 4 hours and 45 minutes to do this, osmconvert takes 2 hours. This is on an AMD Phenom II X4 965 machine with 16GB RAM) b) it outputs the o5m format for which the next tool in the chain, osmfilter, is optimized.

bzcat planet-latest.osm.bz2 | ./osmconvert – -B=us.poly  -o=us.o5m

4. Filter features of interest – The second step is creating a file that holds only the features that we are interested in. We could have done this together with the previous step in one go, but as the diagram shows we will need the output file of step 3 (the US planet file) for the incremental update process. Here, osmfilter comes into play

osmfilter us.o5m --keep= --keep-ways="bridge=" --out-o5m > us-bridges.o5m

osmfilter works in much the same way as the osmosis –tag-filter task. It accepts arguments to drop specific feature types, or to keep features that have a specific tags. In this case, we want to drop everything (–keep=) but the ways that have the key ‘bridge’ (–keep-ways=”bridge=”). We have osmfilter output the result in the efficient o5m format. (o5m lies in between the OSM xml and pbf formats in terms of file size, and was designed as a compromise between the two. One of the design goals for the o5m format was the ability to merge two files really fast, something we will be relying on in this process.)

5. Convert to pbf - The trio osmconvert / osmfilter / osmupdate is designed to handle file-based data and has no interface for PostGIS, so we need to fall back on osmosis for this step. As osmosis cannot read o5m files, we need to convert to pbf first:

osmconvert us-bridges.o5m -b=-180,-90,180,90 --drop-broken-refs -o=us-bridges.osm.pbf

Wait a minute. A lot more happened there than just a format conversion. Let’s take a step back. Because we’re working with a geographical extract of the planet file, we need to be concerned about referential integrity. Because the way objects in OpenStreetMap don’t have an inherent geometry attached to them, any process looking to filter ways based on a bounding box or polygon needs to go back to the nodes referenced and see if they are within the bounds. It then needs to decide what to do with ways that are partly within the bounds: either cut them at the bounds, dropping all nodes that lie outside the bounds (‘hard cut;), include the entire way (‘soft cut’) or drop the entire way. As the –drop-broken-refs argument name suggests, we are doing the latter here. This means that data is potentially lost near the bounds, which is not what we actually want. We need to do it this way though, because the planet update (step 7) cannot take referential integrity into account without resorting to additional (expensive) API calls. (Consider this case: a way is entirely outside the bounds on t0. Between updates, one of the nodes is moved inside the bounds, so the way would be included in the extract now. But the old file does not contain the rest of the nodes comprising that way, nor are they in the delta files that are used in the update process – so the full geometry of the new way cannot be known.)

One way to compensate for the data loss is by buffering the bounding polygon. That would yield false positives, but that may be acceptable. It is how I solved this. What’s best for your case depends on your scenario.

The -b=-180,-90,180,90 option defining a global bounding box seems superfluous, but is actually necessary to circumvent a bug in the –drop-broken-refs task that would leave only nodes in the data.

6. Initial database import – This is a straightforward step that can be done with a simple osmosis command:

osmosis --rb us-bridges.osm.pbf --wp database=bridges user=osm password=osm

This reads the pbf file we just created (–rb) and writes it directly to the database ‘bridges’ using the credentials provided (–wp). If you want way geometries, be sure to load the linestring schema extension in addition to the snapshot schema when creating the database:

psql -U osm -d bridges -f /path/to/osmosis/script/pgsnapshot_schema_0.6.sql
psql -U osm -d bridges -f /path/to/osmosis/script/pgsnapshot_schema_0.6_linestring.sql

osmosis will detect this on import, there is no need to tell osmosis to create the line geometries.

Note that we are using the direct write task (–wp) which is fine for smaller datasets. If your dataset is much larger, you’re going to see real performance benefits from using the dump task (–wpd) and load the dumps into the database using the load script provided with osmosis.

Now that we have the initial import done, we can start the incremental updates. This is where the real fun is!

7. Updating the planet file – This is where osmupdate really excels in flexibility over osmosis. I had not used this tool before and was amazed by how it Just Works. What osmupdate does is look at the input file for the timestamp, intelligently grab all the daily, hourly and minutely diff files from the OpenStreetMap server, and apply them to generate an up-to-date output file. It relies on the osmconvert program that we used before to do the actual patching of the data files, so osmconvert needs to be in your path for it to function. You can pass osmconvert options in, which allows us to apply the bounding polygon in one go:

osmupdate us.o5m us-new.o5m B=us.poly

8. Filter features of interest for the updated planet file – This is a repetition of step 4, but applied to the updated planet file:

osmfilter us-new.o5m --keep= --keep-ways="bridge=" --out-o5m > us-bridges-new.o5m

We also drop the broken references from this new data file:

osmconvert us-bridges-new.o5m -b=-180,-90,180,90 --drop-broken-refs -o=us-bridges-new-nbr.o5m

9. Derive a diff file – We now have our original bridges data file, derived from the planet we downloaded, and the new bridges file derived from the updated planet file. What we need next is a diff file we can apply to our database. This file should be in the OSM Change file format, the same format that is used to publish the diffs for the planet osmconvert used in step 7. This is another task at which osmconvert excels: it can derive a change file from two o5m input files really fast:

osmconvert us-bridges.o5m us-bridges-new-nbr.o5m --diff --fake-lonlat -o=diff-bridges.osc

Again, there’s a little more going on than just deriving a diff file, isn’t there? What is that –fake-lonlat argument? As it turns out, osmconvert creates osc files that don’t have coordinate attributes for nodes that are to be deleted. To do so would be unnecessary, you really only need a node ID to know which node to delete, there is no need to repeat other attributes of the node. But some processing software, including osmosis, requires these attributes to be present, even if the node is in a <delete> block.

10. Update the database – With the osc file defining all the changes since the initial import available, we can instruct osmosis to update the database:

osmosis --wxc diff-bridges.osc --wpc database=bridges user=osm password=osm

..And we’re done. Almost. To keep the database up-to-date, we need to automate steps 7 through 10, and add some logic to move and delete a few files to create a consistent initial state for the replication process. I ended up creating a shell script for this and adding a crontab entry to have it run every three hours. This interval seemed like a good trade-off between server load and data freshness. The incremental update script takes about 11  minutes to complete: about 6 minutes for updating the US planet file, 4 minutes for filtering the bridges, and less than a minute to derive the changes, patch the database and clean up. Here’s some log output from the script, that by the way I’d be happy to share with anyone interested in using or improving it:

Tue Mar  6 03:00:01 MST 2012: update us bridges script 20120304v5 starting...
Tue Mar  6 03:00:01 MST 2012: updating US planet...
Tue Mar  6 03:06:28 MST 2012: filtering US planet...
Tue Mar  6 03:10:11 MST 2012: dropping broken references...
Tue Mar  6 03:10:12 MST 2012: deriving changes...
Tue Mar  6 03:10:13 MST 2012: updating database...
Tue Mar  6 03:10:16 MST 2012: cleaning up...
Tue Mar  6 03:10:30 MST 2012: finished successfully in 629 seconds!
Tue Mar  6 03:10:30 MST 2012:  215744 bridges in the database
Tue Mar  6 06:00:01 MST 2012: update us bridges script 20120304v5 starting...
Tue Mar  6 06:00:01 MST 2012: updating US planet...
Tue Mar  6 06:06:10 MST 2012: filtering US planet...
Tue Mar  6 06:10:38 MST 2012: dropping broken references...
Tue Mar  6 06:10:40 MST 2012: deriving changes...
Tue Mar  6 06:10:40 MST 2012: updating database...
Tue Mar  6 06:10:43 MST 2012: cleaning up...
Tue Mar  6 06:10:53 MST 2012: finished successfully in 652 seconds!
Tue Mar  6 06:10:53 MST 2012:  215748 bridges in the database
Tue Mar  6 09:00:02 MST 2012: update us bridges script 20120304v5 starting...
Tue Mar  6 09:00:02 MST 2012: updating US planet...
Tue Mar  6 09:06:47 MST 2012: filtering US planet...
Tue Mar  6 09:11:23 MST 2012: dropping broken references...
Tue Mar  6 09:11:24 MST 2012: deriving changes...
Tue Mar  6 09:11:26 MST 2012: updating database...
Tue Mar  6 09:11:29 MST 2012: cleaning up...
Tue Mar  6 09:11:44 MST 2012: finished successfully in 702 seconds!
Tue Mar  6 09:11:44 MST 2012:  215749 bridges in the database

Wrapping up

I’ll spend another blog post on my purpose of having this self-updating bridges database sometime soon. It has something to do with comparing and conflating bridges between OpenStreetMap and the National Bridge Inventory. The truth is I am not quite sure how that should be done just yet. I already did some preliminary work on conflation queries in PostGIS and that looks quite promising, but not promising enough (by far) to automate the process of importing NBI data into OSM. Given that NBI is a point database, and bridges in OSM are typically linear features, this would be hard to do anyway.

I’d like to thank Markus Weber, the principal author of osmupdate / osmconvert / osmfilter, for his kind and patient help with refining the process, and for creating a great tool set!

The State Of The OpenStreetMap Road Network In The US


Looks can be deceiving – we all know that. Did you know it also applies to maps? To OpenStreetMap? Let me give you an example.

Head over to osm.org and zoom in to an area outside the major metros in the United States. What you’re likely to see is an OK looking map. It may not be the most beautiful thing you’ve ever seen, but the basics are there: place names, the roads, railroads, lakes, rivers and streams, maybe some land use. Pretty good for a crowdsourced map!

What you’re actually likely looking at is a bunch of data that is imported – not crowdsourced – from a variety of sources ranging from the National Hydrography Dataset to TIGER. This data is at best a few years old and, in the case of TIGER, a topological mess with sometimes very little bearing on the actual ground truth.

TIGER alignment example

The horrible alignment of TIGER ways, shown on top of an aerial imagery base layer. Click on the image for an animation of how this particular case was fixed in OSM. Image from the OSM Wiki.

For most users of OpenStreetMap (not the contributors), the only thing they will ever see is the rendered map. Even for those who are going to use the raw data, the first thing they’ll refer to to get a sense of the quality is the rendered map on osm.org. The only thing that the rendered map really tells you about the data quality, however, is that it has good national coverage for the road network, hydrography and a handful of other feature classes.

To get a better idea of the data quality that underlies the rendered map, we have to look at the data itself. I have done this before in some detail for selected metropolitan areas, but not yet on a national level. This post marks the beginning of that endeavour.

I purposefully kept the first iteration of analyses simple, focusing on the quality of the road network, using the TIGER import as a baseline. I did opt for a fine geographical granularity, choosing counties (and equivalent) as the geographical unit. I designed the following analysis metrics:

  • Number of users involved in editing OSM ways – this metric tells us something about the amount of peer validation. If more people are involved in the local road network, there is a better chance that contributors are checking each other’s work. Note that this metric covers all linear features found, not only actual roads.
  • Average version increase over the TIGER imported roads – this metric provides insight into the amount of work done on improving TIGER roads. A value close to zero means that very little TIGER improvements were done for the study area, which means that all the alignment and topology problems are likely mostly still there.
  • Percentage of TIGER roads – this says something about contributor activity entering new roads (and paths). A lower value means more new roads added after the TIGER import. This is a sign that more committed mappers have been active in the area — entering new roads arguably requires more effort and knowledge than editing existing TIGER roads. A lower value here does not necessarily mean that the TIGER-imported road network has been supplemented with things like bike and footpaths – it can also be caused by mappers replacing TIGER roads with new features, for example as part of a remapping effort. That will typically not be a significant proportion, though.
  • Percentage of untouched TIGER roads – together with the average version increase, this metric shows us the effort that has gone into improving the TIGER import. A high percentage here means lots of untouched, original TIGER roads, which is almost always a bad thing.

Analysis Results

Below are map visualizations of the analysis results for these four metrics, on both the US State and County levels. I used the State and County (and equivalent) borders from the TIGER 2010 dataset for defining the study areas. These files contain 52 state features and 3221 county (and equivalent) features. Hawaii is not on the map, but the analysis was run on all 52 areas (the 50 states plus DC and Puerto Rico – although the planet file I used did not contain Puerto Rico data, so technically there’s valid results for 51 study areas on the state level).

I will let the maps mostly speak for themselves. Below the results visualisations, I will discuss ideas for further work building on this, as well as some technical background.

Map showing the number of contributors to ways, by state

Map showing the average version increase over TIGER imported ways, by state

Map showing the percentage of TIGER ways, by state

Map showing the percentage of untouched TIGER ways, by state

Map showing the number of users involved in ways, by county

Map showing the average version increase over TIGER imported ways, by county

Map showing the percentage of TIGER ways

Map showing the percentage untouched TIGER roads by county

Further work

This initial stats run for the US motivates me to do more with the technical framework I built for it. With that in place, other metrics are relatively straightforward to add to the mix. I would love to hear your ideas, here are some of my own.

Breakdown by road type – It would be interesting to break the analysis down by way type: highways / interstates, primary roads, other roads. The latter category accounts for the majority of the road features, but does not necessarily see the most intensive maintenance by mappers. A breakdown of the analysis will shed some light on this.

Full history – For this analysis, I used a snapshot Planet file from February 2, 2012. A snapshot planet does not contain any historical information about the features – only the current feature version is represented. In a next iteration of this analysis, I would like to use the full history planets that have been available for a while now. Using full history enables me to see how many users have been involved in creating and maintaining ways through time, and how many of them have been active in the last month / year. It also offers an opportunity to identify periods in time when the local community has been particularly active.

Relate users to population / land area – The absolute number of users who contributed to OSM in an area is only mildly instructive. It’d be more interesting if that number were related to the population of that area, or to the land area. Or a combination. We might just find out how many mappers it takes to ‘cover’ an area (i.e. get and keep the other metrics above certain thresholds).

Routing specific metrics – One of the most promising applications of OSM data, and one of the most interesting commercially, is routing. Analyzing the quality of the road network is an essential part of assessing the ‘cost’ of using OpenStreetMap in lieu of other road network data that costs real money. A shallow analysis like I’ve done here is not going to cut it for that purpose though. We will need to know about topological consistency, correct and complete mapping of turn restrictions, grade separations, lanes, traffic lights, and other salient features. There is only so much of that we can do without resorting to comparative analysis, but we can at least devise some quantitative metrics for some.

Technical Background

  • I used the State and County (and equivalent) borders from the TIGER 2010 dataset to determine the study areas.
  • I used osm-history-splitter (by Peter Körner) to do the actual splitting. For this, I needed to convert the TIGER shapefiles to OSM POLY files, for which I used ogr2poly, written by Josh Doe.
  • I used Jochen Topf‘s osmium, more specifically osmjs, for the data processing. The script I ran on all the study areas lives in github.
  • I collated all the results using some python and bash hacking. I used the PostgreSQL COPY function to import the results into a PostgreSQL table.
  • Using a PostgreSQL view, I combined the analysis result data with the geometry tables (which I previously imported into Postgis using shp2pgsql).
  • I exported the views as shapefiles using ogr2ogr, which also offers the option of simplifying the geometries in one step. Useful because the non-generalized counties shapefile is 230MB and takes a long time to load in a GIS).
  • I created the visualizations in Quantum GIS, using its excellent styling tool. I mostly used a quantiles distribution (for equal-sized bins) for the classes, which I tweaked to get prettier class breaks.

I’m planning to do an informal session on this process (focusing on the osmjs / osmium bit) at the upcoming OpenStreetMap hack weekend in DC. I hope to see you there!

OpenStreetMap Data Temperature – How It Was Done (And Do It Yourself!)


In my talk at State Of The Map 2011, I introduced a few new concepts relating to the local quality of OpenStreetMap data and its contributor community:

  • The Community Scorecard which is a concise way to summarize the activity of a local OpenStreetMap community
  • Data Temperature attempts to capture the ‘warmth’ of a local community in one easily interpretable figure
  • The Temperature Map captures some of the most relevant metrics for way features, the age of the last update and the number of versions, in an abstracted image of the study area.

To generate the metrics and the geodata needed, I used Jochen Topf’s osmium framework. Because I am not a skilled C++ developer, I employed the osmjs Javascript interface to the framework to parse and analyze the city-sized OpenStreetMap files. After some delay, I have published the osmjs script together with some supporting files. Feel free to use it to do your own analyses, and please improve and build on it and share back. I’d love to see this script evolve into a more general purpose quality analysis tool.

If you missed my session at SOTM11, here‘s its page on the OpenStreetMap wiki, with links to all relevant resources.

Tutorial: Creating buffered country POLYs for OpenStreetMap data processing


OpenStreetMap represents a lot of data. If you want to import the entire planet into a PostGIS database using osmosis, you need at least 300GB of hard disk space and, depending on how much you spent on fast processors and (more importantly) memory, a lot of patience. Chances are that you are interested in only a tiny part of the world, either to generate a map or do some data analysis. There’s several ways to get bite-sized chunks of the planet – take a look at the various planet mirrors or the cool new Extract-o-tron tool – but sometimes you may want something custom. For the data temperature analysis I did for State of the Map, I wanted city-sized extracts using a small buffer around the city border. If you want to do something similar – or are just interested in how to do basic geoprocessing on a vector file – this tutorial may be of interest to you. Instead of city borders, which I created myself from the excellent Zillow neighborhood boundary dataset, I will show you how to create a suitably generalized OSM POLY file (the de facto standard for describing polygon extracts used by various OSM tools) that is appropriate for extracting a country from the OSM planet with a nice buffer around it.

Let’s get to work.

Preparing Quantum GIS

We will need to add a plugin that allows us to export any polygon from your QGIS desktop as an OSM POLY file. We can get that OSM POLY export plugin for Quantum GIS here.

Unzip the downloaded file and copy the resulting folder into the Python plugins folder. On Windows, if you used the OSGeo installer, that might be

C:\OSGeo4W\apps\qgis\python\plugins

See here for hints where it may be for you.

The plugin should now appear in the Quantum GIS plugin manager (Plugins > Manage plugins…).
If it is not selected, do that now and exit the plugin manager.

Getting Country Borders

Easy. Download world borders from http://thematicmapping.org/downloads/world_borders.php

Unzip the downloaded file and open it in QGIS:

Geoprocessing step 1: Query

Open the Layer Query dialog by either right-clicking on the layer name or selecting Query… from the Layer menu with the TM_WORLD_BORDERS-0.3 layer selected (active).

Type “ISO2″ = “US” in the SQL where clause field and run the query by clicking OK.

Geoprocessing step 2: Buffering

The next step is to create a new polygon representing a buffer around an existing polygon. Because we already queried for the polygon(s) we want to buffer, there’s no need to select anything in the map view. Just make sure the TM_WORLD_BORDERS-0.3 layer is active and select Vector > Geoprocessing Tools > Buffer(s):

Make sure the input vector layer is TM_WORLD_BORDERS-0.3. Only the query will be affected, so we’re operating on a single country and not the entire world.

For Buffer distance, type 1. This is in map units. Because our source borders file is in EPSG:4326, this corresponds to 1 degree which is 69 miles (for the longitudinal axis, that measurement is only valid at the equator and decreases towards the poles). This is a nice size buffer for a country, you may want something larger or smaller depending on the size of the country and what you want to accomplish, so play around with the figure and compare results. Of course, if your map projection is not EPSG:4326, your map units may not be degrees and you should probably be entering much bigger values.

Select a path and filename for the output shapefile. Do not select ‘Dissolve buffer results’. The rest can be left at the default values. Push OK to run the buffer calculation. This can take a little while and the progress bar won’t move. Then you see:

Click Yes. Now we have a buffer polygon based on the US national border:

Geoprocessing step 3: Generalizing

We’re almost done, but the buffer we generated contains a lot of points, which will make the process of cutting a planet file slow. So we’re going to simplify the polygon some. This is also a QGIS built-in function.

Select Vector > Geometry tools > Simplify geometries:

Make sure your buffer layer is selected as the input. Set 0.1 (again, this is in map units) as the Simplify tolerance. This defines by how much the input features will be simplified, the higher this number, the more simplification.

Select a destination for the simplified buffer to be saved. Also select Add result to canvas. Click OK:

This dialog may not seem very promising, but it has worked. Also, I have sometimes gotten an error message after this process completes. Ignore these if you get them.

Geoprocessing step 4: resolving multipolygons

Now, if your simplified country border consists of multiple polygons (as is the case with the US) we have a slight problem. The POLY export plugin does not support multipolygons, so we need to break the multipolygon into single polygons. And even then, we will need to do some manual work if we want OSM .poly files for all the polygons. This is because the plugin relies on unique string attribute values to create different  POLY files, and we do not have those because the polygons we are using are all split from the same multipolygon. So we need to either create a new attribute field and manually enter unique string values in it, or select and export the parts to POLY files one by one and rename the files before they get overwritten.

Finale: Export as POLY

I am going to be lazy here and assume I will only need the contiguous US, so I select the corresponding polygon. After that I invoke the plugin by selecting Plugins > Export OSM Poly > Export to OSM Poly(s):

The plugin will show a list of all the fields that have string values. Select ISO2 and click Yes. Next you will need to select a destination folder for your exported POLY files. Pick or create one and push OK.

This is it! Your POLY files are finished and ready to be used in Osmosis, osmchange and other tools that use it for data processing.

By the way: you can’t load POLY files into JOSM directly, but there’s a perl script to convert POLY files to OSM files that I used in order to visualize the result.

Taking the Temperature of local OpenStreetMap Communities


In my recent talk at the State Of The Map 2011 conference, I introduced the idea of Data Temperature for local OpenStreetMap data. In this blog post, I want to follow up on that, because my talk was only twenty minutes, covered a broader theme and thus did not allow me to elaborate on it much.

Let me first show a slide from the talk to give you an idea – or a reminder, if you were in the audience – of what we are talking about. (The complete talk is available as a streaming video, and the slides are available as well. I linked to both in my previous blog post).

Community activity visualized

Let’s break this image down before I explain how I arrived at the temperature of 77 degrees for San Francisco. The blob of red, orange, yellow, green and gray is an abstracted map of the city, displaying only the linear OpenStreetMap features for that city. The features are styled so that two salient characteristics of community activity become the defining visual elements: number of versions and date of last update. The number of versions, indicating how many updates a feature has received since its creation, is visualized using line thickness, a thicker line style indicating a feature that has seen more community updates. The time passed since a feature last saw an update from the community is visualized using a gray-red-green color gradient, where gray represents features that have not seen an update in two years or more, while green represents linear features that were ‘touched’ in the last month.

The result is an unusually abstracted view of a city – but one that I hope helps to convey a sense of local community activity. In my talk I argued that communities sharing local knowledge are what sets OpenStreetMap apart from other geodata resources. It is what sets Warm Geography apart from Cold Geography.

 Data Temperature

For my talk, I wanted to take the Warm vs. Cold analogy one step further and devise a simple method to calculate the Data Temperature for a city or region. To do this, I decided I needed a little more information than just the version count and the time passed since the last update. Thinking about the US situation, I gathered that the TIGER import could provide me with some salient community statistics; the TIGER data is a reference point from which we can measure how much effort the local communities – or individual mappers – have shown to fix the imported data, enrich it and bring it up to date.

From the current planet file that I used because of resource constraints, you can only derive a limited understanding of the historical development of the OpenStreetMap data. For example, it is not possible to see how many unique users have been involved in contributing to the data for an area, only who have been involved in the current versions of the data. For individual features, it is not possible to determine their age. Features that have been deleted are not accessible. So all in all, by operating on the current planet file, we have a pretty limited field of view. When resources allow, I want to work more with the full history planet that has been available for a while now, and for which a tool set is starting to emerge thanks to the great work of Peter Körner, Jochen Topf and others.

These limitations being what they are, we can still derive useful metrics from the planet data. I devised a ‘Community Score Card’ for local OpenStreetMap data that incorporates the following metrics:

  • The percentage of users responsible for 95% of the current data. This metric tells us a lot about the skew in contributions, a phenomenon that has received considerable attention in the OSS domain1 and is apparent in OpenStreetMap as well. The less skewed the contribution graph is, the healthier I consider the local community. Less skew means that there are more people putting in significant effort mapping their neighborhood. For the cities I looked at, this figure ranged form 5% to 26%. I have to add that this metric loses much of its expressiveness when the absolute number of contributers is low, something I did not take into account in this initial iteration.
  • The percentage of untouched TIGER roads. This metric provides some insight into how involved the local mappers are overall – TIGER data needs cleanup, so untouched TIGER data is always a sign of neglect. Also, it gives an idea of how well the local mappers community covers the area geographically. For the cities I looked into for the talk, this figure ranged from 4% in Salt Lake City (yay!) to a whopping 80% in Louisville, Kentucky.
  • The average version increase over TIGER. This simple metric overlaps somewhat with the previous one, but also provides additional insight into the amount of effort that has gone into local improvements of the imported TIGER road network.
  • The percentage of features that has been edited in the last three months and in the last year. This is the only temporal metric that is part of the Community Score Card. For a more in-depth temporal and historical analysis of OpenStreetMap data, we need to look at the full history planet file, which for this first iteration I did not do. Even so, these two metrics provide an idea of the current activity of the local community. It does not tell us anything about the historical arguments that might be able to explain that activity or lack thereof, however. For example, the local community may have been really active up to a year ago, leaving the map fairly complete, which might explain a diminished activity since. For our purpose though, these simple metrics do a pretty good job quantifying community activity.

I applied a simple weighing to these metrics to arrive at the figure for the data temperature, and there’s really not much wisdom that went in that. My goal was to arrive at a temperature that would be conducive to conveying a sense of community activity, and would show a good range for the cities I analyzed for the talk. In a next iteration, I will attempt to arrive at a somewhat more scientifically sound approach.

The weighing factors are as follows:

  • Percentage of users responsible for 95% of the current data: 30
  • Percentage untouched TIGER roads: -30
  • Average version increase over TIGER road: 5
  • Percentage features edited in the last 3 months: 50
  • Percentage features edited in the last year: 40

I rounded the results to the nearest integer and added them to a base temperature of 32 degrees (freezing point of water on the Fahrenheit scale) to arrive at the final figure for the Data Temperature.

Visualization Is Hard

Looking at the Community Score Cards for the various cities I analyzed for the talk, and comparing them to the abstract maps representing the way versions and time since last edit, you will notice that the maps seem to paint a different picture than the Score Cards. Take a look at the San Francisco map and Score Card above, and compare that to the Baltimore one below.

 We see that while Baltimore’s data is much ‘cooler’ at 59 degrees than San Francisco’s at 77, the Baltimore map looks quite promising for community activity. I can give a few explanations for this. (We are really getting into the visualization aspect of this, but I believe that is a very important dimension of conveying a concept as fluid as Data Temperature.) Firstly, the color defines this map visualization in a more fundamental way than the line thickness. The ‘green-ness’ of the Baltimore map leads us to believe that all is well, even though it is just one element of the Community Score Card. Moreover, not all elements of the Score Card are even represented in the visualization: untouched TIGER roads are pushed to the background by the thicker lines representing roads that did receive community attention. Lastly, scale plays a role in obfuscating differences. To fit the different cities in the same slide template, I had to vary the scale of the maps considerably. Because the line thickness is defined in absolute values and not dependent on map scale, the result can be somewhat deceiving.

Conclusion

I believe that this first attempt at a Data Temperature for local OpenStreetMap data, and its accompanying map visualizations, served its purpose well. The talk was well received and inspired interesting discussions. It set the stage for a broader discussion I want to have within the OpenStreetMap community about leveraging sentiments of recognition and achievement within local communities in order to help those communities grow and thrive.

There is a whole range of improvements to this initial iteration of the Data Temperature concept that I want to pursue, though. Most importantly, I want to use the full history of contributions instead of the current planet file. This will allow me to incorporate historical development of the map data as a whole and of individual contributor profiles. Also, I want to improve the Score Card with more characteristics, looking into the quality of the contributions as well as the quantity. Lastly, I want to improve the map visualizations to more accurately represent the Data Temperature.

I will post more Data Temperature visualizations and Score Cards when I think of an efficient way to do so. I collected data for some 150 US cities based on OpenStreetMap data from August 2011. If you would like a particular city posted first, let me know. Also, if you would like to know more about the tools and methods involved in preparing and processing data, I am willing to do a blog post expanding on those topics a bit.

1. J. Lerner and J. Tirole, “Some simple economics of open source,” Journal of Industrial Economics (2002): 197–234.

Insert Coin To Play – my talk at State Of The Map 2011


UPDATE: I have re-uploaded the slides to Slideshare with the correct Community Score Cards — see the comments.

Phew! State Of The Map is over and it was so very good. In spite of the growth of OpenStreetMap and the conference, the grassroots spirit is still very much there, and that is what OpenStreetMap is about if you ask me.

Below are the slides from my talk. It was also recorded by Toby Murray, thanks Toby for all your effort recording and live-streaming several talks from your mobile phone! I am writing some more background about the data temperature concept that I introduced. I hope to publish that here before FOSS4G starts. I will be recycling that talk there, with a different title and some small tweaks — so if you have already seen it, you may want to plan your personal FOSS4G itinerary around me ;)

OpenStreetMap and Warm vs. Cold Geography


Denver, The Mile High City in Colorado, USA, will be the stage for two of the most prominent conferences in the domain of open geospatial software and data, FOSS4G and SOTM. I will be speaking at both conferences. The community is gearing up for these important and fun events, preparing their talks and discussing the hot topics of the conferences.

For me, the direction in which OpenStreetMap will be heading over the next few years is one of those main topics. I will address this topic myself from the angle of contributor motivation. Specifically, I will address the challenge of addressing the extremely high churn rate that OpenStreetMap is coping with — less than one tenth of everyone who ever created an OpenStreetMap account continue to become active contributors. I will investigate which tactics from the gaming domain OpenStreetMap could use — tactics that have made Foursquare and StackOverflow so successful.

OpenStreetMap needs those flesh and blood contributors, because it is ‘Warm Geography’ at its core: real people mapping what is important to them — as opposed to the ‘Cold Geography’ of the thematic geodata churned out by the national mapping agencies and commercial street data providers; data that is governed by volumes of specifications and elaborate QA rules.

Don’t get me wrong — I am not denouncing these authoritative geodata sources. They have their mandate, and an increasing amount of authoritative, high quality geodata is now freely (beer and speech) available to the public — and I like to think that OpenStreetMap’s success played some part in this.

However, OpenStreetMap occupies its very own special niche in the domain of geodata with its Warm Geography — real people contributing their local knowledge of the world around them.

Will OpenStreetMap retain that unique position?

OpenStreetMap has grown very rapidly over the last three years, both in number of contributors and in data volume. For some countries and regions, OpenStreetMap is considered more complete and / or of higher quality than any other available data source. However much or little truth there may be in such a statement, the fact of the matter is that the relation between OpenStreetMap and authoritative or commercial geodata sources is being reconsidered.

On the one hand, VGI (Volunteered Geographic Information, the widely accepted misnomer that covers a wide range of geospatial information resources that consist of contributions by non-professionals) techniques are introduced into the data collection processes of traditional producers. The degree of voluntarism varies — on one end of the spectrum, we see PND giant TomTom using data collected from their users to trigger better targeted road data updates; on the other end, the VGI experiments conducted by the USGS to help improve the National Map. In either case, we see the intrusion of crowdsourcing techniques into the traditionally closed domains of authoritative and commercial geospatial information.

On the other hand, we see authoritative and other Cold data being integrated into OpenStreetMap, by way of imports on various scales — some local, others covering entire nations.

I have a strong ambivalence towards data imports into OpenStreetMap. I have seen how they can spark and nurture the OpenStreetMap community in the regions affected — but I also envisage how detrimental they could be to OpenStreetMap as a whole. This touches on the very nature of OpenStreetMap, and I reiterate: OpenStreetMap is Warm Geography — real people contributing their local knowledge of the world around them. Will OpenStreetMap claim that particular space the geospatial information landscape? Does it even want to? Or will it manifest itself as an open data repository in which Cold and Warm Geography are mixed together resulting in something Lukewarm?

Warm and Cold Geography in OpenStreetMap side by side. Green lines, prevalent in Des Moines, mark imported TIGER data that has been reviewed by human OpenStreetMap contribuotors. Red data, overwhelmingly present in Omaha, a two hour drive to the east, marks dead TIGER data that has been sitting in OSM, untouched, since 2007.

One could argue that the large scale imports OpenStreetMap has already seen have probably helped spark the local efforts to improve the data and the community. This may be true — although it has never really been researched and you can easily make a counterargument with Germany as your case — but I believe we need to be looking at the relation between imports / authoritative data and the community, and OpenStreetMap should do so before imports are being done or even considered on an individual basis. Let me give two examples as food for thought to wrap this up.

The TIGER line data import in the US is one I deem useful and productive. It’s a low quality data source in terms of geometry, but pretty good (as far as I can tell) in terms of metadata. Also, it represents features that are easily recognizable in the field and map to our real world experience with ease: it’s the streets that we drive, walk, bike, address our mail to, and input in our satnav devices. Importing that data provided initial OpenStreetMap content with two key properties that are essential for an import to ‘work’ in the sense that it serves the interests of the data provider, the OpenStreetMap community and OpenStreetMap as an entity.

First, it provides initial map content that serves to make OpenStreetMap less scary for both aspiring volunteer mappers and potential professional users. An empty map is not attractive for casual mappers — a category that will become more important in the future, even if right now the majority of mapping is done by a small number of mappers in most regions (read my observations on churn rate in OpenStreetMap in a related post).

Second, it provides content that constitutes a low barrier for improving on it. The coarse, quirky geometry of the TIGER line segments can be easily fixed, especially now that nationwide high resolution imagery from BING is available as a backdrop for the OpenStreetMap editors. As a novice to OpenStreetMap, you have an easy way in: just start by fixing some streets in Potlatch2, couldn’t be easier. Having the instant gratification of seeing your contributions on the main map almost instantly helps motivate the novice to continue to contribute more. This process, I believe, would be further aided if OpenStreetMap would have more elements from gaming — competition, scoring, achievements, awards — but that will be the topic of my talk at SOTM ;).

In contrast, let me address my concerns with the land use import done recently in the Netherlands — see the current border of that import here to do a before/after comparison. This is authoritative data sourced from the national mapping agency via a loophole (the actual source dataset, Top10Vector, is not open data but a derived product was deemed to have a license compatible with OpenStreetMap’s). Its addition to the OpenStreetMap data body surely makes for a map that is very appealing visually, but it does not meet both the requirements for it to be warm, living data in OpenStreetMap. Land use data, in contrast to street data, is much more abstract and also much more difficult to survey, especially if you’re an OpenStreetMap novice. On top of that, it’s easy to break things. I can easily see how an OpenStreetMap contributor, even if he’s not a novice, would be daunted by an editor view that looked like this (see inset).

The OpenStreetMap web-based editor Potlatch centered on an area with almost exclusively imported data north of Amsterdam, The Netherlands.

Mind you that this is almost 100% imported data.

So to conclude: I am not against mixing and matching authoritative data and OpenStreetMap, but I firmly believe the distinction between mixing and matching should be heeded more carefully than it has been in some cases. Don’t mix when it’s not going to be mutually beneficial, keeping in mind the requirements I laid out in the cases I described just now. This is necessary for OpenStreetMap to retain its unique position in the geodata domain as a Warm Resource, and that is the only way it will survive in the longer run.

I have been invited to discuss the topic of OpenStreetMap versus Authoritative Data in a panel organized by Eric Wolf of the US Geological Survey at the upcoming State Of The Map conference in Denver. If you’re interested in this topic and would like a broader view on it, I invite you to attend. It takes place on Sunday, Sep. 11th around noon. See you in Denver!

OpenStreetMap Usability – Converting More Sign-Ups To Active Contributors


Yet Another WhereCampEU 2011 / Berlin Report

From Data To Stories

Chris Osborne did a great job making the second European WhereCamp happen. He drew the curtain on the two-day unconference with an inspiring talk on big data, which his company ITO World is in the business of crunching and visualizing. Showing some of their impressive visualizations, ranging from personal location history to the effects of the London congestion charge, Chris drove an important point home very convincingly:

We don’t need data, we need stories.

Data Usability

In a world that is increasingly overflowing with big data – be it from the government open data movement or from crowdsourcing initiatives such as OpenStreetMap – we need to be able to tell the story that is hidden in that swamp of raw data. Stories are a language that everyone understands. Speaking that language means that almost everyone can engage with big data in a meaningful way. I like to look at what Chris and his colleagues do as data usability – creating a meaningful interface between people and data.

Two of my contributions to WhereCampEU also dealt with this topic – but in a very different way. Data usability has always been a challenge for OpenStreetMap, in more than one way. On the consumption side of things, OpenStreetMap data is not very accessible for those who want to do something other than making a map. I addressed a part of this challenge in a talk about dealing with historical OpenStreetMap data. More on that a little later.

Churn Rate

Shifting the focus to contributing to OpenStreetMap, adding your own knowledge to the map, the numbers speak volumes. Almost 70% of those who sign up for an OpenStreetMap account never proceed to make any contribution to the map. As there really is no other compelling reason to sign up for an OpenStreetMap account, there is something missing in the input interface between those 270.000 (and change) people and the OpenStreetMap data. That interface is comprised mainly by the OpenStreetMap editors – there’s a whole range of them carrying an eclectic array of names: JOSM, Potlatch, Merkaartor, MapZen, Vespucci, to name a few. Some are web-based, others desktop applications or apps for smartphones. They are elaborate pieces of work and I use some of them every single day. What they all fail to do however is compel those 70% of new sign-ups to OpenStreetMap to actually start making contributions. They do not trigger the initial motivation to become an active OpenStreetMap contributor. It’s hard to say exactly Why they fail to do that, but WhereCampEU offered a number of suggestions.

Address Hunter

Skobbler – the German-Romanian company that builds innovative things using OpenStreetMap data – gave us a sneak peek of a spare time project of theirs. It is called AddressHunter and it is a multi-player location-based fast paced action game. The goal? Hunt down as many unmapped addresses as you can before your opponents do. When you find an address, you shoot a picture to prove that you were there, confirm and move on. When all the addresses are hunted down, the game ends. The group or individual with the most hunted addresses wins the game and gets bonus points that go towards your overall rank.

We got to try AddressHunter out in teams of 3-5 WhereCampers, each team under the supervision of a Skobbler team member. We played using iPhones that they provided – but the game is all in HTML[1] and thus by nature multi-platform. Even though the game was still in beta stage, with a planned release later this summer, it looked awesome and already feels very polished and well thought out. There were a few glitches in gameplay but nothing serious. The version we played had a 17th century Great Explorer kind of theme; Skobbler plans to offer different versions of the game appealing to specific target groups. And they have some more cool plans for it.

What makes this game not great but awesome? All hunted-down addresses are added to OpenStreetMap directly! AddressHunter picks up the tedious and arguably boring task of adding addresses to OpenStreetMap and transforms it into something a) anyone can do and b) fun. With these ingredients, AddressHunter gives OpenStreetMap a part of that missing interface between people and data, and may very well help bring down that enormous churn rate of 70%.

But what about data quality? Will the address contributions through AddressHunter be as good as those from real OpenStreetMappers using real editors like JOSM?

No. But is that a bad thing?

Yes, but it’s not the end of the world. Let’s take a closer look at the contribution metrics. OpenStreetMap now has just over 400,000 contributors. If the churn rate figure of 68% is still valid – this was based on data from early 2010 – only 128,000 of those will have made any contributions to the map. Of those, a further breakdown shows, only 19% is considered active, that is, has edited in the last three months. That is only about 24,000 people. While that seems like a lot, it is not. Going with the very rough estimate that 10% of Earth’s 150 million square kilometers of land mass is populated, that leaves an area about the size of Madrid for each active contributor to map. What I’m trying to convey here is that OpenStreetMap could do with some more active contributors, and that the challenge seems to not be in attracting them, but rather in retaining them – converting sign-ups to contributors. That is the road that needs paving, and cool initiatives like these provide some of the paving stones.

Tweet your node

I put forward another idea that could help improve this conversion rate. It had been sitting in the back of my head for a while as I was working on a project to get more CCTV cameras into OpenStreetMap. To allow casual contributions, I discussed a few ideas for low-barrier single POI contributions to OpenStreetMap. We came up with the idea of using Twitter as an interface to OpenStreetMap. On the U-Bahn on my way to from Kreuzberg to the WhereCampEU venue, I decided to do run a session about the idea to get some feedback.

The basic idea is very simple: make sure your phone adds a geographical coordinate to your tweets – a standard feature in most mobile twitter client nowadays – and tweet something like

amenity:pub name:Bellman Bar #osmadd

This would then be picked up by a twitter scraper that would parse the content into OpenStreetMap tags and add the POI. This system could be used directly through a Twitter client, but also by third party applications.

The Return Of Anonymous Edits

The main drawback of this is that the contributions could never be traced back to an individual, which is Bad Behaviour in OpenStreetMap. Anonymous edits were abandoned in 2007 for that reason. Most of the discussion after my short introduction of the idea was around this issue. I have thought about the idea some more since and came up with an idea that would also stimulate conversion. It is too big to fit in the margin though.

Tweet Finesse

I also got valuable feedback on the format of the tweet. Firstly, encoding tags like this hardly makes for a casual contribution, because some intimate knowledge with the OpenStreetMap tag definitions is required. This could be addressed by having different hashtags and allowing only a name as the content of the tweet, like

Bellman Bar #osm-pub

, or venturing into the dangerous lands of natural language processing.

Secondly, addressing a hashtag pollutes the timeline of the twitter user with meaningless – at least to human followers – tweets. This can be addressed by having an actual twitter @username as the magic word, like

@osm-pub Bellman Bar

. This way, the tweet will not show up in human follower’s timelines if they do not follow @osm-pub as well, reducing timeline pollution.

I started implementing this idea, incorporating all the ideas gathered at the WhereCampEU session – so more on this soon! I wanted to include it in this post because I believe it can be another example of paving the first mile leading from sign-up to active contribution to OpenStreetMap.

OpenStreetMap Usability, Status Quo

The last WhereCampEU session I want to address in this context of OpenStreetMap usability is Patrick Weber’s. I noticed his blog post describing initial results of the extensive OpenStreetMap usability research he and his colleagues at UCL are carrying out a few weeks before WhereCampEU. The study also got picked up by the Strategic Working Group and (re-)sparked a debate on the layout of the home page. As it turned out, people unfamiliar with the OpenStreetMap web site took more than 6 seconds to find the geographical name search box. For a web site that is all about places and maps, that is a Bad Thing.

In his session, he talked more in-depth about the still ongoing usability study. Using eye tracking technology and more traditional usability study methods, Weber c.s. observed new OpenStreetMap contributors finding their way through the sign-up process and towards their first meaningful contribution. It was interesting and somewhat unsettling to hear about all the snags that new contributors can – and will – hit on their way. A simple task like zooming in on the map in the Potlatch editor proved to be daunting enough for some to either give up or screw up their first editing effort.

Patrick’s study is ongoing, and he will be presenting more results at the upcoming State of the Map in Vienna. This is an important effort that I believe will expose many challenges on the road towards a healthier conversion rate from sign-ups to contributors.

WhereCampEU produced a few good ideas for paving that first mile, and what I am writing about here is only a tiny fraction of everything that was going on. I cannot wait for the next edition. See you somewhere in Europe!

[1] The app is wrapped in the PhoneGap framework to create a native platform app, necessary for continuous GPS and camera access.