Life After Redaction: Detecting Remapped Ways


There are some pretty awesome tools out there to help with the remapping effort after the redaction bot made its sweep across the OpenStreetMap database. (Does this sound like Latin to you? Read up on the license change and the redaction process here.) Geofabrik’s OSM Inspektor shows all the objects affected by the redaction. It is likely the most comprehensive view of the result of the license change redaction. Numerous other tools are listed on the Remapping wiki page. Most of these tools will show you, in some shape or form, the effects of the redaction process: which nodes, ways and relations have been deleted or reverted to a previous, ‘ODbL Clean’ version of the object.

I want to see if we can take it a step further and determine whether an object has already been remapped. This is useful for monitoring remapping progress as well as determining where to focus efforts when you want to contribute to the remapping effort.

For now, I am going to stick with ways. I think maintaining, or reinstating, a good quality routable road network is an important objective for OSM anyway, and especially at this point in time, when many roads are broken due to redaction.

Let’s start by locating a deleted way here in the US using my own Redaction Affected / Deleted Ways Map. That’s easy enough around severely affected Austin, TX:

 

I am going to use three comparison parameters to determine whether this way is likely to already have been remapped:

  1. The Hausdorff distance between the deleted geometry and any new geometries in that area
  2. The highway type of the deleted and any new geometries in that area
  3. The length of the deleted and any new geometries in that area

For this to work, I will need a table with all the ways deleted by the redaction bot. This is easy enough to compile by looking at the changesets created by the redaction account, but Frederik Ramm was kind enough to send the list of OSM IDs to me, so all I had to do is extract the deleted ways by ID from a pre-redaction database. The comparison can then be run on that table and a ways table from a current planet:

 

It is immediately clear that this way is very likely already remapped if we look at the top candidate, with object ID 172171755. It has a very small Hausdorff distance compared to the deleted way 30760760, it is tagged with the same highway= type, and the lengths are almost identical.

Sure enough, when fire up JOSM and load this area, it is clear that this area has been remapped:

 

(Selected objects are version 1 and created after July 18, 2012).

I need to do some more testing and tweaking on the query, but I will soon integrate this in the Redaction Affected / Deleted Ways Map.

 

The State Of The OpenStreetMap Road Network In The US


Looks can be deceiving – we all know that. Did you know it also applies to maps? To OpenStreetMap? Let me give you an example.

Head over to osm.org and zoom in to an area outside the major metros in the United States. What you’re likely to see is an OK looking map. It may not be the most beautiful thing you’ve ever seen, but the basics are there: place names, the roads, railroads, lakes, rivers and streams, maybe some land use. Pretty good for a crowdsourced map!

What you’re actually likely looking at is a bunch of data that is imported – not crowdsourced – from a variety of sources ranging from the National Hydrography Dataset to TIGER. This data is at best a few years old and, in the case of TIGER, a topological mess with sometimes very little bearing on the actual ground truth.

TIGER alignment example

The horrible alignment of TIGER ways, shown on top of an aerial imagery base layer. Click on the image for an animation of how this particular case was fixed in OSM. Image from the OSM Wiki.

For most users of OpenStreetMap (not the contributors), the only thing they will ever see is the rendered map. Even for those who are going to use the raw data, the first thing they’ll refer to to get a sense of the quality is the rendered map on osm.org. The only thing that the rendered map really tells you about the data quality, however, is that it has good national coverage for the road network, hydrography and a handful of other feature classes.

To get a better idea of the data quality that underlies the rendered map, we have to look at the data itself. I have done this before in some detail for selected metropolitan areas, but not yet on a national level. This post marks the beginning of that endeavour.

I purposefully kept the first iteration of analyses simple, focusing on the quality of the road network, using the TIGER import as a baseline. I did opt for a fine geographical granularity, choosing counties (and equivalent) as the geographical unit. I designed the following analysis metrics:

  • Number of users involved in editing OSM ways – this metric tells us something about the amount of peer validation. If more people are involved in the local road network, there is a better chance that contributors are checking each other’s work. Note that this metric covers all linear features found, not only actual roads.
  • Average version increase over the TIGER imported roads – this metric provides insight into the amount of work done on improving TIGER roads. A value close to zero means that very little TIGER improvements were done for the study area, which means that all the alignment and topology problems are likely mostly still there.
  • Percentage of TIGER roads – this says something about contributor activity entering new roads (and paths). A lower value means more new roads added after the TIGER import. This is a sign that more committed mappers have been active in the area — entering new roads arguably requires more effort and knowledge than editing existing TIGER roads. A lower value here does not necessarily mean that the TIGER-imported road network has been supplemented with things like bike and footpaths – it can also be caused by mappers replacing TIGER roads with new features, for example as part of a remapping effort. That will typically not be a significant proportion, though.
  • Percentage of untouched TIGER roads – together with the average version increase, this metric shows us the effort that has gone into improving the TIGER import. A high percentage here means lots of untouched, original TIGER roads, which is almost always a bad thing.

Analysis Results

Below are map visualizations of the analysis results for these four metrics, on both the US State and County levels. I used the State and County (and equivalent) borders from the TIGER 2010 dataset for defining the study areas. These files contain 52 state features and 3221 county (and equivalent) features. Hawaii is not on the map, but the analysis was run on all 52 areas (the 50 states plus DC and Puerto Rico – although the planet file I used did not contain Puerto Rico data, so technically there’s valid results for 51 study areas on the state level).

I will let the maps mostly speak for themselves. Below the results visualisations, I will discuss ideas for further work building on this, as well as some technical background.

Map showing the number of contributors to ways, by state

Map showing the average version increase over TIGER imported ways, by state

Map showing the percentage of TIGER ways, by state

Map showing the percentage of untouched TIGER ways, by state

Map showing the number of users involved in ways, by county

Map showing the average version increase over TIGER imported ways, by county

Map showing the percentage of TIGER ways

Map showing the percentage untouched TIGER roads by county

Further work

This initial stats run for the US motivates me to do more with the technical framework I built for it. With that in place, other metrics are relatively straightforward to add to the mix. I would love to hear your ideas, here are some of my own.

Breakdown by road type – It would be interesting to break the analysis down by way type: highways / interstates, primary roads, other roads. The latter category accounts for the majority of the road features, but does not necessarily see the most intensive maintenance by mappers. A breakdown of the analysis will shed some light on this.

Full history – For this analysis, I used a snapshot Planet file from February 2, 2012. A snapshot planet does not contain any historical information about the features – only the current feature version is represented. In a next iteration of this analysis, I would like to use the full history planets that have been available for a while now. Using full history enables me to see how many users have been involved in creating and maintaining ways through time, and how many of them have been active in the last month / year. It also offers an opportunity to identify periods in time when the local community has been particularly active.

Relate users to population / land area – The absolute number of users who contributed to OSM in an area is only mildly instructive. It’d be more interesting if that number were related to the population of that area, or to the land area. Or a combination. We might just find out how many mappers it takes to ‘cover’ an area (i.e. get and keep the other metrics above certain thresholds).

Routing specific metrics – One of the most promising applications of OSM data, and one of the most interesting commercially, is routing. Analyzing the quality of the road network is an essential part of assessing the ‘cost’ of using OpenStreetMap in lieu of other road network data that costs real money. A shallow analysis like I’ve done here is not going to cut it for that purpose though. We will need to know about topological consistency, correct and complete mapping of turn restrictions, grade separations, lanes, traffic lights, and other salient features. There is only so much of that we can do without resorting to comparative analysis, but we can at least devise some quantitative metrics for some.

Technical Background

  • I used the State and County (and equivalent) borders from the TIGER 2010 dataset to determine the study areas.
  • I used osm-history-splitter (by Peter Körner) to do the actual splitting. For this, I needed to convert the TIGER shapefiles to OSM POLY files, for which I used ogr2poly, written by Josh Doe.
  • I used Jochen Topf‘s osmium, more specifically osmjs, for the data processing. The script I ran on all the study areas lives in github.
  • I collated all the results using some python and bash hacking. I used the PostgreSQL COPY function to import the results into a PostgreSQL table.
  • Using a PostgreSQL view, I combined the analysis result data with the geometry tables (which I previously imported into Postgis using shp2pgsql).
  • I exported the views as shapefiles using ogr2ogr, which also offers the option of simplifying the geometries in one step. Useful because the non-generalized counties shapefile is 230MB and takes a long time to load in a GIS).
  • I created the visualizations in Quantum GIS, using its excellent styling tool. I mostly used a quantiles distribution (for equal-sized bins) for the classes, which I tweaked to get prettier class breaks.

I’m planning to do an informal session on this process (focusing on the osmjs / osmium bit) at the upcoming OpenStreetMap hack weekend in DC. I hope to see you there!

Visualizing geospatial data quality


In the coming months, I will be working on how to measure the quality of geospatial information, and visualizing the results of quality analysis. The actual indicators for quality are still to be defined, but will be along the lines of

  • spatial density – how many features of a certain type does dataset A have, and how many does dataset B have?
  • temporal quality – what is the age of the data? How much time has passed since survey, publishing?
  • crowd quality – what I call the ‘5th dimension of spatial data quality’. more complex (separate post will follow) -

OpenStreetMap 'cheat sheet' mug showing the most used tags.

‘Crowd Quality’ has many dimensions. It is about peer review strength: how many surveyors have ‘touched’ a feature? how many surveyors are responsible for area X? It has several consistency components as well. One is internal attribute consistency: to what extent does the data conform to a set of core attrtibutes? Another is spatial and temporal quality consistency: considering a larger region, does the data show consistent measurements for spatial and temporal quality indicators as described above?

Quality analysis is an important issue for Volunteered Geographic Information projects like OpenStreetMap, because their data is consistently strongly scrutinized: it’s open, so it’s easily accessible and it’s very easy to take cheap shots at extensive voids in the map. Because of its openness, professional users have strong reservations pertaining to the quality of the data: there is almost no barrier for entry into the OpenStreetMap community: provide a username and an email address and you’re good to go – and delete all the data for Amsterdam, for example.

In a community of 200,000, map vandalism of such magnitude will be swiftly detected and reverted, and as such should not even be the biggest concern of potential users of VGI data. Smaller acts of map vandalism, however, might go undetected for a longer period of time, if they are detected at all. Moreover, with OpenStreetMap picking up momentum as it is currently doing, there’s a lot of new aspiring surveyors joining every day. Even when they all subscribe and start adding data with the best intentions, ‘newbies’ are bound to get it wrong at first, inadvertently adding a stretch of freeway in their residential neighborhood, or unintentionally moving features around when all they want to do is add their local pub. Even if the community tends to react to map errors – inadvertent or no – swiftly and pro-actively, the concerns potential users have about the quality of the data is legitimate. VGI is anarchy, and where there is anarchy, there are no guarantuees.

The need for quality analysis also arises from within the VGI communities themselves. As a VGI project matures, contributors are likely to shift their attention to details. This can certainly be said for OpenStreetMap, where some regions are nearing or reaching completion of the basic geospatial features. A quick glance of the current map will no longer be enough to decide how and where to direct your surveying and mapping effort. Data and quality analysis tools are needed to aid the contributors in their efforts. These can be really simple tabular comparisons; in many German cities for example, OpenStreetMap contributors have acquired complete and up-to-date street name lists from the local council, which they compare to the named streets that exist in the OpenStreetMap database. This effort (Essen, Germany here) yields a simple list of missing street names which can then be targeted for mapping efforts.

More complex and versatile data quality analysis tools are being developed as well. Let me give a few examples to conclude this article and give some idea of how the results of my quality analysis research could be visualized

OpenStreetBugs

Not an automated data analysis tool, this web site allows for simple map bug reporting. It was designed to provide a no-barrier way to report errors on the map: you do not even need to be registered as an OpenStreetMap user to use it. It provides some indication of data quality. It can be used by OpenStreetMap contributors to fix reported errors quickly; the web site provides a link to the web-based OpenStreetMap editor, Potlatch, with every reported error automatically.

Visual comparison: Map Compare and FLOSM

An often asked question pertaining to data quality of OpenStreetMap is: How does OpenStreetMap compare to TeleAtlas or NAVTEQ, the two major commercial vendors of street data. While comparing the spatial quality is in itself not a complicated task, you need to have

Map Compare

FLOSM

access to both data sets in order to actually do it. TeleAtlas and NAVTEQ data is expensive, so not many are in a position to actually do this comparison. In the course of my research, I will certainly perform a number of these analyses, as I am in the fortunate position to have easy access to commercial spatial data.

A simple but effective way to visually compare two spatial data sets is to overlay them in GIS software, or in a web mapping application. Making such overlay web applications available is generally discouraged in VGI communities, as it is thought to encourage ‘tracing’ data from proprietary sources. This is a violation of the licenses for most all commercial spatial data, and could thus mean legal trouble for VGI projects.

Nevertheless, some visual comparison tools do exist. Map Compare presents a side-by-side view of OpenStreetMap and Google Map, allowing for easy and intuitive exploratory comparing of the two. FLOSM takes it a step further with a full-on overlay of TeleAtlas data on top of OpenStreetMap data.

Automated analysis: KeepRight and OSM Inspector

OSM Inspector

KeepRight

The tools we’ve seen so far do not provide analysis intelligence themselves; they simply display the factual data and leave it to the user to draw conclusions. Another category of quality assurance tools takes the idea a step further and performs different spatial data quality analyses and displays the results in a map view.

German geo-IT company Geofabrik, also responsible for the Map Compare tool mentioned earlier, publishes the widely used OSM Inspector tool, that can be used to perform a range of data quality analyses on OpenStreetMap data. It can effectively visualize topology issues and common tagging errors. Input for the tool’s functionality and for extending its range of visualizations comes from the community. A recent addition requested by the Dutch community has been a visualization that shows the Dutch street data that has not been ‘touched’ since it was imported in 2007, when AND donated their street data for the Netherlands to OpenStreetMap, effectively completing the road network for the Netherlands in OpenStreetMap. This particular visualization helps Dutch OpenStreetMap contributors to establish which features have not yet been checked since they were imported. A similar tool was put in place when TIGER data from the US Census Bureau was imported into OpenStreetMap in 2008.

KeepRight takes a similar approach as OSM Inspector, analysing OpenStreetMap data for common errors and inconsistencies in the data and displaying them in a web map application.

While these tools are extremely useful for OpenStreetMap contributors looking to improve the data and correct mistakes, they are not particularly useful for visualizing quantitative data quality research outcomes, as those outcomes will be aggregated, generalized data.

For many of the ‘Crowd Quality’ indicators, I am probably going to take a grid approach: establishing quantifiable indicators for Crowd Quality and calculate them for each cell in the grid. What that grid will look like is actually also a matter of debate – it would depend on the quality indicator measured, and on the characteristics of the real world situation referenced by that grid cell.

To get an idea of what a grid visualization pertaining to quality could look like, it’s interesting to look at the visualization for the Aerial Imagery Tracing project running in the German OpenStreetMap community. A set of high resolution aerial photos was made available to OpenStreetMap, and integrated into map editing software for purposes of tracing features. Some tools were developed to assist in completing this effort; amongst those, a grid overlay visualizing the progress for each grid cell. No automated analysis is performed, rather, contributors are asked to scrutinize the grid cells themselves and rate completeness on several indicators. Although the pilot project was completed some time ago, the visualization is still online.

[Edit] This blog post goes into the technicalities of setting up a grid in PostGIS.

Priceless?


Volunteered Geographic Information

Free, Priceless Or Somewhere In Between?

This is the title that has been popping into my head since last summer. I am writing it down because it encompasses in a very general sense the themes that I want to cover in my dissertation, and thus serves me well in trying to guide me while I try to elaborate on them.

I have actually already written some paragraphs elaborating on the themes and ideas that follow, but I want to force myself to touch upon them concisely here.

Volunteered Geographic Information (VGI) is a concept that has not been around for a very long time. Geographic Information has, however: it is what maps are made out of, and what your car navigation device relies on to guide you. Traditionally, Geographic Information is collected, processed and used by professionals, but this no longer holds true: Geographic Information has undergone a process of democratization, both in the usage dimension and in the collection and processing dimension. People are now used to dealing with Geographic Information in different contexts, and have started to pool resources to collectively build repositories of Geographic Information, to facilitate the democratization of the entire ecosystem of Geographic Information.

OpenStreetMap is the most prominent of these efforts, and one in which I have been actively involved since early 2007. Since its conception in 2005, it has grown to a worldwide collaborative effort involving more than 100.000 contributors. In some regions, the maps available from OpenStreetMap are so rich and complete that they are used instead of commercially available map data.

I realize that I need to come up with some examples here, and some numbers that give an indication of how OpenStreetMap has grown, but I am on a train, blissfully disconnected from the internet, so you will just have to bear with me for now. But believe me, it’s getting big fast – at a rate that makes me worried about the validity of any quantitative research results that I might present in the context of this dissertation. But this will have to be dealt with in some future note.

Let us assume for now that OpenStreetMap – there are other VGI efforts around, and they will need to be touched upon as well – is indeed starting to occupy a significant share in the commercial market for Geographic Information. That means the OpenStreetMap data represent a commodity and as such, economic value. As OpenStreetMap data is available at no cost, this value is not quantified in the marketplace, however. This poses intriguing questions:

What is this freely available OpenStreetMap data actually worth?

How do you even begin to measure the value of something that is not subject to the usual economic market mechanisms?

When dealing with value, I believe I cannot omit the concept of quality, especially in this context. Any VGI effort relies on volunteers collecting data in their spare time. While some regions have very active communities, getting together to discuss progress and plan improvements to the map, checking and correcting each other’s contributions, other regions rely on single, isolated individuals contributing to the map – or worse: no-one contributing at all. The resulting picture is one of spotty coverage: very densely mapped regions exist side by side with tersely covered regions. More questions arise!

Is it possible to define the quality of volunteered geographic information in any satisfactory way?

How?

More generally: how do quality and value relate when dealing with geographic information?

I think I cannot proceed from here without looking at real world situations. Economic value is defined in the marketplace where supply and demand meet, and thus cannot be studied without some understanding of how and where this demand arises.

There clearly is a demand for VGI, but where does it originate?

Why would people want to use information that comes with no guarantees of completeness or even factual correctness, and that does not have a consistent quality?

I will need to get to the bottom of this. Apparently it is ‘good enough’ for some! If I’m not careful I will be entering into the domain of psychology. I think I need to stop soon, or I will have covered all domains of modern science and will have defined ample questions to last me three dissertations. But let me just finish this train of thought, and by then I will have arrived in Berlin – one of the best covered cities in OpenStreetMap, by the way; you can even get a detailed map of the zoo!

What drives the decision on the demand side to use volunteered geographic information instead of commercial offerings that do come with a quality label?

I can think of a number of reasons. Firstly, there is a growing number of application domains that do not require extensive, nationwide coverage. The growing domain of location based services are often only relevant in metropolitan areas; consider for example pedestrian and bicycle routing, social networking applications, tourist guide services or restaurant / bar recommendation applications. Even many applications in professional domains operate only within a designated metropolitan area: local police, fire brigades and other public safety professionals operate only within their metro area.

Interestingly, supply and demand sync up really nicely here: in areas where there is likely to be a great demand for high quality – whatever that may mean – geographic information, there is also likely to be a large number of contributors to volunteered geographic information repositories. (This reminds me of my master’s thesis that dealt with the quality of public transportation in rural areas. There was a similar process at play: because of the limited and geographically thinly spread demand, the costs of maintaining a reasonable quality of service had become so high that cuts in service quality had become unavoidable, lowering the demand even further. Both Dutch and German regional governments were struggling to counter this downward spiral, and I did a comparative study on the results of those efforts.)

Secondly, because there is very little restrictions and limitations in terms of how and where you can use the data. Commercial data usage licenses are more often than not restricted to a certain type of application, device or to a limited number of users or devices, and the data can only be used as-is. OpenStreetMap data can be used in almost every context imaginable, and you are free to modify and adapt the data to suit your needs.

Lastly, of course, because it’s free.

I have mixed feelings about this post. It feels unfocused, but I guess that is to be expected. More importantly, I don’t feel comfortable in the domain of economics. Sure, I did my two years of high school accounting and economics, but it did not quite take. It does not particularly interest me, but I feel I need to deal with it anyway. Intuitively, I am drawn to the question of defining and measuring quality. I want to think about how to do that, write tools to analyze OSM data – that part I am really passionate about. It seems like a good moment to talk to Henk and maybe some other people I know that could help and advise me at this junction.

So this is it!


So this is it. This is going to be my dissertation diary. I’m not going to make any commitments as to how often I will write in it; I just read that I should be spending at least 15 minutes every day on my dissertation. Every day for the next four, five, six years! Intriguing at least.

I’m at the very beginning of the process, and my thoughts are really unfocused at this point. In this first entry, I will not go into the theme itself, there will be ample opportunity for that. I would just, for a moment, like to ponder over the implications. At least four years of my life will be dedicated, at least to some degree, to researching and writing about this theme that has yet to unfold.

As I am writing this, I feel that I want to write, I like to explore my thoughts by putting them in writing, although writing in English makes it even harder for my fingers to keep up with my ever-wandering mind.

The first question that springs to mind as I embark on this diary is: should I publish it? Not the dissertation I mean, but these notes? It seems, on the one hand, pointless and vain. Who would want to read about the nitty-gritty details of my struggle towards acquiring a doctorate? Not many, probably, but there might be a reason or two to do it anyway.

Publishing my thoughts might help me overcome a feeling of awkwardness that I frequently have about this project: who am I to think I can do original, creative research? These isolated thoughts, rough outlines of a theme that I might want to pursue, seem so superficial and gratuitous! If I would just go ahead and publish my thoughts and ideas and processes – that would seem to provide some validity to them. An irrational thought maybe, but it works for me.

Publishing these notes may also invoke some sense of urgency. I know I have a tendency to keep thoughts and ideas to myself for too long, thinking they need to mature before they are ready to be shared with the world. This is an inhibition that will seriously slow me down and that I must learn to set aside. It has already happened and I have not even begun to formalize a proposal!

More than a year ago now, Henk Scholten invited me to come to the Vrije Universiteit to discuss the possibilities of him supervising my dissertation. We had a really nice and productive discussion and I felt both flattered and motivated, and told him I would write some ideas I had down for him to ingest. We would have a follow-up meeting soon.
I explored the idea for a while, discussed implications with a couple of colleagues and friends, thought about interesting themes. I think I even wrote some things down, but I did not feel any of them were good or mature enough to even put forward to Henk.

Although the though of doing a dissertation was on my mind now and then over the months that followed, I found myself glad to be distracted by other things to occupy my mind and time. And so time passed, and here we are. I feel that I want to do this more strongly now, for reasons I will explain in a future post. So I am going to write. And explore. It will be beautiful. I can be that naive.