Taking the Temperature of local OpenStreetMap Communities


In my recent talk at the State Of The Map 2011 conference, I introduced the idea of Data Temperature for local OpenStreetMap data. In this blog post, I want to follow up on that, because my talk was only twenty minutes, covered a broader theme and thus did not allow me to elaborate on it much.

Let me first show a slide from the talk to give you an idea – or a reminder, if you were in the audience – of what we are talking about. (The complete talk is available as a streaming video, and the slides are available as well. I linked to both in my previous blog post).

Community activity visualized

Let’s break this image down before I explain how I arrived at the temperature of 77 degrees for San Francisco. The blob of red, orange, yellow, green and gray is an abstracted map of the city, displaying only the linear OpenStreetMap features for that city. The features are styled so that two salient characteristics of community activity become the defining visual elements: number of versions and date of last update. The number of versions, indicating how many updates a feature has received since its creation, is visualized using line thickness, a thicker line style indicating a feature that has seen more community updates. The time passed since a feature last saw an update from the community is visualized using a gray-red-green color gradient, where gray represents features that have not seen an update in two years or more, while green represents linear features that were ‘touched’ in the last month.

The result is an unusually abstracted view of a city – but one that I hope helps to convey a sense of local community activity. In my talk I argued that communities sharing local knowledge are what sets OpenStreetMap apart from other geodata resources. It is what sets Warm Geography apart from Cold Geography.

 Data Temperature

For my talk, I wanted to take the Warm vs. Cold analogy one step further and devise a simple method to calculate the Data Temperature for a city or region. To do this, I decided I needed a little more information than just the version count and the time passed since the last update. Thinking about the US situation, I gathered that the TIGER import could provide me with some salient community statistics; the TIGER data is a reference point from which we can measure how much effort the local communities – or individual mappers – have shown to fix the imported data, enrich it and bring it up to date.

From the current planet file that I used because of resource constraints, you can only derive a limited understanding of the historical development of the OpenStreetMap data. For example, it is not possible to see how many unique users have been involved in contributing to the data for an area, only who have been involved in the current versions of the data. For individual features, it is not possible to determine their age. Features that have been deleted are not accessible. So all in all, by operating on the current planet file, we have a pretty limited field of view. When resources allow, I want to work more with the full history planet that has been available for a while now, and for which a tool set is starting to emerge thanks to the great work of Peter Körner, Jochen Topf and others.

These limitations being what they are, we can still derive useful metrics from the planet data. I devised a ‘Community Score Card’ for local OpenStreetMap data that incorporates the following metrics:

  • The percentage of users responsible for 95% of the current data. This metric tells us a lot about the skew in contributions, a phenomenon that has received considerable attention in the OSS domain1 and is apparent in OpenStreetMap as well. The less skewed the contribution graph is, the healthier I consider the local community. Less skew means that there are more people putting in significant effort mapping their neighborhood. For the cities I looked at, this figure ranged form 5% to 26%. I have to add that this metric loses much of its expressiveness when the absolute number of contributers is low, something I did not take into account in this initial iteration.
  • The percentage of untouched TIGER roads. This metric provides some insight into how involved the local mappers are overall – TIGER data needs cleanup, so untouched TIGER data is always a sign of neglect. Also, it gives an idea of how well the local mappers community covers the area geographically. For the cities I looked into for the talk, this figure ranged from 4% in Salt Lake City (yay!) to a whopping 80% in Louisville, Kentucky.
  • The average version increase over TIGER. This simple metric overlaps somewhat with the previous one, but also provides additional insight into the amount of effort that has gone into local improvements of the imported TIGER road network.
  • The percentage of features that has been edited in the last three months and in the last year. This is the only temporal metric that is part of the Community Score Card. For a more in-depth temporal and historical analysis of OpenStreetMap data, we need to look at the full history planet file, which for this first iteration I did not do. Even so, these two metrics provide an idea of the current activity of the local community. It does not tell us anything about the historical arguments that might be able to explain that activity or lack thereof, however. For example, the local community may have been really active up to a year ago, leaving the map fairly complete, which might explain a diminished activity since. For our purpose though, these simple metrics do a pretty good job quantifying community activity.

I applied a simple weighing to these metrics to arrive at the figure for the data temperature, and there’s really not much wisdom that went in that. My goal was to arrive at a temperature that would be conducive to conveying a sense of community activity, and would show a good range for the cities I analyzed for the talk. In a next iteration, I will attempt to arrive at a somewhat more scientifically sound approach.

The weighing factors are as follows:

  • Percentage of users responsible for 95% of the current data: 30
  • Percentage untouched TIGER roads: -30
  • Average version increase over TIGER road: 5
  • Percentage features edited in the last 3 months: 50
  • Percentage features edited in the last year: 40

I rounded the results to the nearest integer and added them to a base temperature of 32 degrees (freezing point of water on the Fahrenheit scale) to arrive at the final figure for the Data Temperature.

Visualization Is Hard

Looking at the Community Score Cards for the various cities I analyzed for the talk, and comparing them to the abstract maps representing the way versions and time since last edit, you will notice that the maps seem to paint a different picture than the Score Cards. Take a look at the San Francisco map and Score Card above, and compare that to the Baltimore one below.

 We see that while Baltimore’s data is much ‘cooler’ at 59 degrees than San Francisco’s at 77, the Baltimore map looks quite promising for community activity. I can give a few explanations for this. (We are really getting into the visualization aspect of this, but I believe that is a very important dimension of conveying a concept as fluid as Data Temperature.) Firstly, the color defines this map visualization in a more fundamental way than the line thickness. The ‘green-ness’ of the Baltimore map leads us to believe that all is well, even though it is just one element of the Community Score Card. Moreover, not all elements of the Score Card are even represented in the visualization: untouched TIGER roads are pushed to the background by the thicker lines representing roads that did receive community attention. Lastly, scale plays a role in obfuscating differences. To fit the different cities in the same slide template, I had to vary the scale of the maps considerably. Because the line thickness is defined in absolute values and not dependent on map scale, the result can be somewhat deceiving.

Conclusion

I believe that this first attempt at a Data Temperature for local OpenStreetMap data, and its accompanying map visualizations, served its purpose well. The talk was well received and inspired interesting discussions. It set the stage for a broader discussion I want to have within the OpenStreetMap community about leveraging sentiments of recognition and achievement within local communities in order to help those communities grow and thrive.

There is a whole range of improvements to this initial iteration of the Data Temperature concept that I want to pursue, though. Most importantly, I want to use the full history of contributions instead of the current planet file. This will allow me to incorporate historical development of the map data as a whole and of individual contributor profiles. Also, I want to improve the Score Card with more characteristics, looking into the quality of the contributions as well as the quantity. Lastly, I want to improve the map visualizations to more accurately represent the Data Temperature.

I will post more Data Temperature visualizations and Score Cards when I think of an efficient way to do so. I collected data for some 150 US cities based on OpenStreetMap data from August 2011. If you would like a particular city posted first, let me know. Also, if you would like to know more about the tools and methods involved in preparing and processing data, I am willing to do a blog post expanding on those topics a bit.

1. J. Lerner and J. Tirole, “Some simple economics of open source,” Journal of Industrial Economics (2002): 197–234.

Scoping Out An Open Location Platform


I noted earlier today that it would be nice to have a kind of Open Location Platform in order to get the most out of the Social Location apps that have gained such interest and popularity as of late. The problem with all the existing applications and platforms is that they do not speak each other’s language, and thus cannot benefit from all the information that is being crowdsourced into their respective location repositories. All these applications have a different take on Social Location, attracting a specific, unique crowd. The challenge of an Open Location Platform would be not to lose the uniqueness of each of the existing platforms in the process of unifying the way in which places are searched for and posted. Ideally, the applications would not even have to do very much at all to benefit from an Open Location Platform.

The way I see an Open Location Platform is as a kind of API interfacing between the existing Social Location apps, a central repository of Places – in fact, I prefer the name Open Places Platform over Open Location Platform, for a Place is More Than A Location™ – and the proprietary Places Repositories that the apps maintain now.

A unique Place Identifier must be used to link the Places in the Central Repository – which I believe should be OpenStreetMap, because it has the infrastructure set up and would benefit from the crowdsourcing from all these unique Social Location platforms – to the Places in the proprietary, existing repositories.

The initial assignment of unique Place Identifiers is not really straightforward: there’s bound to be a lot of duplicates, even within the existing repositories, and removing duplicates inserted by humans is always a challenge. All unique Places should be collected in the OpenStreetMap database, and the Identifiers should be assigned there. Currently, every new node inserted is assigned an ID automatically, so that ID could easily be used.

Assuming this model is in place, we can start scoping out the basic operations: Searching for Places, Request Details for a Place, and Post a New Place.

Searching for Places

In a Social Location environment, it is safe to assume that search operations for known Places will be geographical, i.e. within a certain distance from the user’s current location, or within a certain area defined by a bounding box. A response to such a Places search would have to contain a minimal amount of information about each place found: geographical location, name, and something like a genre. That last bit might be troublesome, because it should be platform-agnostic. Either the Social Location platforms need to arrive at a shared taxonomy of ‘Place Genres’, or it should be left open, in the spirit of OpenStreetMap. I would opt for the former.

An actual search operation would thus not involve the proprietary Places Repos at all, because everything is stored in the Central Places Repo, OpenStreetMap – with those elementary nuggets of metadata on each Place to avoid unnecessary client-server round trips. Oh, as an added bonus, the location + name + Place Genre gives OpenStreetMap enough info to render it on their map. A Social Location application that uses OpenStreetMap as a base map layer would have the added benefit of up-to-date maps crowdsourced by their own users.

Request Details for a Place

The next step for a typical Social Location app would be to request more detailed information about one particular Place. Applications could still do this the way they are used to: requesting details directly from their own Places Repo using the their own interface or API. This way, however, the application would not be able to reap the real benefits of the Open Places Platform: firing one Place Details request at the Open Places Platform and getting results from all registered Places Repos, instead of just their own. This might not make much sense for the current Social Location apps, but a newer generation of apps could be crafted to make use of the Place information coming from different communities. In order to be able to process all this heterogenous Place information, some harmonization of Place metadata might be desirable.

To make a Place Details request more efficient, and maybe also to provide some indication of the expected information richness, an intermediate request might be implemented, in which the Application requests a list of Repos that have a record for the Place it intends to query. This could be a really fast request.

In a less-than-ideal-but-still-entirely-feasible scenario, one Social Location platform might not want the client of another platform benefiting from the information crowdsourced intro their Repo. This could be dealt with in a security and access provision layer.

Post A New Place

The posting of a new Place would by design be a two-step process, because the Place would have to be registered in the Central Repository, OpenStreetMap, first, and subsequently inserted into the proprietary Places Repo using the new unique Place Identifier returned by the Central Repo in step 1.

Conclusion

I think this could work. Really, it could.

Update – It’s About Business.

I spoke about this concept at WhereCampEU, which was in London on March 12-13, 2010. Slides are here (though most all of that is contained in this post). Wiki here.

The discussion that arose after the talk was very interesting and explored the business aspects of a consolidated check-in more thoroughly. I had thought about the incentive needed for the current social location platforms to want to engage in an effort to consolidate the check-in. The concept I laid out is meant to leave the existing platforms and their databases alone as much as possible. It is not meant to replace current apps either. It is designed to act as a thin intermediary layer, enabling users to gain optimal access to Places, while allowing the individual application platforms to retain the added value they accumulated. The only information about Places that is physically consolidated is the basic factual information: location, name, and Place type. That is not where the value is – the value is in the user generated content, which they retain and can choose to share but are, in the Open Places Platform scheme, in no way obliged to do.

The discussion raised the interesting point of conservatism inspired by venture capital. Steven Feldman, in another session at WhereCampEU, calculated the venture capital backing the social location business added up to around US$ 200 million already. As I am typing it, I wonder if I got that right. It’s a lot of money and the suppliers of that money will want to see some return on their investment – which they are not going to get if they’re going to start ‘opening up’ and ‘jeopardizing’ their assets: Places and Users. Because that is how short-sighted I think most venture capitalists are. So my new conclusion is: We Need A Different Name For This.

Update 5/20: And the story continues…without a happy ending!

Visualizing geospatial data quality


In the coming months, I will be working on how to measure the quality of geospatial information, and visualizing the results of quality analysis. The actual indicators for quality are still to be defined, but will be along the lines of

  • spatial density – how many features of a certain type does dataset A have, and how many does dataset B have?
  • temporal quality – what is the age of the data? How much time has passed since survey, publishing?
  • crowd quality – what I call the ‘5th dimension of spatial data quality’. more complex (separate post will follow) -

OpenStreetMap 'cheat sheet' mug showing the most used tags.

‘Crowd Quality’ has many dimensions. It is about peer review strength: how many surveyors have ‘touched’ a feature? how many surveyors are responsible for area X? It has several consistency components as well. One is internal attribute consistency: to what extent does the data conform to a set of core attrtibutes? Another is spatial and temporal quality consistency: considering a larger region, does the data show consistent measurements for spatial and temporal quality indicators as described above?

Quality analysis is an important issue for Volunteered Geographic Information projects like OpenStreetMap, because their data is consistently strongly scrutinized: it’s open, so it’s easily accessible and it’s very easy to take cheap shots at extensive voids in the map. Because of its openness, professional users have strong reservations pertaining to the quality of the data: there is almost no barrier for entry into the OpenStreetMap community: provide a username and an email address and you’re good to go – and delete all the data for Amsterdam, for example.

In a community of 200,000, map vandalism of such magnitude will be swiftly detected and reverted, and as such should not even be the biggest concern of potential users of VGI data. Smaller acts of map vandalism, however, might go undetected for a longer period of time, if they are detected at all. Moreover, with OpenStreetMap picking up momentum as it is currently doing, there’s a lot of new aspiring surveyors joining every day. Even when they all subscribe and start adding data with the best intentions, ‘newbies’ are bound to get it wrong at first, inadvertently adding a stretch of freeway in their residential neighborhood, or unintentionally moving features around when all they want to do is add their local pub. Even if the community tends to react to map errors – inadvertent or no – swiftly and pro-actively, the concerns potential users have about the quality of the data is legitimate. VGI is anarchy, and where there is anarchy, there are no guarantuees.

The need for quality analysis also arises from within the VGI communities themselves. As a VGI project matures, contributors are likely to shift their attention to details. This can certainly be said for OpenStreetMap, where some regions are nearing or reaching completion of the basic geospatial features. A quick glance of the current map will no longer be enough to decide how and where to direct your surveying and mapping effort. Data and quality analysis tools are needed to aid the contributors in their efforts. These can be really simple tabular comparisons; in many German cities for example, OpenStreetMap contributors have acquired complete and up-to-date street name lists from the local council, which they compare to the named streets that exist in the OpenStreetMap database. This effort (Essen, Germany here) yields a simple list of missing street names which can then be targeted for mapping efforts.

More complex and versatile data quality analysis tools are being developed as well. Let me give a few examples to conclude this article and give some idea of how the results of my quality analysis research could be visualized

OpenStreetBugs

Not an automated data analysis tool, this web site allows for simple map bug reporting. It was designed to provide a no-barrier way to report errors on the map: you do not even need to be registered as an OpenStreetMap user to use it. It provides some indication of data quality. It can be used by OpenStreetMap contributors to fix reported errors quickly; the web site provides a link to the web-based OpenStreetMap editor, Potlatch, with every reported error automatically.

Visual comparison: Map Compare and FLOSM

An often asked question pertaining to data quality of OpenStreetMap is: How does OpenStreetMap compare to TeleAtlas or NAVTEQ, the two major commercial vendors of street data. While comparing the spatial quality is in itself not a complicated task, you need to have

Map Compare

FLOSM

access to both data sets in order to actually do it. TeleAtlas and NAVTEQ data is expensive, so not many are in a position to actually do this comparison. In the course of my research, I will certainly perform a number of these analyses, as I am in the fortunate position to have easy access to commercial spatial data.

A simple but effective way to visually compare two spatial data sets is to overlay them in GIS software, or in a web mapping application. Making such overlay web applications available is generally discouraged in VGI communities, as it is thought to encourage ‘tracing’ data from proprietary sources. This is a violation of the licenses for most all commercial spatial data, and could thus mean legal trouble for VGI projects.

Nevertheless, some visual comparison tools do exist. Map Compare presents a side-by-side view of OpenStreetMap and Google Map, allowing for easy and intuitive exploratory comparing of the two. FLOSM takes it a step further with a full-on overlay of TeleAtlas data on top of OpenStreetMap data.

Automated analysis: KeepRight and OSM Inspector

OSM Inspector

KeepRight

The tools we’ve seen so far do not provide analysis intelligence themselves; they simply display the factual data and leave it to the user to draw conclusions. Another category of quality assurance tools takes the idea a step further and performs different spatial data quality analyses and displays the results in a map view.

German geo-IT company Geofabrik, also responsible for the Map Compare tool mentioned earlier, publishes the widely used OSM Inspector tool, that can be used to perform a range of data quality analyses on OpenStreetMap data. It can effectively visualize topology issues and common tagging errors. Input for the tool’s functionality and for extending its range of visualizations comes from the community. A recent addition requested by the Dutch community has been a visualization that shows the Dutch street data that has not been ‘touched’ since it was imported in 2007, when AND donated their street data for the Netherlands to OpenStreetMap, effectively completing the road network for the Netherlands in OpenStreetMap. This particular visualization helps Dutch OpenStreetMap contributors to establish which features have not yet been checked since they were imported. A similar tool was put in place when TIGER data from the US Census Bureau was imported into OpenStreetMap in 2008.

KeepRight takes a similar approach as OSM Inspector, analysing OpenStreetMap data for common errors and inconsistencies in the data and displaying them in a web map application.

While these tools are extremely useful for OpenStreetMap contributors looking to improve the data and correct mistakes, they are not particularly useful for visualizing quantitative data quality research outcomes, as those outcomes will be aggregated, generalized data.

For many of the ‘Crowd Quality’ indicators, I am probably going to take a grid approach: establishing quantifiable indicators for Crowd Quality and calculate them for each cell in the grid. What that grid will look like is actually also a matter of debate – it would depend on the quality indicator measured, and on the characteristics of the real world situation referenced by that grid cell.

To get an idea of what a grid visualization pertaining to quality could look like, it’s interesting to look at the visualization for the Aerial Imagery Tracing project running in the German OpenStreetMap community. A set of high resolution aerial photos was made available to OpenStreetMap, and integrated into map editing software for purposes of tracing features. Some tools were developed to assist in completing this effort; amongst those, a grid overlay visualizing the progress for each grid cell. No automated analysis is performed, rather, contributors are asked to scrutinize the grid cells themselves and rate completeness on several indicators. Although the pilot project was completed some time ago, the visualization is still online.

[Edit] This blog post goes into the technicalities of setting up a grid in PostGIS.