Taking the Temperature of local OpenStreetMap Communities


In my recent talk at the State Of The Map 2011 conference, I introduced the idea of Data Temperature for local OpenStreetMap data. In this blog post, I want to follow up on that, because my talk was only twenty minutes, covered a broader theme and thus did not allow me to elaborate on it much.

Let me first show a slide from the talk to give you an idea – or a reminder, if you were in the audience – of what we are talking about. (The complete talk is available as a streaming video, and the slides are available as well. I linked to both in my previous blog post).

Community activity visualized

Let’s break this image down before I explain how I arrived at the temperature of 77 degrees for San Francisco. The blob of red, orange, yellow, green and gray is an abstracted map of the city, displaying only the linear OpenStreetMap features for that city. The features are styled so that two salient characteristics of community activity become the defining visual elements: number of versions and date of last update. The number of versions, indicating how many updates a feature has received since its creation, is visualized using line thickness, a thicker line style indicating a feature that has seen more community updates. The time passed since a feature last saw an update from the community is visualized using a gray-red-green color gradient, where gray represents features that have not seen an update in two years or more, while green represents linear features that were ‘touched’ in the last month.

The result is an unusually abstracted view of a city – but one that I hope helps to convey a sense of local community activity. In my talk I argued that communities sharing local knowledge are what sets OpenStreetMap apart from other geodata resources. It is what sets Warm Geography apart from Cold Geography.

 Data Temperature

For my talk, I wanted to take the Warm vs. Cold analogy one step further and devise a simple method to calculate the Data Temperature for a city or region. To do this, I decided I needed a little more information than just the version count and the time passed since the last update. Thinking about the US situation, I gathered that the TIGER import could provide me with some salient community statistics; the TIGER data is a reference point from which we can measure how much effort the local communities – or individual mappers – have shown to fix the imported data, enrich it and bring it up to date.

From the current planet file that I used because of resource constraints, you can only derive a limited understanding of the historical development of the OpenStreetMap data. For example, it is not possible to see how many unique users have been involved in contributing to the data for an area, only who have been involved in the current versions of the data. For individual features, it is not possible to determine their age. Features that have been deleted are not accessible. So all in all, by operating on the current planet file, we have a pretty limited field of view. When resources allow, I want to work more with the full history planet that has been available for a while now, and for which a tool set is starting to emerge thanks to the great work of Peter Körner, Jochen Topf and others.

These limitations being what they are, we can still derive useful metrics from the planet data. I devised a ‘Community Score Card’ for local OpenStreetMap data that incorporates the following metrics:

  • The percentage of users responsible for 95% of the current data. This metric tells us a lot about the skew in contributions, a phenomenon that has received considerable attention in the OSS domain1 and is apparent in OpenStreetMap as well. The less skewed the contribution graph is, the healthier I consider the local community. Less skew means that there are more people putting in significant effort mapping their neighborhood. For the cities I looked at, this figure ranged form 5% to 26%. I have to add that this metric loses much of its expressiveness when the absolute number of contributers is low, something I did not take into account in this initial iteration.
  • The percentage of untouched TIGER roads. This metric provides some insight into how involved the local mappers are overall – TIGER data needs cleanup, so untouched TIGER data is always a sign of neglect. Also, it gives an idea of how well the local mappers community covers the area geographically. For the cities I looked into for the talk, this figure ranged from 4% in Salt Lake City (yay!) to a whopping 80% in Louisville, Kentucky.
  • The average version increase over TIGER. This simple metric overlaps somewhat with the previous one, but also provides additional insight into the amount of effort that has gone into local improvements of the imported TIGER road network.
  • The percentage of features that has been edited in the last three months and in the last year. This is the only temporal metric that is part of the Community Score Card. For a more in-depth temporal and historical analysis of OpenStreetMap data, we need to look at the full history planet file, which for this first iteration I did not do. Even so, these two metrics provide an idea of the current activity of the local community. It does not tell us anything about the historical arguments that might be able to explain that activity or lack thereof, however. For example, the local community may have been really active up to a year ago, leaving the map fairly complete, which might explain a diminished activity since. For our purpose though, these simple metrics do a pretty good job quantifying community activity.

I applied a simple weighing to these metrics to arrive at the figure for the data temperature, and there’s really not much wisdom that went in that. My goal was to arrive at a temperature that would be conducive to conveying a sense of community activity, and would show a good range for the cities I analyzed for the talk. In a next iteration, I will attempt to arrive at a somewhat more scientifically sound approach.

The weighing factors are as follows:

  • Percentage of users responsible for 95% of the current data: 30
  • Percentage untouched TIGER roads: -30
  • Average version increase over TIGER road: 5
  • Percentage features edited in the last 3 months: 50
  • Percentage features edited in the last year: 40

I rounded the results to the nearest integer and added them to a base temperature of 32 degrees (freezing point of water on the Fahrenheit scale) to arrive at the final figure for the Data Temperature.

Visualization Is Hard

Looking at the Community Score Cards for the various cities I analyzed for the talk, and comparing them to the abstract maps representing the way versions and time since last edit, you will notice that the maps seem to paint a different picture than the Score Cards. Take a look at the San Francisco map and Score Card above, and compare that to the Baltimore one below.

 We see that while Baltimore’s data is much ‘cooler’ at 59 degrees than San Francisco’s at 77, the Baltimore map looks quite promising for community activity. I can give a few explanations for this. (We are really getting into the visualization aspect of this, but I believe that is a very important dimension of conveying a concept as fluid as Data Temperature.) Firstly, the color defines this map visualization in a more fundamental way than the line thickness. The ‘green-ness’ of the Baltimore map leads us to believe that all is well, even though it is just one element of the Community Score Card. Moreover, not all elements of the Score Card are even represented in the visualization: untouched TIGER roads are pushed to the background by the thicker lines representing roads that did receive community attention. Lastly, scale plays a role in obfuscating differences. To fit the different cities in the same slide template, I had to vary the scale of the maps considerably. Because the line thickness is defined in absolute values and not dependent on map scale, the result can be somewhat deceiving.

Conclusion

I believe that this first attempt at a Data Temperature for local OpenStreetMap data, and its accompanying map visualizations, served its purpose well. The talk was well received and inspired interesting discussions. It set the stage for a broader discussion I want to have within the OpenStreetMap community about leveraging sentiments of recognition and achievement within local communities in order to help those communities grow and thrive.

There is a whole range of improvements to this initial iteration of the Data Temperature concept that I want to pursue, though. Most importantly, I want to use the full history of contributions instead of the current planet file. This will allow me to incorporate historical development of the map data as a whole and of individual contributor profiles. Also, I want to improve the Score Card with more characteristics, looking into the quality of the contributions as well as the quantity. Lastly, I want to improve the map visualizations to more accurately represent the Data Temperature.

I will post more Data Temperature visualizations and Score Cards when I think of an efficient way to do so. I collected data for some 150 US cities based on OpenStreetMap data from August 2011. If you would like a particular city posted first, let me know. Also, if you would like to know more about the tools and methods involved in preparing and processing data, I am willing to do a blog post expanding on those topics a bit.

1. J. Lerner and J. Tirole, “Some simple economics of open source,” Journal of Industrial Economics (2002): 197–234.