After my last, film and videogame related post, we’re back to familiar ground to look at the ace mashup/data processing tool ScraperWiki. ScraperWiki is a great little suite of programming languages with a very specific focus on extracting information from webpages, files, you name it. It has a database that you can use to store your data and generates URI’s that you can point people to to retrieve your data. One of the really cool things you can do is to take the data you’ve “scraped” and do “things” with it through the views functionality. There is a choice of three programming languages available, depending on your knowledge (Python, Ruby and PHP) and a number of libraries that tap into each one.
Playing around with this earlier last month I used it to rebuild some broken RSS feeds on the City of Lincoln Council website that we had previously been running through Yahoo! Pipes. The process, once I’d learned the foibles of ScraperWiki by following through a few examples, was really painless and helped me gain a better understanding of writing in Python; this seems to be the best way of getting the data off the page and into the ScraperWiki data stores.
After the RSS feed I wanted a new challenge and had an idea. I’d been playing with the Google Maps API on my local development machine for a while, working with a few functions that would take a UK postcode and convert it to Long/Lat for displaying on a map. While this was quite fun I didn’t really have an application for this, nor a place I could host such an application. So I thought I could use the views in ScraperWiki. I had a number of datasets for the Lincoln Decent Homes Scheme (you can find out more about that here) which I thought would be great to somehow visualise on the website as a map, allowing people to find their area and expand it to find out what work has been done to council properties. Previously this had all been published as tables in PDF files.
Getting the data was fairly straightforward; I simply uploaded the eight CSV files to our server to deliver raw data. As I would need to combine each property line under one postcode, I set about writing a scraper for these CSV’s. You can find the code here (written in Python; there are some comments so I’m not going to go into detail!) and the output here. What this is basically doing is reading in each line of the eight CSV files and, where it finds a match in a postcode, combining the informative bit (properties, work done) into a HTML table – this is used in the eventual map. It takes a good few seconds to run as it has a ton of data to process, but it goes to show how robust ScraperWiki is.
All in all, this took me about two days to put together and a lot of that was learning and troubleshooting. What we now have, however, is a very easy way to keep this data up to date. Instead of having to compile new PDF files and alter links on the website, all we do is update the master CSV files and everything else falls into place. This is open data in its purest form, simple source files feeding into automated systems which self maintain.
If anyone wants to go into more detail about how I did this, simply hit up my contact form and drop me a line!