OpenAustralia/ScraperWiki hackfest: my first ruby code!
This weekend, I've been hanging out at my old office, taking part in the OpenAustralia/ScraperWiki "What are you up to next weekend?" hackfest. I've been to quite a few OA hackfests before, but always as a host - this is the first time I've been to one with the intent to code.
I've been meaning to learn Ruby for a while, and this seemed like a good opportunity, so I decided to write a scraper to get some more data into PlanningAlerts.
PlanningAlerts is a project of the OpenAustralia Foundation, and aims to provide you with email alerts of development applications near you. Development applications are scraped from council websites, alerts are sent (via RSS or email) to people who have requested notifications about applications in that area; and the site gives you a simple way to send your feedback back to the council.
Henare from OpenAustralia has written a guide to writing scrapers using the excellent ScraperWiki. Utilising that, cadging from some of his existing scrapers, and asking a few noob questions along the way, I created a scraper that pulls in information about development applications from the Redfern/Waterloo Authority site.
The good parts of the code I've scraped together come from the doc or from other samples; the ugly parts are my own invention.
When I started, the provided sample code when I started working looked like this:
if ScraperWiki.select("* from swdata where `council_reference`='#{record['council_reference']}'").empty? ScraperWiki.save_sqlite(['council_reference'], record) else puts "Skipping already saved record " + record['council_reference'] end
This breaks on a couple of corner cases: if the swdata
table doesn't already exist, this will die. If you want to trample on your existing data, you have to manually comment out 4 lines of code. As well, it results in one select
code per record - fine in small cases, but potentially a time-sink for larger cases.
While I was working on the code, the first problem was fixed by changing the first line to:
if (ScraperWiki.select("* from swdata where `council_reference`='#{record['council_reference']}'").empty? rescue true)
I expanded on that (and along the way taught myself a little bit about Ruby classes):
class Saver def initialize #If you want to trample on existing data, set this to true @trample_data = false @references = (ScraperWiki.select("council_reference from swdata") rescue nil) end def save(record) if record if @trample_data || @references.nil? || @references.include?(record['council_reference']) ScraperWiki.save_sqlite(['council_reference'], record) else puts "Skipping already saved record " + record['council_reference'] end end end end
This will only do one lookup, and can then do in-memory comparisons to decide if the database needs to be updated for each record. This handles the case where swdata
doesn't exist yet; and if you want to trample on the data, just one word needs to be changed.
There's some real ugliness in other parts of the code though.
* The entire page uses a tables-based layout, so to find the data I want I have to use page.search('table table table table table table table table tr')
* Both DAs on the site right now have the same data items in the same order; but rather than assume this is consistent, I have my parser iterating over the rows and using a nasty big case
to interpret the contents of the second cell based on the value of the first cell in the same row.
* Each DA is on public exhibition from a specifc date to another specific date. The two dates are expressed in compact form: if the month/year values are the same for both dates, they'll only be expressed once, on the second date. There's another nasty case
block to handle the different possible values here and extract useful dates.
* Every time the code encounters the start of a new record, it tries to save the old record. This leads to an attempt to save an empty record at the start of the parsing (hence the if record
test in Saver.save
); and a need to manually do One Last Save at the bottom of the code.
The complete code is available on ScraperWiki, and the data is already available on the PlanningAlerts site.