Tuesday, October 28, 2008

Data Scraping Wikipedia

Tony Hirst of the Ouseful.Info blog has written an excellent article explaining how you can use the importHTML function in Google Spreadsheets to retrieve data from any table in any website.



Tony uses the function to scrape data from Wikipedia tables and then uses Yahoo Pipes to geocode the data and create a Google Map mash-up (here is a Google Map showing UK city populations as scraped from Wikipedia).

I've been playing with the importHTML function for a few days now (since reading Tony's article) and instead of Yahoo Pipes I've been using Batchgeocode to retrieve the latitude and longitudes and then the Google Spreadsheet Map Wizard to create a map from the data.

The Google Maps API Tricks blog also has a post on how you can use Google Maps' own geocoder. The Google geocoder can export csv-data, which can then be directly imported into a Google Spreadsheet.

One of the awesome things about the importHTML function is that it is dynamic and automatically refreshes. This means you could use Tony's tutorial with weather or other data presented in table format on the web and create dynamic Google Maps that automatically refresh when the data is updated.

1 comment:

Kyle Bromski said...

great example of data extraction in action!

kyle
www.mozenda.com