One of the trends that keeps coming up in the context of the evolution of the web is the availability of raw data. This isn’t necessarily something new – one of the main pieces of Web 2.0 was websites exposing functionality and machine readable data through web services (also known as APIs), a practice that probably started around 2000 and has since grown steadily. Currently Programmable Web lists over 1800 APIs, and just about every internet startup today makes an API available as a matter of course.
A variation on this theme is the emergence of sites that make tabular data available as a service, in contrast to the multitude of sites that make all sorts of datasets available for download (such as infochimps, data.gov or the NYC Data Mine). The sites I’m describing here make large collections of data accessible through APIs, available to be integrated into any application or mashup anytime, anywhere – essentially “always on” data. The following are the ones I think are useful – and more importantly – won’t cost you a dime to use.
Yahoo Data Tables (Introduced late 2008) – This is a Yahoo! platform that enables you to query multiple data sources with the Yahoo! Query Language (YQL), which is very similar to SQL. The very useful console enables you to obtain metadata for information sources and run test queries. (You can then take your entire query formatted as a URL from the control panel and plug it in anywhere). This query, for example, pulls all of the previous day’s closing stock prices for companies in an industry:
select Name, Symbol, PreviousClose from yahoo.finance.quotes where symbol in ( select company.symbol from yahoo.finance.industry where id = “112” )
Many of the sources are not tables, but actually web services (from sites like Flickr, Last.fm and many others) abstracted so that they can be treated as tables. Another great feature of this platform is the ability to perform table joins within the system.
- Google Fusion Tables (Introduced mid 2009) – This is Google’s application for sharing tabular data on the web. Individuals can share tables (the current selection of public tables is rather short) or keep them private. The API allows users to retrieve, insert and update data using SQL like syntax.
- Factual – This is a relatively new startup which I learned about at a recent event of the New York Semantic Web Meetup. Factual’s objective is to provide quality structured data to the community on an open platform. Their website hosts tables for which a history of provenance, editing and discussion is maintained for every value, building accountability into the data. The collection of tables on the site is substantial and quite eclectic (examples of tables include video games and cheats, hiking trails and endocrinologists in the U.S.). A fairly standard API is provided which includes the ability to read, write and create tables but does not include table joins.
The following are also worthy of mention:
- Google Squared – This Google Labs project generates tabular data about entities by extracting information from the web. There is no API, but once you’ve got a table you can send it to a Google spreadsheet. I’ve obtained mixed results with this tool, with quality roughly depending on the vagueness of the topic.
- Amazon Web Services Public Datasets – A relatively small collection of datasets available for use with Amazon’s Elastic Compute Cloud (EC2) cloud computing services, free when used within Amazon’s platform.
- Freebase Gridworks – This is a Freebase project still in the Beta stage. According to this blog post the application will provide users with functionality for curating private tabular data. The feature that I think sets this tool apart from Excel or Access is that it can align values in the your data with Freebase, as well as map your data to a graph format and upload it to Freebase.
Is there a race to be the “ultimate” repository and platform for all open/public data? Perhaps. But any site that attempts to do this will have to deal with the fundamental issues of data completeness, quality, trust and freshness. Such a site will also constantly face what in business school we call a “death spiral” – low quality data results in low usage, leading to insufficient maintenance (either by the community or by the host, which cannot afford to maintain the data due to low profitability), which in turn leads to lower quality, leading to lower usage, and so on.