Freebase is a structured data version of Wikipedia – rather than writing articles or debating the finer points of a topic, individuals enter individual facts about things (there are currently facts on over 12 million things are in Freebase). It’s considered part of the “Linked Open Data Cloud” and thus a part of the emerging open data ecosystem on the web. The company that runs Freebase, Metaweb, makes Freebase data freely available through a web interface and an API. One of the most powerful features of Freebase is that you can query it like a database, giving it huge potential for creating all kinds of visualizations, applications and mashups. But in some applications, the usefulness of the platform rests on the accuracy and completeness of the data in it.
Given my background in business, I wanted to see how good a source Freebase would be of basic financial data like revenues and profits for public companies. A financial analyst typically likes to see 3-5 prior years of past year financial statement data in order to do a basic projection, so I decided to look at data for the past five years. Querying Freebase with the Metaweb Query Language for companies where NASDAQ/NYSE/(former)AMEX ticker symbols were available, and visualizing availability of revenue figures from 2005-2009 showed very sparse data and some conflicts. (The visualization shows a square for each of the years 2009-2005 top to bottom – yellow indicates the absence of a value, black indicates that a single value was found, and red indicates multiple values found in a given year). The companies were segmented into top level Industry Classification Benchmark categories using approximate string matching (click to enlarge):
Almost 6,500 companies were found, close to the actual number of companies traded on the NYSE and NASDAQ. Complications I ignored: for many companies the financial year is not the same as calendar year, and companies have been added or delisted over the time period in question, resulting in inherently incomplete data. Doing the same for net income data gives even more sparse results (The visualizations were created using C#.NET and WPF):
In both cases completeness was pretty low (and correctness is of course another issue). I believe that Freebase is a great concept – making structured data available freely – but completeness and accuracy can sometimes be critical. Given that the data comes from the community, this could be an ongoing challenge.