One of the features the Semantic Web offers is the ability to link data across databases and web sites through common identifiers and statements of logic. For example, say that a database contains the statement “A knows B”, and another database contains the statement “B knows C”. With Semantic Web technologies I can query both databases asking “which people does A knows through another person?” and obtain C as an answer. In another scenario B is also known as J, and the second database states that J knows C. If either database states that “J is the same as B”, we can again deduce that A knows C through an intermediate person.
I wanted to see how well this would work with a “real world” question in biotechnology, a field I’ve worked in. Recently a large amount of biological data has been made available on the Semantic Web (look at the lower right area of this Linked Data Cloud map), making it possible to ask a variety of interesting questions by following links across data sources. Indeed, many cutting edge pharmaceutical companies and researchers have been using Semantic Web technologies to make sense of large volumes of research data (here and here).The W3C also considers healthcare/life sciences an important area for the Semantic Web.
The question I wanted to try to answer using the Semantic Web is: which companies are competing over which genes, proteins or diseases, or rather, which genes/proteins/disease are the most commercially interesting? This question is somewhat open ended, partly because I don’t yet know what data is actually available to answer the question. After some browsing of the Linked Open Data sources I selected three that appear to contain relevant data: Diseasome, DailyMed and DrugBank. All three make data available in RDF, and offer both SPARQL endpoints and downloads of complete database images.
Answering the query will require writing a federated SPARQL query over the three data sources. This query might look something like this (in pseudocode):
X is a Company .
Y is a Gene .
X makes Drug .
Drug targets Protein .
Protein coded by Y
This query basically traverses relationships among entities in different classes. To actually construct the query I would have to browse the data sources and figure out what information and relationships are present. The challenges with this are:
- If you have multiple databases (in this case 3) figuring out which connections exist can be time consuming and require you to go back and forth between the databases. There may be in fact be multiple paths to get the same answer, and it’s hard to figure out which is optimal.
- Just because a relationship exists doesn’t mean it’s fully populated. For example, a database might contain facts of the form “drug is made by company” but the individual who aggregated the data might not have included such facts for some drugs. This sort of incompleteness might occur in multiple parts of the dataset, resulting in incomplete query results.
I believe these are fundamental challenges for end users when querying Semantic Web data. A solution for this that I wanted to try was a visualization in the form of a matrix, where rows and columns corresponded to entity types (company, drug, gene, etc.) and colored circles in the table cells correspond to the number of relationships through a particular predicate. The area of each circle is proportional to the number of relationships (if circles overlap, “notches” are cut out to reveal the color of the circle directly underneath). To create this visualization database images were downloaded from each source and loaded into Franz’s Allegrograph (a total of almost 1 million RDF triples). From there a C# program was used to generate the graphic:
With this visualization I can easily tell that
- Entities of type dailyMed:organization are connected to entities of type dailyMed:drugs
- Entities of type dailyMed:drugs are connected to entities of type diseasome:diseases
- Entities of type diseasome:diseases are connected to entities of type diseasome:genes
In a future post I will use this as a map to formulate the query to answer my question.