Co-engineering the Workplace

For the record, I don’t have an iPhone or an iPad, and I’m still not entirely sold on the idea of always being tethered to one device or another. But I will credit both devices (and the countless lower priced clones that are following them) with increasing our comfort with “apps”. An app can be described as a simple, user friendly encapsulation of some functionality (though an app can certainly be highly functional and complex). Apps are easy to develop, deploy and learn, in contrast to traditional desktop applications, which are typically complex affairs with multiple menus, windows and manuals.

This proliferation of app use, along with the ubiquity of mobile computing and social media and the dispersion of informatics and programming skills among professionals (the New Literacy) is a critical confluence of trends. It has already changed the way we do many things, such as socializing, shopping and traveling. I wanted to write about another arena where this trend will impact us – how we work. More specifically, I think that these trends will create new opportunities for us to actively shape our workplaces – something I will call “co-engineering the workplace”.

The best way to describe this idea is with an illustration. Say that Jim and Erica are coworkers and Jim provides some service to Erica (and other people like her within the company). Jim thus has to deal with multiple internal clients who are constantly calling and emailing him to learn about the progress of their specific requests. Having to manage many relationships becomes cumbersome for Jim and detracts from the work he does. Currently, there are several options for dealing with this:

  • Stay with the status quo.
  • Implement a project management mega-system to track and manage projects and work items.
  • Make use of small scale project management software, very likely as SaaS (Software as a Service).

However, I think that the three trends I described above can lend themselves to another option: Jim,Erica and their employer co-engineer a solution (think of Jim and Erica as typical employees and not IT personnel). It goes something like this:

  1. Jim tracks his work items on a spreadsheet. He decides to make a part of his spreadsheet available internally through an API.
  2. Jim contacts the corporate IT department. An advisor looks over his concept, checking for any critical issues with privacy, security or scalability.
  3. Jim creates the API and registers it with the corporate intranet or the internal corporate cloud.
  4. Jim emails all potential users about his API. Those that are interested receive API keys from Jim.
  5. Erica creates a small app that polls the API for the status of her work items and displays it as a widget on her desktop or homepage. Mike creates an even better app that does some simple analytics to predict the time of work item completion.
  6. Both share their apps on an internal company “app store” after receiving approval.
  7. Corporate IT monitors and tracks usage of apps and flags any issues.

In my opinion, this approach is much more incremental and organic, primarily because it allows individuals to create the functionality they need as they need it. The company benefits because it harnesses individual creativity to obtain software at costs that are probably going to be much lower than if the functionality had been implemented by consultants or vendors. More importantly, companies can use internal crowdsourcing to manage their portfolio of apps – by monitoring usage, corporate IT can delete or archive applications that receive no use or provide no benefit.

This approach is not going to replace ERP systems. But I think it can be an important niche in the corporate IT portfolio that complements the mega-systems typically used by companies to manage internal information and processes. I would be very interested in hearing from individuals that have blogged about this or worked in a company that has tried this model.

Update: Maybe I should have called this “Pull in the Workplace”. “Pull” is author David Siegel’s vision of a world where the Semantic Web gives us unprecedented control over information – instead of having it “pushed” to us (which is to a large part the current paradigm) we will be in control and “pull” it as we fit. This “overhaul of our information infrastructure” can potentially rewrite the rules in just about every aspect of our lives where digital information plays a role. He has written a book and blogs as well. I recommend both for some eye opening thoughts and great business ideas.

From Company to Gene on the Semantic Web

One of the features the Semantic Web offers is the ability to link data across databases and web sites through common identifiers and statements of logic. For example, say that a database contains the statement “A knows B”, and another database contains the statement “B knows C”. With Semantic Web technologies I can query both databases asking “which people does A knows through another person?” and obtain C as an answer. In another scenario B is also known as J, and the second database states that J knows C. If either database states that “J is the same as B”, we can again deduce that A knows C through an intermediate person.

I wanted to see how well this would work with a “real world” question in biotechnology, a field I’ve worked in. Recently a large amount of biological data has been made available on the Semantic Web (look at the lower right area of this Linked Data Cloud map), making it possible to ask a variety of interesting questions by following links across data sources. Indeed, many cutting edge pharmaceutical companies and researchers have been using Semantic Web technologies to make sense of large volumes of research data (here and here).The W3C also considers healthcare/life sciences an important area for the Semantic Web.

The question I wanted to try to answer using the Semantic Web is: which companies are competing over which genes, proteins or diseases, or rather, which genes/proteins/disease are the most commercially interesting? This question is somewhat open ended, partly because I don’t yet know what data is actually available to answer the question. After some browsing of the Linked Open Data sources I selected three that appear to contain relevant data: Diseasome, DailyMed and DrugBank. All three make data available in RDF, and offer both SPARQL endpoints and downloads of complete database images.

Answering the query will require writing a federated SPARQL query over the three data sources. This query might look something like this (in pseudocode):

SELECT
X, Y
WHERE
X is a Company .
Y is a Gene .
X makes Drug .
Drug targets Protein .
Protein coded by Y

This query basically traverses relationships among entities in different classes. To actually construct the query I would have to browse the data sources and figure out what information and relationships are present. The challenges with this are:

  • If you have multiple databases (in this case 3) figuring out which connections exist can be time consuming and require you to go back and forth between the databases. There may be in fact be multiple paths to get the same answer, and it’s hard to figure out which is optimal.
  • Just because a relationship exists doesn’t mean it’s fully populated. For example, a database might contain facts of the form “drug is made by company” but the individual who aggregated the data might not have included such facts for some drugs. This sort of incompleteness might occur in multiple parts of the dataset, resulting in incomplete query results.

I believe these are fundamental challenges for end users when querying Semantic Web data. A solution for this that I wanted to try was a visualization in the form of a matrix, where rows and columns corresponded to entity types (company, drug, gene, etc.) and colored circles in the table cells correspond to the number of relationships through a particular predicate. The area of each circle is proportional to the number of relationships (if circles overlap, “notches” are cut out to reveal the color of the circle directly underneath). To create this visualization database images were downloaded from each source and loaded into Franz’s Allegrograph (a total of almost 1 million RDF triples). From there a C# program was used to generate the graphic:

With this visualization I can easily tell that

  • Entities of type dailyMed:organization are connected to entities of type dailyMed:drugs
  • Entities of type dailyMed:drugs are connected to entities of type diseasome:diseases
  • Entities of type diseasome:diseases are connected to entities of type diseasome:genes

In a future post I will use this as a map to formulate the query to answer my question.

Table Service

One of the trends that keeps coming up in the context of the evolution of the web is the availability of raw data. This isn’t necessarily something new – one of the main pieces of Web 2.0 was websites exposing functionality and machine readable data through web services (also known as APIs), a practice that probably started around 2000 and has since grown steadily. Currently Programmable Web lists over 1800 APIs, and just about every internet startup today makes an API available as a matter of course.

A variation on this theme is the emergence of sites that make tabular data available as a service, in contrast to the multitude of sites that make all sorts of datasets available for download (such as infochimps, data.gov or the NYC Data Mine). The sites I’m describing here make large collections of data accessible through APIs, available to be integrated into any application or mashup anytime, anywhere – essentially “always on” data. The following are the ones I think are useful – and more importantly – won’t cost you a dime to use.

  • Yahoo Data Tables (Introduced late 2008) – This is a Yahoo! platform that enables you to query multiple data sources with the Yahoo! Query Language (YQL), which is very similar to SQL. The very useful console enables you to obtain metadata for information sources and run test queries. (You can then take your entire query formatted as a URL from the control panel and plug it in anywhere). This query, for example, pulls all of the previous day’s closing stock prices for companies in an industry:

    select Name, Symbol, PreviousClose from yahoo.finance.quotes where symbol in ( select company.symbol from yahoo.finance.industry where id = “112” )

    Many of the sources are not tables, but actually web services (from sites like Flickr, Last.fm and many others) abstracted so that they can be treated as tables. Another great feature of this platform is the ability to perform table joins within the system.

  • Google Fusion Tables (Introduced mid 2009) – This is Google’s application for sharing tabular data on the web. Individuals can share tables (the current selection of public tables is rather short) or keep them private. The API allows users to retrieve, insert and update data using SQL like syntax.
  • Factual – This is a relatively new startup which I learned about at a recent event of the New York Semantic Web Meetup. Factual’s objective is to provide quality structured data to the community on an open platform. Their website hosts tables for which a history of provenance, editing and discussion is maintained for every value, building accountability into the data. The collection of tables on the site is substantial and quite eclectic (examples of tables include video games and cheats, hiking trails and endocrinologists in the U.S.). A fairly standard API is provided which includes the ability to read, write and create tables but does not include table joins.

The following are also worthy of mention:

  • Google Squared – This Google Labs project generates tabular data about entities by extracting information from the web. There is no API, but once you’ve got a table you can send it to a Google spreadsheet. I’ve obtained mixed results with this tool, with quality roughly depending on the vagueness of the topic.
  • Amazon Web Services Public Datasets – A relatively small collection of datasets available for use with Amazon’s Elastic Compute Cloud (EC2) cloud computing services, free when used within Amazon’s platform.
  • Freebase Gridworks – This is a Freebase project still in the Beta stage. According to this blog post the application will provide users with functionality for curating private tabular data. The feature that I think sets this tool apart from Excel or Access is that it can align values in the your data with Freebase, as well as map your data to a graph format and upload it to Freebase.

Is there a race to be the “ultimate” repository and platform for all open/public data? Perhaps. But any site that attempts to do this will have to deal with the fundamental issues of data completeness, quality, trust and freshness. Such a site will also constantly face what in business school we call a “death spiral” – low quality data results in low usage, leading to insufficient maintenance (either by the community or by the host, which cannot afford to maintain the data due to low profitability), which in turn leads to lower quality, leading to lower usage, and so on.

My Simple Framework for Creating Mashups

Just about every other management consulting firm out there claims to have some “proprietary” framework or methodology that somehow makes them better. Having worked as a management consultant and gotten an MBA I’ve been exposed to my share of frameworks, and despite the criticism they often get I’ll admit they have their place. So I thought I could put forward a simple framework I often use (subconsciously) to get things done (if you got into Web 2.0 early on this might seem pretty basic):

(1) Identify and learn your data sources – this could include getting the OWL ontology for an RDF data source, the XML schema for data returned from an API or table fields for CSV files or online tables like Yahoo DataTables. You should do exploratory queries to ensure you can retrieve the data you want.
(2) Identify common values – often, you will be using one value to obtain data from multiple sources about something (for example, you use the name of the company to get the location of its HQ from one API and the names of its subsidiaries from another), or you will be feeding values obtained from one data source to get information from another source (you get the address of the HQ from one API, then feed it into another to get its coordinates). Identify those values and test.
(3) Figure out how you will process the data – you may want to do something simple like aggregation or sorting, or something more sophisticated like feeding it through algorithms. Again perform some tests, preferably with local copies of sample data.
(4) Map data values (or derived data) to elements of your visualization or interface – this is the part where you get to be creative and is usually the hardest part.

The time spent on each step can vary widely. I will try to put more meat around each of these steps in further posts.

The New Literacy

I think I’m getting a handle on why I started this blog, and it’s called the “New Literacy”. Some bloggers have already hit on it – Jon Udell sums it up perfectly here:

“Fluency with digital tools and techniques shouldn’t be a badge of membership in a separate tribe. In conversations with Jeannette Wing and Joan Peckham I’ve explored the idea that what they and others call computational thinking is a form of literacy that needs to become a fourth ‘R’ along with Reading, Writing, and Arithmetic.”

I think I can operationalize this a bit – as I see it, weaving together data from multiple sources, APIs and algorithms into interactive interfaces, as well as applying statistical and scientific thinking should be basic skills, like cooking a plate of spaghetti. I foresee these skills crossing over from the domain of the geeks to almost everyone, or rather everyone who is willing to learn them. Udell further suggests that such skills should be taught to students at an early age (an idea also blogged about here):

“In my own recent writing and speaking, I’ve suggested that feed syndication and lightweight service composition are aspects of computational thinking that we ought to formulate as basic principles and teach in middle school or even grade school.”

Given the volume of data and information around us, and the emergence of the web as a ubiquitous computational platform, not being able to do these things could put an individual at a significant academic and professional disadvantage. It’s not about being an expert in any particular area of computation or technology, but being able to deploy technology in everyday professional and personal settings to get things done better and faster. I don’t know who this trend will affect, but I see it affecting younger folks the most. I can also see younger workers bringing these skills into workplace settings where older workers simply don’t possess them – resulting in friction or misunderstandings.

Could I be overreaching here? Possibly. Many professions have human interaction as a significant component, and many require working with concepts and knowledge that are unstructured and not easily computable. But I do think this topic is something we should think about – and I will try to explore the “curriculum” of this New Literacy in future posts in this blog.

Update:

This article from two years ago sums up these ideas perfectly – I should have included it in my main post.

Freebase for Financial Data?

Freebase is a structured data version of Wikipedia – rather than writing articles or debating the finer points of a topic, individuals enter individual facts about things (there are currently facts on over 12 million things are in Freebase). It’s considered part of the “Linked Open Data Cloud” and thus a part of the emerging open data ecosystem on the web. The company that runs Freebase, Metaweb, makes Freebase data freely available through a web interface and an API. One of the most powerful features of Freebase is that you can query it like a database, giving it huge potential for creating all kinds of visualizations, applications and mashups. But in some applications, the usefulness of the platform rests on the accuracy and completeness of the data in it.

Given my background in business, I wanted to see how good a source Freebase would be of basic financial data like revenues and profits for public companies. A financial analyst typically likes to see 3-5 prior years of past year financial statement data in order to do a basic projection, so I decided to look at data for the past five years. Querying Freebase with the Metaweb Query Language for companies where NASDAQ/NYSE/(former)AMEX ticker symbols were available, and visualizing availability of revenue figures from 2005-2009 showed very sparse data and some conflicts. (The visualization shows a square for each of the years 2009-2005 top to bottom – yellow indicates the absence of a value, black indicates that a single value was found, and red indicates multiple values found in a given year). The companies were segmented into top level Industry Classification Benchmark categories using approximate string matching (click to enlarge):

Almost 6,500 companies were found, close to the actual number of companies traded on the NYSE and NASDAQ. Complications I ignored: for many companies the financial year is not the same as calendar year, and companies have been added or delisted over the time period in question, resulting in inherently incomplete data. Doing the same for net income data gives even more sparse results (The visualizations were created using C#.NET and WPF):

In both cases completeness was pretty low (and correctness is of course another issue). I believe that Freebase is a great concept – making structured data available freely – but completeness and accuracy can sometimes be critical. Given that the data comes from the community, this could be an ongoing challenge.

Textbooks in the Fabric

With the iPad already available in pre-order we are probably all wondering how the battle between Amazon and Apple will shape up in the eBook segment . eTextbooks are an important part of this market due to a captive audience and estimated annual textbook sales of $9B. Last year Amazon put a large-screen version of the Kindle through an eTextbook pilot program at several schools while more recently ScrollMotion entered into a deal with major textbook publishers to develop eTextbooks for the iPad. It seems like we are all pretty much aware that the concept of textbooks as shrink-wrapped, neatly packaged compendiums of knowledge printed on bundles of paper is on its way out.

I personally don’t have a dog in the fight, but I do think that we are due for major changes in the textbook arena. There are many reasons to complain about traditional textbooks (cost, need for a physical supply chain, etc.), but what has always bothered me about them is that the content inside textbooks is physically isolated. We can’t directly connect the words and concepts in a paper textbook to the ecosystem of related information on the web and on our computer. A traditional textbook remains static after it’s been printed, in complete ignorance of all of the opinions and ideas of the multitude of readers and the swirling, ever-changing vortex of knowledge and information around it.

Fortunately, a significant amount of work has been ongoing in the arena of biomedical publishing that could inspire the evolution of textbooks. This field is notorious for information overload, with large numbers of research papers published every year. Large volumes of data from high throughput experiments, frequent term ambiguity (many proteins have been given multiple names), and multiple formats for results (images, 3D protein structures, chemical structures, systems models and more) only complicate the picture. This blog post by Abhishek Tiwari describes some of the halting progress in scientific publishing that seeks to address this issue. It also led me to this paper in PLoS Computational Biology which describes a prototype scientific paper of the future, and Elsevier’s Article 2.0 Contest. I was inspired to summarize some of the concepts I read that could address my chief complaint with traditional textbooks:

  • Connections to the data –We want raw data so that we can do new things with it – analyze it, visualize it or mash it up with other data (echoing Tim Berners-Lee call for Raw Data Now). The next generation of eTextbooks could make the data behind charts and graphs available so that we can feed it into a platform like ManyEyes, Tableau Public or Swivel or do our own thing with it (The PLoS paper includes a nice demo of this, where author provided data was fed into the Google Maps API to create a geographic visualization). We also want to be able to get different or more recent versions of the data and visualize it alongside the author’s analysis. The issue of authors providing raw data in scholarly publications is currently a hot topic (discussed here, here and here) – perhaps a new generation of students who expect data to be made available as the norm will create the demand that spurs change.
  • Collaborative features – A project to make textbooks editable has actually been around since 2003 – the Wikibooks project – but a quick perusal shows a rather limited selection of textbooks and uneven quality. So it is safe to say that at this point community-written textbooks are not going to play a big role in higher education. But the community can still contribute. Let us tag, rate, vote and comment on chapters, paragraphs and even sentences. Readers can choose to view user contributed content or ignore it, and authors could use user contributed content to improve their textbooks in real time. The open source sBook project includes these features, though at this point it appears a bit clunky and doesn’t seem tailored for handheld devices.
  • Embedded structured content – This could be anything from semantic markup (such as identifiers that link back to community databases) or XML (like MathML) that could be loaded directly into specific programs. This feature can makes searching, getting auxiliary information and summarization of content much easier.
  • Connections to the sources – Images, quotes, statements and data pulled from other sources should link directly to the sources they were pulled from. Referenced content can be shown in a preview to avoid breaking up the reading experience. A winning entry in Elsevier’s Grand Challenge applied NLP  to identify the most relevant passage in a reference.

Textbooks that are made available as assemblies of independent, re-purposable units of information will hopefully emerge in the near future, and through individuals applying their own creativity will transform the process of learning.

Opening Thoughts

Welcome to my blog. I decided to start writing out of a desire to share various projects and ideas of mine, all of which have the goals of understanding and managing data, information and knowledge and facilitating general problem solving.

An overarching interest of mine is understanding how learning, knowledge and work are going to evolve in the coming years. With the advances in technology and the availability of information on the web it seems to me that we’re going to see a redefinition of what those things mean. We can envision scenarios where “knowing” something won’t necessarily mean having read it somewhere and retrieving it to perform a task, but rather being able to discover it “on the fly”. Another challenge we will have think about is how we will be managing our personal “knowledge bases” (which consist of both things in our minds and on electronic media) as the growth and availability of information makes what we know obsolete and incomplete faster and faster.

Some of the discussions will involve programming. I am not a professional programmer (my background includes an MBA and work experience as a biochemist and management consultant), but I do see programming as a tool. My experimentation with programming probably goes back to high school where I learned Pascal and Hypercard, but really started about 11 years ago with learning C and writing simple programs using the Windows API (anyone who’s tried that knows what I’m talking about). Through the years I picked up C++, VBA, C# and Java and familiarized myself with technologies such as relational databases, web services and the Semantic Web. This blog will thus hopefully be a forum to tie together ideas, tools and techniques from a variety of disciplines – information retrieval, data mining, statistics, information visualization and semantic technologies – all in practical, do-it-yourself type scenarios.

I hope that you will be able to connect to some of what I’m going to post here. In the meantime, check out some of the blogs on my blog roll and enjoy.