The Perils of Automating Big Data, or Any Data

February 23, 2015

Big data can yield big benefits but, when things go wrong, can create big problems that will forever be beyond the ability of people to anticipate.

The continual flap over Big Data prompts a few words of caution. First of all, Big Data implies automation. It's simply not possible to process huge piles of data manually; it's either done using automated processes or it's simply filed away without processing.

I've made a few points in the past that are worth repeating:

• If data lacks context, the meaning of the data is indeterminate. It could mean anything. If it can mean anything, it means nothing. Raw (context-free) data is not information.

• Data, associated with an inappropriate (wrong) context is worse than nothing because it may appear to have meaning that it does not have.

• Today's computer systems rarely make the developer's assumptions about context transparent. They largely remain unknown.

• A well-written narrative provides contextual information in a variety of ways: the context is described explicitly; allusions and references are made to knowledge and culture that are generally familiar to contemporaneous readers (but not necessarily those in the future); and internal references to other sections of the narrative are included that establish context.

• The meaning of narrative is highly accessible to human readers but is relatively opaque to software applications. Typical programming techniques (the ones used to build EHRs) lack the ability to draw inferences. Even specialized Artificial Intelligence applications require domain-specific preparation before they can draw even rudimentary inferences.

Now, two examples from the news.

First, automated "interpretation" of text. The example comes from Seeking Alpha. In an article commenting on Scottish politics, a reference was made to the Scottish National Party. Seeking Alpha, being a financial site, has apparently programmed their content management system to search for references to publicly traded firms and insert links to additional information. I have no idea what algorithm they use, but I infer that is something like "three capitalized words in a row equals a firm name." So - the text in the article becomes "the two main leftist parties, the Scottish National Party (NYSE:SNP) and the Labor party." If you follow the link you will discover that SNP is China Petroleum & Chemical Corporation - NYSE.

If this doesn't convince you of the futility of using text search to accurately find and select specific phrases, try the following experiment: Choose a large text in electronic form and search for some short sequence of letters that you imagine "ought" to be unique in the given context. I searched my e-mail for ooma (a VOIP provider) and got hits on Bloomberg and oncogenic osteomalacia (OOM) before I was done typing. See how many unexpected matches appear where your "unique" phrase appears as a constituent of some longer word.

Second, the risk of automating procedures that update massive data sets. This example is from a programmer's blog  and relates to Apple and U2's joint release of the "Songs Of Innocence" album.  iTunes has 800 million accounts and, apparently, "Apple set everyone’s account to have "purchased" this album, which auto-downloaded it to all of their devices, possibly filling up the stingy base-level storage that Apple still hasn’t raised..."

When you have 800 million records in a database, or billions of data points from sensors, no action applied to the data set is trivial. At best, it can consume hours of machine time. At worst, you can make 800 million people mad. And, if someone hacks your system, data on 800 million people can be compromised in one fell swoop.

Big data can yield big benefits but, when things go wrong, can create big problems that will forever be beyond the ability of people to anticipate. Caution is strongly advised.