Big Data: The Next Big Thing for EHR?

When an article on EHR or data makes it into a medical journal, it is most often a speculative picture of the future; basically, it's science fiction.

Jules Verne's science fiction was undoubtedly inspired by the burst of scientific and technological innovations that surrounded him during the later part of the 19th century. He imagined space flight and atomic powered submarines much as the creators of "Star Trek" imagined transporter beams and travel at speeds greater than the speed of light.

The technologically charged atmosphere gave him the confidence to dream that, if progress continues apace, his dreams could become reality.

Verne's novel "From the Earth to the Moon" was published in 1865; the first moon landing occurred 104 years later in 1969. The details of space flight were, of course, nothing like what Verne imagined. It is immensely more dangerous, costly and difficult. In fact, with the exception of the Apollo Project, the feat has yet to be repeated.

Verne did not hold himself out to be a scientist or an expert. His goal was to tell a good story. He succeeded admirably.

We are in the midst of another explosion of technology centered on computers, the Internet, and lately "big data." Today, as in 1865, stories are being written about the future potential of this technology to transform lives and improve health. The difference is that many of the storytellers  are presenting themselves as scientists yet they do not act like scientists. They have not clearly defined their terms, assumptions, and axioms, (see "Fundamentals of Concept Formation in Empirical Science" by Carl G. Hempel). The major theoretical breakthrough seems to be: The "best practices" used to build other kinds of computer systems are good enough for medicine. They build first and experiment later, using physicians and patients as study subjects, without their knowledge or consent. If the studies fail to confirm their beliefs, the results are considered suspect, not the assumptions. These experts are nevertheless so confident in their favorite technology that they seek to evangelize, win converts, and influence public policy.

When an article on EHR or data makes it into a medical journal, it is most often a speculative picture of the future or a story of unrequited love of technology, not a scientific report that presents findings or explains concepts. Science fiction is another word for a speculative vision of the future. Take my word for it - or continue reading to learn more.

The latest focus of the experts (who do not want to miss out on the latest fad) is big data. Do you know what big data is? A recent article published in  JAMA by Kohane's group entitled "Finding the Missing Link for Big Biomedical Data" assumes that you do; they don't bother to define it. The editors of JAMA also seem to assume that you know, because they did not insist that the authors define it. My own negative experience may corroborate this view. When I submit articles to JAMA that are intended to educate physicians about data-science terms and concepts, they are rejected almost instantly, presumably because the editors believe you already know that stuff as well.

A recent article in the Communications of the Association for Computing Machinery ["Bringing Arbitrary Compute to Authoritative Data" by Mark Cavage and David Pacheco, does not make that assumption. It begins: "While the term big data is vague enough to have lost much of its meaning ..." and states their definition: "In this article, big data refers to a corpus of data large enough to benefit significantly from parallel computation across a fleet of systems ,.." They then go on to explain that "big data" is basically data warehousing on a large scale.

This is helpful because data warehousing has been around for a long time. There are established principles and definitions. A warehouse is not (as the JAMA article seems to imply) a giant vacuum cleaner that sucks up "information" (undefined in this context), to be stored for later (unspecified) use. Instead, a warehouse is filled with data that has been Extracted from source systems, Transformed to answer specific questions and then Loaded on a regular schedule (rarely more than once a day). No questions? No warehouse. Without a question, how would you know what to extract, how to transform it or how often to repeat the ETL cycle? Design, including the questions to be answered, comes first.

According to the JAMA article, we should bring the power of big data to bear on medicine by combining material from a myriad of sources (you really should look at the full page diagram) - everything from medical records and lab results to Facebook and Twitter posts. Magically, aggregating all of this stuff will enable us to discover meaningful information about patients and populations. To make this possible, all we need are a few more standards (which take years to develop), a universal patient identifier (perhaps a tattoo on the forearm or an implanted RFID chip like your dog), and the patient's willingness to have his privacy invaded to serve the purposes of this grand vision.

Assuming these trivial obstacles are overcome, Could it work as described? Another article, this time from IEEE Computer ["Rethinking Context: Leveraging Human and Machine Computation in Disaster Response" by Vieweg and Hodges] suggests that it can't. Based on research (not opinion), they conclude that to make meaningful use of most data sources requires, "an understanding of [the data's] context ..." including "implied knowledge, assumptions, and cultural scripts that give meaning to a situation. Whereas machines excel at enumerating explicit [data] they are no match for humans at processing implicit contextual information."

"Pragmatics," they explain, "is the study of contextual meaning, or the way context contributes to meaning..." Their conclusion is that most data is devoid of sufficient contextual metadata and, as a result, computers are currently unable to reliably derive pragmatic meaning. That is a task best left to humans.

As far as the supposed benefits of "Big Biomedical Data," the science tells us that you can't get there by simply (well, not so simply) aggregating a big bunch of disparate material. In other words, the expert pronouncements are basically little more than mediocre science fiction.

Time to go ... Beam me up, Scotty.