On Saturday I attended the Electronic Data Methods (EDM) Forum Symposium in Orlando. The focus of the workshop was how to build infrastructure for sharing clinical data for improving patient care. This comes in two flavors : quality improvement (QI), which refers to learning from clinical data much like a feedback loop, patient-centered outcomes research (PCOR) or comparative effectiveness research (CER), which is looks at how patient outcomes vary across different treatments. There’s a lot of hope that moving to electronic health records (EHRs) can facilitate these kind of studies, but the upshot of the workshop was that there are a lot of practical impediments.
One big issue that came up was essentially how EHRs are used, and how the data in them is hard to get out in a consistent and quantifiable way. Physicians record results in idiosyncratic ways, and in order to get practicing physicians to buy-in, the data format of EHRs is rather flexible, resulting in huge headaches for people trying to extract a data table out of a databased of EHRs. Much of the data is in running text — NLP approaches are improving, but it’s far from automated.
Once the data is extracted, it turns out it’s quite noisy, and poorly validated. Sometimes it’s a case of garbage-in : the data was not recorded properly in the first place. Other times, it’s due to miscalibration. There were a number of talks (which I missed) dedicated to this. Then there are questions of whether the data you have collected is representative. If you are trying to draw inferences across multiple sites, how do we appropriately account for confounding factors such as demographic differences? This is the kind of thing that can plague even a single-site observational study, butit becomes particularly acute for multi-site investigations.
Finally, even if each site can extract a more-or-less clean data set, you have the problem of sharing this data. This raises headaches from a policy perspective as well as technological perspective. On the policy side, each site has its own IRB and own review, and many instituions are hesitant to cede authority to third party or federated IRBs. For a small number of sites, a policy and technology framework can be worked out, but scaling these systems up and providing oversight is going to raise new challenges that we probably cannot anticipate. Even if two sites want to share data, they have to implement privacy protections, and depending on the kind of data being shared, technologies may not even exist to mask patient identities — biological samples are inherently problematic in this regard, but even sharing a data table is non-trivial. Apart from the privacy concerns, creating a common schema for the data to be shared sounds like an obvious thing to do, but if the two sites are using different EHR software… well, let’s say it’s not as easy as sharing PowerPoint from Mac to PC.
All in all, I came away feeling like the state of the art is both depressing and invigorating — there’s a lot to do, and I just hope that the short time frame that people go on about doesn’t result in half-baked partial solutions becoming the standard. There are a lot of questions from basic statistics through distributed system design here, so maybe after chewing on it a while I’ll get some new problem ideas.
2 thoughts on “EDM Forum 2012”
This situation exactly mirrors the situation with threat detection in cyber security today. First, data is collected in idiosyncratic (i.e. non-normalized) manner. In cases where the system forces normalized sensing reports (in other words, drops all the free-form text on the floor) it is clear that the systems are seriously lacking, that there is something essential but undefinable in the free text fields that would be lost if we insisted on eliminating them. The data is extremely noisy, indeed, it is well understood that the security “signal” is drowning in noise (aka non-security-relevant data that is extremely interesting to non-security folks). There have been several attempts and initiative at sharing threat-relevant data across organizations but no one knows how to collate such information. Finally, no one wants to share data for fear of privacy violations and social embarrassment. Actually, to be very precise, what people wish for is the ability to retract/redact shared data retroactively after “sensitive” information is uncovered in the data.
Sounds a bit like having your cake and eating it too, but I can’t tell if the retraction is eating or having…