In the 1990s, the Massachusetts Group Insurance Commission released anonymized individual data on all state employees, including every hospital visit from that decade. In 1997, while still a computer science student at the Massachusetts Institute of Technology, Latanya Sweeney, PhD, requested the data set and was able to re-identify the data, sending the governor of Massachusetts’ health records to his office. She later went on to show that 87% of people in the U.S. can be identified by only three unique pieces of information — their five-digit ZIP code, birthdate and gender.
More than 20 years later, the appropriate use and privacy of patient data is as much of a concern. Today, we’re not likely to see broad disclosures of individual data by the government, but large hospital systems use big data — including clinical, imaging, genomic and demographics — to drive healthcare innovations. To develop AI algorithms that have widespread applicability, organizations must share anonymized patient data. Radiology practices are sometimes reluctant to share their data, in part because de-identification in imaging is difficult. Here’s a look at methods for helping radiologists protect patient privacy and what the ACR Data Science Institute® (DSI) is doing to advance solutions that enable data sharing for AI development.
The Regulatory Environment
In the U.S., healthcare data is protected under the Health Insurance Portability and Accountability Act of 1996 (HIPAA). HIPAA covers Protected Health Information (PHI), which is defined as any piece of individually identifiable health information held by a covered entity transmitted or maintained in any form or medium. The HIPAA Privacy Rule also describes the circumstances under which PHI can be shared with third parties when de-identified.
HIPAA outlines two methods for de-identification:
•The expert determination method, which states that a person with appropriate knowledge of and experience with accepted statistical and scientific principles renders the information not individually identifiable. That person applies this principle and determines that the risk of re-identification using available information is very small, and then documents the methods and results to justify this determination.
• The safe harbor method, which requires the removal of 18 specific identifiers.1
This regulatory environment informs how radiologists representing the interests of their practices, their patients and their research subjects approach issues of privacy, consent, data ownership and the concerns of vulnerable populations when embarking on their own AI journey together with third parties.
Disclosures of research and innovation data often hinge on de-identifying images and related data, usually by the Safe Harbor method, but de-identification in imaging is notoriously difficult. De-identification of medical imaging requires addressing metadata found in DICOM files. While several tools are available, few are 100% successful at de-identification, especially when dealing with large, heterogeneous data sets. Even when the DICOM metadata is de-identified, there is a concern that identifying information might be “burned in” to images by modalities, in scanned reports or from associated processing software.
With the limitations of de-identification in medical imaging, there is a need for other methods of protecting data privacy.
Any imaging of the face raises further concerns. Several open-source de-facing software applications are available; however, a review of six available de-face applications for brain MRIs found that the most successful application had only an 89% success rate. De-identification of radiology reports is also a challenge because they sometimes include PHI within their text.2
With the limitations of de-identification in medical imaging, there is a need for other methods of protecting data privacy. Differential privacy and federated learning are two methods being explored:
• Differential privacy is a mathematical definition of privacy based on cryptography, which publishes a pattern from a large data set so that an individual’s personal data is not distinguishable. It is a method that works best on large data sets. Because it answers queries approximately, it is useful in general statistics and pattern recognition, but has limited utility in answering specific questions.
• Federated learning independently trains a network on a population’s data and then reports all the independently trained models back to a centralized model.
Both approaches are promising but still face practical challenges, such as dealing with heterogeneity in distributed systems and maintaining performance considering increased computational overhead.
How the ACR DSI Is Helping
The ACR DSI has been at the forefront of dealing with these challenges. Besides defining use cases and a dataset directory for AI development, the ACR DSI provides practical tools in a data science toolkit that radiologists can use to develop their models through the ACR AI-LAB™. The ACR DSI is also spearheading a collaborative, multi-institutional federated learning experiment using a combination of central ACR servers and localized institutional datasets that are never shared with other partners.
The ACR has been addressing some of the stickiest issues associated with working with data. In 2019, the College created a data-sharing workgroup that identified five key elements within data sharing: informed consent, data standardization, contracts, valuation and privacy.3 The workgroup proposed that a governance board might be necessary for developing a system for informed consent in data-sharing agreements, creating a uniform consent process and determining whether scenarios exist where sharing of patient data poses a low enough risk to the patient that informed consent would not be required.
Ongoing Challenges and Opportunities
Despite progress, challenges to data sharing remain. HIPAA and other related regulations in the U.S. were codified well before the current AI environment took hold and are criticized both for not giving adequate protection for privacy and for being overly restrictive in a time when the benefits of AI in healthcare are limited.
While federated learning shows promise for model training and validation, it is in its early stages. Researchers and ethicists are still grappling with the best way to deal with the potential bias that using unrepresentative datasets introduces into AI models. These challenges define the opportunities for improvement and innovation in the responsible development of data-driven technologies and partnerships.