April 23, 2021

Current State of Data Sharing in Practice

By Morgan P. McBee, MD

One essential component of an AI model is data. Algorithms produced at a single institution may not translate well to others, so it is better to develop algorithms with data pooled across different institutions. Due to privacy concerns and regulations, it’s not feasible to broadly share patients’ protected health information (PHI); however, there are several ways to develop models across institutions.

Data Pooling With Anonymization

One method is to pool anonymized data which greatly reduces concerns over privacy. However, many concerns remain such as the ability to render a 3D rendering of a patient’s face even from de-identified CT or MR images of the head. When sharing imaging data across institutions, business associate agreements and data sharing agreements are necessary to ensure that all parties take necessary steps to protect the data. All PHI must be stripped from the DICOM tags, and special care must be taken with “burned in” data on ultrasounds and radiographs. Some institutions find it easier to not share any images with “burned in” data rather than to try to de-identify them.

A recurrent theme of data sharing is ownership of the data. While the law does support the sharing of data, great care must be taken. Another consideration is where patients fit into the ownership of data. In surveys, most patients expressed a willingness to share their data, but there were differences with whom they were willing to share it. For example, patients are more willing to share their data when it is for nonprofit and research purposes and less willing when it is for commercial purposes. It is essential for healthcare institutions to be transparent with patients about what they are doing with patients’ data, and data use agreements could be developed to give patients a greater say.

Federated Learning

Another method of data sharing is federated learning. Under this framework, the models themselves are shared instead of any imaging data which essentially eliminates risks of a data breach. Another plus is that this method requires transferring a vastly smaller volume of data. A model can be developed at one institution with their own data and then that model can be shared with another institution where it can be further refined with a different dataset. This process can be repeated across many different institutions exposing the model to a vast array of data.

In conclusion, there are different approaches to sharing data across institutions, which all have different associated risks and benefits.