January 02, 2020

Automated Deep Learning Design for Medical Image Classification by Health Care Professionals With No Coding Experience

By Sirus Saeedipour, DO, MBA, radiology resident at the University of Kansas School of Medicine, Kansas City.


Nearly a month after RSNA 2019, many returned from Chicago overwhelmed by the plethora of new technologies available to augment processes and assist in diagnostic decision-making in radiology. Once again, we’re reminded of the inevitable advances of deep learning strategies and computing power made possible by advances in GPU development and cloud computing. However, to many clinicians, AI remains an enigmatic concept — a “black box” — how exactly are these models trained and how can we apply them to our domain to streamline processes and improve patient outcomes?

This past RFS AI Journal Club, “Usage of Automated Deep Learning Tools,” was dedicated to discussing the feasibility and usefulness of automated deep learning technology in medical imaging classification tasks performed by physicians with no coding experience. Panelists included Pearse A. Keane, MD, MSc, FRCOphth, MRCSI, Livia Faes, MD, and Siegfried Wagner, MD. Dr. Keane is an ophthalmologist with no computer science or engineering background who initially became very excited about the field of deep learning when he learned of the advances of ImageNet in 2012. In 2017, he came across an article in The New York Times entitled, “Building A.I. That Can Build A.I.,” which initially sparked his research team’s interest in the possibility of democratizing the development of AI models.

The journal club started with discussion by Dr. Wagner who explained the three items necessary to actually conduct deep learning:

  1. A large amount of data in a computationally tractable form,
  2. Sufficient processing capacity in the form of GPUs, and
  3. Significant training — not model training — but training and knowledge in the field of machine learning and computer science.

In their study, the authors utilized pre-existing, medical imaging open-source datasets collected from the MESSIDOR, Guangzhou Medical University, and Shiley Eye Institute, National Institutes of Health, and Human Against Machine 10000 datasets. Further, they had to learn basic shell scripting to develop a script to upload this data, en masse, to Google Cloud Storage Buckets. From there, they were able to import the data in Google Cloud AutoML Vision to label images, and train and evaluate their models using an easy-to-use graphical interface from the comfort of their favorite web browser. “What’s helpful about this platform is that you have the ability to see all your true positives and false positives, and in general, all your misclassifications,” said Dr. Wagner.

With Google’s AutoML, they developed binary classification, multiple classification, and multilabel classification models. When these models were tested on test data from their original dataset, they performed very well. On the contrary, testing these models using data from external datasets proved to be more troublesome. For example, there was a sharp decline in specificity when one of their models was used to identify skin lesions, particularly nevi, in an external dataset. The more likely explanation of this finding is likely due to differences in the prevalence of nevi in each dataset (75% in the original training dataset vs. 39% in the external testing dataset) and highlights the problem of overfitting and the importance of generalizability of AI models beyond the dataset it was trained on.

At the time of this blog post, many companies aside from Google offer automated machine learning solutions. Even the ACR Data Science Institute’s own ACR AI-LAB™ aims to offer, “Radiologists tools designed to help them learn the basics of AI and participate directly in the creation, validation and use of health care AI.” With the emergence of utilities that enable clinicians without computer science backgrounds to develop elusively impactful AI models, it remains of paramount importance to understand the limitations of such models and thoroughly test them before their incorporation into processes that influence patient care. Dr. Wagner highlighted many limitations that exist with automated machine learning solutions, which included:

  • The inability of knowing exactly which model architecture is being used by these automated platforms,
  • Missing metrics such as sensitivity and specificity, which are important metrics to clinicians, but not common language in the computer science domain,
  • Datasets with unclear inclusion criteria and labeling, and
  • Limited ability to have systematic external validation of these trained models.

The discussion closed with closing remarks by Dr. Keane who reemphasized the role clinicians have in being advocates for change and insisting on clinically meaningful metrics to assess the validity of algorithms. “I think that people have jumped to the conclusion that we would suggest that somehow doctors with no coding experience should train a deep learning system and then start to use this on patients,” he said. “Nobody is talking about that anytime soon … It would be many years before you could actually go through the appropriate validation and approvals before you could use this in real life.”