Clinical utility, not ‘prettiness,’ best metric for evaluating AI improvements to medical imaging

Jha lab evaluates AI techniques for cleaning medical images based on performance in clinical tasks

Shawn Ballard 
In these representative reconstructed images, the column on the left represents ground truth and the column on the right shows a scanned image after AI denoising. In these examples, AI-based denoising methods reduced clinical usefulness by removing defects (top two rows, defects marked by yellow arrows) or introducing false defects (bottom two rows, defects marked by red arrows). (Image courtesy of Jha lab)
In these representative reconstructed images, the column on the left represents ground truth and the column on the right shows a scanned image after AI denoising. In these examples, AI-based denoising methods reduced clinical usefulness by removing defects (top two rows, defects marked by yellow arrows) or introducing false defects (bottom two rows, defects marked by red arrows). (Image courtesy of Jha lab)

Medical imaging plays an essential role in diagnosis and treatment for an array of conditions. From X-rays to see a broken bone or a tooth cavity to SPECT scans for spotting heart defects, doctors use medical imaging to look inside the body, find disease and treat it appropriately. But what happens when those images aren’t clear?

Recent advances in artificial intelligence have opened the door to using AI-based methods for denoising, or cleaning up, medical images. However, before these tools can be used in clinical settings for real patient care, they need to be rigorously evaluated, said Abhinav Jha, assistant professor of biomedical engineering in the McKelvey School of Engineering and of radiology at Mallinckrodt Institute of Radiology (MIR) in the School of Medicine, both at Washington University in St. Louis.

In a study published in Medical Physics, Jha and collaborators at MIR evaluated a commonly used AI-based approach to denoise cardiac SPECT images. The team assessed the performance of the approach in two ways: How visually similar were denoised images to normal images and how well did the denoised image perform in the clinically relevant task of detecting heart defects? 

“Rather alarmingly, while the visual-similarity-based metrics suggested that the AI-based denoising technique improved performance, it was actually having no significant impact, and in some cases, it was even degrading performance on clinical tasks,” Jha said. “This emphasizes the important need for performing evaluation of AI algorithms on clinical tasks and not just relying on visual similarity as a measure of performance.”

In the study, first author Zitong Yu, a doctoral student in Jha’s lab, found that the AI denoising technique tended to smooth out cardiac SPECT images, which reduced noise as intended, but also reduced the contrast of the heart defect that doctors need to make accurate diagnoses. “This is precisely what we want to prevent from happening in actual medical practice,” Yu said. 

The study advocates for task-based evaluation of AI-based denoising methods to assess the usefulness of AI-processed images. “Ensuring AI-based denoising works well for real clinical tasks – not just aesthetically – would mean big benefits for patients by producing high-quality images in less time or with reduced radiation doses,” said collaborator Robert J. Gropler, professor of radiology and senior vice chair and division director of radiological sciences at MIR.

Jha and his team have been developing a new denoising technique along this direction, and their presentation on this topic received an honorable mention at the SPIE Medical Imaging meeting. Jha also led a multi-institutional, multi-agency team tasked with developing a framework for evaluating AI-based medical imaging methods. Their guidelines, Recommendations for Evaluation of AI for Nuclear Medicine (RELAINCE), were released in 2022 and informed this latest research.

 


Yu Z, Rahman MA, Laforest R, Schindler TH, Gropler RJ, Wahl RL, Siegel BA, Jha AK. Need for objective task-based evaluation of deep learning-based denoising methods: A study in the context of myocardial perfusion SPECT. Medical Physics, April 3, 2023. DOI: https://doi.org/10.1002/mp.16407

This work was supported in part by the National Institute of Biomedical Imaging and Bioengineering of the National Institutes of Health (R21-EB024647, R01-EB031051, R01-EB031051-02S1 and R56-EB028287).

 


The McKelvey School of Engineering at Washington University in St. Louis promotes independent inquiry and education with an emphasis on scientific excellence, innovation and collaboration without boundaries. McKelvey Engineering has top-ranked research and graduate programs across departments, particularly in biomedical engineering, environmental engineering and computing, and has one of the most selective undergraduate programs in the country. With 165 full-time faculty, 1,420 undergraduate students, 1,614 graduate students and 21,000 living alumni, we are working to solve some of society’s greatest challenges; to prepare students to become leaders and innovate throughout their careers; and to be a catalyst of economic development for the St. Louis region and beyond.

Click on the topics below for more stories in those areas

Back to News