A research study led by Dr Ihsan Ayyub Qazi, Professor of Computer Science at the Syed Babar Ali School of Science and Engineering at the Lahore University of Management Sciences, has been published in Nature Health, one of the world’s leading medical and health sciences research journals. The study, co-authored by Dr Ayesha Ali, Associate Professor of Economics at the Mushtaq Ahmad Gurmani School of Humanities and Social Sciences, and Dr Muhammad Hamad Alizai, Associate Professor of Computer Science at the Syed Babar Ali School of Science and Engineering, examined whether artificial intelligence tools could meaningfully improve diagnostic accuracy among physicians in Pakistan, a country whose healthcare system faces both a severe shortage of medical specialists and high patient loads that contribute systematically to diagnostic errors.
The research was structured as a randomised controlled trial involving 58 licensed physicians, who were divided into two groups: those using GPT-4o as a diagnostic aid and those relying on conventional online resources. The results were striking. Physicians who used GPT-4o achieved a mean diagnostic reasoning score of 71 percent, compared to just 43 percent for those using conventional resources, a gap that points to the significant potential of large language model-based tools in under-resourced clinical environments. All participating physicians completed 20 hours of structured training in how to use artificial intelligence tools effectively, including how to identify and respond to instances where the artificial intelligence produces incorrect or incomplete outputs, a design choice that the researchers considered essential to the study’s validity and safety profile.
A secondary analysis within the trial produced a more nuanced finding: artificial intelligence used alone without physician involvement outscored physicians using it as an aid. However, in 31 percent of cases, physicians outperformed the artificial intelligence, and Dr Qazi noted that these cases involved red flags and contextual factors that the model appeared to have missed, underscoring that human clinical judgment retains meaningful value in precisely the situations where errors are most consequential. The study also raised a concern that has become a recurring theme in medical artificial intelligence research: that over-reliance on artificial intelligence outputs without sufficient critical evaluation could lead physicians to accept flawed results without questioning them, a risk that the researchers argue makes training in critical artificial intelligence evaluation not a supplementary consideration but a prerequisite for safe adoption. Dr Qazi described the findings as opening new avenues that can eventually lead to more safe and effective integration of artificial intelligence and healthcare, and noted that while the results are expected to be applicable in other countries facing similar constraints, replication with other artificial intelligence models remains necessary.
Follow the SPIN IDG WhatsApp Channel for updates across the Smart Pakistan Insights Network covering all of Pakistan’s technology ecosystem.