Assessing Artificial Intelligence Tools for Basal Cell and Squamous Cell Carcinoma Detection
The integration of artificial intelligence (AI) into health care continues to expand rapidly, with growing interest in its diagnostic potential. A recent study assessed the diagnostic accuracy of OpenAI’s ChatGPT-4 Omni (ChatGPT-4o) in differentiating common skin lesions, including squamous cell carcinoma (SCC), basal cell carcinoma (BCC), actinic keratosis (AK), benign keratosis (BK), and melanocytic nevi. Using 950 dermatoscopic images from the HAM10K database, researchers aimed to determine whether ChatGPT-4o could provide accurate, reliable image-based classifications comparable to those made by clinicians.
ChatGPT-4o’s performance varied considerably by lesion type and prompt design. Under the first prompt—modeled after a standardized exam format—the model demonstrated the highest accuracy in classifying nevi (79.3%) but struggled with BCC (77.8%) and especially SCC (66.1%). While specificity for BCC was high (0.959), sensitivity was extremely low (0.081), indicating the model frequently missed true BCC cases. SCC classifications were similarly inconsistent, with frequent mislabeling of SCC as BCC. When researchers applied a second, more conversational prompt, SCC accuracy improved modestly to 72.8%, but sensitivity declined further to 0.245, suggesting that small linguistic changes in prompts can significantly influence AI diagnostic performance.
These results highlight ChatGPT-4o’s limitations in distinguishing between closely related malignancies that share overlapping dermatoscopic features. While the AI performed reasonably well with benign lesions, it consistently faltered when tasked with identifying cancerous ones—a finding that aligns with prior studies showing AI’s relative success in binary classifications but reduced reliability in multiclass differentiation. The model’s tendency to misclassify SCC as BCC underscores a potential patient-safety concern if such tools were used without clinical oversight.
Importantly, the study emphasized that prompts resembling exam-style questions elicited more accurate responses than open-ended, patient-style phrasing. This sensitivity to prompt wording underscores the unpredictability of conversational AI models and the importance of standardizing input protocols before clinical application. Researchers cautioned that the use of a single dataset and the absence of diverse imaging conditions limit generalizability but reaffirmed the need for rigorous validation of AI tools in dermatology.
This study underscores both the promise and peril of generative AI in clinical decision support. While AI tools like ChatGPT-4o may eventually streamline diagnostic workflows, improve triage efficiency, and expand access to dermatologic expertise, their current limitations demand strong governance frameworks, clinician training, and systematic validation. As payer organizations increasingly evaluate digital health tools for coverage and integration, understanding these limitations is critical to balancing innovation with patient safety and clinical accuracy.
Reference
Chetla N, Chen M, Chang J, et al. Assessing the diagnostic accuracy of ChatGPT-4 in identifying diverse skin lesions against squamous and basal cell carcinoma. JMIR Dermatol. 2025;8:e67299. doi:10.2196/67299 https://derma.jmir.org/2025/1/e67299


