Skip to main content
Peer Review

Peer Reviewed

Letters to the Editor

Correspondence on “Performance of ChatGPT on the Plastic Surgery In-Training Examination”

November 2025
1937-5719
2025;25:e43

© 2025 HMP Global. All Rights Reserved.

Any views and opinions expressed are those of the author(s) and/or participants and do not necessarily reflect the views, policy, or position of ePlasty or HMP Global, their employees, and affiliates.


 

Dear Editor,

We read with interest the publication titled “Performance of ChatGPT on the Plastic Surgery In-Training Examination.1 The study included exam questions from 2015 to 2023, a long and comprehensive time range; however, removing questions featuring photos, charts, or graphs from the analysis resulted in an incomplete picture of the actual exam in clinical practice. Images are frequently used to diagnose and make decisions in the field of plastic surgery. The absence of this critical component may limit the capacity to evaluate ChatGPT’s (OpenAI) genuine capability for clinical reasoning in a real-world environment. Furthermore, using only the GPT-3.5 version without comparing it to the GPT-4 or more recent versions is a big limitation, since subsequent versions may have vastly different capabilities.

The chi-square test and logistic regression are useful tools for identifying correlations and predictors, but the study’s data did not balance the question categories, making it difficult to assess the statistics’ validity. In addition, some P-values, such as P = .070 for the breast/cosmetic theme, should not be considered clinically significant. To provide a more complete picture, the analysis could provide effect sizes or CIs in the frequency of right answers for each topic.

The fact that ChatGPT scored in the fourth percentile when compared to plastic surgery residents may not be due to a model restriction. However, it reflects the complexities of specialized knowledge and an assessment that prioritizes experience and judgment. The dilemma is, are these examinations assessing “transferable” knowledge or abilities that require context and real-world experience? Furthermore, unlike humans, ChatGPT cannot answer questions or employ dynamic clinical reasoning in some instances. Based on the statistics, it is crucial to examine the individual topic groups in which ChatGPT excels, such as breast/cosmetics, to see whether the tendency is attributable to the larger range of data available in the public domain vs specialist themes such as hand/lower extremity.

Furthermore, the performance of GPT should be compared to that of resident physicians from various years to determine whether the model’s knowledge is commensurate with the degree of clinical ability. Alternatively, research should be conducted to generate simulated clinical vignettes that include complicated imagery and data in order to evaluate the AI in a more comprehensive manner. The fact that the knowledge level had more right answers (55.1%) than the analysis level (41.4%) should be evidence that the model is still unable to completely replace humans in clinical reasoning in very complicated circumstances.

Acknowledgments

Authors: Hinpetch Daungsupawong, PhD1; Viroj Wiwanitkit, MD2

Affiliations: 1Private Academic Consultant, Phonhong, Lao People's Democratic Republic; 2Saveetha Medical College and Hospital, Saveetha Institute of Medical and Techni, cal SciencesChennai India.

Correspondence: Hinpetch Daungsupawong, PhD, Private Academic Consultant, Phonhong, Lao People's Democratic Republic. Email: hinpetchdaung@gmail.com

Artificial intelligence statement: The author used a language editing computational tool in the preparation of this article.

Disclosures: The authors disclose no relevant financial or nonfinancial interests.

References

  1. Raine BE, Kozlowski KA, Fowler CC, Frey JD. Performance of ChatGPT on the plastic surgery in-training examination. Eplasty. 2024;24:e68.