To the Editor:

The Chat Generative Pre-trained Transformer (ChatGPT) is a natural language processing tool driven by artificial intelligence and trained on massive amounts of data (1, 2). ChatGPT has been gaining significant attention since its release in November 2022 due to its extraordinary ability to generate human-like answers in response to text input in a conversational context. The initial backbone large language model that supports ChatGPT was GPT-3.5. In March 2023, an upgraded version, GPT-4, was released with the promise of improved accuracy. Although there is a widespread interest in the use of ChatGPT, its application in medicine is controversial (3, 4). ChatGPT was not extensively trained on biomedical data nor were its responses thoroughly assessed by medical experts. In particular, it is likely that patients and clinicians will turn to ChatGPT for assistance in interpreting laboratory data or understanding how to utilize clinical laboratory services. Here we evaluated the ability of GPT-4 to answer a representative set of questions frequently encountered in the laboratory medicine field, ranging from basic knowledge to complex interpretation of laboratory data in clinical context.

ChatGPT was prompted with questions in topics including clinical chemistry, toxicology, transfusion medicine, microbiology, molecular pathology, hematology, coagulation, and laboratory management. The sources of questions included textbooks, clinical pathology board questions, clinical consultation questions, and real-life case scenarios. Questions were classified into one of the following categories: (a) basic medical or technical knowledge, (b) laboratory test interpretation in clinical context, or (c) laboratory operation or regulations. ChatGPT’s output was scored as being either fully correct, incomplete/partially correct, incorrect/misleading, or irrelevant by 3 faculty members (MD or PhD) from a pathology and laboratory medicine department. Evaluators reached strong consensus on the quality of question outputs for all prompts. (Table 1).

Table 1.

Quality of answers offered by ChatGPT after questions related to clinical laboratory medicine. Queries were submitted to the GPT-4 version on March 24, 2023.

Quality of answersQuestion typesTotal
Basic medical/technical knowledgeLaboratory test interpretation in clinical contextLaboratory operations/regulations
Correct, n (%)13 (59.1)14 (46.7)6 (46)33 (50.7)
Incomplete/partially correct, n (%)7 (31.8)7 (23.3)1 (8)15 (23.1)
Incorrect/misleading, n (%)2 (9.1)5 (16.7)4 (31)11 (16.9)
Irrelevant, n (%)0 (0)4 (13.3)2 (15)6 (9.3)
Total, N22301365
Quality of answersQuestion typesTotal
Basic medical/technical knowledgeLaboratory test interpretation in clinical contextLaboratory operations/regulations
Correct, n (%)13 (59.1)14 (46.7)6 (46)33 (50.7)
Incomplete/partially correct, n (%)7 (31.8)7 (23.3)1 (8)15 (23.1)
Incorrect/misleading, n (%)2 (9.1)5 (16.7)4 (31)11 (16.9)
Irrelevant, n (%)0 (0)4 (13.3)2 (15)6 (9.3)
Total, N22301365
Table 1.

Quality of answers offered by ChatGPT after questions related to clinical laboratory medicine. Queries were submitted to the GPT-4 version on March 24, 2023.

Quality of answersQuestion typesTotal
Basic medical/technical knowledgeLaboratory test interpretation in clinical contextLaboratory operations/regulations
Correct, n (%)13 (59.1)14 (46.7)6 (46)33 (50.7)
Incomplete/partially correct, n (%)7 (31.8)7 (23.3)1 (8)15 (23.1)
Incorrect/misleading, n (%)2 (9.1)5 (16.7)4 (31)11 (16.9)
Irrelevant, n (%)0 (0)4 (13.3)2 (15)6 (9.3)
Total, N22301365
Quality of answersQuestion typesTotal
Basic medical/technical knowledgeLaboratory test interpretation in clinical contextLaboratory operations/regulations
Correct, n (%)13 (59.1)14 (46.7)6 (46)33 (50.7)
Incomplete/partially correct, n (%)7 (31.8)7 (23.3)1 (8)15 (23.1)
Incorrect/misleading, n (%)2 (9.1)5 (16.7)4 (31)11 (16.9)
Irrelevant, n (%)0 (0)4 (13.3)2 (15)6 (9.3)
Total, N22301365

A total of 65 questions were submitted to ChatGPT. Overall, it correctly answered 50.7% of questions, generated an incomplete or partially correct answer in 23.1% of questions, was incorrect or misleading in 16.9% of questions, and gave an irrelevant answer to 9.3% of questions. Correct answers were most frequently seen in questions related to basic medical or technical knowledge (59.1%), while incorrect answers were more common in questions related to laboratory operation or regulations (31%). GPT-4 has indeed demonstrated a notable improvement in accuracy over its predecessor, GPT-3.5, which correctly answered only 26% of these questions. However, ChatGPT still has limitations. For instance, it cannot explain the hook effect of a falsely low prostate-specific antigen case. It was unable to diagnose alcoholic hepatitis when given a panel of liver enzyme results. It also mistakenly diagnosed B-cell acute lymphoblastic leukemia when presented with a case of chronic myeloid leukemia with basophilic blast crisis. In another example, it answered that there are a few FDA-approved high-sensitivity troponin point-of-care devices on the U.S. market, which is not true.

ChatGPT has the ability to generate responses that appear convincing through the use of appropriate terminology, but many contain inaccuracies. Often, ChatGPT’s response is too broad and not sufficiently tailored to the case or clinical question presented to be useful for clinical consultation. As a result, it is imperative to possess a deep understanding of laboratory medicine to accurately distinguish between correct and incorrect elements of its response. It may not be able to provide a credible reference to support an answer and in some cases may even provide a fake reference (2).

In addition to the accuracy and relevance of responses from ChatGPT, there are several additional considerations for its use in laboratory medicine. The quality of ChatGPT’s response depends on the quality of prompts used. ChatGPT is nondeterminative in that the same prompt may generate different responses when used by different users. Sometimes rephrasing the query or asking a more specific question leads to a different answer. This variability in responses is undesirable for its use as a medical reference, where consistency is important. In addition, users should be aware that both GPT-3.5 and GPT-4 were trained on data through September 2021, so their answers may not be up to date (5). For example, the data used to train ChatGPT does not contain the most recent CDC blood lead reference value of 3.5 mg/dL in children as this value was updated by the CDC in October 2021.

In conclusion, while ChatGPT has the potential to improve medical education and provide faster responses to routine clinical laboratory questions, currently it should not be relied upon without expert human supervision. Laboratory medicine professionals should be aware of its limitations and be vigilant of the risks involved while exploring its potential benefits.

Author Contributions

The corresponding author takes full responsibility that all authors on this publication have met the following required criteria of eligibility for authorship: (a) significant contributions to the conception and design, acquisition of data, or analysis and interpretation of data; (b) drafting or revising the article for intellectual content; (c) final approval of the published article; and (d) agreement to be accountable for all aspects of the article thus ensuring that questions related to the accuracy or integrity of any part of the article are appropriately investigated and resolved. Nobody who qualifies for authorship has been omitted from the list.

Carlos Munoz-Zuluaga (Conceptualization-Supporting, Data curation-Lead, Formal analysis-Lead, Investigation-Supporting, Methodology-Supporting, Writing—original draft-Supporting, Writing—review & editing-Supporting), Zhen Zhao (Conceptualization-Supporting, Data curation-Supporting, Investigation-Supporting, Methodology-Supporting, Writing—review & editing-Supporting), Fei Wang (Investigation-Supporting, Writing—review & editing-Supporting), Matthew Greenblatt (Conceptualization-Supporting, Data curation-Supporting, Writing—review & editing-Supporting), and He Yang (Conceptualization-Lead, Data curation-Supporting, Formal analysis-Lead, Investigation-Lead, Methodology-Lead, Project administration-Lead, Supervision-Lead, Writing—original draft-Lead, Writing—review & editing-Lead).

Authors’ Disclosures or Potential Conflicts of Interest

No authors declared any potential conflicts of interest.

Acknowledgments

We want to thank Dr. Sabrina Racine-Brzostek and Dr. Amy Chadburn for their assistance in reviewing and verifying the accuracy of ChatGPT answers.

Data Availability

Data is publicly available online via a public repository on Github: https://github.com/SarinaYang9012/GPT-4-evaluation

References

1

van Dis
EAM
,
Bollen
J
,
Zuidema
W
,
van Rooij
R
,
Bockting
CL
.
Chatgpt: five priorities for research
.
Nature
2023
;
614
:
224
6
.

2

The Lancet Digital Health
.
Chatgpt: friend or foe? Lancet Digit Health 2023;5:e102
.

3

Zhavoronkov
A
.
Caution with AI-generated content in biomedicine. Nat Med; 2023;29:532
.

4

Nature Medicine. Will Chatgpt transform healthcare?
Nat Med
2023
;
29
:
505
6
.

5

Castro
H
.
Revolutionizing medicine: how Chatgpt is changing the way we think about health care
. January 12,
2023
. https://www.Kevinmd.Com/2023/01/revolutionizing-medicine-how-chatgpt-is-changing-the-way-we-think-about-health-care.html. (Accessed February 2023).

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/pages/standard-publication-reuse-rights)