-
PDF
- Split View
-
Views
-
Cite
Cite
Liesbet Van Bulck, Philip Moons, What if your patient switches from Dr. Google to Dr. ChatGPT? A vignette-based survey of the trustworthiness, value, and danger of ChatGPT-generated responses to health questions, European Journal of Cardiovascular Nursing, Volume 23, Issue 1, January 2024, Pages 95–98, https://doi.org/10.1093/eurjcn/zvad038
- Share Icon Share
Abstract
ChatGPT is a new artificial intelligence system that revolutionizes the way how information can be sought and obtained. In this study, the trustworthiness, value, and danger of ChatGPT-generated responses on four vignettes that represented virtual patient questions were evaluated by 20 experts in the domain of congenital heart disease, atrial fibrillation, heart failure, or cholesterol. Experts generally considered ChatGPT-generated responses trustworthy and valuable, with few considering them dangerous. Forty percent of the experts found ChatGPT responses more valuable than Google. Experts appreciated the sophistication and nuances in the responses but also recognized that responses were often incomplete and sometimes misleading.
In line with the Journal's conflict of interest policy, this paper was handled by Jeroen Hendriks.
This survey gives first-time evidence about the perceived trustworthiness, value, and danger of responses to virtual patient questions generated by ChatGPT, a new large language model.
Experts in cardiac care considered ChatGPT-generated responses to be rather trustworthy, valuable for patients, and not dangerous.
The ChatGPT-generated responses were found to be as valuable or more valuable than the information provided by Google.
Introduction
The use of the internet to seek health information is a common phenomenon. Patients often look for information on the internet because their healthcare provider did not tell them, they forgot, or they did not understand the information provided to them. It often happens that patients look up information and bring it to their healthcare provider to confirm or question management choices. This trend has led to the coining of the term ‘Dr. Google’ in 2010, which refers to the use of the internet by patients to seek health information.1
In November 2022, ChatGPT, a language model based on the third generation of the Generative Pre-trained Transformer (GPT-3) architecture, was launched. Within weeks, this new artificial intelligence (AI) system had attracted 100 million users and was featured in the lay press almost every single day. ChatGPT is a powerful language model that is capable of interpreting and generating text. It can provide a vast amount of information on a wide range of topics, making it useful for various applications such as language translation, summarization, and text completion. However, an often-heard criticism on ChatGPT is that texts that are generated are not always accurate, and they may sometimes ‘hallucinate’. In AI, a hallucination is a confident response by an AI that does not seem to be justified by its training data.2 As such, the language model fabricates a text that is not founded on reality or empirical data.
In healthcare, both professionals and consumers are likely to use ChatGPT.3 Indeed, large language models, such as ChatGPT, will drastically change the way how patients will seek information on their health condition.4 A recent review has investigated the utility of ChatGPT in healthcare education, research, and practice and has highlighted potential limitations.5 One of the limitations was the risk of inaccurate content with the risk of hallucination.5 The review concludes that more studies are needed to evaluate the content of language models including their potential impact to advance academia and science with a particular focus on healthcare settings. Hence, it could be questioned what the consequences would be if patients are switching from ‘Dr. Google’ to ‘Dr. ChatGPT’, given the possible inaccuracy of the generated responses. Therefore, we briefly evaluated the trustworthiness, value, and danger of ChatGPT-generated responses to virtual patient questions.
Methods
We conducted a vignette-based survey. Four virtual patient questions in the following cardiovascular domains were asked to ChatGPT: congenital heart disease (CHD); atrial fibrillation (AF); heart failure (HF); and cholesterol (Chol). The questions and ChatGPT-generated responses of these four vignettes are represented in Supplementary material online, Figures S1–S4.
Each vignette was given to five experts in the respective domain. Hence, 20 experts (19 nurses; 1 dietician) from the authors’ professional network participated in this survey. Experts were known to be actively involved in clinical practice and/or clinical research in the respective cardiovascular domains. Using an online survey questionnaire that was devised for this study (see Supplementary material online, Box S1), they rated the vignette on its trustworthiness, value, and danger, on a scale from 1 to 10. Higher scores represent more trustworthy, higher value, or more dangerous. Further, they were asked to what extent the information provided by ChatGPT was more or less valuable than the information from Google. Moreover, using open questions, the respondents were asked to indicate what they appreciated in the ChatGPT-generated response, and what was incorrect or misleading.
Descriptive statistics are reported in medians and quartiles. Data are graphically expressed using a violin plot and a dot matrix chart.
Results
The ChatGPT-generated responses were generally considered to be trustworthy (median: 7) and valuable (median: 7.5) by the experts (Central illustration). Most experts did not think that the use of the information provided by ChatGPT on the prompts was dangerous (median: 3). When looking at the evaluation of the disease-specific vignettes, one expert in HF scored the trustworthiness low, and one AF expert rated the value for the patients low (Figure 1). One HF expert and two experts in cholesterol thought it would be dangerous if patients are using the responses in the (self-)management of their condition. When compared to the information provided by Google, 40% of the experts deemed the information from ChatGPT to be more valuable, 45% considered it as valuable, and 15% rated it as less valuable.

Dot matrix plots of trustworthiness, value, and danger of using ChatGPT-generated responses for different cardiovascular domains, evaluated by experts (n = 20). Legend. Dots represent the score of individual experts. Scores range from 1 to 10 with a high score representing a very trustworthy, very valuable, or very dangerous case. CHD: congenital heart disease; AF: atrial fibrillation; HF: heart failure; Chol: cholesterol; Light colours represent favourable evaluation; Dark colours represent less favourable evaluation.

Violin plots of trustworthiness, value for the patient, and danger of using ChatGPT-generated responses by patients, evaluated by experts (n = 20). Legend. Scores represent medians and quartiles between brackets; Light colours represent favourable evaluation; Dark colours represent less favourable evaluation.
The experts acknowledged the level of sophistication and nuances in the responses generated by ChatGPT. The following quotes of experts represent what they appreciate in the responses.
This was both a nuanced and comprehensive response to the question prompt. The summary statements in particular are impressive because they outline several factors that are considered when deciding on a treatment strategy (type/severity of symptoms, AF characteristics, comorbidities). The treatment options are all correct, as are the descriptions. (Expert AF)
I appreciate that it mentions that managing fluid intake is very important for patients with heart failure. It also mentions multiple times to consult healthcare providers for direction and more information and also follow-up regarding symptoms. It also mentions avoiding alcohol and caffeine, which is good. (Expert HF)
However, the experts also recognized that the responses are often incomplete and sometimes misleading. The following quotes are illustrative of this.
All the responses provided on the open questions are presented in the Supplementary material online, Table S1.- No information about the importance of pre-conceptional counselling.
- Not outlining the risk for mother and the child.
- A prepregnant conversation should also include a conversation with the couple to discuss possible risks for mother and child.
- Other tests is a bit vague.
- The sentence: ‘Medications need to be adjusted, and some people may require additional procedures such as heart valve repair or replacement’ should be written in the paragraph ‘before becoming pregnant’. Now this makes it sound as if this is done during pregnancy, which should be avoided.
- No sentence about the fact that in certain situations/medical conditions the medical team can advise against pregnancy.
- No sentence about the heredity risk. (Expert CHD)
Point 4 is very general. Such information may cause the patient to start, for example, performing exercises typical of abdominal muscle expansion. However, fat tissue from the waist area will not be reduced by such exercises alone. Point 5. The information is correct, however, it misses the point that statins should be taken after consultation with a doctor. (Expert Cholesterol)
Discussion
We evaluated the trustworthiness, value, and danger of information provided by ChatGPT on virtual prompts by patients. Overall, experts indicated that ChatGPT-generated responses could be considered to be trustworthy, valuable for patients, and not dangerous. Forty percent of the experts found ChatGPT responses more valuable than Google.
The value and trustworthiness of the responses were generally rated positively. In their explanations, the experts mentioned that they appreciated the nuanced and comprehensive responses and the fact that ChatGPT advised consulting healthcare providers for further direction. The possible value of ChatGPT for patients has been acknowledged in previous studies that mentioned that the tool has the prospects to improve the health literacy of the general public by providing easily accessible and understandable health information.5 Moreover, the overall positive evaluation of trustworthiness is in line with previous studies, reported in preprints, that investigated the accuracy of ChatGPT-generated responses for healthcare. In a study by Antaki et al., the accuracy of ChatGPT in ophthalmology was investigated using multiple-choice questions from the Ophthalmic Knowledge Assessment Program.6 ChatGPT achieved 55.8% and 42.7% on the exams, which is comparable to the results of an average first-year resident and was found to be noteworthy and promising.6 Duong and colleagues investigated how well ChatGPT performed in answering questions with regard to genetics, compared to human respondents.7 They concluded that ChatGPT provided rapid and accurate responses to a wide range of genetic-related questions and that, thereby, ChatGPT can help non-experts easily access information.7 Hence, this study has also acknowledged the value of the responses for patients. Another study evaluated the accuracy and reproducibility of responses to questions on knowledge, management, and emotional support for cirrhosis and hepatocellular carcinoma.8 Overall, that study concluded that practical and multifaced advice was provided, but also mentioned a couple of limitations, such as that ChatGPT lacked knowledge of regional guidelines variation in screening.8
However, a couple of experts in our study were also negative about the value, trustworthiness, and danger of the responses. Indeed, one expert scored 3 on trustworthiness, one expert scored 3 on value and three experts scored 7 or more on dangerous. The most common negative feedback was that certain information was missing, too vague, a bit misleading, and not written in a patient-centred way. Prior reports have described limitations of ChatGPT that can influence the accuracy and value of the responses in a negative way.5 First, ChatGPT is based on GPT3, which is trained up to 2021. Hence, the latest data are not taken into account.5 When GPT3 or its successors are built-in search engines (like New Bing), the accuracy may be higher, and information can be more up-to-date. Second, the interpretability, reproducibility, and transparency are limited, due to the absence of references in ChatGPT.5 Again, new search engines, such as New Bing, are much more transparent about their sources. Third, the tool often hallucinates, as mentioned earlier.5 Researchers and developers are currently working on this issue. Fourth, when the wording of the question or tone is slightly adjusted, a different answer is provided. This can be very misleading for patients when looking for health-related information. Hence, although the responses provided seem trustworthy and valuable, they must be interpreted in light of these limitations.
To the best of our knowledge, our study is the first investigation that has examined the trustworthiness, value, and danger of ChatGPT-generated responses to virtual patient questions. However, some limitations with regard to this study should be mentioned as well. Only four vignettes and 20 experts were selected. No medical doctors or patient representatives were included. This small number of selected experts is sufficient for a first brief evaluation, but it prevents drawing of firm and definite conclusions. Executing a larger study with more vignettes and more experts may now be a next step to further investigate the accuracy and value of ChatGPT for healthcare. Additionally, this study only focused on patient questions. Hence, the value of ChatGPT for healthcare professionals remains to be investigated. Moreover, the chosen vignettes are arbitrary. Nonetheless, the question with regard to lowering cholesterol is one of the top 10 health-related questions asked on Google.9 Furthermore, the questions have been formulated using one wording only. It could have been interesting to formulate the same question slightly adjusted several times and compare the trustworthiness of the different responses.
More research with regard to the accuracy of ChatGPT and other large language models is needed. The growing interest in ChatGPT highlights the need for researchers and developers to address the accuracy and reliability of the information generated by these language models. The accuracy and reliability of information are crucial in healthcare, and patients need to be able to trust the information they receive.
The technology is here to stay. Indeed, AI applications will play an increasing role in nursing practices,10 and also patients will progressively use contemporary AI applications to seek information online. However, healthcare workers and patients need to be able to trust the provided information. Therefore, we evaluated the trustworthiness, value, and danger of information provided by ChatGPT on virtual prompts by patients. In conclusion, we found that ChatGPT-generated responses are considered to be trustworthy, valuable for patients, and not dangerous. However, while ChatGPT is undoubtedly a powerful tool, a lot of limitations still must be acknowledged still and hence, the responses should be interpreted with caution. If language models, such as ChatGPT, can be investigated and ameliorated further, these tools could have the potential to become partners in care. For example, if their reliability is guaranteed, health professionals could refer to these tools when patients are looking for specific, easy-to-understand information.
Author contributions
Liesbet Van Bulck (Conceptualization, methodology, investigation, resources, data curation, writing), Philip Moons (Conceptualization, methodology, formal analysis, writing, visualization, supervision)
Supplementary material
Supplementary material is available at European Journal of Cardiovascular Nursing online.
Acknowledgements
The authors thank the experts for providing their perspective on the prompts generated by ChatGPT.
Funding
This work is supported by the Research Foundation Flanders (grant number 1159522N to LVB).
Data availability
The data underlying this article can be shared on reasonable request to the corresponding author.
References
Author notes
Conflict of interest: None declared.
Comments