Trading off accuracy and explainability in AI decision-making: findings from 2 citizens’ juries

Abstract Objective To investigate how the general public trades off explainability versus accuracy of artificial intelligence (AI) systems and whether this differs between healthcare and non-healthcare scenarios. Materials and Methods Citizens’ juries are a form of deliberative democracy eliciting informed judgment from a representative sample of the general public around policy questions. We organized two 5-day citizens’ juries in the UK with 18 jurors each. Jurors considered 3 AI systems with different levels of accuracy and explainability in 2 healthcare and 2 non-healthcare scenarios. Per scenario, jurors voted for their preferred system; votes were analyzed descriptively. Qualitative data on considerations behind their preferences included transcribed audio-recordings of plenary sessions, observational field notes, outputs from small group work and free-text comments accompanying jurors’ votes; qualitative data were analyzed thematically by scenario, per and across AI systems. Results In healthcare scenarios, jurors favored accuracy over explainability, whereas in non-healthcare contexts they either valued explainability equally to, or more than, accuracy. Jurors’ considerations in favor of accuracy regarded the impact of decisions on individuals and society, and the potential to increase efficiency of services. Reasons for emphasizing explainability included increased opportunities for individuals and society to learn and improve future prospects and enhanced ability for humans to identify and resolve system biases. Conclusion Citizens may value explainability of AI systems in healthcare less than in non-healthcare domains and less than often assumed by professionals, especially when weighed against system accuracy. The public should therefore be actively consulted when developing policy on AI explainability.


Stroke
There are more than 100,000 strokes in the UK each year -that is around one stroke every five minutes. About 11% of patients die immediately or within a few weeks as a result of the stroke, making stroke the fourth biggest killer in the UK. Almost two thirds of stroke survivors leave hospital with a disability.
System A -Expert System: 75% (non-expert level, e.g. A&E doctor) accuracy, full explanation This system uses an algorithm that was developed with help from experienced neurologists and neuroradiologists, and aims to follow the same reasoning as they would do. In practice it does not reach the same level of accuracy as they would, but the algorithm is completely transparent in the way it reaches its conclusions: for each individual case it can provide specific rules that were applied to reach a conclusion. It has an overall accuracy rate of 75%, which is comparable to what most emergency medicine doctors would achieve. This means that in 25% of cases, someone might be classified as having a stroke while they were not or vice versa, or the type, location, and severity of the stroke might be misjudged.· System B -Random Forest system: 85% (human expert level) accuracy, partial explanation This system uses an algorithm that was established through machine learning from a large set of patient data, collected at English hospitals. This algorithm reaches (human) expert level performance, but it is not very transparent in the way it reaches its conclusions: it can only tell us which features, in general, are important and which are not. It has an overall accuracy rate of 85%.
This means that in 15% of cases, someone might be classified as having a stroke while they were not or vice versa, or the type, location, and severity of the stroke might be misjudged.· System C -Deep Learning system: 95% (beyond human level) accuracy, no explanation.
This system uses advanced AI derived from the same set of patient data as System B. However it has "taught itself" from the data which features were best able to distinguish strokes from non-strokes, and best able to distinguish different types of stroke, their location, and their severity. This algorithm is not transparent in the way it reaches conclusions: it is unable to provide any explanation that is understandable by humans. However it has an overall accuracy rate of 95%, which is better than human experts perform. This means that in 5% of cases, someone might be classified as having a stroke while they were not or vice versa, or the type, location, and severity of the stroke might be misjudged.
Trading off accuracy and explainability in AI decision making: findings from two citizens' juries Van der Veer, Riste, Cheraghi-sohi, et al

Kidney transplantation
It is hoped that with this system, a larger number of transplanted kidneys will survive longer. There are three automated decision systems to choose from -system A, system B, and system C.
System A -Expert system: 75% (below human expert level) accuracy, full explanation This system uses an algorithm that was developed with help from experienced kidney doctors, and aims to follow the same reasoning as they would do. In practice it does not reach the same level of accuracy as they would, but the algorithm is completely transparent in the way it reaches its conclusions: for each individual case it can provide specific rules that were applied to reach a conclusion. It has an overall accuracy rate of 75%, which is a little lower than what is currently achieved in practice across the NHS (and lower than that achieved by the top specialists). This means that 25% of the time its predictions were incorrect (e.g. predicting that the kidney would last at least 5 years for the selected patient when in reality it didn't).· System B -Random Forest system: 85% (human expert level) accuracy, partial explanation This system uses an algorithm that was established through machine learning from a large set of patient data, collected at English hospitals. This algorithm achieves (human) expert level performance, but it is not very transparent in the way it reaches its conclusions: it can only tell us which features, in general, are important and which are not. It has an overall accuracy rate of 85%.
This means that 15% of the time its predictions were incorrect (e.g. predicting that the kidney would last at least 5 years for the selected patient when in reality it didn't).
System C -Deep Learning system: 95% (beyond human level) accuracy, no explanation This system uses advanced AI, derived from the same set of patient data as System B. However it has "taught itself" from the data which features were best able to distinguish successful matches from non-successful matches. This algorithm is not transparent in the way it reaches conclusions: it is unable to provide any explanation that is understandable by humans. However it has an overall accuracy rate of 95%, which is better than human experts perform. This means that 5% of the time its predictions were incorrect (e.g. predicting that the kidney would last at least 5 years for the selected patient when in reality it didn't.
Trading off accuracy and explainability in AI decision making: findings from two citizens' juries Van der Veer, Riste, Cheraghi-sohi, et al

Recruitment
System A -Expert system: 75% (non-expert level, e.g. recruitment officer) accuracy, full explanation This system uses an algorithm that was developed with help from experienced recruitment officers, and aims to follow the same reasoning as they would do. In practice it does not reach the same level of accuracy as they would, but the algorithm is completely transparent in the way it reaches its conclusions: for each individual case it can provide specific rules that were applied to reach a conclusion. When tested on existing data about recruitment, this system was shown to have an overall accuracy rate of 75%. This means that 25% of the time its predictions were incorrect (e.g. predicting that an applicant would be unlikely to become a high performing employee when in reality they did, or vice versa).The accuracy of this system is comparable to that of a typical recruitment officer.
System B -Random Forest system: 85% (human expert level) accuracy, partial explanation This system uses an algorithm that was established through machine learning from a large set of recruitment data, collected by the organisation. This algorithm achieves (human) expert level performance, but it is not very transparent in the way it reaches its conclusions: it can only tell us which features, in general, are important and which are not. When tested on existing data about recruitment, this system was shown to have an overall accuracy rate of 85%. This means that 15% of the time its predictions were incorrect (e.g. predicting that an applicant would be unlikely to become a high-performing employee when in reality they did, or vice versa).The accuracy of this system is comparable to that of a very experienced recruitment officer.
System C -Deep Learning system: 95% (beyond human level) accuracy, no explanation This system uses advanced AI, derived from the same set of data as System B. However it has "taught itself" from the data. This algorithm is not transparent in the way it reaches conclusions: it is unable to provide any explanation that is understandable by humans. However it has an overall accuracy rate of 95%, which is better than human experts perform. This means that 5% of the time its predictions were incorrect (e.g. predicting that an applicant would be unlikely to become a high performing employee when in reality they did, or vice versa).
Trading off accuracy and explainability in AI decision making: findings from two citizens' juries Van der Veer, Riste, Cheraghi-sohi, et al

Criminal justice
There are 3 automated decision systems for the Police to choose from:· System A -Expert system: 75% (non-expert level, e.g. custody officer) accuracy, full explanation This system uses an algorithm that was developed with help from very experienced Police Custody Officers, and aims to follow the same reasoning as they would do. In practice it does not reach the same level of accuracy as they would, but the algorithm is completely transparent in the way it reaches its conclusions: for each individual case it can provide specific rules that were applied to reach a conclusion. When tested on existing data about reoffending, this system was shown to have an overall accuracy rate of 75%. This means that 25% of the time its predictions were incorrect (e.g. predicting that an individual would commit a serious offence when in reality they didn't, or vice versa).The accuracy of this system is comparable to that of an average Police Custody Officer.· System B -Random Forest system: 85% (human expert level) accuracy, partial explanation This system uses an algorithm that was established through machine learning from a large set of criminal offence data, collected by the police and local agencies. This algorithm achieves (human) expert level performance, but it is not very transparent in the way it reaches its conclusions: it can only tell us which features, in general, are important and which are not. When tested on existing data about reoffending, this system was shown to have an overall accuracy rate of 85%. This means that 15% of the time its predictions were incorrect (e.g. predicting that an individual would commit a serious offence when in reality they didn't, or vice versa.)The accuracy of this system is comparable to that of a very experienced Police Custody Officer.· System C -Deep Learning system: 95% (beyond human level) accuracy, no explanation This system uses advanced AI, derived from the same set of data as System B. However it has "taught itself" from the data. This algorithm is not transparent in the way it reaches conclusions: it is unable to provide any explanation that is understandable by humans. However it has an overall accuracy rate of 95%, which is better than human experts perform. This means that 5% of the time its predictions were incorrect (e.g. predicting that an individual would commit a serious offence when in reality they didn't, or vice versa.) Trading off accuracy and explainability in AI decision making: findings from two citizens' juries Van der Veer, Riste, Cheraghi-sohi, et al 6

Supplement Part 3: Worked example of the qualitative analysis steps
The process of how the qualitative research team went from initial coding to theme generation is illustrated by the following quote from the stroke scenario "Who is responsible if the diagnosis is wrong with fatal consequences?". This excerpt was initially attributed to a theme called 'Accuracy/error', but subsequently re-coded to 'Accountability for decisions' following the first coding meeting. The concept of AI making decisions but not being liable in the same way as a person was pivotal within the third theme, labelled 'Trust'. This theme was initially termed 'Human element', as we felt the AI systems were being conferred with the human traits expected of doctors in the medical scenarios. It was subsequently amended to 'Trust in automated systems' alongside the other subthemes 'Fairness' and 'Delivery of the decision'.
Trading off accuracy and explainability in AI decision making: findings from two citizens' juries Van der Veer, Riste, If younger patients are prioritised is that data included in the system? Blood pressure and type are objective, age seems more subjective so ethically different

Bias
If a man has to stay within 2 hours of transplant hospital could the system tell him his % chance of a match?

Explanation
Trading off accuracy and explainability in AI decision making: findings from two citizens' juries Van der Veer, Riste, Cheraghi-sohi, et al If a person is 55-60 years old will they have less chance of a kidney transplant than a 20 year old? Bias Machine system is based on available data so there isn't any data from people not given a transplant previously. Learning Could they start including that data if it became available? Could they start to collect it now? Learning With AI systems already in use where a team previously made decisions and it includes consultant surgeons does it make the process more time efficient? Can they use that time to practice their skills and make transplantation better?

Resources
If there's a really large pool of people all needing transplants but only a certain number of kidneys available could system C filter down to a list of low level risk but will the system then build in who is the most important?
Learning When ranking very good/adequate or poor match is age only factor or does it use these categories too? Learning

Recruitment
Race was mentioned, could AI be trained to discriminate and be 'racist'? Bias How do you measure success / failure -Not as easy as in the medical scenarios. Accuracy Will the NHS or are they currently using AI to recruit? Reason The use of AI in this context is entirely context dependent, certain organisations will want 100s of a certain type of employees whilst others will want to sit at a table and hand pick them.
Benefit Do we answer these questions as the recruiter or from the service user aspect? My views are very different depending on this.

Benefit
We depend on feedback to improve, if we use system C deep learning and receive no feedback…but people are given no feedback wont it make society have less feedback and therefore less chance to improve?

Criminal justice
Say the scenario was about a first time offender they weren't on any database -if they went to court and were found guilty the court might rule for them to have rehab. If the person was sent straight to rehab this route would they not have a record and do they have any choice in this?
Bias; human interaction This person is innocent until proven guilty. I'm not happy sending person on rehab if not been found guilty of any offence.

Bias; learning
If someone has a criminal record it is more likely they reoffend. By system sending person to trial they are then more likely to have a record.

Reason
Could system C be used in offender profiling or to predict future behaviour? Learning Trading off accuracy and explainability in AI decision making: findings from two citizens' juries Van der Veer, Riste, Cheraghi-sohi, et al 9

Supplement Part 5: Electronic voting results per scenario per jury location
In the tables below, each row shows how jurors of a particular jury voted for a particular scenario. Individual jurors could vote only once per scenario. Values refer to the number of jurors (out of 18 in total) voting for that option. Criminal justice Coventry 8 7 3 Manchester 7 6 5 a) System A, expert system (below human expert-level accuracy, full explanation); System B, random forest system (human expert-level accuracy, partial explanation); System C, deep learning system (beyond human expert-level accuracy, no explanation)

Supplement Part 6: Statements captured during small group discussions on
(dis)advantages of the system receiving the most votes for a particular scenario.
Statements are organised per (sub)theme from our qualitative analysis. The N provided for statements refers to the number of participants in that jury who selected that statement as being the most important; if none of the participants selected that statement as being the most important, no N is provided.  "Has no ability to learn from data" [Manchester, N=4]  "Data might be corrupt and thus a