The statistics can be confusing - and can be misleading as high values on some statistics can be falsely reassuring.
The AI algorithm providers will normally provide the sensitivity, specificity and accuracy of they findings - as that is usually part of their CE mark submission.
From a clinician/reporter perspective- you really want to know how likely is AI to miss something - and how likely is it to over call findings?
A sensitivity of 90% means it will pick up the abnormality 90% of the time and it will therefore miss 10% of cases. That helps give users the level of assurance as to whether it will miss something. Most people find that helpful to know.
Users also find it helpful to know if AI flags something - how likely is it the patient will really have the abnormality (and how likely is it that AI will overcall a finding). That is the positive predictive value (PPV). This may be better described as - If AI says it’s positive - what percentage of the time is it right?
Unfortunately to calculate PPV you also need to know the prevalence of the finding in your population. The AI providers don’t normally publish that as they don’t know your population. You will need to run it on a sample to work that out.
As an example - if eg the sensitivity for detecting pneumothorax on a chest x-Ray is 90% and the specificity is 95% - and if you know from running the AI on chest X-rays that the prevalence of pneumothorax in your population is 3% (so 3% of chest x-rays have a pneumothorax) then you can work out the PPV.
The calculation is PPV = (sensitivity x prevalence) / [ (sensitivity x prevalence) + ((1 – specificity) x (1 – prevalence)) ]
It’s easier to use an online calculator eg Sensitivity and Specificity Calculator
In the example above - the PPV works out at 35%. This means if AI says there is a pneumothorax it will be right 35% of the time - and wrong 100 - 35 = 65% of the time!
That raises the question - why do a test that is wrong more times than it’s right when it flags something? How can that be safe? An A&E doctor could potentially think the AI must be right (the computer says YES) and could try to insert a chest drain into someone that doesn’t have a pneumothorax.
The question is then how can you mitigate that risk through training - and still get a positive safety benefit from the AI?
From a training perspective you might then say to clinicians -
“AI does not flag pneumothorax very often (3%) of cases - when it does flag a ‘possible’ pneumothorax have a second look to check if you may have missed one. Bare in mind though that AI will only be right in about a third (1 in 3) of cases and will over call two thirds (2 in 3) - so if you disagree that’s fine - go with your own judgement. The AI is just there to say have a second look - just in case you missed one.
Also AI will miss 1 in 10 pneumothorax- so don’t be falsely reassured if it doesn’t flag one and you think there might be a pneumothorax. Have a second look though and ask advice if you are unsure. “
NB calculating PPV in this way is an approximation based on the published sensitivity and specificity . In general terms - the less common an abnormality is (less prevalent) - then the more likely AI will overcall a finding.
Negative predictive value NPV also depends on prevalence - and high values can be misleading. For example lung cancers are typically only diagnosed in ~ 0.5% (1 in 200) GP referral chest X-rays. If you tossed a coin for every chest X-ray- the negative predictive value of ‘tails’ (or heads) for lung cancer would be 99.5%. If anyone wants to buy a coin off me that can have a negative predictive value of 99.5% for lung cancer - let me know.!