The Deceptive Nature of Confidence Scores and the Need for Immediate Reform
Why Traditional Methods Fail in Risk Understanding, Leading to Critical Decision Pitfalls. And How to Fix it :)
Hey, Nacho here. Welcome to the Product Direction’s newsletter, republished on Medium.
When doing discovery activities, we aim to reduce the risk of failure of the opportunities or ideas we are trying to build.
Thus, we must start by understanding the level of risk to:
- Prioritize: Don’t start building ideas with a high chance of failure. Take the amount of evidence into the priority analysis.
- Plan discovery: high-cost & high-risk options should go into your discovery activities first.
- Align with stakeholders: transparently discuss the evidence (or lack thereof) that justifies the decided priorities.
These are some reasons why many teams use a confidence scale, like the 0 to 10 points grading system we find in ICE or RICE frameworks.
While all methods simplify scoring to have a manageable system, there are some serious drawbacks that you need to be aware of and manage accordingly.
Types of Risk and Discovery Steps
There are 4 prominent types of risks when developing new products or features:
- Value / Desire
- Usability / Appeal
- Viability
- Feasibility
Note: if you need to revisit them, they have been well explained by Marty Cagan or Teresa Torres, among others.
To engage in discovery steps -the name I use for any evidence-gathering activity to reduce risks- we need to identify the assumptions for which we don’t have evidence and can make our initiative fail (known as riskiest assumptions).
We employ techniques like assumption mapping. As explained by David Bland, author of Testing Business Ideas, we can categorize assumptions according to the risk they are associated with. For example:
- Desirability / Value: users care enough about this problem to use our new solution.
- Viability: users will be willing to pay the $19 monthly subscription we projected.
- Feasibility: we will be able to use AI to generate meaningful insights for the user (considering the constraints of our team and organization).
Confidence Scales “Hide” Risk Types
The goal behind a confidence scale is to visualize and factor in how certain you are about the other estimations, like impact and effort.
The scale can be very subjective. Usually, numbers are poorly defined, so “very confident” can mean a 6 or a 9, depending on who you ask.
That’s why I like using Itamar Gilad’s confidence meter, which makes it more objective by matching different pieces of evidence with a particular score.
But here is the problem: the evidence you collect helps mitigate one risk, not the others.
Discovery activities are tied to an assumption. While you are increasing confidence for that speculation, it is not necessarily true that you increase overall confidence.
Let’s continue our previous example with some extra context. Let’s say we are an analytics product with a typical SaaS model, trying to add a layer of AI-automated insights for our users.
To test desirability, we may engage in a series of interviews, prototypes, and even concierge experiments to understand better what type of insights users need and what the best way to deliver them is.
Of course, that is a big (if not the biggest) risk. But even if you successfully gain confidence about value, your idea will fail if you are technically unable to create those AI-powered insights (feasibility). And the same is true if you don’t validate the number of users willing to pay a new subscription (viability).
How to fix it
The simple answer is that overall confidence is not an average.
Overall confidence is the lowest confidence you have among all your risks and assumptions.
We need to open up the discussion and say how confident we are about the user value, the business impact, and the feasibility.
In practical terms, this could mean opening up Confidence in at least 3 values (one per risk type in David Bland’s model). Your overall confidence would be the lowest of the 3 values.
Following the above example, since desirability confidence is low, overall confidence will still be low even when viability and feasibility are high.
The good part is that it will give you a clear signal for what discovery activities are needed: do we need a technical spike (feasibility) or a prototype test to get user feedback on the new experience?
The ugly part is that:
- You need to think about different scales. Evidence about technical feasibility is rather different than evidence of user desire.
- This extra complexity won’t make your model perfect.
At the end of the day, we aim to create more transparency and better discussions with our teams and stakeholders to make better decisions. Most times, when we disagree about confidence levels, it is because we are considering different risks and assumptions.
Join Product Direction for free and read subscriber’s exclusive articles like Combining Scoring and Assumption Mapping and The Performance Review Dilemma.