Health Scores That Predict Anything: A Design Framework
Most health scores measure activity, not outcomes. They go green right up until the moment an account churns. Here is what a useful one actually looks like.
If you have ever sat in a churn post-mortem and watched someone point at a health score that was green the week the customer cancelled, you have seen the central failure of CS health scoring in practice. The dashboard said the account was healthy. The account churned. Nobody is sure what to do with this information, so the next quarter the team tweaks the weights, the dashboard comes back greener, and the cycle continues.
The problem is not the weights. The problem is what is being weighted. Most health scores are built on data that is easy to collect — logins, feature events, ticket counts — and that data correlates with healthy accounts because healthy accounts use the product. But correlation with the past is not the same as prediction of the future. A health score built on activity data will catch a slowly declining account; it will miss a politically dead account that is still logging in, and it will miss a competitive evaluation that has not yet shown up in usage.
What follows is a design framework for a health score that predicts renewal outcomes, not just describes current engagement. It is built around four inputs, deliberately chosen. It is not magic. It will not eliminate surprise churn. It will, when implemented well, give you 30 to 60 days of additional warning on most at-risk accounts — which is the difference between intervening and observing.
Why most health scores fail
Three failure modes show up over and over:
Failure mode 1: Activity ≠ adoption ≠ outcome
Health scores commonly conflate three different things. Activity is "did they log in." Adoption is "are they using the parts of the product that produce value." Outcome is "are they getting the business result they bought you for." Customers can have high activity and zero outcome — and they will churn. The score has to weight outcome above activity.
Failure mode 2: Lagging indicators only
A score built on usage data is a score that looks backward. Customers who are politically about to churn — because their champion left, because procurement is consolidating vendors, because a new VP arrived who wants to make their mark by cutting tools — show no usage decline until the decision is already made. The score must include leading indicators that capture the political reality of the account.
Failure mode 3: Optimized for dashboarding, not action
Most health scores produce a number from 0 to 100 with three color bands. This looks good in a board deck and is operationally useless. A number does not tell a CSM what to do. A score that produces an action — "this account needs an exec escalation," "this account needs a champion rebuild," "this account is fine, leave it alone" — is worth a hundred dashboard widgets.
The four-input model
The model uses four inputs, each captured separately, each producing a score and an action. They are deliberately diverse — two leading, two lagging — so that a problem in any dimension surfaces fast.
Input 1: Adoption depth (lagging, quantitative)
Not logins. Not feature events. Adoption depth: the percentage of the customer's contracted use case that is actually being executed in the product. This is harder to compute than login counts and that is precisely why it is more valuable.
Concretely: if a customer bought your platform to manage three use cases (say, ticket triage, knowledge base, and customer self-service), adoption depth is "how many of the three are running, and at what maturity." A customer running one of three at full maturity is at 33% adoption depth — even if their login activity is high, because one team logs in every day.
The scoring scale:
- 4 — Full depth. All contracted use cases live, mature, in production.
- 3 — Strong depth. Most use cases live, one not yet mature.
- 2 — Partial depth. Half or fewer of contracted use cases live.
- 1 — Shallow. One use case live, often the original POC use case.
Input 2: Outcome evidence (lagging, qualitative)
Whether the customer can articulate the business outcome they are getting from your product. Not "are they happy." Not "do they like us." Can they say, in their own words, what they got? Hours saved, tickets deflected, faster cycle times, dollars recovered, errors reduced — something concrete.
The collection mechanism is the QBR, the renewal conversation, and the case study consent process. If the customer cannot answer the question "what's the business impact of [product] for you this year?" in one sentence, the score is low — regardless of how much they say they like working with you.
- 4 — Quantified outcome. Customer can state a number. ("We saved 340 hours / closed $1.8M faster / deflected 22% of tickets.")
- 3 — Qualified outcome. Customer describes the outcome in business terms but cannot yet quantify it.
- 2 — Vague outcome. "It's been really helpful." "The team likes it." No specifics.
- 1 — No outcome articulable. Customer struggles to name what they got.
Input 3: Stakeholder strength (leading, structural)
This is the multithreading map, scored. Pulled directly from the multithreading framework: how many of the four roles (economic buyer, champion, daily user, executive sponsor) are filled, and how strongly.
- 4 — All four roles filled, all strong.
- 3 — All four filled, or three strong.
- 2 — Two or three filled. Champion present but weak.
- 1 — Single-threaded, or champion has left / is leaving.
This input is leading because stakeholder changes happen weeks or months before usage shifts. A VP departure is a yellow flag the day it happens, even if the team is still logging in normally.
Input 4: Sentiment events (leading, behavioral)
A rolling 90-day count of sentiment events — escalations, NPS detractor responses, churned-adjacent statements from the customer, executive complaints, missed SLAs, unrenewed adjacent products at the same company. This is the input most teams either skip entirely or capture so passively that it is useless.
The trick is the discipline of logging the event when it happens — not at the end of the quarter. The CSM, the support team, and the AE all need to know how to flag a sentiment event into the system, and the threshold for "what counts" needs to be low. Better to over-log and filter than to miss the signal.
- 4 — Zero sentiment events in last 90 days. Positive interactions documented.
- 3 — One minor event, addressed and closed.
- 2 — Two or more events, or one unresolved material event.
- 1 — Active escalation, or pattern of unresolved issues, or executive complaint within 30 days.
From four scores to one action
A composite score (4 to 16) is fine for portfolio rollups. But the operationally useful output is not the composite — it is the lowest single input. A high composite hides a 1 on stakeholder strength, and that 1 is the thing that will kill the account.
The decision rule:
| Pattern | Action |
|---|---|
| All four inputs at 3 or 4 | Healthy. Maintain cadence. Use cycles for expansion work. |
| Any single input at 2 | Targeted intervention on the weak dimension. CSM-led. |
| Any single input at 1 | Escalate to manager. Build a 30-day save plan. The other inputs do not matter. |
| Two or more inputs at 2 or below | Material churn risk. Escalate to leadership. Build a 60-day intervention plan and assign roles. |
This is the framing that turns a score into an operating signal. The score does not just tell you which accounts are at risk; it tells you what is wrong with each at-risk account, which determines what play to run.
How to validate the model against your data
A health score is a hypothesis until you validate it against your actual churn outcomes. Most teams skip this step and the score stays a hypothesis forever. The validation is not hard.
Take your last 12 months of churned accounts. Score each one retroactively at the point that is 90 days before their churn date. You will need to reconstruct the inputs from notes, CRM history, and usage data. It is tedious. It is also the single most valuable afternoon a CS Ops lead can spend.
What you are looking for:
- Did the model flag them? A churned account that scored 14+ on the model 90 days before is a model failure. If you have several of these, the model has a blind spot — usually it means your stakeholder strength scoring is too generous.
- Which input flagged first? Across your churned accounts, which input had the earliest decline? If stakeholder strength led most of them, that input deserves more weight or more frequent measurement.
- How early was the warning? If the model would have flagged most churned accounts only 30 days before the event, the score is descriptive, not predictive. Adjust the thresholds downward — make 3s easier to drop to 2s — until the warning window expands.
Do the same exercise with renewed accounts to check for false positives. A model that flags every account as at-risk is also broken, just in the other direction.
Implementation: what to actually build
If your team uses Gainsight, Totango, ChurnZero, or any modern CS platform, all four inputs can be configured as scorecards or measures. The platforms are perfectly capable of capturing this model — the failure mode is teams using whatever score the platform shipped with rather than designing one for their business.
The rough implementation checklist:
- Define the four inputs in your platform as scorecard measures.
- For adoption depth and sentiment events, define a clear data source for each — product telemetry feed for adoption, manual or integrated event log for sentiment.
- For outcome evidence and stakeholder strength, define a quarterly refresh cadence and a single owner (the CSM, with peer review).
- Set the dashboard to display the four inputs and the lowest input, not just the composite.
- Tie the lowest-input value to a Call to Action / playbook in the platform, so a score of 1 on any input auto-generates a task for the CSM.
- Run the validation exercise on the prior 12 months of churn within the first 90 days of launch. Adjust thresholds. Run it again in 90 more days.
The implementation work is one week of configuration, two weeks of stabilization, and a quarter of tuning. That is well inside the budget for replacing a dashboard that goes green right before accounts churn.
What this model does not do
- It does not predict expansion. A separate signal model handles that, and it weights different inputs (champion ambition, adjacent team interest, product roadmap fit).
- It does not eliminate surprise churn. Some accounts churn for reasons no model could capture — acquisition, sudden budget freeze, a new CIO with a hostile vendor philosophy. The goal is to reduce surprise churn, not eliminate it.
- It does not replace judgment. A CSM looking at a score of 16 who knows the customer just got acquired should still escalate. The model informs judgment; it does not replace it.
The health score model is available as a Notion template on the Tools page, with rubrics, scoring guidance, and validation worksheets included.