LOADING
LOADING
From a stalled clinical-documentation pilot to a governed, monitored, SLA-backed AI stack — drift, accuracy, cost, and 24×5 on-call owned by one accountable partner.
This is a representative, composite engagement based on patterns Maverin sees across regulated multi-site health systems. It is not a real named client. Industry figures are cited to public sources; engagement outcomes are modeled and labeled illustrative.
A multi-site health system had a working clinical-documentation assistant in pilot and no way to run it in production. Maverin took over the AI surface under a Managed AI Stack retainer: drift and accuracy monitoring, output evaluation with human-in-the-loop verification, cost observability, 24×5 on-call, and a monthly cadence of new governed workflows — all under an SLA. This is a representative engagement; the outcomes are modeled and labeled, the industry figures are cited.
A regional health system — eight hospital and clinic sites, one shared EHR, PHI under PHIPA — had spent two quarters building a clinical-documentation assistant. It drafted visit summaries and routed referrals. In a demo, it was good. Physicians on the pilot floor liked it.
Then it sat. The internal team that built it was a data-science group, not an operations group. Nobody owned the pager. Nobody owned the question "is it still accurate this month?" When inference costs drifted up, finance noticed before engineering did. When a model update changed the assistant's tone on discharge instructions, a charge nurse caught it, not a monitor.
This is the normal shape of the problem. 66% of U.S. physicians used AI in 2024, up from 38% the year before — a 78% one-year jump (AMA Augmented Intelligence Survey, 2025). The demand side is real and arriving fast. The operational side is landing on health systems that were never staffed to run software like a vendor runs it.
Maverin came in not to rebuild the assistant — it worked — but to own the surface it ran on.
A model in a demo is finished work. A model in production is a depreciating asset. The gap between the two is where most health-AI initiatives die.
The evidence is blunt. 91% of machine-learning models degrade over time in a study of 32 datasets and 2.56 million experiments across healthcare, finance, weather, and traffic (Vela, Sharp, Zhang et al., Scientific Reports, 2022). Accuracy decays silently — nobody gets an alert that the model is now wrong more often. Gartner predicted at least 30% of generative-AI projects would be abandoned after proof-of-concept by end of 2025, citing poor data quality, weak risk controls, escalating costs, and unclear value (Gartner, 2024). This pilot had all four risks live and unowned.
Three specifics made it acute for a health system:
The internal team could build. It could not run drift detection, output evaluation, access controls, on-call, and a feedback loop — as a standing capability, every day, under an SLA. That is a different job.
We did not start with a roadmap. We started with a two-week Discovery Assessment to answer one question: what is actually running, and what is the worst thing that can go wrong while nobody is watching?
The output was a transfer-of-ownership plan, not a rebuild plan. The assistant stayed. We took the surface.
The retainer terms, in plain language:
We mapped the program to NIST's AI Risk Management Framework Generative AI Profile (NIST AI 600-1), which names confabulation (hallucination) as one of 12 generative-AI risk categories (NIST, 2024). That gave procurement, audit, and the privacy office a recognized control framework to map against — not a Maverin invention they had to take on faith.
The reframe we sold the COO: inference is no longer the cost. Querying a GPT-3.5-equivalent model fell from USD 20.00 to USD 0.07 per million tokens between Nov 2022 and Oct 2024 — a >280× drop (Stanford HAI 2025 AI Index). The durable spend has moved to operations: observability, evaluation, retraining, governance. A retainer prices that reality honestly instead of pretending the model is done.
The stack has four layers, each with an owner and a metric.
## 1. Access & isolation PHI never leaves the client's tenancy. Per-role access controls on the model endpoint, full prompt-and-response logging, and a retention policy the privacy office signed. This directly closes the gap behind the IBM finding that 97% of AI-incident victims lacked proper access controls — controls were the first thing in, not the last.
## 2. Output evaluation (the guardrail layer) Every clinical-text output passes a layered evaluation before a clinician sees it: retrieval grounding against the source record, a hallucination check tuned to the MedHallu failure mode, and a confidence gate. Below threshold, the output is flagged for mandatory human review rather than presented as a clean draft. Human-in-the-loop verification is non-negotiable — the 0.625 MedHallu F1 is why the assistant proposes and a clinician disposes, always.
## 3. Drift & accuracy monitoring A weekly evaluation set — held-out, clinician-labeled — runs against the live model. Accuracy, grounding rate, and flag rate are tracked over time. A statistically significant drop pages the on-call engineer and triggers the retraining/prompt-revision pipeline. This is the layer the internal team never had: the answer to "is it still accurate this month?" is now a dashboard, not a nurse's hunch.
## 4. Cost & uptime observability Per-workflow token spend, latency, and availability on one pane. Cost anomalies (a prompt change that doubles token use) alert before the invoice does. The FDA now expects Predetermined Change Control Plans for AI/ML medical devices — ongoing monitoring is a regulatory expectation, not optional (FDA AI-Enabled Medical Device List, 2025; 1,250+ devices authorized as of July 2025). Our change-control discipline maps to that expectation even for the non-device assistant, because the privacy office wanted the same rigor.
The feedback channel physicians asked for (88% top requirement) is wired into layer 2: a one-click "this draft was wrong" that lands in the evaluation set and shapes the next week's retraining. Their feedback is not a suggestion box — it is training signal.
In the first 90 days:
Then the monthly cadence began. Using the governed new-workflow slot, the next three capabilities shipped one per month — each through the same evaluation gate, none re-litigating the operating model:
The sequencing is the point. We did not ship the most exciting workflow first. We shipped the one that earned the next one.
| Label | Value |
|---|---|
| Month 0 | 94 |
| Month 2 | 93.8 |
| Month 4 | 94.1 |
| Month 6 | 93.9 |
| Month 8 | 94.2 |
| Month 10 | 94 |
| Month 12 | 94.3 |
Modeled clinician-labeled accuracy across 12 months. The lower path illustrates the silent decay implied by the cited finding that 91% of ML models degrade over time (Vela et al., 2022); the held band is the managed evaluation-and-retraining loop. Illustrative — not a named-client measurement.
These outcomes are illustrative and modeled for an engagement of this size and shape — they are not measured results from a named client. Where a number is an industry benchmark, it is cited.
The honest version: the retainer did not make the model better than its demo. It kept the demo's quality from quietly evaporating — and it gave a CISO and COO one accountable phone number for the AI surface.
| Label | Value |
|---|---|
| Month 0 | 94 |
| Month 2 | 92.6 |
| Month 4 | 91 |
| Month 6 | 89.1 |
| Month 8 | 87.4 |
| Month 10 | 85.5 |
| Month 12 | 83.2 |
The counterfactual: the same model with no monitoring or retraining, drifting down quarter over quarter — the path 91% of deployed models follow (Vela et al., 2022). Illustrative.
This was a representative engagement. The shape — stalled pilot, no owner, decaying-asset risk — is one we see across regulated health systems. The fix is not a better model. It is an accountable operating layer with an SLA, no lock-in, and a phone that gets answered.
| Label | Value |
|---|---|
| Uptime vs SLA (%) | 99.7 |
| On-call coverage (hrs/week) | 120 |
| New governed workflows / quarter | 3 |
| Clinical-impact incidents | 0 |
Modeled operating posture for an engagement of this size: 24×5 on-call, drift and cost monitoring, monthly governed delivery. Illustrative figures, not a named-client result.
The retainer did not make the model better than its demo. It kept the demo's quality from quietly evaporating — and gave a CISO and COO one accountable phone number for the AI surface.
An AI governance program — built before scaling LLMs and agents — that made saying yes faster than saying no.
Insurance (national P&C insurer)A fixed-fee build with a harness, evals, and a one-click rollback — payback proven in a paid Discovery Assessment before a line of production code shipped.
Have a pilot that works but nobody owns? Let's talk about who carries the pager. Book a Discovery Assessment.
Start a conversation