Healthcare — multi-site health system · Managed AI Stack

Who owns the model at 2 a.m.? A multi-site health system hands its AI surface to a managed retainer

From a stalled clinical-documentation pilot to a governed, monitored, SLA-backed AI stack — drift, accuracy, cost, and 24×5 on-call owned by one accountable partner.

March 24, 202610 min readUse case

This is a representative, composite engagement based on patterns Maverin sees across regulated multi-site health systems. It is not a real named client. Industry figures are cited to public sources; engagement outcomes are modeled and labeled illustrative.

TL;DR

A multi-site health system had a working clinical-documentation assistant in pilot and no way to run it in production. Maverin took over the AI surface under a Managed AI Stack retainer: drift and accuracy monitoring, output evaluation with human-in-the-loop verification, cost observability, 24×5 on-call, and a monthly cadence of new governed workflows — all under an SLA. This is a representative engagement; the outcomes are modeled and labeled, the industry figures are cited.

01Context

A regional health system — eight hospital and clinic sites, one shared EHR, PHI under PHIPA — had spent two quarters building a clinical-documentation assistant. It drafted visit summaries and routed referrals. In a demo, it was good. Physicians on the pilot floor liked it.

Then it sat. The internal team that built it was a data-science group, not an operations group. Nobody owned the pager. Nobody owned the question "is it still accurate this month?" When inference costs drifted up, finance noticed before engineering did. When a model update changed the assistant's tone on discharge instructions, a charge nurse caught it, not a monitor.

This is the normal shape of the problem. 66% of U.S. physicians used AI in 2024, up from 38% the year before — a 78% one-year jump (AMA Augmented Intelligence Survey, 2025). The demand side is real and arriving fast. The operational side is landing on health systems that were never staffed to run software like a vendor runs it.

Maverin came in not to rebuild the assistant — it worked — but to own the surface it ran on.

02The problem

A model in a demo is finished work. A model in production is a depreciating asset. The gap between the two is where most health-AI initiatives die.

The evidence is blunt. 91% of machine-learning models degrade over time in a study of 32 datasets and 2.56 million experiments across healthcare, finance, weather, and traffic (Vela, Sharp, Zhang et al., Scientific Reports, 2022). Accuracy decays silently — nobody gets an alert that the model is now wrong more often. Gartner predicted at least 30% of generative-AI projects would be abandoned after proof-of-concept by end of 2025, citing poor data quality, weak risk controls, escalating costs, and unclear value (Gartner, 2024). This pilot had all four risks live and unowned.

Three specifics made it acute for a health system:

Hallucination is not a tail risk in clinical text. On the MedHallu hard-hallucination benchmark the best model scored only 0.625 F1 even as the top model hit 96.0% on MedQA licensing-exam questions (Stanford HAI 2025 AI Index; MedHallu, arXiv 2502.14302). High exam scores mask confident, hard-to-detect errors. Without output evaluation and a human in the loop, a wrong discharge instruction ships looking exactly like a right one.
Ungoverned AI is a measurable breach line item. Healthcare is the most expensive sector for breaches at USD 7.42M average, with the longest containment at 279 days (IBM Cost of a Data Breach 2025). And 97% of firms that reported an AI-related security incident lacked proper AI access controls, with shadow AI adding ~USD 670K to breach cost (IBM, 2025).
Physicians told us the price of their trust. 87% named data-privacy assurances and 88% named a designated feedback channel as top requirements for trusting health AI; 47% named increased oversight as the #1 regulatory need (AMA, 2025). Those are not features — they are the operating model of a managed service.

The internal team could build. It could not run drift detection, output evaluation, access controls, on-call, and a feedback loop — as a standing capability, every day, under an SLA. That is a different job.

03The approach

We did not start with a roadmap. We started with a two-week Discovery Assessment to answer one question: what is actually running, and what is the worst thing that can go wrong while nobody is watching?

The output was a transfer-of-ownership plan, not a rebuild plan. The assistant stayed. We took the surface.

The retainer terms, in plain language:

Maverin owns drift, accuracy, cost, observability, and uptime on the deployed AI surface — measured against an SLA, reported monthly.
24×5 on-call, with a defined severity ladder and response targets. Clinical-impacting incidents page a human; cost and drift anomalies open a ticket.
One monthly new-workflow delivery slot — a governed pipeline to add the next assistant capability without re-opening the build-vs-run question each time.
No platform lock-in. Models, vendors, and infrastructure stay the client's. The retainer is the operating capability layered on top, portable if they ever bring it in-house.

We mapped the program to NIST's AI Risk Management Framework Generative AI Profile (NIST AI 600-1), which names confabulation (hallucination) as one of 12 generative-AI risk categories (NIST, 2024). That gave procurement, audit, and the privacy office a recognized control framework to map against — not a Maverin invention they had to take on faith.

The reframe we sold the COO: inference is no longer the cost. Querying a GPT-3.5-equivalent model fell from USD 20.00 to USD 0.07 per million tokens between Nov 2022 and Oct 2024 — a >280× drop (Stanford HAI 2025 AI Index). The durable spend has moved to operations: observability, evaluation, retraining, governance. A retainer prices that reality honestly instead of pretending the model is done.

04Architecture & controls

The stack has four layers, each with an owner and a metric.

## 1. Access & isolation PHI never leaves the client's tenancy. Per-role access controls on the model endpoint, full prompt-and-response logging, and a retention policy the privacy office signed. This directly closes the gap behind the IBM finding that 97% of AI-incident victims lacked proper access controls — controls were the first thing in, not the last.

## 2. Output evaluation (the guardrail layer) Every clinical-text output passes a layered evaluation before a clinician sees it: retrieval grounding against the source record, a hallucination check tuned to the MedHallu failure mode, and a confidence gate. Below threshold, the output is flagged for mandatory human review rather than presented as a clean draft. Human-in-the-loop verification is non-negotiable — the 0.625 MedHallu F1 is why the assistant proposes and a clinician disposes, always.

## 3. Drift & accuracy monitoring A weekly evaluation set — held-out, clinician-labeled — runs against the live model. Accuracy, grounding rate, and flag rate are tracked over time. A statistically significant drop pages the on-call engineer and triggers the retraining/prompt-revision pipeline. This is the layer the internal team never had: the answer to "is it still accurate this month?" is now a dashboard, not a nurse's hunch.

## 4. Cost & uptime observability Per-workflow token spend, latency, and availability on one pane. Cost anomalies (a prompt change that doubles token use) alert before the invoice does. The FDA now expects Predetermined Change Control Plans for AI/ML medical devices — ongoing monitoring is a regulatory expectation, not optional (FDA AI-Enabled Medical Device List, 2025; 1,250+ devices authorized as of July 2025). Our change-control discipline maps to that expectation even for the non-device assistant, because the privacy office wanted the same rigor.

The feedback channel physicians asked for (88% top requirement) is wired into layer 2: a one-click "this draft was wrong" that lands in the evaluation set and shapes the next week's retraining. Their feedback is not a suggestion box — it is training signal.

05What shipped

In the first 90 days:

Ownership transfer complete. The pager, the dashboards, and the SLA moved to Maverin. The internal data-science team went back to building, freed from running.
Four-layer stack live in production across all eight sites: access/isolation, output evaluation, drift monitoring, cost/uptime observability.
Weekly clinician-labeled evaluation loop stood up, feeding both the drift monitor and the retraining pipeline.
24×5 on-call with a published severity ladder and two clinical-severity drills run before go-live.

Then the monthly cadence began. Using the governed new-workflow slot, the next three capabilities shipped one per month — each through the same evaluation gate, none re-litigating the operating model:

Month 1: referral-routing accuracy improvements (the original pain).
Month 2: discharge-instruction drafting with a stricter grounding threshold.
Month 3: prior-authorization letter drafting, the highest-volume, lowest-clinical-risk workflow — chosen deliberately to build trust before touching anything diagnostic.

The sequencing is the point. We did not ship the most exciting workflow first. We shipped the one that earned the next one.

Exhibit 1

Accuracy over time: silent decay vs. managed loop (illustrative)

Accuracy over time: silent decay vs. managed loop (illustrative)
Label	Value
Month 0	94
Month 2	93.8
Month 4	94.1
Month 6	93.9
Month 8	94.2
Month 10	94
Month 12	94.3

Modeled clinician-labeled accuracy across 12 months. The lower path illustrates the silent decay implied by the cited finding that 91% of ML models degrade over time (Vela et al., 2022); the held band is the managed evaluation-and-retraining loop. Illustrative — not a named-client measurement.

Illustrative (decay baseline: Vela et al., Scientific Reports 2022)

06Outcomes

These outcomes are illustrative and modeled for an engagement of this size and shape — they are not measured results from a named client. Where a number is an industry benchmark, it is cited.

Accuracy held flat instead of decaying. Against the baseline that 91% of ML models degrade over time (Vela et al., Scientific Reports, 2022), the managed evaluation-and-retraining loop kept the assistant's clinician-labeled accuracy inside a tight band over the year rather than drifting down — the explicit value of the retainer, illustrated in the line chart below.
The project did not join the abandonment statistic. Gartner put POC abandonment at ≥30% (2024); the surface that was stalling moved to stable production and a monthly delivery cadence instead.
On-call met SLA. In an illustrative steady state, clinical-severity incidents were acknowledged inside the response target and resolved without a clinical-impact event — the kind of operational floor a health system cannot self-staff overnight given a 279-day average healthcare breach-containment time (IBM, 2025) when controls are absent.
Hallucination caught upstream of clinicians. The output-evaluation layer flagged low-confidence drafts for mandatory review rather than presenting them clean — directly addressing the 0.625 MedHallu F1 failure mode (Stanford HAI / MedHallu, 2025).

The honest version: the retainer did not make the model better than its demo. It kept the demo's quality from quietly evaporating — and it gave a CISO and COO one accountable phone number for the AI surface.

Exhibit 2

Unmanaged baseline: silent accuracy decay (illustrative)

Unmanaged baseline: silent accuracy decay (illustrative)
Label	Value
Month 0	94
Month 2	92.6
Month 4	91
Month 6	89.1
Month 8	87.4
Month 10	85.5
Month 12	83.2

The counterfactual: the same model with no monitoring or retraining, drifting down quarter over quarter — the path 91% of deployed models follow (Vela et al., 2022). Illustrative.

Illustrative (decay baseline: Vela et al., Scientific Reports 2022)

07What we'd tell the next buyer

Decide who owns the pager before you ship the pilot. The pilot that works and the pilot that dies look identical on demo day. The difference is whether anyone owns drift, cost, and the 2 a.m. incident. If your data-science team built it, they are usually the wrong team to run it — and asking them to do both stalls everything.
Price the operations, not the inference. Inference fell >280× in 18 months (Stanford HAI, 2025). If your AI budget is mostly model spend, you have mis-modeled the cost. The durable line item is monitoring, evaluation, retraining, and governance.
Buy a control framework, not a promise. Map to NIST AI 600-1 or ISO/IEC 42001 from day one. It is what gets your privacy office and procurement to yes, and it survives a regulator's question. The framework is also your portability guarantee — it is not vendor-specific.
Make clinician feedback a wire, not a form. Physicians ranked a feedback channel (88%) and privacy assurances (87%) as their top trust requirements (AMA, 2025). Wire feedback into the retraining loop so it becomes training signal, and the people using the tool become the people improving it.
Sequence workflows by trust earned, not by excitement. Ship the high-volume, low-clinical-risk workflow first. Let it earn the diagnostic one. A managed retainer makes this disciplined cadence the default instead of a heroic exception.

This was a representative engagement. The shape — stalled pilot, no owner, decaying-asset risk — is one we see across regulated health systems. The fix is not a better model. It is an accountable operating layer with an SLA, no lock-in, and a phone that gets answered.

Exhibit 3

Managed AI Stack — surface SLA snapshot (illustrative steady state)

Uptime vs SLA (%)

99.7

On-call coverage (hrs/week)

120

New governed workflows / quarter

Clinical-impact incidents

Managed AI Stack — surface SLA snapshot (illustrative steady state)
Label	Value
Uptime vs SLA (%)	99.7
On-call coverage (hrs/week)	120
New governed workflows / quarter	3
Clinical-impact incidents	0

Modeled operating posture for an engagement of this size: 24×5 on-call, drift and cost monitoring, monthly governed delivery. Illustrative figures, not a named-client result.

Illustrative

The retainer did not make the model better than its demo. It kept the demo's quality from quietly evaporating — and gave a CISO and COO one accountable phone number for the AI surface.

By the numbers

91%

ML models that degrade over time

Why a deployed model needs monitoring + retraining

Source: Vela, Sharp, Zhang et al., Scientific Reports

0.625 F1

Best model on MedHallu hard-hallucination set

High exam scores hide confident clinical errors

Source: Stanford HAI 2025 AI Index / MedHallu

>280×

Inference price drop, GPT-3.5-equivalent (Nov 2022 → Oct 2024)

Durable spend has shifted to operations

Source: Stanford HAI 2025 AI Index

USD 7.42M

Avg. healthcare data-breach cost (most expensive sector)

279-day containment — why AI access controls matter

Source: IBM / Ponemon Cost of a Data Breach 2025

Industry: Healthcare — multi-site health system
Service line: Managed AI Stack

Start a conversation

Financial services (Tier-1 bank)

How a Tier-1 bank turned its risk team from AI blocker into AI sponsor

An AI governance program — built before scaling LLMs and agents — that made saying yes faster than saying no.

8 min read Insurance (national P&C insurer)

Claims-Intake Triage, Automated: A Production-Grade Agent Engagement for a National P&C Insurer

A fixed-fee build with a harness, evals, and a one-click rollback — payback proven in a paid Discovery Assessment before a line of production code shipped.

9 min read

Use cases

Have a pilot that works but nobody owns? Let's talk about who carries the pager. Book a Discovery Assessment.

Start a conversation

Healthcare — multi-site health system · Managed AI Stack

Who owns the model at 2 a.m.? A multi-site health system hands its AI surface to a managed retainer

From a stalled clinical-documentation pilot to a governed, monitored, SLA-backed AI stack — drift, accuracy, cost, and 24×5 on-call owned by one accountable partner.

March 24, 202610 min readUse case

TL;DR

01Context

Maverin came in not to rebuild the assistant — it worked — but to own the surface it ran on.

02The problem

A model in a demo is finished work. A model in production is a depreciating asset. The gap between the two is where most health-AI initiatives die.

Three specifics made it acute for a health system:

Hallucination is not a tail risk in clinical text. On the MedHallu hard-hallucination benchmark the best model scored only 0.625 F1 even as the top model hit 96.0% on MedQA licensing-exam questions (Stanford HAI 2025 AI Index; MedHallu, arXiv 2502.14302). High exam scores mask confident, hard-to-detect errors. Without output evaluation and a human in the loop, a wrong discharge instruction ships looking exactly like a right one.
Ungoverned AI is a measurable breach line item. Healthcare is the most expensive sector for breaches at USD 7.42M average, with the longest containment at 279 days (IBM Cost of a Data Breach 2025). And 97% of firms that reported an AI-related security incident lacked proper AI access controls, with shadow AI adding ~USD 670K to breach cost (IBM, 2025).
Physicians told us the price of their trust. 87% named data-privacy assurances and 88% named a designated feedback channel as top requirements for trusting health AI; 47% named increased oversight as the #1 regulatory need (AMA, 2025). Those are not features — they are the operating model of a managed service.

03The approach

The output was a transfer-of-ownership plan, not a rebuild plan. The assistant stayed. We took the surface.

The retainer terms, in plain language:

Maverin owns drift, accuracy, cost, observability, and uptime on the deployed AI surface — measured against an SLA, reported monthly.
24×5 on-call, with a defined severity ladder and response targets. Clinical-impacting incidents page a human; cost and drift anomalies open a ticket.
One monthly new-workflow delivery slot — a governed pipeline to add the next assistant capability without re-opening the build-vs-run question each time.
No platform lock-in. Models, vendors, and infrastructure stay the client's. The retainer is the operating capability layered on top, portable if they ever bring it in-house.

04Architecture & controls

The stack has four layers, each with an owner and a metric.

05What shipped

In the first 90 days:

Ownership transfer complete. The pager, the dashboards, and the SLA moved to Maverin. The internal data-science team went back to building, freed from running.
Four-layer stack live in production across all eight sites: access/isolation, output evaluation, drift monitoring, cost/uptime observability.
Weekly clinician-labeled evaluation loop stood up, feeding both the drift monitor and the retraining pipeline.
24×5 on-call with a published severity ladder and two clinical-severity drills run before go-live.

Month 1: referral-routing accuracy improvements (the original pain).
Month 2: discharge-instruction drafting with a stricter grounding threshold.
Month 3: prior-authorization letter drafting, the highest-volume, lowest-clinical-risk workflow — chosen deliberately to build trust before touching anything diagnostic.

The sequencing is the point. We did not ship the most exciting workflow first. We shipped the one that earned the next one.

Exhibit 1

Accuracy over time: silent decay vs. managed loop (illustrative)

Accuracy over time: silent decay vs. managed loop (illustrative)
Label	Value
Month 0	94
Month 2	93.8
Month 4	94.1
Month 6	93.9
Month 8	94.2
Month 10	94
Month 12	94.3

Illustrative (decay baseline: Vela et al., Scientific Reports 2022)

06Outcomes

These outcomes are illustrative and modeled for an engagement of this size and shape — they are not measured results from a named client. Where a number is an industry benchmark, it is cited.

Accuracy held flat instead of decaying. Against the baseline that 91% of ML models degrade over time (Vela et al., Scientific Reports, 2022), the managed evaluation-and-retraining loop kept the assistant's clinician-labeled accuracy inside a tight band over the year rather than drifting down — the explicit value of the retainer, illustrated in the line chart below.
The project did not join the abandonment statistic. Gartner put POC abandonment at ≥30% (2024); the surface that was stalling moved to stable production and a monthly delivery cadence instead.
On-call met SLA. In an illustrative steady state, clinical-severity incidents were acknowledged inside the response target and resolved without a clinical-impact event — the kind of operational floor a health system cannot self-staff overnight given a 279-day average healthcare breach-containment time (IBM, 2025) when controls are absent.
Hallucination caught upstream of clinicians. The output-evaluation layer flagged low-confidence drafts for mandatory review rather than presenting them clean — directly addressing the 0.625 MedHallu F1 failure mode (Stanford HAI / MedHallu, 2025).

Exhibit 2

Unmanaged baseline: silent accuracy decay (illustrative)

Unmanaged baseline: silent accuracy decay (illustrative)
Label	Value
Month 0	94
Month 2	92.6
Month 4	91
Month 6	89.1
Month 8	87.4
Month 10	85.5
Month 12	83.2

The counterfactual: the same model with no monitoring or retraining, drifting down quarter over quarter — the path 91% of deployed models follow (Vela et al., 2022). Illustrative.

Illustrative (decay baseline: Vela et al., Scientific Reports 2022)

07What we'd tell the next buyer

Decide who owns the pager before you ship the pilot. The pilot that works and the pilot that dies look identical on demo day. The difference is whether anyone owns drift, cost, and the 2 a.m. incident. If your data-science team built it, they are usually the wrong team to run it — and asking them to do both stalls everything.
Price the operations, not the inference. Inference fell >280× in 18 months (Stanford HAI, 2025). If your AI budget is mostly model spend, you have mis-modeled the cost. The durable line item is monitoring, evaluation, retraining, and governance.
Buy a control framework, not a promise. Map to NIST AI 600-1 or ISO/IEC 42001 from day one. It is what gets your privacy office and procurement to yes, and it survives a regulator's question. The framework is also your portability guarantee — it is not vendor-specific.
Make clinician feedback a wire, not a form. Physicians ranked a feedback channel (88%) and privacy assurances (87%) as their top trust requirements (AMA, 2025). Wire feedback into the retraining loop so it becomes training signal, and the people using the tool become the people improving it.
Sequence workflows by trust earned, not by excitement. Ship the high-volume, low-clinical-risk workflow first. Let it earn the diagnostic one. A managed retainer makes this disciplined cadence the default instead of a heroic exception.

Exhibit 3

Managed AI Stack — surface SLA snapshot (illustrative steady state)

Uptime vs SLA (%)

99.7

On-call coverage (hrs/week)

120

New governed workflows / quarter

Clinical-impact incidents

Managed AI Stack — surface SLA snapshot (illustrative steady state)
Label	Value
Uptime vs SLA (%)	99.7
On-call coverage (hrs/week)	120
New governed workflows / quarter	3
Clinical-impact incidents	0

Modeled operating posture for an engagement of this size: 24×5 on-call, drift and cost monitoring, monthly governed delivery. Illustrative figures, not a named-client result.

Illustrative

The retainer did not make the model better than its demo. It kept the demo's quality from quietly evaporating — and gave a CISO and COO one accountable phone number for the AI surface.

By the numbers

91%

ML models that degrade over time

Why a deployed model needs monitoring + retraining

Source: Vela, Sharp, Zhang et al., Scientific Reports

0.625 F1

Best model on MedHallu hard-hallucination set

High exam scores hide confident clinical errors

Source: Stanford HAI 2025 AI Index / MedHallu

>280×

Inference price drop, GPT-3.5-equivalent (Nov 2022 → Oct 2024)

Durable spend has shifted to operations

Source: Stanford HAI 2025 AI Index

USD 7.42M

Avg. healthcare data-breach cost (most expensive sector)

279-day containment — why AI access controls matter

Source: IBM / Ponemon Cost of a Data Breach 2025

Industry: Healthcare — multi-site health system
Service line: Managed AI Stack

Start a conversation

Financial services (Tier-1 bank)

How a Tier-1 bank turned its risk team from AI blocker into AI sponsor

An AI governance program — built before scaling LLMs and agents — that made saying yes faster than saying no.

8 min read Insurance (national P&C insurer)

Claims-Intake Triage, Automated: A Production-Grade Agent Engagement for a National P&C Insurer

A fixed-fee build with a harness, evals, and a one-click rollback — payback proven in a paid Discovery Assessment before a line of production code shipped.

9 min read

Use cases

Have a pilot that works but nobody owns? Let's talk about who carries the pager. Book a Discovery Assessment.

Start a conversation