The design problem most AI companies aren’t solving

There’s a design principle underneath every high-stakes AI product: AI is the decision support. The human is the decision maker. Those are different jobs. The AI surfaces information, surfaces risk, surfaces patterns a person couldn’t find alone. The human takes that and decides what to do. That is the contract: the AI provides the evidence, but the human owns the decision.
Most AI products in legal, healthcare, and criminal justice contexts are violating that contract by design. Not maliciously. They just weren’t built around it. They were built around making AI output look good and feel fast, and then a human approval step was added. The result is a product that looks like decision support but functions like liability transfer. The AI gets the credit. The human gets the exposure.
The gap between those two things is where people get hurt, sanctioned, and sued.
What the Contract Actually Requires
When I say AI is decision support and the human is the decision maker, I mean something specific.
Decision support means the AI’s job is to make the human’s judgment better: surfacing what they couldn’t see alone, flagging what they might miss, organizing what would otherwise take days into something they can actually work with. The human’s job is to take all of that, add the context and judgment and experience the AI doesn’t have, and make a call they can stand behind.
For that to work, the human has to be in a real position to evaluate what they’re looking at. Not technically present. Not nominally responsible. Actually equipped to engage with the evidence, form a view, and own the outcome.
That’s not what most products are designed to produce. I’ve been watching how AI-assisted decision workflows get built, and the pattern is consistent: the product team builds a good model, the UX team builds a clean interface for the output, somewhere in the spec, there’s a review step, usually a button, sometimes a confirmation modal. Legal signs off because a human is technically approving each action. The product ships.
What doesn’t get designed: what the human actually needs to evaluate what they’re looking at.

How the Contract Gets Broken
The research on what happens next is not ambiguous.
When an authoritative-looking recommendation is on the screen, people defer to it, especially under time pressure and mental strain. Kate Goddard and colleagues identified this in clinical decision support in 2012 and the finding has replicated across domains ever since. The phenomenon has a name: automation bias. It’s the tendency to accept a system’s output rather than interrogate it, not out of carelessness but because the design routes around judgment rather than engaging it.
The 2023 JAMA study put this directly to the test. Randomized, 457 clinicians across 13 states. Standard AI predictions improved diagnostic accuracy by 4.4 percentage points. Systematically biased AI predictions reduced it. And the explanations, image-based saliency maps (visual heatmaps) showing why the AI flagged what it flagged, didn’t help. When the model was wrong, showing clinicians the reasoning behind that wrong answer didn’t protect them from agreeing with it anyway.
The explanation failed because the human was no longer engaged in evaluating the case. They were evaluating whether to trust the system. Those are different cognitive tasks, and the design had already answered the second one by the time the explanation appeared.
A 2023 study titled “Putting a Human in the Loop: Increasing Uptake, but Decreasing Accuracy” found something worse: introducing a human reviewer actually increased how often people followed the AI’s recommendation, because the presence of a human made participants feel the decision had already been vetted. The human wasn’t catching errors. The human was providing cover.
Zana Buçinca and Krzysztof Gajos at Harvard tested whether forcing users to form their own view before seeing the AI’s answer would change this. It did. Requiring a prior commitment reduced how often people followed the AI even when it was wrong. The participants hated those designs. They gave them worse ratings even as they made better decisions with them. People do not enjoy being made to think, particularly when the screen is offering them a shortcut. Frictionless feels like good UX. In high-stakes decisions, frictionless is often the failure mode.
Four Cases Where the Contract Broke
The broken contract isn’t theoretical. Each of the following cases involves a domain where someone was designated the decision maker. Each shows what it costs when the product wasn’t built around what that role actually required.
Legal. In 2023, a New York attorney named Steven Schwartz submitted a brief citing six cases that ChatGPT had fabricated. “Varghese v. China Southern Airlines, 925 F.3d 1339 (11th Cir. 2019)” does not exist. Schwartz later said he had been operating under the belief that the tool “could not possibly be fabricating cases on its own.” The court imposed $5,000 in sanctions and required him to mail copies of the order to every real judge whose name appeared on a fake opinion. That was the beginning. By late 2025, researcher Damien Charlotin had tracked 1,356 documented incidents of AI hallucinations in legal filings, with sanctions in individual cases reaching $30,000. In California, at least one court has begun suggesting that opposing counsel may have a duty to detect the other side’s AI-generated fakes. The lawyer was the decision maker. The design gave them no way to verify the input they were deciding on.
Healthcare. IBM’s Watson for Oncology spent three years and more than $62 million at MD Anderson before an internal audit revealed the product couldn’t sync with Epic, was running on outdated drug protocols, and was producing treatment recommendations that weren’t based on current evidence. A physician at Jupiter Hospital described the product to IBM leadership in terms that don’t belong in a published article. Watson Health was sold to private equity in 2022. Separately, Epic’s sepsis prediction tool was deployed across hundreds of US hospitals on vendor-claimed performance numbers that looked credible. When Michigan Medicine researchers ran their own external validation, they found the model missed 67% of sepsis cases at the recommended threshold, generated alerts on 18% of all hospitalized patients, and correctly flagged only 7% of the cases clinicians had missed. Hundreds of clinicians were nominally the decision makers on sepsis while relying on a tool whose real-world performance they had no way to interrogate.
Autonomous systems. In August 2025, a Florida jury found Tesla one-third liable for a fatal crash involving a Model S in Autopilot mode. The driver admitted fault. Tesla argued no system in existence could have prevented this crash. The jury disagreed, finding that the way Autopilot had been marketed and designed shaped how drivers actually used it, and that designing a system in a way that trains users to trust it more than the technology warrants is itself a contributing factor. For years, the standard answer to “who is responsible when Autopilot is engaged?” was “the driver; the system requires constant supervision.” The jury pushed back. When a product’s design trains the human to hand over judgment they were supposed to keep, the product shares the outcome.
Criminal justice. Eric Loomis was sentenced in Wisconsin in 2013 partly on the basis of a high-risk score from COMPAS, a proprietary recidivism prediction tool. The algorithm’s inputs and logic were not disclosed. Loomis couldn’t examine what the model had used to produce the score or challenge it on those grounds. The Wisconsin Supreme Court ruled in State v. Loomis that judges could continue using COMPAS so long as it wasn’t the “sole basis” for sentencing. The problem with that standard is a well-documented bias called anchoring: once an authoritative-looking number is in the room, it shapes decisions even when the decision maker believes they’re reasoning independently of it. ProPublica’s 2016 investigation found COMPAS was nearly twice as likely to falsely flag Black defendants as future offenders compared to white defendants at the same risk level. The judge was the decision maker. The design put an unexaminable number on the page and then assumed the human could reason around it.
The domains are different. The pattern is the same: someone was designated the decision maker, and the product wasn’t designed around what that role actually required.
Why the Contract Keeps Breaking
Automation bias explains the immediate failure. There’s a slower one that compounds it.
In aviation, the FAA spent years documenting what happened to pilots who relied heavily on autopilot: their ability to fly manually deteriorated. The 2013 Asiana 214 crash at San Francisco International, where investigators found the crew overly reliant on automation and lacking proficiency in manual flight at low speed, led the FAA to issue guidance directing pilots to hand-fly more often during low-workload phases. The goal was to preserve the skills they would need precisely when the automation failed.
Medicine is running the same experiment now. A multicenter randomized trial of colonoscopy with AI polyp detection found that when endoscopists returned to non-AI procedures after sustained AI use, their adenoma detection rate dropped from 28.4% to 22.4%. The tool was making them better at colonoscopy while they used it and measurably worse when they didn’t. In radiology, giving radiologists incorrect AI suggestions increased their false-positive recalls by up to 12%, even when they were explicitly told the AI might be wrong.
A product that removes the hard parts of a job, the reading, the independent reasoning, the forming of a view before consulting any reference, is also removing the experience that builds the judgment the decision maker is supposed to bring. The contract assumes a human with the expertise to evaluate AI support. Over time, a badly designed product dulls the expertise it depends on. That’s not a side effect. It’s a design choice, made by default.
What the Regulation Is Encoding
Regulators have started writing the contract into law.
The EU AI Act’s Article 14 on human oversight requires that high-risk AI systems be designed so that overseeing humans can “remain aware of the possible tendency of automatically relying or over-relying on the output produced by a high-risk AI system.” The regulation uses the term “automation bias” explicitly. That’s a legislative acknowledgment that putting a human in front of an approval button is not the same as designing a product that enables a human to actually make a decision. High-risk system obligations under Article 14 take effect August 2, 2026, meaning both providers and deployers must demonstrate that oversight is substantive: that the human assigned to approve is actually in a position to decide, not just named on the audit trail.
The DoD Directive on autonomous weapons made a related move. The directive doesn’t require a human at every engagement decision. What it requires is “appropriate levels of human judgment,” phrasing chosen specifically to distinguish substantive decision-making from a person whose presence is technically logged.
In financial services, audit trail requirements from NIST, the federal standards body, the SEC, and banking regulators are moving toward the same standard: every AI-influenced decision must be reconstructable. Not just “a human approved this,” but what data the model used, what was missing, what the human considered, and what their stated rationale was. The record has to answer the question a regulator will ask in an enforcement action two years from now.
Most products aren’t built to produce that record. They’re built to produce a timestamp and a user ID. That’s not a decision trail. That’s evidence that someone was present.
What Honoring the Contract Actually Requires
Here’s the question that follows from the design principle. If AI is decision support and the human is the decision maker, what does the human actually need to do that job?
It’s not a confidence score. It’s not a saliency map. It’s not a summary paragraph the AI generated about the same analysis it just ran.
When I founded and built an AI-powered fintech platform for early-stage investment due diligence, this was the central design question. The product surfaced financials, risk signals, and thesis fit for investors making real capital allocation decisions. We had the technical capability to let the AI recommend whether to invest. We chose not to. The AI surfaced insights, showed investors what the numbers actually meant, flagged the risks, and gave them everything they needed to form a view. It never said “this is a good investment.” That was deliberate. In high-stakes financial decisions, if the investor doesn’t understand why they’re making a call, the tool has failed even if the answer is correct. I delayed certain AI features specifically to ensure outputs were explainable and trust-building before shipping, not because we couldn’t build them faster, but because the human’s judgment had to stay in the room.
That principle generalizes. Here’s what it actually looks like in product decisions.
The human needs the underlying evidence, not just the conclusion. The contract clause they can click into, the source document passage they can read, the comparable transaction they can verify. Harvey AI’s Vault, Kira Systems’ source-anchored smart fields, Spellbook’s redline review where every AI suggestion requires an explicit accept or reject: these products put the source one click from the finding. The summary is faster. The source is what lets the human actually stand behind the call.
The human needs to form a view before the AI shows its answer, at least in cases that warrant it. Committing to a position first is what preserves independent reasoning. That’s an uncomfortable design choice. It makes the workflow slower and the experience harder. It also produces better decisions, and for products in high-stakes domains, that tradeoff needs to be named explicitly rather than defaulted away because usability testing rewards frictionless.
The human needs uncertainty expressed in a form they can reason with. Most products display abstract confidence scores like “12% confidence” or “high reliability,” which the brain often processes as a static grade to be ignored. However, research by Cao, Liu, and Huang found that calibrated uncertainty only improves reliance behavior when expressed as a frequency. Telling a clinician that “in 100 patients like this one, 12 would have this condition” forces a mental simulation of real-world outcomes. It transforms a percentage into a scenario that actually demands human judgment.
The human needs their reasoning on the record, not just their approval. “Approved” is not a decision trail. “Approved because the risk clause in section 4.2 is standard for this counterparty type, flagged for legal review” is. In domains where the decision will be reviewed later, the product needs to capture what the human actually thought at the moment of decision, as part of the UX itself, not as a compliance mechanism bolted on afterward.
The human needs to keep the underlying skill. If the workflow removes the work that builds judgment, the reading, the independent assessment, the forming of a view from evidence, it also removes what makes the decision maker’s role meaningful. The contract requires an expert. A product that deskills its users is quietly canceling the contract it depends on.
The fastest test for whether a product is built this way is simple. Find the approval step. Ask what the user would need to know if they had to explain that decision in a deposition a year from now. If the answer isn’t already on the screen, it’s not decision support. It’s a paper trail.
While courtroom sanctions and medical audits leave a visible trail of a broken contract, the highest stakes show up where there is no trail at all. Consider an offline navigation app used by hunters or search and rescue teams. You are deep in remote terrain with no signal, limited battery, and changing conditions, following a route the system suggested twenty minutes ago that you cannot verify. At that point, you are not clicking approve, you are betting your safety on it. There is no source document to inspect, no second system to cross check, no fallback. The product is not supporting the decision, it becomes the only evidence you have. That is where the contract is most fragile, because if the system is wrong, the user does not just lose confidence, they lose time, options, and in some cases their safety. Trust here is not a UX layer, it is a survival mechanism, and the design either acts as a lifeline or quietly becomes the trap.
The AI-Native Reframe
AI-augmented product design asks: how do we display the model’s output clearly?
AI-Native design asks: what does this specific human need, in this specific moment, to make this decision and defend it later?
Those aren’t the same question. The first produces good output display. The second produces something that actually honors the contract.
The products that have gotten this right share a few characteristics. They’re slower in ways that feel purposeful rather than broken. They put the source one click from every claim. They require the human to engage with the evidence before routing around them with a summary. They record what the human actually decided and why, not just whether they clicked. They were designed around the moment someone has to answer for a decision, not just the moment they make one.
The broken-contract pattern holds until it doesn’t. It breaks when a sanctions order names the lawyer who approved something they couldn’t verify. It breaks when a malpractice case establishes that a clinician who processed 200 AI alerts in an hour wasn’t meaningfully deciding anything. It breaks when a board inquiry asks for the decision trail and finds timestamps with no rationale attached.
The teams building in high-stakes domains right now are operating as if the appearance of oversight is enough, or as if the law will move slowly enough to allow a course correction later. The empirical record, the regulatory calendar, and the case law are all moving in the same direction. What they’re moving toward is a standard that asks a simple question: was the human actually in a position to decide?
Most current products aren’t designed to answer yes.
Leslie Sultani is a design leader and player-coach writing about the intersection of AI, design practice, and organizational change. Former CPO, UX engineer, and founder of a FinTech AI platform. Read the full AI-Native Design Series at LinkedIn, Substack or Medium.
Further Reading
- “To Trust or to Think: Cognitive Forcing Functions Can Reduce Overreliance on AI in AI-Assisted Decision-Making” — Zana Buçinca, Maja Barbara Malaya, and Krzysztof Z. Gajos, ACM CHI 2021. The foundational experiment on cognitive forcing functions and why people rate worst the designs that produce their best decisions.
- “Measuring the Impact of AI in the Diagnosis of Hospitalized Patients” — Sarah Jabbour et al., JAMA 2023. Randomized trial showing that AI explanations don’t protect clinicians from systematically biased models. The accompanying editorial by Khera, Simon, and Ross is worth reading alongside it.
- Mata v. Avianca, Inc., S.D.N.Y. 2023 — The sanctions order in the ChatGPT hallucination case that started a wave of legal AI scrutiny.
- “Article 14: Human Oversight” — EU Artificial Intelligence Act. The regulatory text that names automation bias explicitly and requires products to be designed against it.
- “Machine Bias” — Julia Angwin, Jeff Larson, Surya Mattu and Lauren Kirchner, ProPublica 2016. The COMPAS investigation. The question it raised about who a system is actually serving hasn’t been resolved.
- “Automation Bias: A Systematic Review of Frequency, Effect Mediators, and Mitigators” — Kate Goddard, Abdul Roudsari, and Jeremy Wyatt, JAMIA 2012. The foundational systematic review. Still the clearest account of how the phenomenon works and what design moves actually mitigate it.
- AI Hallucination Cases Database — Damien Charlotin. A live, continuously updated tracker of documented incidents where courts have found parties relied on AI-hallucinated content in legal filings. The most comprehensive public record of how the legal case count is growing.
- “External Validation of a Widely Implemented Proprietary Sepsis Prediction Model in Hospitalized Patients” — Andrew Wong, Karandeep Singh et al., JAMA Internal Medicine 2021. The Michigan Medicine external validation of Epic’s sepsis model. The source for the 67% miss rate, 18% alert burden, and 7% of clinician-missed cases flagged figures cited in this article.
When AI decides and human signs off was originally published in UX Collective on Medium, where people are continuing the conversation by highlighting and responding to this story.
