Most of Your AI Rules Don't Mean What You Think They Mean

Last Thursday I rewrote one rule in my AI's operating manual. "Prefer to verify" became "MUST verify." A failure that had repeated three times in one session stopped. Same AI. Same context pressure. Different grammar. Meanwhile, Deloitte's State of Generative AI in the Enterprise found only 21% of organizations rate their AI governance maturity above "developing" — meaning four out of five teams are shipping AI on policy that hasn't matured yet. The fix isn't more rules. It's the grammar of the rules they've already got.

For three weeks I'd been adding rules to that operating manual. Each one earned by a specific failure. By the time I sat down and did the math (last week's edition has the full curve), I had twenty-five rules and a compliance rate of roughly seven percent under load. The rewrite was the first thing that actually moved the number.

Edition #6 named the math problem. Stack twenty-five soft-language rules in a sequence, run them under context pressure, and you compound your way to a seven-percent compliance floor. This week is the language problem underneath it: which rules need to stay soft, and which ones need to be promoted?

Walk through three real rules from my own systems.

Rule one. "Verify by reading the file before any edit or write." This is a rule where failure is a disaster. The AI skips verification, writes confidently from a stale memory of the file, and silently corrupts the work. The cost of getting it wrong is wrong data shipped under my name. There is no version of "well, it's uglier" here. The model has to do this every single time. This is a MUST rule.

Rule two. "Review the brand style guide before drafting any client-facing content." I had this rule in a content pipeline that turns podcast episodes into newsletters. Across multiple sessions, the AI quietly stopped reading the style guide and started generating from training data. Drafts drifted. The voice got softer, more generic, more press-release. When I traced the failure, the word "review" was doing the work the word "MUST" should have been doing. Rewritten: "MUST read these three files before generating any content. No drafting from memory." The drift stopped the same week. Same shape as Rule one, different domain. The cost of getting it wrong is brand voice corrupted at scale. MUST.

Rule three. "Include the source URL for every external claim." Most brand and communications teams have something like this on the books. In ordinary editorial contexts, this is SHOULD — strong default, judgment-permitted exceptions for common-knowledge claims or when attribution is given inline. In regulated industries (finance, healthcare, legal), the same rule gets promoted to MUST because the cost of an unsourced claim stops being "the work is uglier" and becomes "the work is wrong." The rule's tier depends on the content context. That's what makes it SHOULD, not MUST: it has documented exceptions, but you owe yourself the audit on when those exceptions actually apply.

One more rule worth a brief mention. "May use bullet points if the content warrants them." This is the MAY tier — pure permission, no direction. No failure mode depends on the rule firing every time. Every AI policy has a few rules like this. They tend to clutter the document. The discipline is to label them as permissions and stop pretending they're requirements.

Three rules, three tiers, three different things the language is doing.

Now here's the part that surprised me. Internet engineers solved this in 1997. The standard is RFC 2119, and it locked the words I just used (MUST, SHOULD, MAY) as the canonical tier vocabulary for protocol specifications. Any time you've read an IETF spec, those words mean exactly what they look like they mean: required, strongly recommended with judgment exceptions, permitted.

A caveat that matters. RFC 2119 was written for human-readable protocol specs, not probabilistic runtime systems like LLMs. The analogy isn't 1:1. But the discipline transfers. Ambiguity at the language layer creates failure at the runtime layer, whether the runtime is a human implementer reading a spec or a model interpreting a rule under context load. The fix in both cases is grammar.

The frame shift this forces is the move I've been chasing for three weeks of governance work.

The wrong question is "is this rule important?" Every rule in your AI policy feels important. That's why it's in the policy. Importance doesn't tell you what tier the rule belongs on.

The right question is what I'm going to call the generous-reading test. Ask: what happens if the model reads this rule as generously as it can?

If the answer is "the work is wrong" (wrong number, fabricated source, leaked PII, regulatory line crossed), the rule needs to be MUST. The model can't be permitted to read it as a suggestion, ever.

If the answer is "the work is uglier" (off-tone, inconsistent format, weird convention), SHOULD is the right tier. Strong default, documented exceptions, judgment allowed when context warrants.

If the answer is "the work is different but fine" (taste call, formatting preference, optional latitude), MAY is the right tier. This is permission, not direction.

The reason most AI policy docs aren't working is not that the rules are wrong. It's that the rules are all written at the same volume. Critical rules and cosmetic rules get the same hedged language ("prefer," "consider," "where appropriate"), and the model treats them all as suggestions because the language doesn't distinguish them. The fix is grammatical. It's also kind of insulting how cheap it is.

What This Means in Practice

The generous-reading test is a tool you can run on your AI policy this week. Three steps.

Step 1. Find every rule that uses soft language. Open the policy doc. Search for "prefer," "consider," "try to," "should generally," "where possible," "as appropriate." Every hit is a candidate. This isn't how a formal audit works; a formal audit starts from a risk register and traces controls. This is what you do before you call in anyone. It's a self-diagnostic, not an audit.

Step 2. Run the generous-reading test on each candidate. For every soft-language rule, ask: what happens if the model reads this as generously as it can?

If the answer is wrong data, fabricated source, leaked PII, or a compliance line crossed: promote it to MUST. Soft language was undercosting the rule.

If the answer is uglier output, off-tone, inconsistent format: leave it at SHOULD. The hedge was honest.

If the answer is taste or preference: demote it to MAY. The rule was direction-by-decoration; making it explicit permission is more honest.

Step 3. Rewrite the MUST-tier rules with the actual word. "Prefer to verify" becomes "MUST verify." "Review the style guide" becomes "MUST read the style guide." "Try to cite sources" becomes "MUST cite sources, with URLs, on every claim." The change is grammatical. The compliance curve flips.

Most teams will discover that two or three rules in their policy were dramatically undercosted, and the rest were honestly tiered. The two or three that were undercosted are the ones generating most of the noise.

One Thing to Do This Week

Pick one rule. The highest-stakes one. The one where a model failure means the worst consequence for your team: the wrong number sent to leadership, a fabricated source in a regulated industry, customer data exposed.

Walk that one rule through the generous-reading test by Friday. Rewrite it in MUST language if it needs it.

This is the wedge, not the whole job. The whole job is the policy-wide audit, and it doesn't happen until you've proven the mechanic on the rule that matters most. Compliance debt compounds; so does the payback when you start unwinding it.

This is Part 2 of a four-edition series on Compliance Decay. Part 1 (May 12) covered the math: why stacking soft-language rules produces compounding compliance loss. Part 3 (May 26) covers structural gates that don't require behavioral rules at all. Part 4 (June 2) covers values documents that travel with the work.

The Implementation Lane is a weekly newsletter about making AI work inside real organizations. Written by Amanda Crawford, an AI Implementation Specialist who builds systems in the gap between configuration and engineering. If someone forwarded this to you, subscribe here.

Sources: Deloitte State of Generative AI in the Enterprise, Anthropic Prompt Engineering Guidance, RFC 2119 (IETF, March 1997, S. Bradner)

Most of Your AI Rules Don't Mean What You Think They Mean

What This Means in Practice

One Thing to Do This Week

Keep Reading

STAY CONNECTED