R Refactor & Reflect The Journal · est. 2008
Search the journal…
May 5, 2026 · 32 min read · #misc

27 Quality Gates That Block Bad Code from Production

A complete guide to the 27-gate production readiness system — with definitions, a colorful gate map, pass/fail output, and the real cost of skipping each check.

27 Quality Gates That Block Bad Code from Production

Quick Review

  • Thesis & memory hook
  • AI tools write code that compiles. They don’t write the 90-day rollback rehearsal into your runbook or scope your cache keys by tenant. Quality gates do. The 27-point contract turns “is it done?” from a negotiation into a binary answer.
  • Memory hook: “the 27-point contract” — 27 mechanically verified, hard-fail checks that define production-ready.
  • Who this is for / who it isn’t
  • Target reader: any engineer — junior to senior — preparing to ship a production application and wanting a structured quality checklist that goes beyond “the tests pass.”
  • Out-of-scope: DevOps engineers looking for infrastructure-as-code checklists; non-technical stakeholders.
  • Why it matters now
  • AI coding assistants generate coherent, compiling code at speed. They don’t carry the six months of production hardening your senior engineers have internalized. The gate system externalizes that hardening — mechanically, with hard verdicts.
  • Key facts & numbers
  • 27 gates — one failure blocks deployment regardless of overall score [1]
  • ~100 mechanical checks total across the five areas [1]
  • 5 areas: 2 Database gates, 3 Backend gates, 2 Frontend gates, 12 Cross-Cutting gates, 8 Operational gates [1]
  • Every gate is hard-fail — a single failure blocks the build regardless of overall score [1]
  • XC6 maps 13 AI-safety sub-checks to the OWASP Top 10 for LLM Applications 2025 [2]
  • OP3: rollback runbook must be tested within 90 days — hard gate [1]
  • OP7: SLO performance evidence must be dated within 30 days — hard gate [1]
  • The instrument
  • A five-area gate catalog: Database (DB2–DB3), Backend (BE1–BE3), Frontend (FE1–FE2), Cross-Cutting (XC3–XC12, XC15–XC16), Operational (OP1–OP8).
  • Each gate: what it checks, how it can be exploited or affects application health, and how to resolve if it fails.
  • Decision rules — do / don’t
  • Do: treat all 27 as hard-fail — one gate failing is a BLOCK, not a negotiation
  • Do: fix Database gates before writing Backend code — schema errors compound
  • Do: codify SLOs before your first production deploy; the OP7 gate requires measured evidence
  • Don’t: treat 26/27 as “close enough” — one hard-fail gate is a BLOCK regardless of overall score
  • Don’t: skip Operational gates because “ops will handle it” — OP3, OP4, and OP7 require decisions developers own
  • Commands & code at a glance
  • grep -rn "\.Result\|\.Wait()\|\.GetAwaiter().GetResult()" src/ --include="*.cs" — BE2 async hygiene scan
  • dotnet list package --vulnerable --include-transitive — XC8 dependency security scan
  • curl -fsS --max-time 30 http://localhost:{port}/health | jq -e '.status == "Healthy"' — OP1 health check
  • grep -rn "AllowAnyOrigin" --include="*.cs" — wildcard CORS scan (Critical finding)
  • Skim map — where to land if you’re hunting for…
  • …the terminology (hard gate, warn) → §How the Gate System Works
  • …all 27 gates at a glance → §All 27 Gates at a Glance
  • …database schema, init scripts, and migration safety → §Database Gates
  • …async hygiene, project layout, API contract → §Backend Gates
  • …security, AI safety, secrets, multi-tenancy → §Cross-Cutting Gates
  • …health endpoints, rollback, SLOs, performance → §Operational Gates
  • References at a glance
  • [1] Arsalan Shahid — Quality Gates — 27-gate production readiness specification (internal)
  • [2] OWASP Foundation — Top 10 for LLM Applications (2025)
  • [3] OWASP Foundation — Top 10 Web Application Security Risks (2021)

Table of Contents


Claude scaffolded the microservice in eleven minutes. Route handlers, JWT middleware, EF Core repositories — all there, all compiling. Then I ran the gate suite: AllowAnyOrigin() was wired for production CORS, three controller endpoints were missing [Authorize], and the JWT signing key was sitting in appsettings.Production.json as a plain string. The code was perfect. It was a security audit away from a breach.

That incident is why the 27-point contract exists. Not to slow engineers down — to externalize the vigilance that experience builds and AI tools don’t carry. When a gate blocks your PR, it is doing the cheapest debugging you will ever get.


Why Working Code Isn’t Production-Ready

Why doesn’t passing tests mean I’m ready to ship?

Passing tests proves the code does what you told it to do. They don’t prove you wired rate limiting, remembered to scope cache keys by tenant, rotated secrets out of config files, or tested your rollback runbook in the last 90 days. The gate system checks those things — mechanically, repeatedly, with hard verdicts.

Unit tests pass against mocks. Mocks don’t enforce TenantId filters. You’ll find out that your multi-tenant query returns all tenants’ data when your first real production user calls the support line, not when your test suite runs. XC9 (multi-tenant isolation) exists precisely because the bug that passes every unit test is the most dangerous kind.

The same gap applies to operational readiness. A rollback procedure that lives in a wiki page and has never been executed is not a rollback procedure — it is documentation. OP3 stamps the date of the last rehearsal because the incident will always happen at 2am when the engineer on call is not the one who wrote the runbook.


How the Gate System Works: Key Terms and Definitions

Before walking the 27 gates, it helps to understand the vocabulary.

TermDefinition
Hard gateA check where a single failure sets the overall audit to BLOCK, regardless of how many other gates pass. All 27 gates are hard-fail.
DBDatabase area — schema design, init scripts, migration safety. Gates: DB2–DB3.
BEBackend area — .NET code structure, async hygiene, API contract integrity. Gates: BE1–BE3.
FEFrontend area — React component library, Next.js admin, UI/UX readiness. Gates: FE1–FE2.
XCCross-Cutting area — security, AI safety, logging, secrets, dependencies, docs, multi-tenancy, web security, threat model. Gates: XC3–XC12, XC15–XC16.
OPOperational area — health endpoints, rate limiting, rollback rehearsal, SLOs, cost budgets, audit retention, performance evidence, test & ship readiness. Gates: OP1–OP8.
G-numberA stable secondary ID for each gate. The primary area ID (e.g. DB2) tells you the area and sequence; the G-number is permanent and never renumbers even if area IDs change.
~100 checksThe full set of checks across all gates. The 23 hard-fail gates are the non-negotiable subset; the remaining checks produce scored findings that affect area percentages but don’t individually block the build.

All 27 Gates at a Glance

The five areas, their colors, and their gates:

🗄️ Database

DB2 · Init scripts
DB3 · Migration safety

⚙️ Backend

BE1 · Role-based src layout
BE2 · Async hygiene
BE3 · API contract integrity

🖥️ Frontend

FE1 · Next.js admin + component lib
FE2 · UI/UX readiness

🔀 Cross-Cutting

XC3 · TDD naming
XC4 · Folder layout
XC5 · Logging & exceptions
XC6 · AI-operation safety
XC7 · Secret hygiene
XC8 · Dependency security
XC9 · Multi-tenant isolation
XC10 · Capabilities doc current
XC11 · File naming canonical
XC12 · Section schemas conform
XC15 · Web security audit
XC16 · Threat model & architecture

🚀 Operational

OP1 · Health endpoints
OP2 · Rate limiting & DoS
OP3 · Rollback runbook tested
OP4 · SLO/SLI codified
OP5 · Cost-budget alert
OP6 · Audit log retention
OP7 · Performance evidence
OP8 · Test & ship readiness

GATE LIFECYCLE — what happens on every run

📝
PR / commit
🔍
27 gates run
~100 checks
⚖️
Any gate
fail?
No →
✅ PASS
Merge allowed
Yes →
❌ BLOCK
Fix & rerun

Sample gate output — what a real run looks like:

GateNameStatusArea
DB2Init scripts✅ PASSDatabase
DB3Migration safety✅ PASSDatabase
BE1Role-based src layout✅ PASSBackend
BE2Async hygiene❌ FAILBackend
BE3API contract integrity✅ PASSBackend
FE1Next.js admin + component lib⚠️ WARNFrontend
FE2UI/UX readiness✅ PASSFrontend
XC3TDD naming✅ PASSCross-Cutting
XC4Folder layout✅ PASSCross-Cutting
XC5Logging & exceptions✅ PASSCross-Cutting
XC6AI-operation safety✅ PASSCross-Cutting
XC7Secret hygiene✅ PASSCross-Cutting
XC8Dependency security✅ PASSCross-Cutting
XC9Multi-tenant isolation✅ PASSCross-Cutting
XC10Capabilities doc current✅ PASSCross-Cutting
XC11File naming canonical✅ PASSCross-Cutting
XC12Section schemas conform✅ PASSCross-Cutting
XC15Web security audit✅ PASSCross-Cutting
XC16Threat model & architecture✅ PASSCross-Cutting
OP1Health endpoints✅ PASSOperational
OP2Rate limiting & DoS✅ PASSOperational
OP3Rollback runbook tested✅ PASSOperational
OP4SLO/SLI codified✅ PASSOperational
OP5Cost-budget alert✅ PASSOperational
OP6Audit log retention✅ PASSOperational
OP7Performance evidence✅ PASSOperational
OP8Test & ship readiness✅ PASSOperational

Overall verdict: BLOCK — BE2 (async hygiene) is a hard-fail gate. Fix .Result / .Wait() calls, then re-run.


The Gate Walkthrough, Area by Area

Each gate below lists what it checks (as bullets, not paragraphs) and why it exists — the production incident that taught us this check needed to exist.

Database Gates (DB2–DB3)

🗄️ Database area — schema correctness, init scripts, migration safety

DB2 (G3) — Database-first with init scripts

  • An init.sql for your provider exists under scripts/, honors :schema + :prefix parameters, and is idempotent on re-run
  • EF migrations are not the source of truth — init scripts are
# Verify idempotency: run twice, check for errors
sqlcmd -S $SQLSERVER -d $DATABASE -i scripts/init.sql
sqlcmd -S $SQLSERVER -d $DATABASE -i scripts/init.sql
  • Fails if: init script missing; hardcoded schema without parameter substitution; only EF migrations present

How it affects application health: A non-idempotent init script fails silently on the second run in CI. You discover it when a developer runs the script against an already-initialized dev database and gets foreign-key conflicts. If you’re weighing init scripts against EF Core migrations, Database-first with EF Core in 2026 walks through the trade-offs in detail.

How to resolve:
– Add IF NOT EXISTS guards to every CREATE TABLE, CREATE INDEX, and CREATE VIEW
– Wrap stored procedure creation in DROP PROCEDURE IF EXISTS / CREATE PROCEDURE
– Run the script twice in CI against a fresh database — the second run must return zero errors


DB3 (G14) — Migration safety

  • Schema changes are additive-only on production schemas
  • No DROP COLUMN, narrowing ALTER COLUMN, or renames without a documented deprecation cycle
  • Every migration file is idempotent and backward-compatible with the previous app version
  • Fails if: any migration contains a destructive verb without a deprecation marker; any migration is non-idempotent on re-run

How it affects application health: A DROP COLUMN on a live 50M-row table takes an exclusive lock for the duration of the operation. Users are locked out for 40 minutes. There is no hotfix that makes the column come back. Destructive migrations look identical to additive ones in a PR diff unless you’re explicitly scanning for the verb.

How to resolve:
– Search migrations for DROP COLUMN, ALTER COLUMN, and renames: grep -rn "DROP COLUMN\|ALTER COLUMN" Migrations/
– For each destructive change: keep the old column for ≥2 releases, mark it with a deprecation comment, document the removal timeline in CHANGELOG.md
– For large tables: use pt-online-schema-change (MySQL) or pg_repack / zero-downtime migration libraries (Postgres) to avoid locking


Backend Gates (BE1–BE3)

⚙️ Backend area — project structure, async correctness, API contract integrity

BE1 (G8) — Role-based src subfolder layout
.NET projects in role-based subfolders: src/core/, src/client/, src/messaging/, src/web/
– Frontend under src/frontend/ — never at repo root as frontend/
– Fails if: any .csproj at flat src/{Name}.* path; frontend/ at repo root

How it affects application health: Flat project layouts accumulate coupling. A core/ project that can see web/ controllers imports controllers. Role-based isolation enforces the dependency direction via the project reference graph, not via code reviews.

How to resolve:
– Move each .csproj to src/{role}/{name}/core/, client/, messaging/, web/
– Move frontend/ to src/frontend/
– Update the .sln and all <ProjectReference> paths; verify with dotnet build


BE2 (G11) — Async hygiene
– All I/O is async Task<T> with CancellationToken
– No sync-over-async patterns anywhere in the codebase
– EF queries do not produce N+1 over navigation properties

# Scan for sync-over-async (must return zero matches)
grep -rn "\.Result\|\.Wait()\|\.GetAwaiter().GetResult()" src/ --include="*.cs" | grep -v Test
  • Fails if: any .Result, .Wait(), or .GetAwaiter().GetResult() outside test code; Task.Run inside an ASP.NET request path; loop accessing a navigation property without .Include() upstream

How it affects application health: Sync-over-async blocks a ThreadPool thread for the full I/O duration. Under sustained load the pool saturates — requests queue, the service appears to hang, and the symptom is indistinguishable from a database outage until you examine the thread dump. No hotfix resolves it; you redeploy with the .Result calls removed.

How to resolve:
– Replace every .Result with await; every .Wait() with await; every .GetAwaiter().GetResult() with await
– For N+1 queries: add .Include(x => x.Navigation) upstream of the loop that accesses the navigation property
– Add the async scan to your CI pre-merge gate so it never regresses


BE3 (G16) — API contract integrity
– OpenAPI spec is generated from the running app and committed to the repo
docs/2.4-API_REFERENCE.md matches the generated spec (no drift)
– Breaking changes between releases have an explicit ## Breaking Changes entry in CHANGELOG.md
– Fails if: generated spec missing or stale by >7 days; doc spec out of sync; breaking change shipped without changelog entry

How it affects application health: A client SDK built against an undocumented breaking change fails silently until the client ships. The developer gets a support call two weeks later. Committed specs make breaking changes visible at PR review time, not at client integration time.

How to resolve:
– Run dotnet swagger tofile --output openapi.json (or equivalent) and commit the output
– Add a CI step that regenerates the spec and fails if the committed file differs: git diff --exit-code openapi.json
– For breaking changes: add a ## Breaking Changes entry to CHANGELOG.md and bump the major version


Frontend Gates (FE1–FE2)

🖥️ Frontend area — admin coverage, UI correctness, accessibility

FE1 (G7) — Next.js admin + React component library
– Next.js admin at admin-nextjs/app/ or src/clients/nextjs/ with route coverage for every domain entity
– React component library Webpack-built with dual ESM + CJS output
– Fails if: either app missing; library not Webpack-built without a deviation note; missing admin page for any entity

How it affects application health: An admin UI without a page for a new entity means support teams work directly in the database. Direct database access means no audit trail, no validation, and no rollback.

How to resolve:
– Add a Next.js route under app/ for every domain entity that lacks one
– Verify the component library builds: npm run build:lib must exit 0 with both ESM and CJS outputs present


FE2 (G23) — UI/UX readiness
– UI audit score ≥ 80 across 12 dimensions:
– Empty states
– Error states
– Loading states
– Mobile responsive layout
– Keyboard navigation
– ARIA attributes (axe-core when --with-axe flag used)
– WCAG 4.5:1 contrast ratio
– AI-generated filler string detection
– Reduced-motion support
– Skip-to-content link present
– Focus management on route changes
– Form field label associations
– Skipped for ui_stack ∈ {None (Library), None (API-only)}
– Fails if: UI score < 80; any Critical UI finding

How it affects application health: An empty state that renders “No items found.” and nothing else is not a UI — it is a bug that made it past code review. Accessibility gaps are legal exposure in regulated industries. The audit catches both before they reach a user.

How to resolve:
– For empty states: replace bare “No items found.” with an illustration, a description of why it’s empty, and a primary action button
– For accessibility: run npx axe-cli https://localhost:3000 and fix all Critical and Serious findings; verify WCAG 4.5:1 contrast ratio with the browser’s contrast checker
– For reduced-motion: wrap animations in @media (prefers-reduced-motion: reduce) { animation: none; }
– For AI-generated filler: search for “Lorem ipsum”, “Placeholder”, “TODO” strings in rendered output

Cross-Cutting Gates (XC3–XC12, XC15–XC16)

🔀 Cross-Cutting area — security, AI safety, observability, documentation, multi-tenancy, web security, threat model

XC3 (G5) — Tests in TDD format– Test method names follow MethodName_StateUnderTest_ExpectedBehavior
dotnet test and npm run test both green
– Fails if: ≥30% of test names violate the convention; any test suite red

How it affects application health: Tests named Test1, VerifyStuff, or HappyPath give no signal about what regressed when they fail. A red test suite means the build is already broken before the gate runs — every subsequent finding is noise until the suite is green.

How to resolve: Rename failing tests to MethodName_StateUnderTest_ExpectedBehavior. Fix broken tests before any other gate work — a red suite makes all other findings unreliable.


XC4 (G6) — Area-organized folder layoutdocs/ uses numbered area prefixes (1.x–6.x)
– No loose .cs files at project root
– Fails if: any docs/*.md not prefixed with an area number

How it affects application health: Unorganized documentation is effectively undiscoverable. Engineers create duplicate docs, miss existing decisions, and produce conflicting guidance. Loose .cs files at project root indicate code that bypassed the architecture and belongs to no project.

How to resolve: Rename each docs/*.md to include its area prefix (1.x-, 2.x-, etc.). Move loose .cs files into the appropriate project under src/. Update any cross-references.


XC5 (G9) — Logging and exception handling
Nine sub-checks, all required:
IApplicationLogger registered as a Singleton
LoggingBehavior registered in the MediatR pipeline
GlobalExceptionHandler injects IApplicationLogger
UseSerilogRequestLogging wired in Program.cs
CorrelationId middleware present
– No ILogger<T> in controllers (controllers use IApplicationLogger)
EmailEnabled defaults to false in config
GlobalExceptionHandler is the sole Fatal() caller (prevents double-email)
– RFC 7807 application/problem+json content-type on error responses

How it affects application health: An exception handler that swallows the exception logs nothing and returns 500 — when the 3am alert fires, there’s nothing in Kibana. The logging gate ensures every unhandled exception produces a structured log entry with a correlation ID before the response leaves the server.

How to resolve:
– Register IApplicationLogger as a Singleton and inject it wherever ILogger<T> currently appears in controllers
– Add LoggingBehavior to the MediatR pipeline: services.AddTransient(typeof(IPipelineBehavior<,>), typeof(LoggingBehavior<,>))
– Add correlation ID middleware before UseRouting() in Program.cs
– Set all error responses to Content-Type: application/problem+json via the global exception handler


XC6 (G10) — AI-operation safety

🚨 Critical — hard-fail in all modes

XC6 has 13 sub-checks mapped 1:1 to the OWASP Top 10 for LLM Applications (2025). Every sub-check must pass. A failing XC6 is an AI safety gap, not a code style issue.

Thirteen sub-checks, each mapped to an OWASP LLM category [2]:
G10-1 AI database connection isolated (LLM06 — Excessive Agency)
G10-2 AI account least-privilege grants — SELECT/INSERT/UPDATE only, no DDL (LLM06)
G10-3 AI writes inside explicit transactions (LLM05 — Improper Output Handling)
G10-4 Destructive-action denial in code — no DELETE/DROP reachable from AI surface (LLM06)
G10-5 Prompt-injection input separation — user content never concatenated raw into system prompt (LLM01 — Prompt Injection)
G10-6 Sensitive-data redaction before prompt — IPiiRedactor called on any input to LLM (LLM02 — Sensitive Information Disclosure)
G10-7 Supply-chain attestation for model providers — every LLM provider listed with SOC 2 / ISO 27001 date (LLM03 — Supply Chain)
G10-8 Training-data poisoning controls for RAG — source provenance validated (LLM04 — Data Poisoning)
G10-9 Output validation before downstream use — IOutputValidator on every LLM response (LLM05)
G10-10 System-prompt secrets isolation — no credentials or role-grants in system prompts (LLM07 — System Prompt Leakage)
G10-11 Vector store access control — tenant-scoped queries (LLM08 — Vector Weaknesses)
G10-12 Misinformation guardrails on high-stakes outputs (LLM09 — Misinformation)
G10-13 Unbounded consumption controls — per-tenant rate limits and token caps on LLM endpoints (LLM10 — Unbounded Consumption)

LLM TRUST BOUNDARY — what XC6 enforces at each layer

👤
User input
🛡️
PII redactor
G10-6
📋
Prompt
separator
G10-5
🤖
LLM API
isolated acct
G10-1/2
✔️
Output
validator
G10-9
⚙️
Application

How it affects application health: An AI coding assistant generates coherent SQL. With LLM06 unchecked, that AI runs under the same database account as the application — which may include DELETE, DROP TABLE, and TRUNCATE. A DROP TABLE issued by an AI with excessive agency is not recoverable with a hotfix.

How to resolve:
– Create a dedicated database user for AI operations: GRANT SELECT, INSERT, UPDATE ON SCHEMA::dbo TO ai_user (no DDL, no DELETE)
– Wrap all AI-initiated writes in explicit transactions with a short timeout
– Add a prompt-injection filter that separates user-supplied content from the system prompt before the LLM call
– Call your PII redactor on all inputs before sending to the LLM endpoint


XC7 (G12) — Secret hygiene– No connection strings, API keys, JWT signing secrets, or PEM blocks in committed config files
– All sensitive values reference Key Vault, user-secrets, or environment variables
– Fails if: any secret-shaped value in appsettings*.json; any PEM block or bearer token in git history

How it affects application health: A JWT signing key committed to appsettings.Production.json is in git history forever — rotating it requires a deploy. A secret leaked to a public repository is typically scraped and used within hours.

How to resolve:
– Move all secrets to Key Vault, user-secrets (dotnet user-secrets set), or environment variables; remove them from appsettings*.json
– For secrets already committed: rotate them immediately, then scrub the history with git filter-repo --path appsettings.Production.json --invert-paths
– Run gitleaks detect --source . --no-git against the full tree to find secrets in non-config files (scripts, test fixtures, docs)


XC8 (G13) — Dependency securitydotnet list package --vulnerable --include-transitive returns zero
npm audit --omit=dev returns zero high-or-critical findings
– License whitelist enforced (no AGPL-3.0 in commercial-distribution projects)
– Dependabot configured

How it affects application health: A dependency with a known CVE is a documented attack path. Exploits are typically published within days of the advisory — attackers automate scans for vulnerable package versions and the tooling is freely available. A transitive dependency carries the same risk as a direct one; you may not have written the import, but it runs in your process. Ignoring a Dependabot PR is an active choice to accept a known vulnerability.

How to resolve:
– Run dotnet list package --vulnerable --include-transitive; update each flagged package. For breaking changes, pin to the last safe minor version while you assess the upgrade effort.
– Run npm audit fix for auto-fixable findings; for manual fixes, npm audit shows the recommended upgrade path.
– For AGPL-licensed packages in a commercial project: replace with MIT or Apache-2.0 equivalents before distribution.
– Enable Dependabot: add .github/dependabot.yml with open-pull-requests-limit: 10 and set auto-merge for patch-level security updates.


XC9 (G15) — Multi-tenant isolation– Every query against a tenant-scoped table includes a TenantId filter
– Cache keys include tenant ID
– Background job dequeue verifies tenant context before acting
– Verified via Roslyn analyzer + grep on cache patterns
– Fails if: any DbSet<T> access on a tenant-tagged entity without a Where(e => e.TenantId == ...) clause

❌ Broken — no TenantId filter

SELECT * FROM Users
-- TenantId filter missing
-- Returns ALL tenants' data

Tenant A query → sees Tenant B’s records

✅ Correct — scoped filter

SELECT * FROM Users
WHERE TenantId = @tenantId
-- Returns only caller's data

Tenant A query → sees only Tenant A’s records

How it affects application health: Cross-tenant data exposure passes every unit test — the mock doesn’t enforce the filter. You find out when a real user in one tenant can see another tenant’s records, which in a regulated industry is a breach notification. For a deeper look at the three isolation architectures and their trade-offs, see EF Core 10 multi-tenancy: schema-per-tenant, discriminator, and database-per-tenant.

How to resolve:
– Add a Roslyn analyzer rule that flags any DbSet<T> access on a tenant-tagged entity without a .Where(e => e.TenantId == _tenantContext.TenantId) clause.
– For cache keys: replace bare $"user:{id}" keys with $"tenant:{tenantId}:user:{id}".
– For background jobs: add a tenant-context assertion at the top of the job’s Execute() method — fail the job immediately if tenant context is unset rather than silently processing in the wrong scope.
– Write an integration test that runs a query under Tenant A’s credentials and asserts it returns zero rows from Tenant B’s data.


XC10 (G25) — Capabilities documentation currentdocs/6.1-CAPABILITIES.md exists, conforms to v3.2, and reflects current task state
– Delivered/Partial/Planned counts match implementation plan within ±10% tolerance
– File mtime is no more than 7 days behind the latest implementation plan change

How it affects application health: A capabilities document that reflects last quarter’s feature set actively misleads — new developers plan against capabilities that no longer exist, and support engineers misdiagnose incidents against a system that doesn’t match the code. When a customer escalation arrives, the capabilities doc is the first artifact operations opens.

How to resolve:
– Open docs/6.1-CAPABILITIES.md and align every Delivered/Partial/Planned entry against the current implementation plan.
– Add a <!-- updated: YYYY-MM-DD --> comment at the top and update it whenever the plan changes significantly.
– Add a CI freshness check: compare the mtime of CAPABILITIES.md against the mtime of the latest PLAN.md; fail if the gap exceeds 7 days.


XC11 (G26) — File naming canonical– Every docs/*.md matches the regex ^[1-6]\.[0-9]-[A-Z][A-Z0-9_]*\.md$
– Uses the canonical filename for its type (e.g. 1.0-RESEARCH.md, not RESEARCH.md)

How it affects application health: Non-canonical filenames break every tool that indexes the docs directory by convention — search, cross-reference generators, and automated gate scripts. A file named research.md instead of 1.0-RESEARCH.md is invisible to any tooling that expects the numbered area prefix and becomes undiscoverable when the team scales.

How to resolve:
– Rename each non-conforming file: git mv docs/research.md docs/1.0-RESEARCH.md.
– Update all cross-references: grep -rn "research.md" docs/ --include="*.md".
– Add a CI check: find docs/ -name "*.md" | grep -vP "^\./docs/[1-6]\.[0-9]-[A-Z]" | wc -l — must return 0.


XC12 (G27) — Section schemas conform– Every docs/*.md matching a canonical filename has all required sections from its type schema
– Position constraints honored (e.g. References is last in narrative types)

How it affects application health: A PLAN.md without a ## Requirements section means the plan is disconnected from the requirements it claims to satisfy. Reviewers cannot verify coverage without manually cross-referencing. A DEPLOYMENT_GUIDE.md without a ## Rollback section fails OP3 as a side effect and guarantees a gap in incident response.

How to resolve:
– For each failing doc, open its canonical type schema and add the missing sections — a _pending_ placeholder is acceptable; absent is not.
– Validate with the section conformance checker after adding sections.
– Add the schema check to your pre-commit hook so violations are caught before commit, not at gate time.


XC15 — Web security audit

🚨 Critical — hard-fail in all modes

XC15 covers the OWASP Top 10 web vulnerability classes plus CSP and deep secret detection across the entire working tree. A single failing sub-check blocks the build.

What it checks:
A01 Broken Access Control — all non-public endpoints carry [Authorize]; resource-level ownership checks present; no IDOR patterns
A02 Cryptographic Failures — HTTPS enforced; TLS 1.2+ minimum; no weak ciphers (RC4, DES, MD5); secrets use AES-256 or RSA-2048+
A03 Injection — all database access uses parameterized queries or ORM; no raw string concatenation in SQL/LDAP/XPath paths
A04 Insecure Design — rate limiting and input size caps on all public-facing entry points
A05 Security Misconfiguration — debug mode disabled; default credentials removed; Content-Security-Policy header present and policy-compliant; CORS locked to explicit origins
A06 Vulnerable Components — no known CVEs in .NET or npm dependency tree (complements XC8 with a narrative audit of licensing and transitive risk)
A07 Identification and Authentication Failures — account lockout after N failed attempts; session tokens invalidated on logout; no credentials in URLs
A08 Software and Data Integrity Failures — signed artifacts; CI pipeline steps verify checksums before deploy
A09 Security Logging and Monitoring Failures — auth failures, privilege escalations, and input validation rejections all produce structured log events
A10 Server-Side Request Forgery (SSRF) — outbound HTTP calls validate target against an explicit allowlist; no user-controlled URLs passed to HttpClient
Deep secret scangitleaks or trufflehog run across the full working tree and git history (not just config files — scripts, docs, test fixtures included)

How it can be exploited: A missing [Authorize] attribute exposes admin endpoints to unauthenticated callers. An SSRF vector lets an attacker reach internal metadata endpoints (e.g., Azure IMDS, AWS IMDSv1) and exfiltrate cloud credentials. A plaintext secret in a test fixture gets committed, scraped by a bot within hours of a public push, and used to authenticate to your production database.

How to resolve:
– Run grep -rn "AllowAnyOrigin\|AllowAny" --include="*.cs" — replace with explicit origin lists
– Run gitleaks detect --source . — rotate any exposed credentials immediately, then scrub with git filter-repo
– Add Content-Security-Policy via middleware: app.Use((ctx, next) => { ctx.Response.Headers["Content-Security-Policy"] = "default-src 'self'"; return next(); })
– For SSRF: wrap all outbound HttpClient calls in a helper that validates the host against an allowlist before dispatching


XC16 — Threat model & architecture audit
What it checks:
STRIDE threat model — a documented threat model in docs/ covers each STRIDE category: Spoofing, Tampering, Repudiation, Information Disclosure, Denial of Service, Elevation of Privilege; each threat has a listed mitigation
Architectural compliance — no SOLID violations detected: no god classes (>500 lines handling multiple concerns), no layering breaches (e.g., domain layer importing infrastructure), no circular project references
Coupling hotspots — no single class referenced by >15 other classes without an interface boundary; no monolithic manager/service classes handling unrelated domain concepts
Exception-handling depth — no silent exception swallows (catch (Exception) { } without logging); predictable business failures use Result<T> not throw; all catch blocks include meaningful context in the log message
Documentation formatting conformance — docs conform to the project’s visual formatting rules (consistent heading hierarchy, callout style, code block tagging)

How it can be exploited: An undocumented threat model means mitigations are ad hoc and inconsistent — the same threat class gets blocked in one feature and missed in the next. A god class that handles authentication, billing, and notification in one 900-line file is a single point of compromise: one vulnerability affects all three domains. A silent catch means an attacker’s injection attempt fails silently — no alert fires, no audit trail is written.

How to resolve:
– Create a STRIDE worksheet in docs/ — one row per threat, columns: threat, affected component, mitigation, verification test
– For god classes: apply SRP — extract single-purpose services and inject via constructor
– For silent swallows: catch (Exception ex) { _logger.LogError(ex, "Context: {operation}", opName); throw; } or wrap in Result<T>.Failure(ex)
– For layering breaches: use dotnet-architecture-tests or ArchUnit to enforce project reference direction in CI


Operational Gates (OP1–OP8)

🚀 Operational area — the checks that protect you at 3am, not during development

OP1 (G17) — Health endpoints respond/health returns 200 within 30 seconds of container start
/health/ready returns 200 within 60 seconds
– Both responses are valid JSON containing status: "Healthy"
– Fails if: container fails to start; either endpoint non-200 within timeout; body lacks status field

How it affects application health: A container that responds 200 on / but 503 on /health/ready is not ready. Without a readiness probe, Kubernetes sends traffic to it anyway. The first user request fails — and the failure is invisible because the container appears healthy from the outside.

How to resolve:
– Add ASP.NET Core health checks: builder.Services.AddHealthChecks().AddDbContext<AppDbContext>() and map both /health and /health/ready.
– Run curl -fsS --max-time 30 http://localhost:{port}/health | jq -e '.status == "Healthy"' in CI immediately after container start to verify the gate condition locally.
– Configure your Kubernetes readinessProbe to poll /health/ready with initialDelaySeconds: 10 and failureThreshold: 3.


OP2 (G18) — Rate limiting and DoS protectionAddRateLimiter() registered in Program.cs with at least one named non-default policy
– At least one controller or endpoint group has [EnableRateLimiting("...")]
– Reverse-proxy or gateway config declares connection-rate caps
– Fails if: no rate limiter registered; rate limiter registered but no endpoint binds it

How it affects application health: An API with no rate limiting is a free DoS attack surface. A single misconfigured client can flood the service until thread pool saturation makes it unresponsive to all callers. For AI-backed endpoints, unbounded requests also mean unbounded token spend — one runaway integration can exhaust a monthly LLM budget in minutes with no alert firing.

How to resolve:
– Register a named policy: builder.Services.AddRateLimiter(o => o.AddFixedWindowLimiter("api", opts => { opts.Window = TimeSpan.FromMinutes(1); opts.PermitLimit = 100; opts.QueueLimit = 0; }));
– Apply the limiter: app.UseRateLimiter(); app.MapControllers().RequireRateLimiting("api");
– For AI endpoints: add a per-tenant token budget check before every LLM call and return 429 Too Many Requests when the budget is exhausted.


OP3 (G19) — Rollback runbook tested

ℹ️ The rehearsal rule

A rollback procedure that has never been executed is documentation, not a capability. The gate stamps the date of the last rehearsal drill. If it’s more than 90 days old, the gate fails.

  • docs/4.1-DEPLOYMENT_GUIDE.md contains a ## Rollback section with step-by-step procedure
  • state.last_rollback_drill_at is set to an ISO8601 date within the last 90 days
  • state.last_rollback_drill_evidence links to a drill execution log
  • Fails if: rollback section absent; drill stamp absent; stamp older than 90 days

How it affects application health: A rollback procedure executed for the first time during a live incident fails at a step that the author assumed was obvious — the step that involves a credential rotation, a manual DB script, or a Kubernetes command that’s subtly different from the local equivalent. The rehearsal drill surfaces those gaps in a low-stakes environment. The 90-day window forces a re-test whenever the deployment architecture changes significantly.

How to resolve:
– Schedule a quarterly rollback drill: choose a low-traffic window, execute the runbook exactly as written, and record every step that required improvisation.
– For each improvised step: update the runbook before the drill is marked complete.
– Update state.last_rollback_drill_at with the ISO8601 date and link state.last_rollback_drill_evidence to the drill log (a GitHub issue, a Confluence page, or a plain text file in the repo all qualify).


OP4 (G20) — SLO/SLI codified
An SLO (Service Level Objective) is a quantitative reliability target — for example, “99.9% of requests complete within 500ms.” An SLI (Service Level Indicator) is the measured metric that determines whether you’re meeting the SLO — the actual p95 latency measured over a rolling 30-day window. The gap between your SLO and 100% uptime is your error budget: the amount of “bad” behavior the service is allowed before the SLO is breached. Error budgets give teams an objective answer to “can we ship?” — budget intact means continue; budget exhausted means freeze features and fix reliability.

SLO / SLI / ERROR BUDGET — the three concepts OP4 requires you to codify

SLO

Target

99.9% requests
< 500ms p95

What you commit to

SLI

Measured

Actual p95
over 30 days

What you observe

Error Budget

Allowance

100% − 99.9%
= 0.1% bad reqs

Budget gone → feature freeze

  • docs/internal/slo.yaml exists and parses as valid YAML
  • Declares ≥3 SLOs covering availability, latency (p95), and error rate
  • Each SLO has name, target, window, error_budget_remaining_threshold
  • Fails if: fewer than 3 SLOs; required fields missing; implausible targets (availability > 99.99% or latency p95 < 1ms)

How it affects application health: Without codified SLOs, every production incident becomes a negotiation (“is this within acceptable range?”) rather than a mechanical check. When SLOs are measured and error budgets tracked, the team has an objective answer. The OP7 gate’s 30-day freshness requirement ensures the SLOs are not aspirational — they are verified against actual traffic data.

How to resolve:
– Create docs/internal/slo.yaml with at least three SLOs: availability: target: 99.9%, latency_p95: target: 500ms, error_rate: target: 0.1%.
– For each SLO, specify window: 30d and error_budget_remaining_threshold: 20%.
– Wire the SLOs to your monitoring platform: create Azure Monitor, Datadog, or Prometheus alert rules that fire when the error budget drops below the threshold.


OP5 (G21) — Cost-budget alert configured– At least one cost budget declared in infrastructure config (Bicep, Terraform, or GitHub Actions)
– Monthly cap with alert threshold ≤80% and notification target
– Fails if: no budget file; budget without alert threshold; alert threshold > 100%

How it affects application health: A cloud service with no cost alert is a billing surprise waiting to happen. A misconfigured auto-scaler, a background job polling an API in a tight loop, or an AI endpoint without a token cap can push monthly spend 10× over budget before anyone sees an invoice. The alert must fire at 80% so the team has time to investigate before the cap is breached — an alert at 100% fires after the damage is done.

How to resolve:
– In Bicep: declare a Microsoft.Consumption/budgets resource with amount, timeGrain: 'Monthly', and a notification entry at threshold: 80 targeting your ops email or Slack webhook.
– In Terraform: use azurerm_consumption_budget_subscription with the same threshold and notification block.
– Verify the alert fires: temporarily lower the budget below current spend in a staging environment and confirm the notification arrives within one billing cycle check interval.


OP6 (G22) — Audit log retention configured– Retention ≥ compliance floor (default 90 days; SOC2/HIPAA projects raise to 1 year or 6 years)
– Audit logs tagged for long-retention sink separately from normal app logs
– Tamper-evident sink configured (append-only blob with immutability policy, or signed log stream)
– Fails if: retention < compliance floor; audit logs commingled with application logs

How it affects application health: Audit logs stored alongside application logs are at risk of being purged by the shorter application-log retention policy — typically 30 days for app logs versus 90 days or more required by compliance frameworks. In a SOC 2 audit or a breach investigation, a gap in log coverage for a specific date is the difference between a finding you can respond to and a finding you cannot. Logs that are mutable or deletable fail the tamper-evidence requirement even if the retention window is correct.

How to resolve:
– Tag all security events with a log_type: audit field and route them to a dedicated sink (separate Azure Blob container, separate Elasticsearch index, or a WORM-compliant storage tier).
– Set an immutability policy on the audit sink: in Azure Blob, enable time-based retention with allowProtectedAppend: true and lock the policy.
– Verify the retention floor: 90 days for default projects; 1 year for SOC 2; 6 years for HIPAA. Set the sink’s lifecycle rule to match, not just the application-level configuration.


OP7 (G24) — Performance evidence currentstate.last_perf_run_at is within the last 30 days
– Referenced performance run artifact parses correctly
– Every SLO in slo.yaml has a corresponding entry in the run artifact with met: true
– Fails if: stamp absent; stamp older than 30 days; any SLO unmet

How it affects application health: SLOs declared without measurement are aspirational marketing, not operational commitments. A team that last ran a performance test three months ago is committing to a 99.9% availability SLO based on traffic patterns that may no longer apply. The 30-day freshness requirement means the performance evidence must post-date the last significant architecture change — a new cache layer, a schema migration, or a dependency upgrade can all shift the p95 latency enough to breach the SLO.

How to resolve:
– Run your load test suite (k6, Gatling, Azure Load Testing) against the staging environment and commit the artifact to the repo or link it in state.last_perf_run_evidence.
– Update state.last_perf_run_at with the ISO8601 timestamp of the run completion.
– For each SLO in slo.yaml, verify the run artifact contains a matching entry with met: true; if any SLO is unmet, treat it as a blocking finding before the next release.


OP8 — Test & ship readiness
What it checks:
Test coveragedotnet test --collect:"XPlat Code Coverage" generates a coverage report; overall line coverage meets the configured threshold (default ≥80%); dotnet test and npm run test are both green
Assembly metadata — all .csproj files declare AssemblyVersion, FileVersion, Company, and Copyright; version follows semver and has been bumped from the previous release
Packaging config — NuGet packages have PackageId, Authors, License, PackageReadmeFile; npm packages have name, version, license, repository
CI/CD pipeline validity — pipeline YAML passes yamllint and references the current branch/tag strategy; no hardcoded secrets in pipeline definitions
Deployment-target reachability — staging and production base URLs respond to a lightweight HEAD request from the CI runner (read-only connectivity check — no deploy triggered)

How it can be exploited: Shipping with a broken CI pipeline means the next developer to merge triggers a deployment to the wrong environment or skips security scanning entirely. An assembly without a version number makes incident forensics impossible — you can’t tell from a crash dump which build was running. A coverage gap in payment or authentication code is often where the exploitable bug hides.

How to resolve:
– Add coverage collection: dotnet test --collect:"XPlat Code Coverage" -- DataCollectionRunSettings.DataCollectors.DataCollector.Configuration.Format=cobertura
– Add a <PropertyGroup> to each .csproj with <Version>, <Company>, <Copyright>
– Validate pipeline YAML in CI: yamllint .github/workflows/*.yml
– Add a deployment-target ping step: curl -fsS --max-time 10 -o /dev/null -w "%{http_code}" $STAGING_URL


Conclusion

The 27-point contract is not bureaucracy. It is the answer to a specific failure mode: AI tools that write production-quality code without production-quality judgment. The gap between “it compiles” and “it ships safely” has always existed. AI-assisted development made the gap wider and faster to fall into.

Every gate in this system traces back to a production incident or a near-miss. DB3 is a 40-minute table lock. XC9 is a cross-tenant data exposure. XC15 is a wildcard CORS policy that makes every authenticated endpoint reachable from any origin. OP3 is an engineer executing a rollback procedure for the first time during an outage. The gate that blocks your build is doing the cheapest incident response you will ever get.

Run this gate suite against your project. See which gates pass and which fail. Treat every failure as your sprint backlog, not a code review comment.


Key Takeaways

  • 27 gates, all hard-fail — one failure blocks deployment regardless of overall score
  • 5 areas — Database (2), Backend (3), Frontend (2), Cross-Cutting (12), Operational (8)
  • Every security, AI safety, web security, and operational gate is non-negotiable — no bypass, no exceptions
  • The 27-point contract — when all gates pass, “is it done?” is a binary answer, not a negotiation

References

[1] Arsalan Shahid. Quality Gates — 27-gate production readiness specification (internal). 2026-05-05.

[2] OWASP Foundation. OWASP Top 10 for Large Language Model Applications. https://owasp.org/www-project-top-10-for-large-language-model-applications/. 2025.

[3] OWASP Foundation. OWASP Top 10 Web Application Security Risks. https://owasp.org/Top10/. 2021.

Prior versions superseded by this post:
Ship Production-Ready Code with 5 Automated Quality Gatesquality-gates-production-ready-code.md
29 Quality Gates That Stand Between Your Code and Production29-quality-gates-production-readiness-reference.md

Leave a response

Your email address will not be published. Required fields are marked *