The AI Maturity Framework
Levi Garner
Founder & CTO, InteliG
Every engineering team is using AI.
The variance isn’t whether. It’s how well.
I’ve spent the last two months scoring teams across five pillars. Same patterns keep repeating, in companies of every size, in every vertical I’ve seen. None of it is incompetence. It’s the same reason most organizations couldn’t explain their cloud spend in 2014: the tooling outpaced the discipline.
So I’m publishing the framework I use. Free, CC-BY 4.0, copy it, republish it, score yourself with it. Tell me where it’s wrong.
Why a framework
A senior engineer who thinks she’s using Cursor well but is actually re-typing prompts daily that should be in a shared CLAUDE.md. A team that turned on Copilot 18 months ago and has never measured whether it changed anything. A startup that built three internal “agents” before standardizing on any agent loop or memory strategy, and now has three half-broken systems no one trusts. A CTO who bought a $50k/year AI dev-tools contract because the demo was great and can’t tell you if a single engineer actually adopted it.
This framework is the discipline.
Five pillars of AI engineering maturity, each scored 0–4. A team scoring 2 across all five is healthy. A team stuck at 0–1 on pillars 3 and 4 is hemorrhaging leverage every day and doesn’t know it. A team at 4 across the board is compounding — every week of AI use makes the next week better, automatically.
You can use this in an hour. You can set hiring criteria from it. You can compare across a portfolio of companies. And you can use it to tell, finally, whether your AI spend is producing leverage.
The Five Pillars
- Augmented Engineers — individual amplification
- Agentic Loops — autonomous execution
- Standards as Code — what stops AI from drifting
- Memory & Context — what compounds
- MCP & Tool Use — what makes agents real
Walking each one. For each pillar: what it is · what good looks like · what we see in the wild · the failure modes · the 0–4 ladder · the quick wins.
Pillar 1 — Augmented Engineers
What it is. How effectively your individual engineers are amplified by AI tools. Less about which tool and more about how it’s used.
What good looks like. A good engineer using AI today is a small team. They write code at a different cadence than they did two years ago — not 2x faster on the same code, but 5–10x more code attempted, more spikes explored, more cleanups done, more docs written. They review AI output critically because they’ve internalized its failure patterns. They’ve built up a personal library of prompts and templates they reuse across projects.
What we see in the wild. Wildly uneven. On the same team:
- One engineer treating Cursor like autocomplete-on-steroids — accepting Tab completions, no real prompting discipline. 1.2x leverage at best.
- Another running tight Claude Code loops with custom slash commands, AGENTS.md, and chained workflows that ship a whole feature in 30 minutes. 6–10x leverage on the right kinds of work.
- A third who says they use AI but actually uses it as a glorified search engine for Stack Overflow.
Most teams have never measured this gap. The 10x engineer existed in 1972; she now exists across every team with AI access, but management has no way to see her.
Failure modes.
- Tool-of-the-month syndrome. Last quarter Cursor, this quarter Claude Code, next quarter Codex CLI. Each switch resets the team’s muscle memory.
- Slop tolerance. AI generates plausible-looking code; reviewers approve it without reading; bugs ship.
- Prompt secrecy. The 10x engineer doesn’t share her prompts because they feel personal. The team doesn’t compound.
- Demo-to-daily gap. The team adopted a tool because the demo was great. Nobody uses it Tuesday.
The 0–4 ladder.
- 0 — None. Engineers either don’t use AI tools or don’t admit to it. No tooling budget. No consistent IDE.
- 1 — Ad-hoc. Tools available; usage is individual and unmeasured. Big gap between top and bottom engineer.
- 2 — Consistent. Team agrees on a tool stack. Most engineers use it daily. Informal prompt-sharing happens.
- 3 — Instrumented. Team measures AI-assisted commit ratios, time-to-ship, and review hygiene. Prompt libraries shared in the repo. New hires onboard onto the AI stack day one.
- 4 — Compounding. AI is woven into the engineering culture. Prompt-engineering is a respected craft. The team’s velocity curve continues to bend upward as the tools improve.
Quick wins.
- One shared
prompts/directory in the main repo, populated by every engineer who finds a good prompt. - Weekly 30-min “AI workflow” demo session — different engineer each week shows their setup.
- Track AI-assisted commits (one boolean in git trailer or commit metadata). You’ll be shocked what the distribution looks like.
Pillar 2 — Agentic Loops
What it is. Whether and how the team uses autonomous execution — agents that don’t just suggest code, but run loops: read files, call tools, write code, run tests, iterate. The “YOLO mode” question.
What good looks like. The team has thought explicitly about when to gate and when to let an agent run. They’ve drawn a line — reversible/in-boundary actions auto-execute, destructive/external actions require explicit approval. They have a checkpoint pattern (test runs, branch isolation, sandboxed environments) so loops can fail cheaply. They distinguish between assistance loops (engineer-in-the-driver’s-seat) and autonomous loops (agent executes a multi-step task while engineer reviews the result).
What we see in the wild. Most teams haven’t thought about this distinction. They either don’t run autonomous loops at all (too scary, too unproven) — and stay stuck at Pillar 1’s ceiling. Or they run them ad-hoc, with no gating model, and someone’s agent deletes a critical file once, and the team retreats to autocomplete forever.
The teams getting real leverage have a gating axis — usually reversibility + boundary. Reversible in-repo file edits? Auto. Sending an external email? Confirm. Mass-deleting? Confirm. That axis is rarely written down, but it shows up clearly in the agents’ tool definitions.
Failure modes.
- No gating policy. Every loop is treated identically. Paralysis or disaster.
- Over-gating. Every action requires confirmation. Agents slower than typing. No adoption.
- Tool sprawl inside loops. Agent has 50 tools; accuracy drops past the 30–50 tool cliff most LLMs hit.
- No checkpoints. Loops run for hours, fail, nobody can replay the failure.
- Agent identity confusion. No clear notion of which agent did what. Audit trail destroyed.
The 0–4 ladder.
- 0 — None. No autonomous loops. AI is a chat window.
- 1 — Experimental. One engineer running Claude Code or Codex CLI as a one-off, occasionally.
- 2 — Selective. Defined use cases where loops are allowed (e.g. test generation, refactor). Loop runs reviewed.
- 3 — Policy-driven. Explicit gating axis. Auto-approve safe actions; explicit approval for risky ones. Documented in
AGENTS.md. - 4 — Industrial. Multi-agent orchestration, checkpoint patterns, failure replay, sandboxed branches, audit trail. Agents work overnight; engineers review in the morning.
Quick wins.
- Write down your gating policy. One paragraph. Even if it’s wrong, write it down.
- Pick one loop use case (test generation is usually safest) and make it routine.
- Run loops in disposable git worktrees, not main checkouts.
Pillar 3 — Standards as Code
What it is. The written, enforced rules that prevent AI from drifting your codebase into chaos. AGENTS.md, CLAUDE.md, architectural invariants, ArchUnit-style guards, linting rules the AI is held to.
This is the highest-leverage pillar most teams ignore.
What good looks like. The team has a single source of truth that tells the AI: here’s how we name things, here’s the bounded contexts, here’s what’s tested and how, here’s the architecture you must conform to. The AI reads this every session. Invariants that can be enforced by code (ArchUnit, linters, tests) are enforced by code — so the AI can’t violate them even if the prompt drifts.
A canonical example: my own InteliG Agent Playbook. The AGENTS.md at the workspace root, per-repo agent files, STANDARDS_DIGEST.md cited in every commit, ArchUnit invariants that fail the build if violated. The agent can’t drift because the standards are executable.
What we see in the wild. Most teams have no AGENTS.md. The ones that do treat it as a polite suggestion. The agent ignores it after the first few turns because there’s no consequence. Result: the codebase becomes a museum of every fashion the AI thinks is current. Naming conventions drift. Architectural patterns mix. The team blames “the AI” when really they blamed nothing.
Failure modes.
- No standards file. Worst case.
- Standards file the agent doesn’t read. Wishful thinking.
- Standards as paragraphs of prose. “Try to follow good practices.” Useless.
- Standards that contradict each other. Worse than no standards.
- Standards the team itself doesn’t follow. Worst case dressed in a suit.
The 0–4 ladder.
- 0 — None. No
AGENTS.md, noCLAUDE.md, no invariants. Agent invents conventions. - 1 — Aspirational. A file exists; nobody reviews it; the team doesn’t follow it either.
- 2 — Maintained. Standards file is current; agent reads it; team mostly follows it. Naming and architecture coherent.
- 3 — Enforced. Critical invariants enforced by linters / tests / ArchUnit. Agent’s PRs that violate them fail the build.
- 4 — Compounding. Standards versioned, indexed, traceable (rule IDs cited in commits and PRs). New rules added via the same PR flow as code. The standards corpus is itself a competitive moat.
Quick wins.
- If you have no
AGENTS.mdorCLAUDE.md, write a one-page draft tomorrow. Even a bad one beats none. - Pick the one invariant you most fear the AI violating and enforce it in CI.
- Reference the standards file in every PR template.
Pillar 4 — Memory & Context
What it is. How the team persists what AI learned — across sessions, projects, and people. The difference between “Claude knows my codebase this session” and “the org has compounding institutional memory accessible to every agent.”
What good looks like. When a new engineer (or a new agent) lands in your codebase, they ramp on context that the previous engineer / agent left behind. There’s a persistent memory layer — sometimes a MEMORY.md, sometimes a structured knowledge graph, sometimes a vector store of past decisions — that captures: why we chose this architecture, what we tried that didn’t work, what the team learned about this customer last quarter, what the agent learned about this codebase yesterday.
The team treats memory the way good teams treat tests: a first-class artifact that’s expected to grow over time.
What we see in the wild. Almost nobody does this well. Most teams’ “memory” is whatever’s in the engineer’s head, plus whatever Claude/Cursor can reconstruct from the open files in the current session. Restart the laptop → start over. New engineer joins → re-explain the architecture in three meetings. Agent crashes → loses an hour of context.
The teams that do invest in memory are getting compounding returns the rest of the market can’t see. Their agents start tomorrow’s session with everything they learned today.
Failure modes.
- No memory layer. Pure stateless. Every session starts cold.
- Memory bloat. Every agent dumps everything; nothing is indexed; retrieval is useless.
- Memory rot. Notes get stale; team stops trusting them; nobody updates them; spiral.
- Wrong scope. Memory at the wrong granularity (one giant file for everything; or a thousand tiny files with no index).
The 0–4 ladder.
- 0 — Ephemeral. No persistent memory. Every session starts cold.
- 1 — Personal. Individual engineers keep their own notes; not shared.
- 2 — Shared notes. A
NOTES.mdor wiki the team updates and the agent can read. - 3 — Indexed. A structured memory layer with retrieval —
MEMORY.mdindex + topic files, vector store with metadata, or knowledge graph. Agents read selectively. - 4 — Compounding. Memory is curated, versioned, deduplicated, decayed. Every project’s learnings flow back into the corpus. The org gets measurably faster at problems it has seen before.
Quick wins.
- Start a
MEMORY.mdin the repo. Don’t overthink it. One section per fact, one line per pointer. - Reference past decisions in your
AGENTS.md(e.g., “we tried X in 2024, see memory/why-we-dropped-x.md”). - Pick one project. Write its post-mortem into memory. Watch what happens to the next similar project.
Pillar 5 — MCP & Tool Use
What it is. Composable agent tooling. The Model Context Protocol (MCP) plus your team’s internal tool surface — APIs the agent can call, MCP servers you’ve built or installed, the ecosystem of capabilities your agents have access to.
Why it matters. This is where agents stop being chat windows and start being real coworkers. An agent that can read your Gmail, query your database, deploy your code, post in Slack, and update your CRM — under defined gating — is a fundamentally different kind of teammate than one that can only generate text.
It’s also where the most failure modes hide.
What good looks like. The team has a small, well-curated set of tools each agent can use. Tools are designed with clear input contracts, clear outputs, idempotency where it matters, and explicit gating for destructive actions. Internal services have MCP wrappers (or APIs) that agents call. The team has gone through the discipline of cutting tools as the catalog grew — because more than ~30–50 tools hits an accuracy cliff with every current model.
When the agent calls a tool, there’s an audit trail. When the agent’s tool call fails, there’s a clear error the agent can recover from. The team treats tool design as a first-class engineering activity.
What we see in the wild. Two failure patterns dominate:
- Tool starvation. Agents have nothing real to call. They can talk about the codebase but they can’t do anything outside it. Cool demos, no business impact.
- Tool sprawl. Agents have 80+ tools, half of them experimental, none well-designed. Accuracy collapses. Engineers don’t trust the agent. Adoption stalls.
Failure modes.
- No tools. Agents are chat windows only.
- Sprawl past the accuracy cliff. Too many tools; agent gets confused.
- Bad tool contracts. Tools that return half-structured text. Agent can’t use the output. Loops fail.
- No gating. Destructive tool calls happen without confirmation. Trust evaporates after first incident.
- No audit trail. Tools execute; nobody knows what happened; debugging impossible.
- No tool search / discovery. Past the 30-tool cliff with no search-tool pattern.
The 0–4 ladder.
- 0 — Chat only. No tool calls. Agents read files at best.
- 1 — Built-in only. Whatever the IDE ships with — file read/write, terminal — but no internal services exposed.
- 2 — Some MCP / API surface. Team has built or installed a handful of MCP servers or API wrappers; agents call internal services.
- 3 — Curated catalog. Maintained, documented set of tools. Tool design is owned by someone. Gating explicit. Audit trail exists.
- 4 — Tool engineering as discipline. Tools are versioned, deprecated, tested, documented. Tool-search pattern handles the accuracy cliff. New tools require review like new APIs. The agent is genuinely a coworker.
Quick wins.
- Inventory your tools. Yes, write them down. You’ll be surprised.
- For each tool: is it tested? Is it idempotent where it should be? Does it have clear gating?
- Past 30 tools? Add a tool-search pattern. This is documented widely and is now a known requirement.
How to use this framework
As an individual CTO. Score your team honestly on each pillar. Don’t grade yourself softly. Pick the 1–2 pillars where you’re scoring lowest and where movement would matter most. Most teams find Pillars 3 (Standards) and 4 (Memory) are the under-invested ones.
As a partner / value-creation team in a PE firm. Score each portfolio company on each pillar. Lay them out as a heatmap. The variance tells you where to invest fund-level capability (a shared standards toolkit; a shared MCP server library; a memory pattern someone in the portfolio has gotten right).
As an engineering leader hiring. This is your interview rubric. Ask candidates about their AI workflow. The 0–4 they score themselves on each pillar tells you how they’ll show up.
As a vendor pitch lens. If a tool vendor’s pitch doesn’t help you move up any of these pillars, the pitch is a feature list, not a strategy.
What this framework is not
It is not a checklist of tools. It is not “is your team using Cursor?” — that’s the trailing indicator, not the leading one. A team using Cursor at maturity-1 is much less productive than a team using free Claude Code at maturity-3.
It is not a static rubric. The 0–4 levels at the top of each pillar will shift as the underlying capabilities shift. The framework will be republished every quarter.
It is not a substitute for judgment. The scores are inputs. The judgment is “given where this team is, what one move would compound the most.” That’s the job.
A note on InteliG
Most of what I described in Pillars 3 (Standards), 4 (Memory), and 5 (MCP/Tools) is what InteliG does continuously, automatically, across an engineering org. The platform reads git, meetings, and spend; runs the same lens you’d get from an audit, but every day; and surfaces the compounding insights that an annual rubric can’t.
I built InteliG because I couldn’t find a tool that actually did this for an engineering organization.
Reuse: this framework is licensed CC-BY 4.0 — attribute to Levi Garner / InteliG and link back. Methodology open; platform is the product. Tell me where the framework is wrong: levi@intelig.ai.
See What Your Engineering Org Is Really Doing
InteliG reads your repos, analyzes every commit, and gives you the execution intelligence CTOs actually need.
Start Your Trial