The Intelligence Agent Test: A 20-Point Rubric to Tell Real Agents from Marketing
Levi Garner
Founder & CTO, InteliG
Referenced by Episode 1 of The Agent Series. This page is the canonical framework. It’s updated as scores change and new products enter the field. Use it to score the AI tool you spent the most money on this quarter.
The word “agent” is the most expensive marketing victory of the last 18 months.
In the time it took to ship one Anthropic release, the term went from meaning “an autonomous reasoning system that observes and acts on a domain” to meaning “literally any function that calls an LLM with a tool.”
Today, “AI agent” can refer to:
- A ChatGPT plugin that calls your CRM
- A coding assistant that loops until the tests pass
- A workflow tool that fires three API calls in sequence
- A chatbot with a personality
These are not the same thing. They’re not even close to the same thing. And calling all of them “agents” has made the entire category meaningless to the people writing the checks for them.
So I built a test.
The four paradigms
Before the rubric, the taxonomy. There are four kinds of “agents” in the market, and they fail at very different things.
Task agents. LLM plus a tool call plus a return value. ChatGPT plugins. Claude with MCP. Anything that fires a single function and returns text. The 99%.
Workflow agents. Multi-step plan, act, observe, adjust. AutoGPT, CrewAI, AutoGen, OpenAI Assistants. Stateless across sessions. Forgets you the moment you close the tab.
Coding agents. Write, run, and debug code autonomously until tests pass. Devin, Cursor agent mode, Replit Agent, Claude Code, GitHub Copilot Workspace. Operates on a codebase, not on understanding.
Intelligence agents. Continuously observe a domain. Build internal models. Surface what matters without being asked. Pre-compute understanding so the answer is ready before the question. Almost nothing in production qualifies. Palantir Foundry is the closest mainstream comparison.
The category called “agents” in tech press is mostly the first three. The category that actually matters, the one that changes how you operate a business, is the fourth. And no one is shipping it yet.
The Intelligence Agent Test
Four traits. Each scored 0–5. Total of 20 points. A score above 11 puts you in intelligence-agent territory. Below 6, you’re a task agent dressed up.
Trait 1: Continuous Observation
Does the system observe the domain without being asked?
| Score | Behavior |
|---|---|
| 0 | Pure prompt-driven. Does nothing without a user. |
| 1 | Polls a data source on schedule (basic ingestion). |
| 2 | One background process autonomously extracts findings. |
| 3 | Multiple background agents on different signals/cadences. |
| 4 | Event-driven. Reacts to source events, not just polling. |
| 5 | Continuous, event-driven, and adaptive (decides what to observe next). |
What this unlocks: “What broke overnight that I haven’t been told about yet?” A task agent sat idle while you slept. An intelligence agent watched, noticed, and is ready to brief you the moment you open the app.
Trait 2: Persistent Compounding Memory
Does what the system learned yesterday shape what it knows today?
| Score | Behavior |
|---|---|
| 0 | Stateless. Every conversation starts from zero. |
| 1 | Conversation history within a single thread. |
| 2 | User-scoped persistent profile or preferences. |
| 3 | Cross-session findings store. Persistent. Queryable. |
| 4 | Compounding. Past findings inform new reasoning automatically. |
| 5 | Multi-layer memory (org, user, ephemeral) with explicit consolidation. |
What this unlocks: “Last month you flagged Ryan as a single point of failure on payments. Has that gotten better or worse?” A task agent doesn’t remember Ryan, doesn’t remember the flag, doesn’t remember last month. An intelligence agent treats every prior finding as a thread it can pull on.
Trait 3: Pre-Computed Understanding
When you ask, is the answer already known, or is the system reasoning from scratch?
| Score | Behavior |
|---|---|
| 0 | Every answer reasoned from raw data on demand. RAG-only. |
| 1 | Cached responses for common queries. |
| 2 | Pre-indexed evidence (vector store, retrieval, not reasoning). |
| 3 | Pre-computed findings stored, retrieved at query time. |
| 4 | Pre-computed findings plus on-demand synthesis. |
| 5 | Anticipatory. System pre-builds what you’ll likely need next. |
What this unlocks: “Tell me everything that needs my attention right now.” A task agent thinks for 30 seconds and produces something generic. An intelligence agent answers in one second because the findings already exist. It’s retrieving understanding it built last night.
Trait 4: Pattern Recognition Over Time
Does the system see trends, drift, and anomalies, or just snapshots?
| Score | Behavior |
|---|---|
| 0 | Snapshot answers only. No temporal awareness. |
| 1 | Trend display (charts of metrics over time). |
| 2 | Single-domain pattern detection. |
| 3 | Cross-domain anomaly detection. |
| 4 | Cross-finding consolidation. Meta-patterns surface from base findings. |
| 5 | Predictive. Surfaces what’s likely to happen, not just what is. |
What this unlocks: “How did engineering perform this quarter versus last, and what changed in our team that explains it?” A dashboard shows two charts side-by-side and asks you to do the comparison. An intelligence agent does the comparison itself, identifies the structural changes (a contributor went silent, a team’s review participation dropped, a critical service lost ownership continuity), and tells you what’s actually driving the delta.
Classification
| Total | Class | What it actually is |
|---|---|---|
| 0–5 | Task Agent | LLM with tools. Marketing called it an agent. |
| 6–10 | Workflow Agent | Multi-step. Stateless. On-demand. |
| 11–15 | Intelligence Agent (Emerging) | Real memory and observation. Gaps in synthesis. |
| 16–20 | Intelligence Agent (Mature) | Continuous, compounding, anticipatory. |
The brutal results
I scored the entire field. Including my own product. Honestly.
| Product | Continuous | Memory | Pre-Computed | Patterns | Total | Class |
|---|---|---|---|---|---|---|
| ChatGPT (default) | 0 | 1 | 0 | 0 | 1 | Task |
| Claude + MCP tools | 0 | 1 | 0 | 0 | 1 | Task |
| AutoGPT | 1 | 1 | 0 | 0 | 2 | Task |
| CrewAI / AutoGen | 1 | 1 | 0 | 0 | 2 | Task |
| Cursor agent mode | 0 | 2 | 0 | 0 | 2 | Task |
| Replit Agent | 0 | 2 | 0 | 0 | 2 | Task |
| Devin | 1 | 2 | 0 | 0 | 3 | Task |
| GitHub Copilot Workspace | 1 | 2 | 0 | 0 | 3 | Task |
| LinearB | 1 | 1 | 1 | 1 | 4 | Task |
| Jellyfish | 1 | 1 | 1 | 1 | 4 | Task |
| Cortex.io | 1 | 2 | 1 | 0 | 4 | Task |
| Glean | 1 | 2 | 2 | 0 | 5 | Task (top of band) |
| Cognis (today) | 2 | 4 | 3 | 2 | 11 | Intelligence (Emerging) |
| Palantir Foundry | 4 | 5 | 4 | 4 | 17 | Intelligence (Mature) |
| Cognis (Q4 2026 target) | 4 | 5 | 4 | 4 | 17 | Intelligence (Mature) |
A few things jump off this table.
The entire coding-agent category (the hottest, best-funded segment in AI right now) scores between 2 and 3. Devin, Cursor, Replit, Copilot Workspace. Billions of dollars in valuation, building task agents.
The entire engineering analytics category (LinearB, Jellyfish, Cortex) scores 4. These aren’t even agents. They’re metrics dashboards with charts. They poll Git and Jira on a schedule, run aggregations, and put the result on a screen. There’s no autonomous observation because there’s no observer. There’s no memory because there’s no reasoning to remember. Aggregation is not understanding. Time-series storage is not memory. A chart is not pattern recognition. You are the pattern recognition, looking at the chart.
Glean tops the task-agent band at 5. It’s a sophisticated retrieval system, but RAG is not intelligence. Search, no matter how good, is not reasoning.
There is one product on this list that scores in intelligence-agent territory: Palantir Foundry. It’s not for engineering. It costs millions of dollars per year. Most companies will never deploy it.
That leaves a category of one in the middle of the table. And it’s not where most people think to look.
The questions only an intelligence agent can answer
The rubric is abstract until you ground it in what a CTO actually wants to ask. Here are four questions, one per trait, and what each tier of agent produces when you ask them.
”What broke overnight that I should know about?”
- Task agent (ChatGPT, Cursor, Devin): “I don’t have access to your engineering systems. Can you describe what happened?” No autonomous observation. The system only knows what’s in its current prompt window.
- Workflow agent (CrewAI, AutoGPT): Will spin up a multi-step process to scan your repos if you ask it to. Returns 8 minutes later with a summary. Reactive, not proactive. Can’t tell you what changed because it has no baseline.
- Engineering analytics (LinearB, Jellyfish): “Here’s a chart of overnight commits and PR throughput.” Shows you the data. You’re the one looking for what broke.
- Intelligence agent (Cognis): “Three things happened. The payments service got two PRs from a contributor who hasn’t touched it in three months. Review them carefully. Sprint 47 closed five action items but two are now overdue. The fraud-engine tests started failing at 2am. First failure in 60 days.” The system observed while you slept. It already knows what changed and what matters.
”Has the bus factor on payments improved since you flagged it last month?”
- Task agent: “What’s the bus factor? What did I flag?” No memory of prior conversation, prior finding, or prior conversation thread.
- Workflow agent: Will re-run the analysis from scratch. Gives you a current snapshot. Has no concept of “since last month” because it didn’t exist last month from the system’s perspective.
- Engineering analytics: “Here’s contributor distribution on payments for the last 30 days.” Shows you data. Doesn’t remember what your concern was.
- Intelligence agent (Cognis): “Last month’s finding flagged Ryan O’Brien at 82% ownership. Today he’s at 71%. Three other contributors made meaningful commits in the interim. Maria added the new payout flow. Devon owned the chargeback refactor. The risk is reduced but not resolved. Recommend continuing to redistribute review load.” Knows the prior finding, knows what’s changed, can compare across time.
”Tell me everything that needs my attention right now.”
- Task agent: 30 seconds of “thinking” then generic productivity advice. Reasons from scratch every time. No prior context to draw on.
- Workflow agent: 4-step plan: “I will 1) check repos, 2) check meetings, 3) check sprints, 4) summarize.” Returns 6 minutes later with a verbose dump. Slow. Surfaces everything because it can’t distinguish what matters from what doesn’t.
- Engineering analytics: “Here’s your dashboard. Look at the red metrics.” Shows you charts. You decide what needs attention.
- Intelligence agent (Cognis): One second. “Five things. Ranked by severity. Two are blocking, three are watch-list. Click any for evidence.” Pre-computed understanding. The work was done at 6am; you’re just retrieving it.
”How did we perform this quarter versus last, and what explains the delta?”
- Task agent: Cannot answer. Has no historical context or data access.
- Workflow agent: Will fetch metrics for both quarters and produce a comparison table. Numerical comparison, no causal reasoning.
- Engineering analytics: Two charts side-by-side. Velocity, throughput, cycle time. You’re doing the pattern recognition. Tool just shows the numbers.
- Intelligence agent (Cognis): “Output volume is up 14% but effort quality declined 9%. The delta is driven by three structural changes: Sarah moved to a different team in week 3, the platform group absorbed two on-call rotations, and AI-assisted code rose from 22% to 38% of merged commits. That correlates with the cycle-time improvement. The drop in effort quality is concentrated in two repos that lost their primary owner. Want to drill in?” Recognizes the patterns, identifies the causes, surfaces the structural explanation.
This is the actual difference. Not “more features.” Not “better UI.” A category change.
Cognis: scored honestly, with code
I’m not going to write a blog post that hands my own product a perfect score. That’s the marketing playbook the rest of the industry runs and it’s exactly what makes “agent” mean nothing.
Here’s what Cognis actually is, scored against the rubric I just published, with code citations:
Continuous Observation: 2. One background scheduler runs every six hours per org, extracting findings without user input (KnowledgeRefreshScheduler.java). The Cognis Operator spec (multi-agent autonomous observation) is in design, not built. The main reasoning loop is still synchronous and request-driven.
Persistent Compounding Memory: 4. The Knowledge Store is real. KnowledgeEntry is a persistent table. FindingExtractor writes findings after every reasoning pass. Cross-session, queryable, TTL-managed. Multi-layer hierarchy (org, user, ephemeral consolidation) is not yet built.
Pre-Computed Understanding: 3. Background extracts findings into the Knowledge Store. The main QueryExecutor still synthesizes on-demand at query time. Dynamic Greeting (proactive surfacing of pre-computed findings) is spec’d, not shipped.
Pattern Recognition Over Time: 2. Per-domain detectors are live: bus factor, at-risk contributor, dormant repository, stalled initiative. Cross-finding consolidation, drift detection, and anomaly inference are vision-only.
Total: 11/20. Intelligence Agent (Emerging).
Six-month target: 17/20. We know exactly which features close the gap and they’re on the active roadmap.
I’m publishing the score. I’m publishing the gaps. If you’re reading this in six months and we’re still at 11, hold me to it.
Why this rubric works (and why memory is the entire game)
There’s a single thread running through every trait in the IAT, and it’s worth pulling on.
You can do this experiment yourself. Take Claude (the best general-purpose LLM in the world) and ask it about your engineering org. It will hallucinate, hedge, or give you a generic answer. Now give Claude a MEMORY.md file with persistent context about your team, your codebase, your priorities. Run the same conversation. Suddenly Claude is reliable. It remembers what you told it. It builds on prior reasoning. It cites specifics.
Same model. Same prompt. The only thing that changed was persistent memory.
This is the entire game. The reason every “agent” on the table above scores under 6 is the same reason a stateless LLM is unreliable for any specific domain: it has no memory, so it has no understanding to compound. Every conversation is the first conversation. Every answer is reasoned from scratch. Every insight is forgotten the moment the session ends.
What we built in Cognis is the same pattern at architectural scale. The Knowledge Store is the engineering-execution equivalent of MEMORY.md. Every reasoning pass writes findings. Every future pass reads them. Two CTOs asking the same question a week apart get different answers because the system learned something in between.
This is why intelligence agents are rare. Building memory isn’t hard because LLMs are hard. It’s hard because you have to design the data model for what’s worth remembering. You have to extract findings from raw signals. You have to decide what compounds and what expires. You have to write the schema, build the extractors, ship the consolidators. None of that is glamorous. None of that gets you on the AI Twitter timeline.
But without it, your agent forgets you the moment the conversation ends. And a system that forgets you isn’t intelligent.
Score your own
The IAT is a rubric. Use it.
Pick the AI tool you spent the most money on this quarter. Score it against the four traits. Be honest. If it scores under 6, you’re paying agent prices for task-agent work.
If you’re a CTO, score the engineering analytics platform you’re already paying for. If it’s 4, you’re not buying intelligence. You’re buying charts.
If you’re building an AI product, score yourself. The rubric is falsifiable. The features that move you from 6 to 11 to 17 are concrete and shippable. There’s no ambiguity about what “real intelligence” requires anymore.
Engineering execution is the hardest, most expensive, least-visible function in most companies. It deserves an intelligence agent. The category is wide open.
If you want to see what an 11 looks like running today (and what 17 looks like by Q4), book a Cognis demo at intelig.ai. I’ll score your current stack on the call. Honestly.
The dashboard era is over. The task-agent-with-better-marketing era should be too.
Time to demand the real thing.
See What Your Engineering Org Is Really Doing
InteliG reads your repos, analyzes every commit, and gives you the execution intelligence CTOs actually need.
Start Your Trial