Decision stage
InvokePlane: Observability Built for Agents, Not Chatbots
The tools the industry reached for when it first wanted to observe LLMs were shaped by chatbots. One prompt, one response, one trace. Log the prompt, log the completion, bill against a per-token ledger, move on. For the first wave of production GPT-3 and GPT-4 use cases — summarisation, classification, simple Q&A — this was enough.
Agents are not chatbots. An agent is a sequence of model calls wrapped in decisions: which tool to use, whether to retry, when to ask the user, when to give up. A single user-facing request can produce thirty model calls, seven tool calls, two retries, and an interrupt that splits the session into two branches. Flattening that into a linear trace loses the structure that operators most want to see.
InvokePlane is an observability and control plane built from the opposite premise. Traces preserve the tree. Sessions are first-class. Evaluations gate deployment. Keys belong to the tenant, not to the platform.
Agent observability versus chat observability
The distinction is worth drawing cleanly because most teams, when evaluating platforms in this space, start by asking the wrong question.
| Axis | Chat observability | Agent observability |
|---|---|---|
| Primary unit | Prompt → completion pair | Session (many calls, one user intent) |
| Trace shape | Linear | Tree with retries, tool calls, branches |
| Session concept | Implicit (conversation history) | Explicit, scoped, addressable |
| Tool calls | Often flattened into prompt text | First-class spans |
| Retries | Invisible unless logged by app | Visible as sibling spans |
| Eval dataset role | Optional, for tuning | Gate on deployment |
| Multi-tenant keys | Rare | Table stakes |
A chat observability tool can be used to watch an agent. The result is that the agent’s decision structure becomes a blob in the prompt field, and the operator has to reconstruct what happened by reading the prompt text as prose. This is workable for five sessions a day. It is not workable for five thousand.
The four primitives
InvokePlane’s shape comes from four design decisions that compound.
1. Streaming sessions
Every interaction is a session — a durable, addressable unit with a lifecycle. A session contains the tree of model calls, tool calls, retries, and branches that made up the interaction. The observability surface shows you the session as a tree, not a log.
Streaming means the tree is live during the session, not only after it completes. When a production agent is mid-session and an operator wants to look at what it is doing right now, the UI is already showing it. For debugging long-running agents (minutes, sometimes longer), this is the difference between “check back in ten minutes” and “watch it happen.”
2. Bring-your-own-keys per tenant
Every agent operator’s keys belong to the tenant, not to the platform. A B2B copilot vendor with 400 customers has 400 sets of keys. Each customer’s model calls go against their own OpenAI, Anthropic, Google, or self-hosted account, with their own data-processing agreement, their own usage limits, and their own bills.
The platform never pools keys. A compromised key affects one tenant. A rate-limited key affects one tenant. Compliance questions about which provider is processing whose data have an obvious answer.
This is the constraint that fundamentally does not retrofit into a single-key observability tool. Platforms designed around a central key pool either break multi-tenancy or leak data across tenant boundaries. InvokePlane is designed around the multi-tenant constraint from the ground up.
3. Multi-tenant workspace isolation
Beyond the key boundary, every tenant’s data, traces, evals, and agent versions live in a workspace. The boundary is enforced at the infrastructure level — a workspace’s traces never share a storage partition with another workspace, a workspace’s eval datasets never leak into another’s model fine-tuning path.
Workspace isolation is the contract that lets a platform operator tell an enterprise customer, honestly, that their data is not in the training set of the next model the vendor ships. Most AI platforms wave at this claim; InvokePlane architects around it.
4. Eval-gated publishing
A new version of an agent does not reach users until its eval dataset passes configured thresholds. The gate is built into the publish flow, not reconstructed by every team.
The structure mirrors CI for code:
- An agent version is a candidate.
- The candidate runs against a versioned eval dataset.
- Each eval produces a score (exact-match, LLM-judge, custom rubric, regression against a golden set).
- Thresholds per eval determine pass/fail.
- A failing candidate does not publish; a passing candidate does.
The eval-gated publish turns agent work from “hope it holds” into “gated like any other software release.” It is the single most consequential difference between an agent that ships monthly with confidence and one that ships weekly with anxiety.
A practical workflow
Here is what a typical production-agent team’s workflow looks like on InvokePlane.
Author. An engineer builds a new version of the agent locally. Instrumentation comes from the SDK; no need to hand-write tracing into every tool call.
Dev run. Running the agent locally produces full sessions in the dev workspace immediately. The engineer can see the tree, click into spans, and inspect every model call.
Eval. Before proposing a release, the engineer runs the candidate against the team’s eval dataset. The results land in the platform with per-eval pass/fail, a diff against the last green candidate, and aggregate scores.
Review. A peer reviews the eval results rather than reading the candidate’s diff. This is the shift in discipline — reviewers look at outputs, not only code.
Publish. The candidate passes the gate and publishes to a staging environment. The eval dataset runs one more time against the deployed candidate to confirm parity.
Rollout. The candidate rolls out to a percentage of production traffic. Sessions stream in live. An anomaly detector flags unusual rates of retries, tool failures, or user-visible interrupts.
Ratchet. Production traffic on the new candidate moves up. The old candidate is retired on a configured schedule.
The shape is familiar — it’s release engineering applied to agents. The gates are agent-aware; the rest is engineering discipline you already had.
When InvokePlane is the right choice
Three conditions, any of which is sufficient:
-
Production agents with real users. Internal prototypes and research experiments can run on simpler tooling. Once the agent is answering customer email, authoring customer-facing text, or running tools against customer data, the observability ceiling of chat-era tools gets uncomfortable fast.
-
Multi-tenant platforms. If your customers bring their own keys, your observability stack needs to respect that boundary. Single-key platforms cannot retrofit this.
-
Eval-gated release is a team requirement. If you have decided that agent releases should be gated by evals — and the answer is almost always yes, once the agent has revenue attached — build it once, with the evals co-located with the platform that ships the agent. Reconstructing the gate per team is expensive and fragile.
When to look elsewhere
A few cases where InvokePlane is not the right answer:
-
Self-hosted is a hard requirement. InvokePlane is a hosted product. If your team needs to self-host the control plane for compliance or cost reasons, alternatives like Langfuse or Helicone are worth evaluating. A separate entry on multi-tenant-aware alternatives covers the comparison.
-
You are in the research-organisation scale tier. If your team is running agent research at the scale of a frontier lab, you likely already have custom observability and custom eval infrastructure. InvokePlane is built for teams that should not have to build this themselves.
-
The agent is a prototype. Evaluation and session replay overhead is not worth the investment while the agent’s behaviour is still changing weekly. Ship the prototype, learn from it, and move to InvokePlane when the agent is stable enough that a regression matters.
The underlying claim
The shape of tooling reflects the shape of the work. Chat observability tools are shaped like chatbots because when they were built, that is what needed to be observed. Agents work differently, break differently, and recover differently. The tools that serve them need to be shaped like the sessions they observe.
InvokePlane is the plane your agents live on in production. Free to start; paid when your agents graduate from the notebook.
Related reading:
- LangSmith Alternatives for Multi-Tenant AI Platforms — comparison matrix across four platforms.
- Shipping Software You Can Bet a Career On — the thesis behind the portfolio.
Frequently asked
What is agent observability, as distinct from LLM observability?
LLM observability traces a prompt-to-completion pair, usually linearly. Agent observability traces a session — a sequence of model calls, tool calls, retries, branches, and interrupts — as one unit, and preserves the structure of the decisions inside it. A linear trace flattens an agent's reasoning into a timeline; agent observability keeps the tree. The distinction matters the moment an agent has to explain itself in production.
How is InvokePlane different from LangSmith?
Three ways. First, InvokePlane is built around session-scoped traces with native tool-call, retry, and interrupt primitives, rather than linear prompt traces. Second, it supports bring-your-own-keys per tenant — each customer's model calls bill against their own LLM provider account, isolated from every other tenant's. Third, it enforces eval-gated publishing as a first-class constraint: a new agent version does not reach production users until its eval dataset passes configured thresholds.
What does 'eval-gated publishing' mean?
A deployment discipline where a new version of an agent cannot reach users until it has passed a defined evaluation suite. The evals run on a dataset of representative inputs, score the agent's outputs against expected behaviours, and gate the deploy on passing rates. It is the agent-era equivalent of requiring a green CI pipeline before merging to main. InvokePlane builds this gate into the publish flow rather than leaving it to be reconstructed per team.
Why do multi-tenant AI platforms need bring-your-own-keys?
Three reasons: compliance (each tenant's model calls are attributable to their own API keys, their own usage logs, their own data-processing agreements with the model provider), cost ownership (the tenant pays their own model-provider bill rather than being repackaged into the platform's margin), and blast radius (a compromised or throttled key only affects one tenant, not the whole platform). InvokePlane's key-isolation is at the workspace boundary, so a tenant's keys never pass through shared infrastructure.
Is InvokePlane open source?
No. InvokePlane is a hosted product; the control plane is not open source. The observability SDKs that teams integrate into their agents are open source and free. If a fully self-hostable agent observability platform is a hard requirement, InvokePlane is not currently a fit; alternatives include Langfuse and Helicone, compared in a separate entry.
What kind of agents is InvokePlane built for?
Production agents that matter to your customers — customer-facing copilots, B2B workflow agents, autonomous support assistants, code-gen agents with real tool access. It is overkill for internal prototypes and under-kill for large research-org-scale agent fleets that already have custom observability. The sweet spot is production agents at a team that needs CI-style gates but does not want to build them from scratch.
Referenced products
Related entries
-
Apr 23, 2026
LangSmith Alternatives for Multi-Tenant AI Platforms
LangSmith presumes a single org with a single key pool. Multi-tenant AI platforms need per-tenant key isolation, per-tenant eval gates, and session-scoped traces. Compare LangSmith, Langfuse, Helicone, and InvokePlane on those axes, with the trade-offs that matter.
-
Apr 23, 2026
Shipping Software You Can Bet a Career On
The vendor-tool graveyard has a cost no pricing page shows: the career risk of picking wrong. After two decades of shipping enterprise software, we are opening a holding company around one thesis — software worth keeping.