The Problem

Every Enterprise is a Snowflake

Agent
Agent
Agent
Same agent.
Identical code.
Customer ACustomer BCustomer C
DD
PD
GF
CF
Datadog + PagerDuty + Grafana + Confluence
SP
AZ
SN
Splunk + Azure + ServiceNow
NR
PD
JR
NT
WK
New Relic + PagerDuty + Jira + Notion + Wiki
You ship it. It works beautifully — for your test environment.
Then every customer connects completely different tools.
MonitoringTicketingKnowledgeCommsCode/CI
OpsDatadogPagerDutyConfluenceSlack
SupportZendeskNotionIntercom
EngSplunkJiraConfluenceSlackGitHub
SalesSalesforceNotionOutlook
Every row is different. Every column is different.
5–15 SaaS tools per team, wired differently at every company.
Hardcoded Agent API deprecatedAuth expired
N tools × M tenants × P auth = explosion
Every new vendor = months of integration work
How do you build ONE agent that works across ALL tenants?
Agent
Discovered, not declared
Tools found at runtime per tenant
Delegated, not embedded
Auth handled by platform
Composable, not monolithic
Domain logic adapts to tools
Production Agent for Enterprise Tenants
Skills
Reusable domain logic
that orchestrates tools
without vendor coupling
Native Tools
Platform capabilities
every tenant gets
for free
MCP
A protocol that lets
the agent discover &
invoke any tool at runtime
Three architectural ideas. Let's build it up.
Why is This Actually Hard?

The Four Constraints Nobody Warns You About

Agent 😊 API DB LLM Demo Day ✓ 10,000 tenants all different 😰
At a hackathon: one user, one set of tools, one API key in an .env file.
That agent does not work for 10,000 paying customers.
CONSTRAINT 1 Tenant Isolation Same Agent Code TENANT A Data, credentials, tool connections, user permissions Scoped API responses TENANT B Different data, different credentials, different permissions Scoped API responses ISOLATION WALL
Customer A's agent must never see Customer B's data.
One leaked API response across tenants = game over.
CONSTRAINT 2 Tools Unknown Until Runtime Tenant A — Full Shelf Datadog PagerDuty Slack Jira Confluence Tenant B — Sparse Splunk OpsGenie Tenant C — Empty Nothing connected yet tools = ["datadog", "pagerduty", "splunk"] Hardcoded = broken on Day 1
CONSTRAINT 3 Delegated Authentication Agent (Turn 4) tool_call() 🔒 401 Unauthorized GRACEFUL DEGRADATION ✓ Note the gap ✓ Tell the user ✓ Continue with what's available ✕ Don't crash, hallucinate, or pretend
OAuth tokens expire mid-conversation. Some tools need re-consent.
The agent doesn't own credentials. The user does.
CONSTRAINT 4 Partial Failure is the Norm 5 Parallel Tool Calls ✓ Alerts fetched200 OK — 12 results ✓ Logs fetched200 OK — 48 results ⚠ Partial data200 OK — empty body ✕ Timeout30s — no response ✕ 500 ErrorInternal server error Best Available Answer Not "Error: please try again"
Your Agent TENANT ISOLATION HETEROGENEOUS TOOLS DELEGATED AUTH PARTIAL FAILURE
You must design within these walls — not pretend they don't exist.
Now that we know the constraints —
let's look at the architecture that satisfies all four.
Three layers. Three ideas. One protocol that ties them together.
Orchestrator
Skills
Tools
The Architecture

AI-Powered Ops Investigation Assistant

INTERACTION SURFACE Enterprise User Chat / Product UI CONVERSATION RUNTIME Intent + PlanningParse & plan actions Agent OrchestratorCoordinate agents Response SynthesisFormat & deliver Capability RegistryTool metadata, skill definitions,schemas, routing, guardrails SKILL EXECUTION ENGINE 1 Skill SelectionRoute to best skill 2 Execution LoopExecute & iterate Evidence AccumulationGather & synthesize findings TOOL GATEWAY Native Platform Toolsplatform-owned, zero-auth Alerts, Incidents, Service Dependencies, Knowledge Retrieval, Content Reads Tenant-Connected MCP Apps3P, runtime discovery Observability App AObservability App BOther Apps Scoped per tenant + user, discovered at runtime ENTERPRISE TENANT CONTEXT Historical Incident + AlertPast patterns & resolutions Product / Service DataTopology, configs, catalogs Connected App DataIntegrated 3P service data feedback — evidence returns to user
Two Types of Tools, One Skill

The Skill Doesn't Care Where the Data Comes From

"Why is checkout-service throwing 5xx errors?" SKILL EXECUTION LOOP 1 Get alert context What fired? When? What service? Native 2 Get service dependencies What's upstream/downstream? Native 3 Get metrics Is latency/error rate spiking? MCP App 4 Get logs What errors are appearing? MCP App 5 Get recent changes Any deploys in the window? Native 6 Synthesize What does the evidence say? Output = Native tool (always available) = MCP app (varies per tenant) The skill treats them identically.
The skill has a plan — an evidence-gathering loop. Some steps hit native tools. Some hit MCP apps.
The skill doesn't distinguish.
NATIVE TOOLS — ALWAYS AVAILABLE Platform-owned. Always authenticated. Your guaranteed baseline. Project Tracker get-issue-details find-linked-issues get-change-history Service Management get-alert-context find-similar-incidents get-service-topology Knowledge Base search-runbooks get-resolution-guides find-past-postmortems Even with zero 3P tools connected, the agent can still reason. MCP APPS — VARIES PER TENANT Third-party. Discovered at runtime. Scoped per tenant. New Relic query-nrql get-golden-signals get-error-groups Datadog search-metrics get-log-patterns list-deployment-events Splunk run-search-query get-detector-incidents Sentry, Dynatrace, ... varies... The skill says "I need metrics" — the gateway resolves to whatever's connected.
NATIVE EVIDENCE ✓ Alert: checkout-service 5xx spike at 14:32 ✓ Topology: depends on payments-db, auth-svc ✓ Change: deploy v2.4.1 at 14:28 by @eng-team ✓ Runbook: "5xx on checkout" → check DB pool MCP EVIDENCE ✓ Metrics: p99 latency 4200ms (was 180ms) ✓ Logs: "connection pool exhausted" ×847 ✓ Traces: 92% of slow spans in payments-db SYNTHESIZED ANSWER Root cause: DB connection pool exhaustion Deploy v2.4.1 introduced a connection leak in the payments-db adapter. p99 latency spiked 23× within 4 minutes of deploy. 847 "pool exhausted" errors. 92% of slow spans in payments-db. Recommendation: Roll back v2.4.1 or increase pool_max_size Runbook: "5xx on checkout" → matches historical pattern from Q2 Confidence: 94% Neither source alone was enough. Together, a complete picture.
Native tools gave context (what changed, what's connected). MCP apps gave signals (metrics, logs, traces).
The skill fused them into a grounded diagnosis.
MCP Deep Dive

How Does the Agent Talk to Tools It's Never Seen?

WITHOUT MCP Agent custom API client custom auth flow custom response parsing custom error handling Per vendor. Per version. Per tenant. N integrations = N × complexity WITH MCP Agent GatewayMCP New Relic Datadog Splunk Any... Agent: "What tools do you have?"→ tools/list Agent: "Run this with these args"→ tools/call Provider: "Here's the result"→ structured response Two operations. Every provider speaks the same language.
Three Hops: Agent → Gateway → Provider AGENT RUNTIME "I need tools for this tenant" Carries: • tenant ID • user token • provider identifier HOP 1 Streamable HTTP INTEGRATION GATEWAY Resolves: Tenant X has New Relic → route to New Relic's MCP server Handles: • Auth exchange (user's OAuth) • Connection lifecycle • Pooling, caching, expiry The heavy lifter. One endpoint, multiplexed across all providers. HOP 2 MCP Protocol PROVIDER MCP SERVERS New Relic Datadog Splunk Sentry (greyed) → Provider API → Provider API → Provider API HOP 3: Each implements MCP spec. Returns structured observations.
The agent never talks directly to Splunk or Datadog. It talks to the gateway, which multiplexes across all connected providers.
Inside the Gateway: Four Hard Problems Solved 1 Discovery"What's available for this tenant?" listTools(tenantId, userId) → ["query-nrql", "get-golden-signals", ...] Different tenant? Different list. Empty? Skill handles the gap. 2 Routing"Which server handles this tool?" tenant X + provider Y → server Z One endpoint, multiplexed by tenant-scoped resource identifier 3 Auth Delegation"Whose credentials?" Forwards user's auth context — not platform credentials Expired? → structured auth error → "Re-authenticate New Relic" 4 Tool Wrapping"Make MCP tools LLM-invocable" JSON Schema → LLM function. Read-only → auto-execute. Write tools → user confirmation. Names normalized for LLM.
Four problems. Four modules. This is the checklist for building multi-tenant tool access.
When Things Break: Graceful Degradation in Practice Gateway routing 4 providers ✓ New Relic200 OK ✓ Datadog200 OK ✕ SplunkTimeout 30s ✕ PagerDuty401 Auth Agent Output (with gaps noted): ✓ Metrics from New Relic: p99 latency 4200ms, error rate 23% ✓ Logs from Datadog: "connection pool exhausted" ×847 ⚠ Splunk logs unavailable (provider timeout). Analysis based on available sources. ⚠ PagerDuty: Re-authenticate to access on-call data. Confidence: Medium (2/4 sources) Was High — dropped due to missing sources
The system doesn't crash. The output notes the gaps. Confidence adjusts. An answer is still produced.
Skills

The Strategy Layer Between Prompts and Tools

Where Skills Fit PROMPT→ too vague "You are a helpful assistant" SKILL→ just right "Investigate an incident end-to-end" TOOL→ too atomic "get-metrics(service, window)" A skill defines: What evidence to gather (not which API to call) Order — sequential phases, or parallel fan-out Gap handling — what to do when a tool returns nothing Stop condition — confidence threshold, or all phases complete Synthesis — combine observations into a grounded answer SKILL FLOWCHART — no vendor names alerts topology metrics logs changes history synthesize
The skill is tool-agnostic. It says "I need metrics" — not "call Datadog". The gateway resolves the how. The skill owns the what and why.
Skills Compose: Like Functions in Code INVESTIGATION SKILL (orchestrator) Phase 1:alert-enrichment skill Resolve alert context Phase 2:observability-analysis skill Gather evidence metrics logs traces primaryprimarybest-effort Phase 3:dependency-analysis + change-correlation skill Phase 4: Synthesize → Hypothesis Reuse across contexts: observability-analysis → used in investigations, health checks, proactive scans dependency-analysis → used in incidents and capacity planning Small, composable, single-responsibility. Like functions in code.
The Execution Loop + Context Ledger EXECUTION LOOP OBSERVE What do I know? Missing? ACT Call tool or sub-skill RECORD Store in ledger + note gaps DONE? no → loop yes ↓ SYNTHESIZE CONTEXT LEDGER Persistent state — grows with each iteration iter 1: alert = checkout-service 5xx @ 14:32 UTC iter 2: topology = depends on payments-db, auth-svc iter 3: metrics = p99 4200ms (baseline 180ms) iter 4: logs = "connection pool exhausted" ×847 iter 5: deploy v2.4.1 by @eng-team @ 14:28 gap: traces unavailable (Splunk timeout) synthesize reads ONLY from ledger ✓ Carries state across tool calls ✓ Survives sub-skill delegation ✓ Enforces evidence-only reasoning No hallucination. No invention. Only cited evidence.
The ledger is the difference between "an LLM that sometimes calls tools" and "a disciplined investigator that builds a case."
Production Lessons

What Broke, What Scaled, What Surprised Us

WHAT BROKE Tool discovery is slow Listing tools: 200–800ms per tenant. Every turn = sluggish. Fix: Per-tenant tool cache with signature-based invalidation {} MCP server quality is wildly inconsistent Some return 50KB JSON blobs. No summary. Not LLM-friendly. Fix: Wrapping layer that normalizes, truncates, extracts signal 🔒 Auth expiry mid-investigation Tool #5 returns 401. Start over? Resume? Worst UX. Fix: Context ledger enables resumability after re-auth "No data" ≠ "error" — LLMs confuse them Empty results = valid evidence ("nothing anomalous"). LLMs retry. Fix: Explicit "absence is information" logic in skill definitions
We shipped to real enterprise tenants. These are the scars.
WHAT SCALED 🧩 Skill composition scales linearly New domain? Write a sub-skill, register it. No monolith coordination. Teams ship skills independently. 🔌 MCP is genuinely vendor-agnostic New vendor ships MCP server → our agent supports it. Zero code change. Zero deployment on our side. 🌱 "Two arms" solves the cold-start problem Day 1 tenant, zero 3P tools — agent still works via native tools. As they connect providers, it gets better. Never useless. 🛡 Evidence-only synthesis eliminates hallucination Synthesize only from context ledger. Not from LLM memory. If it's not in the ledger, it's not in the output.
Three ideas that make it work. Skills Domain expertise as composable, tool-agnostic workflows. The what and why. • Composable sub-skills • Evidence-only synthesis • Context ledger • Gap handling built in Native Tools Platform-owned capabilities that guarantee a baseline. The always-available foundation. • Zero-auth, zero-config • Solves cold-start • Context + topology + history • Day-1 useful MCP A protocol for dynamic tool discovery and invocation. The how — resolved at runtime. • Three-hop architecture • Gateway: discovery + routing • Zero-deploy vendor support • Per-tenant scoping None novel in isolation. The contribution is in how they compose — and in making them work at enterprise scale, with graceful degradation as a first-class requirement. Thank you.
next back Space next