The Production Agent Secret Nobody Mentions
Everyone talks about building agents with Claude. Nobody talks about deploying them.
Last week I updated an agent in production. Two hours later, it was processing requests from users who shouldn't have access and generating responses that violated business rules.
The agent worked perfectly on my machine.
In production it was chaos.
What I learned: the difference between an agent that works in development and one that survives in production isn't in the model. It's in three patterns I implemented after that disaster.
The Real Problem: Development Agents vs Production Agents
When you test locally, you're the only user. You control the input. You know what to expect.
In production:
→ Multiple simultaneous users with different permissions → Inputs you never imagined → API calls that fail at the worst possible moment → Costs that scale out of control if something goes wrong
The claude-agent-sdk gives you the tools to build the agent. But you have to implement the production patterns yourself.
Pattern 1: Authentication With User Context
This was my first mistake: treating authentication as a simple API key check.
The common mistake:
```typescript // ❌ This works in dev, fails in production const agent = new ClaudeAgent({ apiKey: process.env.ANTHROPIC_API_KEY, model: 'claude-3-5-sonnet-20241022' });
await agent.chat(userMessage); ```
The problem: all users share the same context. No isolation between sessions.
The pattern that works:
```typescript // ✅ User context in every call const agent = new ClaudeAgent({ apiKey: process.env.ANTHROPIC_API_KEY, model: 'claude-3-5-sonnet-20241022' });
// Inject user context in system prompt const systemPrompt = ` User Context:
- User ID: ${user.id}
- Role: ${user.role}
- Permissions: ${user.permissions.join(', ')}
- Organization: ${user.orgId}
You may only access data belonging to this user's organization. If a request exceeds user permissions, respond: "You don't have permissions for this operation." `;
await agent.chat(userMessage, { systemPrompt, metadata: { userId: user.id, orgId: user.orgId } }); ```
The key: user context isn't just for logging. It's so the agent knows what it can and cannot do.
Pattern 2: System Prompts As Guardrails
System prompts aren't just for personality. They're your executable business rules.
After the incident, I rewrote my system prompts as explicit guardrails:
```typescript const guardrails = ` OPERATIONAL BOUNDARIES: 1. Data Access: Only query resources where orgId = "${user.orgId}" 2. Rate Limits: Maximum 10 API calls per user request 3. Sensitive Operations: NEVER delete, update, or modify data without explicit confirmation 4. External Calls: Only call approved APIs: [${APPROVED_APIS.join(', ')}]
IF ANY RULE IS VIOLATED:
- Stop processing immediately
- Log the violation with context
- Return error message to user
- DO NOT attempt to continue or "work around" the rule
ERROR RESPONSES: For auth errors: "You don't have permissions for this operation." For rate limit: "You've reached the query limit. Try again in a few minutes." For unsafe operations: "This operation requires explicit confirmation." `;
const agent = new ClaudeAgent({ systemPrompt: guardrails, // ... rest of config }); ```
Why it works:
LLMs are good at following explicit instructions. If you define your business rules as system instructions, the model respects them in most cases.
It's not perfect (no system is), but it's better than relying on post-hoc validations.
Pattern 3: Error Handling With Gradual Fallbacks
This was the hardest one to learn.
The first version of my agent had a generic try-catch. If something failed, the user saw "Error processing your request."
Useful, right?
The gradual fallback pattern:
```typescript async function executeAgentWithFallbacks( agent: ClaudeAgent, userMessage: string, context: UserContext ) { // Level 1: Try complete response try { return await agent.chat(userMessage, { systemPrompt: buildSystemPrompt(context), maxTokens: 4096 }); } catch (error) { // Level 2: Try reduced response if (error.type === 'rate_limit_error') { await delay(1000); try { return await agent.chat(userMessage, { systemPrompt: buildSystemPrompt(context), maxTokens: 1024 // Reduce tokens }); } catch (retryError) { // Level 3: Fallback response return { content: `Your query is queued. We'll notify you when it's ready.`, queued: true, queueId: await queueRequest(userMessage, context) }; } }
// Level 4: Specific error by type if (error.type === 'invalid_request_error') { logError('Invalid request', { error, userMessage, context }); return { content: `I couldn't process that request. Can you rephrase it?`, suggestion: await getSuggestion(userMessage) }; }
// Level 5: Generic error with useful context logError('Agent execution failed', { error, userMessage, context }); return { content: `Something went wrong. We're already investigating. Meanwhile, you can: [alternative options]`, supportId: generateSupportId() }; } } ```
The philosophy:
Each fallback level tries to give the user the best possible experience given the circumstances.
It's not just "return an error." It's graceful degradation.
What Really Changed
After implementing these three patterns:
→ Zero security incidents related to unauthorized data access → Visible errors dropped because fallbacks handle most edge cases → Predictable costs because guardrails limit excessive calls → Happier users because errors include useful context
The Real Learning
What nobody tells you about production agents:
It's not enough that it works. It has to work when:
- The user does something unexpected
- Claude's API hits a rate limit
- Two users access the same resource simultaneously
- Someone tries to exploit the agent
Production patterns aren't about making code "better." They're about making code robust against real-world reality.
Your Checklist Before Deploying
Before you push your next agent to production, ask yourself:
1. Authentication: Does each request have user context? Are permissions validated? 2. Guardrails: Are your business rules in the system prompt? Are they explicit? 3. Error Handling: Do you have fallbacks for each common error type? Are errors useful for the user?
If the answer to any is "no", don't deploy yet.
Learn from my mistakes. Implement these patterns from day one.
Your future self (and your users) will thank you.
