April 5, 20265 min

Most MCP Servers Break in Production. The 5 Failure Modes I Found.

400+ MCP servers exist. Most work in demos and fail under real load. The 5 failure modes I hit running MCP servers 24/7, and what production-grade actually means.

MCPInfrastructureProduction

There are over 400 MCP servers listed in public directories. I've tried dozens. Most work perfectly in a demo: you connect, call a tool, get a result. Ship that to a cron job running at 3am and watch it fail in ways the README never mentioned.

After running 4+ MCP servers continuously for months, here are the five failure modes that actually matter.

Failure Mode Severity (Production Impact)

Transport Timeouts

No Rate Limiting

Auth Missing

Useless Errors

No Health Checks

Production MCP Stack

Request→Auth Middleware→Rate Limiter→Tool Handler→Structured Error→Health Check

1. Transport Timeouts Kill Silent

The MCP spec supports stdio and HTTP transports. Stdio is simple: launch a process, pipe JSON. HTTP is flexible: connect to a running server. Both have the same problem: no standard timeout handling.

Your agent calls a tool. The MCP server makes an external API request. That API is slow today. 30 seconds pass. The agent's context window is burning tokens waiting. There's no timeout. No heartbeat. No way to know if the server is thinking or dead.

The fix: wrap every tool handler in a timeout. 10 seconds for local operations, 30 for external APIs. Return a structured error on timeout, not silence. Your agent can retry or fall back. Silence is the worst possible response.

2. Auth Is an Afterthought

Most open-source MCP servers have no authentication. Connect and you have full access. That's fine for local development. It's a security hole in any shared environment.

The pattern that works: middleware that validates an API key or OAuth2 token before the request reaches any tool handler. One function, applied to every route. If the token is missing or invalid, return 401 before any tool logic executes.

This isn't complex. It's 20 lines of code. But it needs to exist before the server goes anywhere near a network.

3. No Rate Limiting, No Cost Control

MCP servers that call external APIs (LLMs, databases, third-party services) have no built-in rate limiting. An agent in a loop can call the same tool hundreds of times per minute. Each call might cost money. Each call might hit a rate limit on the external service and start returning errors.

The fix: per-tool rate limits with a sliding window. Track calls per key per minute. Return 429 when exceeded. Log usage for cost tracking. This is standard HTTP middleware; it just doesn't exist in most MCP implementations.

4. Error Messages Are Useless

A typical MCP server error: "Tool execution failed." No context. No error code. No indication of whether it's transient or permanent. The agent has no information to decide between retry, fallback, or abort.

Production errors need structure: an error code, a human-readable message, whether it's retryable, and what the agent should do next. { "error": "RATE_LIMITED", "message": "External API rate limit exceeded", "retryable": true, "retryAfter": 60 } gives the agent everything it needs.

5. No Health Checks, No Observability

MCP servers run as background processes. When they crash, nothing notices. The agent's next tool call fails. The agent might retry, or it might report the task as impossible. Either way, you find out hours later when you check logs.

A health endpoint (/health or a periodic stdio ping) lets monitoring catch crashes in seconds. Structured logging with request IDs lets you trace a failed tool call back to the specific error. These are basic operational requirements that most MCP servers skip entirely.

What Production-Grade Means

It means the server handles the unhappy paths: timeouts, auth failures, rate limits, bad inputs, crashes. It means structured errors that agents can act on. It means health checks that monitoring can watch. It means deployment configs that actually work in Docker and systemd.

The MCP Server Starter Kit gets you from zero to running. The Production MCP Server Kit gets you from running to reliable. Both are on the products page.

The gap between "works in a demo" and "runs unattended at 3am" is where most MCP servers live. Closing that gap isn't glamorous work. But it's the work that matters.