MCP Server Learnings

When an MCP tool fails, what happens next depends entirely on how you communicate the failure. A generic “Error occurred” message leaves the AI helpless. A structured error with context, recovery options, and retry guidance lets the AI handle the failure intelligently — sometimes without the user even noticing.

This module covers five error handling patterns that turn failures from dead ends into recoverable situations.

Structured Error Responses

The MCP SDK provides an isError flag on tool results, but that alone is not enough. The AI needs to understand what went wrong, why, and what to do about it.

// BAD: Unhelpful error
return {
  content: [{ type: "text", text: "Failed to create record" }],
  isError: true,
};

// GOOD: Structured error with recovery guidance
return {
  content: [{
    type: "text",
    text: JSON.stringify({
      error: {
        code: "DUPLICATE_ENTRY",
        message: "A record with this email already exists",
        field: "email",
        value: "[email protected]",
        suggestion: "Use the update action instead of create, or use a different email",
        existing_id: "rec_abc123",
      },
    }, null, 2),
  }],
  isError: true,
};

A structured error should include:

Error code — machine-readable category (DUPLICATE_ENTRY, NOT_FOUND, RATE_LIMITED)
Human message — what happened in plain language
Context — which field, which value, which constraint
Suggestion — what the AI should try instead
Related data — IDs, existing records, alternative options

The suggestion field is the most important. It turns the AI from “I encountered an error” into “That email is taken, let me update the existing record instead.”

Graceful Degradation

Not every failure is total. If a tool fetches data from three sources and one fails, should the whole operation fail? Usually no. Return what you have and note what is missing.

server.tool("get_project_overview", {
  project_id: z.string(),
}, async ({ project_id }) => {
  const results: Record<string, unknown> = {};
  const warnings: string[] = [];

  // Always succeeds (local DB)
  results.project = await getProject(project_id);

  // External API — might fail
  try {
    results.github = await getGithubStats(project_id);
  } catch (e) {
    warnings.push("GitHub stats unavailable — API rate limited. Data shown without GitHub metrics.");
    results.github = null;
  }

  // Another external API — might fail
  try {
    results.deployments = await getDeploymentHistory(project_id);
  } catch (e) {
    warnings.push("Deployment history unavailable — service timeout.");
    results.deployments = null;
  }

  return {
    content: [{
      type: "text",
      text: JSON.stringify({
        ...results,
        _warnings: warnings.length > 0 ? warnings : undefined,
        _completeness: warnings.length === 0 ? "full" : "partial",
      }, null, 2),
    }],
    // NOT isError — we have partial results
  };
});

The _completeness field tells the AI whether it has everything or is working with partial data. The _warnings array explains what is missing and why. The AI can then decide whether to proceed with partial data or inform the user.

Retry-Friendly Responses

Some errors are transient — rate limits, network blips, temporary outages. Your error response should tell the AI whether retrying makes sense.

// Rate limited — tell the AI when to retry
return {
  content: [{
    type: "text",
    text: JSON.stringify({
      error: {
        code: "RATE_LIMITED",
        message: "GitHub API rate limit exceeded",
        retryable: true,
        retry_after_seconds: 60,
        suggestion: "Wait 60 seconds and try again, or reduce the scope of the request",
      },
    }, null, 2),
  }],
  isError: true,
};

// Permanent failure — don't waste time retrying
return {
  content: [{
    type: "text",
    text: JSON.stringify({
      error: {
        code: "INVALID_CREDENTIALS",
        message: "API key is invalid or expired",
        retryable: false,
        suggestion: "The API key needs to be updated in the server configuration. This cannot be fixed through retry.",
      },
    }, null, 2),
  }],
  isError: true,
};

Three fields for retry guidance:

retryable — boolean, is retrying worth attempting?
retry_after_seconds — how long to wait (for rate limits)
max_retries — optional, how many times to try before giving up

Circuit Breakers

When an external dependency is down, you don't want every tool call to wait for a timeout. A circuit breaker tracks failures and “opens” after a threshold, immediately returning an error without attempting the call.

class CircuitBreaker {
  private failures = 0;
  private lastFailure = 0;
  private state: "closed" | "open" | "half-open" = "closed";

  constructor(
    private threshold: number = 5,
    private resetTimeout: number = 30000,
  ) {}

  async call<T>(fn: () => Promise<T>): Promise<T> {
    if (this.state === "open") {
      if (Date.now() - this.lastFailure > this.resetTimeout) {
        this.state = "half-open";
      } else {
        throw new Error("Circuit breaker is open — service unavailable");
      }
    }

    try {
      const result = await fn();
      this.failures = 0;
      this.state = "closed";
      return result;
    } catch (e) {
      this.failures++;
      this.lastFailure = Date.now();
      if (this.failures >= this.threshold) {
        this.state = "open";
      }
      throw e;
    }
  }
}

// Usage in a tool
const githubBreaker = new CircuitBreaker(3, 60000);

server.tool("get_repo_info", { repo: z.string() }, async ({ repo }) => {
  try {
    const info = await githubBreaker.call(() => fetchRepoInfo(repo));
    return { content: [{ type: "text", text: JSON.stringify(info, null, 2) }] };
  } catch (e) {
    return {
      content: [{
        type: "text",
        text: JSON.stringify({
          error: {
            code: "SERVICE_UNAVAILABLE",
            message: "GitHub API is currently unavailable",
            retryable: true,
            retry_after_seconds: 60,
            suggestion: "The GitHub API has been failing repeatedly. Try again in a minute.",
          },
        }, null, 2),
      }],
      isError: true,
    };
  }
});

Timeout Handling

Long-running operations need timeout protection. Without it, the AI (and the user) wait indefinitely for a response that may never come.

function withTimeout<T>(
  promise: Promise<T>,
  ms: number,
  context: string
): Promise<T> {
  return Promise.race([
    promise,
    new Promise<never>((_, reject) =>
      setTimeout(() => reject(new Error(
        `Operation timed out after ${ms}ms: ${context}`
      )), ms)
    ),
  ]);
}

server.tool("analyze_codebase", {
  repo_path: z.string(),
}, async ({ repo_path }) => {
  try {
    const analysis = await withTimeout(
      runCodeAnalysis(repo_path),
      30000,
      "codebase analysis"
    );
    return { content: [{ type: "text", text: JSON.stringify(analysis, null, 2) }] };
  } catch (e) {
    if (e.message.includes("timed out")) {
      return {
        content: [{
          type: "text",
          text: JSON.stringify({
            error: {
              code: "TIMEOUT",
              message: "Analysis took longer than 30 seconds",
              retryable: true,
              suggestion: "Try analyzing a smaller scope — a single directory instead of the whole repo",
            },
          }, null, 2),
        }],
        isError: true,
      };
    }
    throw e;
  }
});

Always set timeouts on external calls and provide actionable suggestions when they trigger. “Try a smaller scope” is more useful than “try again.”

Exercise: Add Error Handling to a Weather Server

You have a weather MCP server with a get_forecast tool that calls a weather API. Currently it throws raw errors on failure.

Your challenge:

Add structured errors for: invalid city name, API rate limit, API key expired, network timeout
Implement graceful degradation: if the 7-day forecast fails, return the 3-day forecast from cache
Add a circuit breaker for the weather API (threshold: 3 failures, reset: 2 minutes)
Add timeout handling (5 second timeout)
For each error type, write the suggestion field that helps the AI recover

Check Your Understanding

What five fields should a structured error response include?
When should you return partial results vs a full error?
Why is the retryable field important for AI agents?
Explain the three states of a circuit breaker and when transitions happen.
A tool calls two APIs in parallel and one times out. Should you return an error or partial results? Why?

Key Takeaway

Error handling in MCP is not about catching exceptions — it is about giving the AI enough information to recover. Structure your errors with codes, context, and suggestions. Degrade gracefully when you can. Tell the AI whether to retry. Protect against cascading failures with circuit breakers. The difference between a fragile MCP server and a resilient one is how it behaves when things go wrong.

Error Handling Patterns

Key Concepts