Guide · SLO

MCP server SLO

A service level objective (SLO) turns an uptime percentage into a commitment with a concrete error budget: an amount of downtime you're allowed to spend before violating the SLO. Without an SLO, "we want high uptime" is a vague aspiration. With an SLO of 99.9%, you have 43.8 minutes of error budget per month — spend it on planned maintenance, use some to take risks on deployments, and alert when burn rate threatens to exhaust it before month end.

TL;DR

Start with a 99.5% SLO for an indie MCP server (3.65 hours/month error budget — realistic for a side project). Move to 99.9% (43.8 min/month) once the server is relied on by external users. Reserve 99.99% (4.4 min/month) for commercial MCP services with paying customers. Measure the SLO using probe data: error budget = total probes × (1 − SLO target); remaining budget = error budget − failed probes. Alert when burn rate exceeds 5× sustainable (exhausts budget in <6 days). Review the SLO monthly against your probe history. AliveMCP tracks error budget consumption automatically on Team tier ($49/mo).

SLO vs. SLA vs. uptime target

These three terms are often used interchangeably but mean different things:

Uptime target: an internal goal ("we want 99.9% uptime"). No formal measurement methodology, no consequence for missing it. The weakest form of an availability commitment.
SLO (Service Level Objective): an internal commitment with a formal measurement methodology, an error budget, and a defined consequence for breach (typically: freeze new features until the error budget recovers, post-incident review required). Strong enough for most engineering teams. No external obligation.
SLA (Service Level Agreement): an external contractual commitment with financial consequences for breach (service credits, refunds, contract penalties). SLAs are typically set at a lower threshold than SLOs — if your SLO is 99.9%, your SLA should be 99.5% to leave headroom. Breaching your SLO triggers an internal review; breaching your SLA triggers a refund to a customer.

Most MCP server operators need an SLO, not an SLA. Unless you have contractual uptime commitments to customers, an SLA creates legal liability without additional engineering benefit over a well-defined SLO.

Choosing an SLO target

The right SLO target depends on how the MCP server is used and what the consequences of downtime are:

99.0% (7.3 hours/month error budget)

Appropriate for: experimental, development, or internal-only MCP servers. This target allows for multiple multi-hour outages per month — acceptable for a server under active development where reliability is not yet the priority. You can deploy multiple times per day without worrying about the error budget if each deployment causes a 1–2 minute restart.

99.5% (3.65 hours/month error budget)

Appropriate for: public indie MCP servers with low user traffic, early-stage commercial servers, servers where users understand and accept experimental availability. You have ~3.5 hours of downtime budget per month — enough for weekly maintenance windows (30 minutes each), a few deployment hiccups, and one partial-hour incident per month.

99.9% (43.8 minutes/month error budget)

Appropriate for: production MCP servers relied on by external users or teams, commercial MCP services on a paid plan, servers that power agent workflows that run continuously. At 43.8 minutes/month, you have roughly 2–3 deployment windows and almost no room for unplanned outages. Any deployment that causes >10 minutes of downtime is spending ~25% of your monthly budget. Zero-downtime deployment practices become important at this tier.

99.99% (4.4 minutes/month error budget)

Appropriate for: commercial MCP services with SLA-backed contracts, critical infrastructure MCP servers powering high-stakes agent workflows. At 4.4 minutes/month, you have essentially no budget for any unplanned downtime. Hot standby, automatic failover, multi-region deployment, and rigorous deployment rehearsal are required to operate at this level.

Matching the SLO to the dependency chain

Your MCP server's SLO should be achievable given the SLOs of its dependencies. If your MCP server calls a downstream API that itself has a 99.5% uptime guarantee, you cannot credibly commit to 99.9% without error handling that degrades gracefully when the downstream API is unavailable. Dependency-chain SLO math: if you have N independent dependencies each with availability A, your server's theoretical maximum availability is A^N. Two 99.9% dependencies give you a theoretical max of 99.8% for the combined stack.

Error budget calculation

Error budget is the concrete operationalization of an SLO. Rather than thinking "we want 99.9% uptime," you think "we have 43.8 minutes of downtime to spend this month."

Calculating error budget from probe data

At 60-second probe cadence, there are 43,200 probes per 30-day month.

99.9% SLO: 43,200 × 0.001 = 43.2 probe failures allowed per month ≈ 43.2 minutes of downtime.
99.5% SLO: 43,200 × 0.005 = 216 probe failures allowed ≈ 3.6 hours.
99.0% SLO: 43,200 × 0.01 = 432 probe failures ≈ 7.2 hours.

Remaining budget at any point in the month: budget_remaining = budget_total − failed_probes_month_to_date. When budget_remaining hits zero, the SLO has been breached for the month. (The SLO window resets at the start of the next calendar month, or on a rolling 30-day basis depending on how you define it.)

Calendar-month vs. rolling 30-day window

Calendar-month SLOs are common because they align with billing and reporting cycles. A rolling 30-day window is more continuous — a breach at the end of month never magically resets at midnight on the 1st. For operational purposes, a rolling window gives cleaner signal; for external SLA reporting, calendar months are easier to communicate to customers. Pick one and document it.

Burn rate alerting

Error budget burn rate is the ratio of your current error rate to the "sustainable" error rate that would exactly exhaust your budget by month end.

Sustainable burn rate: if your SLO is 99.9% and your month has 43,200 probe slots, you can afford 43.2 failures spread evenly — that's 1.44 failures per day. If you're currently failing 7 probes per day, your burn rate is 7 / 1.44 = 4.9× sustainable. At 4.9×, you'll exhaust your budget in 6 days.

Recommended burn rate alert thresholds

P1 alert: burn rate ≥ 14× over 1 hour. At 14× burn rate, you exhaust a 99.9% monthly budget in 2.1 days. A short-window spike at 14× is a major incident, not gradual degradation.
P2 alert: burn rate ≥ 5× over 6 hours. Sustained 5× burn rate exhausts the budget in 6 days. Fire a P2 to review — this isn't emergency-page-at-2am urgency, but it requires investigation before end of business.
P3 alert: burn rate ≥ 2× over 3 days. Slow burn. By itself unlikely to exhaust the budget, but worth a monthly review. Track in your SLO review meeting.

See MCP server error rate for the full burn rate calculation formula and alert wiring.

SLO measurement for MCP's four-layer protocol

Standard SLOs measure "was the server available" as a binary. For MCP, "available" has four distinct meanings corresponding to the four protocol layers:

Transport SLO: TCP connection succeeds within the probe timeout. Minimum bar for any availability claim.
HTTP SLO: server responds with a non-5xx HTTP status code. Measures the application layer, not just network reachability.
Initialize SLO: JSON-RPC initialize method returns a valid result. Measures the MCP protocol layer specifically.
Tool surface SLO: tools/list returns ≥1 valid tool definition. Measures whether the server is actually functional for agent use, not just alive.

The strictest SLO definition uses tool surface availability — the server is only "available" if an agent can actually discover and use tools. A looser definition uses initialize availability. Document which layer your SLO measures. Most operators use initialize availability as the canonical SLO signal (the server speaks MCP) with a separate tracking metric for tool surface availability.

See MCP server health check for the full four-layer probe sequence.

Monthly SLO review

An SLO without a review process is just a number on a wiki page. Monthly SLO reviews convert monitoring data into reliability improvements:

Pull the month's probe data. Total probes, failed probes by layer, error budget consumed, burn rate peaks.
Identify the top-3 error contributors. Which incident or recurring pattern consumed the most error budget? A single 2-hour outage? Recurring cold-start failures at 2am? A tools/list flap that lasted 3 days at low error rate?
Post-mortems for budget-consuming incidents. Any incident that consumed >10% of the monthly error budget deserves a brief written post-mortem: what happened, why it happened, what changed to prevent recurrence. See MCP server incident response.
SLO target review. If you burned >80% of your budget, consider whether the SLO is achievable or needs adjusting (downward) until reliability improves. If you burned <10% consistently, consider tightening the SLO (upward) to raise the bar.
Infrastructure reliability investments. If the same failure mode appears in multiple months, invest in a fix: zero-downtime deployment if deployments cause budget consumption; cold-start suppression if serverless idle timeouts dominate; multi-region probing if false positives are consuming budget.

Related questions

Should I make my MCP server's SLO public?

Publishing your SLO target (for example, on your server's status page or README) builds user trust and creates accountability. But only publish an SLO you're confident you can meet — publishing 99.9% and then breaching it for 3 consecutive months erodes trust faster than publishing nothing at all. A good starting point: publish your rolling 30-day uptime on the status page without formally committing to a target. Once you have 3 months of data showing consistent 99.9%+, formalize it as an SLO. See MCP server status page for how to expose availability data publicly.

How do I handle planned maintenance in SLO calculations?

Planned maintenance during a registered maintenance window is typically excluded from SLO calculation — it's "scheduled downtime," not a reliability failure. Register maintenance windows in advance, exclude probe failures during those windows from your error budget calculation, and document the exclusion policy clearly. Unplanned downtime that coincidentally occurs during a maintenance window still counts against the SLO — only proactively registered windows get the exclusion. This prevents gaming the SLO by retroactively declaring maintenance after an outage.

What's the difference between an SLO for uptime vs. an SLO for latency?

Availability SLOs (99.9% uptime) measure whether the server responded successfully. Latency SLOs measure how fast it responded — "p95 latency under 500ms for 99.5% of requests." Both types are valid for MCP servers. Availability SLOs are more common because they're simpler to measure and communicate. Latency SLOs are important if your MCP server is in a latency-sensitive workflow where slow responses degrade agent performance as much as full outages. See MCP server latency for the latency monitoring model to pair with a latency SLO.

How do I set an SLO before I have baseline uptime data?

Start with a conservative (lower) target and tighten it after you have data. For a brand-new MCP server with no history, start at 99.0% — this gives you 7.3 hours/month of error budget to absorb early growing pains. After 2–3 months of probe data, you'll know your actual uptime baseline. If you've been consistently above 99.9%, formalize 99.9% as your SLO. If you've been at 99.5%, that's your starting SLO. Never set an SLO tighter than your demonstrated baseline — you'll breach it immediately and lose team confidence in the SLO framework.