Reliability
infrastructure
for Solana
Routing policies, observability, and incident response for high-throughput RPC infrastructure.
SLO: P99 < 200ms | 99.99% uptime
Interfaces: API, SDK, CLI, Console
SYSTEM STATUS[LIVE]
P99 latency142ms
Error rate0.02%
Healthy endpoints24 / 24
Last incident3d ago
[ OK ] routing policies active
System model
Four-stage pipeline from signal to traffic decision
1.Signals→Metrics, traces, health checks
2.Scoring→Weight by latency, errors, capacity
3.Policy eval→Apply rules, thresholds, constraints
4.Traffic decision→Route, shift, or circuit-break
Interfaces
API-first with SDKs, CLI, and optional web console
API
- /v1/routes - query routing table state
- /v1/policies - manage routing policies
- /v1/telemetry - stream metrics and traces
- /v1/incidents - incident history and context
SDK
- TypeScript client (npm: @gate/client)
- Rust client (crates.io: gate-client)
CLI
$ gate status
$ gate policy apply latency-failover.yaml
$ gate routes list
$ gate incidents show --last 24h
Web Console
- Optional UI for metrics dashboards and incident review
- All functionality available via API and CLI
- All functionality available via API and CLI
Primitives
Core abstractions and guarantees
[ Observability ]
Track P50 / P95 / P99 latency, error budgets, and saturation.
- - Multi-dimensional metrics
- - Distributed trace context
- - Endpoint-level breakdowns
[ Routing Policies ]
Declarative failover logic and traffic shaping.
- - Versioned policy configs
- - Priority, weights, decay, hysteresis
- - Health-based routing
- - Gradual rollouts
[ Endpoint Scoring ]
Weight endpoints by latency, availability, and cost.
- - Real-time scoring
- - Geographic affinity
- - Capacity-aware selection
[ Audit Model ]
Complete history of routing decisions and config changes.
- - Config hash verification
- - Signed changes with attribution
- - Immutable logs
- - Full diff viewing and rollback
Failure modes handled
Designed for real-world degradation patterns
- RPC timeout / connection failures
- Partial endpoint degradation (latency increase without total failure)
- Skewed endpoints (one endpoint significantly slower than pool)
- Error rate spikes with intermittent recovery
- Capacity saturation (connection pool exhaustion)
- Network partition / geographic isolation
Policy examples
Declarative YAML configs for routing behavior
# policy/latency-failover.yaml
name: latency_failover
priority: 100
trigger:
metric: p99_latency
threshold: 200ms
window: 30s
action: shift_traffic
target: fallback_pool
hysteresis: 60s
# policy/error-budget.yaml
name: error_budget_enforcement
priority: 200
trigger:
metric: error_rate
threshold: 0.1%
burn_rate: 5x
lookback: 1h
action: circuit_break
recovery_threshold: 0.01%
# Routing table dump
$ gate routes list --format=table
ENDPOINT SCORE WEIGHT STATUS P99
us-west-1a 0.92 0.35 OK 142ms
us-west-1b 0.88 0.32 OK 156ms
us-east-1a 0.85 0.28 OK 178ms
eu-central-1 0.42 0.05 DEGRADED 421ms