C-42891
Customer Service Case
SLA: 14 min remaining
Agent fleet offline -- 12 agents unresponsive across secondary cluster
Investigation
Case Lifecycle
✓
Create
2 min
✓
Triage
3 min
3
Investigation
In progress
4
Resolution
--
5
Confirm
--
6
Close
--
Details
Investigation 3
Correspondence 2
Attachments 1
Sub-cases
Audit 8
Case Information
▾
Case ID
C-42891
Case Type
Incident
Urgency *
Category
Subcategory
Agent Runtime / Fleet
Channel
Internal Monitoring
Assigned To *
Silas
Assignment Group
Platform Engineering
Customer
BlackRoad Internal
Contact
Alexa Amundson
Description
12 of 27 RoadTrip agents went offline at 20:12 UTC. Affected: cecilia, olympia, gematria, portia, atticus, cicero, valeria, celeste, elias, ophelia, gaia, anastasia. WebSocket drops and heartbeat timeouts. No recent deployments. Suspect NATS broker partition on secondary cluster.
Investigation -- Current Action
▾
Root Cause Analysis *
Root Cause Category
Affected CI
roadtrip.blackroad.io
Proposed Resolution
Routing Decision -- When/Then Rules
| When | Category | Urgency | Then Route To | SLA | Active |
|---|---|---|---|---|---|
| Rule 1 | Infrastructure | Critical | Platform Engineering | 30 min | ✓ |
| Rule 2 | Application | Critical | Product Engineering | 1 hr | ✓ |
| Rule 3 | Security | Any | Security Team | 15 min | ✓ |
| Rule 4 | Any | Low | General Support | 8 hrs | ✓ |
Roadie Copilot
AI
Recommended: Apply KB0008412
NATS broker failover procedure matches this case pattern. 94% of similar incidents resolved with this KB article within 15 minutes.
Confidence
94%
Apply Fix
View KB
Dismiss
SLA Tracking
Response SLA
Met (2 min)
Resolution SLA
14 min left
Target
30 min (P1)
Elapsed
16 min
Pulse
View All
SI
Silas identified NATS broker as root cause. Restarting with corrected health check config.
2 min ago
OC
Octavia escalated to P1 Critical. Impact: enterprise-wide agent delegation affected.
8 min ago
RD
Roadie auto-detected 12 agent heartbeat failures. Correlation: NATS partition. Attached KB0008412.
10 min ago
AA
Alexa created case. "12 agents dropped off the convoy. Secondary cluster is down."
12 min ago
Related
+ Link
roadtrip.blackroad.io
App CI
nats-secondary.blackroad.io
Middleware
KB0008412 -- NATS failover
Article
CHG-3891 -- NATS tuning
Change
Attachments
+ Upload
nats-broker-logs-20260419.txt
2.4 MB
agent-fleet-status-screenshot.png
842 KB