Create Runbooks to Guide Automated Debugging

Runbooks in Relvy do more than document investigation steps — they guide the behavior of Relvy’s AI agent during real-time incident analysis. When incidents occur, Relvy’s AI-based investigation engine uses your runbooks as strategic instructions to plan and execute the debugging process.

How Runbooks Work in Relvy

Relvy’s AI investigation system consists of:

  • A Planner, which orchestrates the overall debugging strategy.
  • Specialized Data Source Agents for logs, metrics (dashboards), events, and traces.

When an investigation is triggered (manually or via alert), the planner:

  • Reads applicable runbooks relevant to the incident symptoms or tags.
  • Uses the instructions to prioritize certain signals and tools.
  • Dispatches tasks to the data source agents accordingly.
  • Aggregates findings into a unified Root Cause Analysis.

Runbooks as AI Guidance

Runbooks provide a way for your team to configure and influence the planner using natural language. You define what your team considers best practices — and the planner follows them when forming its investigation plan.

This allows you to encode organizational knowledge and system-specific workflows directly into the AI.

Creating a Runbook

To add a runbook:

  1. Navigate to the Runbooks tab in the Discovery section on the left sidebar.

  2. Click Create New Runbook.

  3. Fill out the form:

    • Title: A clear and concise label.
    • Type: e.g., General Planning, Log Analysis, Event Analysis.
    • Symptom / When to Use: Describe when this runbook is relevant.
    • Instructions: Write your investigation steps in natural language.
    • Tags (optional): Add tags like user-facing, latency, kubernetes, etc.
  4. Click Create Runbook to save.

Example Runbook Instructions

Example 1: General Debugging Instructions

TitleGeneral Debugging Instructions
When to UseWhen debugging any incident
Instructions1. Check RED metrics dashboard as a starting point for most investigations
2. For user facing issues, check frontend service logs and metrics
3. To check recent deployments, filter events by @source:kubernetes and look for pod restarts, scaling or service deployments
4. Check Runtime metrics dashboard for CPU/memory utilization
5. To locate traces from logs, use otel.trace_id and otel.span_id

💡 Relvy’s planner will use these general guidelines as a foundation for any investigation, adapting the approach based on the specific incident context.

Example 2: Latency in Backend Services

TitleLatency in Core APIs
When to UseWhen alerts mention increased latency in APIs
Instructions1. Begin with the API service metrics dashboard
2. Compare current latency to 1h and 24h baselines
3. Check for saturation in DB or cache services connected to the API
4. Investigate traces for slow spans and associated service calls
5. Review logs for errors or warnings in the same time range

💡 The planner will dynamically follow these steps, dispatching tasks to metric, trace, and log agents to execute them in order.

Example 3: Debugging Kafka Issues

TitleDebugging Kafka issues
When to UseWhen debugging a kafka related alert
Instructions1. Look at the kafka dashboard for the appropriate topic / consumer group - identify specific affected partitions
2. Identify if any consumer pods are down
3. Check logs for consumer and producer services
4. For lag issues, check if this is because of traffic surges
5. Finally, check the kafka infra metrics dashboards for issues with kafka itself

💡 Relvy’s planner will interpret this and structure its investigation to answer the above questions.

Example 4: Application Architecture Overview

TitleApplication and System Description
When to UseWhen debugging all incidents
InstructionsThis is an ecommerce application (Astronomy Shop). This is the list of critical services:

- accounting
- ad
- cart
- checkout
- currency
- email
- frontend
- payment
- product-catalog
- quote
- recommendation
- shipping

💡 Relvy’s planner will use this architectural knowledge to prioritize services during investigations.

6.5 Runbook Management

  • Runbooks can be searched and filtered by tags.
  • You can edit them anytime to evolve with your system.
  • After an investigation, Relvy highlights the runbooks that were followed and lets you view or modify the instructions for future investigations.

Benefits of Configurable AI Planning

With runbooks:

  • You encode team knowledge into the AI, turning experience into automation.
  • Investigations become standardized, repeatable, and transparent.
  • New team members benefit from a guided process, and experts can continuously improve it.

Whether you’re debugging application errors, latency spikes, infrastructure issues, or deployment regressions — Relvy’s AI will follow your instructions, step-by-step.