MicroRCA-Agent: Using Large Language Models to Find Root Causes in Microservices

How LLM agents, logs, traces, and metrics come together to solve one of cloud computing’s hardest problems

Modern cloud applications are almost always built using microservices architecture. While microservices give us scalability, flexibility, and faster development cycles, they also introduce a serious operational challenge:

When something breaks, finding why it broke is extremely hard.

Failures in microservices rarely come from a single component. Instead, they propagate across services, containers, nodes, and infrastructure layers — generating huge volumes of logs, traces, and metrics along the way.

The research paper “MicroRCA-Agent: Microservice Root Cause Analysis Method Based on Large Language Model Agents” (arXiv:2509.15635) proposes a novel solution:

combine classical machine learning with Large Language Model (LLM) reasoning to automatically diagnose failures and explain them.

Why Root Cause Analysis Is So Hard in Microservices

In monolithic systems, failures were easier to trace. In microservices:

  • A single request may trigger dozens of service calls
  • Failures propagate indirectly
  • Logs are unstructured and noisy
  • Metrics fluctuate even during healthy operation
  • Traces are complex call graphs

Traditional monitoring systems can:

  • Detect anomalies
  • Raise alerts
  • But fail to explain the true root cause

Engineers still need to manually:

  • Read logs
  • Inspect traces
  • Correlate metrics
  • Guess causal relationships

This manual RCA process is:

  • Time-consuming
  • Error-prone
  • Not scalable

What MicroRCA-Agent Proposes

The core idea of MicroRCA-Agent is simple but powerful:

Use LLMs as reasoning agents to combine logs, traces, and metrics into a coherent root-cause explanation

Instead of replacing existing ML methods, the system:

  • Uses classical ML for anomaly detection
  • Uses LLMs for reasoning, summarization, and explanation

This hybrid design avoids common LLM pitfalls while exploiting their strengths.

Key Contributions of the Paper

The paper makes three major contributions:

1. Multimodal RCA Framework

A complete pipeline that integrates:

  • Logs
  • Traces
  • Metrics into a single RCA process.

2. LLM-Based Reasoning Agent

LLMs are used not for detection, but for:

  • Interpreting anomalies
  • Correlating evidence
  • Producing human-readable explanations

3. Real-World Validation

The system is evaluated using:

  • CCF International AIOps Challenge 2025
  • Achieves a score of 50.71
  • Demonstrates robustness through ablation studies

System Architecture Overview

MicroRCA-Agent is built as a modular system, making it suitable for real production environments.

Each module is independent but designed to feed structured information to the next

Input Format and Data Preprocessing

Input JSON

The system accepts a JSON file containing:

  • uuid: fault identifier
  • description: textual fault description
  • start_time and end_time: anomaly window

Preprocessing Tasks

  • Timestamp normalization (nanosecond precision)
  • Time window alignment across data sources
  • Data cleaning and synchronization

This step ensures all logs, traces, and metrics refer to the same failure interval.

Log Fault Extraction Module

Problem

Logs are:

  • Unstructured
  • Extremely verbose
  • Filled with irrelevant messages

Solution

MicroRCA-Agent applies multi-stage log processing:

1. Keyword Filtering

Regex-based filtering to retain:

  • Error
  • Exception
  • Failure-related logs

2. Drain Log Parsing Algorithm

Drain converts raw logs into templates by:

  • Removing variable fields (IDs, timestamps)
  • Retaining semantic structure

This dramatically reduces log volume while preserving meaning.

3. Multi-Level Filtering

Logs are further filtered by:

  • Time window
  • Service identity
  • Deduplication

Output

A concise set of fault-related log templates, ready for reasoning.

Trace Anomaly Detection Module

What Are Traces?

Traces represent:

  • End-to-end request paths
  • Inter-service call chains
  • Latency and dependency structure

Detection Strategy

The system combines two techniques:

1. Isolation Forest (Unsupervised ML)

  • Trained on normal trace durations
  • Detects abnormal latency patterns
  • Works without labeled data

2. Status Code Analysis

  • Identifies failed requests (e.g., HTTP 5xx)
  • Captures explicit failures missed by latency models

Why Both?

  • Latency anomalies ≠ failures
  • Failures ≠ latency anomalies

Combining both improves accuracy.

Metric Fault Summarization Module

Metrics Challenges

  • Thousands of metrics
  • High natural variance
  • Noise overwhelms signal

Step 1: Symmetric Ratio Filtering

Metrics are filtered using statistical comparison between:

  • Fault window
  • Normal baseline window

Only significantly changed metrics are retained.

Step 2: Two-Stage LLM Summarization

LLM summarizes metrics at:

  • Pod level
  • Service level

Stage 2

LLM combines summaries across:

  • Infrastructure layers
  • Services

This staged approach:

  • Reduces token usage
  • Preserves hierarchical context

Multimodal LLM-Based Root Cause Analysis

This is the core innovation of the paper.

Why LLMs Here?

LLMs excel at:

  • Pattern synthesis
  • Contextual reasoning
  • Natural language explanation

But they are not used for raw detection.

Multimodal Prompt Design

The LLM receives:

  • Log fault templates
  • Trace anomaly summaries
  • Metric behavior summaries

Carefully designed prompts instruct the LLM to:

  1. Correlate evidence
  2. Identify causal chains
  3. Explain why the failure happened

LLM Output Structure

The output includes:

  • Root cause component
  • Failure description
  • Supporting evidence
  • Logical reasoning path

This transforms raw telemetry into actionable insight.

Experimental Evaluation

Dataset

  • CCF International AIOps Challenge 2025

Performance

  • Final score: 50.71
  • Competitive with state-of-the-art methods

Ablation Studies

The authors remove components to measure impact:

  • Removed Logs Component — Performance Drop : Significant
  • Removed Traces Component — Performance Drop : Significant
  • Removed Metrics Component — Performance Drop : Significant
  • Removed LLM Reasoning Component — Performance Drop : Severe

Strengths of MicroRCA-Agent

✅ Multimodal integration
✅ Explainable results
✅ Modular design
✅ Practical deployment potential
✅ Combines ML + LLM strengths

Limitations and Future Directions

The paper acknowledges:

  • Prompt sensitivity
  • LLM cost and latency
  • Need for domain adaptation

Future work may include:

  • Fine-tuned domain LLMs
  • Online learning
  • Autonomous remediation agents

Why This Paper is important

MicroRCA-Agent shows that:

LLMs are not just chatbots — they can be intelligent system reasoning agents.

This work bridges:

  • AIOps
  • Microservices
  • Classical ML
  • LLM-based reasoning

It represents a new generation of intelligent observability systems.

MicroRCA-Agent blends traditional techniques (like regex filtering and Isolation Forest) with advanced LLM reasoning to create a powerful RCA engine for microservices. It automates root cause diagnosis while providing interpretable explanations — addressing one of the toughest problems in distributed systems monitoring and reliability.

Paper Reference

MicroRCA-Agent: Microservice Root Cause Analysis Method Based on Large Language Model Agents
arXiv:2509.15635