Shilpa Blog

How LLM agents, logs, traces, and metrics come together to solve one of cloud computing’s hardest problems

Modern cloud applications are almost always built using microservices architecture. While microservices give us scalability, flexibility, and faster development cycles, they also introduce a serious operational challenge:

When something breaks, finding why it broke is extremely hard.

Failures in microservices rarely come from a single component. Instead, they propagate across services, containers, nodes, and infrastructure layers — generating huge volumes of logs, traces, and metrics along the way.

The research paper “MicroRCA-Agent: Microservice Root Cause Analysis Method Based on Large Language Model Agents” (arXiv:2509.15635) proposes a novel solution:

combine classical machine learning with Large Language Model (LLM) reasoning to automatically diagnose failures and explain them.

Why Root Cause Analysis Is So Hard in Microservices

In monolithic systems, failures were easier to trace. In microservices:

A single request may trigger dozens of service calls
Failures propagate indirectly
Logs are unstructured and noisy
Metrics fluctuate even during healthy operation
Traces are complex call graphs

Traditional monitoring systems can:

Detect anomalies
Raise alerts
But fail to explain the true root cause

Engineers still need to manually:

Read logs
Inspect traces
Correlate metrics
Guess causal relationships

This manual RCA process is:

Time-consuming
Error-prone
Not scalable

What MicroRCA-Agent Proposes

The core idea of MicroRCA-Agent is simple but powerful:

Use LLMs as reasoning agents to combine logs, traces, and metrics into a coherent root-cause explanation

Instead of replacing existing ML methods, the system:

Uses classical ML for anomaly detection
Uses LLMs for reasoning, summarization, and explanation

This hybrid design avoids common LLM pitfalls while exploiting their strengths.

Key Contributions of the Paper

The paper makes three major contributions:

1. Multimodal RCA Framework

A complete pipeline that integrates:

Logs
Traces
Metrics into a single RCA process.

2. LLM-Based Reasoning Agent

LLMs are used not for detection, but for:

Interpreting anomalies
Correlating evidence
Producing human-readable explanations

3. Real-World Validation

The system is evaluated using:

CCF International AIOps Challenge 2025
Achieves a score of 50.71
Demonstrates robustness through ablation studies

System Architecture Overview

MicroRCA-Agent is built as a modular system, making it suitable for real production environments.

Each module is independent but designed to feed structured information to the next

Input Format and Data Preprocessing

Input JSON

The system accepts a JSON file containing:

uuid: fault identifier
description: textual fault description
start_time and end_time: anomaly window

Preprocessing Tasks

Timestamp normalization (nanosecond precision)
Time window alignment across data sources
Data cleaning and synchronization

This step ensures all logs, traces, and metrics refer to the same failure interval.

Log Fault Extraction Module

Problem

Logs are:

Unstructured
Extremely verbose
Filled with irrelevant messages

Solution

MicroRCA-Agent applies multi-stage log processing:

1. Keyword Filtering

Regex-based filtering to retain:

Error
Exception
Failure-related logs

2. Drain Log Parsing Algorithm

Drain converts raw logs into templates by:

Removing variable fields (IDs, timestamps)
Retaining semantic structure

This dramatically reduces log volume while preserving meaning.

3. Multi-Level Filtering

Logs are further filtered by:

Time window
Service identity
Deduplication

Output

A concise set of fault-related log templates, ready for reasoning.

Trace Anomaly Detection Module

What Are Traces?

Traces represent:

End-to-end request paths
Inter-service call chains
Latency and dependency structure

Detection Strategy

The system combines two techniques:

1. Isolation Forest (Unsupervised ML)

Trained on normal trace durations
Detects abnormal latency patterns
Works without labeled data

2. Status Code Analysis

Identifies failed requests (e.g., HTTP 5xx)
Captures explicit failures missed by latency models

Why Both?

Latency anomalies ≠ failures
Failures ≠ latency anomalies

Combining both improves accuracy.

Metric Fault Summarization Module

Metrics Challenges

Thousands of metrics
High natural variance
Noise overwhelms signal

Step 1: Symmetric Ratio Filtering

Metrics are filtered using statistical comparison between:

Fault window
Normal baseline window

Only significantly changed metrics are retained.

Step 2: Two-Stage LLM Summarization

LLM summarizes metrics at:

Pod level
Service level

Stage 2

LLM combines summaries across:

Infrastructure layers
Services

This staged approach:

Reduces token usage
Preserves hierarchical context

Multimodal LLM-Based Root Cause Analysis

This is the core innovation of the paper.

Why LLMs Here?

LLMs excel at:

Pattern synthesis
Contextual reasoning
Natural language explanation

But they are not used for raw detection.

Multimodal Prompt Design

The LLM receives:

Log fault templates
Trace anomaly summaries
Metric behavior summaries

Carefully designed prompts instruct the LLM to:

Correlate evidence
Identify causal chains
Explain why the failure happened

LLM Output Structure

The output includes:

Root cause component
Failure description
Supporting evidence
Logical reasoning path

This transforms raw telemetry into actionable insight.

Experimental Evaluation

Dataset

CCF International AIOps Challenge 2025

Performance

Final score: 50.71
Competitive with state-of-the-art methods

Ablation Studies

The authors remove components to measure impact:

Removed Logs Component — Performance Drop : Significant
Removed Traces Component — Performance Drop : Significant
Removed Metrics Component — Performance Drop : Significant
Removed LLM Reasoning Component — Performance Drop : Severe

Strengths of MicroRCA-Agent

✅ Multimodal integration
✅ Explainable results
✅ Modular design
✅ Practical deployment potential
✅ Combines ML + LLM strengths

Limitations and Future Directions

The paper acknowledges:

Prompt sensitivity
LLM cost and latency
Need for domain adaptation

Future work may include:

Fine-tuned domain LLMs
Online learning
Autonomous remediation agents

Why This Paper is important

MicroRCA-Agent shows that:

LLMs are not just chatbots — they can be intelligent system reasoning agents.

This work bridges:

AIOps
Microservices
Classical ML
LLM-based reasoning

It represents a new generation of intelligent observability systems.

MicroRCA-Agent blends traditional techniques (like regex filtering and Isolation Forest) with advanced LLM reasoning to create a powerful RCA engine for microservices. It automates root cause diagnosis while providing interpretable explanations — addressing one of the toughest problems in distributed systems monitoring and reliability.

Paper Reference

MicroRCA-Agent: Microservice Root Cause Analysis Method Based on Large Language Model Agents
arXiv:2509.15635

MicroRCA-Agent: Using Large Language Models to Find Root Causes in Microservices