How LLM agents, logs, traces, and metrics come together to solve one of cloud computing’s hardest problems
Modern cloud applications are almost always built using microservices architecture. While microservices give us scalability, flexibility, and faster development cycles, they also introduce a serious operational challenge:
When something breaks, finding why it broke is extremely hard.
Failures in microservices rarely come from a single component. Instead, they propagate across services, containers, nodes, and infrastructure layers — generating huge volumes of logs, traces, and metrics along the way.
The research paper “MicroRCA-Agent: Microservice Root Cause Analysis Method Based on Large Language Model Agents” (arXiv:2509.15635) proposes a novel solution:
combine classical machine learning with Large Language Model (LLM) reasoning to automatically diagnose failures and explain them.
Why Root Cause Analysis Is So Hard in Microservices
In monolithic systems, failures were easier to trace. In microservices:
- A single request may trigger dozens of service calls
- Failures propagate indirectly
- Logs are unstructured and noisy
- Metrics fluctuate even during healthy operation
- Traces are complex call graphs
Traditional monitoring systems can:
- Detect anomalies
- Raise alerts
- But fail to explain the true root cause
Engineers still need to manually:
- Read logs
- Inspect traces
- Correlate metrics
- Guess causal relationships
This manual RCA process is:
- Time-consuming
- Error-prone
- Not scalable
What MicroRCA-Agent Proposes
The core idea of MicroRCA-Agent is simple but powerful:
Use LLMs as reasoning agents to combine logs, traces, and metrics into a coherent root-cause explanation
Instead of replacing existing ML methods, the system:
- Uses classical ML for anomaly detection
- Uses LLMs for reasoning, summarization, and explanation
This hybrid design avoids common LLM pitfalls while exploiting their strengths.
Key Contributions of the Paper
The paper makes three major contributions:
1. Multimodal RCA Framework
A complete pipeline that integrates:
- Logs
- Traces
- Metrics into a single RCA process.
2. LLM-Based Reasoning Agent
LLMs are used not for detection, but for:
- Interpreting anomalies
- Correlating evidence
- Producing human-readable explanations
3. Real-World Validation
The system is evaluated using:
- CCF International AIOps Challenge 2025
- Achieves a score of 50.71
- Demonstrates robustness through ablation studies
System Architecture Overview
MicroRCA-Agent is built as a modular system, making it suitable for real production environments.

Each module is independent but designed to feed structured information to the next

Input Format and Data Preprocessing
Input JSON
The system accepts a JSON file containing:
- uuid: fault identifier
- description: textual fault description
- start_time and end_time: anomaly window
Preprocessing Tasks
- Timestamp normalization (nanosecond precision)
- Time window alignment across data sources
- Data cleaning and synchronization
This step ensures all logs, traces, and metrics refer to the same failure interval.
Log Fault Extraction Module
Problem
Logs are:
- Unstructured
- Extremely verbose
- Filled with irrelevant messages
Solution
MicroRCA-Agent applies multi-stage log processing:
1. Keyword Filtering
Regex-based filtering to retain:
- Error
- Exception
- Failure-related logs
2. Drain Log Parsing Algorithm
Drain converts raw logs into templates by:
- Removing variable fields (IDs, timestamps)
- Retaining semantic structure
This dramatically reduces log volume while preserving meaning.
3. Multi-Level Filtering
Logs are further filtered by:
- Time window
- Service identity
- Deduplication
Output
A concise set of fault-related log templates, ready for reasoning.
Trace Anomaly Detection Module
What Are Traces?
Traces represent:
- End-to-end request paths
- Inter-service call chains
- Latency and dependency structure
Detection Strategy
The system combines two techniques:
1. Isolation Forest (Unsupervised ML)
- Trained on normal trace durations
- Detects abnormal latency patterns
- Works without labeled data
2. Status Code Analysis
- Identifies failed requests (e.g., HTTP 5xx)
- Captures explicit failures missed by latency models
Why Both?
- Latency anomalies ≠ failures
- Failures ≠ latency anomalies
Combining both improves accuracy.
Metric Fault Summarization Module
Metrics Challenges
- Thousands of metrics
- High natural variance
- Noise overwhelms signal
Step 1: Symmetric Ratio Filtering
Metrics are filtered using statistical comparison between:
- Fault window
- Normal baseline window
Only significantly changed metrics are retained.
Step 2: Two-Stage LLM Summarization
LLM summarizes metrics at:
- Pod level
- Service level
Stage 2
LLM combines summaries across:
- Infrastructure layers
- Services
This staged approach:
- Reduces token usage
- Preserves hierarchical context

Multimodal LLM-Based Root Cause Analysis
This is the core innovation of the paper.
Why LLMs Here?
LLMs excel at:
- Pattern synthesis
- Contextual reasoning
- Natural language explanation
But they are not used for raw detection.
Multimodal Prompt Design
The LLM receives:
- Log fault templates
- Trace anomaly summaries
- Metric behavior summaries
Carefully designed prompts instruct the LLM to:
- Correlate evidence
- Identify causal chains
- Explain why the failure happened
LLM Output Structure
The output includes:
- Root cause component
- Failure description
- Supporting evidence
- Logical reasoning path
This transforms raw telemetry into actionable insight.
Experimental Evaluation
Dataset
- CCF International AIOps Challenge 2025
Performance
- Final score: 50.71
- Competitive with state-of-the-art methods
Ablation Studies
The authors remove components to measure impact:

- Removed Logs Component — Performance Drop : Significant
- Removed Traces Component — Performance Drop : Significant
- Removed Metrics Component — Performance Drop : Significant
- Removed LLM Reasoning Component — Performance Drop : Severe
Strengths of MicroRCA-Agent
✅ Multimodal integration
✅ Explainable results
✅ Modular design
✅ Practical deployment potential
✅ Combines ML + LLM strengths
Limitations and Future Directions
The paper acknowledges:
- Prompt sensitivity
- LLM cost and latency
- Need for domain adaptation
Future work may include:
- Fine-tuned domain LLMs
- Online learning
- Autonomous remediation agents
Why This Paper is important
MicroRCA-Agent shows that:
LLMs are not just chatbots — they can be intelligent system reasoning agents.
This work bridges:
- AIOps
- Microservices
- Classical ML
- LLM-based reasoning
It represents a new generation of intelligent observability systems.
MicroRCA-Agent blends traditional techniques (like regex filtering and Isolation Forest) with advanced LLM reasoning to create a powerful RCA engine for microservices. It automates root cause diagnosis while providing interpretable explanations — addressing one of the toughest problems in distributed systems monitoring and reliability.
Paper Reference
MicroRCA-Agent: Microservice Root Cause Analysis Method Based on Large Language Model Agents
arXiv:2509.15635
