ScarfBench: Benchmarking AI Agents for Enterprise Java Framework Migration
Teams migrate applications across frameworks to improve maintainability, cloud readiness, developer productivity, and access to modern capabilities. Recent advances in coding agents have sparked excitement around AI-assisted modernization. But an important question remains: Can AI agents reliably modernize real-world enterprise applications?
Key Takeaways
- Existing software engineering benchmarks have demonstrated impressive progress in bug fixing and code generation, but framework migration presents a fundamentally different challenge.
Success requires not only translating code, but also preserving behavior, adapting build systems, and navigating runtime dependencies.
- A simple repository migration can require changes across dependency injection, persistence configuration, queries, and framework descriptors.
Small mistakes in any of these pieces can prevent successful deployment.
- This provides a much more realistic measure of modernization quality.
Benchmark at a Glance Metric Value Applications 34 Framework implementations 102 Migration tasks 204 Lines of code ~151K Source and test files ~2,000 Expert-written tests 1,331 ScarfBench includes both focused migration tasks and whole-application migrations.
- Despite strong performance on traditional software engineering benchmarks, framework migration remains difficult.
Success rates vary considerably across framework pairs and whole-application migrations remain particularly challenging.
- Figure: Migration Outcomes by Target Framework Migration difficulty depends strongly on the target framework, with Jakarta EE proving particularly challenging.
Stats & Key Facts
- #Benchmark at a Glance Metric Value Applications 34 Framework implementations 102 Migration tasks 204 Lines of code ~151K Source and test files ~2,000 Expert-written tests 1,331 ScarfBench includes both focused migration tasks and whole-application migrations.
- #info/leaderboard Even the strongest current agents achieve less than 10% behavioral success, illustrating the gap between generating compilable code and preserving application behavior.
Existing software engineering benchmarks have demonstrated impressive progress in bug fixing and code generation, but framework migration presents a fundamentally different challenge. Success requires not only translating code, but also preserving behavior, adapting build systems, and navigating runtime dependencies. To address this gap, we introduce ScarfBench (Self-Contained Application Refactoring Benchmark) , an open benchmark for evaluating AI agents on cross-framework migration tasks in Enterprise Java.
ScarfBench focuses on migrations across three major Java ecosystems: Spring Jakarta EE Quarkus Unlike traditional benchmarks that compare generated code against reference implementations, ScarfBench evaluates whether migrated applications actually build, deploy, and preserve behavior. Why Migration Is Hard Framework migration is much more than replacing annotations. A simple repository migration can require changes across dependency injection, persistence configuration, queries, and framework descriptors.
Small mistakes in any of these pieces can prevent successful deployment. Figure: Spring → Jakarta Migration Example Framework migration requires translating framework semantics, not just source code. Introducing ScarfBench ScarfBench provides a systematic way to evaluate AI agents on enterprise Java framework migration tasks.
For more details please read the original article at Hugging Face.
Continue Learning
Comments
Sign in to join the conversation