Evaluate AI agents systematically with Agent-EvalKit

Overview

AWS Labs released Agent-EvalKit, a free open-source toolkit under the Apache 2.0 license that tests AI agents by tracing their full execution path, not only their final answers. It plugs into AI coding assistants including Claude Code, Kiro CLI, and Kilo Code, running evaluation through six phases inside the developer's workspace. In a sample travel research agent, the toolkit caught the agent inventing exchange rates and temperatures when its web search came back empty, a failure that polished-looking output alone would hide.

Key Takeaways

Agent-EvalKit is an open-source toolkit from AWS Labs, released under the Apache 2.0 license so businesses use it freely for commercial work.
It works as slash commands inside AI coding assistants, with support for Claude Code, Kiro CLI, and Kilo Code, keeping testing in the development workspace instead of after launch.
The toolkit traces which tools an agent called, what data those tools returned, and whether the final answer matched that data, surfacing hidden failures that look fine on the surface.
Six phases run the workflow: plan, data, trace, run agent, eval, and report, ending with code-level fix recommendations tied to specific locations.
Three core scores guide quality: faithfulness (is the answer grounded in real tool data), tool parameter accuracy (did it call tools correctly), and response quality (is the answer coherent and useful).
A demo travel research agent scored low on faithfulness, showing the agent fabricated facts when its data sources returned nothing.

Stats & Key Facts

#Apache 2.0 license, the permissive open-source license that allows free commercial and personal use
#6 evaluation phases structure the full testing workflow: plan, data, trace, run agent, eval, and report
#3 supported AI coding assistants at launch: Claude Code, Kiro CLI, and Kilo Code
#100 multi-turn test sessions ran against the sample travel research agent across five scenario types
#32.3% faithfulness score for the demo agent, showing weak grounding in real tool data
#83.9% response quality score for the same agent, far above its faithfulness, exposing the gap between polish and accuracy

Evaluate AI agents systematically with Agent-EvalKit

Why final answers alone hide agent failures

The toolkit targets a gap that simple output testing misses.

AI agents pick their own tools and decide the order of steps, so a smooth, confident response does not prove the work behind it was sound. An agent can return a well-written answer while skipping a verification step or inventing a number when a data source comes back empty.

Agent-EvalKit checks the whole execution path. It records which tools the agent called, what those tools returned, and whether the final reply faithfully reflects that data. This approach catches failures that stay invisible when you read only the last message.

Six evaluation phases from planning to code-level fixes

The workflow runs as six slash commands, each handling one stage.

›Plan reads the agent's code and designs metrics matched to concrete test methods, ranked by the quality goals you describe.
›Data generates test cases with expected outcomes aimed at specific behaviors and likely failure modes.
›Trace adds OpenTelemetry-compatible tracing so the execution path becomes visible, with support for Strands, LangGraph, and CrewAI.
›Run agent executes the agent against the test cases and produces structured trace files capturing tool calls and intermediate state.
›Eval turns metrics into runnable code, supporting the DeepEval and Strands Evals SDK libraries, then Report delivers ranked fix recommendations pointing to exact code locations.

Three quality scores: faithfulness, tool accuracy, and response quality

The toolkit centers on three measures that separate polish from correctness.

›Faithfulness checks whether the answer is grounded in the data the tools actually returned, catching fabricated facts.
›Tool parameter accuracy checks whether the agent called each tool with the correct inputs.
›Response quality rates how coherent and useful the final answer reads to a person.

The travel research agent that fabricated data

AWS used a sample agent to show the toolkit at work.

The demonstration agent was built with the Strands Agents SDK on Amazon Bedrock, with tools for web search, flight information, climate data, currency conversion, and budget calculation. Evaluation ran 100 multi-turn test sessions across destination research, seasonal timing, itinerary building, comparisons, and budget planning.

The results exposed a clear problem. The agent scored 32.3% on faithfulness and 64.5% on tool parameter accuracy, yet 83.9% on response quality. In plain terms, the answers read well but often did not match real data. When web search returned empty results, the agent invented exchange rates and temperatures instead of admitting it lacked information.

Built into the coding workflow, not bolted on after launch

Agent-EvalKit runs where developers already work.

Instead of treating evaluation as a separate step after deployment, the toolkit lives inside AI coding assistants as slash commands. Teams describe quality goals and failure modes in plain language, and that guidance shapes test generation, metric choice, and the final recommendations.

Test cases and tracing instrumentation are reusable across runs, and the toolkit fits into CI/CD pipelines as an automated quality gate. Developers re-run any single phase with new guidance, so evaluation becomes part of ongoing development rather than a one-time audit.

What this means for non-technical teams buying or building agents

Here is the plain-language takeaway for business readers.

The headline lesson is that an agent sounding confident and reading well tells you little about whether it is right. The demo agent's 83.9% response quality next to its 32.3% faithfulness is the warning: good prose can mask invented facts. Any business deploying an agent for research, support, or financial tasks faces this risk.

Tools like Agent-EvalKit give teams a way to test for grounded accuracy before customers see the output. Because it is free under Apache 2.0 and works with assistants such as Claude Code, the barrier to adding this kind of checking is low for technical teams, which is worth asking about when vetting an agent vendor or internal build.

Frequently Asked Questions

What is Agent-EvalKit?

Agent-EvalKit is an open-source toolkit from AWS Labs that systematically tests AI agents by tracing their full execution path, including which tools they called and whether the final answer matched the data those tools returned. It is released under the Apache 2.0 license.

Which AI coding assistants does it work with?

At launch it supports Claude Code, Kiro CLI, and Kilo Code. It runs as slash commands inside these assistants, so evaluation happens in the developer's workspace rather than as a separate step after deployment.

What does the faithfulness score measure?

Faithfulness measures whether an agent's response is grounded in the data its tools actually returned. A low score, such as the 32.3% in the demo agent, signals the agent is producing claims not backed by real data, including fabricated facts.

What were the main findings in the demo travel agent test?

Across 100 multi-turn test sessions, the agent scored 83.9% on response quality but only 32.3% on faithfulness. It invented exchange rates and temperatures when its web search returned empty results, showing well-written answers can still be wrong.

Does it cost anything to use?

No. Agent-EvalKit is free and open source under the Apache 2.0 license, which permits both commercial and personal use. Running it does require an AWS account with Amazon Bedrock access for the example setup.

Agent-EvalKit reframes agent testing around what the agent actually did, not only what it said, giving teams a free, structured way to catch fabricated or ungrounded answers before users see them. For any business relying on AI agents, the demo's gap between high response quality and low faithfulness is the lasting reminder that confident output is not the same as correct output.