Need Agent Evaluation Framework
Context engineers need to evaluate whether AI agents called APIs correctly—with right parameters, in right order—not just whether the final output looks correct.
Take Control Of Your Signals — Become a Naftiko Design Partner Today!
Persona Story:
Nina, a context engineer, needs to evaluate whether AI agents called APIs correctly—with right parameters, in right order—not just whether the final output looks correct.
Problem Context
- Current eval frameworks just judge what the LLM says at the end
- No verification that correct API calls were actually made
- Need to measure state changes on upstream services during evaluation
Problem Impact
- Cannot iterate on context engineering approaches without precise measurement
- “Hope that it actually made all the right calls” isn’t acceptable for production
- Changes to tool descriptions can break integrations without proper evaluation
Naftiko Today
- OutputParameters normalization layer ensures raw API payloads never reach the LLM, making it possible to verify that the correct data was extracted and shaped from each API call
- JSON Schema validation and Spectral ruleset (15 rules) enforce structural correctness of capability specs, catching misconfigurations before runtime
- Multi-step orchestration with cross-step data mapping provides a declarative, auditable record of which API calls should happen and in what order
- MCP tool exposure gives a well-defined tool contract that can be tested against expected call patterns
Naftiko Tomorrow
- Mock mode (May 2026) would enable dry-run evaluation of agent tool calls without hitting real APIs, verifying call sequences and parameters
- Tool annotations for readOnly/destructive/idempotent (May 2026) would let evaluation frameworks classify and validate the safety profile of each agent action
- Execution history API (future) would provide a full audit trail of which tools were called, with what parameters, enabling precise evaluation of agent behavior
- Conditional steps with if/for-each/parallel-join (May 2026) would make orchestration logic explicit and testable