As AI systems evolve from generative models to autonomous agents, moving from experimentation to production brings new challenges.
At ‘Getting Agentic Apps Ready for Production: Lessons in Observability and Evaluation’, a tech deep-dive session at DevSparks Pune 2026, Anannya Roy, Developer Advocate, Gen AI at Amazon Web Services, explored why agentic applications often fail in production, and what teams must do to prevent it.
The session focused on building strong observability and continuous evaluation frameworks that trace agent decisions, monitor behaviour, and ensure reliability at scale.
Roy began by explaining how developer needs have shifted from generative AI to agentic AI, with growing expectations for systems that can reason, plan, and act autonomously.
“Not long ago, we all heard that large language models (LLMs) could talk to us. We provided prompts and instructions, and it was able to respond to us, carrying out tasks such as summarization or finding the right intent. Then we, as developers, realized that this is not going to work for us. It's doubling our work. We wanted agents – systems that could reason, plan and act on our behalf,” she said.
That's where the shift happened. “We started with GenAI and moved to agentic AI – fully autonomous systems that could help us make our lives easier. And with this, we definitely reduced human oversight.”
However, the move to agentic AI also introduces new complexities, especially when transitioning from proof of concept to production.
Developers must understand how agents reason, why they choose specific actions, and how those actions scale from hundreds to millions of users. To do this effectively, they must address challenges around security, governance, scalability, and transparency.
Roy noted that agentic systems also introduce new risks. Their non-deterministic nature means the same prompt can trigger different decision paths. Agents may misinterpret business rules, overstep their authority, or expose sensitive data if guardrails are weak.
These failures often cascade, from hallucinations and faulty reasoning to poor response quality, latency, and rising operational costs. Even small changes, such as modifying a tool, switching models, or adjusting a prompt, can alter outcomes.
For Roy, the solution lies in building strong observability and evaluation frameworks that trace decisions, detect drift, and ensure agents remain reliable, transparent, and production-ready.
Why evaluation frameworks are essential
Roy said observability alone is not enough when deploying agentic systems. The key question is: how should organizations observe these systems, and what exactly should they monitor?
Once agents are deployed in production, they generate vast volumes of logs. Teams must analyze these logs to understand what happened—why an agent took a particular action and whether the outcome was correct.
However, systems cannot automatically distinguish between good and bad outcomes. Human oversight remains essential. Humans are intentionally placed in the loop to evaluate agent behaviour and guide improvements.
This makes structured evaluation critical. Organizations must detect issues such as hallucinations or incorrect reasoning before systems move from local environments to production. Without proper evaluation, customers may receive inaccurate or harmful responses, even when guardrails are in place.
Agentic systems are also highly sensitive to change. A small prompt adjustment, a model update, or a shift in business policy can significantly alter outcomes.
Roy emphasized that evaluation cannot be a one-time exercise. It must be continuous.
“You start by building an agent. You set the right evaluation parameters, identify the right logs that you will be capturing, and the right logs that you have to evaluate. And then finally, you build test datasets and re-run this cycle to monitor how the agent behaves in production.”
Roy then demonstrated the use of multi-test agents to evaluate different use cases, including planning trips, recommending budgets, and handling multi-turn conversations.
She also showcased how the Amazon Bedrock AgentCore platform configures evaluation metrics and monitors agent behaviour across multiple sessions. The demonstration highlighted the importance of continuous evaluation and the role humans play in improving agent performance.
Tracking performance in production
The session then shifted to the production phase. Roy explained that production readiness for agentic systems depends heavily on monitoring and evaluation.
Once an agent is built and deployed, teams must configure how it will be observed in real-world environments.
The process begins by selecting the deployed agent and defining multiple evaluators. These evaluators test different scenarios and behavioural patterns, running test cases repeatedly across multiple sessions to generate trace logs.
Roy noted that even a single trace log can reveal issues, but recurring patterns are what help teams identify the changes required in the system.
She suggested a hybrid evaluation approach. Offline evaluations involve subject-matter experts (SMEs) reviewing behavior, while online evaluations rely on analytics dashboards that track patterns and performance in real time.
Monitoring ultimately depends on the optimization goal.
“You run these test cases multiple times, and finally, you get an agent that can be responsible and accountable for its actions. It depends on what you monitor, what you observe. Are you trying to optimize your overall application or a particular event?” she asked.
If the focus is on the agent itself, teams observe behavioral indicators, whether the agent selects the right tools, how effectively it uses them, and how it handles multi-turn conversations.
Teams also check for issues such as context overload, memory gaps, or incorrect contextual reasoning.
At the application level, monitoring focuses on broader metrics, including cost, latency, and response quality. Session-level metrics evaluate overall performance, while trace-level metrics assess specific behaviours such as hallucinations, coherence, faithfulness, and tool selection.
Why humans still matter in agentic AI
Roy emphasized that humans-in-the-loop remain critical when deploying agentic AI systems.
“Sometimes humans are present, not by redundancy. They are there by choice, so use them. Use a hybrid approach to figure out what went wrong. Evaluations empower you to check what went wrong. Humans can tell you how they went wrong and what needs to be fixed.”
Subject-matter experts review evaluation scores across different layers, including session accuracy, tool selection, and parameter performance.
Drilling into these metrics helps teams identify the root causes behind failures. Re-running the same prompts and test cases allows organizations to detect when performance drops or correctness changes.
Roy concluded by outlining a structured path to production: build the agent, deploy it, log every activity, and continuously monitor performance.
Teams should define clear pass–fail criteria, test across multiple sessions and edge cases, and apply insights from both automated metrics and human reviews. By combining logs, structured evaluation frameworks, and expert oversight, organizations can refine agents and ensure they consistently take the right actions.
Original Article
(Disclaimer – This post is auto-fetched from publicly available RSS feeds. Original source: Yourstory. All rights belong to the respective publisher.)