Research Synthesis

Agentic Shopping Summary

A Research Synthesis of Twenty-Two Foundational Papers, 2025–2026 — Evidence-Grounded Edition
Andrew Bell March 2026 22 Papers Companion: 20 Paper Translations

At two in the morning, a woman sits at her kitchen table with the blue light of her laptop shining across an unpaid electric bill, a school form she forgot to sign, and a coffee cup gone cold an hour ago. Her son needs cleats by Friday. Her father needs a replacement blood pressure cuff because the old one stopped working. Her husband asked her, sometime before dinner and after the dog threw up on the rug, if she could also find a birthday gift for his mother. She is tired enough to make mistakes, tired enough to buy the wrong size, the wrong brand, the wrong thing entirely, and this is the promise agentic shopping now extends to people like her: let the machine hold the threads. Let it remember what matters. Let it sort the noise from the need.

But the research gathered across twenty-two papers from 2025 to 2026 tells a more fragile story.

These systems can be impressive when the task is simple, when the path is straight, when the request is clean. Yet the moment real life enters the room with all its cluttered demands, they begin to fail in ways that feel less technological than painfully human.

That is the first hard truth running through this literature. Complexity is where the confidence breaks. An agent may handle a single instruction well enough to seem almost magical. But ask it to balance price, size, urgency, safety, personal preference, and availability across several turns, and performance falls sharply. Not because the system lacks words or has never seen the category before, but because holding many things at once and carrying them faithfully through a decision is harder than the glossy language around artificial intelligence has made it sound. Real shopping is rarely one clean query followed by one neat answer. It is a process shaped by memory, interruption, compromise, and emotion. It is a mother remembering that her daughter hates scratchy fabric. A man trying to stay under budget without buying something cheap enough to break. Quiet panic of needing the right item fast, with no time left for error. The research shows that agents are still poor at living inside that kind of layered reality.

There is another lesson here, and it may matter even more for brands than for consumers. The systems do not see the way people see. Across benchmark after benchmark, vision-only approaches underperform, while structured semantic representations, especially accessibility trees and other forms of ordered navigation, prove far more reliable. That sounds technical until you understand what it means in practice. It means the machine is not won over by the beauty of a page, by the mood of a lifestyle image, by the softness of a story, or by the subtle persuasion that has long made good merchandising powerful. It wants structure. It wants clarity. It wants the catalog to tell the truth in ways a system can retrieve, parse, and trust. For years, many brands have treated product storytelling as the heart of digital commerce. For human beings, perhaps it still is. But for agents, the stronger advantage may lie elsewhere, in clean product data, explicit attributes, stable retrieval, and infrastructure designed not to charm but to guide. In this new world, beauty without structure may be remembered by people and missed by machines.

And then there is personalization, that old dream of modern commerce, the promise that a system can know us, anticipate us, serve us almost tenderly. The research offers both hope and disappointment. At the population level, personas matter enormously. Remove them and performance drops. Give agents a model of what certain groups tend to want and they improve in meaningful ways. But the individual person remains elusive. The exact self, with her contradictions and private loyalties, with the brand she trusts because her mother used it, with the price she will stretch to pay for one category but not another, with the preference she herself could not explain if asked directly, still resists compression into a clean behavioral profile. The agents can learn the crowd more easily than they can learn the person. That may be the saddest and most revealing truth in the research. Commerce has always wanted to know us completely, and technology has always suggested that the final key might be near. But a human being is not a pattern alone. She is history, habit, fear, memory, exhaustion, longing, and sometimes pure irrational grace.

So the future emerging from these papers is not one in which bigger models sweep away uncertainty and become flawless shoppers on our behalf. It is something slower, sterner, and in some ways more honest. The next real gains will come from better retrieval, stronger stateful memory, more reliable constraint handling, and deeper safety architecture. They will come from systems built to remember what was said, preserve what matters, and move carefully through decisions that carry consequences. For brands, the message is not glamorous, but it is clear: the companies that make their catalogs easiest for agents to understand will hold an advantage. For platforms, the lesson is even harder. They are not merely building convenient tools. They are building intermediaries between human need and commercial choice. And whenever something stands in that space, between the tired woman at the kitchen table and the things her family depends on, the standard must be higher than novelty. It must be trust. It must be clarity. It must be a system strong enough not just to answer, but to carry the weight of what people are actually asking.

The evidence is specific. What follows is what I found.

Executive Summary

22Foundational Papers
30–55%Task Success Range
41%Production Retrieval Relevance
19.72ptsPersona Ablation F1 Drop
~2%Individual Product Overlap
77% vs 35%Human vs Model Safety Rate

The agentic shopping literature points to a field at an inflection point. Across twenty-two papers from 2025 to 2026, performance falls sharply as tasks become more complex, especially when agents must manage multiple constraints across turns, which suggests current limits in decomposition, planning, and constraint tracking rather than simple knowledge gaps. The research also shows a clear asymmetry in perception: vision-only systems perform poorly in several benchmark settings, while structured semantic representations such as accessibility trees are much more reliable. Adding vision to accessibility-tree agents yields inconsistent results: on some tasks performance improves, on others it degrades, and no benchmark has yet demonstrated a reliable fusion strategy that consistently outperforms structured navigation alone. Personalization is also foundational at the population level, with persona removal causing large drops in session-level performance, even as individual-level preference matching remains weak. Taken together, the literature suggests that the next advances in agentic commerce will come less from larger models alone and more from better retrieval, stronger stateful architectures, more robust constraint handling, and improved safety design.

What This Means for Brands

For brands, the message is simple: shopping agents are improving fast, but they still do not behave like expert human shoppers. They struggle when a customer request includes several requirements at once, they perform best when product information is structured clearly, and they still need stronger memory, retrieval, and safety controls before they can be trusted with fully autonomous decisions. The most important takeaway is that success in agentic commerce will depend less on who uses the biggest model and more on who has the cleanest product data, the best search and retrieval systems, and the strongest infrastructure for guiding agents through complex buying journeys.

My Take Amazon Rufus is the quintessential example of a custom model with a mixture of other models outperforming the best models in shopping related tasks. The moat is not model size. It is domain specialization and proprietary training data.

Research: The Three Realities

The agentic shopping research landscape reveals a field at critical inflection. My analysis of twenty-two foundational papers from 2025–2026 establishes three realities that reshape how we approach agentic commerce, and each one carries direct implications for platform operators, researchers, and governance teams alike.

First, performance degrades predictably with task complexity: not because models lack knowledge, but because they lack hierarchical reasoning structures. Second, vision and structured navigation operate asymmetrically: vision-only agents achieve just 4.55% task success on product search with Claude Sonnet 4 (WebMall, Table 3), while the same model using accessibility trees reaches 56%, and adding vision to accessibility-tree agents yields inconsistent results across tasks and benchmarks. Third, personalization is foundational: in CUSTOMER-R1, persona removal degrades session outcome F1 by 19.72 points (Tables 5 and 8), the largest single ablation effect in the corpus.

Although relatively new, the field has matured rapidly from conceptual frameworks to deployed production systems. Agents achieve roughly 30–55% success on realistic shopping tasks depending on benchmark and complexity (DeepShop: 30% holistic for deep research systems; WebMall: 56% for AX-Tree product search; ShoppingBench: 48.2% for GPT-4.1), with simpler structured tasks reaching substantially higher rates (EcomBench: 90%+ on Level 1). Population-level simulation shows 0.64 correlation with real A/B test outcomes (SimGym, validated across 20 storefronts). Yet a persistent 20–50% performance gap remains. Agents systematically fail at safety-aware decision making (below 40% accuracy), multi-constraint reasoning, and individual preference matching (low single-digit product overlap despite high task-level alignment).

The binding constraint is retrieval quality (41% relevance in production, per Cite Before You Speak, Table 2), not reasoning capability alone. Research attention has been disproportionately directed toward LLM improvements; fundamental gains will come from search quality, decomposition architecture, and safety infrastructure.

Methodology and Scope

The twenty-two papers were selected from the 2025–2026 arXiv corpus using three criteria: the paper must include quantitative evaluation of shopping agent performance (not purely theoretical), it must introduce a benchmark, an architectural contribution, or a production deployment (not a minor variant), and it must be citable at the time of analysis. No systematic search protocol was used; this is a purposive sample weighted toward papers that introduce evaluation infrastructure or production evidence. Papers were identified through citation chaining, arXiv keyword monitoring, and practitioner referral. The sample skews toward English-language work on Western and Chinese e-commerce platforms (Amazon, Lazada, Taobao); multilingual and emerging-market work is underrepresented.

The hidden signal extraction methodology is a structured close-reading protocol applied to each paper in two passes. The first pass extracts headline claims, reported metrics, and stated contributions. The second pass examines subordinate clauses, parenthetical disclosures, appendix tables, and footnotes for information that qualifies, contradicts, or contextualizes the main claims. This approach is inherently subject to confirmation bias: an analyst looking for discrepancies between main text and subordinate text will find them. The synthesis that follows should be read with that limitation in mind. Where a hidden signal is reported, the source location (appendix, footnote, parenthetical) is identified so readers can verify independently.

Each of the twenty-two papers has also been translated into a standalone faithfulness-audited intelligence brief in the companion document (Agentic Commerce Research Intelligence: 20 Paper Translations). That document labels every recommendation as Directly Supported, Cautious Inference, or Unsupported based on the source paper's evidence. The synthesis that follows draws on those translations for specific table and section citations; where a claim in this synthesis is inferential rather than directly measured, it is flagged as such.

Five Findings: What the Research Actually Says

Finding 01

The 30–50% Performance Ceiling Is Real on Hard Benchmarks

EcomBench — Model Performance Comparison
EcomBench — Model Performance Comparison — 12 systems clustered between 43–65%. ChatGPT-5.1 leads by 1 point over Gemini. The compressed 22-point spread shows no model is breaking from the pack. These are aggregate scores across difficulty levels — what they hide is what happens when complexity increases, which the findings below make explicit.

Shopping agents fail systematically across all models tested on complex shopping tasks. GPT-4.1 achieves 48.2% average success rate on ShoppingBench (Table 1). Claude Sonnet scores 39.0%. Vision-only agents catastrophically fail at 4.55% (WebMall, Table 3). This is not measurement variance; it is a structural characteristic. The ceiling appears consistently across DeepShop, WebMall, ShopSimulator, and ShoppingBench on harder task types, indicating the problem is inherent to the task architecture, not to any single model family. On simpler tasks, performance can be much higher — EcomBench reports over 90% on Level 1 — but as complexity increases, all models converge toward the same band.

Whether frontier models with extended reasoning (GPT-5.2 and beyond) break through this ceiling on complex tasks remains an open question. The prudent assumption: do not test how your product page works exclusively with frontier models and assume that performance generalizes to production-scale inference.

Why This Matters: Scaling foundation models alone will not solve agentic commerce. The 30–50% range on hard benchmarks is imposed by architectural constraints and data gaps, not reasoning capacity. Brands betting on frontier models alone will plateau at this performance level on complex, multi-constraint tasks.
Finding 02

Five Root Causes Decompose the Gap Into Solvable Problems

The 30–50% ceiling is not monolithic. Systematic root cause analysis identifies five distinct failure mechanisms, each quantified and each addressable.

ShoppingBench Fig. 2 — Benchmark Construction Pipeline
ShoppingBench Fig. 2 — Benchmark Construction Pipeline — Three stages: product sampling from four intent types (complexity increasing from Product Finder down to Coupon & Budget), field sampling across product attributes, and GPT-4o user instruction simulation. The downward arrow labeled "Challenging level" is the key visual — the benchmark is explicitly engineered as a complexity ladder. The performance cliff below is baked into the architecture.
Root Cause 1 — Multi-Constraint Reasoning Collapse
ShoppingBench Fig. 1 — Eight Simultaneous Constraints in Action
ShoppingBench Fig. 1 — Eight Simultaneous Constraints in Action — A single user request: sunscreen (invisible shield tech, SPF 60, cream texture), foundation stick (full coverage, oil control, built-in brush), budget of 103, conditional voucher valid only when total exceeds 83. The agent calls find_product twice, views candidates, then executes Python to verify budget compliance. The python_execute output shows total_price: 112.0, discount: 10.0, final_price: 102.0, success: true. This is not an edge case. This is the Coupon & Budget tier. This is exactly the architecture that collapses at three-plus constraints.

Simple shopping tasks (single attribute) achieve 59.6% success on ShoppingBench. Adding three simultaneous constraints (budget + brand + availability) causes collapse to 30.4%: a 29.2 percentage point absolute drop, representing approximately 49% relative decline (ShoppingBench, Table 1). ShoppingComp confirms the pattern at scale: product retrieval F1 degrades substantially as constraint count increases (ShoppingComp, Section 5.2). Fine-tuned small models with constraint-aware reward shaping show the mechanism is learnable: Qwen3-4B achieves 48.7% ASR, matching GPT-4.1's 48.2% (ShoppingBench, Table 1).

ShoppingBench Fig. 3 — The Full API Tool Surface
ShoppingBench Fig. 3 — The Full API Tool Surface — The six tools available to the agent: find_product, view_product_information, python_execute, web_search, recommend_product, and terminate. The python_execute output (total_price, discount, final_price, success: true) shows what financial constraint satisfaction actually requires: code execution, not natural language reasoning. The terminate tool requires a deliberate decision to stop — the action agents systematically underpredict. This is the complete tool surface for the task shown above.
Root Cause 2 — Retrieval Quality Ceiling

Production retrieval systems achieve only 41.1% relevance on real-world e-commerce platforms, compared to ideal scenarios where all evidence is relevant in synthetic benchmarks (Cite Before You Speak, Table 2). This single constraint creates a hard ceiling: CGR drops from 96.52% (synthetic) to 83.86% (real world), a 13-point collapse driven primarily by retrieval quality — though the severity varies by prompting strategy: Citation Prompt mitigates the drop (maintaining 95.46% CGR on real-world data, only a 3.2-point decline from 98.65%), while Guided Prompt suffers a larger 12.6-point collapse. ShoppingComp confirms that retrieval remains the binding constraint for all models tested (Section 5.3). Improving retrieval relevance from 41% to 70% could yield significant absolute performance improvement, potentially 20–30%, without changing agent reasoning capability. This is a projection based on the documented relationship between retrieval relevance and downstream accuracy, not a measured outcome.

OptAgent Framework — Genetic Algorithm Query Optimization
OptAgent Framework — Genetic Algorithm Query Optimization — OptAgent attacks the retrieval problem from the query side rather than the index side. The user's initial query is treated as a population seed. LLMs generate variant rewrites via mutation and crossover. A multi-agent evaluator scores each variant against retrieved products, stores scores in memory, and a purchase agent selects survivors. Top-scoring queries breed the next generation. 21.98% improvement over baseline without changing the underlying model. Note: the $8 per query cost (6,109 LLM calls per optimization run, Appendix E) is visible in this architecture if you count the generation cycles. Architectural elegance and operational cost are two different things.
AgenticShop Table 6 — Comprehensive Analysis of Curated Products
AgenticShop Table 6 — Comprehensive Analysis of Curated Products — The Invalid Page column is the buried story. ChatGPT Search sends users to invalid product pages 71.65% of the time. Claude Sonnet 4 Search: 64.44%. Gemini: 92.36%. Perplexity: 87.61%. Nearly two-thirds to nine-tenths of all product references are broken links or hallucinated pages. The Out of Budget column shows every system clustering between 0.65 and 0.74 — tight clustering that signals a category-wide architectural failure, not a model-specific one. No current system reliably aggregates base price, shipping, tax, and discount into an accurate final cost.
Root Cause 3 — Session Memory Absence

Agents with full session memory achieve 0.64 Pearson correlation with real human A/B test outcomes (SimGym, Tables 2 and 5). Agents without session memory drop to 0.29, below the threshold for reliable behavior prediction. SimGym's ablation is definitive: removing memory collapsed alignment from 69% to 55%, increased stuck-in-loop rates 5.38x (8.50% to 45.70%), and caused over half of agents to hit step limits (SimGym, Tables 3–4). Memory is non-negotiable for shopping agents.

Root Cause 4 — Perception Modality Mismatch
WebMall Table 3 — The Definitive Modality Comparison
WebMall Table 3 — The Definitive Modality Comparison — The headline: Vision-only produces 0.00% completion, precision, recall, and F1 on End to End tasks for both Claude Sonnet 4 and GPT-4.1. Complete failure — not poor performance, zero. AX-Tree alone puts Claude Sonnet 4 at 62.50% on End to End. AX-Tree + Memory: 75.00% CR, 87.50% precision, 84.26% F1. The memory jump is as large as the modality jump on several task types. The fusion paradox: Claude on Specific Product Search improves 56.06% to 60.23% with vision, but on Vague Product Search AX-Tree + Vision drops to 38.06% vs 53.61% for AX-Tree alone. Vision helps on specific tasks and actively hurts on vague ones.
WebMall Table 4 — Task-Category Disaggregation
WebMall Table 4 — Task-Category Disaggregation — Three findings Table 3 hides. First: Checkout is solved — GPT-4.1 achieves 100% CR with AX-Tree + Memory; Claude at 87.50%. The last mile works. The journey to it doesn't. Second: Vision zeros out precisely on judgment-dependent tasks (Add to Cart, Cheapest Offer Vague Requirements, End to End all show 0.00% for Claude under Vision). Third: Find Substitutes is Claude's strongest category at 83.33% CR and F1, holding even partial performance under Vision. Semantic understanding of product relationships is where current agents approach human-level competence. No paper in the corpus has built on this finding.

For Claude Sonnet 4 on Specific Product Search, vision-only agents achieve 4.55% completion rate while accessibility-tree agents achieve 56.06% (WebMall, Table 3). The gap is model-dependent — GPT-4.1 shows a smaller difference on the same task — but the directional advantage of structured navigation is consistent across models. However, adding vision to accessibility-tree agents yields inconsistent results across tasks: on Specific Product Search, performance increases from 56.06% to 60.23%, while on other task types, it decreases (WebMall, Table 3). The strongest conclusion is not that vision hurts, but that no reliable fusion strategy has yet been demonstrated to consistently outperform structured navigation alone. DeepShop confirms the grounding problem: vision-based agents using set-of-mark prompts misclassify interactive buttons, fail to segment review sections, and overlook small filtering widgets (DeepShop, Section 5.4, Figure 12).

Root Cause 5 — Safety Awareness Deficit
ShoppingComp Fig. 1 — Four-Panel Leaderboard: The Confidence-Safety Decoupling
ShoppingComp Fig. 1 — Four-Panel Leaderboard: The Confidence-Safety Decoupling — Four panels. Bottom right (Safety Pass Rate): human experts at 77.08%, best model at 37.86%, GPT-4o at 12.50%. Every model below 38%. Bottom left (Report Rationale Validity): GPT-5.2 at 91.73%, human experts at 90.96%. Agents are near-human-level at explaining their choices while being catastrophically below human-level at making safe ones. The reasoning sounds good. The decisions are not. This is not a knowledge gap — it is a confidence-safety decoupling, and this figure is the only place in the corpus where it becomes fully visible.

Best-in-class models achieve approximately 35% safety pass rate on shopping decisions. ShoppingComp reports 35.42% (±6.18%) for GPT-5.2 (Table 3). Human experts achieve 77.08%. The variance is as concerning as the mean: over 5% standard deviation indicates non-deterministic, unstable safety responses across runs (ShoppingComp, Section 5.2). No model has been explicitly trained with safety as a primary optimization objective for shopping decisions. This is not a reasoning limitation (expensive reasoning tokens do not improve safety); it is an objective misalignment.

Why This Matters: Each cause is solvable in isolation. Fine-tuning on shopping-specific data yields large gains (ShoppingBench: Qwen3-4B baseline 18.0% to SFT+RL 48.7%, a 30.7-point gain; Customer-R1: zero-shot 7.32% to SFT+RL 39.58%, a 32-point gain). Retrieval optimization yields 21.98% improvement via query rewriting alone (OptAgent, Table 1). Memory infrastructure is standard engineering. But these causes interact multiplicatively: fail at retrieval, and agents cannot find good products to reason about; lose memory, and constraints are forgotten; lack safety signal, and agents recommend hazardous items.
Finding 03

Specialization Decisively Defeats Scale for Shopping

CUSTOMER-R1 Table 3 — The Inversion: Specialization Defeats Scale
CUSTOMER-R1 Table 3 — The Inversion: Specialization Defeats Scale — Read the three tiers. Proprietary models cluster between 6.78–11.69% on generated action accuracy — Claude 3.5 Sonnet v2 at 11.69% / 18.54% F1. These are not bad models. They are the wrong models for this task. Fine-tuned Qwen2.5-7B with reasoning: 17.26% / 33.86% F1. A 7-billion parameter open-source model beats every proprietary model in this table on session outcome F1. Llama-3.2-3B with reasoning hits 33.99% F1. Qwen2.5-1.5B jumps 219% on action accuracy with reasoning. Llama-3.2-3B jumps 618% on session outcome F1. The competitive hierarchy is not just closed. It is inverted.

The strategic inflection point: Domain-specific fine-tuning outperforms general-purpose scaling by substantial margins on shopping tasks. Qwen3-4B, fine-tuned with shopping behavior data from a 2.5M-product catalog, matches or exceeds GPT-4.1 (a far larger model) at a fraction of the inference cost (ShoppingBench, Table 1). On CUSTOMER-R1, the zero-shot baseline achieves 7.32% next-action accuracy; SFT+RL reaches 39.58%, a 32-point gain (Table 4). On ShopSimulator, SFT+RL training on Qwen3-8B improves multi-turn success from 6.48% to 35.50%, a 29-point gain (ShopSimulator, Table 3).

This represents a structural advantage: smaller, domain-specialized models consistently match or exceed frontier models on shopping-specific tasks across multiple independent benchmarks. Session-level purchase decisions show comparable gains when personas are integrated during training rather than prompting.

Why This Matters: The competitive moat in agentic commerce belongs to whoever controls the best shopping-specific training data, not whoever has API access to the largest models. Any Tier 1 retailer with millions of product records and tens of thousands of annotated shopping sessions can build agents that exceed frontier models on their own domain. The race is shifting from "who has the best foundation model" to "who has the best curated shopping data."
Finding 04

Population-Level Personalization Works; Individual-Level Fails

CUSTOMER-R1 Fig. 2 — Framework for Simulating User Behavior
CUSTOMER-R1 Fig. 2 — Framework for Simulating User Behavior — The model holds full HTML observation, complete session behavior history, and a persona ($50,000 income, student, INFJ, cares about brand), then predicts both the rationale and the next action simultaneously. The reward structure separates JSON format validity from action correctness. The persona is a demographic archetype, not a specific person. The rationale ("the product seems nice for my cat, I want to add it to cart") is plausible for the profile and may be completely wrong for the actual individual. Additional signal: this rationale was generated by Claude during training — the model may be learning Claude's reasoning patterns, not actual user cognition.
PAARS Fig. 1 — Designed for Populations, Not Persons
PAARS Fig. 1 — Designed for Populations, Not Persons — The alignment suite (top right) tells the structural story: query prediction gets individual-level accuracy, item selection gets semantic similarity (softer), and session generation has no individual metric at all — only KL divergence, a population distribution comparison. The system is not failing to achieve individual alignment. It is not trying to. The feedback loop propagates improvements back to the population layer, never to the individual layer. The potential applications box names exactly what this is built for: A/B testing, survey tooling, feature launches, marketplace launches. Every application is population-scale.

Personas improve population-level simulation with the real-world correlation documented in Root Cause 3 (SimGym: 0.64 correlation, validated against actual A/B test outcomes across 20 storefronts). This enables substantial speedup on A/B testing: estimated results in under 1 hour versus 2–4 weeks of live traffic. The persona ablation in CUSTOMER-R1, the largest single ablation in the corpus (19.72-point F1 drop, Tables 5 and 8), confirms that population-level conditioning is load-bearing. PAARS confirms the pattern: persona conditioning improves query alignment by approximately 17% relative (PAARS, Section 5, Table 2). KL divergence for query similarity improves from 18.81 to 17.51 (approximately 6.9% improvement).

But individual-level personalization fails. Agents achieve high task-level alignment yet near-zero exact product overlap in controlled studies: Yang et al. (LLM Agent Meets Agentic AI, Abstract) report approximately 2%. They learn what populations do; they do not learn what individuals prefer. This is architectural: aggregation by design loses individual specificity. Scaling individual personalization without hybrid architecture (personas + explicit user feedback + content-based filtering) yields negative ROI.

ShopSimulator Fig. 3 — Error Statistics of Failed Trajectories (Claude-4-Sonnet)
ShopSimulator Fig. 3 — Error Statistics of Failed Trajectories (Claude-4-Sonnet) — Panel (b) Personalization Errors: 55.82% of personalization failures involve ignoring personal information entirely — not misinterpreting it, ignoring it. 35.71% involve over-applying it. The agent swings between two failure modes with almost no calibrated middle. Panel (a) Multi-turn Action Errors: 45.63% of buy now failures involve no detail confirmation, 31.31% involve purchase after rejection — the agent retries items the user already refused. Panel (c) Shopper Errors: agents add extra intent (3.69%), distort target intent (2.93%), go silent on key goals (0.43%). This is not a neutral executor of user intent. It actively reshapes what the user wants.
Why This Matters: Invest heavily in population-level personalization for A/B testing and conversion optimization. Do not deploy individual personalization without explicit feedback mechanisms. The platforms that use personas for what they are good at (testing at scale) and other techniques for what personas cannot do (individual recommendation) will win.
Finding 05

Memory, Constraint Tracking, and Structured Navigation Are Table-Stakes

Stateless architectures cannot function for shopping agents. The evidence across multiple benchmarks is consistent: SimGym's memory ablation (Tables 2–4) shows correlation collapse from 0.64 to 0.29, 5.38x increase in stuck-in-loop behavior, and goal completion dropping from 90% to 45%. WebMall confirms that accessibility tree navigation vastly outperforms vision-only modality, as documented in Root Cause 4. ShopSimulator demonstrates that multi-turn success halves without session context (Table 3). Constraint tracking across turns prevents redundant search and failure loop detection.

Simple REST API agents that respond to single queries are becoming less and less plausible for commerce. Production agents require stateful multi-turn session management, constraint preservation across turns, and semantic navigation via accessibility trees.

Why This Matters: Infrastructure complexity is higher than field estimates suggest. Deploying shopping agents is not "call an LLM API." It requires session persistence, constraint state machines, and accessibility tree extraction. Brands that build this infrastructure today have a meaningful head start on competitors.

Paper Inventory

Twenty-two papers. Six categories. One unified signal: shopping agents exhibit predictable failure modes, measurable capability gaps, and addressable architectural constraints. Full faithfulness-audited translations of each paper are available in the companion document. The inventory below organizes the corpus by contribution type.

Benchmark Papers — 6 Core Systems
Key Signal: DeepShop reveals a steep performance cliff. For simple RAG methods, easy queries reach 16% success and hard queries drop to 0% (Section 5.3, Figure 5b). Web agents show the same degradation pattern at higher absolute rates. This is evidence of reasoning limits, not knowledge limits.

DeepShop (Lyu et al.)

600+ evolved queries across 5 product categories. Reveals agents struggle most with Fashion/Sports (visual reasoning) and that performance drops dramatically with complexity. RAG methods fail entirely at 7.33%; deep research systems plateau at 30% holistic success.

ShoppingBench (Wang et al.)

3,310 instructions across 4 intent types with 2.5M real products from Lazada. First benchmark to systematically measure multi-constraint reasoning and complex financial constraint satisfaction. GPT-4.1 achieves 48.2% overall ASR.

EcomBench (Min, Qiao et al.)

Real-world e-commerce decision-making with 7 task categories. Human-in-the-loop curation. Difficulty stratification via tool-hierarchy methodology. ChatGPT-5.1 achieves 90%+ on Level 1 but drops to 46% on Level 3.

WebMall (Peeters et al.)

First reproducible multi-shop environment. 4,421 real product offers across 3 categories. Shows accessibility trees outperform vision-only (Claude Sonnet 4: 56% vs 4.55% on product search; gap varies by model). Cross-shop aggregation identified as recurring failure mode.

ShoppingComp (Tou et al.)

Open-world shopping with 145 instances and 558 scenarios. Safety Rubric Pass Rate: 35.42% (±6.18%) best model vs 77.08% human. Models exhibit inverted precision-recall trade-off compared to human experts.

AgenticShop (Kim et al.)

First personalized product curation benchmark. 350 personalized shopping tasks across 50 user profiles and 7 scenarios, drawing from real Amazon purchase histories. Systems achieve only 13.56–37.93% curation success. ChatGPT Search hallucinates approximately 20% of product references; Table 6 disaggregates failures across five distinct columns with rates varying widely by system and failure type.

AgenticShop Figs. 4 & 5 — Radar Charts and Intent Distribution
AgenticShop Figs. 4 & 5 — Radar Charts and Intent Distribution — Figure 4 (radar charts): every system across both panels collapses on Review Sensitivity and Price Sensitivity while bulging on Brand Preferences. Agents retrieve brand signals correctly and mishandle financial constraints and social proof simultaneously. Figure 5 (intent distribution): Claude spends 80% of session time on Alternative Evaluation and only 8% on Purchase Decision. Agent-E: 64% evaluation, 6% decision. Every system is a thorough researcher and a poor closer. The purchase decision phase — the moment that generates revenue — is where every system invests the least.
Framework and Architecture Papers — 5

PAARS (Mansour et al.)

Persona-driven agent population simulation. Explicit individual-level vs. group-level alignment distinction. A/B testing validation: directional agreement on 2/3 tests, but 10–30x magnitude error.

CUSTOMER-R1 (Wang et al.)

RL-based persona conditioning with difficulty-aware reward weighting. Persona removal degrades session outcome F1 by 19.72 points (Tables 5, 8). SFT+RL achieves 39.58% next-action accuracy vs. 7.32% zero-shot.

OptAgent (Handa et al.)

Multi-agent genetic algorithm for query optimization. Temperature-sampled ensemble (5 agents) correlates with human judgment at r=0.552 (p<0.001). Tail queries show largest gains. Cost: approximately $8 per query.

See, Think, Act (Zhang et al.)

VLM integration for behavior simulation. Visual grounding adds 6+ percentage point accuracy improvement (text-only SFT+RL at 38.44% to visual SFT+RL at 44.57%) on OPeRA dataset. Introduces difficulty-aware reward scaling (DARS).

ShopSimulator (Wang et al.)

1.3M real Taobao products. SFT+RL with strict multiplicative rewards yields 35.50% multi-turn success vs. 6.48% baseline. GPT-5 achieves 32.65% on single-turn (Table 2). Agents ignore 55.82% of personal preferences or over-apply 35.71%.

Simulation and Evaluation Papers — 4
The headline evidence: SimGym achieved 0.64 correlation with real A/B test results across 20 storefronts (Tables 2, 5), providing the strongest validation of the simulation paradigm in the corpus.

Cite Before You Speak (Zeng et al.)

Production citation generation for Amazon's conversational shopping assistant. 13.83% CGR improvement on real-world data. 3–10% customer engagement uplift in A/B test with 10M+ customers. Citation quality shows 50–53% Perfect Sentence Rate (PSR) and 72% Correct Citation Rate (CCR) without fine-tuning.

Can LLM Agents Simulate (Lu et al.)

First process-centric evaluation of action-level fidelity. Prompt-only LLMs achieve approximately 12% next-action accuracy. Fine-tuning with reasoning traces reaches approximately 17%. Agents exhibit systematic purchase overprediction and termination underprediction.

Perplexity Adoption Study (Yang, Yonack et al.)

Hundreds of millions of user interactions with Comet agent across the top 100 countries. Shopping represents 10% of agentic queries; Amazon dominates at 43.2%. Strong GDP correlation with adoption (r=0.85–0.86) and moderate education correlation (r=0.75).

SimGym (Castelo et al.)

Production-deployed offline A/B testing. 69% directional alignment with real human outcomes. Six-stage persona extraction pipeline from production clickstream. Episodic memory ablation: removing it collapses correlation from 0.64 to 0.29.

Architecture, Systems and Governance — 5

WEBSERV (Lu et al.)

Full-stack browser-server environment for RL training. Incus containers: 1.78-second launch, 28 MiB storage vs Docker's approximately 9 seconds, 6.78 GiB. DOM parser streamlines raw HTML while preserving interactive elements. Claude Sonnet 4.5 achieves 46.7% on shopping tasks.

Agentic Web Survey (Yang et al.)

Defines three-dimensional framework (Intelligence × Interaction × Economics). Introduces Agent Attention Economy concept. Catalogs 16 threat types across intelligence, interaction, and value layers. MCP and A2A protocol specification.

Wizard of Shopping / TRACER (Li et al.)

Decision tree-guided dialogue generation. 3,600 conversations with 4.2/5.0 naturalness (GPT-4, human-evaluated). Single-pass generation outperforms interactive by 1.3 naturalness points.

Autonomy Levels Framework (Feng, McDonald, Zhang)

Five-level spectrum (L1 operator to L5 observer) for calibrating agent autonomy independent of capability. Identifies rubber-stamping risk at L4. Proposes autonomy certificates for governance.

LLM Agent Meets Agentic AI (Sun et al.)

Empirical user study with 40 participants identifying specific behavioral gaps: agents explore more than humans (1.9 vs 1.2 items), underestimate satisfaction, and lack trust assessment capability. Users expressed concerns about bias, personalization failures, and loss of control.

Cross-Cutting Themes

Ten patterns emerge across these twenty-two papers. Each carries direct implications for system design, research prioritization, and deployment strategy. I name them here because naming forces precision, and precision is what this field currently lacks.

Theme 01
The Complexity-Performance Correlation

Agent performance degrades dramatically with query/task complexity. The evidence shows steep, cliff-like degradation around 3–4 simultaneous constraints. Whether this constitutes a true phase transition in the technical sense (a discontinuous change of state) or steep but continuous decline remains debatable; the practical distinction matters less than the implication: solutions that work at two constraints fail categorically at four. DeepShop documents the cliff: simple RAG methods achieve 16% on easy queries and 0% on hard; web agents show the same steep decline at higher baselines (Section 5.3). ShoppingBench confirms: single-attribute tasks at 59.6% collapse to 30.4% with three constraints (Table 1). EcomBench shows the same pattern: 90%+ Level 1 to 46% Level 3 (Figure 3).

Contradiction in Evidence: OptAgent shows tail queries (sparse data) benefit most from optimization. Complexity is addressable through search strategy when data is sparse, but not through better language models alone.
Theme 02
The Vision-Language Modality Gap

Vision-only systems fail most catastrophically (4.55% task success for Claude Sonnet 4 on product search); text-only systems using accessibility trees perform far better (56.06% for the same model and task). WebMall provides the definitive comparison (Tables 3–4). Adding vision to accessibility-tree agents produces inconsistent results: performance improves on some tasks (56% to 60% on Specific Product Search) but can degrade on others. No benchmark demonstrates a reliable fusion strategy that consistently outperforms structured navigation alone. DeepShop's error analysis (Section 5.4, Figures 12–14) identifies the mechanism: vision-based agents using set-of-mark prompts misclassify interactive buttons, fail to segment review sections, and overlook small filtering widgets. Vision-only agents achieve 0% success on end-to-end workflows (WebMall, Table 3).

Theme 03
Persona vs. Rationale Trade-off

Both explicit persona conditioning and reasoning/rationale generation improve behavior simulation, but their relative contributions interact in complex ways. CUSTOMER-R1 delivers the headline: persona removal produces the largest single ablation effect in the corpus (19.72-point F1 drop, Tables 5 and 8). PAARS confirms that persona conditioning improves query alignment by approximately 17% relative (Section 5, Table 2). See, Think, Act shows rationale generation adds accuracy while persona provides user-level priors. Synthetic rationales (generated by Claude) may create train-test distribution mismatch, as the model may learn Claude's reasoning patterns rather than actual user cognition (See, Think, Act; CUSTOMER-R1).

Contradiction: Some papers show rationale marginal value (below 5% improvement); others show 23% improvement. The difference traces to evaluation metric: session-level decisions (purchase/terminate) are more sensitive to persona than individual actions.
Theme 04
The Hallucination-Confidence Inverse

Larger, more capable models systematically overestimate confidence in unreliable claims, and different model families exhibit systematically different failure modes. DeepShop documents that OpenAI's deep research system "overconfidently claims complex queries" while Gemini "more frequently acknowledges failure" (Section 5.4, Figures 16–17). Cite Before You Speak shows 3% baseline hallucination in production Amazon conversational shopping (Section 1). AgenticShop reports that ChatGPT Search hallucinates approximately 20% of product references, with Table 6 disaggregating failures across five distinct columns that vary widely by system. OptAgent demonstrates agents systematically penalize low-review products, creating a review-count bias (Section 5.3).

Theme 05
The Retrieval Relevance Ceiling

This is the single most actionable finding across the corpus. Even when LLMs reason correctly, poor retrieval fundamentally limits performance. Cite Before You Speak provides the critical number: real-world benchmark achieves only 41.1% relevance (Table 2). CGR drops from 96.52% to 83.86%: a 13-point collapse driven primarily by retrieval quality, though the severity is setting-specific (Citation Prompt mitigates the drop to just 3.2 points; Guided Prompt suffers the full 12.6-point decline). ShoppingComp confirms that retrieval strategy and constraint enforcement remain the binding constraint across all tested models (Section 5.3). Even with tools, retrieval quality and constraint enforcement remain the binding constraint, not information access.

Contradiction: Authors acknowledge retrieval as the primary constraint but do not deeply explore solutions. Research attention remains focused on reasoning improvements while the actual constraint is upstream.
Theme 06
Multi-Constraint Reasoning Collapse

Agent performance collapses as constraint count increases. The collapse is non-linear: agents handle one or two constraints well, then fall apart at three or more. ShopSimulator reveals the mechanism: multiplicative reward structures create harsh penalties for partial satisfaction, causing agents to avoid risky exploration rather than attempt complex constraint satisfaction (Section 4.1). Some papers use soft matching (allowing partial satisfaction); others use strict (requiring all constraints). ShopSimulator demonstrates strict rewards consistently outperform loose rewards across scenarios (Table 3). This implicit trade-off between evaluation realism and agent performance goes largely undiscussed in the literature.

Theme 07
The Simulation Fidelity Paradox

LLM agents achieve high task-level alignment with humans but near-zero exact product overlap. The paradox emerges across multiple independent studies: very low individual-level product agreement in Lu et al.'s controlled study (Can LLM Agents Simulate, Section 4.2.1), approximately 2% in Yang et al.'s digital-twin study (LLM Agent Meets Agentic AI, Abstract). SimGym's 0.64 correlation validates population-level utility, but personas derive from aggregated clickstream, not individual patterns. Yang et al. reveals the mechanism: agents explore more systematically than humans (1.9 items clicked vs. human exploration patterns), then report uniformly positive functional feedback while humans voice frustration about personalization failures, bias concerns, and control loss.

Theme 08
The Autonomy Calibration Problem

Agents require different autonomy levels for different task types and user demographics. The Autonomy Levels Framework (Feng, McDonald, Zhang) establishes the L1–L5 spectrum with a critical insight: autonomy is a design decision separable from capability. The framework identifies rubber-stamping risk: users disengage when autonomy is too high, approving agent actions without genuine review (Section 3.4). The safety data from ShoppingComp makes the governance case: best model achieves approximately 35–38% safety rubric pass rate versus 77.08% human. That approximately 40-point gap makes unsupervised financial decisions untenable at any autonomy level above L3.

AgenticShop Fig. 5 — Intent Distribution and Capability Radar
AgenticShop Fig. 5 — Intent Distribution and Capability Radar — Intent distribution (bottom): Claude spends 80% of session time in Alternative Evaluation and only 8% on Purchase Decision. Agent-E: 64% evaluation, 6% decision. Agents that spend 6–8% of their time on purchase decisions are not ready for unsupervised financial autonomy. The rubber-stamping risk at L4 is mirrored in the agent's own action distribution. Radar charts (top): every system collapses on Review Sensitivity and Price Sensitivity while performing best on Brand Preferences — the exact inversion of what matters most for safe financial decisions.
Theme 09
The Cost-Performance Trade-off Opacity

Papers rarely disclose computational costs. When disclosed, costs span four orders of magnitude. OptAgent costs $8 per query with 6,109 LLM calls per optimization (Appendix E). WebMall documents the model-level trade-off: Claude Sonnet 4 costs $0.85–$1.42 per task vs. GPT-4.1 at $0.26–$0.34, but Claude achieves higher completion on end-to-end tasks (Table 5). Most papers report accuracy without reporting cost, making cross-system comparison impossible.

WebMall Table 5 + Fig. 2 — Token Usage, Cost, Runtime, and Cost vs. Completion Rate
WebMall Table 5 + Fig. 2 — Token Usage, Cost, Runtime, and Cost vs. Completion Rate — The scatter plot makes the trade-off visible in a way prose cannot. Vision-only cluster sits bottom right: high cost, low completion. Claude Vision at $1.42 per task delivers 8% completion — paying the most for the worst outcome. Best value: GPT AX-Tree + Memory at $0.34, 55% completion. Adding memory costs 3 cents and delivers an 11-point completion gain. The table adds three findings the scatter hides: Vision causes Claude to take 45.57 average steps vs 26.67 for AX-Tree — 71% more steps to accomplish less. Memory costs almost nothing on GPT (3 cents, steps decrease) and a 35% cost increase on Claude. Runtime: Claude Vision averages 491.6 seconds per task. Over eight minutes. If real-time shopping assistance is the goal, the runtime column may matter more than the cost column.
Theme 10
The Hidden Signal Extraction Gap

Papers systematically hide crucial information in subordinate clauses, appendices, and asides. DeepShop's main text claims 84% evaluation agreement; the appendix reveals only a 50-sample validation. Cite Before You Speak claims "production-ready"; hidden in the data: 50–53% PSR suggests a fundamentally limited approach without fine-tuning. PAARS reports "group alignment achieved"; hidden: approximately 7% relative improvement in KL divergence, and some metrics actually degrade. OptAgent claims "agent correlation 0.552"; hidden: this is a correlation metric, not a causal relationship. The companion translations document provides the full hidden signal extraction for each paper.

The State of Agentic Shopping

What Works Today

  • Multi-turn dialogue for shopping: Conversational interfaces outperform keyword search on subjective preferences (TRACER achieves 4.2/5.0 naturalness). Agents maintain context across 5–10 turns effectively.
  • Persona-driven behavior modeling: Persona conditioning improves population-level simulation (PAARS, SimGym). Works best when personas are extracted from observed behavior rather than designed.
  • Structured navigation via accessibility trees: Dramatically outperforms vision-only approaches. The modality advantage is the largest perceptual gap in the corpus. Adding vision yields inconsistent gains.
  • RL training with reward shaping: GRPO and policy gradient methods improve over pure LLM prompting by 5–15% on structured tasks. Strict multiplicative rewards consistently outperform loose additive rewards.
  • Citation generation via ICL: 50–53% Perfect Sentence Rate and 72% Correct Citation Rate without fine-tuning. 3–10% customer engagement uplift in production A/B test.
  • Offline A/B testing with agents: SimGym achieves population-level correlation. Reduces testing cycles from weeks to under one hour.
  • Query optimization through genetic algorithms: OptAgent achieves 21.98% improvement over baseline queries. Tail queries show the largest gains.

What Does Not Work Yet

  • Safety-aware decision making: All models show catastrophic collapse (approximately 35–38% best vs 77.08% human). Variance exceeds 5% across runs.
  • Implicit constraint inference: Only 17.76% F1 on open-world product retrieval. Multi-constraint reasoning collapse documented across 5+ benchmarks.
  • Individual preference matching: Population-level simulation works; individual-level fails (approximately 2% exact product overlap).
  • Real-time cross-platform price optimization: OptAgent optimizes static datasets. AgenticShop confirms: 70% of price failures are out-of-budget recommendations.
  • Interactive element handling: Agents cannot drag price sliders, operate nested dropdowns, or control dynamic sorting menus. Architectural, not reasoning limitation.
  • Long-context memory and planning: Sessions longer than 20–30 turns show degraded performance. Multi-turn tasks plateau at approximately 35% even after SFT+RL.
  • Confidence calibration: OpenAI: false positives; Gemini: false negatives. No unified approach across model families.
Key Pattern: 20–50% performance gap between SOTA agents and humans across all shopping tasks. The gap is smallest on structured tasks (EcomBench: approximately 10% at Level 1) and largest on open-world tasks (ShoppingComp: approximately 50%).

The Research-Production Gap

Research results and production reality diverge systematically. Four forces drive this gap: evaluation metric divergence (papers use different success criteria), cost constraints (research uses expensive inference; production must optimize), risk aversion (autonomous shopping carries liability), and data distribution shift (real users behave differently than evaluation datasets). The Perplexity adoption study confirms scale: shopping represents 10% of agentic queries, Amazon dominates at 43.2% of shopping environments, and adoption correlates strongly with GDP (r=0.85–0.86) and moderately with education (r=0.75).

Key Metrics and Numbers

Success Rate Metrics

Task Completion: 17.76% (ShoppingComp) to 90%+ (EcomBench Level 1) depending on task complexity. DeepShop: RAG baseline Easy 16%, Hard 0%; web agents show same degradation pattern at higher rates. WebMall: 56% AX-Tree product search; 0% end-to-end vision-only. ShoppingBench: 48.2% GPT-4.1 overall. AgenticShop: 13.56–37.93% curation score. ShopSimulator: 32.65% GPT-5 single-turn (Table 2).

Accuracy Metrics

Claim Grounding Rate: 83.86% real-world vs 96.52% synthetic (Cite Before You Speak, Table 2). Perfect Sentence Rate: 50–53% without fine-tuning. Correct Citation Rate: 72% without fine-tuning. Evidence Utilization Rate: 54–58%. Query Prediction Similarity: 0.59 baseline to 0.69 with persona (PAARS, Table 2). Next-action prediction: 11.86% prompt-only to 17.26% fine-tuned (Customer-R1 / Lu et al.).

Alignment Metrics

Human-Agent Task Alignment (F1): High task completion but very low exact product overlap at the individual level (Lu et al., Section 4.2.1; Yang et al., Abstract). A/B Test Prediction Correlation: 0.64 with 69% sign alignment (SimGym, Tables 2, 5). Inter-human Agreement: κ = 0.592 moderate (OptAgent, Appendix D). Population-level KL divergence: 18.81 baseline to 17.51 with persona for query similarity (PAARS); 2.40 to 1.08 for item selection.

Safety Metrics

Safety Rubric Pass Rate: 35.42% (±6.18%) GPT-5.2 vs 77.08% human (ShoppingComp, Table 3). Hallucination Rate: 3% baseline in Amazon CSA (Cite Before You Speak, Section 1). Product link hallucination: ChatGPT Search hallucinates approximately 20% of product references; Table 6 disaggregates failures across five distinct columns with rates varying by system and failure type (AgenticShop, Table 6).

Cost Metrics

OptAgent: $8 per query, 6,109 LLM calls (Appendix E). WebMall: Claude $0.85–$1.42 per task vs. GPT-4.1 $0.26–$0.34 (Table 5). WEBSERV: Incus containers at 1.78s launch vs Docker at 8.96s. SFT training: approximately 3 hours on H200 clusters.

Dataset Scale

DeepShop: 150 seeds evolved to 600+. ShoppingBench: 3,310 instructions, 2.5M products. ShopSimulator: 1.3M products, 28K annotated tasks. EcomBench: Task set with quarterly updates. Can LLM Agents Simulate: 31,865 sessions, 230,965 actions. Perplexity Study: Hundreds of millions of interactions across top 100 countries. TRACER: 3,600 conversations, 236K+ products.

Hidden Signals Synthesis

Across all analyses, the hidden signal extraction consistently surfaced information that standard deep research analysis missed. The following ten revelations carry the highest strategic weight. Full evidence chains with faithfulness labels (Directly Supported, Cautious Inference, Unsupported) are available in the companion translations.

  1. Retrieval Is the Critical Upstream Constraint, Not Reasoning: Real-world retrieval only 41.1% relevant (Cite Before You Speak, Table 2). Retrieval strategy and constraint enforcement remain the binding constraint for all models (ShoppingComp). The projected 20–30% gain from retrieval improvement is inferential, not measured.
  2. Human Evaluation Agreement Is Moderate, Limiting Benchmark Validity: Inter-human κ = 0.592 (OptAgent, Appendix D); 50-sample validation in DeepShop. Reported metrics carry ±10–15% confidence intervals due to evaluation noise.
  3. Population-Level Simulation Works; Individual-Level Fails: Confirmed independently by Lu et al. (very low overlap), Yang et al. (approximately 2%), and PAARS (individual alignment remains marginal while group alignment improves).
  4. Different Model Families Have Systematically Different Failure Modes: OpenAI: false-positive bias. Gemini: false-negative bias (DeepShop, Section 5.4). ShoppingComp confirms: GPT-5.2 favors breadth-first exploration; Gemini favors depth-first (Section 5.3).
  5. Personas Provide Approximately 20-Point Session-Level Improvement: CUSTOMER-R1 ablation: 19.72-point session outcome F1 drop (Tables 5, 8). Effect is concentrated on purchase/terminate decisions (19.72 points) vs. next-action accuracy (1.78 points).
  6. Vision-Based Agents Fail Entirely on End-to-End Tasks: 0% success for vision-only on checkout/transaction tasks (WebMall, Table 3). HTML-only Agent-E also fails on visual categories (DeepShop, Table 2).
  7. RL Without SFT Initialization Leads to Policy Collapse: RL-only model overselects "click" and never predicts input/terminate actions (CUSTOMER-R1, Table 7). ShopSimulator confirms: standalone RL lags SFT by 6.95 points on multi-turn (Section 4.1).
  8. Synthetic Rationales Create Train-Test Distribution Mismatch: Claude-generated rationales improve prediction but may introduce stylistic bias, as the model may learn reasoning patterns specific to the rationale generator rather than actual user cognition (See Think Act; CUSTOMER-R1, Section 3.2).
  9. Query Complexity Creates Cliff Effects: RAG achieves 16% on medium, 0% on hard in DeepShop (Figure 5b). Tail queries show largest improvements from optimization (OptAgent, Table 1).
  10. Agent Research Costs Span Orders of Magnitude: $0.26–$8.00+ per experiment across OptAgent and WebMall. WebMall documents per-task costs: $0.26–$1.42 depending on model and configuration (Table 5).

Architectural Constraints Revealed Through Candor Zones

Five infrastructure-level limitations emerge from subordinate clauses and appendix disclosures:

  1. Action Space Limitations: agents cannot interact with sliders, dropdowns, lazy-loaded content (DeepShop, Section 5.4, Figure 14).
  2. Context Window Pressure: HTML truncation loses middle history; WEBSERV addresses this with DOM parsing that streamlines raw HTML while preserving interactive elements.
  3. Multi-UX-Inference and paged attention architecture for handling feature interference in production (Cite Before You Speak, Section 5).
  4. Incus Container Launch (1.78 seconds vs Docker 8.96 seconds) enables concurrent agent deployment (WEBSERV, Table 2).
  5. Accessibility tree navigation enables agents to recognize repeated actions and maintain structured interaction state (WEBSERV, Section 3.2.1).

Real Limitations Acknowledged Only in Asides

Gender/age bias in participant pools (skew toward 35+ males in tech). Single-platform bias (Taobao, Lazada, Amazon-specific findings may not generalize). Language bias (German dominance in OptAgent, with "Others" category showing only 0.49% improvement, Table 2; German shows 3.59%). Incognito-mode exclusion. Enterprise user exclusion. Privacy-conscious user self-selection bias.

Research Gaps and Future Directions

These gaps are not speculative. Each emerges from the intersection of what the twenty-two papers collectively measure and what they collectively omit.

Persistent Open Problems

  • Multi-constraint reasoning collapse (5+ papers)
  • Safety-aware decision making (catastrophic in all papers)
  • Retrieval quality ceiling (acknowledged but not solved)

Emerging Open Problems

  • Implicit requirement inference (ShoppingComp's dominant failure mode)
  • Cross-platform generalization
  • Preference alignment vs. task completion trade-off

Strategic Implications

Four stakeholder groups face distinct strategic imperatives. The research base supports specific recommendations for each. Evidence strength labels follow the companion translations: claims marked as projections or inferences are identified.

For Brands and Sellers

  • Prepare for Agent-Driven Discovery: Agents optimize for different signals than human search. ShoppingComp documents an inverted precision-recall trade-off: models exhibit high recall but low precision, while human experts enforce constraints jointly (Table 2). Surface clear, structured product attributes.
  • Implement Agentic Commerce Infrastructure: Brands need APIs and structured data (schema.org compliance critical). DeepShop documents failures on dropdown/hidden information (Section 5.4, Figure 14). TRACER demonstrates that product catalog quality directly limits dialogue naturalness.
  • Expect 20–50% Performance Gap: Agents succeed at category matching but select different specific products than humans would. Very low exact product overlap at the individual level means agent recommendations will differ from human choices.
  • Pricing Strategy Shift: Agents optimize for lowest price more aggressively than humans. OptAgent confirms agents exhibit position bias matching real user behavior (Figure 5), but systematic price-seeking is more aggressive.

For Platform Operators

  • Invest in Retrieval Quality: The relevance ceiling creates the single largest addressable constraint. The projected 20–30% gain from retrieval improvement is inferential but directionally supported by the 13-point CGR collapse between synthetic and real-world benchmarks.
  • Implement Autonomy Levels Framework: L3–L4 (partial with human approval) is current production standard. The rubber-stamping risk at L4 requires active engagement monitoring.
  • Prepare for Agent Attention Economy: The Agentic Web Survey introduces this concept: agents select based on cost, delay, security, and reliability, not visual persuasion.
  • Deploy Additional Safety Layers: All models show below 40% safety awareness with over 5% variance between runs. Platforms cannot rely on agents to autonomously avoid safety issues.

For AI Agent Developers

  • Focus on Multi-Constraint Reasoning: Most consistent gap across papers (5+ benchmarks). ShopSimulator demonstrates strict multiplicative rewards consistently outperform loose additive rewards. Hierarchical planning with constraint relaxation is the structural solution.
  • Build Confidence Calibration: All models systematically miscalibrated. Post-hoc calibration must be model-family-specific (OpenAI overconfident, Gemini underconfident).
  • Invest in Structured Navigation: The modality advantage is substantial. Accessibility trees dramatically outperform vision-only approaches. Vision fusion remains an open research problem with inconsistent results across tasks.
  • Invest in Persona Engineering: The CUSTOMER-R1 ablation effect is the largest documented in the corpus. Traffic-grounded personas work better than designed personas. SFT initialization is required before RL.
  • Solve the Cost Problem: Current costs span $0.26–$8.00 per task attempt. Distillation can close the gap: Qwen3-4B matches GPT-4.1 at a fraction of inference cost.

For Researchers

  • Shift Focus to Retrieval: The retrieval relevance ceiling is the most underpursued constraint. Dense retrieval, re-ranking, and fusion methods could unlock the largest absolute gains.
  • Standardize on Unified Benchmarks: Current fragmentation: 6+ benchmarks with different evaluation metrics. Success rates vary 5x depending on which benchmark you cite.
  • Study Safety Systematically: ShoppingComp identifies the approximately 40-point gap; no paper offers a solution. Safety-aware reward shaping is an untested hypothesis.
  • Address the Reproducibility Crisis: Research costs vary dramatically across experiments. WEBSERV's Incus infrastructure (fast resets, minimal per-container storage) is one path toward democratization.

Synthesis

The twenty-two papers tell one story: agentic shopping works at population scale, fails at individual scale, and breaks predictably under complexity. The path forward is not larger models: it is better systems. Better retrieval. Better decomposition. Better safety infrastructure. Better personalization architecture.

The governance opportunity is clear: the Autonomy Levels framework provides structure for calibrating agent autonomy as a deliberate design decision. Current production systems operate at L3–L4 (partial autonomy with human approval); full autonomy (L5) remains risky for financial transactions until the approximately 40-point safety gap closes.

Limitations of This Synthesis

This synthesis is a purposive sample of twenty-two papers, not a systematic review. The hidden signal extraction methodology is subject to confirmation bias: an analyst looking for discrepancies between main text and subordinate text will find them. Several projections in this document (retrieval improvement gains, ROI estimates) are inferences from documented correlations, not measured outcomes; they are flagged as such where they appear. Cross-paper metric comparisons should be interpreted cautiously: as the comparison table illustrates, benchmarks use different evaluation frameworks, different product catalogs, and different success criteria. The companion translations document provides faithfulness labels (Directly Supported, Cautious Inference, Unsupported) for every recommendation derived from these papers; this synthesis draws on those labels but does not replicate the full audit for every claim.

The future of agentic shopping is not about better language models alone. It is about better systems, architecture, and safety.

The research base across twenty-two foundational papers is clear. The competitive moat will not be held by whoever licenses the largest foundation model. It will be held by whoever builds the best shopping-specific training data, the most reliable retrieval infrastructure, and the most rigorous safety architecture.

The gaps are named. The metrics are established. The work begins now.

Evidence Grounding Appendix

The Evidence Grounding Appendix maps key quantitative claims in this synthesis to their specific source locations in each paper, enabling independent verification. For full evidence chains including faithfulness labels (Directly Supported, Cautious Inference, Unsupported), see the companion translations document: Agentic Commerce Research Intelligence — 20 Paper Translations.

Research Papers