What is an Agent?
In 2025, we have come to think of an agent as an AI system that autonomously pursues goals through iterative cycles of reasoning and action. Unlike traditional LLMs that simply respond to prompts, agents can break down complex tasks, use tools to gather information or perform actions, and adapt their approach based on results—all without human intervention at each step.
Technically, most modern agents follow the ReAct pattern (Reasoning + Acting): the system generates thoughts about what to do next, executes an action using available tools, observes the results, and repeats this cycle until the goal is achieved. This closed-loop process enables agents to handle multi-step workflows that would otherwise require manual orchestration.
This newly defined “agentic” AI promises to automate entire processes, leading to wide adoption—at least in LinkedIn posts.
However, a simple ReAct loop isn’t always suitable. This guide explores five core patterns—ReAct, Plan-then-Execute, ReWOO, LLMCompiler, and Reflexion—with practical guidance on when to use each to optimize for cost, speed, or quality.
The Challenge: Moving Beyond Basic Agents
Current ReAct-style agents often struggle with complex, multi-step queries requiring comprehensive analysis and actionable recommendations. Consider two contrasting experiences:
Example 1: Simple query, inadequate response
- Query: “What are the best coffee beans for espresso and where can I buy them locally?”
- Result: Agent refused to search inventory or provide store locations, citing unclear parameters
- Problem: User received no actionable insights despite having relevant data available
Example 2: Heavy prompt engineering gives great results
- Approach: Manually guided agent through structured workflow (search beans → filter by roast → check inventory → find stores → compare prices)
- Result: High-quality, data-driven recommendation with detailed supporting evidence
- Challenge: Required significant user effort to orchestrate
The gap? Example 2 required manual orchestration of what should be the agent’s natural capability. Planning patterns can bridge this gap by building structured workflows directly into the agent’s reasoning process.
Pattern 1: ReAct (Reason + Act) — The Current Standard
What It Is
ReAct augments LLM action spaces by interleaving explicit reasoning traces (thoughts) with environment interactions (actions and observations). The agent generates natural language thoughts explaining its reasoning, executes corresponding actions, receives observations from the environment, and repeats until task completion.
Technical innovation: Extends the agent’s action space from just environmental actions to include a language space for reasoning traces. Each thought decomposes goals into subgoals, tracks progress, injects commonsense knowledge, and handles exceptions—all visible to humans.
Architecture Flow
graph TD
A[User Query] --> B[Think: Reason about next step]
B --> C[Act: Select and execute tool]
C --> D[Observe: Get tool result]
D --> E{Task Complete?}
E -->|Yes| F[Return Answer]
E -->|No| B
style A fill:#fff,stroke:#333,stroke-width:2px
style B fill:#e8e8e8,stroke:#333,stroke-width:2px
style C fill:#d0d0d0,stroke:#333,stroke-width:2px
style D fill:#b8b8b8,stroke:#333,stroke-width:2px
style E fill:#a0a0a0,stroke:#333,stroke-width:2px
style F fill:#888,stroke:#333,stroke-width:2px
Strengths
- Maximum adaptability: Handles unknown task complexity and ambiguous queries by discovering requirements during exploration
- Transparent reasoning: Every tool call has an associated thought trace, enabling debugging, compliance auditing, and human-in-the-loop intervention
- Multi-hop reasoning: Naturally chains information across 3+ sources (HotpotQA: 27.4% accuracy, ALFWorld: 71% success vs 45% for action-only)
- Reduced hallucination: Grounds reasoning in external knowledge sources (6% false positives vs 14% for Chain-of-Thought alone)
- Interactive environments: Excels in text-based games, web navigation, and embodied AI tasks
Limitations
- Token inefficiency: Requires LLM call for EACH tool invocation (typically 3-7 steps), increasing both latency and API costs significantly
- Myopic planning: Only plans for 1 sub-problem at a time without upfront global reasoning, leading to sub-optimal trajectories
- Tool selection overload: Performance degrades with 7+ tools (calendar scheduling drops to 2% with 7+ domains)
- Higher reasoning errors: 47% vs 16% failure rate compared to pure Chain-of-Thought due to structural constraints
- Repetitive loops: Can generate same action sequence repeatedly (23% of HotpotQA failures)
- Requires large models: Models smaller than 62B parameters show poor performance
When to Use ReAct
- Exploratory tasks where complexity is unknown upfront
- Dynamic scenarios where next steps depend on previous results
- Interpretability critical for debugging, compliance auditing, or human-in-the-loop oversight
- Interactive environments like text games, web navigation, or embodied AI
When to Avoid ReAct
- Cost-sensitive high-volume systems (use ReWOO or LLMCompiler for 80% cost reduction)
- Well-scoped repeatable workflows (use Plan-then-Execute for deterministic execution)
- Speed-critical applications (use LLMCompiler)
- Large tool sets (>7 tools) where tool selection degrades performance
Pattern 2: Plan-then-Execute
What It Is
Plan-then-Execute is a full agentic framework with three distinct components: Planner, Executor, and Replanner. The planner generates initial multi-step plans as structured lists. The executor (typically a ReAct agent) carries out individual steps using available tools. The replanner examines completed steps and decides whether to continue with remaining steps, generate a revised plan, or respond with final results.
Inspired by Plan-and-Solve Prompting: The architecture draws inspiration from Wang et al.’s Plan-and-Solve (PS/PS+) prompting technique (ACL 2023), which improved zero-shot arithmetic reasoning from 70.4% to 76.7% by having LLMs explicitly plan before solving. The original PS+ prompt reduced calculation errors from 7% to 5% and missing-step errors from 12% to 7%.
Architecture Flow
graph TD
A[User Query] --> B[Planner: Generate multi-step plan]
B --> C[Plan visible upfront]
C --> D[Executor: Execute step 1]
D --> E[Executor: Execute step 2]
E --> F[Executor: Execute step N]
F --> G{Replanner: Evaluate results}
G -->|Success| H[Return final answer]
G -->|Need more info| B
G -->|Adjust approach| B
style A fill:#fff,stroke:#333,stroke-width:2px
style B fill:#e8e8e8,stroke:#333,stroke-width:2px
style C fill:#d8d8d8,stroke:#333,stroke-width:2px
style D fill:#c8c8c8,stroke:#333,stroke-width:2px
style E fill:#c8c8c8,stroke:#333,stroke-width:2px
style F fill:#c8c8c8,stroke:#333,stroke-width:2px
style G fill:#a0a0a0,stroke:#333,stroke-width:2px
style H fill:#888,stroke:#333,stroke-width:2px
Strengths
- Speed advantage: Multi-step workflows execute faster since the large planning LLM is only called during planning and replanning phases
- Cost optimization: Can use more sophisticated models for planning and smaller models for execution (30-50% cost reduction)
- Quality improvement: Forces planner to “think through” ALL steps upfront creating more coherent multi-step solutions
- Deterministic and auditable: Plan is visible upfront before any execution, making it easier to test, debug, and validate against business requirements
Limitations
- Sequential execution bottleneck: Tasks execute one after another (ReWOO and LLMCompiler address this)
- Planning rigidity: Brittle if initial plan is wrong; limited adaptability if user query needs different tools mid-flight (requires costly replanning)
- Planning overhead: Not justified for simple single-step queries where ReAct or direct function calling would be faster
- Context window limitations: Performance degrades as domain and tool counts increase
- Replanning ambiguity: Deciding when to replan versus respond lacks clear criteria
When to Use Plan-then-Execute
- Multi-step complex tasks with 5+ decomposable reasoning steps (research, data pipelines, long-horizon analysis)
- Arithmetic reasoning where explicit planning reduces missing-step and calculation errors
- Repeatable workflows with predefined procedures (customer support, batch reports)
- Audit-critical scenarios requiring deterministic, visible-upfront plans for validation
- Cost optimization priority using model tiering (larger model for planning, smaller model for execution)
- Stable environments where plans remain valid during execution and accuracy outweighs latency
When to Avoid Plan-then-Execute
- Simple single-step queries where planning overhead isn’t justified (use direct function calling)
- Highly dynamic environments where plans quickly become obsolete or need constant revision
- Exploratory tasks without clear structure requiring adaptive discovery (use ReAct)
- Speed-critical applications with parallelizable tasks (use LLMCompiler for faster execution)
Pattern 3: ReWOO (Reasoning Without Observation)
What It Is
ReWOO introduces a three-module architecture that completely separates planning from execution. The Planner generates a complete multi-step plan before any tool execution using “foreseeable reasoning”—predicting needed information without observing actual results. Plans use variable placeholders (#E1, #E2, #E3) to reference future evidence, enabling subsequent steps to depend explicitly on prior results without waiting for actual observations.
Critical innovation: Variable substitution—planning occurs using placeholders rather than actual tool outputs, eliminating the need to wait for observations during the reasoning phase. Tasks can reference previous outputs using syntax like #E2 (e.g., Search[Stats for #E2]).
Three-Module Architecture
Planner: Generates complete reasoning graph with variable placeholders (#E1, #E2, #E3) before any tool execution. Plans what information is needed without seeing actual results.
Worker: Executes tools based on the Planner’s blueprint, populating evidence variables with actual results—this phase involves no LLM reasoning, just pure execution.
Solver: Receives the complete plan plus all evidence and synthesizes the final answer, prompted to use evidence “with caution” to handle potential errors. Can partially compensate for Planner or Worker failures.
Architecture Flow
graph LR
A[User Query] --> B[Planner]
B --> C[Worker]
C --> D[Solver]
D --> E[Final Answer]
B1[Plan with placeholders<br/>#E1, #E2, #E3] -.-> B
C1[Execute tools<br/>populate evidence] -.-> C
D1[Synthesize with<br/>complete evidence] -.-> D
style A fill:#fff,stroke:#333,stroke-width:2px
style B fill:#e8e8e8,stroke:#333,stroke-width:2px
style C fill:#c8c8c8,stroke:#333,stroke-width:2px
style D fill:#a8a8a8,stroke:#333,stroke-width:2px
style E fill:#888,stroke:#333,stroke-width:2px
style B1 fill:#f8f8f8,stroke:#666,stroke-width:1px,stroke-dasharray: 5 5
style C1 fill:#f8f8f8,stroke:#666,stroke-width:1px,stroke-dasharray: 5 5
style D1 fill:#f8f8f8,stroke:#666,stroke-width:1px,stroke-dasharray: 5 5
Performance Metrics
Token efficiency: On HotpotQA, ReWOO consumed 1,986 tokens vs ReAct’s 9,795 tokens (5× token efficiency), translating to $3.97 per 1,000 queries vs $19.59 for ReAct (80% cost reduction).
Accuracy improvements: HotpotQA 42.4% vs 40.8% for ReAct, TriviaQA 66.6% vs 59.4%, StrategyQA 66.6% vs 64.6%, SOTUQA 70.2% vs 64.8% (8% absolute improvement).
Strengths
- Dramatic token efficiency: 5× token reduction (1,986 vs 9,795 tokens), 80% cost reduction ($3.97 vs $19.59 per 1K queries)
- Focused context per task: Each task has only required context (input + variable values) rather than full history
- Improved accuracy: 8% absolute improvement across benchmarks
- Eliminates prompt redundancy: Question and context fed only twice (Planner and Solver) vs every step in ReAct
- Robustness under tool failure: 29.2% accuracy drop vs ReAct’s 40.8% drop when tools fail, saves 110 tokens during failure
- Explicit dependency tracking: Variable flow through #E references makes reasoning traceable and debugging straightforward
Limitations
- Sequential execution bottleneck: Tasks execute one after another (total time = sum of tool times)—LLMCompiler addresses this with parallelization
- Planning rigidity: Once committed to a plan, execution proceeds regardless of observations; cannot dynamically switch tools or revise approaches mid-execution
- Initial plan blind spots: Reasoning happens before any observation, might miss edge cases that would be discovered during exploration
- Tool count sensitivity: Performance degraded from 42% with 2 tools to 37% with 7 tools
- Real-time interactive applications: Not suitable for scenarios requiring adaptive strategy based on intermediate results
- No dynamic replanning: Unlike Plan-then-Execute, ReWOO doesn’t have a replanner component
When to Use ReWOO
- Predictable multi-hop question answering where information dependencies are clear (“Find X, then use X to find Y”)
- Complex multi-theory queries requiring synthesis across multiple data sources
- High-volume production systems where cost is critical (80% cost reduction vs ReAct)
- Curated tool environments with 2-5 well-defined complementary tools
When to Avoid ReWOO
- Exploratory tasks requiring adaptive discovery or trial-and-error (use ReAct)
- Large tool sets with >5 options
- Dynamic environments needing adaptive strategy based on intermediate results
- Real-time interactive applications or scenarios with highly uncertain tool reliability
Pattern 4: LLMCompiler (Parallel Function Calling)
What It Is
LLMCompiler draws inspiration from classical compiler design to optimize agent execution through parallel function calling. The framework decomposes user queries into Directed Acyclic Graphs (DAGs) representing tasks with explicit inter-dependencies, then executes independent tasks concurrently. This extends beyond ReWOO’s sequential execution to achieve true parallelization while maintaining dynamic replanning capabilities.
Critical innovation vs ReWOO: LLMCompiler supports two key capabilities explicitly: (1) parallel function calling reducing latency and cost, and (2) dynamic replanning for problems whose execution flow cannot be determined statically upfront.
Three-Component Architecture
Planner: Generates task sequences with dependencies forming a DAG, identifying necessary tasks, input arguments, and inter-dependencies using placeholder variables ($1, $2, $3). Can stream tasks as they’re generated, hiding planning latency behind tool execution through instruction pipelining.
Task Fetching Unit: Schedules and dispatches tasks as soon as dependencies are satisfied using a greedy policy. Replaces placeholder variables with actual outputs from completed tasks without requiring dedicated LLM calls.
Executor: Receives independent tasks and runs them asynchronously in parallel, with each task having dedicated memory for intermediate outcomes.
Architecture Flow
graph TD
A[User Query] --> B[Planner: Generate DAG with dependencies]
B --> C{Task Fetching Unit}
C --> D[Task 1: $1]
C --> E[Task 2: $2]
C --> F[Task 3: $3]
D --> G{Dependencies Satisfied?}
E --> G
F --> G
G -->|Yes| H[Executor: Run parallel tasks]
H --> I[Task 4: Uses $1, $2]
H --> J[Task 5: Uses $3]
I --> K{Replanner: Continue or Finish?}
J --> K
K -->|Finish| L[Final Answer]
K -->|Continue| C
style A fill:#fff,stroke:#333,stroke-width:2px
style B fill:#e8e8e8,stroke:#333,stroke-width:2px
style C fill:#d0d0d0,stroke:#333,stroke-width:2px
style D fill:#c0c0c0,stroke:#333,stroke-width:2px
style E fill:#c0c0c0,stroke:#333,stroke-width:2px
style F fill:#c0c0c0,stroke:#333,stroke-width:2px
style G fill:#b0b0b0,stroke:#333,stroke-width:2px
style H fill:#a0a0a0,stroke:#333,stroke-width:2px
style I fill:#989898,stroke:#333,stroke-width:2px
style J fill:#989898,stroke:#333,stroke-width:2px
style K fill:#888,stroke:#333,stroke-width:2px
style L fill:#707070,stroke:#333,stroke-width:2px
Performance Claims
- Up to 3.7× latency speedup (Movie Recommendation: 5.47s vs 20.47s for ReAct)
- 6.73× cost reduction on some benchmarks
- 35% faster execution than OpenAI’s proprietary parallel function calling
- 9% accuracy improvement (ParallelQA: 68.14% vs 59.59% for ReAct)
Strengths
- Dramatic performance efficiency: Up to 3.7× latency speedup and 3-7× cost reduction through parallel execution—total time equals longest single tool per dependency level rather than sum of all tools
- Quality improvements from upfront planning: DAG planning prevents common ReAct failure modes including premature early stopping (85% of cases), repetitive loops (10% of cases), and context pollution from intermediate observations
- Instruction pipelining optimization: Streaming task generation hides planning latency behind tool execution, with Task Fetching Unit dispatching tasks as soon as dependencies are satisfied
- Dynamic replanning capability: Unlike ReWOO’s rigid commit-and-execute, supports replanning for problems whose execution flow cannot be determined statically
- Architecture flexibility: Model-agnostic design demonstrated across multiple model families enabling cost-quality trade-offs through model tiering
Limitations
- Parallelization limitations (Amdahl’s Law): Planner overhead (~1.88s) can’t be parallelized, straggler effects mean slowest task determines completion time (1.13s vs 0.61s average), and speedup is highly workload-dependent (3.7× for 8-way parallel vs 1.8× for 2-way)
- Implementation and debugging complexity: Requires DAG scheduler, task fetching logic, and parallel execution infrastructure making it significantly more complex than ReAct/P-t-E—parallel execution complicates error tracing
- Requires parallelizable workflows: Sequential dependencies or complex causal chains see minimal benefit (Game of 24: 2.89× vs Movie Recommendation: 3.7×)
- Production readiness concerns: Newer pattern (Dec 2023) with less battle-testing than ReAct, unknown tool count sensitivity at 7+ tools
- Replanning overhead trade-off: Dynamic replanning adds latency compared to ReWOO’s commit-and-execute approach
When to Use LLMCompiler
- Speed-critical applications demanding fastest execution (1.8-3.7× faster with parallel workflows)
- Embarrassingly parallel workflows with multiple independent data fetches running concurrently
- Cost-sensitive high-volume systems where 3-6× cost reduction matters at scale
- Clear task dependencies where dependency graphs are predictable
When to Avoid LLMCompiler
- Sequential workflows where tasks have long dependency chains (minimal parallelization benefit)
- Exploratory tasks with unpredictable dependencies making DAG planning difficult (use ReAct)
- Resource-constrained environments unable to support parallel execution infrastructure
- Immature tooling concerns if production validation and battle-testing are critical
Pattern 5: Reflexion (Self-Reflective Iterative Improvement)
What It Is
Reflexion introduces verbal self-reflection and iterative refinement to agent architectures. After generating an initial solution, the agent reflects on failures by producing natural language feedback about what went wrong and how to improve. This reflection is stored in an episodic memory buffer and provided as context for subsequent trials, enabling the agent to learn from mistakes within a task without parameter updates.
Critical innovation: Unlike traditional RL which updates model weights through backpropagation, Reflexion stores verbal reflections in episodic memory and provides them as additional context. This enables learning within a task through language-based feedback rather than requiring retraining.
Three-Component Architecture
Actor: Generates text and actions based on state observations and reflection memory (typically a ReAct agent)
Evaluator: Scores outputs using task-specific heuristics, learned reward models, or binary success/failure signals. Provides feedback on what worked and what didn’t.
Self-Reflection: Generates verbal reinforcement cues from evaluation signals and trajectory history. Creates natural language summaries of failure patterns (e.g., “Search query was too specific, try broader terms” or “Missed validating data sources before analysis”)
Architecture Flow
graph TD
A[User Query] --> B[Actor: Generate initial solution using ReAct/P-t-E/etc]
B --> C[Evaluator: Score output success/failure/quality]
C --> D{Success criteria met?}
D -->|Yes| E[Return Final Answer]
D -->|No| F[Self-Reflection: Analyze failures generate verbal feedback]
F --> G[Store reflection in Episodic Memory Buffer]
G --> H[Actor: Retry with reflection context Trial 2, 3, ... N]
H --> I[Evaluator: Re-score new attempt]
I --> J{Success or max trials reached?}
J -->|Success| E
J -->|Max trials| K[Return best attempt]
J -->|Continue| F
style A fill:#fff,stroke:#333,stroke-width:2px
style B fill:#e8e8e8,stroke:#333,stroke-width:2px
style C fill:#d8d8d8,stroke:#333,stroke-width:2px
style D fill:#c8c8c8,stroke:#333,stroke-width:2px
style E fill:#707070,stroke:#333,stroke-width:2px
style F fill:#b8b8b8,stroke:#333,stroke-width:2px
style G fill:#a8a8a8,stroke:#333,stroke-width:2px
style H fill:#989898,stroke:#333,stroke-width:2px
style I fill:#888,stroke:#333,stroke-width:2px
style J fill:#787878,stroke:#333,stroke-width:2px
style K fill:#707070,stroke:#333,stroke-width:2px
Performance Claims
- 20-25% success rate improvement on complex tasks (ALFWorld: 97% vs 75% baseline, +22%)
- Game of 24 improved from 4% to 74% with 3 reflections (+70 percentage points)
- HumanEval code generation reached 91% pass@1 (vs 80% without reflection, +11%)
- 3-12× cost increase due to multiple trial iterations
Strengths
- Dramatic quality improvements through iterative refinement: 20-70 percentage point success rate gains (Game of 24: 4%→74%, ALFWorld: 75%→97%) by learning from failures across trials—particularly effective for long-horizon tasks requiring 50+ steps
- Verbal self-reflection with episodic memory: Core innovation enabling learning within a task without parameter updates—agent generates human-interpretable failure analyses stored in memory buffer, preventing repeated mistakes
- Complementary architecture wrapper: Unlike other patterns that replace ReAct/P-t-E, Reflexion wraps around any existing Actor pattern as a quality-enhancing meta-layer—can add reflection to ReAct, Plan-then-Execute, or ReWOO without changing core architecture
- Adaptive learning from evaluation signals: Supports flexible evaluation approaches including task-specific heuristics, learned reward models, or LLM-as-evaluator—reflections guide Actor toward more promising action spaces
Limitations
- Multi-trial cost-latency multiplication: 3-12× cost increase and proportional latency impact (3 trials = 3× execution time) makes it incompatible with real-time requirements or cost-constrained high-volume systems
- Evaluation quality dependency: Requires reliable evaluator providing meaningful signals—weak evaluators produce poor reflections leading to no improvement or degradation
- No cross-task generalization: Reflections are task-specific ephemeral learning—agent doesn’t improve at new tasks unlike fine-tuning which generalizes across problem types
- Only valuable for failure-prone tasks: ROI exists only when baseline success rate <80%—high-success tasks see minimal benefit from reflection overhead
- Production deployment challenges: Long reflection histories consume context window requiring pruning strategies, unknown interaction with large tool sets, and memory buffer management complexity
When to Use Reflexion
- Quality-critical applications where accuracy/completeness outweigh cost (executive reports, compliance docs)
- High first-attempt failure rate (<80% success) where reflection enables learning from mistakes
- Complex multi-dimensional analysis where missing aspects is common failure mode
- Latency-tolerant scenarios like batch processing or overnight report generation
When to Avoid Reflexion
- Cost-constrained applications where 3-12× cost increase is unacceptable for high-volume queries
- Real-time requirements where multiple trial latency is incompatible with user-facing interactions
- High baseline success rate (>80%) where marginal benefit doesn’t justify cost
- No quality evaluator available—requires reliable scoring mechanism for meaningful reflections
Other Notable Patterns
Tree-of-Thought / Graph-of-Thought
Explore multiple reasoning branches with backtracking and scoring. Useful for generating multiple hypotheses and selecting the best via evaluation.
graph TD
A[User Query] --> B[Thought Branch 1]
A --> C[Thought Branch 2]
A --> D[Thought Branch 3]
B --> E[Evaluate]
C --> F[Evaluate]
D --> G[Evaluate]
E --> H{Select Best}
F --> H
G --> H
H --> I[Answer]
style A fill:#fff,stroke:#333,stroke-width:2px
style B fill:#d8d8d8,stroke:#333,stroke-width:2px
style C fill:#d8d8d8,stroke:#333,stroke-width:2px
style D fill:#d8d8d8,stroke:#333,stroke-width:2px
style E fill:#b0b0b0,stroke:#333,stroke-width:2px
style F fill:#b0b0b0,stroke:#333,stroke-width:2px
style G fill:#b0b0b0,stroke:#333,stroke-width:2px
style H fill:#888,stroke:#333,stroke-width:2px
style I fill:#707070,stroke:#333,stroke-width:2px
Use case: “Generate 3 different architectural approaches for scaling microservices” → evaluate each → pick best. Game of 24 improved from 4% → 74% through search-based reasoning.
Decision Framework: Choosing the Right Pattern
Why Planning Patterns Matter (vs ReAct)
Planning patterns (Plan-then-Execute, ReWOO, LLMCompiler) promise improvements over traditional ReAct-style agents:
- Speed: Execute multi-step workflows faster since the large agent doesn’t need to be consulted after each action
- Cost: Significant cost savings over ReAct through model tiering (30-50% cost reduction without accuracy loss)
- Quality: Can perform better overall by forcing the planner to explicitly “think through” all steps required
Performance Comparison Summary
⏰ Speed: LLMCompiler > ReWOO ≥ P-t-E > ReAct »> Reflexion
💸 Cost: ReWOO ≥ LLMCompiler > P-t-E > ReAct »> Reflexion
🏆 Quality: Reflexion > LLMCompiler ≥ ReWOO ≥ P-t-E ≥ ReAct
Pattern Comparison Table
| Pattern | Best For | Speed | Cost | Quality | Complexity | Tool Limit |
|---|---|---|---|---|---|---|
| ReAct (current) | Exploratory, unknown complexity, adaptive workflows | Moderate (3-7 steps) | Moderate | Good | Low | 7 MCP servers at limit |
| Plan-then-Execute | Well-scoped repeatable workflows, compliance/audit | Fast (fewer LLM calls) | Low (40-50% reduction) | Good-Excellent | Medium | Unknown |
| ReWOO | Predictable multi-hop, token efficiency critical | Fast (no reasoning loops) | Very Low (80% reduction) | Good-Excellent | Medium | 7 tools at degradation threshold |
| LLMCompiler | Speed-critical, parallel workflows, clear dependencies | Fastest (1.8-3.7× speedup) | Very Low (3-6× reduction) | Good-Excellent | High | Unknown |
| Reflexion | Quality-critical, failure-prone tasks, batch processing | Slowest (2-10 trials) | Very High (3-12× increase) | Excellent | Medium | Unknown |
| Multi-Agent Supervisor | >10 tools, domain specialization needed | Moderate | Moderate | Excellent | High | Specialist: 3-5 tools each |
Quick Decision Cues
- Ambiguity high, info unknown → ReAct ✅ — adapts during exploration
- Workflow known, repeatable → Plan-then-Execute — predictable + cost-efficient
- Complex reasoning with predictable operations → ReWOO — 80% cost reduction, 5× token efficiency
- Speed critical, clear task dependencies → LLMCompiler — 1.8-3.7× speedup via parallel DAG execution
- Quality critical, time flexible → Reflexion on top of any pattern — 20-70% success rate improvement at 3-12× cost
- Exploration and solution diversity needed → Tree/Graph-of-Thought — hypothesis generation
- >7 tools, domain specialization → Multi-Agent Supervisor — split into specialists with 3-5 tools each
Pattern Selection Examples
- Exploratory research (“Find security vulnerabilities in codebase”) → ReAct — unknown complexity, adaptive discovery
- Multi-source analysis (“Compare pricing across competitors + market trends + customer reviews”) → LLMCompiler — parallel data fetching
- Well-scoped reports (“Q3 sales performance analysis”) → Plan-then-Execute — predictable, auditable workflow
- Open-ended questions (“What’s the best database for my use case?”) → ReAct — needs exploration and context gathering
- Speed-critical lookups (“Real-time stock portfolio dashboard”) → LLMCompiler — fastest parallel execution
- Quality-critical outputs (“Investment recommendation report”) → Reflexion — iterative refinement for accuracy
Emerging Hybrid Approaches
Real-world production systems increasingly combine multiple patterns rather than using them in isolation. These hybrid architectures leverage complementary strengths while mitigating individual weaknesses:
1. ReWOO + ReAct Fallback (Graceful Degradation)
Pattern: Start with ReWOO for efficiency; fallback to ReAct on failure
Trigger: If ReWOO plan execution returns empty results or evaluator scores output as low-quality
Benefit: Get 80% cost reduction on successful cases, full adaptability on edge cases
Use case: Predictable multi-hop queries (95% success with ReWOO) with ReAct handling edge cases requiring adaptive exploration
Implementation: Wrap ReWOO in try-catch; on failure, invoke ReAct with full context
2. LLMCompiler + Reflexion (Speed + Quality)
Pattern: Use LLMCompiler for fast parallel execution; add Reflexion layer for quality-critical outputs
Benefit: 1.8-3.7× speedup with 20-25% quality improvement on complex analyses
Use case: Financial reports or research briefs requiring both speed (user-facing) and quality (accuracy-critical)
Trade-off: 1st trial fast (LLMCompiler), 2nd trial expensive (full reflection) but only on failures
Implementation: LLMCompiler as Actor in Reflexion framework; evaluator triggers re-trial if needed
3. Plan-then-Execute with Multi-Agent Workers (Scale + Structure)
Pattern: Planner generates structured plan; route steps to specialized worker agents; replanner coordinates
Benefit: Handles >10 tools by domain specialization while maintaining deterministic workflows
Use case: Comprehensive market research requiring Web Search agent + Data Analysis agent + Report Synthesis agent
Tool distribution:
- Worker 1 (Research Agent): 3 tools (WebSearch, DocumentRetrieval, PDFExtraction)
- Worker 2 (Analysis Agent): 3 tools (DataAggregation, StatisticalAnalysis, Visualization)
- Worker 3 (Synthesis Agent): 2 tools (ReportGeneration, ChartCreation)
Implementation: Planner identifies which specialist per step; supervisor routes to workers; replanner evaluates
4. Reflexion with Model Diversity (X-MAS Pattern)
Pattern: Each Reflexion trial uses different LLM for ensemble quality
Benefit: 70% vs 23.33% accuracy from heterogeneous models (X-MAS research, 2025)
Use case: Critical analyses where consensus across models increases confidence
Trade-off: 3-5 trials × 3 models = 9-15× cost, but dramatic quality improvement
Implementation: Different frontier model per trial with cross-model reflections for ensemble learning
5. ReWOO + Dynamic Tool Loading (Adaptive Efficiency)
Pattern: ReWOO Planner generates plan; Worker dynamically loads only required tools
Benefit: Mitigates tool selection degradation (42% → 37% with 7 tools) by reducing active tool count per query
Use case: Multi-domain analysis where different queries need different tool subsets
Tool loading: Query about “Weather patterns” loads only [WeatherAPI, HistoricalData, Forecasting]; “Stock analysis” loads [MarketData, NewsAPI, FinancialStatements]
Implementation: Planner identifies required tools; Worker initializes only subset; Solver synthesizes
6. Hierarchical Planning (Two-Level P-t-E)
Pattern: Strategic Planner creates high-level phases; Tactical Planner details each phase; Executor runs steps
Benefit: Addresses planning rigidity by allowing phase-level replanning without full plan regeneration
Use case: Long-horizon analyses (quarterly reviews, annual reports, research projects) with evolving requirements
Example flow:
- Strategic Plan: [Phase 1: Data Collection] → [Phase 2: Analysis] → [Phase 3: Report Synthesis]
- Tactical Plan for Phase 1: [Fetch market data] → [Download competitor reports] → [Extract key metrics]
- After Phase 1: Tactical Replanner adjusts Phase 2 based on Phase 1 outcomes
Implementation: Nested P-t-E agents; strategic replanner decides whether to continue/revise next phase
7. ReAct with Cached Plans (Learning Pattern Library)
Pattern: ReAct agent builds episodic memory of successful reasoning traces; retrieves similar patterns for new queries
Benefit: Combines ReAct’s adaptability with P-t-E’s efficiency through learned templates
Use case: Recurring query types (“market analysis for X sector”, “code review for Y framework”) that follow similar trajectories
Memory structure: Vector database storing {query_embedding, successful_tool_sequence, outcome_quality}
Retrieval: New query → find top-3 similar past queries → inject their tool sequences as “suggested approach” → ReAct adapts if needed
Implementation: LangChain Memory + vector store; inject retrieved sequences into system prompt
Choosing Hybrid Approaches
- Production maturity critical → ReWOO + ReAct Fallback (battle-tested components)
- Budget available, quality paramount → LLMCompiler + Reflexion (best of both worlds)
- Tool count >10 → P-t-E + Multi-Agent Workers (specialization at scale)
- Mission-critical decisions → Reflexion + Model Diversity (consensus across LLMs)
- Recurring query patterns → ReAct + Cached Plans (learn from experience)
- Long-horizon workflows → Hierarchical Planning (phase-level adaptation)
Conclusion
The choice of agentic design pattern significantly impacts your system’s cost, speed, and quality. While ReAct remains a solid default for exploratory tasks, planning patterns like ReWOO and LLMCompiler offer dramatic efficiency gains (80% cost reduction, 3.7× speedup) for predictable workflows. For quality-critical applications, Reflexion’s iterative improvement delivers 20-70% success rate gains at higher cost.
The future lies in hybrid approaches that combine complementary strengths—using ReWOO for efficiency with ReAct fallback for edge cases, or LLMCompiler for speed enhanced with Reflexion for quality. As these patterns mature and tool ecosystems expand beyond 7+ tools, multi-agent architectures with specialized workers become increasingly essential.
Key takeaway: There’s no one-size-fits-all solution. Understand your constraints (cost, latency, quality requirements), evaluate your task characteristics (predictable vs exploratory, sequential vs parallelizable), and choose—or combine—patterns accordingly.
Key Insights for Production Systems
Tool Limits Matter:
- ReAct: At upper limit (research shows 2% performance with 7+ domains in calendar scheduling)
- ReWOO: At degradation threshold (42% → 37% performance with 2 → 7 tools)
- LLMCompiler: Unknown tool count sensitivity; research focused on smaller tool sets
- Multi-Agent Supervisor: Best option if expanding beyond 7 tools—split into specialists
Production Trends (End of 2024):
- LangGraph adoption: 43% of organizations
- ReAct pattern: 39.8% of production implementations
- Top concern: Quality/performance (45.8%), Cost second (22.4%)
- Best practice: 5-10 tools per agent, multi-agent for larger tool sets
References
Core Pattern Papers
ReAct: Yao et al., ICLR 2023 - “ReAct: Synergizing Reasoning and Acting in Language Models”
- Introduces interleaved reasoning and acting with 27.4% HotpotQA, 71% ALFWorld, 6% hallucination rate
- Establishes baseline for modern agentic systems with thought-action-observation cycle
Plan-and-Solve: Wang et al., ACL 2023 - “Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning”
- PS+ prompting technique: 70.4% → 76.7% average accuracy on arithmetic reasoning
- MultiArith: 83.8% → 91.8%, GSM8K: 56.4% → 59.3%
- Inspiration for Plan-then-Execute agentic framework
ReWOO: Xu et al., 2023 - “ReWOO: Decoupling Reasoning from Observations for Efficient Augmented Language Models”
- 5× token efficiency (1,986 vs 9,795 tokens), 80% cost reduction ($3.97 vs $19.59 per 1K queries)
- Variable substitution (#E1, #E2, #E3) enables foreseeable reasoning without observations
- HotpotQA: 42.4% vs 40.8% ReAct with dramatic cost savings
LLMCompiler: Kim et al., UC Berkeley, ICML 2024 - “An LLM Compiler for Parallel Function Calling”
- Up to 3.7× latency speedup, 6.73× cost reduction through DAG-based parallel execution
- Beats OpenAI parallel function calling by 35% through instruction pipelining
- HotpotQA: 1.80× speedup with 3.37× cost reduction
Reflexion: Shinn et al., NeurIPS 2023 - “Reflexion: Language Agents with Verbal Reinforcement Learning”
- Self-reflection with episodic memory: ALFWorld 97% vs 75% baseline (+22%)
- Game of 24: 4% → 74% with 3 reflections, HumanEval: 91% vs 80%
- Verbal feedback without model fine-tuning, 3-12× cost increase
Foundational Techniques
Chain-of-Thought: Wei et al., NeurIPS 2022 - “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models”
- Establishes step-by-step reasoning as core prompting technique
- Foundation for ReAct’s reasoning traces and Plan-and-Solve improvements
Tree-of-Thoughts: Yao et al., NeurIPS 2023 - “Tree of Thoughts: Deliberate Problem Solving with Large Language Models”
- Explores multiple reasoning branches with backtracking
- Game of 24: 4% → 74% through search-based reasoning
Multi-Agent & Advanced Architectures
Multi-Agent Collaboration: LangGraph documentation on supervisor patterns
- Hub-and-spoke topology with specialist agents handling 3-5 tools each
- 50% performance improvement when tools properly grouped by domain
- Addresses tool selection overload at 7+ tools
X-MAS (Heterogeneous Multi-Agent Systems): 2025 - “X-MAS: Solving Math Word Problems via Cross-Model Augmented Self-Correction”
- 70% accuracy with heterogeneous models vs 23.33% homogeneous (+46.67 points)
- Model diversity through ensemble of different frontier models
- Validates Reflexion with Model Diversity hybrid approach
Security & Reliability
- Plan-then-Execute Security: Del Rosario et al., 2025 - “Architecting Resilient LLM Agents: A Guide to Secure Plan-then-Execute Implementations”
- Control-flow integrity through planning-execution separation
- Defense-in-depth strategies: least privilege, sandboxing, human-in-the-loop
- Resilience to indirect prompt injection attacks
Benchmarking & Evaluation
AI Agents That Matter: Princeton & Allen Institute for AI, July 2024
- Simple baselines often outperform complex architectures when cost-controlled
- Emphasizes importance of rigorous evaluation and fair comparison
- Challenges inflated performance claims in agent research
τ-bench: Sierra AI, 2024
- Industry standard for realistic agent evaluation with retail/airline scenarios
- <50% success rates reveal gap between research benchmarks and production reality
- Emphasizes need for practical, grounded agent assessment
Implementation Resources
LangChain Blog: Planning Agents (Feb 2024)
- Compares Plan-and-Execute, ReWOO, and LLMCompiler implementations
- Production insights on speed/cost/quality tradeoffs
- Model tiering strategies for 40-50% cost reduction
LangGraph Tutorials: Official Documentation
- ReAct Agent from Scratch
- Plan-and-Execute - Full implementation with state tracking
- ReWOO - Variable substitution examples
- LLMCompiler - DAG scheduling implementation
BabyAGI: GitHub Repository (Nakajima, 2023)
- Early autonomous agent with task management and prioritization
- Inspiration for task decomposition patterns
Evaluation Tooling
- τ-bench: Sierra AI, 2024 - Realistic retail/airline agent evaluation
- LangSmith: LangChain’s observability and testing platform for agent traces
- LangFuse: Open-source LLM observability and monitoring
- HumanEval/MBPP: Code generation benchmarks (used for Reflexion evaluation)
- HotpotQA: Multi-hop question answering requiring 2-3 Wikipedia passages
- ALFWorld: Embodied AI tasks in text-based household environments (134 tasks)
- WebShop: E-commerce navigation with 1.18M real products
- GSM8K: Grade-school math word problems (2-8 reasoning steps)
Production Insights
LangSmith Production Trends (End of 2024):
- LangGraph adoption: 43% of organizations
- ReAct pattern: 39.8% of production implementations
- Top concern: Quality/performance (45.8%), Cost second (22.4%)
Tool Selection Research (2024):
- Performance degrades with 10+ tools even with capable models
- Calendar scheduling: 2% success with 7+ domains (GPT-4o)
- Best practice: 5-10 tools per agent, multi-agent for larger tool sets
This article is based on research and practical experience implementing agentic systems. For the complete source material with additional details, visit the PDF source.