Note: All numerical values and tables on this page should be verified against the paper PDF for the most accurate and up-to-date results.

Results

Architecture, not models, determines performance. Same LLM, same task, wildly different outcomes.

Orchestration Overhead

Framework orchestration alone can create 100× latency differences, even for trivial tasks. We fix the LLM, query, and prompts, varying only the orchestration layer.

Architecture Type Latency (p50) Throughput (req/s)
Direct LLM 0.38s 8.88
Graph-based (LangGraph) 0.52s (1.4×) 6.38
Role-based (AutoGen, OpenAgents) 0.50s (1.3×) 6.83-7.06
Role-based (CrewAI, Agno, OpenAI SDK) 0.61-1.17s (1.6-3.1×) 2.86-4.16
GABM (Concordia) 44.47s (117×) 0.089

Key Takeaway

Orchestration architecture alone governs baseline scalability. Graph- and role-based designs introduce modest overhead (1.3-3×), while GABM execution incurs orders-of-magnitude higher runtime and output even for trivial tasks, driven by execution semantics rather than task complexity.

Memory Architecture Effects

Memory structure matters more than context size. Retrieval enables stable recall, accumulation enables learning but scales poorly, and hybrid designs work best under bounded context.

Memory Type AR Score TTL Score LRU Score Overall
Retrieval-only (LangGraph) 33.2 11.4 30.4 23.8
Hybrid (LangGraph W=512) 44.9 24.2 17.6 21.7
Accumulation (OpenAI SDK W=8192) 33.9 20.7 27.5 20.6
Accumulation (OpenAI SDK W=50) 8.4 1.2 0.0 6.1

Key Takeaways

  • Retrieval enables stable recall: Retrieval-only designs dominate on factual recall (AR) and long-range understanding (LRU), achieving 30.4 on LRU vs 27.5 for large accumulation.
  • Hybrid works best for learning: Hybrid retrieval-accumulation achieves highest TTL (28.9) by anchoring learning to retrieved signals rather than raw prompt growth.
  • Accumulation scales poorly: Runtime grows rapidly with context window as full histories are replayed for each query. Retrieval avoids repeated processing.

Planning Interface Effects

Schema-constrained planning reduces accuracy and introduces high failure rates. Free-form planning preserves or improves accuracy with minimal overhead.

Planning Interface Accuracy Impact Formatting Failures Runtime Overhead
No Plan (Direct) Baseline 0%
Schema-constrained (Crew-Plan) -30% to -50% Up to 84.7% 7.4× to 31.3×
Free-form (Direct-LLM-Plan) +15% to -5% 0% 1.2× to 6.6×

Key Takeaway

Planning outcomes are driven primarily by interface design, not LLM planning ability. Schema-constrained planning introduces large formatting failure rates (up to 84.7%) and high orchestration overhead (up to 31×). Free-form planning preserves accuracy with minimal overhead. Planning should be implemented as a permissive stage that tolerates variability in plan text.

Agent Specialization

Specialization is governed by how frameworks inject task-specific reasoning structure, not role identity alone. Expert-guided conditioning improves F1 scores by 58 points.

Conditioning Strategy Multiclass F1 Impact
No Role 41.9 Baseline
Role-based prompting 41.9-42.1 No improvement
Planning-based conditioning 42.0 No improvement
Expert-guided conditioning 96.0-100.0 +54 to +58 points

Key Takeaway

Role labels and generic planning interfaces fail to activate domain knowledge. Explicit procedural instructions impose structured solution workflows that reliably improve performance. Specialization should be implemented through reasoning procedures embedded in the framework, not through role naming or lightweight prompt modifications.

Coordination and Scaling

Coordination performance is governed by the match between task structure and communication geometry. Topology choice becomes critical at scale.

Topology Coloring (n=100) Consensus (n=100) Rounds
Scale-Free 97% Failed 11
Small-World 98% Failed 15
Fully Connected 98% 100% 3
Delaunay 81% Failed 19
Sequential Failed (>40 rounds) Failed >40

Key Takeaways

  • Local coordination: Sparse topologies (scale-free, small-world) achieve high success (97-98%) on local tasks like Coloring, converging in 11-15 rounds.
  • Global agreement: Only fully connected topologies succeed on Consensus (100%), converging in 3 rounds independent of network size. All sparse topologies fail despite higher runtime and token expenditure.
  • Sequential pipelines fail at scale: Sequential topologies exceed 40 rounds and fail entirely on large networks, showing that increasing interaction budgets alone does not improve outcomes.

Summary

These results demonstrate that performance in multi-agent LLM systems is governed by framework architecture, not model quality alone. Architectural choices create order-of-magnitude differences in latency, accuracy, and coordination success.

Architecture > Model

Framework-level design choices can create 100× latency differences and 30% accuracy drops, even when using identical LLM models.

Design Dimensions Matter

Orchestration, memory, planning, specialization, and coordination each independently drive performance. No single dimension can compensate for poor choices in others.