Results | MAFBench

Orchestration Overhead

Framework orchestration alone can create 100× latency differences, even for trivial tasks. We fix the LLM, query, and prompts, varying only the orchestration layer.

Architecture Type	Latency (p50)	Throughput (req/s)
Direct LLM	0.38s	8.88
Graph-based (LangGraph)	0.52s (1.4×)	6.38
Role-based (AutoGen, OpenAgents)	0.50s (1.3×)	6.83-7.06
Role-based (CrewAI, Agno, OpenAI SDK)	0.61-1.17s (1.6-3.1×)	2.86-4.16
GABM (Concordia)	44.47s (117×)	0.089

Key Takeaway

Orchestration architecture alone governs baseline scalability. Graph- and role-based designs introduce modest overhead (1.3-3×), while GABM execution incurs orders-of-magnitude higher runtime and output even for trivial tasks, driven by execution semantics rather than task complexity.

Memory Architecture Effects

Memory structure matters more than context size. Retrieval enables stable recall, accumulation enables learning but scales poorly, and hybrid designs work best under bounded context.

Memory Type	AR Score	TTL Score	LRU Score	Overall
Retrieval-only (LangGraph)	33.2	11.4	30.4	23.8
Hybrid (LangGraph W=512)	44.9	24.2	17.6	21.7
Accumulation (OpenAI SDK W=8192)	33.9	20.7	27.5	20.6
Accumulation (OpenAI SDK W=50)	8.4	1.2	0.0	6.1

Key Takeaways

• Retrieval enables stable recall: Retrieval-only designs dominate on factual recall (AR) and long-range understanding (LRU), achieving 30.4 on LRU vs 27.5 for large accumulation.
• Hybrid works best for learning: Hybrid retrieval-accumulation achieves highest TTL (28.9) by anchoring learning to retrieved signals rather than raw prompt growth.
• Accumulation scales poorly: Runtime grows rapidly with context window as full histories are replayed for each query. Retrieval avoids repeated processing.

Planning Interface Effects

Schema-constrained planning reduces accuracy and introduces high failure rates. Free-form planning preserves or improves accuracy with minimal overhead.

Planning Interface	Accuracy Impact	Formatting Failures	Runtime Overhead
No Plan (Direct)	Baseline	0%	1×
Schema-constrained (Crew-Plan)	-30% to -50%	Up to 84.7%	7.4× to 31.3×
Free-form (Direct-LLM-Plan)	+15% to -5%	0%	1.2× to 6.6×

Key Takeaway

Planning outcomes are driven primarily by interface design, not LLM planning ability. Schema-constrained planning introduces large formatting failure rates (up to 84.7%) and high orchestration overhead (up to 31×). Free-form planning preserves accuracy with minimal overhead. Planning should be implemented as a permissive stage that tolerates variability in plan text.

Agent Specialization

Specialization is governed by how frameworks inject task-specific reasoning structure, not role identity alone. Expert-guided conditioning improves F1 scores by 58 points.

Conditioning Strategy	Multiclass F1	Impact
No Role	41.9	Baseline
Role-based prompting	41.9-42.1	No improvement
Planning-based conditioning	42.0	No improvement
Expert-guided conditioning	96.0-100.0	+54 to +58 points

Key Takeaway

Role labels and generic planning interfaces fail to activate domain knowledge. Explicit procedural instructions impose structured solution workflows that reliably improve performance. Specialization should be implemented through reasoning procedures embedded in the framework, not through role naming or lightweight prompt modifications.

Coordination and Scaling

Coordination performance is governed by the match between task structure and communication geometry. Topology choice becomes critical at scale.

Topology	Coloring (n=100)	Consensus (n=100)	Rounds
Scale-Free	97%	Failed	11
Small-World	98%	Failed	15
Fully Connected	98%	100%	3
Delaunay	81%	Failed	19
Sequential	Failed (>40 rounds)	Failed	>40

Key Takeaways

• Local coordination: Sparse topologies (scale-free, small-world) achieve high success (97-98%) on local tasks like Coloring, converging in 11-15 rounds.
• Global agreement: Only fully connected topologies succeed on Consensus (100%), converging in 3 rounds independent of network size. All sparse topologies fail despite higher runtime and token expenditure.
• Sequential pipelines fail at scale: Sequential topologies exceed 40 rounds and fail entirely on large networks, showing that increasing interaction budgets alone does not improve outcomes.

Summary

These results demonstrate that performance in multi-agent LLM systems is governed by framework architecture, not model quality alone. Architectural choices create order-of-magnitude differences in latency, accuracy, and coordination success.

Architecture > Model

Framework-level design choices can create 100× latency differences and 30% accuracy drops, even when using identical LLM models.

Design Dimensions Matter

Orchestration, memory, planning, specialization, and coordination each independently drive performance. No single dimension can compensate for poor choices in others.