Note: All numerical values and tables on this page should be verified against the paper PDF for the most accurate and up-to-date results.
Architecture, not models, determines performance. Same LLM, same task, wildly different outcomes.
Framework orchestration alone can create 100× latency differences, even for trivial tasks. We fix the LLM, query, and prompts, varying only the orchestration layer.
| Architecture Type | Latency (p50) | Throughput (req/s) |
|---|---|---|
| Direct LLM | 0.38s | 8.88 |
| Graph-based (LangGraph) | 0.52s (1.4×) | 6.38 |
| Role-based (AutoGen, OpenAgents) | 0.50s (1.3×) | 6.83-7.06 |
| Role-based (CrewAI, Agno, OpenAI SDK) | 0.61-1.17s (1.6-3.1×) | 2.86-4.16 |
| GABM (Concordia) | 44.47s (117×) | 0.089 |
Orchestration architecture alone governs baseline scalability. Graph- and role-based designs introduce modest overhead (1.3-3×), while GABM execution incurs orders-of-magnitude higher runtime and output even for trivial tasks, driven by execution semantics rather than task complexity.
Memory structure matters more than context size. Retrieval enables stable recall, accumulation enables learning but scales poorly, and hybrid designs work best under bounded context.
| Memory Type | AR Score | TTL Score | LRU Score | Overall |
|---|---|---|---|---|
| Retrieval-only (LangGraph) | 33.2 | 11.4 | 30.4 | 23.8 |
| Hybrid (LangGraph W=512) | 44.9 | 24.2 | 17.6 | 21.7 |
| Accumulation (OpenAI SDK W=8192) | 33.9 | 20.7 | 27.5 | 20.6 |
| Accumulation (OpenAI SDK W=50) | 8.4 | 1.2 | 0.0 | 6.1 |
Schema-constrained planning reduces accuracy and introduces high failure rates. Free-form planning preserves or improves accuracy with minimal overhead.
| Planning Interface | Accuracy Impact | Formatting Failures | Runtime Overhead |
|---|---|---|---|
| No Plan (Direct) | Baseline | 0% | 1× |
| Schema-constrained (Crew-Plan) | -30% to -50% | Up to 84.7% | 7.4× to 31.3× |
| Free-form (Direct-LLM-Plan) | +15% to -5% | 0% | 1.2× to 6.6× |
Planning outcomes are driven primarily by interface design, not LLM planning ability. Schema-constrained planning introduces large formatting failure rates (up to 84.7%) and high orchestration overhead (up to 31×). Free-form planning preserves accuracy with minimal overhead. Planning should be implemented as a permissive stage that tolerates variability in plan text.
Specialization is governed by how frameworks inject task-specific reasoning structure, not role identity alone. Expert-guided conditioning improves F1 scores by 58 points.
| Conditioning Strategy | Multiclass F1 | Impact |
|---|---|---|
| No Role | 41.9 | Baseline |
| Role-based prompting | 41.9-42.1 | No improvement |
| Planning-based conditioning | 42.0 | No improvement |
| Expert-guided conditioning | 96.0-100.0 | +54 to +58 points |
Role labels and generic planning interfaces fail to activate domain knowledge. Explicit procedural instructions impose structured solution workflows that reliably improve performance. Specialization should be implemented through reasoning procedures embedded in the framework, not through role naming or lightweight prompt modifications.
Coordination performance is governed by the match between task structure and communication geometry. Topology choice becomes critical at scale.
| Topology | Coloring (n=100) | Consensus (n=100) | Rounds |
|---|---|---|---|
| Scale-Free | 97% | Failed | 11 |
| Small-World | 98% | Failed | 15 |
| Fully Connected | 98% | 100% | 3 |
| Delaunay | 81% | Failed | 19 |
| Sequential | Failed (>40 rounds) | Failed | >40 |
These results demonstrate that performance in multi-agent LLM systems is governed by framework architecture, not model quality alone. Architectural choices create order-of-magnitude differences in latency, accuracy, and coordination success.
Framework-level design choices can create 100× latency differences and 30% accuracy drops, even when using identical LLM models.
Orchestration, memory, planning, specialization, and coordination each independently drive performance. No single dimension can compensate for poor choices in others.