Note: All numerical values and statistics on this page should be verified against the paper PDF for the most accurate and up-to-date results.
MAFBench integrates established benchmarks to evaluate multi-agent frameworks across key architectural dimensions. Metrics follow original benchmark definitions (Acc.=Accuracy, F1=F1-score, R@5=Recall@5).
| Architectural Dimension | Benchmark | Metrics |
|---|---|---|
| Orchestration overhead | Trivial query | Latency, throughput, tokens |
| Memory architecture | MemoryAgentBench | Acc./F1/R@5 |
| Planning interface | GSM8K, CSQA, MATH | Accuracy, failures, runtime |
| Specialization conditioning | CatDB tasks | Precision, recall, F1 |
| Coordination topology | AGENTSNET (Code) | Success, rounds, tokens, time |
MAFBench provides a standardized agent interface for session-level execution, centralizing configuration of model parameters, session limits, batching, and scoring. This ensures identical conditions across frameworks while isolating architectural effects.
We replace string-based metrics with LLM-based semantic evaluation to handle diverse answer formats, enabling fair comparison across different framework output styles.
For large-scale long-context evaluation, MAFBench introduces transparent backend routing that redirects compatible API calls to alternative providers (e.g., Groq-hosted models), enabling lower-cost evaluation without modifying framework implementations.
Results are logged and aggregated using a shared schema that captures accuracy, runtime, and token usage. Model selection, planning modes, and run budgets are centrally configured to ensure reproducibility and cost control.
For coordination evaluation, MAFBench implements a topology rewriting engine that transforms base communication graphs into sequential, hierarchical, and centralized structures while preserving agent sets, enabling systematic analysis of how interaction topology affects coordination dynamics.
By fixing the underlying LLM model, prompts, and task structure, MAFBench attributes performance differences to framework architecture rather than model quality, revealing true architectural impact on memory, planning, specialization, and coordination.
MAFBench is open source and ready to use. Evaluate your framework or contribute to the benchmark.