MAFBench | Unified Benchmark for Multi-Agent LLM Frameworks

Note: All numerical values and statistics on this page should be verified against the paper PDF for the most accurate and up-to-date results.

Benchmarks

MAFBench integrates established benchmarks to evaluate multi-agent frameworks across key architectural dimensions. Metrics follow original benchmark definitions (Acc.=Accuracy, F1=F1-score, R@5=Recall@5).

Architectural Dimension	Benchmark	Metrics
Orchestration overhead	Trivial query	Latency, throughput, tokens
Memory architecture	MemoryAgentBench	Acc./F1/R@5
Planning interface	GSM8K, CSQA, MATH	Accuracy, failures, runtime
Specialization conditioning	CatDB tasks	Precision, recall, F1
Coordination topology	AGENTSNET (Code)	Success, rounds, tokens, time

MAFBench Contributions

Unified Execution Pipeline

MAFBench provides a standardized agent interface for session-level execution, centralizing configuration of model parameters, session limits, batching, and scoring. This ensures identical conditions across frameworks while isolating architectural effects.

Semantic Evaluation

We replace string-based metrics with LLM-based semantic evaluation to handle diverse answer formats, enabling fair comparison across different framework output styles.

Transparent Backend Routing

For large-scale long-context evaluation, MAFBench introduces transparent backend routing that redirects compatible API calls to alternative providers (e.g., Groq-hosted models), enabling lower-cost evaluation without modifying framework implementations.

Standardized Logging and Reproducibility

Results are logged and aggregated using a shared schema that captures accuracy, runtime, and token usage. Model selection, planning modes, and run budgets are centrally configured to ensure reproducibility and cost control.

Topology Rewriting Engine

For coordination evaluation, MAFBench implements a topology rewriting engine that transforms base communication graphs into sequential, hierarchical, and centralized structures while preserving agent sets, enabling systematic analysis of how interaction topology affects coordination dynamics.

Isolated Architectural Impact

By fixing the underlying LLM model, prompts, and task structure, MAFBench attributes performance differences to framework architecture rather than model quality, revealing true architectural impact on memory, planning, specialization, and coordination.

Ready to Run Your Own Tests?

MAFBench is open source and ready to use. Evaluate your framework or contribute to the benchmark.

View on GitHub Read the Paper