PRELUDE: A Benchmark Designed to Require Global Comprehension and Reasoning over Long Contexts Paper • 2508.09848 • Published Aug 13 • 67
HeroBench: A Benchmark for Long-Horizon Planning and Structured Reasoning in Virtual Worlds Paper • 2508.12782 • Published Aug 18 • 25
MCP-Universe: Benchmarking Large Language Models with Real-World Model Context Protocol Servers Paper • 2508.14704 • Published Aug 20 • 43
FutureX: An Advanced Live Benchmark for LLM Agents in Future Prediction Paper • 2508.11987 • Published Aug 16 • 71