-
Towards General-Purpose Model-Free Reinforcement Learning
Paper • 2501.16142 • Published • 30 -
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Paper • 2503.14476 • Published • 142 -
Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
Paper • 2504.13837 • Published • 138 -
Learning to Reason under Off-Policy Guidance
Paper • 2504.14945 • Published • 88
Collections
Discover the best community collections!
Collections including paper arxiv:2507.19849
-
MaPPO: Maximum a Posteriori Preference Optimization with Prior Knowledge
Paper • 2507.21183 • Published • 14 -
MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE
Paper • 2507.21802 • Published • 17 -
EDGE-GRPO: Entropy-Driven GRPO with Guided Error Correction for Advantage Diversity
Paper • 2507.21848 • Published • 8 -
Agentic Reinforced Policy Optimization
Paper • 2507.19849 • Published • 158
-
Agentic Reinforced Policy Optimization
Paper • 2507.19849 • Published • 158 -
Falcon-H1: A Family of Hybrid-Head Language Models Redefining Efficiency and Performance
Paper • 2507.22448 • Published • 66 -
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Paper • 2508.18265 • Published • 208 -
R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs via Bi-Mode Annealing and Reinforce Learning
Paper • 2508.21113 • Published • 110
-
GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning
Paper • 2507.19457 • Published • 28 -
Agentic Reinforced Policy Optimization
Paper • 2507.19849 • Published • 158 -
Group Sequence Policy Optimization
Paper • 2507.18071 • Published • 314 -
Cache-to-Cache: Direct Semantic Communication Between Large Language Models
Paper • 2510.03215 • Published • 97
-
Towards General-Purpose Model-Free Reinforcement Learning
Paper • 2501.16142 • Published • 30 -
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Paper • 2503.14476 • Published • 142 -
Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
Paper • 2504.13837 • Published • 138 -
Learning to Reason under Off-Policy Guidance
Paper • 2504.14945 • Published • 88
-
MaPPO: Maximum a Posteriori Preference Optimization with Prior Knowledge
Paper • 2507.21183 • Published • 14 -
MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE
Paper • 2507.21802 • Published • 17 -
EDGE-GRPO: Entropy-Driven GRPO with Guided Error Correction for Advantage Diversity
Paper • 2507.21848 • Published • 8 -
Agentic Reinforced Policy Optimization
Paper • 2507.19849 • Published • 158
-
GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning
Paper • 2507.19457 • Published • 28 -
Agentic Reinforced Policy Optimization
Paper • 2507.19849 • Published • 158 -
Group Sequence Policy Optimization
Paper • 2507.18071 • Published • 314 -
Cache-to-Cache: Direct Semantic Communication Between Large Language Models
Paper • 2510.03215 • Published • 97
-
Agentic Reinforced Policy Optimization
Paper • 2507.19849 • Published • 158 -
Falcon-H1: A Family of Hybrid-Head Language Models Redefining Efficiency and Performance
Paper • 2507.22448 • Published • 66 -
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Paper • 2508.18265 • Published • 208 -
R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs via Bi-Mode Annealing and Reinforce Learning
Paper • 2508.21113 • Published • 110