E2HiL: Entropy-Guided Sample Selection for Efficient Real-World Human-in-the-Loop Reinforcement Learning

Full Width Image

Abstract

Human-in-the-loop guidance has emerged as an effective approach for enabling faster convergence in online reinforcement learning (RL) of complex real-world manipulation tasks. However, existing human-in-the-loop RL (HiL-RL) frameworks often suffer from low sample efficiency, requiring substantial human interventions to achieve convergence and thereby leading to high labor costs. To address this, we propose a sample-efficient real-world human-in-the-loop RL framework named \method, which requires fewer human intervention by actively selecting informative samples. Specifically, stable reduction of policy entropy enables improved trade-off between exploration and exploitation with higher sample efficiency. We first build influence functions of different samples on the policy entropy, which is efficiently estimated by the covariance of action probabilities and soft advantages of policies. Then we select samples with moderate values of influence functions, where shortcut samples that induce sharp entropy drops and noisy samples with negligible effect are pruned. Extensive experiments on four real-world manipulation tasks demonstrate that \method achieves a 42.1\% higher success rate while requiring 10.1\% fewer human interventions compared to the state-of-the-art HiL-RL method, validating its effectiveness.

Method

Entropy-Guided Sample Selection Framework. Our framework consists of two key components: Sample Entropy Dynamics Estimation, where we characterizes the entropy dynamics induced by each sample, and Entropy-Bounded Sample Selection, where we prune shortcut and noisy samples by constraining their influence value within dynamic bounds.

Full Width Image

Real-world Online RL Tasks

The real-world experiments are performed in a tabletop setup with objects randomized in location every episode. E2HiL is capable of executing four complex real-world manipulation tasks, requiring an average training time of only about one hour.

Touch Cube

Pick & Place Cube

Pick up Cube

Stack Blocks

BibTeX