Offline reinforcement learning (RL) aims to learn a policy from a fixed dataset, without further interactions with the environment. However, offline datasets are often very noisy, which consist of large quantities of sub-optimal or task-agnostic trajectories. Therefore, it is very challenging for offline RL to learn an optimal policy from such datasets. To address this, we use reward machines (RM) to encode human knowledge about the task and refine datasets for offline RL. Specifically, we define the event-ordered RM to label offline datasets with RM states. Then, we further use the labeled datasets to generate refined datasets, which is smaller but better for offline RL. By using the RM, we can decompose a long-horizon task into easier sub-tasks, inform the agent about their current stage along task completion, and guide the offline learning process. In addition, we generate counterfactual experiences by RM to guide agent to complete each sub-task. Experimental results in the D4RL benchmark confirm that our method achieves better performance in long-horizon manipulation tasks with sub-optimal datasets.
» Read on@inproceedings{SWaamas23,
address = {London, UK},
author = {Haoyuan Sun and Feng Wu},
booktitle = {Proceedings of the 2023 International Conference on Autonomous Agents and Multiagent Systems (AAMAS)},
month = {May},
pages = {1239-1247},
title = {Less Is More: Refining Datasets for Offline Reinforcement Learning with Reward Machines},
url = {https://dl.acm.org/doi/10.5555/3545946.3598769},
year = {2023}
}