Learning to Coordinate from Offline Datasets with Uncoordinated Behavior Policies

,

In offline multi-agent reinforcement learning (RL), multiple agents must learn to coordinate from previously collected datasets. Like the single-agent case, we must handle the distribution shift issue from the datasets. Most importantly, we also need to deal with possible miscoordination in the datasets, collected by some uncoordinated behavior policies. To address this, we propose a novel offline multi-agent RL method using counterfactual sample-average approximation with subteam masking. Specifically, we compute the best-response policy for each agent using sample-average approximation. For the miscoordination issue, we use counterfactual mechanism and subteam masking to reason about the agents' contributions to the team. Based on this, each agent learns to coordinate from the uncoordinated datasets. Empirically, we evaluate our method in two benchmark domains: a continuous multi-agent MuJoCo control domain, and a challenging cooperation environment Starcraft II domain. Our experimental results confirm that our approach can achieve significantly better performance than several state-of-the-art methods. The source code is available at: https://github. com/JinmingM/CAST-BCQ.

» Read on
Jinming Ma, Feng Wu. Learning to Coordinate from Offline Datasets with Uncoordinated Behavior Policies. In Proceedings of the 2023 International Conference on Autonomous Agents and Multiagent Systems (AAMAS), pages 1258-1266, London, UK, May 2023.
Save as file
@inproceedings{MWaamas23,
 address = {London, UK},
 author = {Jinming Ma and Feng Wu},
 booktitle = {Proceedings of the 2023 International Conference on Autonomous Agents and Multiagent Systems (AAMAS)},
 month = {May},
 pages = {1258-1266},
 title = {Learning to Coordinate from Offline Datasets with Uncoordinated Behavior Policies},
 url = {https://dl.acm.org/doi/abs/10.5555/3545946.3598771},
 year = {2023}
}