LLMs as Efficient Reward Function Searchers for Custom-Environment MORL

Guanwen Xie¹, Jingzehua Xu¹, Yiyuan Yang², Yimian Ding¹, Shuai Zhang³

¹Massachusetts Institute of Technology, ²University of Oxford, ³New Jersey Institute of Technology

ERFSL Architecture & Example Prompts

Introduction

Our approach is based on these observations:

It's hard to design and balance reward components for multi-objective reinforcement learning (MORL).
It's hard for LLMs to design intricate reward functions on trial-and-error exploration and rectify wrong reward functions through implicit feedback.
LLMs, especially those on a smaller scale, have a severe decline in comprehension in long contexts.
LLMs are adept at summarizing and heuristically generating code with specific and clear task contexts, yet they are less proficient in tackling numerical contexts.

We propose ERFSL, an efficient reward function searcher using LLMs, which enables LLMs to be effective white-box searchers and highlights their advanced semantic understanding capabilities.

Experimental Results & Example Case Studies

The reward critic can effectively detect various errors and then rewrite reward components based on the description and variables of the Env class. For each component, only one iteration is needed to obtain the correct feedback results.

The reward critic can rectify elusive errors in reward components, but EUREKA-S fails to do so.

Two groups of weights generated by the reward weight initializer achieve Pareto solutions, therefore no further search is required, or only a more refined search may be necessary.

In the 500x off experiment, the reward weight searcher quickly recognizes excessive energy consumption optimization and tries various adjustment. However, EUREKA-M increases weights in a highly random manner ranther than decreasing the penalty of energy consumption.

This figure shows the maximum value of w_service/w_ec in five weight groups during training, with the best performance among the five repetitions. The reward weight searcher adopts step sizes more flexibly, compared to GPT-4o w/o TLA and EUREKA-M.

This table displays the number of iterations required to meet user demand.The experiments are performed 5 times, and the mean values and standard deviations are reported. Finally, on average only 5.2 iterations are needed to meet user requirements.

ERFSL Prompt samples

BibTeX

@article{xie2024llmrsearcher,
      title={Large Language Models as Efficient Reward Function Searchers for Custom-Environment Multi-Objective Reinforcement Learning},
      author={Xie, Guanwen and Xu, Jingzehua and and Yang, Yiyuan and Ren, Yong and Ding, Yimian and Zhang, Shuai},
      journal={arXiv preprint arXiv:2409.02428},
      year={2024}
    }
}