LLMs as Efficient Reward Function Searchers for Custom-Environment MORL

1Massachusetts Institute of Technology, 2University of Oxford, 3New Jersey Institute of Technology

ERFSL Architecture & Example Prompts

architecture-image image.

Introduction

Our approach is based on these observations:

  • It's hard to design and balance reward components for multi-objective reinforcement learning (MORL).
  • It's hard for LLMs to design intricate reward functions on trial-and-error exploration and rectify wrong reward functions through implicit feedback.
  • LLMs, especially those on a smaller scale, have a severe decline in comprehension in long contexts.
  • LLMs are adept at summarizing and heuristically generating code with specific and clear task contexts, yet they are less proficient in tackling numerical contexts.

We propose ERFSL, an efficient reward function searcher using LLMs, which enables LLMs to be effective white-box searchers and highlights their advanced semantic understanding capabilities.

Experimental Results & Example Case Studies

architecture-image image.

The reward critic can effectively detect various errors and then rewrite reward components based on the description and variables of the Env class. For each component, only one iteration is needed to obtain the correct feedback results.



architecture-image image.

The reward critic can rectify elusive errors in reward components, but EUREKA-S fails to do so.



architecture-image image.

Two groups of weights generated by the reward weight initializer achieve Pareto solutions, therefore no further search is required, or only a more refined search may be necessary.



architecture-image image.

In the 500x off experiment, the reward weight searcher quickly recognizes excessive energy consumption optimization and tries various adjustment. However, EUREKA-M increases weights in a highly random manner ranther than decreasing the penalty of energy consumption.



architecture-image image.

This figure shows the maximum value of w_service/w_ec in five weight groups during training, with the best performance among the five repetitions. The reward weight searcher adopts step sizes more flexibly, compared to GPT-4o w/o TLA and EUREKA-M.



architecture-image image.

This table displays the number of iterations required to meet user demand.The experiments are performed 5 times, and the mean values and standard deviations are reported. Finally, on average only 5.2 iterations are needed to meet user requirements.

ERFSL Prompt samples

BibTeX

@article{xie2024llmrsearcher,
      title={Large Language Models as Efficient Reward Function Searchers for Custom-Environment Multi-Objective Reinforcement Learning},
      author={Xie, Guanwen and Xu, Jingzehua and and Yang, Yiyuan and Ren, Yong and Ding, Yimian and Zhang, Shuai},
      journal={arXiv preprint arXiv:2409.02428},
      year={2024}
    }
}

architecture-image image.

©ERP / ENDRO! Production Committee