Our approach is based on these observations:
We propose ERFSL, an efficient reward function searcher using LLMs, which enables LLMs to be effective white-box searchers and highlights their advanced semantic understanding capabilities.
The reward critic can effectively detect various errors and then rewrite reward components based on the description and variables of the Env class. For each component, only one iteration is needed to obtain the correct feedback results.
The reward critic can rectify elusive errors in reward components, but EUREKA-S fails to do so.
Two groups of weights generated by the reward weight initializer achieve Pareto solutions, therefore no further search is required, or only a more refined search may be necessary.
In the 500x off experiment, the reward weight searcher quickly recognizes excessive energy consumption optimization and tries various adjustment. However, EUREKA-M increases weights in a highly random manner ranther than decreasing the penalty of energy consumption.
This figure shows the maximum value of w_service/w_ec in five weight groups during training, with the best performance among the five repetitions. The reward weight searcher adopts step sizes more flexibly, compared to GPT-4o w/o TLA and EUREKA-M.
This table displays the number of iterations required to meet user demand.The experiments are performed 5 times, and the mean values and standard deviations are reported. Finally, on average only 5.2 iterations are needed to meet user requirements.
@article{xie2024llmrsearcher,
title={Large Language Models as Efficient Reward Function Searchers for Custom-Environment Multi-Objective Reinforcement Learning},
author={Xie, Guanwen and Xu, Jingzehua and and Yang, Yiyuan and Ren, Yong and Ding, Yimian and Zhang, Shuai},
journal={arXiv preprint arXiv:2409.02428},
year={2024}
}
}