Large Language Models are Semi-Parameteric RL Agents

ai research
code
Author

Priya Shanmugasundaram

Published

December 12, 2023

Hello Everyone! Welcome to my blog, today we’ll be going through the NeurIPS 2023 paper, “Large Langauge Models are Semi-Parametric RL Agents”.

Key Points in Abstract:

Key Takeways from Intro:

Thus RLEM - updates the experience memory rather than modifiying the model parameters - updates experience memory through analogical RL training so it can self-evolve

REFLEXION - LLM with short-term working memory - authors introduce long-term experience memory as short term working memory is tied to a specific task goal and the stored memory cannot be used for different goals.

Inner Monologue, Corrective Re-prompting, DEPS - take advantage of immediate failure feedback only once?

RLEM + LLM = REMEMBERER - utilize the experiences selectively based on the current state to optimize the decision

semi-parametric RL agent - evolve its ability through interaction experiences however without finetuning parameters

Experimental setup

RLEM Pipeline:

\[ s_i = \lambda f_g(g, g_i) + (1 - \lambda)f_o(o_t, o_i) \]

where \(m\) records with the highest similarity are retrieved for the exemplars. For the input part they show task information, past observations, past actions, and interaction feedback and in the output where encouraged and discouraged actions are shown based on the q values.

Details on Experiments:

Webshop and WikiHow are conducted on the models OpenAI API of GPT3.5Turbo and text-davinci-003

Webshop - instructed to browse the site and shop for target goods. score between 0 and 1 will be rated after shopping by assessing the correspondence between product and instruction.

webpage representation and a list of available actions — > LLM —> products to shop for

no immediate rewards only the last 5 actions serve as procedure feedback

inspired by cot and react - LLM is prompted to predict a reason for its decision – observation similarity based on web page representation into four patterns

task similiarity sentence similarity transformer \(f_g\)

WikiHow - follow the instructions and navigate to the required page

intermediate rewards are available, the screen is represented in an HTML element sequence, last 4 performed actions and the last reward are given as feedback

task representation, screen representation and step instruction –> LLMs –> print the HTML representation of the operated element

\(f_g\) task similarity is computed from step instructions and \(f_o\) is computed based on the length of the longest common sequence of HTML elements in the screen representation

They present ablation analysis of full model with variants like w/o bootstrap, w/o random and w/o discouraged actions and so on and look at avg reward and avg success rates for the two task sets

They compare performance with LLM only method, ReAct and REMEMBERER for different training sets and different initial exemplar sets and see consistent improvements across different settings