In my last post, I discussed deep reinforcement learning (RL) in the context of sequential decision making, and argued that it arises very naturally in many settings.
Wow, the point about RLHF not obviously fitting sequential decision-making, like you discussed in your last post, really got me thinking. Does this suggest the 'state' for LLM alignment is inherently more nebulos or continuous than in typical RL scenarios?
Wow, the point about RLHF not obviously fitting sequential decision-making, like you discussed in your last post, really got me thinking. Does this suggest the 'state' for LLM alignment is inherently more nebulos or continuous than in typical RL scenarios?