主流场景

RL from human Feedback

RL with verifiable rewards

RL with multi-turn agentic interaction

DeepSpeed-Chat

OpenRLHF

FlexRLHF