Offline Reinforcement Learning — CQL
Conservative α:
1.5
Dataset quality:
50%
Train CQL
Reset
CQL penalizes Q(s,a) for out-of-distribution actions not in dataset