The Decision Transformer (DT) model casts offline reinforcement learning as a conditional sequence modeling problem.

Unlike prior approaches to offline RL that fit value functions or compute policy gradients, Decision Transformer simply outputs the optimal actions by leveraging a causally masked Transformer. By conditioning an autoregressive model on the desired return (reward-to-go), past states, and actions, Decision Transformer model can generate future actions that achieve the desired return.

Due to the simple supervised objective and transformer architecture, Decision Transformer is simple, stable and easy to implement as it
has a minimum number of moving parts.


Despite its simplicity and stability, DT has a number of drawbacks. It does not capable of stitching suboptimal 
trajectories (that's why poor performance on AntMaze datasets), and can also [show]( bad performance in stochastic environments.

Explanation of logged metrics

  • eval/{target_return}_return_mean: mean undiscounted evaluation return when prompted with config.target_return value (there might be more than one)
  • eval/{target_return}_return_std: standard deviation of the undiscounted evaluation return across config.eval_episodes episodes
  • eval/{target_return}_normalized_score_mean: mean normalized score when prompted with config.target_return value (there might be more than one). Should be between 0 and 100, where 100+ is the performance above expert for this environment. Implemented by D4RL library [ source].
  • eval/{target_return}_normalized_score_std: standard deviation of the normalized score return across config.eval_episodes episodes
  • train_loss: current training loss, Mean squared error (MSE) for continuous action spaces
  • learning_rate: current learning rate, helps monitor learning rate schedule

Implementation details

  1. Batch sampling weighted by trajectory length ( algorithms/offline/
  2. State normalization during training and inference ( algorithms/offline/
  3. Reward downscaling ( algorithms/offline/
  4. Positional embedding shared across one transition ( algorithms/offline/
  5. Prompting with multiple return-to-go's during evaluation, as DT can be sensitive to the prompt ( algorithms/offline/

Experimental results

For detailed scores on all benchmarked datasets see benchmarks section. Reports visually compare our reproduction results with original paper scores to make sure our implementation is working properly.

