Reward function (what the RL agent optimises)
r_t = ΔPnL_t
− λ_dn · max(0, −ΔPnL_t) ← asymmetric downside (CVaR proxy)
− λ_var · max(0, VaR_t / Budget − 1) ← soft VaR-budget constraint
− λ_cap · max(0, |inv_t| / Cap − 1) ← soft inventory-cap constraint
− λ_breach · breaches_t ← hard breach penalty
R = Σ_t r_t (the policy maximises R; episode-level)TermHeuristicLearnedΣ ΔPnL (gross objective)$0$0− λ_dn · downside$0$0− λ_var · VaR-over$0$0− λ_cap · cap-over$0$0− λ_breach · breaches$0$0R (total reward)$0$0 Move a slider → the table re-scores the same trajectories under the new objective. To actually retrain the policy under your weights, run rl/train_sac.pywith the matching --lambda-* flags (see RL tab).