Bitso Corridor Desk — Agent-Based Digital Twin

RFQ internalisation and cross-venue hedging of stablecoin FX corridors, with an RL desk policy

Ask Arenas

Prepared for E. Arenas · Bitso Alpha

v0.3 · RL

Desk policy

HeuristicStatic knobs — the baseline desk.

Learned (CEM)Trained policy, regime-conditional de-risk.

CompareOverlay both on the same seed.

Simulation Parameters

Competing desks

Horizon (minutes)

240

Hawkes branching ratio η

0.60

Baseline flow intensity μ

3.0/min

Directional bias (USD→fiat)

80%

Internalisation band

0.40

Base corridor spread

35 bps

Competitor spread

30 bps

Settlement lag (gamma carry)

5 min

Reward function (what the RL agent optimises)

r_t = ΔPnL_t
    − λ_dn   · max(0, −ΔPnL_t)              ← asymmetric downside (CVaR proxy)
    − λ_var  · max(0, VaR_t / Budget − 1)   ← soft VaR-budget constraint
    − λ_cap  · max(0, |inv_t| / Cap  − 1)   ← soft inventory-cap constraint
    − λ_breach · breaches_t                 ← hard breach penalty

R = Σ_t r_t      (the policy maximises R; episode-level)

λ_dn — downside multiplier1.50

0 = risk-neutral (maximise mean PnL). ↑ = punish drawdowns more, push policy toward tail-protection.

λ_var — soft VaR penalty ($/over-budget unit)$50K

Soft Lagrangian on the VaR-budget constraint. 0 = ignore.

λ_cap — soft inventory-cap penalty ($/over-cap unit)$50K

Soft penalty for breaching the warehouse cap. The simulator already trims hard; this shapes behaviour earlier.

λ_breach — per-breach penalty ($)$25K

Discrete cost per hard breach event. Use to suppress breach count regardless of size.

Reward on current run · 0 steps

TermHeuristicLearnedΣ ΔPnL (gross objective)$0$0− λ_dn · downside$0$0− λ_var · VaR-over$0$0− λ_cap · cap-over$0$0− λ_breach · breaches$0$0

R (total reward)$0$0

Move a slider → the table re-scores the same trajectories under the new objective. To actually retrain the policy under your weights, run rl/train_sac.pywith the matching --lambda-* flags (see RL tab).

Tune the desk parameters and click Run.