TMRL: Diffusion Timestep-Modulated Pre-training Enables Exploration for Efficient Policy Fine-tuning

TL;DR

A unified framework bridging behavior-cloning pre-training and reinforcement-learning fine-tuning.

Our method

Context-Smoothed Pre-training (CSP) applies a diffusion-style forward-noising schedule to policy inputs, producing a continuous spectrum from precise imitation to broad coverage. Timestep-Modulated RL (TMRL) then trains a high-level policy π_HL(z, σ | s) to select σ per action chunk.

Results

Real-world fine-tuning of π₀ to near-perfect success on three robot tasks: under an hour on WidowX, under four on Franka. Outperforming state-of-the-art baselines in simulation.

Motivation

The need for action distribution interpolation.

Robots trained by imitation copy demonstrations narrowly — in sparsely covered contexts the conditional support collapses to near-zero probability on optimal actions, so online rollouts yield no reward signal and RL fine-tuning stalls. We propose to inject forward-diffusion noise into policy inputs at pre-training, smoothly interpolating between the conditional p(a | c) and marginal p(a) action distributions — broadening action coverage so RL can find solutions beyond the demonstrations.

On a 2D pointmaze from OGBench, we train two policies — one approximating p(a | c), the other p(a) — where c is the goal position. Both train on pointmaze-large and evaluate on the larger pointmaze-giant environment, with an unseen goal location (agent in yellow, goal in pink).

Watch the OOD goal: left, p(a | c) collapses and drifts away. Right, p(a) covers broadly enough to reach.

p(a | c) · BC

p(a) · CSP, σ = T

The conditional collapses on optimal actions; the marginal covers them only with low likelihood. Thus motivating a policy that can interpolate between the two — and a mechanism to learn where on that interpolation to operate.

Method

Context smoothing & timestep modulation.

Diffusion noise on pointcloud contexts

On the dexterous-grasping suite, the context c is the object pointcloud. As σ rises, q_σ progressively corrupts c: the pointcloud slowly loses its original structure, aliasing the out-of-distribution object with nearby in-distribution pointclouds seen during training. The policy can then borrow coherent grasps from related training contexts rather than failing on an unfamiliar shape.

1 · Reference rollout

noising c with q_σ ↓

2 · Context-smoothed pointcloud

Top: TMRL on a held-out drill (context c = pointcloud). Bottom: q_σ removes structure as t rises — aliasing the OOD object back to in-distribution shapes the policy has seen during training.

σ Slider

The conditional ↔ marginal spectrum.

CSP trains one policy across all σ by corrupting the context c with q_σ(c | c). The endpoints anchor the spectrum: σ = 0 recovers p(a | c), σ = T recovers p(a).

Drag σ below to see how the policy’s rollouts spread as the context input is more aggressively noised.

σ = 0 (conditional) σ = T/4 σ = T/2 σ = 3T/4 σ = T (marginal)

σ = 0.00 (conditional)

Simulation Results

Coverage and RL sample efficiency.

Action coverage before RL: Success@K

Success@K — the fraction of out-of-distribution states where at least one of K base-policy rollouts succeeds. On pointmaze-giant and cube-single, CSP exceeds PostBC and BC at every K, with the gap widest on cube-single where both baselines remain at zero.

Success@K comparison — **CSP provides coverage before RL.** BC and PostBC stay near zero on `cube-single`; CSP rises with K.

RL sample efficiency

We compare against several state-of-the-art baselines. TMRL outperforms or matches all baselines across four simulation tasks.

Simulation RL results — **Simulation RL success rates.** TMRL approaches 100% across all four tasks; only TMRL reaches non-trivial success on the long-horizon `libero-90`.

Real-World RL

Context-Smoothed π₀ in the real world.

< 1 hr

DSRL never converges on either WidowX task. TMRL reaches near-perfect success in under an hour of real-robot fine-tuning on both (sausage-in-pot, shrimp-in-drawer).

CSP for VLAs. Noise the VLM embedding context with q_σ to allow π₀ to sample across the conditional → marginal spectrum.

We fine-tune a context-smoothed π₀ on BridgeData-v2 (WidowX 250) and DROID (Franka).

Real-world RL fine-tuning results — **Learning curves.** Base π₀ stalls — its action support is too narrow for any rollout to succeed. CSP widens that support; TMRL leverages it to reach near-perfect success on all three tasks.

Per-task rollouts

Base π₀ versus context-smoothed π₀, side by side. The learning curve below traces the full RL training run.

Base π₀

Context-Smoothed π₀ (ours)

Real-world RL learning curve

Button Press

TMRL learns to be precise when it matters.

π_HL treats the diffusion timestep t̃ as an adaptive coverage control, choosing it per action chunk. Below, the converged TMRL policy on WidowX shrimp-in-drawer learns to use t̃ low through the precision-critical pick (precise imitation) and higher during reaching and placing.

“pick up the shrimp and put it into the white drawer”

Action Chunk 1 / 10

t̃ = 0.01

BibTeX

@inproceedings{hong2026tmrl,
    title  = {TMRL: Diffusion Timestep-Modulated Pre-training Enables Exploration for Efficient Policy Fine-tuning},
    author = {Hong, Matthew M. and Zhang, Jesse and Nagabandi, Anusha and Gupta, Abhishek},
    booktitle = {Robotics: Science and Systems (RSS)},
    year   = {2026}
}