FOCUS: DLLMs Know How to Tame Their Compute Bound

DLLMs Know How to Tame Their Compute Bound

Diffusion LLMs process a full token block at every denoising step, but only around 10% of tokens are decoded. FOCUS predicts decodable tokens from early-layer importance deltas, evicts non-decodable tokens on the fly, and turns the saved FLOPs into higher throughput.

Training-free inference Decodability-aware token eviction LMDeploy + SDAR / LLaDA2.0
Kaihua Liang1Xin Tan2An Zhong1Hong Xu2Marco Canini1
1King Abdullah University of Science and Technology2The Chinese University of Hong Kong

Replay Demo

0ms
Playing

Prompt


      

FOCUS

LMDeploy

Overview

FOCUS architecture overview
Design
FOCUS estimates token importance after the early Q/K projections, keeps likely decodable candidates under an adaptive budget, and removes the rest from later-layer computation.

Motivation

Block-diffusion decoding keeps exact KV caching and parallel decoding, but repeats full-block computation at every step even though only a small subset of tokens decode.

Decodability Signal

The Layer 0-to-1 importance delta separates decodable from non-decodable tokens, giving FOCUS an early signal before the expensive later layers.

Runtime Action

FOCUS uses dynamic budgeting, token eviction, ragged execution, and intra-block KV caching to spend computation on likely decodable tokens.

Efficiency

Throughput scaling on SDAR and LLaDA2.0 FOCUS improves throughput over LMDeploy across datasets, models, and batch sizes. Throughput scaling Single NVIDIA A100-SXM4-80GB, tokens / second LMDeploy FOCUS ShareGPT WildChat MATH SDAR LLaDA2.0 0 700 1400 2100 2800 32 SDAR / ShareGPT / B32 LMDeploy: 838 tok/s FOCUS: 1218 tok/s / 1.45x 64 SDAR / ShareGPT / B64 LMDeploy: 913 tok/s FOCUS: 1850 tok/s / 2.03x 128 SDAR / ShareGPT / B128 LMDeploy: 938 tok/s FOCUS: 2160 tok/s / 2.30x 256 SDAR / ShareGPT / B256 LMDeploy: 979 tok/s FOCUS: 2272 tok/s / 2.32x 0 700 1400 2100 2800 32 SDAR / WildChat / B32 LMDeploy: 891 tok/s FOCUS: 1292 tok/s / 1.45x 64 SDAR / WildChat / B64 LMDeploy: 983 tok/s FOCUS: 1890 tok/s / 1.92x 128 SDAR / WildChat / B128 LMDeploy: 992 tok/s FOCUS: 2205 tok/s / 2.22x 256 SDAR / WildChat / B256 LMDeploy: 1001 tok/s FOCUS: 2324 tok/s / 2.32x 0 800 1600 2400 3200 32 SDAR / MATH / B32 LMDeploy: 1469 tok/s FOCUS: 1766 tok/s / 1.20x 64 SDAR / MATH / B64 LMDeploy: 1571 tok/s FOCUS: 2500 tok/s / 1.59x 128 SDAR / MATH / B128 LMDeploy: 1628 tok/s FOCUS: 2788 tok/s / 1.71x 256 SDAR / MATH / B256 LMDeploy: 1635 tok/s FOCUS: 2930 tok/s / 1.79x 0 900 1800 2700 3600 32 LLaDA2.0 / ShareGPT / B32 LMDeploy: 472 tok/s FOCUS: 973 tok/s / 2.06x 64 LLaDA2.0 / ShareGPT / B64 LMDeploy: 1770 tok/s FOCUS: 1644 tok/s / 0.93x 128 LLaDA2.0 / ShareGPT / B128 LMDeploy: 2067 tok/s FOCUS: 2495 tok/s / 1.21x 256 LLaDA2.0 / ShareGPT / B256 LMDeploy: 2181 tok/s FOCUS: 3369 tok/s / 1.54x 0 900 1800 2700 3600 32 LLaDA2.0 / WildChat / B32 LMDeploy: 470 tok/s FOCUS: 917 tok/s / 1.95x 64 LLaDA2.0 / WildChat / B64 LMDeploy: 1610 tok/s FOCUS: 1551 tok/s / 0.96x 128 LLaDA2.0 / WildChat / B128 LMDeploy: 1973 tok/s FOCUS: 2393 tok/s / 1.21x 256 LLaDA2.0 / WildChat / B256 LMDeploy: 2037 tok/s FOCUS: 3229 tok/s / 1.59x 0 1400 2800 4200 5600 32 LLaDA2.0 / MATH / B32 LMDeploy: 764 tok/s FOCUS: 1544 tok/s / 2.02x 64 LLaDA2.0 / MATH / B64 LMDeploy: 3306 tok/s FOCUS: 2530 tok/s / 0.77x 128 LLaDA2.0 / MATH / B128 LMDeploy: 3903 tok/s FOCUS: 3800 tok/s / 0.97x 256 LLaDA2.0 / MATH / B256 LMDeploy: 4022 tok/s FOCUS: 5017 tok/s / 1.25x Batch size

Hover over each batch group to inspect throughput.

Summary
On a single NVIDIA A100-SXM4-80GB GPU, FOCUS reaches up to 2.32x throughput over the LMDeploy baseline by reducing redundant block computation as batch size grows. The paper also reports up to 3.52x speedup at block size B=64 on ShareGPT with SDAR, where the larger diffusion block makes the compute bottleneck more pronounced.

Quality

Conf. Method alpha GSM8K Math500 HumanEval MBPP IFEval Avg.
0.9 Baseline - 89.20 64.70 69.82 56.81 57.45 67.60
0.9 FOCUS 1.2 89.84 64.60 72.56 58.17 60.97 69.23
0.9 FOCUS 1.5 90.15 64.30 69.51 61.09 60.87 69.18
0.9 FOCUS 1.8 90.75 63.80 71.34 58.95 60.11 68.99
0.8 Baseline - 87.57 59.30 65.85 50.78 58.69 64.44
0.8 FOCUS 1.2 90.33 64.10 67.68 58.56 59.91 68.12
0.8 FOCUS 1.5 89.73 62.20 69.21 59.73 60.97 68.37
0.8 FOCUS 1.8 89.39 62.10 69.82 56.81 61.14 67.85
0.7 Baseline - 84.65 54.60 59.76 50.58 58.47 61.61
0.7 FOCUS 1.2 89.24 62.00 68.29 56.81 61.91 67.65
0.7 FOCUS 1.5 89.31 62.20 64.33 55.25 62.26 66.67
0.7 FOCUS 1.8 88.03 59.80 64.33 53.89 60.08 65.23
Summary
On SDAR, FOCUS improves the five-task average across confidence thresholds, including the default Conf=0.8, alpha=1.5 setting.
Source: paper Table β€œGeneration Quality across Thresholds and alpha on SDAR”.

Citation

@article{liang2026focus,
  title   = {FOCUS: DLLMs Know How to Tame Their Compute Bound},
  author  = {Kaihua Liang and Xin Tan and An Zhong and Hong Xu and Marco Canini},
  journal = {arXiv preprint arXiv:2601.23278},
  year    = {2026},
  url     = {https://arxiv.org/abs/2601.23278}
}