Auto-regressive (AR) architectures are constrained by a causal bottleneck. Diffusion Language Models (DLMs) conceptualize text generation as a bidirectional denoising process. We explore the frontiers of non-causal generation, addressing challenges from architectural inertia to latent thinking to reach the next milestone in LLM evolution.
Latest Updates & Announcements
🚀 We released DLLM-Agent: The first Diffusion-based LLM agents. They deliver over 30% faster end-to-end performance on average than autoregressive agents at comparable accuracy, achieving up to 8× speedups in selected cases with more efficient multi-step planning.
Our proposal about Top 10 Open Challenges of diffusion LLMs was presented at AAAI'26, outlining current bottlenecks and effective ideas for boosting performance and expanding application fields.
We proposed a novel Diffusion-in-Diffusion paradigm: a 'draft-then-refine' framework designed to overcome the irreversibility and myopia problems inherent in block diffusion models.
Dynamic output length issues are addressed by our new DCD (Deferred Commitment Decoding) method, which maintains a certainty-aware sliding window to resolve tokens only when sufficient contextual evidence is available.
Selected Publications
First Diffusion-based LLM agents with 30% faster planning.
A recursive 'draft-then-refine' generation framework.
Non-blocking diffusion for parallel decoding acceleration.
Solving the dynamic output length issue with certainty windows.
Bottlenecks & Scalability
Native designs for non-causal, iterative updates without redundant global re-computation or ineffective KV caching.
Multi-scale tokenization that moves beyond flat BPE to allocate resources between semantic structuring and lexical polishing.
Solving computational waste in long-sequence pre-training where minimal tokens provide gradient feedback.
Structured mechanisms accounting for interdependencies between functional tokens vs. generic filler words.
Predicting optimal output length dynamically using methods like CDC to avoid "hallucinatory padding".
Curating corpora that highlight structural relationships and multi-point dependencies for bidirectional learning.
Balancing denoising quality with the "iterative tax" of multiple steps during high-throughput execution.
Enabling the model to "re-think" or edit its output, allowing iterative self-correction beyond linear trajectories.
Frameworks where prompts serve as global constraints or skeletal scaffolds rather than simple prefixes.
Collapsing understanding and generation into a single denoising manifold for unified multimodal models (e.g., VLA).
The Path Forward
Shifting to Diffusion-native ecosystems. We propose stochastic-aware attention and multi-scale tokenizers that simulate hierarchical thought.
Implementing dynamic masking and speculative denoising. Incorporating uncertainty-aware windows directly into the decoding process.
Shifting to Active Remasking. Identifying low-confidence regions for immediate re-generation, enabling self-correction.
Treating understanding and generation as a single continuum to collapse the modality gap in unified multimodal models.