Diffusion Language Models
Project from Noah's Ark Lab

Auto-regressive (AR) architectures are constrained by a causal bottleneck. Diffusion Language Models (DLMs) conceptualize text generation as a bidirectional denoising process. We explore the frontiers of non-causal generation, addressing challenges from architectural inertia to latent thinking to reach the next milestone in LLM evolution.

News

Latest Updates & Announcements

2026.02

🚀 We released DLLM-Agent: The first Diffusion-based LLM agents. They deliver over 30% faster end-to-end performance on average than autoregressive agents at comparable accuracy, achieving up to 8× speedups in selected cases with more efficient multi-step planning.

2026.01

Our proposal about Top 10 Open Challenges of diffusion LLMs was presented at AAAI'26, outlining current bottlenecks and effective ideas for boosting performance and expanding application fields.

2026.01

We proposed a novel Diffusion-in-Diffusion paradigm: a 'draft-then-refine' framework designed to overcome the irreversibility and myopia problems inherent in block diffusion models.

2026.01

Dynamic output length issues are addressed by our new DCD (Deferred Commitment Decoding) method, which maintains a certainty-aware sliding window to resolve tokens only when sufficient contextual evidence is available.

Featured Research

Selected Publications

Open Challenges

Bottlenecks & Scalability

1

Inference-Efficient Architectures

Native designs for non-causal, iterative updates without redundant global re-computation or ineffective KV caching.

2

Structured Hierarchy

Multi-scale tokenization that moves beyond flat BPE to allocate resources between semantic structuring and lexical polishing.

3

Gradient Sparsity

Solving computational waste in long-sequence pre-training where minimal tokens provide gradient feedback.

4

Advanced Masking

Structured mechanisms accounting for interdependencies between functional tokens vs. generic filler words.

5

Adaptive Termination

Predicting optimal output length dynamically using methods like CDC to avoid "hallucinatory padding".

6

Data Engineering

Curating corpora that highlight structural relationships and multi-point dependencies for bidirectional learning.

7

Resource Optimization

Balancing denoising quality with the "iterative tax" of multiple steps during high-throughput execution.

8

Latent Thinking

Enabling the model to "re-think" or edit its output, allowing iterative self-correction beyond linear trajectories.

9

Structured Prompting

Frameworks where prompts serve as global constraints or skeletal scaffolds rather than simple prefixes.

10

Unified Multimodal

Collapsing understanding and generation into a single denoising manifold for unified multimodal models (e.g., VLA).

Strategic Roadmap

The Path Forward

Pillar I: Infrastructure

Shifting to Diffusion-native ecosystems. We propose stochastic-aware attention and multi-scale tokenizers that simulate hierarchical thought.

Pillar II: Optimization

Implementing dynamic masking and speculative denoising. Incorporating uncertainty-aware windows directly into the decoding process.

Pillar III: Cognitive Reasoning

Shifting to Active Remasking. Identifying low-confidence regions for immediate re-generation, enabling self-correction.

Pillar IV: Unified Intelligence

Treating understanding and generation as a single continuum to collapse the modality gap in unified multimodal models.