deepseek-ai/DeepSeek-V4-Pro

DeepSeek V4 flagship MoE (1.6T total / 49B active) with hybrid CSA+HCA attention, manifold-constrained hyper-connections, Muon-trained on 32T+ tokens, and three-tier reasoning.

View on HuggingFace

moe1600B / 49B1,048,576 ctxvLLM 0.20.0+text

Guide

Overview

DeepSeek-V4-Pro is the flagship of the V4 preview family: a 1.6T-total / 49B-active Mixture-of-Experts model. It pairs a hybrid attention stack — Compressed Sparse Attention (CSA) + Heavily Compressed Attention (HCA) — with Manifold-Constrained Hyper-Connections (mHC) to reach 27% of V3.2's per-token inference FLOPs and 10% of V3.2's KV cache at 1M context. Pre-trained on 32T+ tokens with the Muon optimizer for faster convergence; post-training is a two-stage pipeline (domain-specific expert cultivation + unified consolidation via on-policy distillation).

Checkpoint is FP4+FP8 mixed: MoE expert weights are stored in FP4 while the remaining (attention / norm / router) params stay in FP8.

Reasoning modes

The chat template exposes three reasoning-effort modes:

Non-think — fast, intuitive responses.
Think High — explicit chain-of-thought for complex problem-solving and planning.
Think Max — maximum reasoning effort; requires --max-model-len >= 393216 (384K tokens) to avoid truncation.

Recommended sampling: temperature = 1.0, top_p = 1.0.

OpenAI Client Example

For DeepSeek-V4, keep reasoning controls in chat_template_kwargs, as it exposes a custom Think Max mode via "reasoning_effort": "max".

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
model = "deepseek-ai/DeepSeek-V4-Pro"
messages = [{"role": "user", "content": "What is 17*19? Return only the final integer."}]

# Non-think
resp = client.chat.completions.create(
    model=model,
    messages=messages,
)

# Think High
resp = client.chat.completions.create(
    model=model,
    messages=messages,
    extra_body={
        "chat_template_kwargs": {
            "thinking": True,
            "reasoning_effort": "high",
        },
    },
)

# Think Max
resp = client.chat.completions.create(
    model=model,
    messages=messages,
    extra_body={
        "chat_template_kwargs": {
            "thinking": True,
            "reasoning_effort": "max",
        },
    },
)

Recommended deployments

B300 (8× GPU): single-node DP + EP with --data-parallel-size 8.
H200 (8× GPU): DP + EP with --data-parallel-size 8. Context is capped at 800K tokens (--max-model-len 800000) to leave KV headroom with dense params replicated across ranks — applies to both single-node and multi-node H200.
MI355X (8× GPU): validated with ROCm + AITER (VLLM_ROCM_USE_AITER=1, VLLM_ROCM_USE_AITER_LINEAR=1), --moe-backend triton_unfused, --gpu-memory-utilization 0.6, --max-num-seqs 128, --max-num-batched-tokens 8192, and --distributed-executor-backend mp.
GB200 NVL4 (4× GPU per tray): the ~960 GB mixed-precision checkpoint does not fit on one tray; run multi-node DP + EP across 2 trays (8 GPUs total) with --data-parallel-size 8. Pick the "Multi-Node" tab and set nodes to 2.

GSM8K validation (MI355X)

Launch command (TP=8):

export HF_HOME=/data/huggingface-cache
export VLLM_ROCM_USE_AITER=1
export VLLM_ROCM_USE_AITER_LINEAR=1

vllm serve /home/models/DeepSeek-V4-Pro \
  --host localhost \
  --port 8001 \
  --dtype auto \
  --kv-cache-dtype fp8 \
  --tensor-parallel-size 8 \
  --max-num-seqs 128 \
  --max-num-batched-tokens 8192 \
  --distributed-executor-backend mp \
  --trust-remote-code \
  --gpu-memory-utilization 0.6 \
  --moe-backend triton_unfused \
  --tokenizer-mode deepseek_v4 \
  --reasoning-parser deepseek_v4 \
  --async-scheduling \
  --enforce-eager

GSM8K command:

MODEL=/home/models/DeepSeek-V4-Pro
lm_eval --model local-completions \
  --model_args model=$MODEL,base_url=http://0.0.0.0:8001/v1/completions,num_concurrent=2,max_retries=10,max_gen_toks=2048,timeout=60000 \
  --batch_size auto \
  --tasks gsm8k \
  --num_fewshot 8 \
  --output_path . 2>&1 | tee -a eval.log

Reported result from PR #40871:

local-completions ({'model': '/home/models/DeepSeek-V4-Pro', 'base_url': 'http://0.0.0.0:8001/v1/completions', 'num_concurrent': 2, 'max_retries': 10, 'max_gen_toks': 2048, 'timeout': 60000}), gen_kwargs: ({}), limit: None, num_fewshot: 8, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     8|exact_match|↑  |0.9538|±  |0.0058|
|     |       |strict-match    |     8|exact_match|↑  |0.9545|±  |0.0057|