GPU利用率仅10%？深入剖析LLM推理性能瓶颈主要原因

时间：2026-06-01 10:18

GPU 利用率不足 10%？——深度解析 LLM 推理性能瓶颈前置依赖：无（系列首篇）本篇配套代码：约 100 行 | 预计阅读时间：约 20 分钟 0 环境配置与准备工作 import warnings warnings filterwarnings( "ignore ") %config In

GPU 利用率不足 10%？——深度解析 LLM 推理性能瓶颈

前置依赖：无（系列首篇）本篇配套代码：约 100 行 | 预计阅读时间：约 20 分钟

0. 环境配置与准备工作

import warnings  
warnings.filterwarnings("ignore")  

%config InlineBackend.figure_formats = ['svg']  

import torch  
import time  
import numpy as np  
import matplotlib.pyplot as plt  
from transformers import AutoTokenizer, AutoModelForCausalLM  

# 中文字体设置（如需在图表中显示中文）  
plt.rcParams['font.sans-serif'] = ['Arial Unicode MS', 'SimHei', 'DejaVu Sans']  
plt.rcParams['axes.unicode_minus'] = False  

print(f"PyTorch 版本: {torch.__version__}")  
print(f"CUDA 是否可用: {torch.cuda.is_a vailable()}")  

device = "cuda" if torch.cuda.is_a vailable() else "cpu"  
print(f"当前使用设备: {device}")

执行后输出结果：

PyTorch 版本: 2.5.1  
CUDA 可用: False  
使用设备: cpu

1. 问题引入：为何 `model.generate()` 如此缓慢？

想必许多读者都体验过 HuggingFace 的 model.generate() 进行文本生成。常见的体验是：输入一段提示词后，便开始漫长等待——一个字一个字地缓慢输出。如果同时处理 10 个请求，往往面临显存不足或速度极慢的窘境。更令人困惑的是，GPU 利用率竟然只有 10%？这块价值数万的显卡似乎并未发挥应有性能。

这些问题的根源往往不在于代码质量，而是大语言模型（LLM）推理固有的结构性性能瓶颈。vLLM 正是为解决此类难题而生。但在深入 vLLM 之前，我们首先需要彻底理解瓶颈所在。

不妨从最基础的实现开始——使用 Qwen3-0.6B 模型执行一次文本生成，亲手探查问题所在。

加载一个轻量级模型

我们选用 Qwen3-0.6B（约 6 亿参数）进行实验。尽管模型不大，但其推理流程与 LLaMA-70B 等大型模型完全一致——性能瓶颈的本质并不会因模型规模而改变。

# 加载 Qwen3-0.6B 模型及其分词器  
model_name = "Qwen/Qwen3-0.6B"  
tokenizer = AutoTokenizer.from_pretrained(model_name)  
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float32).to(device)  
model.eval()  

print(f"加载模型: {model_name}")  
print(f"参数量: {sum(p.numel() for p in model.parameters()) / 1e6:.1f}M")  
print(f"网络层数: {model.config.num_hidden_layers}, 隐藏层维度: {model.config.hidden_size}, 注意力头数: {model.config.num_attention_heads}")

执行后输出结果：

`torch_dtype` is deprecated! Use `dtype` instead!  
模型: Qwen/Qwen3-0.6B  
参数量: 596.0M  
层数: 28, 隐藏维度: 1024, 注意力头: 16

最基础的文本生成方式

首先使用 model.generate() 运行一次生成，直观感受速度。

# 最基础的文本生成方式  
prompt = "The future of artificial intelligence is"  
input_ids = tokenizer.encode(prompt, return_tensors="pt").to(device)  

generate_length = 100  

# 开始计时  
start = time.time()  
with torch.no_grad():  
    output = model.generate(input_ids,  
                           max_new_tokens=generate_length,  
                           do_sample=False,  # 使用贪心解码，结果确定  
                           )  
elapsed = time.time() - start  

generated_text = tokenizer.decode(output[0], skip_special_tokens=True)  
tokens_per_sec = generate_length / elapsed  

print(f"输入提示词: {prompt!r}")  
print(f"生成 {generate_length} 个 token，耗时 {elapsed:.2f} 秒")  
print(f"生成速度: {tokens_per_sec:.1f} tokens/s")  
print(f"n最终生成文本:n{generated_text}")

执行后输出结果：

The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.  
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected beha vior. Please pass your input's `attention_mask` to obtain reliable results.  

Prompt: 'The future of artificial intelligence is'  
生成了 100 个 token，耗时 36.00 秒  
速度: 2.8 tokens/s  

生成结果:  
The future of artificial intelligence is a topic that has sparked considerable debate and speculation in the scientific community. As we delve into this subject, it's essential to understand the multifaceted nature of AI, its potential applications, and the ethical considerations that accompany these advancements. The narrative presented here is a blend of scientific inquiry and philosophical reflection, aiming to provide a comprehensive yet engaging exploration of the topic.  
The first step in understanding the future of AI is to recognize that AI is not a singular entity but a complex system of interconnected technologies.

2. LLM 推理的两个关键阶段：Prefill 与 Decode

虽然 model.generate() 看起来只是一个函数调用，但其内部实际上包含了两类截然不同的计算过程。

2.1 Prefill（预填充）阶段

当我们将提示词输入模型时，它需要一次性处理所有提示词 token。这个阶段称为 Prefill：输入是整个提示词（例如 6 个 token），计算过程为一次前向传播，所有 token 可以并行处理。输出结果包含最后一个位置的 logits（用于生成第一个新 token），以及所有位置的 KV Cache。

该阶段的特点是计算密集型（compute-bound），GPU 的算力能够得到充分利用。

2.2 Decode（解码）阶段

在生成第一个新 token 之后，后续的 token 必须逐个生成——因为第 N+1 个 token 依赖于第 N 个 token 的计算结果。这个阶段称为 Decode：输入是刚刚生成的 1 个 token，计算过程同样为一次前向传播，但仅处理 1 个 token 的 Attention，输出是下一个 token 的 logits。

该阶段的特点是访存密集型（memory-bound），每一步都需要从显存中读取整个模型的权重，但实际执行的计算量却极小。

Prefill 阶段（并行处理）：  
┌─────────────────────────────────────┐  
│[The] [future] [of] [AI] [is] [→]    │← 6 个 token 一次计算完成  
└─────────────────────────────────────┘  
       ↓  
    得到第 1 个新 token  

Decode 阶段（逐 token 生成）：  
┌───────┐   ┌───────┐   ┌───────┐  
│ [a]→  │ → │ [new]→│ → │ [era]→│ → ...  ← 每次仅计算 1 个 token  
└───────┘   └───────┘   └───────┘

这里的关键在于：生成 100 个 token 时，Prefill 只需执行 1 次，而 Decode 则需要执行 100 次。尽管每次 Decode 的计算量很小，但执行次数众多，且每次都需要搬运大量数据，因此 Decode 阶段才是真正的性能瓶颈所在。

实证验证：手动分解 Prefill 与 Decode

下面我们不再使用 model.generate()，而是通过手动逐步执行的方式，分别对 Prefill 和 Decode 进行计时。

def manual_generate(model, input_ids: torch.Tensor, max_new_tokens: int = 50):  
    """手动实现贪心解码，分别记录 prefill 和每步 decode 的耗时"""  
    prefill_time = 0.0  
    decode_times: list[float] = []  
    generated_ids = input_ids.clone()  

    with torch.no_grad():  
        # === Prefill 阶段 ===  
        start = time.time()  
        outputs = model(generated_ids, use_cache=True)  
        prefill_time = time.time() - start  

        # 取最后一个 token 的 logits，选择概率最大的作为首个生成 token  
        next_token = outputs.logits[:, -1, :].argmax(dim=-1, keepdim=True)  
        generated_ids = torch.cat([generated_ids, next_token], dim=-1)  
        past_key_values = outputs.past_key_values  

        # === Decode 阶段：逐 token 生成 ===  
        for _ in range(max_new_tokens - 1):  
            start = time.time()  
            outputs = model(next_token, past_key_values=past_key_values, use_cache=True)  
            decode_times.append(time.time() - start)  

            next_token = outputs.logits[:, -1, :].argmax(dim=-1, keepdim=True)  
            generated_ids = torch.cat([generated_ids, next_token], dim=-1)  
            past_key_values = outputs.past_key_values  

    return generated_ids, prefill_time, decode_times

运行手动生成过程，观察 Prefill 和 Decode 各自消耗的时间。

# 执行手动生成  
prompt = "The future of artificial intelligence is"  
input_ids = tokenizer.encode(prompt, return_tensors="pt").to(device)  

num_prompt_tokens = input_ids.shape[1]  
num_generate = 100  

generated_ids, prefill_time, decode_times = manual_generate(model, input_ids, num_generate)  

total_decode_time = sum(decode_times)  
total_time = prefill_time + total_decode_time  

print(f"提示词 token 数量: {num_prompt_tokens}")  
print(f"生成 token 数量: {num_generate}")  
print(f"{'='*40}")  
print(f"Prefill 耗时: {prefill_time*1000:>8.1f} ms({prefill_time/total_time*100:.1f}%)")  
print(f"Decode 耗时: {total_decode_time*1000:>8.1f} ms({total_decode_time/total_time*100:.1f}%)")  
print(f"{'='*40}")  
print(f"总耗时: {total_time*1000:>8.1f} ms")  
print(f"Decode 每步平均耗时: {np.mean(decode_times)*1000:>8.1f} ms/token")

执行后输出结果：

Prompt tokens: 6  
生成 tokens: 100  
========================================  
Prefill耗时: 1149.8 ms(11.7%)  
Decode 耗时: 8712.6 ms(88.3%)  
========================================  
总耗时:9862.4 ms  
Decode 每步: 88.0 ms/token

可视化呈现：每一步 Decode 的耗时分布

通过图表来观察每一步 decode 的耗时变化情况。

fig, axes = plt.subplots(1, 2, figsize=(14, 5))  

# 左图：Prefill 与 Decode 总耗时对比  
phases = ["Prefilln(处理 prompt)", "Decoden(生成 token)"]  
times_ms = [prefill_time * 1000, total_decode_time * 1000]  
colors = ["#4CAF50", "#FF5722"]  
bars = axes[0].bar(phases, times_ms, color=colors, width=0.5)  
axes[0].set_ylabel("耗时 (ms)")  
axes[0].set_title("Prefill 与 Decode 总耗时对比")  
for bar, t in zip(bars, times_ms):  
    axes[0].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 5, f"{t:.0f}ms", ha='center', fontsize=12)  

# 右图：每一步 Decode 的耗时变化趋势  
decode_ms = [t * 1000 for t in decode_times]  
axes[1].plot(range(len(decode_ms)), decode_ms, color="#FF5722", alpha=0.7, linewidth=1.5)  
axes[1].axhline(y=np.mean(decode_ms), color='gray', linestyle='--', label=f'平均耗时: {np.mean(decode_ms):.1f}ms')  
axes[1].set_xlabel("Decode 步数")  
axes[1].set_ylabel("耗时 (ms)")  
axes[1].set_title("每步 Decode 的耗时分步")  
axes[1].legend()  

plt.tight_layout()  
plt.show()

Prefill v Decode 耗时对比图

3. Decode 为何如此缓慢？——计算密集型与访存密集型的本质区别

从上述结果可以看出，Decode 阶段占据了绝大部分时间。但每次 Decode 仅处理 1 个 token，计算量明明很小，为什么反而更慢呢？

答案在于算力利用率的差异。

3.1 算术强度（Arithmetic Intensity）

衡量 GPU 使用效率的核心指标是算术强度：即从显存搬运每 1 字节数据时，能够执行多少次浮点运算。

$算术强度 = \frac{\text{FLOPs（计算量）}}{\text{Bytes（访存量）}}$

Prefill：处理 N 个 token 时，矩阵乘法为 [N, d] × [d, d]，计算量与 N 成正比，而模型权重只需读取一次 → 算术强度高 → 属于计算密集型 → GPU 算力得到充分利用。
Decode：仅处理 1 个 token 时，矩阵乘法为 [1, d] × [d, d]，但模型权重仍需全部读取一遍，而计算量仅为 Prefill 的 1/N → 算术强度极低 → 属于访存密集型 → GPU 大部分时间用于等待数据传输。

可以打个比方：Prefill 就像一次将整箱货物搬上楼，虽然箱子重，但效率高；Decode 则像一次只搬运一颗螺丝钉，却每次都需要跑完整趟来回，大量时间都耗费在路途上。

3.2 用数据说话：Prefill 与 Decode 的计算效率对比

我们来估算 Qwen3-0.6B 在 Prefill 和 Decode 阶段的理论计算量与访存量。

# Qwen3-0.6B 的模型配置参数  
n_layers = model.config.num_hidden_layers  
d_model = model.config.hidden_size  
n_params = sum(p.numel() for p in model.parameters())  
bytes_per_param = 4  # 使用 float32 精度  

# 模型权重的总存储大小  
model_size_bytes = n_params * bytes_per_param  
model_size_gb = model_size_bytes / 1e9  

# Prefill: 处理 N 个 token  
N_prompt = num_prompt_tokens  
# 简化估算：每个 token 的前向传播约需 2 * n_params 次浮点运算  
prefill_flops = 2 * n_params * N_prompt  
decode_flops = 2 * n_params * 1  # Decode 仅处理 1 个 token  

# 访存量：两个阶段均需读取一遍模型权重（简化估算）  
memory_access = model_size_bytes  # 单位：字节  

# 计算算术强度  
prefill_intensity = prefill_flops / memory_access  
decode_intensity = decode_flops / memory_access  

print(f"模型: {model_name}, 参数量: {n_params/1e6:.0f}M, 权重存储大小: {model_size_gb:.2f} GB")  
print(f"{'='*50}")  
print(f"{'指标':<20} {'Prefill':>12} {'Decode':>12}")  
print(f"{'='*50}")  
print(f"{'输入 tokens 数量':<20} {N_prompt:>12} {1:>12}")  
print(f"{'计算量 (GFLOPs)':<20} {prefill_flops/1e9:>12.2f} {decode_flops/1e9:>12.2f}")  
print(f"{'访存量 (GB)':<20} {memory_access/1e9:>12.2f} {memory_access/1e9:>12.2f}")  
print(f"{'算术强度 (FLOP/B)':<20} {prefill_intensity:>12.1f} {decode_intensity:>12.1f}")  
print(f"{'='*50}")  
print(f"nPrefill 的算术强度是 Decode 的 {prefill_intensity/decode_intensity:.0f} 倍！")

执行后输出结果：

模型: Qwen/Qwen3-0.6B, 参数量: 596M, 权重大小: 2.38 GB  
==================================================  
       指标          Prefill         Decode  
==================================================  
    输入 tokens              6              1  
   计算量 (GFLOPs)         7.15           1.19  
    访存量 (GB)            2.38           2.38  
算术强度 (FLOP/B)          3.0            0.5  
==================================================  

Prefill 的算术强度是 Decode 的 6 倍！

3.3 换个视角：模型规模增大后的瓶颈效应

Qwen3-0.6B 仅有 6 亿参数，而实际应用中的模型往往达到 7B、70B 甚至更大。观察模型规模扩大后，Decode 阶段的瓶颈会变得多么严重。

# 不同模型规模下，Decode 一个 token 所需搬运的数据量  
models = {  
    "Qwen3-0.6B": 0.6e9,  
    "LLaMA-7B": 7e9,  
    "LLaMA-13B": 13e9,  
    "LLaMA-70B": 70e9,  
}  

# A100 GPU 显存带宽约为 2 TB/s  
a100_bandwidth = 2e12  # 单位：bytes/s  

print(f"假设使用 A100 GPU（显存带宽约 2 TB/s），采用 FP16 精度推理：")  
print(f"{'='*65}")  
print(f"{'模型名称':<18} {'权重存储大小':>10} {'Decode 搬运耗时':>18} {'理论速度上限':>15}")  
print(f"{'='*65}")  

for name, params in models.items():  
    weight_bytes = params * 2  # FP16 精度：每个参数占用 2 字节  
    weight_gb = weight_bytes / 1e9  

    # Decode 每一步至少需要将所有权重读取一遍  
    decode_latency_ms = (weight_bytes / a100_bandwidth) * 1000  
    max_tokens_per_sec = 1000 / decode_latency_ms  

    print(f"{name:<18} {weight_gb:>8.1f} GB   {decode_latency_ms:>14.1f} ms   {max_tokens_per_sec:>11.0f} tok/s")  

print(f"{'='*65}")  
print(f"n注意：以上仅为单请求的理论上限——实际运行会更慢，因为还需考虑 KV Cache 的额外访存开销。")

执行后输出结果：

假设使用 A100 GPU（显存带宽 ~2 TB/s），FP16 推理：  
=================================================================  
      模型           权重大小     Decode 搬运耗时        理论上限  
=================================================================  
Qwen3-0.6B               1.2 GB            0.6 ms         1667 tok/s  
LLaMA-7B                14.0 GB            7.0 ms          143 tok/s  
LLaMA-13B               26.0 GB           13.0 ms           77 tok/s  
LLaMA-70B              140.0 GB           70.0 ms           14 tok/s  
=================================================================  

注意：这是单请求的理论上限——实际更慢，因为还有 KV Cache 的访存开销。

4. 吞吐量 vs 延迟：推理引擎面临的两难抉择

理解了 Prefill 和 Decode 的区别后，我们来审视推理服务面临的核心权衡问题。

4.1 两个关键性能指标

延迟（Latency）：单个请求从输入到输出所花费的总时间。直接影响用户体验——没有人愿意等待 10 秒才看到回复。
吞吐量（Throughput）：单位时间内能够处理的 token 总数。这是服务商成本控制的关键——同样的 GPU，处理更多请求意味着更高的收益。

4.2 矛盾何在？

单个请求的 Decode 阶段属于访存密集型，GPU 算力被严重浪费。一个直觉上的解决方案是：将多个请求合并处理（batching），这样一次权重搬运可以服务于多个请求，从而提高算术强度。

但随之而来的是新的问题：

批次（batch）越大，单个请求的延迟就越高（需要等待凑齐一批）
不同请求生成的文本长度各不相同，短的必须等待长的，导致 GPU 出现空闲
显存容量有限，批次不能无限扩大——KV Cache 会消耗大量显存资源

这正是推理引擎需要解决的核心难题：在有限的显存条件下，如何同时优化吞吐量和延迟？

4.3 模拟实验：Batching 对吞吐量的影响

通过代码来感受不同批次大小对吞吐量的影响。

# 测试不同 batch size 下的吞吐量表现  
batch_sizes = [1, 2, 4, 8, 16]  
results: list[dict] = []  

# 确保存在 pad token  
if tokenizer.pad_token is None:  
    tokenizer.pad_token = tokenizer.eos_token  
    model.config.pad_token_id = tokenizer.eos_token_id  

for bs in batch_sizes:  
    # 构造一个批次：重复使用同一个 prompt  
    prompts = [prompt] * bs  
    inputs = tokenizer(prompts, return_tensors="pt", padding=True).to(device)  

    gen_tokens = 50  

    # 预热操作  
    with torch.no_grad():  
        _ = model.generate(**inputs, max_new_tokens=5, do_sample=False)  

    # 开始计时  
    start = time.time()  
    with torch.no_grad():  
        _ = model.generate(**inputs, max_new_tokens=gen_tokens, do_sample=False)  
    elapsed = time.time() - start  

    total_tokens = bs * gen_tokens  
    throughput = total_tokens / elapsed  
    latency_per_request = elapsed  # 所有请求同时完成  

    results.append({  
        "batch_size": bs,  
        "total_tokens": total_tokens,  
        "elapsed": elapsed,  
        "throughput": throughput,  
        "latency": latency_per_request,  
    })  

    print(f"batch_size={bs:>2}: 吞吐量 {throughput:>7.1f} tok/s, 延迟 {latency_per_request:.2f}s")

执行后输出结果：

batch_size= 1: 吞吐   4.6 tok/s, 延迟 10.95s  
batch_size= 2: 吞吐   6.0 tok/s, 延迟 16.60s  
batch_size= 4: 吞吐   9.9 tok/s, 延迟 20.19s  
batch_size= 8: 吞吐  21.5 tok/s, 延迟 18.65s  
batch_size=16: 吞吐  28.7 tok/s, 延迟 27.87s

可视化展示批次大小与吞吐量、延迟之间的关系。

fig, axes = plt.subplots(1, 2, figsize=(14, 5))  

bs_list = [r["batch_size"] for r in results]  
tp_list = [r["throughput"] for r in results]  
lat_list = [r["latency"] for r in results]  

# 左图：吞吐量变化  
axes[0].plot(bs_list, tp_list, 'o-', color="#2196F3", linewidth=2, markersize=8)  
axes[0].set_xlabel("批次大小 (Batch Size)")  
axes[0].set_ylabel("吞吐量 (tokens/s)")  
axes[0].set_title("批次大小 vs 吞吐量")  
axes[0].set_xticks(bs_list)  
axes[0].grid(True, alpha=0.3)  

# 右图：延迟变化  
axes[1].plot(bs_list, lat_list, 'o-', color="#FF5722", linewidth=2, markersize=8)  
axes[1].set_xlabel("批次大小 (Batch Size)")  
axes[1].set_ylabel("延迟 (s)")  
axes[1].set_title("批次大小 vs 延迟")  
axes[1].set_xticks(bs_list)  
axes[1].grid(True, alpha=0.3)  

plt.tight_layout()  
plt.show()  

print("观察结论：增加批次大小 → 吞吐量提升（优势），但延迟也会随之增加（劣势）。")  
print("这正是朴素批处理方案（naive batching）的局限性所在。")

Batch Size 与吞吐量和延迟的关系图

执行后输出结果：

观察：batch size 增大 → 吞吐量提升（好），但延迟也增加（坏）。  
这就是朴素 batching 的局限性。

5. vLLM 的核心解决思路

至此，我们已明确了 LLM 推理面临的三大核心问题：

Decode 阶段属于访存密集型，导致 GPU 算力严重浪费
KV Cache 占用大量显存，限制了可同时处理的请求数量
朴素的批处理方案需要等待最慢的请求完成，进一步浪费资源

那么，vLLM 是如何解决这些问题的呢？其核心创新主要包含三个方面。

5.1 PagedAttention —— 像操作系统一样精细管理 KV Cache

传统方式会为每个请求预分配 max_seq_len 大小的 KV Cache 空间，即使实际只使用了一半，剩余显存也被白白浪费。vLLM 的做法是：借鉴操作系统的虚拟内存管理机制，将 KV Cache 划分为固定大小的"页"，按需动态分配。这样一来，显存利用率可以从传统的约 20-40% 提升至接近 100%。

传统方式（连续预分配，浪费严重）：  
┌──────────────────────────────┐  
│ Req 1 KV Cache ████████░░░░░ │← 预分配了 max_len，实际仅用一半  
│ Req 2 KV Cache ████░░░░░░░░░ │← 浪费更加严重  
│ [ 无法再分配新请求 ]          │  
└──────────────────────────────┘  

vLLM PagedAttention（分页按需分配）：  
┌──────────────────────────────┐  
│ [1][1][1][2][2][3][3][3][3]  │← 不同请求的页面交错存放  
│ [1][4][4][4][free][]         │← 空间紧凑，还可容纳新请求  
└──────────────────────────────┘

5.2 Continuous Batching —— 无需等待最慢的请求

传统的批处理需要等待同一批次的所有请求全部完成后，才能处理下一批。vLLM 采用连续批处理策略：每一步（iteration）都会重新决定哪些请求参与计算。一旦某个请求生成完毕，立即将其移出，并立即加入新的请求。这样就彻底避免了"等待最慢请求"的问题。

5.3 智能调度 —— 显存不足时的优雅处理机制

当显存资源紧张时，vLLM 的调度器可以执行以下操作：将低优先级请求的 KV Cache 暂时交换（swap）到 CPU 内存，或者抢占（preempt）某些请求以释放显存给更紧急的任务，从而在吞吐量和延迟之间实现动态平衡。

6. 与真实 vLLM 的对比分析

本文仅旨在理解问题所在，尚未开始真正解决问题。不过，值得先对比一下朴素实现与真实 vLLM 之间的差距：

方面	我们的朴素实现	真实 vLLM
KV Cache 管理	HuggingFace 自动管理，采用预分配方式	PagedAttention，按需分页管理
批处理策略	静态批次，需等待最慢的请求	Continuous Batching，动态调度
显存利用率	20-40%（存在大量浪费）	接近 100%
并发处理能力	批次大小受限	可同时处理数十至数百个请求
调度策略	无（采用先到先得原则）	优先级调度 + 抢占机制
Kernel 实现	PyTorch 默认实现	定制的 CUDA kernel

根据 vLLM 论文报告，在相同硬件条件下，vLLM 的吞吐量是 HuggingFace 原生推理的 2-4 倍，在某些特定场景下甚至可以达到 24 倍的提升。在本系列后续文章中，我们将使用纯 Python 代码逐步实现上表中从左侧到右侧的演进过程。虽然我们不会自行编写 CUDA kernel，但核心算法与数据结构将与 vLLM 完全一致。

7. 本章总结

本文从最基础的 model.generate() 出发，系统拆解了 LLM 推理过程中的性能瓶颈：

Prefill 与 Decode 的本质差异：推理过程包含两个截然不同的阶段。Prefill 属于计算密集型（并行处理提示词），而 Decode 属于访存密集型（逐 token 生成）。Decode 阶段才是真正的性能瓶颈所在。
访存密集型的深层原因：Decode 每一步都需要从显存中搬运整个模型的权重，但实际执行的计算量却微乎其微。模型规模越大，这个问题就越突出——对于 70B 模型，即使在 A100 上，理论 Decode 速度上限也仅为每秒几十个 token。
吞吐量与延迟的权衡难题：批处理（Batching）虽然能提升吞吐量，但也会增加延迟，而朴素的批处理方案还存在"等待最慢请求"的额外问题。推理引擎必须在两者之间找到最优平衡点。

在充分理解这些问题的基础上，下一篇文章我们将正式开始动手解决这些性能瓶颈。

来源：https://juejin.cn/post/7625074868338786330

GPU

上一篇Nolej安全可持续转化内容为高效学习活动提升教育体验 下一篇seedance 2.0 内容可控创作工具免费体验教程

本站内容用于信息整理与展示，如有侵权或内容问题请及时联系处理。