Linux CPU分配指南：按容器权重分配资源的两种方法

时间：2025-10-30 19:59

Linux 内核中的完全公平调度器中每个逻辑核都有一个调度队列 struct cfs_rq。每个调度队列中都是用红黑树来组织的。红黑树的节点是 struct sched_entity， sched_

在Linux内核的完全公平调度器中，每个逻辑核都维护着一个独立的调度队列struct cfs_rq。这些调度队列采用红黑树结构进行组织，每个红黑树节点都对应一个调度实体struct sched_entity。值得注意的是，sched_entity既可以关联具体进程的struct task_struct，也能够对应容器使用的struct cfs_rq。

本文重点探讨了Linux内核为容器分配CPU资源的第一种实现机制，通过period和quota的组合来限制容器使用CPU时间的上限。不过内核还提供了第二种CPU资源分配策略——按权重分配。接下来我们将深入分析这种分配方式的具体使用方法及其底层实现原理。

一、Linux的完全公平调度器

在详细讲解容器权重分配之前，我们有必要先回顾一下内核中完全公平调度器的核心机制。

Linux内核的完全公平调度器为每个逻辑核都配备了独立的调度队列struct cfs_rq，这些队列采用红黑树结构来组织任务。红黑树中的每个节点都是一个调度实体struct sched_entity，这些实体既可以代表具体的进程task_struct，也能够对应容器层的cfs_rq。

图片

以下是完全公平调度器cfs_rq内核对象的完整定义。

// file:kernel/sched/sched.h struct cfs_rq { ... // 当前队列中所有进程vruntime的最小值 u64 min_vruntime; // 保存就绪任务的红黑树 struct rb_root_cached tasks_timeline; ... }

在该对象中，最核心的是rb_root_cached类型的对象，这个对象的数据结构就是以红黑树来组织的。在红黑树的节点中，存放的是一个调度实体sched_entity对象。这个对象有可能是属于普通进程task_struct的，也有可能是属于容器进程组task_group的。

//file:kernel/sched/sched.h struct task_group { ... struct sched_entity **se; struct cfs_rq **cfs_rq; unsigned long shares; }

//file:include/linux/sched.h struct task_struct { ... struct sched_entity se; }

无论sched_entity对应的是具体进程还是容器实体，都会包含一个虚拟运行时间vruntime字段，以及一个用于存储权重数据的load字段。

图片

在进程调度的过程中，每个逻辑核上都设有一个定时器，周期性地触发调度器从红黑树上判断是否需要用最左侧调度实体替换当前正在运行的进程。在选择进程进行切换时，虽然存在多种策略，但最核心的是要保持所有调度实体的vruntime的公平性。换句话说，不管Linux系统上有多少个使用完全公平调度器的进程（使用实时调度策略的进程除外），它们最终的vruntime基本会保持一致。

二、权重的设置

上节我们讲到完全公平调度器运转是基于vruntime来维持所有调度实体公平地使用CPU资源的。但现实情况是，有的服务确实需要多使用一些CPU资源，另一些服务只需要少使用一点就可以。例如说某台服务机是云上的一台服务器，有的用户购买了8核套餐，有的用户只购买的1核配置。在计算vruntime的时候必然需要一些策略来支持。

为了实现这个需求，每个调度实体中的权重就显得非常关键了。

//file:include/linux/sched.h struct sched_entity { struct load_weight load; u64 vruntime; ... } struct load_weight { unsigned long weight; u32 inv_weight; };

对于普通进程来说，这个权重可以使用nice命令来间接地修改。在容器中，在cgroup v1下可以通过cgroupfs下的cpu.shares文件来修改，在cgroup v2下通过cpu.weight / cpu.weight.nice来修改。

在cgroup v1中，对cpu.shares的修改会执行到cpu_shares_write_u64这个函数中。

//file:kernel/sched/core.c static struct cftype cpu_legacy_files[] = { { .name = "shares", .read_u64 = cpu_shares_read_u64, .write_u64 = cpu_shares_write_u64, }, ... }

在cgroup v2中，对cpu.weight的修改会执行到cpu_weight_write_u64函数中。

//file:kernel/sched/core.c static struct cftype cpu_files[] = { { .name = "weight", .flags = CFTYPE_NOT_ON_ROOT, .read_u64 = cpu_weight_read_u64, .write_u64 = cpu_weight_write_u64, }, ... }

不管是cgroup v1修改cpu.shares时执行cpu_shares_write_u64，还是cgroup v2修改cpu.weight是执行cpu_weight_write_u64，最终都会调用到__sched_group_set_shares来把权重信息shares记录到调度实体se上去了。

//file:kernel/sched/fair.c static int __sched_group_set_shares(struct task_group *tg, unsigned long shares){ ...... tg->shares = shares; for_each_possible_cpu(i) { struct sched_entity *se = tg->se[i]; for_each_sched_entity(se) update_cfs_group(se); } }

具体的设置是在update_cfs_group中完成的，它依次调用reweight_entity、update_load_set来把权重值记录到调度实体上。这样后面就可以通过调度实体se->load->weight找到进程或容器的权重信息了。

//file:kernel/sched/fair.c static inline void update_load_set(struct load_weight *lw, unsigned long w){ lw->weight = w; lw->inv_weight = 0; }

三、容器CPU权重分配实现

完全公平调度器是维持的所有调度实体的vruntime的公平性。但是vruntime会根据权重来进行缩放，vruntime的实现是calc_delta_fair函数。

// file:kernel/sched/fair.c static inline u64 calc_delta_fair(u64 delta, struct sched_entity *se){ if (unlikely(se->load.weight != NICE_0_LOAD)) delta = __calc_delta(delta, NICE_0_LOAD, &se->load); return delta; }

在这个函数中，NICE_0_LOAD宏对应的是1024。如果权重是1024，那么vruntime正好等于实际运行时间。否则会进入到__calc_delta中根据权重和实际运行时间来折算一个vruntime增量来计算。__calc_delta函数为了追求极致的性能，实现上比较繁杂一些，源码就不给大家展示了。我们只把它用到的缩放算法展示如下：

vruntime = (实际运行时间 * ((NICE_0_LOAD * 2^32) / weight)) >> 32

如果权重weight较高，那么同样的实际运行时间算出来的vruntime会偏小，这样它就会在调度中获得较多的CPU时长。如果权重weight较低，那么算出来的vruntime会比实际运行时间偏大。这样它就会在调度的过程中获得的CPU时间就会较少。完全公平调度器就是这样简单地实现了CPU资源的按权重分配。

我们再举个例子，假如有一个8核的物理机上，上面运行着A服务、B服务、C服务的一些容器。

图片