Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Deepin-Kernel-SIG] [Upstream] [linux 6.6-y] sched/fair: Simplify Util_est #572

Merged
13 changes: 7 additions & 6 deletions Documentation/scheduler/sched-capacity.rst
Original file line number Diff line number Diff line change
Expand Up @@ -39,14 +39,15 @@ per Hz, leading to::
-------------------

Two different capacity values are used within the scheduler. A CPU's
``capacity_orig`` is its maximum attainable capacity, i.e. its maximum
attainable performance level. A CPU's ``capacity`` is its ``capacity_orig`` to
which some loss of available performance (e.g. time spent handling IRQs) is
subtracted.
``original capacity`` is its maximum attainable capacity, i.e. its maximum
attainable performance level. This original capacity is returned by
the function arch_scale_cpu_capacity(). A CPU's ``capacity`` is its ``original
capacity`` to which some loss of available performance (e.g. time spent
handling IRQs) is subtracted.

Note that a CPU's ``capacity`` is solely intended to be used by the CFS class,
while ``capacity_orig`` is class-agnostic. The rest of this document will use
the term ``capacity`` interchangeably with ``capacity_orig`` for the sake of
while ``original capacity`` is class-agnostic. The rest of this document will use
the term ``capacity`` interchangeably with ``original capacity`` for the sake of
brevity.

1.3 Platform examples
Expand Down
7 changes: 3 additions & 4 deletions Documentation/scheduler/schedutil.rst
Original file line number Diff line number Diff line change
Expand Up @@ -90,17 +90,16 @@ For more detail see:
- Documentation/scheduler/sched-capacity.rst:"1. CPU Capacity + 2. Task utilization"


UTIL_EST / UTIL_EST_FASTUP
==========================
UTIL_EST
========

Because periodic tasks have their averages decayed while they sleep, even
though when running their expected utilization will be the same, they suffer a
(DVFS) ramp-up after they are running again.

To alleviate this (a default enabled option) UTIL_EST drives an Infinite
Impulse Response (IIR) EWMA with the 'running' value on dequeue -- when it is
highest. A further default enabled option UTIL_EST_FASTUP modifies the IIR
filter to instantly increase and only decay on decrease.
highest. UTIL_EST filters to instantly increase and only decay on decrease.

A further runqueue wide sum (of runnable tasks) is maintained of:

Expand Down
7 changes: 3 additions & 4 deletions Documentation/translations/zh_CN/scheduler/schedutil.rst
Original file line number Diff line number Diff line change
Expand Up @@ -89,16 +89,15 @@ r_cpu被定义为当前CPU的最高性能水平与系统中任何其它CPU的最
- Documentation/translations/zh_CN/scheduler/sched-capacity.rst:"1. CPU Capacity + 2. Task utilization"


UTIL_EST / UTIL_EST_FASTUP
==========================
UTIL_EST
========

由于周期性任务的平均数在睡眠时会衰减,而在运行时其预期利用率会和睡眠前相同,
因此它们在再次运行后会面临(DVFS)的上涨。

为了缓解这个问题,(一个默认使能的编译选项)UTIL_EST驱动一个无限脉冲响应
(Infinite Impulse Response,IIR)的EWMA,“运行”值在出队时是最高的。
另一个默认使能的编译选项UTIL_EST_FASTUP修改了IIR滤波器,使其允许立即增加,
仅在利用率下降时衰减。
UTIL_EST滤波使其在遇到更高值时立刻增加,而遇到低值时会缓慢衰减。

进一步,运行队列的(可运行任务的)利用率之和由下式计算:

Expand Down
1 change: 1 addition & 0 deletions arch/arm/include/asm/topology.h
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
#define arch_set_freq_scale topology_set_freq_scale
#define arch_scale_freq_capacity topology_get_freq_scale
#define arch_scale_freq_invariant topology_scale_freq_invariant
#define arch_scale_freq_ref topology_get_freq_ref
#endif

/* Replace task scheduler's default cpu-invariant accounting */
Expand Down
1 change: 1 addition & 0 deletions arch/arm64/include/asm/topology.h
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ void update_freq_counters_refs(void);
#define arch_set_freq_scale topology_set_freq_scale
#define arch_scale_freq_capacity topology_get_freq_scale
#define arch_scale_freq_invariant topology_scale_freq_invariant
#define arch_scale_freq_ref topology_get_freq_ref

#ifdef CONFIG_ACPI_CPPC_LIB
#define arch_init_invariance_cppc topology_init_cpu_capacity_cppc
Expand Down
1 change: 1 addition & 0 deletions arch/riscv/include/asm/topology.h
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
#define arch_set_freq_scale topology_set_freq_scale
#define arch_scale_freq_capacity topology_get_freq_scale
#define arch_scale_freq_invariant topology_scale_freq_invariant
#define arch_scale_freq_ref topology_get_freq_ref

/* Replace task scheduler's default cpu-invariant accounting */
#define arch_scale_cpu_capacity topology_get_cpu_scale
Expand Down
1 change: 1 addition & 0 deletions arch/x86/kernel/cpu/aperfmperf.c
Original file line number Diff line number Diff line change
Expand Up @@ -346,6 +346,7 @@ static DECLARE_WORK(disable_freq_invariance_work,
disable_freq_invariance_workfn);

DEFINE_PER_CPU(unsigned long, arch_freq_scale) = SCHED_CAPACITY_SCALE;
EXPORT_PER_CPU_SYMBOL_GPL(arch_freq_scale);

static void scale_freq_tick(u64 acnt, u64 mcnt)
{
Expand Down
42 changes: 20 additions & 22 deletions drivers/base/arch_topology.c
Original file line number Diff line number Diff line change
Expand Up @@ -19,14 +19,16 @@
#include <linux/init.h>
#include <linux/rcupdate.h>
#include <linux/sched.h>
#include <linux/units.h>

#define CREATE_TRACE_POINTS
#include <trace/events/thermal_pressure.h>

static DEFINE_PER_CPU(struct scale_freq_data __rcu *, sft_data);
static struct cpumask scale_freq_counters_mask;
static bool scale_freq_invariant;
static DEFINE_PER_CPU(u32, freq_factor) = 1;
DEFINE_PER_CPU(unsigned long, capacity_freq_ref) = 1;
EXPORT_PER_CPU_SYMBOL_GPL(capacity_freq_ref);

static bool supports_scale_freq_counters(const struct cpumask *cpus)
{
Expand Down Expand Up @@ -170,9 +172,9 @@ DEFINE_PER_CPU(unsigned long, thermal_pressure);
* operating on stale data when hot-plug is used for some CPUs. The
* @capped_freq reflects the currently allowed max CPUs frequency due to
* thermal capping. It might be also a boost frequency value, which is bigger
* than the internal 'freq_factor' max frequency. In such case the pressure
* value should simply be removed, since this is an indication that there is
* no thermal throttling. The @capped_freq must be provided in kHz.
* than the internal 'capacity_freq_ref' max frequency. In such case the
* pressure value should simply be removed, since this is an indication that
* there is no thermal throttling. The @capped_freq must be provided in kHz.
*/
void topology_update_thermal_pressure(const struct cpumask *cpus,
unsigned long capped_freq)
Expand All @@ -183,10 +185,7 @@ void topology_update_thermal_pressure(const struct cpumask *cpus,

cpu = cpumask_first(cpus);
max_capacity = arch_scale_cpu_capacity(cpu);
max_freq = per_cpu(freq_factor, cpu);

/* Convert to MHz scale which is used in 'freq_factor' */
capped_freq /= 1000;
max_freq = arch_scale_freq_ref(cpu);

/*
* Handle properly the boost frequencies, which should simply clean
Expand Down Expand Up @@ -279,13 +278,13 @@ void topology_normalize_cpu_scale(void)

capacity_scale = 1;
for_each_possible_cpu(cpu) {
capacity = raw_capacity[cpu] * per_cpu(freq_factor, cpu);
capacity = raw_capacity[cpu] * per_cpu(capacity_freq_ref, cpu);
capacity_scale = max(capacity, capacity_scale);
}

pr_debug("cpu_capacity: capacity_scale=%llu\n", capacity_scale);
for_each_possible_cpu(cpu) {
capacity = raw_capacity[cpu] * per_cpu(freq_factor, cpu);
capacity = raw_capacity[cpu] * per_cpu(capacity_freq_ref, cpu);
capacity = div64_u64(capacity << SCHED_CAPACITY_SHIFT,
capacity_scale);
topology_set_cpu_scale(cpu, capacity);
Expand Down Expand Up @@ -321,15 +320,15 @@ bool __init topology_parse_cpu_capacity(struct device_node *cpu_node, int cpu)
cpu_node, raw_capacity[cpu]);

/*
* Update freq_factor for calculating early boot cpu capacities.
* Update capacity_freq_ref for calculating early boot CPU capacities.
* For non-clk CPU DVFS mechanism, there's no way to get the
* frequency value now, assuming they are running at the same
* frequency (by keeping the initial freq_factor value).
* frequency (by keeping the initial capacity_freq_ref value).
*/
cpu_clk = of_clk_get(cpu_node, 0);
if (!PTR_ERR_OR_ZERO(cpu_clk)) {
per_cpu(freq_factor, cpu) =
clk_get_rate(cpu_clk) / 1000;
per_cpu(capacity_freq_ref, cpu) =
clk_get_rate(cpu_clk) / HZ_PER_KHZ;
clk_put(cpu_clk);
}
} else {
Expand Down Expand Up @@ -398,9 +397,6 @@ init_cpu_capacity_callback(struct notifier_block *nb,
struct cpufreq_policy *policy = data;
int cpu;

if (!raw_capacity)
return 0;

if (val != CPUFREQ_CREATE_POLICY)
return 0;

Expand All @@ -411,12 +407,14 @@ init_cpu_capacity_callback(struct notifier_block *nb,
cpumask_andnot(cpus_to_visit, cpus_to_visit, policy->related_cpus);

for_each_cpu(cpu, policy->related_cpus)
per_cpu(freq_factor, cpu) = policy->cpuinfo.max_freq / 1000;
per_cpu(capacity_freq_ref, cpu) = policy->cpuinfo.max_freq;

if (cpumask_empty(cpus_to_visit)) {
topology_normalize_cpu_scale();
schedule_work(&update_topology_flags_work);
free_raw_capacity();
if (raw_capacity) {
topology_normalize_cpu_scale();
schedule_work(&update_topology_flags_work);
free_raw_capacity();
}
pr_debug("cpu_capacity: parsing done\n");
schedule_work(&parsing_done_work);
}
Expand All @@ -436,7 +434,7 @@ static int __init register_cpufreq_notifier(void)
* On ACPI-based systems skip registering cpufreq notifier as cpufreq
* information is not needed for cpu capacity initialization.
*/
if (!acpi_disabled || !raw_capacity)
if (!acpi_disabled)
return -EINVAL;

if (!alloc_cpumask_var(&cpus_to_visit, GFP_KERNEL))
Expand Down
4 changes: 2 additions & 2 deletions drivers/cpufreq/cpufreq.c
Original file line number Diff line number Diff line change
Expand Up @@ -454,7 +454,7 @@ void cpufreq_freq_transition_end(struct cpufreq_policy *policy,

arch_set_freq_scale(policy->related_cpus,
policy->cur,
policy->cpuinfo.max_freq);
arch_scale_freq_ref(policy->cpu));

spin_lock(&policy->transition_lock);
policy->transition_ongoing = false;
Expand Down Expand Up @@ -2188,7 +2188,7 @@ unsigned int cpufreq_driver_fast_switch(struct cpufreq_policy *policy,

policy->cur = freq;
arch_set_freq_scale(policy->related_cpus, freq,
policy->cpuinfo.max_freq);
arch_scale_freq_ref(policy->cpu));
cpufreq_stats_record_transition(policy, freq);

if (trace_cpu_frequency_enabled()) {
Expand Down
7 changes: 7 additions & 0 deletions include/linux/arch_topology.h
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,13 @@ static inline unsigned long topology_get_cpu_scale(int cpu)

void topology_set_cpu_scale(unsigned int cpu, unsigned long capacity);

DECLARE_PER_CPU(unsigned long, capacity_freq_ref);

static inline unsigned long topology_get_freq_ref(int cpu)
{
return per_cpu(capacity_freq_ref, cpu);
}

DECLARE_PER_CPU(unsigned long, arch_freq_scale);

static inline unsigned long topology_get_freq_scale(int cpu)
Expand Down
1 change: 1 addition & 0 deletions include/linux/cpufreq.h
Original file line number Diff line number Diff line change
Expand Up @@ -1220,6 +1220,7 @@ void arch_set_freq_scale(const struct cpumask *cpus,
{
}
#endif

/* the following are really really optional */
extern struct freq_attr cpufreq_freq_attr_scaling_available_freqs;
extern struct freq_attr cpufreq_freq_attr_scaling_boost_freqs;
Expand Down
6 changes: 3 additions & 3 deletions include/linux/energy_model.h
Original file line number Diff line number Diff line change
Expand Up @@ -224,7 +224,7 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
unsigned long max_util, unsigned long sum_util,
unsigned long allowed_cpu_cap)
{
unsigned long freq, scale_cpu;
unsigned long freq, ref_freq, scale_cpu;
struct em_perf_state *ps;
int cpu;

Expand All @@ -241,11 +241,11 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
*/
cpu = cpumask_first(to_cpumask(pd->cpus));
scale_cpu = arch_scale_cpu_capacity(cpu);
ps = &pd->table[pd->nr_perf_states - 1];
ref_freq = arch_scale_freq_ref(cpu);

max_util = map_util_perf(max_util);
max_util = min(max_util, allowed_cpu_cap);
freq = map_util_freq(max_util, ps->frequency, scale_cpu);
freq = map_util_freq(max_util, ref_freq, scale_cpu);

/*
* Find the lowest performance state of the Energy Model above the
Expand Down
49 changes: 12 additions & 37 deletions include/linux/sched.h
Original file line number Diff line number Diff line change
Expand Up @@ -416,42 +416,6 @@ struct load_weight {
u32 inv_weight;
};

/**
* struct util_est - Estimation utilization of FAIR tasks
* @enqueued: instantaneous estimated utilization of a task/cpu
* @ewma: the Exponential Weighted Moving Average (EWMA)
* utilization of a task
*
* Support data structure to track an Exponential Weighted Moving Average
* (EWMA) of a FAIR task's utilization. New samples are added to the moving
* average each time a task completes an activation. Sample's weight is chosen
* so that the EWMA will be relatively insensitive to transient changes to the
* task's workload.
*
* The enqueued attribute has a slightly different meaning for tasks and cpus:
* - task: the task's util_avg at last task dequeue time
* - cfs_rq: the sum of util_est.enqueued for each RUNNABLE task on that CPU
* Thus, the util_est.enqueued of a task represents the contribution on the
* estimated utilization of the CPU where that task is currently enqueued.
*
* Only for tasks we track a moving average of the past instantaneous
* estimated utilization. This allows to absorb sporadic drops in utilization
* of an otherwise almost periodic task.
*
* The UTIL_AVG_UNCHANGED flag is used to synchronize util_est with util_avg
* updates. When a task is dequeued, its util_est should not be updated if its
* util_avg has not been updated in the meantime.
* This information is mapped into the MSB bit of util_est.enqueued at dequeue
* time. Since max value of util_est.enqueued for a task is 1024 (PELT util_avg
* for a task) it is safe to use MSB.
*/
struct util_est {
unsigned int enqueued;
unsigned int ewma;
#define UTIL_EST_WEIGHT_SHIFT 2
#define UTIL_AVG_UNCHANGED 0x80000000
} __attribute__((__aligned__(sizeof(u64))));

/*
* The load/runnable/util_avg accumulates an infinite geometric series
* (see __update_load_avg_cfs_rq() in kernel/sched/pelt.c).
Expand Down Expand Up @@ -506,11 +470,22 @@ struct sched_avg {
unsigned long load_avg;
unsigned long runnable_avg;
unsigned long util_avg;
struct util_est util_est;
unsigned int util_est;
DEEPIN_KABI_RESERVE(1)
DEEPIN_KABI_RESERVE(2)
} ____cacheline_aligned;

/*
* The UTIL_AVG_UNCHANGED flag is used to synchronize util_est with util_avg
* updates. When a task is dequeued, its util_est should not be updated if its
* util_avg has not been updated in the meantime.
* This information is mapped into the MSB bit of util_est at dequeue time.
* Since max value of util_est for a task is 1024 (PELT util_avg for a task)
* it is safe to use MSB.
*/
#define UTIL_EST_WEIGHT_SHIFT 2
#define UTIL_AVG_UNCHANGED 0x80000000

struct sched_statistics {
#ifdef CONFIG_SCHEDSTATS
u64 wait_start;
Expand Down
8 changes: 8 additions & 0 deletions include/linux/sched/topology.h
Original file line number Diff line number Diff line change
Expand Up @@ -275,6 +275,14 @@ void arch_update_thermal_pressure(const struct cpumask *cpus,
{ }
#endif

#ifndef arch_scale_freq_ref
static __always_inline
unsigned int arch_scale_freq_ref(int cpu)
{
return 0;
}
#endif

static inline int task_node(const struct task_struct *p)
{
return cpu_to_node(task_cpu(p));
Expand Down
2 changes: 1 addition & 1 deletion kernel/sched/core.c
Original file line number Diff line number Diff line change
Expand Up @@ -10047,7 +10047,7 @@ void __init sched_init(void)
#ifdef CONFIG_SMP
rq->sd = NULL;
rq->rd = NULL;
rq->cpu_capacity = rq->cpu_capacity_orig = SCHED_CAPACITY_SCALE;
rq->cpu_capacity = SCHED_CAPACITY_SCALE;
rq->balance_callback = &balance_push_callback;
rq->active_balance = 0;
rq->next_balance = jiffies;
Expand Down
2 changes: 1 addition & 1 deletion kernel/sched/cpudeadline.c
Original file line number Diff line number Diff line change
Expand Up @@ -131,7 +131,7 @@ int cpudl_find(struct cpudl *cp, struct task_struct *p,
if (!dl_task_fits_capacity(p, cpu)) {
cpumask_clear_cpu(cpu, later_mask);

cap = capacity_orig_of(cpu);
cap = arch_scale_cpu_capacity(cpu);

if (cap > max_cap ||
(cpu == task_cpu(p) && cap == max_cap)) {
Expand Down
Loading
Loading