Age | Commit message (Collapse) | Author |
|
Patch series "fork: Page operation cleanups in the fork code", v3.
This patchset consists of outtakes from a 1 year+ old patchset from Pasha,
which all stand on their own. See:
https://lore.kernel.org/all/20240311164638.2015063-1-pasha.tatashin@soleen.com/
These are good cleanups for readability so I split these off, rebased on
v6.15-rc1, addressed review comments and send them separately.
All mentions of dynamic stack are removed from the patchset as we have no
idea whether that will go anywhere.
This patch (of 3):
There is unneeded OR in the ifdef functions that are used to allocate and
free kernel stacks based on direct map or vmap.
Therefore, clean up by changing the order so OR is no longer needed.
[linus.walleij@linaro.org: rebased]
Link: https://lkml.kernel.org/r/20250509-fork-fixes-v3-1-e6c69dd356f2@linaro.org
Link: https://lkml.kernel.org/r/20250509-fork-fixes-v3-0-e6c69dd356f2@linaro.org
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Link: https://lore.kernel.org/20240311164638.2015063-3-pasha.tatashin@soleen.com
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "sysfs: add counters for lockups and stalls", v2.
Commits 9db89b411170 ("exit: Expose "oops_count" to sysfs") and
8b05aa263361 ("panic: Expose "warn_count" to sysfs") added counters for
oopses and warnings to sysfs, and these two patches do the same for
hard/soft lockups and RCU stalls.
All of these counters are useful for monitoring tools to detect whether
the machine is healthy. If the kernel has experienced a lockup or a
stall, it's probably due to a kernel bug, and I'd like to detect that
quickly and easily. There is currently no way to detect that, other than
parsing dmesg. Or observing indirect effects: such as certain tasks not
responding, but then I need to observe all tasks, and it may take a while
until these effects become visible/measurable. I'd rather be able to
detect the primary cause more quickly, possibly before everything falls
apart.
This patch (of 2):
There is /proc/sys/kernel/hung_task_detect_count, /sys/kernel/warn_count
and /sys/kernel/oops_count but there is no userspace-accessible counter
for hard/soft lockups. Having this is useful for monitoring tools.
Link: https://lkml.kernel.org/r/20250504180831.4190860-1-max.kellermann@ionos.com
Link: https://lkml.kernel.org/r/20250504180831.4190860-2-max.kellermann@ionos.com
Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
Cc:
Cc: Core Minyard <cminyard@mvista.com>
Cc: Doug Anderson <dianders@chromium.org>
Cc: Joel Granados <joel.granados@kernel.org>
Cc: Song Liu <song@kernel.org>
Cc: Kees Cook <kees@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Crash kernel will retrieve the dm crypt keys based on the dmcryptkeys
command line parameter. When user space writes the key description to
/sys/kernel/config/crash_dm_crypt_key/restore, the crash kernel will save
the encryption keys to the user keyring. Then user space e.g.
cryptsetup's --volume-key-keyring API can use it to unlock the encrypted
device.
Link: https://lkml.kernel.org/r/20250502011246.99238-6-coxu@redhat.com
Signed-off-by: Coiby Xu <coxu@redhat.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: "Daniel P. Berrange" <berrange@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Jan Pazdziora <jpazdziora@redhat.com>
Cc: Liu Pingfan <kernelfans@gmail.com>
Cc: Milan Broz <gmazyland@gmail.com>
Cc: Ondrej Kozina <okozina@redhat.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
When there are CPU and memory hot un/plugs, the dm crypt keys may need to
be reloaded again depending on the solution for crash hotplug support.
Currently, there are two solutions. One is to utilizes udev to instruct
user space to reload the kdump kernel image and initrd, elfcorehdr and etc
again. The other is to only update the elfcorehdr segment introduced in
commit 247262756121 ("crash: add generic infrastructure for crash hotplug
support").
For the 1st solution, the dm crypt keys need to be reloaded again. The
user space can write true to /sys/kernel/config/crash_dm_crypt_key/reuse
so the stored keys can be re-used.
For the 2nd solution, the dm crypt keys don't need to be reloaded.
Currently, only x86 supports the 2nd solution. If the 2nd solution gets
extended to all arches, this patch can be dropped.
Link: https://lkml.kernel.org/r/20250502011246.99238-5-coxu@redhat.com
Signed-off-by: Coiby Xu <coxu@redhat.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: "Daniel P. Berrange" <berrange@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Jan Pazdziora <jpazdziora@redhat.com>
Cc: Liu Pingfan <kernelfans@gmail.com>
Cc: Milan Broz <gmazyland@gmail.com>
Cc: Ondrej Kozina <okozina@redhat.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
When the kdump kernel image and initrd are loaded, the dm crypts keys will
be read from keyring and then stored in kdump reserved memory.
Assume a key won't exceed 256 bytes thus MAX_KEY_SIZE=256 according to
"cryptsetup benchmark".
Link: https://lkml.kernel.org/r/20250502011246.99238-4-coxu@redhat.com
Signed-off-by: Coiby Xu <coxu@redhat.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: "Daniel P. Berrange" <berrange@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Jan Pazdziora <jpazdziora@redhat.com>
Cc: Liu Pingfan <kernelfans@gmail.com>
Cc: Milan Broz <gmazyland@gmail.com>
Cc: Ondrej Kozina <okozina@redhat.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
A configfs /sys/kernel/config/crash_dm_crypt_keys is provided for user
space to make the dm crypt keys persist for the kdump kernel. Take the
case of dumping to a LUKS-encrypted target as an example, here is the life
cycle of the kdump copies of LUKS volume keys,
1. After the 1st kernel loads the initramfs during boot, systemd uses
an user-input passphrase to de-crypt the LUKS volume keys or simply
TPM-sealed volume keys and then save the volume keys to specified
keyring (using the --link-vk-to-keyring API) and the keys will expire
within specified time.
2. A user space tool (kdump initramfs loader like kdump-utils) create
key items inside /sys/kernel/config/crash_dm_crypt_keys to inform
the 1st kernel which keys are needed.
3. When the kdump initramfs is loaded by the kexec_file_load
syscall, the 1st kernel will iterate created key items, save the
keys to kdump reserved memory.
4. When the 1st kernel crashes and the kdump initramfs is booted, the
kdump initramfs asks the kdump kernel to create a user key using the
key stored in kdump reserved memory by writing yes to
/sys/kernel/crash_dm_crypt_keys/restore. Then the LUKS encrypted
device is unlocked with libcryptsetup's --volume-key-keyring API.
5. The system gets rebooted to the 1st kernel after dumping vmcore to
the LUKS encrypted device is finished
Eventually the keys have to stay in the kdump reserved memory for the
kdump kernel to unlock encrypted volumes. During this process, some
measures like letting the keys expire within specified time are desirable
to reduce security risk.
This patch assumes,
1) there are 128 LUKS devices at maximum to be unlocked thus
MAX_KEY_NUM=128.
2) a key description won't exceed 128 bytes thus KEY_DESC_MAX_LEN=128.
And here is a demo on how to interact with
/sys/kernel/config/crash_dm_crypt_keys,
# Add key #1
mkdir /sys/kernel/config/crash_dm_crypt_keys/7d26b7b4-e342-4d2d-b660-7426b0996720
# Add key #1's description
echo cryptsetup:7d26b7b4-e342-4d2d-b660-7426b0996720 > /sys/kernel/config/crash_dm_crypt_keys/description
# how many keys do we have now?
cat /sys/kernel/config/crash_dm_crypt_keys/count
1
# Add key# 2 in the same way
# how many keys do we have now?
cat /sys/kernel/config/crash_dm_crypt_keys/count
2
# the tree structure of /crash_dm_crypt_keys configfs
tree /sys/kernel/config/crash_dm_crypt_keys/
/sys/kernel/config/crash_dm_crypt_keys/
├── 7d26b7b4-e342-4d2d-b660-7426b0996720
│ └── description
├── count
├── fce2cd38-4d59-4317-8ce2-1fd24d52c46a
│ └── description
Link: https://lkml.kernel.org/r/20250502011246.99238-3-coxu@redhat.com
Signed-off-by: Coiby Xu <coxu@redhat.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: "Daniel P. Berrange" <berrange@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Jan Pazdziora <jpazdziora@redhat.com>
Cc: Liu Pingfan <kernelfans@gmail.com>
Cc: Milan Broz <gmazyland@gmail.com>
Cc: Ondrej Kozina <okozina@redhat.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "Support kdump with LUKS encryption by reusing LUKS volume
keys", v9.
LUKS is the standard for Linux disk encryption, widely adopted by users,
and in some cases, such as Confidential VMs, it is a requirement. With
kdump enabled, when the first kernel crashes, the system can boot into the
kdump/crash kernel to dump the memory image (i.e., /proc/vmcore) to a
specified target. However, there are two challenges when dumping vmcore
to a LUKS-encrypted device:
- Kdump kernel may not be able to decrypt the LUKS partition. For some
machines, a system administrator may not have a chance to enter the
password to decrypt the device in kdump initramfs after the 1st kernel
crashes; For cloud confidential VMs, depending on the policy the
kdump kernel may not be able to unseal the keys with TPM and the
console virtual keyboard is untrusted.
- LUKS2 by default use the memory-hard Argon2 key derivation function
which is quite memory-consuming compared to the limited memory reserved
for kdump. Take Fedora example, by default, only 256M is reserved for
systems having memory between 4G-64G. With LUKS enabled, ~1300M needs
to be reserved for kdump. Note if the memory reserved for kdump can't
be used by 1st kernel i.e. an user sees ~1300M memory missing in the
1st kernel.
Besides users (at least for Fedora) usually expect kdump to work out of
the box i.e. no manual password input or custom crashkernel value is
needed. And it doesn't make sense to derivate the keys again in kdump
kernel which seems to be redundant work.
This patchset addresses the above issues by making the LUKS volume keys
persistent for kdump kernel with the help of cryptsetup's new APIs
(--link-vk-to-keyring/--volume-key-keyring). Here is the life cycle of
the kdump copies of LUKS volume keys,
1. After the 1st kernel loads the initramfs during boot, systemd
use an user-input passphrase to de-crypt the LUKS volume keys
or TPM-sealed key and then save the volume keys to specified keyring
(using the --link-vk-to-keyring API) and the key will expire within
specified time.
2. A user space tool (kdump initramfs loader like kdump-utils) create
key items inside /sys/kernel/config/crash_dm_crypt_keys to inform
the 1st kernel which keys are needed.
3. When the kdump initramfs is loaded by the kexec_file_load
syscall, the 1st kernel will iterate created key items, save the
keys to kdump reserved memory.
4. When the 1st kernel crashes and the kdump initramfs is booted, the
kdump initramfs asks the kdump kernel to create a user key using the
key stored in kdump reserved memory by writing yes to
/sys/kernel/crash_dm_crypt_keys/restore. Then the LUKS encrypted
device is unlocked with libcryptsetup's --volume-key-keyring API.
5. The system gets rebooted to the 1st kernel after dumping vmcore to
the LUKS encrypted device is finished
After libcryptsetup saving the LUKS volume keys to specified keyring,
whoever takes this should be responsible for the safety of these copies of
keys. The keys will be saved in the memory area exclusively reserved for
kdump where even the 1st kernel has no direct access. And further more,
two additional protections are added,
- save the copy randomly in kdump reserved memory as suggested by Jan
- clear the _PAGE_PRESENT flag of the page that stores the copy as
suggested by Pingfan
This patchset only supports x86. There will be patches to support other
architectures once this patch set gets merged.
This patch (of 9):
Currently, kexec_buf is placed in order which means for the same machine,
the info in the kexec_buf is always located at the same position each time
the machine is booted. This may cause a risk for sensitive information
like LUKS volume key. Now struct kexec_buf has a new field random which
indicates it's supposed to be placed in a random position.
Note this feature is enabled only when CONFIG_CRASH_DUMP is enabled. So
it only takes effect for kdump and won't impact kexec reboot.
Link: https://lkml.kernel.org/r/20250502011246.99238-1-coxu@redhat.com
Link: https://lkml.kernel.org/r/20250502011246.99238-2-coxu@redhat.com
Signed-off-by: Coiby Xu <coxu@redhat.com>
Suggested-by: Jan Pazdziora <jpazdziora@redhat.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: "Daniel P. Berrange" <berrange@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Liu Pingfan <kernelfans@gmail.com>
Cc: Milan Broz <gmazyland@gmail.com>
Cc: Ondrej Kozina <okozina@redhat.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
There is no reason to restrict scx_bpf_select_cpu_dfl() invocations to
ops.select_cpu() while allowing scx_bpf_select_cpu_and() to be used from
multiple contexts, as both provide equivalent functionality, with the
latter simply accepting an additional "allowed" cpumask.
Therefore, unify the two APIs, enabling both kfuncs to be used from
ops.select_cpu(), ops.enqueue(), and unlocked contexts (e.g., via BPF
test_run).
This allows schedulers to implement a consistent idle CPU selection
policy and helps reduce code duplication.
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
&desc->lock is acquired on 2 consecutive lines in hwirq_show(). This leads
obviously to a deadlock. Drop the raw_spin_lock_irq() and keep guard().
Fixes: 5d964a9f7cd8 ("genirq/irqdesc: Switch to lock guards")
Signed-off-by: Claudiu Beznea <claudiu.beznea.uj@bp.renesas.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250521142541.3832130-1-claudiu.beznea.uj@bp.renesas.com
|
|
The PERF_RECORD_THROTTLE records are dumped for all throttled events.
It's not necessary for group events, which are throttled altogether.
Optimize it by only dump the throttle log for the leader.
The sample right after the THROTTLE record must be generated by the
actual target event. It is good enough for the perf tool to locate the
actual target event.
Suggested-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20250520181644.2673067-3-kan.liang@linux.intel.com
|
|
The current throttle logic doesn't work well with a group, e.g., the
following sampling-read case.
$ perf record -e "{cycles,cycles}:S" ...
$ perf report -D | grep THROTTLE | tail -2
THROTTLE events: 426 ( 9.0%)
UNTHROTTLE events: 425 ( 9.0%)
$ perf report -D | grep PERF_RECORD_SAMPLE -a4 | tail -n 5
0 1020120874009167 0x74970 [0x68]: PERF_RECORD_SAMPLE(IP, 0x1):
... sample_read:
.... group nr 2
..... id 0000000000000327, value 000000000cbb993a, lost 0
..... id 0000000000000328, value 00000002211c26df, lost 0
The second cycles event has a much larger value than the first cycles
event in the same group.
The current throttle logic in the generic code only logs the THROTTLE
event. It relies on the specific driver implementation to disable
events. For all ARCHs, the implementation is similar. Only the event is
disabled, rather than the group.
The logic to disable the group should be generic for all ARCHs. Add the
logic in the generic code. The following patch will remove the buggy
driver-specific implementation.
The throttle only happens when an event is overflowed. Stop the entire
group when any event in the group triggers the throttle.
The MAX_INTERRUPTS is set to all throttle events.
The unthrottled could happen in 3 places.
- event/group sched. All events in the group are scheduled one by one.
All of them will be unthrottled eventually. Nothing needs to be
changed.
- The perf_adjust_freq_unthr_events for each tick. Needs to restart the
group altogether.
- The __perf_event_period(). The whole group needs to be restarted
altogether as well.
With the fix,
$ sudo perf report -D | grep PERF_RECORD_SAMPLE -a4 | tail -n 5
0 3573470770332 0x12f5f8 [0x70]: PERF_RECORD_SAMPLE(IP, 0x2):
... sample_read:
.... group nr 2
..... id 0000000000000a28, value 00000004fd3dfd8f, lost 0
..... id 0000000000000a29, value 00000004fd3dfd8f, lost 0
Suggested-by: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20250520181644.2673067-2-kan.liang@linux.intel.com
|
|
The kerneldoc for futex_wait_setup() states it can return "0" or "<1".
This isn't true because the error case is "<0" not less than 1.
Document that <0 is returned on error. Drop the possible return values
and state possible reasons.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: André Almeida <andrealmeid@igalia.com>
Link: https://lore.kernel.org/r/20250517151455.1065363-6-bigeasy@linutronix.de
|
|
The commit dfa0a574cbc47 ("sched/uclamg: Handle delayed dequeue")
has add the sched_delayed check to prevent double uclamp_dec/inc.
However, it put the uclamp_rq_inc() after enqueue_task().
This may lead to the following issues:
When a task with uclamp goes through enqueue_task() and could trigger
cpufreq update, its uclamp won't even be considered in the cpufreq
update. It is only after enqueue will the uclamp be added to rq
buckets, and cpufreq will only pick it up at the next update.
This could cause a delay in frequency updating. It may affect
the performance(uclamp_min > 0) or power(uclamp_max < 1024).
So, just like util_est, put the uclamp_rq_inc() before enqueue_task().
And as for the sched_delayed_task, same as util_est, using the
sched_delayed flag to prevent inc the sched_delayed_task's uclamp,
using the ENQUEUE_DELAYED flag to allow inc the sched_delayed_task's uclamp
which is being woken up.
Signed-off-by: Xuewen Yan <xuewen.yan@unisoc.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lore.kernel.org/r/20250417043457.10632-3-xuewen.yan@unisoc.com
|
|
To prevent double enqueue/dequeue of the util-est for sched_delayed tasks,
commit 729288bc6856 ("kernel/sched: Fix util_est accounting for DELAY_DEQUEUE")
added the corresponding check. This check excludes double en/dequeue during
task migration and priority changes.
In fact, these conditions can be simplified.
For util_est_dequeue, we know that sched_delayed flag is set in dequeue_entity.
When the task is sleeping, we need to call util_est_dequeue to subtract
util-est from the cfs_rq. At this point, sched_delayed has not yet been set.
If we find that sched_delayed is already set, it indicates that this task
has already called dequeue_task_fair once. In this case, there is no need to
call util_est_dequeue again. Therefore, simply checking the sched_delayed flag
should be sufficient to prevent unnecessary util_est updates during the dequeue.
For util_est_enqueue, our goal is to add the util_est to the cfs_rq
when task enqueue. However, we don't want to add the util_est of a
sched_delayed task to the cfs_rq because the task is sleeping.
Therefore, we can exclude the util_est_enqueue for sched_delayed tasks
by checking the sched_delayed flag. However, when waking up a delayed task,
the sched_delayed flag is cleared after util_est_enqueue. As a result,
if we only check the sched_delayed flag, we would miss the util_est_enqueue.
Since waking up a sched_delayed task calls enqueue_task with the ENQUEUE_DELAYED flag,
we can determine whether to call util_est_enqueue by checking if the
enqueue_flag contains ENQUEUE_DELAYED.
Signed-off-by: Xuewen Yan <xuewen.yan@unisoc.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lore.kernel.org/r/20250417043457.10632-2-xuewen.yan@unisoc.com
|
|
Delayed dequeued feature keeps a sleeping task enqueued until its
lag has elapsed. As a result, it stays also visible in rq->nr_running.
So when in wake_affine_idle(), we should use the real running-tasks
in rq to check whether we should place the wake-up task to
current cpu.
On the other hand, add a helper function to return the nr-delayed.
Fixes: 152e11f6df29 ("sched/fair: Implement delayed dequeue")
Signed-off-by: Xuewen Yan <xuewen.yan@unisoc.com>
Reviewed-and-tested-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20250303105241.17251-2-xuewen.yan@unisoc.com
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto fixes from Herbert Xu:
"This fixes a regression in padata as well as an ancient double-free
bug in af_alg"
* tag 'v6.15-p7' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
crypto: algif_hash - fix double free in hash_accept
padata: do not leak refcount in reorder_work
|
|
Allow scx_bpf_select_cpu_and() to be used from an unlocked context, in
addition to ops.enqueue() or ops.select_cpu().
This enables schedulers, including user-space ones, to implement a
consistent idle CPU selection policy and helps reduce code duplication.
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Validate locking correctness when accessing p->nr_cpus_allowed and
p->cpus_ptr inside scx_bpf_select_cpu_and(): if the rq lock is held,
access is safe; otherwise, require that p->pi_lock is held.
This allows to catch potential unsafe calls to scx_bpf_select_cpu_and().
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Relocate the scx_kf_allowed_if_unlocked(), so it can be used from other
source files (e.g., ext_idle.c).
No functional change.
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
There are a few places where a conditional check is performed to validate a
given css on its rstat participation. This new helper tries to make the
code more readable where this check is performed.
Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
It is possible to eliminate contention between subsystems when
updating/flushing stats by using subsystem-specific locks. Let the existing
rstat locks be dedicated to the cgroup base stats and rename them to
reflect that. Add similar locks to the cgroup_subsys struct for use with
individual subsystems.
Lock initialization is done in the new function ss_rstat_init(ss) which
replaces cgroup_rstat_boot(void). If NULL is passed to this function, the
global base stat locks will be initialized. Otherwise, the subsystem locks
will be initialized.
Change the existing lock helper functions to accept a reference to a css.
Then within these functions, conditionally select the appropriate locks
based on the subsystem affiliation of the given css. Add helper functions
for this selection routine to avoid repeated code.
Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Different subsystems may call cgroup_rstat_updated() within the same
cgroup, resulting in a tree of pending updates from multiple subsystems.
When one of these subsystems is flushed via cgroup_rstat_flushed(), all
other subsystems with pending updates on the tree will also be flushed.
Change the paradigm of having a single rstat tree for all subsystems to
having separate trees for each subsystem. This separation allows for
subsystems to perform flushes without the side effects of other subsystems.
As an example, flushing the cpu stats will no longer cause the memory stats
to be flushed and vice versa.
In order to achieve subsystem-specific trees, change the tree node type
from cgroup to cgroup_subsys_state pointer. Then remove those pointers from
the cgroup and instead place them on the css. Finally, change update/flush
functions to make use of the different node type (css). These changes allow
a specific subsystem to be associated with an update or flush. Separate
rstat trees will now exist for each unique subsystem.
Since updating/flushing will now be done at the subsystem level, there is
no longer a need to keep track of updated css nodes at the cgroup level.
The list management of these nodes done within the cgroup (rstat_css_list
and related) has been removed accordingly.
Conditional guards for checking validity of a given css were placed within
css_rstat_updated/flush() to prevent undefined behavior occuring from kfunc
usage in bpf programs. Guards were also placed within css_rstat_init/exit()
in order to help consolidate calls to them. At call sites for all four
functions, the existing guards were removed.
Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Adjust the implementation of css_is_cgroup() so that it compares the given
css to cgroup::self. Rename the function to css_is_self() in order to
reflect that. Change the existing css->ss NULL check to a warning in the
true branch. Finally, adjust call sites to use the new function name.
Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
An early init subsystem that attempts to make use of rstat can lead to
failures during early boot. The reason for this is the timing in which the
css's of the root cgroup have css_online() invoked on them. At the point of
this call, there is a stated assumption that a cgroup has "successfully
completed all allocations" [0]. An example of a subsystem that relies on
the previously mentioned assumption [0] is the memory subsystem. Within its
implementation of css_online(), work is queued to asynchronously begin
flushing via rstat. In the early init path for a given subsystem, having
rstat enabled leads to this sequence:
cgroup_init_early()
for_each_subsys(ss, ssid)
if (ss->early_init)
cgroup_init_subsys(ss, true)
cgroup_init_subsys(ss, early_init)
css = ss->css_alloc(...)
init_and_link_css(css, ss, ...)
...
online_css(css)
online_css(css)
ss = css->ss
ss->css_online(css)
Continuing to use the memory subsystem as an example, the issue with this
sequence is that css_rstat_init() has not been called yet. This means there
is now a race between the pending async work to flush rstat and the call to
css_rstat_init(). So a flush can occur within the given cgroup while the
rstat fields are not initialized.
Since we are in the early init phase, the rstat fields cannot be
initialized because they require per-cpu allocations. So it's not possible
to have css_rstat_init() called early enough (before online_css()). This
patch treats the combination of early init and rstat the same as as other
invalid conditions.
[0] Documentation/admin-guide/cgroup-v1/cgroups.rst (section: css_online)
Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Throughout the verifier's logic, there are multiple checks for
inconsistent states that should never happen and would indicate a
verifier bug. These bugs are typically logged in the verifier logs and
sometimes preceded by a WARN_ONCE.
This patch reworks these checks to consistently emit a verifier log AND
a warning when CONFIG_DEBUG_KERNEL is enabled. The consistent use of
WARN_ONCE should help fuzzers (ex. syzkaller) expose any situation
where they are actually able to reach one of those buggy verifier
states.
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Link: https://lore.kernel.org/r/aCs1nYvNNMq8dAWP@mail.gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
A recent patch that addressed a UAF introduced a reference count leak:
the parallel_data refcount is incremented unconditionally, regardless
of the return value of queue_work(). If the work item is already queued,
the incremented refcount is never decremented.
Fix this by checking the return value of queue_work() and decrementing
the refcount when necessary.
Resolves:
Unreferenced object 0xffff9d9f421e3d80 (size 192):
comm "cryptomgr_probe", pid 157, jiffies 4294694003
hex dump (first 32 bytes):
80 8b cf 41 9f 9d ff ff b8 97 e0 89 ff ff ff ff ...A............
d0 97 e0 89 ff ff ff ff 19 00 00 00 1f 88 23 00 ..............#.
backtrace (crc 838fb36):
__kmalloc_cache_noprof+0x284/0x320
padata_alloc_pd+0x20/0x1e0
padata_alloc_shell+0x3b/0xa0
0xffffffffc040a54d
cryptomgr_probe+0x43/0xc0
kthread+0xf6/0x1f0
ret_from_fork+0x2f/0x50
ret_from_fork_asm+0x1a/0x30
Fixes: dd7d37ccf6b1 ("padata: avoid UAF for reorder_work")
Cc: <stable@vger.kernel.org>
Signed-off-by: Dominik Grzegorzek <dominik.grzegorzek@oracle.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|
|
The text_size bit referred to by the comment has been removed as of commit
ac3b43283923 ("module: replace module_layout with module_memory")
and is thus no longer relevant. Remove it and comment about the contents of
the masks array instead.
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lore.kernel.org/r/20250429113242.998312-23-vschneid@redhat.com
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
|
|
Section .static_call_sites holds data structures that need to be sorted and
processed only at module load time. This initial processing happens in
static_call_add_module(), which is invoked as a callback to the
MODULE_STATE_COMING notification from prepare_coming_module().
The section is never modified afterwards. Make it therefore read-only after
module initialization to avoid any (non-)accidental modifications.
Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Link: https://lore.kernel.org/r/20250306131430.7016-4-petr.pavlu@suse.com
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
|
|
Move the logic to mark special sections as read-only after module
initialization into a separate function, along other related code in
strict_rwx.c. Use a table with names of such sections to make it easier to
add more.
Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Link: https://lore.kernel.org/r/20250306131430.7016-3-petr.pavlu@suse.com
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
|
|
Minor cleanup, this is a non-functional change.
Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Link: https://lore.kernel.org/r/20250306131430.7016-2-petr.pavlu@suse.com
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull hotfixes from Andrew Morton:
"Nine singleton hotfixes, all MM. Four are cc:stable"
* tag 'mm-hotfixes-stable-2025-05-17-09-41' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
mm: userfaultfd: correct dirty flags set for both present and swap pte
zsmalloc: don't underflow size calculation in zs_obj_write()
mm/page_alloc: fix race condition in unaccepted memory handling
mm/page_alloc: ensure try_alloc_pages() plays well with unaccepted memory
MAINTAINERS: add mm GUP section
mm/codetag: move tag retrieval back upfront in __free_pages()
mm/memory: fix mapcount / refcount sanity check for mTHP reuse
kernel/fork: only call untrack_pfn_clear() on VMAs duplicated for fork()
mm: hugetlb: fix incorrect fallback for subpool
|
|
Add a helper to check if an event is in freq mode to improve readability.
No functional changes.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250516182853.2610284-2-kan.liang@linux.intel.com
|
|
`pr_cont()` unfortunately does not work here, as other parts of the
Linux kernel log between the two log lines:
[18445.295056] r8152-cfgselector 4-1.1.3: USB disconnect, device number 5
[18445.295112] OOM killer enabled.
[18445.295115] Restarting tasks ...
[18445.295185] usb 3-1: USB disconnect, device number 2
[18445.295193] usb 3-1.1: USB disconnect, device number 3
[18445.296262] usb 3-1.5: USB disconnect, device number 4
[18445.297017] done.
[18445.297029] random: crng reseeded on system resumption
`pr_cont()` also uses the default log level, normally warning, if the
corresponding log line is interrupted.
Therefore, replace the `pr_cont()`, and explicitly log it as a separate
line with log level info:
Restarting tasks: Starting
[…]
Restarting tasks: Done
Signed-off-by: Paul Menzel <pmenzel@molgen.mpg.de>
Link: https://patch.msgid.link/20250511174648.950430-1-pmenzel@molgen.mpg.de
[ rjw: Rebase on top of an earlier analogous change ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
Creating an irq domain that serves as an MSI parent requires
a substantial amount of esoteric boiler-plate code, some of
which is often provided twice (such as the bus token).
To make things a bit simpler for the unsuspecting MSI tinkerer,
provide a helper that does it for them, and serves as documentation
of what needs to be provided.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250513172819.2216709-3-maz@kernel.org
|
|
irqdomain.c's kernel-doc exists, but is not plugged into Documentation/
yet.
Before plugging it in, fix it first: irq_domain_get_irq_data() and
irq_domain_set_info() were documented twice. Identically, by both
definitions for CONFIG_IRQ_DOMAIN_HIERARCHY and !CONFIG_IRQ_DOMAIN_HIERARCHY.
Therefore, switch the second kernel-doc into an ordinary comment -- change
"/**" to simple "/*". This avoids sphinx's: WARNING: Duplicate C
declaration
Next, in commit b7b377332b96 ("irqdomain: Fix the kernel-doc and plug it
into Documentation"), irqdomain.h's (header) kernel-doc was added into
core-api/genericirq.rst. But given the amount of irqdomain functions and
structures, move all these to core-api/irq/irq-domain.rst now.
Finally, add these newly fixed irqdomain.c's (source) docs there as
well.
Signed-off-by: Jiri Slaby (SUSE) <jirislaby@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250319092951.37667-58-jirislaby@kernel.org
|
|
Most irq_domain_add_*() functions are unused now, so drop them. The
remaining ones are moved to the deprecated section and will be removed
during the merge window after the patches in various trees have been
merged.
Note: The Chinese docs are touched but unfinished. I cannot parse those.
[ tglx: Remove the leftover in irq-domain.rst and handle merge logistics ]
Signed-off-by: Jiri Slaby (SUSE) <jirislaby@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250319092951.37667-41-jirislaby@kernel.org
|
|
There is no reason to export the function as an extra symbol. It is
simple enough and is just a wrapper to already exported functions.
Therefore, switch the exported function to an inline.
Signed-off-by: Jiri Slaby (SUSE) <jirislaby@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250319092951.37667-13-jirislaby@kernel.org
|
|
All uses of of_node_to_fwnode() in non-irqdomain code were changed to
"officially" defined of_fwnode_handle(). Therefore, the former can be
dropped along with the last uses in the irqdomain code.
Due to merge logistics the inline cannot be dropped immediately. Move it to
a deprecated section, which will be removed during the merge window.
[ tglx: Handle merge logistics ]
Signed-off-by: Jiri Slaby (SUSE) <jirislaby@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250319092951.37667-12-jirislaby@kernel.org
|
|
'rcu/torture-for-6.16' into rcu/for-next
|
|
Currently, the ->gpwrap is not tested (at all per my testing) due to the
requirement of a large delta between a CPU's rdp->gp_seq and its node's
rnp->gpseq.
This results in no testing of ->gpwrap being set. This patch by default
adds 5 minutes of testing with ->gpwrap forced by lowering the delta
between rdp->gp_seq and rnp->gp_seq to just 8 GPs. All of this is
configurable, including the active time for the setting and a full
testing cycle.
By default, the first 25 minutes of a test will have the _default_
behavior there is right now (ULONG_MAX / 4) delta. Then for 5 minutes,
we switch to a smaller delta causing 1-2 wraps in 5 minutes. I believe
this is reasonable since we at least add a little bit of testing for
usecases where ->gpwrap is set.
[ Apply fix for Dan Carpenter's bug report on init path cleanup. ]
[ Apply kernel doc warning fix from Akira Yokosawa. ]
Tested-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
|
|
The rcu_torture_reader() and rcu_torture_fwd_prog_cr() functions
run CPU-bound for extended periods of time (tens or even
hundreds of milliseconds), so they invoke tick_dep_set_task() and
tick_dep_clear_task() to ensure that the scheduling-clock tick helps
move grace periods forward.
So why doesn't rcu_torture_fwd_prog_nr() also invoke tick_dep_set_task()
and tick_dep_clear_task()? Because the point of this function is to test
RCU's ability to (eventually) force grace periods forward even when the
tick has been disabled during long CPU-bound kernel execution.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
|
|
For built with CONFIG_PROVE_RCU=y and CONFIG_PREEMPT_RT=y kernels,
Disable BH does not change the SOFTIRQ corresponding bits in
preempt_count(), but change current->softirq_disable_cnt, this
resulted in the following splat:
WARNING: suspicious RCU usage
kernel/rcu/tree_plugin.h:36 Unsafe read of RCU_NOCB offloaded state!
stack backtrace:
CPU: 0 UID: 0 PID: 22 Comm: rcuc/0
Call Trace:
[ 0.407907] <TASK>
[ 0.407910] dump_stack_lvl+0xbb/0xd0
[ 0.407917] dump_stack+0x14/0x20
[ 0.407920] lockdep_rcu_suspicious+0x133/0x210
[ 0.407932] rcu_rdp_is_offloaded+0x1c3/0x270
[ 0.407939] rcu_core+0x471/0x900
[ 0.407942] ? lockdep_hardirqs_on+0xd5/0x160
[ 0.407954] rcu_cpu_kthread+0x25f/0x870
[ 0.407959] ? __pfx_rcu_cpu_kthread+0x10/0x10
[ 0.407966] smpboot_thread_fn+0x34c/0xa50
[ 0.407970] ? trace_preempt_on+0x54/0x120
[ 0.407977] ? __pfx_smpboot_thread_fn+0x10/0x10
[ 0.407982] kthread+0x40e/0x840
[ 0.407990] ? __pfx_kthread+0x10/0x10
[ 0.407994] ? rt_spin_unlock+0x4e/0xb0
[ 0.407997] ? rt_spin_unlock+0x4e/0xb0
[ 0.408000] ? __pfx_kthread+0x10/0x10
[ 0.408006] ? __pfx_kthread+0x10/0x10
[ 0.408011] ret_from_fork+0x40/0x70
[ 0.408013] ? __pfx_kthread+0x10/0x10
[ 0.408018] ret_from_fork_asm+0x1a/0x30
[ 0.408042] </TASK>
Currently, triggering an rdp offloaded state change need the
corresponding rdp's CPU goes offline, and at this time the rcuc
kthreads has already in parking state. this means the corresponding
rcuc kthreads can safely read offloaded state of rdp while it's
corresponding cpu is online.
This commit therefore add softirq_count() check for
Preempt-RT kernels.
Suggested-by: Joel Fernandes <joelagnelf@nvidia.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Zqiang <qiang.zhang1211@gmail.com>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
|
|
It's safer to using kcalloc() because it can prevent overflow
problem.
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Su Hui <suhui@nfschina.com>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
|
|
This reverts commit f7345ccc62a4b880cf76458db5f320725f28e400.
swake_up_one_online() has been removed because hrtimers can now assign
a proper online target to hrtimers queued from offline CPUs. Therefore
remove the related hackery.
Link: https://lore.kernel.org/all/20241231170712.149394-4-frederic@kernel.org/
Reviewed-by: Usama Arif <usamaarif642@gmail.com>
Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
|
|
When counting the number of hardirqs in the x86 architecture,
it is essential to add arch_irq_stat_cpu to ensure accuracy.
For example, a CPU loop within the rcu_read_lock function.
Before:
[ 70.910184] rcu: INFO: rcu_preempt self-detected stall on CPU
[ 70.910436] rcu: 3-....: (4999 ticks this GP) idle=***
[ 70.910711] rcu: hardirqs softirqs csw/system
[ 70.910870] rcu: number: 0 657 0
[ 70.911024] rcu: cputime: 0 0 2498 ==> 2498(ms)
[ 70.911278] rcu: (t=5001 jiffies g=3677 q=29 ncpus=8)
After:
[ 68.046132] rcu: INFO: rcu_preempt self-detected stall on CPU
[ 68.046354] rcu: 2-....: (4999 ticks this GP) idle=***
[ 68.046628] rcu: hardirqs softirqs csw/system
[ 68.046793] rcu: number: 2498 663 0
[ 68.046951] rcu: cputime: 0 0 2496 ==> 2496(ms)
[ 68.047244] rcu: (t=5000 jiffies g=3825 q=4 ncpus=8)
Fixes: be42f00b73a0 ("rcu: Add RCU stall diagnosis information")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202501090842.SfI6QPGS-lkp@intel.com/
Signed-off-by: Yongliang Gao <leonylgao@tencent.com>
Reviewed-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Link: https://lore.kernel.org/r/20250216084109.3109837-1-leonylgao@gmail.com
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
|
|
It's now ok to perform a wake-up from an offline CPU because the
resulting armed scheduler bandwidth hrtimers are now correctly targeted
by hrtimer infrastructure.
Remove the obsolete hackerry.
Link: https://lore.kernel.org/all/20241231170712.149394-3-frederic@kernel.org/
Reviewed-by: Usama Arif <usamaarif642@gmail.com>
Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
|
|
Fix those:
./kernel/futex/futex.h:208: warning: Function parameter or struct member 'drop_hb_ref' not described in 'futex_q'
./kernel/futex/waitwake.c:343: warning: expecting prototype for futex_wait_queue(). Prototype was for futex_do_wait() instead
./kernel/futex/waitwake.c:594: warning: Function parameter or struct member 'task' not described in 'futex_wait_setup'
Fixes: 93f1b6d79a73 ("futex: Move futex_queue() into futex_wait_setup()")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250512185641.0450a99b@canb.auug.org.au # report
Link: https://lore.kernel.org/r/20250515171641.24073-1-bp@kernel.org # submission
|
|
Cross-merge networking fixes after downstream PR (net-6.15-rc7).
Conflicts:
tools/testing/selftests/drivers/net/hw/ncdevmem.c
97c4e094a4b2 ("tests/ncdevmem: Fix double-free of queue array")
2f1a805f32ba ("selftests: ncdevmem: Implement devmem TCP TX")
https://lore.kernel.org/20250514122900.1e77d62d@canb.auug.org.au
Adjacent changes:
net/core/devmem.c
net/core/devmem.h
0afc44d8cdf6 ("net: devmem: fix kernel panic when netlink socket close after module unload")
bd61848900bf ("net: devmem: Implement TX path")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
perf always allocates contiguous AUX pages based on aux_watermark.
However, this contiguous allocation doesn't benefit all PMUs. For
instance, ARM SPE and TRBE operate with virtual pages, and Coresight
ETR allocates a separate buffer. For these PMUs, allocating contiguous
AUX pages unnecessarily exacerbates memory fragmentation. This
fragmentation can prevent their use on long-running devices.
This patch modifies the perf driver to be memory-friendly by default,
by allocating non-contiguous AUX pages. For PMUs requiring contiguous
pages (Intel BTS and some Intel PT), the existing
PERF_PMU_CAP_AUX_NO_SG capability can be used. For PMUs that don't
require but can benefit from contiguous pages (some Intel PT), a new
capability, PERF_PMU_CAP_AUX_PREFER_LARGE, is added to maintain their
existing behavior.
Signed-off-by: Yabin Cui <yabinc@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: James Clark <james.clark@linaro.org>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20250508232642.148767-1-yabinc@google.com
|
|
Affinity-managed interrupts can be shut down and restarted during CPU
hotunplug/plug. Thereby the interrupt may be left in an unexpected state.
Specifically:
1. Interrupt is affine to CPU N
2. disable_irq() -> depth is 1
3. CPU N goes offline
4. irq_shutdown() -> depth is set to 1 (again)
5. CPU N goes online
6. irq_startup() -> depth is set to 0 (BUG! driver expects that the interrupt
still disabled)
7. enable_irq() -> depth underflow / unbalanced enable_irq() warning
This is only a problem for managed interrupts and CPU hotplug, all other
cases like request()/free()/request() truly needs to reset a possibly stale
disable depth value.
Provide a startup function, which takes the disable depth into account, and
invoked it for the managed interrupts in the CPU hotplug path.
This requires to change irq_shutdown() to do a depth increment instead of
setting it to 1, which allows to retain the disable depth, but is harmless
for the other code paths using irq_startup(), which will still reset the
disable depth unconditionally to keep the original correct behaviour.
A kunit tests will be added separately to cover some of these aspects.
[ tglx: Massaged changelog ]
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Brian Norris <briannorris@chromium.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250514201353.3481400-2-briannorris@chromium.org
|