summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2025-03-17sched/deadline: Generalize unique visiting of root domainsJuri Lelli
Bandwidth checks and updates that work on root domains currently employ a cookie mechanism for efficiency. This mechanism is very much tied to when root domains are first created and initialized. Generalize the cookie mechanism so that it can be used also later at runtime while updating root domains. Also, additionally guard it with sched_domains_mutex, since domains need to be stable while updating them (and it will be required for further dynamic changes). Fixes: 53916d5fd3c0 ("sched/deadline: Check bandwidth overflow earlier for hotplug") Reported-by: Jon Hunter <jonathanh@nvidia.com> Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Waiman Long <longman@redhat.com> Tested-by: Jon Hunter <jonathanh@nvidia.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/Z9MQaiXPvEeW_v7x@jlelli-thinkpadt14gen4.remote.csb
2025-03-17sched/topology: Wrappers for sched_domains_mutexJuri Lelli
Create wrappers for sched_domains_mutex so that it can transparently be used on both CONFIG_SMP and !CONFIG_SMP, as some function will need to do. Fixes: 53916d5fd3c0 ("sched/deadline: Check bandwidth overflow earlier for hotplug") Reported-by: Jon Hunter <jonathanh@nvidia.com> Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Waiman Long <longman@redhat.com> Tested-by: Jon Hunter <jonathanh@nvidia.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/Z9MP5Oq9RB8jBs3y@jlelli-thinkpadt14gen4.remote.csb
2025-03-17sched/deadline: Ignore special tasks when rebuilding domainsJuri Lelli
SCHED_DEADLINE special tasks get a fake bandwidth that is only used to make sure sleeping and priority inheritance 'work', but it is ignored for runtime enforcement and admission control. Be consistent with it also when rebuilding root domains. Fixes: 53916d5fd3c0 ("sched/deadline: Check bandwidth overflow earlier for hotplug") Reported-by: Jon Hunter <jonathanh@nvidia.com> Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Waiman Long <longman@redhat.com> Tested-by: Jon Hunter <jonathanh@nvidia.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/20250313170011.357208-2-juri.lelli@redhat.com
2025-03-17tracing: Use preempt_model_str()Sebastian Andrzej Siewior
Use preempt_model_str() instead of manually conducting the preemption model. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: "Steven Rostedt (Google)" <rostedt@goodmis.org> Link: https://lore.kernel.org/r/20250314160810.2373416-10-bigeasy@linutronix.de
2025-03-17perf: Clean up pmu specific dataKan Liang
The pmu specific data is saved in task_struct now. Remove it from event context structure. Remove swap_task_ctx() as well. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250314172700.438923-7-kan.liang@linux.intel.com
2025-03-17sched: Add a generic function to return the preemption stringSebastian Andrzej Siewior
The individual architectures often add the preemption model to the begin of the backtrace. This is the case on X86 or ARM64 for the "die" case but not for regular warning. With the addition of DYNAMIC_PREEMPT for PREEMPT_RT we end up with CONFIG_PREEMPT and CONFIG_PREEMPT_RT set simultaneously. That means that everyone who tried to add that piece of information gets it wrong for PREEMPT_RT because PREEMPT is checked first. Provide a generic function which returns the current scheduling model considering LAZY preempt and the current state of PREEMPT_DYNAMIC. The resulting strings are: ┏━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓ ┃ Model ┃ -RT -DYN ┃ +RT -DYN ┃ -RT +DYN ┃ +RT +DYN ┃ ┡━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩ │NONE │ NONE │ n/a │ PREEMPT(none) │ n/a │ ├───────────┼──────────────┼───────────────────┼────────────────────┼───────────────────┤ │VOLUNTARY │ VOLUNTARY │ n/a │ PREEMPT(voluntary) │ n/a │ ├───────────┼──────────────┼───────────────────┼────────────────────┼───────────────────┤ │FULL │ PREEMPT │ PREEMPT_RT │ PREEMPT(full) │ PREEMPT_{RT,full} │ ├───────────┼──────────────┼───────────────────┼────────────────────┼───────────────────┤ │LAZY │ PREEMPT_LAZY │ PREEMPT_{RT,LAZY} │ PREEMPT(lazy) │ PREEMPT_{RT,lazy} │ └───────────┴──────────────┴───────────────────┴────────────────────┴───────────────────┘ [ The dynamic building of the string can lead to an empty string if the function is invoked simultaneously on two CPUs. ] Co-developed-by: "Peter Zijlstra (Intel)" <peterz@infradead.org> Signed-off-by: "Peter Zijlstra (Intel)" <peterz@infradead.org> Co-developed-by: "Steven Rostedt (Google)" <rostedt@goodmis.org> Signed-off-by: "Steven Rostedt (Google)" <rostedt@goodmis.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com> Link: https://lore.kernel.org/r/20250314160810.2373416-2-bigeasy@linutronix.de
2025-03-17perf: Supply task information to sched_task()Kan Liang
To save/restore LBR call stack data in system-wide mode, the task_struct information is required. Extend the parameters of sched_task() to supply task_struct information. When schedule in, the LBR call stack data for new task will be restored. When schedule out, the LBR call stack data for old task will be saved. Only need to pass the required task_struct information. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250314172700.438923-4-kan.liang@linux.intel.com
2025-03-17perf: attach/detach PMU specific dataKan Liang
The LBR call stack data has to be saved/restored during context switch to fix the shorter LBRs call stacks issue in the system-wide mode. Allocate PMU specific data and attach them to the corresponding task_struct during LBR call stack monitoring. When a LBR call stack event is accounted, the perf_ctx_data for the related tasks will be allocated/attached by attach_perf_ctx_data(). When a LBR call stack event is unaccounted, the perf_ctx_data for related tasks will be detached/freed by detach_perf_ctx_data(). The LBR call stack event could be a per-task event or a system-wide event. - For a per-task event, perf only allocates the perf_ctx_data for the current task. If the allocation fails, perf will error out. - For a system-wide event, perf has to allocate the perf_ctx_data for both the existing tasks and the upcoming tasks. The allocation for the existing tasks is done in perf_event_alloc(). If any allocation fails, perf will error out. The allocation for the new tasks will be done in perf_event_fork(). A global reader/writer semaphore, global_ctx_data_rwsem, is added to address the global race. - The perf_ctx_data only be freed by the last LBR call stack event. The number of the per-task events is tracked by refcount of each task. Since the system-wide events impact all tasks, it's not practical to go through the whole task list to update the refcount for each system-wide event. The number of system-wide events is tracked by a global variable global_ctx_data_ref. Suggested-by: "Peter Zijlstra (Intel)" <peterz@infradead.org> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250314172700.438923-3-kan.liang@linux.intel.com
2025-03-17perf: Save PMU specific data in task_structKan Liang
Some PMU specific data has to be saved/restored during context switch, e.g. LBR call stack data. Currently, the data is saved in event context structure, but only for per-process event. For system-wide event, because of missing the LBR call stack data after context switch, LBR callstacks are always shorter in comparison to per-process mode. For example, Per-process mode: $perf record --call-graph lbr -- taskset -c 0 ./tchain_edit - 99.90% 99.86% tchain_edit tchain_edit [.] f3 99.86% _start __libc_start_main generic_start_main main f1 - f2 f3 System-wide mode: $perf record --call-graph lbr -a -- taskset -c 0 ./tchain_edit - 99.88% 99.82% tchain_edit tchain_edit [.] f3 - 62.02% main f1 f2 f3 - 28.83% f1 - f2 f3 - 28.83% f1 - f2 f3 - 8.88% generic_start_main main f1 f2 f3 It isn't practical to simply allocate the data for system-wide event in CPU context structure for all tasks. We have no idea which CPU a task will be scheduled to. The duplicated LBR data has to be maintained on every CPU context structure. That's a huge waste. Otherwise, the LBR data still lost if the task is scheduled to another CPU. Save the pmu specific data in task_struct. The size of pmu specific data is 788 bytes for LBR call stack. Usually, the overall amount of threads doesn't exceed a few thousands. For 10K threads, keeping LBR data would consume additional ~8MB. The additional space will only be allocated during LBR call stack monitoring. It will be released when the monitoring is finished. Furthermore, moving task_ctx_data from perf_event_context to task_struct can reduce complexity and make things clearer. E.g. perf doesn't need to swap task_ctx_data on optimized context switch path. This patch set is just the first step. There could be other optimization/extension on top of this patch set. E.g. for cgroup profiling, perf just needs to save/store the LBR call stack information for tasks in specific cgroup. That could reduce the additional space. Also, the LBR call stack can be available for software events, or allow even debugging use cases, like LBRs on crash later. Because of the alignment requirement of Intel Arch LBR, the Kmem cache is used to allocate the PMU specific data. It's required when child task allocates the space. Save it in struct perf_ctx_data. The refcount in struct perf_ctx_data is used to track the users of pmu specific data. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Alexey Budankov <alexey.budankov@linux.intel.com> Link: https://lore.kernel.org/r/20250314172700.438923-1-kan.liang@linux.intel.com
2025-03-17posix-timers: Drop redundant memset() invocationCyrill Gorcunov
Initially in commit 6891c4509c79 memset() was required to clear a variable allocated on stack. Commit 2482097c6c0f removed the on stack variable and retained the memset() despite the fact that the memory is allocated via kmem_cache_zalloc() and therefore zereoed already. Drop the redundant memset(). Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/Z9ctVxwaYOV4A2g4@grain
2025-03-17perf/ring_buffer: Allow the EPOLLRDNORM flag for pollTao Chen
The poll man page says POLLRDNORM is equivalent to POLLIN. For poll(), it seems that if user sets pollfd with POLLRDNORM in userspace, perf_poll will not return until timeout even if perf_output_wakeup called, whereas POLLIN returns. Fixes: 76369139ceb9 ("perf: Split up buffer handling from core code") Signed-off-by: Tao Chen <chen.dylane@linux.dev> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250314030036.2543180-1-chen.dylane@linux.dev
2025-03-17perf/core: Use POLLHUP for pinned events in errorNamhyung Kim
Pinned performance events can enter an error state when they fail to be scheduled in the context due to a failed constraint or some other conflict or condition. In error state these events won't generate any samples anymore and are silently ignored until they are recovered by PERF_EVENT_IOC_ENABLE, or the condition can also change so that they can be scheduled in. Tooling should be allowed to know about the state change, but currently there's no mechanism to notify tooling when events enter an error state. One way to do this is to issue a POLLHUP event to poll(2) to handle this. Reading events in an error state would return 0 (EOF) and it matches to the behavior of POLLHUP according to the man page. Tooling should remove the fd of the event from pollfd after getting POLLHUP, otherwise it'll be returned repeatedly. [ mingo: Clarified the changelog ] Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250317061745.1777584-1-namhyung@kernel.org
2025-03-17configs: drop GENERIC_PTDUMP from debug.configAnshuman Khandual
Patch series "mm: Rework generic PTDUMP configs", v3. The series reworks generic PTDUMP configs before eventually renaming them after some basic cleanups first. This patch (of 5): The platforms that support GENERIC_PTDUMP select the config explicitly. But enabling this feature on platforms that don't really support - does nothing or might cause a build failure. Hence just drop GENERIC_PTDUMP from generic debug.config Link: https://lkml.kernel.org/r/20250226122404.1927473-1-anshuman.khandual@arm.com Link: https://lkml.kernel.org/r/20250226122404.1927473-2-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: Steven Price <steven.price@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16reboot: add support for configuring emergency hardware protection actionAhmad Fatoum
We currently leave the decision of whether to shutdown or reboot to protect hardware in an emergency situation to the individual drivers. This works out in some cases, where the driver detecting the critical failure has inside knowledge: It binds to the system management controller for example or is guided by hardware description that defines what to do. In the general case, however, the driver detecting the issue can't know what the appropriate course of action is and shouldn't be dictating the policy of dealing with it. Therefore, add a global hw_protection toggle that allows the user to specify whether shutdown or reboot should be the default action when the driver doesn't set policy. This introduces no functional change yet as hw_protection_trigger() has no callers, but these will be added in subsequent commits. [arnd@arndb.de: hide unused hw_protection_attr] Link: https://lkml.kernel.org/r/20250224141849.1546019-1-arnd@kernel.org Link: https://lkml.kernel.org/r/20250217-hw_protection-reboot-v3-7-e1c09b090c0c@pengutronix.de Signed-off-by: Ahmad Fatoum <a.fatoum@pengutronix.de> Reviewed-by: Tzung-Bi Shih <tzungbi@kernel.org> Cc: Benson Leung <bleung@chromium.org> Cc: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Fabio Estevam <festevam@denx.de> Cc: Guenter Roeck <groeck@chromium.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Girdwood <lgirdwood@gmail.com> Cc: Lukasz Luba <lukasz.luba@arm.com> Cc: Mark Brown <broonie@kernel.org> Cc: Matteo Croce <teknoraver@meta.com> Cc: Matti Vaittinen <mazziesaccount@gmail.com> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Rob Herring (Arm) <robh@kernel.org> Cc: Rui Zhang <rui.zhang@intel.com> Cc: Sascha Hauer <kernel@pengutronix.de> Cc: "Serge E. Hallyn" <serge@hallyn.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16reboot: indicate whether it is a HARDWARE PROTECTION reboot or shutdownAhmad Fatoum
It currently depends on the caller, whether we attempt a hardware protection shutdown (poweroff) or a reboot. A follow-up commit will make this partially user-configurable, so it's a good idea to have the emergency message clearly state whether the kernel is going for a reboot or a shutdown. Link: https://lkml.kernel.org/r/20250217-hw_protection-reboot-v3-6-e1c09b090c0c@pengutronix.de Signed-off-by: Ahmad Fatoum <a.fatoum@pengutronix.de> Reviewed-by: Tzung-Bi Shih <tzungbi@kernel.org> Cc: Benson Leung <bleung@chromium.org> Cc: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Fabio Estevam <festevam@denx.de> Cc: Guenter Roeck <groeck@chromium.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Girdwood <lgirdwood@gmail.com> Cc: Lukasz Luba <lukasz.luba@arm.com> Cc: Mark Brown <broonie@kernel.org> Cc: Matteo Croce <teknoraver@meta.com> Cc: Matti Vaittinen <mazziesaccount@gmail.com> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Rob Herring (Arm) <robh@kernel.org> Cc: Rui Zhang <rui.zhang@intel.com> Cc: Sascha Hauer <kernel@pengutronix.de> Cc: "Serge E. Hallyn" <serge@hallyn.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16reboot: rename now misleading __hw_protection_shutdown symbolsAhmad Fatoum
The __hw_protection_shutdown function name has become misleading since it can cause either a shutdown (poweroff) or a reboot depending on its argument. To avoid further confusion, let's rename it, so it doesn't suggest that a poweroff is all it can do. Link: https://lkml.kernel.org/r/20250217-hw_protection-reboot-v3-5-e1c09b090c0c@pengutronix.de Signed-off-by: Ahmad Fatoum <a.fatoum@pengutronix.de> Reviewed-by: Tzung-Bi Shih <tzungbi@kernel.org> Cc: Benson Leung <bleung@chromium.org> Cc: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Fabio Estevam <festevam@denx.de> Cc: Guenter Roeck <groeck@chromium.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Girdwood <lgirdwood@gmail.com> Cc: Lukasz Luba <lukasz.luba@arm.com> Cc: Mark Brown <broonie@kernel.org> Cc: Matteo Croce <teknoraver@meta.com> Cc: Matti Vaittinen <mazziesaccount@gmail.com> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Rob Herring (Arm) <robh@kernel.org> Cc: Rui Zhang <rui.zhang@intel.com> Cc: Sascha Hauer <kernel@pengutronix.de> Cc: "Serge E. Hallyn" <serge@hallyn.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16reboot: describe do_kernel_restart's cmd argument in kernel-docAhmad Fatoum
A W=1 build rightfully complains about the function's kernel-doc being incomplete. Describe its single parameter to fix this. Link: https://lkml.kernel.org/r/20250217-hw_protection-reboot-v3-4-e1c09b090c0c@pengutronix.de Signed-off-by: Ahmad Fatoum <a.fatoum@pengutronix.de> Reviewed-by: Tzung-Bi Shih <tzungbi@kernel.org> Cc: Benson Leung <bleung@chromium.org> Cc: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Fabio Estevam <festevam@denx.de> Cc: Guenter Roeck <groeck@chromium.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Girdwood <lgirdwood@gmail.com> Cc: Lukasz Luba <lukasz.luba@arm.com> Cc: Mark Brown <broonie@kernel.org> Cc: Matteo Croce <teknoraver@meta.com> Cc: Matti Vaittinen <mazziesaccount@gmail.com> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Rob Herring (Arm) <robh@kernel.org> Cc: Rui Zhang <rui.zhang@intel.com> Cc: Sascha Hauer <kernel@pengutronix.de> Cc: "Serge E. Hallyn" <serge@hallyn.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16reboot: reboot, not shutdown, on hw_protection_reboot timeoutAhmad Fatoum
hw_protection_shutdown() will kick off an orderly shutdown and if that takes longer than a configurable amount of time, an emergency shutdown will occur. Recently, hw_protection_reboot() was added for those systems that don't implement a proper shutdown and are better served by rebooting and having the boot firmware worry about doing something about the critical condition. On timeout of the orderly reboot of hw_protection_reboot(), the system would go into shutdown, instead of reboot. This is not a good idea, as going into shutdown was explicitly not asked for. Fix this by always doing an emergency reboot if hw_protection_reboot() is called and the orderly reboot takes too long. Link: https://lkml.kernel.org/r/20250217-hw_protection-reboot-v3-2-e1c09b090c0c@pengutronix.de Fixes: 79fa723ba84c ("reboot: Introduce thermal_zone_device_critical_reboot()") Signed-off-by: Ahmad Fatoum <a.fatoum@pengutronix.de> Reviewed-by: Tzung-Bi Shih <tzungbi@kernel.org> Reviewed-by: Matti Vaittinen <mazziesaccount@gmail.com> Cc: Benson Leung <bleung@chromium.org> Cc: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Fabio Estevam <festevam@denx.de> Cc: Guenter Roeck <groeck@chromium.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Girdwood <lgirdwood@gmail.com> Cc: Lukasz Luba <lukasz.luba@arm.com> Cc: Mark Brown <broonie@kernel.org> Cc: Matteo Croce <teknoraver@meta.com> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Rob Herring (Arm) <robh@kernel.org> Cc: Rui Zhang <rui.zhang@intel.com> Cc: Sascha Hauer <kernel@pengutronix.de> Cc: "Serge E. Hallyn" <serge@hallyn.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16reboot: replace __hw_protection_shutdown bool action parameter with an enumAhmad Fatoum
Patch series "reboot: support runtime configuration of emergency hw_protection action", v3. We currently leave the decision of whether to shutdown or reboot to protect hardware in an emergency situation to the individual drivers. This works out in some cases, where the driver detecting the critical failure has inside knowledge: It binds to the system management controller for example or is guided by hardware description that defines what to do. This is inadequate in the general case though as a driver reporting e.g. an imminent power failure can't know whether a shutdown or a reboot would be more appropriate for a given hardware platform. To address this, this series adds a hw_protection kernel parameter and sysfs toggle that can be used to change the action from the shutdown default to reboot. A new hw_protection_trigger API then makes use of this default action. My particular use case is unattended embedded systems that don't have support for shutdown and that power on automatically when power is supplied: - A brief power cycle gets detected by the driver - The kernel powers down the system and SoC goes into shutdown mode - Power is restored - The system remains oblivious to the restored power - System needs to be manually power cycled for a duration long enough to drain the capacitors With this series, such systems can configure the kernel with hw_protection=reboot to have the boot firmware worry about critical conditions. This patch (of 12): Currently __hw_protection_shutdown() either reboots or shuts down the system according to its shutdown argument. To make the logic easier to follow, both inside __hw_protection_shutdown and at caller sites, lets replace the bool parameter with an enum. This will be extra useful, when in a later commit, a third action is added to the enumeration. No functional change. Link: https://lkml.kernel.org/r/20250217-hw_protection-reboot-v3-0-e1c09b090c0c@pengutronix.de Link: https://lkml.kernel.org/r/20250217-hw_protection-reboot-v3-1-e1c09b090c0c@pengutronix.de Signed-off-by: Ahmad Fatoum <a.fatoum@pengutronix.de> Reviewed-by: Tzung-Bi Shih <tzungbi@kernel.org> Cc: Benson Leung <bleung@chromium.org> Cc: Mark Brown <broonie@kernel.org> Cc: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Fabio Estevam <festevam@denx.de> Cc: Guenter Roeck <groeck@chromium.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Girdwood <lgirdwood@gmail.com> Cc: Lukasz Luba <lukasz.luba@arm.com> Cc: Matteo Croce <teknoraver@meta.com> Cc: Matti Vaittinen <mazziesaccount@gmail.com> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Rob Herring <robh@kernel.org> Cc: Rui Zhang <rui.zhang@intel.com> Cc: Sascha Hauer <kernel@pengutronix.de> Cc: "Serge E. Hallyn" <serge@hallyn.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16ucount: use rcuref_t for reference countingSebastian Andrzej Siewior
Use rcuref_t for reference counting. This eliminates the cmpxchg loop in the get and put path. This also eliminates the need to acquire the lock in the put path because once the final user returns the reference, it can no longer be obtained anymore. Use rcuref_t for reference counting. Link: https://lkml.kernel.org/r/20250203150525.456525-5-bigeasy@linutronix.de Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Lai jiangshan <jiangshanlai@gmail.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Mengen Sun <mengensun@tencent.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com> Cc: YueHong Wu <yuehongwu@tencent.com> Cc: Zqiang <qiang.zhang1211@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16ucount: use RCU for ucounts lookupsSebastian Andrzej Siewior
The ucounts element is looked up under ucounts_lock. This can be optimized by using RCU for a lockless lookup and return and element if the reference can be obtained. Replace hlist_head with hlist_nulls_head which is RCU compatible. Let find_ucounts() search for the required item within a RCU section and return the item if a reference could be obtained. This means alloc_ucounts() will always return an element (unless the memory allocation failed). Let put_ucounts() RCU free the element if the reference counter dropped to zero. Link: https://lkml.kernel.org/r/20250203150525.456525-4-bigeasy@linutronix.de Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Lai jiangshan <jiangshanlai@gmail.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Mengen Sun <mengensun@tencent.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com> Cc: YueHong Wu <yuehongwu@tencent.com> Cc: Zqiang <qiang.zhang1211@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16ucount: replace get_ucounts_or_wrap() with atomic_inc_not_zero()Sebastian Andrzej Siewior
get_ucounts_or_wrap() increments the counter and if the counter is negative then it decrements it again in order to reset the previous increment. This statement can be replaced with atomic_inc_not_zero() to only increment the counter if it is not yet 0. This simplifies the get function because the put (if the get failed) can be removed. atomic_inc_not_zero() is implement as a cmpxchg() loop which can be repeated several times if another get/put is performed in parallel. This will be optimized later. Increment the reference counter only if not yet dropped to zero. Link: https://lkml.kernel.org/r/20250203150525.456525-3-bigeasy@linutronix.de Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Lai jiangshan <jiangshanlai@gmail.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Mengen Sun <mengensun@tencent.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com> Cc: YueHong Wu <yuehongwu@tencent.com> Cc: Zqiang <qiang.zhang1211@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16crash: let arch decide usable memory range in reserved areaSourabh Jain
Although the crashkernel area is reserved, on architectures like PowerPC, it is possible for the crashkernel reserved area to contain components like RTAS, TCE, OPAL, etc. To avoid placing kexec segments over these components, PowerPC has its own set of APIs to locate holes in the crashkernel reserved area. Add an arch hook in the generic locate mem hole APIs so that architectures can handle such special regions in the crashkernel area while locating memory holes for kexec segments using generic APIs. With this, a lot of redundant arch-specific code can be removed, as it performs the exact same job as the generic APIs. To keep the generic and arch-specific changes separate, the changes related to moving PowerPC to use the generic APIs and the removal of PowerPC-specific APIs for memory hole allocation are done in a subsequent patch titled "powerpc/crash: Use generic APIs to locate memory hole for kdump. Link: https://lkml.kernel.org/r/20250131113830.925179-4-sourabhjain@linux.ibm.com Signed-off-by: Sourabh Jain <sourabhjain@linux.ibm.com> Acked-by: Baoquan He <bhe@redhat.com> Cc: Hari Bathini <hbathini@linux.ibm.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Mahesh Salgaonkar <mahesh@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16crash: remove an unused argument from reserve_crashkernel_generic()Sourabh Jain
cmdline argument is not used in reserve_crashkernel_generic() so remove it. Correspondingly, all the callers have been updated as well. No functional change intended. Link: https://lkml.kernel.org/r/20250131113830.925179-3-sourabhjain@linux.ibm.com Signed-off-by: Sourabh Jain <sourabhjain@linux.ibm.com> Acked-by: Hari Bathini <hbathini@linux.ibm.com> Acked-by: Baoquan He <bhe@redhat.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Mahesh Salgaonkar <mahesh@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16kexec: initialize ELF lowest address to ULONG_MAXSourabh Jain
Patch series "powerpc/crash: use generic crashkernel reservation", v3. Commit 0ab97169aa05 ("crash_core: add generic function to do reservation") added a generic function to reserve crashkernel memory. So let's use the same function on powerpc and remove the architecture-specific code that essentially does the same thing. The generic crashkernel reservation also provides a way to split the crashkernel reservation into high and low memory reservations, which can be enabled for powerpc in the future. Additionally move powerpc to use generic APIs to locate memory hole for kexec segments while loading kdump kernel. This patch (of 7): kexec_elf_load() loads an ELF executable and sets the address of the lowest PT_LOAD section to the address held by the lowest_load_addr function argument. To determine the lowest PT_LOAD address, a local variable lowest_addr (type unsigned long) is initialized to UINT_MAX. After loading each PT_LOAD, its address is compared to lowest_addr. If a loaded PT_LOAD address is lower, lowest_addr is updated. However, setting lowest_addr to UINT_MAX won't work when the kernel image is loaded above 4G, as the returned lowest PT_LOAD address would be invalid. This is resolved by initializing lowest_addr to ULONG_MAX instead. This issue was discovered while implementing crashkernel high/low reservation on the PowerPC architecture. Link: https://lkml.kernel.org/r/20250131113830.925179-1-sourabhjain@linux.ibm.com Link: https://lkml.kernel.org/r/20250131113830.925179-2-sourabhjain@linux.ibm.com Fixes: a0458284f062 ("powerpc: Add support code for kexec_file_load()") Signed-off-by: Sourabh Jain <sourabhjain@linux.ibm.com> Acked-by: Hari Bathini <hbathini@linux.ibm.com> Acked-by: Baoquan He <bhe@redhat.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Mahesh Salgaonkar <mahesh@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm,procfs: allow read-only remote mm access under CAP_PERFMONAndrii Nakryiko
It's very common for various tracing and profiling toolis to need to access /proc/PID/maps contents for stack symbolization needs to learn which shared libraries are mapped in memory, at which file offset, etc. Currently, access to /proc/PID/maps requires CAP_SYS_PTRACE (unless we are looking at data for our own process, which is a trivial case not too relevant for profilers use cases). Unfortunately, CAP_SYS_PTRACE implies way more than just ability to discover memory layout of another process: it allows to fully control arbitrary other processes. This is problematic from security POV for applications that only need read-only /proc/PID/maps (and other similar read-only data) access, and in large production settings CAP_SYS_PTRACE is frowned upon even for the system-wide profilers. On the other hand, it's already possible to access similar kind of information (and more) with just CAP_PERFMON capability. E.g., setting up PERF_RECORD_MMAP collection through perf_event_open() would give one similar information to what /proc/PID/maps provides. CAP_PERFMON, together with CAP_BPF, is already a very common combination for system-wide profiling and observability application. As such, it's reasonable and convenient to be able to access /proc/PID/maps with CAP_PERFMON capabilities instead of CAP_SYS_PTRACE. For procfs, these permissions are checked through common mm_access() helper, and so we augment that with cap_perfmon() check *only* if requested mode is PTRACE_MODE_READ. I.e., PTRACE_MODE_ATTACH wouldn't be permitted by CAP_PERFMON. So /proc/PID/mem, which uses PTRACE_MODE_ATTACH, won't be permitted by CAP_PERFMON, but /proc/PID/maps, /proc/PID/environ, and a bunch of other read-only contents will be allowable under CAP_PERFMON. Besides procfs itself, mm_access() is used by process_madvise() and process_vm_{readv,writev}() syscalls. The former one uses PTRACE_MODE_READ to avoid leaking ASLR metadata, and as such CAP_PERFMON seems like a meaningful allowable capability as well. process_vm_{readv,writev} currently assume PTRACE_MODE_ATTACH level of permissions (though for readv PTRACE_MODE_READ seems more reasonable, but that's outside the scope of this change), and as such won't be affected by this patch. Link: https://lkml.kernel.org/r/20250127222114.1132392-1-andrii@kernel.org Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Ingo Molnar <mingo@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jann Horn <jannh@google.com> Cc: Kees Cook <kees@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: "Mike Rapoport (IBM)" <rppt@kernel.org> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm: make vma cache SLAB_TYPESAFE_BY_RCUSuren Baghdasaryan
To enable SLAB_TYPESAFE_BY_RCU for vma cache we need to ensure that object reuse before RCU grace period is over will be detected by lock_vma_under_rcu(). Current checks are sufficient as long as vma is detached before it is freed. The only place this is not currently happening is in exit_mmap(). Add the missing vma_mark_detached() in exit_mmap(). Another issue which might trick lock_vma_under_rcu() during vma reuse is vm_area_dup(), which copies the entire content of the vma into a new one, overriding new vma's vm_refcnt and temporarily making it appear as attached. This might trick a racing lock_vma_under_rcu() to operate on a reused vma if it found the vma before it got reused. To prevent this situation, we should ensure that vm_refcnt stays at detached state (0) when it is copied and advances to attached state only after it is added into the vma tree. Introduce vm_area_init_from() which preserves new vma's vm_refcnt and use it in vm_area_dup(). Since all vmas are in detached state with no current readers when they are freed, lock_vma_under_rcu() will not be able to take vm_refcnt after vma got detached even if vma is reused. vma_mark_attached() in modified to include a release fence to ensure all stores to the vma happen before vm_refcnt gets initialized. Finally, make vm_area_cachep SLAB_TYPESAFE_BY_RCU. This will facilitate vm_area_struct reuse and will minimize the number of call_rcu() calls. [surenb@google.com: remove atomic_set_release() usage in tools/] Link: https://lkml.kernel.org/r/20250217054351.2973666-1-surenb@google.com Link: https://lkml.kernel.org/r/20250213224655.1680278-18-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Shivank Garg <shivankg@amd.com> Link: https://lkml.kernel.org/r/5e19ec93-8307-47c2-bb13-3ddf7150624e@amd.com Cc: Christian Brauner <brauner@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Klara Modin <klarasmodin@gmail.com> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Lokesh Gidra <lokeshgidra@google.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Sourav Panda <souravpanda@google.com> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm: replace vm_lock and detached flag with a reference countSuren Baghdasaryan
rw_semaphore is a sizable structure of 40 bytes and consumes considerable space for each vm_area_struct. However vma_lock has two important specifics which can be used to replace rw_semaphore with a simpler structure: 1. Readers never wait. They try to take the vma_lock and fall back to mmap_lock if that fails. 2. Only one writer at a time will ever try to write-lock a vma_lock because writers first take mmap_lock in write mode. Because of these requirements, full rw_semaphore functionality is not needed and we can replace rw_semaphore and the vma->detached flag with a refcount (vm_refcnt). When vma is in detached state, vm_refcnt is 0 and only a call to vma_mark_attached() can take it out of this state. Note that unlike before, now we enforce both vma_mark_attached() and vma_mark_detached() to be done only after vma has been write-locked. vma_mark_attached() changes vm_refcnt to 1 to indicate that it has been attached to the vma tree. When a reader takes read lock, it increments vm_refcnt, unless the top usable bit of vm_refcnt (0x40000000) is set, indicating presence of a writer. When writer takes write lock, it sets the top usable bit to indicate its presence. If there are readers, writer will wait using newly introduced mm->vma_writer_wait. Since all writers take mmap_lock in write mode first, there can be only one writer at a time. The last reader to release the lock will signal the writer to wake up. refcount might overflow if there are many competing readers, in which case read-locking will fail. Readers are expected to handle such failures. In summary: 1. all readers increment the vm_refcnt; 2. writer sets top usable (writer) bit of vm_refcnt; 3. readers cannot increment the vm_refcnt if the writer bit is set; 4. in the presence of readers, writer must wait for the vm_refcnt to drop to 1 (plus the VMA_LOCK_OFFSET writer bit), indicating an attached vma with no readers; 5. vm_refcnt overflow is handled by the readers. While this vm_lock replacement does not yet result in a smaller vm_area_struct (it stays at 256 bytes due to cacheline alignment), it allows for further size optimization by structure member regrouping to bring the size of vm_area_struct below 192 bytes. [surenb@google.com: fix a crash due to vma_end_read() that should have been removed] Link: https://lkml.kernel.org/r/20250220200208.323769-1-surenb@google.com Link: https://lkml.kernel.org/r/20250213224655.1680278-13-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Suggested-by: Peter Zijlstra <peterz@infradead.org> Suggested-by: Matthew Wilcox <willy@infradead.org> Tested-by: Shivank Garg <shivankg@amd.com> Link: https://lkml.kernel.org/r/5e19ec93-8307-47c2-bb13-3ddf7150624e@amd.com Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christian Brauner <brauner@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Klara Modin <klarasmodin@gmail.com> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Lokesh Gidra <lokeshgidra@google.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Sourav Panda <souravpanda@google.com> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm: move mmap_init_lock() out of the header fileSuren Baghdasaryan
mmap_init_lock() is used only from mm_init() in fork.c, therefore it does not have to reside in the header file. This move lets us avoid including additional headers in mmap_lock.h later, when mmap_init_lock() needs to initialize rcuwait object. Link: https://lkml.kernel.org/r/20250213224655.1680278-9-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Tested-by: Shivank Garg <shivankg@amd.com> Link: https://lkml.kernel.org/r/5e19ec93-8307-47c2-bb13-3ddf7150624e@amd.com Cc: Christian Brauner <brauner@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Klara Modin <klarasmodin@gmail.com> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Lokesh Gidra <lokeshgidra@google.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Sourav Panda <souravpanda@google.com> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm: mark vma as detached until it's added into vma treeSuren Baghdasaryan
Current implementation does not set detached flag when a VMA is first allocated. This does not represent the real state of the VMA, which is detached until it is added into mm's VMA tree. Fix this by marking new VMAs as detached and resetting detached flag only after VMA is added into a tree. Introduce vma_mark_attached() to make the API more readable and to simplify possible future cleanup when vma->vm_mm might be used to indicate detached vma and vma_mark_attached() will need an additional mm parameter. Link: https://lkml.kernel.org/r/20250213224655.1680278-4-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Tested-by: Shivank Garg <shivankg@amd.com> Link: https://lkml.kernel.org/r/5e19ec93-8307-47c2-bb13-3ddf7150624e@amd.com Cc: Christian Brauner <brauner@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Klara Modin <klarasmodin@gmail.com> Cc: Lokesh Gidra <lokeshgidra@google.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Sourav Panda <souravpanda@google.com> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm: move per-vma lock into vm_area_structSuren Baghdasaryan
Back when per-vma locks were introduces, vm_lock was moved out of vm_area_struct in [1] because of the performance regression caused by false cacheline sharing. Recent investigation [2] revealed that the regressions is limited to a rather old Broadwell microarchitecture and even there it can be mitigated by disabling adjacent cacheline prefetching, see [3]. Splitting single logical structure into multiple ones leads to more complicated management, extra pointer dereferences and overall less maintainable code. When that split-away part is a lock, it complicates things even further. With no performance benefits, there are no reasons for this split. Merging the vm_lock back into vm_area_struct also allows vm_area_struct to use SLAB_TYPESAFE_BY_RCU later in this patchset. Move vm_lock back into vm_area_struct, aligning it at the cacheline boundary and changing the cache to be cacheline-aligned as well. With kernel compiled using defconfig, this causes VMA memory consumption to grow from 160 (vm_area_struct) + 40 (vm_lock) bytes to 256 bytes: slabinfo before: <name> ... <objsize> <objperslab> <pagesperslab> : ... vma_lock ... 40 102 1 : ... vm_area_struct ... 160 51 2 : ... slabinfo after moving vm_lock: <name> ... <objsize> <objperslab> <pagesperslab> : ... vm_area_struct ... 256 32 2 : ... Aggregate VMA memory consumption per 1000 VMAs grows from 50 to 64 pages, which is 5.5MB per 100000 VMAs. Note that the size of this structure is dependent on the kernel configuration and typically the original size is higher than 160 bytes. Therefore these calculations are close to the worst case scenario. A more realistic vm_area_struct usage before this change is: <name> ... <objsize> <objperslab> <pagesperslab> : ... vma_lock ... 40 102 1 : ... vm_area_struct ... 176 46 2 : ... Aggregate VMA memory consumption per 1000 VMAs grows from 54 to 64 pages, which is 3.9MB per 100000 VMAs. This memory consumption growth can be addressed later by optimizing the vm_lock. [1] https://lore.kernel.org/all/20230227173632.3292573-34-surenb@google.com/ [2] https://lore.kernel.org/all/ZsQyI%2F087V34JoIt@xsang-OptiPlex-9020/ [3] https://lore.kernel.org/all/CAJuCfpEisU8Lfe96AYJDZ+OM4NoPmnw9bP53cT_kbfP_pR+-2g@mail.gmail.com/ Link: https://lkml.kernel.org/r/20250213224655.1680278-3-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Tested-by: Shivank Garg <shivankg@amd.com> Link: https://lkml.kernel.org/r/5e19ec93-8307-47c2-bb13-3ddf7150624e@amd.com Cc: Christian Brauner <brauner@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Klara Modin <klarasmodin@gmail.com> Cc: Lokesh Gidra <lokeshgidra@google.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Sourav Panda <souravpanda@google.com> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16kernel/events/uprobes: handle device-exclusive entries correctly in ↵David Hildenbrand
__replace_page() Ever since commit b756a3b5e7ea ("mm: device exclusive memory access") we can return with a device-exclusive entry from page_vma_mapped_walk(). __replace_page() is not prepared for that, so teach it about these PFN swap PTEs. Note that device-private entries are so far not applicable on that path, because GUP would never have returned such folios (conversion to device-private happens by page migration, not in-place conversion of the PTE). There is a race between GUP and us locking the folio to look it up using page_vma_mapped_walk(), so this is likely a fix (unless something else could prevent that race, but it doesn't look like). pte_pfn() on something that is not a present pte could give use garbage, and we'd wrongly mess up the mapcount because it was already adjusted by calling folio_remove_rmap_pte() when making the entry device-exclusive. Link: https://lkml.kernel.org/r/20250210193801.781278-9-david@redhat.com Fixes: b756a3b5e7ea ("mm: device exclusive memory access") Signed-off-by: David Hildenbrand <david@redhat.com> Tested-by: Alistair Popple <apopple@nvidia.com> Cc: Alex Shi <alexs@kernel.org> Cc: Danilo Krummrich <dakr@kernel.org> Cc: Dave Airlie <airlied@gmail.com> Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Karol Herbst <kherbst@redhat.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Lyude <lyude@redhat.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: SeongJae Park <sj@kernel.org> Cc: Simona Vetter <simona.vetter@ffwll.ch> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yanteng Si <si.yanteng@linux.dev> Cc: Barry Song <v-songbaohua@oppo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16percpu: use TYPEOF_UNQUAL() in variable declarationsUros Bizjak
Use TYPEOF_UNQUAL() to declare variables as a corresponding type without named address space qualifier to avoid "`__seg_gs' specified for auto variable `var'" errors. Link: https://lkml.kernel.org/r/20250127160709.80604-4-ubizjak@gmail.com Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Acked-by: Nadav Amit <nadav.amit@gmail.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Arnd Bergmann <arnd@arndb.de> Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Will Deacon <will@kernel.org> Cc: Waiman Long <longman@redhat.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16Merge tag 'trace-v6.14-rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fix from Steven Rostedt: "Fix ref count of trace_array in error path of histogram file open Tracing instances have a ref count to keep them around while files within their directories are open. This prevents them from being deleted while they are used. The histogram code had some files that needed to take the ref count and that was added, but the error paths did not decrement the ref counts. This caused the instances from ever being removed if a histogram file failed to open due to some error" * tag 'trace-v6.14-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing: Correct the refcount if the hist/hist_debug file fails to open
2025-03-16perf/core: Use sysfs_emit() instead of scnprintf()XieLudan
Follow the advice in Documentation/filesystems/sysfs.rst: "- show() should only use sysfs_emit() or sysfs_emit_at() when formatting the value to be returned to user space." No change in functionality intended. [ mingo: Updated the changelog ] Signed-off-by: XieLudan <xie.ludan@zte.com.cn> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250315141738452lXIH39UJAXlCmcATCzcBv@zte.com.cn
2025-03-15bpf: Check map->record at the beginning of check_and_free_fields()Hou Tao
When there are no special fields in the map value, there is no need to invoke bpf_obj_free_fields(). Therefore, checking the validity of map->record in advance. After the change, the benchmark result of the per-cpu update case in map_perf_test increased by 40% under a 16-CPU VM. Signed-off-by: Hou Tao <houtao1@huawei.com> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250315150930.1511727-1-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-15bpf: preload: Add MODULE_DESCRIPTIONArnd Bergmann
Modpost complains when extra warnings are enabled: WARNING: modpost: missing MODULE_DESCRIPTION() in kernel/bpf/preload/bpf_preload.o Add a description from the Kconfig help text. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250310134920.4123633-1-arnd@kernel.org ---- Not sure if that description actually fits what the module does. If not, please add a different description instead. Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-15security: Propagate caller information in bpf hooksBlaise Boscaccy
Certain bpf syscall subcommands are available for usage from both userspace and the kernel. LSM modules or eBPF gatekeeper programs may need to take a different course of action depending on whether or not a BPF syscall originated from the kernel or userspace. Additionally, some of the bpf_attr struct fields contain pointers to arbitrary memory. Currently the functionality to determine whether or not a pointer refers to kernel memory or userspace memory is exposed to the bpf verifier, but that information is missing from various LSM hooks. Here we augment the LSM hooks to provide this data, by simply passing a boolean flag indicating whether or not the call originated in the kernel, in any hook that contains a bpf_attr struct that corresponds to a subcommand that may be called from the kernel. Signed-off-by: Blaise Boscaccy <bboscaccy@linux.microsoft.com> Acked-by: Song Liu <song@kernel.org> Acked-by: Paul Moore <paul@paul-moore.com> Link: https://lore.kernel.org/r/20250310221737.821889-2-bboscaccy@linux.microsoft.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-15bpf: fix missing kdoc string fields in cpumask.cEmil Tsalapatis
Some bpf_cpumask-related kfuncs have kdoc strings that are missing return values. Add a the missing descriptions for the return values. Reported-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Emil Tsalapatis (Meta) <emil@etsalapatis.com> Acked-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20250309230427.26603-4-emil@etsalapatis.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-15bpf: add kfunc for populating cpumask bitsEmil Tsalapatis
Add a helper kfunc that sets the bitmap of a bpf_cpumask from BPF memory. Signed-off-by: Emil Tsalapatis (Meta) <emil@etsalapatis.com> Acked-by: Hou Tao <houtao1@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20250309230427.26603-2-emil@etsalapatis.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-15bpf: correct use/def for may_goto instructionEduard Zingerman
may_goto instruction does not use any registers, but in compute_insn_live_regs() it was treated as a regular conditional jump of kind BPF_K with r0 as source register. Thus unnecessarily marking r0 as used. Fixes: 14c8552db644 ("bpf: simple DFA-based live registers analysis") Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250305085436.2731464-1-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-15bpf: use register liveness information for func_states_equalEduard Zingerman
Liveness analysis DFA computes a set of registers live before each instruction. Leverage this information to skip comparison of dead registers in func_states_equal(). This helps with convergance of iterator processing loops, as bpf_reg_state->live marks can't be used when loops are processed. This has certain performance impact for selftests, here is a veristat listing using `-f "insns_pct>5" -f "!insns<200"` selftests: File Program States (A) States (B) States (DIFF) -------------------- ----------------------------- ---------- ---------- -------------- arena_htab.bpf.o arena_htab_llvm 37 35 -2 (-5.41%) arena_htab_asm.bpf.o arena_htab_asm 37 33 -4 (-10.81%) arena_list.bpf.o arena_list_add 37 22 -15 (-40.54%) dynptr_success.bpf.o test_dynptr_copy 22 16 -6 (-27.27%) dynptr_success.bpf.o test_dynptr_copy_xdp 68 58 -10 (-14.71%) iters.bpf.o checkpoint_states_deletion 918 40 -878 (-95.64%) iters.bpf.o clean_live_states 136 66 -70 (-51.47%) iters.bpf.o iter_nested_deeply_iters 43 37 -6 (-13.95%) iters.bpf.o iter_nested_iters 72 62 -10 (-13.89%) iters.bpf.o iter_pass_iter_ptr_to_subprog 30 26 -4 (-13.33%) iters.bpf.o iter_subprog_iters 68 59 -9 (-13.24%) iters.bpf.o loop_state_deps2 35 32 -3 (-8.57%) iters_css.bpf.o iter_css_for_each 32 29 -3 (-9.38%) pyperf600_iter.bpf.o on_event 286 192 -94 (-32.87%) Total progs: 3578 Old success: 2061 New success: 2061 States diff min: -95.64% States diff max: 0.00% -100 .. -90 %: 1 -55 .. -45 %: 3 -45 .. -35 %: 2 -35 .. -25 %: 5 -20 .. -10 %: 12 -10 .. 0 %: 6 sched_ext: File Program States (A) States (B) States (DIFF) ----------------- ---------------------- ---------- ---------- --------------- bpf.bpf.o lavd_dispatch 8950 7065 -1885 (-21.06%) bpf.bpf.o lavd_init 516 480 -36 (-6.98%) bpf.bpf.o layered_dispatch 662 501 -161 (-24.32%) bpf.bpf.o layered_dump 298 237 -61 (-20.47%) bpf.bpf.o layered_init 523 423 -100 (-19.12%) bpf.bpf.o layered_init_task 24 22 -2 (-8.33%) bpf.bpf.o layered_runnable 151 125 -26 (-17.22%) bpf.bpf.o p2dq_dispatch 66 53 -13 (-19.70%) bpf.bpf.o p2dq_init 170 142 -28 (-16.47%) bpf.bpf.o refresh_layer_cpumasks 120 78 -42 (-35.00%) bpf.bpf.o rustland_init 37 34 -3 (-8.11%) bpf.bpf.o rustland_init 37 34 -3 (-8.11%) bpf.bpf.o rusty_select_cpu 125 108 -17 (-13.60%) scx_central.bpf.o central_dispatch 59 43 -16 (-27.12%) scx_central.bpf.o central_init 39 28 -11 (-28.21%) scx_nest.bpf.o nest_init 58 51 -7 (-12.07%) scx_pair.bpf.o pair_dispatch 142 111 -31 (-21.83%) scx_qmap.bpf.o qmap_dispatch 174 141 -33 (-18.97%) scx_qmap.bpf.o qmap_init 768 654 -114 (-14.84%) Total progs: 216 Old success: 186 New success: 186 States diff min: -35.00% States diff max: 0.00% -35 .. -25 %: 3 -25 .. -20 %: 6 -20 .. -15 %: 6 -15 .. -5 %: 7 -5 .. 0 %: 6 Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250304195024.2478889-5-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-15bpf: simple DFA-based live registers analysisEduard Zingerman
Compute may-live registers before each instruction in the program. The register is live before the instruction I if it is read by I or some instruction S following I during program execution and is not overwritten between I and S. This information would be used in the next patch as a hint in func_states_equal(). Use a simple algorithm described in [1] to compute this information: - define the following: - I.use : a set of all registers read by instruction I; - I.def : a set of all registers written by instruction I; - I.in : a set of all registers that may be alive before I execution; - I.out : a set of all registers that may be alive after I execution; - I.successors : a set of instructions S that might immediately follow I for some program execution; - associate separate empty sets 'I.in' and 'I.out' with each instruction; - visit each instruction in a postorder and update corresponding 'I.in' and 'I.out' sets as follows: I.out = U [S.in for S in I.successors] I.in = (I.out / I.def) U I.use (where U stands for set union, / stands for set difference) - repeat the computation while I.{in,out} changes for any instruction. On implementation side keep things as simple, as possible: - check_cfg() already marks instructions EXPLORED in post-order, modify it to save the index of each EXPLORED instruction in a vector; - represent I.{in,out,use,def} as bitmasks; - don't split the program into basic blocks and don't maintain the work queue, instead: - do fixed-point computation by visiting each instruction; - maintain a simple 'changed' flag if I.{in,out} for any instruction change; Measurements show that even such simplistic implementation does not add measurable verification time overhead (for selftests, at-least). Note on check_cfg() ex_insn_beg/ex_done change: To avoid out of bounds access to env->cfg.insn_postorder array, it should be guaranteed that instruction transitions to EXPLORED state only once. Previously this was not the fact for incorrect programs with direct calls to exception callbacks. The 'align' selftest needs adjustment to skip computed insn/live registers printout. Otherwise it matches lines from the live registers printout. [1] https://en.wikipedia.org/wiki/Live-variable_analysis Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250304195024.2478889-4-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-15bpf: get_call_summary() utility functionEduard Zingerman
Refactor mark_fastcall_pattern_for_call() to extract a utility function get_call_summary(). For a helper or kfunc call this function fills the following information: {num_params, is_void, fastcall}. This function would be used in the next patch in order to get number of parameters of a helper or kfunc call. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250304195024.2478889-3-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-15bpf: jmp_offset() and verbose_insn() utility functionsEduard Zingerman
Extract two utility functions: - One BPF jump instruction uses .imm field to encode jump offset, while the rest use .off. Encapsulate this detail as jmp_offset() function. - Avoid duplicating instruction printing callback definitions by defining a verbose_insn() function, which disassembles an instruction into the verifier log while hiding this detail. These functions will be used in the next patch. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250304195024.2478889-2-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-15bpf: Introduce load-acquire and store-release instructionsPeilin Ye
Introduce BPF instructions with load-acquire and store-release semantics, as discussed in [1]. Define 2 new flags: #define BPF_LOAD_ACQ 0x100 #define BPF_STORE_REL 0x110 A "load-acquire" is a BPF_STX | BPF_ATOMIC instruction with the 'imm' field set to BPF_LOAD_ACQ (0x100). Similarly, a "store-release" is a BPF_STX | BPF_ATOMIC instruction with the 'imm' field set to BPF_STORE_REL (0x110). Unlike existing atomic read-modify-write operations that only support BPF_W (32-bit) and BPF_DW (64-bit) size modifiers, load-acquires and store-releases also support BPF_B (8-bit) and BPF_H (16-bit). As an exception, however, 64-bit load-acquires/store-releases are not supported on 32-bit architectures (to fix a build error reported by the kernel test robot). An 8- or 16-bit load-acquire zero-extends the value before writing it to a 32-bit register, just like ARM64 instruction LDARH and friends. Similar to existing atomic read-modify-write operations, misaligned load-acquires/store-releases are not allowed (even if BPF_F_ANY_ALIGNMENT is set). As an example, consider the following 64-bit load-acquire BPF instruction (assuming little-endian): db 10 00 00 00 01 00 00 r0 = load_acquire((u64 *)(r1 + 0x0)) opcode (0xdb): BPF_ATOMIC | BPF_DW | BPF_STX imm (0x00000100): BPF_LOAD_ACQ Similarly, a 16-bit BPF store-release: cb 21 00 00 10 01 00 00 store_release((u16 *)(r1 + 0x0), w2) opcode (0xcb): BPF_ATOMIC | BPF_H | BPF_STX imm (0x00000110): BPF_STORE_REL In arch/{arm64,s390,x86}/net/bpf_jit_comp.c, have bpf_jit_supports_insn(..., /*in_arena=*/true) return false for the new instructions, until the corresponding JIT compiler supports them in arena. [1] https://lore.kernel.org/all/20240729183246.4110549-1-yepeilin@google.com/ Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Ilya Leoshkevich <iii@linux.ibm.com> Cc: kernel test robot <lkp@intel.com> Signed-off-by: Peilin Ye <yepeilin@google.com> Link: https://lore.kernel.org/r/a217f46f0e445fbd573a1a024be5c6bf1d5fe716.1741049567.git.yepeilin@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-15bpf: Add verifier support for timed may_gotoKumar Kartikeya Dwivedi
Implement support in the verifier for replacing may_goto implementation from a counter-based approach to one which samples time on the local CPU to have a bigger loop bound. We implement it by maintaining 16-bytes per-stack frame, and using 8 bytes for maintaining the count for amortizing time sampling, and 8 bytes for the starting timestamp. To minimize overhead, we need to avoid spilling and filling of registers around this sequence, so we push this cost into the time sampling function 'arch_bpf_timed_may_goto'. This is a JIT-specific wrapper around bpf_check_timed_may_goto which returns us the count to store into the stack through BPF_REG_AX. All caller-saved registers (r0-r5) are guaranteed to remain untouched. The loop can be broken by returning count as 0, otherwise we dispatch into the function when the count drops to 0, and the runtime chooses to refresh it (by returning count as BPF_MAX_TIMED_LOOPS) or returning 0 and aborting the loop on next iteration. Since the check for 0 is done right after loading the count from the stack, all subsequent cond_break sequences should immediately break as well, of the same loop or subsequent loops in the program. We pass in the stack_depth of the count (and thus the timestamp, by adding 8 to it) to the arch_bpf_timed_may_goto call so that it can be passed in to bpf_check_timed_may_goto as an argument after r1 is saved, by adding the offset to r10/fp. This adjustment will be arch specific, and the next patch will introduce support for x86. Note that depending on loop complexity, time spent in the loop can be more than the current limit (250 ms), but imposing an upper bound on program runtime is an orthogonal problem which will be addressed when program cancellations are supported. The current time afforded by cond_break may not be enough for cases where BPF programs want to implement locking algorithms inline, and use cond_break as a promise to the verifier that they will eventually terminate. Below are some benchmarking numbers on the time taken per-iteration for an empty loop that counts the number of iterations until cond_break fires. For comparison, we compare it against bpf_for/bpf_repeat which is another way to achieve the same number of spins (BPF_MAX_LOOPS). The hardware used for benchmarking was a Sapphire Rapids Intel server with performance governor enabled, mitigations were enabled. +-----------------------------+--------------+--------------+------------------+ | Loop type | Iterations | Time (ms) | Time/iter (ns) | +-----------------------------|--------------+--------------+------------------+ | may_goto | 8388608 | 3 | 0.36 | | timed_may_goto (count=65535)| 589674932 | 250 | 0.42 | | bpf_for | 8388608 | 10 | 1.19 | +-----------------------------+--------------+--------------+------------------+ This gives a good approximation at low overhead while staying close to the current implementation. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250304003239.2390751-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-15bpf: Factor out check_load_mem() and check_store_reg()Peilin Ye
Extract BPF_LDX and most non-ATOMIC BPF_STX instruction handling logic in do_check() into helper functions to be used later. While we are here, make that comment about "reserved fields" more specific. Suggested-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Peilin Ye <yepeilin@google.com> Link: https://lore.kernel.org/r/8b39c94eac2bb7389ff12392ca666f939124ec4f.1740978603.git.yepeilin@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-15bpf: Factor out check_atomic_rmw()Peilin Ye
Currently, check_atomic() only handles atomic read-modify-write (RMW) instructions. Since we are planning to introduce other types of atomic instructions (i.e., atomic load/store), extract the existing RMW handling logic into its own function named check_atomic_rmw(). Remove the @insn_idx parameter as it is not really necessary. Use 'env->insn_idx' instead, as in other places in verifier.c. Signed-off-by: Peilin Ye <yepeilin@google.com> Link: https://lore.kernel.org/r/6323ac8e73a10a1c8ee547c77ed68cf8eb6b90e1.1740978603.git.yepeilin@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-03-15bpf: Factor out atomic_ptr_type_ok()Peilin Ye
Factor out atomic_ptr_type_ok() as a helper function to be used later. Signed-off-by: Peilin Ye <yepeilin@google.com> Link: https://lore.kernel.org/r/e5ef8b3116f3fffce78117a14060ddce05eba52a.1740978603.git.yepeilin@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>