summaryrefslogtreecommitdiff
path: root/include/linux
AgeCommit message (Collapse)Author
2025-05-12mm/huge_memory: remove useless folio pointers passingGavin Guo
Since the previous commit "mm/huge_memory: Adjust try_to_migrate_one() and split_huge_pmd_locked()" has simplified the logic by leveraging the folio verification in page_vma_mapped_walk(), this patch removes the unnecessary folio pointers passing. Link: https://lkml.kernel.org/r/20250425103859.825879-3-gavinguo@igalia.com Link: https://lore.kernel.org/all/98d1d195-7821-4627-b518-83103ade56c0@redhat.com/ Link: https://lore.kernel.org/all/91599a3c-e69e-4d79-bac5-5013c96203d7@redhat.com/ Signed-off-by: Gavin Guo <gavinguo@igalia.com> Suggested-by: David Hildenbrand <david@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Florent Revest <revest@google.com> Cc: Gavin Shan <gshan@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12vmscan,cgroup: apply mems_effective to reclaimGregory Price
It is possible for a reclaimer to cause demotions of an lruvec belonging to a cgroup with cpuset.mems set to exclude some nodes. Attempt to apply this limitation based on the lruvec's memcg and prevent demotion. Notably, this may still allow demotion of shared libraries or any memory first instantiated in another cgroup. This means cpusets still cannot cannot guarantee complete isolation when demotion is enabled, and the docs have been updated to reflect this. This is useful for isolating workloads on a multi-tenant system from certain classes of memory more consistently - with the noted exceptions. Note on locking: The cgroup_get_e_css reference protects the css->effective_mems, and calls of this interface would be subject to the same race conditions associated with a non-atomic access to cs->effective_mems. So while this interface cannot make strong guarantees of correctness, it can therefore avoid taking a global or rcu_read_lock for performance. Link: https://lkml.kernel.org/r/20250424202806.52632-3-gourry@gourry.net Signed-off-by: Gregory Price <gourry@gourry.net> Suggested-by: Shakeel Butt <shakeel.butt@linux.dev> Suggested-by: Waiman Long <longman@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Waiman Long <longman@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Koutný <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12cpuset: rename cpuset_node_allowed to cpuset_current_node_allowedGregory Price
Patch series "vmscan: enforce mems_effective during demotion", v5. Change reclaim to respect cpuset.mems_effective during demotion when possible. Presently, reclaim explicitly ignores cpuset.mems_effective when demoting, which may cause the cpuset settings to violated. Implement cpuset_node_allowed() to check the cpuset.mems_effective associated wih the mem_cgroup of the lruvec being scanned. This only applies to cgroup/cpuset v2, as cpuset exists in a different hierarchy than mem_cgroup in v1. This requires renaming the existing cpuset_node_allowed() to be cpuset_current_now_allowed() - which is more descriptive anyway - to implement the new cpuset_node_allowed() which takes a target cgroup. This patch (of 2): Rename cpuset_node_allowed to reflect that the function checks the current task's cpuset.mems. This allows us to make a new cpuset_node_allowed function that checks a target cgroup's cpuset.mems. Link: https://lkml.kernel.org/r/20250424202806.52632-1-gourry@gourry.net Link: https://lkml.kernel.org/r/20250424202806.52632-2-gourry@gourry.net Signed-off-by: Gregory Price <gourry@gourry.net> Acked-by: Waiman Long <longman@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Koutný <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12Update Christoph's Email address and make it consistentChristoph Lameter (Ampere)
Use cl@gentwo.org throughout and remove the old email addresses. Link: https://lkml.kernel.org/r/8b962f57-4d98-cbb0-cd82-b6ba456733e8@gentwo.org Signed-off-by: Christoph Lameter <cl@gentwo.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12mm/damon/core: introduce damos quota goal metrics for memory node utilizationSeongJae Park
Patch series "mm/damon: auto-tune DAMOS for NUMA setups including tiered memory". Utilizing DAMON for memory tiering usually requires manual tuning and/or tedious controls. Let it self-tune hotness and coldness thresholds for promotion and demotion aiming high utilization of high memory tiers, by introducing new DAMOS quota goal metrics representing the used and the free memory ratios of specific NUMA nodes. And introduce a sample DAMON module that demonstrates how the new feature can be used for memory tiering use cases. Backgrounds =========== A type of tiered memory system exposes the memory tiers as NUMA nodes. A straightforward pages placement strategy for such systems is placing access-hot and cold pages on upper and lower tiers, reespectively, pursuing higher utilization of upper tiers. Since access temperature can be dynamic, periodically finding and migrating hot pages and cold pages to proper tiers (promoting and demoting) is also required. Linux kernel provides several features for such dynamic and transparent pages placement. Page Faults and LRU ------------------- One widely known way is using NUMA balancing in tiering mode (a.k.a NUMAB-2) and reclaim-based demotion features. In the setup, NUMAB-2 finds hot pages using access check-purpose page faults (a.k.a prot_none) and promote those inside each process' context, until there is no more pages to promote, or the upper tier is filled up and memory pressure happens. In the latter case, LRU-based reclaim logic wakes up as a response to the memory pressure and demotes cold pages to lower tiers in asynchronous (kswapd) and/or synchronous ways (direct reclaim). DAMON ----- Yet another available solution is using DAMOS with migrate_hot and migrate_cold DAMOS actions for promotions and demotions, respectively. To make it optimum, users need to specify aggressiveness and access temperature thresholds for promotions and demotions in a good balance that results in high utilization of upper tiers. The number of parameters is not small, and optimum parameter values depend on characteristics of the underlying hardware and the workload. As a result, it often requires manual, time consuming and repetitive tuning of the DAMOS schemes for given workloads and systems combinations. Self-tuned DAMON-based Memory Tiering ===================================== To solve such manual tuning problems, DAMOS provides aim-oriented feedback-driven quotas self-tuning. Using the feature, we design a self-tuned DAMON-based memory tiering for general multi-tier memory systems. For each memory tier node, if it has a lower tier, run a DAMOS scheme that demotes cold pages of the node, auto-tuning the aggressiveness aiming an amount of free space of the node. The free space is for keeping the headroom that avoids significant memory pressure during upper tier memory usage spike, and promoting hot pages from the lower tier. For each memory tier node, if it has an upper tier, run a DAMOS scheme that promotes hot pages of the current node to the upper tier, auto-tuning the aggressiveness aiming a high utilization ratio of the upper tier. The target ratio is to ensure higher tiers are utilized as much as possible. It should match with the headroom for demotion scheme, but have slight overlap, to ensure promotion and demotion are not entirely stopped. The aim-oriented aggressiveness auto-tuning of DAMOS is already available. Hence, to make such tiering solution implementation, only new quota goal metrics for utilization and free space ratio of specific NUMA node need to be developed. Discussions =========== The design imposes below discussion points. Expected Behaviors ------------------ The system will let upper tier memory node accommodates as many hot data as possible. If total amount of the data is less than the top tier memory's promotion/demotion target utilization, entire data will be just placed on the top tier. Promotion scheme will do nothing since there is no data to promote. Demotion scheme will also do nothing since the free space ratio of the top tier is higher than the goal. Only if the amount of data is larger than the top tier's utilization ratio, demotion scheme will demote cold pages and ensure the headroom free space. Since the promotion and demotion schemes for a single node has small overlap at their target utilization and free space goals, promotions and demotions will continue working with a moderate aggressiveness level. It will keep all data is placed on access hotness under dynamic access pattern, while minimizing the migration overhead. In any case, each node will keep headroom free space and as many upper tiers are utilized as possible. Ease of Use ----------- Users still need to set the target utilization and free space ratio, but it will be easier to set. We argue 99.7 % utilization and 0.5 % free space ratios can be good default values. It can be easily adjusted based on desired headroom size of given use case. Users are also still required to answer the minimum coldness and hotness thresholds. Together with monitoring intervals auto-tuning[2], DAMON will always show meaningful amount of hot and cold memory. And DAMOS quota's prioritization mechanism will make good decision as long as the source information is that colorful. Hence, users can very naively set the minimum criterias. We believe any access observation and no access observation within last one aggregation interval is enough for minimum hot and cold regions criterias. General Tiered Memory Setup Applicability ----------------------------------------- The design can be applied to any number of tiers having any performance characteristics, as long as they can be hierarchical. Hence, applying the system to different tiered memory system will be straightforward. Note that this assumes only single CPU NUMA node case. Because today's DAMON is not aware of which CPU made each access, applying this on systems having multiple CPU NUMA nodes can be complicated. We are planning to extend DAMON for the use case, but that's out of the scope of this patch series. How To Use ---------- Users can implement the auto-tuned DAMON-based memory tiering using DAMON sysfs interface. It can be easily done using DAMON user-space tool like user-space tool. Below evaluation results section shows an example DAMON user-space tool command for that. For wider and simpler deployment, having a kernel module that sets up and run the DAMOS schemes via DAMON kernel API can be useful. The module can enable the memory tiering at boot time via kernel command line parameter or at run time with single command. This patch series implements a sample DAMON kernel module that shows how such module can be implemented. Comparison To Page Faults and LRU-based Approaches -------------------------------------------------- The existing page faults based promotion (NUMAB-2) does hot pages detection and migration in the process context. When there are many pages to promote, it can block the progress of the application's real works. DAMOS works in asynchronous worker thread, so it doesn't block the real works. NUMAB-2 doesn't provide a way to control aggressiveness of promotion other than the maximum amount of pages to promote per given time widnow. If hot pages are found, promotions can happen in the upper-bound speed, regardless of upper tier's memory pressure. If the maximum speed is not well set for the given workload, it can result in slow promotion or unnecessary memory pressure. Self-tuned DAMON-based memory tiering alleviates the problem by adjusting the speed based on current utilization of the upper tier. LRU-based demotion can be triggered in both asynchronous (kswapd) and synchronous (direct reclaim) ways. Other than the way of finding cold pages, asynchronous LRU-based demotion and DAMON-based demotion has no big difference. DAMON-based demotion can make a better balancing with DAMON-based promotion, though. The LRU-based demotion can do better than DAMON-based demotion when the tier is having significant memory pressure. It would be wise to use DAMON-based demotion as a proactive and primary one, but utilizing LRU-based demotions together as a fast backup solution. Evaluation ========== In short, under a setup that requires fast and frequent promotions, self-tuned DAMON-based memory tiering's hot pages promotion improves performance about 4.42 %. We believe this shows self-tuned DAMON-based promotion's effectiveness. Meanwhile, NUMAB-2's hot pages promotion degrades the performance about 7.34 %. We suspect the degradation is mostly due to NUMAB-2's synchronous nature that can block the application's progress, which highlights the advantage of DAMON-based solution's asynchronous nature. Note that the test was done with the RFC version of this patch series. We don't run it again since this patch series got no meaningful change after the RFC, while the test takes pretty long time. Setup ----- Hardware. Use a machine that equips 250 GiB DRAM memory tier and 50 GiB CXL memory tier. The tiers are exposed as NUMA nodes 0 and 1, respectively. Kernel. Use Linux kernel v6.13 that modified as following. Add all DAMON patches that available on mm tree of 2025-03-15, and this patch series. Also modify it to ignore mempolicy() system calls, to avoid bad effects from application's traditional NUMA systems assumed optimizations. Workload. Use a modified version of Taobench benchmark[3] that available on DCPerf benchmark suite. It represents an in-memory caching workload. We set its 'memsize', 'warmup_time', and 'test_time' parameter as 340 GiB, 2,500 seconds and 1,440 seconds. The parameters are chosen to ensure the workload uses more than DRAM memory tier. Its RSS under the parameter grows to 270 GiB within the warmup time. It turned out the workload has a very static access pattrn. Only about 13 % of the RSS is frequently accessed from the beginning to end. Hence promotion shows no meaningful performance difference regardless of different design and implementations. We therefore modify the kernel to periodically demote up to 10 GiB hot pages and promote up to 10 GiB cold pages once per minute. The intention is to simulate periodic access pattern changes. The hotness and coldness threshold is very naively set so that it is more like random access pattern change rather than strict hot/cold pages exchange. This is why we call the workload as "modified". It is implemented as two DAMOS schemes each running on an asynchronous thread. It can be reproduced with DAMON user-space tool like below. # ./damo start \ --ops paddr --numa_node 0 --monitoring_intervals 10s 200s 200s \ --damos_action migrate_hot 1 \ --damos_quota_interval 60s --damos_quota_space 10G \ --ops paddr --numa_node 1 --monitoring_intervals 10s 200s 200s \ --damos_action migrate_cold 0 \ --damos_quota_interval 60s --damos_quota_space 10G \ --nr_schemes 1 1 --nr_targets 1 1 --nr_ctxs 1 1 System configurations. Use below variant system configurations. - Baseline. No memory tiering features are turned on. - Numab_tiering. On the baseline, enable NUMAB-2 and relcaim-based demotion. In detail, following command is executed: echo 2 > /proc/sys/kernel/numa_balancing; echo 1 > /sys/kernel/mm/numa/demotion_enabled; echo 7 > /proc/sys/vm/zone_reclaim_mode - DAMON_tiering. On the baseline, utilize self-tuned DAMON-based memory tiering implementation via DAMON user-space tool. It utilizes two kernel threads, namely promotion thread and demotion thread. Demotion thread monitors access pattern of DRAM node using DAMON with auto-tuned monitoring intervals aiming 4% DAMON-observed access ratio, and demote coldest pages up to 200 MiB per second aiming 0.5% free space of DRAM node. Promotion thread monitors CXL node using same intervals auto-tuning, and promote hot pages in same way but aiming for 99.7% utilization of DRAM node. Because DAMON provides only best-effort accuracy, add young page DAMOS filters to allow only and reject all young pages at promoting and demoting, respectively. It can be reproduced with DAMON user-space tool like below. # ./damo start \ --numa_node 0 --monitoring_intervals_goal 4% 3 5ms 10s \ --damos_action migrate_cold 1 --damos_access_rate 0% 0% \ --damos_apply_interval 1s \ --damos_quota_interval 1s --damos_quota_space 200MB \ --damos_quota_goal node_mem_free_bp 0.5% 0 \ --damos_filter reject young \ --numa_node 1 --monitoring_intervals_goal 4% 3 5ms 10s \ --damos_action migrate_hot 0 --damos_access_rate 5% max \ --damos_apply_interval 1s \ --damos_quota_interval 1s --damos_quota_space 200MB \ --damos_quota_goal node_mem_used_bp 99.7% 0 \ --damos_filter allow young \ --damos_nr_quota_goals 1 1 --damos_nr_filters 1 1 \ --nr_targets 1 1 --nr_schemes 1 1 --nr_ctxs 1 1 Measurment Results ------------------ On each system configuration, run the modified version of Taobench and collect 'score'. 'score' is a metric that calculated and provided by Taobench to represents the performance of the run on the system. To handle the measurement errors, repeat the measurement five times. The results are as below. Config Score Stdev (%) Normalized Baseline 1.6165 0.0319 1.9764 1.0000 Numab_tiering 1.4976 0.0452 3.0209 0.9264 DAMON_tiering 1.6881 0.0249 1.4767 1.0443 'Config' column shows the system config of the measurement. 'Score' column shows the 'score' measurement in average of the five runs on the system config. 'Stdev' column shows the standsard deviation of the five measurements of the scores. '(%)' column shows the 'Stdev' to 'Score' ratio in percentage. Finally, 'Normalized' column shows the averaged score values of the configs that normalized to that of 'Baseline'. The periodic hot pages demotion and cold pages promotion that was conducted to simulate dynamic access pattern was started from the beginning of the workload. It resulted in the DRAM tier utilization always under the watermark, and hence no real demotion was happened for all test runs. This means the above results show no difference between LRU-based and DAMON-based demotions. Only difference between NUMAB-2 and DAMON-based promotions are represented on the results. Numab_tiering config degraded the performance about 7.36 %. We suspect this happened because NUMAB-2's synchronous promotion was blocking the Taobench's real work progress. DAMON_tiering config improved the performance about 4.43 %. We believe this shows effectiveness of DAMON-based promotion that didn't block Taobench's real work progress due to its asynchronous nature. Also this means DAMON's monitoring results are accurate enough to provide visible amount of improvement. Evaluation Limitations ---------------------- As mentioned above, this evaluation shows only comparison of promotion mechanisms. DAMON-based tiering is recommended to be used together with reclaim-based demotion as a faster backup under significant memory pressure, though. From some perspective, the modified version of Taobench may seems making the picture distorted too much. It would be better to evaluate with more realistic workload, or more finely tuned micro benchmarks. Patch Sequence ============== The first patch (patch 1) implements two new quota goal metrics on core layer and expose it to DAMON core kernel API. The second and third ones (patches 2 and 3) further link it to DAMON sysfs interface. Three following patches (patches 4-6) document the new feature and sysfs file on design, usage, and ABI documents. The final one (patch 7) implements a working version of a self-tuned DAMON-based memory tiering solution in an incomplete but easy to understand form as a kernel module under samples/damon/ directory. References ========== [1] https://lore.kernel.org/20231112195602.61525-1-sj@kernel.org/ [2] https://lore.kernel.org/20250303221726.484227-1-sj@kernel.org [3] https://github.com/facebookresearch/DCPerf/blob/main/packages/tao_bench/README.md This patch (of 7): Used and free space ratios for specific NUMA nodes can be useful inputs for NUMA-specific DAMOS schemes' aggressiveness self-tuning feedback loop. Implement DAMOS quota goal metrics for such self-tuned schemes. Link: https://lkml.kernel.org/r/20250420194030.75838-1-sj@kernel.org Link: https://lkml.kernel.org/r/20250420194030.75838-2-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Yunjeong Mun <yunjeong.mun@sk.com> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12Merge branch 'for-6.16/tsm-mr' into tsm-nextDan Williams
Merge measurement-register infrastructure for v6.16. Resolve conflicts with the establishment of drivers/virt/coco/guest/ for cross-vendor common TSM functionality. Address a mis-merge with a fixup from Lukas: Link: http://lore.kernel.org/20250509134031.70559-1-lukas.bulwahn@redhat.com
2025-05-12scsi: sd_zbc: block: Respect bio vector limits for REPORT ZONES bufferSteve Siwinski
The REPORT ZONES buffer size is currently limited by the HBA's maximum segment count to ensure the buffer can be mapped. However, the block layer further limits the number of iovec entries to 1024 when allocating a bio. To avoid allocation of buffers too large to be mapped, further restrict the maximum buffer size to BIO_MAX_INLINE_VECS. Replace the UIO_MAXIOV symbolic name with the more contextually appropriate BIO_MAX_INLINE_VECS. Fixes: b091ac616846 ("sd_zbc: Fix report zones buffer allocation") Cc: stable@vger.kernel.org Signed-off-by: Steve Siwinski <ssiwinski@atto.com> Link: https://lore.kernel.org/r/20250508200122.243129-1-ssiwinski@atto.com Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2025-05-12Merge branch 'for-next' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/linux Tony Nguyen says: ==================== Prepare for Intel IPU E2000 (GEN3) This is the first part in introducing RDMA support for idpf. ---------------------------------------------------------------- Tatyana Nikolova says: To align with review comments, the patch series introducing RDMA RoCEv2 support for the Intel Infrastructure Processing Unit (IPU) E2000 line of products is going to be submitted in three parts: 1. Modify ice to use specific and common IIDC definitions and pass a core device info to irdma. 2. Add RDMA support to idpf and modify idpf to use specific and common IIDC definitions and pass a core device info to irdma. 3. Add RDMA RoCEv2 support for the E2000 products, referred to as GEN3 to irdma. This first part is a 5 patch series based on the original "iidc/ice/irdma: Update IDC to support multiple consumers" patch to allow for multiple CORE PCI drivers, using the auxbus. Patches: 1) Move header file to new name for clarity and replace ice specific DSCP define with a kernel equivalent one in irdma 2) Unify naming convention 3) Separate header file into common and driver specific info 4) Replace ice specific DSCP define with a kernel equivalent one in ice 5) Implement core device info struct and update drivers to use it ---------------------------------------------------------------- v1: https://lore.kernel.org/20250505212037.2092288-1-anthony.l.nguyen@intel.com IWL reviews: [v5] https://lore.kernel.org/20250416021549.606-1-tatyana.e.nikolova@intel.com [v4] https://lore.kernel.org/20250225050428.2166-1-tatyana.e.nikolova@intel.com [v3] https://lore.kernel.org/20250207194931.1569-1-tatyana.e.nikolova@intel.com [v2] https://lore.kernel.org/20240824031924.421-1-tatyana.e.nikolova@intel.com [v1] https://lore.kernel.org/20240724233917.704-1-tatyana.e.nikolova@intel.com * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/linux: iidc/ice/irdma: Update IDC to support multiple consumers ice: Replace ice specific DSCP mapping num with a kernel define iidc/ice/irdma: Break iidc.h into two headers iidc/ice/irdma: Rename to iidc_* convention iidc/ice/irdma: Rename IDC header file ==================== Link: https://patch.msgid.link/20250509200712.2911060-1-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-12helpers: make few bpf helpers publicMykyta Yatsenko
Make bpf_dynptr_slice_rdwr, bpf_dynptr_check_off_len and __bpf_dynptr_write available outside of the helpers.c by adding their prototypes into linux/include/bpf.h. bpf_dynptr_check_off_len() implementation is moved to header and made inline explicitly, as small function should typically be inlined. These functions are going to be used from bpf_trace.c in the next patch of this series. Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Link: https://lore.kernel.org/r/20250512205348.191079-2-mykyta.yatsenko5@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-05-12soc: qcom: llcc-qcom: Add support for SM8750Melody Olvera
Add system cache table and configs for SM8750 SoCs. Signed-off-by: Melody Olvera <melody.olvera@oss.qualcomm.com> Reviewed-by: Konrad Dybcio <konrad.dybcio@oss.qualcomm.com> Link: https://lore.kernel.org/r/20250512-sm8750_llcc_master-v5-3-d78dca6282a5@oss.qualcomm.com Signed-off-by: Bjorn Andersson <andersson@kernel.org>
2025-05-12io_uring: drain based on allocates reqsPavel Begunkov
Don't rely on CQ sequence numbers for draining, as it has become messy and needs cq_extra adjustments. Instead, base it on the number of allocated requests and only allow flushing when all requests are in the drain list. As a result, cq_extra is gone, no overhead for its accounting in aux cqe posting, less bloating as it was inlined before, and it's in general simpler than trying to track where we should bump it and where it should be put back like in cases of overflow. Also, it'll likely help with cleaning and unifying some of the CQ posting helpers. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/46ece1e34320b046c06fee2498d6b4cd12a700f2.1746788718.git.asml.silence@gmail.com Link: https://lore.kernel.org/r/24497b04b004bceada496033d3c9d09ff8e81ae9.1746944903.git.asml.silence@gmail.com [axboe: fold in fix from link2] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-12ACPI: MRRM: Minimal parse of ACPI MRRM tableTony Luck
The resctrl file system code needs to know how many region tags are supported. Parse the ACPI MRRM table and save the max_mem_region value. Provide a function for resctrl to collect that value. Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://patch.msgid.link/20250505173819.419271-2-tony.luck@intel.com Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-05-12Merge 6.15-rc6 into driver-core-nextGreg Kroah-Hartman
We need the driver core fix in here as well for testing Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-05-12regulator: pca9450: Add restart handlerPaul Geurts
When restarting a CPU powered by the PCA9450 power management IC, it is beneficial to use the PCA9450 to power cycle the CPU and all its connected peripherals to start up in a known state. The PCA9450 features a cold start procedure initiated by an I2C command. Add a restart handler so that the PCA9450 is used to restart the CPU. The restart handler sends command 0x14 to the SW_RST register, initiating a cold reset (Power recycle all regulators except LDO1, LDO2 and CLK_32K_OUT) As the PCA9450 is a PMIC specific for the i.MX8M family CPU, the restart handler priority is set just slightly higher than imx2_wdt and the PSCI restart handler. This makes sure this restart handler takes precedence. Signed-off-by: Paul Geurts <paul.geurts@prodrive-technologies.com> Link: https://patch.msgid.link/20250505115936.1946891-1-paul.geurts@prodrive-technologies.com Signed-off-by: Mark Brown <broonie@kernel.org>
2025-05-12Merge branch 'cznic/platform' of ↵Arnd Bergmann
https://git.kernel.org/pub/scm/linux/kernel/git/soc/soc into soc/drivers These are updates from Marek Behún for the cznic platform drivers: This series adds support for generating ECDSA signatures with hardware stored private key on Turris Omnia and Turris MOX. This ability is exposed via the keyctl() syscall. * 'cznic/platform' of https://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: platform: cznic: use ffs() instead of __bf_shf() firmware: turris-mox-rwtm: fix building without CONFIG_KEYS platform: cznic: fix function parameter names firmware: turris-mox-rwtm: Add support for ECDSA signatures with HW private key firmware: turris-mox-rwtm: Drop ECDSA signatures via debugfs platform: cznic: turris-omnia-mcu: Add support for digital message signing with HW private key platform: cznic: Add keyctl helpers for Turris platform platform: cznic: turris-omnia-mcu: Refactor requesting MCU interrupt Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2025-05-12power: supply: core: Add additional health status valuesArmin Wolf
Some batteries can signal when an internal fuse was blown. In such a case POWER_SUPPLY_HEALTH_DEAD is too vague for userspace applications to perform meaningful diagnostics. Additionally some batteries can also signal when some of their internal cells are imbalanced. In such a case returning POWER_SUPPLY_HEALTH_UNSPEC_FAILURE is again too vague for userspace applications to perform meaningful diagnostics. Add new health status values for both cases. Signed-off-by: Armin Wolf <W_Armin@gmx.de> Reviewed-by: Sebastian Reichel <sebastian.reichel@collabora.com> Link: https://lore.kernel.org/r/20250429003606.303870-1-W_Armin@gmx.de Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
2025-05-12mm/hmm: provide generic DMA managing logicLeon Romanovsky
HMM callers use PFN list to populate range while calling to hmm_range_fault(), the conversion from PFN to DMA address is done by the callers with help of another DMA list. However, it is wasteful on any modern platform and by doing the right logic, that DMA list can be avoided. Provide generic logic to manage these lists and gave an interface to map/unmap PFNs to DMA addresses, without requiring from the callers to be an experts in DMA core API. Tested-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
2025-05-12mm/hmm: let users to tag specific PFN with DMA mapped bitLeon Romanovsky
Introduce new sticky flag (HMM_PFN_DMA_MAPPED), which isn't overwritten by HMM range fault. Such flag allows users to tag specific PFNs with information if this specific PFN was already DMA mapped. Tested-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
2025-05-12Provide a new two step DMA mapping APILeon Romanovsky
Currently the only efficient way to map a complex memory description through the DMA API is by using the scatter list APIs. The SG APIs are unique in that they efficiently combine the two fundamental operations of sizing and allocating a large IOVA window from the IOMMU and processing all the per-address swiotlb/flushing/p2p/map details. This uniqueness has been a long standing pain point as the scatter list API is mandatory, but expensive to use. It prevents any kind of optimization or feature improvement (such as avoiding struct page for P2P) due to the impossibility of improving the scatter list. Several approaches have been explored to expand the DMA API with additional scatterlist-like structures (BIO, rlist), instead split up the DMA API to allow callers to bring their own data structure. The API is split up into parts: - Allocate IOVA space: To do any pre-allocation required. This is done based on the caller supplying some details about how much IOMMU address space it would need in worst case. - Map and unmap relevant structures to pre-allocated IOVA space: Perform the actual mapping into the pre-allocated IOVA. This is very similar to dma_map_page(). Thanks Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-05-12Add more devm_ functions to fix PM imbalance inMark Brown
Merge series from Bence Csókás <csokas.bence@prolan.hu>: The probe() function of the atmel-quadspi driver got quite convoluted, especially since the addition of SAMA7G5 support, that was forward-ported from an older vendor kernel. During the port, a bug was introduced, where the PM get() and put() calls were imbalanced. To alleivate this - and similar problems in the future - an effort was made to migrate as many functions as possible, to their devm_ managed counterparts. The few functions, which did not yet have a devm_ variant, are added in patch 1 of this series. Patch 2 then uses these APIs to fix the probe() function.
2025-05-11list: remove redundant 'extern' for function prototypesAndy Shevchenko
The 'extern' keyword is redundant for function prototypes. list.h never used them and new code in general is better without them. Link: https://lkml.kernel.org/r/20250502121742.3997529-1-andriy.shevchenko@linux.intel.com Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11lib/oid_registry.c: remove unused sprint_OIDDr. David Alan Gilbert
sprint_OID() was added as part of 2012's commit 4f73175d0375 ("X.509: Add utility functions to render OIDs as strings") but it hasn't been used. Remove it. Note that there's also 'sprint_oid' (lower case) which is used in a lot of places; that's left as is except for fixing its case in a comment. Link: https://lkml.kernel.org/r/20250501010502.326472-1-linux@treblig.org Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11util_macros.h: fix the reference in kernel-docAndy Shevchenko
In PTR_IF() description the text refers to the parameter as (ptr) while the kernel-doc format asks for @ptr. Fix this accordingly. Link: https://lkml.kernel.org/r/20250428072737.3265239-1-andriy.shevchenko@linux.intel.com Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Alexandru Ardelean <aardelean@baylibre.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11sort.h: hoist cmp_int() into generic header fileFedor Pchelkin
Deduplicate the same functionality implemented in several places by moving the cmp_int() helper macro into linux/sort.h. The macro performs a three-way comparison of the arguments mostly useful in different sorting strategies and algorithms. Link: https://lkml.kernel.org/r/20250427201451.900730-1-pchelkin@ispras.ru Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru> Suggested-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Kent Overstreet <kent.overstreet@linux.dev> Acked-by: Coly Li <colyli@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Carlos Maiolino <cem@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Coly Li <colyli@kernel.org> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11compiler_types.h: fix "unused variable" in __compiletime_assert()Marc Herbert
This refines commit c03567a8e8d5 ("include/linux/compiler.h: don't perform compiletime_assert with -O0") and restores #ifdef __OPTIMIZE__ symmetry by evaluating the 'condition' variable in both compile-time variants of __compiletimeassert(). As __OPTIMIZE__ is always true by default, this commit does not change anything by default. But it fixes warnings with _non-default_ CFLAGS like for instance this: make CFLAGS_tcp.o='-Og -U__OPTIMIZE__' from net/ipv4/tcp.c:273: include/net/sch_generic.h: In function `qdisc_cb_private_validate': include/net/sch_generic.h:511:30: error: unused variable `qcb' [-Werror=unused-variable] { struct qdisc_skb_cb *qcb; BUILD_BUG_ON(sizeof(skb->cb) < sizeof(*qcb)); ... } [akpm@linux-foundation.org: regularize comment layout, reflow comment] Link: https://lkml.kernel.org/r/20250424194048.652571-1-marc.herbert@linux.intel.com Signed-off-by: Marc Herbert <Marc.Herbert@linux.intel.com> Cc: Alexander Potapenko <glider@google.com> Cc: Changbin Du <changbin.du@intel.com> Cc: Jan Hendrik Farr <kernel@jfarr.cc> Cc: Macro Elver <elver@google.com> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Tony Ambardar <tony.ambardar@gmail.com> Cc: Uros Bizjak <ubizjak@gmail.com> Cc: Kees Cook <kees@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11treewide: fix typo "previlege"WangYuli
There are some spelling mistakes of 'previlege' in comments which should be 'privilege'. Fix them and add it to scripts/spelling.txt. The typo in arch/loongarch/kvm/main.c was corrected by a different patch [1] and is therefore not included in this submission. [1]. https://lore.kernel.org/all/20250420142208.2252280-1-wheatfox17@icloud.com/ Link: https://lkml.kernel.org/r/46AD404E411A4BAC+20250421074910.66988-1-wangyuli@uniontech.com Signed-off-by: WangYuli <wangyuli@uniontech.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11relay: remove unused relay_late_setup_filesDr. David Alan Gilbert
The last use of relay_late_setup_files() was removed in 2018 by commit 2b47733045aa ("drm/i915/guc: Merge log relay file and channel creation") Remove it and the helper it used. relay_late_setup_files() was used for eventually registering 'buffer only' channels. With it gone, delete the docs that explain how to do that. Which suggests it should be possible to lose the 'has_base_filename' flags. (Are there any other uses??) Link: https://lkml.kernel.org/r/20250418234932.490863-1-linux@treblig.org Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Reviewed-by: Jens Axboe <axboe@kernel.dk> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andriy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11rapidio: remove unused functionsDr. David Alan Gilbert
rio_request_dma() and rio_dma_prep_slave_sg() were added in 2012 by commit e42d98ebe7d7 ("rapidio: add DMA engine support for RIO data transfers") but never used. rio_find_mport() last use was removed in 2013 by commit 9edbc30b434f ("rapidio: update enumerator registration mechanism") rio_unregister_scan() was added in 2013 by commit a11650e11093 ("rapidio: make enumeration/discovery configurable") but never used. Remove them. Link: https://lkml.kernel.org/r/20250419203012.429787-3-linux@treblig.org Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Cc: Alexandre Bounine <alex.bou9@gmail.com> Cc: Matt Porter <mporter@kernel.crashing.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11scatterlist: inline sg_next()Caleb Sander Mateos
sg_next() is a short function called frequently in I/O paths. Define it in the header file so it can be inlined into its callers. Link: https://lkml.kernel.org/r/20250416160615.3571958-1-csander@purestorage.com Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Cc: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11hung_task: show the blocker task if the task is hung on semaphoreLance Yang
Inspired by mutex blocker tracking[1], this patch makes a trade-off to balance the overhead and utility of the hung task detector. Unlike mutexes, semaphores lack explicit ownership tracking, making it challenging to identify the root cause of hangs. To address this, we introduce a last_holder field to the semaphore structure, which is updated when a task successfully calls down() and cleared during up(). The assumption is that if a task is blocked on a semaphore, the holders must not have released it. While this does not guarantee that the last holder is one of the current blockers, it likely provides a practical hint for diagnosing semaphore-related stalls. With this change, the hung task detector can now show blocker task's info like below: [Tue Apr 8 12:19:07 2025] INFO: task cat:945 blocked for more than 120 seconds. [Tue Apr 8 12:19:07 2025] Tainted: G E 6.14.0-rc6+ #1 [Tue Apr 8 12:19:07 2025] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [Tue Apr 8 12:19:07 2025] task:cat state:D stack:0 pid:945 tgid:945 ppid:828 task_flags:0x400000 flags:0x00000000 [Tue Apr 8 12:19:07 2025] Call Trace: [Tue Apr 8 12:19:07 2025] <TASK> [Tue Apr 8 12:19:07 2025] __schedule+0x491/0xbd0 [Tue Apr 8 12:19:07 2025] schedule+0x27/0xf0 [Tue Apr 8 12:19:07 2025] schedule_timeout+0xe3/0xf0 [Tue Apr 8 12:19:07 2025] ? __folio_mod_stat+0x2a/0x80 [Tue Apr 8 12:19:07 2025] ? set_ptes.constprop.0+0x27/0x90 [Tue Apr 8 12:19:07 2025] __down_common+0x155/0x280 [Tue Apr 8 12:19:07 2025] down+0x53/0x70 [Tue Apr 8 12:19:07 2025] read_dummy_semaphore+0x23/0x60 [Tue Apr 8 12:19:07 2025] full_proxy_read+0x5f/0xa0 [Tue Apr 8 12:19:07 2025] vfs_read+0xbc/0x350 [Tue Apr 8 12:19:07 2025] ? __count_memcg_events+0xa5/0x140 [Tue Apr 8 12:19:07 2025] ? count_memcg_events.constprop.0+0x1a/0x30 [Tue Apr 8 12:19:07 2025] ? handle_mm_fault+0x180/0x260 [Tue Apr 8 12:19:07 2025] ksys_read+0x66/0xe0 [Tue Apr 8 12:19:07 2025] do_syscall_64+0x51/0x120 [Tue Apr 8 12:19:07 2025] entry_SYSCALL_64_after_hwframe+0x76/0x7e [Tue Apr 8 12:19:07 2025] RIP: 0033:0x7f419478f46e [Tue Apr 8 12:19:07 2025] RSP: 002b:00007fff1c4d2668 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 [Tue Apr 8 12:19:07 2025] RAX: ffffffffffffffda RBX: 0000000000020000 RCX: 00007f419478f46e [Tue Apr 8 12:19:07 2025] RDX: 0000000000020000 RSI: 00007f4194683000 RDI: 0000000000000003 [Tue Apr 8 12:19:07 2025] RBP: 00007f4194683000 R08: 00007f4194682010 R09: 0000000000000000 [Tue Apr 8 12:19:07 2025] R10: fffffffffffffbc5 R11: 0000000000000246 R12: 0000000000000000 [Tue Apr 8 12:19:07 2025] R13: 0000000000000003 R14: 0000000000020000 R15: 0000000000020000 [Tue Apr 8 12:19:07 2025] </TASK> [Tue Apr 8 12:19:07 2025] INFO: task cat:945 blocked on a semaphore likely last held by task cat:938 [Tue Apr 8 12:19:07 2025] task:cat state:S stack:0 pid:938 tgid:938 ppid:584 task_flags:0x400000 flags:0x00000000 [Tue Apr 8 12:19:07 2025] Call Trace: [Tue Apr 8 12:19:07 2025] <TASK> [Tue Apr 8 12:19:07 2025] __schedule+0x491/0xbd0 [Tue Apr 8 12:19:07 2025] ? _raw_spin_unlock_irqrestore+0xe/0x40 [Tue Apr 8 12:19:07 2025] schedule+0x27/0xf0 [Tue Apr 8 12:19:07 2025] schedule_timeout+0x77/0xf0 [Tue Apr 8 12:19:07 2025] ? __pfx_process_timeout+0x10/0x10 [Tue Apr 8 12:19:07 2025] msleep_interruptible+0x49/0x60 [Tue Apr 8 12:19:07 2025] read_dummy_semaphore+0x2d/0x60 [Tue Apr 8 12:19:07 2025] full_proxy_read+0x5f/0xa0 [Tue Apr 8 12:19:07 2025] vfs_read+0xbc/0x350 [Tue Apr 8 12:19:07 2025] ? __count_memcg_events+0xa5/0x140 [Tue Apr 8 12:19:07 2025] ? count_memcg_events.constprop.0+0x1a/0x30 [Tue Apr 8 12:19:07 2025] ? handle_mm_fault+0x180/0x260 [Tue Apr 8 12:19:07 2025] ksys_read+0x66/0xe0 [Tue Apr 8 12:19:07 2025] do_syscall_64+0x51/0x120 [Tue Apr 8 12:19:07 2025] entry_SYSCALL_64_after_hwframe+0x76/0x7e [Tue Apr 8 12:19:07 2025] RIP: 0033:0x7f7c584a646e [Tue Apr 8 12:19:07 2025] RSP: 002b:00007ffdba8ce158 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 [Tue Apr 8 12:19:07 2025] RAX: ffffffffffffffda RBX: 0000000000020000 RCX: 00007f7c584a646e [Tue Apr 8 12:19:07 2025] RDX: 0000000000020000 RSI: 00007f7c5839a000 RDI: 0000000000000003 [Tue Apr 8 12:19:07 2025] RBP: 00007f7c5839a000 R08: 00007f7c58399010 R09: 0000000000000000 [Tue Apr 8 12:19:07 2025] R10: fffffffffffffbc5 R11: 0000000000000246 R12: 0000000000000000 [Tue Apr 8 12:19:07 2025] R13: 0000000000000003 R14: 0000000000020000 R15: 0000000000020000 [Tue Apr 8 12:19:07 2025] </TASK> [1] https://lore.kernel.org/all/174046694331.2194069.15472952050240807469.stgit@mhiramat.tok.corp.google.com Link: https://lkml.kernel.org/r/20250414145945.84916-3-ioworker0@gmail.com Signed-off-by: Mingzhe Yang <mingzhe.yang@ly.com> Signed-off-by: Lance Yang <ioworker0@gmail.com> Suggested-by: Andrew Morton <akpm@linux-foundation.org> Suggested-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Cc: Anna Schumaker <anna.schumaker@oracle.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Joel Granados <joel.granados@kernel.org> Cc: John Stultz <jstultz@google.com> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tomasz Figa <tfiga@chromium.org> Cc: Waiman Long <longman@redhat.com> Cc: Will Deacon <will@kernel.org> Cc: Yongliang Gao <leonylgao@tencent.com> Cc: Zi Li <amaindex@outlook.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11hung_task: replace blocker_mutex with encoded blockerLance Yang
Patch series "hung_task: extend blocking task stacktrace dump to semaphore", v5. Inspired by mutex blocker tracking[1], this patch series extend the feature to not only dump the blocker task holding a mutex but also to support semaphores. Unlike mutexes, semaphores lack explicit ownership tracking, making it challenging to identify the root cause of hangs. To address this, we introduce a last_holder field to the semaphore structure, which is updated when a task successfully calls down() and cleared during up(). The assumption is that if a task is blocked on a semaphore, the holders must not have released it. While this does not guarantee that the last holder is one of the current blockers, it likely provides a practical hint for diagnosing semaphore-related stalls. With this change, the hung task detector can now show blocker task's info like below: [Tue Apr 8 12:19:07 2025] INFO: task cat:945 blocked for more than 120 seconds. [Tue Apr 8 12:19:07 2025] Tainted: G E 6.14.0-rc6+ #1 [Tue Apr 8 12:19:07 2025] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [Tue Apr 8 12:19:07 2025] task:cat state:D stack:0 pid:945 tgid:945 ppid:828 task_flags:0x400000 flags:0x00000000 [Tue Apr 8 12:19:07 2025] Call Trace: [Tue Apr 8 12:19:07 2025] <TASK> [Tue Apr 8 12:19:07 2025] __schedule+0x491/0xbd0 [Tue Apr 8 12:19:07 2025] schedule+0x27/0xf0 [Tue Apr 8 12:19:07 2025] schedule_timeout+0xe3/0xf0 [Tue Apr 8 12:19:07 2025] ? __folio_mod_stat+0x2a/0x80 [Tue Apr 8 12:19:07 2025] ? set_ptes.constprop.0+0x27/0x90 [Tue Apr 8 12:19:07 2025] __down_common+0x155/0x280 [Tue Apr 8 12:19:07 2025] down+0x53/0x70 [Tue Apr 8 12:19:07 2025] read_dummy_semaphore+0x23/0x60 [Tue Apr 8 12:19:07 2025] full_proxy_read+0x5f/0xa0 [Tue Apr 8 12:19:07 2025] vfs_read+0xbc/0x350 [Tue Apr 8 12:19:07 2025] ? __count_memcg_events+0xa5/0x140 [Tue Apr 8 12:19:07 2025] ? count_memcg_events.constprop.0+0x1a/0x30 [Tue Apr 8 12:19:07 2025] ? handle_mm_fault+0x180/0x260 [Tue Apr 8 12:19:07 2025] ksys_read+0x66/0xe0 [Tue Apr 8 12:19:07 2025] do_syscall_64+0x51/0x120 [Tue Apr 8 12:19:07 2025] entry_SYSCALL_64_after_hwframe+0x76/0x7e [Tue Apr 8 12:19:07 2025] RIP: 0033:0x7f419478f46e [Tue Apr 8 12:19:07 2025] RSP: 002b:00007fff1c4d2668 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 [Tue Apr 8 12:19:07 2025] RAX: ffffffffffffffda RBX: 0000000000020000 RCX: 00007f419478f46e [Tue Apr 8 12:19:07 2025] RDX: 0000000000020000 RSI: 00007f4194683000 RDI: 0000000000000003 [Tue Apr 8 12:19:07 2025] RBP: 00007f4194683000 R08: 00007f4194682010 R09: 0000000000000000 [Tue Apr 8 12:19:07 2025] R10: fffffffffffffbc5 R11: 0000000000000246 R12: 0000000000000000 [Tue Apr 8 12:19:07 2025] R13: 0000000000000003 R14: 0000000000020000 R15: 0000000000020000 [Tue Apr 8 12:19:07 2025] </TASK> [Tue Apr 8 12:19:07 2025] INFO: task cat:945 blocked on a semaphore likely last held by task cat:938 [Tue Apr 8 12:19:07 2025] task:cat state:S stack:0 pid:938 tgid:938 ppid:584 task_flags:0x400000 flags:0x00000000 [Tue Apr 8 12:19:07 2025] Call Trace: [Tue Apr 8 12:19:07 2025] <TASK> [Tue Apr 8 12:19:07 2025] __schedule+0x491/0xbd0 [Tue Apr 8 12:19:07 2025] ? _raw_spin_unlock_irqrestore+0xe/0x40 [Tue Apr 8 12:19:07 2025] schedule+0x27/0xf0 [Tue Apr 8 12:19:07 2025] schedule_timeout+0x77/0xf0 [Tue Apr 8 12:19:07 2025] ? __pfx_process_timeout+0x10/0x10 [Tue Apr 8 12:19:07 2025] msleep_interruptible+0x49/0x60 [Tue Apr 8 12:19:07 2025] read_dummy_semaphore+0x2d/0x60 [Tue Apr 8 12:19:07 2025] full_proxy_read+0x5f/0xa0 [Tue Apr 8 12:19:07 2025] vfs_read+0xbc/0x350 [Tue Apr 8 12:19:07 2025] ? __count_memcg_events+0xa5/0x140 [Tue Apr 8 12:19:07 2025] ? count_memcg_events.constprop.0+0x1a/0x30 [Tue Apr 8 12:19:07 2025] ? handle_mm_fault+0x180/0x260 [Tue Apr 8 12:19:07 2025] ksys_read+0x66/0xe0 [Tue Apr 8 12:19:07 2025] do_syscall_64+0x51/0x120 [Tue Apr 8 12:19:07 2025] entry_SYSCALL_64_after_hwframe+0x76/0x7e [Tue Apr 8 12:19:07 2025] RIP: 0033:0x7f7c584a646e [Tue Apr 8 12:19:07 2025] RSP: 002b:00007ffdba8ce158 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 [Tue Apr 8 12:19:07 2025] RAX: ffffffffffffffda RBX: 0000000000020000 RCX: 00007f7c584a646e [Tue Apr 8 12:19:07 2025] RDX: 0000000000020000 RSI: 00007f7c5839a000 RDI: 0000000000000003 [Tue Apr 8 12:19:07 2025] RBP: 00007f7c5839a000 R08: 00007f7c58399010 R09: 0000000000000000 [Tue Apr 8 12:19:07 2025] R10: fffffffffffffbc5 R11: 0000000000000246 R12: 0000000000000000 [Tue Apr 8 12:19:07 2025] R13: 0000000000000003 R14: 0000000000020000 R15: 0000000000020000 [Tue Apr 8 12:19:07 2025] </TASK> This patch (of 3): This patch replaces 'struct mutex *blocker_mutex' with 'unsigned long blocker', as only one blocker is active at a time. The blocker filed can store both the lock addrees and the lock type, with LSB used to encode the type as Masami suggested, making it easier to extend the feature to cover other types of locks. Also, once the lock type is determined, we can directly extract the address and cast it to a lock pointer ;) Link: https://lkml.kernel.org/r/20250414145945.84916-1-ioworker0@gmail.com Link: https://lore.kernel.org/all/174046694331.2194069.15472952050240807469.stgit@mhiramat.tok.corp.google.com [1] Link: https://lkml.kernel.org/r/20250414145945.84916-2-ioworker0@gmail.com Signed-off-by: Mingzhe Yang <mingzhe.yang@ly.com> Signed-off-by: Lance Yang <ioworker0@gmail.com> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Suggested-by: Andrew Morton <akpm@linux-foundation.org> Suggested-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Cc: Anna Schumaker <anna.schumaker@oracle.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Joel Granados <joel.granados@kernel.org> Cc: John Stultz <jstultz@google.com> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tomasz Figa <tfiga@chromium.org> Cc: Waiman Long <longman@redhat.com> Cc: Will Deacon <will@kernel.org> Cc: Yongliang Gao <leonylgao@tencent.com> Cc: Zi Li <amaindex@outlook.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11kernel.h: move PTR_IF() and u64_to_user_ptr() to util_macros.hAndy Shevchenko
While the natural choice of PTR_IF() is kconfig.h, the latter is too broad to include C code and actually the macro was moved out from there in the past. But kernel.h is neither a good choice for that. Move it to util_macros.h. Do the same for u64_to_user_ptr(). While moving, add necessary documentation. Link: https://lkml.kernel.org/r/20250324105228.775784-3-andriy.shevchenko@linux.intel.com Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Alexandru Ardelean <aardelean@baylibre.com> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11kernel.h: move READ/WRITE definitions to <linux/types.h>Ingo Molnar
Patch series "kernel.h: Move out a couple of macros and constants". kernel.h hosts a couple of macros and constants that may be better placed. Do that. Also add missing documentation. No functional changes intended. This patch (of 2): Headers shouldn't be forced to include <linux/kernel.h> just to gain these simple constants. Link: https://lkml.kernel.org/r/20250324105228.775784-1-andriy.shevchenko@linux.intel.com Link: https://lkml.kernel.org/r/20250324105228.775784-2-andriy.shevchenko@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Alexandru Ardelean <aardelean@baylibre.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11task_stack.h: remove obsolete __HAVE_ARCH_KSTACK_END checkPasha Tatashin
Remove __HAVE_ARCH_KSTACK_END as it has been obsolete since removal of metag architecture in v4.17. Link: https://lkml.kernel.org/r/20250403-kstack-end-v1-1-7798e71f34d1@linaro.org Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Link: https://lore.kernel.org/20240311164638.2015063-2-pasha.tatashin@soleen.com Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11mm: move mmap/vma locking logic into specific filesLorenzo Stoakes
Currently the VMA and mmap locking logic is entangled in two of the most overwrought files in mm - include/linux/mm.h and mm/memory.c. Separate this logic out so we can more easily make changes and create an appropriate MAINTAINERS entry that spans only the logic relating to locking. This should have no functional change. Care is taken to avoid dependency loops, we must regrettably keep release_fault_lock() and assert_fault_locked() in mm.h as a result due to the dependence on the vm_fault type. Additionally we must declare rcuwait_wake_up() manually to avoid a dependency cycle on linux/rcuwait.h. Additionally move the nommu implementatino of lock_mm_and_find_vma() to mmap_lock.c so everything lock-related is in one place. Link: https://lkml.kernel.org/r/bec6c8e29fa8de9267a811a10b1bdae355d67ed4.1744799282.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: SeongJae Park <sj@kernel.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11mm: convert free_page_and_swap_cache() to free_folio_and_swap_cache()Fan Ni
free_page_and_swap_cache() takes a struct page pointer as input parameter, but it will immediately convert it to folio and all operations following within use folio instead of page. It makes more sense to pass in folio directly. Convert free_page_and_swap_cache() to free_folio_and_swap_cache() to consume folio directly. Link: https://lkml.kernel.org/r/20250416201720.41678-1-nifan.cxl@gmail.com Signed-off-by: Fan Ni <fan.ni@samsung.com> Acked-by: Davidlohr Bueso <dave@stgolabs.net> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Adam Manzanares <a.manzanares@samsung.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Luis Chamberalin <mcgrof@kernel.org> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11maple_tree: add sufficient heightSidhartha Kumar
In order to support rebalancing and spanning stores using less than the worst case number of nodes, we need to track more than just the vacant height. Using only vacant height to reduce the worst case maple node allocation count can lead to a shortcoming of nodes in the following scenarios. For rebalancing writes, when a leaf node becomes insufficient, it may be combined with a sibling into a single node. This means that the parent node which has entries for this children will lose one entry. If this parent node was just meeting the minimum entries, losing one entry will now cause this parent node to be insufficient. This leads to a cascading operation of rebalancing at different levels and can lead to more node allocations than simply using vacant height can return. For spanning writes, a similar situation occurs. At the location at which a spanning write is detected, the number of ancestor nodes may similarly need to rebalanced into a smaller number of nodes and the same cascading situation could occur. To use less than the full height of the tree for the number of allocations, we also need to track the height at which a non-leaf node cannot become insufficient. This means even if a rebalance occurs to a child of this node, it currently has enough entries that it can lose one without any further action. This field is stored in the maple write state as sufficient height. In mas_prealloc_calc() when figuring out how many nodes to allocate, we check if the vacant node is lower in the tree than a sufficient node (has a larger value). If it is, we cannot use the vacant height and must use the difference in the height and sufficient height as the basis for the number of nodes needed. An off by one bug was also discovered in mast_overflow() where it is using >= rather than >. This caused extra iterations of the mas_spanning_rebalance() loop and lead to unneeded allocations. A test is also added to check the number of allocations is correct. Link: https://lkml.kernel.org/r/20250410191446.2474640-6-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11maple_tree: use vacant nodes to reduce worst case allocationsSidhartha Kumar
In order to determine the store type for a maple tree operation, a walk of the tree is done through mas_wr_walk(). This function descends the tree until a spanning write is detected or we reach a leaf node. While descending, keep track of the height at which we encounter a node with available space. This is done by checking if mas->end is less than the number of slots a given node type can fit. Now that the height of the vacant node is tracked, we can use the difference between the height of the tree and the height of the vacant node to know how many levels we will have to propagate creating new nodes. Update mas_prealloc_calc() to consider the vacant height and reduce the number of worst-case allocations. Rebalancing and spanning stores are not supported and fall back to using the full height of the tree for allocations. Update preallocation testing assertions to take into account vacant height. Link: https://lkml.kernel.org/r/20250410191446.2474640-4-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11mm: skip ptlock_init() for kernel PMDsKevin Brodsky
Split page table locks are not used for pgtables associated to init_mm, at any level. pte_alloc_kernel() does not call ptlock_init() as a result. There is however no separate alloc/free functions for kernel PMDs, and pmd_ptlock_init() is called unconditionally. When ALLOC_SPLIT_PTLOCKS is true (e.g. 32-bit architectures or if CONFIG_PREEMPT_RT is selected), this results in unnecessary dynamic memory allocation every time a kernel PMD is allocated. Now that pagetable_pmd_ctor() is passed the associated mm, we can easily remove this overhead by skipping pmd_ptlock_init() if the pgtable is associated to init_mm. No special-casing is needed on the dtor path, as ptlock_free() is already called unconditionally for all levels. (ptlock_free() is a no-op unless a ptlock was allocated for the given PTP.) Link: https://lkml.kernel.org/r/20250408095222.860601-8-kevin.brodsky@arm.com Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Linus Waleij <linus.walleij@linaro.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Mike Rapoport <rppt@kernel.org> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Will Deacon <will@kernel.org> Cc: <x86@kernel.org> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11mm: call ctor/dtor for kernel PTEsKevin Brodsky
Since [1], constructors/destructors are expected to be called for all page table pages, at all levels and for both user and kernel pgtables. There is however one glaring exception: kernel PTEs are managed via separate helpers (pte_alloc_kernel/pte_free_kernel), which do not call the [cd]tor, at least not in the generic implementation. The most obvious reason for this anomaly is that init_mm is special-cased not to use split page table locks. As a result calling ptlock_init() for PTEs associated with init_mm would be wasteful, potentially resulting in dynamic memory allocation. However, pgtable [cd]tors perform other actions - currently related to accounting/statistics, and potentially more functionally significant in the future. Now that pagetable_pte_ctor() is passed the associated mm, we can make it skip the call to ptlock_init() for init_mm; this allows us to call the ctor from pte_alloc_one_kernel() too. This is matched by a call to the pgtable destructor in pte_free_kernel(); no special-casing is needed on that path, as ptlock_free() is already called unconditionally. (ptlock_free() is a no-op unless a ptlock was allocated for the given PTP.) This patch ensures that all architectures that rely on <asm-generic/pgalloc.h> call the [cd]tor for kernel PTEs. pte_free_kernel() cannot be overridden so changing the generic implementation is sufficient. pte_alloc_one_kernel() can be overridden using __HAVE_ARCH_PTE_ALLOC_ONE_KERNEL, and a few architectures implement it by calling the page allocator directly. We amend those so that they call the generic __pte_alloc_one_kernel() instead, if possible, ensuring that the ctor is called. A few architectures do not use <asm-generic/pgalloc.h>; those will be taken care of separately. [1] https://lore.kernel.org/linux-mm/20250103184415.2744423-1-kevin.brodsky@arm.com/ Link: https://lkml.kernel.org/r/20250408095222.860601-4-kevin.brodsky@arm.com Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com> Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com> # s390 Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Linus Waleij <linus.walleij@linaro.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Mike Rapoport <rppt@kernel.org> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Will Deacon <will@kernel.org> Cc: <x86@kernel.org> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11mm: pass mm down to pagetable_{pte,pmd}_ctorKevin Brodsky
Patch series "Always call constructor for kernel page tables", v2. There has been much confusion around exactly when page table constructors/destructors (pagetable_*_[cd]tor) are supposed to be called. They were initially introduced for user PTEs only (to support split page table locks), then at the PMD level for the same purpose. Accounting was added later on, starting at the PTE level and then moving to higher levels (PMD, PUD). Finally, with my earlier series "Account page tables at all levels" [1], the ctor/dtor is run for all levels, all the way to PGD. I thought this was the end of the story, and it hopefully is for user pgtables, but I was wrong for what concerns kernel pgtables. The current situation there makes very little sense: * At the PTE level, the ctor/dtor is not called (at least in the generic implementation). Specific helpers are used for kernel pgtables at this level (pte_{alloc,free}_kernel()) and those have never called the ctor/dtor, most likely because they were initially irrelevant in the kernel case. * At all other levels, the ctor/dtor is normally called. This is potentially wasteful at the PMD level (more on that later). This series aims to ensure that the ctor/dtor is always called for kernel pgtables, as it already is for user pgtables. Besides consistency, the main motivation is to guarantee that ctor/dtor hooks are systematically called; this makes it possible to insert hooks to protect page tables [2], for instance. There is however an extra challenge: split locks are not used for kernel pgtables, and it would therefore be wasteful to initialise them (ptlock_init()). It is worth clarifying exactly when split locks are used. They clearly are for user pgtables, but as illustrated in commit 61444cde9170 ("ARM: 8591/1: mm: use fully constructed struct pages for EFI pgd allocations"), they also are for special page tables like efi_mm. The one case where split locks are definitely unused is pgtables owned by init_mm; this is consistent with the behaviour of apply_to_pte_range(). The approach chosen in this series is therefore to pass the mm associated to the pgtables being constructed to pagetable_{pte,pmd}_ctor() (patch 1), and skip ptlock_init() if mm == &init_mm (patch 3 and 7). This makes it possible to call the PTE ctor/dtor from pte_{alloc,free}_kernel() without unintended consequences (patch 3). As a result the accounting functions are now called at all levels for kernel pgtables, and split locks are never initialised. In configurations where ptlocks are dynamically allocated (32-bit, PREEMPT_RT, etc.) and ARCH_ENABLE_SPLIT_PMD_PTLOCK is selected, this series results in the removal of a kmem_cache allocation for every kernel PMD. Additionally, for certain architectures that do not use <asm-generic/pgalloc.h> such as s390, the same optimisation occurs at the PTE level. === Things get more complicated when it comes to special pgtable allocators (patch 8-12). All architectures need such allocators to create initial kernel pgtables; we are not concerned with those as the ctor cannot be called so early in the boot sequence. However, those allocators may also be used later in the boot sequence or during normal operations. There are two main use-cases: 1. Mapping EFI memory: efi_mm (arm, arm64, riscv) 2. arch_add_memory(): init_mm The ctor is already explicitly run (at the PTE/PMD level) in the first case, as required for pgtables that are not associated with init_mm. However the same allocators may also be used for the second use-case (or others), and this is where it gets messy. Patch 1 calls the ctor with NULL as mm in those situations, as the actual mm isn't available. Practically this means that ptlocks will be unconditionally initialised. This is fine on arm - create_mapping_late() is only used for the EFI mapping. On arm64, __create_pgd_mapping() is also used by arch_add_memory(); patch 8/9/11 ensure that ctors are called at all levels with the appropriate mm. The situation is similar on riscv, but propagating the mm down to the ctor would require significant refactoring. Since they are already called unconditionally, this series leaves riscv no worse off - patch 10 adds comments to clarify the situation. From a cursory look at other architectures implementing arch_add_memory(), s390 and x86 may also need a similar treatment to add constructor calls. This is to be taken care of in a future version or as a follow-up. === The complications in those special pgtable allocators beg the question: does it really make sense to treat efi_mm and init_mm differently in e.g. apply_to_pte_range()? Maybe what we really need is a way to tell if an mm corresponds to user memory or not, and never use split locks for non-user mm's. Feedback and suggestions welcome! This patch (of 12): In preparation for calling constructors for all kernel page tables while eliding unnecessary ptlock initialisation, let's pass down the associated mm to the PTE/PMD level ctors. (These are the two levels where ptlocks are used.) In most cases the mm is already around at the point of calling the ctor so we simply pass it down. This is however not the case for special page table allocators: * arch/arm/mm/mmu.c * arch/arm64/mm/mmu.c * arch/riscv/mm/init.c In those cases, the page tables being allocated are either for standard kernel memory (init_mm) or special page directories, which may not be associated to any mm. For now let's pass NULL as mm; this will be refined where possible in future patches. No functional change in this patch. Link: https://lore.kernel.org/linux-mm/20250103184415.2744423-1-kevin.brodsky@arm.com/ [1] Link: https://lore.kernel.org/linux-hardening/20250203101839.1223008-1-kevin.brodsky@arm.com/ [2] Link: https://lkml.kernel.org/r/20250408095222.860601-1-kevin.brodsky@arm.com Link: https://lkml.kernel.org/r/20250408095222.860601-2-kevin.brodsky@arm.com Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com> Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com> [s390] Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Linus Waleij <linus.walleij@linaro.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Mike Rapoport <rppt@kernel.org> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: <x86@kernel.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11fs/proc/page: refactor to reduce code duplicationLiu Ye
kpageflags_read() and kpagecgroup_read() are quite similar to kpagecount_read(). Refactor common code into a helper function to reduce code duplication. Link: https://lkml.kernel.org/r/20250318063226.223284-1-liuyerd@163.com Signed-off-by: Liu Ye <liuye@kylinos.cn> Acked-by: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Ran Xiaokai <ran.xiaokai@zte.com.cn> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Svetly Todorov <svetly.todorov@memverge.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11xarray: make xa_alloc_cyclic() return 0 on all success casesPrzemek Kitszel
Change xa_alloc_cyclic() to return 0 even on wrap-around. Do the same for xa_alloc_cyclic_irq() and xa_alloc_cyclic_bh(). This will prevent any future bug of treating return of 1 as an error: int ret = xa_alloc_cyclic(...) if (ret) // currently mishandles ret==1 goto failure; If there will be someone interested in when wrap-around occurs, there is still __xa_alloc_cyclic() that behaves as before. For now there is no such user. Link: https://lkml.kernel.org/r/20250320102219.8101-1-przemyslaw.kitszel@intel.com Signed-off-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Suggested-by: Matthew Wilcox <willy@infradead.org> Link: https://lore.kernel.org/netdev/Z9gUd-5t8b5NX2wE@casper.infradead.org Cc: Andriy Shevchenko <andriy.shevchenko@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Cc: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11mm/ptdump: split effective_prot() into level specific callbacksAnshuman Khandual
Last argument in effective_prot() is u64 assuming pxd_val() returned value (all page table levels) is 64 bit. pxd_val() is very platform specific and its type should not be assumed in generic MM. Split effective_prot() into individual page table level specific callbacks which accepts corresponding pxd_t argument instead and then the subscribing platform (only x86) just derive pxd_val() from the entries as required and proceed as earlier. Link: https://lkml.kernel.org/r/20250407053113.746295-3-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11mm/ptdump: split note_page() into level specific callbacksAnshuman Khandual
Patch series "mm/ptdump: Drop assumption that pxd_val() is u64", v2. Last argument passed down in note_page() is u64 assuming pxd_val() returned value (all page table levels) is 64 bit - which might not be the case going ahead when D128 page tables is enabled on arm64 platform. Besides pxd_val() is very platform specific and its type should not be assumed in generic MM. A similar problem exists for effective_prot(), although it is restricted to x86 platform. This series splits note_page() and effective_prot() into individual page table level specific callbacks which accepts corresponding pxd_t page table entry as an argument instead and later on all subscribing platforms could derive pxd_val() from the table entries as required and proceed as before. Define ptdesc_t type which describes the basic page table descriptor layout on arm64 platform. Subsequently all level specific pxxval_t descriptors are derived from ptdesc_t thus establishing a common original format, which can also be appropriate for page table entries, masks and protection values etc which are used at all page table levels. This patch (of 3): Last argument passed down in note_page() is u64 assuming pxd_val() returned value (all page table levels) is 64 bit - which might not be the case going ahead when D128 page tables is enabled on arm64 platform. Besides pxd_val() is very platform specific and its type should not be assumed in generic MM. Split note_page() into individual page table level specific callbacks which accepts corresponding pxd_t argument instead and then subscribing platforms just derive pxd_val() from the entries as required and proceed as earlier. Also add a note_page_flush() callback for flushing the last page table page that was being handled earlier via level = -1. Link: https://lkml.kernel.org/r/20250407053113.746295-1-anshuman.khandual@arm.com Link: https://lkml.kernel.org/r/20250407053113.746295-2-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11kernel/events/uprobes: pass VMA to set_swbp(), set_orig_insn() and ↵David Hildenbrand
uprobe_write_opcode() We already have the VMA, no need to look it up using get_user_page_vma_remote(). We can now switch to get_user_pages_remote(). Link: https://lkml.kernel.org/r/20250321113713.204682-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Olsa <olsajiri@gmail.com> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Namhyung kim <namhyung@kernel.org> Cc: Russel King <linux@armlinux.org.uk> Cc: tongtiangen <tongtiangen@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11mm: fix parameter passed to page_mapcount_is_type()Gavin Shan
Patch series "Fix parameter passed to page_mapcount_is_type()", v2. Found by code inspection. There are two places where the parameter passed to page_mapcount_is_type() is (page->_mapcount), which is incorrect since it should be one more than the value, as explained in the comments to page_mapcount_is_type(): (a) page_has_type() in page-flags.h (b) __dump_folio() in mm/debug.c PATCH[1] fixes the parameter for (a) PATCH[2] fixes the parameter for (b) Note that the issue doesn't cause any visible impacts due to the safety gap introduced by PGTY_mapcount_underflow limit. So the tag 'Cc: stable@vger.kernel.org' isn't needed. This patch (of 2): As the comments of page_mapcount_is_type() indicate, the parameter passed to the function should be one more than page->_mapcount. However, page->_mapcount (equivalent to page->page_type) is passed to the function by commit 4ffca5a96678 ("mm: support only one page_type per page") page_type_has_type() is replaced by page_mapcount_is_type(), but the parameter isn't adjusted. Fix it by replacing page_mapcount_is_type() with page_type_has_type() in page_has_type(). Note that the issue doesn't cause any visible impacts due to the safety gap introduced by PGTY_mapcount_underflow limit. Link: https://lkml.kernel.org/r/20250321120222.1456770-1-gshan@redhat.com Link: https://lkml.kernel.org/r/20250321120222.1456770-2-gshan@redhat.com Fixes: 4ffca5a96678 ("mm: support only one page_type per page") Signed-off-by: Gavin Shan <gshan@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: gehao <gehao@kylinos.cn> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11mm: annotate data race in update_hiwater_rssIgnacio Encinas
mm_struct.hiwater_rss can be accessed concurrently without proper synchronization as reported by KCSAN. This data race is benign as it only affects accounting information. Annotate it with data_race() to make KCSAN happy. Link: https://lkml.kernel.org/r/20250331-mm-maxrss-data-race-v2-1-cf958e6205bf@iencinas.com Signed-off-by: Ignacio Encinas <ignacio@iencinas.com> Reported-by: syzbot+419c4b42acc36c420ad3@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/67e3390c.050a0220.1ec46.0001.GAE@google.com/ Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Pedro Falcato <pfalcato@suse.de> Cc: Liam Howlett <liam.howlett@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11mm/compaction: use folio in hugetlb pathwayVishal Moola (Oracle)
Use a folio in the hugetlb pathway during the compaction migrate-able pageblock scan. This removes a call to compound_head(). Link: https://lkml.kernel.org/r/20250401021025.637333-2-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Acked-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11memory: implement memory_block_advise/probe_max_sizeGregory Price
Patch series "memory,x86,acpi: hotplug memory alignment advisement", v8. When physical address regions are not aligned to memory block size, the misaligned portion is lost (stranded capacity). Block size (min/max/selected) is architecture defined. Most architectures tend to use the minimum block size or some simplistic heurist. On x86, memory block size increases up to 2GB, and is otherwise fitted to the alignment of non-hotplug (i.e. not special purpose memory). CXL exposes its memory for management through the ACPI CEDT (CXL Early Detection Table) in a field called the CXL Fixed Memory Window. Per the CXL specification, this memory must be aligned to at least 256MB. When a CFMW aligns on a size less than the block size, this causes a loss of up to 2GB per CFMW on x86. It is not uncommon for CFMW to be allocated per-device - though this behavior is BIOS defined. This patch set provides 3 things: 1) implement advise/query functions in driverse/base/memory.c to report/query architecture agnostic hotplug block alignment advice. 2) update x86 memblock size logic to consider the hotplug advice 3) add code in acpi/numa/srat.c to report CFMW alignment advice The advisement interfaces are design to be called during arch_init code prior to allocator and smp_init. start_kernel will call these through setup_arch() (via acpi and mm/init_64.c on x86), which occurs prior to mm_core_init and smp_init - so no need for atomics. There's an attempt to signal callers to advise() that query has already occurred, but this is predicated on the notion that query actually occurs (which presently only happens on the x86 arch). This is to assist debugging future users. Otherwise, the advise() call has been marked __init to help static discovery of bad call times. Once query is called the first time, it will always return the same value. Interfaces return -EBUSY and 0 respectively on systems without hotplug. This patch (of 3): Hotplug memory sources may have opinions on what the memblock size should be - usually for alignment purposes. For example, CXL memory extents can be 256MB with a matching alignment. If this size/alignment is smaller than the block size, it can result in stranded capacity. Implement memory_block_advise_max_size for use prior to allocator init, for software to advise the system on the max block size. Implement memory_block_probe_max_size for use by arch init code to calculate the best block size. Use of advice is architecture defined. The probe value can never change after first probe. Calls to advise after probe will return -EBUSY to aid debugging. On systems without hotplug, always return -ENODEV and 0 respectively. Link: https://lkml.kernel.org/r/20250127153405.3379117-1-gourry@gourry.net Link: https://lkml.kernel.org/r/20250127153405.3379117-2-gourry@gourry.net Signed-off-by: Gregory Price <gourry@gourry.net> Suggested-by: Ira Weiny <ira.weiny@intel.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: Dan Williams <dan.j.williams@intel.com> Tested-by: Fan Ni <fan.ni@samsung.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Acked-by: Oscar Salvador <osalvador@suse.de> Cc: Alison Schofield <alison.schofield@intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Betkov <bp@alien8.de> Cc: Bruno Faccini <bfaccini@nvidia.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Haibo Xu <haibo1.xu@intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Joanthan Cameron <Jonathan.Cameron@huawei.com> Cc: Len Brown <lenb@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Robert Richter <rrichter@amd.com> Cc: Thomas Gleinxer <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>