summaryrefslogtreecommitdiff
path: root/drivers/block
AgeCommit message (Collapse)Author
2025-03-16zram: switch to new zsmalloc object mapping APISergey Senozhatsky
Use new read/write zsmalloc object API. For cases when RO mapped object spans two physical pages (requires temp buffer) compression streams now carry around one extra physical page. Link: https://lkml.kernel.org/r/20250303022425.285971-16-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Hillf Danton <hdanton@sina.com> Cc: Kairui Song <ryncsn@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16zram: move post-processing target allocationSergey Senozhatsky
Allocate post-processing target in place_pp_slot(). This simplifies scan_slots_for_writeback() and scan_slots_for_recompress() loops because we don't need to track pps pointer state anymore. Previously we have to explicitly NULL the point if it has been added to a post-processing bucket or re-use previously allocated pointer otherwise and make sure we don't leak the memory in the end. We are also fine doing GFP_NOIO allocation, as post-processing can be called under memory pressure so we better pick as many slots as we can as soon as we can and start post-processing them, possibly saving the memory. Allocation failure there is not fatal, we will post-process whatever we put into the buckets on previous iterations. Link: https://lkml.kernel.org/r/20250303022425.285971-12-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Hillf Danton <hdanton@sina.com> Cc: Kairui Song <ryncsn@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16zram: rework recompression loopSergey Senozhatsky
This reworks recompression loop handling: - set a rule that stream-put NULLs the stream pointer If the loop returns with a non-NULL stream then it's a successful recompression, otherwise the stream should always be NULL. - do not count the number of recompressions Mark object as incompressible as soon as the algorithm with the highest priority failed to compress that object. - count compression errors as resource usage Even if compression has failed, we still need to bump num_recomp_pages counter. Link: https://lkml.kernel.org/r/20250303022425.285971-11-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Hillf Danton <hdanton@sina.com> Cc: Kairui Song <ryncsn@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16zram: filter out recomp targets based on prioritySergey Senozhatsky
Do no select for post processing slots that are already compressed with same or higher priority compression algorithm. This should save some memory, as previously we would still put those entries into corresponding post-processing buckets and filter them out later in recompress_slot(). Link: https://lkml.kernel.org/r/20250303022425.285971-10-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Hillf Danton <hdanton@sina.com> Cc: Kairui Song <ryncsn@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16zram: limit max recompress prio to num_active_compsSergey Senozhatsky
Use the actual number of algorithms zram was configure with instead of theoretical limit of ZRAM_MAX_COMPS. Also make sure that min prio is not above max prio. Link: https://lkml.kernel.org/r/20250303022425.285971-9-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Hillf Danton <hdanton@sina.com> Cc: Kairui Song <ryncsn@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16zram: remove writestall zram_stats memberSergey Senozhatsky
There is no zsmalloc handle allocation slow path now and writestall is not possible any longer. Remove it from zram_stats. Link: https://lkml.kernel.org/r/20250303022425.285971-8-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Hillf Danton <hdanton@sina.com> Cc: Kairui Song <ryncsn@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16zram: add GFP_NOWARN to incompressible zsmalloc handle allocationSergey Senozhatsky
We normally use __GFP_NOWARN for zsmalloc handle allocations, add it to write_incompressible_page() allocation too. Link: https://lkml.kernel.org/r/20250303022425.285971-7-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Hillf Danton <hdanton@sina.com> Cc: Kairui Song <ryncsn@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16zram: remove second stage of handle allocationSergey Senozhatsky
Previously zram write() was atomic which required us to pass __GFP_KSWAPD_RECLAIM to zsmalloc handle allocation on a fast path and attempt a slow path allocation (with recompression) if the fast path failed. Since we are not in atomic context anymore we can permit direct reclaim during handle allocation, and hence can have a single allocation path. There is no slow path anymore so we don't unlock per-CPU stream (and don't lose compressed data) which means that there is no need to do recompression now (which should reduce CPU and battery usage). Link: https://lkml.kernel.org/r/20250303022425.285971-6-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Hillf Danton <hdanton@sina.com> Cc: Kairui Song <ryncsn@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16zram: remove max_comp_streams device attrSergey Senozhatsky
max_comp_streams device attribute has been defunct since May 2016 when zram switched to per-CPU compression streams, remove it. Link: https://lkml.kernel.org/r/20250303022425.285971-5-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Hillf Danton <hdanton@sina.com> Cc: Kairui Song <ryncsn@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16zram: remove unused crypto includeSergey Senozhatsky
We stopped using crypto API (for the time being), so remove its include and replace CRYPTO_MAX_ALG_NAME with a local define. Link: https://lkml.kernel.org/r/20250303022425.285971-4-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Hillf Danton <hdanton@sina.com> Cc: Kairui Song <ryncsn@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16zram: permit preemption with active compression streamSergey Senozhatsky
Currently, per-CPU stream access is done from a non-preemptible (atomic) section, which imposes the same atomicity requirements on compression backends as entry spin-lock, and makes it impossible to use algorithms that can schedule/wait/sleep during compression and decompression. Switch to preemptible per-CPU model, similar to the one used in zswap. Instead of a per-CPU local lock, each stream carries a mutex which is locked throughout entire time zram uses it for compression or decompression, so that cpu-dead event waits for zram to stop using a particular per-CPU stream and release it. Link: https://lkml.kernel.org/r/20250303022425.285971-3-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Suggested-by: Yosry Ahmed <yosry.ahmed@linux.dev> Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: Hillf Danton <hdanton@sina.com> Cc: Kairui Song <ryncsn@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16zram: sleepable entry lockingSergey Senozhatsky
Patch series "zsmalloc/zram: there be preemption", v10. Currently zram runs compression and decompression in non-preemptible sections, e.g. zcomp_stream_get() // grabs CPU local lock zcomp_compress() or zram_slot_lock() // grabs entry spin-lock zcomp_stream_get() // grabs CPU local lock zs_map_object() // grabs rwlock and CPU local lock zcomp_decompress() Potentially a little troublesome for a number of reasons. For instance, this makes it impossible to use async compression algorithms or/and H/W compression algorithms, which can wait for OP completion or resource availability. This also restricts what compression algorithms can do internally, for example, zstd can allocate internal state memory for C/D dictionaries: do_fsync() do_writepages() zram_bio_write() zram_write_page() // become non-preemptible zcomp_compress() zstd_compress() ZSTD_compress_usingCDict() ZSTD_compressBegin_usingCDict_internal() ZSTD_resetCCtx_usingCDict() ZSTD_resetCCtx_internal() zstd_custom_alloc() // memory allocation Not to mention that the system can be configured to maximize compression ratio at a cost of CPU/HW time (e.g. lz4hc or deflate with very high compression level) so zram can stay in non-preemptible section (even under spin-lock or/and rwlock) for an extended period of time. Aside from compression algorithms, this also restricts what zram can do. One particular example is zram_write_page() zsmalloc handle allocation, which has an optimistic allocation (disallowing direct reclaim) and a pessimistic fallback path, which then forces zram to compress the page one more time. This series changes zram to not directly impose atomicity restrictions on compression algorithms (and on itself), which makes zram write() fully preemptible; zram read(), sadly, is not always preemptible yet. There are still indirect atomicity restrictions imposed by zsmalloc(). One notable example is object mapping API, which returns with: a) local CPU lock held b) zspage rwlock held First, zsmalloc's zspage lock is converted from rwlock to a special type of RW-lookalike look with some extra guarantees/features. Second, a new handle mapping is introduced which doesn't use per-CPU buffers (and hence no local CPU lock), does fewer memcpy() calls, but requires users to provide a pointer to temp buffer for object copy-in (when needed). Third, zram is converted to the new zsmalloc mapping API and thus zram read() becomes preemptible. This patch (of 19): Concurrent modifications of meta table entries is now handled by per-entry spin-lock. This has a number of shortcomings. First, this imposes atomic requirements on compression backends. zram can call both zcomp_compress() and zcomp_decompress() under entry spin-lock, which implies that we can use only compression algorithms that don't schedule/sleep/wait during compression and decompression. This, for instance, makes it impossible to use some of the ASYNC compression algorithms (H/W compression, etc.) implementations. Second, this can potentially trigger watchdogs. For example, entry re-compression with secondary algorithms is performed under entry spin-lock. Given that we chain secondary compression algorithms and that some of them can be configured for best compression ratio (and worst compression speed) zram can stay under spin-lock for quite some time. Having a per-entry mutex (or, for instance, a rw-semaphore) significantly increases sizeof() of each entry and hence the meta table. Therefore entry locking returns back to bit locking, as before, however, this time also preempt-rt friendly, because if waits-on-bit instead of spinning-on-bit. Lock owners are also now permitted to schedule, which is a first step on the path of making zram non-atomic. Link: https://lkml.kernel.org/r/20250303022425.285971-1-senozhatsky@chromium.org Link: https://lkml.kernel.org/r/20250303022425.285971-2-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Hillf Danton <hdanton@sina.com> Cc: Kairui Song <ryncsn@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-14Merge tag 'block-6.14-20250313' of git://git.kernel.dk/linuxLinus Torvalds
Pull block fixes from Jens Axboe: - NVMe pull request via Keith: - Concurrent pci error and hotplug handling fix (Keith) - Endpoint function fixes (Damien) - Fix for a regression introduced in this cycle with error checking for batched request completions (Shin'ichiro) * tag 'block-6.14-20250313' of git://git.kernel.dk/linux: block: change blk_mq_add_to_batch() third argument type to bool nvme: move error logging from nvme_end_req() to __nvme_end_req() nvmet: pci-epf: Do not add an IRQ vector if not needed nvmet: pci-epf: Set NVMET_PCI_EPF_Q_LIVE when a queue is fully created nvme-pci: fix stuck reset on concurrent DPC and HP
2025-03-13block: remove unused parameter 'q' parameter in __blk_rq_map_sg()Anuj Gupta
request_queue param is no longer used by blk_rq_map_sg and __blk_rq_map_sg. Remove it. Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250313035322.243239-1-anuj20.g@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-12block: change blk_mq_add_to_batch() third argument type to boolShin'ichiro Kawasaki
Commit 1f47ed294a2b ("block: cleanup and fix batch completion adding conditions") modified the evaluation criteria for the third argument, 'ioerror', in the blk_mq_add_to_batch() function. Initially, the function had checked if 'ioerror' equals zero. Following the commit, it started checking for negative error values, with the presumption that such values, for instance -EIO, would be passed in. However, blk_mq_add_to_batch() callers do not pass negative error values. Instead, they pass status codes defined in various ways: - NVMe PCI and Apple drivers pass NVMe status code - virtio_blk driver passes the virtblk request header status byte - null_blk driver passes blk_status_t These codes are either zero or positive, therefore the revised check fails to function as intended. Specifically, with the NVMe PCI driver, this modification led to the failure of the blktests test case nvme/039. In this test scenario, errors are artificially injected to the NVMe driver, resulting in positive NVMe status codes passed to blk_mq_add_to_batch(), which unexpectedly processes the failed I/O in a batch. Hence the failure. To correct the ioerror check within blk_mq_add_to_batch(), make all callers to uniformly pass the argument as boolean. Modify the callers to check their specific status codes and pass the boolean value 'is_error'. Also describe the arguments of blK_mq_add_to_batch as kerneldoc. Fixes: 1f47ed294a2b ("block: cleanup and fix batch completion adding conditions") Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Link: https://lore.kernel.org/r/20250311104359.1767728-3-shinichiro.kawasaki@wdc.com [axboe: fold in documentation update] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-10rust: module: introduce `authors` keyGuilherme Giacomo Simoes
In the `module!` macro, the `author` field is currently of type `String`. Since modules can have multiple authors, this limitation prevents specifying more than one. Add an `authors` field as `Option<Vec<String>>` to allow creating modules with multiple authors, and change the documentation and all current users to use it. Eventually, the single `author` field may be removed. [ The `modinfo` key needs to still be `author`; otherwise, tooling may not work properly, e.g.: $ modinfo --author samples/rust/rust_print.ko Rust for Linux Contributors I have also kept the original `author` field (undocumented), so that we can drop it more easily in a kernel cycle or two. - Miguel ] Suggested-by: Miguel Ojeda <ojeda@kernel.org> Link: https://github.com/Rust-for-Linux/linux/issues/244 Reviewed-by: Charalampos Mitrodimas <charmitro@posteo.net> Reviewed-by: Alice Ryhl <aliceryhl@google.com> Reviewed-by: Andreas Hindborg <a.hindborg@kernel.org> Signed-off-by: Guilherme Giacomo Simoes <trintaeoitogc@gmail.com> Link: https://lore.kernel.org/r/20250309175712.845622-2-trintaeoitogc@gmail.com [ Fixed `modinfo` key. Kept `author` field. Reworded message accordingly. Updated my email. - Miguel ] Signed-off-by: Miguel Ojeda <ojeda@kernel.org>
2025-03-07Merge tag 'block-6.14-20250306' of git://git.kernel.dk/linuxLinus Torvalds
Pull block fixes from Jens Axboe: - NVMe pull request via Keith: - TCP use after free fix on polling (Sagi) - Controller memory buffer cleanup fixes (Icenowy) - Free leaking requests on bad user passthrough commands (Keith) - TCP error message fix (Maurizio) - TCP corruption fix on partial PDU (Maurizio) - TCP memory ordering fix for weakly ordered archs (Meir) - Type coercion fix on message error for TCP (Dan) - Name the RQF flags enum, fixing issues with anon enums and BPF import of it - ublk parameter setting fix - GPT partition 7-bit conversion fix * tag 'block-6.14-20250306' of git://git.kernel.dk/linux: block: Name the RQF flags enum nvme-tcp: fix signedness bug in nvme_tcp_init_connection() block: fix conversion of GPT partition name to 7-bit ublk: set_params: properly check if parameters can be applied nvmet-tcp: Fix a possible sporadic response drops in weakly ordered arch nvme-tcp: fix potential memory corruption in nvme_tcp_recv_pdu() nvme-tcp: Fix a C2HTermReq error message nvmet: remove old function prototype nvme-ioctl: fix leaked requests on mapping error nvme-pci: skip CMB blocks incompatible with PCI P2P DMA nvme-pci: clean up CMBMSC when registering CMB fails nvme-tcp: fix possible UAF in nvme_tcp_poll
2025-03-06badblocks: use sector_t instead of int to avoid truncation of badblocks lengthZheng Qixing
There is a truncation of badblocks length issue when set badblocks as follow: echo "2055 4294967299" > bad_blocks cat bad_blocks 2055 3 Change 'sectors' argument type from 'int' to 'sector_t'. This change avoids truncation of badblocks length for large sectors by replacing 'int' with 'sector_t' (u64), enabling proper handling of larger disk sizes and ensuring compatibility with 64-bit sector addressing. Fixes: 9e0e252a048b ("badblocks: Add core badblock management code") Signed-off-by: Zheng Qixing <zhengqixing@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Coly Li <colyli@kernel.org> Link: https://lore.kernel.org/r/20250227075507.151331-13-zhengqixing@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-06badblocks: return boolean from badblocks_set() and badblocks_clear()Zheng Qixing
Change the return type of badblocks_set() and badblocks_clear() from int to bool, indicating success or failure. Specifically: - _badblocks_set() and _badblocks_clear() functions now return true for success and false for failure. - All calls to these functions are updated to handle the new boolean return type. - This change improves code clarity and ensures a more consistent handling of success and failure states. Signed-off-by: Zheng Qixing <zhengqixing@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Coly Li <colyli@kernel.org> Acked-by: Ira Weiny <ira.weiny@intel.com> Link: https://lore.kernel.org/r/20250227075507.151331-11-zhengqixing@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-05ublk: set_params: properly check if parameters can be appliedUday Shankar
The parameters set by the set_params call are only applied to the block device in the start_dev call. So if a device has already been started, a subsequently issued set_params on that device will not have the desired effect, and should return an error. There is an existing check for this - set_params fails on devices in the LIVE state. But this check is not sufficient to cover the recovery case. In this case, the device will be in the QUIESCED or FAIL_IO states, so set_params will succeed. But this success is misleading, because the parameters will not be applied, since the device has already been started (by a previous ublk server). The bit UB_STATE_USED is set on completion of the start_dev; use it to detect and fail set_params commands which arrive too late to be applied (after start_dev). Signed-off-by: Uday Shankar <ushankar@purestorage.com> Fixes: 0aa73170eba5 ("ublk_drv: add SET_PARAMS/GET_PARAMS control command") Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250304-set_params-v1-1-17b5e0887606@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-04ublk: enforce ublks_max only for unprivileged devicesUday Shankar
Commit 403ebc877832 ("ublk_drv: add module parameter of ublks_max for limiting max allowed ublk dev"), claimed ublks_max was added to prevent a DoS situation with an untrusted user creating too many ublk devices. If that's the case, ublks_max should only restrict the number of unprivileged ublk devices in the system. Enforce the limit only for unprivileged ublk devices, and rename variables accordingly. Leave the external-facing parameter name unchanged, since changing it may break systems which use it (but still update its documentation to reflect its new meaning). As a result of this change, in a system where there are only normal (non-unprivileged) devices, the maximum number of such devices is increased to 1 << MINORBITS, or 1048576. That ought to be enough for anyone, right? Signed-off-by: Uday Shankar <ushankar@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250228-ublks_max-v1-1-04b7379190c0@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-04loop: Remove struct loop_func_tableZhu Yanjun
The struct is introduced in the commit 754d96798fab ("loop: remove loop.h"), but it is not used now. So remove it. No functional changes. Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250227163343.55952-1-yanjun.zhu@linux.dev Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-04ublk: don't cast registered buffer index to intCaleb Sander Mateos
io_buffer_register_bvec() takes index as an unsigned int argument, but ublk_register_io_buf() casts ub_cmd->addr (a u64) to int. Remove the misleading cast and instead pass index as an unsigned value to ublk_register_io_buf() and ublk_unregister_io_buf(). Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250301190317.950208-1-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-03ublk: add DMA alignment limitMing Lei
The in-tree ublk driver doesn't need DMA alignment limit because there is one data copy between request pages and the userspace buffer. However, ublk is going to support zero copy, then DMA alignment limit is required, because same IO buffer is forwarded to backend which may have specific buffer DMA alignment limit, so the limit has to be exposed from the frontend driver to client application. Cc: Keith Busch <kbusch@kernel.org> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250227103707.2640014-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-03null_blk: do partial IO for bad blocksShin'ichiro Kawasaki
The current null_blk implementation checks if any bad blocks exist in the target blocks of each IO. If so, the IO fails and data is not transferred for all of the IO target blocks. However, when real storage devices have bad blocks, the devices may transfer data partially up to the first bad blocks (e.g., SAS drives). Especially, when the IO is a write operation, such partial IO leaves partially written data on the device. To simulate such partial IO using null_blk, introduce the new parameter 'badblocks_partial_io'. When this parameter is set, null_handle_badblocks() returns the number of the sectors for the partial IO as its third pointer argument. Pass the returned number of sectors to the following calls to null_handle_memory_backend() in null_process_cmd() and null_zone_write(). Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Link: https://lore.kernel.org/r/20250226100613.1622564-6-shinichiro.kawasaki@wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-03null_blk: pass transfer size to null_handle_rq()Shin'ichiro Kawasaki
As preparation to support partial data transfer, add a new argument to null_handle_rq() to pass the number of sectors to transfer. While at it, rename the function from null_handle_rq to null_handle_data_transfer. This commit does not change the behavior. Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Link: https://lore.kernel.org/r/20250226100613.1622564-5-shinichiro.kawasaki@wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-03null_blk: replace null_process_cmd() call in null_zone_write()Shin'ichiro Kawasaki
As a preparation to support partial data transfer due to badblocks, replace the null_process_cmd() call in null_zone_write() with equivalent calls to null_handle_badblocks() and null_handle_memory_backed(). This commit does not change behavior. It will enable null_handle_badblocks() to return the size of partial data transfer in the following commit, allowing null_zone_write() to move write pointers appropriately. Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Link: https://lore.kernel.org/r/20250226100613.1622564-4-shinichiro.kawasaki@wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-03null_blk: introduce badblocks_once parameterShin'ichiro Kawasaki
When IO errors happen on real storage devices, the IOs repeated to the same target range can success by virtue of recovery features by devices, such as reserved block assignment. To simulate such IO errors and recoveries, introduce the new parameter badblocks_once parameter. When this parameter is set to 1, the specified badblocks are cleared after the first IO error, so that the next IO to the blocks succeed. Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Link: https://lore.kernel.org/r/20250226100613.1622564-3-shinichiro.kawasaki@wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-03null_blk: generate null_blk configfs features stringShin'ichiro Kawasaki
The null_blk configfs file 'features' provides a string that lists available null_blk features for userspace programs to reference. The string is defined as a long constant in the code, which tends to be forgotten for updates. It also causes checkpatch.pl to report "WARNING: quoted string split across lines". To avoid these drawbacks, generate the feature string on the fly. Refer to the ca_name field of each element in the nullb_device_attrs table and concatenate them in the given buffer. Also, sorted nullb_device_attrs table elements in alphabetical order. Of note is that the feature "index" was missing before this commit. This commit adds it to the generated string. Suggested-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Link: https://lore.kernel.org/r/20250226100613.1622564-2-shinichiro.kawasaki@wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-03ublk: complete command synchronously on errorCaleb Sander Mateos
In case of an error, ublk's ->uring_cmd() functions currently return -EIOCBQUEUED and immediately call io_uring_cmd_done(). -EIOCBQUEUED and io_uring_cmd_done() are intended for asynchronous completions. For synchronous completions, the ->uring_cmd() function can just return the negative return code directly. This skips io_uring_cmd_del_cancelable(), and deferring the completion to task work. So return the error code directly from __ublk_ch_uring_cmd() and ublk_ctrl_uring_cmd(). Update ublk_ch_uring_cmd_cb(), which currently ignores the return value from __ublk_ch_uring_cmd(), to call io_uring_cmd_done() for synchronous completions. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/20250225212456.2902549-1-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-28io_uring/ublk: report error when unregister operation failsCaleb Sander Mateos
Indicate to userspace applications if a UBLK_IO_UNREGISTER_IO_BUF command specifies an invalid buffer index by returning an error code. Return -EINVAL if no buffer is registered with the given index, and -EBUSY if the registered buffer is not a kernel bvec. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Link: https://lore.kernel.org/r/20250228231432.642417-1-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-28ublk: zc register/unregister bvecKeith Busch
Provide new operations for the user to request mapping an active request to an io uring instance's buf_table. The user has to provide the index it wants to install the buffer. A reference count is taken on the request to ensure it can't be completed while it is active in a ring's buf_table. Signed-off-by: Keith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/20250227223916.143006-6-kbusch@meta.com Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-24loop: take the file system minimum dio alignment into accountChristoph Hellwig
The loop driver currently uses the logical block size of the underlying bdev as the lower bound of the loop device block size. While this works for many cases, it fails for file systems made up of multiple devices with different logical block sizes (e.g. XFS with a RT device that has a larger logical block size), or when the file systems doesn't support direct I/O writes at the sector size granularity (e.g. because it does out of place writes with a file system block size larger than the sector size). Fix this by querying the minimum direct I/O alignment from statx when available. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20250131120120.1315125-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-24loop: check in LO_FLAGS_DIRECT_IO in loop_default_blocksizeChristoph Hellwig
We can't go below the minimum direct I/O size no matter if direct I/O is enabled by passing in an O_DIRECT file descriptor or due to the explicit flag. Now that LO_FLAGS_DIRECT_IO is set earlier after assigning a backing file, loop_default_blocksize can check it instead of the O_DIRECT flag to handle both conditions. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20250131120120.1315125-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-24loop: set LO_FLAGS_DIRECT_IO in loop_assign_backing_fileChristoph Hellwig
Assigning LO_FLAGS_DIRECT_IO from the O_DIRECT flag is related to assigning a new backing file. Move the assignment in preparation of using the flag more and earlier. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20250131120120.1315125-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-24loop: factor out a loop_assign_backing_file helperChristoph Hellwig
Split the code for setting up a backing file into a helper in preparation of adding more code to this path. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20250131120120.1315125-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-18Revert "driver: block: release the lo_work_lock before queue_work"Zhaoyang Huang
This reverts commit ad934fc1784802fd1408224474b25ee5289fadfc. loop_queue_work should be strictly serialized to loop_process_work since the lo_worker could be freed without noticing new work has been queued again. Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com> Link: https://lore.kernel.org/r/20250218065835.19503-1-zhaoyang.huang@unisoc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-18null_blk: Switch to use hrtimer_setup()Nam Cao
hrtimer_setup() takes the callback function pointer as argument and initializes the timer completely. Replace hrtimer_init() and the open coded initialization of hrtimer::function with the new setup mechanism. Patch was created by using Coccinelle. Signed-off-by: Nam Cao <namcao@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/076cd30acb4b2d6b699f3c80463a8983530d4f06.1738746821.git.namcao@linutronix.de
2025-02-11loop: release the lo_work_lock before queue_workZhaoyang Huang
queue_work could spin on wq->cpu_pwq->pool->lock which could lead to concurrent loop_process_work failed on lo_work_lock contention and increase the request latency. Remove this combination by moving the lock release ahead of queue_work. Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com> Link: https://lore.kernel.org/r/20250207091942.3966756-1-zhaoyang.huang@unisoc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-07Merge tag 'block-6.14-20250207' of git://git.kernel.dk/linuxLinus Torvalds
Pull block fixes from Jens Axboe: - MD pull request via Song: - fix an error handling path for md-linear - NVMe pull request via Keith: - Connection fixes for fibre channel transport (Daniel) - Endian fixes (Keith, Christoph) - Cleanup fix for host memory buffer (Francis) - Platform specific power quirks (Georg) - Target memory leak (Sagi) - Use appropriate controller state accessor (Daniel) - Fixup for a regression introduced last week, where sunvdc wasn't updated for an API change, causing compilation failures on sparc64. * tag 'block-6.14-20250207' of git://git.kernel.dk/linux: drivers/block/sunvdc.c: update the correct AIP call md: Fix linear_set_limits() nvme-fc: use ctrl state getter nvme: make nvme_tls_attrs_group static nvmet: add a missing endianess conversion in nvmet_execute_admin_connect nvmet: the result field in nvmet_alloc_ctrl_args is little endian nvmet: fix a memory leak in controller identify nvme-fc: do not ignore connectivity loss during connecting nvme: handle connectivity loss in nvme_set_queue_count nvme-fc: go straight to connecting state when initializing nvme-pci: Add TUXEDO IBP Gen9 to Samsung sleep quirk nvme-pci: Add TUXEDO InfinityFlex to Samsung sleep quirk nvme-pci: remove redundant dma frees in hmb nvmet: fix rw control endian access
2025-02-02drivers/block/sunvdc.c: update the correct AIP callStephen Rothwell
My sparc64 defconfig build failed like this: drivers/block/sunvdc.c: In function 'vdc_queue_drain': drivers/block/sunvdc.c:1130:9: error: too many arguments to function 'blk_mq_unquiesce_queue' 1130 | blk_mq_unquiesce_queue(q, memflags); | ^~~~~~~~~~~~~~~~~~~~~~ In file included from drivers/block/sunvdc.c:10: include/linux/blk-mq.h:895:6: note: declared here 895 | void blk_mq_unquiesce_queue(struct request_queue *q); | ^~~~~~~~~~~~~~~~~~~~~~ drivers/block/sunvdc.c:1131:9: error: too few arguments to function 'blk_mq_unfreeze_queue' 1131 | blk_mq_unfreeze_queue(q); | ^~~~~~~~~~~~~~~~~~~~~ In file included from drivers/block/sunvdc.c:10: include/linux/blk-mq.h:914:1: note: declared here 914 | blk_mq_unfreeze_queue(struct request_queue *q, unsigned int memflags) | ^~~~~~~~~~~~~~~~~~~~~ Fixes: 1e1a9cecfab3 ("block: force noio scope in blk_mq_freeze_queue") Cc: Christoph Hellwig <hch@lst.de> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-31Merge tag 'block-6.14-20250131' of git://git.kernel.dk/linuxLinus Torvalds
Pull more block updates from Jens Axboe: - MD pull request via Song: - Fix a md-cluster regression introduced - More sysfs race fixes - Mark anything inside queue freezing as not being able to do IO for memory allocations - Fix for a regression introduced in loop in this merge window - Fix for a regression in queue mapping setups introduced in this merge window - Fix for the block dio fops attempting an iov_iter revert upton getting -EIOCBQUEUED on the read side. This one is going to stable as well * tag 'block-6.14-20250131' of git://git.kernel.dk/linux: block: force noio scope in blk_mq_freeze_queue block: fix nr_hw_queue update racing with disk addition/removal block: get rid of request queue ->sysfs_dir_lock loop: don't clear LO_FLAGS_PARTSCAN on LOOP_SET_STATUS{,64} md/md-bitmap: Synchronize bitmap_get_stats() with bitmap lifetime blk-mq: create correct map for fallback case block: don't revert iter for -EIOCBQUEUED
2025-01-31block: force noio scope in blk_mq_freeze_queueChristoph Hellwig
When block drivers or the core block code perform allocations with a frozen queue, this could try to recurse into the block device to reclaim memory and deadlock. Thus all allocations done by a process that froze a queue need to be done without __GFP_IO and __GFP_FS. Instead of tying to track all of them down, force a noio scope as part of freezing the queue. Note that nvme is a bit of a mess here due to the non-owner freezes, and they will be addressed separately. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250131120352.1315351-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-28Merge tag 'driver-core-6.14-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core and debugfs updates from Greg KH: "Here is the big set of driver core and debugfs updates for 6.14-rc1. Included in here is a bunch of driver core, PCI, OF, and platform rust bindings (all acked by the different subsystem maintainers), hence the merge conflict with the rust tree, and some driver core api updates to mark things as const, which will also require some fixups due to new stuff coming in through other trees in this merge window. There are also a bunch of debugfs updates from Al, and there is at least one user that does have a regression with these, but Al is working on tracking down the fix for it. In my use (and everyone else's linux-next use), it does not seem like a big issue at the moment. Here's a short list of the things in here: - driver core rust bindings for PCI, platform, OF, and some i/o functions. We are almost at the "write a real driver in rust" stage now, depending on what you want to do. - misc device rust bindings and a sample driver to show how to use them - debugfs cleanups in the fs as well as the users of the fs api for places where drivers got it wrong or were unnecessarily doing things in complex ways. - driver core const work, making more of the api take const * for different parameters to make the rust bindings easier overall. - other small fixes and updates All of these have been in linux-next with all of the aforementioned merge conflicts, and the one debugfs issue, which looks to be resolved "soon"" * tag 'driver-core-6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (95 commits) rust: device: Use as_char_ptr() to avoid explicit cast rust: device: Replace CString with CStr in property_present() devcoredump: Constify 'struct bin_attribute' devcoredump: Define 'struct bin_attribute' through macro rust: device: Add property_present() saner replacement for debugfs_rename() orangefs-debugfs: don't mess with ->d_name octeontx2: don't mess with ->d_parent or ->d_parent->d_name arm_scmi: don't mess with ->d_parent->d_name slub: don't mess with ->d_name sof-client-ipc-flood-test: don't mess with ->d_name qat: don't mess with ->d_name xhci: don't mess with ->d_iname mtu3: don't mess wiht ->d_iname greybus/camera - stop messing with ->d_iname mediatek: stop messing with ->d_iname netdevsim: don't embed file_operations into your structs b43legacy: make use of debugfs_get_aux() b43: stop embedding struct file_operations into their objects carl9170: stop embedding file_operations into their objects ...
2025-01-27Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhostLinus Torvalds
Pull virtio updates from Michael Tsirkin: "A small number of improvements all over the place: - vdpa/octeon support for multiple interrupts - virtio-pci support for error recovery - vp_vdpa support for notification with data - vhost/net fix to set num_buffers for spec compliance - virtio-mem now works with kdump on s390 And small cleanups all over the place" * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (23 commits) virtio_blk: Add support for transport error recovery virtio_pci: Add support for PCIe Function Level Reset vhost/net: Set num_buffers for virtio 1.0 vdpa/octeon_ep: read vendor-specific PCI capability virtio-pci: define type and header for PCI vendor data vdpa/octeon_ep: handle device config change events vdpa/octeon_ep: enable support for multiple interrupts per device vdpa: solidrun: Replace deprecated PCI functions s390/kdump: virtio-mem kdump support (CONFIG_PROC_VMCORE_DEVICE_RAM) virtio-mem: support CONFIG_PROC_VMCORE_DEVICE_RAM virtio-mem: remember usable region size virtio-mem: mark device ready before registering callbacks in kdump mode fs/proc/vmcore: introduce PROC_VMCORE_DEVICE_RAM to detect device RAM ranges in 2nd kernel fs/proc/vmcore: factor out freeing a list of vmcore ranges fs/proc/vmcore: factor out allocating a vmcore range and adding it to a list fs/proc/vmcore: move vmcore definitions out of kcore.h fs/proc/vmcore: prefix all pr_* with "vmcore:" fs/proc/vmcore: disallow vmcore modifications while the vmcore is open fs/proc/vmcore: replace vmcoredd_mutex by vmcore_mutex fs/proc/vmcore: convert vmcore_cb_lock into vmcore_mutex ...
2025-01-27loop: don't clear LO_FLAGS_PARTSCAN on LOOP_SET_STATUS{,64}Christoph Hellwig
LOOP_SET_STATUS{,64} can set a lot more flags than it is supposed to clear (the LOOP_SET_STATUS_CLEARABLE_FLAGS vs LOOP_SET_STATUS_SETTABLE_FLAGS defines should have been a hint..). Fix this by only clearing the bits in LOOP_SET_STATUS_CLEARABLE_FLAGS. Fixes: ae074d07a0e5 ("loop: move updating lo_flag s out of loop_set_status_from_info") Reported-by: kernel test robot <oliver.sang@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250127143045.538279-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-27virtio_blk: Add support for transport error recoveryIsrael Rukshin
Add support for proper cleanup and re-initialization of virtio-blk devices during transport reset error recovery flow. This enhancement includes: - Pre-reset handler (reset_prepare) to perform device-specific cleanup - Post-reset handler (reset_done) to re-initialize the device These changes allow the device to recover from various reset scenarios, ensuring proper functionality after a reset event occurs. Without this implementation, the device cannot properly recover from resets, potentially leading to undefined behavior or device malfunction. This feature has been tested using PCI transport with Function Level Reset (FLR) as an example reset mechanism. The reset can be triggered manually via sysfs (echo 1 > /sys/bus/pci/devices/$PCI_ADDR/reset). Signed-off-by: Israel Rukshin <israelr@nvidia.com> Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com> Message-Id: <1732690652-3065-3-git-send-email-israelr@nvidia.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-01-26Merge tag 'mm-stable-2025-01-26-14-59' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "The various patchsets are summarized below. Plus of course many indivudual patches which are described in their changelogs. - "Allocate and free frozen pages" from Matthew Wilcox reorganizes the page allocator so we end up with the ability to allocate and free zero-refcount pages. So that callers (ie, slab) can avoid a refcount inc & dec - "Support large folios for tmpfs" from Baolin Wang teaches tmpfs to use large folios other than PMD-sized ones - "Fix mm/rodata_test" from Petr Tesarik performs some maintenance and fixes for this small built-in kernel selftest - "mas_anode_descend() related cleanup" from Wei Yang tidies up part of the mapletree code - "mm: fix format issues and param types" from Keren Sun implements a few minor code cleanups - "simplify split calculation" from Wei Yang provides a few fixes and a test for the mapletree code - "mm/vma: make more mmap logic userland testable" from Lorenzo Stoakes continues the work of moving vma-related code into the (relatively) new mm/vma.c - "mm/page_alloc: gfp flags cleanups for alloc_contig_*()" from David Hildenbrand cleans up and rationalizes handling of gfp flags in the page allocator - "readahead: Reintroduce fix for improper RA window sizing" from Jan Kara is a second attempt at fixing a readahead window sizing issue. It should reduce the amount of unnecessary reading - "synchronously scan and reclaim empty user PTE pages" from Qi Zheng addresses an issue where "huge" amounts of pte pagetables are accumulated: https://lore.kernel.org/lkml/cover.1718267194.git.zhengqi.arch@bytedance.com/ Qi's series addresses this windup by synchronously freeing PTE memory within the context of madvise(MADV_DONTNEED) - "selftest/mm: Remove warnings found by adding compiler flags" from Muhammad Usama Anjum fixes some build warnings in the selftests code when optional compiler warnings are enabled - "mm: don't use __GFP_HARDWALL when migrating remote pages" from David Hildenbrand tightens the allocator's observance of __GFP_HARDWALL - "pkeys kselftests improvements" from Kevin Brodsky implements various fixes and cleanups in the MM selftests code, mainly pertaining to the pkeys tests - "mm/damon: add sample modules" from SeongJae Park enhances DAMON to estimate application working set size - "memcg/hugetlb: Rework memcg hugetlb charging" from Joshua Hahn provides some cleanups to memcg's hugetlb charging logic - "mm/swap_cgroup: remove global swap cgroup lock" from Kairui Song removes the global swap cgroup lock. A speedup of 10% for a tmpfs-based kernel build was demonstrated - "zram: split page type read/write handling" from Sergey Senozhatsky has several fixes and cleaups for zram in the area of zram_write_page(). A watchdog softlockup warning was eliminated - "move pagetable_*_dtor() to __tlb_remove_table()" from Kevin Brodsky cleans up the pagetable destructor implementations. A rare use-after-free race is fixed - "mm/debug: introduce and use VM_WARN_ON_VMG()" from Lorenzo Stoakes simplifies and cleans up the debugging code in the VMA merging logic - "Account page tables at all levels" from Kevin Brodsky cleans up and regularizes the pagetable ctor/dtor handling. This results in improvements in accounting accuracy - "mm/damon: replace most damon_callback usages in sysfs with new core functions" from SeongJae Park cleans up and generalizes DAMON's sysfs file interface logic - "mm/damon: enable page level properties based monitoring" from SeongJae Park increases the amount of information which is presented in response to DAMOS actions - "mm/damon: remove DAMON debugfs interface" from SeongJae Park removes DAMON's long-deprecated debugfs interfaces. Thus the migration to sysfs is completed - "mm/hugetlb: Refactor hugetlb allocation resv accounting" from Peter Xu cleans up and generalizes the hugetlb reservation accounting - "mm: alloc_pages_bulk: small API refactor" from Luiz Capitulino removes a never-used feature of the alloc_pages_bulk() interface - "mm/damon: extend DAMOS filters for inclusion" from SeongJae Park extends DAMOS filters to support not only exclusion (rejecting), but also inclusion (allowing) behavior - "Add zpdesc memory descriptor for zswap.zpool" from Alex Shi introduces a new memory descriptor for zswap.zpool that currently overlaps with struct page for now. This is part of the effort to reduce the size of struct page and to enable dynamic allocation of memory descriptors - "mm, swap: rework of swap allocator locks" from Kairui Song redoes and simplifies the swap allocator locking. A speedup of 400% was demonstrated for one workload. As was a 35% reduction for kernel build time with swap-on-zram - "mm: update mips to use do_mmap(), make mmap_region() internal" from Lorenzo Stoakes reworks MIPS's use of mmap_region() so that mmap_region() can be made MM-internal - "mm/mglru: performance optimizations" from Yu Zhao fixes a few MGLRU regressions and otherwise improves MGLRU performance - "Docs/mm/damon: add tuning guide and misc updates" from SeongJae Park updates DAMON documentation - "Cleanup for memfd_create()" from Isaac Manjarres does that thing - "mm: hugetlb+THP folio and migration cleanups" from David Hildenbrand provides various cleanups in the areas of hugetlb folios, THP folios and migration - "Uncached buffered IO" from Jens Axboe implements the new RWF_DONTCACHE flag which provides synchronous dropbehind for pagecache reading and writing. To permite userspace to address issues with massive buildup of useless pagecache when reading/writing fast devices - "selftests/mm: virtual_address_range: Reduce memory" from Thomas Weißschuh fixes and optimizes some of the MM selftests" * tag 'mm-stable-2025-01-26-14-59' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (321 commits) mm/compaction: fix UBSAN shift-out-of-bounds warning s390/mm: add missing ctor/dtor on page table upgrade kasan: sw_tags: use str_on_off() helper in kasan_init_sw_tags() tools: add VM_WARN_ON_VMG definition mm/damon/core: use str_high_low() helper in damos_wmark_wait_us() seqlock: add missing parameter documentation for raw_seqcount_try_begin() mm/page-writeback: consolidate wb_thresh bumping logic into __wb_calc_thresh mm/page_alloc: remove the incorrect and misleading comment zram: remove zcomp_stream_put() from write_incompressible_page() mm: separate move/undo parts from migrate_pages_batch() mm/kfence: use str_write_read() helper in get_access_type() selftests/mm/mkdirty: fix memory leak in test_uffdio_copy() kasan: hw_tags: Use str_on_off() helper in kasan_init_hw_tags() selftests/mm: virtual_address_range: avoid reading from VM_IO mappings selftests/mm: vm_util: split up /proc/self/smaps parsing selftests/mm: virtual_address_range: unmap chunks after validation selftests/mm: virtual_address_range: mmap() without PROT_WRITE selftests/memfd/memfd_test: fix possible NULL pointer dereference mm: add FGP_DONTCACHE folio creation flag mm: call filemap_fdatawrite_range_kick() after IOCB_DONTCACHE issue ...
2025-01-26Merge tag 'mm-nonmm-stable-2025-01-24-23-16' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: "Mainly individually changelogged singleton patches. The patch series in this pull are: - "lib min_heap: Improve min_heap safety, testing, and documentation" from Kuan-Wei Chiu provides various tightenings to the min_heap library code - "xarray: extract __xa_cmpxchg_raw" from Tamir Duberstein preforms some cleanup and Rust preparation in the xarray library code - "Update reference to include/asm-<arch>" from Geert Uytterhoeven fixes pathnames in some code comments - "Converge on using secs_to_jiffies()" from Easwar Hariharan uses the new secs_to_jiffies() in various places where that is appropriate - "ocfs2, dlmfs: convert to the new mount API" from Eric Sandeen switches two filesystems to the new mount API - "Convert ocfs2 to use folios" from Matthew Wilcox does that - "Remove get_task_comm() and print task comm directly" from Yafang Shao removes now-unneeded calls to get_task_comm() in various places - "squashfs: reduce memory usage and update docs" from Phillip Lougher implements some memory savings in squashfs and performs some maintainability work - "lib: clarify comparison function requirements" from Kuan-Wei Chiu tightens the sort code's behaviour and adds some maintenance work - "nilfs2: protect busy buffer heads from being force-cleared" from Ryusuke Konishi fixes an issues in nlifs when the fs is presented with a corrupted image - "nilfs2: fix kernel-doc comments for function return values" from Ryusuke Konishi fixes some nilfs kerneldoc - "nilfs2: fix issues with rename operations" from Ryusuke Konishi addresses some nilfs BUG_ONs which syzbot was able to trigger - "minmax.h: Cleanups and minor optimisations" from David Laight does some maintenance work on the min/max library code - "Fixes and cleanups to xarray" from Kemeng Shi does maintenance work on the xarray library code" * tag 'mm-nonmm-stable-2025-01-24-23-16' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (131 commits) ocfs2: use str_yes_no() and str_no_yes() helper functions include/linux/lz4.h: add some missing macros Xarray: use xa_mark_t in xas_squash_marks() to keep code consistent Xarray: remove repeat check in xas_squash_marks() Xarray: distinguish large entries correctly in xas_split_alloc() Xarray: move forward index correctly in xas_pause() Xarray: do not return sibling entries from xas_find_marked() ipc/util.c: complete the kernel-doc function descriptions gcov: clang: use correct function param names latencytop: use correct kernel-doc format for func params minmax.h: remove some #defines that are only expanded once minmax.h: simplify the variants of clamp() minmax.h: move all the clamp() definitions after the min/max() ones minmax.h: use BUILD_BUG_ON_MSG() for the lo < hi test in clamp() minmax.h: reduce the #define expansion of min(), max() and clamp() minmax.h: update some comments minmax.h: add whitespace around operators and after commas nilfs2: do not update mtime of renamed directory that is not moved nilfs2: handle errors that nilfs_prepare_chunk() may return CREDITS: fix spelling mistake ...
2025-01-25zram: remove zcomp_stream_put() from write_incompressible_page()Sergey Senozhatsky
We cannot and should not put per-CPU compression stream in write_incompressible_page() because that function never gets any per-CPU streams in the first place. It's zram_write_page() that puts the stream before it calls write_incompressible_page(). Link: https://lkml.kernel.org/r/20250115072003.380567-1-senozhatsky@chromium.org Fixes: 485d11509d6d ("zram: factor out ZRAM_HUGE write") Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>