summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2025-11-30md/raid5: fix IO hang when array is broken with IO inflightYu Kuai
Following test can cause IO hang: mdadm -CvR /dev/md0 -l10 -n4 /dev/sd[abcd] --assume-clean --chunk=64K --bitmap=none sleep 5 echo 1 > /sys/block/sda/device/delete echo 1 > /sys/block/sdb/device/delete echo 1 > /sys/block/sdc/device/delete echo 1 > /sys/block/sdd/device/delete dd if=/dev/md0 of=/dev/null bs=8k count=1 iflag=direct Root cause: 1) all disks removed, however all rdevs in the array is still in sync, IO will be issued normally. 2) IO failure from sda, and set badblocks failed, sda will be faulty and MD_SB_CHANGING_PENDING will be set. 3) error recovery try to recover this IO from other disks, IO will be issued to sdb, sdc, and sdd. 4) IO failure from sdb, and set badblocks failed again, now array is broken and will become read-only. 5) IO failure from sdc and sdd, however, stripe can't be handled anymore because MD_SB_CHANGING_PENDING is set: handle_stripe handle_stripe if (test_bit MD_SB_CHANGING_PENDING) set_bit STRIPE_HANDLE goto finish // skip handling failed stripe release_stripe if (test_bit STRIPE_HANDLE) list_add_tail conf->hand_list 6) later raid5d can't handle failed stripe as well: raid5d md_check_recovery md_update_sb if (!md_is_rdwr()) // can't clear pending bit return if (test_bit MD_SB_CHANGING_PENDING) break; // can't handle failed stripe Since MD_SB_CHANGING_PENDING can never be cleared for read-only array, fix this problem by skip this checking for read-only array. Link: https://lore.kernel.org/linux-raid/20251117085557.770572-3-yukuai@fnnas.com Fixes: d87f064f5874 ("md: never update metadata when array is read-only.") Signed-off-by: Yu Kuai <yukuai@fnnas.com> Reviewed-by: Li Nan <linan122@huawei.com>
2025-11-30md: warn about updating super block failureYu Kuai
Many personalities will handle IO error from daemon thread(like raid1d, raid10d, raid5d), and sb will require to be clean before hanlding these failed IO. However update sb can fail, for example array is broken by IO failure, or user config sysfs api array_state. This patch adds warning if updating sb failed first, in case this will be related to IO hang. Link: https://lore.kernel.org/linux-raid/20251117085557.770572-2-yukuai@fnnas.com Signed-off-by: Yu Kuai <yukuai@fnnas.com> Reviewed-by: Li Nan <linan122@huawei.com>
2025-11-30md/raid0: fix NULL pointer dereference in create_strip_zones() for dm-raidYu Kuai
Commit 2107457e31fa ("md/raid0: Move queue limit setup before r0conf initialization") dereference mddev->gendisk unconditionally, which is NULL for dm-raid. Fix this problem by reverting to old codes for dm-raid. Link: https://lore.kernel.org/linux-raid/20251116021816.107648-1-yukuai@fnnas.com Fixes: 2107457e31fa ("md/raid0: Move queue limit setup before r0conf initialization") Reported-and-tested-by: Changhui Zhong <czhong@redhat.com> Closes: https://lore.kernel.org/all/CAGVVp+VqVnvGeneUoTbYvBv2cw6GwQRrR3B-iQ-_9rVfyumoKA@mail.gmail.com/ Signed-off-by: Yu Kuai <yukuai@fnnas.com> Reviewed-by: Xiao Ni <xni@redhat.com> Reviewed-by: Li Nan <linan122@huawei.com> Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
2025-11-28sbitmap: fix all kernel-doc warningsRandy Dunlap
Modify kernel-doc comments in sbitmap.h to prevent warnings: Warning: include/linux/sbitmap.h:84 struct member 'alloc_hint' not described in 'sbitmap' Warning: include/linux/sbitmap.h:151 struct member 'ws_active' not described in 'sbitmap_queue' Warning: include/linux/sbitmap.h:552 No description found for return value of 'sbq_wait_ptr' Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-28ublk: add helper of __ublk_fetch()Ming Lei
Add helper __ublk_fetch() for refactoring ublk_fetch(). Meantime move ublk_config_io_buf() out of __ublk_fetch() to make the code structure cleaner. Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-28ublk: pass const pointer to ublk_queue_is_zoned()Ming Lei
Pass const pointer to ublk_queue_is_zoned() because it is readonly. Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-28ublk: refactor auto buffer register in ublk_dispatch_req()Ming Lei
Refactor auto buffer register code and prepare for supporting batch IO feature, and the main motivation is to put 'ublk_io' operation code together, so that per-io lock can be applied for the code block. The key changes are: - Rename ublk_auto_buf_reg() as ublk_do_auto_buf_reg() - Introduce an enum `auto_buf_reg_res` to represent the result of the buffer registration attempt (FAIL, FALLBACK, OK). - Split the existing `ublk_do_auto_buf_reg` function into two: - `__ublk_do_auto_buf_reg`: Performs the actual buffer registration and returns the `auto_buf_reg_res` status. - `ublk_do_auto_buf_reg`: A wrapper that calls the internal function and handles the I/O preparation based on the result. - Introduce `ublk_prep_auto_buf_reg_io` to encapsulate the logic for preparing the I/O for completion after buffer registration. - Pass the `tag` directly to `ublk_auto_buf_reg_fallback` to avoid recalculating it. This refactoring makes the control flow clearer and isolates the different stages of the auto buffer registration process. Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-28ublk: add `union ublk_io_buf` with improved namingMing Lei
Add `union ublk_io_buf` for naming the anonymous union of struct ublk_io's addr and buf fields, meantime apply it to `struct ublk_io` for storing either ublk auto buffer register data or ublk server io buffer address. The union uses clear field names: - `addr`: for regular ublk server io buffer addresses - `auto_reg`: for ublk auto buffer registration data This eliminates confusing access patterns and improves code readability. Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-28ublk: add parameter `struct io_uring_cmd *` to ublk_prep_auto_buf_reg()Ming Lei
Add parameter `struct io_uring_cmd *` to ublk_prep_auto_buf_reg() and prepare for reusing this helper for the coming UBLK_BATCH_IO feature, which can fetch & commit one batch of io commands via single uring_cmd. Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-28kfifo: add kfifo_alloc_node() helper for NUMA awarenessMing Lei
Add __kfifo_alloc_node() by refactoring and reusing __kfifo_alloc(), and define kfifo_alloc_node() macro to support NUMA-aware memory allocation. The new __kfifo_alloc_node() function accepts a NUMA node parameter and uses kmalloc_array_node() instead of kmalloc_array() for node-specific allocation. The existing __kfifo_alloc() now calls __kfifo_alloc_node() with NUMA_NO_NODE to maintain backward compatibility. This enables users to allocate kfifo buffers on specific NUMA nodes, which is important for performance in NUMA systems where the kfifo will be primarily accessed by threads running on specific nodes. Cc: Stefani Seibold <stefani@seibold.net> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-28blk-mq: fix potential uaf for 'queue_hw_ctx'Fengnan Chang
This is just apply Kuai's patch in [1] with mirror changes. blk_mq_realloc_hw_ctxs() will free the 'queue_hw_ctx'(e.g. undate submit_queues through configfs for null_blk), while it might still be used from other context(e.g. switch elevator to none): t1 t2 elevator_switch blk_mq_unquiesce_queue blk_mq_run_hw_queues queue_for_each_hw_ctx // assembly code for hctx = (q)->queue_hw_ctx[i] mov 0x48(%rbp),%rdx -> read old queue_hw_ctx __blk_mq_update_nr_hw_queues blk_mq_realloc_hw_ctxs hctxs = q->queue_hw_ctx q->queue_hw_ctx = new_hctxs kfree(hctxs) movslq %ebx,%rax mov (%rdx,%rax,8),%rdi ->uaf This problem was found by code review, and I comfirmed that the concurrent scenario do exist(specifically 'q->queue_hw_ctx' can be changed during blk_mq_run_hw_queues()), however, the uaf problem hasn't been repoduced yet without hacking the kernel. Sicne the queue is freezed in __blk_mq_update_nr_hw_queues(), fix the problem by protecting 'queue_hw_ctx' through rcu where it can be accessed without grabbing 'q_usage_counter'. [1] https://lore.kernel.org/all/20220225072053.2472431-1-yukuai3@huawei.com/ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Fengnan Chang <changfengnan@bytedance.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-28blk-mq: use array manage hctx map instead of xarrayFengnan Chang
After commit 4e5cc99e1e48 ("blk-mq: manage hctx map via xarray"), we use an xarray instead of array to store hctx, but in poll mode, each time in blk_mq_poll, we need use xa_load to find corresponding hctx, this introduce some costs. In my test, xa_load may cost 3.8% cpu. This patch revert previous change, eliminates the overhead of xa_load and can result in a 3% performance improvement. Signed-off-by: Fengnan Chang <changfengnan@bytedance.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-26ublk: prevent invalid access with DEBUGKevin Brodsky
ublk_ch_uring_cmd_local() may jump to the out label before initialising the io pointer. This will cause trouble if DEBUG is defined, because the pr_devel() call dereferences io. Clang reports: drivers/block/ublk_drv.c:2403:6: error: variable 'io' is used uninitialized whenever 'if' condition is true [-Werror,-Wsometimes-uninitialized] 2403 | if (tag >= ub->dev_info.queue_depth) | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ drivers/block/ublk_drv.c:2492:32: note: uninitialized use occurs here 2492 | __func__, cmd_op, tag, ret, io->flags); | Fix this by initialising io to NULL and checking it before dereferencing it. Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com> Fixes: 71f28f3136af ("ublk_drv: add io_uring based userspace block driver") Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-26s390/dasd: Use scnprintf() instead of sprintf()Jan Höppner
Use scnprintf() instead of sprintf() for those cases where the destination is an array and the size of the array is known at compile time. This prevents theoretical buffer overflows, but also avoids that people again and again spend time to figure out if the code is actually safe. Signed-off-by: Jan Höppner <hoeppner@linux.ibm.com> Reviewed-by: Stefan Haberland <sth@linux.ibm.com> Signed-off-by: Stefan Haberland <sth@linux.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-26s390/dasd: Move device name formatting into separate functionJan Höppner
The device name formatting can be generalized and made more readable compared to the current state. SCSI already provides a generalized way to format many devices in the same naming scheme as DASD does, which was introduced with commit 3e1a7ff8a0a7 ("block: allow disk to have extended device number"). Use this much cleaner code from drivers/scsi/sd.c to handle the legacy naming scheme in DASD as a replacement for the current implementation. For easier error handling for the new function, move the gendisk free portion of dasd_gendisk_free() out into a new function dasd_gd_free(). Signed-off-by: Jan Höppner <hoeppner@linux.ibm.com> Reviewed-by: Stefan Haberland <sth@linux.ibm.com> Signed-off-by: Stefan Haberland <sth@linux.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-26s390/dasd: Remove unnecessary debugfs_create() return checksStefan Haberland
The DASD driver only uses the dentry pointers when removing debugfs entries, and debugfs_remove() can safely handle both NULL and ERR_PTR. There is therefore no need to check debugfs_create() return values. This simplifies the debugfs setup code without changing functionality. Suggested-by: Heiko Carstens <hca@linux.ibm.com> Reviewed-by: Heiko Carstens <hca@linux.ibm.com> Reviewed-by: Jan Hoeppner <hoeppner@linux.ibm.com> Signed-off-by: Stefan Haberland <sth@linux.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-26s390/dasd: Fix gendisk parent after copy pair swapStefan Haberland
After a copy pair swap the block device's "device" symlink points to the secondary CCW device, but the gendisk's parent remained the primary, leaving /sys/block/<dasdx> under the wrong parent. Move the gendisk to the secondary's device with device_move(), keeping the sysfs topology consistent after the swap. Fixes: 413862caad6f ("s390/dasd: add copy pair swap capability") Cc: stable@vger.kernel.org #6.1 Reviewed-by: Jan Hoeppner <hoeppner@linux.ibm.com> Signed-off-by: Stefan Haberland <sth@linux.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-25block: ignore __blkdev_issue_discard() return valueChaitanya Kulkarni
__blkdev_issue_discard() always returns 0, making the error check in blkdev_issue_discard() dead code. In function blkdev_issue_discard() initialize ret = 0, remove ret assignment from __blkdev_issue_discard(), rely on bio == NULL check to call submit_bio_wait(), preserve submit_bio_wait() error handling, and preserve -EOPNOTSUPP to 0 mapping. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Chaitanya Kulkarni <ckulkarnilinux@gmail.com> Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-25block: fix typos in comments and strings in blk-coreshechenglong
This patch fixes multiple spelling mistakes in comments and documentation in the file block/blk-core.c. No functional changes intended. Suggested-by: Christoph Hellwig <hch@infradead.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: shechenglong <shechenglong@xfusion.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-25block: Remove references to __device_add_disk()John Garry
Since commit d1254a874971 ("block: remove support for delayed queue registrations"), function __device_add_disk() has been replaced with device_add_disk(), so fix up comments. Signed-off-by: John Garry <john.g.garry@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-24Revert "Merge branch 'loop-aio-nowait' into for-6.19/block"Jens Axboe
This reverts commit f43fdeb9a368a5ff56b088b46edc245bd4b52cde, reversing changes made to 2c6d792d4b7676e2b340df05425330452fee1f40. There are concerns that doing inline submits can cause excessive stack usage, particularly when going back into the filesystem. Revert the loop dio nowait change for now. Link: https://lore.kernel.org/linux-block/aSP3SG_KaROJTBHx@infradead.org/ Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-20block: use min() instead of min_t()David Laight
min_t(unsigned int, a, b) casts an 'unsigned long' to 'unsigned int'. Use min(a, b) instead as it promotes any 'unsigned int' to 'unsigned long' and so cannot discard significant bits. In this case the 'unsigned long' value is small enough that the result is ok. (Similarly for max_t() and clamp_t().) Detected by an extra check added to min_t(). Signed-off-by: David Laight <david.laight.linux@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-20zloop: clear nowait flag in workqueue contextChaitanya Kulkarni
The zloop driver advertises REQ_NOWAIT support through BLK_FEAT_NOWAIT (enabled by default for all blk-mq devices), and honors the nowait behavior throughout zloop_queue_rq(). However, actual I/O to the backing file is performed in a workqueue, where blocking is allowed. To avoid imposing unnecessary non-blocking constraints in this blocking context, clear the REQ_NOWAIT flag before processing the request in the workqueue context. Signed-off-by: Chaitanya Kulkarni <ckulkarnilinux@gmail.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-20loop: clear nowait flag in workqueue contextChaitanya Kulkarni
The loop driver advertises REQ_NOWAIT support through BLK_FEAT_NOWAIT (enabled by default for all blk-mq devices), and honors the nowait behavior throughout loop_queue_rq(). However, actual I/O to the backing file is performed in a workqueue, where blocking is allowed. To avoid imposing unnecessary non-blocking constraints in this blocking context, clear the REQ_NOWAIT flag before processing the request in the workqueue context. Signed-off-by: Chaitanya Kulkarni <ckulkarnilinux@gmail.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-19zloop: fix zone append check in zloop_rw()Damien Le Moal
While commit cf28f6f923cb ("zloop: fail zone append operations that are targeting full zones") added a check in zloop_rw() that a zone append is not issued to a full zone, commit e3a96ca90462 ("zloop: simplify checks for writes to sequential zones") inadvertently removed the check to verify that there is enough unwritten space in a zone for an incoming zone append opration. Re-add this check in zloop_rw() to make sure we do not write beyond the end of a zone. Of note is that this same check is already present in the function zloop_set_zone_append_sector() when ordered zone append is in use. Reported-by: Hans Holmberg <Hans.Holmberg@wdc.com> Fixes: e3a96ca90462 ("zloop: simplify checks for writes to sequential zones") Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-19MAINTAINERS: add a maintainer for zoned block device supportDamien Le Moal
Add myself as the maintainer of the block layer support for the zoned block device code and user API. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-19MAINTAINERS: add missing block layer user API header filesDamien Le Moal
Add the missing user API header files related to the block layer to the list of matching file patterns for Jens's block layer entry. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-18block: remove the declaration of elevator_init_mq functionChengkaitao
In commit 1e44bedbc921 ("block: unifying elevator change"), the elevator_init_mq function was deleted, but its declaration in elevator.h was overlooked. This patch fixes it. Signed-off-by: Chengkaitao <chengkaitao@kylinos.cn> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-18Revert "block: consider discard merge last"Jens Axboe
This reverts commit 2516c246d01c23a5f5310e9ac78d9f8aad9b1d0e. Suspected issues with discard merging post this patch, hence revert it for now. Link: https://lore.kernel.org/linux-block/26acdfdf-de13-430b-8c73-f890c7689a84@kernel.dk/ Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-18block: Remove queue freezing from several sysfs store callbacksBart Van Assche
Freezing the request queue from inside sysfs store callbacks may cause a deadlock in combination with the dm-multipath driver and the queue_if_no_path option. Additionally, freezing the request queue slows down system boot on systems where sysfs attributes are set synchronously. Fix this by removing the blk_mq_freeze_queue() / blk_mq_unfreeze_queue() calls from the store callbacks that do not strictly need these callbacks. Add the __data_racy annotation to request_queue.rq_timeout to suppress KCSAN data race reports about the rq_timeout reads. This patch may cause a small delay in applying the new settings. For all the attributes affected by this patch, I/O will complete correctly whether the old or the new value of the attribute is used. This patch affects the following sysfs attributes: * io_poll_delay * io_timeout * nomerges * read_ahead_kb * rq_affinity Here is an example of a deadlock triggered by running test srp/002 if this patch is not applied: task:multipathd Call Trace: <TASK> __schedule+0x8c1/0x1bf0 schedule+0xdd/0x270 schedule_preempt_disabled+0x1c/0x30 __mutex_lock+0xb89/0x1650 mutex_lock_nested+0x1f/0x30 dm_table_set_restrictions+0x823/0xdf0 __bind+0x166/0x590 dm_swap_table+0x2a7/0x490 do_resume+0x1b1/0x610 dev_suspend+0x55/0x1a0 ctl_ioctl+0x3a5/0x7e0 dm_ctl_ioctl+0x12/0x20 __x64_sys_ioctl+0x127/0x1a0 x64_sys_call+0xe2b/0x17d0 do_syscall_64+0x96/0x3a0 entry_SYSCALL_64_after_hwframe+0x4b/0x53 </TASK> task:(udev-worker) Call Trace: <TASK> __schedule+0x8c1/0x1bf0 schedule+0xdd/0x270 blk_mq_freeze_queue_wait+0xf2/0x140 blk_mq_freeze_queue_nomemsave+0x23/0x30 queue_ra_store+0x14e/0x290 queue_attr_store+0x23e/0x2c0 sysfs_kf_write+0xde/0x140 kernfs_fop_write_iter+0x3b2/0x630 vfs_write+0x4fd/0x1390 ksys_write+0xfd/0x230 __x64_sys_write+0x76/0xc0 x64_sys_call+0x276/0x17d0 do_syscall_64+0x96/0x3a0 entry_SYSCALL_64_after_hwframe+0x4b/0x53 </TASK> Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Cc: Nilay Shroff <nilay@linux.ibm.com> Cc: Martin Wilck <mwilck@suse.com> Cc: Benjamin Marzinski <bmarzins@redhat.com> Cc: stable@vger.kernel.org Fixes: af2814149883 ("block: freeze the queue in queue_attr_store") Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-18fs: Add the __data_racy annotation to backing_dev_info.ra_pagesBart Van Assche
Some but not all .ra_pages changes happen while block layer I/O is paused with blk_mq_freeze_queue(). Filesystems may read .ra_pages even while block layer I/O is paused, e.g. from inside their .fadvise callback. Annotating all .ra_pages reads with READ_ONCE() would be cumbersome. Hence, add the __data_racy annotatation to the .ra_pages member variable. Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-18block: plug attempts to batch allocate tags multiple timesXue He
This patch aims to enable batch allocation of sufficient tags after batch IO submission with plug mechanism, thereby avoiding the need for frequent individual requests when the initial allocation is insufficient. ----------------------------------------------------------- HW: 16 CPUs/16 poll queues Disk: Samsung PM9A3 Gen4 3.84T CMD: [global] ioengine=io_uring group_reporting=1 time_based=1 runtime=1m refill_buffers=1 norandommap=1 randrepeat=0 fixedbufs=1 registerfiles=1 rw=randread iodepth=128 iodepth_batch_submit=32 iodepth_batch_complete_min=32 iodepth_batch_complete_max=128 iodepth_low=32 bs=4k numjobs=1 direct=1 hipri=1 [job1] filename=/dev/nvme0n1 name=batch_test ------------------------------------------------------------ Perf: base code: __blk_mq_alloc_requests() 1.47% patch: __blk_mq_alloc_requests() 0.75% ------------------------------------------------------------ Signed-off-by: hexue <xue01.he@samsung.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-18Merge branch 'loop-aio-nowait' into for-6.19/blockJens Axboe
Merge async IO IOCB_NOWAIT support from Ming: "This patchset improves loop aio perf by using IOCB_NOWAIT for avoiding to queue aio command to workqueue context, meantime refactor lo_rw_aio() a bit. In my test VM, loop disk perf becomes very close to perf of the backing block device(nvme/mq virtio-scsi). And Mikulas verified that this way can improve 12jobs sequential readwrite io by ~5X, and basically solve the reported problem together with loop MQ change. https://lore.kernel.org/linux-block/a8e5c76a-231f-07d1-a394-847de930f638@redhat.com/ Zhaoyang Huang also mentioned it may fix their performance issue on Android use case. The loop MQ change will be posted as standalone patch, because it needs UAPI change." Link: https://lore.kernel.org/linux-block/20251015110735.1361261-1-ming.lei@redhat.com/ Signed-off-by: Jens Axboe <axboe@kernel.dk> * loop-aio-nowait: loop: add hint for handling aio via IOCB_NOWAIT loop: try to handle loop aio command via NOWAIT IO first loop: move command blkcg/memcg initialization into loop_queue_work loop: add lo_submit_rw_aio() loop: add helper lo_rw_aio_prep() loop: add helper lo_cmd_nr_bvec()
2025-11-18loop: add hint for handling aio via IOCB_NOWAITMing Lei
Add hint for using IOCB_NOWAIT to handle loop aio command for avoiding to cause write(especially randwrite) perf regression on sparse backed file. Try IOCB_NOWAIT in the following situations: - backing file is block device OR - READ aio command OR - there isn't any queued blocking async WRITEs, because NOWAIT won't cause contention with blocking WRITE, which often implies exclusive lock With this simple policy, perf regression of randwrite/write on sparse backing file is fixed. Link: https://lore.kernel.org/dm-devel/7d6ae2c9-df8e-50d0-7ad6-b787cb3cfab4@redhat.com/ Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-18loop: try to handle loop aio command via NOWAIT IO firstMing Lei
Try to handle loop aio command via NOWAIT IO first, then we can avoid to queue the aio command into workqueue. This is usually one big win in case that FS block mapping is stable, Mikulas verified [1] that this way improves IO perf by close to 5X in 12jobs sequential read/write test, in which FS block mapping is just stable. Fallback to workqueue in case of -EAGAIN. This way may bring a little cost from the 1st retry, but when running the following write test over loop/sparse_file, the actual effect on randwrite is obvious: ``` truncate -s 4G 1.img #1.img is created on XFS/virtio-scsi losetup -f 1.img --direct-io=on fio --direct=1 --bs=4k --runtime=40 --time_based --numjobs=1 --ioengine=libaio \ --iodepth=16 --group_reporting=1 --filename=/dev/loop0 -name=job --rw=$RW ``` - RW=randwrite: obvious IOPS drop observed - RW=write: a little drop(%5 - 10%) This perf drop on randwrite over sparse file will be addressed in the following patch. BLK_MQ_F_BLOCKING has to be set for calling into .read_iter() or .write_iter() which might sleep even though it is NOWAIT, and the only effect is that rcu read lock is replaced with srcu read lock. Link: https://lore.kernel.org/linux-block/a8e5c76a-231f-07d1-a394-847de930f638@redhat.com/ [1] Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-18loop: move command blkcg/memcg initialization into loop_queue_workMing Lei
Move loop command blkcg/memcg initialization into loop_queue_work, and prepare for supporting to handle loop io command by IOCB_NOWAIT. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-18loop: add lo_submit_rw_aio()Ming Lei
Refactor lo_rw_aio() by extracting the I/O submission logic into a new helper function lo_submit_rw_aio(). This further improves code organization by separating the I/O preparation, submission, and completion handling into distinct phases. Prepare for using NOWAIT to improve loop performance. Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-18loop: add helper lo_rw_aio_prep()Ming Lei
Add helper lo_rw_aio_prep() to separate the preparation phase(setting up bio vectors and initializing the iocb structure) from the actual I/O execution in the loop block driver. Prepare for using NOWAIT to improve loop performance. Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-18loop: add helper lo_cmd_nr_bvec()Ming Lei
Add lo_cmd_nr_bvec() and prepare for refactoring lo_rw_aio(). Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-18drbd: turn bitmap I/O comments into regular block commentsSukrut Heroorkar
W=1 build warns because the bitmap I/O comments use '/**', which marks them as kernel-doc comments even though these functions do not document an external API. Convert these comments to regular block comments so kernel-doc no longer parses them. Signed-off-by: Sukrut Heroorkar <hsukrut3@gmail.com> Acked-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-17block: rate-limit capacity change info logLi Chen
loop devices under heavy stress-ng loop streessor can trigger many capacity change events in a short time. Each event prints an info message from set_capacity_and_notify(), flooding the console and contributing to soft lockups on slow consoles. Switch the printk in set_capacity_and_notify() to pr_info_ratelimited() so frequent capacity changes do not spam the log while still reporting occasional changes. Cc: stable@vger.kernel.org Signed-off-by: Li Chen <chenl311@chinatelecom.cn> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-17Documentation: admin-guide: blockdev: update zloop parametersDamien Le Moal
In Documentation/admin-guide/blockdev/zoned_loop.rst, add the description of the zone_append and ordered_zone_append configuration arguments of zloop "add" command (device creation). Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-17zloop: introduce the ordered_zone_append configuration parameterDamien Le Moal
The zone append operation processing for zloop devices is similar to any other command, that is, the operation is processed as a command work item, without any special serialization between the work items (beside the zone mutex for mutually exclusive code sections). This processing is fine and gives excellent performance. However, it has a side effect: zone append operation are very often reordered and processed in a sequence that is very different from their issuing order by the user. This effect is very visible using an XFS file system on top of a zloop device. A simple file write leads to many file extents as the data writes using zone append are reordered and so result in the physical order being different than the file logical order. E.g. executing: $ dd if=/dev/zero of=/mnt/test bs=1M count=10 && sync $ xfs_bmap /mnt/test /mnt/test: 0: [0..4095]: 2162688..2166783 1: [4096..6143]: 2168832..2170879 2: [6144..8191]: 2166784..2168831 3: [8192..10239]: 2170880..2172927 4: [10240..12287]: 2174976..2177023 5: [12288..14335]: 2172928..2174975 6: [14336..20479]: 2177024..2183167 For 10 IOs, 6 extents are created. This is fine and actually allows to exercise XFS zone garbage collection very well. However, this also makes debugging/working on XFS data placement harder as the underlying device will most of the time reorder IOs, resulting in many file extents. Allow a user to mitigate this with the new ordered_zone_append configuration parameter. For a zloop device created with this parameter specified, the sector of a zone append command is set early, when the command is submitted by the block layer with the zloop_queue_rq() function, instead of in the zloop_rw() function which is exectued later in the command work item context. This change ensures that more often than not, zone append operations data end up being written in the same order as the command submission by the user. In the case of XFS, this leads to far less file data extents. E.g., for the previous example, we get a single file data extent for the written file. $ dd if=/dev/zero of=/mnt/test bs=1M count=10 && sync $ xfs_bmap /mnt/test /mnt/test: 0: [0..20479]: 2162688..2183167 Since we cannot use a mutex in the context of the zloop_queue_rq() function to atomically set a zone append operation sector to the target zone write pointer location and increment that the write pointer, a new per-zone spinlock is introduced to protect a zone write pointer access and modifications. To check a zone write pointer location and set a zone append operation target sector to that value, the function zloop_set_zone_append_sector() is introduced and called from zloop_queue_rq(). Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-17zloop: introduce the zone_append configuration parameterDamien Le Moal
A zloop zoned block device declares to the block layer that it supports zone append operations. That is, a zloop device ressembles an NVMe ZNS devices supporting zone append. This native support is fine but it does not allow exercising the block layer zone write plugging emulation of zone append, as is done with SCSI or ATA SMR HDDs. Introduce the zone_append configuration parameter to allow creating a zloop device without native support for zone append, thus relying on the block layer zone append emulation. If not specified, zone append support is enabled by default. Otherwise, a value of 0 disables native zone append and a value of 1 enables it. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-17zloop: simplify checks for writes to sequential zonesDamien Le Moal
The function zloop_rw() already checks early that a request is fully contained within the target zone. So this check does not need to be done again for regular writes to sequential zones. Furthermore, since zone append operations are always directed to the zone write pointer location, we do not need to check for their alignment to that value after setting it. So turn the "if" checking the write pointer alignment into an "else if". While at it, improve the comment describing the write pointer modification and how this value is corrected in case of error. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-17zloop: fail zone append operations that are targeting full zonesDamien Le Moal
zloop_rw() will fail any regular write operation that targets a full sequential zone. The check for this is indirect and achieved by checking the write pointer alignment of the write operation. But this check is ineffective for zone append operations since these are alwasy automatically directed at a zone write pointer. Prevent zone append operations from being executed in a full zone with an explicit check of the zone condition. Fixes: eb0570c7df23 ("block: new zoned loop block device driver") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-17zloop: make the write pointer of full zones invalidDamien Le Moal
The write pointer of zones that are in the full condition is always invalid. Reflect that fact by setting the write pointer of full zones to ULLONG_MAX. Fixes: eb0570c7df23 ("block: new zoned loop block device driver") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-17block/blk-throttle: Remove throtl_slice from struct throtl_dataGuenter Roeck
throtl_slice is now a constant. Remove the variable and use the constant directly where needed. Cc: Yu Kuai <yukuai@kernel.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Khazhismel Kumykov <khazhy@google.com> Reviewed-by: Yu Kuai <yukuai@fnnas.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-17block/blk-throttle: drop unneeded blk_stat_enable_accountingGuenter Roeck
After the removal of CONFIG_BLK_DEV_THROTTLING_LOW, it is no longer necessary to enable block accounting, so remove the call to blk_stat_enable_accounting(). With that, the track_bio_latency variable is no longer used and can be deleted from struct throtl_data. Also, including blk-stat.h is no longer necessary. Fixes: bf20ab538c81 ("blk-throttle: remove CONFIG_BLK_DEV_THROTTLING_LOW") Cc: Yu Kuai <yukuai@kernel.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Khazhismel Kumykov <khazhy@google.com> Reviewed-by: Yu Kuai <yukuai@fnnas.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-17block/blk-throttle: Fix throttle slice time for SSDsGuenter Roeck
Commit d61fcfa4bb18 ("blk-throttle: choose a small throtl_slice for SSD") introduced device type specific throttle slices if BLK_DEV_THROTTLING_LOW was enabled. Commit bf20ab538c81 ("blk-throttle: remove CONFIG_BLK_DEV_THROTTLING_LOW") removed support for BLK_DEV_THROTTLING_LOW, but left the device type specific throttle slices in place. This effectively changed throttling behavior on systems with SSD which now use a different and non-configurable slice time compared to non-SSD devices. Practical impact is that throughput tests with low configured throttle values (65536 bps) experience less than expected throughput on SSDs, presumably due to rounding errors associated with the small throttle slice time used for those devices. The same tests pass when setting the throttle values to 65536 * 4 = 262144 bps. The original code sets the throttle slice time to DFL_THROTL_SLICE_HD if CONFIG_BLK_DEV_THROTTLING_LOW is disabled. Restore that code to fix the problem. With that, DFL_THROTL_SLICE_SSD is no longer necessary. Revert to the original code and re-introduce DFL_THROTL_SLICE to replace both DFL_THROTL_SLICE_HD and DFL_THROTL_SLICE_SSD. This effectively reverts commit d61fcfa4bb18 ("blk-throttle: choose a small throtl_slice for SSD"). While at it, also remove MAX_THROTL_SLICE since it is not used anymore. Fixes: bf20ab538c81 ("blk-throttle: remove CONFIG_BLK_DEV_THROTTLING_LOW") Cc: Yu Kuai <yukuai@kernel.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Khazhismel Kumykov <khazhy@google.com> Reviewed-by: Yu Kuai <yukuai@fnnas.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>