summaryrefslogtreecommitdiff
path: root/block
AgeCommit message (Collapse)Author
2024-11-07block: Add a public bdev_zone_is_seq() helperDamien Le Moal
Turn the private disk_zone_is_conv() function in blk-zoned.c into a public and documented bdev_zone_is_seq() helper with the inverse polarity of the original function, also adding a check for non-zoned devices so that all file systems can use the helper, even with a regular block device. Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20241107064300.227731-3-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-07block: RCU protect disk->conv_zones_bitmapDamien Le Moal
Ensure that a disk revalidation changing the conventional zones bitmap of a disk does not cause invalid memory references when using the disk_zone_is_conv() helper by RCU protecting the disk->conv_zones_bitmap pointer. disk_zone_is_conv() is modified to operate under the RCU read lock and the function disk_set_conv_zones_bitmap() is added to update a disk conv_zones_bitmap pointer using rcu_replace_pointer() with the disk zone_wplugs_lock spinlock held. disk_free_zone_resources() is modified to call disk_update_zone_resources() with a NULL bitmap pointer to free the disk conv_zones_bitmap. disk_set_conv_zones_bitmap() is also used in disk_update_zone_resources() to set the new (revalidated) bitmap and free the old one. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20241107064300.227731-2-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-07block: Replace sprintf() with sysfs_emit()zhangguopeng
Per Documentation/filesystems/sysfs.rst, show() should only use sysfs_emit() or sysfs_emit_at() when formatting the value to be returned to user space. No functional change intended. Signed-off-by: zhangguopeng <zhangguopeng@kylinos.cn> Suggested-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20241107104258.29742-1-zhangguopeng@kylinos.cn Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-07block: Switch to using refcount_t for zone write plugsDamien Le Moal
Replace the raw atomic_t reference counting of zone write plugs with a refcount_t. No functional changes. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202411050650.ilIZa8S7-lkp@intel.com/ Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20241107065438.236348-1-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-07Revert "block: pre-calculate max_zone_append_sectors"Jens Axboe
This causes issue on, at least, nvme-mpath where my boot fails with: WARNING: CPU: 354 PID: 2729 at block/blk-settings.c:75 blk_validate_limits+0x356/0x380 Modules linked in: tg3(+) nvme usbcore scsi_mod ptp i2c_piix4 libphy nvme_core crc32c_intel scsi_common usb_common pps_core i2c_smbus CPU: 354 UID: 0 PID: 2729 Comm: kworker/u2061:1 Not tainted 6.12.0-rc6+ #181 Hardware name: Dell Inc. PowerEdge R7625/06444F, BIOS 1.8.3 04/02/2024 Workqueue: async async_run_entry_fn RIP: 0010:blk_validate_limits+0x356/0x380 Code: f6 47 01 04 75 28 83 bf 94 00 00 00 00 75 39 83 bf 98 00 00 00 00 75 34 83 7f 68 00 75 32 31 c0 83 7f 5c 00 0f 84 9b fd ff ff <0f> 0b eb 13 0f 0b eb 0f 48 c7 c0 74 12 58 92 48 89 c7 e8 13 76 46 RSP: 0018:ffffa8a1dfb93b30 EFLAGS: 00010286 RAX: 0000000000000000 RBX: ffff9232829c8388 RCX: 0000000000000088 RDX: 0000000000000080 RSI: 0000000000000200 RDI: ffffa8a1dfb93c38 RBP: 000000000000000c R08: 00000000ffffffff R09: 000000000000ffff R10: 0000000000000000 R11: 0000000000000000 R12: ffff9232829b9000 R13: ffff9232829b9010 R14: ffffa8a1dfb93c38 R15: ffffa8a1dfb93c38 FS: 0000000000000000(0000) GS:ffff923867c80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055c1b92480a8 CR3: 0000002484ff0002 CR4: 0000000000370ef0 Call Trace: <TASK> ? __warn+0xca/0x1a0 ? blk_validate_limits+0x356/0x380 ? report_bug+0x11a/0x1a0 ? handle_bug+0x5e/0x90 ? exc_invalid_op+0x16/0x40 ? asm_exc_invalid_op+0x16/0x20 ? blk_validate_limits+0x356/0x380 blk_alloc_queue+0x7a/0x250 __blk_alloc_disk+0x39/0x80 nvme_mpath_alloc_disk+0x13d/0x1b0 [nvme_core] nvme_scan_ns+0xcc7/0x1010 [nvme_core] async_run_entry_fn+0x27/0x120 process_scheduled_works+0x1a0/0x360 worker_thread+0x2bc/0x350 ? pr_cont_work+0x1b0/0x1b0 kthread+0x111/0x120 ? kthread_unuse_mm+0x90/0x90 ret_from_fork+0x30/0x40 ? kthread_unuse_mm+0x90/0x90 ret_from_fork_asm+0x11/0x20 </TASK> ---[ end trace 0000000000000000 ]--- presumably due to max_zone_append_sectors not being cleared to zero, resulting in blk_validate_zoned_limits() complaining and failing. This reverts commit 2a8f6153e1c2db06a537a5c9d61102eb591776f1. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-04block: pre-calculate max_zone_append_sectorsChristoph Hellwig
max_zone_append_sectors differs from all other queue limits in that the final value used is not stored in the queue_limits but needs to be obtained using queue_limits_max_zone_append_sectors helper. This not only adds (tiny) extra overhead to the I/O path, but also can be easily forgotten in file system code. Add a new max_hw_zone_append_sectors value to queue_limits which is set by the driver, and calculate max_zone_append_sectors from that and the other inputs in blk_validate_zoned_limits, similar to how max_sectors is calculated to fix this. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20241104073955.112324-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-04block: remove the max_zone_append_sectors check in blk_revalidate_disk_zonesChristoph Hellwig
With the lock layer zone append emulation, we are now always setting a max_zone_append_sectors value for zoned devices and this check can't ever trigger. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20241104073955.112324-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-04block: update blk_stack_limits documentationChristoph Hellwig
Listing every single features that needs to be pre-set by stacking drivers does not scale. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20241104054218.45596-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-01Merge tag 'block-6.12-20241101' of git://git.kernel.dk/linuxLinus Torvalds
Pull block fixes from Jens Axboe: - Fixup for a recent blk_rq_map_user_bvec() patch - NVMe pull request via Keith: - Spec compliant identification fix (Keith) - Module parameter to enable backward compatibility on unusual namespace formats (Keith) - Target double free fix when using keys (Vitaliy) - Passthrough command error handling fix (Keith) * tag 'block-6.12-20241101' of git://git.kernel.dk/linux: nvme: re-fix error-handling for io_uring nvme-passthrough nvmet-auth: assign dh_key to NULL after kfree_sensitive nvme: module parameter to disable pi with offsets block: fix queue limits checks in blk_rq_map_user_bvec for real nvme: enhance cns version checking
2024-10-31block: remove bio_add_zone_append_pageChristoph Hellwig
This is only used by the nvmet zns passthrough code, which can trivially just use bio_add_pc_page and do the sanity check for the max zone append limit itself. All future zoned file systems should follow the btrfs lead and let the upper layers fill up bios unlimited by hardware constraints and split them to the limits in the I/O submission handler. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20241030051859.280923-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-31block: remove zone append special casing from the direct I/O pathChristoph Hellwig
This code is unused, and all future zoned file systems should follow the btrfs lead of splitting the bios themselves to the zoned limits in the I/O submission handler, because if they didn't they would be hit by commit ed9832bc08db ("block: introduce folio awareness and add a bigger size from folio") breaking this code when the zone append limit (that is usually the max_hw_sectors limit) is smaller than the largest possible folio size. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20241030051859.280923-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-30blk-integrity: remove seed for user mapped buffersKeith Busch
The seed is only used for kernel generation and verification. That doesn't happen for user buffers, so passing the seed around doesn't accomplish anything. Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Link: https://lore.kernel.org/r/20241016201309.1090320-1-kbusch@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-29block: add a bdev_limits helperChristoph Hellwig
Add a helper to get the queue_limits from the bdev without having to poke into the request_queue. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Garry <john.g.garry@oracle.com> Link: https://lore.kernel.org/r/20241029141937.249920-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-28block: fix queue limits checks in blk_rq_map_user_bvec for realChristoph Hellwig
blk_rq_map_user_bvec currently only has ad-hoc checks for queue limits, and the last fix to it enabled valid NVMe I/O to pass, but also allowed invalid one for drivers that set a max_segment_size or seg_boundary limit. Fix it once for all by using the bio_split_rw_at helper from the I/O path that indicates if and where a bio would be have to be split to adhere to the queue limits, and it returns a positive value, turn that into -EREMOTEIO to retry using the copy path. Fixes: 2ff949441802 ("block: fix sanity checks in blk_rq_map_user_bvec") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Garry <john.g.garry@oracle.com> Link: https://lore.kernel.org/r/20241028090840.446180-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-27Merge tag 'block-6.12-20241026' of git://git.kernel.dk/linuxLinus Torvalds
Pull block fixes from Jens Axboe: - Pull request for MD via Song fixing a few issues - Fix a wrong check in blk_rq_map_user_bvec(), causing IO errors on passthrough IO (Xinyu) * tag 'block-6.12-20241026' of git://git.kernel.dk/linux: block: fix sanity checks in blk_rq_map_user_bvec md/raid10: fix null ptr dereference in raid10_size() md: ensure child flush IO does not affect origin bio->bi_status
2024-10-26block: model freeze & enter queue as lock for supporting lockdepMing Lei
Recently we got several deadlock report[1][2][3] caused by blk_mq_freeze_queue and blk_enter_queue(). Turns out the two are just like acquiring read/write lock, so model them as read/write lock for supporting lockdep: 1) model q->q_usage_counter as two locks(io and queue lock) - queue lock covers sync with blk_enter_queue() - io lock covers sync with bio_enter_queue() 2) make the lockdep class/key as per-queue: - different subsystem has very different lock use pattern, shared lock class causes false positive easily - freeze_queue degrades to no lock in case that disk state becomes DEAD because bio_enter_queue() won't be blocked any more - freeze_queue degrades to no lock in case that request queue becomes dying because blk_enter_queue() won't be blocked any more 3) model blk_mq_freeze_queue() as acquire_exclusive & try_lock - it is exclusive lock, so dependency with blk_enter_queue() is covered - it is trylock because blk_mq_freeze_queue() are allowed to run concurrently 4) model blk_enter_queue() & bio_enter_queue() as acquire_read() - nested blk_enter_queue() are allowed - dependency with blk_mq_freeze_queue() is covered - blk_queue_exit() is often called from other contexts(such as irq), and it can't be annotated as lock_release(), so simply do it in blk_enter_queue(), this way still covered cases as many as possible With lockdep support, such kind of reports may be reported asap and needn't wait until the real deadlock is triggered. For example, lockdep report can be triggered in the report[3] with this patch applied. [1] occasional block layer hang when setting 'echo noop > /sys/block/sda/queue/scheduler' https://bugzilla.kernel.org/show_bug.cgi?id=219166 [2] del_gendisk() vs blk_queue_enter() race condition https://lore.kernel.org/linux-block/20241003085610.GK11458@google.com/ [3] queue_freeze & queue_enter deadlock in scsi https://lore.kernel.org/linux-block/ZxG38G9BuFdBpBHZ@fedora/T/#u Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20241025003722.3630252-4-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-26blk-mq: add non_owner variant of start_freeze/unfreeze queue APIsMing Lei
Add non_owner variant of start_freeze/unfreeze queue APIs, so that the caller knows that what they are doing, and we can skip lockdep support for non_owner variant in per-call level. Prepare for supporting lockdep for freezing/unfreezing queue. Reviewed-by: Christoph Hellwig <hch@lst.de> Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20241025003722.3630252-2-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-23block: fix sanity checks in blk_rq_map_user_bvecXinyu Zhang
blk_rq_map_user_bvec contains a check bytes + bv->bv_len > nr_iter which causes unnecessary failures in NVMe passthrough I/O, reproducible as follows: - register a 2 page, page-aligned buffer against a ring - use that buffer to do a 1 page io_uring NVMe passthrough read The second (i = 1) iteration of the loop in blk_rq_map_user_bvec will then have nr_iter == 1 page, bytes == 1 page, bv->bv_len == 1 page, so the check bytes + bv->bv_len > nr_iter will succeed, causing the I/O to fail. This failure is unnecessary, as when the check succeeds, it means we've checked the entire buffer that will be used by the request - i.e. blk_rq_map_user_bvec should complete successfully. Therefore, terminate the loop early and return successfully when the check bytes + bv->bv_len > nr_iter succeeds. While we're at it, also remove the check that all segments in the bvec are single-page. While this seems to be true for all users of the function, it doesn't appear to be required anywhere downstream. CC: stable@vger.kernel.org Signed-off-by: Xinyu Zhang <xizhang@purestorage.com> Co-developed-by: Uday Shankar <ushankar@purestorage.com> Signed-off-by: Uday Shankar <ushankar@purestorage.com> Fixes: 37987547932c ("block: extend functionality to map bvec iterator") Link: https://lore.kernel.org/r/20241023211519.4177873-1-ushankar@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-23blk-mq: Unexport blk_mq_flush_busy_ctxs()Bart Van Assche
Commit a6088845c2bf ("block: kyber: make kyber more friendly with merging") removed the only blk_mq_flush_busy_ctxs() call from outside the block layer core. Hence unexport blk_mq_flush_busy_ctxs(). Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20241023202850.3469279-1-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-22blk-mq: Make blk_mq_quiesce_tagset() hold the tag list mutex less longBart Van Assche
Make sure that the tag_list_lock mutex is not held any longer than necessary. This change reduces latency if e.g. blk_mq_quiesce_tagset() is called concurrently from more than one thread. This function is used by the NVMe core and also by the UFS driver. Reported-by: Peter Wang <peter.wang@mediatek.com> Cc: Chao Leng <lengchao@huawei.com> Cc: Ming Lei <ming.lei@redhat.com> Cc: stable@vger.kernel.org Fixes: 414dd48e882c ("blk-mq: add tagset quiesce interface") Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Keith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/20241022181617.2716173-1-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-22block: remove redundant explicit memory barrier from rq_qos waiter and wakerMuchun Song
The memory barriers in list_del_init_careful() and list_empty_careful() in pairs already handle the proper ordering between data.got_token and data.wq.entry. So remove the redundant explicit barriers. And also change a "break" statement to "return" to avoid redundant calling of finish_wait(). Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Reviewed-by: Omar Sandoval <osandov@fb.com> Link: https://lore.kernel.org/r/20241021085251.73353-1-songmuchun@bytedance.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-22Merge branch 'for-6.13/block-atomic' into for-6.13/blockJens Axboe
Merge in block/fs prep patches for the atomic write support. * for-6.13/block-atomic: block: Add bdev atomic write limits helpers fs/block: Check for IOCB_DIRECT in generic_atomic_write_valid() block/fs: Pass an iocb to generic_atomic_write_valid()
2024-10-22block: flush all throttled bios when deleting the cgroupLi Lingfeng
When a process migrates to another cgroup and the original cgroup is deleted, the restrictions of throttled bios cannot be removed. If the restrictions are set too low, it will take a long time to complete these bios. Refer to the process of deleting a disk to remove the restrictions and issue bios when deleting the cgroup. This makes difference on the behavior of throttled bios: Before: the limit of the throttled bios can't be changed and the bios will complete under this limit; Now: the limit will be canceled and the throttled bios will be flushed immediately. References: [1] https://lore.kernel.org/r/20220318130144.1066064-4-ming.lei@redhat.com [2] https://lore.kernel.org/all/da861d63-58c6-3ca0-2535-9089993e9e28@huaweicloud.com/ Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20240817071108.1919729-1-lilingfeng@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-22block: fix ordering between checking BLK_MQ_S_STOPPED request addingMuchun Song
Supposing first scenario with a virtio_blk driver. CPU0 CPU1 blk_mq_try_issue_directly() __blk_mq_issue_directly() q->mq_ops->queue_rq() virtio_queue_rq() blk_mq_stop_hw_queue() virtblk_done() blk_mq_request_bypass_insert() 1) store blk_mq_start_stopped_hw_queue() clear_bit(BLK_MQ_S_STOPPED) 3) store blk_mq_run_hw_queue() if (!blk_mq_hctx_has_pending()) 4) load return blk_mq_sched_dispatch_requests() blk_mq_run_hw_queue() if (!blk_mq_hctx_has_pending()) return blk_mq_sched_dispatch_requests() if (blk_mq_hctx_stopped()) 2) load return __blk_mq_sched_dispatch_requests() Supposing another scenario. CPU0 CPU1 blk_mq_requeue_work() blk_mq_insert_request() 1) store virtblk_done() blk_mq_start_stopped_hw_queue() blk_mq_run_hw_queues() clear_bit(BLK_MQ_S_STOPPED) 3) store blk_mq_run_hw_queue() if (!blk_mq_hctx_has_pending()) 4) load return blk_mq_sched_dispatch_requests() if (blk_mq_hctx_stopped()) 2) load continue blk_mq_run_hw_queue() Both scenarios are similar, the full memory barrier should be inserted between 1) and 2), as well as between 3) and 4) to make sure that either CPU0 sees BLK_MQ_S_STOPPED is cleared or CPU1 sees dispatch list. Otherwise, either CPU will not rerun the hardware queue causing starvation of the request. The easy way to fix it is to add the essential full memory barrier into helper of blk_mq_hctx_stopped(). In order to not affect the fast path (hardware queue is not stopped most of the time), we only insert the barrier into the slow path. Actually, only slow path needs to care about missing of dispatching the request to the low-level device driver. Fixes: 320ae51feed5 ("blk-mq: new multi-queue block IO queueing mechanism") Cc: stable@vger.kernel.org Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20241014092934.53630-4-songmuchun@bytedance.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-22block: fix ordering between checking QUEUE_FLAG_QUIESCED request addingMuchun Song
Supposing the following scenario. CPU0 CPU1 blk_mq_insert_request() 1) store blk_mq_unquiesce_queue() blk_queue_flag_clear() 3) store blk_mq_run_hw_queues() blk_mq_run_hw_queue() if (!blk_mq_hctx_has_pending()) 4) load return blk_mq_run_hw_queue() if (blk_queue_quiesced()) 2) load return blk_mq_sched_dispatch_requests() The full memory barrier should be inserted between 1) and 2), as well as between 3) and 4) to make sure that either CPU0 sees QUEUE_FLAG_QUIESCED is cleared or CPU1 sees dispatch list or setting of bitmap of software queue. Otherwise, either CPU will not rerun the hardware queue causing starvation. So the first solution is to 1) add a pair of memory barrier to fix the problem, another solution is to 2) use hctx->queue->queue_lock to synchronize QUEUE_FLAG_QUIESCED. Here, we chose 2) to fix it since memory barrier is not easy to be maintained. Fixes: f4560ffe8cec ("blk-mq: use QUEUE_FLAG_QUIESCED to quiesce queue") Cc: stable@vger.kernel.org Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20241014092934.53630-3-songmuchun@bytedance.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-22block: fix missing dispatching request when queue is started or unquiescedMuchun Song
Supposing the following scenario with a virtio_blk driver. CPU0 CPU1 CPU2 blk_mq_try_issue_directly() __blk_mq_issue_directly() q->mq_ops->queue_rq() virtio_queue_rq() blk_mq_stop_hw_queue() virtblk_done() blk_mq_try_issue_directly() if (blk_mq_hctx_stopped()) blk_mq_request_bypass_insert() blk_mq_run_hw_queue() blk_mq_run_hw_queue() blk_mq_run_hw_queue() blk_mq_insert_request() return After CPU0 has marked the queue as stopped, CPU1 will see the queue is stopped. But before CPU1 puts the request on the dispatch list, CPU2 receives the interrupt of completion of request, so it will run the hardware queue and marks the queue as non-stopped. Meanwhile, CPU1 also runs the same hardware queue. After both CPU1 and CPU2 complete blk_mq_run_hw_queue(), CPU1 just puts the request to the same hardware queue and returns. It misses dispatching a request. Fix it by running the hardware queue explicitly. And blk_mq_request_issue_directly() should handle a similar situation. Fix it as well. Fixes: d964f04a8fde ("blk-mq: fix direct issue") Cc: stable@vger.kernel.org Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20241014092934.53630-2-songmuchun@bytedance.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-22Revert "blk-throttle: Fix IO hang for a corner case"Xiuhong Wang
This reverts commit 5b7048b89745c3c5fb4b3080fb7bced61dba2a2b. The main purpose of this patch is cleanup. The throtl_adjusted_limit function was removed after commit bf20ab538c81 ("blk-throttle: remove CONFIG_BLK_DEV_THROTTLING_LOW"), so the problem of not being able to scale after setting bps or iops to 1 will not occur. So revert this commit that bps/iops can be set to 1. Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Xiuhong Wang <xiuhong.wang@unisoc.com> Signed-off-by: Zhiguo Niu <zhiguo.niu@unisoc.com> Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20241016024508.3340330-1-xiuhong.wang@unisoc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-22block: replace call_rcu by kfree_rcu for simple kmem_cache_free callbackJulia Lawall
Since SLOB was removed and since commit 6c6c47b063b5 ("mm, slab: call kvfree_rcu_barrier() from kmem_cache_destroy()"), it is not necessary to use call_rcu when the callback only performs kmem_cache_free. Use kfree_rcu() directly. The changes were made using Coccinelle. Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr> Link: https://lore.kernel.org/r/20241013201704.49576-10-Julia.Lawall@inria.fr Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-22block: sed-opal: add ioctl IOC_OPAL_SET_SID_PWGreg Joyce
After a SED drive is provisioned, there is no way to change the SID password via the ioctl() interface. A new ioctl IOC_OPAL_SET_SID_PW will allow the password to be changed. The valid current password is required. Signed-off-by: Greg Joyce <gjoyce@linux.ibm.com> Reviewed-by: Daniel Wagner <dwagner@suse.de> Link: https://lore.kernel.org/r/20240829175639.6478-2-gjoyce@linux.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-22block: enable passthrough command statisticsKeith Busch
Applications using the passthrough interfaces for IO want to continue seeing the disk stats. These requests had been fenced off from this block layer feature. While the block layer doesn't necessarily know what a passthrough command does, we do know the data size and direction, which is enough to account for the command's stats. Since tracking these has the potential to produce unexpected results, the passthrough stats are locked behind a new queue flag that needs to be enabled with the /sys/block/<dev>/queue/iostats_passthrough attribute. Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20241007153236.2818562-1-kbusch@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-22block: return void from the queue_sysfs_entry load_module methodChristoph Hellwig
Requesting a module either succeeds or does nothing, return an error from this method does not make sense. Also move the load_module after the store method in the struct declaration to keep the important show and store methods together. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Andreas Hindborg <a.hindborg@kernel.org> Link: https://lore.kernel.org/r/20241008050841.104602-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-22block: add partition uuid into uevent as "PARTUUID"Konstantin Khlebnikov
Both most common formats have uuid in addition to partition name: GPT: standard uuid xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx DOS: 4 byte disk signature and 1 byte partition xxxxxxxx-xx Tools from util-linux use the same notation for them. Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Reviewed-by: Kyle Fortin <kyle.fortin@oracle.com> [dianders: rebased to modern kernels] Signed-off-by: Douglas Anderson <dianders@google.com> Signed-off-by: Douglas Anderson <dianders@chromium.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20241004171340.v2.1.I938c91d10e454e841fdf5d64499a8ae8514dc004@changeid Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-22block: move issue side time stamping to blk_account_io_start()Jens Axboe
It's known needed at that point, and it's cleaner to just assign it there rather than rely on it being reliably set before hitting the IO accounting. Hence, move it out of blk_mq_rq_time_init(), which is now only doing the allocation side timing. While at it, get rid of the '0' time passing to blk_mq_rq_time_init(), just pass in blk_time_get_ns() for the two cases where 0 is being explicitly passed in. The rest pass in the previously cached allocation time. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-22block: set issue time stamp based on queue stateJens Axboe
A previous commit moved RQF_IO_STAT into blk_account_io_done(), where it's being set rather than at allocation time. Unfortunately we do check for that flag in blk_mq_rq_time_init(), and hence setting the start_time_ns wasn't being done. This lead to unwieldy inflight IO counts and times, as IO completion accounting would a 0 value rather than the issue time for it's subtraction math. Fix this by switching the blk_mq_rq_time_init() check to use the queue state rather than the request state. Fixes: b8f762400ae8 ("block: move iostat check into blk_acount_io_start()") Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202410062110.512391df-oliver.sang@intel.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-22block: add support for partition table defined in OFChristian Marangi
Add support for partition table defined in Device Tree. Similar to how it's done with MTD, add support for defining a fixed partition table in device tree. A common scenario for this is fixed block (eMMC) embedded devices that have no MBR or GPT partition table to save storage space. Bootloader access the block device with absolute address of data. This is to complete the functionality with an equivalent implementation with providing partition table with bootargs, for case where the booargs can't be modified and tweaking the Device Tree is the only solution to have an usabe partition table. The implementation follow the fixed-partitions parser used on MTD devices where a "partitions" node is expected to be declared with "fixed-partitions" compatible in the OF node of the disk device (mmc-card for eMMC for example) and each child node declare a label and a reg with offset and size. If label is not declared, the node name is used as fallback. Eventually is also possible to declare the read-only property to flag the partition as read-only. Signed-off-by: Christian Marangi <ansuelsmth@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20241002221306.4403-6-ansuelsmth@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-22block: introduce add_disk_fwnode()Christian Marangi
Introduce add_disk_fwnode() as a replacement of device_add_disk() that permits to pass and attach a fwnode to disk dev. This variant can be useful for eMMC that might have the partition table for the disk defined in DT. A parser can later make use of the attached fwnode to parse the related table and init the hardcoded partition for the disk. device_add_disk() is converted to a simple wrapper of add_disk_fwnode() with the fwnode entry set as NULL. Signed-off-by: Christian Marangi <ansuelsmth@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20241002221306.4403-4-ansuelsmth@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-22block: add support for defining read-only partitionsChristian Marangi
Add support for defining read-only partitions and complete support for it in the cmdline partition parser as the additional "ro" after a partition is scanned but never actually applied. Signed-off-by: Christian Marangi <ansuelsmth@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20241002221306.4403-2-ansuelsmth@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-22block: kill blk_do_io_stat() helperJens Axboe
It's now just checking whether or not RQF_IO_STAT is set, so let's get rid of it and just open-code the specific flag that is being checked. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-22block: remove 'req->part' check for stats accountingJens Axboe
If RQF_IO_STAT is set, then accounting is enabled. There's no need to further gate this on req->part being set or not, RQF_IO_STAT should never be set if accounting is not being done for this request. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-22block: move iostat check into blk_acount_io_start()Jens Axboe
Rather than have blk_do_io_stat() check for both RQF_IO_STAT and whether the request is a passthrough requests every time, move both of those checks into blk_account_io_start(). Then blk_do_io_stat() can be reduced to just checking for RQF_IO_STAT. Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-19fs/block: Check for IOCB_DIRECT in generic_atomic_write_valid()John Garry
Currently FMODE_CAN_ATOMIC_WRITE is set if the bdev can atomic write and the file is open for direct IO. This does not work if the file is not opened for direct IO, yet fcntl(O_DIRECT) is used on the fd later. Change to check for direct IO on a per-IO basis in generic_atomic_write_valid(). Since we want to report -EOPNOTSUPP for non-direct IO for an atomic write, change to return an error code. Relocate the block fops atomic write checks to the common write path, as to catch non-direct IO. Fixes: c34fc6f26ab8 ("fs: Initial atomic write support") Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: John Garry <john.g.garry@oracle.com> Link: https://lore.kernel.org/r/20241019125113.369994-3-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-19block/fs: Pass an iocb to generic_atomic_write_valid()John Garry
Darrick and Hannes both thought it better that generic_atomic_write_valid() should be passed a struct iocb, and not just the member of that struct which is referenced; see [0] and [1]. I think that makes a more generic and clean API, so make that change. [0] https://lore.kernel.org/linux-block/680ce641-729b-4150-b875-531a98657682@suse.de/ [1] https://lore.kernel.org/linux-xfs/20240620212401.GA3058325@frogsfrogsfrogs/ Fixes: c34fc6f26ab8 ("fs: Initial atomic write support") Suggested-by: Darrick J. Wong <djwong@kernel.org> Suggested-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: John Garry <john.g.garry@oracle.com> Link: https://lore.kernel.org/r/20241019125113.369994-2-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-18Merge tag 'block-6.12-20241018' of git://git.kernel.dk/linuxLinus Torvalds
Pull block fixes from Jens Axboe: - NVMe pull request via Keith: - Fix target passthrough identifier (Nilay) - Fix tcp locking (Hannes) - Replace list with sbitmap for tracking RDMA rsp tags (Guixen) - Remove unnecessary fallthrough statements (Tokunori) - Remove ready-without-media support (Greg) - Fix multipath partition scan deadlock (Keith) - Fix concurrent PCI reset and remove queue mapping (Maurizio) - Fabrics shutdown fixes (Nilay) - Fix for a kerneldoc warning (Keith) - Fix a race with blk-rq-qos and wakeups (Omar) - Cleanup of checking for always-set tag_set (SurajSonawane2415) - Fix for a crash with CPU hotplug notifiers (Ming) - Don't allow zero-copy ublk on unprivileged device (Ming) - Use array_index_nospec() for CDROM (Josh) - Remove dead code in drbd (David) - Tweaks to elevator loading (Breno) * tag 'block-6.12-20241018' of git://git.kernel.dk/linux: cdrom: Avoid barrier_nospec() in cdrom_ioctl_media_changed() nvme: use helper nvme_ctrl_state in nvme_keep_alive_finish function nvme: make keep-alive synchronous operation nvme-loop: flush off pending I/O while shutting down loop controller nvme-pci: fix race condition between reset and nvme_dev_disable() ublk: don't allow user copy for unprivileged device blk-rq-qos: fix crash on rq_qos_wait vs. rq_qos_wake_function race nvme-multipath: defer partition scanning blk-mq: setup queue ->tag_set before initializing hctx elevator: Remove argument from elevator_find_get elevator: do not request_module if elevator exists drbd: Remove unused conn_lowest_minor nvme: disable CC.CRIME (NVME_CC_CRIME) nvme: delete unnecessary fallthru comment nvmet-rdma: use sbitmap to replace rsp free list block: Fix elevator_get_default() checking for NULL q->tag_set nvme: tcp: avoid race between queue_lock lock and destroy nvmet-passthru: clear EUID/NGUID/UUID while using loop target block: fix blk_rq_map_integrity_sg kernel-doc
2024-10-16blk-rq-qos: fix crash on rq_qos_wait vs. rq_qos_wake_function raceOmar Sandoval
We're seeing crashes from rq_qos_wake_function that look like this: BUG: unable to handle page fault for address: ffffafe180a40084 #PF: supervisor write access in kernel mode #PF: error_code(0x0002) - not-present page PGD 100000067 P4D 100000067 PUD 10027c067 PMD 10115d067 PTE 0 Oops: Oops: 0002 [#1] PREEMPT SMP PTI CPU: 17 UID: 0 PID: 0 Comm: swapper/17 Not tainted 6.12.0-rc3-00013-geca631b8fe80 #11 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 RIP: 0010:_raw_spin_lock_irqsave+0x1d/0x40 Code: 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 41 54 9c 41 5c fa 65 ff 05 62 97 30 4c 31 c0 ba 01 00 00 00 <f0> 0f b1 17 75 0a 4c 89 e0 41 5c c3 cc cc cc cc 89 c6 e8 2c 0b 00 RSP: 0018:ffffafe180580ca0 EFLAGS: 00010046 RAX: 0000000000000000 RBX: ffffafe180a3f7a8 RCX: 0000000000000011 RDX: 0000000000000001 RSI: 0000000000000003 RDI: ffffafe180a40084 RBP: 0000000000000000 R08: 00000000001e7240 R09: 0000000000000011 R10: 0000000000000028 R11: 0000000000000888 R12: 0000000000000002 R13: ffffafe180a40084 R14: 0000000000000000 R15: 0000000000000003 FS: 0000000000000000(0000) GS:ffff9aaf1f280000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffafe180a40084 CR3: 000000010e428002 CR4: 0000000000770ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: <IRQ> try_to_wake_up+0x5a/0x6a0 rq_qos_wake_function+0x71/0x80 __wake_up_common+0x75/0xa0 __wake_up+0x36/0x60 scale_up.part.0+0x50/0x110 wb_timer_fn+0x227/0x450 ... So rq_qos_wake_function() calls wake_up_process(data->task), which calls try_to_wake_up(), which faults in raw_spin_lock_irqsave(&p->pi_lock). p comes from data->task, and data comes from the waitqueue entry, which is stored on the waiter's stack in rq_qos_wait(). Analyzing the core dump with drgn, I found that the waiter had already woken up and moved on to a completely unrelated code path, clobbering what was previously data->task. Meanwhile, the waker was passing the clobbered garbage in data->task to wake_up_process(), leading to the crash. What's happening is that in between rq_qos_wake_function() deleting the waitqueue entry and calling wake_up_process(), rq_qos_wait() is finding that it already got a token and returning. The race looks like this: rq_qos_wait() rq_qos_wake_function() ============================================================== prepare_to_wait_exclusive() data->got_token = true; list_del_init(&curr->entry); if (data.got_token) break; finish_wait(&rqw->wait, &data.wq); ^- returns immediately because list_empty_careful(&wq_entry->entry) is true ... return, go do something else ... wake_up_process(data->task) (NO LONGER VALID!)-^ Normally, finish_wait() is supposed to synchronize against the waker. But, as noted above, it is returning immediately because the waitqueue entry has already been removed from the waitqueue. The bug is that rq_qos_wake_function() is accessing the waitqueue entry AFTER deleting it. Note that autoremove_wake_function() wakes the waiter and THEN deletes the waitqueue entry, which is the proper order. Fix it by swapping the order. We also need to use list_del_init_careful() to match the list_empty_careful() in finish_wait(). Fixes: 38cfb5a45ee0 ("blk-wbt: improve waking of tasks") Cc: stable@vger.kernel.org Signed-off-by: Omar Sandoval <osandov@fb.com> Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/d3bee2463a67b1ee597211823bf7ad3721c26e41.1729014591.git.osandov@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-14blk-mq: setup queue ->tag_set before initializing hctxMing Lei
Commit 7b815817aa58 ("blk-mq: add helper for checking if one CPU is mapped to specified hctx") needs to check queue mapping via tag set in hctx's cpuhp handler. However, q->tag_set may not be setup yet when the cpuhp handler is enabled, then kernel oops is triggered. Fix the issue by setup queue tag_set before initializing hctx. Cc: stable@vger.kernel.org Reported-and-tested-by: Rick Koch <mr.rickkoch@gmail.com> Closes: https://lore.kernel.org/linux-block/CANa58eeNDozLaBHKPLxSAhEy__FPfJT_F71W=sEQw49UCrC9PQ@mail.gmail.com Fixes: 7b815817aa58 ("blk-mq: add helper for checking if one CPU is mapped to specified hctx") Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Garry <john.g.garry@oracle.com> Link: https://lore.kernel.org/r/20241014005115.2699642-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-11elevator: Remove argument from elevator_find_getBreno Leitao
Commit e4eb37cc0f3ed ("block: Remove elevator required features") removed the usage of `struct request_queue` from elevator_find_get(), but didn't removed the argument. Remove the "struct request_queue *q" argument from elevator_find_get() given it is useless. Fixes: e4eb37cc0f3e ("block: Remove elevator required features") Signed-off-by: Breno Leitao <leitao@debian.org> Link: https://lore.kernel.org/r/20241011155615.3361143-1-leitao@debian.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-11elevator: do not request_module if elevator existsBreno Leitao
Whenever an I/O elevator is changed, the system attempts to load a module for the new elevator. This occurs regardless of whether the elevator is already loaded or built directly into the kernel. This behavior introduces unnecessary overhead and potential issues. This makes the operation slower, and more error-prone. For instance, making the problem fixed by [1] visible for users that doesn't even rely on modules being available through modules. Do not try to load the ioscheduler if it is already visible. This change brings two main benefits: it improves the performance of elevator changes, and it reduces the likelihood of errors occurring during this process. [1] Commit e3accac1a976 ("block: Fix elv_iosched_local_module handling of "none" scheduler") Fixes: 734e1a860312 ("block: Prevent deadlocks when switching elevators") Signed-off-by: Breno Leitao <leitao@debian.org> Link: https://lore.kernel.org/r/20241011170122.3880087-1-leitao@debian.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-07block: Fix elevator_get_default() checking for NULL q->tag_setSurajSonawane2415
elevator_get_default() and elv_support_iosched() both check for whether or not q->tag_set is non-NULL, however it's not possible for them to be NULL. This messes up some static checkers, as the checking of tag_set isn't consistent. Remove the checks, which both simplifies the logic and avoids checker errors. Signed-off-by: SurajSonawane2415 <surajsonawane0215@gmail.com> Link: https://lore.kernel.org/r/20241007111416.13814-1-surajsonawane0215@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-04Merge tag 'block-6.12-20241004' of git://git.kernel.dk/linuxLinus Torvalds
Pull block fixes from Jens Axboe: - Fix another use-after-free in aoe - Fixup wrong nested non-saving irq disable/restore in blk-iocost - Fixup a kerneldoc complaint introduced by a merge window patch * tag 'block-6.12-20241004' of git://git.kernel.dk/linux: aoe: fix the potential use-after-free problem in more places blk_iocost: remove some duplicate irq disable/enables block: fix blk_rq_map_integrity_sg kernel-doc
2024-10-02move asm/unaligned.h to linux/unaligned.hAl Viro
asm/unaligned.h is always an include of asm-generic/unaligned.h; might as well move that thing to linux/unaligned.h and include that - there's nothing arch-specific in that header. auto-generated by the following: for i in `git grep -l -w asm/unaligned.h`; do sed -i -e "s/asm\/unaligned.h/linux\/unaligned.h/" $i done for i in `git grep -l -w asm-generic/unaligned.h`; do sed -i -e "s/asm-generic\/unaligned.h/linux\/unaligned.h/" $i done git mv include/asm-generic/unaligned.h include/linux/unaligned.h git mv tools/include/asm-generic/unaligned.h tools/include/linux/unaligned.h sed -i -e "/unaligned.h/d" include/asm-generic/Kbuild sed -i -e "s/__ASM_GENERIC/__LINUX/" include/linux/unaligned.h tools/include/linux/unaligned.h