summaryrefslogtreecommitdiff
path: root/block/blk-mq.c
AgeCommit message (Collapse)Author
2025-05-13Merge tag 'md-6.16-20250513' of ↵Jens Axboe
https://git.kernel.org/pub/scm/linux/kernel/git/mdraid/linux into for-6.16/block Pull MD changes from Yu Kuai: - Fix that normal IO can be starved by sync IO, found by mkfs on newly created large raid5, with some clean up patches for bdev inflight counters. * tag 'md-6.16-20250513' of https://git.kernel.org/pub/scm/linux/kernel/git/mdraid/linux: md: clean up accounting for issued sync IO md: fix is_mddev_idle() md: add a new api sync_io_depth md: record dm-raid gendisk in mddev block: export API to get the number of bdev inflight IO block: clean up blk_mq_in_flight_rw() block: WARN if bdev inflight counter is negative block: reuse part_in_flight_rw for part_in_flight blk-mq: remove blk_mq_in_flight()
2025-05-12block: unfreeze queue if realloc tag set fails during nr_hw_queues updateNilay Shroff
In __blk_mq_update_nr_hw_queues(), the current sequence involves: 1. unregistering sysfs/debugfs attributes 2. freeze the queue 3. reallocating the tag set 4. updating the queue map 5. reallocating hardware contexts 6. updating the elevator (which unfreeze the queue again) 7. re-register sysfs/debugfs attributes If tag set reallocation fails at step 3, the function skips steps 4–6 and proceeds directly to step 7, re-registering the sysfs/debugfs attributes without unfreezing the queue first. This is incorrect and can lead to a system hang or lockdep splat, as the queue remains frozen and is never properly unfrozen. This patch addresses the issue by explicitly unfreezing the queue before re-registering the sysfs/debugfs attributes in the event of a tag set reallocation failure. Fixes: 9dc7a882ce96 ("block: move hctx debugfs/sysfs registering out of freezing queue") Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250512092952.135887-1-nilay@linux.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-10block: clean up blk_mq_in_flight_rw()Yu Kuai
Also add comment for part_inflight_show() for the difference between bio-based and rq-based device. Link: https://lore.kernel.org/linux-raid/20250506124903.2540268-4-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2025-05-10blk-mq: remove blk_mq_in_flight()Yu Kuai
After commit 7be835694dae ("block: fix that util can be greater than 100%"), it's not used and can be removed. Link: https://lore.kernel.org/linux-raid/20250506124903.2540268-1-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de>
2025-05-06block: move hctx cpuhp add/del out of queue freezingMing Lei
Move hctx cpuhp add/del out of queue freezing for not connecting freeze lock with cpuhp locks, then lockdep warning can be avoided. This way is safe because both needn't queue to be frozen and scheduler switch isn't allowed, with same reason for moving hctx debugfs/sysfs register out of queue freeze. Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250505141805.2751237-25-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-06block: don't acquire ->elevator_lock in blk_mq_map_swqueue and ↵Ming Lei
blk_mq_realloc_hw_ctxs Both blk_mq_map_swqueue() and blk_mq_realloc_hw_ctxs() are called before the request queue is added to tagset list, so the two won't run concurrently with blk_mq_update_nr_hw_queues(). When the two functions are only called from queue initialization or blk_mq_update_nr_hw_queues(), elevator switch can't happen. So remove ->elevator_lock uses from the two functions. Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250505141805.2751237-24-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-06block: move hctx debugfs/sysfs registering out of freezing queueMing Lei
Move hctx debugfs/sysfs register out of freezing queue in __blk_mq_update_nr_hw_queues(), so that the following lockdep dependency can be killed: #2 (&q->q_usage_counter(io)#16){++++}-{0:0}: #1 (fs_reclaim){+.+.}-{0:0}: #0 (&sb->s_type->i_mutex_key#3){+.+.}-{4:4}: //debugfs And registering/un-registering hctx debugfs/sysfs does not require queue to be frozen: - hctx sysfs attributes show() are drained when removing kobject, and there isn't store() implementation for hctx sysfs attributes - debugfs entry read() is drained too when removing debugfs directory, and there isn't write() implementation for hctx debugfs too - so it is safe to register/unregister hctx sysfs/debugfs without freezing queue because the cod paths changes nothing, and we just need to keep hctx live Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250505141805.2751237-23-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-06block: move elv_register[unregister]_queue out of elevator_lockMing Lei
Move elv_register[unregister]_queue out of ->elevator_lock & queue freezing, so we can kill many lockdep warnings. elv_register[unregister]_queue() is serialized, and just dealing with sysfs/ debugfs things, no need to be done with queue frozen: - when it is called from adding disk, elevator switch isn't possible because ->queue_kobj isn't added yet - when it is called from deleting disk, disable_elv_switch() is responsible for preventing new elevator switch and draining old elevator switch. - when it is called from blk_mq_update_nr_hw_queues(), adding/removing disk and elevator switch can't be allowed or in-progress With this change, elevator's ->exit() is called before calling elv_unregister_queue, then user may call into ->show()/store() of elevator's sysfs attributes, and we have covered this issue by adding `ELEVATOR_FLAG_DYNG`. For blk-mq debugfs, hctx->sched_tags is always checked with ->elevator_lock by debugfs code, meantime hctx->sched_tags is updated with ->elevator_lock, so there isn't such issue. Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250505141805.2751237-22-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-06block: simplify elevator reattachment for updating nr_hw_queuesMing Lei
In blk_mq_update_nr_hw_queues(), nr_hw_queues changes and elevator data depends on it, and elevator has to be reattached, so call elevator_switch() to force attachment. Add elv_update_nr_hw_queues() simply for blk_mq_update_nr_hw_queues() to reattach elevator, since elevator switch isn't likely when running blk_mq_update_nr_hw_queues(). This way removes the current switch none and switch back code. Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250505141805.2751237-14-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-06block: look up the elevator type in elevator_switchChristoph Hellwig
That makes the function nicely self-contained and can be used to avoid code duplication. Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250505141805.2751237-11-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-06block: prevent adding/deleting disk during updating nr_hw_queuesMing Lei
Both adding/deleting disk code are reader of `nr_hw_queues`, so we can't allow them in-progress when updating nr_hw_queues, kernel panic and kasan has been reported in [1]. Prevent adding/deleting disk during updating nr_hw_queues by adding rw_semaphore to tagset, write lock is grabbed in blk_mq_update_nr_hw_queues(), and read lock is acquired when adding/deleting disk. Also mark GFP_NOIO allocation scope for adding/deleting disk because blk_mq_update_nr_hw_queues() is part of some driver's error handler. This way avoids lot of trouble. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Suggested-by: Nilay Shroff <nilay@linux.ibm.com> Reported-by: Nilay Shroff <nilay@linux.ibm.com> Closes: https://lore.kernel.org/linux-block/a5896cdb-a59a-4a37-9f99-20522f5d2987@linux.ibm.com/ Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250505141805.2751237-9-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-06block: move blk_mq_add_queue_tag_set() after blk_mq_map_swqueue()Ming Lei
Move blk_mq_add_queue_tag_set() after blk_mq_map_swqueue(), and publish this request queue to tagset after everything is setup. This way is safe because BLK_MQ_F_TAG_QUEUE_SHARED isn't used by blk_mq_map_swqueue(), and this flag is mainly checked in fast IO code path. Prepare for removing ->elevator_lock from blk_mq_map_swqueue() which is supposed to be called when elevator switch can't be done. Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reported-by: Nilay Shroff <nilay@linux.ibm.com> Closes: https://lore.kernel.org/linux-block/567cb7ab-23d6-4cee-a915-c8cdac903ddd@linux.ibm.com/ Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250505141805.2751237-2-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-05block: remove bounce buffering supportChristoph Hellwig
The block layer bounce buffering support is unused now, remove it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250505081138.3435992-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-02block: avoid hctx spinlock for plug with multiple queuesCaleb Sander Mateos
blk_mq_flush_plug_list() has a fast path if all requests in the plug are destined for the same request_queue. It calls ->queue_rqs() with the whole batch of requests, falling back on ->queue_rq() for any requests not handled by ->queue_rqs(). However, if the requests are destined for multiple queues, blk_mq_flush_plug_list() has a slow path that calls blk_mq_dispatch_list() repeatedly to filter the requests by ctx/hctx. Each queue's requests are inserted into the hctx's dispatch list under a spinlock, then __blk_mq_sched_dispatch_requests() takes them out of the dispatch list (taking the spinlock again), and finally blk_mq_dispatch_rq_list() calls ->queue_rq() on each request. Acquiring the hctx spinlock twice and calling ->queue_rq() instead of ->queue_rqs() makes the slow path significantly more expensive. Thus, batching more requests into a single plug (e.g. io_uring_enter syscall) can counterintuitively hurt performance by causing the plug to span multiple queues. We have observed 2-3% of CPU time spent acquiring the hctx spinlock alone on workloads issuing requests to multiple NVMe devices in the same io_uring SQE batches. Add a medium path in blk_mq_flush_plug_list() for plugs that don't have elevators or come from a schedule, but do span multiple queues. Filter the requests by queue and call ->queue_rqs()/->queue_rq() on the list of requests destined to each request_queue. With this change, we no longer see any CPU time spent in _raw_spin_lock from blk_mq_flush_plug_list and throughput increases accordingly. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250426011728.4189119-4-csander@purestorage.com [axboe: fix whitespace damage] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-02block: factor out blk_mq_dispatch_queue_requests() helperCaleb Sander Mateos
Factor out the logic from blk_mq_flush_plug_list() that calls ->queue_rqs() with a fallback to ->queue_rq() into a helper function blk_mq_dispatch_queue_requests(). This is in preparation for using this code with other lists of requests. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250426011728.4189119-3-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-02block: take rq_list instead of plug in dispatch functionsCaleb Sander Mateos
blk_mq_plug_issue_direct(), __blk_mq_flush_plug_list(), and blk_mq_dispatch_plug_list() take a struct blk_plug * but only use its mq_list. Pass the struct rq_list * instead in preparation for calling them with other lists of requests. Drop "plug" from the function names as they are no longer plug-specific. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250426011728.4189119-2-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-21block: ensure that struct blk_mq_alloc_data is fully initializedJens Axboe
On x86, rep stos will be emitted to clear the the blk_mq_alloc_data struct, as not all members are being explicitly initialied. Depending on the type of CPU, this is a noticeable slowdown compared to just ensuring that the struct is fully initialized when setup. For the 4 spots that setup a struct blk_mq_alloc_data on the stack, ensure all members are being initialized. Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-21block: Simplify blk_mq_dispatch_rq_list() and its callersBart Van Assche
The 'nr_budgets' argument of blk_mq_dispatch_rq_list() is either the number of elements in the 'list' argument or zero. Instead of passing the number of list elements to blk_mq_dispatch_rq_list(), pass a boolean argument that indicates whether or not blk_mq_dispatch_rq_list() should request the block driver for a budget for each request in 'list'. Remove the code for counting list elements from blk_mq_dispatch_rq_list() callers where possible. Remove the code that decrements nr_budgets from blk_mq_dispatch_rq_list() because it is superfluous. Each request that is processed by blk_mq_dispatch_rq_list() is in one of these two states if 'get_budget' is false: * Either the request is on 'list' and the budget for the request has to be released from the error path. * Or the request is not on 'list' and q->mq_ops->queue_rq() has already released the budget (ret != BLK_STS_OK) or q->mq_ops->queue_rq() will release the budget asynchronously (ret == BLK_STS_OK). Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Cc: John Garry <john.g.garry@oracle.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20250415205134.3650042-1-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-03block: don't grab elevator lock during queue initializationMing Lei
->elevator_lock depends on queue freeze lock, see block/blk-sysfs.c. queue freeze lock depends on fs_reclaim. So don't grab elevator lock during queue initialization which needs to call kmalloc(GFP_KERNEL), and we can cut the dependency between ->elevator_lock and fs_reclaim, then the lockdep warning can be killed. This way is safe because elevator setting isn't ready to run during queue initialization. There isn't such issue in __blk_mq_update_nr_hw_queues() because memalloc_noio_save() is called before acquiring elevator lock. Fixes the following lockdep warning: https://lore.kernel.org/linux-block/67e6b425.050a0220.2f068f.007b.GAE@google.com/ Reported-by: syzbot+4c7e0f9b94ad65811efb@syzkaller.appspotmail.com Cc: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250403105402.1334206-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-01block: remove unused nseg parameterNitesh Shetty
We are no longer using nr_segs, after blk_mq_attempt_bio_merge was moved out of blk_mq_get_new_request. Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com> Link: https://lore.kernel.org/r/20250401044348.15588-1-nj.shetty@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-12block: remove unused parameterGuixin Liu
The blk_mq_map_queue()'s request_queue param is not used anymore, remove it, same with blk_get_flush_queue(). Signed-off-by: Guixin Liu <kanie@linux.alibaba.com> Link: https://lore.kernel.org/r/20250312084722.129680-1-kanie@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-10block: make sure ->nr_integrity_segments is cloned in blk_rq_prep_cloneMing Lei
Make sure ->nr_integrity_segments is cloned in blk_rq_prep_clone(), otherwise requests cloned by device-mapper multipath will not have the proper nr_integrity_segments values set, then BUG() is hit from sg_alloc_table_chained(). Fixes: b0fd271d5fba ("block: add request clone interface (v2)") Cc: stable@vger.kernel.org Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250310115453.2271109-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-10block: protect hctx attributes/params using q->elevator_lockNilay Shroff
Currently, hctx attributes (nr_tags, nr_reserved_tags, and cpu_list) are protected using `q->sysfs_lock`. However, these attributes can be updated in multiple scenarios: - During the driver's probe method. - When updating nr_hw_queues. - When writing to the sysfs attribute nr_requests, which can modify nr_tags. The nr_requests attribute is already protected using q->elevator_lock, but none of the update paths actually use q->sysfs_lock to protect hctx attributes. So to ensure proper synchronization, replace q->sysfs_lock with q->elevator_lock when reading hctx attributes through sysfs. Additionally, blk_mq_update_nr_hw_queues allocates and updates hctx. The allocation of hctx is protected using q->elevator_lock, however, updating hctx params happens without any protection, so safeguard hctx param update path by also using q->elevator_lock. Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Link: https://lore.kernel.org/r/20250306093956.2818808-1-nilay@linux.ibm.com [axboe: wrap comment at 80 chars] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-10block: introduce a dedicated lock for protecting queue elevator updatesNilay Shroff
A queue's elevator can be updated either when modifying nr_hw_queues or through the sysfs scheduler attribute. Currently, elevator switching/ updating is protected using q->sysfs_lock, but this has led to lockdep splats[1] due to inconsistent lock ordering between q->sysfs_lock and the freeze-lock in multiple block layer call sites. As the scope of q->sysfs_lock is not well-defined, its (mis)use has resulted in numerous lockdep warnings. To address this, introduce a new q->elevator_lock, dedicated specifically for protecting elevator switches/updates. And we'd now use this new q->elevator_lock instead of q->sysfs_lock for protecting elevator switches/updates. While at it, make elv_iosched_load_module() a static function, as it is only called from elv_iosched_store(). Also, remove redundant parameters from elv_iosched_load_module() function signature. [1] https://lore.kernel.org/all/67637e70.050a0220.3157ee.000c.GAE@google.com/ Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Link: https://lore.kernel.org/r/20250304102551.2533767-5-nilay@linux.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-31block: force noio scope in blk_mq_freeze_queueChristoph Hellwig
When block drivers or the core block code perform allocations with a frozen queue, this could try to recurse into the block device to reclaim memory and deadlock. Thus all allocations done by a process that froze a queue need to be done without __GFP_IO and __GFP_FS. Instead of tying to track all of them down, force a noio scope as part of freezing the queue. Note that nvme is a bit of a mess here due to the non-owner freezes, and they will be addressed separately. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250131120352.1315351-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-14blk-mq: Move more error handling into blk_mq_submit_bio()Bart Van Assche
The error handling code in blk_mq_get_new_requests() cannot be understood without knowing that this function is only called by blk_mq_submit_bio(). Hence move the code for handling blk_mq_get_new_requests() failures into blk_mq_submit_bio(). Cc: Damien Le Moal <dlemoal@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20241218212246.1073149-3-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-14block: Reorder the request allocation code in blk_mq_submit_bio()Bart Van Assche
Help the CPU branch predictor in case of a cache hit by handling the cache hit scenario first. Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20241218212246.1073149-2-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-10block: don't update BLK_FEAT_POLL in __blk_mq_update_nr_hw_queuesChristoph Hellwig
When __blk_mq_update_nr_hw_queues changes the number of tag sets, it might have to disable poll queues. Currently it does so by adjusting the BLK_FEAT_POLL, which is a bit against the intent of features that describe hardware / driver capabilities, but more importantly causes nasty lock order problems with the broadly held freeze when updating the number of hardware queues and the limits lock. Fix this by leaving BLK_FEAT_POLL alone, and instead check for the number of poll queues in the bio submission and poll handlers. While this adds extra work to the fast path, the variables are in cache lines used by these operations anyway, so it should be cheap enough. Fixes: 8023e144f9d6 ("block: move the poll flag to queue_limits") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Link: https://lore.kernel.org/r/20250110054726.1499538-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-10block: check BLK_FEAT_POLL under q_usage_countChristoph Hellwig
Otherwise feature reconfiguration can race with I/O submission. Also drop the bio_clear_polled in the error path, as the flag does not matter for instant error completions, it is a left over from when we allowed polled I/O to proceed unpolled in this case. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20250110054726.1499538-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-06block: simplify tag allocation policy selectionChristoph Hellwig
Use a plain BLK_MQ_F_* flag to select the round robin tag selection instead of overlaying an enum with just two possible values into the flags space. Doing so allows adding a BLK_MQ_F_MAX sentinel for simplified overflow checking in the messy debugfs helpers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Garry <john.g.garry@oracle.com> Link: https://lore.kernel.org/r/20250106083531.799976-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-04block: remove blk_rq_bio_prepChristoph Hellwig
There is not real point in a helper just to assign three values to four fields, especially when the surrounding code is working on the neighbor fields directly. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Link: https://lore.kernel.org/r/20250103073417.459715-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-23block: track queue dying state automatically for modeling queue freeze lockdepMing Lei
Now we only verify the outmost freeze & unfreeze in current context in case that !q->mq_freeze_depth, so it is reliable to save queue lying state when we want to lock the freeze queue since the state is one per-task variable now. Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20241127135133.3952153-5-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-23block: track disk DEAD state automatically for modeling queue freeze lockdepMing Lei
Now we only verify the outmost freeze & unfreeze in current context in case that !q->mq_freeze_depth, so it is reliable to save disk DEAD state when we want to lock the freeze queue since the state is one per-task variable now. Doing this way can kill lots of false positive when freeze queue is called before adding disk[1]. [1] https://lore.kernel.org/linux-block/6741f6b2.050a0220.1cc393.0017.GAE@google.com/ Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20241127135133.3952153-3-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-23block: remove unnecessary check in blk_unfreeze_check_owner()Ming Lei
The following check of 'q->mq_freeze_owner != current' covers the previous one, so remove the unnecessary check. Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20241127135133.3952153-2-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-18block: avoid to reuse `hctx` not removed from cpuhp callback listMing Lei
If the 'hctx' isn't removed from cpuhp callback list, we can't reuse it, otherwise use-after-free may be triggered. Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202412172217.b906db7c-lkp@intel.com Tested-by: kernel test robot <oliver.sang@intel.com> Fixes: 22465bbac53c ("blk-mq: move cpuhp callback registering out of q->sysfs_lock") Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20241218101617.3275704-3-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-18block: Revert "block: Fix potential deadlock while freezing queue and ↵Ming Lei
acquiring sysfs_lock" This reverts commit be26ba96421ab0a8fa2055ccf7db7832a13c44d2. Commit be26ba96421a ("block: Fix potential deadlock while freezing queue and acquiring sysfs_loc") actually reverts commit 22465bbac53c ("blk-mq: move cpuhp callback registering out of q->sysfs_lock"), and causes the original resctrl lockdep warning. So revert it and we need to fix the issue in another way. Cc: Nilay Shroff <nilay@linux.ibm.com> Fixes: be26ba96421a ("block: Fix potential deadlock while freezing queue and acquiring sysfs_loc") Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20241218101617.3275704-2-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-13block: Fix potential deadlock while freezing queue and acquiring sysfs_lockNilay Shroff
For storing a value to a queue attribute, the queue_attr_store function first freezes the queue (->q_usage_counter(io)) and then acquire ->sysfs_lock. This seems not correct as the usual ordering should be to acquire ->sysfs_lock before freezing the queue. This incorrect ordering causes the following lockdep splat which we are able to reproduce always simply by accessing /sys/kernel/debug file using ls command: [ 57.597146] WARNING: possible circular locking dependency detected [ 57.597154] 6.12.0-10553-gb86545e02e8c #20 Tainted: G W [ 57.597162] ------------------------------------------------------ [ 57.597168] ls/4605 is trying to acquire lock: [ 57.597176] c00000003eb56710 (&mm->mmap_lock){++++}-{4:4}, at: __might_fault+0x58/0xc0 [ 57.597200] but task is already holding lock: [ 57.597207] c0000018e27c6810 (&sb->s_type->i_mutex_key#3){++++}-{4:4}, at: iterate_dir+0x94/0x1d4 [ 57.597226] which lock already depends on the new lock. [ 57.597233] the existing dependency chain (in reverse order) is: [ 57.597241] -> #5 (&sb->s_type->i_mutex_key#3){++++}-{4:4}: [ 57.597255] down_write+0x6c/0x18c [ 57.597264] start_creating+0xb4/0x24c [ 57.597274] debugfs_create_dir+0x2c/0x1e8 [ 57.597283] blk_register_queue+0xec/0x294 [ 57.597292] add_disk_fwnode+0x2e4/0x548 [ 57.597302] brd_alloc+0x2c8/0x338 [ 57.597309] brd_init+0x100/0x178 [ 57.597317] do_one_initcall+0x88/0x3e4 [ 57.597326] kernel_init_freeable+0x3cc/0x6e0 [ 57.597334] kernel_init+0x34/0x1cc [ 57.597342] ret_from_kernel_user_thread+0x14/0x1c [ 57.597350] -> #4 (&q->debugfs_mutex){+.+.}-{4:4}: [ 57.597362] __mutex_lock+0xfc/0x12a0 [ 57.597370] blk_register_queue+0xd4/0x294 [ 57.597379] add_disk_fwnode+0x2e4/0x548 [ 57.597388] brd_alloc+0x2c8/0x338 [ 57.597395] brd_init+0x100/0x178 [ 57.597402] do_one_initcall+0x88/0x3e4 [ 57.597410] kernel_init_freeable+0x3cc/0x6e0 [ 57.597418] kernel_init+0x34/0x1cc [ 57.597426] ret_from_kernel_user_thread+0x14/0x1c [ 57.597434] -> #3 (&q->sysfs_lock){+.+.}-{4:4}: [ 57.597446] __mutex_lock+0xfc/0x12a0 [ 57.597454] queue_attr_store+0x9c/0x110 [ 57.597462] sysfs_kf_write+0x70/0xb0 [ 57.597471] kernfs_fop_write_iter+0x1b0/0x2ac [ 57.597480] vfs_write+0x3dc/0x6e8 [ 57.597488] ksys_write+0x84/0x140 [ 57.597495] system_call_exception+0x130/0x360 [ 57.597504] system_call_common+0x160/0x2c4 [ 57.597516] -> #2 (&q->q_usage_counter(io)#21){++++}-{0:0}: [ 57.597530] __submit_bio+0x5ec/0x828 [ 57.597538] submit_bio_noacct_nocheck+0x1e4/0x4f0 [ 57.597547] iomap_readahead+0x2a0/0x448 [ 57.597556] xfs_vm_readahead+0x28/0x3c [ 57.597564] read_pages+0x88/0x41c [ 57.597571] page_cache_ra_unbounded+0x1ac/0x2d8 [ 57.597580] filemap_get_pages+0x188/0x984 [ 57.597588] filemap_read+0x13c/0x4bc [ 57.597596] xfs_file_buffered_read+0x88/0x17c [ 57.597605] xfs_file_read_iter+0xac/0x158 [ 57.597614] vfs_read+0x2d4/0x3b4 [ 57.597622] ksys_read+0x84/0x144 [ 57.597629] system_call_exception+0x130/0x360 [ 57.597637] system_call_common+0x160/0x2c4 [ 57.597647] -> #1 (mapping.invalidate_lock#2){++++}-{4:4}: [ 57.597661] down_read+0x6c/0x220 [ 57.597669] filemap_fault+0x870/0x100c [ 57.597677] xfs_filemap_fault+0xc4/0x18c [ 57.597684] __do_fault+0x64/0x164 [ 57.597693] __handle_mm_fault+0x1274/0x1dac [ 57.597702] handle_mm_fault+0x248/0x484 [ 57.597711] ___do_page_fault+0x428/0xc0c [ 57.597719] hash__do_page_fault+0x30/0x68 [ 57.597727] do_hash_fault+0x90/0x35c [ 57.597736] data_access_common_virt+0x210/0x220 [ 57.597745] _copy_from_user+0xf8/0x19c [ 57.597754] sel_write_load+0x178/0xd54 [ 57.597762] vfs_write+0x108/0x6e8 [ 57.597769] ksys_write+0x84/0x140 [ 57.597777] system_call_exception+0x130/0x360 [ 57.597785] system_call_common+0x160/0x2c4 [ 57.597794] -> #0 (&mm->mmap_lock){++++}-{4:4}: [ 57.597806] __lock_acquire+0x17cc/0x2330 [ 57.597814] lock_acquire+0x138/0x400 [ 57.597822] __might_fault+0x7c/0xc0 [ 57.597830] filldir64+0xe8/0x390 [ 57.597839] dcache_readdir+0x80/0x2d4 [ 57.597846] iterate_dir+0xd8/0x1d4 [ 57.597855] sys_getdents64+0x88/0x2d4 [ 57.597864] system_call_exception+0x130/0x360 [ 57.597872] system_call_common+0x160/0x2c4 [ 57.597881] other info that might help us debug this: [ 57.597888] Chain exists of: &mm->mmap_lock --> &q->debugfs_mutex --> &sb->s_type->i_mutex_key#3 [ 57.597905] Possible unsafe locking scenario: [ 57.597911] CPU0 CPU1 [ 57.597917] ---- ---- [ 57.597922] rlock(&sb->s_type->i_mutex_key#3); [ 57.597932] lock(&q->debugfs_mutex); [ 57.597940] lock(&sb->s_type->i_mutex_key#3); [ 57.597950] rlock(&mm->mmap_lock); [ 57.597958] *** DEADLOCK *** [ 57.597965] 2 locks held by ls/4605: [ 57.597971] #0: c0000000137c12f8 (&f->f_pos_lock){+.+.}-{4:4}, at: fdget_pos+0xcc/0x154 [ 57.597989] #1: c0000018e27c6810 (&sb->s_type->i_mutex_key#3){++++}-{4:4}, at: iterate_dir+0x94/0x1d4 Prevent the above lockdep warning by acquiring ->sysfs_lock before freezing the queue while storing a queue attribute in queue_attr_store function. Later, we also found[1] another function __blk_mq_update_nr_ hw_queues where we first freeze queue and then acquire the ->sysfs_lock. So we've also updated lock ordering in __blk_mq_update_nr_hw_queues function and ensured that in all code paths we follow the correct lock ordering i.e. acquire ->sysfs_lock before freezing the queue. [1] https://lore.kernel.org/all/CAFj5m9Ke8+EHKQBs_Nk6hqd=LGXtk4mUxZUN5==ZcCjnZSBwHw@mail.gmail.com/ Reported-by: kjain@linux.ibm.com Fixes: af2814149883 ("block: freeze the queue in queue_attr_store") Tested-by: kjain@linux.ibm.com Cc: hch@lst.de Cc: axboe@kernel.dk Cc: ritesh.list@gmail.com Cc: ming.lei@redhat.com Cc: gjoyce@linux.ibm.com Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20241210144222.1066229-1-nilay@linux.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-13blk-mq: Clean up blk_mq_requeue_work()Bart Van Assche
Move a statement that occurs in both branches of an if-statement in front of the if-statement. Fix a typo in a source code comment. No functionality has been changed. Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20241212212941.1268662-3-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-06blk-mq: move cpuhp callback registering out of q->sysfs_lockMing Lei
Registering and unregistering cpuhp callback requires global cpu hotplug lock, which is used everywhere. Meantime q->sysfs_lock is used in block layer almost everywhere. It is easy to trigger lockdep warning[1] by connecting the two locks. Fix the warning by moving blk-mq's cpuhp callback registering out of q->sysfs_lock. Add one dedicated global lock for covering registering & unregistering hctx's cpuhp, and it is safe to do so because hctx is guaranteed to be live if our request_queue is live. [1] https://lore.kernel.org/lkml/Z04pz3AlvI4o0Mr8@agluck-desk3/ Cc: Reinette Chatre <reinette.chatre@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Peter Newman <peternewman@google.com> Cc: Babu Moger <babu.moger@amd.com> Reported-by: Luck Tony <tony.luck@intel.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Tested-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20241206111611.978870-3-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-12-06blk-mq: register cpuhp callback after hctx is added to xarray tableMing Lei
We need to retrieve 'hctx' from xarray table in the cpuhp callback, so the callback should be registered after this 'hctx' is added to xarray table. Cc: Reinette Chatre <reinette.chatre@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Peter Newman <peternewman@google.com> Cc: Babu Moger <babu.moger@amd.com> Cc: Luck Tony <tony.luck@intel.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Tested-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20241206111611.978870-2-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-25block: Remove extra part pointer NULLify in blk_rq_init()John Garry
The rq->part pointer is already NULLified in the memset() call, so - like for other pointers in rq - don't re-NULLify. Signed-off-by: John Garry <john.g.garry@oracle.com> Link: https://lore.kernel.org/r/20241125100258.4172774-1-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-19block: blk-mq: fix uninit-value in blk_rq_prep_clone and refactorSuraj Sonawane
Fix an issue detected by the `smatch` tool: block/blk-mq.c:3314 blk_rq_prep_clone() error: uninitialized symbol 'bio'. This patch refactors `blk_rq_prep_clone()` to improve code readability and ensure safety by addressing potential misuse of the `bio` variable: - Move the bio_put(bio); call to the bio_ctr error handling block, which is the only place where it can be triggered. - Move the bio variable into the __rq_for_each_bio loop scope. This change removes the need to set bio to NULL at the loop's end. discussion on why bio remains uninitialized: https://lore.kernel.org/lkml/20241004141037.43277-1-surajsonawane0215@gmail.com Summary of above discussion: - I pointed out that `bio` can remain uninitialized if the allocation with `bio_alloc_clone` fails. - Keith Busch explained that `bio` is initialized to `NULL` when `bio_alloc_clone()` fails, preventing uninitialized usage. - John Garry questioned whether `rq_src->bio` being `NULL` could leave `bio` uninitialized. Keith clarified that in such cases, `bio` is not referenced, so it does not need initialization. - Christoph Hellwig recommended code improvements: - move the bio_put to the bio_ctr error handling, which is the only case where it can happen - move the bio variable into the __rq_for_each_bio scope, which also removed the need to zero it at the end of the loop These changes enhance code clarity, address static analysis tool warnings, and make the function more maintainable. thread of previous version patch discussion: https://lore.kernel.org/lkml/20241004100842.9052-1-surajsonawane0215@gmail.com Signed-off-by: Suraj Sonawane <surajsonawane0215@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20241119164412.37609-1-surajsonawane0215@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-13block: don't reorder requests in blk_add_rq_to_plugChristoph Hellwig
Add requests to the tail of the list instead of the front so that they are queued up in submission order. Remove the re-reordering in blk_mq_dispatch_plug_list, virtio_queue_rqs and nvme_queue_rqs now that the list is ordered as expected. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20241113152050.157179-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-13block: add a rq_list typeChristoph Hellwig
Replace the semi-open coded request list helpers with a proper rq_list type that mirrors the bio_list and has head and tail pointers. Besides better type safety this actually allows to insert at the tail of the list, which will be useful soon. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20241113152050.157179-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-12block: remove the ioprio field from struct requestChristoph Hellwig
The request ioprio is only initialized from the first attached bio, so requests without a bio already never set it. Directly use the bio field instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20241112170050.1612998-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-12block: remove the write_hint field from struct requestChristoph Hellwig
The write_hint is only used for read/write requests, which must have a bio attached to them. Just use the bio field instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20241112170050.1612998-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-07block: always verify unfreeze lock on the owner taskMing Lei
commit f1be1788a32e ("block: model freeze & enter queue as lock for supporting lockdep") tries to apply lockdep for verifying freeze & unfreeze. However, the verification is only done the outmost freeze and unfreeze. This way is actually not correct because q->mq_freeze_depth still may drop to zero on other task instead of the freeze owner task. Fix this issue by always verifying the last unfreeze lock on the owner task context, and make sure both the outmost freeze & unfreeze are verified in the current task. Fixes: f1be1788a32e ("block: model freeze & enter queue as lock for supporting lockdep") Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20241031133723.303835-4-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-07block: remove blk_freeze_queue()Ming Lei
No one use blk_freeze_queue(), so remove it and the obsolete comment. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20241031133723.303835-2-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-26block: model freeze & enter queue as lock for supporting lockdepMing Lei
Recently we got several deadlock report[1][2][3] caused by blk_mq_freeze_queue and blk_enter_queue(). Turns out the two are just like acquiring read/write lock, so model them as read/write lock for supporting lockdep: 1) model q->q_usage_counter as two locks(io and queue lock) - queue lock covers sync with blk_enter_queue() - io lock covers sync with bio_enter_queue() 2) make the lockdep class/key as per-queue: - different subsystem has very different lock use pattern, shared lock class causes false positive easily - freeze_queue degrades to no lock in case that disk state becomes DEAD because bio_enter_queue() won't be blocked any more - freeze_queue degrades to no lock in case that request queue becomes dying because blk_enter_queue() won't be blocked any more 3) model blk_mq_freeze_queue() as acquire_exclusive & try_lock - it is exclusive lock, so dependency with blk_enter_queue() is covered - it is trylock because blk_mq_freeze_queue() are allowed to run concurrently 4) model blk_enter_queue() & bio_enter_queue() as acquire_read() - nested blk_enter_queue() are allowed - dependency with blk_mq_freeze_queue() is covered - blk_queue_exit() is often called from other contexts(such as irq), and it can't be annotated as lock_release(), so simply do it in blk_enter_queue(), this way still covered cases as many as possible With lockdep support, such kind of reports may be reported asap and needn't wait until the real deadlock is triggered. For example, lockdep report can be triggered in the report[3] with this patch applied. [1] occasional block layer hang when setting 'echo noop > /sys/block/sda/queue/scheduler' https://bugzilla.kernel.org/show_bug.cgi?id=219166 [2] del_gendisk() vs blk_queue_enter() race condition https://lore.kernel.org/linux-block/20241003085610.GK11458@google.com/ [3] queue_freeze & queue_enter deadlock in scsi https://lore.kernel.org/linux-block/ZxG38G9BuFdBpBHZ@fedora/T/#u Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20241025003722.3630252-4-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-26blk-mq: add non_owner variant of start_freeze/unfreeze queue APIsMing Lei
Add non_owner variant of start_freeze/unfreeze queue APIs, so that the caller knows that what they are doing, and we can skip lockdep support for non_owner variant in per-call level. Prepare for supporting lockdep for freezing/unfreezing queue. Reviewed-by: Christoph Hellwig <hch@lst.de> Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20241025003722.3630252-2-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>