Age | Commit message (Collapse) | Author |
|
gitolite.kernel.org:pub/scm/linux/kernel/git/mdraid/linux into block-6.16
Pull MD fixes from Yu:
" - fix uaf due to stack memory used for bio mempool, from Jinchao
- fix raid10/raid1 nowait IO error path, from Nigel and Qixing
- fix kernel crash from reading bitmap sysfs entry, by Håkon"
* tag 'md-6.16-20250705' of gitolite.kernel.org:pub/scm/linux/kernel/git/mdraid/linux:
md/md-bitmap: fix GPF in bitmap_get_stats()
md/raid1,raid10: strip REQ_NOWAIT from member bios
raid10: cleanup memleak at raid10_make_request
md/raid1: Fix stack memory use after return in raid1_reshape
|
|
The commit message of commit 6ec1f0239485 ("md/md-bitmap: fix stats
collection for external bitmaps") states:
Remove the external bitmap check as the statistics should be
available regardless of bitmap storage location.
Return -EINVAL only for invalid bitmap with no storage (neither in
superblock nor in external file).
But, the code does not adhere to the above, as it does only check for
a valid super-block for "internal" bitmaps. Hence, we observe:
Oops: GPF, probably for non-canonical address 0x1cd66f1f40000028
RIP: 0010:bitmap_get_stats+0x45/0xd0
Call Trace:
seq_read_iter+0x2b9/0x46a
seq_read+0x12f/0x180
proc_reg_read+0x57/0xb0
vfs_read+0xf6/0x380
ksys_read+0x6d/0xf0
do_syscall_64+0x8c/0x1b0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
We fix this by checking the existence of a super-block for both the
internal and external case.
Fixes: 6ec1f0239485 ("md/md-bitmap: fix stats collection for external bitmaps")
Cc: stable@vger.kernel.org
Reported-by: Gerald Gibson <gerald.gibson@oracle.com>
Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
Link: https://lore.kernel.org/linux-raid/20250702091035.2061312-1-haakon.bugge@oracle.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
|
|
RAID layers don't implement proper non-blocking semantics for
REQ_NOWAIT, making the flag potentially misleading when propagated
to member disks.
This patch clear REQ_NOWAIT from cloned bios in raid1/raid10. Retain
original bio's REQ_NOWAIT flag for upper layer error handling.
Maybe we can implement non-blocking I/O handling mechanisms within
RAID in future work.
Fixes: 9f346f7d4ea7 ("md/raid1,raid10: don't handle IO error for
REQ_RAHEAD and REQ_NOWAIT")
Signed-off-by: Zheng Qixing <zhengqixing@huawei.com>
Link: https://lore.kernel.org/linux-raid/20250702102341.1969154-1-zhengqixing@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
|
|
If raid10_read_request or raid10_write_request registers a new
request and the REQ_NOWAIT flag is set, the code does not
free the malloc from the mempool.
unreferenced object 0xffff8884802c3200 (size 192):
comm "fio", pid 9197, jiffies 4298078271
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 88 41 02 00 00 00 00 00 .........A......
08 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace (crc c1a049a2):
__kmalloc+0x2bb/0x450
mempool_alloc+0x11b/0x320
raid10_make_request+0x19e/0x650 [raid10]
md_handle_request+0x3b3/0x9e0
__submit_bio+0x394/0x560
__submit_bio_noacct+0x145/0x530
submit_bio_noacct_nocheck+0x682/0x830
__blkdev_direct_IO_async+0x4dc/0x6b0
blkdev_read_iter+0x1e5/0x3b0
__io_read+0x230/0x1110
io_read+0x13/0x30
io_issue_sqe+0x134/0x1180
io_submit_sqes+0x48c/0xe90
__do_sys_io_uring_enter+0x574/0x8b0
do_syscall_64+0x5c/0xe0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
V4: changing backing tree to see if CKI tests will pass.
The patch code has not changed between any versions.
Fixes: c9aa889b035f ("md: raid10 add nowait support")
Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
Link: https://lore.kernel.org/linux-raid/c0787379-9caa-42f3-b5fc-369aed784400@redhat.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
|
|
In the raid1_reshape function, newpool is
allocated on the stack and assigned to conf->r1bio_pool.
This results in conf->r1bio_pool.wait.head pointing
to a stack address.
Accessing this address later can lead to a kernel panic.
Example access path:
raid1_reshape()
{
// newpool is on the stack
mempool_t newpool, oldpool;
// initialize newpool.wait.head to stack address
mempool_init(&newpool, ...);
conf->r1bio_pool = newpool;
}
raid1_read_request() or raid1_write_request()
{
alloc_r1bio()
{
mempool_alloc()
{
// if pool->alloc fails
remove_element()
{
--pool->curr_nr;
}
}
}
}
mempool_free()
{
if (pool->curr_nr < pool->min_nr) {
// pool->wait.head is a stack address
// wake_up() will try to access this invalid address
// which leads to a kernel panic
return;
wake_up(&pool->wait);
}
}
Fix:
reinit conf->r1bio_pool.wait after assigning newpool.
Fixes: afeee514ce7f ("md: convert to bioset_init()/mempool_init()")
Signed-off-by: Wang Jinchao <wangjinchao600@gmail.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/linux-raid/20250612112901.3023950-1-wangjinchao600@gmail.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
|
|
Pull NVMe fixes from Christoph:
"- fix incorrect cdw15 value in passthru error logging (Alok Tiwari)
- fix memory leak of bio integrity in nvmet (Dmitry Bogdanov)
- refresh visible attrs after being checked (Eugen Hristev)
- fix suspicious RCU usage warning in the multipath code (Geliang Tang)
- correctly account for namespace head reference counter (Nilay Shroff)"
* tag 'nvme-6.16-2025-07-03' of git://git.infradead.org/nvme:
nvme-multipath: fix suspicious RCU usage warning
nvme-pci: refresh visible attrs after being checked
nvmet: fix memory leak of bio integrity
nvme: correctly account for namespace head reference counter
nvme: Fix incorrect cdw15 value in passthru error logging
|
|
__xa_cmpxchg() is called with rcu_read_lock(), and it will allocate
memory if necessary.
Fix the problem by moving rcu_read_lock() after __xa_cmpxchg(), meanwhile,
it still should be held before xa_unlock(), prevent returned page to be
freed by concurrent discard.
Fixes: bbcacab2e8ee ("brd: avoid extra xarray lookups on first write")
Reported-by: syzbot+ea4c8fd177a47338881a@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/685ec4c9.a00a0220.129264.000c.GAE@google.com/
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250630112828.421219-1-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Commit 524346e9d79f ("ublk: build batch from IOs in same io_ring_ctx and io task")
need to dereference `io->cmd` for checking if the IO can be added to current
batch, see ublk_belong_to_same_batch() and io_uring_cmd_ctx_handle(). However,
`io->cmd` may become invalid after the uring_cmd is canceled.
Fixes it by only allowing to queue this IO in case that ublk_prep_req()
returns `BLK_STS_OK`, when 'io->cmd' is guaranteed to be valid.
Reported-by: Changhui Zhong <czhong@redhat.com>
Fixes: 524346e9d79f ("ublk: build batch from IOs in same io_ring_ctx and io task")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250701072325.1458109-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
When I run the NVME over TCP test in virtme-ng, I get the following
"suspicious RCU usage" warning in nvme_mpath_add_sysfs_link():
'''
[ 5.024557][ T44] nvmet: Created nvm controller 1 for subsystem nqn.2025-06.org.nvmexpress.mptcp for NQN nqn.2014-08.org.nvmexpress:uuid:f7f6b5e0-ff97-4894-98ac-c85309e0bc77.
[ 5.027401][ T183] nvme nvme0: creating 2 I/O queues.
[ 5.029017][ T183] nvme nvme0: mapped 2/0/0 default/read/poll queues.
[ 5.032587][ T183] nvme nvme0: new ctrl: NQN "nqn.2025-06.org.nvmexpress.mptcp", addr 127.0.0.1:4420, hostnqn: nqn.2014-08.org.nvmexpress:uuid:f7f6b5e0-ff97-4894-98ac-c85309e0bc77
[ 5.042214][ T25]
[ 5.042440][ T25] =============================
[ 5.042579][ T25] WARNING: suspicious RCU usage
[ 5.042705][ T25] 6.16.0-rc3+ #23 Not tainted
[ 5.042812][ T25] -----------------------------
[ 5.042934][ T25] drivers/nvme/host/multipath.c:1203 RCU-list traversed in non-reader section!!
[ 5.043111][ T25]
[ 5.043111][ T25] other info that might help us debug this:
[ 5.043111][ T25]
[ 5.043341][ T25]
[ 5.043341][ T25] rcu_scheduler_active = 2, debug_locks = 1
[ 5.043502][ T25] 3 locks held by kworker/u9:0/25:
[ 5.043615][ T25] #0: ffff888008730948 ((wq_completion)async){+.+.}-{0:0}, at: process_one_work+0x7ed/0x1350
[ 5.043830][ T25] #1: ffffc900001afd40 ((work_completion)(&entry->work)){+.+.}-{0:0}, at: process_one_work+0xcf3/0x1350
[ 5.044084][ T25] #2: ffff888013ee0020 (&head->srcu){.+.+}-{0:0}, at: nvme_mpath_add_sysfs_link.part.0+0xb4/0x3a0
[ 5.044300][ T25]
[ 5.044300][ T25] stack backtrace:
[ 5.044439][ T25] CPU: 0 UID: 0 PID: 25 Comm: kworker/u9:0 Not tainted 6.16.0-rc3+ #23 PREEMPT(full)
[ 5.044441][ T25] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 5.044442][ T25] Workqueue: async async_run_entry_fn
[ 5.044445][ T25] Call Trace:
[ 5.044446][ T25] <TASK>
[ 5.044449][ T25] dump_stack_lvl+0x6f/0xb0
[ 5.044453][ T25] lockdep_rcu_suspicious.cold+0x4f/0xb1
[ 5.044457][ T25] nvme_mpath_add_sysfs_link.part.0+0x2fb/0x3a0
[ 5.044459][ T25] ? queue_work_on+0x90/0xf0
[ 5.044461][ T25] ? lockdep_hardirqs_on+0x78/0x110
[ 5.044466][ T25] nvme_mpath_set_live+0x1e9/0x4f0
[ 5.044470][ T25] nvme_mpath_add_disk+0x240/0x2f0
[ 5.044472][ T25] ? __pfx_nvme_mpath_add_disk+0x10/0x10
[ 5.044475][ T25] ? add_disk_fwnode+0x361/0x580
[ 5.044480][ T25] nvme_alloc_ns+0x81c/0x17c0
[ 5.044483][ T25] ? kasan_quarantine_put+0x104/0x240
[ 5.044487][ T25] ? __pfx_nvme_alloc_ns+0x10/0x10
[ 5.044495][ T25] ? __pfx_nvme_find_get_ns+0x10/0x10
[ 5.044496][ T25] ? rcu_read_lock_any_held+0x45/0xa0
[ 5.044498][ T25] ? validate_chain+0x232/0x4f0
[ 5.044503][ T25] nvme_scan_ns+0x4c8/0x810
[ 5.044506][ T25] ? __pfx_nvme_scan_ns+0x10/0x10
[ 5.044508][ T25] ? find_held_lock+0x2b/0x80
[ 5.044512][ T25] ? ktime_get+0x16d/0x220
[ 5.044517][ T25] ? kvm_clock_get_cycles+0x18/0x30
[ 5.044520][ T25] ? __pfx_nvme_scan_ns_async+0x10/0x10
[ 5.044522][ T25] async_run_entry_fn+0x97/0x560
[ 5.044523][ T25] ? rcu_is_watching+0x12/0xc0
[ 5.044526][ T25] process_one_work+0xd3c/0x1350
[ 5.044532][ T25] ? __pfx_process_one_work+0x10/0x10
[ 5.044536][ T25] ? assign_work+0x16c/0x240
[ 5.044539][ T25] worker_thread+0x4da/0xd50
[ 5.044545][ T25] ? __pfx_worker_thread+0x10/0x10
[ 5.044546][ T25] kthread+0x356/0x5c0
[ 5.044548][ T25] ? __pfx_kthread+0x10/0x10
[ 5.044549][ T25] ? ret_from_fork+0x1b/0x2e0
[ 5.044552][ T25] ? __lock_release.isra.0+0x5d/0x180
[ 5.044553][ T25] ? ret_from_fork+0x1b/0x2e0
[ 5.044555][ T25] ? rcu_is_watching+0x12/0xc0
[ 5.044557][ T25] ? __pfx_kthread+0x10/0x10
[ 5.044559][ T25] ret_from_fork+0x218/0x2e0
[ 5.044561][ T25] ? __pfx_kthread+0x10/0x10
[ 5.044562][ T25] ret_from_fork_asm+0x1a/0x30
[ 5.044570][ T25] </TASK>
'''
This patch uses sleepable RCU version of helper list_for_each_entry_srcu()
instead of list_for_each_entry_rcu() to fix it.
Fixes: 4dbd2b2ebe4c ("nvme-multipath: Add visibility for round-robin io-policy")
Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
The sysfs attributes are registered early, but the driver does not know
whether they are needed or not at that moment.
For the CMB attributes, commit e917a849c3fc ("nvme-pci: refresh visible
attrs for cmb attributes") solved this problem by
calling nvme_update_attrs after mapping the CMB. However the issue
persists for the HMB attributes. To solve the problem, moved the call to
nvme_update_attrs after nvme_setup_host_mem, which sets up the HMB.
Fixes: e917a849c3fc ("nvme-pci: refresh visible attrs for cmb attributes")
Fixes: 86adbf0cdb9e ("nvme: simplify transport specific device attribute handling")
Signed-off-by: Eugen Hristev <eugen.hristev@collabora.com>
Signed-off-by: André Almeida <andrealmeid@igalia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
If nvmet receives commands with metadata there is a continuous memory
leak of kmalloc-128 slab or more precisely bio->bi_integrity.
Since commit bf4c89fc8797 ("block: don't call bio_uninit from bio_endio")
each user of bio_init has to use bio_uninit as well. Otherwise the bio
integrity is not getting free. Nvmet uses bio_init for inline bios.
Uninit the inline bio to complete deallocation of integrity in bio.
Fixes: bf4c89fc8797 ("block: don't call bio_uninit from bio_endio")
Signed-off-by: Dmitry Bogdanov <d.bogdanov@yadro.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
The blktests nvme/058 manifests an issue where the NVMe subsystem
kobject entry remains stale in sysfs, causing a failure during
subsequent NVMe module reloads[1]. Specifically, when attempting to
register a new NVMe subsystem, the driver encounters a kobejct name
collision because a stale kobject still exists. Though, please note
that nvme/058 doesn't report any failure and test case passes and
it's only during subsequent NVMe module reloads, the stale nvme sub-
system kobject entry in sysfs causes the observed symptom[1].
This issue stems from an imbalance in the get/put usage of the namespace
head (nshead) reference counter. The nshead holds a reference to the
associated NVMe subsystem. If the nshead reference is not properly
released, it prevents the cleanup of the subsystem's kobject, leaving
nvme subsystem stale entry behind in sysfs.
During the failure case, the last namespace path referencing a nshead
is removed, but the nshead reference was not released. This occurs
because the release logic currently only puts the nshead reference
when its state is LIVE. However, in configurations where ANA (Asymmetric
Namespace Access) is enabled, a namespace may be associated with an ANA
state that is neither optimized nor non-optimized. In this case, the
nshead may never transition to LIVE, and the corresponding nshead
reference is then never dropped. In fact nvme/058 associates some of
nvme namespaces to an inaccessible ANA state and with that nshead is
created but it's state is not transitioned to LIVE. So the current
logic would then causes nshead reference to be leaked for non-LIVE
states.
Another scenario, during namespace allocation, the driver first
allocates a nshead and then issues an Identify Namespace command. If
this command fails — which can happen in tests like nvme/058 that
rapidly enables and disables namespaces — we must release the reference
to the newly allocated nshead. However this reference release is
currently missing in the failure, causing a nshead reference leak.
To fix this, we now unconditionally release the nshead reference when
the last nvme path referencing to the nshead is removed, regardless of
the head’s state. Also during identify namespace failure case we now
properly release the nshead reference. So this ensures proper cleanup
of the nshead, and consequently, the NVMe subsystem and its associated
kobject.
This change prevents stale kobject entries from lingering in sysfs and
eliminates the module reload failures observed just after running
nvme/058.
[1] https://lore.kernel.org/all/CAHj4cs8fOBS-eSjsd5LUBzy7faKXJtgLkCN+mDy_-ezCLLLq+Q@mail.gmail.com/
Reported-by: yi.zhang@redhat.com
Closes: https://lore.kernel.org/all/CAHj4cs8fOBS-eSjsd5LUBzy7faKXJtgLkCN+mDy_-ezCLLLq+Q@mail.gmail.com/
Fixes: 62188639ec16 ("nvme-multipath: introduce delayed removal of the multipath head node")
Tested-by: yi.zhang@redhat.com
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
Fix an error in nvme_log_err_passthru() where cdw14 was incorrectly
printed twice instead of cdw15. This fix ensures accurate logging of
the full passthrough command payload.
Fixes: 9f079dda1433 ("nvme: allow passthru cmd error logging")
Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
While bdev_count_inflight is interating all cpus, if some IOs are issued
from traversed cpu and then completed from the cpu that is not traversed
yet:
cpu0
cpu1
bdev_count_inflight
//for_each_possible_cpu
// cpu0 is 0
infliht += 0
// issue a io
blk_account_io_start
// cpu0 inflight ++
cpu2
// the io is done
blk_account_io_done
// cpu2 inflight --
// cpu 1 is 0
inflight += 0
// cpu2 is -1
inflight += -1
...
In this case, the total inflight will be -1, causing lots of false
warning. Fix the problem by removing the warning.
Noted there is still a valid warning for nvme-mpath(From Yi) that is not
fixed yet.
Fixes: f5482ee5edb9 ("block: WARN if bdev inflight counter is negative")
Reported-by: Yi Zhang <yi.zhang@redhat.com>
Closes: https://lore.kernel.org/linux-block/aFtUXy-lct0WxY2w@mozart.vkv.me/T/#mae89155a5006463d0a21a4a2c35ae0034b26a339
Reported-and-tested-by: Calvin Owens <calvin@wbinvd.org>
Closes: https://lore.kernel.org/linux-block/aFtUXy-lct0WxY2w@mozart.vkv.me/T/#m1d935a00070bf95055d0ac84e6075158b08acaef
Reported-by: Dave Chinner <david@fromorbit.com>
Closes: https://lore.kernel.org/linux-block/aFuypjqCXo9-5_En@dread.disaster.area/
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20250626115743.1641443-1-yukuai3@huawei.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Pull NVMe fixes from Christoph:
" - reset delayed remove_work after reconnect (Keith Busch)
- fix atomic write size validation (Christoph Hellwig)"
* tag 'nvme-6.16-2025-06-26' of git://git.infradead.org/nvme:
nvme: fix atomic write size validation
nvme: refactor the atomic write unit detection
nvme: reset delayed remove_work after reconnect
|
|
Add additional checks that queue depth and number of queues are
non-zero.
Signed-off-by: Ronnie Sahlberg <rsahlberg@whamcloud.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250626022046.235018-1-ronniesahlberg@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Don't mix the namespace and controller values, and validate the
per-controller limit when probing the controller. This avoid spurious
failures for controllers with namespaces that have different namespaces
with different logical block sizes, or report the per-namespace values
only for some namespaces.
It also fixes a missing queue_limits_cancel_update in an error path by
removing that error path.
Fixes: 8695f060a029 ("nvme: all namespaces in a subsystem must adhere to a common atomic write size")
Reported-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Tested-by: Yi Zhang <yi.zhang@redhat.com>
|
|
Move all the code out of nvme_update_disk_info into the helper, and
rename the helper to have a somewhat less clumsy name.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: John Garry <john.g.garry@oracle.com>
|
|
The remove_work will proceed with permanently disconnecting on the
initial final path failure if the head shows no paths after the delay.
If a new path connects while the remove_work is pending, and if that new
path happens to disconnect before that remove_work executes, the delayed
removal should reset based on the most recent path disconnect time, but
queue_delayed_work() won't do anything if the work is already pending.
Attempt to cancel the delayed work when a new path connects, and use
mod_delayed_work() in case the remove_work remains pending anyway.
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
If ublk_get_data() fails, -EIOCBQUEUED is returned and the current command
becomes ASYNC. And the only reason is that mapping data can't move on,
because of no enough pages or pending signal, then the current ublk request
has to be requeued.
Once the request need to be requeued, we have to setup `ublk_io` correctly,
including io->cmd and flags, otherwise the request may not be forwarded to
ublk server successfully.
Fixes: 9810362a57cb ("ublk: don't call ublk_dispatch_req() for NEED_GET_DATA")
Reported-by: Changhui Zhong <czhong@redhat.com>
Closes: https://lore.kernel.org/linux-block/CAGVVp+VN9QcpHUz_0nasFf5q9i1gi8H8j-G-6mkBoqa3TyjRHA@mail.gmail.com/
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Changhui Zhong <czhong@redhat.com>
Link: https://lore.kernel.org/r/20250624104121.859519-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
UBLK_F_SUPPORT_ZERO_COPY has a very old comment describing the initial
idea for how zero-copy would be implemented. The actual implementation
added in commit 1f6540e2aabb ("ublk: zc register/unregister bvec") uses
io_uring registered buffers rather than shared memory mapping.
Remove the inaccurate remarks about mapping ublk request memory into the
ublk server's address space and requiring 4K block size. Replace them
with a description of the current zero-copy mechanism.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250621171015.354932-1-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
When a C++ file compiled with -Wc++11-narrowing includes the UAPI header
linux/ublk_cmd.h, ublk_sqe_addr_to_auto_buf_reg()'s assignments of u64
values to u8, u16, and u32 fields result in compiler warnings. Add
explicit casts to the intended types to avoid these warnings. Drop the
unnecessary bitmasks.
Reported-by: Uday Shankar <ushankar@purestorage.com>
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Fixes: 99c1e4eb6a3f ("ublk: register buffer to local io_uring with provided buf index via UBLK_F_AUTO_BUF_REG")
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250621162842.337452-1-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Don't use same backing file for more than one ublk devices, and avoid
concurrent write on same file from more ublk disks.
Fixes: 8ccebc19ee3d ("selftests: ublk: support UBLK_F_AUTO_BUF_REG")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250623011934.741788-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
ublk_queue_cmd_list() dispatches the whole batch list by scheduling task
work via the tail request's io_uring_cmd, this way is fine even though
more than one io_ring_ctx are involved for this batch since it is just
one running context.
However, the task work handler ublk_cmd_list_tw_cb() takes `issue_flags`
of tail uring_cmd's io_ring_ctx for completing all commands. This way is
wrong if any uring_cmd is issued from different io_ring_ctx.
Fixes it by always building batch IOs from same io_ring_ctx and io task
because ublk_dispatch_req() does validate task context, and IO needs to
be aborted in case of running from fallback task work context.
For typical per-queue or per-io daemon implementation, this way shouldn't
make difference from performance viewpoint, because single io_ring_ctx is
taken in each daemon for normal use case.
Fixes: d796cea7b9f3 ("ublk: implement ->queue_rqs()")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250625022554.883571-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Sanity check the values for queue depth and number of queues
we get from userspace when adding a device.
Signed-off-by: Ronnie Sahlberg <rsahlberg@whamcloud.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Fixes: 71f28f3136af ("ublk_drv: add io_uring based userspace block driver")
Fixes: 62fe99cef94a ("ublk: add read()/write() support for ublk char device")
Link: https://lore.kernel.org/r/20250619021031.181340-1-ronniesahlberg@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
When aoe's rexmit_timer() notices that an aoe target fails to respond to
commands for more than aoe_deadsecs, it calls aoedev_downdev() which
cleans the outstanding aoe and block queues. This can involve sleeping,
such as in blk_mq_freeze_queue(), which should not occur in irq context.
This patch defers that aoedev_downdev() call to the aoe device's
workqueue.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=212665
Signed-off-by: Justin Sanders <jsanders.devel@gmail.com>
Link: https://lore.kernel.org/r/20250610170600.869-2-jsanders.devel@gmail.com
Tested-By: Valentin Kleibel <valentin@vrvis.at>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
An aoe device's rq_list contains accepted block requests that are
waiting to be transmitted to the aoe target. This queue was added as
part of the conversion to blk_mq. However, the queue was not cleaned out
when an aoe device is downed which caused blk_mq_freeze_queue() to sleep
indefinitely waiting for those requests to complete, causing a hang. This
fix cleans out the queue before calling blk_mq_freeze_queue().
Link: https://bugzilla.kernel.org/show_bug.cgi?id=212665
Fixes: 3582dd291788 ("aoe: convert aoeblk to blk-mq")
Signed-off-by: Justin Sanders <jsanders.devel@gmail.com>
Link: https://lore.kernel.org/r/20250610170600.869-1-jsanders.devel@gmail.com
Tested-By: Valentin Kleibel <valentin@vrvis.at>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Currently NVMe uring_cmd completions will complete locally, if they are
polled. This is done because those completions are always invoked from
task context. And while that is true, there's no guarantee that it's
invoked under the right ring context, or even task. If someone does
NVMe passthrough via multiple threads and with a limited number of
poll queues, then ringA may find completions from ringB. For that case,
completing the request may not be sound.
Always just punt the passthrough completions via task_work, which will
redirect the completion, if needed.
Cc: stable@vger.kernel.org
Fixes: 585079b6e425 ("nvme: wire up async polling for io passthrough commands")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Stephen Rothwell reports htmldocs warning on ublk docs:
Documentation/block/ublk.rst:414: ERROR: Unexpected indentation. [docutils]
Fix the warning by separating sublists of auto buffer registration
fallback behavior from their appropriate parent list item.
Fixes: ff20c516485e ("ublk: document auto buffer registration(UBLK_F_AUTO_BUF_REG)")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Closes: https://lore.kernel.org/linux-next/20250612132638.193de386@canb.auug.org.au/
Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com>
Link: https://lore.kernel.org/r/20250613023857.15971-1-bagasdotme@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Similarly to 26064d3e2b4d ("block: fix adding folio to bio"), if
we attempt to add a folio that is larger than 4GB, we'll silently
truncate the offset and len. Widen the parameters to size_t, assert
that the length is less than 4GB and set the first page that contains
the interesting data rather than the first page of the folio.
Fixes: 26db5ee15851 (block: add a bvec_set_folio helper)
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://lore.kernel.org/r/20250612144255.2850278-1-willy@infradead.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
It is possible for physically contiguous folios to have discontiguous
struct pages if SPARSEMEM is enabled and SPARSEMEM_VMEMMAP is not.
This is correctly handled by folio_page_idx(), so remove this open-coded
implementation.
Fixes: 640d1930bef4 (block: Add bio_for_each_folio_all())
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://lore.kernel.org/r/20250612144126.2849931-1-willy@infradead.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Previously, the block layer stored the requests in the plug list in
LIFO order. For this reason, blk_attempt_plug_merge() would check
just the head entry for a back merge attempt, and abort after that
unless requests for multiple queues existed in the plug list. If more
than one request is present in the plug list, this makes the one-shot
back merging less useful than before, as it'll always fail to find a
quick merge candidate.
Use the tail entry for the one-shot merge attempt, which is the last
added request in the list. If that fails, abort immediately unless
there are multiple queues available. If multiple queues are available,
then scan the list. Ideally the latter scan would be a backwards scan
of the list, but as it currently stands, the plug list is singly linked
and hence this isn't easily feasible.
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/linux-block/20250611121626.7252-1-abuehaze@amazon.com/
Reported-by: Hazem Mohamed Abuelfotoh <abuehaze@amazon.com>
Fixes: e70c301faece ("block: don't reorder requests in blk_add_rq_to_plug")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Bios queued up in the zone write plug have already gone through all all
preparation in the submit_bio path, including the freeze protection.
Submitting them through submit_bio_noacct_nocheck duplicates the work
and can can cause deadlocks when freezing a queue with pending bio
write plugs.
Go straight to ->submit_bio or blk_mq_submit_bio to bypass the
superfluous extra freeze protection and checks.
Fixes: 9b1ce7f0c6f8 ("block: Implement zone append emulation")
Reported-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Tested-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20250611044416.2351850-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
When blk_zone_write_plug_bio_endio() is called for a regular write BIO
used to emulate a zone append operation, that is, a BIO flagged with
BIO_EMULATES_ZONE_APPEND, the BIO operation code is restored to the
original REQ_OP_ZONE_APPEND but the BIO_EMULATES_ZONE_APPEND flag is not
cleared. Clear it to fully return the BIO to its orginal definition.
Fixes: 9b1ce7f0c6f8 ("block: Implement zone append emulation")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250611005915.89843-1-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Document recently merged feature auto buffer registration(UBLK_F_AUTO_BUF_REG).
Cc: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250611085632.109719-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
It isn't necessary to freeze queue for updating disk size given submit_bio()
doesn't grab queue usage counter for checking eod.
Also many driver won't freeze queue for calling set_capacity_and_notify().
Move lo_set_size() out of queue freeze for fixing many lockdep warning
report.
Link: https://lore.kernel.org/linux-block/67ea99e0.050a0220.3c3d88.0042.GAE@google.com/
Reported-by: syzbot+9dd7dbb1a4b915dee638@syzkaller.appspotmail.com
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250611084938.108829-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Pull NVMe updates and fixes from Christoph:
"nvme updates for Linux 6.16
- TCP error handling fix (Shin'ichiro Kawasaki)
- TCP I/O stall handling fixes (Hannes Reinecke)
- fix command limits status code (Keith Busch)
- support vectored buffers also for passthrough (Pavel Begunkov)
- spelling fixes (Yi Zhang)"
* tag 'nvme-6.16-2025-06-05' of git://git.infradead.org/nvme:
nvme: spelling fixes
nvme-tcp: fix I/O stalls on congested sockets
nvme-tcp: sanitize request list handling
nvme-tcp: remove tag set when second admin queue config fails
nvme: enable vectored registered bufs for passthrough cmds
nvme: fix implicit bool to flags conversion
nvme: fix command limits status code
|
|
Fix various spelling errors in comments.
Signed-off-by: Yi Zhang <yi.zhang@redhat.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
When the socket is busy processing nvme_tcp_try_recv() might return
-EAGAIN, but this doesn't automatically imply that the sending side is
blocked, too. So check if there are pending requests once
nvme_tcp_try_recv() returns -EAGAIN and continue with the sending loop
to avoid I/O stalls.
Signed-off-by: Hannes Reinecke <hare@kernel.org>
Acked-by: Chris Leech <cleech@redhat.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
Validate the request in nvme_tcp_handle_r2t() to ensure it's not part of
any list, otherwise a malicious R2T PDU might inject a loop in request
list processing.
Signed-off-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
Commit 104d0e2f6222 ("nvme-fabrics: reset admin connection for secure
concatenation") modified nvme_tcp_setup_ctrl() to call
nvme_tcp_configure_admin_queue() twice. The first call prepares for
DH-CHAP negotitation, and the second call is required for secure
concatenation. However, this change triggered BUG KASAN slab-use-after-
free in blk_mq_queue_tag_busy_iter(). This BUG can be recreated by
repeating the blktests test case nvme/063 a few times [1].
When the BUG happens, nvme_tcp_create_ctrl() fails in the call chain
below:
nvme_tcp_create_ctrl()
nvme_tcp_alloc_ctrl() new=true ... Alloc nvme_tcp_ctrl and admin_tag_set
nvme_tcp_setup_ctrl() new=true
nvme_tcp_configure_admin_queue() new=true ... Succeed
nvme_alloc_admin_tag_set() ... Alloc the tag set for admin_tag_set
nvme_stop_keep_alive()
nvme_tcp_teardown_admin_queue() remove=false
nvme_tcp_configure_admin_queue() new=false
nvme_tcp_alloc_admin_queue() ... Fail, but do not call nvme_remove_admin_tag_set()
nvme_uninit_ctrl()
nvme_put_ctrl() ... Free up the nvme_tcp_ctrl and admin_tag_set
The first call of nvme_tcp_configure_admin_queue() succeeds with
new=true argument. The second call fails with new=false argument. This
second call does not call nvme_remove_admin_tag_set() on failure, due to
the new=false argument. Then the admin tag set is not removed. However,
nvme_tcp_create_ctrl() assumes that nvme_tcp_setup_ctrl() would call
nvme_remove_admin_tag_set(). Then it frees up struct nvme_tcp_ctrl which
has admin_tag_set field. Later on, the timeout handler accesses the
admin_tag_set field and causes the BUG KASAN slab-use-after-free.
To not leave the admin tag set, call nvme_remove_admin_tag_set() when
the second nvme_tcp_configure_admin_queue() call fails. Do not return
from nvme_tcp_setup_ctrl() on failure. Instead, jump to "destroy_admin"
go-to label to call nvme_tcp_teardown_admin_queue() which calls
nvme_remove_admin_tag_set().
Fixes: 104d0e2f6222 ("nvme-fabrics: reset admin connection for secure concatenation")
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/linux-nvme/6mhxskdlbo6fk6hotsffvwriauurqky33dfb3s44mqtr5dsxmf@gywwmnyh3twm/ [1]
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
nvme already supports registered buffers for non-vectored io_uring
passthrough commands, enable it for the vectored mode as well. It takes
an iovec, each entry of which should contain a range within the same
registered buffer specificied in sqe->buf_index.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
nvme_map_user_request() takes flags as the last argument, but
nvme_uring_cmd_io() shoves a bool "vec" into it. It behaves as
expected because bool is converted to 0/1 and NVME_IOCTL_VEC is
defined as 1, but it's better to pass flags explicitly.
Fixes: 7b7fdb8e2dbc1 ("nvme: replace the "bool vec" arguments with flags in the ioctl path")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
The command specific status code, 0x183, was introduced in the NVMe 2.0
specification defined to "Command Size Limits Exceeded" and only ever
applied to DSM and Copy commands. Fix the name and, remove the
incorrect translation to error codes and special treatment in the
target code for it.
Fixes: 3b7c33b28a44d4 ("nvme.h: add Write Zeroes definitions")
Cc: Chaitanya Kulkarni <chaitanyak@nvidia.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
Some failure modes are handled poorly by kublk. For example, if ublk_drv
is built as a module but not currently loaded into the kernel, ./kublk
add ... just hangs forever. This happens because in this case (and a few
others), the worker process does not notify its parent (via a write to
the shared eventfd) that it has tried and failed to initialize, so the
parent hangs forever. Fix this by ensuring that we always notify the
parent process of any initialization failure, and have the parent print
a (not very descriptive) log line when this happens.
Signed-off-by: Uday Shankar <ushankar@purestorage.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250603-ublk_init_fail-v1-1-87c91486230e@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
blk_rq_integrity_map_user() creates the ubuf iter with ITER_DEST for
write-direction operations and ITER_SOURCE for read-direction ones.
This is backwards; writes use the user buffer as a source for metadata
and reads use it as a destination. Switch to the rq_data_dir() helper,
which maps writes to ITER_SOURCE (WRITE) and reads to ITER_DEST(READ).
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Fixes: fe8f4ca7107e ("block: modify bio_integrity_map_user to accept iov_iter as argument")
Link: https://lore.kernel.org/r/20250603184752.1185676-1-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
direction is determined from bio, which is already passed in. Compute
op_is_write(bio_op(bio)) directly instead of converting it to an iter
direction and back to a bool.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Link: https://lore.kernel.org/r/20250603183133.1178062-1-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
We have stress_03, stress_04 and stress_05 for checking new feature vs.
stress IO & device removal & ublk server crash & recovery, so let the
three existing stress tests cover PER_IO_DAEMON.
Then stress_06 can be removed, since the same test function is included in
stress_03.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250602132113.1398645-1-ming.lei@redhat.com
Reviewed-by: Uday Shankar <ushankar@purestorage.com>
[axboe: remove test_stress_06.sh from Makefile too]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Explain the restrictions imposed on ublk servers in two cases:
1. When UBLK_F_PER_IO_DAEMON is set (current ublk_drv)
2. When UBLK_F_PER_IO_DAEMON is not set (legacy)
Remove most references to per-queue daemons, as the new
UBLK_F_PER_IO_DAEMON feature renders that concept obsolete.
Signed-off-by: Uday Shankar <ushankar@purestorage.com>
Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250529-ublk_task_per_io-v8-9-e9d3b119336a@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Add a new test_stress_06 for the per io daemons feature. This is just a
copy of test_stress_01 with the per_io_tasks flag added, with varying
amounts of nthreads. This test is able to reproduce a panic which was
caught manually during development [1]; in the current version of this
patch set, it passes.
Note that this commit also makes all stress tests using the
run_io_and_remove helper more stressful by additionally exercising the
batch submit (queue_rqs) path.
[1] https://lore.kernel.org/linux-block/aDgwGoGCEpwd1mFY@fedora/
Suggested-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Uday Shankar <ushankar@purestorage.com>
Link: https://lore.kernel.org/r/20250529-ublk_task_per_io-v8-8-e9d3b119336a@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|