summaryrefslogtreecommitdiff
path: root/drivers/nvme
AgeCommit message (Collapse)Author
11 daysMerge tag 'block-6.16-20250704' of git://git.kernel.dk/linuxLinus Torvalds
Pull block fixes from Jens Axboe: - NVMe fixes via Christoph: - fix incorrect cdw15 value in passthru error logging (Alok Tiwari) - fix memory leak of bio integrity in nvmet (Dmitry Bogdanov) - refresh visible attrs after being checked (Eugen Hristev) - fix suspicious RCU usage warning in the multipath code (Geliang Tang) - correctly account for namespace head reference counter (Nilay Shroff) - Fix for a regression introduced in ublk in this cycle, where it would attempt to queue a canceled request. - brd RCU sleeping fix, also introduced in this cycle. Bare bones fix, should be improved upon for the next release. * tag 'block-6.16-20250704' of git://git.kernel.dk/linux: brd: fix sleeping function called from invalid context in brd_insert_page() ublk: don't queue request if the associated uring_cmd is canceled nvme-multipath: fix suspicious RCU usage warning nvme-pci: refresh visible attrs after being checked nvmet: fix memory leak of bio integrity nvme: correctly account for namespace head reference counter nvme: Fix incorrect cdw15 value in passthru error logging
2025-07-01nvme-multipath: fix suspicious RCU usage warningGeliang Tang
When I run the NVME over TCP test in virtme-ng, I get the following "suspicious RCU usage" warning in nvme_mpath_add_sysfs_link(): ''' [ 5.024557][ T44] nvmet: Created nvm controller 1 for subsystem nqn.2025-06.org.nvmexpress.mptcp for NQN nqn.2014-08.org.nvmexpress:uuid:f7f6b5e0-ff97-4894-98ac-c85309e0bc77. [ 5.027401][ T183] nvme nvme0: creating 2 I/O queues. [ 5.029017][ T183] nvme nvme0: mapped 2/0/0 default/read/poll queues. [ 5.032587][ T183] nvme nvme0: new ctrl: NQN "nqn.2025-06.org.nvmexpress.mptcp", addr 127.0.0.1:4420, hostnqn: nqn.2014-08.org.nvmexpress:uuid:f7f6b5e0-ff97-4894-98ac-c85309e0bc77 [ 5.042214][ T25] [ 5.042440][ T25] ============================= [ 5.042579][ T25] WARNING: suspicious RCU usage [ 5.042705][ T25] 6.16.0-rc3+ #23 Not tainted [ 5.042812][ T25] ----------------------------- [ 5.042934][ T25] drivers/nvme/host/multipath.c:1203 RCU-list traversed in non-reader section!! [ 5.043111][ T25] [ 5.043111][ T25] other info that might help us debug this: [ 5.043111][ T25] [ 5.043341][ T25] [ 5.043341][ T25] rcu_scheduler_active = 2, debug_locks = 1 [ 5.043502][ T25] 3 locks held by kworker/u9:0/25: [ 5.043615][ T25] #0: ffff888008730948 ((wq_completion)async){+.+.}-{0:0}, at: process_one_work+0x7ed/0x1350 [ 5.043830][ T25] #1: ffffc900001afd40 ((work_completion)(&entry->work)){+.+.}-{0:0}, at: process_one_work+0xcf3/0x1350 [ 5.044084][ T25] #2: ffff888013ee0020 (&head->srcu){.+.+}-{0:0}, at: nvme_mpath_add_sysfs_link.part.0+0xb4/0x3a0 [ 5.044300][ T25] [ 5.044300][ T25] stack backtrace: [ 5.044439][ T25] CPU: 0 UID: 0 PID: 25 Comm: kworker/u9:0 Not tainted 6.16.0-rc3+ #23 PREEMPT(full) [ 5.044441][ T25] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 5.044442][ T25] Workqueue: async async_run_entry_fn [ 5.044445][ T25] Call Trace: [ 5.044446][ T25] <TASK> [ 5.044449][ T25] dump_stack_lvl+0x6f/0xb0 [ 5.044453][ T25] lockdep_rcu_suspicious.cold+0x4f/0xb1 [ 5.044457][ T25] nvme_mpath_add_sysfs_link.part.0+0x2fb/0x3a0 [ 5.044459][ T25] ? queue_work_on+0x90/0xf0 [ 5.044461][ T25] ? lockdep_hardirqs_on+0x78/0x110 [ 5.044466][ T25] nvme_mpath_set_live+0x1e9/0x4f0 [ 5.044470][ T25] nvme_mpath_add_disk+0x240/0x2f0 [ 5.044472][ T25] ? __pfx_nvme_mpath_add_disk+0x10/0x10 [ 5.044475][ T25] ? add_disk_fwnode+0x361/0x580 [ 5.044480][ T25] nvme_alloc_ns+0x81c/0x17c0 [ 5.044483][ T25] ? kasan_quarantine_put+0x104/0x240 [ 5.044487][ T25] ? __pfx_nvme_alloc_ns+0x10/0x10 [ 5.044495][ T25] ? __pfx_nvme_find_get_ns+0x10/0x10 [ 5.044496][ T25] ? rcu_read_lock_any_held+0x45/0xa0 [ 5.044498][ T25] ? validate_chain+0x232/0x4f0 [ 5.044503][ T25] nvme_scan_ns+0x4c8/0x810 [ 5.044506][ T25] ? __pfx_nvme_scan_ns+0x10/0x10 [ 5.044508][ T25] ? find_held_lock+0x2b/0x80 [ 5.044512][ T25] ? ktime_get+0x16d/0x220 [ 5.044517][ T25] ? kvm_clock_get_cycles+0x18/0x30 [ 5.044520][ T25] ? __pfx_nvme_scan_ns_async+0x10/0x10 [ 5.044522][ T25] async_run_entry_fn+0x97/0x560 [ 5.044523][ T25] ? rcu_is_watching+0x12/0xc0 [ 5.044526][ T25] process_one_work+0xd3c/0x1350 [ 5.044532][ T25] ? __pfx_process_one_work+0x10/0x10 [ 5.044536][ T25] ? assign_work+0x16c/0x240 [ 5.044539][ T25] worker_thread+0x4da/0xd50 [ 5.044545][ T25] ? __pfx_worker_thread+0x10/0x10 [ 5.044546][ T25] kthread+0x356/0x5c0 [ 5.044548][ T25] ? __pfx_kthread+0x10/0x10 [ 5.044549][ T25] ? ret_from_fork+0x1b/0x2e0 [ 5.044552][ T25] ? __lock_release.isra.0+0x5d/0x180 [ 5.044553][ T25] ? ret_from_fork+0x1b/0x2e0 [ 5.044555][ T25] ? rcu_is_watching+0x12/0xc0 [ 5.044557][ T25] ? __pfx_kthread+0x10/0x10 [ 5.044559][ T25] ret_from_fork+0x218/0x2e0 [ 5.044561][ T25] ? __pfx_kthread+0x10/0x10 [ 5.044562][ T25] ret_from_fork_asm+0x1a/0x30 [ 5.044570][ T25] </TASK> ''' This patch uses sleepable RCU version of helper list_for_each_entry_srcu() instead of list_for_each_entry_rcu() to fix it. Fixes: 4dbd2b2ebe4c ("nvme-multipath: Add visibility for round-robin io-policy") Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-06-30nvme-pci: refresh visible attrs after being checkedEugen Hristev
The sysfs attributes are registered early, but the driver does not know whether they are needed or not at that moment. For the CMB attributes, commit e917a849c3fc ("nvme-pci: refresh visible attrs for cmb attributes") solved this problem by calling nvme_update_attrs after mapping the CMB. However the issue persists for the HMB attributes. To solve the problem, moved the call to nvme_update_attrs after nvme_setup_host_mem, which sets up the HMB. Fixes: e917a849c3fc ("nvme-pci: refresh visible attrs for cmb attributes") Fixes: 86adbf0cdb9e ("nvme: simplify transport specific device attribute handling") Signed-off-by: Eugen Hristev <eugen.hristev@collabora.com> Signed-off-by: André Almeida <andrealmeid@igalia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-06-30nvmet: fix memory leak of bio integrityDmitry Bogdanov
If nvmet receives commands with metadata there is a continuous memory leak of kmalloc-128 slab or more precisely bio->bi_integrity. Since commit bf4c89fc8797 ("block: don't call bio_uninit from bio_endio") each user of bio_init has to use bio_uninit as well. Otherwise the bio integrity is not getting free. Nvmet uses bio_init for inline bios. Uninit the inline bio to complete deallocation of integrity in bio. Fixes: bf4c89fc8797 ("block: don't call bio_uninit from bio_endio") Signed-off-by: Dmitry Bogdanov <d.bogdanov@yadro.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-06-30nvme: correctly account for namespace head reference counterNilay Shroff
The blktests nvme/058 manifests an issue where the NVMe subsystem kobject entry remains stale in sysfs, causing a failure during subsequent NVMe module reloads[1]. Specifically, when attempting to register a new NVMe subsystem, the driver encounters a kobejct name collision because a stale kobject still exists. Though, please note that nvme/058 doesn't report any failure and test case passes and it's only during subsequent NVMe module reloads, the stale nvme sub- system kobject entry in sysfs causes the observed symptom[1]. This issue stems from an imbalance in the get/put usage of the namespace head (nshead) reference counter. The nshead holds a reference to the associated NVMe subsystem. If the nshead reference is not properly released, it prevents the cleanup of the subsystem's kobject, leaving nvme subsystem stale entry behind in sysfs. During the failure case, the last namespace path referencing a nshead is removed, but the nshead reference was not released. This occurs because the release logic currently only puts the nshead reference when its state is LIVE. However, in configurations where ANA (Asymmetric Namespace Access) is enabled, a namespace may be associated with an ANA state that is neither optimized nor non-optimized. In this case, the nshead may never transition to LIVE, and the corresponding nshead reference is then never dropped. In fact nvme/058 associates some of nvme namespaces to an inaccessible ANA state and with that nshead is created but it's state is not transitioned to LIVE. So the current logic would then causes nshead reference to be leaked for non-LIVE states. Another scenario, during namespace allocation, the driver first allocates a nshead and then issues an Identify Namespace command. If this command fails — which can happen in tests like nvme/058 that rapidly enables and disables namespaces — we must release the reference to the newly allocated nshead. However this reference release is currently missing in the failure, causing a nshead reference leak. To fix this, we now unconditionally release the nshead reference when the last nvme path referencing to the nshead is removed, regardless of the head’s state. Also during identify namespace failure case we now properly release the nshead reference. So this ensures proper cleanup of the nshead, and consequently, the NVMe subsystem and its associated kobject. This change prevents stale kobject entries from lingering in sysfs and eliminates the module reload failures observed just after running nvme/058. [1] https://lore.kernel.org/all/CAHj4cs8fOBS-eSjsd5LUBzy7faKXJtgLkCN+mDy_-ezCLLLq+Q@mail.gmail.com/ Reported-by: yi.zhang@redhat.com Closes: https://lore.kernel.org/all/CAHj4cs8fOBS-eSjsd5LUBzy7faKXJtgLkCN+mDy_-ezCLLLq+Q@mail.gmail.com/ Fixes: 62188639ec16 ("nvme-multipath: introduce delayed removal of the multipath head node") Tested-by: yi.zhang@redhat.com Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-06-30nvme: Fix incorrect cdw15 value in passthru error loggingAlok Tiwari
Fix an error in nvme_log_err_passthru() where cdw14 was incorrectly printed twice instead of cdw15. This fix ensures accurate logging of the full passthrough command payload. Fixes: 9f079dda1433 ("nvme: allow passthru cmd error logging") Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-06-27Merge tag 'block-6.16-20250626' of git://git.kernel.dk/linuxLinus Torvalds
Pull block fixes from Jens Axboe: - Fixes for ublk: - fix C++ narrowing warnings in the uapi header - update/improve UBLK_F_SUPPORT_ZERO_COPY comment in uapi header - fix for the ublk ->queue_rqs() implementation, limiting a batch to just the specific task AND ring - ublk_get_data() error handling fix - sanity check more arguments in ublk_ctrl_add_dev() - selftest addition - NVMe pull request via Christoph: - reset delayed remove_work after reconnect - fix atomic write size validation - Fix for a warning introduced in bdev_count_inflight_rw() in this merge window * tag 'block-6.16-20250626' of git://git.kernel.dk/linux: block: fix false warning in bdev_count_inflight_rw() ublk: sanity check add_dev input for underflow nvme: fix atomic write size validation nvme: refactor the atomic write unit detection nvme: reset delayed remove_work after reconnect ublk: setup ublk_io correctly in case of ublk_get_data() failure ublk: update UBLK_F_SUPPORT_ZERO_COPY comment in UAPI header ublk: fix narrowing warnings in UAPI header selftests: ublk: don't take same backing file for more than one ublk devices ublk: build batch from IOs in same io_ring_ctx and io task
2025-06-26nvme: fix atomic write size validationChristoph Hellwig
Don't mix the namespace and controller values, and validate the per-controller limit when probing the controller. This avoid spurious failures for controllers with namespaces that have different namespaces with different logical block sizes, or report the per-namespace values only for some namespaces. It also fixes a missing queue_limits_cancel_update in an error path by removing that error path. Fixes: 8695f060a029 ("nvme: all namespaces in a subsystem must adhere to a common atomic write size") Reported-by: Yi Zhang <yi.zhang@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: John Garry <john.g.garry@oracle.com> Tested-by: Yi Zhang <yi.zhang@redhat.com>
2025-06-26nvme: refactor the atomic write unit detectionChristoph Hellwig
Move all the code out of nvme_update_disk_info into the helper, and rename the helper to have a somewhat less clumsy name. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: John Garry <john.g.garry@oracle.com>
2025-06-26nvme: reset delayed remove_work after reconnectKeith Busch
The remove_work will proceed with permanently disconnecting on the initial final path failure if the head shows no paths after the delay. If a new path connects while the remove_work is pending, and if that new path happens to disconnect before that remove_work executes, the delayed removal should reset based on the most recent path disconnect time, but queue_delayed_work() won't do anything if the work is already pending. Attempt to cancel the delayed work when a new path connects, and use mod_delayed_work() in case the remove_work remains pending anyway. Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-06-14Merge tag 'block-6.16-20250614' of git://git.kernel.dk/linuxLinus Torvalds
Pull block fixes from Jens Axboe: - Fix for a deadlock on queue freeze with zoned writes - Fix for zoned append emulation - Two bio folio fixes, for sparsemem and for very large folios - Fix for a performance regression introduced in 6.13 when plug insertion was changed - Fix for NVMe passthrough handling for polled IO - Document the ublk auto registration feature - loop lockdep warning fix * tag 'block-6.16-20250614' of git://git.kernel.dk/linux: nvme: always punt polled uring_cmd end_io work to task_work Documentation: ublk: Separate UBLK_F_AUTO_BUF_REG fallback behavior sublists block: Fix bvec_set_folio() for very large folios bio: Fix bio_first_folio() for SPARSEMEM without VMEMMAP block: use plug request list tail for one-shot backmerge attempt block: don't use submit_bio_noacct_nocheck in blk_zone_wplug_bio_work block: Clear BIO_EMULATES_ZONE_APPEND flag on BIO completion ublk: document auto buffer registration(UBLK_F_AUTO_BUF_REG) loop: move lo_set_size() out of queue freeze
2025-06-13nvme: always punt polled uring_cmd end_io work to task_workJens Axboe
Currently NVMe uring_cmd completions will complete locally, if they are polled. This is done because those completions are always invoked from task context. And while that is true, there's no guarantee that it's invoked under the right ring context, or even task. If someone does NVMe passthrough via multiple threads and with a limited number of poll queues, then ringA may find completions from ringB. For that case, completing the request may not be sound. Always just punt the passthrough completions via task_work, which will redirect the completion, if needed. Cc: stable@vger.kernel.org Fixes: 585079b6e425 ("nvme: wire up async polling for io passthrough commands") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-08treewide, timers: Rename from_timer() to timer_container_of()Ingo Molnar
Move this API to the canonical timer_*() namespace. [ tglx: Redone against pre rc1 ] Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/aB2X0jCKQO56WdMt@gmail.com
2025-06-06Merge tag 'block-6.16-20250606' of git://git.kernel.dk/linuxLinus Torvalds
Pull more block updates from Jens Axboe: - NVMe pull request via Christoph: - TCP error handling fix (Shin'ichiro Kawasaki) - TCP I/O stall handling fixes (Hannes Reinecke) - fix command limits status code (Keith Busch) - support vectored buffers also for passthrough (Pavel Begunkov) - spelling fixes (Yi Zhang) - MD pull request via Yu: - fix REQ_RAHEAD and REQ_NOWAIT IO err handling for raid1/10 - fix max_write_behind setting for dm-raid - some minor cleanups - Integrity data direction fix and cleanup - bcache NULL pointer fix - Fix for loop missing write start/end handling - Decouple hardware queues and IO threads in ublk - Slew of ublk selftests additions and updates * tag 'block-6.16-20250606' of git://git.kernel.dk/linux: (29 commits) nvme: spelling fixes nvme-tcp: fix I/O stalls on congested sockets nvme-tcp: sanitize request list handling nvme-tcp: remove tag set when second admin queue config fails nvme: enable vectored registered bufs for passthrough cmds nvme: fix implicit bool to flags conversion nvme: fix command limits status code selftests: ublk: kublk: improve behavior on init failure block: flip iter directions in blk_rq_integrity_map_user() block: drop direction param from bio_integrity_copy_user() selftests: ublk: cover PER_IO_DAEMON in more stress tests Documentation: ublk: document UBLK_F_PER_IO_DAEMON selftests: ublk: add stress test for per io daemons selftests: ublk: add functional test for per io daemons selftests: ublk: kublk: decouple ublk_queues from ublk server threads selftests: ublk: kublk: move per-thread data out of ublk_queue selftests: ublk: kublk: lift queue initialization out of thread selftests: ublk: kublk: tie sqe allocation to io instead of queue selftests: ublk: kublk: plumb q_id in io_uring user_data ublk: have a per-io daemon instead of a per-queue daemon ...
2025-06-04nvme: spelling fixesYi Zhang
Fix various spelling errors in comments. Signed-off-by: Yi Zhang <yi.zhang@redhat.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-06-04nvme-tcp: fix I/O stalls on congested socketsHannes Reinecke
When the socket is busy processing nvme_tcp_try_recv() might return -EAGAIN, but this doesn't automatically imply that the sending side is blocked, too. So check if there are pending requests once nvme_tcp_try_recv() returns -EAGAIN and continue with the sending loop to avoid I/O stalls. Signed-off-by: Hannes Reinecke <hare@kernel.org> Acked-by: Chris Leech <cleech@redhat.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-06-04nvme-tcp: sanitize request list handlingHannes Reinecke
Validate the request in nvme_tcp_handle_r2t() to ensure it's not part of any list, otherwise a malicious R2T PDU might inject a loop in request list processing. Signed-off-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-06-04nvme-tcp: remove tag set when second admin queue config failsShin'ichiro Kawasaki
Commit 104d0e2f6222 ("nvme-fabrics: reset admin connection for secure concatenation") modified nvme_tcp_setup_ctrl() to call nvme_tcp_configure_admin_queue() twice. The first call prepares for DH-CHAP negotitation, and the second call is required for secure concatenation. However, this change triggered BUG KASAN slab-use-after- free in blk_mq_queue_tag_busy_iter(). This BUG can be recreated by repeating the blktests test case nvme/063 a few times [1]. When the BUG happens, nvme_tcp_create_ctrl() fails in the call chain below: nvme_tcp_create_ctrl() nvme_tcp_alloc_ctrl() new=true ... Alloc nvme_tcp_ctrl and admin_tag_set nvme_tcp_setup_ctrl() new=true nvme_tcp_configure_admin_queue() new=true ... Succeed nvme_alloc_admin_tag_set() ... Alloc the tag set for admin_tag_set nvme_stop_keep_alive() nvme_tcp_teardown_admin_queue() remove=false nvme_tcp_configure_admin_queue() new=false nvme_tcp_alloc_admin_queue() ... Fail, but do not call nvme_remove_admin_tag_set() nvme_uninit_ctrl() nvme_put_ctrl() ... Free up the nvme_tcp_ctrl and admin_tag_set The first call of nvme_tcp_configure_admin_queue() succeeds with new=true argument. The second call fails with new=false argument. This second call does not call nvme_remove_admin_tag_set() on failure, due to the new=false argument. Then the admin tag set is not removed. However, nvme_tcp_create_ctrl() assumes that nvme_tcp_setup_ctrl() would call nvme_remove_admin_tag_set(). Then it frees up struct nvme_tcp_ctrl which has admin_tag_set field. Later on, the timeout handler accesses the admin_tag_set field and causes the BUG KASAN slab-use-after-free. To not leave the admin tag set, call nvme_remove_admin_tag_set() when the second nvme_tcp_configure_admin_queue() call fails. Do not return from nvme_tcp_setup_ctrl() on failure. Instead, jump to "destroy_admin" go-to label to call nvme_tcp_teardown_admin_queue() which calls nvme_remove_admin_tag_set(). Fixes: 104d0e2f6222 ("nvme-fabrics: reset admin connection for secure concatenation") Cc: stable@vger.kernel.org Link: https://lore.kernel.org/linux-nvme/6mhxskdlbo6fk6hotsffvwriauurqky33dfb3s44mqtr5dsxmf@gywwmnyh3twm/ [1] Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-06-04nvme: enable vectored registered bufs for passthrough cmdsPavel Begunkov
nvme already supports registered buffers for non-vectored io_uring passthrough commands, enable it for the vectored mode as well. It takes an iovec, each entry of which should contain a range within the same registered buffer specificied in sqe->buf_index. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-06-04nvme: fix implicit bool to flags conversionPavel Begunkov
nvme_map_user_request() takes flags as the last argument, but nvme_uring_cmd_io() shoves a bool "vec" into it. It behaves as expected because bool is converted to 0/1 and NVME_IOCTL_VEC is defined as 1, but it's better to pass flags explicitly. Fixes: 7b7fdb8e2dbc1 ("nvme: replace the "bool vec" arguments with flags in the ioctl path") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-06-04nvme: fix command limits status codeKeith Busch
The command specific status code, 0x183, was introduced in the NVMe 2.0 specification defined to "Command Size Limits Exceeded" and only ever applied to DSM and Copy commands. Fix the name and, remove the incorrect translation to error codes and special treatment in the target code for it. Fixes: 3b7c33b28a44d4 ("nvme.h: add Write Zeroes definitions") Cc: Chaitanya Kulkarni <chaitanyak@nvidia.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-05-31Merge tag 'mm-nonmm-stable-2025-05-31-15-28' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: - "hung_task: extend blocking task stacktrace dump to semaphore" from Lance Yang enhances the hung task detector. The detector presently dumps the blocking tasks's stack when it is blocked on a mutex. Lance's series extends this to semaphores - "nilfs2: improve sanity checks in dirty state propagation" from Wentao Liang addresses a couple of minor flaws in nilfs2 - "scripts/gdb: Fixes related to lx_per_cpu()" from Illia Ostapyshyn fixes a couple of issues in the gdb scripts - "Support kdump with LUKS encryption by reusing LUKS volume keys" from Coiby Xu addresses a usability problem with kdump. When the dump device is LUKS-encrypted, the kdump kernel may not have the keys to the encrypted filesystem. A full writeup of this is in the series [0/N] cover letter - "sysfs: add counters for lockups and stalls" from Max Kellermann adds /sys/kernel/hardlockup_count and /sys/kernel/hardlockup_count and /sys/kernel/rcu_stall_count - "fork: Page operation cleanups in the fork code" from Pasha Tatashin implements a number of code cleanups in fork.c - "scripts/gdb/symbols: determine KASLR offset on s390 during early boot" from Ilya Leoshkevich fixes some s390 issues in the gdb scripts * tag 'mm-nonmm-stable-2025-05-31-15-28' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (67 commits) llist: make llist_add_batch() a static inline delayacct: remove redundant code and adjust indentation squashfs: add optional full compressed block caching crash_dump, nvme: select CONFIGFS_FS as built-in scripts/gdb/symbols: determine KASLR offset on s390 during early boot scripts/gdb/symbols: factor out pagination_off() scripts/gdb/symbols: factor out get_vmlinux() kernel/panic.c: format kernel-doc comments mailmap: update and consolidate Casey Connolly's name and email nilfs2: remove wbc->for_reclaim handling fork: define a local GFP_VMAP_STACK fork: check charging success before zeroing stack fork: clean-up naming of vm_stack/vm_struct variables in vmap stacks code fork: clean-up ifdef logic around stack allocation kernel/rcu/tree_stall: add /sys/kernel/rcu_stall_count kernel/watchdog: add /sys/kernel/{hard,soft}lockup_count x86/crash: make the page that stores the dm crypt keys inaccessible x86/crash: pass dm crypt keys to kdump kernel Revert "x86/mm: Remove unused __set_memory_prot()" crash_dump: retrieve dm crypt keys in kdump kernel ...
2025-05-28Merge tag 'net-next-6.16' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Paolo Abeni: "Core: - Implement the Device Memory TCP transmit path, allowing zero-copy data transmission on top of TCP from e.g. GPU memory to the wire. - Move all the IPv6 routing tables management outside the RTNL scope, under its own lock and RCU. The route control path is now 3x times faster. - Convert queue related netlink ops to instance lock, reducing again the scope of the RTNL lock. This improves the control plane scalability. - Refactor the software crc32c implementation, removing unneeded abstraction layers and improving significantly the related micro-benchmarks. - Optimize the GRO engine for UDP-tunneled traffic, for a 10% performance improvement in related stream tests. - Cover more per-CPU storage with local nested BH locking; this is a prep work to remove the current per-CPU lock in local_bh_disable() on PREMPT_RT. - Introduce and use nlmsg_payload helper, combining buffer bounds verification with accessing payload carried by netlink messages. Netfilter: - Rewrite the procfs conntrack table implementation, improving considerably the dump performance. A lot of user-space tools still use this interface. - Implement support for wildcard netdevice in netdev basechain and flowtables. - Integrate conntrack information into nft trace infrastructure. - Export set count and backend name to userspace, for better introspection. BPF: - BPF qdisc support: BPF-qdisc can be implemented with BPF struct_ops programs and can be controlled in similar way to traditional qdiscs using the "tc qdisc" command. - Refactor the UDP socket iterator, addressing long standing issues WRT duplicate hits or missed sockets. Protocols: - Improve TCP receive buffer auto-tuning and increase the default upper bound for the receive buffer; overall this improves the single flow maximum thoughput on 200Gbs link by over 60%. - Add AFS GSSAPI security class to AF_RXRPC; it provides transport security for connections to the AFS fileserver and VL server. - Improve TCP multipath routing, so that the sources address always matches the nexthop device. - Introduce SO_PASSRIGHTS for AF_UNIX, to allow disabling SCM_RIGHTS, and thus preventing DoS caused by passing around problematic FDs. - Retire DCCP socket. DCCP only receives updates for bugs, and major distros disable it by default. Its removal allows for better organisation of TCP fields to reduce the number of cache lines hit in the fast path. - Extend TCP drop-reason support to cover PAWS checks. Driver API: - Reorganize PTP ioctl flag support to require an explicit opt-in for the drivers, avoiding the problem of drivers not rejecting new unsupported flags. - Converted several device drivers to timestamping APIs. - Introduce per-PHY ethtool dump helpers, improving the support for dump operations targeting PHYs. Tests and tooling: - Add support for classic netlink in user space C codegen, so that ynl-c can now read, create and modify links, routes addresses and qdisc layer configuration. - Add ynl sub-types for binary attributes, allowing ynl-c to output known struct instead of raw binary data, clarifying the classic netlink output. - Extend MPTCP selftests to improve the code-coverage. - Add tests for XDP tail adjustment in AF_XDP. New hardware / drivers: - OpenVPN virtual driver: offload OpenVPN data channels processing to the kernel-space, increasing the data transfer throughput WRT the user-space implementation. - Renesas glue driver for the gigabit ethernet RZ/V2H(P) SoC. - Broadcom asp-v3.0 ethernet driver. - AMD Renoir ethernet device. - ReakTek MT9888 2.5G ethernet PHY driver. - Aeonsemi 10G C45 PHYs driver. Drivers: - Ethernet high-speed NICs: - nVidia/Mellanox (mlx5): - refactor the steering table handling to significantly reduce the amount of memory used - add support for complex matches in H/W flow steering - improve flow streeing error handling - convert to netdev instance locking - Intel (100G, ice, igb, ixgbe, idpf): - ice: add switchdev support for LLDP traffic over VF - ixgbe: add firmware manipulation and regions devlink support - igb: introduce support for frame transmission premption - igb: adds persistent NAPI configuration - idpf: introduce RDMA support - idpf: add initial PTP support - Meta (fbnic): - extend hardware stats coverage - add devlink dev flash support - Broadcom (bnxt): - add support for RX-side device memory TCP - Wangxun (txgbe): - implement support for udp tunnel offload - complete PTP and SRIOV support for AML 25G/10G devices - Ethernet NICs embedded and virtual: - Google (gve): - add device memory TCP TX support - Amazon (ena): - support persistent per-NAPI config - Airoha: - add H/W support for L2 traffic offload - add per flow stats for flow offloading - RealTek (rtl8211): add support for WoL magic packet - Synopsys (stmmac): - dwmac-socfpga 1000BaseX support - add Loongson-2K3000 support - introduce support for hardware-accelerated VLAN stripping - Broadcom (bcmgenet): - expose more H/W stats - Freescale (enetc, dpaa2-eth): - enetc: add MAC filter, VLAN filter RSS and loopback support - dpaa2-eth: convert to H/W timestamping APIs - vxlan: convert FDB table to rhashtable, for better scalabilty - veth: apply qdisc backpressure on full ring to reduce TX drops - Ethernet switches: - Microchip (kzZ88x3): add ETS scheduler support - Ethernet PHYs: - RealTek (rtl8211): - add support for WoL magic packet - add support for PHY LEDs - CAN: - Adds RZ/G3E CANFD support to the rcar_canfd driver. - Preparatory work for CAN-XL support. - Add self-tests framework with support for CAN physical interfaces. - WiFi: - mac80211: - scan improvements with multi-link operation (MLO) - Qualcomm (ath12k): - enable AHB support for IPQ5332 - add monitor interface support to QCN9274 - add multi-link operation support to WCN7850 - add 802.11d scan offload support to WCN7850 - monitor mode for WCN7850, better 6 GHz regulatory - Qualcomm (ath11k): - restore hibernation support - MediaTek (mt76): - WiFi-7 improvements - implement support for mt7990 - Intel (iwlwifi): - enhanced multi-link single-radio (EMLSR) support on 5 GHz links - rework device configuration - RealTek (rtw88): - improve throughput for RTL8814AU - RealTek (rtw89): - add multi-link operation support - STA/P2P concurrency improvements - support different SAR configs by antenna - Bluetooth: - introduce HCI Driver protocol - btintel_pcie: do not generate coredump for diagnostic events - btusb: add HCI Drv commands for configuring altsetting - btusb: add RTL8851BE device 0x0bda:0xb850 - btusb: add new VID/PID 13d3/3584 for MT7922 - btusb: add new VID/PID 13d3/3630 and 13d3/3613 for MT7925 - btnxpuart: implement host-wakeup feature" * tag 'net-next-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1611 commits) selftests/bpf: Fix bpf selftest build warning selftests: netfilter: Fix skip of wildcard interface test net: phy: mscc: Stop clearing the the UDPv4 checksum for L2 frames net: openvswitch: Fix the dead loop of MPLS parse calipso: Don't call calipso functions for AF_INET sk. selftests/tc-testing: Add a test for HFSC eltree double add with reentrant enqueue behaviour on netem net_sched: hfsc: Address reentrant enqueue adding class to eltree twice octeontx2-pf: QOS: Refactor TC_HTB_LEAF_DEL_LAST callback octeontx2-pf: QOS: Perform cache sync on send queue teardown net: mana: Add support for Multi Vports on Bare metal net: devmem: ncdevmem: remove unused variable net: devmem: ksft: upgrade rx test to send 1K data net: devmem: ksft: add 5 tuple FS support net: devmem: ksft: add exit_wait to make rx test pass net: devmem: ksft: add ipv4 support net: devmem: preserve sockc_err page_pool: fix ugly page_pool formatting net: devmem: move list_add to net_devmem_bind_dmabuf. selftests: netfilter: nft_queue.sh: include file transfer duration in log message net: phy: mscc: Fix memory leak when using one step timestamping ...
2025-05-27crash_dump, nvme: select CONFIGFS_FS as built-inArnd Bergmann
Configfs can be configured as a loadable module, which causes a link-time failure for dm-crypt crash dump support: crash_dump_dm_crypt.c:(.text+0x3a4): undefined reference to `config_item_init_type_name' aarch64-linux-ld: kernel/crash_dump_dm_crypt.o: in function `configfs_dmcrypt_keys_init': crash_dump_dm_crypt.c:(.init.text+0x90): undefined reference to `config_group_init' aarch64-linux-ld: crash_dump_dm_crypt.c:(.init.text+0xb4): undefined reference to `configfs_register_subsystem' aarch64-linux-ld: crash_dump_dm_crypt.c:(.init.text+0xd8): undefined reference to `configfs_unregister_subsystem' This could be avoided with a dependency on CONFIGFS_FS=y, but the dependency has an additional problem of causing Kconfig dependency loops since most other uses select the symbol. Using a simple 'select CONFIGFS_FS' here in turn fails with CONFIG_DM_CRYPT=m, because that still only causes configfs to be a loadable module. The only version I found that fixes this reliably uses an additional Kconfig symbol to ensure the 'select' actually turns on configfs as builtin, with two additional changes to avoid dependency loops with nvme and sysfs. There is no compile-time dependency between configfs and sysfs, so selecting configfs from a driver with sysfs disabled does not cause link failures, only the default /sys/kernel/config mount point will not be created. Link: https://lkml.kernel.org/r/20250521160359.2132363-1-arnd@kernel.org Fixes: 6b23858fd63b ("crash_dump: make dm crypt keys persist for the kdump kernel") Fixes: 1fb470408497 ("nvme-loop: add configfs dependency") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Andreas Hindborg <a.hindborg@kernel.org> Cc: Breno Leitao <leitao@debian.org> Cc: Chaitanya Kulkarni <kch@nvidia.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Coiby Xu <coxu@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-27net: core: Convert inet_addr_is_any() to sockaddr_storageKees Cook
All the callers of inet_addr_is_any() have a sockaddr_storage-backed sockaddr. Avoid casts and switch prototype to the actual object being used. Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> # SCSI Signed-off-by: Kees Cook <kees@kernel.org> Link: https://patch.msgid.link/20250521204619.2301870-1-kees@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-26Merge tag 'for-6.16/block-20250523' of git://git.kernel.dk/linuxLinus Torvalds
Pull block updates from Jens Axboe: - ublk updates: - Add support for updating the size of a ublk instance - Zero-copy improvements - Auto-registering of buffers for zero-copy - Series simplifying and improving GET_DATA and request lookup - Series adding quiesce support - Lots of selftests additions - Various cleanups - NVMe updates via Christoph: - add per-node DMA pools and use them for PRP/SGL allocations (Caleb Sander Mateos, Keith Busch) - nvme-fcloop refcounting fixes (Daniel Wagner) - support delayed removal of the multipath node and optionally support the multipath node for private namespaces (Nilay Shroff) - support shared CQs in the PCI endpoint target code (Wilfred Mallawa) - support admin-queue only authentication (Hannes Reinecke) - use the crc32c library instead of the crypto API (Eric Biggers) - misc cleanups (Christoph Hellwig, Marcelo Moreira, Hannes Reinecke, Leon Romanovsky, Gustavo A. R. Silva) - MD updates via Yu: - Fix that normal IO can be starved by sync IO, found by mkfs on newly created large raid5, with some clean up patches for bdev inflight counters - Clean up brd, getting rid of atomic kmaps and bvec poking - Add loop driver specifically for zoned IO testing - Eliminate blk-rq-qos calls with a static key, if not enabled - Improve hctx locking for when a plug has IO for multiple queues pending - Remove block layer bouncing support, which in turn means we can remove the per-node bounce stat as well - Improve blk-throttle support - Improve delay support for blk-throttle - Improve brd discard support - Unify IO scheduler switching. This should also fix a bunch of lockdep warnings we've been seeing, after enabling lockdep support for queue freezing/unfreezeing - Add support for block write streams via FDP (flexible data placement) on NVMe - Add a bunch of block helpers, facilitating the removal of a bunch of duplicated boilerplate code - Remove obsolete BLK_MQ pci and virtio Kconfig options - Add atomic/untorn write support to blktrace - Various little cleanups and fixes * tag 'for-6.16/block-20250523' of git://git.kernel.dk/linux: (186 commits) selftests: ublk: add test for UBLK_F_QUIESCE ublk: add feature UBLK_F_QUIESCE selftests: ublk: add test case for UBLK_U_CMD_UPDATE_SIZE traceevent/block: Add REQ_ATOMIC flag to block trace events ublk: run auto buf unregisgering in same io_ring_ctx with registering io_uring: add helper io_uring_cmd_ctx_handle() ublk: remove io argument from ublk_auto_buf_reg_fallback() ublk: handle ublk_set_auto_buf_reg() failure correctly in ublk_fetch() selftests: ublk: add test for covering UBLK_AUTO_BUF_REG_FALLBACK selftests: ublk: support UBLK_F_AUTO_BUF_REG ublk: support UBLK_AUTO_BUF_REG_FALLBACK ublk: register buffer to local io_uring with provided buf index via UBLK_F_AUTO_BUF_REG ublk: prepare for supporting to register request buffer automatically ublk: convert to refcount_t selftests: ublk: make IO & device removal test more stressful nvme: rename nvme_mpath_shutdown_disk to nvme_mpath_remove_disk nvme: introduce multipath_always_on module param nvme-multipath: introduce delayed removal of the multipath head node nvme-pci: derive and better document max segments limits nvme-pci: use struct_size for allocation struct nvme_dev ...
2025-05-22Merge tag 'block-6.15-20250522' of git://git.kernel.dk/linuxLinus Torvalds
Pull block fixes from Jens Axboe: - Fix for a regression with setting up loop on a file system without ->write_iter() - Fix for an nvme sysfs regression * tag 'block-6.15-20250522' of git://git.kernel.dk/linux: nvme: avoid creating multipath sysfs group under namespace path devices loop: don't require ->write_iter for writable files in loop_configure
2025-05-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.15-rc8). Conflicts: 80f2ab46c2ee ("irdma: free iwdev->rf after removing MSI-X") 4bcc063939a5 ("ice, irdma: fix an off by one in error handling code") c24a65b6a27c ("iidc/ice/irdma: Update IDC to support multiple consumers") https://lore.kernel.org/20250513130630.280ee6c5@canb.auug.org.au No extra adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-21nvme-tcp: use crc32c() and skb_copy_and_crc32c_datagram_iter()Eric Biggers
Now that the crc32c() library function directly takes advantage of architecture-specific optimizations and there also now exists a function skb_copy_and_crc32c_datagram_iter(), it is unnecessary to go through the crypto_ahash API. Just use those functions. This is much simpler, and it also improves performance due to eliminating the crypto API overhead. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://patch.msgid.link/20250519175012.36581-10-ebiggers@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-21nvme: avoid creating multipath sysfs group under namespace path devicesNilay Shroff
Commit 4dbd2b2ebe4c ("nvme-multipath: Add visibility for round-robin io-policy") introduced the creation of the multipath sysfs group under the NVMe head gendisk device node. However, it also inadvertently added the same sysfs group under each namespace path device which head node refers to and that is incorrect. The multipath sysfs group should only be exposed through the namespace head gendisk node. This is sufficient, as the head device already provides symbolic links to the individual namespace paths it manages. This patch fixes the issue by preventing the creation of the multipath sysfs group under namespace path devices, ensuring it only appears under the head disk node. Fixes: 4dbd2b2ebe4c ("nvme-multipath: Add visibility for round-robin io-policy") Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-05-20nvme: rename nvme_mpath_shutdown_disk to nvme_mpath_remove_diskNilay Shroff
In the NVMe context, the term "shutdown" has a specific technical meaning. To avoid confusion, this commit renames the nvme_mpath_ shutdown_disk function to nvme_mpath_remove_disk to better reflect its purpose (i.e. removing the disk from the system). However, nvme_mpath_remove_disk was already in use, and its functionality is related to releasing or putting the head node disk. To resolve this naming conflict and improve clarity, the existing nvme_mpath_ remove_disk function is also renamed to nvme_mpath_put_disk. This renaming improves code readability and better aligns function names with their actual roles. Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-05-20nvme: introduce multipath_always_on module paramNilay Shroff
Currently, a multipath head disk node is not created for single- ported NVMe adapters or private namespaces with non-unique NSID. However, creating a head node in these cases can help transparently handle transient PCIe link failures. Without a head node, features like delayed removal cannot be leveraged, making it difficult to tolerate such link failures. To address this, this commit introduces nvme_core module parameter multipath_always_on. When multipath_always_on is set to true, it forces the creation of a multipath head node regardless NVMe disk or namespace type. So this option allows the use of delayed removal of head node functionality even for single-ported NVMe disks and private namespaces with a unique NSID and thus helps transparently handle transient PCIe link failures. By default multipath_always_on is set to false, thus preserving the existing behavior. Setting it to true enables improved fault tolerance in PCIe setups. Moreover, please note that enabling this option would also implicitly enable nvme_core.multipath. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-05-20nvme-multipath: introduce delayed removal of the multipath head nodeNilay Shroff
Currently, the multipath head node of an NVMe disk is removed immediately as soon as all paths of the disk are removed. However, this can cause issues in scenarios where: - The disk hot-removal followed by re-addition. - Transient PCIe link failures that trigger re-enumeration, temporarily removing and then restoring the disk. In these cases, removing the head node prematurely may lead to a head disk node name change upon re-addition, requiring applications to reopen their handles if they were performing I/O during the failure. To address this, introduce a delayed removal mechanism of head disk node. During transient failure, instead of immediate removal of head disk node, the system waits for a configurable timeout, allowing the disk to recover. During transient disk failure, if application sends any IO then we queue it instead of failing such IO immediately. If the disk comes back online within the timeout, the queued IOs are resubmitted to the disk ensuring seamless operation. In case disk couldn't recover from the failure then queued IOs are failed to its completion and application receives the error. So this way, if disk comes back online within the configured period, the head node remains unchanged, ensuring uninterrupted workloads without requiring applications to reopen device handles. A new sysfs attribute, named "delayed_removal_secs" is added under head disk blkdev for user who wish to configure time for the delayed removal of head disk node. The default value of this attribute is set to zero second ensuring no behavior change unless explicitly configured. Link: https://lore.kernel.org/linux-nvme/Y9oGTKCFlOscbPc2@infradead.org/ Link: https://lore.kernel.org/linux-nvme/Y+1aKcQgbskA2tra@kbusch-mbp.dhcp.thefacebook.com/ Suggested-by: Keith Busch <kbusch@kernel.org> Suggested-by: Christoph Hellwig <hch@infradead.org> [nilay: reworked based on the original idea/POC from Christoph and Keith] Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-05-20nvme-pci: derive and better document max segments limitsChristoph Hellwig
Redefine the max segments and max integrity limits based on the limiting factors. This keeps exactly the same values for 4k PAGE_SIZE systems, but increases the number of segments for larger page size as it properly derives the scatterlist allocation based limit for them instead of assuming a 4k PAGE_SIZE. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org>
2025-05-20nvme-pci: use struct_size for allocation struct nvme_devChristoph Hellwig
This avoids open coding the variable size array arithmetics. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Leon Romanovsky <leon@kernel.org>
2025-05-20nvme-pci: add a symolic name for the small pool sizeLeon Romanovsky
Open coding magic numbers in multiple places is never a good idea. Signed-off-by: Leon Romanovsky <leon@kernel.org> [hch: split from a larger patch] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
2025-05-20nvme-pci: use a better encoding for small prp pool allocationsChristoph Hellwig
Add a separate flag to encode that the transfer is using the small page sized pool, and use a normal 0..n count for the number of descriptors. Contains improvements and suggestions from Kanchan Joshi <joshi.k@samsung.com> and Leon Romanovsky <leon@kernel.org>. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Leon Romanovsky <leon@kernel.org>
2025-05-20nvme-pci: rename the descriptor poolsChristoph Hellwig
They are used for both PRPs and SGLs, and we use descriptor elsewhere when referring to their allocations, so use that name here as well. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Leon Romanovsky <leon@kernel.org>
2025-05-20nvme-pci: remove struct nvme_descriptorChristoph Hellwig
There is no real point in having a union of two pointer types here, just use a void pointer as we mix and match types between the arms of the union between the allocation and freeing side already. Also rename the nr_allocations field to nr_descriptors to better describe what it does. Signed-off-by: Christoph Hellwig <hch@lst.de> [leon: ported forward to include metadata SGL support] Signed-off-by: Leon Romanovsky <leon@kernel.org> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
2025-05-20nvme-pci: store aborted state in flags variableLeon Romanovsky
Instead of keeping dedicated "bool aborted" variable, switch to a flags flags that can be used for other flags as well. Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
2025-05-20nvme-pci: don't try to use SGLs for metadata on the admin queueChristoph Hellwig
No admin command defined in an NVMe specification supports metadata, but to protect against vendor specific commands using metadata ensure that we don't try to use SGLs for metadata on the admin queue, as NVMe does not support SGLs on the admin queue for the PCI transport. Do this by checking if the data transfer has been setup using SGLs as that is required for using SGLs for metadata. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Leon Romanovsky <leon@kernel.org>
2025-05-20nvme-pci: make PRP list DMA pools per-NUMA-nodeCaleb Sander Mateos
NVMe commands with over 8 KB of discontiguous data allocate PRP list pages from the per-nvme_device dma_pool prp_page_pool or prp_small_pool. Each call to dma_pool_alloc() and dma_pool_free() takes the per-dma_pool spinlock. These device-global spinlocks are a significant source of contention when many CPUs are submitting to the same NVMe devices. On a workload issuing 32 KB reads from 16 CPUs (8 hypertwin pairs) across 2 NUMA nodes to 23 NVMe devices, we observed 2.4% of CPU time spent in _raw_spin_lock_irqsave called from dma_pool_alloc and dma_pool_free. Ideally, the dma_pools would be per-hctx to minimize contention. But that could impose considerable resource costs in a system with many NVMe devices and CPUs. As a compromise, allocate per-NUMA-node PRP list DMA pools. Map each nvme_queue to the set of DMA pools corresponding to its device and its hctx's NUMA node. This reduces the _raw_spin_lock_irqsave overhead by about half, to 1.2%. Preventing the sharing of PRP list pages across NUMA nodes also makes them cheaper to initialize. Link: https://lore.kernel.org/linux-nvme/CADUfDZqa=OOTtTTznXRDmBQo1WrFcDw1hBA7XwM7hzJ-hpckcA@mail.gmail.com/T/#u Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-05-20nvme-pci: factor out a nvme_init_hctx_common() helperCaleb Sander Mateos
nvme_init_hctx() and nvme_admin_init_hctx() are very similar. In preparation for adding more logic, factor out a nvme_init_hctx-common() helper. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-05-20nvme-fc: do not reference lsrsp after failureDaniel Wagner
The lsrsp object is maintained by the LLDD. The lifetime of the lsrsp object is implicit. Because there is no explicit cleanup/free call into the LLDD, it is not safe to assume after xml_rsp_fails, that the lsrsp is still valid. The LLDD could have freed the object already. With the recent changes how fcloop tracks the resources, this is the case. Thus don't access lsrsp after xml_rsp_fails. Signed-off-by: Daniel Wagner <wagi@kernel.org> Reviewed-by: Hannes Reinecke <hare@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-05-20nvmet-fcloop: don't wait for lport cleanupDaniel Wagner
The lifetime of the fcloop_lsreq is not tight to the lifetime of the host or target port, thus there is no need anymore to synchronize the cleanup path anymore. Signed-off-by: Daniel Wagner <wagi@kernel.org> Reviewed-by: Hannes Reinecke <hare@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-05-20nvmet-fcloop: add missing fcloop_callback_host_doneDaniel Wagner
Add the missing fcloop_call_host_done calls so that the caller frees resources when something goes wrong. Signed-off-by: Daniel Wagner <wagi@kernel.org> Reviewed-by: Hannes Reinecke <hare@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-05-20nvmet-fc: take tgtport refs for portentryDaniel Wagner
Ensure that the tgtport is not going away as long portentry has a pointer on it. Signed-off-by: Daniel Wagner <wagi@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-05-20nvmet-fc: free pending reqs on tgtport unregisterDaniel Wagner
When nvmet_fc_unregister_targetport is called by the LLDD, it's not possible to communicate with the host, thus all pending request will not be process. Thus explicitly free them. Signed-off-by: Daniel Wagner <wagi@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-05-20nvmet-fcloop: drop response if targetport is goneDaniel Wagner
When the target port is gone, the lsrsp pointer is invalid. Thus don't call the done function anymore instead just drop the response. This happens when the target sends a disconnect association. After this the target starts tearing down all resources and doesn't expect any response. Signed-off-by: Daniel Wagner <wagi@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-05-20nvmet-fcloop: allocate/free fcloop_lsreq directlyDaniel Wagner
fcloop depends on the host or the target to allocate the fcloop_lsreq object. This means that the lifetime of the fcloop_lsreq is tied to either the host or the target. Consequently, the host or the target must cooperate during shutdown. Unfortunately, this approach does not work well when the target forces a shutdown, as there are dependencies that are difficult to resolve in a clean way. The simplest solution is to decouple the lifetime of the fcloop_lsreq object by managing them directly within fcloop. Since this is not a performance-critical path and only a small number of LS objects are used during setup and cleanup, it does not significantly impact performance to allocate them during normal operation. Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Daniel Wagner <wagi@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>