summaryrefslogtreecommitdiff
path: root/drivers/nvme
AgeCommit message (Collapse)Author
2025-04-11Merge tag 'block-6.15-20250411' of git://git.kernel.dk/linuxLinus Torvalds
Pull more block fixes from Jens Axboe: "Apparently my internal clock was off, or perhaps it was just wishful thinking, but I sent out block fixes yesterday as my brain assumed it was Friday. Subsequently, that missed the NVMe fixes that should go into this weeks release as well. Hence, here's a followup with those, and another simple fix. - NVMe pull request via Christoph: - nvmet fc/fcloop refcounting fixes (Daniel Wagner) - fix missed namespace/ANA scans (Hannes Reinecke) - fix a use after free in the new TCP netns support (Kuniyuki Iwashima) - fix a NULL instead of false review in multipath (Uday Shankar) - Use strscpy() for null_blk disk name copy" * tag 'block-6.15-20250411' of git://git.kernel.dk/linux: null_blk: Use strscpy() instead of strscpy_pad() in null_add_dev() nvmet-fc: put ref when assoc->del_work is already scheduled nvmet-fc: take tgtport reference only once nvmet-fc: update tgtport ref per assoc nvmet-fc: inline nvmet_fc_free_hostport nvmet-fc: inline nvmet_fc_delete_assoc nvmet-fcloop: add ref counting to lport nvmet-fcloop: replace kref with refcount nvmet-fcloop: swap list_add_tail arguments nvme-tcp: fix use-after-free of netns by kernel TCP socket. nvme: multipath: fix return value of nvme_available_path nvme: re-read ANA log page after ns scan completes nvme: requeue namespace scan on missed AENs
2025-04-09nvmet-fc: put ref when assoc->del_work is already scheduledDaniel Wagner
Do not leak the tgtport reference when the work is already scheduled. Signed-off-by: Daniel Wagner <wagi@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-04-09nvmet-fc: take tgtport reference only onceDaniel Wagner
The reference counting code can be simplified. Instead taking a tgtport refrerence at the beginning of nvmet_fc_alloc_hostport and put it back if not a new hostport object is allocated, only take it when a new hostport object is allocated. Signed-off-by: Daniel Wagner <wagi@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-04-09nvmet-fc: update tgtport ref per assocDaniel Wagner
We need to take for each unique association a reference. nvmet_fc_alloc_hostport for each newly created association. Signed-off-by: Daniel Wagner <wagi@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-04-09nvmet-fc: inline nvmet_fc_free_hostportDaniel Wagner
No need for this tiny helper with only one user, let's inline it. And since the hostport ref counter needs to stay in sync, it's not optional anymore to give back the reference. Signed-off-by: Daniel Wagner <wagi@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-04-09nvmet-fc: inline nvmet_fc_delete_assocDaniel Wagner
No need for this tiny helper with only one user, just inline it. Signed-off-by: Daniel Wagner <wagi@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-04-09nvmet-fcloop: add ref counting to lportDaniel Wagner
The fcloop_lport objects live time is controlled by the user interface add_local_port and del_local_port. nport, rport and tport objects are pointing to the lport objects but here is no clear tracking. Let's introduce an explicit ref counter for the lport objects and prepare the stage for restructuring how lports are used. Signed-off-by: Daniel Wagner <wagi@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-04-09nvmet-fcloop: replace kref with refcountDaniel Wagner
The kref wrapper is not really adding any value ontop of refcount. Thus replace the kref API with the refcount API. Signed-off-by: Daniel Wagner <wagi@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-04-09nvmet-fcloop: swap list_add_tail argumentsDaniel Wagner
The newly element to be added to the list is the first argument of list_add_tail. This fix is missing dcfad4ab4d67 ("nvmet-fcloop: swap the list_add_tail arguments"). Fixes: 437c0b824dbd ("nvme-fcloop: add target to host LS request support") Signed-off-by: Daniel Wagner <wagi@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-04-09nvme-tcp: fix use-after-free of netns by kernel TCP socket.Kuniyuki Iwashima
Commit 1be52169c348 ("nvme-tcp: fix selinux denied when calling sock_sendmsg") converted sock_create() in nvme_tcp_alloc_queue() to sock_create_kern(). sock_create_kern() creates a kernel socket, which does not hold a reference to netns. If the code does not manage the netns lifetime properly, use-after-free could happen. Also, TCP kernel socket with sk_net_refcnt 0 has a socket leak problem: it remains FIN_WAIT_1 if it misses FIN after close() because tcp_close() stops all timers. To fix such problems, let's hold netns ref by sk_net_refcnt_upgrade(). We had the same issue in CIFS, SMC, etc, and applied the same solution, see commit ef7134c7fc48 ("smb: client: Fix use-after-free of network namespace.") and commit 9744d2bf1976 ("smc: Fix use-after-free in tcp_write_timer_handler()."). Fixes: 1be52169c348 ("nvme-tcp: fix selinux denied when calling sock_sendmsg") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-04-08nvme: multipath: fix return value of nvme_available_pathUday Shankar
The function returns bool so we should return false, not NULL. No functional changes are expected. Signed-off-by: Uday Shankar <ushankar@purestorage.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-04-08nvme: re-read ANA log page after ns scan completesHannes Reinecke
When scanning for new namespaces we might have missed an ANA AEN. The NVMe base spec (NVMe Base Specification v2.1, Figure 151 'Asynchonous Event Information - Notice': Asymmetric Namespace Access Change) states: A controller shall not send this even if an Attached Namespace Attribute Changed asynchronous event [...] is sent for the same event. so we need to re-read the ANA log page after we rescanned the namespace list to update the ANA states of the new namespaces. Signed-off-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-04-07nvme: requeue namespace scan on missed AENsHannes Reinecke
Scanning for namespaces can take some time, so if the target is reconfigured while the scan is running we may miss a Attached Namespace Attribute Changed AEN. Check if the NVME_AER_NOTICE_NS_CHANGED bit is set once the scan has finished, and requeue scanning to pick up any missed change. Signed-off-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-04-05treewide: Switch/rename to timer_delete[_sync]()Thomas Gleixner
timer_delete[_sync]() replaces del_timer[_sync](). Convert the whole tree over and remove the historical wrapper inlines. Conversion was done with coccinelle plus manual fixups where necessary. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-04-03Merge tag 'block-6.15-20250403' of git://git.kernel.dk/linuxLinus Torvalds
Pull more block updates from Jens Axboe: - NVMe pull request via Keith: - PCI endpoint target cleanup (Damien) - Early import for uring_cmd fixed buffer (Caleb) - Multipath documentation and notification improvements (John) - Invalid pci sq doorbell write fix (Maurizio) - Queue init locking fix - Remove dead nsegs parameter from blk_mq_get_new_requests() * tag 'block-6.15-20250403' of git://git.kernel.dk/linux: block: don't grab elevator lock during queue initialization nvme-pci: skip nvme_write_sq_db on empty rqlist nvme-multipath: change the NVME_MULTIPATH config option nvme: update the multipath warning in nvme_init_ns_head nvme/ioctl: move fixed buffer lookup to nvme_uring_cmd_io() nvme/ioctl: move blk_mq_free_request() out of nvme_map_user_request() nvme/ioctl: don't warn on vectorized uring_cmd with fixed buffer nvmet: pci-epf: Keep completion queues mapped block: remove unused nseg parameter
2025-04-02Merge tag 'objtool-urgent-2025-04-01' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull objtool fixes from Ingo Molnar: "These are objtool fixes and updates by Josh Poimboeuf, centered around the fallout from the new CONFIG_OBJTOOL_WERROR=y feature, which, despite its default-off nature, increased the profile/impact of objtool warnings: - Improve error handling and the presentation of warnings/errors - Revert the new summary warning line that some test-bot tools interpreted as new regressions - Fix a number of objtool warnings in various drivers, core kernel code and architecture code. About half of them are potential problems related to out-of-bounds accesses or potential undefined behavior, the other half are additional objtool annotations - Update objtool to latest (known) compiler quirks and objtool bugs triggered by compiler code generation - Misc fixes" * tag 'objtool-urgent-2025-04-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits) objtool/loongarch: Add unwind hints in prepare_frametrace() rcu-tasks: Always inline rcu_irq_work_resched() context_tracking: Always inline ct_{nmi,irq}_{enter,exit}() sched/smt: Always inline sched_smt_active() objtool: Fix verbose disassembly if CROSS_COMPILE isn't set objtool: Change "warning:" to "error: " for fatal errors objtool: Always fail on fatal errors Revert "objtool: Increase per-function WARN_FUNC() rate limit" objtool: Append "()" to function name in "unexpected end of section" warning objtool: Ignore end-of-section jumps for KCOV/GCOV objtool: Silence more KCOV warnings, part 2 objtool, drm/vmwgfx: Don't ignore vmw_send_msg() for ORC objtool: Fix STACK_FRAME_NON_STANDARD for cold subfunctions objtool: Fix segfault in ignore_unreachable_insn() objtool: Fix NULL printf() '%s' argument in builtin-check.c:save_argv() objtool, lkdtm: Obfuscate the do_nothing() pointer objtool, regulator: rk808: Remove potential undefined behavior in rk806_set_mode_dcdc() objtool, ASoC: codecs: wcd934x: Remove potential undefined behavior in wcd934x_slim_irq_handler() objtool, Input: cyapa - Remove undefined behavior in cyapa_update_fw_store() objtool, panic: Disable SMAP in __stack_chk_fail() ...
2025-04-01Merge tag 'mm-nonmm-stable-2025-03-30-18-23' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: - The series "powerpc/crash: use generic crashkernel reservation" from Sourabh Jain changes powerpc's kexec code to use more of the generic layers. - The series "get_maintainer: report subsystem status separately" from Vlastimil Babka makes some long-requested improvements to the get_maintainer output. - The series "ucount: Simplify refcounting with rcuref_t" from Sebastian Siewior cleans up and optimizing the refcounting in the ucount code. - The series "reboot: support runtime configuration of emergency hw_protection action" from Ahmad Fatoum improves the ability for a driver to perform an emergency system shutdown or reboot. - The series "Converge on using secs_to_jiffies() part two" from Easwar Hariharan performs further migrations from msecs_to_jiffies() to secs_to_jiffies(). - The series "lib/interval_tree: add some test cases and cleanup" from Wei Yang permits more userspace testing of kernel library code, adds some more tests and performs some cleanups. - The series "hung_task: Dump the blocking task stacktrace" from Masami Hiramatsu arranges for the hung_task detector to dump the stack of the blocking task and not just that of the blocked task. - The series "resource: Split and use DEFINE_RES*() macros" from Andy Shevchenko provides some cleanups to the resource definition macros. - Plus the usual shower of singleton patches - please see the individual changelogs for details. * tag 'mm-nonmm-stable-2025-03-30-18-23' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (77 commits) mailmap: consolidate email addresses of Alexander Sverdlin fs/procfs: fix the comment above proc_pid_wchan() relay: use kasprintf() instead of fixed buffer formatting resource: replace open coded variant of DEFINE_RES() resource: replace open coded variants of DEFINE_RES_*_NAMED() resource: replace open coded variant of DEFINE_RES_NAMED_DESC() resource: split DEFINE_RES_NAMED_DESC() out of DEFINE_RES_NAMED() samples: add hung_task detector mutex blocking sample hung_task: show the blocker task if the task is hung on mutex kexec_core: accept unaccepted kexec segments' destination addresses watchdog/perf: optimize bytes copied and remove manual NUL-termination lib/interval_tree: fix the comment of interval_tree_span_iter_next_gap() lib/interval_tree: skip the check before go to the right subtree lib/interval_tree: add test case for span iteration lib/interval_tree: add test case for interval_tree_iter_xxx() helpers lib/rbtree: add random seed lib/rbtree: split tests lib/rbtree: enable userland test suite for rbtree related data structure checkpatch: describe --min-conf-desc-length scripts/gdb/symbols: determine KASLR offset on s390 ...
2025-04-01nvme-pci: skip nvme_write_sq_db on empty rqlistMaurizio Lombardi
nvme_submit_cmds() should check the rqlist before calling nvme_write_sq_db(); if the list is empty, it must return immediately. Fixes: beadf0088501 ("nvme-pci: reverse request order in nvme_queue_rqs") Signed-off-by: Maurizio Lombardi <mlombard@redhat.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-04-01nvme-multipath: change the NVME_MULTIPATH config optionJohn Meneghini
Fix up the NVME_MULTIPATH config description so that it accurately describes what it does. Signed-off-by: John Meneghini <jmeneghi@redhat.com> Tested-by: John Meneghini <jmeneghi@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-04-01nvme: update the multipath warning in nvme_init_ns_headJohn Meneghini
The new NVME_MULTIPATH_PARAM config option requires updates to the warning message in nvme_init_ns_head(). Signed-off-by: John Meneghini <jmeneghi@redhat.com> Tested-by: John Meneghini <jmeneghi@redhat.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-04-01nvme/ioctl: move fixed buffer lookup to nvme_uring_cmd_io()Caleb Sander Mateos
nvme_map_user_request() is called from both nvme_submit_user_cmd() and nvme_uring_cmd_io(). But the ioucmd branch is only applicable to nvme_uring_cmd_io(). Move it to nvme_uring_cmd_io() and just pass the resulting iov_iter to nvme_map_user_request(). For NVMe passthru operations with fixed buffers, the fixed buffer lookup happens in io_uring_cmd_import_fixed(). But nvme_uring_cmd_io() can return -EAGAIN first from nvme_alloc_user_request() if all tags in the tag set are in use. This ordering difference is observable when using UBLK_U_IO_{,UN}REGISTER_IO_BUF SQEs to modify the fixed buffer table. If the NVMe passthru operation is followed by UBLK_U_IO_UNREGISTER_IO_BUF to unregister the fixed buffer and the NVMe passthru goes async, the fixed buffer lookup will fail because it happens after the unregister. Userspace should not depend on the order in which io_uring issues SQEs submitted in parallel, but it may try submitting the SQEs together and fall back on a slow path if the fixed buffer lookup fails. To make the fast path more likely, do the import before nvme_alloc_user_request(). Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-04-01nvme/ioctl: move blk_mq_free_request() out of nvme_map_user_request()Caleb Sander Mateos
The callers of nvme_map_user_request() (nvme_submit_user_cmd() and nvme_uring_cmd_io()) allocate the request, so have them free it if nvme_map_user_request() fails. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-04-01nvme/ioctl: don't warn on vectorized uring_cmd with fixed bufferCaleb Sander Mateos
The vectorized io_uring NVMe passthru opcodes don't yet support fixed buffers. But since userspace can trigger this condition based on the io_uring SQE parameters, it shouldn't cause a kernel warning. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Fixes: 23fd22e55b76 ("nvme: wire up fixed buffer support for nvme passthrough") Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-04-01nvmet: pci-epf: Keep completion queues mappedDamien Le Moal
Instead of mapping and unmapping the completion queues memory to the host PCI address space whenever nvmet_pci_epf_cq_work() is called, map a completion queue to the host PCI address space when the completion queue is created with nvmet_pci_epf_create_cq() and unmap it when the completion queue is deleted with nvmet_pci_epf_delete_cq(). This removes the completion queue mapping/unmapping from nvmet_pci_epf_cq_work() and significantly increases performance. For a single job 4K random read QD=1 workload, the IOPS is increased from 23 KIOPS to 25 KIOPS. Some significant throughput increasde for high queue depth and large IOs workloads can also be seen. Since the functions nvmet_pci_epf_map_queue() and nvmet_pci_epf_unmap_queue() are called respectively only from nvmet_pci_epf_create_cq() and nvmet_pci_epf_delete_cq(), these functions are removed and open-coded in their respective call sites. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-03-27Merge tag 'soc-drivers-6.15-1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc Pull SoC driver updates from Arnd Bergmann: "These are the updates for SoC specific drivers and related subsystems: - Firmware driver updates for SCMI, FF-A and SMCCC firmware interfaces, adding support for additional firmware features including SoC identification and FF-A SRI callbacks as well as various bugfixes - Memory controller updates for Nvidia and Mediatek - Reset controller support for microchip sam9x7 and imx8qxp/imx8qm - New hardware support for multiple Mediatek, Renesas and Samsung Exynos chips - Minor updates on Zynq, Qualcomm, Amlogic, TI, Samsung, Nvidia and Apple chips There will be a follow up with a few more driver updates that are still causing build regressions at the moment" * tag 'soc-drivers-6.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (97 commits) irqchip: Add support for Amlogic A4 and A5 SoCs dt-bindings: interrupt-controller: Add support for Amlogic A4 and A5 SoCs reset: imx: fix incorrect module device table dt-bindings: power: qcom,kpss-acc-v2: add qcom,msm8916-acc compatible bus: qcom-ssc-block-bus: Fix the error handling path of qcom_ssc_block_bus_probe() bus: qcom-ssc-block-bus: Remove some duplicated iounmap() calls soc: qcom: pd-mapper: Add support for SDM630/636 reset: imx: Add SCU reset driver for i.MX8QXP and i.MX8QM dt-bindings: firmware: imx: add property reset-controller dt-bindings: reset: atmel,at91sam9260-reset: add sam9x7 memory: mtk-smi: Add ostd setting for mt8192 dt-bindings: soc: samsung: exynos-usi: Drop unnecessary status from example firmware: tegra: bpmp: Fix typo in bpmp-abi.h soc/tegra: pmc: Use str_enable_disable-like helpers soc: samsung: include linux/array_size.h where needed firmware: arm_scmi: use ioread64() instead of ioread64_hi_lo() soc: mediatek: mtk-socinfo: Add extra entry for MT8395AV/ZA Genio 1200 soc: mediatek: mt8188-mmsys: Add support for DSC on VDO0 soc: mediatek: mmsys: Migrate all tables to MMSYS_ROUTE() macro soc: mediatek: mt8365-mmsys: Fix routing table masks and values ...
2025-03-26Merge tag 'for-6.15/block-20250322' of git://git.kernel.dk/linuxLinus Torvalds
Pull block updates from Jens Axboe: - Fixes for integrity handling - NVMe pull request via Keith: - Secure concatenation for TCP transport (Hannes) - Multipath sysfs visibility (Nilay) - Various cleanups (Qasim, Baruch, Wang, Chen, Mike, Damien, Li) - Correct use of 64-bit BARs for pci-epf target (Niklas) - Socket fix for selinux when used in containers (Peijie) - MD pull request via Yu: - fix recovery can preempt resync (Li Nan) - fix md-bitmap IO limit (Su Yue) - fix raid10 discard with REQ_NOWAIT (Xiao Ni) - fix raid1 memory leak (Zheng Qixing) - fix mddev uaf (Yu Kuai) - fix raid1,raid10 IO flags (Yu Kuai) - some refactor and cleanup (Yu Kuai) - Series cleaning up and fixing bugs in the bad block handling code - Improve support for write failure simulation in null_blk - Various lock ordering fixes - Fixes for locking for debugfs attributes - Various ublk related fixes and improvements - Cleanups for blk-rq-qos wait handling - blk-throttle fixes - Fixes for loop dio and sync handling - Fixes and cleanups for the auto-PI code - Block side support for hardware encryption keys in blk-crypto - Various cleanups and fixes * tag 'for-6.15/block-20250322' of git://git.kernel.dk/linux: (105 commits) nvmet: replace max(a, min(b, c)) by clamp(val, lo, hi) nvme-tcp: fix selinux denied when calling sock_sendmsg nvmet: pci-epf: Always configure BAR0 as 64-bit nvmet: Remove duplicate uuid_copy nvme: zns: Simplify nvme_zone_parse_entry() nvmet: pci-epf: Remove redundant 'flush_workqueue()' calls nvmet-fc: Remove unused functions nvme-pci: remove stale comment nvme-fc: Utilise min3() to simplify queue count calculation nvme-multipath: Add visibility for queue-depth io-policy nvme-multipath: Add visibility for numa io-policy nvme-multipath: Add visibility for round-robin io-policy nvmet: add tls_concat and tls_key debugfs entries nvmet-tcp: support secure channel concatenation nvmet: Add 'sq' argument to alloc_ctrl_args nvme-fabrics: reset admin connection for secure concatenation nvme-tcp: request secure channel concatenation nvme-keyring: add nvme_tls_psk_refresh() nvme: add nvme_auth_derive_tls_psk() nvme: add nvme_auth_generate_digest() ...
2025-03-26Merge tag 'for-6.15/io_uring-20250322' of git://git.kernel.dk/linuxLinus Torvalds
Pull io_uring updates from Jens Axboe: "This is the first of the io_uring pull requests for the 6.15 merge window, there will be others once the net tree has gone in. This contains: - Cleanup and unification of cancelation handling across various request types. - Improvement for bundles, supporting them both for incrementally consumed buffers, and for non-multishot requests. - Enable toggling of using iowait while waiting on io_uring events or not. Unfortunately this is still tied with CPU frequency boosting on short waits, as the scheduler side has not been very receptive to splitting the (useless) iowait stat from the cpufreq implied boost. - Add support for kbuf nodes, enabling zero-copy support for the ublk block driver. - Various cleanups for resource node handling. - Series greatly cleaning up the legacy provided (non-ring based) buffers. For years, we've been pushing the ring provided buffers as the way to go, and that is what people have been using. Reduce the complexity and code associated with legacy provided buffers. - Series cleaning up the compat handling. - Series improving and cleaning up the recvmsg/sendmsg iovec and msg handling. - Series of cleanups for io-wq. - Start adding a bunch of selftests. The liburing repository generally carries feature and regression tests for everything, but at least for ublk initially, we'll try and go the route of having it in selftests as well. We'll see how this goes, might decide to migrate more tests this way in the future. - Various little cleanups and fixes" * tag 'for-6.15/io_uring-20250322' of git://git.kernel.dk/linux: (108 commits) selftests: ublk: add stripe target selftests: ublk: simplify loop io completion selftests: ublk: enable zero copy for null target selftests: ublk: prepare for supporting stripe target selftests: ublk: move common code into common.c selftests: ublk: increase max buffer size to 1MB selftests: ublk: add single sqe allocator helper selftests: ublk: add generic_01 for verifying sequential IO order selftests: ublk: fix starting ublk device io_uring: enable toggle of iowait usage when waiting on CQEs selftests: ublk: fix write cache implementation selftests: ublk: add variable for user to not show test result selftests: ublk: don't show `modprobe` failure selftests: ublk: add one dependency header io_uring/kbuf: enable bundles for incrementally consumed buffers Revert "io_uring/rsrc: simplify the bvec iter count calculation" selftests: ublk: improve test usability selftests: ublk: add stress test for covering IO vs. killing ublk server selftests: ublk: add one stress test for covering IO vs. removing device selftests: ublk: load/unload ublk_drv when preparing & cleaning up tests ...
2025-03-25objtool, nvmet: Fix out-of-bounds stack access in nvmet_ctrl_state_show()Josh Poimboeuf
The csts_state_names[] array only has six sparse entries, but the iteration code in nvmet_ctrl_state_show() iterates seven, resulting in a potential out-of-bounds stack read. Fix that. Fixes the following warning with an UBSAN kernel: vmlinux.o: warning: objtool: .text.nvmet_ctrl_state_show: unexpected end of section Fixes: 649fd41420a8 ("nvmet: add debugfs support") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Sagi Grimberg <sagi@grimberg.me> Cc: Chaitanya Kulkarni <kch@nvidia.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/f1f60858ee7a941863dc7f5506c540cb9f97b5f6.1742852847.git.jpoimboe@kernel.org Closes: https://lore.kernel.org/oe-kbuild-all/202503171547.LlCTJLQL-lkp@intel.com/
2025-03-20nvmet: replace max(a, min(b, c)) by clamp(val, lo, hi)Li Haoran
This patch replaces max(a, min(b, c)) by clamp(val, lo, hi) in the nvme driver. The clamp() macro explicitly expresses the intent of constraining a value within bounds, improving code readability. Signed-off-by: Li Haoran <li.haoran7@zte.com.cn> Signed-off-by: Shao Mingyin <shao.mingyin@zte.com.cn> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-03-20nvme-tcp: fix selinux denied when calling sock_sendmsgPeijie Shao
In a SELinux enabled kernel, socket_create() initializes the security label of the socket using the security label of the calling process, this typically works well. However, in a containerized environment like Kubernetes, problem arises when a privileged container(domain spc_t) connects to an NVMe target and mounts the NVMe as persistent storage for unprivileged containers(domain container_t). This is because the container_t domain cannot access resources labeled with spc_t, resulting in socket_sendmsg returning -EACCES. The solution is to use socket_create_kern() instead of socket_create(), which labels the socket context to kernel_t. Access control will then be handled by the VFS layer rather than the socket itself. Signed-off-by: Peijie Shao <shaopeijie@cestc.cn> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-03-20nvmet: pci-epf: Always configure BAR0 as 64-bitNiklas Cassel
NVMe PCIe Transport Specification 1.1, section 2.1.10, claims that the BAR0 type is Implementation Specific. However, in NVMe 1.1, the type is required to be 64-bit. Thus, to make our PCI EPF work on as many host systems as possible, always configure the BAR0 type to be 64-bit. In the rare case that the underlying PCI EPC does not support configuring BAR0 as 64-bit, the call to pci_epc_set_bar() will fail, and we will return a failure back to the user. This should not be a problem, as most PCI EPCs support configuring a BAR as 64-bit (and those EPCs with .only_64bit set to true in epc_features only support configuring the BAR as 64-bit). Tested-by: Damien Le Moal <dlemoal@kernel.org> Fixes: 0faa0fe6f90e ("nvmet: New NVMe PCI endpoint function target driver") Signed-off-by: Niklas Cassel <cassel@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-03-20nvmet: Remove duplicate uuid_copyMike Christie
We do uuid_copy twice in nvmet_alloc_ctrl so this patch deletes one of the calls. Signed-off-by: Mike Christie <michael.christie@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-03-20nvme: zns: Simplify nvme_zone_parse_entry()Damien Le Moal
Instead of passing a pointer to a struct nvme_ctrl and a pointer to a struct nvme_ns_head as the first two arguments of nvme_zone_parse_entry(), pass only a pointer to a struct nvme_ns as both the controller structure and ns head structure can be infered from the namespace structure. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-03-20nvmet: pci-epf: Remove redundant 'flush_workqueue()' callsChen Ni
'destroy_workqueue()' already drains the queue before destroying it, so there is no need to flush it explicitly. Remove the redundant 'flush_workqueue()' calls. This was generated with coccinelle: @@ expression E; @@ - flush_workqueue(E); destroy_workqueue(E); Signed-off-by: Chen Ni <nichen@iscas.ac.cn> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-03-20nvmet-fc: Remove unused functionsWangYuli
The functions nvmet_fc_iodnum() and nvmet_fc_fodnum() are currently unutilized. Following commit c53432030d86 ("nvme-fabrics: Add target support for FC transport"), which introduced these two functions, they have not been used at all in practice. Remove them to resolve the compiler warnings. Fix follow errors with clang-19 when W=1e: drivers/nvme/target/fc.c:177:1: error: unused function 'nvmet_fc_iodnum' [-Werror,-Wunused-function] 177 | nvmet_fc_iodnum(struct nvmet_fc_ls_iod *iodptr) | ^~~~~~~~~~~~~~~ drivers/nvme/target/fc.c:183:1: error: unused function 'nvmet_fc_fodnum' [-Werror,-Wunused-function] 183 | nvmet_fc_fodnum(struct nvmet_fc_fcp_iod *fodptr) | ^~~~~~~~~~~~~~~ 2 errors generated. make[8]: *** [scripts/Makefile.build:207: drivers/nvme/target/fc.o] Error 1 make[7]: *** [scripts/Makefile.build:465: drivers/nvme/target] Error 2 make[6]: *** [scripts/Makefile.build:465: drivers/nvme] Error 2 make[6]: *** Waiting for unfinished jobs.... Fixes: c53432030d86 ("nvme-fabrics: Add target support for FC transport") Signed-off-by: WangYuli <wangyuli@uniontech.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-03-20nvme-pci: remove stale commentBaruch Siach
The ns variable has been removed in commit 62451a2b2e7e ("nvme: separate command prep and issue"). Drop reference to ns in comment. Fixes: 62451a2b2e7e ("nvme: separate command prep and issue") Signed-off-by: Baruch Siach <baruch@tkos.co.il> Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-03-20nvme-fc: Utilise min3() to simplify queue count calculationQasim Ijaz
Refactor nvme_fc_create_io_queues() and nvme_fc_recreate_io_queues() to use the min3() macro to find the minimum between 3 values instead of multiple min()'s. This shortens the code and makes it easier to read. Signed-off-by: Qasim Ijaz <qasdev00@gmail.com> Reviewed-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-03-20nvme-multipath: Add visibility for queue-depth io-policyNilay Shroff
This patch helps add nvme native multipath visibility for queue-depth io-policy. It adds a new attribute file named "queue_depth" under namespace device path node which would print the number of active/ in-flight I/O requests currently queued for the given path. For instance, if we have a shared namespace accessible from two different controllers/paths then accessing head block node of the shared namespace would show the following output: $ ls -l /sys/block/nvme1n1/multipath/ nvme1c1n1 -> ../../../../../pci052e:78/052e:78:00.0/nvme/nvme1/nvme1c1n1 nvme1c3n1 -> ../../../../../pci058e:78/058e:78:00.0/nvme/nvme3/nvme1c3n1 In the above example, nvme1n1 is head gendisk node created for a shared namespace and the namespace is accessible from nvme1c1n1 and nvme1c3n1 paths. For queue-depth io-policy we can then refer the "queue_depth" attribute file created under each namespace path: $ cat /sys/block/nvme1n1/multipath/nvme1c1n1/queue_depth 518 $cat /sys/block/nvme1n1/multipath/nvme1c3n1/queue_depth 504 >From the above output, we can infer that I/O workload targeted at nvme1n1 uses two paths nvme1c1n1 and nvme1c3n1 and the current queue depth of each path is 518 and 504 respectively. Reading "queue_depth" file when configured io-policy is anything but queue-depth would show no output. Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-03-20nvme-multipath: Add visibility for numa io-policyNilay Shroff
This patch helps add nvme native multipath visibility for numa io-policy. It adds a new attribute file named "numa_nodes" under namespace gendisk device path node which prints the list of numa nodes preferred by the given namespace path. The numa nodes value is comma delimited list of nodes or A-B range of nodes. For instance, if we have a shared namespace accessible from two different controllers/paths then accessing head node of the shared namespace would show the following output: $ ls -l /sys/block/nvme1n1/multipath/ nvme1c1n1 -> ../../../../../pci052e:78/052e:78:00.0/nvme/nvme1/nvme1c1n1 nvme1c3n1 -> ../../../../../pci058e:78/058e:78:00.0/nvme/nvme3/nvme1c3n1 In the above example, nvme1n1 is head gendisk node created for a shared namespace and this namespace is accessible from nvme1c1n1 and nvme1c3n1 paths. For numa io-policy we can then refer the "numa_nodes" attribute file created under each namespace path: $ cat /sys/block/nvme1n1/multipath/nvme1c1n1/numa_nodes 0-1 $ cat /sys/block/nvme1n1/multipath/nvme1c3n1/numa_nodes 2-3 >From the above output, we infer that I/O workload targeted at nvme1n1 and running on numa nodes 0 and 1 would prefer using path nvme1c1n1. Similarly, I/O workload running on numa nodes 2 and 3 would prefer using path nvme1c3n1. Reading "numa_nodes" file when configured io-policy is anything but numa would show no output. Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-03-20nvme-multipath: Add visibility for round-robin io-policyNilay Shroff
This patch helps add nvme native multipath visibility for round-robin io-policy. It creates a "multipath" sysfs directory under head gendisk device node directory and then from "multipath" directory it adds a link to each namespace path device the head node refers. For instance, if we have a shared namespace accessible from two different controllers/paths then we create a soft link to each path device from head disk node as shown below: $ ls -l /sys/block/nvme1n1/multipath/ nvme1c1n1 -> ../../../../../pci052e:78/052e:78:00.0/nvme/nvme1/nvme1c1n1 nvme1c3n1 -> ../../../../../pci058e:78/058e:78:00.0/nvme/nvme3/nvme1c3n1 In the above example, nvme1n1 is head gendisk node created for a shared namespace and the namespace is accessible from nvme1c1n1 and nvme1c3n1 paths. For round-robin I/O policy, we could easily infer from the above output that I/O workload targeted to nvme1n1 would toggle across paths nvme1c1n1 and nvme1c3n1. Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-03-20nvmet: add tls_concat and tls_key debugfs entriesHannes Reinecke
Add debugfs entries to display the 'concat' and 'tls_key' controller attributes. Signed-off-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-03-20nvmet-tcp: support secure channel concatenationHannes Reinecke
Evaluate the SC_C flag during DH-CHAP-HMAC negotiation to check if secure concatenation as specified in the NVMe Base Specification v2.1, section 8.3.4.3: "Secure Channel Concatenationand" is requested. If requested the generated PSK is inserted into the keyring once negotiation has finished allowing for an encrypted connection once the admin queue is restarted. Signed-off-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-03-20nvmet: Add 'sq' argument to alloc_ctrl_argsHannes Reinecke
For secure concatenation the result of the TLS handshake will be stored in the 'sq' struct, so add it to the alloc_ctrl_args struct. Cc: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-03-20nvme-fabrics: reset admin connection for secure concatenationHannes Reinecke
When secure concatenation is requested the connection needs to be reset to enable TLS encryption on the new cnnection. That implies that the original connection used for the DH-CHAP negotiation really shouldn't be used, and we should reset as soon as the DH-CHAP negotiation has succeeded on the admin queue. Based on an idea from Sagi. Signed-off-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-03-20nvme-tcp: request secure channel concatenationHannes Reinecke
Add a fabrics option 'concat' to request secure channel concatenation as specified the NVME Base Specification v2.1, section 8.3.4.3: Secure Channel Concatenation. When secure channel concatenation is enabled a 'generated PSK' is inserted into the keyring such that it's available after reset. Signed-off-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-03-20nvme-keyring: add nvme_tls_psk_refresh()Hannes Reinecke
Add a function to refresh a generated PSK in the specified keyring. Signed-off-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-03-20nvme: add nvme_auth_derive_tls_psk()Hannes Reinecke
Add a function to derive the TLS PSK as specified TP8018. Signed-off-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-03-20nvme: add nvme_auth_generate_digest()Hannes Reinecke
Add a function to calculate the PSK digest as specified in TP8018. Signed-off-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-03-20nvme: add nvme_auth_generate_psk()Hannes Reinecke
Add a function to generate a NVMe PSK from the shared credentials negotiated by DH-HMAC-CHAP. Signed-off-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-03-17nvme: convert timeouts to secs_to_jiffies()Easwar Hariharan
Commit b35108a51cf7 ("jiffies: Define secs_to_jiffies()") introduced secs_to_jiffies(). As the value here is a multiple of 1000, use secs_to_jiffies() instead of msecs_to_jiffies() to avoid the multiplication This is converted using scripts/coccinelle/misc/secs_to_jiffies.cocci with the following Coccinelle rules: @depends on patch@ expression E; @@ -msecs_to_jiffies +secs_to_jiffies (E - * \( 1000 \| MSEC_PER_SEC \) ) Link: https://lkml.kernel.org/r/20250225-converge-secs-to-jiffies-part-two-v3-11-a43967e36c88@linux.microsoft.com Signed-off-by: Easwar Hariharan <eahariha@linux.microsoft.com> Acked-by: Keith Busch <kbusch@kernel.org> Cc: Carlos Maiolino <cem@kernel.org> Cc: Carlos Maiolino <cmaiolino@redhat.com> Cc: Chris Mason <clm@fb.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Damien Le Maol <dlemoal@kernel.org> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: David Sterba <dsterba@suse.com> Cc: Dick Kennedy <dick.kennedy@broadcom.com> Cc: Dongsheng Yang <dongsheng.yang@easystack.cn> Cc: Fabio Estevam <festevam@gmail.com> Cc: Frank Li <frank.li@nxp.com> Cc: Hans de Goede <hdegoede@redhat.com> Cc: Henrique de Moraes Holschuh <hmh@hmh.eng.br> Cc: Ilpo Jarvinen <ilpo.jarvinen@linux.intel.com> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: James Bottomley <james.bottomley@HansenPartnership.com> Cc: James Smart <james.smart@broadcom.com> Cc: Jaroslav Kysela <perex@perex.cz> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jens Axboe <axboe@kernel.dk> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Julia Lawall <julia.lawall@inria.fr> Cc: Kalesh Anakkur Purayil <kalesh-anakkur.purayil@broadcom.com> Cc: Leon Romanovsky <leon@kernel.org> Cc: Marc Kleine-Budde <mkl@pengutronix.de> Cc: Mark Brown <broonie@kernel.org> Cc: "Martin K. Petersen" <martin.petersen@oracle.com> Cc: Nicolas Palix <nicolas.palix@imag.fr> Cc: Niklas Cassel <cassel@kernel.org> Cc: Oded Gabbay <ogabbay@kernel.org> Cc: Sagi Grimberg <sagi@grimberg.me> Cc: Sascha Hauer <s.hauer@pengutronix.de> Cc: Sebastian Reichel <sre@kernel.org> Cc: Selvin Thyparampil Xavier <selvin.xavier@broadcom.com> Cc: Shawn Guo <shawnguo@kernel.org> Cc: Shyam-sundar S-k <Shyam-sundar.S-k@amd.com> Cc: Takashi Iwai <tiwai@suse.com> Cc: Takashi Iwai <tiwai@suse.de> Cc: Xiubo Li <xiubli@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>