summaryrefslogtreecommitdiff
path: root/drivers/nvme/host/tcp.c
AgeCommit message (Collapse)Author
2025-03-05nvme-tcp: fix signedness bug in nvme_tcp_init_connection()Dan Carpenter
The kernel_recvmsg() function returns an int which could be either negative error codes or the number of bytes received. The problem is that the condition: if (ret < sizeof(*icresp)) { is type promoted to type unsigned long and negative values are treated as high positive values which is success, when they should be treated as failure. Handle invalid positive returns separately from negative error codes to avoid this problem. Fixes: 578539e09690 ("nvme-tcp: fix connect failure on receiving partial ICResp PDU") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-02-27nvme-tcp: fix potential memory corruption in nvme_tcp_recv_pdu()Maurizio Lombardi
nvme_tcp_recv_pdu() doesn't check the validity of the header length. When header digests are enabled, a target might send a packet with an invalid header length (e.g. 255), causing nvme_tcp_verify_hdgst() to access memory outside the allocated area and cause memory corruptions by overwriting it with the calculated digest. Fix this by rejecting packets with an unexpected header length. Fixes: 3f2304f8c6d6 ("nvme-tcp: add NVMe over TCP host driver") Signed-off-by: Maurizio Lombardi <mlombard@redhat.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-02-27nvme-tcp: Fix a C2HTermReq error messageMaurizio Lombardi
In H2CTermReq, a FES with value 0x05 means "R2T Limit Exceeded"; but in C2HTermReq the same value has a different meaning (Data Transfer Limit Exceeded). Fixes: 84e009042d0f ("nvme-tcp: add basic support for the C2HTermReq PDU") Signed-off-by: Maurizio Lombardi <mlombard@redhat.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-02-24nvme-tcp: fix possible UAF in nvme_tcp_pollSagi Grimberg
nvme_tcp_poll() may race with the send path error handler because it may complete the request while it is actively being polled for completion, resulting in a UAF panic [1]: We should make sure to stop polling when we see an error when trying to read from the socket. Hence make sure to propagate the error so that the block layer breaks the polling cycle. [1]: -- [35665.692310] nvme nvme2: failed to send request -13 [35665.702265] nvme nvme2: unsupported pdu type (3) [35665.702272] BUG: kernel NULL pointer dereference, address: 0000000000000000 [35665.702542] nvme nvme2: queue 1 receive failed: -22 [35665.703209] #PF: supervisor write access in kernel mode [35665.703213] #PF: error_code(0x0002) - not-present page [35665.703214] PGD 8000003801cce067 P4D 8000003801cce067 PUD 37e6f79067 PMD 0 [35665.703220] Oops: 0002 [#1] SMP PTI [35665.703658] nvme nvme2: starting error recovery [35665.705809] Hardware name: Inspur aaabbb/YZMB-00882-104, BIOS 4.1.26 09/22/2022 [35665.705812] Workqueue: kblockd blk_mq_requeue_work [35665.709172] RIP: 0010:_raw_spin_lock+0xc/0x30 [35665.715788] Call Trace: [35665.716201] <TASK> [35665.716613] ? show_trace_log_lvl+0x1c1/0x2d9 [35665.717049] ? show_trace_log_lvl+0x1c1/0x2d9 [35665.717457] ? blk_mq_request_bypass_insert+0x2c/0xb0 [35665.717950] ? __die_body.cold+0x8/0xd [35665.718361] ? page_fault_oops+0xac/0x140 [35665.718749] ? blk_mq_start_request+0x30/0xf0 [35665.719144] ? nvme_tcp_queue_rq+0xc7/0x170 [nvme_tcp] [35665.719547] ? exc_page_fault+0x62/0x130 [35665.719938] ? asm_exc_page_fault+0x22/0x30 [35665.720333] ? _raw_spin_lock+0xc/0x30 [35665.720723] blk_mq_request_bypass_insert+0x2c/0xb0 [35665.721101] blk_mq_requeue_work+0xa5/0x180 [35665.721451] process_one_work+0x1e8/0x390 [35665.721809] worker_thread+0x53/0x3d0 [35665.722159] ? process_one_work+0x390/0x390 [35665.722501] kthread+0x124/0x150 [35665.722849] ? set_kthread_struct+0x50/0x50 [35665.723182] ret_from_fork+0x1f/0x30 Reported-by: Zhang Guanghui <zhang.guanghui@cestc.cn> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-02-18nvme-tcp: fix connect failure on receiving partial ICResp PDUCaleb Sander Mateos
nvme_tcp_init_connection() attempts to receive an ICResp PDU but only checks that the return value from recvmsg() is non-negative. If the sender closes the TCP connection or sends fewer than 128 bytes, this check will pass even though the full PDU wasn't received. Ensure the full ICResp PDU is received by checking that recvmsg() returns the expected 128 bytes. Additionally set the MSG_WAITALL flag for recvmsg(), as a sender could split the ICResp over multiple TCP frames. Without MSG_WAITALL, recvmsg() could return prematurely with only part of the PDU. Fixes: 3f2304f8c6d6 ("nvme-tcp: add NVMe over TCP host driver") Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-02-18nvme: tcp: Fix compilation warning with W=1Damien Le Moal
When compiling with W=1, a warning result for the function nvme_tcp_set_queue_io_cpu(): host/tcp.c:1578: warning: Function parameter or struct member 'queue' not described in 'nvme_tcp_set_queue_io_cpu' host/tcp.c:1578: warning: expecting prototype for Track the number of queues assigned to each cpu using a global per(). Prototype was for nvme_tcp_set_queue_io_cpu() instead Avoid this warning by using the regular comment format for the function nvme_tcp_set_queue_io_cpu() instead of the kdoc comment format. Fixes: 32193789878c ("nvme-tcp: Fix I/O queue cpu spreading for multiple controllers") Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-02-18nvme-tcp: add basic support for the C2HTermReq PDUMaurizio Lombardi
Previously, the NVMe/TCP host driver did not handle the C2HTermReq PDU, instead printing "unsupported pdu type (3)" when received. This patch adds support for processing the C2HTermReq PDU, allowing the driver to print the Fatal Error Status field. Example of output: nvme nvme4: Received C2HTermReq (FES = Invalid PDU Header Field) Signed-off-by: Maurizio Lombardi <mlombard@redhat.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-01-20Merge tag 'for-6.14/block-20250118' of git://git.kernel.dk/linuxLinus Torvalds
Pull block updates from Jens Axboe: - NVMe pull requests via Keith: - Target support for PCI-Endpoint transport (Damien) - TCP IO queue spreading fixes (Sagi, Chaitanya) - Target handling for "limited retry" flags (Guixen) - Poll type fix (Yongsoo) - Xarray storage error handling (Keisuke) - Host memory buffer free size fix on error (Francis) - MD pull requests via Song: - Reintroduce md-linear (Yu Kuai) - md-bitmap refactor and fix (Yu Kuai) - Replace kmap_atomic with kmap_local_page (David Reaver) - Quite a few queue freeze and debugfs deadlock fixes Ming introduced lockdep support for this in the 6.13 kernel, and it has (unsurprisingly) uncovered quite a few issues - Use const attributes for IO schedulers - Remove bio ioprio wrappers - Fixes for stacked device atomic write support - Refactor queue affinity helpers, in preparation for better supporting isolated CPUs - Cleanups of loop O_DIRECT handling - Cleanup of BLK_MQ_F_* flags - Add rotational support for null_blk - Various fixes and cleanups * tag 'for-6.14/block-20250118' of git://git.kernel.dk/linux: (106 commits) block: Don't trim an atomic write block: Add common atomic writes enable flag md/md-linear: Fix a NULL vs IS_ERR() bug in linear_add() block: limit disk max sectors to (LLONG_MAX >> 9) block: Change blk_stack_atomic_writes_limits() unit_min check block: Ensure start sector is aligned for stacking atomic writes blk-mq: Move more error handling into blk_mq_submit_bio() block: Reorder the request allocation code in blk_mq_submit_bio() nvme: fix bogus kzalloc() return check in nvme_init_effects_log() md/md-bitmap: move bitmap_{start, end}write to md upper layer md/raid5: implement pers->bitmap_sector() md: add a new callback pers->bitmap_sector() md/md-bitmap: remove the last parameter for bimtap_ops->endwrite() md/md-bitmap: factor behind write counters out from bitmap_{start/end}write() md: Replace deprecated kmap_atomic() with kmap_local_page() md: reintroduce md-linear partitions: ldm: remove the initial kernel-doc notation blk-cgroup: rwstat: fix kernel-doc warnings in header file blk-cgroup: fix kernel-doc warnings in header file nbd: fix partial sending ...
2025-01-10nvme-tcp: Fix I/O queue cpu spreading for multiple controllersSagi Grimberg
Since day-1 we are assigning the queue io_cpu very naively. We always base the queue id (controller scope) and assign it its matching cpu from the online mask. This works fine when the number of queues match the number of cpu cores. The problem starts when we have less queues than cpu cores. First, we should take into account the mq_map and select a cpu within the cpus that are assigned to this queue by the mq_map in order to minimize cross numa cpu bouncing. Second, even worse is that we don't take into account multiple controllers may have assigned queues to a given cpu. As a result we may simply compund more and more queues on the same set of cpus, which is suboptimal. We fix this by introducing global per-cpu counters that tracks the number of queues assigned to each cpu, and we select the least used cpu based on the mq_map and the per-cpu counters, and assign it as the queue io_cpu. The behavior for a single controller is slightly optimized by selecting better cpu candidates by consulting with the mq_map, and multiple controllers are spreading queues among cpu cores much better, resulting in lower average cpu load, and less likelihood to hit hotspots. Note that the accounting is not 100% perfect, but we don't need to be, we're simply putting our best effort to select the best candidate cpu core that we find at any given point. Another byproduct is that every controller reset/reconnect may change the queues io_cpu mapping, based on the current LRU accounting scheme. Here is the baseline queue io_cpu assignment for 4 controllers, 2 queues per controller, and 4 cpus on the host: nvme1: queue 0: using cpu 0 nvme1: queue 1: using cpu 1 nvme2: queue 0: using cpu 0 nvme2: queue 1: using cpu 1 nvme3: queue 0: using cpu 0 nvme3: queue 1: using cpu 1 nvme4: queue 0: using cpu 0 nvme4: queue 1: using cpu 1 And this is the fixed io_cpu assignment: nvme1: queue 0: using cpu 0 nvme1: queue 1: using cpu 2 nvme2: queue 0: using cpu 1 nvme2: queue 1: using cpu 3 nvme3: queue 0: using cpu 0 nvme3: queue 1: using cpu 2 nvme4: queue 0: using cpu 1 nvme4: queue 1: using cpu 3 Fixes: 3f2304f8c6d6 ("nvme-tcp: add NVMe over TCP host driver") Suggested-by: Hannes Reinecke <hare@kernel.org> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> [fixed kbuild reported errors] Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-12-27nvme-tcp: remove nvme_tcp_destroy_io_queues()Chunguang.xu
Now when destroying the IO queue we call nvme_tcp_stop_io_queues() twice, nvme_tcp_destroy_io_queues() has an unnecessary call. Here we try to remove nvme_tcp_destroy_io_queues() and merge it into nvme_tcp_teardown_io_queues(), simplify the code and align with nvme-rdma, make it easy to maintaince. Signed-off-by: Chunguang.xu <chunguang.xu@shopee.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-12-04nvme-tcp: simplify nvme_tcp_teardown_io_queues()Chunguang.xu
As nvme_tcp_teardown_io_queues() is the only one caller of nvme_tcp_destroy_admin_queue(), so we can merge it into nvme_tcp_teardown_io_queues() to simplify the code. Signed-off-by: Chunguang.xu <chunguang.xu@shopee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-12-04nvme-tcp: no need to quiesce admin_q in nvme_tcp_teardown_io_queues()Chunguang.xu
As we quiesce admin_q in nvme_tcp_teardown_admin_queue(), so we should no need to quiesce it in nvme_tcp_reaardown_io_queues(), make things simple. Signed-off-by: Chunguang.xu <chunguang.xu@shopee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-12-04nvme-tcp: fix the memleak while create new ctrl failedChunguang.xu
Now while we create new ctrl failed, we have not free the tagset occupied by admin_q, here try to fix it. Fixes: fd1418de10b9 ("nvme-tcp: avoid open-coding nvme_tcp_teardown_admin_queue()") Signed-off-by: Chunguang.xu <chunguang.xu@shopee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-10-03nvme: tcp: avoid race between queue_lock lock and destroyHannes Reinecke
Commit 76d54bf20cdc ("nvme-tcp: don't access released socket during error recovery") added a mutex_lock() call for the queue->queue_lock in nvme_tcp_get_address(). However, the mutex_lock() races with mutex_destroy() in nvme_tcp_free_queue(), and causes the WARN below. DEBUG_LOCKS_WARN_ON(lock->magic != lock) WARNING: CPU: 3 PID: 34077 at kernel/locking/mutex.c:587 __mutex_lock+0xcf0/0x1220 Modules linked in: nvmet_tcp nvmet nvme_tcp nvme_fabrics iw_cm ib_cm ib_core pktcdvd nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables qrtr sunrpc ppdev 9pnet_virtio 9pnet pcspkr netfs parport_pc parport e1000 i2c_piix4 i2c_smbus loop fuse nfnetlink zram bochs drm_vram_helper drm_ttm_helper ttm drm_kms_helper xfs drm sym53c8xx floppy nvme scsi_transport_spi nvme_core nvme_auth serio_raw ata_generic pata_acpi dm_multipath qemu_fw_cfg [last unloaded: ib_uverbs] CPU: 3 UID: 0 PID: 34077 Comm: udisksd Not tainted 6.11.0-rc7 #319 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-2.fc40 04/01/2014 RIP: 0010:__mutex_lock+0xcf0/0x1220 Code: 08 84 d2 0f 85 c8 04 00 00 8b 15 ef b6 c8 01 85 d2 0f 85 78 f4 ff ff 48 c7 c6 20 93 ee af 48 c7 c7 60 91 ee af e8 f0 a7 6d fd <0f> 0b e9 5e f4 ff ff 48 b8 00 00 00 00 00 fc ff df 4c 89 f2 48 c1 RSP: 0018:ffff88811305f760 EFLAGS: 00010286 RAX: 0000000000000000 RBX: ffff88812c652058 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000004 RDI: 0000000000000001 RBP: ffff88811305f8b0 R08: 0000000000000001 R09: ffffed1075c36341 R10: ffff8883ae1b1a0b R11: 0000000000010498 R12: 0000000000000000 R13: 0000000000000000 R14: dffffc0000000000 R15: ffff88812c652058 FS: 00007f9713ae4980(0000) GS:ffff8883ae180000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fcd78483c7c CR3: 0000000122c38000 CR4: 00000000000006f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> ? __warn.cold+0x5b/0x1af ? __mutex_lock+0xcf0/0x1220 ? report_bug+0x1ec/0x390 ? handle_bug+0x3c/0x80 ? exc_invalid_op+0x13/0x40 ? asm_exc_invalid_op+0x16/0x20 ? __mutex_lock+0xcf0/0x1220 ? nvme_tcp_get_address+0xc2/0x1e0 [nvme_tcp] ? __pfx___mutex_lock+0x10/0x10 ? __lock_acquire+0xd6a/0x59e0 ? nvme_tcp_get_address+0xc2/0x1e0 [nvme_tcp] nvme_tcp_get_address+0xc2/0x1e0 [nvme_tcp] ? __pfx_nvme_tcp_get_address+0x10/0x10 [nvme_tcp] nvme_sysfs_show_address+0x81/0xc0 [nvme_core] dev_attr_show+0x42/0x80 ? __asan_memset+0x1f/0x40 sysfs_kf_seq_show+0x1f0/0x370 seq_read_iter+0x2cb/0x1130 ? rw_verify_area+0x3b1/0x590 ? __mutex_lock+0x433/0x1220 vfs_read+0x6a6/0xa20 ? lockdep_hardirqs_on+0x78/0x100 ? __pfx_vfs_read+0x10/0x10 ksys_read+0xf7/0x1d0 ? __pfx_ksys_read+0x10/0x10 ? __x64_sys_openat+0x105/0x1d0 do_syscall_64+0x93/0x180 ? lockdep_hardirqs_on_prepare+0x16d/0x400 ? do_syscall_64+0x9f/0x180 ? lockdep_hardirqs_on+0x78/0x100 ? do_syscall_64+0x9f/0x180 ? __pfx_ksys_read+0x10/0x10 ? lockdep_hardirqs_on_prepare+0x16d/0x400 ? do_syscall_64+0x9f/0x180 ? lockdep_hardirqs_on+0x78/0x100 ? do_syscall_64+0x9f/0x180 ? lockdep_hardirqs_on_prepare+0x16d/0x400 ? do_syscall_64+0x9f/0x180 ? lockdep_hardirqs_on+0x78/0x100 ? do_syscall_64+0x9f/0x180 ? lockdep_hardirqs_on_prepare+0x16d/0x400 ? do_syscall_64+0x9f/0x180 ? lockdep_hardirqs_on+0x78/0x100 ? do_syscall_64+0x9f/0x180 ? lockdep_hardirqs_on_prepare+0x16d/0x400 ? do_syscall_64+0x9f/0x180 ? lockdep_hardirqs_on+0x78/0x100 ? do_syscall_64+0x9f/0x180 ? do_syscall_64+0x9f/0x180 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7f9713f55cfa Code: 55 48 89 e5 48 83 ec 20 48 89 55 e8 48 89 75 f0 89 7d f8 e8 e8 74 f8 ff 48 8b 55 e8 48 8b 75 f0 41 89 c0 8b 7d f8 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 2e 44 89 c7 48 89 45 f8 e8 42 75 f8 ff 48 8b RSP: 002b:00007ffd7f512e70 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 RAX: ffffffffffffffda RBX: 000055c38f316859 RCX: 00007f9713f55cfa RDX: 0000000000000fff RSI: 00007ffd7f512eb0 RDI: 0000000000000011 RBP: 00007ffd7f512e90 R08: 0000000000000000 R09: 00000000ffffffff R10: 0000000000000000 R11: 0000000000000246 R12: 000055c38f317148 R13: 0000000000000000 R14: 00007f96f4004f30 R15: 000055c3b6b623c0 </TASK> The WARN is observed when the blktests test case nvme/014 is repeated with tcp transport. It is rare, and 200 times repeat is required to recreate in some test environments. To avoid the WARN, check the NVME_TCP_Q_LIVE flag before locking queue->queue_lock. The flag is cleared long time before the lock gets destroyed. Signed-off-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-08-22nvme-tcp: check for invalidated or revoked keyHannes Reinecke
key_lookup() will always return a key, even if that key is revoked or invalidated. So check for invalid keys before continuing. Signed-off-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-08-22nvme-tcp: sanitize TLS key handlingHannes Reinecke
There is a difference between TLS configured (ie the user has provisioned/requested a key) and TLS enabled (ie the connection is encrypted with TLS). This becomes important for secure concatenation, where the initial authentication is run on an unencrypted connection (ie with TLS configured, but not enabled), and then the queue is reset to run over TLS (ie TLS configured _and_ enabled). So to differentiate between those two states store the generated key in opts->tls_key (as we're using the same TLS key for all queues), the key serial of the resulting TLS handshake in ctrl->tls_pskid (to signal that TLS on the admin queue is enabled), and a simple flag for the queues to indicated that TLS has been enabled. Signed-off-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-07-28nvme-tcp: use sendpages_ok() instead of sendpage_ok()Ofir Gal
Currently nvme_tcp_try_send_data() use sendpage_ok() in order to disable MSG_SPLICE_PAGES, it check the first page of the iterator, the iterator may represent contiguous pages. MSG_SPLICE_PAGES enables skb_splice_from_iter() which checks all the pages it sends with sendpage_ok(). When nvme_tcp_try_send_data() sends an iterator that the first page is sendable, but one of the other pages isn't skb_splice_from_iter() warns and aborts the data transfer. Using the new helper sendpages_ok() in order to disable MSG_SPLICE_PAGES solves the issue. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Ofir Gal <ofir.gal@volumez.com> Link: https://lore.kernel.org/r/20240718084515.3833733-3-ofir.gal@volumez.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-26nvme-pci: do not directly handle subsys reset falloutKeith Busch
Scheduling reset_work after a nvme subsystem reset is expected to fail on pcie, but this also prevents potential handling the platform's pcie services may provide that might successfully recovering the link without re-enumeration. Such examples include AER, DPC, and power's EEH. Provide a pci specific operation that safely initiates a subsystem reset, and instead of scheduling reset work, read back the status register to trigger a pcie read error. Since this only affects pci, the other fabrics drivers subscribe to a generic nvmf subsystem reset that is exactly the same as before. The loop fabric doesn't use it because nvmet doesn't support setting that property anyway. And since we're using the magic NSSR value in two places now, provide a symbolic define for it. Reported-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-06-24nvme: split device add from initializationKeith Busch
Combining both creates an ambiguous cleanup scenario for the caller if an error is returned: does the device reference need to be dropped or did the error occur before the device was initialized? If an error occurs after the device is added, then the existing cleanup routines will leak memory. Furthermore, the nvme core is taking it upon itself to free the device's kobj name under certain conditions rather than go through the core device API. We shouldn't be peaking into these implementation details. Split the device initialization from the addition to make it easier to know the error handling actions, fix the existing memory leaks, and stop the device layering violations. Link: https://lore.kernel.org/linux-nvme/c4050a37-ecc9-462c-9772-65e25166f439@grimberg.me/ Tested-by: Yi Zhang <yi.zhang@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-06-24nvme: tcp: split controller bringup handlingKeith Busch
Drivers must call nvme_uninit_ctrl after a successful nvme_init_ctrl. Split the allocation side out to make the error handling boundary easier to navigate. The nvme tcp driver's error handling had different returns in the error goto label's, which harm readability. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-05-14Merge tag 'nvme-6.10-2024-05-14' of git://git.infradead.org/nvme into block-6.10Jens Axboe
Pull NVMe updates and fixes from Keith: "nvme updates for Linux 6.10 - Fabrics connection retries (Daniel, Hannes) - Fabrics logging enhancements (Tokunori) - RDMA delete optimization (Sagi)" * tag 'nvme-6.10-2024-05-14' of git://git.infradead.org/nvme: nvme-rdma, nvme-tcp: include max reconnects for reconnect logging nvmet-rdma: Avoid o(n^2) loop in delete_ctrl nvme: do not retry authentication failures nvme-fabrics: short-circuit reconnect retries nvme: return kernel error codes for admin queue connect nvmet: return DHCHAP status codes from nvmet_setup_auth() nvmet: lock config semaphore when accessing DH-HMAC-CHAP key
2024-05-07nvme-rdma, nvme-tcp: include max reconnects for reconnect loggingTokunori Ikegami
Makes clear max reconnects translated by ctrl loss tmo and reconnect delay. Signed-off-by: Tokunori Ikegami <ikegami.t@gmail.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-05-01nvme-fabrics: short-circuit reconnect retriesHannes Reinecke
Returning a nvme status from nvme_tcp_setup_ctrl() indicates that the association was established and we have received a status from the controller; consequently we should honour the DNR bit. If not any future reconnect attempts will just return the same error, so we can short-circuit the reconnect attempts and fail the connection directly. Signed-off-by: Hannes Reinecke <hare@suse.de> [dwagner: - extended nvme_should_reconnect] Signed-off-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-05-01nvme-tcp: strict pdu pacing to avoid send stalls on TLSHannes Reinecke
TLS requires a strict pdu pacing via MSG_EOR to signal the end of a record and subsequent encryption. If we do not set MSG_EOR at the end of a sequence the record won't be closed, encryption doesn't start, and we end up with a send stall as the message will never be passed on to the TCP layer. So do not check for the queue status when TLS is enabled but rather make the MSG_MORE setting dependent on the current request only. Signed-off-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-21Merge tag 'nvme-6.9-2024-03-21' of git://git.infradead.org/nvme into block-6.9Jens Axboe
Pull NVMe fixes from Keith: "nvme updates for Linux 6.9 - Make an informative message less ominous (Keith) - Enhanced trace decoding (Guixin) - TCP updates (Hannes, Li) - Fabrics connect deadlock fix (Chunguang) - Platform API migration update (Uwe) - A new device quirk (Jiawei)" * tag 'nvme-6.9-2024-03-21' of git://git.infradead.org/nvme: nvmet-rdma: remove NVMET_RDMA_REQ_INVALIDATE_RKEY flag nvme: remove redundant BUILD_BUG_ON check nvme/tcp: Add wq_unbound modparam for nvme_tcp_wq nvme-tcp: Export the nvme_tcp_wq to sysfs drivers/nvme: Add quirks for device 126f:2262 nvme: parse format command's lbafu when tracing nvme: add tracing of reservation commands nvme: parse zns command's zsa and zrasf to string nvme: use nvme_disk_is_ns_head helper nvme: fix reconnection fail due to reserved tag allocation nvmet: add tracing of zns commands nvmet: add tracing of authentication commands nvme-apple: Convert to platform remove callback returning void nvmet-tcp: do not continue for invalid icreq nvme: change shutdown timeout setting message
2024-03-18nvme/tcp: Add wq_unbound modparam for nvme_tcp_wqLi Feng
The default nvme_tcp_wq will use all CPUs to process tasks. Sometimes it is necessary to set CPU affinity to improve performance. A new module parameter wq_unbound is added here. If set to true, users can configure cpu affinity through /sys/devices/virtual/workqueue/nvme_tcp_wq/cpumask. Signed-off-by: Li Feng <fengli@smartx.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-18nvme-tcp: Export the nvme_tcp_wq to sysfsLi Feng
Make the workqueue userspace visible for easy viewing and configuration. Signed-off-by: Li Feng <fengli@smartx.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-05net: introduce page_frag_cache_drain()Yunsheng Lin
When draining a page_frag_cache, most user are doing the similar steps, so introduce an API to avoid code duplication. Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Acked-by: Jason Wang <jasowang@redhat.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-02-01nvme-tcp: show hostnqn when connecting to tcp targetNitin U. Yewale
Log hostnqn when connecting to nvme target. As hostnqn could be changed, logging this information in syslog at appropriate time may help in troubleshooting. Signed-off-by: Nitin U. Yewale <nyewale@redhat.com> Reviewed-by: John Meneghini <jmeneghi@redhat.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-01-31nvme: split out fabrics version of nvme_opcode_str()Caleb Sander
nvme_opcode_str() currently supports admin, IO, and fabrics commands. However, fabrics commands aren't allowed for the pci transport. Currently the pci caller passes 0 as the fctype, which means any fabrics command would be displayed as "Property Set". Move fabrics command support into a function nvme_fabrics_opcode_str() and remove the fctype argument to nvme_opcode_str(). This way, a fabrics command will display as "Unknown" for pci. Convert the rdma and tcp transports to use nvme_fabrics_opcode_str(). Signed-off-by: Caleb Sander <csander@purestorage.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-01-24nvme: add module description to stop warningsChaitanya Kulkarni
Add MODULE_DESCRIPTION() in order to remove warnings & get clean build:- WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/nvme/host/nvme-core.o WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/nvme/host/nvme.o WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/nvme/host/nvme-fabrics.o WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/nvme/host/nvme-rdma.o WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/nvme/host/nvme-fc.o WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/nvme/host/nvme-tcp.o Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-01-18Merge tag 'for-6.8/block-2024-01-18' of git://git.kernel.dk/linuxLinus Torvalds
Pull block fixes from Jens Axboe: - NVMe pull request via Keith: - tcp, fc, and rdma target fixes (Maurizio, Daniel, Hannes, Christoph) - discard fixes and improvements (Christoph) - timeout debug improvements (Keith, Max) - various cleanups (Daniel, Max, Giuxen) - trace event string fixes (Arnd) - shadow doorbell setup on reset fix (William) - a write zeroes quirk for SK Hynix (Jim) - MD pull request via Song: - Sparse warning since v6.0 (Bart) - /proc/mdstat regression since v6.7 (Yu Kuai) - Use symbolic error value (Christian) - IO Priority documentation update (Christian) - Fix for accessing queue limits without having entered the queue (Christoph, me) - Fix for loop dio support (Christoph) - Move null_blk off deprecated ida interface (Christophe) - Ensure nbd initializes full msghdr (Eric) - Fix for a regression with the folio conversion, which is now easier to hit because of an unrelated change (Matthew) - Remove redundant check in virtio-blk (Li) - Fix for a potential hang in sbitmap (Ming) - Fix for partial zone appending (Damien) - Misc changes and fixes (Bart, me, Kemeng, Dmitry) * tag 'for-6.8/block-2024-01-18' of git://git.kernel.dk/linux: (45 commits) Documentation: block: ioprio: Update schedulers loop: fix the the direct I/O support check when used on top of block devices blk-mq: Remove the hctx 'run' debugfs attribute nbd: always initialize struct msghdr completely block: Fix iterating over an empty bio with bio_for_each_folio_all block: bio-integrity: fix kcalloc() arguments order virtio_blk: remove duplicate check if queue is broken in virtblk_done sbitmap: remove stale comment in sbq_calc_wake_batch block: Correct a documentation comment in blk-cgroup.c null_blk: Remove usage of the deprecated ida_simple_xx() API block: ensure we hold a queue reference when using queue limits blk-mq: rename blk_mq_can_use_cached_rq block: print symbolic error name instead of error code blk-mq: fix IO hang from sbitmap wakeup race nvmet-rdma: avoid circular locking dependency on install_queue() nvmet-tcp: avoid circular locking dependency on install_queue() nvme-pci: set doorbell config before unquiescing block: fix partial zone append completion handling in req_bio_endio() block/iocost: silence warning on 'last_period' potentially being unused md/raid1: Use blk_opf_t for read and write operations ...
2024-01-08nvme-tcp: enhance timeout kernel logMax Gurtovoy
Print the command_id along side blk-mq's tag to help match commands with protocol wire traces and logs. Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-01-03nvme: tcp: remove unnecessary goto statementGuixin Liu
There is no requirement to call nvme_tcp_free_queue() for queue deallocation if the pskid is null or the queue allocation fails, as the NVME_TCP_Q_ALLOCATED flag would not be set in such scenarios. Signed-off-by: Guixin Liu <kanie@linux.alibaba.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-12-04nvme: ensure reset state check orderingKeith Busch
A different CPU may be setting the ctrl->state value, so ensure proper barriers to prevent optimizing to a stale state. Normally it isn't a problem to observe the wrong state as it is merely advisory to take a quicker path during initialization and error recovery, but seeing an old state can report unexpected ENETRESET errors when a reset request was in fact successful. Reported-by: Minh Hoang <mh2022@meta.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Hannes Reinecke <hare@suse.de>
2023-11-22nvme: tcp: fix compile-time checks for TLS modeArnd Bergmann
When CONFIG_NVME_KEYRING is enabled as a loadable module, but the TCP host code is built-in, it fails to link: arm-linux-gnueabi-ld: drivers/nvme/host/tcp.o: in function `nvme_tcp_setup_ctrl': tcp.c:(.text+0x1940): undefined reference to `nvme_tls_psk_default' The problem is that the compile-time conditionals are inconsistent here, using a mix of #ifdef CONFIG_NVME_TCP_TLS, IS_ENABLED(CONFIG_NVME_TCP_TLS) and IS_ENABLED(CONFIG_NVME_KEYRING) checks, with CONFIG_NVME_KEYRING controlling whether the implementation is actually built. Change it to use IS_ENABLED(CONFIG_NVME_KEYRING) checks consistently, which should help readability and make it less error-prone. Combining it with the check for the ctrl->opts->tls flag lets the compiler drop all the TLS code in configurations without this feature, which also helps runtime behavior in addition to avoiding the link failure. To make it possible for the compiler to build the dead code, both the tls_handshake_timeout variable and the TLS specific members of nvme_tcp_queue need to be moved out of the #ifdef block as well, but at least the former of these gets optimized out again. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Link: https://lore.kernel.org/r/20231122224719.4042108-4-arnd@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-11-22nvme: move nvme_stop_keep_alive() back to original positionHannes Reinecke
Stopping keep-alive not only stops the keep-alive workqueue, but also needs to be synchronized with I/O termination as we must not send a keep-alive command after all I/O had been terminated. So to avoid any regressions move the call to stop_keep_alive() back to its original position and ensure that keep-alive is correctly stopped failing to setup the admin queue. Fixes: 4733b65d82bd ("nvme: start keep-alive after admin queue setup") Suggested-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-11-20nvme-tcp: only evaluate 'tls' option if TLS is selectedHannes Reinecke
We only need to evaluate the 'tls' connect option if TLS is enabled; otherwise we might be getting a link error. Fixes: 706add13676d ("nvme: keyring: fix conditional compilation") Reported-by: kernel test robot <yujie.liu@intel.com> Closes: https://lore.kernel.org/r/202311140426.0eHrTXBr-lkp@intel.com/ Signed-off-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-11-06nvme-tcp: avoid open-coding nvme_tcp_teardown_admin_queue()Hannes Reinecke
nvme_tcp_setup_ctrl() has an open-coded version of nvme_tcp_teardown_admin_queue(). Signed-off-by: Hannes Reinecke <hare@suse.de> Tested-by: Mark O'Donovan <shiftee@posteo.net> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-11-06nvme-tcp: Fix a memory leakChristophe JAILLET
All error handling path end to the error handling path, except this one. Go to the error handling branch as well here, otherwise 'icreq' and 'icresp' will leak. Fixes: 2837966ab2a8 ("nvme-tcp: control message handling for recvmsg()") Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-10-11nvme-fabrics: parse options 'keyring' and 'tls_key'Hannes Reinecke
Parse the fabrics options 'keyring' and 'tls_key' and store the referenced keys in the options structure. Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-10-11nvme-tcp: improve icreq/icresp loggingHannes Reinecke
When icreq/icresp fails we should be printing out a warning to inform the user that the connection could not be established; without it there won't be anything in the kernel message log, just an error code returned to nvme-cli. Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-10-11nvme-tcp: control message handling for recvmsg()Hannes Reinecke
kTLS is sending TLS ALERT messages as control messages for recvmsg(). As we can't do anything sensible with it just abort the connection and let the userspace agent to a re-negotiation. Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-10-11nvme-tcp: enable TLS handshake upcallHannes Reinecke
Add a fabrics option 'tls' and start the TLS handshake upcall with the default PSK. When TLS is started the PSK key serial number is displayed in the sysfs attribute 'tls_key' Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-10-11nvme-tcp: allocate socket fileHannes Reinecke
When using the TLS upcall we need to allocate a socket file such that the userspace daemon is able to use the socket. Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-08-11Merge tag 'block-6.5-2023-08-11' of git://git.kernel.dk/linuxLinus Torvalds
Pull block fixes from Jens Axboe: - NVMe pull request via Keith: - Fixes for request_queue state (Ming) - Another uuid quirk (August) - RCU poll fix for NVMe (Ming) - Fix for an IO stall with polled IO (me) - Fix for blk-iocost stats enable/disable accounting (Chengming) - Regression fix for large pages for zram (Christoph) * tag 'block-6.5-2023-08-11' of git://git.kernel.dk/linux: nvme: core: don't hold rcu read lock in nvme_ns_chr_uring_cmd_iopoll blk-iocost: fix queue stats accounting block: don't make REQ_POLLED imply REQ_NOWAIT block: get rid of unused plug->nowait flag zram: take device and not only bvec offset into account nvme-pci: add NVME_QUIRK_BOGUS_NID for Samsung PM9B1 256G and 512G nvme-rdma: fix potential unbalanced freeze & unfreeze nvme-tcp: fix potential unbalanced freeze & unfreeze nvme: fix possible hang when removing a controller during error recovery
2023-07-21nvme-tcp: fix potential unbalanced freeze & unfreezeMing Lei
Move start_freeze into nvme_tcp_configure_io_queues(), and there is at least two benefits: 1) fix unbalanced freeze and unfreeze, since re-connection work may fail or be broken by removal 2) IO during error recovery can be failfast quickly because nvme fabrics unquiesces queues after teardown. One side-effect is that !mpath request may timeout during connecting because of queue topo change, but that looks not one big deal: 1) same problem exists with current code base 2) compared with !mpath, mpath use case is dominant Fixes: 2875b0aecabe ("nvme-tcp: fix controller reset hang during traffic") Cc: stable@vger.kernel.org Signed-off-by: Ming Lei <ming.lei@redhat.com> Tested-by: Yi Zhang <yi.zhang@redhat.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-07-02nvme-tcp: Fix comma-related oopsDavid Howells
Fix a comma that should be a semicolon. The comma is at the end of an if-body and thus makes the statement after (a bvec_set_page()) conditional too, resulting in an oops because we didn't fill out the bio_vec[]: BUG: kernel NULL pointer dereference, address: 0000000000000008 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page ... Workqueue: nvme_tcp_wq nvme_tcp_io_work [nvme_tcp] RIP: 0010:skb_splice_from_iter+0xf1/0x370 ... Call Trace: tcp_sendmsg_locked+0x3a6/0xdd0 tcp_sendmsg+0x31/0x50 inet_sendmsg+0x47/0x80 sock_sendmsg+0x99/0xb0 nvme_tcp_try_send_data+0x149/0x490 [nvme_tcp] nvme_tcp_try_send+0x1b7/0x300 [nvme_tcp] nvme_tcp_io_work+0x40/0xc0 [nvme_tcp] process_one_work+0x21c/0x430 worker_thread+0x54/0x3e0 kthread+0xf8/0x130 Fixes: 7769887817c3 ("nvme-tcp: Use sendmsg(MSG_SPLICE_PAGES) rather then sendpage") Reported-by: Aurelien Aptel <aaptel@nvidia.com> Link: https://lore.kernel.org/r/253mt0il43o.fsf@mtr-vdi-124.i-did-not-set--mail-host-address--so-tickle-me/ Signed-off-by: David Howells <dhowells@redhat.com> cc: Sagi Grimberg <sagi@grimberg.me> cc: Willem de Bruijn <willemb@google.com> cc: Keith Busch <kbusch@kernel.org> cc: Jens Axboe <axboe@fb.com> cc: Christoph Hellwig <hch@lst.de> cc: Chaitanya Kulkarni <kch@nvidia.com> cc: "David S. Miller" <davem@davemloft.net> cc: Eric Dumazet <edumazet@google.com> cc: Jakub Kicinski <kuba@kernel.org> cc: Paolo Abeni <pabeni@redhat.com> cc: Jens Axboe <axboe@kernel.dk> cc: Jens Axboe <axboe@kernel.dk> cc: Matthew Wilcox <willy@infradead.org> cc: linux-nvme@lists.infradead.org cc: netdev@vger.kernel.org Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-28Merge tag 'net-next-6.5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking changes from Jakub Kicinski: "WiFi 7 and sendpage changes are the biggest pieces of work for this release. The latter will definitely require fixes but I think that we got it to a reasonable point. Core: - Rework the sendpage & splice implementations Instead of feeding data into sockets page by page extend sendmsg handlers to support taking a reference on the data, controlled by a new flag called MSG_SPLICE_PAGES Rework the handling of unexpected-end-of-file to invoke an additional callback instead of trying to predict what the right combination of MORE/NOTLAST flags is Remove the MSG_SENDPAGE_NOTLAST flag completely - Implement SCM_PIDFD, a new type of CMSG type analogous to SCM_CREDENTIALS, but it contains pidfd instead of plain pid - Enable socket busy polling with CONFIG_RT - Improve reliability and efficiency of reporting for ref_tracker - Auto-generate a user space C library for various Netlink families Protocols: - Allow TCP to shrink the advertised window when necessary, prevent sk_rcvbuf auto-tuning from growing the window all the way up to tcp_rmem[2] - Use per-VMA locking for "page-flipping" TCP receive zerocopy - Prepare TCP for device-to-device data transfers, by making sure that payloads are always attached to skbs as page frags - Make the backoff time for the first N TCP SYN retransmissions linear. Exponential backoff is unnecessarily conservative - Create a new MPTCP getsockopt to retrieve all info (MPTCP_FULL_INFO) - Avoid waking up applications using TLS sockets until we have a full record - Allow using kernel memory for protocol ioctl callbacks, paving the way to issuing ioctls over io_uring - Add nolocalbypass option to VxLAN, forcing packets to be fully encapsulated even if they are destined for a local IP address - Make TCPv4 use consistent hash in TIME_WAIT and SYN_RECV. Ensure in-kernel ECMP implementation (e.g. Open vSwitch) select the same link for all packets. Support L4 symmetric hashing in Open vSwitch - PPPoE: make number of hash bits configurable - Allow DNS to be overwritten by DHCPACK in the in-kernel DHCP client (ipconfig) - Add layer 2 miss indication and filtering, allowing higher layers (e.g. ACL filters) to make forwarding decisions based on whether packet matched forwarding state in lower devices (bridge) - Support matching on Connectivity Fault Management (CFM) packets - Hide the "link becomes ready" IPv6 messages by demoting their printk level to debug - HSR: don't enable promiscuous mode if device offloads the proto - Support active scanning in IEEE 802.15.4 - Continue work on Multi-Link Operation for WiFi 7 BPF: - Add precision propagation for subprogs and callbacks. This allows maintaining verification efficiency when subprograms are used, or in fact passing the verifier at all for complex programs, especially those using open-coded iterators - Improve BPF's {g,s}setsockopt() length handling. Previously BPF assumed the length is always equal to the amount of written data. But some protos allow passing a NULL buffer to discover what the output buffer *should* be, without writing anything - Accept dynptr memory as memory arguments passed to helpers - Add routing table ID to bpf_fib_lookup BPF helper - Support O_PATH FDs in BPF_OBJ_PIN and BPF_OBJ_GET commands - Drop bpf_capable() check in BPF_MAP_FREEZE command (used to mark maps as read-only) - Show target_{obj,btf}_id in tracing link fdinfo - Addition of several new kfuncs (most of the names are self-explanatory): - Add a set of new dynptr kfuncs: bpf_dynptr_adjust(), bpf_dynptr_is_null(), bpf_dynptr_is_rdonly(), bpf_dynptr_size() and bpf_dynptr_clone(). - bpf_task_under_cgroup() - bpf_sock_destroy() - force closing sockets - bpf_cpumask_first_and(), rework bpf_cpumask_any*() kfuncs Netfilter: - Relax set/map validation checks in nf_tables. Allow checking presence of an entry in a map without using the value - Increase ip_vs_conn_tab_bits range for 64BIT builds - Allow updating size of a set - Improve NAT tuple selection when connection is closing Driver API: - Integrate netdev with LED subsystem, to allow configuring HW "offloaded" blinking of LEDs based on link state and activity (i.e. packets coming in and out) - Support configuring rate selection pins of SFP modules - Factor Clause 73 auto-negotiation code out of the drivers, provide common helper routines - Add more fool-proof helpers for managing lifetime of MDIO devices associated with the PCS layer - Allow drivers to report advanced statistics related to Time Aware scheduler offload (taprio) - Allow opting out of VF statistics in link dump, to allow more VFs to fit into the message - Split devlink instance and devlink port operations New hardware / drivers: - Ethernet: - Synopsys EMAC4 IP support (stmmac) - Marvell 88E6361 8 port (5x1GE + 3x2.5GE) switches - Marvell 88E6250 7 port switches - Microchip LAN8650/1 Rev.B0 PHYs - MediaTek MT7981/MT7988 built-in 1GE PHY driver - WiFi: - Realtek RTL8192FU, 2.4 GHz, b/g/n mode, 2T2R, 300 Mbps - Realtek RTL8723DS (SDIO variant) - Realtek RTL8851BE - CAN: - Fintek F81604 Drivers: - Ethernet NICs: - Intel (100G, ice): - support dynamic interrupt allocation - use meta data match instead of VF MAC addr on slow-path - nVidia/Mellanox: - extend link aggregation to handle 4, rather than just 2 ports - spawn sub-functions without any features by default - OcteonTX2: - support HTB (Tx scheduling/QoS) offload - make RSS hash generation configurable - support selecting Rx queue using TC filters - Wangxun (ngbe/txgbe): - add basic Tx/Rx packet offloads - add phylink support (SFP/PCS control) - Freescale/NXP (enetc): - report TAPRIO packet statistics - Solarflare/AMD: - support matching on IP ToS and UDP source port of outer header - VxLAN and GENEVE tunnel encapsulation over IPv4 or IPv6 - add devlink dev info support for EF10 - Virtual NICs: - Microsoft vNIC: - size the Rx indirection table based on requested configuration - support VLAN tagging - Amazon vNIC: - try to reuse Rx buffers if not fully consumed, useful for ARM servers running with 16kB pages - Google vNIC: - support TCP segmentation of >64kB frames - Ethernet embedded switches: - Marvell (mv88e6xxx): - enable USXGMII (88E6191X) - Microchip: - lan966x: add support for Egress Stage 0 ACL engine - lan966x: support mapping packet priority to internal switch priority (based on PCP or DSCP) - Ethernet PHYs: - Broadcom PHYs: - support for Wake-on-LAN for BCM54210E/B50212E - report LPI counter - Microsemi PHYs: support RGMII delay configuration (VSC85xx) - Micrel PHYs: receive timestamp in the frame (LAN8841) - Realtek PHYs: support optional external PHY clock - Altera TSE PCS: merge the driver into Lynx PCS which it is a variant of - CAN: Kvaser PCIEcan: - support packet timestamping - WiFi: - Intel (iwlwifi): - major update for new firmware and Multi-Link Operation (MLO) - configuration rework to drop test devices and split the different families - support for segmented PNVM images and power tables - new vendor entries for PPAG (platform antenna gain) feature - Qualcomm 802.11ax (ath11k): - Multiple Basic Service Set Identifier (MBSSID) and Enhanced MBSSID Advertisement (EMA) support in AP mode - support factory test mode - RealTek (rtw89): - add RSSI based antenna diversity - support U-NII-4 channels on 5 GHz band - RealTek (rtl8xxxu): - AP mode support for 8188f - support USB RX aggregation for the newer chips" * tag 'net-next-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1602 commits) net: scm: introduce and use scm_recv_unix helper af_unix: Skip SCM_PIDFD if scm->pid is NULL. net: lan743x: Simplify comparison netlink: Add __sock_i_ino() for __netlink_diag_dump(). net: dsa: avoid suspicious RCU usage for synced VLAN-aware MAC addresses Revert "af_unix: Call scm_recv() only after scm_set_cred()." phylink: ReST-ify the phylink_pcs_neg_mode() kdoc libceph: Partially revert changes to support MSG_SPLICE_PAGES net: phy: mscc: fix packet loss due to RGMII delays net: mana: use vmalloc_array and vcalloc net: enetc: use vmalloc_array and vcalloc ionic: use vmalloc_array and vcalloc pds_core: use vmalloc_array and vcalloc gve: use vmalloc_array and vcalloc octeon_ep: use vmalloc_array and vcalloc net: usb: qmi_wwan: add u-blox 0x1312 composition perf trace: fix MSG_SPLICE_PAGES build error ipvlan: Fix return value of ipvlan_queue_xmit() netfilter: nf_tables: fix underflow in chain reference counter netfilter: nf_tables: unbind non-anonymous set if rule construction fails ...
2023-06-24nvme-tcp: Use sendmsg(MSG_SPLICE_PAGES) rather then sendpageDavid Howells
When transmitting data, call down into TCP using a sendmsg with MSG_SPLICE_PAGES instead of sendpage. Signed-off-by: David Howells <dhowells@redhat.com> Tested-by: Sagi Grimberg <sagi@grimberg.me> Acked-by: Willem de Bruijn <willemb@google.com> cc: Keith Busch <kbusch@kernel.org> cc: Jens Axboe <axboe@fb.com> cc: Christoph Hellwig <hch@lst.de> cc: Chaitanya Kulkarni <kch@nvidia.com> cc: Jens Axboe <axboe@kernel.dk> cc: Matthew Wilcox <willy@infradead.org> cc: linux-nvme@lists.infradead.org Link: https://lore.kernel.org/r/20230623225513.2732256-8-dhowells@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>