summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2024-08-11net: ethernet: use ip_hdrlen() instead of bit shiftMoon Yeounsu
`ip_hdr(skb)->ihl << 2` is the same as `ip_hdrlen(skb)` Therefore, we should use a well-defined function not a bit shift to find the header length. It also compresses two lines to a single line. Signed-off-by: Moon Yeounsu <yyyynoom@gmail.com> Reviewed-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-08-11Merge branch 'l2tp-misc-improvements'David S. Miller
James Chapman says: ==================== l2tp: misc improvements This series makes several improvements to l2tp: * update documentation to be consistent with recent l2tp changes. * move l2tp_ip socket tables to per-net data. * fix handling of hash key collisions in l2tp_v3_session_get * implement and use get-next APIs for management and procfs/debugfs. * improve l2tp refcount helpers. * use per-cpu dev->tstats in l2tpeth devices. * fix a lockdep splat. * fix a race between l2tp_pre_exit_net and pppol2tp_release. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2024-08-11l2tp: flush workqueue before draining itJames Chapman
syzbot exposes a race where a net used by l2tp is removed while an existing pppol2tp socket is closed. In l2tp_pre_exit_net, l2tp queues TUNNEL_DELETE work items to close each tunnel in the net. When these are run, new SESSION_DELETE work items are queued to delete each session in the tunnel. This all happens in drain_workqueue. However, drain_workqueue allows only new work items if they are queued by other work items which are already in the queue. If pppol2tp_release runs after drain_workqueue has started, it may queue a SESSION_DELETE work item, which results in the warning below in drain_workqueue. Address this by flushing the workqueue before drain_workqueue such that all queued TUNNEL_DELETE work items run before drain_workqueue is started. This will queue SESSION_DELETE work items for each session in the tunnel, hence pppol2tp_release or other API requests won't queue SESSION_DELETE requests once drain_workqueue is started. WARNING: CPU: 1 PID: 5467 at kernel/workqueue.c:2259 __queue_work+0xcd3/0xf50 kernel/workqueue.c:2258 Modules linked in: CPU: 1 UID: 0 PID: 5467 Comm: syz.3.43 Not tainted 6.11.0-rc1-syzkaller-00247-g3608d6aca5e7 #0 Hardware name: Google Compute Engine/Google Compute Engine, BIOS Google 06/27/2024 RIP: 0010:__queue_work+0xcd3/0xf50 kernel/workqueue.c:2258 Code: ff e8 11 84 36 00 90 0f 0b 90 e9 1e fd ff ff e8 03 84 36 00 eb 13 e8 fc 83 36 00 eb 0c e8 f5 83 36 00 eb 05 e8 ee 83 36 00 90 <0f> 0b 90 48 83 c4 60 5b 41 5c 41 5d 41 5e 41 5f 5d c3 cc cc cc cc RSP: 0018:ffffc90004607b48 EFLAGS: 00010093 RAX: ffffffff815ce274 RBX: ffff8880661fda00 RCX: ffff8880661fda00 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 RBP: 0000000000000000 R08: ffffffff815cd6d4 R09: 0000000000000000 R10: ffffc90004607c20 R11: fffff520008c0f85 R12: ffff88802ac33800 R13: ffff88802ac339c0 R14: dffffc0000000000 R15: 0000000000000008 FS: 00005555713eb500(0000) GS:ffff8880b9300000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000008 CR3: 000000001eda6000 CR4: 00000000003506f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> queue_work_on+0x1c2/0x380 kernel/workqueue.c:2392 pppol2tp_release+0x163/0x230 net/l2tp/l2tp_ppp.c:445 __sock_release net/socket.c:659 [inline] sock_close+0xbc/0x240 net/socket.c:1421 __fput+0x24a/0x8a0 fs/file_table.c:422 task_work_run+0x24f/0x310 kernel/task_work.c:228 resume_user_mode_work include/linux/resume_user_mode.h:50 [inline] exit_to_user_mode_loop kernel/entry/common.c:114 [inline] exit_to_user_mode_prepare include/linux/entry-common.h:328 [inline] __syscall_exit_to_user_mode_work kernel/entry/common.c:207 [inline] syscall_exit_to_user_mode+0x168/0x370 kernel/entry/common.c:218 do_syscall_64+0x100/0x230 arch/x86/entry/common.c:89 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f061e9779f9 Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 a8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007ffff1c1fce8 EFLAGS: 00000246 ORIG_RAX: 00000000000001b4 RAX: 0000000000000000 RBX: 000000000001017d RCX: 00007f061e9779f9 RDX: 0000000000000000 RSI: 000000000000001e RDI: 0000000000000003 RBP: 00007ffff1c1fdc0 R08: 0000000000000001 R09: 00007ffff1c1ffcf R10: 00007f061e800000 R11: 0000000000000246 R12: 0000000000000032 R13: 00007ffff1c1fde0 R14: 00007ffff1c1fe00 R15: ffffffffffffffff </TASK> Fixes: fc7ec7f554d7 ("l2tp: delete sessions using work queue") Reported-by: syzbot+0e85b10481d2f5478053@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=0e85b10481d2f5478053 Signed-off-by: James Chapman <jchapman@katalix.com> Signed-off-by: Tom Parkin <tparkin@katalix.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-08-11l2tp: l2tp_eth: use per-cpu counters from dev->tstatsJames Chapman
l2tp_eth uses old-style dev->stats for fastpath packet/byte counters. Convert it to use dev->tstats per-cpu counters. Signed-off-by: James Chapman <jchapman@katalix.com> Signed-off-by: Tom Parkin <tparkin@katalix.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-08-11l2tp: improve tunnel/session refcount helpersJames Chapman
l2tp_tunnel_inc_refcount and l2tp_session_inc_refcount wrap refcount_inc. They add no value so just use the refcount APIs directly and drop l2tp's helpers. l2tp already uses refcount_inc_not_zero anyway. Rename l2tp_tunnel_dec_refcount and l2tp_session_dec_refcount to l2tp_tunnel_put and l2tp_session_put to better match their use pairing various _get getters. Signed-off-by: James Chapman <jchapman@katalix.com> Signed-off-by: Tom Parkin <tparkin@katalix.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-08-11l2tp: use get_next APIs for management requests and procfs/debugfsJames Chapman
l2tp netlink and procfs/debugfs iterate over tunnel and session lists to obtain data. They currently use very inefficient get_nth functions to do so. Replace these with get_next. For netlink, use nl cb->ctx[] for passing state instead of the obsolete cb->args[]. l2tp_tunnel_get_nth and l2tp_session_get_nth are no longer used so they can be removed. Signed-off-by: James Chapman <jchapman@katalix.com> Signed-off-by: Tom Parkin <tparkin@katalix.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-08-11l2tp: add tunnel/session get_next helpersJames Chapman
l2tp management APIs and procfs/debugfs iterate over l2tp tunnel and session lists. Since these lists are now implemented using IDR, we can use IDR get_next APIs to iterate them. Add tunnel/session get_next functions to do so. The session get_next functions get the next session in a given tunnel and need to account for l2tpv2 and l2tpv3 differences: * l2tpv2 sessions are keyed by tunnel ID / session ID. Iteration for a given tunnel ID, TID, can therefore start with a key given by TID/0 and finish when the next entry's tunnel ID is not TID. This is possible only because the tunnel ID part of the key is the upper 16 bits and the session ID part the lower 16 bits; when idr_next increments the key value, it therefore finds the next sessions of the current tunnel before those of the next tunnel. Entries with session ID 0 are always skipped because they are used internally by pppol2tp. * l2tpv3 sessions are keyed by session ID. Iteration starts at the first IDR entry and skips entries where the tunnel does not match. Iteration must also consider session ID collisions and walk the list of colliding sessions (if any) for one which matches the supplied tunnel. Signed-off-by: James Chapman <jchapman@katalix.com> Signed-off-by: Tom Parkin <tparkin@katalix.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-08-11l2tp: handle hash key collisions in l2tp_v3_session_getJames Chapman
To handle colliding l2tpv3 session IDs, l2tp_v3_session_get searches a hashed list keyed by ID and sk. Although unlikely, if hash keys collide, it is possible that hash_for_each_possible loops over a session which doesn't have the ID that we are searching for. So check for session ID match when looping over possible hash key matches. Signed-off-by: James Chapman <jchapman@katalix.com> Signed-off-by: Tom Parkin <tparkin@katalix.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-08-11l2tp: move l2tp_ip and l2tp_ip6 data to pernetJames Chapman
l2tp_ip[6] have always used global socket tables. It is therefore not possible to create l2tpip sockets in different namespaces with the same socket address. To support this, move l2tpip socket tables to pernet data. Signed-off-by: James Chapman <jchapman@katalix.com> Signed-off-by: Tom Parkin <tparkin@katalix.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-08-11l2tp: remove inline from functions in c sourcesJames Chapman
Update l2tp to remove the inline keyword from several functions in C sources, since this is now discouraged. Signed-off-by: James Chapman <jchapman@katalix.com> Signed-off-by: Tom Parkin <tparkin@katalix.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-08-11documentation/networking: update l2tp docsJames Chapman
l2tp no longer uses sk_user_data in tunnel sockets and now manages tunnel/session lifetimes slightly differently. Update docs to cover this. CC: linux-doc@vger.kernel.org CC: corbet@lwn.net Signed-off-by: James Chapman <jchapman@katalix.com> Signed-off-by: Tom Parkin <tparkin@katalix.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-08-10gpio: mlxbf3: Support shutdown() functionAsmaa Mnebhi
During Linux graceful reboot, the GPIO interrupts are not disabled. Since the drivers are not removed during graceful reboot, the logic to call mlxbf3_gpio_irq_disable() is not triggered. Interrupts that remain enabled can cause issues on subsequent boots. For example, the mlxbf-gige driver contains PHY logic to bring up the link. If the gpio-mlxbf3 driver loads first, the mlxbf-gige driver will use a GPIO interrupt to bring up the link. Otherwise, it will use polling. The next time Linux boots and loads the drivers in this order, we encounter the issue: - mlxbf-gige loads first and uses polling while the GPIO10 interrupt is still enabled from the previous boot. So if the interrupt triggers, there is nothing to clear it. - gpio-mlxbf3 loads. - i2c-mlxbf loads. The interrupt doesn't trigger for I2C because it is shared with the GPIO interrupt line which was not cleared. The solution is to add a shutdown function to the GPIO driver to clear and disable all interrupts. Also clear the interrupt after disabling it in mlxbf3_gpio_irq_disable(). Fixes: 38a700efc510 ("gpio: mlxbf3: Add gpio driver support") Signed-off-by: Asmaa Mnebhi <asmaa@nvidia.com> Reviewed-by: David Thompson <davthompson@nvidia.com> Reviewed-by: Andy Shevchenko <andy@kernel.org> Reviewed-by: Linus Walleij <linus.walleij@linaro.org> Link: https://lore.kernel.org/r/20240611171509.22151-1-asmaa@nvidia.com Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
2024-08-10Merge tag 'nfsd-6.11-1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux Pull nfsd fixes from Chuck Lever: - Two minor fixes for recent changes * tag 'nfsd-6.11-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: nfsd: don't set SVC_SOCK_ANONYMOUS when creating nfsd sockets sunrpc: avoid -Wformat-security warning
2024-08-10Merge tag 'i2c-for-6.11-rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux Pull i2c fixes from Wolfram Sang: - Two fixes for SMBusAlert handling in the I2C core: one to avoid an endless loop when scanning for handlers and one to make sure handlers are always called even if HW has broken behaviour - I2C header build fix for when ACPI is enabled but I2C isn't - The testunit gets a rename in the code to match the documentation - Two fixes for the Qualcomm GENI I2C controller are cleaning up the error exit patch in the runtime_resume() function. The first is disabling the clock, the second disables the icc on the way out * tag 'i2c-for-6.11-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux: i2c: testunit: match HostNotify test name with docs i2c: qcom-geni: Add missing geni_icc_disable in geni_i2c_runtime_resume i2c: qcom-geni: Add missing clk_disable_unprepare in geni_i2c_runtime_resume i2c: Fix conditional for substituting empty ACPI functions i2c: smbus: Send alert notifications to all devices if source not found i2c: smbus: Improve handling of stuck alerts
2024-08-10Merge tag 'dma-mapping-6.11-2024-08-10' of ↵Linus Torvalds
git://git.infradead.org/users/hch/dma-mapping Pull dma-mapping fix from Christoph Hellwig: - avoid a deadlock with dma-debug and netconsole (Rik van Riel) * tag 'dma-mapping-6.11-2024-08-10' of git://git.infradead.org/users/hch/dma-mapping: dma-debug: avoid deadlock between dma debug vs printk and netconsole
2024-08-10Merge tag 'bcachefs-2024-08-10' of git://evilpiepirate.org/bcachefsLinus Torvalds
Pull more bcachefs fixes from Kent Overstreet: "A couple last minute fixes for the new disk accounting - fix a bug that was causing ACLs to seemingly "disappear" - new on disk format version, bcachefs_metadata_version_disk_accounting_v3 bcachefs_metadata_version_disk_accounting_v2 accidentally included padding in disk_accounting_key; fortunately, 6.11 isn't out yet so we can fix this with another version bump" * tag 'bcachefs-2024-08-10' of git://evilpiepirate.org/bcachefs: bcachefs: bcachefs_metadata_version_disk_accounting_v3 bcachefs: improve bch2_dev_usage_to_text() bcachefs: bch2_accounting_invalid() bcachefs: Switch to .get_inode_acl()
2024-08-10wifi: brcmfmac: cfg80211: Handle SSID based pmksa deletionJanne Grunau
wpa_supplicant 2.11 sends since 1efdba5fdc2c ("Handle PMKSA flush in the driver for SAE/OWE offload cases") SSID based PMKSA del commands. brcmfmac is not prepared and tries to dereference the NULL bssid and pmkid pointers in cfg80211_pmksa. PMKID_V3 operations support SSID based updates so copy the SSID. Fixes: a96202acaea4 ("wifi: brcmfmac: cfg80211: Add support for PMKID_V3 operations") Cc: stable@vger.kernel.org # 6.4.x Signed-off-by: Janne Grunau <j@jannau.net> Reviewed-by: Neal Gompa <neal@gompa.dev> Acked-by: Arend van Spriel <arend.vanspriel@broadcom.com> Signed-off-by: Kalle Valo <kvalo@kernel.org> Link: https://patch.msgid.link/20240803-brcmfmac_pmksa_del_ssid-v1-1-4e85f19135e1@jannau.net
2024-08-10ALSA: timer: Relax start tick time check for slave timer elementsTakashi Iwai
The recent addition of a sanity check for a too low start tick time seems breaking some applications that uses aloop with a certain slave timer setup. They may have the initial resolution 0, hence it's treated as if it were a too low value. Relax and skip the check for the slave timer instance for addressing the regression. Fixes: 4a63bd179fa8 ("ALSA: timer: Set lower bound of start tick time") Cc: <stable@vger.kernel.org> Link: https://github.com/raspberrypi/linux/issues/6294 Link: https://patch.msgid.link/20240810084833.10939-1-tiwai@suse.de Signed-off-by: Takashi Iwai <tiwai@suse.de>
2024-08-10irqchip/riscv-aplic: Retrigger MSI interrupt on source configurationYong-Xuan Wang
The section 4.5.2 of the RISC-V AIA specification says that "any write to a sourcecfg register of an APLIC might (or might not) cause the corresponding interrupt-pending bit to be set to one if the rectified input value is high (= 1) under the new source mode." When the interrupt type is changed in the sourcecfg register, the APLIC device might not set the corresponding pending bit, so the interrupt might never become pending. To handle sourcecfg register changes for level-triggered interrupts in MSI mode, manually set the pending bit for retriggering interrupt so it gets retriggered if it was already asserted. Fixes: ca8df97fe679 ("irqchip/riscv-aplic: Add support for MSI-mode") Signed-off-by: Yong-Xuan Wang <yongxuan.wang@sifive.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Vincent Chen <vincent.chen@sifive.com> Reviewed-by: Anup Patel <anup@brainfault.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/20240809071049.2454-1-yongxuan.wang@sifive.com
2024-08-10irqchip/xilinx: Fix shift out of boundsRadhey Shyam Pandey
The device tree property 'xlnx,kind-of-intr' is sanity checked that the bitmask contains only set bits which are in the range of the number of interrupts supported by the controller. The check is done by shifting the mask right by the number of supported interrupts and checking the result for zero. The data type of the mask is u32 and the number of supported interrupts is up to 32. In case of 32 interrupts the shift is out of bounds, resulting in a mismatch warning. The out of bounds condition is also reported by UBSAN: UBSAN: shift-out-of-bounds in irq-xilinx-intc.c:332:22 shift exponent 32 is too large for 32-bit type 'unsigned int' Fix it by promoting the mask to u64 for the test. Fixes: d50466c90724 ("microblaze: intc: Refactor DT sanity check") Signed-off-by: Radhey Shyam Pandey <radhey.shyam.pandey@amd.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/1723186944-3571957-1-git-send-email-radhey.shyam.pandey@amd.com
2024-08-09Merge branch 'mlx5-misc-fixes-2024-08-08'Jakub Kicinski
Tariq Toukan says: ==================== mlx5 misc fixes 2024-08-08 This patchset provides misc bug fixes from the team to the mlx5 core and Eth drivers. ==================== Link: https://patch.msgid.link/20240808144107.2095424-1-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-09net/mlx5e: Fix queue stats access to non-existing channels splatGal Pressman
The queue stats API queries the queues according to the real_num_[tr]x_queues, in case the device is down and channels were not yet created, don't try to query their statistics. To trigger the panic, run this command before the interface is brought up: ./cli.py --spec ../../../Documentation/netlink/specs/netdev.yaml --dump qstats-get --json '{"ifindex": 4}' BUG: kernel NULL pointer dereference, address: 0000000000000c00 PGD 0 P4D 0 Oops: Oops: 0000 [#1] SMP PTI CPU: 3 UID: 0 PID: 977 Comm: python3 Not tainted 6.10.0+ #40 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 RIP: 0010:mlx5e_get_queue_stats_rx+0x3c/0xb0 [mlx5_core] Code: fc 55 48 63 ee 53 48 89 d3 e8 40 3d 70 e1 85 c0 74 58 4c 89 ef e8 d4 07 04 00 84 c0 75 41 49 8b 84 24 f8 39 00 00 48 8b 04 e8 <48> 8b 90 00 0c 00 00 48 03 90 40 0a 00 00 48 89 53 08 48 8b 90 08 RSP: 0018:ffff888116be37d0 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff888116be3868 RCX: 0000000000000004 RDX: ffff88810ada4000 RSI: 0000000000000000 RDI: ffff888109df09c0 RBP: 0000000000000000 R08: 0000000000000004 R09: 0000000000000004 R10: ffff88813461901c R11: ffffffffffffffff R12: ffff888109df0000 R13: ffff888109df09c0 R14: ffff888116be38d0 R15: 0000000000000000 FS: 00007f4375d5c740(0000) GS:ffff88852c980000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000c00 CR3: 0000000106ada006 CR4: 0000000000370eb0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> ? __die+0x1f/0x60 ? page_fault_oops+0x14e/0x3d0 ? exc_page_fault+0x73/0x130 ? asm_exc_page_fault+0x22/0x30 ? mlx5e_get_queue_stats_rx+0x3c/0xb0 [mlx5_core] netdev_nl_stats_by_netdev+0x2a6/0x4c0 ? __rmqueue_pcplist+0x351/0x6f0 netdev_nl_qstats_get_dumpit+0xc4/0x1b0 genl_dumpit+0x2d/0x80 netlink_dump+0x199/0x410 __netlink_dump_start+0x1aa/0x2c0 genl_family_rcv_msg_dumpit+0x94/0xf0 ? __pfx_genl_start+0x10/0x10 ? __pfx_genl_dumpit+0x10/0x10 ? __pfx_genl_done+0x10/0x10 genl_rcv_msg+0x116/0x2b0 ? __pfx_netdev_nl_qstats_get_dumpit+0x10/0x10 ? __pfx_genl_rcv_msg+0x10/0x10 netlink_rcv_skb+0x54/0x100 genl_rcv+0x24/0x40 netlink_unicast+0x21a/0x340 netlink_sendmsg+0x1f4/0x440 __sys_sendto+0x1b6/0x1c0 ? do_sock_setsockopt+0xc3/0x180 ? __sys_setsockopt+0x60/0xb0 __x64_sys_sendto+0x20/0x30 do_syscall_64+0x50/0x110 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7f43757132b0 Code: c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 1d 45 31 c9 45 31 c0 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 68 c3 0f 1f 80 00 00 00 00 41 54 48 83 ec 20 RSP: 002b:00007ffd258da048 EFLAGS: 00000246 ORIG_RAX: 000000000000002c RAX: ffffffffffffffda RBX: 00007ffd258da0f8 RCX: 00007f43757132b0 RDX: 000000000000001c RSI: 00007f437464b850 RDI: 0000000000000003 RBP: 00007f4375085de0 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: ffffffffc4653600 R14: 0000000000000001 R15: 00007f43751a6147 </TASK> Modules linked in: netconsole xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat br_netfilter rpcsec_gss_krb5 auth_rpcgss oid_registry overlay rpcrdma rdma_ucm ib_iser libiscsi scsi_transport_iscsi ib_umad rdma_cm ib_ipoib iw_cm ib_cm mlx5_ib ib_uverbs ib_core zram zsmalloc mlx5_core fuse [last unloaded: netconsole] CR2: 0000000000000c00 ---[ end trace 0000000000000000 ]--- RIP: 0010:mlx5e_get_queue_stats_rx+0x3c/0xb0 [mlx5_core] Code: fc 55 48 63 ee 53 48 89 d3 e8 40 3d 70 e1 85 c0 74 58 4c 89 ef e8 d4 07 04 00 84 c0 75 41 49 8b 84 24 f8 39 00 00 48 8b 04 e8 <48> 8b 90 00 0c 00 00 48 03 90 40 0a 00 00 48 89 53 08 48 8b 90 08 RSP: 0018:ffff888116be37d0 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff888116be3868 RCX: 0000000000000004 RDX: ffff88810ada4000 RSI: 0000000000000000 RDI: ffff888109df09c0 RBP: 0000000000000000 R08: 0000000000000004 R09: 0000000000000004 R10: ffff88813461901c R11: ffffffffffffffff R12: ffff888109df0000 R13: ffff888109df09c0 R14: ffff888116be38d0 R15: 0000000000000000 FS: 00007f4375d5c740(0000) GS:ffff88852c980000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000c00 CR3: 0000000106ada006 CR4: 0000000000370eb0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Fixes: 7b66ae536a78 ("net/mlx5e: Add per queue netdev-genl stats") Signed-off-by: Gal Pressman <gal@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Joe Damato <jdamato@fastly.com> Link: https://patch.msgid.link/20240808144107.2095424-6-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-09net/mlx5e: Correctly report errors for ethtool rx flowsCosmin Ratiu
Previously, an ethtool rx flow with no attrs would not be added to the NIC as it has no rules to configure the hw with, but it would be reported as successful to the caller (return code 0). This is confusing for the user as ethtool then reports "Added rule $num", but no rule was actually added. This change corrects that by instead reporting these wrong rules as -EINVAL. Fixes: b29c61dac3a2 ("net/mlx5e: Ethtool steering flow validation refactoring") Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Saeed Mahameed <saeedm@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20240808144107.2095424-5-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-09net/mlx5e: Take state lock during tx timeout reporterDragos Tatulea
mlx5e_safe_reopen_channels() requires the state lock taken. The referenced changed in the Fixes tag removed the lock to fix another issue. This patch adds it back but at a later point (when calling mlx5e_safe_reopen_channels()) to avoid the deadlock referenced in the Fixes tag. Fixes: eab0da38912e ("net/mlx5e: Fix possible deadlock on mlx5e_tx_timeout_work") Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Link: https://lore.kernel.org/all/ZplpKq8FKi3vwfxv@gmail.com/T/ Reviewed-by: Breno Leitao <leitao@debian.org> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20240808144107.2095424-4-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-09net/mlx5e: SHAMPO, Increase timeout to improve latencyDragos Tatulea
During latency tests (netperf TCP_RR) a 30% degradation of HW GRO vs SW GRO was observed. This is due to SHAMPO triggering timeout filler CQEs instead of delivering the CQE for the packet. Having a short timeout for SHAMPO doesn't bring any benefits as it is the driver that does the merging, not the hardware. On the contrary, it can have a negative impact: additional filler CQEs are generated due to the timeout. As there is no way to disable this timeout, this change sets it to the maximum value. Instead of using the packet_merge.timeout parameter which is also used for LRO, set the value directly when filling in the rest of the SHAMPO parameters in mlx5e_build_rq_param(). Fixes: 99be56171fa9 ("net/mlx5e: SHAMPO, Re-enable HW-GRO") Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20240808144107.2095424-3-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-09net/mlx5: SD, Do not query MPIR register if no sd_groupTariq Toukan
Unconditionally calling the MPIR query on BF separate mode yields the FW syndrome below [1]. Do not call it unless admin clearly specified the SD group, i.e. expressing the intention of using the multi-PF netdev feature. This fix covers cases not covered in commit fca3b4791850 ("net/mlx5: Do not query MPIR on embedded CPU function"). [1] mlx5_cmd_out_err:808:(pid 8267): ACCESS_REG(0x805) op_mod(0x1) failed, status bad system state(0x4), syndrome (0x685f19), err(-5) Fixes: 678eb448055a ("net/mlx5: SD, Implement basic query and instantiation") Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Gal Pressman <gal@nvidia.com> Link: https://patch.msgid.link/20240808144107.2095424-2-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-09Merge branch 'mlx5-misc-patches-2024-08-08'Jakub Kicinski
Tariq Toukan says: ==================== mlx5 misc patches 2024-08-08 This patchset contains multiple enhancements from the team to the mlx5 core and Eth drivers. Patch #1 by Chris bumps a defined value to permit more devices doing TC offloads. Patch #2 by Jianbo adds an IPsec fast-path optimization to replace the slow async handling. Patches #3 and #4 by Jianbo add TC offload support for complicated rules to overcome firmware limitation. Patch #5 by Gal unifies the access macro to advertised/supported link modes. Patches #6 to #9 by Gal adds extack messages in ethtool ops to replace prints to the kernel log. Patch #10 by Cosmin switches to using 'update' verb instead of 'replace' to better reflect the operation. Patch #11 by Cosmin exposes an update connection tracking operation to replace the assumed delete+add implementaiton. ==================== Link: https://patch.msgid.link/20240808055927.2059700-1-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-09net/mlx5e: CT: Update connection tracking steering entriesCosmin Ratiu
Previously, replacing a connection tracking steering entry was done by adding a new rule (with the same tag but possibly different mod hdr actions/labels) then removing the old rule. This approach doesn't work in hardware steering because two steering entries with the same tag cannot coexist in a hardware steering table. This commit prepares for that by adding a new ct_rule_update operation on the ct_fs_ops struct which is used instead of add+delete. Implementations for both dmfs (firmware steering) and smfs (software steering) are provided, which simply add the new rule and delete the old one. Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20240808055927.2059700-12-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-09net/mlx5e: CT: 'update' rules instead of 'replace'Cosmin Ratiu
Offloaded rules can be updated with a new modify header action containing a changed restore cookie. This was done using the verb 'replace', while in some configurations 'update' is a better fit. This commit renames the functions used to reflect that. Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20240808055927.2059700-11-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-09net/mlx5e: Use extack in get module eeprom by page callbackGal Pressman
In case of errors in get module eeprom by page, reflect it through extack instead of a dmesg print. While at it, make the messages more human friendly. Signed-off-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20240808055927.2059700-10-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-09net/mlx5e: Use extack in set coalesce callbackGal Pressman
In case of errors in set coalesce, reflect it through extack instead of a dmesg print. While at it, make the messages more human friendly. Signed-off-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20240808055927.2059700-9-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-09net/mlx5e: Use extack in get coalesce callbackGal Pressman
In case of errors in get coalesce, reflect it through extack instead of a dmesg print. Signed-off-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20240808055927.2059700-8-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-09net/mlx5e: Use extack in set ringparams callbackGal Pressman
In case of errors in set ringparams, reflect it through extack instead of a dmesg print. While at it, make the messages more human friendly and remove two redundant checks that are already validated by the core. Signed-off-by: Gal Pressman <gal@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20240808055927.2059700-7-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-09net/mlx5e: Be consistent with bitmap handling of link modesGal Pressman
Use the bitmap operations when accessing the advertised/supported link modes and remove places that access them as arrays of unsigned longs (underlying implementation of the bitmap), this makes the code much more readable and clear. Signed-off-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Carolina Jubran <cjubran@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20240808055927.2059700-6-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-09net/mlx5e: TC, Offload rewrite and mirror to both internal and external destsJianbo Liu
Firmware has the limitation that it cannot offload a rule with rewrite and mirror to internal and external destinations simultaneously. This patch adds a workaround to this issue. Here the destination array is split again, just like what's done in previous commit, but after the action indexed by split_count - 1. An extra rule is added for the leftover destinations. Such rule can be offloaded, even there are destinations to both internal and external destinations, because the header rewrite is left in the original FTE. Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20240808055927.2059700-5-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-09net/mlx5e: TC, Offload rewrite and mirror on tunnel over ovs internal portJianbo Liu
To offload the encap rule when the tunnel IP is configured on an openvswitch internal port, driver need to overwrite vport metadata in reg_c0 to the value assigned to the internal port, then forward packets to root table to be processed again by the rules matching on the metadata for such internal port. When such rule is combined with header rewrite and mirror, openvswitch generates the rule like the following, because it resets mirror after packets are modified. in_port(enp8s0f0npf0sf1),.., actions:enp8s0f0npf0sf2,set(tunnel(...)),set(ipv4(...)),vxlan_sys_4789,enp8s0f0npf0sf2 The split_count was introduced before to support rewrite and mirror. Driver splits the rule into two different hardware rules in order to offload it. But it's not enough to offload the above complicated rule because of the limitations, in both driver and firmware. To resolve this issue, the destination array is split again after the destination indexed by split_count. An extra rule is added for the leftover destinations (in the above example, it is enp8s0f0npf0sf2), and is inserted to post_act table. And the extra destination is added in the original rule to forward to post_act table, so the extra mirror is done there. Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20240808055927.2059700-4-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-09net/mlx5e: Enable remove flow for hard packet limitJianbo Liu
In the commit a2a73ea14b1a ("net/mlx5e: Don't listen to remove flows event"), remove_flow_enable event is removed, and the hard limit usually relies on software mechanism added in commit b2f7b01d36a9 ("net/mlx5e: Simulate missing IPsec TX limits hardware functionality"). But the delayed work is rescheduled every one second, which is slow for fast traffic. As a result, traffic can't be blocked even reaches the hard limit, which usually happens when soft and hard limits are very close. In reality it won't happen because soft limit is much lower than hard limit. But, as an optimization for RX to block traffic when reaching hard limit, need to set remove_flow_enable. When remove flow is enabled, IPSEC HARD_LIFETIME ASO syndrome will be set in the metadata defined in the ASO return register if packets reach hard lifetime threshold. And those packets are dropped immediately by the steering table. Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20240808055927.2059700-3-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-09net/mlx5: E-Switch, Increase max int port number for offloadChris Mi
Currently MLX5E_TC_MAX_INT_PORT_NUM is 8. Usually int port has one ingress and one egress rules. But sometimes, a temporary rule can be offloaded as well, eg: recirc_id(0),in_port(br-phy),eth(src=10:70:fd:87:57:c0,dst=33:33:00:00:00:16), eth_type(0x86dd),ipv6(frag=no), packets:2, bytes:180, used:0.060s, actions:enp8s0f0 If one int port device offloads 3 rules, only 2 devices can offload. Other devices will hit the limit and fail to offload. Actually it is insufficient for customers. So increase the number to 32. Signed-off-by: Chris Mi <cmi@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20240808055927.2059700-2-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-09net: ag71xx: use phylink_mii_ioctlRosen Penev
f1294617d2f38bd2b9f6cce516b0326858b61182 removed the custom function for ndo_eth_ioctl and used the standard phy_do_ioctl which calls phy_mii_ioctl. However since then, this driver was ported to phylink where it makes more sense to call phylink_mii_ioctl. Bring back custom function that calls phylink_mii_ioctl. Signed-off-by: Rosen Penev <rosenp@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://patch.msgid.link/20240807215834.33980-1-rosenp@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-09Merge branch 'ibmvnic-ibmvnic-rr-patchset'Jakub Kicinski
Nick Child says: ==================== ibmvnic: ibmvnic rr patchset v1 - https://lore.kernel.org/netdev/20240801212340.132607-1-nnac123@linux.ibm.com/ v2 - https://lore.kernel.org/netdev/20240806193706.998148-1-nnac123@linux.ibm.com/ ==================== Link: https://patch.msgid.link/20240807211809.1259563-1-nnac123@linux.ibm.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-09ibmvnic: Perform tx CSO during send scrq directNick Child
During initialization with the vnic server, a bitstring is communicated to the client regarding header info needed during CSO (See "VNIC Capabilities" in PAPR). Most of the time, to be safe, vnic server requests header info for CSO. When header info is needed, multiple TX descriptors are required per skb; This limits the driver to use send_subcrq_indirect instead of send_subcrq_direct. Previously, the vnic server request for header info was ignored. This allowed the use of send_sub_crq_direct. Transmissions were successful because the bitstring returned by vnic server is broad and over cautionary. It was observed that mlx backing devices could actually transmit and handle CSO packets without the vnic server receiving header info (despite the fact that the bitstring requested it). There was a trust issue: The bitstring was overcautionary. This extra precaution (requesting header info when the backing device may not use it) comes at the cost of performance (using direct vs indirect hcalls has a 30% delta in small packet RR transaction rate). So it has been requested that the vnic server team tries to ensure that the bitstring is more exact. In the meantime, disable CSO when it is possible to use the skb in the send_subcrq_direct path. In other words, calculate the checksum before handing the packet to FW when the packet is not segmented and xmit_more is false. Since the code path is only possible if the skb is non GSO and xmit_more is false, the cost of doing checksum in the send_subcrq_direct path is minimal. Any large segmented skb will have xmit_more set to true more frequently and it is inexpensive to do checksumming on a small skb. The worst-case workload would be a 9000 MTU TCP_RR test with close to MTU sized packets (and TSO off). This allows xmit_more to be false more frequently and open the code path up to use send_subcrq_direct. Observing trace data (graph-time = 1) and packet rate with this workload shows minimal performance degradation: 1. NIC does checksum w headers, safely use send_subcrq_indirect: - Packet rate: 631k txs - Trace data: ibmvnic_xmit = 44344685.87 us / 6234576 hits = AVG 7.11 us skb_checksum_help = 4.07 us / 2 hits = AVG 2.04 us ^ Notice hits, tracing this just for reassurance ibmvnic_tx_scrq_flush = 33040649.69 us / 5638441 hits = AVG 5.86 us send_subcrq_indirect = 37438922.24 us / 6030859 hits = AVG 6.21 us 2. NIC does checksum w/o headers, dangerously use send_subcrq_direct: - Packet rate: 831k txs - Trace data: ibmvnic_xmit = 48940092.29 us / 8187630 hits = AVG 5.98 us skb_checksum_help = 2.03 us / 1 hits = AVG 2.03 ibmvnic_tx_scrq_flush = 31141879.57 us / 7948960 hits = AVG 3.92 us send_subcrq_indirect = 8412506.03 us / 728781 hits = AVG 11.54 ^ notice hits is much lower b/c send_subcrq_direct was called ^ wasn't traceable 3. driver does checksum, safely use send_subcrq_direct (THIS PATCH): - Packet rate: 829k txs - Trace data: ibmvnic_xmit = 56696077.63 us / 8066168 hits = AVG 7.03 us skb_checksum_help = 8587456.16 us / 7526072 hits = AVG 1.14 us ibmvnic_tx_scrq_flush = 30219545.55 us / 7782409 hits = AVG 3.88 us send_subcrq_indirect = 8638326.44 us / 763693 hits = AVG 11.31 us When the bitstring ever specifies that CSO does not require headers (dependent on VIOS vnic server changes), then this patch should be removed and replaced with one that investigates the bitstring before using send_subcrq_direct. Signed-off-by: Nick Child <nnac123@linux.ibm.com> Link: https://patch.msgid.link/20240807211809.1259563-8-nnac123@linux.ibm.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-09ibmvnic: Only record tx completed bytes once per handlerNick Child
Byte Queue Limits depends on dql_completed being called once per tx completion round in order to adjust its algorithm appropriately. The dql->limit value is an approximation of the amount of bytes that the NIC can consume per irq interval. If this approximation is too high then the NIC will become over-saturated. Too low and the NIC will starve. The dql->limit depends on dql->prev-* stats to calculate an optimal value. If dql_completed() is called more than once per irq handler then those prev-* values become unreliable (because they are not an accurate representation of the previous state of the NIC) resulting in a sub-optimal limit value. Therefore, move the call to netdev_tx_completed_queue() to the end of ibmvnic_complete_tx(). When performing 150 sessions of TCP rr (request-response 1 byte packets) workloads, one could observe: PREVIOUSLY: - limit and inflight values hovering around 130 - transaction rate of around 750k pps. NOW: - limit rises and falls in response to inflight (130-900) - transaction rate of around 1M pps (33% improvement) Signed-off-by: Nick Child <nnac123@linux.ibm.com> Link: https://patch.msgid.link/20240807211809.1259563-7-nnac123@linux.ibm.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-09ibmvnic: Introduce send sub-crq directNick Child
Firmware supports two hcalls to send a sub-crq request: H_SEND_SUB_CRQ_INDIRECT and H_SEND_SUB_CRQ. The indirect hcall allows for submission of batched messages while the other hcall is limited to only one message. This protocol is defined in PAPR section 17.2.3.3. Previously, the ibmvnic xmit function only used the indirect hcall. This allowed the driver to batch it's skbs. A single skb can occupy a few entries per hcall depending on if FW requires skb header information or not. The FW only needs header information if the packet is segmented. By this logic, if an skb is not GSO then it can fit in one sub-crq message and therefore is a candidate for H_SEND_SUB_CRQ. Batching skb transmission is only useful when there are more packets coming down the line (ie netdev_xmit_more is true). As it turns out, H_SEND_SUB_CRQ induces less latency than H_SEND_SUB_CRQ_INDIRECT. Therefore, use H_SEND_SUB_CRQ where appropriate. Small latency gains seen when doing TCP_RR_150 (request/response workload). Ftrace results (graph-time=1): Previous: ibmvnic_xmit = 29618270.83 us / 8860058.0 hits = AVG 3.34 ibmvnic_tx_scrq_flush = 21972231.02 us / 6553972.0 hits = AVG 3.35 Now: ibmvnic_xmit = 22153350.96 us / 8438942.0 hits = AVG 2.63 ibmvnic_tx_scrq_flush = 15858922.4 us / 6244076.0 hits = AVG 2.54 Signed-off-by: Nick Child <nnac123@linux.ibm.com> Link: https://patch.msgid.link/20240807211809.1259563-6-nnac123@linux.ibm.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-09ibmvnic: Remove duplicate memory barriers in txNick Child
send_subcrq_[in]direct() already has a dma memory barrier. Remove the earlier one. Signed-off-by: Nick Child <nnac123@linux.ibm.com> Link: https://patch.msgid.link/20240807211809.1259563-5-nnac123@linux.ibm.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-09ibmvnic: Reduce memcpys in tx descriptor generationNick Child
Previously when creating the header descriptors, the driver would: 1. allocate a temporary buffer on the stack (in build_hdr_descs_arr) 2. memcpy the header info into the temporary buffer (in build_hdr_data) 3. memcpy the temp buffer into a local variable (in create_hdr_descs) 4. copy the local variable into the return buffer (in create_hdr_descs) Since, there is no opportunity for errors during this process, the temp buffer is not needed and work can be done on the return buffer directly. Repurpose build_hdr_data() to only calculate the header lengths. Rename it to get_hdr_lens(). Edit create_hdr_descs() to read from the skb directly and copy directly into the returned useful buffer. The process now involves less memory and write operations while also being more readable. Signed-off-by: Nick Child <nnac123@linux.ibm.com> Link: https://patch.msgid.link/20240807211809.1259563-4-nnac123@linux.ibm.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-09ibmvnic: Use header len helper functions on txNick Child
Use the header length helper functions rather than trying to calculate it within the driver. There are defined functions for mac and network headers (skb_mac_header_len and skb_network_header_len) but no such function exists for the transport header length. Also, hdr_data was memset during allocation to all 0's so no need to memset again. Signed-off-by: Nick Child <nnac123@linux.ibm.com> Link: https://patch.msgid.link/20240807211809.1259563-3-nnac123@linux.ibm.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-09ibmvnic: Only replenish rx pool when resources are getting lowNick Child
Previously, the driver would replenish the rx pool if the polling function consumed less than the budget. The logic being that the driver did not exhaust its budget so that must mean that the driver is not busy and has cycles to spare for replenishing the pool. So pool replenishment happens on every poll which did not consume the budget. This can very costly during request-response tests. In fact, an extra ~100pps can be seen in TCP_RR_150 tests when we remove this conditional. Trace results (ftrace, graph-time=1) for the poll function are below: Previous results: ibmvnic_poll = 64951846.0 us / 4167628.0 hits = AVG 15.58 replenish_rx_pool = 17602846.0 us / 4710437.0 hits = AVG 3.74 Now: ibmvnic_poll = 57673941.0 us / 4791737.0 hits = AVG 12.04 replenish_rx_pool = 3938171.6 us / 4314.0 hits = AVG 912.88 While the replenish function takes longer, it is hit less frequently meaning the ibmvnic_poll function, on average, is faster. Furthermore, this change does not have a negative effect on performance bandwidth/latency measurements. Signed-off-by: Nick Child <nnac123@linux.ibm.com> Link: https://patch.msgid.link/20240807211809.1259563-2-nnac123@linux.ibm.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-09net: fs_enet: Fix warning due to wrong typeChristophe Leroy
Building fs_enet on powerpc e500 leads to following warning: CC drivers/net/ethernet/freescale/fs_enet/mac-scc.o In file included from ./include/linux/build_bug.h:5, from ./include/linux/container_of.h:5, from ./include/linux/list.h:5, from ./include/linux/module.h:12, from drivers/net/ethernet/freescale/fs_enet/mac-scc.c:15: drivers/net/ethernet/freescale/fs_enet/mac-scc.c: In function 'allocate_bd': ./include/linux/err.h:28:49: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast] 28 | #define IS_ERR_VALUE(x) unlikely((unsigned long)(void *)(x) >= (unsigned long)-MAX_ERRNO) | ^ ./include/linux/compiler.h:77:45: note: in definition of macro 'unlikely' 77 | # define unlikely(x) __builtin_expect(!!(x), 0) | ^ drivers/net/ethernet/freescale/fs_enet/mac-scc.c:138:13: note: in expansion of macro 'IS_ERR_VALUE' 138 | if (IS_ERR_VALUE(fep->ring_mem_addr)) | ^~~~~~~~~~~~ This is due to fep->ring_mem_addr not being a pointer but a DMA address which is 64 bits on that platform while pointers are 32 bits as this is a 32 bits platform with wider physical bus. However, using fep->ring_mem_addr is just wrong because cpm_muram_alloc() returns an offset within the muram and not a physical address directly. So use fpi->dpram_offset instead. Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/ec67ea3a3bef7e58b8dc959f7c17d405af0d27e4.1723101144.git.christophe.leroy@csgroup.eu Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-09net: usb: cdc_ether: don't spew notificationszhangxiangqian
The usbnet_link_change function is not called, if the link has not changed. ... [16913.807393][ 3] cdc_ether 1-2:2.0 enx00e0995fd1ac: kevent 12 may have been dropped [16913.822266][ 2] cdc_ether 1-2:2.0 enx00e0995fd1ac: kevent 12 may have been dropped [16913.826296][ 2] cdc_ether 1-2:2.0 enx00e0995fd1ac: kevent 11 may have been dropped ... kevent 11 is scheduled too frequently and may affect other event schedules. Signed-off-by: zhangxiangqian <zhangxiangqian@kylinos.cn> Acked-by: Oliver Neukum <oneukum@suse.com> Link: https://patch.msgid.link/1723109985-11996-1-git-send-email-zhangxiangqian@kylinos.cn Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-09Merge branch ↵Jakub Kicinski
'don-t-take-hw-uso-path-when-packets-can-t-be-checksummed-by-device' Jakub Sitnicki says: ==================== Don't take HW USO path when packets can't be checksummed by device This series addresses a recent regression report from syzbot [1]. After enabling UDP_SEGMENT for egress devices which don't support checksum offload [2], we need to tighten down the checks which let packets take the HW USO path. The fix consists of two parts: 1. don't let devices offer USO without checksum offload, and 2. force software USO fallback in presence of IPv6 extension headers. [1] https://lore.kernel.org/all/000000000000e1609a061d5330ce@google.com/ [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=10154dbded6d6a2fecaebdfda206609de0f121a9 v3: https://lore.kernel.org/r/20240807-udp-gso-egress-from-tunnel-v3-0-8828d93c5b45@cloudflare.com v2: https://lore.kernel.org/r/20240801-udp-gso-egress-from-tunnel-v2-0-9a2af2f15d8d@cloudflare.com v1: https://lore.kernel.org/r/20240725-udp-gso-egress-from-tunnel-v1-0-5e5530ead524@cloudflare.com ==================== Link: https://patch.msgid.link/20240808-udp-gso-egress-from-tunnel-v4-0-f5c5b4149ab9@cloudflare.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>