summaryrefslogtreecommitdiff
path: root/drivers/net
AgeCommit message (Collapse)Author
2025-04-23Revert "wifi: iwlwifi: add support for BE213"Miri Korenblit
This reverts commit 16a8d9a739430bec9c11eda69226c5a39f3478aa. This device needs commit 75a3313f52b7 ("wifi: iwlwifi: make no_160 more generic"), which has a bug and is being reverted until it is fixed. Since this device wasn't shipped yet it is ok to not support it. Reported-by: Todd Brandt <todd.e.brandt@intel.com> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=220029 Fixes: 16a8d9a73943 ("wifi: iwlwifi: add support for BE213") Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com> Link: https://patch.msgid.link/20250420115541.581160ae3e4b.Icecc46baee8a797c00ad04fab92d7d1114b84829@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-04-22xdp: create locked/unlocked instances of xdp redirect target settersJoshua Washington
Commit 03df156dd3a6 ("xdp: double protect netdev->xdp_flags with netdev->lock") introduces the netdev lock to xdp_set_features_flag(). The change includes a _locked version of the method, as it is possible for a driver to have already acquired the netdev lock before calling this helper. However, the same applies to xdp_features_(set|clear)_redirect_flags(), which ends up calling the unlocked version of xdp_set_features_flags() leading to deadlocks in GVE, which grabs the netdev lock as part of its suspend, reset, and shutdown processes: [ 833.265543] WARNING: possible recursive locking detected [ 833.270949] 6.15.0-rc1 #6 Tainted: G E [ 833.276271] -------------------------------------------- [ 833.281681] systemd-shutdow/1 is trying to acquire lock: [ 833.287090] ffff949d2b148c68 (&dev->lock){+.+.}-{4:4}, at: xdp_set_features_flag+0x29/0x90 [ 833.295470] [ 833.295470] but task is already holding lock: [ 833.301400] ffff949d2b148c68 (&dev->lock){+.+.}-{4:4}, at: gve_shutdown+0x44/0x90 [gve] [ 833.309508] [ 833.309508] other info that might help us debug this: [ 833.316130] Possible unsafe locking scenario: [ 833.316130] [ 833.322142] CPU0 [ 833.324681] ---- [ 833.327220] lock(&dev->lock); [ 833.330455] lock(&dev->lock); [ 833.333689] [ 833.333689] *** DEADLOCK *** [ 833.333689] [ 833.339701] May be due to missing lock nesting notation [ 833.339701] [ 833.346582] 5 locks held by systemd-shutdow/1: [ 833.351205] #0: ffffffffa9c89130 (system_transition_mutex){+.+.}-{4:4}, at: __se_sys_reboot+0xe6/0x210 [ 833.360695] #1: ffff93b399e5c1b8 (&dev->mutex){....}-{4:4}, at: device_shutdown+0xb4/0x1f0 [ 833.369144] #2: ffff949d19a471b8 (&dev->mutex){....}-{4:4}, at: device_shutdown+0xc2/0x1f0 [ 833.377603] #3: ffffffffa9eca050 (rtnl_mutex){+.+.}-{4:4}, at: gve_shutdown+0x33/0x90 [gve] [ 833.386138] #4: ffff949d2b148c68 (&dev->lock){+.+.}-{4:4}, at: gve_shutdown+0x44/0x90 [gve] Introduce xdp_features_(set|clear)_redirect_target_locked() versions which assume that the netdev lock has already been acquired before setting the XDP feature flag and update GVE to use the locked version. Fixes: 03df156dd3a6 ("xdp: double protect netdev->xdp_flags with netdev->lock") Tested-by: Mina Almasry <almasrymina@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Harshitha Ramamurthy <hramamurthy@google.com> Signed-off-by: Joshua Washington <joshwash@google.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Acked-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20250422011643.3509287-1-joshwash@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-22net: wangxun: restrict feature flags for tunnel packetsJiawen Wu
Implement ndo_features_check to restrict Tx checksum offload flags, since there are some inner layer length and protocols unsupported. Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com> Reviewed-by: Michal Kubiak <michal.kubiak@intel.com> Link: https://patch.msgid.link/20250421022956.508018-3-jiawenwu@trustnetic.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-22net: txgbe: Support to set UDP tunnel portJiawen Wu
Tunnel types VXLAN/VXLAN_GPE/GENEVE are supported for txgbe devices. The hardware supports to set only one port for each tunnel type. Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com> Link: https://patch.msgid.link/20250421022956.508018-2-jiawenwu@trustnetic.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-22ppp: Split ppp_exit_net() to ->exit_rtnl().Kuniyuki Iwashima
ppp_exit_net() unregisters devices related to the netns under RTNL and destroys lists and IDR. Let's use ->exit_rtnl() for the device unregistration part to save RTNL dances for each netns. Note that we delegate the for_each_netdev_safe() part to default_device_exit_batch() and replace unregister_netdevice_queue() with ppp_nl_dellink() to align with bond, geneve, gtp, and pfcp. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250418003259.48017-4-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-22pfcp: Convert pfcp_net_exit() to ->exit_rtnl().Kuniyuki Iwashima
pfcp_net_exit() holds RTNL and cleans up all devices in the netns and other devices tied to sockets in the netns. We can use ->exit_rtnl() to save RTNL dance for all dying netns. Note that we delegate the for_each_netdev() part to default_device_exit_batch() to avoid a list corruption splat like the one reported in commit 4ccacf86491d ("gtp: Suppress list corruption splat in gtp_net_exit_batch_rtnl().") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250418003259.48017-3-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-22net: stmmac: visconti: convert to set_clk_tx_rate() methodRussell King (Oracle)
Convert visconti to use the set_clk_tx_rate() method. By doing so, the GMAC control register will already have been updated (unlike with the fix_mac_speed() method) so this code can be removed while porting to the set_clk_tx_rate() method. There is also no need for the spinlock, and has never been - neither fix_mac_speed() nor set_clk_tx_rate() can be called by more than one thread at a time, so the lock does nothing useful. Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Acked-by: Nobuhiro Iwamatsu <nobuhiro1.iwamatsu@toshiba.co.jp> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Link: https://patch.msgid.link/E1u5SiQ-001I0B-OQ@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-22net: dsa: rzn1_a5psw: Make the read-only array offsets static constColin Ian King
Don't populate the read-only array offsets on the stack at run time, instead make it static const. Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Wolfram Sang <wsa+renesas@sang-engineering.com> Tested-by: Wolfram Sang <wsa+renesas@sang-engineering.com> Link: https://patch.msgid.link/20250417161353.490219-1-colin.i.king@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-22net: ethernet: mtk_eth_soc: net: revise NETSYSv3 hardware configurationBo-Cun Chen
Change hardware configuration for the NETSYSv3. - Enable PSE dummy page mechanism for the GDM1/2/3 - Enable PSE drop mechanism when the WDMA Rx ring full - Enable PSE no-drop mechanism for packets from the WDMA Tx - Correct PSE free drop threshold - Correct PSE CDMA high threshold Fixes: 1953f134a1a8b ("net: ethernet: mtk_eth_soc: add NETSYS_V3 version support") Signed-off-by: Bo-Cun Chen <bc-bocun.chen@mediatek.com> Signed-off-by: Daniel Golle <daniel@makrotopia.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/b71f8fd9d4bb69c646c4d558f9331dd965068606.1744907886.git.daniel@makrotopia.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-22net: stmmac: Add DWMAC glue layer for Renesas GBETHLad Prabhakar
Add the DWMAC glue layer for the GBETH IP found in the Renesas RZ/V2H(P) SoC. Signed-off-by: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com> Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Philipp Zabel <p.zabel@pengutronix.de> Link: https://patch.msgid.link/20250417084015.74154-4-prabhakar.mahadev-lad.rj@bp.renesas.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-22virtio-net: disable delayed refill when pausing rxBui Quang Minh
When pausing rx (e.g. set up xdp, xsk pool, rx resize), we call napi_disable() on the receive queue's napi. In delayed refill_work, it also calls napi_disable() on the receive queue's napi. When napi_disable() is called on an already disabled napi, it will sleep in napi_disable_locked while still holding the netdev_lock. As a result, later napi_enable gets stuck too as it cannot acquire the netdev_lock. This leads to refill_work and the pause-then-resume tx are stuck altogether. This scenario can be reproducible by binding a XDP socket to virtio-net interface without setting up the fill ring. As a result, try_fill_recv will fail until the fill ring is set up and refill_work is scheduled. This commit adds virtnet_rx_(pause/resume)_all helpers and fixes up the virtnet_rx_resume to disable future and cancel all inflights delayed refill_work before calling napi_disable() to pause the rx. Fixes: 413f0271f396 ("net: protect NAPI enablement with netdev_lock()") Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com> Acked-by: Jason Wang <jasowang@redhat.com> Link: https://patch.msgid.link/20250417072806.18660-2-minhquangbui99@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-22net: phy: leds: fix memory leakQingfang Deng
A network restart test on a router led to an out-of-memory condition, which was traced to a memory leak in the PHY LED trigger code. The root cause is misuse of the devm API. The registration function (phy_led_triggers_register) is called from phy_attach_direct, not phy_probe, and the unregister function (phy_led_triggers_unregister) is called from phy_detach, not phy_remove. This means the register and unregister functions can be called multiple times for the same PHY device, but devm-allocated memory is not freed until the driver is unbound. This also prevents kmemleak from detecting the leak, as the devm API internally stores the allocated pointer. Fix this by replacing devm_kzalloc/devm_kcalloc with standard kzalloc/kcalloc, and add the corresponding kfree calls in the unregister path. Fixes: 3928ee6485a3 ("net: phy: leds: Add support for "link" trigger") Fixes: 2e0bc452f472 ("net: phy: leds: add support for led triggers on phy link state change") Signed-off-by: Hao Guan <hao.guan@siflower.com.cn> Signed-off-by: Qingfang Deng <qingfang.deng@siflower.com.cn> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://patch.msgid.link/20250417032557.2929427-1-dqfext@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-22emulex/benet: Annotate flash_cookie as nonstringKees Cook
GCC 15's new -Wunterminated-string-initialization notices that the 32 character "flash_cookie" (which is not used as a C-String) needs to be marked as "nonstring": drivers/net/ethernet/emulex/benet/be_cmds.c:2618:51: warning: initializer-string for array of 'char' truncates NUL terminator but destination lacks 'nonstring' attribute (17 chars into 16 available) [-Wunterminated-string-initialization] 2618 | static char flash_cookie[2][16] = {"*** SE FLAS", "H DIRECTORY *** "}; | ^~~~~~~~~~~~~~~~~~ Add this annotation, avoid using a multidimensional array, but keep the string split (with a comment about why). Additionally mark it const and annotate the "cookie" member that is being memcmp()ed against as nonstring too. Signed-off-by: Kees Cook <kees@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250416221028.work.967-kees@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-22net: phy: dp83822: Add support for changing the MAC terminationDimitri Fedrau
The dp83822 provides the possibility to set the resistance value of the the MAC termination. Modifying the resistance to an appropriate value can reduce signal reflections and therefore improve signal quality. Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: Dimitri Fedrau <dimitri.fedrau@liebherr.com> Link: https://patch.msgid.link/20250416-dp83822-mac-impedance-v3-4-028ac426cddb@liebherr.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-22net: phy: Add helper for getting MAC termination resistanceDimitri Fedrau
Add helper which returns the MAC termination resistance value. Modifying the resistance to an appropriate value can reduce signal reflections and therefore improve signal quality. Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: Dimitri Fedrau <dimitri.fedrau@liebherr.com> Link: https://patch.msgid.link/20250416-dp83822-mac-impedance-v3-3-028ac426cddb@liebherr.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-22r8169: use pci_prepare_to_sleep in rtl_shutdownHeiner Kallweit
Use pci_prepare_to_sleep() like PCI core does in pci_pm_suspend_noirq. This aligns setting a low-power mode during shutdown with the handling of the transition to system suspend. Also the transition to runtime suspend uses pci_target_state() instead of setting D3hot unconditionally. Note: pci_prepare_to_sleep() uses device_may_wakeup() to check whether device may generate wakeup events. So we don't lose anything by not passing tp->saved_wolopts any longer. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/f573fdbd-ba6d-41c1-b68f-311d3c88db2c@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-22octeontx2-af: Remove unused rvu_npc_enable_bcast_entryDr. David Alan Gilbert
The last use of rvu_npc_enable_bcast_entry() was removed in 2021 by commit 967db3529eca ("octeontx2-af: add support for multicast/promisc packet replication feature") Remove it. Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20250420225810.171852-1-linux@treblig.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-22net: phylink: fix suspend/resume with WoL enabled and link downRussell King (Oracle)
When WoL is enabled, we update the software state in phylink to indicate that the link is down, and disable the resolver from bringing the link back up. On resume, we attempt to bring the overall state into consistency by calling the .mac_link_down() method, but this is wrong if the link was already down, as phylink strictly orders the .mac_link_up() and .mac_link_down() methods - and this would break that ordering. Fixes: f97493657c63 ("net: phylink: add suspend/resume support") Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Tested-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Link: https://patch.msgid.link/E1u55Qf-0016RN-PA@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-22rtase: Add ndo_setup_tc support for CBS offload in traffic control setupJustin Lai
Add support for ndo_setup_tc to enable CBS offload functionality as part of traffic control configuration for network devices, where CBS is applied from the CPU to the switch. More specifically, CBS is applied at the GMAC in the topmost architecture diagram. Signed-off-by: Justin Lai <justinlai0215@realtek.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250416115757.28156-1-justinlai0215@realtek.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-22net: phy: microchip: force IRQ polling mode for lan88xxFiona Klute
With lan88xx based devices the lan78xx driver can get stuck in an interrupt loop while bringing the device up, flooding the kernel log with messages like the following: lan78xx 2-3:1.0 enp1s0u3: kevent 4 may have been dropped Removing interrupt support from the lan88xx PHY driver forces the driver to use polling instead, which avoids the problem. The issue has been observed with Raspberry Pi devices at least since 4.14 (see [1], bug report for their downstream kernel), as well as with Nvidia devices [2] in 2020, where disabling interrupts was the vendor-suggested workaround (together with the claim that phylib changes in 4.9 made the interrupt handling in lan78xx incompatible). Iperf reports well over 900Mbits/sec per direction with client in --dualtest mode, so there does not seem to be a significant impact on throughput (lan88xx device connected via switch to the peer). [1] https://github.com/raspberrypi/linux/issues/2447 [2] https://forums.developer.nvidia.com/t/jetson-xavier-and-lan7800-problem/142134/11 Link: https://lore.kernel.org/0901d90d-3f20-4a10-b680-9c978e04ddda@lunn.ch Fixes: 792aec47d59d ("add microchip LAN88xx phy driver") Signed-off-by: Fiona Klute <fiona.klute@gmx.de> Cc: kernel-list@raspberrypi.com Cc: stable@vger.kernel.org Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://patch.msgid.link/20250416102413.30654-1-fiona.klute@gmx.de Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-22ionic: add module eeprom channel data to ionic_if and ethtoolShannon Nelson
Make the CMIS module type's page 17 channel data available for ethtool to request. As done previously, carve space for this data from the port_info reserved space. In the future, if additional pages are needed, a new firmware AdminQ command will be added for accessing random pages. Reviewed-by: Brett Creeley <brett.creeley@amd.com> Signed-off-by: Shannon Nelson <shannon.nelson@amd.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250415231317.40616-4-shannon.nelson@amd.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-22ionic: support ethtool get_module_eeprom_by_pageShannon Nelson
Add support for the newer get_module_eeprom_by_page interface. Only the upper half of the 256 byte page is available for reading, and the firmware puts the two sections into the extended sprom buffer, so a union is used over the extended sprom buffer to make clear which page is to be accessed. With get_module_eeprom_by_page implemented there is no need for the older get_module_info or git_module_eeprom interfaces, so remove them. Reviewed-by: Brett Creeley <brett.creeley@amd.com> Signed-off-by: Shannon Nelson <shannon.nelson@amd.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250415231317.40616-3-shannon.nelson@amd.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-22ionic: extend the QSFP module sprom for more pagesShannon Nelson
Some QSFP modules have more eeprom to be read by ethtool than the initial high and low page 0 that is currently available in the DSC's ionic sprom[] buffer. Since the current sprom[] is baked into the middle of an existing API struct, to make the high end of page 1 and page 2 available a block is carved from a reserved space of the existing port_info struct and the ionic_get_module_eeprom() service is taught how to get there. Newer firmware writes the additional QSFP page info here, yet this remains backward compatible because older firmware sets this space to all 0 and older ionic drivers do not use the reserved space. Reviewed-by: Brett Creeley <brett.creeley@amd.com> Signed-off-by: Shannon Nelson <shannon.nelson@amd.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250415231317.40616-2-shannon.nelson@amd.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-22vxlan: Convert FDB table to rhashtableIdo Schimmel
FDB entries are currently stored in a hash table with a fixed number of buckets (256), resulting in performance degradation as the number of entries grows. Solve this by converting the driver to use rhashtable which maintains more or less constant performance regardless of the number of entries. Measured transmitted packets per second using a single pktgen thread with varying number of entries when the transmitted packet always hits the default entry (worst case): Number of entries | Improvement ------------------|------------ 1k | +1.12% 4k | +9.22% 16k | +55% 64k | +585% 256k | +2460% In addition, the change reduces the size of the VXLAN device structure from 2584 bytes to 672 bytes. Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20250415121143.345227-16-idosch@nvidia.com Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-22vxlan: Introduce FDB key structureIdo Schimmel
In preparation for converting the FDB table to rhashtable, introduce a key structure that includes the MAC address and source VNI. No functional changes intended. Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20250415121143.345227-15-idosch@nvidia.com Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-22vxlan: Do not treat dst cache initialization errors as fatalIdo Schimmel
FDB entries are allocated in an atomic context as they can be added from the data path when learning is enabled. After converting the FDB hash table to rhashtable, the insertion rate will be much higher (*) which will entail a much higher rate of per-CPU allocations via dst_cache_init(). When adding a large number of entries (e.g., 256k) in a batch, a small percentage (< 0.02%) of these per-CPU allocations will fail [1]. This does not happen with the current code since the insertion rate is low enough to give the per-CPU allocator a chance to asynchronously create new chunks of per-CPU memory. Given that: a. Only a small percentage of these per-CPU allocations fail. b. The scenario where this happens might not be the most realistic one. c. The driver can work correctly without dst caches. The dst_cache_*() APIs first check that the dst cache was properly initialized. d. The dst caches are not always used (e.g., 'tos inherit'). It seems reasonable to not treat these allocation failures as fatal. Therefore, do not bail when dst_cache_init() fails and suppress warnings by specifying '__GFP_NOWARN'. [1] percpu: allocation failed, size=40 align=8 atomic=1, atomic alloc failed, no space left (*) 97% reduction in average latency of vxlan_fdb_update() when adding 256k entries in a batch. Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20250415121143.345227-14-idosch@nvidia.com Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-22vxlan: Create wrappers for FDB lookupIdo Schimmel
__vxlan_find_mac() is called from both the data path (e.g., during learning) and the control path (e.g., when replacing an entry). The function is missing lockdep annotations to make sure that the FDB hash lock is held during FDB updates. Rename __vxlan_find_mac() to vxlan_find_mac_rcu() to reflect the fact that it should be called from an RCU read-side critical section and call it from vxlan_find_mac() which checks that the FDB hash lock is held. Change callers to invoke the appropriate function. Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20250415121143.345227-13-idosch@nvidia.com Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-22vxlan: Rename FDB Tx lookup functionIdo Schimmel
vxlan_find_mac() is only expected to be called from the Tx path as it updates the 'used' timestamp. Rename it to vxlan_find_mac_tx() to reflect that and to avoid incorrect updates of this timestamp like those addressed by commit 9722f834fe9a ("vxlan: Avoid unnecessary updates to FDB 'used' time"). No functional changes intended. Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20250415121143.345227-12-idosch@nvidia.com Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-22vxlan: Convert FDB flushing to RCUIdo Schimmel
Instead of holding the FDB hash lock when traversing the FDB linked list during flushing, use RCU and only acquire the lock for entries that need to be flushed. Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20250415121143.345227-11-idosch@nvidia.com Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-22vxlan: Convert FDB garbage collection to RCUIdo Schimmel
Instead of holding the FDB hash lock when traversing the FDB linked list during garbage collection, use RCU and only acquire the lock for entries that need to be removed (aged out). Avoid races by using hlist_unhashed() to check that the entry has not been removed from the list by another thread. Note that vxlan_fdb_destroy() uses hlist_del_init_rcu() to remove an entry from the list which should cause list_unhashed() to return true. Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20250415121143.345227-10-idosch@nvidia.com Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-22vxlan: Use linked list to traverse FDB entriesIdo Schimmel
In preparation for removing the fixed size hash table, convert FDB entry traversal to use the newly added FDB linked list. No functional changes intended. Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20250415121143.345227-9-idosch@nvidia.com Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-22vxlan: Add a linked list of FDB entriesIdo Schimmel
Currently, FDB entries are stored in a hash table with a fixed number of buckets. The table is used for both lookups and entry traversal. Subsequent patches will convert the table to rhashtable which is not suitable for entry traversal. In preparation for this conversion, add FDB entries to a linked list. Subsequent patches will convert the driver to use this list when traversing entries during dump, flush, etc. Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20250415121143.345227-8-idosch@nvidia.com Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-22vxlan: Use a single lock to protect the FDB tableIdo Schimmel
Currently, the VXLAN driver stores FDB entries in a hash table with a fixed number of buckets (256). Subsequent patches are going to convert this table to rhashtable with a linked list for entry traversal, as rhashtable is more scalable. In preparation for this conversion, move from a per-bucket spin lock to a single spin lock that protects the entire FDB table. The per-bucket spin locks were introduced by commit fe1e0713bbe8 ("vxlan: Use FDB_HASH_SIZE hash_locks to reduce contention") citing "huge contention when inserting/deleting vxlan_fdbs into the fdb_head". It is not clear from the commit message which code path was holding the spin lock for long periods of time, but the obvious suspect is the FDB cleanup routine (vxlan_cleanup()) that periodically traverses the entire table in order to delete aged-out entries. This will be solved by subsequent patches that will convert the FDB cleanup routine to traverse the linked list of FDB entries using RCU, only acquiring the spin lock when deleting an aged-out entry. The change reduces the size of the VXLAN device structure from 3600 bytes to 2576 bytes. Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20250415121143.345227-7-idosch@nvidia.com Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-22vxlan: Relocate assignment of default remote deviceIdo Schimmel
The default FDB entry can be associated with a net device if a physical device (i.e., 'dev PHYS_DEV') was specified during the creation of the VXLAN device. The assignment of the net device pointer to 'dst->remote_dev' logically belongs in the if block that resolves the pointer from the specified ifindex, so move it there. Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20250415121143.345227-6-idosch@nvidia.com Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-22vxlan: Unsplit default FDB entry creation and notificationIdo Schimmel
Commit 0241b836732f ("vxlan: fix default fdb entry netlink notify ordering during netdev create") split the creation of the default FDB entry from its notification to avoid sending a RTM_NEWNEIGH notification before RTM_NEWLINK. Previous patches restructured the code so that the default FDB entry is created after registering the VXLAN device and the notification about the new entry immediately follows its creation. Therefore, simplify the code and revert back to vxlan_fdb_update() which takes care of both creating the FDB entry and notifying user space about it. Hold the FDB hash lock when calling vxlan_fdb_update() like it expects. A subsequent patch will add a lockdep assertion to make sure this is indeed the case. Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20250415121143.345227-5-idosch@nvidia.com Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-22vxlan: Insert FDB into hash table in vxlan_fdb_create()Ido Schimmel
Commit 7c31e54aeee5 ("vxlan: do not destroy fdb if register_netdevice() is failed") split the insertion of FDB entries into the FDB hash table from the function where they are created. This was done in order to work around a problem that is no longer possible after the previous patch. Simplify the code and move the body of vxlan_fdb_insert() back into vxlan_fdb_create(). Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20250415121143.345227-4-idosch@nvidia.com Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-22vxlan: Simplify creation of default FDB entryIdo Schimmel
There is asymmetry in how the default FDB entry (all-zeroes) is created and destroyed in the VXLAN driver. It is created as part of the driver's newlink() routine, but destroyed as part of its ndo_uninit() routine. This caused multiple problems in the past. First, commit 0241b836732f ("vxlan: fix default fdb entry netlink notify ordering during netdev create") split the notification about the entry from its creation so that it will not be notified to user space before the VXLAN device is registered. Then, commit 6db924687139 ("vxlan: Fix error path in __vxlan_dev_create()") made the error path in __vxlan_dev_create() asymmetric by destroying the FDB entry before unregistering the net device. Otherwise, the FDB entry would have been freed twice: By ndo_uninit() as part of unregister_netdevice() and by vxlan_fdb_destroy() in the error path. Finally, commit 7c31e54aeee5 ("vxlan: do not destroy fdb if register_netdevice() is failed") split the insertion of the FDB entry into the hash table from its creation, moving the insertion after the registration of the net device. Otherwise, like before, the FDB entry would have been freed twice: By ndo_uninit() as part of register_netdevice()'s error path and by vxlan_fdb_destroy() in the error path of __vxlan_dev_create(). The end result is that the code is unnecessarily complex. In addition, the fixed size hash table cannot be converted to rhashtable as vxlan_fdb_insert() cannot fail, which will no longer be true with rhashtable. Solve this by making the addition and deletion of the default FDB entry completely symmetric. Namely, as part of newlink() routine, create the entry, insert it into to the hash table and send a notification to user space after the net device was registered. Note that at this stage the net device is still administratively down and cannot transmit / receive packets. Move the deletion from ndo_uninit() to the dellink routine(): Flush the default entry together with all the other entries, before unregistering the net device. Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20250415121143.345227-3-idosch@nvidia.com Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-22vxlan: Add RCU read-side critical sections in the Tx pathIdo Schimmel
The Tx path does not run from an RCU read-side critical section which makes the current lockless accesses to FDB entries invalid. As far as I am aware, this has not been a problem in practice, but traces will be generated once we transition the FDB lookup to rhashtable_lookup(). Add rcu_read_{lock,unlock}() around the handling of FDB entries in the Tx path. Remove the RCU read-side critical section from vxlan_xmit_nh() as now the function is always called from an RCU read-side critical section. Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20250415121143.345227-2-idosch@nvidia.com Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-21net: enetc: fix frame corruption on bpf_xdp_adjust_head/tail() and XDP_PASSVladimir Oltean
Vlatko Markovikj reported that XDP programs attached to ENETC do not work well if they use bpf_xdp_adjust_head() or bpf_xdp_adjust_tail(), combined with the XDP_PASS verdict. A typical use case is to add or remove a VLAN tag. The resulting sk_buff passed to the stack is corrupted, because the algorithm used by the driver for XDP_PASS is to unwind the current buffer pointer in the RX ring and to re-process the current frame with enetc_build_skb() as if XDP hadn't run. That is incorrect because XDP may have modified the geometry of the buffer, which we then are completely unaware of. We are looking at a modified buffer with the original geometry. The initial reaction, both from me and from Vlatko, was to shop around the kernel for code to steal that would calculate a delta between the old and the new XDP buffer geometry, and apply that to the sk_buff too. We noticed that veth and generic xdp have such code. The headroom adjustment is pretty uncontroversial, but what turned out severely problematic is the tailroom. veth has this snippet: __skb_put(skb, off); /* positive on grow, negative on shrink */ which on first sight looks decent enough, except __skb_put() takes an "unsigned int" for the second argument, and the arithmetic seems to only work correctly by coincidence. Second issue, __skb_put() contains a SKB_LINEAR_ASSERT(). It's not a great pattern to make more widespread. The skb may still be nonlinear at that point - it only becomes linear later when resetting skb->data_len to zero. To avoid the above, bpf_prog_run_generic_xdp() does this instead: skb_set_tail_pointer(skb, xdp->data_end - xdp->data); skb->len += off; /* positive on grow, negative on shrink */ which is more open-coded, uses lower-level functions and is in general a bit too much to spread around in driver code. Then there is the snippet: if (xdp_buff_has_frags(xdp)) skb->data_len = skb_shinfo(skb)->xdp_frags_size; else skb->data_len = 0; One would have expected __pskb_trim() to be the function of choice for this task. But it's not used in veth/xdpgeneric because the extraneous fragments were _already_ freed by bpf_xdp_adjust_tail() -> bpf_xdp_frags_shrink_tail() -> ... -> __xdp_return() - the backing memory for the skb frags and the xdp frags is the same, but they don't keep individual references. In fact, that is the biggest reason why this snippet cannot be reused as-is, because ENETC temporarily constructs an skb with the original len and the original number of frags. Because the extraneous frags are already freed by bpf_xdp_adjust_tail() and returned to the page allocator, it means the entire approach of using enetc_build_skb() is questionable for XDP_PASS. To avoid that, one would need to elevate the page refcount of all frags before calling bpf_prog_run_xdp() and drop it after XDP_PASS. There are other things that are missing in ENETC's handling of XDP_PASS, like for example updating skb_shinfo(skb)->meta_len. These are all handled correctly and cleanly in commit 539c1fba1ac7 ("xdp: add generic xdp_build_skb_from_buff()"), added to net-next in Dec 2024, and in addition might even be quicker that way. I have a very strong preference towards backporting that commit for "stable", and that is what is used to fix the handling bugs. It is way too messy to go this deep into the guts of an sk_buff from the code of a device driver. Fixes: d1b15102dd16 ("net: enetc: add support for XDP_DROP and XDP_PASS") Reported-by: Vlatko Markovikj <vlatko.markovikj@etas.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Wei Fang <wei.fang@nxp.com> Link: https://patch.msgid.link/20250417120005.3288549-4-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-21net: enetc: refactor bulk flipping of RX buffers to separate functionVladimir Oltean
This small snippet of code ensures that we do something with the array of RX software buffer descriptor elements after passing the skb to the stack. In this case, we see if the other half of the page is reusable, and if so, we "turn around" the buffers, making them directly usable by enetc_refill_rx_ring() without going to enetc_new_page(). We will need to perform this kind of buffer flipping from a new code path, i.e. from XDP_PASS. Currently, enetc_build_skb() does it there buffer by buffer, but in a subsequent change we will stop using enetc_build_skb() for XDP_PASS. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Wei Fang <wei.fang@nxp.com> Link: https://patch.msgid.link/20250417120005.3288549-3-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-21net: enetc: register XDP RX queues with frag_sizeVladimir Oltean
At the time when bpf_xdp_adjust_tail() gained support for non-linear buffers, ENETC was already generating this kind of geometry on RX, due to its use of 2K half page buffers. Frames larger than 1472 bytes (without FCS) are stored as multi-buffer, presenting a need for multi buffer support to work properly even in standard MTU circumstances. Allow bpf_xdp_frags_increase_tail() to know the allocation size of paged data, so it can safely permit growing the tailroom of the buffer from XDP programs. Fixes: bf25146a5595 ("bpf: add frags support to the bpf_xdp_adjust_tail() API") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Wei Fang <wei.fang@nxp.com> Link: https://patch.msgid.link/20250417120005.3288549-2-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-21xen-netfront: handle NULL returned by xdp_convert_buff_to_frame()Alexey Nepomnyashih
The function xdp_convert_buff_to_frame() may return NULL if it fails to correctly convert the XDP buffer into an XDP frame due to memory constraints, internal errors, or invalid data. Failing to check for NULL may lead to a NULL pointer dereference if the result is used later in processing, potentially causing crashes, data corruption, or undefined behavior. On XDP redirect failure, the associated page must be released explicitly if it was previously retained via get_page(). Failing to do so may result in a memory leak, as the pages reference count is not decremented. Cc: stable@vger.kernel.org # v5.9+ Fixes: 6c5aa6fc4def ("xen networking: add basic XDP support for xen-netfront") Signed-off-by: Alexey Nepomnyashih <sdl@nppct.ru> Link: https://patch.msgid.link/20250417122118.1009824-1-sdl@nppct.ru Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-21bnxt_en: Remove unused macros in bnxt_ulp.hKalesh AP
BNXT_ROCE_ULP and BNXT_MAX_ULP are no longer used. Remove them to clean up the code. Reviewed-by: Shruti Parab <shruti.parab@broadcom.com> Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Link: https://patch.msgid.link/20250417172448.1206107-5-michael.chan@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-21bnxt_en: Remove unused field "ref_count" in struct bnxt_ulpKalesh AP
The "ref_count" field in struct bnxt_ulp is unused after commit a43c26fa2e6c ("RDMA/bnxt_re: Remove the sriov config callback"). So we can just remove it now. Reviewed-by: Somnath Kotur <somnath.kotur@broadcom.com> Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Link: https://patch.msgid.link/20250417172448.1206107-4-michael.chan@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-21bnxt_en: Report the ethtool coredump length after copying the coredumpShruti Parab
ethtool first calls .get_dump_flags() to get the dump length. For coredump, the driver calls the FW to get the coredump length (L1). The min. of L1 and the user specified length is then passed to .get_dump_data() (L2) to get the coredump. The actual coredump length retrieved by the FW (L3) during .get_dump_data() may be smaller than L1. This length discrepancy will trigger a WARN_ON() in ethtool_get_dump_data(). ethtool has already vzalloc'ed a buffer with size L1. Just report the coredump length as L2 even though the actual coredump length L3 may be smaller. The extra zero padding does not matter. This will prevent the warning that may alarm the user. For correctness, only do the final length update if there is no error. Reviewed-by: Andy Gospodarek <andrew.gospodarek@broadcom.com> Reviewed-by: Damodharam Ammepalli <damodharam.ammepalli@broadcom.com> Signed-off-by: Shruti Parab <shruti.parab@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Link: https://patch.msgid.link/20250417172448.1206107-3-michael.chan@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-21bnxt_en: Change FW message timeout warningMichael Chan
The firmware advertises a "hwrm_cmd_max_timeout" value to the driver for NVRAM and coredump related functions that can take tens of seconds to complete. The driver polls for the operation to complete under mutex and may trigger hung task watchdog warning if the wait is too long. To warn the user about this, the driver currently prints a warning if this advertised value exceeds 40 seconds: Device requests max timeout of %d seconds, may trigger hung task watchdog Initially, we chose 40 seconds, well below the kernel's default CONFIG_DEFAULT_HUNG_TASK_TIMEOUT (120 seconds) to avoid triggering the hung task watchdog. But 60 seconds is the timeout on most production FW and cannot be reduced further. Change the driver's warning threshold to 60 seconds to avoid triggering this warning on all production devices. We also print the warning if the value exceeds CONFIG_DEFAULT_HUNG_TASK_TIMEOUT which may be set to architecture specific defaults as low as 10 seconds. Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com> Reviewed-by: Andy Gospodarek <andrew.gospodarek@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Link: https://patch.msgid.link/20250417172448.1206107-2-michael.chan@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-21net: stmmac: socfpga: convert to devm_stmmac_pltfr_probe()Russell King (Oracle)
Convert socfpga to use devm_stmmac_pltfr_probe() to further simplify the probe function, wrapping the call to the set_phy_mode() method into socfpga_dwmac_init() which can be called from the plat_dat->init() method. Also call this from socfpga_dwmac_resume() thereby simplifying that function. Using the devm variant also means we can remove the call to stmmac_pltfr_remove(). Unfortunately, we can't also convert to stmmac_pltfr_pm_ops as there is extra work done in socfpga_dwmac_resume(). Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Tested-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/E1u5Sns-001IJw-OY@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-21net: stmmac: socfpga: call set_phy_mode() before registrationRussell King (Oracle)
Initialisation/setup after registration is a bug. This is the second of two patches fixing this in socfpga. The set_phy_mode() functions do various hardware setup that would interfere with a netdev that has been published, and thus available to be opened by the kernel/userspace. However, set_phy_mode() relies upon the netdev having been initialised to get at the plat_stmmacenet_data structure, which is probably why it was placed after stmmac_drv_probe(). We can remove that need by storing a pointer to struct plat_stmmacenet_data in struct socfpga_dwmac. Move the call to set_phy_mode() before calling stmmac_dvr_probe(). This also simplifies the probe function as there is no need to unregister the netdev if set_phy_mode() fails. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Tested-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/E1u5Snn-001IJq-L0@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-21net: stmmac: socfpga: convert to stmmac_pltfr_pm_opsRussell King (Oracle)
Convert socfpga to use the generic stmmac_pltfr_pm_ops, which can be achieved by adding an appropriate plat_dat->init function to do the setup. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Tested-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/E1u5Sni-001IJk-Gi@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-21net: stmmac: socfpga: provide init functionRussell King (Oracle)
Both the resume and probe path needs to configure the phy mode, so provide a common function to do this which can later be hooked into plat_dat->init. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Tested-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/E1u5Snd-001IJe-Cx@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>