summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2025-02-05r8169: don't scan PHY addresses > 0Heiner Kallweit
The PHY address is a dummy, because r8169 PHY access registers don't support a PHY address. Therefore scan address 0 only. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://patch.msgid.link/830637dd-4016-4a68-92b3-618fcac6589d@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-05net: flush_backlog() small changesEric Dumazet
Add READ_ONCE() around reads of skb->dev->reg_state, because this field can be changed from other threads/cpus. Instead of calling dev_kfree_skb_irq() and kfree_skb() while interrupts are masked and locks held, use a temporary list and use __skb_queue_purge_reason() Use SKB_DROP_REASON_DEV_READY drop reason to better describe why these skbs are dropped. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Link: https://patch.msgid.link/20250204144825.316785-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-05s390/net: Remove LCS driverAswin Karuvally
The original Open Systems Adapter (OSA) was introduced by IBM in the mid-90s. These were then superseded by OSA-Express in 1999 which used Queued Direct IO to greatly improve throughput. The newer cards retained the older, slower non-QDIO (OSE) modes for compatibility with older systems. In Linux, the lcs driver was responsible for cards operating in the older OSE mode and the qeth driver was introduced to allow the OSA-Express cards to operate in the newer QDIO (OSD) mode. For an S390 machine from 1998 or later, there is no reason to use the OSE mode and lcs driver as all OSA cards since 1999 provide the faster OSD mode. As a result, it's been years since we have heard of a customer configuration involving the lcs driver. This patch removes the lcs driver. The technology it supports has been obsolete for past 25+ years and is irrelevant for current use cases. Reviewed-by: Alexandra Winter <wintera@linux.ibm.com> Acked-by: Heiko Carstens <hca@linux.ibm.com> Acked-by: Peter Oberparleiter <oberpar@linux.ibm.com> Signed-off-by: Aswin Karuvally <aswin@linux.ibm.com> Signed-off-by: Alexandra Winter <wintera@linux.ibm.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250204103135.1619097-1-wintera@linux.ibm.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-05cxgb4: Avoid a -Wflex-array-member-not-at-end warningGustavo A. R. Silva
-Wflex-array-member-not-at-end was introduced in GCC-14, and we are getting ready to enable it, globally. Move the conflicting declaration to the end of the structure. Notice that `struct ethtool_dump` is a flexible structure --a structure that contains a flexible-array member. Fix the following warning: ./drivers/net/ethernet/chelsio/cxgb4/cxgb4.h:1215:29: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end] Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Link: https://patch.msgid.link/Z6GBZ4brXYffLkt_@kspp Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-05bridge: mdb: Allow replace of a host-joined groupPetr Machata
Attempts to replace an MDB group membership of the host itself are currently bounced: # ip link add name br up type bridge vlan_filtering 1 # bridge mdb replace dev br port br grp 239.0.0.1 vid 2 # bridge mdb replace dev br port br grp 239.0.0.1 vid 2 Error: bridge: Group is already joined by host. A similar operation done on a member port would succeed. Ignore the check for replacement of host group memberships as well. The bit of code that this enables is br_multicast_host_join(), which, for already-joined groups only refreshes the MC group expiration timer, which is desirable; and a userspace notification, also desirable. Change a selftest that exercises this code path from expecting a rejection to expecting a pass. The rest of MDB selftests pass without modification. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/e5c5188b9787ae806609e7ca3aa2a0a501b9b5c4.1738685648.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-05selftests: net: suppress ReST file generation when building selftestsJakub Kicinski
Some selftests need libynl.a. When building it try to skip generating the ReST documentation, libynl.a does not depend on them. Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250203214850.1282291-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-05Merge branch 'net-sysfs-remove-the-rtnl_trylock-restart_syscall-construction'Jakub Kicinski
Antoine Tenart says: ==================== net-sysfs: remove the rtnl_trylock/restart_syscall construction The series initially aimed at improving spins (and thus delays) while accessing net sysfs under rtnl lock contention[1]. The culprit was the trylock/restart_syscall constructions. There wasn't much interest at the time but it got traction recently for other reasons (lowering the rtnl lock pressure). Since v1[2]: - Do not export rtnl_lock_interruptible [Stephen]. - Add netdev_warn_once messages in rx_queue_add_kobject [Jakub]. Since the RFC[1]: - Limit the breaking of the sysfs protection to sysfs_rtnl_lock() only as this is not needed in the whole rtnl locking section thanks to the additional check on dev_isalive(). This simplifies error handling as well as the unlocking path. - Used an interruptible version of rtnl_lock, as done by Jakub in his experiments. - Removed a WARN_ONCE_ONCE [Greg]. - Removed explicit inline markers [Stephen]. Most of the reasoning is explained in comments added in patch 1. This was tested by stress-testing net sysfs attributes (read/write ops) while adding/removing queues and adding/removing veths, all in parallel. I also used an OCP single node cluster, spawning lots of pods. [1] https://lore.kernel.org/all/20231018154804.420823-1-atenart@kernel.org/T/ [2] https://lore.kernel.org/all/20250117102612.132644-1-atenart@kernel.org/T/ ==================== Link: https://patch.msgid.link/20250204170314.146022-1-atenart@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-05net-sysfs: remove rtnl_trylock from queue attributesAntoine Tenart
Similar to the commit removing remove rtnl_trylock from device attributes we here apply the same technique to networking queues. Signed-off-by: Antoine Tenart <atenart@kernel.org> Link: https://patch.msgid.link/20250204170314.146022-5-atenart@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-05net-sysfs: prevent uncleared queues from being re-addedAntoine Tenart
With the (upcoming) removal of the rtnl_trylock/restart_syscall logic and because of how Tx/Rx queues are implemented (and their requirements), it might happen that a queue is re-added before having the chance to be cleared. In such rare case, do not complete the queue addition operation. Signed-off-by: Antoine Tenart <atenart@kernel.org> Link: https://patch.msgid.link/20250204170314.146022-4-atenart@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-05net-sysfs: move queue attribute groups outside the default groupsAntoine Tenart
Rx/tx queues embed their own kobject for registering their per-queue sysfs files. The issue is they're using the kobject default groups for this and entirely rely on the kobject refcounting for releasing their sysfs paths. In order to remove rtnl_trylock calls we need sysfs files not to rely on their associated kobject refcounting for their release. Thus we here move queues sysfs files from the kobject default groups to their own groups which can be removed separately. Signed-off-by: Antoine Tenart <atenart@kernel.org> Link: https://patch.msgid.link/20250204170314.146022-3-atenart@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-05net-sysfs: remove rtnl_trylock from device attributesAntoine Tenart
There is an ABBA deadlock between net device unregistration and sysfs files being accessed[1][2]. To prevent this from happening all paths taking the rtnl lock after the sysfs one (actually kn->active refcount) use rtnl_trylock and return early (using restart_syscall)[3], which can make syscalls to spin for a long time when there is contention on the rtnl lock[4]. There are not many possibilities to improve the above: - Rework the entire net/ locking logic. - Invert two locks in one of the paths — not possible. But here it's actually possible to drop one of the locks safely: the kernfs_node refcount. More details in the code itself, which comes with lots of comments. Note that we check the device is alive in the added sysfs_rtnl_lock helper to disallow sysfs operations to run after device dismantle has started. This also help keeping the same behavior as before. Because of this calls to dev_isalive in sysfs ops were removed. [1] https://lore.kernel.org/netdev/49A4D5D5.5090602@trash.net/ [2] https://lore.kernel.org/netdev/m14oyhis31.fsf@fess.ebiederm.org/ [3] https://lore.kernel.org/netdev/20090226084924.16cb3e08@nehalam/ [4] https://lore.kernel.org/all/20210928125500.167943-1-atenart@kernel.org/T/ Signed-off-by: Antoine Tenart <atenart@kernel.org> Link: https://patch.msgid.link/20250204170314.146022-2-atenart@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-05net: phy: realtek: use string choices helpersHeiner Kallweit
Use string choices helpers to simplify the code. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202501190707.qQS8PGHW-lkp@intel.com/ Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-02-04r8169: make Kconfig option for LED support user-visibleHeiner Kallweit
Make config option R8169_LEDS user-visible, so that users can remove support if not needed. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/d29f0cdb-32bf-435f-b59d-dc96bca1e3ab@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-04net: phy: realtek: make HWMON support a user-visible Kconfig symbolHeiner Kallweit
Make config symbol REALTEK_PHY_HWMON user-visible, so that users can remove support if not needed. Suggested-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/3466ee92-166a-4b0f-9ae7-42b9e046f333@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-04netconsole: selftest: Add test for fragmented messagesBreno Leitao
Add a new selftest to verify netconsole's handling of messages that exceed the packet size limit and require fragmentation. The test sends messages with varying sizes and userdata, validating that: 1. Large messages are correctly fragmented and reassembled 2. Userdata fields are properly preserved across fragments 3. Messages work correctly with and without kernel release version appending The test creates a networking environment using netdevsim, sends messages through /dev/kmsg, and verifies the received fragments maintain message integrity. Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250203-netcons_frag_msgs-v1-1-5bc6bedf2ac0@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-04net: atlantic: Avoid -Wflex-array-member-not-at-end warningsGustavo A. R. Silva
-Wflex-array-member-not-at-end was introduced in GCC-14, and we are getting ready to enable it, globally. Remove unused flexible-array member `buf` and, with this, fix the following warnings: drivers/net/ethernet/aquantia/atlantic/aq_hw.h:197:36: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end] drivers/net/ethernet/aquantia/atlantic/hw_atl/../aq_hw.h:197:36: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end] Suggested-by: Igor Russkikh <irusskikh@marvell.com> Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Igor Russkikh <irusskikh@marvell.com> Link: https://patch.msgid.link/Z6F3KZVfnAZ2FoJm@kspp Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-04net: warn if NAPI instance wasn't shut downJakub Kicinski
Drivers should always disable a NAPI instance before removing it. If they don't the instance may be queued for polling. Since commit 86e25f40aa1e ("net: napi: Add napi_config") we also remove the NAPI from the busy polling hash table in napi_disable(), so not disabling would leave a stale entry there. Use of busy polling is relatively uncommon so bugs may be lurking in the drivers. Add an explicit warning. Reviewed-by: Joe Damato <jdamato@fastly.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250203215816.1294081-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-04cavium/liquidio: Remove unused lio_get_device_idDr. David Alan Gilbert
lio_get_device_id() has been unused since 2018's commit 64fecd3ec512 ("liquidio: remove obsolete functions and data structures") Remove it. Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250203183343.193691-1-linux@treblig.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-04mlxsw: spectrum_router: Remove unused functionsDr. David Alan Gilbert
mlxsw_sp_ipip_lb_ul_vr_id() has been unused since 2020's commit acde33bf7319 ("mlxsw: spectrum_router: Reduce mlxsw_sp_ipip_fib_entry_op_gre4()") mlxsw_sp_rif_exists() has been unused since 2023's commit 49c3a615d382 ("mlxsw: spectrum_router: Replay MACVLANs when RIF is made") mlxsw_sp_rif_vid() has been unused since 2023's commit a5b52692e693 ("mlxsw: spectrum_switchdev: Manage RIFs on PVID change") Remove them. Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Link: https://patch.msgid.link/20250203190141.204951-1-linux@treblig.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-04net/mlx5: Remove unused mlx5dr_domain_syncDr. David Alan Gilbert
mlx5dr_domain_sync() was added in 2019 by commit 70605ea545e8 ("net/mlx5: DR, Expose APIs for direct rule managing") but hasn't been used. Remove it. mlx5dr_domain_sync() was the only user of mlx5dr_send_ring_force_drain(). Remove it. Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Link: https://patch.msgid.link/20250203185958.204794-1-linux@treblig.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-04mlx4: Remove unused functionsDr. David Alan Gilbert
The last use of mlx4_find_cached_mac() was removed in 2014 by commit 2f5bb473681b ("mlx4: Add ref counting to port MAC table for RoCE") mlx4_zone_free_entries() was added in 2014 by commit 7a89399ffad7 ("net/mlx4: Add mlx4_bitmap zone allocator") but hasn't been used. (The _unique version is used) Remove them. Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Link: https://patch.msgid.link/20250203185229.204279-1-linux@treblig.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-04net: qed: fix typosAndrew Kreimer
There are some typos in comments/messages: - Valiate -> Validate - acceptible -> acceptable - acces -> access - relased -> released Fix them via codespell. Signed-off-by: Andrew Kreimer <algonell@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250203175419.4146-1-algonell@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-04dt-bindings: net: faraday,ftgmac100: Add phys modeNinad Palsule
Aspeed device supports rgmii, rgmii-id, rgmii-rxid, rgmii-txid so document them. Acked-by: Rob Herring (Arm) <robh@kernel.org> Signed-off-by: Ninad Palsule <ninad@linux.ibm.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://patch.msgid.link/20250203151306.276358-2-ninad@linux.ibm.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-04neighbour: remove neigh_parms_destroy()Eric Dumazet
neigh_parms_destroy() is a simple kfree(), no need for a forward declaration. neigh_parms_put() can instead call kfree() directly. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250203151152.3163876-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-04bonding: delete always true device checkLeon Romanovsky
XFRM API makes sure that xs->xso.dev is valid in all XFRM offload callbacks. There is no need to check it again. Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/0b2f8f5f09701bb43bbd83b94bfe5cb506b57adc.1738587150.git.leon@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-30Merge tag 'net-6.14-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from IPSec, netfilter and Bluetooth. Nothing really stands out, but as usual there's a slight concentration of fixes for issues added in the last two weeks before the merge window, and driver bugs from 6.13 which tend to get discovered upon wider distribution. Current release - regressions: - net: revert RTNL changes in unregister_netdevice_many_notify() - Bluetooth: fix possible infinite recursion of btusb_reset - eth: adjust locking in some old drivers which protect their state with spinlocks to avoid sleeping in atomic; core protects netdev state with a mutex now Previous releases - regressions: - eth: - mlx5e: make sure we pass node ID, not CPU ID to kvzalloc_node() - bgmac: reduce max frame size to support just 1500 bytes; the jumbo frame support would previously cause OOB writes, but now fails outright - mptcp: blackhole only if 1st SYN retrans w/o MPC is accepted, avoid false detection of MPTCP blackholing Previous releases - always broken: - mptcp: handle fastopen disconnect correctly - xfrm: - make sure skb->sk is a full sock before accessing its fields - fix taking a lock with preempt disabled for RT kernels - usb: ipheth: improve safety of packet metadata parsing; prevent potential OOB accesses - eth: renesas: fix missing rtnl lock in suspend/resume path" * tag 'net-6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (88 commits) MAINTAINERS: add Neal to TCP maintainers net: revert RTNL changes in unregister_netdevice_many_notify() net: hsr: fix fill_frame_info() regression vs VLAN packets doc: mptcp: sysctl: blackhole_timeout is per-netns mptcp: blackhole only if 1st SYN retrans w/o MPC is accepted netfilter: nf_tables: reject mismatching sum of field_len with set key length net: sh_eth: Fix missing rtnl lock in suspend/resume path net: ravb: Fix missing rtnl lock in suspend/resume path selftests/net: Add test for loading devbound XDP program in generic mode net: xdp: Disallow attaching device-bound programs in generic mode tcp: correct handling of extreme memory squeeze bgmac: reduce max frame size to support just MTU 1500 vsock/test: Add test for connect() retries vsock/test: Add test for UAF due to socket unbinding vsock/test: Introduce vsock_connect_fd() vsock/test: Introduce vsock_bind() vsock: Allow retrying on connect() failure vsock: Keep the binding until socket destruction Bluetooth: L2CAP: accept zero as a special value for MTU auto-selection Bluetooth: btnxpuart: Fix glitches seen in dual A2DP streaming ...
2025-01-30Merge tag 'docs-6.14-2' of git://git.lwn.net/linuxLinus Torvalds
Pull documentation fixes from Jonathan Corbet: "Two fixes for footnote-related warnings that appeared with Sphinx 8.x. We want to encourage use of newer Sphinx - they fixed a performance problem and the docs build takes less than half the time it used to" * tag 'docs-6.14-2' of git://git.lwn.net/linux: docs: power: Fix footnote reference for Toshiba Satellite P10-554 Documentation: ublk: Drop Stefan Hajnoczi's message footnote
2025-01-30Merge tag 's390-6.14-3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull s390 fixes from Alexander Gordeev: - Architecutre-specific ftrace recursion trylock tests were removed in favour of the generic function_graph_enter(), but s390 got missed. Remove this test for s390 as well. - Add ftrace_get_symaddr() for s390, which returns the symbol address from ftrace 'ip' parameter * tag 's390-6.14-3' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: s390/tracing: Define ftrace_get_symaddr() for s390 s390/fgraph: Fix to remove ftrace_test_recursion_trylock()
2025-01-30Merge tag 's390-6.14-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull more s390 updates from Alexander Gordeev: - The rework that uncoupled physical and virtual address spaces inadvertently prevented KASAN shadow mappings from using large pages. Restore large page mappings for KASAN shadows - Add decompressor routine physmem_alloc() that may fail, unlike physmem_alloc_or_die(). This allows callers to implement fallback paths - Allow falling back from large pages to smaller pages (1MB or 4KB) if the allocation of 2GB pages in the decompressor can not be fulfilled - Add to the decompressor boot print support of "%%" format string, width and padding hadnling, length modifiers and decimal conversion specifiers - Add to the decompressor message severity levels similar to kernel ones. Support command-line options that control console output verbosity - Replaces boot_printk() calls with appropriate loglevel- specific helpers such as boot_emerg(), boot_warn(), and boot_debug(). - Collect all boot messages into a ring buffer independent of the current log level. This is particularly useful for early crash analysis - If 'earlyprintk' command line parameter is not specified, store decompressor boot messages in a ring buffer to be printed later by the kernel, once the console driver is registered - Add 'bootdebug' command line parameter to enable printing of decompressor debug messages when needed. That parameters allows message suppressing and filtering - Dump boot messages on a decompressor crash, but only if 'bootdebug' command line parameter is enabled - When CONFIG_PRINTK_TIME is enabled, add timestamps to boot messages in the same format as regular printk() - Dump physical memory tracking information on boot: online ranges, reserved areas and vmem allocations - Dump virtual memory layout and randomization details - Improve decompression error reporting and dump the message ring buffer in case the boot failed and system halted - Add an exception handler which handles exceptions when FPU control register is attempted to be set to an invalid value. Remove '.fixup' section as result of this change - Use 'A', 'O', and 'R' inline assembly format flags, which allows recent Clang compilers to generate better FPU code - Rework uaccess code so it reads better and generates more efficient code - Cleanup futex inline assembly code - Disable KMSAN instrumention for futex inline assemblies, which contain dereferenced user pointers. Otherwise, shadows for the user pointers would be accessed - PFs which are not initially configured but in standby create only a single-function PCI domain. If they are configured later on, sibling PFs and their child VFs will not be added to their PCI domain breaking SR-IOV expectations. Fix that by allowing initially configured but in standby PFs create multi-function PCI domains - Add '-std=gnu11' to decompressor and purgatory CFLAGS to avoid compile errors caused by kernel's own definitions of 'bool', 'false', and 'true' conflicting with the C23 reserved keywords - Fix sclp subsystem failure when a sclp console is not present - Fix misuse of non-NULL terminated strings in vmlogrdr driver - Various other small improvements, cleanups and fixes * tag 's390-6.14-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (53 commits) s390/vmlogrdr: Use array instead of string initializer s390/vmlogrdr: Use internal_name for error messages s390/sclp: Initialize sclp subsystem via arch_cpu_finalize_init() s390/tools: Use array instead of string initializer s390/vmem: Fix null-pointer-arithmetic warning in vmem_map_init() s390: Add '-std=gnu11' to decompressor and purgatory CFLAGS s390/bitops: Use correct constraint for arch_test_bit() inline assembly s390/pci: Fix SR-IOV for PFs initially in standby s390/futex: Avoid KMSAN instrumention for user pointers s390/uaccess: Rename get_put_user_noinstr_attributes to uaccess_kmsan_or_inline s390/futex: Cleanup futex_atomic_cmpxchg_inatomic() s390/futex: Generate futex atomic op functions s390/uaccess: Remove INLINE_COPY_FROM_USER and INLINE_COPY_TO_USER s390/uaccess: Use asm goto for put_user()/get_user() s390/uaccess: Remove usage of the oac specifier s390/uaccess: Replace EX_TABLE_UA_LOAD_MEM exception handling s390/uaccess: Cleanup noinstr __put_user()/__get_user() inline assembly constraints s390/uaccess: Remove __put_user_fn()/__get_user_fn() wrappers s390/uaccess: Move put_user() / __put_user() close to put_user() asm code s390/uaccess: Use asm goto for __mvc_kernel_nofault() ...
2025-01-30Merge tag 'gpio-fixes-for-v6.14-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux Pull gpio fixes from Bartosz Golaszewski: - update gpio-sim selftests to not fail now that we no longer allow rmdir() on configfs entries of active devices - remove leftover code from gpio-mxc * tag 'gpio-fixes-for-v6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux: selftests: gpio: gpio-sim: Fix missing chip disablements gpio: mxc: remove dead code after switch to DT-only
2025-01-30Merge tag 'pull-revalidate' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs d_revalidate updates from Al Viro: "Provide stable parent and name to ->d_revalidate() instances Most of the filesystem methods where we care about dentry name and parent have their stability guaranteed by the callers; ->d_revalidate() is the major exception. It's easy enough for callers to supply stable values for expected name and expected parent of the dentry being validated. That kills quite a bit of boilerplate in ->d_revalidate() instances, along with a bunch of races where they used to access ->d_name without sufficient precautions" * tag 'pull-revalidate' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: 9p: fix ->rename_sem exclusion orangefs_d_revalidate(): use stable parent inode and name passed by caller ocfs2_dentry_revalidate(): use stable parent inode and name passed by caller nfs: fix ->d_revalidate() UAF on ->d_name accesses nfs{,4}_lookup_validate(): use stable parent inode passed by caller gfs2_drevalidate(): use stable parent inode and name passed by caller fuse_dentry_revalidate(): use stable parent inode and name passed by caller vfat_revalidate{,_ci}(): use stable parent inode passed by caller exfat_d_revalidate(): use stable parent inode passed by caller fscrypt_d_revalidate(): use stable parent inode passed by caller ceph_d_revalidate(): propagate stable name down into request encoding ceph_d_revalidate(): use stable parent inode passed by caller afs_d_revalidate(): use stable name and parent inode passed by caller Pass parent directory inode and expected name to ->d_revalidate() generic_ci_d_compare(): use shortname_storage ext4 fast_commit: make use of name_snapshot primitives dissolve external_name.u into separate members make take_dentry_name_snapshot() lockless dcache: back inline names with a struct-wrapped array of unsigned long make sure that DNAME_INLINE_LEN is a multiple of word size
2025-01-30Merge tag 'nf-25-01-30' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following batch contains one Netfilter fix: 1) Reject mismatching sum of field_len with set key length which allows to create a set without inconsistent pipapo rule width and set key length. * tag 'nf-25-01-30' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf: netfilter: nf_tables: reject mismatching sum of field_len with set key length ==================== Link: https://patch.msgid.link/20250130113307.2327470-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-30MAINTAINERS: add Neal to TCP maintainersJakub Kicinski
Neal Cardwell has been indispensable in TCP reviews and investigations, especially protocol-related. Neal is also the author of packetdrill. Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250129191332.2526140-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-30net: revert RTNL changes in unregister_netdevice_many_notify()Eric Dumazet
This patch reverts following changes: 83419b61d187 net: reduce RTNL hold duration in unregister_netdevice_many_notify() (part 2) ae646f1a0bb9 net: reduce RTNL hold duration in unregister_netdevice_many_notify() (part 1) cfa579f66656 net: no longer hold RTNL while calling flush_all_backlogs() This caused issues in layers holding a private mutex: cleanup_net() rtnl_lock(); mutex_lock(subsystem_mutex); unregister_netdevice(); rtnl_unlock(); // LOCKDEP violation rtnl_lock(); I will revisit this in next cycle, opt-in for the new behavior from safe contexts only. Fixes: cfa579f66656 ("net: no longer hold RTNL while calling flush_all_backlogs()") Fixes: ae646f1a0bb9 ("net: reduce RTNL hold duration in unregister_netdevice_many_notify() (part 1)") Fixes: 83419b61d187 ("net: reduce RTNL hold duration in unregister_netdevice_many_notify() (part 2)") Reported-by: syzbot+5b9196ecf74447172a9a@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/6789d55f.050a0220.20d369.004e.GAE@google.com/ Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250129142726.747726-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-30net: hsr: fix fill_frame_info() regression vs VLAN packetsEric Dumazet
Stephan Wurm reported that my recent patch broke VLAN support. Apparently skb->mac_len is not correct for VLAN traffic as shown by debug traces [1]. Use instead pskb_may_pull() to make sure the expected header is present in skb->head. Many thanks to Stephan for his help. [1] kernel: skb len=170 headroom=2 headlen=170 tailroom=20 mac=(2,14) mac_len=14 net=(16,-1) trans=-1 shinfo(txflags=0 nr_frags=0 gso(size=0 type=0 segs=0)) csum(0x0 start=0 offset=0 ip_summed=0 complete_sw=0 valid=0 level=0) hash(0x0 sw=0 l4=0) proto=0x0000 pkttype=0 iif=0 priority=0x0 mark=0x0 alloc_cpu=0 vlan_all=0x0 encapsulation=0 inner(proto=0x0000, mac=0, net=0, trans=0) kernel: dev name=prp0 feat=0x0000000000007000 kernel: sk family=17 type=3 proto=0 kernel: skb headroom: 00000000: 74 00 kernel: skb linear: 00000000: 01 0c cd 01 00 01 00 d0 93 53 9c cb 81 00 80 00 kernel: skb linear: 00000010: 88 b8 00 01 00 98 00 00 00 00 61 81 8d 80 16 52 kernel: skb linear: 00000020: 45 47 44 4e 43 54 52 4c 2f 4c 4c 4e 30 24 47 4f kernel: skb linear: 00000030: 24 47 6f 43 62 81 01 14 82 16 52 45 47 44 4e 43 kernel: skb linear: 00000040: 54 52 4c 2f 4c 4c 4e 30 24 44 73 47 6f 6f 73 65 kernel: skb linear: 00000050: 83 07 47 6f 49 64 65 6e 74 84 08 67 8d f5 93 7e kernel: skb linear: 00000060: 76 c8 00 85 01 01 86 01 00 87 01 00 88 01 01 89 kernel: skb linear: 00000070: 01 00 8a 01 02 ab 33 a2 15 83 01 00 84 03 03 00 kernel: skb linear: 00000080: 00 91 08 67 8d f5 92 77 4b c6 1f 83 01 00 a2 1a kernel: skb linear: 00000090: a2 06 85 01 00 83 01 00 84 03 03 00 00 91 08 67 kernel: skb linear: 000000a0: 8d f5 92 77 4b c6 1f 83 01 00 kernel: skb tailroom: 00000000: 80 18 02 00 fe 4e 00 00 01 01 08 0a 4f fd 5e d1 kernel: skb tailroom: 00000010: 4f fd 5e cd Fixes: b9653d19e556 ("net: hsr: avoid potential out-of-bound access in fill_frame_info()") Reported-by: Stephan Wurm <stephan.wurm@a-eberle.de> Tested-by: Stephan Wurm <stephan.wurm@a-eberle.de> Closes: https://lore.kernel.org/netdev/Z4o_UC0HweBHJ_cw@PC-LX-SteWu/ Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250129130007.644084-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-30Merge tag 'ntfs3_for_6.14' of ↵Linus Torvalds
https://github.com/Paragon-Software-Group/linux-ntfs3 Pull ntfs3 fixes from Konstantin Komarov: - unify inode corruption marking and mark them as bad immediately upon detection of an error in attribute enumeration - folio cleanup * tag 'ntfs3_for_6.14' of https://github.com/Paragon-Software-Group/linux-ntfs3: fs/ntfs3: Unify inode corruption marking with _ntfs_bad_inode() fs/ntfs3: Mark inode as bad as soon as error detected in mi_enum_attr() ntfs3: Remove an access to page->index
2025-01-30Merge tag 'bcachefs-2025-01-29' of git://evilpiepirate.org/bcachefsLinus Torvalds
Pull bcachefs fixes from Kent Overstreet: - second half of a fix for a bug that'd been causing oopses on filesystems using snapshots with memory pressure (key cache fills for snaphots btrees are tricky) - build fix for strange compiler configurations that double stack frame size - "journal stuck timeout" now takes into account device latency: this fixes some spurious warnings, and the main remaining source of SRCU lock hold time warnings (I'm no longer seeing this in my CI, so any users still seeing this should definitely ping me) - fix for slow/hanging unmounts (" Improve journal pin flushing") - some more tracepoint fixes/improvements, to chase down the "rebalance isn't making progress" issues * tag 'bcachefs-2025-01-29' of git://evilpiepirate.org/bcachefs: bcachefs: Improve trace_move_extent_finish bcachefs: Fix trace_copygc bcachefs: Journal writes are now IOPRIO_CLASS_RT bcachefs: Improve journal pin flushing bcachefs: fix bch2_btree_node_flags bcachefs: rebalance, copygc enabled are runtime opts bcachefs: Improve decompression error messages bcachefs: bset_blacklisted_journal_seq is now AUTOFIX bcachefs: "Journal stuck" timeout now takes into account device latency bcachefs: Reduce stack frame size of __bch2_str_hash_check_key() bcachefs: Fix btree_trans_peek_key_cache()
2025-01-30Merge branch 'mptcp-blackhole-only-if-1st-syn-retrans-w-o-mpc-is-accepted'Paolo Abeni
Matthieu Baerts says: ==================== mptcp: blackhole only if 1st SYN retrans w/o MPC is accepted Here are two small fixes for issues introduced in v6.12. - Patch 1: reset the mpc_drop mark for other SYN retransmits, to only consider an MPTCP blackhole when the first SYN retransmitted without the MPTCP options is accepted, as initially intended. - Patch 2: also mention in the doc that the blackhole_timeout sysctl knob is per-netns, like all the others. Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> ==================== Link: https://patch.msgid.link/20250129-net-mptcp-blackhole-fix-v1-0-afe88e5a6d2c@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-01-30doc: mptcp: sysctl: blackhole_timeout is per-netnsMatthieu Baerts (NGI0)
All other sysctl entries mention it, and it is a per-namespace sysctl. So mention it as well. Fixes: 27069e7cb3d1 ("mptcp: disable active MPTCP in case of blackhole") Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-01-30mptcp: blackhole only if 1st SYN retrans w/o MPC is acceptedMatthieu Baerts (NGI0)
The Fixes commit mentioned this: > An MPTCP firewall blackhole can be detected if the following SYN > retransmission after a fallback to "plain" TCP is accepted. But in fact, this blackhole was detected if any following SYN retransmissions after a fallback to TCP was accepted. That's because 'mptcp_subflow_early_fallback()' will set 'request_mptcp' to 0, and 'mpc_drop' will never be reset to 0 after. This is an issue, because some not so unusual situations might cause the kernel to detect a false-positive blackhole, e.g. a client trying to connect to a server while the network is not ready yet, causing a few SYN retransmissions, before reaching the end server. Fixes: 27069e7cb3d1 ("mptcp: disable active MPTCP in case of blackhole") Cc: stable@vger.kernel.org Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-01-30netfilter: nf_tables: reject mismatching sum of field_len with set key lengthPablo Neira Ayuso
The field length description provides the length of each separated key field in the concatenation, each field gets rounded up to 32-bits to calculate the pipapo rule width from pipapo_init(). The set key length provides the total size of the key aligned to 32-bits. Register-based arithmetics still allows for combining mismatching set key length and field length description, eg. set key length 10 and field description [ 5, 4 ] leading to pipapo width of 12. Cc: stable@vger.kernel.org Fixes: 3ce67e3793f4 ("netfilter: nf_tables: do not allow mismatch field size and set key length") Reported-by: Noam Rathaus <noamr@ssd-disclosure.com> Reviewed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-01-30Merge branch 'fix-missing-rtnl-lock-in-suspend-path'Paolo Abeni
Kory Maincent says: ==================== Fix missing rtnl lock in suspend path Fix the suspend path by ensuring the rtnl lock is held where required. Calls to open, close and WOL operations must be performed under the rtnl lock to prevent conflicts with ongoing ndo operations. Discussion about this issue can be found here: https://lore.kernel.org/netdev/20250120141926.1290763-1-kory.maincent@bootlin.com/ While working on the ravb fix, it was discovered that the sh_eth driver has the same issue. This patch series addresses both drivers. I do not have access to hardware for either of these MACs, so it would be great if maintainers or others with the relevant boards could test these fixes. v2: https://lore.kernel.org/r/20250123-fix_missing_rtnl_lock_phy_disconnect-v2-0-e6206f5508ba@bootlin.com v1: https://lore.kernel.org/r/20250122-fix_missing_rtnl_lock_phy_disconnect-v1-0-8cb9f6f88fd1@bootlin.com Signed-off-by: Kory Maincent <kory.maincent@bootlin.com> ==================== Link: https://patch.msgid.link/20250129-fix_missing_rtnl_lock_phy_disconnect-v3-0-24c4ba185a92@bootlin.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-01-30net: sh_eth: Fix missing rtnl lock in suspend/resume pathKory Maincent
Fix the suspend/resume path by ensuring the rtnl lock is held where required. Calls to sh_eth_close, sh_eth_open and wol operations must be performed under the rtnl lock to prevent conflicts with ongoing ndo operations. Fixes: b71af04676e9 ("sh_eth: add more PM methods") Tested-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se> Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru> Signed-off-by: Kory Maincent <kory.maincent@bootlin.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-01-30net: ravb: Fix missing rtnl lock in suspend/resume pathKory Maincent
Fix the suspend/resume path by ensuring the rtnl lock is held where required. Calls to ravb_open, ravb_close and wol operations must be performed under the rtnl lock to prevent conflicts with ongoing ndo operations. Without this fix, the following warning is triggered: [ 39.032969] ============================= [ 39.032983] WARNING: suspicious RCU usage [ 39.033019] ----------------------------- [ 39.033033] drivers/net/phy/phy_device.c:2004 suspicious rcu_dereference_protected() usage! ... [ 39.033597] stack backtrace: [ 39.033613] CPU: 0 UID: 0 PID: 174 Comm: python3 Not tainted 6.13.0-rc7-next-20250116-arm64-renesas-00002-g35245dfdc62c #7 [ 39.033623] Hardware name: Renesas SMARC EVK version 2 based on r9a08g045s33 (DT) [ 39.033628] Call trace: [ 39.033633] show_stack+0x14/0x1c (C) [ 39.033652] dump_stack_lvl+0xb4/0xc4 [ 39.033664] dump_stack+0x14/0x1c [ 39.033671] lockdep_rcu_suspicious+0x16c/0x22c [ 39.033682] phy_detach+0x160/0x190 [ 39.033694] phy_disconnect+0x40/0x54 [ 39.033703] ravb_close+0x6c/0x1cc [ 39.033714] ravb_suspend+0x48/0x120 [ 39.033721] dpm_run_callback+0x4c/0x14c [ 39.033731] device_suspend+0x11c/0x4dc [ 39.033740] dpm_suspend+0xdc/0x214 [ 39.033748] dpm_suspend_start+0x48/0x60 [ 39.033758] suspend_devices_and_enter+0x124/0x574 [ 39.033769] pm_suspend+0x1ac/0x274 [ 39.033778] state_store+0x88/0x124 [ 39.033788] kobj_attr_store+0x14/0x24 [ 39.033798] sysfs_kf_write+0x48/0x6c [ 39.033808] kernfs_fop_write_iter+0x118/0x1a8 [ 39.033817] vfs_write+0x27c/0x378 [ 39.033825] ksys_write+0x64/0xf4 [ 39.033833] __arm64_sys_write+0x18/0x20 [ 39.033841] invoke_syscall+0x44/0x104 [ 39.033852] el0_svc_common.constprop.0+0xb4/0xd4 [ 39.033862] do_el0_svc+0x18/0x20 [ 39.033870] el0_svc+0x3c/0xf0 [ 39.033880] el0t_64_sync_handler+0xc0/0xc4 [ 39.033888] el0t_64_sync+0x154/0x158 [ 39.041274] ravb 11c30000.ethernet eth0: Link is Down Reported-by: Claudiu Beznea <claudiu.beznea.uj@bp.renesas.com> Closes: https://lore.kernel.org/netdev/4c6419d8-c06b-495c-b987-d66c2e1ff848@tuxon.dev/ Fixes: 0184165b2f42 ("ravb: add sleep PM suspend/resume support") Signed-off-by: Kory Maincent <kory.maincent@bootlin.com> Tested-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-01-30Merge tag 'for-net-2025-01-29' of ↵Paolo Abeni
git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth Luiz Augusto von Dentz says: ==================== bluetooth pull request for net: - btusb: mediatek: Add locks for usb_driver_claim_interface() - L2CAP: accept zero as a special value for MTU auto-selection - btusb: Fix possible infinite recursion of btusb_reset - Add ABI doc for sysfs reset - btnxpuart: Fix glitches seen in dual A2DP streaming * tag 'for-net-2025-01-29' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth: Bluetooth: L2CAP: accept zero as a special value for MTU auto-selection Bluetooth: btnxpuart: Fix glitches seen in dual A2DP streaming Bluetooth: Add ABI doc for sysfs reset Bluetooth: Fix possible infinite recursion of btusb_reset Bluetooth: btusb: mediatek: Add locks for usb_driver_claim_interface() ==================== Link: https://patch.msgid.link/20250129210057.1318963-1-luiz.dentz@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-01-29selftests/net: Add test for loading devbound XDP program in generic modeToke Høiland-Jørgensen
Add a test to bpf_offload.py for loading a devbound XDP program in generic mode, checking that it fails correctly. Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250127131344.238147-2-toke@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-29net: xdp: Disallow attaching device-bound programs in generic modeToke Høiland-Jørgensen
Device-bound programs are used to support RX metadata kfuncs. These kfuncs are driver-specific and rely on the driver context to read the metadata. This means they can't work in generic XDP mode. However, there is no check to disallow such programs from being attached in generic mode, in which case the metadata kfuncs will be called in an invalid context, leading to crashes. Fix this by adding a check to disallow attaching device-bound programs in generic mode. Fixes: 2b3486bc2d23 ("bpf: Introduce device-bound XDP programs") Reported-by: Marcus Wichelmann <marcus.wichelmann@hetzner-cloud.de> Closes: https://lore.kernel.org/r/dae862ec-43b5-41a0-8edf-46c59071cdda@hetzner-cloud.de Tested-by: Marcus Wichelmann <marcus.wichelmann@hetzner-cloud.de> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20250127131344.238147-1-toke@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-29tcp: correct handling of extreme memory squeezeJon Maloy
Testing with iperf3 using the "pasta" protocol splicer has revealed a problem in the way tcp handles window advertising in extreme memory squeeze situations. Under memory pressure, a socket endpoint may temporarily advertise a zero-sized window, but this is not stored as part of the socket data. The reasoning behind this is that it is considered a temporary setting which shouldn't influence any further calculations. However, if we happen to stall at an unfortunate value of the current window size, the algorithm selecting a new value will consistently fail to advertise a non-zero window once we have freed up enough memory. This means that this side's notion of the current window size is different from the one last advertised to the peer, causing the latter to not send any data to resolve the sitution. The problem occurs on the iperf3 server side, and the socket in question is a completely regular socket with the default settings for the fedora40 kernel. We do not use SO_PEEK or SO_RCVBUF on the socket. The following excerpt of a logging session, with own comments added, shows more in detail what is happening: // tcp_v4_rcv(->) // tcp_rcv_established(->) [5201<->39222]: ==== Activating log @ net/ipv4/tcp_input.c/tcp_data_queue()/5257 ==== [5201<->39222]: tcp_data_queue(->) [5201<->39222]: DROPPING skb [265600160..265665640], reason: SKB_DROP_REASON_PROTO_MEM [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184] [copied_seq 259909392->260034360 (124968), unread 5565800, qlen 85, ofoq 0] [OFO queue: gap: 65480, len: 0] [5201<->39222]: tcp_data_queue(<-) [5201<->39222]: __tcp_transmit_skb(->) [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160] [5201<->39222]: tcp_select_window(->) [5201<->39222]: (inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM) ? --> TRUE [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160] returning 0 [5201<->39222]: tcp_select_window(<-) [5201<->39222]: ADVERTISING WIN 0, ACK_SEQ: 265600160 [5201<->39222]: [__tcp_transmit_skb(<-) [5201<->39222]: tcp_rcv_established(<-) [5201<->39222]: tcp_v4_rcv(<-) // Receive queue is at 85 buffers and we are out of memory. // We drop the incoming buffer, although it is in sequence, and decide // to send an advertisement with a window of zero. // We don't update tp->rcv_wnd and tp->rcv_wup accordingly, which means // we unconditionally shrink the window. [5201<->39222]: tcp_recvmsg_locked(->) [5201<->39222]: __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160 [5201<->39222]: [new_win = 0, win_now = 131184, 2 * win_now = 262368] [5201<->39222]: [new_win >= (2 * win_now) ? --> time_to_ack = 0] [5201<->39222]: NOT calling tcp_send_ack() [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160] [5201<->39222]: __tcp_cleanup_rbuf(<-) [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184] [copied_seq 260040464->260040464 (0), unread 5559696, qlen 85, ofoq 0] returning 6104 bytes [5201<->39222]: tcp_recvmsg_locked(<-) // After each read, the algorithm for calculating the new receive // window in __tcp_cleanup_rbuf() finds it is too small to advertise // or to update tp->rcv_wnd. // Meanwhile, the peer thinks the window is zero, and will not send // any more data to trigger an update from the interrupt mode side. [5201<->39222]: tcp_recvmsg_locked(->) [5201<->39222]: __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160 [5201<->39222]: [new_win = 262144, win_now = 131184, 2 * win_now = 262368] [5201<->39222]: [new_win >= (2 * win_now) ? --> time_to_ack = 0] [5201<->39222]: NOT calling tcp_send_ack() [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160] [5201<->39222]: __tcp_cleanup_rbuf(<-) [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184] [copied_seq 260099840->260171536 (71696), unread 5428624, qlen 83, ofoq 0] returning 131072 bytes [5201<->39222]: tcp_recvmsg_locked(<-) // The above pattern repeats again and again, since nothing changes // between the reads. [...] [5201<->39222]: tcp_recvmsg_locked(->) [5201<->39222]: __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160 [5201<->39222]: [new_win = 262144, win_now = 131184, 2 * win_now = 262368] [5201<->39222]: [new_win >= (2 * win_now) ? --> time_to_ack = 0] [5201<->39222]: NOT calling tcp_send_ack() [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160] [5201<->39222]: __tcp_cleanup_rbuf(<-) [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184] [copied_seq 265600160->265600160 (0), unread 0, qlen 0, ofoq 0] returning 54672 bytes [5201<->39222]: tcp_recvmsg_locked(<-) // The receive queue is empty, but no new advertisement has been sent. // The peer still thinks the receive window is zero, and sends nothing. // We have ended up in a deadlock situation. Note that well behaved endpoints will send win0 probes, so the problem will not occur. Furthermore, we have observed that in these situations this side may send out an updated 'th->ack_seq´ which is not stored in tp->rcv_wup as it should be. Backing ack_seq seems to be harmless, but is of course still wrong from a protocol viewpoint. We fix this by updating the socket state correctly when a packet has been dropped because of memory exhaustion and we have to advertize a zero window. Further testing shows that the connection recovers neatly from the squeeze situation, and traffic can continue indefinitely. Fixes: e2142825c120 ("net: tcp: send zero-window ACK when no memory") Cc: Menglong Dong <menglong8.dong@gmail.com> Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Jon Maloy <jmaloy@redhat.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Link: https://patch.msgid.link/20250127231304.1465565-1-jmaloy@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-29bgmac: reduce max frame size to support just MTU 1500Rafał Miłecki
bgmac allocates new replacement buffer before handling each received frame. Allocating & DMA-preparing 9724 B each time consumes a lot of CPU time. Ideally bgmac should just respect currently set MTU but it isn't the case right now. For now just revert back to the old limited frame size. This change bumps NAT masquerade speed by ~95%. Since commit 8218f62c9c9b ("mm: page_frag: use initial zero offset for page_frag_alloc_align()"), the bgmac driver fails to open its network interface successfully and runs out of memory in the following call stack: bgmac_open -> bgmac_dma_init -> bgmac_dma_rx_skb_for_slot -> netdev_alloc_frag BGMAC_RX_ALLOC_SIZE = 10048 and PAGE_FRAG_CACHE_MAX_SIZE = 32768. Eventually we land into __page_frag_alloc_align() with the following parameters across multiple successive calls: __page_frag_alloc_align: fragsz=10048, align_mask=-1, size=32768, offset=0 __page_frag_alloc_align: fragsz=10048, align_mask=-1, size=32768, offset=10048 __page_frag_alloc_align: fragsz=10048, align_mask=-1, size=32768, offset=20096 __page_frag_alloc_align: fragsz=10048, align_mask=-1, size=32768, offset=30144 So in that case we do indeed have offset + fragsz (40192) > size (32768) and so we would eventually return NULL. Reverting to the older 1500 bytes MTU allows the network driver to be usable again. Fixes: 8c7da63978f1 ("bgmac: configure MTU and add support for frames beyond 8192 byte size") Signed-off-by: Rafał Miłecki <rafal@milecki.pl> [florian: expand commit message about recent commits] Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Florian Fainelli <florian.fainelli@broadcom.com> Link: https://patch.msgid.link/20250127175159.1788246-1-florian.fainelli@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-29Merge branch 'vsock-transport-reassignment-and-error-handling-issues'Jakub Kicinski
Michal Luczaj says: ==================== vsock: Transport reassignment and error handling issues Series deals with two issues: - socket reference count imbalance due to an unforgiving transport release (triggered by transport reassignment); - unintentional API feature, a failing connect() making the socket impossible to use for any subsequent connect() attempts. v2: https://lore.kernel.org/20250121-vsock-transport-vs-autobind-v2-0-aad6069a4e8c@rbox.co v1: https://lore.kernel.org/20250117-vsock-transport-vs-autobind-v1-0-c802c803762d@rbox.co ==================== Link: https://patch.msgid.link/20250128-vsock-transport-vs-autobind-v3-0-1cf57065b770@rbox.co Signed-off-by: Jakub Kicinski <kuba@kernel.org>