summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2025-02-11sfc: extend NVRAM MCDI handlersEdward Cree
Support variable write-alignment, and background updates. The latter allows other MCDI to continue while the device is processing an MC_CMD_NVRAM_UPDATE_FINISH, since this can take a long time owing to e.g. cryptographic signature verification. Expose these handlers in mcdi.h, and build them even when CONFIG_SFC_MTD=n, so they can be used for devlink flash in a subsequent patch. Signed-off-by: Edward Cree <ecree.xilinx@gmail.com> Link: https://patch.msgid.link/de3d9e14fee69e15d95b46258401a93b75659f78.1739186253.git.ecree.xilinx@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-11sfc: parse headers of devlink flash imagesEdward Cree
This parsing is necessary to obtain the metadata which will be used in a subsequent patch to write the image to the device. Signed-off-by: Edward Cree <ecree.xilinx@gmail.com> Link: https://patch.msgid.link/65318300f3f1b1462925f917f7c0d0ac833955ae.1739186253.git.ecree.xilinx@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-11Merge branch 'use-hwmon_channel_info-macro-to-simplify-code'Jakub Kicinski
Huisong Li says: ==================== Use HWMON_CHANNEL_INFO macro to simplify code The HWMON_CHANNEL_INFO macro is provided by hwmon.h and used widely by many other drivers. This series use HWMON_CHANNEL_INFO macro to simplify code in net subsystem. Note: These patches do not depend on each other. Put them togeter just for belonging to the same subsystem. ==================== Link: https://patch.msgid.link/20250210054710.12855-1-lihuisong@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-11net: phy: aquantia: Use HWMON_CHANNEL_INFO macro to simplify codeHuisong Li
Use HWMON_CHANNEL_INFO macro to simplify code. Signed-off-by: Huisong Li <lihuisong@huawei.com> Link: https://patch.msgid.link/20250210054710.12855-6-lihuisong@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-11net: phy: marvell10g: Use HWMON_CHANNEL_INFO macro to simplify codeHuisong Li
Use HWMON_CHANNEL_INFO macro to simplify code. Signed-off-by: Huisong Li <lihuisong@huawei.com> Link: https://patch.msgid.link/20250210054710.12855-5-lihuisong@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-11net: phy: marvell: Use HWMON_CHANNEL_INFO macro to simplify codeHuisong Li
Use HWMON_CHANNEL_INFO macro to simplify code. Signed-off-by: Huisong Li <lihuisong@huawei.com> Link: https://patch.msgid.link/20250210054710.12855-4-lihuisong@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-11net: nfp: Use HWMON_CHANNEL_INFO macro to simplify codeHuisong Li
Use HWMON_CHANNEL_INFO macro to simplify code. Signed-off-by: Huisong Li <lihuisong@huawei.com> Link: https://patch.msgid.link/20250210054710.12855-3-lihuisong@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-11net: aquantia: Use HWMON_CHANNEL_INFO macro to simplify codeHuisong Li
Use HWMON_CHANNEL_INFO macro to simplify code. Signed-off-by: Huisong Li <lihuisong@huawei.com> Link: https://patch.msgid.link/20250210054710.12855-2-lihuisong@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-11rxrpc: Fix alteration of headers whilst zerocopy pendingDavid Howells
rxrpc: Fix alteration of headers whilst zerocopy pending AF_RXRPC now uses MSG_SPLICE_PAGES to do zerocopy of the DATA packets when it transmits them, but to reduce the number of descriptors required in the DMA ring, it allocates a space for the protocol header in the memory immediately before the data content so that it can include both in a single descriptor. This is used for either the main RX header or the smaller jumbo subpacket header as appropriate: +----+------+ | RX | | +-+--+DATA | |JH| | +--+------+ Now, when it stitches a large jumbo packet together from a number of individual DATA packets (each of which is 1412 bytes of data), it uses the full RX header from the first and then the jumbo subpacket header for the rest of the components: +---+--+------+--+------+--+------+--+------+--+------+--+------+ |UDP|RX|DATA |JH|DATA |JH|DATA |JH|DATA |JH|DATA |JH|DATA | +---+--+------+--+------+--+------+--+------+--+------+--+------+ As mentioned, the main RX header and the jumbo header overlay one another in memory and the formats don't match, so switching from one to the other means rearranging the fields and adjusting the flags. However, now that TLP has been included, it wants to retransmit the last subpacket as a new data packet on its own, which means switching between the header formats... and if the transmission is still pending, because of the MSG_SPLICE_PAGES, we end up corrupting the jumbo subheader. This has a variety of effects, with the RX service number overwriting the jumbo checksum/key number field and the RX checksum overwriting the jumbo flags - resulting in, at the very least, a confused connection-level abort from the peer. Fix this by leaving the jumbo header in the allocation with the data, but allocating the RX header from the page frag allocator and concocting it on the fly at the point of transmission as it does for ACK packets. Fixes: 7c482665931b ("rxrpc: Implement RACK/TLP to deal with transmission stalls [RFC8985]") Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: Chuck Lever <chuck.lever@oracle.com> cc: Simon Horman <horms@kernel.org> cc: linux-afs@lists.infradead.org Link: https://patch.msgid.link/2181712.1739131675@warthog.procyon.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-11net: wwan: t7xx: don't include '<linux/pm_wakeup.h>' directlyWolfram Sang
The header clearly states that it does not want to be included directly, only via '<linux/(platform_)?device.h>'. Which is already present, so delete the superfluous include. Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com> Acked-by: Sergey Ryazanov <ryazanov.s.a@gmail.com> Link: https://patch.msgid.link/20250210113710.52071-2-wsa+renesas@sang-engineering.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-11net: phy: broadcom: don't include '<linux/pm_wakeup.h>' directlyWolfram Sang
The header clearly states that it does not want to be included directly, only via '<linux/(platform_)?device.h>'. Replace the include accordingly. Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com> Acked-by: Florian Fainelli <florian.fainelli@broadcom.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://patch.msgid.link/20250210113658.52019-2-wsa+renesas@sang-engineering.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-11net: freescale: ucc_geth: remove unused PHY_INIT_TIMEOUT and PHY_CHANGE_TIMEHeiner Kallweit
Both definitions are unused. Last users have been removed with: 1577ecef7666 ("netdev: Merge UCC and gianfar MDIO bus drivers") 728de4c927a3 ("ucc_geth: migrate ucc_geth to phylib") Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Reviewed-by: Gerhard Engleder <gerhard@engleder-embedded.com> Link: https://patch.msgid.link/62e9429b-57e0-42ec-96a5-6a89553f441d@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-11net: phy: remove unused PHY_INIT_TIMEOUT and PHY_FORCE_TIMEOUTHeiner Kallweit
Both definitions are unused. Last users have been removed with: f3ba9d490d6e ("net: s6gmac: remove driver") 2bd229df5e2e ("net: phy: remove state PHY_FORCING") Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Reviewed-by: Gerhard Engleder <gerhard@engleder-embedded.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://patch.msgid.link/f8e7b8ed-a665-41ad-b0ce-cbfdb65262ef@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-11hamradio: baycom: replace strcpy() with strscpy()Ethan Carter Edwards
The strcpy() function has been deprecated and replaced with strscpy(). There is an effort to make this change treewide: https://github.com/KSPP/linux/issues/88. Signed-off-by: Ethan Carter Edwards <ethan@ethancedwards.com> Reviewed-by: Dan Carpenter <dan.carpenter@linaro.org> Link: https://patch.msgid.link/3qo3fbrak7undfgocsi2s74v4uyjbylpdqhie4dohfoh4welfn@joq7up65ug6v Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-11blackhole_dev: convert self-test to KUnitTamir Duberstein
Convert this very simple smoke test to a KUnit test. Add a missing `htons` call that was spotted[0] by kernel test robot <lkp@intel.com> after initial conversion to KUnit. Link: https://lore.kernel.org/oe-kbuild-all/202502090223.qCYMBjWT-lkp@intel.com/ [0] Signed-off-by: Tamir Duberstein <tamird@gmail.com> Link: https://patch.msgid.link/20250208-blackholedev-kunit-convert-v2-1-182db9bd56ec@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-11net: phylink: make configuring clock-stop dependent on MAC supportRussell King (Oracle)
We should not be configuring the PHYs clock-stop settings unless the MAC supports phylink managed EEE. Make this dependent on MAC support. This was noticed in a suspicious RCU usage report from the kernel test robot (the suspicious RCU usage due to calling phy_detach() remains unaddressed, but is triggered by the error this was generating.) Fixes: 03abf2a7c654 ("net: phylink: add EEE management") Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://patch.msgid.link/E1tgjNn-003q0w-Pw@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-11vxlan: check vxlan_vnigroup_init() return valueEric Dumazet
vxlan_init() must check vxlan_vnigroup_init() success otherwise a crash happens later, spotted by syzbot. Oops: general protection fault, probably for non-canonical address 0xdffffc000000002c: 0000 [#1] PREEMPT SMP KASAN NOPTI KASAN: null-ptr-deref in range [0x0000000000000160-0x0000000000000167] CPU: 0 UID: 0 PID: 7313 Comm: syz-executor147 Not tainted 6.14.0-rc1-syzkaller-00276-g69b54314c975 #0 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014 RIP: 0010:vxlan_vnigroup_uninit+0x89/0x500 drivers/net/vxlan/vxlan_vnifilter.c:912 Code: 00 48 8b 44 24 08 4c 8b b0 98 41 00 00 49 8d 86 60 01 00 00 48 89 c2 48 89 44 24 10 48 b8 00 00 00 00 00 fc ff df 48 c1 ea 03 <80> 3c 02 00 0f 85 4d 04 00 00 49 8b 86 60 01 00 00 48 ba 00 00 00 RSP: 0018:ffffc9000cc1eea8 EFLAGS: 00010202 RAX: dffffc0000000000 RBX: 0000000000000001 RCX: ffffffff8672effb RDX: 000000000000002c RSI: ffffffff8672ecb9 RDI: ffff8880461b4f18 RBP: ffff8880461b4ef4 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000020000 R13: ffff8880461b0d80 R14: 0000000000000000 R15: dffffc0000000000 FS: 00007fecfa95d6c0(0000) GS:ffff88806a600000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fecfa95cfb8 CR3: 000000004472c000 CR4: 0000000000352ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> vxlan_uninit+0x1ab/0x200 drivers/net/vxlan/vxlan_core.c:2942 unregister_netdevice_many_notify+0x12d6/0x1f30 net/core/dev.c:11824 unregister_netdevice_many net/core/dev.c:11866 [inline] unregister_netdevice_queue+0x307/0x3f0 net/core/dev.c:11736 register_netdevice+0x1829/0x1eb0 net/core/dev.c:10901 __vxlan_dev_create+0x7c6/0xa30 drivers/net/vxlan/vxlan_core.c:3981 vxlan_newlink+0xd1/0x130 drivers/net/vxlan/vxlan_core.c:4407 rtnl_newlink_create net/core/rtnetlink.c:3795 [inline] __rtnl_newlink net/core/rtnetlink.c:3906 [inline] Fixes: f9c4bb0b245c ("vxlan: vni filtering support on collect metadata device") Reported-by: syzbot+6a9624592218c2c5e7aa@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/67a9d9b4.050a0220.110943.002d.GAE@google.com/T/#u Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Roopa Prabhu <roopa@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20250210105242.883482-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-11Merge branch 'net-phy-rename-eee_broken_mode'Jakub Kicinski
Heiner Kallweit says: ==================== net: phy: rename eee_broken_mode eee_broken_mode is used also if an EEE mode isn't actually broken but e.g. not supported by MAC. So rename it. This is split out from a bigger series that needs more rework. ==================== Link: https://patch.msgid.link/d7924d4e-49b0-4182-831f-73c558d4425e@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-11net: phy: rename phy_set_eee_broken to phy_disable_eee_modeHeiner Kallweit
Consider that an EEE mode may not be broken but simply not supported by the MAC, and rename function phy_set_eee_broken(). Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://patch.msgid.link/30deb630-3f6b-4ffb-a1e6-a9736021f43a@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-11net: phy: rename eee_broken_modes to eee_disabled_modesHeiner Kallweit
This bitmap is used also if the MAC doesn't support an EEE mode. So the mode isn't necessarily broken in the PHY. Therefore rename the bitmap. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://patch.msgid.link/6cd11422-dd67-4c87-a642-308de694af92@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-11btrfs: fix hole expansion when writing at an offset beyond EOFFilipe Manana
At btrfs_write_check() if our file's i_size is not sector size aligned and we have a write that starts at an offset larger than the i_size that falls within the same page of the i_size, then we end up not zeroing the file range [i_size, write_offset). The code is this: start_pos = round_down(pos, fs_info->sectorsize); oldsize = i_size_read(inode); if (start_pos > oldsize) { /* Expand hole size to cover write data, preventing empty gap */ loff_t end_pos = round_up(pos + count, fs_info->sectorsize); ret = btrfs_cont_expand(BTRFS_I(inode), oldsize, end_pos); if (ret) return ret; } So if our file's i_size is 90269 bytes and a write at offset 90365 bytes comes in, we get 'start_pos' set to 90112 bytes, which is less than the i_size and therefore we don't zero out the range [90269, 90365) by calling btrfs_cont_expand(). This is an old bug introduced in commit 9036c10208e1 ("Btrfs: update hole handling v2"), from 2008, and the buggy code got moved around over the years. Fix this by discarding 'start_pos' and comparing against the write offset ('pos') without any alignment. This bug was recently exposed by test case generic/363 which tests this scenario by polluting ranges beyond EOF with an mmap write and than verify that after a file increases we get zeroes for the range which is supposed to be a hole and not what we wrote with the previous mmaped write. We're only seeing this exposed now because generic/363 used to run only on xfs until last Sunday's fstests update. The test was failing like this: $ ./check generic/363 FSTYP -- btrfs PLATFORM -- Linux/x86_64 debian0 6.13.0-rc7-btrfs-next-185+ #17 SMP PREEMPT_DYNAMIC Mon Feb 3 12:28:46 WET 2025 MKFS_OPTIONS -- /dev/sdc MOUNT_OPTIONS -- /dev/sdc /home/fdmanana/btrfs-tests/scratch_1 generic/363 0s ... [failed, exit status 1]- output mismatch (see /home/fdmanana/git/hub/xfstests/results//generic/363.out.bad) --- tests/generic/363.out 2025-02-05 15:31:14.013646509 +0000 +++ /home/fdmanana/git/hub/xfstests/results//generic/363.out.bad 2025-02-05 17:25:33.112630781 +0000 @@ -1 +1,46 @@ QA output created by 363 +READ BAD DATA: offset = 0xdcad, size = 0xd921, fname = /home/fdmanana/btrfs-tests/dev/junk +OFFSET GOOD BAD RANGE +0x1609d 0x0000 0x3104 0x0 +operation# (mod 256) for the bad data may be 4 +0x1609e 0x0000 0x0472 0x1 +operation# (mod 256) for the bad data may be 4 ... (Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/generic/363.out /home/fdmanana/git/hub/xfstests/results//generic/363.out.bad' to see the entire diff) Ran: generic/363 Failures: generic/363 Failed 1 of 1 tests Fixes: 9036c10208e1 ("Btrfs: update hole handling v2") CC: stable@vger.kernel.org Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-02-11btrfs: fix stale page cache after race between readahead and direct IO writeFilipe Manana
After commit ac325fc2aad5 ("btrfs: do not hold the extent lock for entire read") we can now trigger a race between a task doing a direct IO write and readahead. When this race is triggered it results in tasks getting stale data when they attempt do a buffered read (including the task that did the direct IO write). This race can be sporadically triggered with test case generic/418, failing like this: $ ./check generic/418 FSTYP -- btrfs PLATFORM -- Linux/x86_64 debian0 6.13.0-rc7-btrfs-next-185+ #17 SMP PREEMPT_DYNAMIC Mon Feb 3 12:28:46 WET 2025 MKFS_OPTIONS -- /dev/sdc MOUNT_OPTIONS -- /dev/sdc /home/fdmanana/btrfs-tests/scratch_1 generic/418 14s ... - output mismatch (see /home/fdmanana/git/hub/xfstests/results//generic/418.out.bad) --- tests/generic/418.out 2020-06-10 19:29:03.850519863 +0100 +++ /home/fdmanana/git/hub/xfstests/results//generic/418.out.bad 2025-02-03 15:42:36.974609476 +0000 @@ -1,2 +1,5 @@ QA output created by 418 +cmpbuf: offset 0: Expected: 0x1, got 0x0 +[6:0] FAIL - comparison failed, offset 24576 +diotest -wp -b 4096 -n 8 -i 4 failed at loop 3 Silence is golden ... (Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/generic/418.out /home/fdmanana/git/hub/xfstests/results//generic/418.out.bad' to see the entire diff) Ran: generic/418 Failures: generic/418 Failed 1 of 1 tests The race happens like this: 1) A file has a prealloc extent for the range [16K, 28K); 2) Task A starts a direct IO write against file range [24K, 28K). At the start of the direct IO write it invalidates the page cache at __iomap_dio_rw() with kiocb_invalidate_pages() for the 4K page at file offset 24K; 3) Task A enters btrfs_dio_iomap_begin() and locks the extent range [24K, 28K); 4) Task B starts a readahead for file range [16K, 28K), entering btrfs_readahead(). First it attempts to read the page at offset 16K by entering btrfs_do_readpage(), where it calls get_extent_map(), locks the range [16K, 20K) and gets the extent map for the range [16K, 28K), caching it into the 'em_cached' variable declared in the local stack of btrfs_readahead(), and then unlocks the range [16K, 20K). Since the extent map has the prealloc flag, at btrfs_do_readpage() we zero out the page's content and don't submit any bio to read the page from the extent. Then it attempts to read the page at offset 20K entering btrfs_do_readpage() where we reuse the previously cached extent map (decided by get_extent_map()) since it spans the page's range and it's still in the inode's extent map tree. Just like for the previous page, we zero out the page's content since the extent map has the prealloc flag set. Then it attempts to read the page at offset 24K entering btrfs_do_readpage() where we reuse the previously cached extent map (decided by get_extent_map()) since it spans the page's range and it's still in the inode's extent map tree. Just like for the previous pages, we zero out the page's content since the extent map has the prealloc flag set. Note that we didn't lock the extent range [24K, 28K), so we didn't synchronize with the ongoing direct IO write being performed by task A; 5) Task A enters btrfs_create_dio_extent() and creates an ordered extent for the range [24K, 28K), with the flags BTRFS_ORDERED_DIRECT and BTRFS_ORDERED_PREALLOC set; 6) Task A unlocks the range [24K, 28K) at btrfs_dio_iomap_begin(); 7) The ordered extent enters btrfs_finish_one_ordered() and locks the range [24K, 28K); 8) Task A enters fs/iomap/direct-io.c:iomap_dio_complete() and it tries to invalidate the page at offset 24K by calling kiocb_invalidate_post_direct_write(), resulting in a call chain that ends up at btrfs_release_folio(). The btrfs_release_folio() call ends up returning false because the range for the page at file offset 24K is currently locked by the task doing the ordered extent completion in the previous step (7), so we have: btrfs_release_folio() -> __btrfs_release_folio() -> try_release_extent_mapping() -> try_release_extent_state() This last function checking that the range is locked and returning false and propagating it up to btrfs_release_folio(). So this results in a failure to invalidate the page and kiocb_invalidate_post_direct_write() triggers this message logged in dmesg: Page cache invalidation failure on direct I/O. Possible data corruption due to collision with buffered I/O! After this we leave the page cache with stale data for the file range [24K, 28K), filled with zeroes instead of the data written by direct IO write (all bytes with a 0x01 value), so any task attempting to read with buffered IO, including the task that did the direct IO write, will get all bytes in the range with a 0x00 value instead of the written data. Fix this by locking the range, with btrfs_lock_and_flush_ordered_range(), at the two callers of btrfs_do_readpage() instead of doing it at get_extent_map(), just like we did before commit ac325fc2aad5 ("btrfs: do not hold the extent lock for entire read"), and unlocking the range after all the calls to btrfs_do_readpage(). This way we never reuse a cached extent map without flushing any pending ordered extents from a concurrent direct IO write. Fixes: ac325fc2aad5 ("btrfs: do not hold the extent lock for entire read") Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-02-11Merge tag 'tomoyo-pr-20250211' of git://git.code.sf.net/p/tomoyo/tomoyoLinus Torvalds
Pull tomoyo fixes from Tetsuo Handa: "Redo of pathname patternization and fix spelling errors" * tag 'tomoyo-pr-20250211' of git://git.code.sf.net/p/tomoyo/tomoyo: tomoyo: use better patterns for procfs in learning mode tomoyo: fix spelling errors tomoyo: fix spelling error
2025-02-11x86/cpu/kvm: SRSO: Fix possible missing IBPB on VM-ExitPatrick Bellasi
In [1] the meaning of the synthetic IBPB flags has been redefined for a better separation of concerns: - ENTRY_IBPB -- issue IBPB on entry only - IBPB_ON_VMEXIT -- issue IBPB on VM-Exit only and the Retbleed mitigations have been updated to match this new semantics. Commit [2] was merged shortly before [1], and their interaction was not handled properly. This resulted in IBPB not being triggered on VM-Exit in all SRSO mitigation configs requesting an IBPB there. Specifically, an IBPB on VM-Exit is triggered only when X86_FEATURE_IBPB_ON_VMEXIT is set. However: - X86_FEATURE_IBPB_ON_VMEXIT is not set for "spec_rstack_overflow=ibpb", because before [1] having X86_FEATURE_ENTRY_IBPB was enough. Hence, an IBPB is triggered on entry but the expected IBPB on VM-exit is not. - X86_FEATURE_IBPB_ON_VMEXIT is not set also when "spec_rstack_overflow=ibpb-vmexit" if X86_FEATURE_ENTRY_IBPB is already set. That's because before [1] this was effectively redundant. Hence, e.g. a "retbleed=ibpb spec_rstack_overflow=bpb-vmexit" config mistakenly reports the machine still vulnerable to SRSO, despite an IBPB being triggered both on entry and VM-Exit, because of the Retbleed selected mitigation config. - UNTRAIN_RET_VM won't still actually do anything unless CONFIG_MITIGATION_IBPB_ENTRY is set. For "spec_rstack_overflow=ibpb", enable IBPB on both entry and VM-Exit and clear X86_FEATURE_RSB_VMEXIT which is made superfluous by X86_FEATURE_IBPB_ON_VMEXIT. This effectively makes this mitigation option similar to the one for 'retbleed=ibpb', thus re-order the code for the RETBLEED_MITIGATION_IBPB option to be less confusing by having all features enabling before the disabling of the not needed ones. For "spec_rstack_overflow=ibpb-vmexit", guard this mitigation setting with CONFIG_MITIGATION_IBPB_ENTRY to ensure UNTRAIN_RET_VM sequence is effectively compiled in. Drop instead the CONFIG_MITIGATION_SRSO guard, since none of the SRSO compile cruft is required in this configuration. Also, check only that the required microcode is present to effectively enabled the IBPB on VM-Exit. Finally, update the KConfig description for CONFIG_MITIGATION_IBPB_ENTRY to list also all SRSO config settings enabled by this guard. Fixes: 864bcaa38ee4 ("x86/cpu/kvm: Provide UNTRAIN_RET_VM") [1] Fixes: d893832d0e1e ("x86/srso: Add IBPB on VMEXIT") [2] Reported-by: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Patrick Bellasi <derkling@google.com> Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2025-02-11platform/x86: int3472: Call "reset" GPIO "enable" for INT347ESakari Ailus
The DT bindings for ov7251 specify "enable" GPIO (xshutdown in documentation) but the int3472 indiscriminately provides this as a "reset" GPIO to sensor drivers. Take this into account by assigning it as "enable" with active high polarity for INT347E devices, i.e. ov7251. "reset" with active low polarity remains the default GPIO name for other devices. Signed-off-by: Sakari Ailus <sakari.ailus@linux.intel.com> Reviewed-by: Hans de Goede <hdegoede@redhat.com> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Link: https://lore.kernel.org/r/20250211072841.7713-3-sakari.ailus@linux.intel.com Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
2025-02-11platform/x86: int3472: Use correct type for "polarity", call it gpio_flagsSakari Ailus
Struct gpiod_lookup flags field's type is unsigned long. Thus use unsigned long for values to be assigned to that field. Similarly, also call the field gpio_flags which it really is. Signed-off-by: Sakari Ailus <sakari.ailus@linux.intel.com> Reviewed-by: Hans de Goede <hdegoede@redhat.com> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Link: https://lore.kernel.org/r/20250211072841.7713-2-sakari.ailus@linux.intel.com Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
2025-02-11igc: Set buffer type for empty frames in igc_init_empty_frameSong Yoong Siang
Set the buffer type to IGC_TX_BUFFER_TYPE_SKB for empty frame in the igc_init_empty_frame function. This ensures that the buffer type is correctly identified and handled during Tx ring cleanup. Fixes: db0b124f02ba ("igc: Enhance Qbv scheduling by using first flag bit") Cc: stable@vger.kernel.org # 6.2+ Signed-off-by: Song Yoong Siang <yoong.siang.song@intel.com> Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Mor Bar-Gabay <morx.bar.gabay@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-11igc: Fix HW RX timestamp when passed by ZC XDPZdenek Bouska
Fixes HW RX timestamp in the following scenario: - AF_PACKET socket with enabled HW RX timestamps is created - AF_XDP socket with enabled zero copy is created - frame is forwarded to the BPF program, where the timestamp should still be readable (extracted by igc_xdp_rx_timestamp(), kfunc behind bpf_xdp_metadata_rx_timestamp()) - the frame got XDP_PASS from BPF program, redirecting to the stack - AF_PACKET socket receives the frame with HW RX timestamp Moves the skb timestamp setting from igc_dispatch_skb_zc() to igc_construct_skb_zc() so that igc_construct_skb_zc() is similar to igc_construct_skb(). This issue can also be reproduced by running: # tools/testing/selftests/bpf/xdp_hw_metadata enp1s0 When a frame with the wrong port 9092 (instead of 9091) is used: # echo -n xdp | nc -u -q1 192.168.10.9 9092 then the RX timestamp is missing and xdp_hw_metadata prints: skb hwtstamp is not found! With this fix or when copy mode is used: # tools/testing/selftests/bpf/xdp_hw_metadata -c enp1s0 then RX timestamp is found and xdp_hw_metadata prints: found skb hwtstamp = 1736509937.852786132 Fixes: 069b142f5819 ("igc: Add support for PTP .getcyclesx64()") Signed-off-by: Zdenek Bouska <zdenek.bouska@siemens.com> Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Florian Bezdeka <florian.bezdeka@siemens.com> Reviewed-by: Song Yoong Siang <yoong.siang.song@intel.com> Tested-by: Mor Bar-Gabay <morx.bar.gabay@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-11ixgbe: Fix possible skb NULL pointer dereferencePiotr Kwapulinski
The commit c824125cbb18 ("ixgbe: Fix passing 0 to ERR_PTR in ixgbe_run_xdp()") stopped utilizing the ERR-like macros for xdp status encoding. Propagate this logic to the ixgbe_put_rx_buffer(). The commit also relaxed the skb NULL pointer check - caught by Smatch. Restore this check. Fixes: c824125cbb18 ("ixgbe: Fix passing 0 to ERR_PTR in ixgbe_run_xdp()") Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/intel-wired-lan/2c7d6c31-192a-4047-bd90-9566d0e14cc0@stanley.mountain/ Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Piotr Kwapulinski <piotr.kwapulinski@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Saritha Sanigani <sarithax.sanigani@intel.com> (A Contingent Worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-11idpf: call set_real_num_queues in idpf_openJoshua Hay
On initial driver load, alloc_etherdev_mqs is called with whatever max queue values are provided by the control plane. However, if the driver is loaded on a system where num_online_cpus() returns less than the max queues, the netdev will think there are more queues than are actually available. Only num_online_cpus() will be allocated, but skb_get_queue_mapping(skb) could possibly return an index beyond the range of allocated queues. Consequently, the packet is silently dropped and it appears as if TX is broken. Set the real number of queues during open so the netdev knows how many queues will be allocated. Fixes: 1c325aac10a8 ("idpf: configure resources for TX queues") Signed-off-by: Joshua Hay <joshua.a.hay@intel.com> Reviewed-by: Madhu Chittim <madhu.chittim@intel.com> Tested-by: Samuel Salin <Samuel.salin@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-11idpf: record rx queue in skb for RSC packetsSridhar Samudrala
Move the call to skb_record_rx_queue in idpf_rx_process_skb_fields() so that RX queue is recorded for RSC packets too. Fixes: 90912f9f4f2d ("idpf: convert header split mode to libeth + napi_build_skb()") Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com> Reviewed-by: Madhu Chittim <madhu.chittim@intel.com> Tested-by: Samuel Salin <Samuel.salin@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-11idpf: fix handling rsc packet with a single segmentSridhar Samudrala
Handle rsc packet with a single segment same as a multi segment rsc packet so that CHECKSUM_PARTIAL is set in the skb->ip_summed field. The current code is passing CHECKSUM_NONE resulting in TCP GRO layer doing checksum in SW and hiding the issue. This will fail when using dmabufs as payload buffers as skb frag would be unreadable. Fixes: 3a8845af66ed ("idpf: add RX splitq napi poll support") Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Tested-by: Samuel Salin <Samuel.salin@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-11bcachefs: Fix want_new_bset() so we write until the end of the btree nodeKent Overstreet
want_new_bset() returns the address of a new bset to initialize if we wish to do so in a btree node - either because the previous one is too big, or because it's been written. The case for 'previous bset was written' was wrong: it's only supposed to check for if we have space in the node for one more block, but because it subtracted the header from the space available it would never initialize a new bset if we were down to the last block in a node. Fixing this results in fewer btree node splits/compactions, which fixes a bug with flushing the journal to go read-only sometimes not terminating or taking excessively long. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-02-11bcachefs: Split out journal pins by btree levelKent Overstreet
This lets us flush the journal to go read-only more effectively. Flushing the journal and going read-only requires halting mutually recursive processes, which strictly speaking are not guaranteed to terminate. Flushing btree node journal pins will kick off a btree node write, and btree node writes on completion must do another btree update to the parent node to update the 'sectors_written' field for that node's key. If the parent node is full and requires a split or compaction, that's going to generate a whole bunch of additional btree updates - alloc info, LRU btree, and more - which then have to be flushed, and the cycle repeats. This process will terminate much more effectively if we tweak journal reclaim to flush btree updates leaf to root: i.e., don't flush updates for a given btree node (kicking off a write, and consuming space within that node up to the next block boundary) if there might still be unflushed updates in child nodes. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-02-11bcachefs: Fix use after freeAlan Huang
acc->k.data should be used with the lock hold: 00221 ========= TEST generic/187 00221 run fstests generic/187 at 2025-02-09 21:08:10 00221 spectre-v4 mitigation disabled by command-line option 00222 bcachefs (vdc): starting version 1.20: directory_size opts=errors=ro 00222 bcachefs (vdc): initializing new filesystem 00222 bcachefs (vdc): going read-write 00222 bcachefs (vdc): marking superblocks 00222 bcachefs (vdc): initializing freespace 00222 bcachefs (vdc): done initializing freespace 00222 bcachefs (vdc): reading snapshots table 00222 bcachefs (vdc): reading snapshots done 00222 bcachefs (vdc): done starting filesystem 00222 bcachefs (vdc): shutting down 00222 bcachefs (vdc): going read-only 00222 bcachefs (vdc): finished waiting for writes to stop 00223 bcachefs (vdc): flushing journal and stopping allocators, journal seq 6 00223 bcachefs (vdc): flushing journal and stopping allocators complete, journal seq 8 00223 bcachefs (vdc): clean shutdown complete, journal seq 9 00223 bcachefs (vdc): marking filesystem clean 00223 bcachefs (vdc): shutdown complete 00223 bcachefs (vdc): starting version 1.20: directory_size opts=errors=ro 00223 bcachefs (vdc): initializing new filesystem 00223 bcachefs (vdc): going read-write 00223 bcachefs (vdc): marking superblocks 00223 bcachefs (vdc): initializing freespace 00223 bcachefs (vdc): done initializing freespace 00223 bcachefs (vdc): reading snapshots table 00223 bcachefs (vdc): reading snapshots done 00223 bcachefs (vdc): done starting filesystem 00244 hrtimer: interrupt took 123350440 ns 00264 bcachefs (vdc): shutting down 00264 bcachefs (vdc): going read-only 00264 bcachefs (vdc): finished waiting for writes to stop 00264 bcachefs (vdc): flushing journal and stopping allocators, journal seq 97 00265 bcachefs (vdc): flushing journal and stopping allocators complete, journal seq 101 00265 bcachefs (vdc): clean shutdown complete, journal seq 102 00265 bcachefs (vdc): marking filesystem clean 00265 bcachefs (vdc): shutdown complete 00265 bcachefs (vdc): starting version 1.20: directory_size opts=errors=ro 00265 bcachefs (vdc): recovering from clean shutdown, journal seq 102 00265 bcachefs (vdc): accounting_read... 00265 ================================================================== 00265 done 00265 BUG: KASAN: slab-use-after-free in bch2_fs_to_text+0x12b4/0x1728 00265 bcachefs (vdc): alloc_read... done 00265 bcachefs (vdc): stripes_read... done 00265 Read of size 4 at addr ffffff80c57eac00 by task cat/7531 00265 bcachefs (vdc): snapshots_read... done 00265 00265 CPU: 6 UID: 0 PID: 7531 Comm: cat Not tainted 6.13.0-rc3-ktest-g16fc6fa3819d #14103 00265 Hardware name: linux,dummy-virt (DT) 00265 Call trace: 00265 show_stack+0x1c/0x30 (C) 00265 dump_stack_lvl+0x6c/0x80 00265 print_report+0xf8/0x5d8 00265 kasan_report+0x90/0xd0 00265 __asan_report_load4_noabort+0x1c/0x28 00265 bch2_fs_to_text+0x12b4/0x1728 00265 bch2_fs_show+0x94/0x188 00265 sysfs_kf_seq_show+0x1a4/0x348 00265 kernfs_seq_show+0x12c/0x198 00265 seq_read_iter+0x27c/0xfd0 00265 kernfs_fop_read_iter+0x390/0x4f8 00265 vfs_read+0x480/0x7f0 00265 ksys_read+0xe0/0x1e8 00265 __arm64_sys_read+0x70/0xa8 00265 invoke_syscall.constprop.0+0x74/0x1e8 00265 do_el0_svc+0xc8/0x1c8 00265 el0_svc+0x20/0x60 00265 el0t_64_sync_handler+0x104/0x130 00265 el0t_64_sync+0x154/0x158 00265 00265 Allocated by task 7510: 00265 kasan_save_stack+0x28/0x50 00265 kasan_save_track+0x1c/0x38 00265 kasan_save_alloc_info+0x3c/0x50 00265 __kasan_kmalloc+0xac/0xb0 00265 __kmalloc_node_noprof+0x168/0x348 00265 __kvmalloc_node_noprof+0x20/0x140 00265 __bch2_darray_resize_noprof+0x90/0x1b0 00265 __bch2_accounting_mem_insert+0x76c/0xb08 00265 bch2_accounting_mem_insert+0x224/0x3b8 00265 bch2_accounting_mem_mod_locked+0x480/0xc58 00265 bch2_accounting_read+0xa94/0x3eb8 00265 bch2_run_recovery_pass+0x80/0x178 00265 bch2_run_recovery_passes+0x340/0x698 00265 bch2_fs_recovery+0x1c98/0x2bd8 00265 bch2_fs_start+0x240/0x490 00265 bch2_fs_get_tree+0xe1c/0x1458 00265 vfs_get_tree+0x7c/0x250 00265 path_mount+0xe24/0x1648 00265 __arm64_sys_mount+0x240/0x438 00265 invoke_syscall.constprop.0+0x74/0x1e8 00265 do_el0_svc+0xc8/0x1c8 00265 el0_svc+0x20/0x60 00265 el0t_64_sync_handler+0x104/0x130 00265 el0t_64_sync+0x154/0x158 00265 00265 Freed by task 7510: 00265 kasan_save_stack+0x28/0x50 00265 kasan_save_track+0x1c/0x38 00265 kasan_save_free_info+0x48/0x88 00265 __kasan_slab_free+0x48/0x60 00265 kfree+0x188/0x408 00265 kvfree+0x3c/0x50 00265 __bch2_darray_resize_noprof+0xe0/0x1b0 00265 __bch2_accounting_mem_insert+0x76c/0xb08 00265 bch2_accounting_mem_insert+0x224/0x3b8 00265 bch2_accounting_mem_mod_locked+0x480/0xc58 00265 bch2_accounting_read+0xa94/0x3eb8 00265 bch2_run_recovery_pass+0x80/0x178 00265 bch2_run_recovery_passes+0x340/0x698 00265 bch2_fs_recovery+0x1c98/0x2bd8 00265 bch2_fs_start+0x240/0x490 00265 bch2_fs_get_tree+0xe1c/0x1458 00265 vfs_get_tree+0x7c/0x250 00265 path_mount+0xe24/0x1648 00265 bcachefs (vdc): going read-write 00265 __arm64_sys_mount+0x240/0x438 00265 invoke_syscall.constprop.0+0x74/0x1e8 00265 do_el0_svc+0xc8/0x1c8 00265 el0_svc+0x20/0x60 00265 el0t_64_sync_handler+0x104/0x130 00265 el0t_64_sync+0x154/0x158 00265 00265 The buggy address belongs to the object at ffffff80c57eac00 00265 which belongs to the cache kmalloc-128 of size 128 00265 The buggy address is located 0 bytes inside of 00265 freed 128-byte region [ffffff80c57eac00, ffffff80c57eac80) 00265 00265 The buggy address belongs to the physical page: 00265 page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x1057ea 00265 head: order:1 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0 00265 flags: 0x8000000000000040(head|zone=2) 00265 page_type: f5(slab) 00265 raw: 8000000000000040 ffffff80c0002800 dead000000000100 dead000000000122 00265 raw: 0000000000000000 0000000000200020 00000001f5000000 ffffff80c57a6400 00265 head: 8000000000000040 ffffff80c0002800 dead000000000100 dead000000000122 00265 head: 0000000000000000 0000000000200020 00000001f5000000 ffffff80c57a6400 00265 head: 8000000000000001 fffffffec315fa81 ffffffffffffffff 0000000000000000 00265 head: 0000000000000002 0000000000000000 00000000ffffffff 0000000000000000 00265 page dumped because: kasan: bad access detected 00265 00265 Memory state around the buggy address: 00265 ffffff80c57eab00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00265 ffffff80c57eab80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc 00265 >ffffff80c57eac00: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb 00265 ^ 00265 ffffff80c57eac80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc 00265 ffffff80c57ead00: 00 00 00 00 00 00 00 00 00 00 00 00 00 fc fc fc 00265 ================================================================== 00265 Kernel panic - not syncing: kasan.fault=panic set ... 00265 CPU: 6 UID: 0 PID: 7531 Comm: cat Not tainted 6.13.0-rc3-ktest-g16fc6fa3819d #14103 00265 Hardware name: linux,dummy-virt (DT) 00265 Call trace: 00265 show_stack+0x1c/0x30 (C) 00265 dump_stack_lvl+0x30/0x80 00265 dump_stack+0x18/0x20 00265 panic+0x4d4/0x518 00265 start_report.constprop.0+0x0/0x90 00265 kasan_report+0xa0/0xd0 00265 __asan_report_load4_noabort+0x1c/0x28 00265 bch2_fs_to_text+0x12b4/0x1728 00265 bch2_fs_show+0x94/0x188 00265 sysfs_kf_seq_show+0x1a4/0x348 00265 kernfs_seq_show+0x12c/0x198 00265 seq_read_iter+0x27c/0xfd0 00265 kernfs_fop_read_iter+0x390/0x4f8 00265 vfs_read+0x480/0x7f0 00265 ksys_read+0xe0/0x1e8 00265 __arm64_sys_read+0x70/0xa8 00265 invoke_syscall.constprop.0+0x74/0x1e8 00265 do_el0_svc+0xc8/0x1c8 00265 el0_svc+0x20/0x60 00265 el0t_64_sync_handler+0x104/0x130 00265 el0t_64_sync+0x154/0x158 00265 SMP: stopping secondary CPUs 00265 Kernel Offset: disabled 00265 CPU features: 0x000,00000070,00000010,8240500b 00265 Memory Limit: none 00265 ---[ end Kernel panic - not syncing: kasan.fault=panic set ... ]--- 00270 ========= FAILED TIMEOUT generic.187 in 1200s Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-02-11mfd: syscon: Restore device_node_to_regmap() for non-syscon nodesRob Herring (Arm)
Commit ba5095ebbc7a ("mfd: syscon: Allow syscon nodes without a "syscon" compatible") broke drivers which call device_node_to_regmap() on nodes without a "syscon" compatible. Restore the prior behavior for device_node_to_regmap(). This also makes using device_node_to_regmap() incompatible with of_syscon_register_regmap() again, so add kerneldoc for device_node_to_regmap() and syscon_node_to_regmap() to make it clear how and when each one should be used. Fixes: ba5095ebbc7a ("mfd: syscon: Allow syscon nodes without a "syscon" compatible") Reported-by: Vaishnav Achath <vaishnav.a@ti.com> Signed-off-by: Rob Herring (Arm) <robh@kernel.org> Reviewed-by: Daniel Golle <daniel@makrotopia.org> Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> Tested-by: Chen-Yu Tsai <wenst@chromium.org> Tested-by: Nishanth Menon <nm@ti.com> Tested-by: Daniel Golle <daniel@makrotopia.org> Tested-by: Frank Wunderlich <frank-w@public-files.de> Tested-by: Dhruva Gole <d-gole@ti.com> Tested-by: Nícolas F. R. A. Prado <nfraprado@collabora.com> Tested-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> Link: https://lore.kernel.org/r/20250124191644.2309790-1-robh@kernel.org Signed-off-by: Lee Jones <lee@kernel.org>
2025-02-11Merge branch 'tcp-allow-to-reduce-max-rto'Paolo Abeni
Eric Dumazet says: ==================== tcp: allow to reduce max RTO This is a followup of a discussion started 6 months ago by Jason Xing. Some applications want to lower the time between each retransmit attempts. TCP_KEEPINTVL and TCP_KEEPCNT socket options don't work around the issue. This series adds: - a new TCP level socket option (TCP_RTO_MAX_MS) - a new sysctl (/proc/sys/net/ipv4/tcp_rto_max_ms) Admins and/or applications can now change the max rto value at their own risk. Link: https://lore.kernel.org/netdev/20240715033118.32322-1-kerneljasonxing@gmail.com/T/ ==================== Link: https://patch.msgid.link/20250207152830.2527578-1-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-02-11tcp: add tcp_rto_max_ms sysctlEric Dumazet
Previous patch added a TCP_RTO_MAX_MS socket option to tune a TCP socket max RTO value. Many setups prefer to change a per netns sysctl. This patch adds /proc/sys/net/ipv4/tcp_rto_max_ms Its initial value is 120000 (120 seconds). Keep in mind that a decrease of tcp_rto_max_ms means shorter overall timeouts, unless tcp_retries2 sysctl is increased. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-02-11tcp: add the ability to control max RTOEric Dumazet
Currently, TCP stack uses a constant (120 seconds) to limit the RTO value exponential growth. Some applications want to set a lower value. Add TCP_RTO_MAX_MS socket option to set a value (in ms) between 1 and 120 seconds. It is discouraged to change the socket rto max on a live socket, as it might lead to unexpected disconnects. Following patch is adding a netns sysctl to control the default value at socket creation time. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-02-11tcp: use tcp_reset_xmit_timer()Eric Dumazet
In order to reduce TCP_RTO_MAX occurrences, replace: inet_csk_reset_xmit_timer(sk, what, when, TCP_RTO_MAX) With: tcp_reset_xmit_timer(sk, what, when, false); Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-02-11tcp: add a @pace_delay parameter to tcp_reset_xmit_timer()Eric Dumazet
We want to factorize calls to inet_csk_reset_xmit_timer(), to ease TCP_RTO_MAX change. Current users want to add tcp_pacing_delay(sk) to the timeout. Remaining calls to inet_csk_reset_xmit_timer() do not add the pacing delay. Following patch will convert them, passing false for @pace_delay. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-02-11tcp: remove tcp_reset_xmit_timer() @max_when argumentEric Dumazet
All callers use TCP_RTO_MAX, we can factorize this constant, becoming a variable soon. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-02-11Merge branch 'mptcp-pm-misc-cleanups-part-2'Paolo Abeni
Matthieu Baerts says: ==================== mptcp: pm: misc cleanups, part 2 These cleanups lead the way to the unification of the path-manager interfaces, and allow future extensions. The following patches are not all linked to each others, but are all related to the path-managers. - Patch 1: drop unneeded parameter in a function helper. - Patch 2: clearer NL error message when an NL attribute is missing. - Patch 3: more precise NL messages by avoiding 'this or that is NOK'. - Patch 4: improve too vague or missing NL err messages. - Patch 5: use GENL_REQ_ATTR_CHECK to look for mandatory NL attributes. - Patch 6: avoid overriding the error message. - Patch 7: check all mandatory NL attributes with GENL_REQ_ATTR_CHECK. - Patch 8: use NL_SET_ERR_MSG_ATTR instead of GENL_SET_ERR_MSG - Patch 9: move doit callbacks used for both PM to pm.c. - Patch 10: drop another unneeded parameter in a function helper. - Patch 11: share the ID parsing code for the 'get_addr' callback. - Patch 12: share sending NL code for the 'get_addr' callback. - Patch 13: drop yet another unneeded parameter in a function helper. - Patch 14: pick the usual structure type for the remote address. - Patch 15: share the local addr parsing code for the 'set_flags' cb. The behaviour when there are no errors should then not be modified. Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> ==================== v2: https://lore.kernel.org/r/20250117-net-next-mptcp-pm-misc-cleanup-2-v2-0-61d4fe0586e8@kernel.org v1: https://lore.kernel.org/r/20250116-net-next-mptcp-pm-misc-cleanup-2-v1-0-c0b43f18fe06@kernel.org Link: https://patch.msgid.link/20250207-net-next-mptcp-pm-misc-cleanup-2-v3-0-71753ed957de@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-02-11mptcp: pm: add local parameter for set_flagsGeliang Tang
This patch updates the interfaces set_flags to reduce repetitive code, adds a new parameter 'local' for them. The local address is parsed in public helper mptcp_pm_nl_set_flags_doit(), then pass it to mptcp_pm_nl_set_flags() and mptcp_userspace_pm_set_flags(). Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-02-11mptcp: pm: change rem type of set_flagsGeliang Tang
Generally, in the path manager interfaces, the local address is defined as an mptcp_pm_addr_entry type address, while the remote address is defined as an mptcp_addr_info type one: (struct mptcp_pm_addr_entry *local, struct mptcp_addr_info *remote) But the set_flags() interface uses two mptcp_pm_addr_entry type parameters. This patch changes the second one to mptcp_addr_info type and use helper mptcp_pm_parse_addr() to parse it instead of using mptcp_pm_parse_entry(). Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-02-11mptcp: pm: drop skb parameter of set_flagsGeliang Tang
The first parameter 'skb' in mptcp_pm_nl_set_flags() is only used to obtained the network namespace, which can also be obtained through the second parameters 'info' by using genl_info_net() helper. This patch drops these useless parameters 'skb' in all three set_flags() interfaces. Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-02-11mptcp: pm: reuse sending nlmsg code in get_addrGeliang Tang
The netlink messages are sent both in mptcp_pm_nl_get_addr() and mptcp_userspace_pm_get_addr(), this makes the code somewhat repetitive. This is because the netlink PM and userspace PM use different locks to protect the address entry that needs to be sent via the netlink message. The former uses rcu read lock, and the latter uses msk->pm.lock. The current get_addr() flow looks like this: lock(); entry = get_entry(); send_nlmsg(entry); unlock(); After holding the lock, get the entry from the list, send the entry, and finally release the lock. This patch changes the process by getting the entry while holding the lock, then making a copy of the entry so that the lock can be released. Finally, the copy of the entry is sent without locking: lock(); entry = get_entry(); *copy = *entry; unlock(); send_nlmsg(copy); This way we can reuse the send_nlmsg() code in get_addr() interfaces between the netlink PM and userspace PM. They only need to implement their own get_addr() interfaces to hold the different locks, get the entry from the different lists, then release the locks. Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-02-11mptcp: pm: add id parameter for get_addrGeliang Tang
The address id is parsed both in mptcp_pm_nl_get_addr() and mptcp_userspace_pm_get_addr(), this makes the code somewhat repetitive. So this patch adds a new parameter 'id' for all get_addr() interfaces. The address id is only parsed in mptcp_pm_nl_get_addr_doit(), then pass it to both mptcp_pm_nl_get_addr() and mptcp_userspace_pm_get_addr(). Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-02-11mptcp: pm: drop skb parameter of get_addrGeliang Tang
The first parameters 'skb' of get_addr() interfaces are now useless since mptcp_userspace_pm_get_sock() helper is used. This patch drops these useless parameters of them. Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-02-11mptcp: pm: make three pm wrappers staticGeliang Tang
Three netlink functions: mptcp_pm_nl_get_addr_doit() mptcp_pm_nl_get_addr_dumpit() mptcp_pm_nl_set_flags_doit() are generic, implemented for each PM, in-kernel PM and userspace PM. It's clearer to move them from pm_netlink.c to pm.c. And the linked three path manager wrappers mptcp_pm_get_addr() mptcp_pm_dump_addr() mptcp_pm_set_flags() can be changed as static functions, no need to export them in protocol.h. Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>