summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2023-08-17net: do not allow gso_size to be set to GSO_BY_FRAGSEric Dumazet
One missing check in virtio_net_hdr_to_skb() allowed syzbot to crash kernels again [1] Do not allow gso_size to be set to GSO_BY_FRAGS (0xffff), because this magic value is used by the kernel. [1] general protection fault, probably for non-canonical address 0xdffffc000000000e: 0000 [#1] PREEMPT SMP KASAN KASAN: null-ptr-deref in range [0x0000000000000070-0x0000000000000077] CPU: 0 PID: 5039 Comm: syz-executor401 Not tainted 6.5.0-rc5-next-20230809-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/26/2023 RIP: 0010:skb_segment+0x1a52/0x3ef0 net/core/skbuff.c:4500 Code: 00 00 00 e9 ab eb ff ff e8 6b 96 5d f9 48 8b 84 24 00 01 00 00 48 8d 78 70 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <0f> b6 04 02 84 c0 74 08 3c 03 0f 8e ea 21 00 00 48 8b 84 24 00 01 RSP: 0018:ffffc90003d3f1c8 EFLAGS: 00010202 RAX: dffffc0000000000 RBX: 000000000001fffe RCX: 0000000000000000 RDX: 000000000000000e RSI: ffffffff882a3115 RDI: 0000000000000070 RBP: ffffc90003d3f378 R08: 0000000000000005 R09: 000000000000ffff R10: 000000000000ffff R11: 5ee4a93e456187d6 R12: 000000000001ffc6 R13: dffffc0000000000 R14: 0000000000000008 R15: 000000000000ffff FS: 00005555563f2380(0000) GS:ffff8880b9800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000020020000 CR3: 000000001626d000 CR4: 00000000003506f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> udp6_ufo_fragment+0x9d2/0xd50 net/ipv6/udp_offload.c:109 ipv6_gso_segment+0x5c4/0x17b0 net/ipv6/ip6_offload.c:120 skb_mac_gso_segment+0x292/0x610 net/core/gso.c:53 __skb_gso_segment+0x339/0x710 net/core/gso.c:124 skb_gso_segment include/net/gso.h:83 [inline] validate_xmit_skb+0x3a5/0xf10 net/core/dev.c:3625 __dev_queue_xmit+0x8f0/0x3d60 net/core/dev.c:4329 dev_queue_xmit include/linux/netdevice.h:3082 [inline] packet_xmit+0x257/0x380 net/packet/af_packet.c:276 packet_snd net/packet/af_packet.c:3087 [inline] packet_sendmsg+0x24c7/0x5570 net/packet/af_packet.c:3119 sock_sendmsg_nosec net/socket.c:727 [inline] sock_sendmsg+0xd9/0x180 net/socket.c:750 ____sys_sendmsg+0x6ac/0x940 net/socket.c:2496 ___sys_sendmsg+0x135/0x1d0 net/socket.c:2550 __sys_sendmsg+0x117/0x1e0 net/socket.c:2579 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x38/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd RIP: 0033:0x7ff27cdb34d9 Fixes: 3953c46c3ac7 ("sk_buff: allow segmenting based on frag sizes") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Xin Long <lucien.xin@gmail.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Jason Wang <jasowang@redhat.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Link: https://lore.kernel.org/r/20230816142158.1779798-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-17sock: Fix misuse of sk_under_memory_pressure()Abel Wu
The status of global socket memory pressure is updated when: a) __sk_mem_raise_allocated(): enter: sk_memory_allocated(sk) > sysctl_mem[1] leave: sk_memory_allocated(sk) <= sysctl_mem[0] b) __sk_mem_reduce_allocated(): leave: sk_under_memory_pressure(sk) && sk_memory_allocated(sk) < sysctl_mem[0] So the conditions of leaving global pressure are inconstant, which may lead to the situation that one pressured net-memcg prevents the global pressure from being cleared when there is indeed no global pressure, thus the global constrains are still in effect unexpectedly on the other sockets. This patch fixes this by ignoring the net-memcg's pressure when deciding whether should leave global memory pressure. Fixes: e1aab161e013 ("socket: initial cgroup code.") Signed-off-by: Abel Wu <wuyun.abel@bytedance.com> Acked-by: Shakeel Butt <shakeelb@google.com> Link: https://lore.kernel.org/r/20230816091226.1542-1-wuyun.abel@bytedance.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-17sfc: don't fail probe if MAE/TC setup failsEdward Cree
Existing comment in the source explains why we don't want efx_init_tc() failure to be fatal. Cited commit erroneously consolidated failure paths causing the probe to be failed in this case. Fixes: 7e056e2360d9 ("sfc: obtain device mac address based on firmware handle for ef100") Reviewed-by: Martin Habets <habetsm.xilinx@gmail.com> Signed-off-by: Edward Cree <ecree.xilinx@gmail.com> Link: https://lore.kernel.org/r/aa7f589dd6028bd1ad49f0a85f37ab33c09b2b45.1692114888.git.ecree.xilinx@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-17sfc: don't unregister flow_indr if it was never registeredEdward Cree
In efx_init_tc(), move the setting of efx->tc->up after the flow_indr_dev_register() call, so that if it fails, efx_fini_tc() won't call flow_indr_dev_unregister(). Fixes: 5b2e12d51bd8 ("sfc: bind indirect blocks for TC offload on EF100") Suggested-by: Pieter Jansen van Vuuren <pieter.jansen-van-vuuren@amd.com> Reviewed-by: Martin Habets <habetsm.xilinx@gmail.com> Signed-off-by: Edward Cree <ecree.xilinx@gmail.com> Link: https://lore.kernel.org/r/a81284d7013aba74005277bd81104e4cfbea3f6f.1692114888.git.ecree.xilinx@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-17ice: refactor ice_vf_lib to make functions staticJan Sokolowski
As following methods are not used outside ice_vf_lib, they can be made static: ice_vf_rebuild_host_vlan_cfg ice_vf_rebuild_host_tx_rate_cfg ice_vf_set_host_trust_cfg ice_vf_rebuild_host_mac_cfg ice_vf_rebuild_aggregator_node_cfg ice_vf_rebuild_host_cfg ice_set_vf_state_qs_dis ice_vf_set_initialized In order to achieve that, the order in which these were defined was reorganized. Signed-off-by: Jan Sokolowski <jan.sokolowski@intel.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2023-08-17ice: refactor ice_lib to make functions staticJan Sokolowski
As following methods are not used outside of ice_lib, they can be made static: ice_vsi_is_vlan_pruning_ena ice_vsi_cfg_frame_size Signed-off-by: Jan Sokolowski <jan.sokolowski@intel.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2023-08-17ice: refactor ice_ddp to make functions staticJan Sokolowski
As following methods are not used outside of ice_ddp, they can be made static: ice_verify_pgk ice_pkg_val_buf ice_aq_download_pkg ice_aq_update_pkg ice_find_seg_in_pkg Signed-off-by: Jan Sokolowski <jan.sokolowski@intel.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2023-08-17ice: remove unused methodsJan Sokolowski
Following methods were found to no longer be in use: ice_is_pca9575_present ice_mac_fltr_exist ice_napi_del Remove them. Signed-off-by: Jan Sokolowski <jan.sokolowski@intel.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2023-08-17Merge tag 'nfsd-6.5-4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux Pull nfsd fix from Chuck Lever: - Fix new MSG_SPLICE_PAGES support in server's TCP sendmsg helper * tag 'nfsd-6.5-4' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: sunrpc: set the bv_offset of first bvec in svc_tcp_sendmsg
2023-08-17Revert "drm/edid: Fix csync detailed mode parsing"Jani Nikula
This reverts commit ca62297b2085b5b3168bd891ca24862242c635a1. Commit ca62297b2085 ("drm/edid: Fix csync detailed mode parsing") fixed EDID detailed mode sync parsing. Unfortunately, there are quite a few displays out there that have bogus (zero) sync field that are broken by the change. Zero means analog composite sync, which is not right for digital displays, and the modes get rejected. Regardless, it used to work, and it needs to continue to work. Revert the change. Rejecting modes with analog composite sync was the part that fixed the gitlab issue 8146 [1]. We'll need to get back to the drawing board with that. [1] https://gitlab.freedesktop.org/drm/intel/-/issues/8146 Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/8789 Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/8930 Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/9044 Fixes: ca62297b2085 ("drm/edid: Fix csync detailed mode parsing") Cc: Ville Syrjälä <ville.syrjala@linux.intel.com> Cc: dri-devel@lists.freedesktop.org Cc: <stable@vger.kernel.org> # v6.4+ Signed-off-by: Jani Nikula <jani.nikula@intel.com> Acked-by: Ville Syrjälä <ville.syrjala@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20230815101907.2900768-1-jani.nikula@intel.com
2023-08-16net: dsa: mv88e6xxx: Wait for EEPROM done before HW resetAlfred Lee
If the switch is reset during active EEPROM transactions, as in just after an SoC reset after power up, the I2C bus transaction may be cut short leaving the EEPROM internal I2C state machine in the wrong state. When the switch is reset again, the bad state machine state may result in data being read from the wrong memory location causing the switch to enter unexpected mode rendering it inoperational. Fixes: a3dcb3e7e70c ("net: dsa: mv88e6xxx: Wait for EEPROM done after HW reset") Signed-off-by: Alfred Lee <l00g33k@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://lore.kernel.org/r/20230815001323.24739-1-l00g33k@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-16Merge tag 'for-netdev' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2023-08-16 We've added 17 non-merge commits during the last 6 day(s) which contain a total of 20 files changed, 1179 insertions(+), 37 deletions(-). The main changes are: 1) Add a BPF hook in sys_socket() to change the protocol ID from IPPROTO_TCP to IPPROTO_MPTCP to cover migration for legacy applications, from Geliang Tang. 2) Follow-up/fallout fix from the SO_REUSEPORT + bpf_sk_assign work to fix a splat on non-fullsock sks in inet[6]_steal_sock, from Lorenz Bauer. 3) Improvements to struct_ops links to avoid forcing presence of update/validate callbacks. Also add bpf_struct_ops fields documentation, from David Vernet. 4) Ensure libbpf sets close-on-exec flag on gzopen, from Marco Vedovati. 5) Several new tcx selftest additions and bpftool link show support for tcx and xdp links, from Daniel Borkmann. 6) Fix a smatch warning on uninitialized symbol in bpf_perf_link_fill_kprobe, from Yafang Shao. 7) BPF selftest fixes e.g. misplaced break in kfunc_call test, from Yipeng Zou. 8) Small cleanup to remove unused declaration bpf_link_new_file, from Yue Haibing. 9) Small typo fix to bpftool's perf help message, from Daniel T. Lee. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: selftests/bpf: Add mptcpify test selftests/bpf: Fix error checks of mptcp open_and_load selftests/bpf: Add two mptcp netns helpers bpf: Add update_socket_protocol hook bpftool: Implement link show support for xdp bpftool: Implement link show support for tcx selftests/bpf: Add selftest for fill_link_info bpf: Fix uninitialized symbol in bpf_perf_link_fill_kprobe() net: Fix slab-out-of-bounds in inet[6]_steal_sock bpf: Document struct bpf_struct_ops fields bpf: Support default .validate() and .update() behavior for struct_ops links selftests/bpf: Add various more tcx test cases selftests/bpf: Clean up fmod_ret in bench_rename test script selftests/bpf: Fix repeat option when kfunc_call verification fails libbpf: Set close-on-exec flag on gzopen bpftool: fix perf help message bpf: Remove unused declaration bpf_link_new_file() ==================== Link: https://lore.kernel.org/r/20230816212840.1539-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-16Revert "net: ethernet: ti: am65-cpsw: add mqprio qdisc offload in channel mode"Jakub Kicinski
This reverts commit 90bc21aaef4adaefceda2d385756138fc247c0c2. Patch was merged too hastily, Vladimir requested changes in: https://lore.kernel.org/all/20230816121305.5dio5tk3chge2ndh@skbuf/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-16drm/nouveau/disp: fix use-after-free in error handling of ↵Karol Herbst
nouveau_connector_create We can't simply free the connector after calling drm_connector_init on it. We need to clean up the drm side first. It might not fix all regressions from commit 2b5d1c29f6c4 ("drm/nouveau/disp: PIOR DP uses GPIO for HPD, not PMGR AUX interrupts"), but at least it fixes a memory corruption in error handling related to that commit. Link: https://lore.kernel.org/lkml/20230806213107.GFZNARG6moWpFuSJ9W@fat_crate.local/ Fixes: 95983aea8003 ("drm/nouveau/disp: add connector class") Signed-off-by: Karol Herbst <kherbst@redhat.com> Reviewed-by: Lyude Paul <lyude@redhat.com> Link: https://patchwork.freedesktop.org/patch/msgid/20230814144933.3956959-1-kherbst@redhat.com
2023-08-16net/mlx5: Fix mlx5_cmd_update_root_ft() error flowShay Drory
The cited patch change mlx5_cmd_update_root_ft() to work with multiple peer devices. However, it didn't align the error flow as well. Hence, Fix the error code to work with multiple peer devices. Fixes: 222dd185833e ("{net/RDMA}/mlx5: introduce lag_for_each_peer") Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-08-16net/mlx5e: XDP, Fix fifo overrun on XDP_REDIRECTDragos Tatulea
Before this fix, running high rate traffic through XDP_REDIRECT with multibuf could overrun the fifo used to release the xdp frames after tx completion. This resulted in corrupted data being consumed on the free side. The culplirt was a miscalculation of the fifo size: the maximum ratio between fifo entries / data segments was incorrect. This ratio serves to calculate the max fifo size for a full sq where each packet uses the worst case number of entries in the fifo. This patch fixes the formula and names the constant. It also makes sure that future values will use a power of 2 number of entries for the fifo mask to work. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Fixes: 3f734b8c594b ("net/mlx5e: XDP, Use multiple single-entry objects in xdpi_fifo") Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-08-16Revert "Revert "drm/amdgpu/display: change pipe policy for DCN 2.0""Alex Deucher
This reverts commit 27dd79c00aeab36cd7542c7a4481a32549038659. It appears MPC_SPLIT_DYNAMIC still causes problems with multiple displays on DCN2.0 hardware. Switch back to MPC_SPLIT_AVOID_MULT_DISP. This increases power usage with multiple displays, but avoids hangs. Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2475 Cc: Rodrigo Siqueira <Rodrigo.Siqueira@amd.com> Reviewed-by: Harry Wentland <harry.wentland@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org # 6.4.x
2023-08-16drm/amd: flush any delayed gfxoff on suspend entryMario Limonciello
DCN 3.1.4 is reported to hang on s2idle entry if graphics activity is happening during entry. This is because GFXOFF was scheduled as delayed but RLC gets disabled in s2idle entry sequence which will hang GFX IP if not already in GFXOFF. To help this problem, flush any delayed work for GFXOFF early in s2idle entry sequence to ensure that it's off when RLC is changed. commit 4b31b92b143f ("drm/amdgpu: complete gfxoff allow signal during suspend without delay") modified power gating flow so that if called in s0ix that it ensured that GFXOFF wasn't put in work queue but instead processed immediately. This is dead code due to commit 10cb67eb8a1b ("drm/amdgpu: skip CG/PG for gfx during S0ix") because GFXOFF will now not be explicitly called as part of the suspend entry code. Remove that dead code. Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Tim Huang <tim.huang@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org
2023-08-16drm/amdgpu: skip fence GFX interrupts disable/enable for S0ixTim Huang
GFX v11.0.1 reported fence fallback timer expired issue on SDMA and GFX rings after S0ix resume. This is generated by EOP interrupts are disabled when S0ix suspend but fails to re-enable when resume because of the GFX is in GFXOFF. [ 203.349571] [drm] Fence fallback timer expired on ring sdma0 [ 203.349572] [drm] Fence fallback timer expired on ring gfx_0.0.0 [ 203.861635] [drm] Fence fallback timer expired on ring gfx_0.0.0 For S0ix, GFX is in GFXOFF state, avoid to touch the GFX registers to configure the fence driver interrupts for rings that belong to GFX. The interrupts configuration will be restored by GFXOFF exit. Signed-off-by: Tim Huang <Tim.Huang@amd.com> Reviewed-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org
2023-08-16drm/amdgpu: skip xcp drm device allocation when out of drm resourceJames Zhu
Return 0 when drm device alloc failed with -ENOSPC in order to allow amdgpu drive loading. But the xcp without drm device node assigned won't be visiable in user space. This helps amdgpu driver loading on system which has more than 64 nodes, the current limitation. The proposal to add more drm nodes is discussed in public, which will support up to 2^20 nodes totally. kernel drm: https://lore.kernel.org/lkml/20230724211428.3831636-1-michal.winiarski@intel.com/T/ libdrm: https://gitlab.freedesktop.org/mesa/drm/-/merge_requests/305 Signed-off-by: James Zhu <James.Zhu@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-08-16drm/amd/pm: Update pci link width for smu v13.0.6Asad Kamal
Update addresses of PCIE link width registers, & link width format used to populate gpu metrics table for smu v13.0.6 v2: Removed ESM register update v3: Updated patch subject and message Signed-off-by: Asad Kamal <asad.kamal@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-08-16drm/amd/pm: Fix temperature unit of SMU v13.0.6Lijo Lazar
Temperature needs to be reported in millidegree Celsius. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Yang Wang <kevinyang.wang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-08-16drm/amdgpu/pm: fix throttle_status for other than MP1 11.0.7Umio Yasuno
Use the right metrics table version based on the firmware. Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2720 Reviewed-by: Evan Quan <evan.quan@amd.com> Signed-off-by: Umio Yasuno <coelacanth_dream@protonmail.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org
2023-08-16drm/amdgpu: disable mcbp if parameter zero is setJiadong Zhu
The parameter amdgpu_mcbp shall have priority against the default value calculated from the chip version. User could disable mcbp by setting the parameter mcbp as zero. v2: do not trigger preemption in sw ring muxer when mcbp is disabled. Signed-off-by: Jiadong Zhu <Jiadong.Zhu@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-08-16Merge branch 'bpf: Force to MPTCP'Martin KaFai Lau
Geliang Tang says: ==================== As is described in the "How to use MPTCP?" section in MPTCP wiki [1]: "Your app should create sockets with IPPROTO_MPTCP as the proto: ( socket(AF_INET, SOCK_STREAM, IPPROTO_MPTCP); ). Legacy apps can be forced to create and use MPTCP sockets instead of TCP ones via the mptcpize command bundled with the mptcpd daemon." But the mptcpize (LD_PRELOAD technique) command has some limitations [2]: - it doesn't work if the application is not using libc (e.g. GoLang apps) - in some envs, it might not be easy to set env vars / change the way apps are launched, e.g. on Android - mptcpize needs to be launched with all apps that want MPTCP: we could have more control from BPF to enable MPTCP only for some apps or all the ones of a netns or a cgroup, etc. - it is not in BPF, we cannot talk about it at netdev conf. So this patchset attempts to use BPF to implement functions similer to mptcpize. The main idea is to add a hook in sys_socket() to change the protocol id from IPPROTO_TCP (or 0) to IPPROTO_MPTCP. [1] https://github.com/multipath-tcp/mptcp_net-next/wiki [2] https://github.com/multipath-tcp/mptcp_net-next/issues/79 v14: - Use getsockopt(MPTCP_INFO) to verify mptcp protocol intead of using nstat command. v13: - drop "Use random netns name for mptcp" patch. v12: - update diag_* log of update_socket_protocol. - add 'ip netns show' after 'ip netns del' to check if there is a test did not clean up its netns. - return libbpf_get_error() instead of -EIO for the error from open_and_load(). - Use getsockopt(SOL_PROTOCOL) to verify mptcp protocol intead of using 'ss -tOni'. v11: - add comments about outputs of 'ss' and 'nstat'. - use "err = verify_mptcpify()" instead of using =+. v10: - drop "#ifdef CONFIG_BPF_JIT". - include vmlinux.h and bpf_tracing_net.h to avoid defining some macros. - drop unneeded checks for mptcp. v9: - update comment for 'update_socket_protocol'. v8: - drop the additional checks on the 'protocol' value after the 'update_socket_protocol()' call. v7: - add __weak and __diag_* for update_socket_protocol. v6: - add update_socket_protocol. v5: - add bpf_mptcpify helper. v4: - use lsm_cgroup/socket_create v3: - patch 8: char cmd[128]; -> char cmd[256]; v2: - Fix build selftests errors reported by CI Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/79 ==================== Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-08-16selftests/bpf: Add mptcpify testGeliang Tang
Implement a new test program mptcpify: if the family is AF_INET or AF_INET6, the type is SOCK_STREAM, and the protocol ID is 0 or IPPROTO_TCP, set it to IPPROTO_MPTCP. It will be hooked in update_socket_protocol(). Extend the MPTCP test base, add a selftest test_mptcpify() for the mptcpify case. Open and load the mptcpify test prog to mptcpify the TCP sockets dynamically, then use start_server() and connect_to_fd() to create a TCP socket, but actually what's created is an MPTCP socket, which can be verified through 'getsockopt(SOL_PROTOCOL)' and 'getsockopt(MPTCP_INFO)'. Acked-by: Yonghong Song <yonghong.song@linux.dev> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Geliang Tang <geliang.tang@suse.com> Link: https://lore.kernel.org/r/364e72f307e7bb38382ec7442c182d76298a9c41.1692147782.git.geliang.tang@suse.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-08-16selftests/bpf: Fix error checks of mptcp open_and_loadGeliang Tang
Return libbpf_get_error(), instead of -EIO, for the error from mptcp_sock__open_and_load(). Load success means prog_fd and map_fd are always valid. So drop these unneeded ASSERT_GE checks for them in mptcp run_test(). Acked-by: Yonghong Song <yonghong.song@linux.dev> Signed-off-by: Geliang Tang <geliang.tang@suse.com> Link: https://lore.kernel.org/r/db5fcb93293df9ab173edcbaf8252465b80da6f2.1692147782.git.geliang.tang@suse.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-08-16selftests/bpf: Add two mptcp netns helpersGeliang Tang
Add two netns helpers for mptcp tests: create_netns() and cleanup_netns(). Use them in test_base(). These new helpers will be re-used in the following commits introducing new tests. Acked-by: Yonghong Song <yonghong.song@linux.dev> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Geliang Tang <geliang.tang@suse.com> Link: https://lore.kernel.org/r/7506371fb6c417b401cc9d7365fe455754f4ba3f.1692147782.git.geliang.tang@suse.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-08-16bpf: Add update_socket_protocol hookGeliang Tang
Add a hook named update_socket_protocol in __sys_socket(), for bpf progs to attach to and update socket protocol. One user case is to force legacy TCP apps to create and use MPTCP sockets instead of TCP ones. Define a fmod_ret set named bpf_mptcp_fmodret_ids, add the hook update_socket_protocol into this set, and register it in bpf_mptcp_kfunc_init(). Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/79 Acked-by: Matthieu Baerts <matthieu.baerts@tessares.net> Acked-by: Yonghong Song <yonghong.song@linux.dev> Signed-off-by: Geliang Tang <geliang.tang@suse.com> Link: https://lore.kernel.org/r/ac84be00f97072a46f8a72b4e2be46cbb7fa5053.1692147782.git.geliang.tang@suse.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-08-16bpftool: Implement link show support for xdpDaniel Borkmann
Add support to dump XDP link information to bpftool. This reuses the recently added show_link_ifindex_{plain,json}(). The XDP link info only exposes the ifindex. Below shows an example link dump output, and a cgroup link is included for comparison, too: # bpftool link [...] 10: cgroup prog 2466 cgroup_id 1 attach_type cgroup_inet6_post_bind [...] 16: xdp prog 2477 ifindex enp5s0(3) [...] Equivalent json output: # bpftool link --json [...] { "id": 10, "type": "cgroup", "prog_id": 2466, "cgroup_id": 1, "attach_type": "cgroup_inet6_post_bind" }, [...] { "id": 16, "type": "xdp", "prog_id": 2477, "devname": "enp5s0", "ifindex": 3 } [...] Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Quentin Monnet <quentin@isovalent.com> Link: https://lore.kernel.org/r/20230816095651.10014-2-daniel@iogearbox.net Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-08-16bpftool: Implement link show support for tcxDaniel Borkmann
Add support to dump tcx link information to bpftool. This adds a common helper show_link_ifindex_{plain,json}() which can be reused also for other link types. The plain text and json device output is the same format as in bpftool net dump. Below shows an example link dump output along with a cgroup link for comparison: # bpftool link [...] 10: cgroup prog 1977 cgroup_id 1 attach_type cgroup_inet6_post_bind [...] 13: tcx prog 2053 ifindex enp5s0(3) attach_type tcx_ingress 14: tcx prog 2080 ifindex enp5s0(3) attach_type tcx_egress [...] Equivalent json output: # bpftool link --json [...] { "id": 10, "type": "cgroup", "prog_id": 1977, "cgroup_id": 1, "attach_type": "cgroup_inet6_post_bind" }, [...] { "id": 13, "type": "tcx", "prog_id": 2053, "devname": "enp5s0", "ifindex": 3, "attach_type": "tcx_ingress" }, { "id": 14, "type": "tcx", "prog_id": 2080, "devname": "enp5s0", "ifindex": 3, "attach_type": "tcx_egress" } [...] Suggested-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Quentin Monnet <quentin@isovalent.com> Acked-by: Yafang Shao <laoar.shao@gmail.com> Link: https://lore.kernel.org/r/20230816095651.10014-1-daniel@iogearbox.net Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-08-16drm/amd/pm: disallow the fan setting if there is no fan on smu 13.0.0Kenneth Feng
drm/amd/pm: disallow the fan setting if there is no fan on smu 13.0.0 V2: depend on pm.no_fan to check Signed-off-by: Kenneth Feng <kenneth.feng@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org
2023-08-16virtchnl: fix fake 1-elem arrays for structures allocated as `nents`Alexander Lobakin
Finally, fix 3 structures which are allocated technically correctly, i.e. the calculated size equals to the one that struct_size() would return, except for sizeof(). For &virtchnl_vlan_filter_list_v2, use the same approach when there are no enough space as taken previously for &virtchnl_vlan_filter_list, i.e. let the maximum size be calculated automatically instead of trying to guestimate it using maths. Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Reviewed-by: Kees Cook <keescook@chromium.org> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2023-08-16virtchnl: fix fake 1-elem arrays in structures allocated as `nents + 1`Alexander Lobakin
There are five virtchnl structures, which are allocated and checked in the code as `nents + 1`, meaning that they always have memory for one excessive element regardless of their actual number. This comes from that their sizeof() includes space for 1 element and then they get allocated via struct_size() or its open-coded equivalents, passing the actual number of elements. Expand virtchnl_struct_size() to handle such structures and replace those 1-elem arrays with proper flex ones. Also fix several places which open-code %IAVF_VIRTCHNL_VF_RESOURCE_SIZE. Finally, let the virtchnl_ether_addr_list size be computed automatically when there's no enough space for the whole list, otherwise we have to open-code reverse struct_size() logics. Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Reviewed-by: Kees Cook <keescook@chromium.org> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2023-08-16virtchnl: fix fake 1-elem arrays in structs allocated as `nents + 1` - 1Alexander Lobakin
The two most problematic virtchnl structures are virtchnl_rss_key and virtchnl_rss_lut. Their "flex" arrays have the type of u8, thus, when allocating / checking, the actual size is calculated as `sizeof + nents - 1 byte`. But their sizeof() is not 1 byte larger than the size of such structure with proper flex array, it's two bytes larger due to the padding. That said, their size is always 1 byte larger unless there are no tail elements -- then it's +2 bytes. Add virtchnl_struct_size() macro which will handle this case (and later other cases as well). Make its calling conv the same as we call struct_size() to allow it to be drop-in, even though it's unlikely to become possible to switch to generic API. The macro will calculate a proper size of a structure with a flex array at the end, so that it becomes transparent for the compilers, but add the difference from the old values, so that the real size of sorta-ABI-messages doesn't change. Use it on the allocation side in IAVF and the receiving side (defined as static inline in virtchnl.h) for the mentioned two structures. Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Reviewed-by: Kees Cook <keescook@chromium.org> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2023-08-16i40e: fix misleading debug logsAndrii Staikov
Change "write" into the actual "read" word. Change parameters description. Fixes: 7073f46e443e ("i40e: Add AQ commands for NVM Update for X722") Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Andrii Staikov <andrii.staikov@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2023-08-16iavf: fix FDIR rule fields masks validationPiotr Gardocki
Return an error if a field's mask is neither full nor empty. When a mask is only partial the field is not being used for rule programming but it gives a wrong impression it is used. Fix by returning an error on any partial mask to make it clear they are not supported. The ip_ver assignment is moved earlier in code to allow using it in iavf_validate_fdir_fltr_masks. Fixes: 527691bf0682 ("iavf: Support IPv4 Flow Director filters") Fixes: e90cbc257a6f ("iavf: Support IPv6 Flow Director filters") Signed-off-by: Piotr Gardocki <piotrx.gardocki@intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2023-08-16selftests/bpf: Add selftest for fill_link_infoYafang Shao
Add selftest for the fill_link_info of uprobe, kprobe and tracepoint. The result: $ tools/testing/selftests/bpf/test_progs --name=fill_link_info #79/1 fill_link_info/kprobe_link_info:OK #79/2 fill_link_info/kretprobe_link_info:OK #79/3 fill_link_info/kprobe_invalid_ubuff:OK #79/4 fill_link_info/tracepoint_link_info:OK #79/5 fill_link_info/uprobe_link_info:OK #79/6 fill_link_info/uretprobe_link_info:OK #79/7 fill_link_info/kprobe_multi_link_info:OK #79/8 fill_link_info/kretprobe_multi_link_info:OK #79/9 fill_link_info/kprobe_multi_invalid_ubuff:OK #79 fill_link_info:OK Summary: 1/9 PASSED, 0 SKIPPED, 0 FAILED The test case for kprobe_multi won't be run on aarch64, as it is not supported. Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yonghong.song@linux.dev> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/bpf/20230813141900.1268-3-laoar.shao@gmail.com
2023-08-16bpf: Fix uninitialized symbol in bpf_perf_link_fill_kprobe()Yafang Shao
The commit 1b715e1b0ec5 ("bpf: Support ->fill_link_info for perf_event") leads to the following Smatch static checker warning: kernel/bpf/syscall.c:3416 bpf_perf_link_fill_kprobe() error: uninitialized symbol 'type'. That can happens when uname is NULL. So fix it by verifying the uname when we really need to fill it. Fixes: 1b715e1b0ec5 ("bpf: Support ->fill_link_info for perf_event") Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yonghong.song@linux.dev> Acked-by: Jiri Olsa <jolsa@kernel.org> Closes: https://lore.kernel.org/bpf/85697a7e-f897-4f74-8b43-82721bebc462@kili.mountain Link: https://lore.kernel.org/bpf/20230813141900.1268-2-laoar.shao@gmail.com
2023-08-16Merge branch 'ipv6-expired-routes'David S. Miller
Kui-Feng Lee says: ==================== Remove expired routes with a separated list of routes. FIB6 GC walks trees of fib6_tables to remove expired routes. Walking a tree can be expensive if the number of routes in a table is big, even if most of them are permanent. Checking routes in a separated list of routes having expiration will avoid this potential issue. Background ========== The size of a Linux IPv6 routing table can become a big problem if not managed appropriately. Now, Linux has a garbage collector to remove expired routes periodically. However, this may lead to a situation in which the routing path is blocked for a long period due to an excessive number of routes. For example, years ago, there is a commit c7bb4b89033b ("ipv6: tcp: drop silly ICMPv6 packet too big messages"). The root cause is that malicious ICMPv6 packets were sent back for every small packet sent to them. These packets add routes with an expiration time that prompts the GC to periodically check all routes in the tables, including permanent ones. Why Route Expires ================= Users can add IPv6 routes with an expiration time manually. However, the Neighbor Discovery protocol may also generate routes that can expire. For example, Router Advertisement (RA) messages may create a default route with an expiration time. [RFC 4861] For IPv4, it is not possible to set an expiration time for a route, and there is no RA, so there is no need to worry about such issues. Create Routes with Expires ========================== You can create routes with expires with the command. For example, ip -6 route add 2001:b000:591::3 via fe80::5054:ff:fe12:3457 \ dev enp0s3 expires 30 The route that has been generated will be deleted automatically in 30 seconds. GC of FIB6 ========== The function called fib6_run_gc() is responsible for performing garbage collection (GC) for the Linux IPv6 stack. It checks for the expiration of every route by traversing the trees of routing tables. The time taken to traverse a routing table increases with its size. Holding the routing table lock during traversal is particularly undesirable. Therefore, it is preferable to keep the lock for the shortest possible duration. Solution ======== The cause of the issue is keeping the routing table locked during the traversal of large trees. To solve this problem, we can create a separate list of routes that have expiration. This will prevent GC from checking permanent routes. Result ====== We conducted a test to measure the execution times of fib6_gc_timer_cb() and observed that it enhances the GC of FIB6. During the test, we added permanent routes with the following numbers: 1000, 3000, 6000, and 9000. Additionally, we added a route with an expiration time. Here are the average execution times for the kernel without the patch. - 120020 ns with 1000 permanent routes - 308920 ns with 3000 ... - 581470 ns with 6000 ... - 855310 ns with 9000 ... The kernel with the patch consistently takes around 14000 ns to execute, regardless of the number of permanent routes that are installed. Major changes from v7: - Fix warings raised by the patchwork. Major changes from v6: - Remove unnecessary check of tb6 in fib6_clean_expires_locked(). - Use ib6_clean_expires_locked() instead in fib6_purge_rt(). Major changes from v5: - Change the order of adding new routes to the GC list and starting GC timer. - Remove time measurements from the test case. - Stop forcing GC flush. Major changes from v4: - Detect existence of 'strace' in the test case. Major changes from v3: - Fix the type of arg according to feedback. - Add 1k temporary routes and 5K permanent routes in the test case. Measure time spending on GC with strace. Major changes from v2: - Remove unnecessary and incorrect sysctl restoring in the test case. Major changes from v1: - Moved gc_link to avoid creating a hole in fib6_info. - Moved fib6_set_expires*() and fib6_clean_expires*() to the header file and inlined. And removed duplicated lines. - Added a test case. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-16selftests: fib_tests: Add a test case for IPv6 garbage collectionKui-Feng Lee
Add 1000 IPv6 routes with expiration time (w/ and w/o additional 5000 permanet routes in the background.) Wait for a few seconds to make sure they are removed correctly. The expected output of the test looks like the following example. > Fib6 garbage collection test > TEST: ipv6 route garbage collection [ OK ] Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-16net/ipv6: Remove expired routes with a separated list of routes.Kui-Feng Lee
FIB6 GC walks trees of fib6_tables to remove expired routes. Walking a tree can be expensive if the number of routes in a table is big, even if most of them are permanent. Checking routes in a separated list of routes having expiration will avoid this potential issue. Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-16e1000e: Use PME poll to circumvent unreliable ACPI wakeKai-Heng Feng
On some I219 devices, ethernet cable plugging detection only works once from PCI D3 state. Subsequent cable plugging does set PME bit correctly, but device still doesn't get woken up. Since I219 connects to the root complex directly, it relies on platform firmware (ACPI) to wake it up. In this case, the GPE from _PRW only works for first cable plugging but fails to notify the driver for subsequent plugging events. The issue was originally found on CNP, but the same issue can be found on ADL too. So workaround the issue by continuing use PME poll after first ACPI wake. As PME poll is always used, the runtime suspend restriction for CNP can also be removed. Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com> Tested-by: Naama Meir <naamax.meir@linux.intel.com> Acked-by: Sasha Neftin <sasha.neftin@intel.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-16net-memcg: Fix scope of sockmem pressure indicatorsAbel Wu
Now there are two indicators of socket memory pressure sit inside struct mem_cgroup, socket_pressure and tcpmem_pressure, indicating memory reclaim pressure in memcg->memory and ->tcpmem respectively. When in legacy mode (cgroupv1), the socket memory is charged into ->tcpmem which is independent of ->memory, so socket_pressure has nothing to do with socket's pressure at all. Things could be worse by taking socket_pressure into consideration in legacy mode, as a pressure in ->memory can lead to premature reclamation/throttling in socket. While for the default mode (cgroupv2), the socket memory is charged into ->memory, and ->tcpmem/->tcpmem_pressure are simply not used. So {socket,tcpmem}_pressure are only used in default/legacy mode respectively for indicating socket memory pressure. This patch fixes the pieces of code that make mixed use of both. Fixes: 8e8ae645249b ("mm: memcontrol: hook up vmpressure to socket pressure") Signed-off-by: Abel Wu <wuyun.abel@bytedance.com> Acked-by: Shakeel Butt <shakeelb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-16nfp: update maintainerLouis Peens
Take over maintainership of the nfp driver from Simon as he is moving away from Corigine. Signed-off-by: Louis Peens <louis.peens@corigine.com> Acked-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-16net: ethernet: ti: am65-cpsw: add mqprio qdisc offload in channel modeGrygorii Strashko
This patch adds MQPRIO Qdisc offload in full 'channel' mode which allows not only setting up pri:tc mapping, but also configuring TX shapers on external port FIFOs. The K3 CPSW MQPRIO Qdisc offload is expected to work with VLAN/priority tagged packets. Non-tagged packets have to be mapped only to TC0. - TX traffic classes must be rated starting from TC that has highest priority and with no gaps - Traffic classes are used starting from 0, that has highest priority - min_rate defines Committed Information Rate (guaranteed) - max_rate defines Excess Information Rate (non guaranteed) and offloaded as (max_rate[i] - tcX_min_rate[i]) - VLAN/priority tagged packets mapped to TC0 will exit switch with VLAN tag priority 0 The configuration example: ethtool -L eth1 tx 5 ethtool --set-priv-flags eth1 p0-rx-ptype-rrobin off tc qdisc add dev eth1 parent root handle 100: mqprio num_tc 3 \ map 0 0 1 2 0 0 0 0 0 0 0 0 0 0 0 0 \ queues 1@0 1@1 1@2 hw 1 mode channel \ shaper bw_rlimit min_rate 0 100mbit 200mbit max_rate 0 101mbit 202mbit tc qdisc replace dev eth2 handle 100: parent root mqprio num_tc 1 \ map 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 queues 1@0 hw 1 ip link add link eth1 name eth1.100 type vlan id 100 ip link set eth1.100 type vlan egress 0:0 1:1 2:2 3:3 4:4 5:5 6:6 7:7 In the above example two ports share the same TX CPPI queue 0 for low priority traffic. 3 traffic classes are defined for eth1 and mapped to: TC0 - low priority, TX CPPI queue 0 -> ext Port 1 fifo0, no rate limit TC1 - prio 2, TX CPPI queue 1 -> ext Port 1 fifo1, CIR=100Mbit/s, EIR=1Mbit/s TC2 - prio 3, TX CPPI queue 2 -> ext Port 1 fifo2, CIR=200Mbit/s, EIR=2Mbit/s Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com> Signed-off-by: Roger Quadros <rogerq@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-16Merge tag 'nf-23-08-16' of ↵David S. Miller
https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf Florisn Westphal says: ==================== These are netfilter fixes for the *net* tree. First patch resolves a false-positive lockdep splat: rcu_dereference is used outside of rcu read lock. Let lockdep validate that the transaction mutex is locked. Second patch fixes a kdoc warning added in previous PR. Third patch fixes a memory leak: The catchall element isn't disabled correctly, this allows userspace to deactivate the element again. This results in refcount underflow which in turn prevents memory release. This was always broken since the feature was added in 5.13. Patch 4 fixes an incorrect change in the previous pull request: Adding a duplicate key to a set should work if the duplicate key has expired, restore this behaviour. All from myself. Patch #5 resolves an old historic artifact in sctp conntrack: a 300ms timeout for shutdown_ack. Increase this to 3s. From Xin Long. Patch #6 fixes a sysctl data race in ipvs, two threads can clobber the sysctl value, from Sishuai Gong. This is a day-0 bug that predates git history. Patches 7, 8 and 9, from Pablo Neira Ayuso, are also followups for the previous GC rework in nf_tables: The netlink notifier and the netns exit path must both increment the gc worker seqcount, else worker may encounter stale (free'd) pointers. ================ Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-16Merge branch 'inet-data-races'David S. Miller
Eric Dumazet says: ==================== inet: socket lock and data-races avoidance In this series, I converted 20 bits in "struct inet_sock" and made them truly atomic. This allows to implement many IP_ socket options in a lockless fashion (no need to acquire socket lock), and fixes data-races that were showing up in various KCSAN reports. I also took care of IP_TTL/IP_MINTTL, but left few other options for another series. v4: Rebased after recent mptcp changes. Added Reviewed-by: tags from Simon (thanks !) v3: fixed patch 7, feedback from build bot about ipvs set_mcast_loop() v2: addressed a feedback from a build bot in patch 9 by removing unused issk variable in mptcp_setsockopt_sol_ip_set_transparent() Added Acked-by: tags from Soheil (thanks !) ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-16inet: implement lockless IP_MINTTLEric Dumazet
inet->min_ttl is already read with READ_ONCE(). Implementing IP_MINTTL socket option set/read without holding the socket lock is easy. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-16inet: implement lockless IP_TTLEric Dumazet
ip_select_ttl() is racy, because it reads inet->uc_ttl without proper locking. Add READ_ONCE()/WRITE_ONCE() annotations while allowing IP_TTL socket option to be set/read without holding the socket lock. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>