Age | Commit message (Collapse) | Author |
|
Commit 86e25f40aa1e ("net: napi: Add napi_config") moved napi->napi_id
assignment to a later point in time (napi_hash_add_with_id). This breaks
__xdp_rxq_info_reg which copies napi_id at an earlier time and now
stores 0 napi_id. It also makes sk_mark_napi_id_once_xdp and
__sk_mark_napi_id_once useless because they now work against 0 napi_id.
Since sk_busy_loop requires valid napi_id to busy-poll on, there is no way
to busy-poll AF_XDP sockets anymore.
Bring back the ability to busy-poll on XSK by resolving socket's napi_id
at bind time. This relies on relatively recent netif_queue_set_napi,
but (assume) at this point most popular drivers should have been converted.
This also removes per-tx/rx cycles which used to check and/or set
the napi_id value.
Confirmed by running a busy-polling AF_XDP socket
(github.com/fomichev/xskrtt) on mlx5 and looking at BusyPollRxPackets
from /proc/net/netstat.
Fixes: 86e25f40aa1e ("net: napi: Add napi_config")
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/20250109003436.2829560-1-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
As pointed out in the original comment, lookup in sockmap can return a TCP
ESTABLISHED socket. Such TCP socket may have had SO_ATTACH_REUSEPORT_EBPF
set before it was ESTABLISHED. In other words, a non-NULL sk_reuseport_cb
does not imply a non-refcounted socket.
Drop sk's reference in both error paths.
unreferenced object 0xffff888101911800 (size 2048):
comm "test_progs", pid 44109, jiffies 4297131437
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
80 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace (crc 9336483b):
__kmalloc_noprof+0x3bf/0x560
__reuseport_alloc+0x1d/0x40
reuseport_alloc+0xca/0x150
reuseport_attach_prog+0x87/0x140
sk_reuseport_attach_bpf+0xc8/0x100
sk_setsockopt+0x1181/0x1990
do_sock_setsockopt+0x12b/0x160
__sys_setsockopt+0x7b/0xc0
__x64_sys_setsockopt+0x1b/0x30
do_syscall_64+0x93/0x180
entry_SYSCALL_64_after_hwframe+0x76/0x7e
Fixes: 64d85290d79c ("bpf: Allow bpf_map_lookup_elem for SOCKMAP and SOCKHASH")
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Reviewed-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20250110-reuseport-memleak-v1-1-fa1ddab0adfe@rbox.co
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next
Steffen Klassert says:
====================
ipsec-next-2025-01-09
1) Implement the AGGFRAG protocol and basic IP-TFS (RFC9347) functionality.
From Christian Hopps.
2) Support ESN context update to hardware for TX.
From Jianbo Liu.
Please pull or let me know if there are problems.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
When jumping to 'martian_destination' a drop reason is always set but
that label falls-through the 'e_nobufs' one, overriding the value.
The behavior was introduced by the mentioned commit. The logic went
from,
goto martian_destination;
...
martian_destination:
...
e_inval:
err = -EINVAL;
goto out;
e_nobufs:
err = -ENOBUFS;
goto out;
to,
reason = ...;
goto martian_destination;
...
martian_destination:
...
e_nobufs:
reason = SKB_DROP_REASON_NOMEM;
goto out;
A 'goto out' is clearly missing now after 'martian_destination' to avoid
overriding the drop reason.
Fixes: 5b92112acd8e ("net: ip: make ip_route_input_slow() return drop reasons")
Reported-by: Sabrina Dubroca <sd@queasysnail.net>
Cc: Menglong Dong <menglong8.dong@gmail.com>
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/20250108165725.404564-1-atenart@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Switch to using hvhdk.h everywhere in the kernel. This header
includes all the new Hyper-V headers in include/hyperv, which form a
superset of the definitions found in hyperv-tlfs.h.
This makes it easier to add new Hyper-V interfaces without being
restricted to those in the TLFS doc (reflected in hyperv-tlfs.h).
To be more consistent with the original Hyper-V code, the names of
some definitions are changed slightly. Update those where needed.
Update comments in mshyperv.h files to point to include/hyperv for
adding new definitions.
Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
Reviewed-by: Easwar Hariharan <eahariha@linux.microsoft.com>
Signed-off-by: Roman Kisel <romank@linux.microsoft.com>
Reviewed-by: Easwar Hariharan <eahariha@linux.microsoft.com>
Link: https://lore.kernel.org/r/1732577084-2122-5-git-send-email-nunodasneves@linux.microsoft.com
Link: https://lore.kernel.org/r/20250108222138.1623703-3-romank@linux.microsoft.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
|
|
Cross-merge networking fixes after downstream PR (net-6.13-rc7).
Conflicts:
a42d71e322a8 ("net_sched: sch_cake: Add drop reasons")
737d4d91d35b ("sched: sch_cake: add bounds checks to host bulk flow fairness counts")
Adjacent changes:
drivers/net/ethernet/meta/fbnic/fbnic.h
3a856ab34726 ("eth: fbnic: add IRQ reuse support")
95978931d55f ("eth: fbnic: Revert "eth: fbnic: Add hardware monitoring support via HWMON interface"")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for net:
1) Fix imbalance between flowtable BIND and UNBIND calls to configure
hardware offload, this fixes a possible kmemleak.
2) Clamp maximum conntrack hashtable size to INT_MAX to fix a possible
WARN_ON_ONCE splat coming from kvmalloc_array(), only possible from
init_netns.
* tag 'nf-25-01-09' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
netfilter: conntrack: clamp maximum hashtable size to INT_MAX
netfilter: nf_tables: imbalance in flowtable binding
====================
Link: https://patch.msgid.link/20250109123532.41768-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
As mentioned in a previous commit of this series, using the 'net'
structure via 'current' is not recommended for different reasons:
- Inconsistency: getting info from the reader's/writer's netns vs only
from the opener's netns.
- current->nsproxy can be NULL in some cases, resulting in an 'Oops'
(null-ptr-deref), e.g. when the current task is exiting, as spotted by
syzbot [1] using acct(2).
The per-netns structure can be obtained from the table->data using
container_of(), then the 'net' one can be retrieved from the listen
socket (if available).
Fixes: c6a58ffed536 ("RDS: TCP: Add sysctl tunables for sndbuf/rcvbuf on rds-tcp socket")
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/67769ecb.050a0220.3a8527.003f.GAE@google.com [1]
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250108-net-sysctl-current-nsproxy-v1-9-5df34b2083e8@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
As mentioned in a previous commit of this series, using the 'net'
structure via 'current' is not recommended for different reasons:
- Inconsistency: getting info from the reader's/writer's netns vs only
from the opener's netns.
- current->nsproxy can be NULL in some cases, resulting in an 'Oops'
(null-ptr-deref), e.g. when the current task is exiting, as spotted by
syzbot [1] using acct(2).
The 'net' structure can be obtained from the table->data using
container_of().
Note that table->data could also be used directly, as this is the only
member needed from the 'net' structure, but that would increase the size
of this fix, to use '*data' everywhere 'net->sctp.probe_interval' is
used.
Fixes: d1e462a7a5f3 ("sctp: add probe_interval in sysctl and sock/asoc/transport")
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/67769ecb.050a0220.3a8527.003f.GAE@google.com [1]
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250108-net-sysctl-current-nsproxy-v1-8-5df34b2083e8@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
As mentioned in a previous commit of this series, using the 'net'
structure via 'current' is not recommended for different reasons:
- Inconsistency: getting info from the reader's/writer's netns vs only
from the opener's netns.
- current->nsproxy can be NULL in some cases, resulting in an 'Oops'
(null-ptr-deref), e.g. when the current task is exiting, as spotted by
syzbot [1] using acct(2).
The 'net' structure can be obtained from the table->data using
container_of().
Note that table->data could also be used directly, but that would
increase the size of this fix, while 'sctp.ctl_sock' still needs to be
retrieved from 'net' structure.
Fixes: 046c052b475e ("sctp: enable udp tunneling socks")
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/67769ecb.050a0220.3a8527.003f.GAE@google.com [1]
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250108-net-sysctl-current-nsproxy-v1-7-5df34b2083e8@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
As mentioned in a previous commit of this series, using the 'net'
structure via 'current' is not recommended for different reasons:
- Inconsistency: getting info from the reader's/writer's netns vs only
from the opener's netns.
- current->nsproxy can be NULL in some cases, resulting in an 'Oops'
(null-ptr-deref), e.g. when the current task is exiting, as spotted by
syzbot [1] using acct(2).
The 'net' structure can be obtained from the table->data using
container_of().
Note that table->data could also be used directly, but that would
increase the size of this fix, while 'sctp.ctl_sock' still needs to be
retrieved from 'net' structure.
Fixes: b14878ccb7fa ("net: sctp: cache auth_enable per endpoint")
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/67769ecb.050a0220.3a8527.003f.GAE@google.com [1]
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250108-net-sysctl-current-nsproxy-v1-6-5df34b2083e8@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
As mentioned in a previous commit of this series, using the 'net'
structure via 'current' is not recommended for different reasons:
- Inconsistency: getting info from the reader's/writer's netns vs only
from the opener's netns.
- current->nsproxy can be NULL in some cases, resulting in an 'Oops'
(null-ptr-deref), e.g. when the current task is exiting, as spotted by
syzbot [1] using acct(2).
The 'net' structure can be obtained from the table->data using
container_of().
Note that table->data could also be used directly, as this is the only
member needed from the 'net' structure, but that would increase the size
of this fix, to use '*data' everywhere 'net->sctp.rto_min/max' is used.
Fixes: 4f3fdf3bc59c ("sctp: add check rto_min and rto_max in sysctl")
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/67769ecb.050a0220.3a8527.003f.GAE@google.com [1]
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250108-net-sysctl-current-nsproxy-v1-5-5df34b2083e8@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
As mentioned in a previous commit of this series, using the 'net'
structure via 'current' is not recommended for different reasons:
- Inconsistency: getting info from the reader's/writer's netns vs only
from the opener's netns.
- current->nsproxy can be NULL in some cases, resulting in an 'Oops'
(null-ptr-deref), e.g. when the current task is exiting, as spotted by
syzbot [1] using acct(2).
The 'net' structure can be obtained from the table->data using
container_of().
Note that table->data could also be used directly, as this is the only
member needed from the 'net' structure, but that would increase the size
of this fix, to use '*data' everywhere 'net->sctp.sctp_hmac_alg' is
used.
Fixes: 3c68198e7511 ("sctp: Make hmac algorithm selection for cookie generation dynamic")
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/67769ecb.050a0220.3a8527.003f.GAE@google.com [1]
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250108-net-sysctl-current-nsproxy-v1-4-5df34b2083e8@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
As mentioned in the previous commit, using the 'net' structure via
'current' is not recommended for different reasons:
- Inconsistency: getting info from the reader's/writer's netns vs only
from the opener's netns.
- current->nsproxy can be NULL in some cases, resulting in an 'Oops'
(null-ptr-deref), e.g. when the current task is exiting, as spotted by
syzbot [1] using acct(2).
The 'pernet' structure can be obtained from the table->data using
container_of().
Fixes: 27069e7cb3d1 ("mptcp: disable active MPTCP in case of blackhole")
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/67769ecb.050a0220.3a8527.003f.GAE@google.com [1]
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250108-net-sysctl-current-nsproxy-v1-3-5df34b2083e8@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Using the 'net' structure via 'current' is not recommended for different
reasons.
First, if the goal is to use it to read or write per-netns data, this is
inconsistent with how the "generic" sysctl entries are doing: directly
by only using pointers set to the table entry, e.g. table->data. Linked
to that, the per-netns data should always be obtained from the table
linked to the netns it had been created for, which may not coincide with
the reader's or writer's netns.
Another reason is that access to current->nsproxy->netns can oops if
attempted when current->nsproxy had been dropped when the current task
is exiting. This is what syzbot found, when using acct(2):
Oops: general protection fault, probably for non-canonical address 0xdffffc0000000005: 0000 [#1] PREEMPT SMP KASAN PTI
KASAN: null-ptr-deref in range [0x0000000000000028-0x000000000000002f]
CPU: 1 UID: 0 PID: 5924 Comm: syz-executor Not tainted 6.13.0-rc5-syzkaller-00004-gccb98ccef0e5 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024
RIP: 0010:proc_scheduler+0xc6/0x3c0 net/mptcp/ctrl.c:125
Code: 03 42 80 3c 38 00 0f 85 fe 02 00 00 4d 8b a4 24 08 09 00 00 48 b8 00 00 00 00 00 fc ff df 49 8d 7c 24 28 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 cc 02 00 00 4d 8b 7c 24 28 48 8d 84 24 c8 00 00
RSP: 0018:ffffc900034774e8 EFLAGS: 00010206
RAX: dffffc0000000000 RBX: 1ffff9200068ee9e RCX: ffffc90003477620
RDX: 0000000000000005 RSI: ffffffff8b08f91e RDI: 0000000000000028
RBP: 0000000000000001 R08: ffffc90003477710 R09: 0000000000000040
R10: 0000000000000040 R11: 00000000726f7475 R12: 0000000000000000
R13: ffffc90003477620 R14: ffffc90003477710 R15: dffffc0000000000
FS: 0000000000000000(0000) GS:ffff8880b8700000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fee3cd452d8 CR3: 000000007d116000 CR4: 00000000003526f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
proc_sys_call_handler+0x403/0x5d0 fs/proc/proc_sysctl.c:601
__kernel_write_iter+0x318/0xa80 fs/read_write.c:612
__kernel_write+0xf6/0x140 fs/read_write.c:632
do_acct_process+0xcb0/0x14a0 kernel/acct.c:539
acct_pin_kill+0x2d/0x100 kernel/acct.c:192
pin_kill+0x194/0x7c0 fs/fs_pin.c:44
mnt_pin_kill+0x61/0x1e0 fs/fs_pin.c:81
cleanup_mnt+0x3ac/0x450 fs/namespace.c:1366
task_work_run+0x14e/0x250 kernel/task_work.c:239
exit_task_work include/linux/task_work.h:43 [inline]
do_exit+0xad8/0x2d70 kernel/exit.c:938
do_group_exit+0xd3/0x2a0 kernel/exit.c:1087
get_signal+0x2576/0x2610 kernel/signal.c:3017
arch_do_signal_or_restart+0x90/0x7e0 arch/x86/kernel/signal.c:337
exit_to_user_mode_loop kernel/entry/common.c:111 [inline]
exit_to_user_mode_prepare include/linux/entry-common.h:329 [inline]
__syscall_exit_to_user_mode_work kernel/entry/common.c:207 [inline]
syscall_exit_to_user_mode+0x150/0x2a0 kernel/entry/common.c:218
do_syscall_64+0xda/0x250 arch/x86/entry/common.c:89
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7fee3cb87a6a
Code: Unable to access opcode bytes at 0x7fee3cb87a40.
RSP: 002b:00007fffcccac688 EFLAGS: 00000202 ORIG_RAX: 0000000000000037
RAX: 0000000000000000 RBX: 00007fffcccac710 RCX: 00007fee3cb87a6a
RDX: 0000000000000041 RSI: 0000000000000000 RDI: 0000000000000003
RBP: 0000000000000003 R08: 00007fffcccac6ac R09: 00007fffcccacac7
R10: 00007fffcccac710 R11: 0000000000000202 R12: 00007fee3cd49500
R13: 00007fffcccac6ac R14: 0000000000000000 R15: 00007fee3cd4b000
</TASK>
Modules linked in:
---[ end trace 0000000000000000 ]---
RIP: 0010:proc_scheduler+0xc6/0x3c0 net/mptcp/ctrl.c:125
Code: 03 42 80 3c 38 00 0f 85 fe 02 00 00 4d 8b a4 24 08 09 00 00 48 b8 00 00 00 00 00 fc ff df 49 8d 7c 24 28 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 cc 02 00 00 4d 8b 7c 24 28 48 8d 84 24 c8 00 00
RSP: 0018:ffffc900034774e8 EFLAGS: 00010206
RAX: dffffc0000000000 RBX: 1ffff9200068ee9e RCX: ffffc90003477620
RDX: 0000000000000005 RSI: ffffffff8b08f91e RDI: 0000000000000028
RBP: 0000000000000001 R08: ffffc90003477710 R09: 0000000000000040
R10: 0000000000000040 R11: 00000000726f7475 R12: 0000000000000000
R13: ffffc90003477620 R14: ffffc90003477710 R15: dffffc0000000000
FS: 0000000000000000(0000) GS:ffff8880b8700000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fee3cd452d8 CR3: 000000007d116000 CR4: 00000000003526f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
----------------
Code disassembly (best guess), 1 bytes skipped:
0: 42 80 3c 38 00 cmpb $0x0,(%rax,%r15,1)
5: 0f 85 fe 02 00 00 jne 0x309
b: 4d 8b a4 24 08 09 00 mov 0x908(%r12),%r12
12: 00
13: 48 b8 00 00 00 00 00 movabs $0xdffffc0000000000,%rax
1a: fc ff df
1d: 49 8d 7c 24 28 lea 0x28(%r12),%rdi
22: 48 89 fa mov %rdi,%rdx
25: 48 c1 ea 03 shr $0x3,%rdx
* 29: 80 3c 02 00 cmpb $0x0,(%rdx,%rax,1) <-- trapping instruction
2d: 0f 85 cc 02 00 00 jne 0x2ff
33: 4d 8b 7c 24 28 mov 0x28(%r12),%r15
38: 48 rex.W
39: 8d .byte 0x8d
3a: 84 24 c8 test %ah,(%rax,%rcx,8)
Here with 'net.mptcp.scheduler', the 'net' structure is not really
needed, because the table->data already has a pointer to the current
scheduler, the only thing needed from the per-netns data.
Simply use 'data', instead of getting (most of the time) the same thing,
but from a longer and indirect way.
Fixes: 6963c508fd7a ("mptcp: only allow set existing scheduler for net.mptcp.scheduler")
Cc: stable@vger.kernel.org
Reported-by: syzbot+e364f774c6f57f2c86d1@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/67769ecb.050a0220.3a8527.003f.GAE@google.com
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250108-net-sysctl-current-nsproxy-v1-2-5df34b2083e8@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
'net.mptcp.available_schedulers' sysctl knob is there to list available
schedulers, not to modify this list.
There are then no reasons to give write access to it.
Nothing would have been written anyway, but no errors would have been
returned, which is unexpected.
Fixes: 73c900aa3660 ("mptcp: add net.mptcp.available_schedulers")
Cc: stable@vger.kernel.org
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250108-net-sysctl-current-nsproxy-v1-1-5df34b2083e8@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Even though we fixed a logic error in the commit cited below, syzbot
still managed to trigger an underflow of the per-host bulk flow
counters, leading to an out of bounds memory access.
To avoid any such logic errors causing out of bounds memory accesses,
this commit factors out all accesses to the per-host bulk flow counters
to a series of helpers that perform bounds-checking before any
increments and decrements. This also has the benefit of improving
readability by moving the conditional checks for the flow mode into
these helpers, instead of having them spread out throughout the
code (which was the cause of the original logic error).
As part of this change, the flow quantum calculation is consolidated
into a helper function, which means that the dithering applied to the
ost load scaling is now applied both in the DRR rotation and when a
sparse flow's quantum is first initiated. The only user-visible effect
of this is that the maximum packet size that can be sent while a flow
stays sparse will now vary with +/- one byte in some cases. This should
not make a noticeable difference in practice, and thus it's not worth
complicating the code to preserve the old behaviour.
Fixes: 546ea84d07e3 ("sched: sch_cake: fix bulk flow accounting logic for host fairness")
Reported-by: syzbot+f63600d288bfb7057424@syzkaller.appspotmail.com
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Dave Taht <dave.taht@gmail.com>
Link: https://patch.msgid.link/20250107120105.70685-1-toke@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Linus suggested during one of past maintainer summits (in context of
a DMA_BUF discussion) that symbol namespaces can be used to prevent
unwelcome but in-tree code from using all exported functions.
Create a namespace for netdev.
Export netdev_rx_queue_restart(), drivers may want to use it since
it gives them a simple and safe way to restart a queue to apply
config changes. But it's both too low level and too actively developed
to be used outside netdev.
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Netlink code depends on NAPI instances being sorted by ID on
the netdev list for dump continuation. We need to be able to
find the position on the list where we left off if dump does
not fit in a single skb, and in the meantime NAPI instances
can come and go.
This was trivially true when we were assigning a new ID to every
new NAPI instance. Since we added the NAPI config API, we try
to retain the ID previously used for the same queue, but still
add the new NAPI instance at the start of the list.
This is fine if we reset the entire netdev and all NAPIs get
removed and added back. If driver replaces a NAPI instance
during an operation like DEVMEM queue reset, or recreates
a subset of NAPI instances in other ways we may end up with
broken ordering, and therefore Netlink dumps with either
missing or duplicated entries.
At this stage the problem is theoretical. Only two drivers
support queue API, bnxt and gve. gve recreates NAPIs during
queue reset, but it doesn't support NAPI config.
bnxt supports NAPI config but doesn't recreate instances
during reset.
We need to save the ID in the config as soon as it is assigned
because otherwise the new NAPI will not know what ID it will
get at enable time, at the time it is being added.
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Nadia Pinaeva writes:
I am working on a tool that allows collecting network performance
metrics by using conntrack events.
Start time of a conntrack entry is used to evaluate seen_reply
latency, therefore the sooner it is timestamped, the better the
precision is.
In particular, when using this tool to compare the performance of the
same feature implemented using iptables/nftables/OVS it is crucial
to have the entry timestamped earlier to see any difference.
At this time, conntrack events can only get timestamped at recv time in
userspace, so there can be some delay between the event being generated
and the userspace process consuming the message.
There is sys/net/netfilter/nf_conntrack_timestamp, which adds a
64bit timestamp (ns resolution) that records start and stop times,
but its not suited for this either, start time is the 'hashtable insertion
time', not 'conntrack allocation time'.
There is concern that moving the start-time moment to conntrack
allocation will add overhead in case of flooding, where conntrack
entries are allocated and released right away without getting inserted
into the hashtable.
Also, even if this was changed it would not with events other than
new (start time) and destroy (stop time).
Pablo suggested to add new CTA_TIMESTAMP_EVENT, this adds this feature.
The timestamp is recorded in case both events are requested and the
sys/net/netfilter/nf_conntrack_timestamp toggle is enabled.
Reported-by: Nadia Pinaeva <n.m.pinaeva@gmail.com>
Suggested-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Use INT_MAX as maximum size for the conntrack hashtable. Otherwise, it
is possible to hit WARN_ON_ONCE in __kvmalloc_node_noprof() when
resizing hashtable because __GFP_NOWARN is unset. See:
0708a0afe291 ("mm: Consider __GFP_NOWARN flag for oversized kvmalloc() calls")
Note: hashtable resize is only possible from init_netns.
Fixes: 9cc1c73ad666 ("netfilter: conntrack: avoid integer overflow when resizing")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
All these cases cause imbalance between BIND and UNBIND calls:
- Delete an interface from a flowtable with multiple interfaces
- Add a (device to a) flowtable with --check flag
- Delete a netns containing a flowtable
- In an interactive nft session, create a table with owner flag and
flowtable inside, then quit.
Fix it by calling FLOW_BLOCK_UNBIND when unregistering hooks, then
remove late FLOW_BLOCK_UNBIND call when destroying flowtable.
Fixes: ff4bf2f42a40 ("netfilter: nf_tables: add nft_unregister_flowtable_hook()")
Reported-by: Phil Sutter <phil@nwl.cc>
Tested-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
A synchronize_rcu() was added by mistake in commit
c5a759117210 ("net/hsr: Use list_head (and rcu) instead
of array for slave devices.")
RCU does not mandate to observe a grace period after
list_add_tail_rcu().
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250107144701.503884-1-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
In commit 66e4c8d95008 ("net: warn if transport header was not set")
I added a debug check in skb_transport_header() to detect
if a caller expects the transport_header to be set to a meaningful
value by a prior code path.
Unfortunately, __netif_receive_skb_core() resets the transport header
to the same value than the network header, defeating this check
in receive paths.
Pretending the transport and network headers are the same
is usually wrong.
This patch removes this reset for CONFIG_DEBUG_NET=y builds
to let fuzzers and CI find bugs.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250107144342.499759-1-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
This change introduces a mechanism for notifying userspace
applications about changes to IPv6 anycast addresses via netlink. It
includes:
* Addition and deletion of IPv6 anycast addresses are reported using
RTM_NEWANYCAST and RTM_DELANYCAST.
* A new netlink group (RTNLGRP_IPV6_ACADDR) for subscribing to these
notifications.
This enables user space applications(e.g. ip monitor) to efficiently
track anycast addresses through netlink messages, improving metrics
collection and system monitoring. It also unlocks the potential for
advanced anycast management in user space, such as hardware offload
control and fine grained network control.
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: Yuyang Huang <yuyanghuang@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20250107114355.1766086-1-yuyanghuang@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth
Luiz Augusto von Dentz says:
====================
bluetooth pull request for net:
- btmtk: Fix failed to send func ctrl for MediaTek devices.
- hci_sync: Fix not setting Random Address when required
- MGMT: Fix Add Device to responding before completing
- btnxpuart: Fix driver sending truncated data
* tag 'for-net-2025-01-08' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth:
Bluetooth: btmtk: Fix failed to send func ctrl for MediaTek devices.
Bluetooth: btnxpuart: Fix driver sending truncated data
Bluetooth: MGMT: Fix Add Device to responding before completing
Bluetooth: hci_sync: Fix not setting Random Address when required
====================
Link: https://patch.msgid.link/20250108162627.1623760-1-luiz.dentz@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
bpf_sk_storage_clone() will call bpf_selem_free() to free the clone
element when the allocation of new sock storage fails. bpf_selem_free()
will call check_and_free_fields() to free the special fields in the
element. Since the allocated element is not visible to bpf syscall or
bpf program when bpf_local_storage_alloc() fails, these special fields
in the element must be all zero when invoking bpf_selem_free().
To be uniform with other callers of bpf_selem_free(), disabling
migration when cloning sock storage. Adding migrate_{disable|enable}
pair also benefits the potential switching from kzalloc to bpf memory
allocator for sock storage.
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20250108010728.207536-9-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
When destroying sock storage, it invokes bpf_local_storage_destroy() to
remove all storage elements saved in the sock storage. The destroy
procedure will call bpf_selem_free() to free the element, and
bpf_selem_free() calls bpf_obj_free_fields() to free the special fields
in map value (e.g., kptr). Since kptrs may be allocated from bpf memory
allocator, migrate_{disable|enable} pairs are necessary for the freeing
of these kptrs.
To simplify reasoning about when migrate_disable() is needed for the
freeing of these dynamically-allocated kptrs, let the caller to
guarantee migration is disabled before invoking bpf_obj_free_fields().
Therefore, the patch adds migrate_{disable|enable} pair in
bpf_sock_storage_free(). The migrate_{disable|enable} pairs in the
underlying implementation of bpf_obj_free_fields() will be removed by
The following patch.
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20250108010728.207536-8-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
This is a follow-up to 3c5b4d69c358 ("net: annotate data-races around
sk->sk_mark"). sk->sk_mark can be read and written without holding
the socket lock. IPv6 equivalent is already covered with READ_ONCE()
annotation in tcp_v6_send_response().
Fixes: 3c5b4d69c358 ("net: annotate data-races around sk->sk_mark")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/f459d1fc44f205e13f6d8bdca2c8bfb9902ffac9.1736244569.git.daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The NAPI IDs were not fully exposed to user space prior to the netlink
API, so they were never namespaced. The netlink API must ensure that
at the very least NAPI instance belongs to the same netns as the owner
of the genl sock.
napi_by_id() can become static now, but it needs to move because of
dev_get_by_napi_id().
Cc: stable@vger.kernel.org
Fixes: 1287c1ae0fc2 ("netdev-genl: Support setting per-NAPI config values")
Fixes: 27f91aaf49b3 ("netdev-genl: Add netlink framework functions for napi")
Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Reviewed-by: Joe Damato <jdamato@fastly.com>
Link: https://patch.msgid.link/20250106180137.1861472-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
kthread_create() creates a kthread without running it yet. kthread_run()
creates a kthread and runs it.
On the other hand, kthread_create_worker() creates a kthread worker and
runs it.
This difference in behaviours is confusing. Also there is no way to
create a kthread worker and affine it using kthread_bind_mask() or
kthread_affine_preferred() before starting it.
Consolidate the behaviours and introduce kthread_run_worker[_on_cpu]()
that behaves just like kthread_run(). kthread_create_worker[_on_cpu]()
will now only create a kthread worker without starting it.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
|
|
Use usb_autopm_get_interface() and usb_autopm_put_interface()
in btmtk_usb_shutdown(), it could send func ctrl after enabling
autosuspend.
Bluetooth: btmtk_usb_hci_wmt_sync() hci0: Execution of wmt command
timed out
Bluetooth: btmtk_usb_shutdown() hci0: Failed to send wmt func ctrl
(-110)
Fixes: 5c5e8c52e3ca ("Bluetooth: btmtk: move btusb_mtk_[setup, shutdown] to btmtk.c")
Signed-off-by: Chris Lu <chris.lu@mediatek.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
|
|
Add Device with LE type requires updating resolving/accept list which
requires quite a number of commands to complete and each of them may
fail, so instead of pretending it would always work this checks the
return of hci_update_passive_scan_sync which indicates if everything
worked as intended.
Fixes: e8907f76544f ("Bluetooth: hci_sync: Make use of hci_cmd_sync_queue set 3")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
|
|
This fixes errors such as the following when Own address type is set to
Random Address but it has not been programmed yet due to either be
advertising or connecting:
< HCI Command: LE Set Exte.. (0x08|0x0041) plen 13
Own address type: Random (0x03)
Filter policy: Ignore not in accept list (0x01)
PHYs: 0x05
Entry 0: LE 1M
Type: Passive (0x00)
Interval: 60.000 msec (0x0060)
Window: 30.000 msec (0x0030)
Entry 1: LE Coded
Type: Passive (0x00)
Interval: 180.000 msec (0x0120)
Window: 90.000 msec (0x0090)
> HCI Event: Command Complete (0x0e) plen 4
LE Set Extended Scan Parameters (0x08|0x0041) ncmd 1
Status: Success (0x00)
< HCI Command: LE Set Exten.. (0x08|0x0042) plen 6
Extended scan: Enabled (0x01)
Filter duplicates: Enabled (0x01)
Duration: 0 msec (0x0000)
Period: 0.00 sec (0x0000)
> HCI Event: Command Complete (0x0e) plen 4
LE Set Extended Scan Enable (0x08|0x0042) ncmd 1
Status: Invalid HCI Command Parameters (0x12)
Fixes: c45074d68a9b ("Bluetooth: Fix not generating RPA when required")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
|
|
All implementations of get_mac_eee() now just return zero without doing
anything useful. Remove the call to this method in preparation to
removing the method from each DSA driver.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/E1tUlkz-007Uyl-UA@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
syzbot presented an use-after-free report [0] regarding ipvlan and
linkwatch.
ipvlan does not hold a refcnt of the lower device unlike vlan and
macvlan.
If the linkwatch work is triggered for the ipvlan dev, the lower dev
might have already been freed, resulting in UAF of ipvlan->phy_dev in
ipvlan_get_iflink().
We can delay the lower dev unregistration like vlan and macvlan by
holding the lower dev's refcnt in dev->netdev_ops->ndo_init() and
releasing it in dev->priv_destructor().
Jakub pointed out calling .ndo_XXX after unregister_netdevice() has
returned is error prone and suggested [1] addressing this UAF in the
core by taking commit 750e51603395 ("net: avoid potential UAF in
default_operstate()") further.
Let's assume unregistering devices DOWN and use RCU protection in
default_operstate() not to race with the device unregistration.
[0]:
BUG: KASAN: slab-use-after-free in ipvlan_get_iflink+0x84/0x88 drivers/net/ipvlan/ipvlan_main.c:353
Read of size 4 at addr ffff0000d768c0e0 by task kworker/u8:35/6944
CPU: 0 UID: 0 PID: 6944 Comm: kworker/u8:35 Not tainted 6.13.0-rc2-g9bc5c9515b48 #12 4c3cb9e8b4565456f6a355f312ff91f4f29b3c47
Hardware name: linux,dummy-virt (DT)
Workqueue: events_unbound linkwatch_event
Call trace:
show_stack+0x38/0x50 arch/arm64/kernel/stacktrace.c:484 (C)
__dump_stack lib/dump_stack.c:94 [inline]
dump_stack_lvl+0xbc/0x108 lib/dump_stack.c:120
print_address_description mm/kasan/report.c:378 [inline]
print_report+0x16c/0x6f0 mm/kasan/report.c:489
kasan_report+0xc0/0x120 mm/kasan/report.c:602
__asan_report_load4_noabort+0x20/0x30 mm/kasan/report_generic.c:380
ipvlan_get_iflink+0x84/0x88 drivers/net/ipvlan/ipvlan_main.c:353
dev_get_iflink+0x7c/0xd8 net/core/dev.c:674
default_operstate net/core/link_watch.c:45 [inline]
rfc2863_policy+0x144/0x360 net/core/link_watch.c:72
linkwatch_do_dev+0x60/0x228 net/core/link_watch.c:175
__linkwatch_run_queue+0x2f4/0x5b8 net/core/link_watch.c:239
linkwatch_event+0x64/0xa8 net/core/link_watch.c:282
process_one_work+0x700/0x1398 kernel/workqueue.c:3229
process_scheduled_works kernel/workqueue.c:3310 [inline]
worker_thread+0x8c4/0xe10 kernel/workqueue.c:3391
kthread+0x2b0/0x360 kernel/kthread.c:389
ret_from_fork+0x10/0x20 arch/arm64/kernel/entry.S:862
Allocated by task 9303:
kasan_save_stack mm/kasan/common.c:47 [inline]
kasan_save_track+0x30/0x68 mm/kasan/common.c:68
kasan_save_alloc_info+0x44/0x58 mm/kasan/generic.c:568
poison_kmalloc_redzone mm/kasan/common.c:377 [inline]
__kasan_kmalloc+0x84/0xa0 mm/kasan/common.c:394
kasan_kmalloc include/linux/kasan.h:260 [inline]
__do_kmalloc_node mm/slub.c:4283 [inline]
__kmalloc_node_noprof+0x2a0/0x560 mm/slub.c:4289
__kvmalloc_node_noprof+0x9c/0x230 mm/util.c:650
alloc_netdev_mqs+0xb4/0x1118 net/core/dev.c:11209
rtnl_create_link+0x2b8/0xb60 net/core/rtnetlink.c:3595
rtnl_newlink_create+0x19c/0x868 net/core/rtnetlink.c:3771
__rtnl_newlink net/core/rtnetlink.c:3896 [inline]
rtnl_newlink+0x122c/0x15c0 net/core/rtnetlink.c:4011
rtnetlink_rcv_msg+0x61c/0x918 net/core/rtnetlink.c:6901
netlink_rcv_skb+0x1dc/0x398 net/netlink/af_netlink.c:2542
rtnetlink_rcv+0x34/0x50 net/core/rtnetlink.c:6928
netlink_unicast_kernel net/netlink/af_netlink.c:1321 [inline]
netlink_unicast+0x618/0x838 net/netlink/af_netlink.c:1347
netlink_sendmsg+0x5fc/0x8b0 net/netlink/af_netlink.c:1891
sock_sendmsg_nosec net/socket.c:711 [inline]
__sock_sendmsg net/socket.c:726 [inline]
__sys_sendto+0x2ec/0x438 net/socket.c:2197
__do_sys_sendto net/socket.c:2204 [inline]
__se_sys_sendto net/socket.c:2200 [inline]
__arm64_sys_sendto+0xe4/0x110 net/socket.c:2200
__invoke_syscall arch/arm64/kernel/syscall.c:35 [inline]
invoke_syscall+0x90/0x278 arch/arm64/kernel/syscall.c:49
el0_svc_common+0x13c/0x250 arch/arm64/kernel/syscall.c:132
do_el0_svc+0x54/0x70 arch/arm64/kernel/syscall.c:151
el0_svc+0x4c/0xa8 arch/arm64/kernel/entry-common.c:744
el0t_64_sync_handler+0x78/0x108 arch/arm64/kernel/entry-common.c:762
el0t_64_sync+0x198/0x1a0 arch/arm64/kernel/entry.S:600
Freed by task 10200:
kasan_save_stack mm/kasan/common.c:47 [inline]
kasan_save_track+0x30/0x68 mm/kasan/common.c:68
kasan_save_free_info+0x58/0x70 mm/kasan/generic.c:582
poison_slab_object mm/kasan/common.c:247 [inline]
__kasan_slab_free+0x48/0x68 mm/kasan/common.c:264
kasan_slab_free include/linux/kasan.h:233 [inline]
slab_free_hook mm/slub.c:2338 [inline]
slab_free mm/slub.c:4598 [inline]
kfree+0x140/0x420 mm/slub.c:4746
kvfree+0x4c/0x68 mm/util.c:693
netdev_release+0x94/0xc8 net/core/net-sysfs.c:2034
device_release+0x98/0x1c0
kobject_cleanup lib/kobject.c:689 [inline]
kobject_release lib/kobject.c:720 [inline]
kref_put include/linux/kref.h:65 [inline]
kobject_put+0x2b0/0x438 lib/kobject.c:737
netdev_run_todo+0xdd8/0xf48 net/core/dev.c:10924
rtnl_unlock net/core/rtnetlink.c:152 [inline]
rtnl_net_unlock net/core/rtnetlink.c:209 [inline]
rtnl_dellink+0x484/0x680 net/core/rtnetlink.c:3526
rtnetlink_rcv_msg+0x61c/0x918 net/core/rtnetlink.c:6901
netlink_rcv_skb+0x1dc/0x398 net/netlink/af_netlink.c:2542
rtnetlink_rcv+0x34/0x50 net/core/rtnetlink.c:6928
netlink_unicast_kernel net/netlink/af_netlink.c:1321 [inline]
netlink_unicast+0x618/0x838 net/netlink/af_netlink.c:1347
netlink_sendmsg+0x5fc/0x8b0 net/netlink/af_netlink.c:1891
sock_sendmsg_nosec net/socket.c:711 [inline]
__sock_sendmsg net/socket.c:726 [inline]
____sys_sendmsg+0x410/0x708 net/socket.c:2583
___sys_sendmsg+0x178/0x1d8 net/socket.c:2637
__sys_sendmsg net/socket.c:2669 [inline]
__do_sys_sendmsg net/socket.c:2674 [inline]
__se_sys_sendmsg net/socket.c:2672 [inline]
__arm64_sys_sendmsg+0x12c/0x1c8 net/socket.c:2672
__invoke_syscall arch/arm64/kernel/syscall.c:35 [inline]
invoke_syscall+0x90/0x278 arch/arm64/kernel/syscall.c:49
el0_svc_common+0x13c/0x250 arch/arm64/kernel/syscall.c:132
do_el0_svc+0x54/0x70 arch/arm64/kernel/syscall.c:151
el0_svc+0x4c/0xa8 arch/arm64/kernel/entry-common.c:744
el0t_64_sync_handler+0x78/0x108 arch/arm64/kernel/entry-common.c:762
el0t_64_sync+0x198/0x1a0 arch/arm64/kernel/entry.S:600
The buggy address belongs to the object at ffff0000d768c000
which belongs to the cache kmalloc-cg-4k of size 4096
The buggy address is located 224 bytes inside of
freed 4096-byte region [ffff0000d768c000, ffff0000d768d000)
The buggy address belongs to the physical page:
page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x117688
head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
memcg:ffff0000c77ef981
flags: 0xbfffe0000000040(head|node=0|zone=2|lastcpupid=0x1ffff)
page_type: f5(slab)
raw: 0bfffe0000000040 ffff0000c000f500 dead000000000100 dead000000000122
raw: 0000000000000000 0000000000040004 00000001f5000000 ffff0000c77ef981
head: 0bfffe0000000040 ffff0000c000f500 dead000000000100 dead000000000122
head: 0000000000000000 0000000000040004 00000001f5000000 ffff0000c77ef981
head: 0bfffe0000000003 fffffdffc35da201 ffffffffffffffff 0000000000000000
head: 0000000000000008 0000000000000000 00000000ffffffff 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff0000d768bf80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
ffff0000d768c000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>ffff0000d768c080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff0000d768c100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff0000d768c180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
Fixes: 8c55facecd7a ("net: linkwatch: only report IF_OPER_LOWERLAYERDOWN if iflink is actually down")
Reported-by: syzkaller <syzkaller@googlegroups.com>
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/netdev/20250102174400.085fd8ac@kernel.org/ [1]
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250106071911.64355-1-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
(un)?register_netdevice_notifier_dev_net() hold RTNL before triggering
the notifier for all netdev in the netns.
Let's convert the RTNL to rtnl_net_lock().
Note that move_netdevice_notifiers_dev_net() is assumed to be (but not
yet) protected by per-netns RTNL of both src and dst netns; we need to
convert wireless and hyperv drivers that call dev_change_net_namespace().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250106070751.63146-4-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
(un)?register_netdevice_notifier_net() hold RTNL before triggering the
notifier for all netdev in the netns.
Let's convert the RTNL to rtnl_net_lock().
Note that the per-netns netdev notifier is protected by per-netns RTNL.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250106070751.63146-3-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
(un)?register_netdevice_notifier() hold pernet_ops_rwsem and RTNL,
iterate all netns, and trigger the notifier for all netdev.
Let's hold __rtnl_net_lock() before triggering the notifier.
Note that we will need protection for netdev_chain when RTNL is
removed. (e.g. blocking_notifier conversion [0] with a lockdep
annotation [1])
Link: https://lore.kernel.org/netdev/20250104063735.36945-2-kuniyu@amazon.com/ [0]
Link: https://lore.kernel.org/netdev/20250105075957.67334-1-kuniyu@amazon.com/ [1]
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250106070751.63146-2-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
In commit d7811e623dd4 ("[NET]: Drop tx lock in dev_watchdog_up")
dev_watchdog_up() became a simple wrapper for __netdev_watchdog_up()
Herbert also said : "In 2.6.19 we can eliminate the unnecessary
__dev_watchdog_up and replace it with dev_watchdog_up."
This patch consolidates things to have only two functions, with
a common prefix.
- netdev_watchdog_up(), exported for the sake of one freescale driver.
This replaces __netdev_watchdog_up() and dev_watchdog_up().
- netdev_watchdog_down(), static to net/sched/sch_generic.c
This replaces dev_watchdog_down().
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Link: https://patch.msgid.link/20250105090924.1661822-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
We've noticed that NFS can hang when using RPC over TLS on an unstable
connection, and investigation shows that the RPC layer is stuck in a tight
loop attempting to transmit, but forever getting -EBADMSG back from the
underlying network. The loop begins when tcp_sendmsg_locked() returns
-EPIPE to tls_tx_records(), but that error is converted to -EBADMSG when
calling the socket's error reporting handler.
Instead of converting errors from tcp_sendmsg_locked(), let's pass them
along in this path. The RPC layer handles -EPIPE by reconnecting the
transport, which prevents the endless attempts to transmit on a broken
connection.
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Fixes: a42055e8d2c3 ("net/tls: Add support for async encryption of records for performance")
Link: https://patch.msgid.link/9594185559881679d81f071b181a10eb07cd079f.1736004079.git.bcodding@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The skb_buff struct in br_is_nd_neigh_msg() is never modified. Mark it as
const.
Signed-off-by: Ted Chen <znscnchen@gmail.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20250104083846.71612-1-znscnchen@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Let's hold per-netns RTNL of dev_net(dev) in register_netdev()
and unregister_netdev().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
rtnl_lock_killable() is used only in register_netdev()
and will be converted to per-netns RTNL.
Let's unexport it and add the corresponding helper.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Previously xfrm_dev_state_advance_esn() was added for RX only. But
it's possible that ESN context also need to be synced to hardware for
TX, so call it for outbound in this patch.
Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
|
|
We use NAPI ID as the key for continuing dumps. We also depend
on the NAPIs being sorted by ID within the driver list. Tx NAPIs
(which don't have an ID assigned) break this expectation, it's
not currently possible to dump them reliably. Since Tx NAPIs
are relatively rare, and can't be used in doit (GET or SET)
hide them from the dump API as well.
Fixes: 27f91aaf49b3 ("netdev-genl: Add netlink framework functions for napi")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250103183207.1216004-1-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Use kfree_rcu() instead of synchronize_rcu()+kfree().
This might allow syzbot to fuzz HSR a bit faster...
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250103101148.3594545-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
syzbot found a lockdep issue [1].
We should remove ax25 RTNL dependency in ax25_setsockopt()
This should also fix a variety of possible UAF in ax25.
[1]
WARNING: possible circular locking dependency detected
6.13.0-rc3-syzkaller-00762-g9268abe611b0 #0 Not tainted
------------------------------------------------------
syz.5.1818/12806 is trying to acquire lock:
ffffffff8fcb3988 (rtnl_mutex){+.+.}-{4:4}, at: ax25_setsockopt+0xa55/0xe90 net/ax25/af_ax25.c:680
but task is already holding lock:
ffff8880617ac258 (sk_lock-AF_AX25){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1618 [inline]
ffff8880617ac258 (sk_lock-AF_AX25){+.+.}-{0:0}, at: ax25_setsockopt+0x209/0xe90 net/ax25/af_ax25.c:574
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (sk_lock-AF_AX25){+.+.}-{0:0}:
lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5849
lock_sock_nested+0x48/0x100 net/core/sock.c:3642
lock_sock include/net/sock.h:1618 [inline]
ax25_kill_by_device net/ax25/af_ax25.c:101 [inline]
ax25_device_event+0x24d/0x580 net/ax25/af_ax25.c:146
notifier_call_chain+0x1a5/0x3f0 kernel/notifier.c:85
__dev_notify_flags+0x207/0x400
dev_change_flags+0xf0/0x1a0 net/core/dev.c:9026
dev_ifsioc+0x7c8/0xe70 net/core/dev_ioctl.c:563
dev_ioctl+0x719/0x1340 net/core/dev_ioctl.c:820
sock_do_ioctl+0x240/0x460 net/socket.c:1234
sock_ioctl+0x626/0x8e0 net/socket.c:1339
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:906 [inline]
__se_sys_ioctl+0xf5/0x170 fs/ioctl.c:892
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #0 (rtnl_mutex){+.+.}-{4:4}:
check_prev_add kernel/locking/lockdep.c:3161 [inline]
check_prevs_add kernel/locking/lockdep.c:3280 [inline]
validate_chain+0x18ef/0x5920 kernel/locking/lockdep.c:3904
__lock_acquire+0x1397/0x2100 kernel/locking/lockdep.c:5226
lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5849
__mutex_lock_common kernel/locking/mutex.c:585 [inline]
__mutex_lock+0x1ac/0xee0 kernel/locking/mutex.c:735
ax25_setsockopt+0xa55/0xe90 net/ax25/af_ax25.c:680
do_sock_setsockopt+0x3af/0x720 net/socket.c:2324
__sys_setsockopt net/socket.c:2349 [inline]
__do_sys_setsockopt net/socket.c:2355 [inline]
__se_sys_setsockopt net/socket.c:2352 [inline]
__x64_sys_setsockopt+0x1ee/0x280 net/socket.c:2352
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(sk_lock-AF_AX25);
lock(rtnl_mutex);
lock(sk_lock-AF_AX25);
lock(rtnl_mutex);
*** DEADLOCK ***
1 lock held by syz.5.1818/12806:
#0: ffff8880617ac258 (sk_lock-AF_AX25){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1618 [inline]
#0: ffff8880617ac258 (sk_lock-AF_AX25){+.+.}-{0:0}, at: ax25_setsockopt+0x209/0xe90 net/ax25/af_ax25.c:574
stack backtrace:
CPU: 1 UID: 0 PID: 12806 Comm: syz.5.1818 Not tainted 6.13.0-rc3-syzkaller-00762-g9268abe611b0 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:94 [inline]
dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120
print_circular_bug+0x13a/0x1b0 kernel/locking/lockdep.c:2074
check_noncircular+0x36a/0x4a0 kernel/locking/lockdep.c:2206
check_prev_add kernel/locking/lockdep.c:3161 [inline]
check_prevs_add kernel/locking/lockdep.c:3280 [inline]
validate_chain+0x18ef/0x5920 kernel/locking/lockdep.c:3904
__lock_acquire+0x1397/0x2100 kernel/locking/lockdep.c:5226
lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5849
__mutex_lock_common kernel/locking/mutex.c:585 [inline]
__mutex_lock+0x1ac/0xee0 kernel/locking/mutex.c:735
ax25_setsockopt+0xa55/0xe90 net/ax25/af_ax25.c:680
do_sock_setsockopt+0x3af/0x720 net/socket.c:2324
__sys_setsockopt net/socket.c:2349 [inline]
__do_sys_setsockopt net/socket.c:2355 [inline]
__se_sys_setsockopt net/socket.c:2352 [inline]
__x64_sys_setsockopt+0x1ee/0x280 net/socket.c:2352
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f7b62385d29
Fixes: c433570458e4 ("ax25: fix a use-after-free in ax25_fillin_cb()")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250103210514.87290-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Define inet_sk_dscp() to get a dscp_t value from struct inet_sock, so
that sctp_v4_get_dst() can easily set ->flowi4_tos from a dscp_t
variable. For the SCTP_DSCP_SET_MASK case, we can just use
inet_dsfield_to_dscp() to get a dscp_t value.
Then, when converting ->flowi4_tos from __u8 to dscp_t, we'll just have
to drop the inet_dscp_to_dsfield() conversion function.
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Acked-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/1a645f4a0bc60ad18e7c0916642883ce8a43c013.1735835456.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
rcu_read_lock/rcu_read_unlock has already provide protection for the
pointer we will reference when we call c_show. Therefore, there is no
need to obtain a cache reference to help protect cache_head.
Additionally, the .put such as expkey_put/svc_export_put will invoke
dput, which can sleep and break rcu. Stop get cache reference to fix
them all.
Fixes: ae74136b4bb6 ("SUNRPC: Allow cache lookups to use RCU protection rather than the r/w spinlock")
Suggested-by: NeilBrown <neilb@suse.de>
Signed-off-by: Yang Erkun <yangerkun@huawei.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|