Age | Commit message (Collapse) | Author |
|
Rob suggests to move of_net.c from under drivers/of/ somewhere
to the networking code.
Suggested-by: Rob Herring <robh@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Rob Herring <robh@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
ipsec
Steffen Klassert says:
====================
pull request (net): ipsec 2021-10-07
1) Fix a sysbot reported shift-out-of-bounds in xfrm_get_default.
From Pavel Skripkin.
2) Fix XFRM_MSG_MAPPING ABI breakage. The new XFRM_MSG_MAPPING
messages were accidentally not paced at the end.
Fix by Eugene Syromiatnikov.
3) Fix the uapi for the default policy, use explicit field and macros
and make it accessible to userland.
From Nicolas Dichtel.
4) Fix a missing rcu lock in xfrm_notify_userpolicy().
From Nicolas Dichtel.
Please pull or let me know if there are problems.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add a pair of new ethtool messages, 'ETHTOOL_MSG_MODULE_SET' and
'ETHTOOL_MSG_MODULE_GET', that can be used to control transceiver
modules parameters and retrieve their status.
The first parameter to control is the power mode of the module. It is
only relevant for paged memory modules, as flat memory modules always
operate in low power mode.
When a paged memory module is in low power mode, its power consumption
is reduced to the minimum, the management interface towards the host is
available and the data path is deactivated.
User space can choose to put modules that are not currently in use in
low power mode and transition them to high power mode before putting the
associated ports administratively up. This is useful for user space that
favors reduced power consumption and lower temperatures over reduced
link up times. In QSFP-DD modules the transition from low power mode to
high power mode can take a few seconds and this transition is only
expected to get longer with future / more complex modules.
User space can control the power mode of the module via the power mode
policy attribute ('ETHTOOL_A_MODULE_POWER_MODE_POLICY'). Possible
values:
* high: Module is always in high power mode.
* auto: Module is transitioned by the host to high power mode when the
first port using it is put administratively up and to low power mode
when the last port using it is put administratively down.
The operational power mode of the module is available to user space via
the 'ETHTOOL_A_MODULE_POWER_MODE' attribute. The attribute is not
reported to user space when a module is not plugged-in.
The user API is designed to be generic enough so that it could be used
for modules with different memory maps (e.g., SFF-8636, CMIS).
The only implementation of the device driver API in this series is for a
MAC driver (mlxsw) where the module is controlled by the device's
firmware, but it is designed to be generic enough so that it could also
be used by implementations where the module is controlled by the CPU.
CMIS testing
============
# ethtool -m swp11
Identifier : 0x18 (QSFP-DD Double Density 8X Pluggable Transceiver (INF-8628))
...
Module State : 0x03 (ModuleReady)
LowPwrAllowRequestHW : Off
LowPwrRequestSW : Off
The module is not in low power mode, as it is not forced by hardware
(LowPwrAllowRequestHW is off) or by software (LowPwrRequestSW is off).
The power mode can be queried from the kernel. In case
LowPwrAllowRequestHW was on, the kernel would need to take into account
the state of the LowPwrRequestHW signal, which is not visible to user
space.
$ ethtool --show-module swp11
Module parameters for swp11:
power-mode-policy high
power-mode high
Change the power mode policy to 'auto':
# ethtool --set-module swp11 power-mode-policy auto
Query the power mode again:
$ ethtool --show-module swp11
Module parameters for swp11:
power-mode-policy auto
power-mode low
Verify with the data read from the EEPROM:
# ethtool -m swp11
Identifier : 0x18 (QSFP-DD Double Density 8X Pluggable Transceiver (INF-8628))
...
Module State : 0x01 (ModuleLowPwr)
LowPwrAllowRequestHW : Off
LowPwrRequestSW : On
Put the associated port administratively up which will instruct the host
to transition the module to high power mode:
# ip link set dev swp11 up
Query the power mode again:
$ ethtool --show-module swp11
Module parameters for swp11:
power-mode-policy auto
power-mode high
Verify with the data read from the EEPROM:
# ethtool -m swp11
Identifier : 0x18 (QSFP-DD Double Density 8X Pluggable Transceiver (INF-8628))
...
Module State : 0x03 (ModuleReady)
LowPwrAllowRequestHW : Off
LowPwrRequestSW : Off
Put the associated port administratively down which will instruct the
host to transition the module to low power mode:
# ip link set dev swp11 down
Query the power mode again:
$ ethtool --show-module swp11
Module parameters for swp11:
power-mode-policy auto
power-mode low
Verify with the data read from the EEPROM:
# ethtool -m swp11
Identifier : 0x18 (QSFP-DD Double Density 8X Pluggable Transceiver (INF-8628))
...
Module State : 0x01 (ModuleLowPwr)
LowPwrAllowRequestHW : Off
LowPwrRequestSW : On
SFF-8636 testing
================
# ethtool -m swp13
Identifier : 0x11 (QSFP28)
...
Extended identifier description : 5.0W max. Power consumption, High Power Class (> 3.5 W) enabled
Power set : Off
Power override : On
...
Transmit avg optical power (Channel 1) : 0.7733 mW / -1.12 dBm
Transmit avg optical power (Channel 2) : 0.7649 mW / -1.16 dBm
Transmit avg optical power (Channel 3) : 0.7790 mW / -1.08 dBm
Transmit avg optical power (Channel 4) : 0.7837 mW / -1.06 dBm
Rcvr signal avg optical power(Channel 1) : 0.9302 mW / -0.31 dBm
Rcvr signal avg optical power(Channel 2) : 0.9079 mW / -0.42 dBm
Rcvr signal avg optical power(Channel 3) : 0.8993 mW / -0.46 dBm
Rcvr signal avg optical power(Channel 4) : 0.8778 mW / -0.57 dBm
The module is not in low power mode, as it is not forced by hardware
(Power override is on) or by software (Power set is off).
The power mode can be queried from the kernel. In case Power override
was off, the kernel would need to take into account the state of the
LPMode signal, which is not visible to user space.
$ ethtool --show-module swp13
Module parameters for swp13:
power-mode-policy high
power-mode high
Change the power mode policy to 'auto':
# ethtool --set-module swp13 power-mode-policy auto
Query the power mode again:
$ ethtool --show-module swp13
Module parameters for swp13:
power-mode-policy auto
power-mode low
Verify with the data read from the EEPROM:
# ethtool -m swp13
Identifier : 0x11 (QSFP28)
Extended identifier description : 5.0W max. Power consumption, High Power Class (> 3.5 W) not enabled
Power set : On
Power override : On
...
Transmit avg optical power (Channel 1) : 0.0000 mW / -inf dBm
Transmit avg optical power (Channel 2) : 0.0000 mW / -inf dBm
Transmit avg optical power (Channel 3) : 0.0000 mW / -inf dBm
Transmit avg optical power (Channel 4) : 0.0000 mW / -inf dBm
Rcvr signal avg optical power(Channel 1) : 0.0000 mW / -inf dBm
Rcvr signal avg optical power(Channel 2) : 0.0000 mW / -inf dBm
Rcvr signal avg optical power(Channel 3) : 0.0000 mW / -inf dBm
Rcvr signal avg optical power(Channel 4) : 0.0000 mW / -inf dBm
Put the associated port administratively up which will instruct the host
to transition the module to high power mode:
# ip link set dev swp13 up
Query the power mode again:
$ ethtool --show-module swp13
Module parameters for swp13:
power-mode-policy auto
power-mode high
Verify with the data read from the EEPROM:
# ethtool -m swp13
Identifier : 0x11 (QSFP28)
...
Extended identifier description : 5.0W max. Power consumption, High Power Class (> 3.5 W) enabled
Power set : Off
Power override : On
...
Transmit avg optical power (Channel 1) : 0.7934 mW / -1.01 dBm
Transmit avg optical power (Channel 2) : 0.7859 mW / -1.05 dBm
Transmit avg optical power (Channel 3) : 0.7885 mW / -1.03 dBm
Transmit avg optical power (Channel 4) : 0.7985 mW / -0.98 dBm
Rcvr signal avg optical power(Channel 1) : 0.9325 mW / -0.30 dBm
Rcvr signal avg optical power(Channel 2) : 0.9034 mW / -0.44 dBm
Rcvr signal avg optical power(Channel 3) : 0.9086 mW / -0.42 dBm
Rcvr signal avg optical power(Channel 4) : 0.8885 mW / -0.51 dBm
Put the associated port administratively down which will instruct the
host to transition the module to low power mode:
# ip link set dev swp13 down
Query the power mode again:
$ ethtool --show-module swp13
Module parameters for swp13:
power-mode-policy auto
power-mode low
Verify with the data read from the EEPROM:
# ethtool -m swp13
Identifier : 0x11 (QSFP28)
...
Extended identifier description : 5.0W max. Power consumption, High Power Class (> 3.5 W) not enabled
Power set : On
Power override : On
...
Transmit avg optical power (Channel 1) : 0.0000 mW / -inf dBm
Transmit avg optical power (Channel 2) : 0.0000 mW / -inf dBm
Transmit avg optical power (Channel 3) : 0.0000 mW / -inf dBm
Transmit avg optical power (Channel 4) : 0.0000 mW / -inf dBm
Rcvr signal avg optical power(Channel 1) : 0.0000 mW / -inf dBm
Rcvr signal avg optical power(Channel 2) : 0.0000 mW / -inf dBm
Rcvr signal avg optical power(Channel 3) : 0.0000 mW / -inf dBm
Rcvr signal avg optical power(Channel 4) : 0.0000 mW / -inf dBm
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
rtnl_fill_statsinfo() is filling skb with one mandatory if_stats_msg structure.
nlmsg_put(skb, pid, seq, type, sizeof(struct if_stats_msg), flags);
But if_nlmsg_stats_size() never considered the needed storage.
This bug did not show up because alloc_skb(X) allocates skb with
extra tailroom, because of added alignments. This could very well
be changed in the future to have deterministic behavior.
Fixes: 10c9ead9f3c6 ("rtnetlink: add new RTM_GETSTATS message to dump link stats")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Roopa Prabhu <roopa@nvidia.com>
Acked-by: Roopa Prabhu <roopa@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Commit 94531cfcbe79 ("af_unix: Add unix_stream_proto for sockmap") sets
unix domain socket peer state to TCP_CLOSE in unix_shutdown. This could
happen when the local end is shutdown but the other end is not. Then,
the other end will get read or write failures which is not expected.
Fix the issue by setting the local state to shutdown.
Fixes: 94531cfcbe79 ("af_unix: Add unix_stream_proto for sockmap")
Reported-by: Casey Schaufler <casey@schaufler-ca.com>
Suggested-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Jiang Wang <jiang.wang@bytedance.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Casey Schaufler <casey@schaufler-ca.com>
Reviewed-by: Casey Schaufler <casey@schaufler-ca.com>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20211004232530.2377085-1-jiang.wang@bytedance.com
|
|
This adds selftests that tests the success and failure path for modules
kfuncs (in presence of invalid kfunc calls) for both libbpf and
gen_loader. It also adds a prog_test kfunc_btf_id_list so that we can
add module BTF ID set from bpf_testmod.
This also introduces a couple of test cases to verifier selftests for
validating whether we get an error or not depending on if invalid kfunc
call remains after elimination of unreachable instructions.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20211002011757.311265-10-memxor@gmail.com
|
|
This commit moves BTF ID lookup into the newly added registration
helper, in a way that the bbr, cubic, and dctcp implementation set up
their sets in the bpf_tcp_ca kfunc_btf_set list, while the ones not
dependent on modules are looked up from the wrapper function.
This lifts the restriction for them to be compiled as built in objects,
and can be loaded as modules if required. Also modify Makefile.modfinal
to call resolve_btfids for each module.
Note that since kernel kfunc_ids never overlap with module kfunc_ids, we
only match the owner for module btf id sets.
See following commits for background on use of:
CONFIG_X86 ifdef:
569c484f9995 (bpf: Limit static tcp-cc functions in the .BTF_ids list to x86)
CONFIG_DYNAMIC_FTRACE ifdef:
7aae231ac93b (bpf: tcp: Limit calling some tcp cc functions to CONFIG_DYNAMIC_FTRACE)
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20211002011757.311265-6-memxor@gmail.com
|
|
This change adds support on the kernel side to allow for BPF programs to
call kernel module functions. Userspace will prepare an array of module
BTF fds that is passed in during BPF_PROG_LOAD using fd_array parameter.
In the kernel, the module BTFs are placed in the auxilliary struct for
bpf_prog, and loaded as needed.
The verifier then uses insn->off to index into the fd_array. insn->off
0 is reserved for vmlinux BTF (for backwards compat), so userspace must
use an fd_array index > 0 for module kfunc support. kfunc_btf_tab is
sorted based on offset in an array, and each offset corresponds to one
descriptor, with a max limit up to 256 such module BTFs.
We also change existing kfunc_tab to distinguish each element based on
imm, off pair as each such call will now be distinct.
Another change is to check_kfunc_call callback, which now include a
struct module * pointer, this is to be used in later patch such that the
kfunc_id and module pointer are matched for dynamically registered BTF
sets from loadable modules, so that same kfunc_id in two modules doesn't
lead to check_kfunc_call succeeding. For the duration of the
check_kfunc_call, the reference to struct module exists, as it returns
the pointer stored in kfunc_btf_tab.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20211002011757.311265-2-memxor@gmail.com
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next
Luiz Augusto von Dentz says:
====================
bluetooth-next pull request for net-next:
- Add support for MediaTek MT7922 and MT7921
- Enable support for AOSP extention in Qualcomm WCN399x and Realtek
8822C/8852A.
- Add initial support for link quality and audio/codec offload.
- Rework of sockets sendmsg to avoid locking issues.
- Add vhci suspend/resume emulation.
====================
Link: https://lore.kernel.org/r/20211001230850.3635543-1-luiz.dentz@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
While existing code is correct, KCSAN is reporting
a data-race in netlink_insert / netlink_sendmsg [1]
It is correct to read nlk->bound without a lock, as netlink_autobind()
will acquire all needed locks.
[1]
BUG: KCSAN: data-race in netlink_insert / netlink_sendmsg
write to 0xffff8881031c8b30 of 1 bytes by task 18752 on cpu 0:
netlink_insert+0x5cc/0x7f0 net/netlink/af_netlink.c:597
netlink_autobind+0xa9/0x150 net/netlink/af_netlink.c:842
netlink_sendmsg+0x479/0x7c0 net/netlink/af_netlink.c:1892
sock_sendmsg_nosec net/socket.c:703 [inline]
sock_sendmsg net/socket.c:723 [inline]
____sys_sendmsg+0x360/0x4d0 net/socket.c:2392
___sys_sendmsg net/socket.c:2446 [inline]
__sys_sendmsg+0x1ed/0x270 net/socket.c:2475
__do_sys_sendmsg net/socket.c:2484 [inline]
__se_sys_sendmsg net/socket.c:2482 [inline]
__x64_sys_sendmsg+0x42/0x50 net/socket.c:2482
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x3d/0x90 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
read to 0xffff8881031c8b30 of 1 bytes by task 18751 on cpu 1:
netlink_sendmsg+0x270/0x7c0 net/netlink/af_netlink.c:1891
sock_sendmsg_nosec net/socket.c:703 [inline]
sock_sendmsg net/socket.c:723 [inline]
__sys_sendto+0x2a8/0x370 net/socket.c:2019
__do_sys_sendto net/socket.c:2031 [inline]
__se_sys_sendto net/socket.c:2027 [inline]
__x64_sys_sendto+0x74/0x90 net/socket.c:2027
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x3d/0x90 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
value changed: 0x00 -> 0x01
Reported by Kernel Concurrency Sanitizer on:
CPU: 1 PID: 18751 Comm: syz-executor.0 Not tainted 5.14.0-rc1-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Fixes: da314c9923fe ("netlink: Replace rhash_portid with bound")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
No users in tree since commit a3498436b3a0 ("netns: restrict uevents"),
so remove this functionality.
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
There is a comment in qdisc_create() about us not calling ops->reset()
in some cases.
err_out4:
/*
* Any broken qdiscs that would require a ops->reset() here?
* The qdisc was never in action so it shouldn't be necessary.
*/
As taprio sets a timer before actually receiving a packet, we need
to cancel it from ops->destroy, just in case ops->reset has not
been called.
syzbot reported:
ODEBUG: free active (active state 0) object type: hrtimer hint: advance_sched+0x0/0x9a0 arch/x86/include/asm/atomic64_64.h:22
WARNING: CPU: 0 PID: 8441 at lib/debugobjects.c:505 debug_print_object+0x16e/0x250 lib/debugobjects.c:505
Modules linked in:
CPU: 0 PID: 8441 Comm: syz-executor813 Not tainted 5.14.0-rc6-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:debug_print_object+0x16e/0x250 lib/debugobjects.c:505
Code: ff df 48 89 fa 48 c1 ea 03 80 3c 02 00 0f 85 af 00 00 00 48 8b 14 dd e0 d3 e3 89 4c 89 ee 48 c7 c7 e0 c7 e3 89 e8 5b 86 11 05 <0f> 0b 83 05 85 03 92 09 01 48 83 c4 18 5b 5d 41 5c 41 5d 41 5e c3
RSP: 0018:ffffc9000130f330 EFLAGS: 00010282
RAX: 0000000000000000 RBX: 0000000000000003 RCX: 0000000000000000
RDX: ffff88802baeb880 RSI: ffffffff815d87b5 RDI: fffff52000261e58
RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000000
R10: ffffffff815d25ee R11: 0000000000000000 R12: ffffffff898dd020
R13: ffffffff89e3ce20 R14: ffffffff81653630 R15: dffffc0000000000
FS: 0000000000f0d300(0000) GS:ffff8880b9d00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007ffb64b3e000 CR3: 0000000036557000 CR4: 00000000001506e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
__debug_check_no_obj_freed lib/debugobjects.c:987 [inline]
debug_check_no_obj_freed+0x301/0x420 lib/debugobjects.c:1018
slab_free_hook mm/slub.c:1603 [inline]
slab_free_freelist_hook+0x171/0x240 mm/slub.c:1653
slab_free mm/slub.c:3213 [inline]
kfree+0xe4/0x540 mm/slub.c:4267
qdisc_create+0xbcf/0x1320 net/sched/sch_api.c:1299
tc_modify_qdisc+0x4c8/0x1a60 net/sched/sch_api.c:1663
rtnetlink_rcv_msg+0x413/0xb80 net/core/rtnetlink.c:5571
netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2504
netlink_unicast_kernel net/netlink/af_netlink.c:1314 [inline]
netlink_unicast+0x533/0x7d0 net/netlink/af_netlink.c:1340
netlink_sendmsg+0x86d/0xdb0 net/netlink/af_netlink.c:1929
sock_sendmsg_nosec net/socket.c:704 [inline]
sock_sendmsg+0xcf/0x120 net/socket.c:724
____sys_sendmsg+0x6e8/0x810 net/socket.c:2403
___sys_sendmsg+0xf3/0x170 net/socket.c:2457
__sys_sendmsg+0xe5/0x1b0 net/socket.c:2486
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
Fixes: 44d4775ca518 ("net/sched: sch_taprio: reset child qdiscs before freeing them")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Davide Caratti <dcaratti@redhat.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Acked-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Commit de1799667b00 ("net: bridge: add STP xstats")
added an additional nla_reserve_64bit() in br_fill_linkxstats(),
but forgot to update br_get_linkxstats_size() accordingly.
This can trigger the following in rtnl_stats_get()
WARN_ON(err == -EMSGSIZE);
Fixes: de1799667b00 ("net: bridge: add STP xstats")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Vivien Didelot <vivien.didelot@gmail.com>
Cc: Nikolay Aleksandrov <nikolay@nvidia.com>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
bridge_fill_linkxstats() is using nla_reserve_64bit().
We must use nla_total_size_64bit() instead of nla_total_size()
for corresponding data structure.
Fixes: 1080ab95e3c7 ("net: bridge: add support for IGMP/MLD stats and export them via netlink")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Nikolay Aleksandrov <nikolay@nvidia.com>
Cc: Vivien Didelot <vivien.didelot@gmail.com>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This is an operational low memory situation that needs to be
flagged. The new tracepoint records a timestamp and the nfsd thread
that failed to allocate pages.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
There are currently three separate purposes being served by single
tracepoints. Split them up, as was done with wc_send.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
There are currently three separate purposes being served by a single
tracepoint here. They need to be split up.
svcrdma_wc_send:
- status is always zero, so there's no value in recording it.
- vendor_err is meaningless unless status is not zero, so
there's no value in recording it.
- This tracepoint is needed only when developing modifications,
so it should be left disabled most of the time.
svcrdma_wc_send_flush:
- As above, needed only rarely, and not an error.
svcrdma_wc_send_err:
- This tracepoint can be left persistently enabled because
completion errors are run-time problems (except for FLUSHED_ERR).
- Tracepoint name now ends in _err to reflect its purpose.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
There are currently three separate purposes being served by a single
tracepoint here. They need to be split up.
svcrdma_wc_recv:
- status is always zero, so there's no value in recording it.
- vendor_err is meaningless unless status is not zero, so
there's no value in recording it.
- This tracepoint is needed only when developing modifications,
so it should be left disabled most of the time.
svcrdma_wc_recv_flush:
- As above, needed only rarely, and not an error.
svcrdma_wc_recv_err:
- received is always zero, so there's no value in recording it.
- This tracepoint can be left enabled because completion
errors are run-time problems (except for FLUSHED_ERR).
- Tracepoint name now ends in _err to reflect its purpose.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
A packet received on a trunk will have bit 2 set in Forward DSA tagged
frame. Bit 1 can be either 0 or 1 and is otherwise undefined and bit 0
indicates the frame CFI. Masking with 7 thus results in frames as
being identified as being from a trunk when in fact they are not. Fix
the mask to just look at bit 2.
Fixes: 5b60dadb71db ("net: dsa: tag_dsa: Support reception of packets from LAG devices")
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
sdata->tun_src should be freed before sdata is freed
because sdata->tun_src is allocated after sdata allocation.
So, kfree(sdata) and kfree(rcu_dereference_raw(sdata->tun_src)) are
changed code order.
Fixes: f04ed7d277e8 ("net: ipv6: check return value of rhashtable_init")
Signed-off-by: MichelleJin <shjy180909@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This patch adds support for the ip6ip6 encapsulation by providing three encap
modes: inline, encap and auto.
Signed-off-by: Justin Iurman <justin.iurman@uliege.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This prerequisite patch provides some minor edits (alignments, renames) and a
minor modification inside a function to facilitate the next patch by using
existing nla_* functions.
Signed-off-by: Justin Iurman <justin.iurman@uliege.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This patch anticipates the support for the IOAM insertion inside in-transit
packets, by making a difference between input and output in order to determine
the right value for its hop-limit (inherited from the IPv6 hop-limit).
Input case: happens before ip6_forward, the IPv6 hop-limit is not decremented
yet -> decrement the IOAM hop-limit to reflect the new hop inside the trace.
Output case: happens after ip6_forward, the IPv6 hop-limit has already been
decremented -> keep the same value for the IOAM hop-limit.
Signed-off-by: Justin Iurman <justin.iurman@uliege.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The clearing of the XPRT_LOCKED bit has to happen after we clear
xprt->snd_task, but we don't require any extra memory barriers after
that.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
|
|
Don't let xprtiod pre-empt softirq.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
|
|
Allow tasks that need to pre-empt rpciod/xprtiod to do so when it is
safe.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
|
|
The premise of commit 6f9f17287e78 ("SUNRPC: Mitigate cond_resched() in
xprt_transmit()") was that cond_resched() is expensive and unnecessary
when there has been just a single send.
The point of cond_resched() is to ensure that tasks that should pre-empt
this one get a chance to do so when it is safe to do so. The code prior
to commit 6f9f17287e78 failed to take into account that it was keeping a
rpc_task pinned for longer than it needed to, and so rather than doing a
full revert, let's just move the cond_resched.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
|
|
Add multi-packet route input tests, for message reassembly. These will
feed packets to be received by a bound socket, or dropped.
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add a few tests for single-packet route inputs, testing the
mctp_route_input function.
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add a few tests for the initial packet ingress through
mctp_pkttype_receive function; mainly packet header sanity checks. Full
input routing checks will be added as a separate change.
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add a new object for shared test utilities
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This change adds the first kunit test for the mctp subsystem, and an
initial test for the fragmentation path.
We're adding tests under a new net/mctp/test/ directory.
Incorporates a fix for module configs:
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Refactor.
Now that the NFSv2 and NFSv3 XDR decoders have been converted to
use xdr_streams, the WRITE decoder functions can use
xdr_stream_subsegment() to extract the WRITE payload into its own
xdr_buf, just as the NFSv4 WRITE XDR decoder currently does.
That makes it possible to pass the first kvec, pages array + length,
page_base, and total payload length via a single function parameter.
The payload's page_base is not yet assigned or used, but will be in
subsequent patches.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
xdr_stream_subsegment() was introduced in commit c1346a1216ab
("NFSD: Replace the internals of the READ_BUF() macro").
There are two call sites for xdr_stream_subsegment(). One is
nfsd4_decode_write(), and the other is nfsd4_decode_setxattr().
Currently neither of these call sites calls this API when
xdr_buf::page_base is a non-zero value.
However, I'm about to add a case where page_base will sometimes not
be zero when nfsd4_decode_write() invokes this API. Replace the
logic in xdr_stream_subsegment() that advances to the next data item
in the xdr_stream with something more generic in order to handle
this new use case.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
Convert from ether_addr_copy() to eth_hw_addr_set():
@@
expression dev, np;
@@
- ether_addr_copy(dev->dev_addr, np)
+ eth_hw_addr_set(dev, np)
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Convert all Ethernet drivers from memcpy(... ETH_ADDR)
to eth_hw_addr_set():
@@
expression dev, np;
@@
- memcpy(dev->dev_addr, np, ETH_ALEN)
+ eth_hw_addr_set(dev, np)
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Convert sw drivers from memcpy(... ETH_ADDR) to eth_hw_addr_set():
@@
expression dev, np;
@@
- memcpy(dev->dev_addr, np, ETH_ALEN)
+ eth_hw_addr_set(dev, np)
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Currently, all packets injected into Ocelot switches are classified to
VLAN 0, regardless of whether they are VLAN-tagged or not. This is
because the switch only looks at the VLAN TCI from the DSA tag.
VLAN 0 is then stripped on egress due to REW_TAG_CFG_TAG_CFG. There are
2 cases really, below is the explanation for ocelot_port_set_native_vlan:
- Port is VLAN-aware, we set REW_TAG_CFG_TAG_CFG to 1 (egress-tag all
frames except VID 0 and the native VLAN) if a native VLAN exists, or
to 3 otherwise (tag all frames, including VID 0).
- Port is VLAN-unaware, we set REW_TAG_CFG_TAG_CFG to 0 (port tagging
disabled, classified VLAN never appears in the packet).
One can already see an inconsistency: when a native VLAN exists, VID 0
is egress-untagged, but when it doesn't, VID 0 is egress-tagged.
So when we do this:
ip link add br0 type bridge vlan_filtering 1
ip link set swp0 master br0
bridge vlan del dev swp0 vid 1
bridge vlan add dev swp0 vid 1 pvid # but not untagged
and we ping through swp0, packets will look like this:
MAC > 33:33:00:00:00:02, ethertype 802.1Q (0x8100): vlan 0, p 0,
ethertype 802.1Q (0x8100), vlan 1, p 0, ethertype IPv6 (0x86dd),
ICMP6, router solicitation, length 16
So VID 1 frames (sent that way by the Linux bridge) are encapsulated in
a VID 0 header - the classified VLAN of the packets as far as the hw is
concerned. To avoid that, what we really need to do is stop injecting
packets using the classified VLAN of 0.
This patch strips the VLAN header from the skb payload, if that VLAN
exists and if the port is under a VLAN-aware bridge. Then it copies that
VLAN header into the DSA injection frame header.
A positive side effect is that VCAP ES0 VLAN rewriting rules now work
for packets injected from the CPU into a port that's under a VLAN-aware
bridge, and we are able to match those packets by the VLAN ID that was
sent by the network stack, and not by VLAN ID 0.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
napi_gro_complete always returned the same value, NET_RX_SUCCESS
And the value was not used anywhere
Signed-off-by: Gyumin Hwang <hkm73560@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Pablo Neira Ayuso says:
====================
Netfilter fixes for net (v2)
The following patchset contains Netfilter fixes for net:
1) Move back the defrag users fields to the global netns_nf area.
Kernel fails to boot if conntrack is builtin and kernel is booted
with: nf_conntrack.enable_hooks=1. From Florian Westphal.
2) Rule event notification is missing relevant context such as
the position handle and the NLM_F_APPEND flag.
3) Rule replacement is expanded to add + delete using the existing
rule handle, reverse order of this operation so it makes sense
from rule notification standpoint.
4) Propagate to userspace the NLM_F_CREATE and NLM_F_EXCL flags
from the rule notification path.
Patches #2, #3 and #4 are used by 'nft monitor' and 'iptables-monitor'
userspace utilities which are not correctly representing the following
operations through netlink notifications:
- rule insertions
- rule addition/insertion from position handle
- create table/chain/set/map/flowtable/...
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Include the NLM_F_CREATE and NLM_F_EXCL flags in netlink event
notifications, otherwise userspace cannot distiguish between create and
add commands.
Fixes: 96518518cc41 ("netfilter: add nftables")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Daniel Borkmann says:
====================
bpf-next 2021-10-02
We've added 85 non-merge commits during the last 15 day(s) which contain
a total of 132 files changed, 13779 insertions(+), 6724 deletions(-).
The main changes are:
1) Massive update on test_bpf.ko coverage for JITs as preparatory work for
an upcoming MIPS eBPF JIT, from Johan Almbladh.
2) Add a batched interface for RX buffer allocation in AF_XDP buffer pool,
with driver support for i40e and ice from Magnus Karlsson.
3) Add legacy uprobe support to libbpf to complement recently merged legacy
kprobe support, from Andrii Nakryiko.
4) Add bpf_trace_vprintk() as variadic printk helper, from Dave Marchevsky.
5) Support saving the register state in verifier when spilling <8byte bounded
scalar to the stack, from Martin Lau.
6) Add libbpf opt-in for stricter BPF program section name handling as part
of libbpf 1.0 effort, from Andrii Nakryiko.
7) Add a document to help clarifying BPF licensing, from Alexei Starovoitov.
8) Fix skel_internal.h to propagate errno if the loader indicates an internal
error, from Kumar Kartikeya Dwivedi.
9) Fix build warnings with -Wcast-function-type so that the option can later
be enabled by default for the kernel, from Kees Cook.
10) Fix libbpf to ignore STT_SECTION symbols in legacy map definitions as it
otherwise errors out when encountering them, from Toke Høiland-Jørgensen.
11) Teach libbpf to recognize specialized maps (such as for perf RB) and
internally remove BTF type IDs when creating them, from Hengqi Chen.
12) Various fixes and improvements to BPF selftests.
====================
Link: https://lore.kernel.org/r/20211002001327.15169-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
prevent_wake logic is backward since what it is really checking is
if the device may wakeup the system or not, not that it will prevent
the to be awaken.
Also looking on how other subsystems have the entry as power/wakeup
this also renames the force_prevent_wake to force_wakeup in vhci driver.
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
|
|
napi_busy_loop() disables preemption and performs a NAPI poll. We can't acquire
sleeping locks with disabled preemption which would be required while
__napi_poll() invokes the callback of the driver.
A threaded interrupt performing the NAPI-poll can be preempted on PREEMPT_RT.
A RT thread on another CPU may observe NAPIF_STATE_SCHED bit set and busy-spin
until it is cleared or its spin time runs out. Given it is the task with the
highest priority it will never observe the NEED_RESCHED bit set.
In this case the time is better spent by simply sleeping.
The NET_RX_BUSY_POLL is disabled by default (the system wide sysctls for
poll/read are set to zero). Disabling NET_RX_BUSY_POLL on PREEMPT_RT to avoid
wrong locking context in case it is used.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://lore.kernel.org/r/20211001145841.2308454-1-bigeasy@linutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
syzbot reported another NULL deref in fifo_set_limit() [1]
I could repro the issue with :
unshare -n
tc qd add dev lo root handle 1:0 tbf limit 200000 burst 70000 rate 100Mbit
tc qd replace dev lo parent 1:0 pfifo_fast
tc qd change dev lo root handle 1:0 tbf limit 300000 burst 70000 rate 100Mbit
pfifo_fast does not have a change() operation.
Make fifo_set_limit() more robust about this.
[1]
BUG: kernel NULL pointer dereference, address: 0000000000000000
PGD 1cf99067 P4D 1cf99067 PUD 7ca49067 PMD 0
Oops: 0010 [#1] PREEMPT SMP KASAN
CPU: 1 PID: 14443 Comm: syz-executor959 Not tainted 5.15.0-rc3-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:0x0
Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6.
RSP: 0018:ffffc9000e2f7310 EFLAGS: 00010246
RAX: dffffc0000000000 RBX: ffffffff8d6ecc00 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffff888024c27910 RDI: ffff888071e34000
RBP: ffff888071e34000 R08: 0000000000000001 R09: ffffffff8fcfb947
R10: 0000000000000001 R11: 0000000000000000 R12: ffff888024c27910
R13: ffff888071e34018 R14: 0000000000000000 R15: ffff88801ef74800
FS: 00007f321d897700(0000) GS:ffff8880b9d00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffffffffffffd6 CR3: 00000000722c3000 CR4: 00000000003506e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
fifo_set_limit net/sched/sch_fifo.c:242 [inline]
fifo_set_limit+0x198/0x210 net/sched/sch_fifo.c:227
tbf_change+0x6ec/0x16d0 net/sched/sch_tbf.c:418
qdisc_change net/sched/sch_api.c:1332 [inline]
tc_modify_qdisc+0xd9a/0x1a60 net/sched/sch_api.c:1634
rtnetlink_rcv_msg+0x413/0xb80 net/core/rtnetlink.c:5572
netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2504
netlink_unicast_kernel net/netlink/af_netlink.c:1314 [inline]
netlink_unicast+0x533/0x7d0 net/netlink/af_netlink.c:1340
netlink_sendmsg+0x86d/0xdb0 net/netlink/af_netlink.c:1929
sock_sendmsg_nosec net/socket.c:704 [inline]
sock_sendmsg+0xcf/0x120 net/socket.c:724
____sys_sendmsg+0x6e8/0x810 net/socket.c:2409
___sys_sendmsg+0xf3/0x170 net/socket.c:2463
__sys_sendmsg+0xe5/0x1b0 net/socket.c:2492
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
Fixes: fb0305ce1b03 ("net-sched: consolidate default fifo qdisc setup")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Link: https://lore.kernel.org/r/20210930212239.3430364-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
If sd_max is unsigned, then sd_max - GSS_SEQ_WIN is a very large number
whenever sd_max is less than GSS_SEQ_WIN, and the comparison:
seq_num <= sd->sd_max - GSS_SEQ_WIN
in gss_check_seq_num is pretty much always true, even when that's
clearly not what was intended.
This was causing pynfs to hang when using krb5, because pynfs uses zero
as the initial gss sequence number. That's perfectly legal, but this
logic error causes knfsd to drop the rpc in that case. Out-of-order
sequence IDs in the first GSS_SEQ_WIN (128) calls will also cause this.
Fixes: 10b9d99a3dbb ("SUNRPC: Augment server-side rpcgss tracepoints")
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
This reverts commit 4f42ad2011d2fcbd89f5cdf56121271a8cd5ee5d, reversing
changes made to ea2dd331bfaaeba74ba31facf437c29044f7d4cb.
These chanfges break the build when mctp is modular.
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Each region has an independently configurable number of maximum
snapshots. This information is not reported to userspace, making it not
very discoverable. Fix this by adding a new
DEVLINK_ATTR_REGION_MAX_SNAPSHOST attribute which is used to report this
maximum.
Ex:
$devlink region
pci/0000:af:00.0/nvm-flash: size 10485760 snapshot [] max 1
pci/0000:af:00.0/device-caps: size 4096 snapshot [] max 10
pci/0000:af:00.1/nvm-flash: size 10485760 snapshot [] max 1
pci/0000:af:00.1/device-caps: size 4096 snapshot [] max 10
This information enables users to understand why a new region command
may fail due to having too many existing snapshots.
Reported-by: Gurucharan G <gurucharanx.g@intel.com> (A Contingent worker at Intel)
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add multi-packet route input tests, for message reassembly. These will
feed packets to be received by a bound socket, or dropped.
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add a few tests for single-packet route inputs, testing the
mctp_route_input function.
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
|