Age | Commit message (Collapse) | Author |
|
Add check for the return value of nfp_app_ctrl_msg_alloc() in
nfp_bpf_cmsg_alloc() to prevent null pointer dereference.
Fixes: ff3d43f7568c ("nfp: bpf: implement helpers for FW map ops")
Cc: stable@vger.kernel.org
Signed-off-by: Haoxiang Li <haoxiang_li2024@163.com>
Link: https://patch.msgid.link/20250218030409.2425798-1-haoxiang_li2024@163.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Xiumei reported hitting the WARN in xfrm6_tunnel_net_exit while
running tests that boil down to:
- create a pair of netns
- run a basic TCP test over ipcomp6
- delete the pair of netns
The xfrm_state found on spi_byaddr was not deleted at the time we
delete the netns, because we still have a reference on it. This
lingering reference comes from a secpath (which holds a ref on the
xfrm_state), which is still attached to an skb. This skb is not
leaked, it ends up on sk_receive_queue and then gets defer-free'd by
skb_attempt_defer_free.
The problem happens when we defer freeing an skb (push it on one CPU's
defer_list), and don't flush that list before the netns is deleted. In
that case, we still have a reference on the xfrm_state that we don't
expect at this point.
We already drop the skb's dst in the TCP receive path when it's no
longer needed, so let's also drop the secpath. At this point,
tcp_filter has already called into the LSM hooks that may require the
secpath, so it should not be needed anymore. However, in some of those
places, the MPTCP extension has just been attached to the skb, so we
cannot simply drop all extensions.
Fixes: 68822bdf76f1 ("net: generalize skb freeing deferral to per-cpu lists")
Reported-by: Xiumei Mu <xmu@redhat.com>
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/5055ba8f8f72bdcb602faa299faca73c280b7735.1739743613.git.sd@queasysnail.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
The external PHY will undergo a soft reset twice during the resume process
when it wake up from suspend. The first reset occurs when the axienet
driver calls phylink_of_phy_connect(), and the second occurs when
mdio_bus_phy_resume() invokes phy_init_hw(). The second soft reset of the
external PHY does not reinitialize the internal PHY, which causes issues
with the internal PHY, resulting in the PHY link being down. To prevent
this, setting the mac_managed_pm flag skips the mdio_bus_phy_resume()
function.
Fixes: a129b41fe0a8 ("Revert "net: phy: dp83867: perform soft reset and retain established link"")
Signed-off-by: Nick Hu <nick.hu@sifive.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20250217055843.19799-1-nick.hu@sifive.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Jakub Kicinski says:
====================
selftests: drv-net: add a simple TSO test
Add a simple test for exercising TSO over tunnels.
Similarly to csum test we want to iterate over ip versions.
Rework how addresses are stored in env to make this easier.
====================
Link: https://patch.msgid.link/20250218225426.77726-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add a simple test for TSO. Send a few MB of data and check device
stats to verify that the device was performing segmentation.
Do the same thing over a few tunnel types.
Injecting GSO packets directly would give us more ability to test
corner cases, but perhaps starting simple is good enough?
# ./ksft-net-drv/drivers/net/hw/tso.py
# Detected qstat for LSO wire-packets
KTAP version 1
1..14
ok 1 tso.ipv4 # SKIP Test requires IPv4 connectivity
ok 2 tso.vxlan4_ipv4 # SKIP Test requires IPv4 connectivity
ok 3 tso.vxlan6_ipv4 # SKIP Test requires IPv4 connectivity
ok 4 tso.vxlan_csum4_ipv4 # SKIP Test requires IPv4 connectivity
ok 5 tso.vxlan_csum6_ipv4 # SKIP Test requires IPv4 connectivity
ok 6 tso.gre4_ipv4 # SKIP Test requires IPv4 connectivity
ok 7 tso.gre6_ipv4 # SKIP Test requires IPv4 connectivity
ok 8 tso.ipv6
ok 9 tso.vxlan4_ipv6
ok 10 tso.vxlan6_ipv6
ok 11 tso.vxlan_csum4_ipv6
ok 12 tso.vxlan_csum6_ipv6
# Testing with mangleid enabled
ok 13 tso.gre4_ipv6
ok 14 tso.gre6_ipv6
# Totals: pass:7 fail:0 xfail:0 xpass:0 skip:7 error:0
Note that the test currently depends on the driver reporting
the LSO count via qstat, which appears to be relatively rare
(virtio, cisco/enic, sfc/efc; but virtio needs host support).
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250218225426.77726-5-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Looks like more and more tests want to iterate over IP version,
run the same test over ipv4 and ipv6. The current naming of
members in the env class makes it a bit awkward, we have
separate members for ipv4 and ipv6 parameters.
Store the parameters inside dicts, so that tests can easily
index them with ip version.
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250218225426.77726-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
We already record output of ip link for NETIF in env for easy access.
Record the detailed version. TSO test will want to know the max tso size.
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Link: https://patch.msgid.link/20250218225426.77726-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Find out and record in env the name of the interface which remote host
will use for the IP address provided via config.
Interface name is useful for mausezahn and for setting up tunnels.
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Link: https://patch.msgid.link/20250218225426.77726-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Matthieu Baerts says:
====================
mptcp: rx path refactor
Paolo worked on this RX path refactor for these two main reasons:
- Currently, the MPTCP RX path introduces quite a bit of 'exceptional'
accounting/locking processing WRT to plain TCP, adding up to the
implementation complexity in a miserable way.
- The performance gap WRT plain TCP for single subflow connections is
quite measurable.
The present refactor addresses both the above items: most of the
additional complexity is dropped, and single stream performances
increase measurably, from 55Gbps to 71Gbps in Paolo's loopback test.
As a reference, plain TCP was around 84Gbps on the same host.
The above comes to a price: the patch are invasive, even in subtle ways.
Note: patch 5/7 removes the sk_forward_alloc_get() helper, which caused
some trivial modifications in different places in the net tree: sockets,
IPv4, sched. That's why a few more people have been Cc here. Feel free
to only look at this patch 5/7.
====================
Link: https://patch.msgid.link/20250218-net-next-mptcp-rx-path-refactor-v1-0-4a47d90d7998@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
After the RX path refactor the mentioned function is expected to run
frequently, let's optimize it a bit.
Scan for ready subflow from the last processed one, and stop after
traversing the list once or reaching the msk memory limit - instead of
looking for dubious per-subflow conditions.
Also re-order the memory limit checks, to avoid duplicate tests.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250218-net-next-mptcp-rx-path-refactor-v1-7-4a47d90d7998@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
After the RX path refactor, it become a wrapper for sk_rmem_alloc
access, with a slightly misleading name. Just drop it.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250218-net-next-mptcp-rx-path-refactor-v1-6-4a47d90d7998@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
After the previous patch we can remove the forward_alloc_get
proto callback, basically reverting commit 292e6077b040 ("net: introduce
sk_forward_alloc_get()") and commit 66d58f046c9d ("net: use
sk_forward_alloc_get() in sk_get_meminfo()").
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250218-net-next-mptcp-rx-path-refactor-v1-5-4a47d90d7998@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
After the previous patch, updating sk_forward_memory is cheap and
we can drop a lot of complexity from the MPTCP memory accounting,
removing the custom fwd mem allocations for rmem.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250218-net-next-mptcp-rx-path-refactor-v1-4-4a47d90d7998@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
After commit c2e6048fa1cf ("mptcp: fix race in release_cb") we can
move the whole MPTCP rx path under the socket lock leveraging the
release_cb.
We can drop a bunch of spin_lock pairs in the receive functions, use
a single receive queue and invoke __mptcp_move_skbs only when subflows
ask for it.
This will allow more cleanup in the next patch.
Some changes are worth specific mention:
The msk rcvbuf update now always happens under both the msk and the
subflow socket lock: we can drop a bunch of ONCE annotation and
consolidate the checks.
When the skbs move is delayed at msk release callback time, even the
msk rcvbuf update is delayed; additionally take care of such action in
__mptcp_move_skbs().
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250218-net-next-mptcp-rx-path-refactor-v1-3-4a47d90d7998@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
When we will move the whole RX path under the msk socket lock, updating
the already queued skb for passive fastopen socket at 3rd ack time will
be extremely painful and race prone
The map_seq for already enqueued skbs is used only to allow correct
coalescing with later data; preventing collapsing to the first skb of
a fastopen connect we can completely remove the
__mptcp_fastopen_gen_msk_ackseq() helper.
Before dropping this helper, a new item had to be added to the
mptcp_skb_cb structure. Because this item will be frequently tested in
the fast path -- almost on every packet -- and because there is free
space there, a single byte is used instead of a bitfield. This micro
optimisation slightly reduces the number of CPU operations to do the
associated check.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250218-net-next-mptcp-rx-path-refactor-v1-2-4a47d90d7998@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Consolidate all the cleanup actions requiring the worker in a single
helper and ensure the dummy data fin creation for fallback socket is
performed only when the tcp rx queue is empty.
There are no functional changes intended, but this will simplify the
next patch, when the tcp rx queue spooling could be delayed at release_cb
time.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250218-net-next-mptcp-rx-path-refactor-v1-1-4a47d90d7998@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
nfc_llc_unregister() has been unused since it was added in 2012's
commit 67cccfe17d1b ("NFC: Add an LLC Core layer to HCI")
Remove it.
Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Link: https://patch.msgid.link/20250219020258.297995-1-linux@treblig.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Fixes minor spelling errors:
- `simult_flows.sh`: "al testcases" -> "all testcases"
- `psock_tpacket.c`: "accross" -> "across"
Signed-off-by: Suchit Karunakaran <suchitkarunakaran@gmail.com>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20250218165923.20740-1-suchitkarunakaran@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Breno Leitao says:
====================
net: core: improvements to device lookup by hardware address.
The first patch adds a new dev_getbyhwaddr() helper function for
finding devices by hardware address when the rtnl lock is held. This
prevents PROVE_LOCKING warnings that occurred when rtnl lock was held
but the RCU read lock wasn't. The common address comparison logic is
extracted into dev_comp_addr() to avoid code duplication.
The second coverts arp_req_set_public() to the new helper.
====================
Link: https://patch.msgid.link/20250218-arm_fix_selftest-v5-0-d3d6892db9e1@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The arp_req_set_public() function is called with the rtnl lock held,
which provides enough synchronization protection. This makes the RCU
variant of dev_getbyhwaddr() unnecessary. Switch to using the simpler
dev_getbyhwaddr() function since we already have the required rtnl
locking.
This change helps maintain consistency in the networking code by using
the appropriate helper function for the existing locking context.
Since we're not holding the RCU read lock in arp_req_set_public()
existing code could trigger false positive locking warnings.
Fixes: 941666c2e3e0 ("net: RCU conversion of dev_getbyhwaddr() and arp_ioctl()")
Suggested-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://patch.msgid.link/20250218-arm_fix_selftest-v5-2-d3d6892db9e1@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add dedicated helper for finding devices by hardware address when
holding rtnl_lock, similar to existing dev_getbyhwaddr_rcu(). This prevents
PROVE_LOCKING warnings when rtnl_lock is held but RCU read lock is not.
Extract common address comparison logic into dev_addr_cmp().
The context about this change could be found in the following
discussion:
Link: https://lore.kernel.org/all/20250206-scarlet-ermine-of-improvement-1fcac5@leitao/
Cc: kuniyu@amazon.com
Cc: ushankar@purestorage.com
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250218-arm_fix_selftest-v5-1-d3d6892db9e1@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Russell King says:
====================
net: stmmac: further cleanups
This small series does further cleanups to the stmmac driver:
1. Name priv->pause to indicate that it's a timeout and clarify the
units of the "pause" module parameter
2. Remove useless priv->flow_ctrl member and deprecate the useless
"flow_ctrl" module parameter
3. Fix long-standing signed-ness issue with "speed" passed around the
driver from the mac_link_up method.
====================
Link: https://patch.msgid.link/Z7Rf2daOaf778TOg@shell.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
priv->plat->fix_mac_speed() is called from stmmac_mac_link_up(), which
is passed the speed as an "int". However, fix_mac_speed() implicitly
casts this to an unsigned int. Some platform glue code print this value
using %u, others with %d. Some implicitly cast it back to an int, and
others to u32.
Good practice is to use one type and only one type to represent a value
being passed around a driver.
Switch all of these over to consistently use "int" when dealing with a
speed passed from stmmac_mac_link_up(), even though the speed will
always be positive.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Acked-by: Chen-Yu Tsai <wens@csie.org>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Acked-by: Nobuhiro Iwamatsu <nobuhiro1.iwamatsu@toshiba.co.jp>
Link: https://patch.msgid.link/E1tkKmN-004ObM-Ge@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
priv->flow_ctrl is only accessed by stmmac_main.c, and the only place
that it is read is in stmmac_mac_flow_ctrl(). This function is only
called from stmmac_mac_link_up() which always sets priv->flow_ctrl
immediately before calling this function.
Therefore, initialising this at probe time is ineffectual because it
will always be overwritten before it's read. As such, the "flow_ctrl"
module parameter has been useless for some time. Rather than remove
the module parameter, which would risk module load failure, change the
description to indicate that it is obsolete, and warn if it is set by
userspace.
Moreover, storing the value in the stmmac_priv has no benefit as it's
set and then immediately read stmmac_mac_flow_ctrl(). Instead, pass it
as a parameter to this function..
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Nobuhiro Iwamatsu <nobuhiro1.iwamatsu@toshiba.co.jp>
Link: https://patch.msgid.link/E1tkKmI-004ObG-DL@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
priv->pause corresponds with "pause_time" in the 802.3 specification,
and is also called "pause_time" in the various MAC backends. For
consistency, use the same name in the core stmmac code.
Clarify the units of the "pause" module parameter which sets up this
struct member to indicate that it's in units of the pause_quanta
defined by 802.3, which is 512 bit times.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com>
Link: https://patch.msgid.link/E1tkKmD-004ObA-9K@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
According to the C11 standard (ISO/IEC 9899:2011, 6.5.7):
"If E1 has a signed type and E1 x 2^E2 is not representable in the result
type, the behavior is undefined."
Shifting 1 << 31 causes signed integer overflow, which leads to undefined
behavior.
Fix this by explicitly using '1U << 31' to ensure the shift operates on
an unsigned type, avoiding undefined behavior.
Signed-off-by: Yu-Chun Lin <eleanor15x@gmail.com>
Link: https://patch.msgid.link/20250218081217.3468369-1-eleanor15x@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Cong Wang says:
====================
flow_dissector: Fix handling of mixed port and port-range keys
This patchset contains two fixes for flow_dissector handling of mixed
port and port-range keys, for both tc-flower case and bpf case. Each
of them also comes with a selftest.
====================
Link: https://patch.msgid.link/20250218043210.732959-1-xiyou.wangcong@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
After this patch:
#102/1 flow_dissector_classification/ipv4:OK
#102/2 flow_dissector_classification/ipv4_continue_dissect:OK
#102/3 flow_dissector_classification/ipip:OK
#102/4 flow_dissector_classification/gre:OK
#102/5 flow_dissector_classification/port_range:OK
#102/6 flow_dissector_classification/ipv6:OK
#102 flow_dissector_classification:OK
Summary: 1/6 PASSED, 0 SKIPPED, 0 FAILED
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Link: https://patch.msgid.link/20250218043210.732959-5-xiyou.wangcong@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Fix how port range keys are handled in __skb_flow_bpf_to_target() by:
- Separating PORTS and PORTS_RANGE key handling
- Using correct key_ports_range structure for range keys
- Properly initializing both key types independently
This ensures port range information is correctly stored in its dedicated
structure rather than incorrectly using the regular ports key structure.
Fixes: 59fb9b62fb6c ("flow_dissector: Fix to use new variables for port ranges in bpf hook")
Reported-by: Qiang Zhang <dtzq01@gmail.com>
Closes: https://lore.kernel.org/netdev/CAPx+-5uvFxkhkz4=j_Xuwkezjn9U6kzKTD5jz4tZ9msSJ0fOJA@mail.gmail.com/
Cc: Yoshiki Komachi <komachi.yoshiki@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Link: https://patch.msgid.link/20250218043210.732959-4-xiyou.wangcong@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
port-range
After this patch:
# ./tc_flower_port_range.sh
TEST: Port range matching - IPv4 UDP [ OK ]
TEST: Port range matching - IPv4 TCP [ OK ]
TEST: Port range matching - IPv6 UDP [ OK ]
TEST: Port range matching - IPv6 TCP [ OK ]
TEST: Port range matching - IPv4 UDP Drop [ OK ]
Cc: Qiang Zhang <dtzq01@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Tested-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20250218043210.732959-3-xiyou.wangcong@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
This patch fixes a bug in TC flower filter where rules combining a
specific destination port with a source port range weren't working
correctly.
The specific case was when users tried to configure rules like:
tc filter add dev ens38 ingress protocol ip flower ip_proto udp \
dst_port 5000 src_port 2000-3000 action drop
The root cause was in the flow dissector code. While both
FLOW_DISSECTOR_KEY_PORTS and FLOW_DISSECTOR_KEY_PORTS_RANGE flags
were being set correctly in the classifier, the __skb_flow_dissect_ports()
function was only populating one of them: whichever came first in
the enum check. This meant that when the code needed both a specific
port and a port range, one of them would be left as 0, causing the
filter to not match packets as expected.
Fix it by removing the either/or logic and instead checking and
populating both key types independently when they're in use.
Fixes: 8ffb055beae5 ("cls_flower: Fix the behavior using port ranges with hw-offload")
Reported-by: Qiang Zhang <dtzq01@gmail.com>
Closes: https://lore.kernel.org/netdev/CAPx+-5uvFxkhkz4=j_Xuwkezjn9U6kzKTD5jz4tZ9msSJ0fOJA@mail.gmail.com/
Cc: Yoshiki Komachi <komachi.yoshiki@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20250218043210.732959-2-xiyou.wangcong@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add more logs to assist in debugging and monitoring
driver behaviour, making it easier to identify potential
issues during development and testing.
Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Link: https://patch.msgid.link/1739842455-23899-1-git-send-email-ernis@linux.microsoft.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Kuniyuki Iwashima says:
====================
gtp/geneve: Suppress list_del() splat during ->exit_batch_rtnl().
The common pattern in tunnel device's ->exit_batch_rtnl() is iterating
two netdev lists for each netns: (i) for_each_netdev() to clean up
devices in the netns, and (ii) the device type specific list to clean
up devices in other netns.
list_for_each_entry(net, net_list, exit_list) {
for_each_netdev_safe(net, dev, next) {
/* (i) call unregister_netdevice_queue(dev, list) */
}
list_for_each_entry_safe(xxx, xxx_next, &net->yyy, zzz) {
/* (ii) call unregister_netdevice_queue(xxx->dev, list) */
}
}
Then, ->exit_batch_rtnl() could touch the same device twice.
Say we have two netns A & B and device B that is created in netns A and
moved to netns B.
1. cleanup_net() processes netns A and then B.
2. ->exit_batch_rtnl() finds the device B while iterating netns A's (ii)
[ device B is not yet unlinked from netns B as
unregister_netdevice_many() has not been called. ]
3. ->exit_batch_rtnl() finds the device B while iterating netns B's (i)
gtp and geneve calls ->dellink() at 2. and 3. that calls list_del() for (ii)
and unregister_netdevice_queue().
Calling unregister_netdevice_queue() twice is fine because it uses
list_move_tail(), but the 2nd list_del() triggers a splat when
CONFIG_DEBUG_LIST is enabled.
Possible solution is either of
(a) Use list_del_init() in ->dellink()
(b) Iterate dev with empty ->unreg_list for (i) like
#define for_each_netdev_alive(net, d) \
list_for_each_entry(d, &(net)->dev_base_head, dev_list) \
if (list_empty(&d->unreg_list))
(c) Remove (i) and delegate it to default_device_exit_batch().
This series avoids the 2nd ->dellink() by (c) to suppress the splat for
gtp and geneve.
Note that IPv4/IPv6 tunnels calls just unregister_netdevice() during
->exit_batch_rtnl() and dev is unlinked from (ii) later in ->ndo_uninit(),
so they are safe.
Also, pfcp has the same pattern but is safe because
unregister_netdevice_many() is called for each netns.
====================
Link: https://patch.msgid.link/20250217203705.40342-1-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
As explained in the previous patch, iterating for_each_netdev() and
gn->geneve_list during ->exit_batch_rtnl() could trigger ->dellink()
twice for the same device.
If CONFIG_DEBUG_LIST is enabled, we will see a list_del() corruption
splat in the 2nd call of geneve_dellink().
Let's remove for_each_netdev() in geneve_destroy_tunnels() and delegate
that part to default_device_exit_batch().
Fixes: 9593172d93b9 ("geneve: Fix use-after-free in geneve_find_dev().")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250217203705.40342-3-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Brad Spengler reported the list_del() corruption splat in
gtp_net_exit_batch_rtnl(). [0]
Commit eb28fd76c0a0 ("gtp: Destroy device along with udp socket's netns
dismantle.") added the for_each_netdev() loop in gtp_net_exit_batch_rtnl()
to destroy devices in each netns as done in geneve and ip tunnels.
However, this could trigger ->dellink() twice for the same device during
->exit_batch_rtnl().
Say we have two netns A & B and gtp device B that resides in netns B but
whose UDP socket is in netns A.
1. cleanup_net() processes netns A and then B.
2. gtp_net_exit_batch_rtnl() finds the device B while iterating
netns A's gn->gtp_dev_list and calls ->dellink().
[ device B is not yet unlinked from netns B
as unregister_netdevice_many() has not been called. ]
3. gtp_net_exit_batch_rtnl() finds the device B while iterating
netns B's for_each_netdev() and calls ->dellink().
gtp_dellink() cleans up the device's hash table, unlinks the dev from
gn->gtp_dev_list, and calls unregister_netdevice_queue().
Basically, calling gtp_dellink() multiple times is fine unless
CONFIG_DEBUG_LIST is enabled.
Let's remove for_each_netdev() in gtp_net_exit_batch_rtnl() and
delegate the destruction to default_device_exit_batch() as done
in bareudp.
[0]:
list_del corruption, ffff8880aaa62c00->next (autoslab_size_M_dev_P_net_core_dev_11127_8_1328_8_S_4096_A_64_n_139+0xc00/0x1000 [slab object]) is LIST_POISON1 (ffffffffffffff02) (prev is 0xffffffffffffff04)
kernel BUG at lib/list_debug.c:58!
Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN
CPU: 1 UID: 0 PID: 1804 Comm: kworker/u8:7 Tainted: G T 6.12.13-grsec-full-20250211091339 #1
Tainted: [T]=RANDSTRUCT
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
Workqueue: netns cleanup_net
RIP: 0010:[<ffffffff84947381>] __list_del_entry_valid_or_report+0x141/0x200 lib/list_debug.c:58
Code: c2 76 91 31 c0 e8 9f b1 f7 fc 0f 0b 4d 89 f0 48 c7 c1 02 ff ff ff 48 89 ea 48 89 ee 48 c7 c7 e0 c2 76 91 31 c0 e8 7f b1 f7 fc <0f> 0b 4d 89 e8 48 c7 c1 04 ff ff ff 48 89 ea 48 89 ee 48 c7 c7 60
RSP: 0018:fffffe8040b4fbd0 EFLAGS: 00010283
RAX: 00000000000000cc RBX: dffffc0000000000 RCX: ffffffff818c4054
RDX: ffffffff84947381 RSI: ffffffff818d1512 RDI: 0000000000000000
RBP: ffff8880aaa62c00 R08: 0000000000000001 R09: fffffbd008169f32
R10: fffffe8040b4f997 R11: 0000000000000001 R12: a1988d84f24943e4
R13: ffffffffffffff02 R14: ffffffffffffff04 R15: ffff8880aaa62c08
RBX: kasan shadow of 0x0
RCX: __wake_up_klogd.part.0+0x74/0xe0 kernel/printk/printk.c:4554
RDX: __list_del_entry_valid_or_report+0x141/0x200 lib/list_debug.c:58
RSI: vprintk+0x72/0x100 kernel/printk/printk_safe.c:71
RBP: autoslab_size_M_dev_P_net_core_dev_11127_8_1328_8_S_4096_A_64_n_139+0xc00/0x1000 [slab object]
RSP: process kstack fffffe8040b4fbd0+0x7bd0/0x8000 [kworker/u8:7+netns 1804 ]
R09: kasan shadow of process kstack fffffe8040b4f990+0x7990/0x8000 [kworker/u8:7+netns 1804 ]
R10: process kstack fffffe8040b4f997+0x7997/0x8000 [kworker/u8:7+netns 1804 ]
R15: autoslab_size_M_dev_P_net_core_dev_11127_8_1328_8_S_4096_A_64_n_139+0xc08/0x1000 [slab object]
FS: 0000000000000000(0000) GS:ffff888116000000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000748f5372c000 CR3: 0000000015408000 CR4: 00000000003406f0 shadow CR4: 00000000003406f0
Stack:
0000000000000000 ffffffff8a0c35e7 ffffffff8a0c3603 ffff8880aaa62c00
ffff8880aaa62c00 0000000000000004 ffff88811145311c 0000000000000005
0000000000000001 ffff8880aaa62000 fffffe8040b4fd40 ffffffff8a0c360d
Call Trace:
<TASK>
[<ffffffff8a0c360d>] __list_del_entry_valid include/linux/list.h:131 [inline] fffffe8040b4fc28
[<ffffffff8a0c360d>] __list_del_entry include/linux/list.h:248 [inline] fffffe8040b4fc28
[<ffffffff8a0c360d>] list_del include/linux/list.h:262 [inline] fffffe8040b4fc28
[<ffffffff8a0c360d>] gtp_dellink+0x16d/0x360 drivers/net/gtp.c:1557 fffffe8040b4fc28
[<ffffffff8a0d0404>] gtp_net_exit_batch_rtnl+0x124/0x2c0 drivers/net/gtp.c:2495 fffffe8040b4fc88
[<ffffffff8e705b24>] cleanup_net+0x5a4/0xbe0 net/core/net_namespace.c:635 fffffe8040b4fcd0
[<ffffffff81754c97>] process_one_work+0xbd7/0x2160 kernel/workqueue.c:3326 fffffe8040b4fd88
[<ffffffff81757195>] process_scheduled_works kernel/workqueue.c:3407 [inline] fffffe8040b4fec0
[<ffffffff81757195>] worker_thread+0x6b5/0xfa0 kernel/workqueue.c:3488 fffffe8040b4fec0
[<ffffffff817782a0>] kthread+0x360/0x4c0 kernel/kthread.c:397 fffffe8040b4ff78
[<ffffffff814d8594>] ret_from_fork+0x74/0xe0 arch/x86/kernel/process.c:172 fffffe8040b4ffb8
[<ffffffff8110f509>] ret_from_fork_asm+0x29/0xc0 arch/x86/entry/entry_64.S:399 fffffe8040b4ffe8
</TASK>
Modules linked in:
Fixes: eb28fd76c0a0 ("gtp: Destroy device along with udp socket's netns dismantle.")
Reported-by: Brad Spengler <spender@grsecurity.net>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250217203705.40342-2-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Ido Schimmel says:
====================
net: fib_rules: Add port mask support
In some deployments users would like to encode path information into
certain bits of the IPv6 flow label, the UDP source port and the DSCP
field and use this information to route packets accordingly.
Redirecting traffic to a routing table based on specific bits in the UDP
source port is not currently possible. Only exact match and range are
currently supported by FIB rules.
This patchset extends FIB rules to match on layer 4 ports with an
optional mask. The mask is not supported when matching on a range. A
future patchset will add support for matching on the DSCP field with an
optional mask.
Patches #1-#6 gradually extend FIB rules to match on layer 4 ports with
an optional mask.
Patches #7-#8 add test cases for FIB rule port matching.
iproute2 support can be found here [1].
[1] https://github.com/idosch/iproute2/tree/submit/fib_rule_mask_v1
====================
Link: https://patch.msgid.link/20250217134109.311176-1-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add tests for FIB rules that match on source and destination ports with
a mask. Test both good and bad flows.
# ./fib_rule_tests.sh
IPv6 FIB rule tests
[...]
TEST: rule6 check: sport and dport redirect to table [ OK ]
TEST: rule6 check: sport and dport no redirect to table [ OK ]
TEST: rule6 del by pref: sport and dport redirect to table [ OK ]
TEST: rule6 check: sport and dport range redirect to table [ OK ]
TEST: rule6 check: sport and dport range no redirect to table [ OK ]
TEST: rule6 del by pref: sport and dport range redirect to table [ OK ]
TEST: rule6 check: sport and dport masked redirect to table [ OK ]
TEST: rule6 check: sport and dport masked no redirect to table [ OK ]
TEST: rule6 del by pref: sport and dport masked redirect to table [ OK ]
[...]
Tests passed: 292
Tests failed: 0
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20250217134109.311176-9-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Currently, only matching on specific ports is tested. Add port range
testing to make sure this use case does not regress.
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20250217134109.311176-8-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add new port mask attributes to the spec. Example:
# ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/rt_rule.yaml \
--do newrule \
--json '{"family": 2, "sport-range": { "start": 12345, "end": 12345 }, "sport-mask": 65535, "action": 1, "table": 1}'
None
# ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/rt_rule.yaml \
--do newrule \
--json '{"family": 2, "dport-range": { "start": 54321, "end": 54321 }, "dport-mask": 65535, "action": 1, "table": 2}'
None
$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/rt_rule.yaml \
--dump getrule --json '{"family": 2}' --output-json | jq '.[]'
[...]
{
"table": 2,
"suppress-prefixlen": "0xffffffff",
"protocol": 0,
"priority": 32764,
"dport-range": {
"start": 54321,
"end": 54321
},
"dport-mask": "0xffff",
"family": 2,
"dst-len": 0,
"src-len": 0,
"tos": 0,
"action": "to-tbl",
"flags": 0
}
{
"table": 1,
"suppress-prefixlen": "0xffffffff",
"protocol": 0,
"priority": 32765,
"sport-range": {
"start": 12345,
"end": 12345
},
"sport-mask": "0xffff",
"family": 2,
"dst-len": 0,
"src-len": 0,
"tos": 0,
"action": "to-tbl",
"flags": 0
}
[...]
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20250217134109.311176-7-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Allow user space to configure FIB rules that match on the source and
destination ports with a mask, now that support has been added to the
FIB rule core and the IPv4 and IPv6 address families.
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20250217134109.311176-6-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Extend IPv6 FIB rules to match on source and destination ports using a
mask. Note that the mask is only set when not matching on a range.
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20250217134109.311176-5-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Extend IPv4 FIB rules to match on source and destination ports using a
mask. Note that the mask is only set when not matching on a range.
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20250217134109.311176-4-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add support for configuring and deleting rules that match on source and
destination ports using a mask as well as support for dumping such rules
to user space.
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20250217134109.311176-3-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add attributes that allow matching on source and destination ports with
a mask. Matching on the source port with a mask is needed in deployments
where users encode path information into certain bits of the UDP source
port.
Temporarily set the type of the attributes to 'NLA_REJECT' while support
is being added.
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20250217134109.311176-2-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
BCM53101 is a ethernet switch, very similar to the BCM53115.
Signed-off-by: Claus Stovgaard <claus.stovgaard@prevas.dk>
Acked-by: Rob Herring (Arm) <robh@kernel.org>
Link: https://patch.msgid.link/20250217080503.1390282-2-claus.stovgaard@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
BCM53101 is a ethernet switch, very similar to the BCM53115.
Enable support for it, in the existing b53 dsa driver.
Signed-off-by: Torben Nielsen <torben.nielsen@prevas.dk>
Signed-off-by: Claus Stovgaard <claus.stovgaard@prevas.dk>
Link: https://patch.msgid.link/20250217080503.1390282-1-claus.stovgaard@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"18 hotfixes. 5 are cc:stable and the remainder address post-6.13
issues or aren't considered necessary for -stable kernels.
10 are for MM and 8 are for non-MM. All are singletons, please see the
changelogs for details"
* tag 'mm-hotfixes-stable-2025-02-19-17-49' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
test_xarray: fix failure in check_pause when CONFIG_XARRAY_MULTI is not defined
kasan: don't call find_vm_area() in a PREEMPT_RT kernel
MAINTAINERS: update Nick's contact info
selftests/mm: fix check for running THP tests
mm: hugetlb: avoid fallback for specific node allocation of 1G pages
memcg: avoid dead loop when setting memory.max
mailmap: update Nick's entry
mm: pgtable: fix incorrect reclaim of non-empty PTE pages
taskstats: modify taskstats version
getdelays: fix error format characters
mm/migrate_device: don't add folio to be freed to LRU in migrate_device_finalize()
tools/mm: fix build warnings with musl-libc
mailmap: add entry for Feng Tang
.mailmap: add entries for Jeff Johnson
mm,madvise,hugetlb: check for 0-length range after end address adjustment
mm/zswap: fix inconsistency when zswap_store_page() fails
lib/iov_iter: fix import_iovec_ubuf iovec management
procfs: fix a locking bug in a vmcore_add_device_dump() error path
|
|
We don't want to be holding the srcu lock while waiting on btree write
completions - easily fixed.
Reported-by: Janpieter Sollie <janpieter.sollie@edpnet.be>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
We had some error handling confusion here;
-BCH_ERR_missing_indirect_extent is thrown by
trans_trigger_reflink_p_segment(); at this point we haven't decide
whether we're generating an error.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Error handling was wrong, causing unhandled transaction restart errors.
check_directory_size() was also inefficient, since keys in multiple
snapshots would be iterated over once for every snapshot. Convert it to
the same scheme used for i_sectors and subdir count checking.
Cc: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|