Age | Commit message (Collapse) | Author |
|
An optimization for promiscuous mode adds a high-priority steering
table with a single catch-all rule to steer all traffic directly to
the TTC table.
However, a gap exists between the creation of this table and the
insertion of the catch-all rule. Packets arriving in this brief window
would miss as no rule was inserted yet, unnecessarily incrementing the
'rx_steer_missed_packets' counter and dropped.
This patch resolves the issue by introducing a new prio for this
table, placing it between MLX5E_TC_PRIO and MLX5E_NIC_PRIO. By doing
so, packets arriving during the window now fall through to the next
prio (at MLX5E_NIC_PRIO) instead of being dropped.
Fixes: 1c46d7409f30 ("net/mlx5e: Optimize promiscuous mode")
Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/1752155624-24095-4-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
There's a race between disabling DIM and NAPI callbacks using the dim
pointer on the RQ or SQ.
If NAPI checks the DIM state bit and sees it still set, it assumes
`rq->dim` or `sq->dim` is valid. But if DIM gets disabled right after
that check, the pointer might already be set to NULL, leading to a NULL
pointer dereference in net_dim().
Fix this by calling `synchronize_net()` before freeing the DIM context.
This ensures all in-progress NAPI callbacks are finished before the
pointer is cleared.
Kernel log:
BUG: kernel NULL pointer dereference, address: 0000000000000000
...
RIP: 0010:net_dim+0x23/0x190
...
Call Trace:
<TASK>
? __die+0x20/0x60
? page_fault_oops+0x150/0x3e0
? common_interrupt+0xf/0xa0
? sysvec_call_function_single+0xb/0x90
? exc_page_fault+0x74/0x130
? asm_exc_page_fault+0x22/0x30
? net_dim+0x23/0x190
? mlx5e_poll_ico_cq+0x41/0x6f0 [mlx5_core]
? sysvec_apic_timer_interrupt+0xb/0x90
mlx5e_handle_rx_dim+0x92/0xd0 [mlx5_core]
mlx5e_napi_poll+0x2cd/0xac0 [mlx5_core]
? mlx5e_poll_ico_cq+0xe5/0x6f0 [mlx5_core]
busy_poll_stop+0xa2/0x200
? mlx5e_napi_poll+0x1d9/0xac0 [mlx5_core]
? mlx5e_trigger_irq+0x130/0x130 [mlx5_core]
__napi_busy_loop+0x345/0x3b0
? sysvec_call_function_single+0xb/0x90
? asm_sysvec_call_function_single+0x16/0x20
? sysvec_apic_timer_interrupt+0xb/0x90
? pcpu_free_area+0x1e4/0x2e0
napi_busy_loop+0x11/0x20
xsk_recvmsg+0x10c/0x130
sock_recvmsg+0x44/0x70
__sys_recvfrom+0xbc/0x130
? __schedule+0x398/0x890
__x64_sys_recvfrom+0x20/0x30
do_syscall_64+0x4c/0x100
entry_SYSCALL_64_after_hwframe+0x4b/0x53
...
---[ end trace 0000000000000000 ]---
...
---[ end Kernel panic - not syncing: Fatal exception in interrupt ]---
Fixes: 445a25f6e1a2 ("net/mlx5e: Support updating coalescing configuration without resetting channels")
Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/1752155624-24095-3-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
When changing a node's parent, its scheduling element is destroyed and
re-created with bw_share 0. However, the node's bw_share field was not
updated accordingly.
Set the node's bw_share to 0 after re-creation to keep the software
state in sync with the firmware configuration.
Fixes: 9c7bbf4c3304 ("net/mlx5: Add support for setting parent of nodes")
Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/1752155624-24095-2-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The message "Error getting PHY irq. Use polling instead"
is emitted when the mlxbf_gige driver is loaded by the
kernel before the associated gpio-mlxbf driver, and thus
the call to get the PHY IRQ fails since it is not yet
available. The driver probe() must return -EPROBE_DEFER
if acpi_dev_gpio_irq_get_by() returns the same.
Fixes: 6c2a6ddca763 ("net: mellanox: mlxbf_gige: Replace non-standard interrupt handling")
Signed-off-by: David Thompson <davthompson@nvidia.com>
Reviewed-by: Asmaa Mnebhi <asmaa@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250618135902.346-1-davthompson@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from bluetooth and wireless.
Current release - regressions:
- af_unix: allow passing cred for embryo without SO_PASSCRED/SO_PASSPIDFD
Current release - new code bugs:
- eth: airoha: correct enable mask for RX queues 16-31
- veth: prevent NULL pointer dereference in veth_xdp_rcv when peer
disappears under traffic
- ipv6: move fib6_config_validate() to ip6_route_add(), prevent
invalid routes
Previous releases - regressions:
- phy: phy_caps: don't skip better duplex match on non-exact match
- dsa: b53: fix untagged traffic sent via cpu tagged with VID 0
- Revert "wifi: mwifiex: Fix HT40 bandwidth issue.", it caused
transient packet loss, exact reason not fully understood, yet
Previous releases - always broken:
- net: clear the dst when BPF is changing skb protocol (IPv4 <> IPv6)
- sched: sfq: fix a potential crash on gso_skb handling
- Bluetooth: intel: improve rx buffer posting to avoid causing issues
in the firmware
- eth: intel: i40e: make reset handling robust against multiple
requests
- eth: mlx5: ensure FW pages are always allocated on the local NUMA
node, even when device is configure to 'serve' another node
- wifi: ath12k: fix GCC_GCC_PCIE_HOT_RST definition for WCN7850,
prevent kernel crashes
- wifi: ath11k: avoid burning CPU in ath11k_debugfs_fw_stats_request()
for 3 sec if fw_stats_done is not set"
* tag 'net-6.16-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (70 commits)
selftests: drv-net: rss_ctx: Add test for ntuple rules targeting default RSS context
net: ethtool: Don't check if RSS context exists in case of context 0
af_unix: Allow passing cred for embryo without SO_PASSCRED/SO_PASSPIDFD.
ipv6: Move fib6_config_validate() to ip6_route_add().
net: drv: netdevsim: don't napi_complete() from netpoll
net/mlx5: HWS, Add error checking to hws_bwc_rule_complex_hash_node_get()
veth: prevent NULL pointer dereference in veth_xdp_rcv
net_sched: remove qdisc_tree_flush_backlog()
net_sched: ets: fix a race in ets_qdisc_change()
net_sched: tbf: fix a race in tbf_change()
net_sched: red: fix a race in __red_change()
net_sched: prio: fix a race in prio_tune()
net_sched: sch_sfq: reject invalid perturb period
net: phy: phy_caps: Don't skip better duplex macth on non-exact match
MAINTAINERS: Update Kuniyuki Iwashima's email address.
selftests: net: add test case for NAT46 looping back dst
net: clear the dst when changing skb protocol
net/mlx5e: Fix number of lanes to UNKNOWN when using data_rate_oper
net/mlx5e: Fix leak of Geneve TLV option object
net/mlx5: HWS, make sure the uplink is the last destination
...
|
|
Check for if ida_alloc() or rhashtable_lookup_get_insert_fast() fails.
Fixes: 17e0accac577 ("net/mlx5: HWS, support complex matchers")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Link: https://patch.msgid.link/aEmBONjyiF6z5yCV@stanley.mountain
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
When the link is up, either eth_proto_oper or ext_eth_proto_oper
typically reports the active link protocol, from which both speed
and number of lanes can be retrieved. However, in certain cases,
such as when a NIC is connected via a non-standard cable, the
firmware may not report the protocol.
In such scenarios, the speed can still be obtained from the
data_rate_oper field in PTYS register. Since data_rate_oper
provides only speed information and lacks lane details, it is
incorrect to derive the number of lanes from it.
This patch corrects the behavior by setting the number of lanes to
UNKNOWN instead of incorrectly using MAX_LANES when relying on
data_rate_oper.
Fixes: 7e959797f021 ("net/mlx5e: Enable lanes configuration when auto-negotiation is off")
Signed-off-by: Shahar Shitrit <shshitrit@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/20250610151514.1094735-10-mbloch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Previously, a unique tunnel id was added for the matching on TC
non-zero chains, to support inner header rewrite with goto action.
Later, it was used to support VF tunnel offload for vxlan, then for
Geneve and GRE. To support VF tunnel, a temporary mlx5_flow_spec is
used to parse tunnel options. For Geneve, if there is TLV option, a
object is created, or refcnt is added if already exists. But the
temporary mlx5_flow_spec is directly freed after parsing, which causes
the leak because no information regarding the object is saved in
flow's mlx5_flow_spec, which is used to free the object when deleting
the flow.
To fix the leak, call mlx5_geneve_tlv_option_del() before free the
temporary spec if it has TLV object.
Fixes: 521933cdc4aa ("net/mlx5e: Support Geneve and GRE with VF tunnel offload")
Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Alex Lazar <alazar@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/20250610151514.1094735-9-mbloch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
When there are more than one destinations, we create a FW flow
table and provide it with all the destinations. FW requires to
have wire as the last destination in the list (if it exists),
otherwise the operation fails with FW syndrome.
This patch fixes the destination array action creation: if it
contains a wire destination, it is moved to the end.
Fixes: 504e536d9010 ("net/mlx5: HWS, added actions handling")
Signed-off-by: Vlad Dogaru <vdogaru@nvidia.com>
Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/20250610151514.1094735-7-mbloch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Fix missing field handling in definer - outer IP version.
Fixes: 74a778b4a63f ("net/mlx5: HWS, added definers handling")
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/20250610151514.1094735-6-mbloch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The newly introduced mutex is only used for reformat actions, but it was
initialized for modify header instead.
The struct that contains the mutex is zero-initialized and an all-zero
mutex is valid, so the issue only shows up with CONFIG_DEBUG_MUTEXES.
Fixes: b206d9ec19df ("net/mlx5: HWS, register reformat actions with fw")
Signed-off-by: Vlad Dogaru <vdogaru@nvidia.com>
Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/20250610151514.1094735-5-mbloch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
When attempting to add a rule to an existing flow group, if a matching
flow group exists but is not active, the error code returned should be
EAGAIN, so that the rule can be added to the matching flow group once
it is active, rather than ENOENT, which indicates that no matching
flow group was found.
Fixes: bd71b08ec2ee ("net/mlx5: Support multiple updates of steering rules in parallel")
Signed-off-by: Gavi Teitz <gavi@nvidia.com>
Signed-off-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/20250610151514.1094735-4-mbloch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Fix shutdown flow UAF when a virtual function is created on the embedded
chip (ECVF) of a BlueField device. In such case the vport acl ingress
table is not properly destroyed.
ECVF functionality is independent of ecpf_vport_exists capability and
thus functions mlx5_eswitch_(enable|disable)_pf_vf_vports() should not
test it when enabling/disabling ECVF vports.
kernel log:
[] refcount_t: underflow; use-after-free.
[] WARNING: CPU: 3 PID: 1 at lib/refcount.c:28
refcount_warn_saturate+0x124/0x220
----------------
[] Call trace:
[] refcount_warn_saturate+0x124/0x220
[] tree_put_node+0x164/0x1e0 [mlx5_core]
[] mlx5_destroy_flow_table+0x98/0x2c0 [mlx5_core]
[] esw_acl_ingress_table_destroy+0x28/0x40 [mlx5_core]
[] esw_acl_ingress_lgcy_cleanup+0x80/0xf4 [mlx5_core]
[] esw_legacy_vport_acl_cleanup+0x44/0x60 [mlx5_core]
[] esw_vport_cleanup+0x64/0x90 [mlx5_core]
[] mlx5_esw_vport_disable+0xc0/0x1d0 [mlx5_core]
[] mlx5_eswitch_unload_ec_vf_vports+0xcc/0x150 [mlx5_core]
[] mlx5_eswitch_disable_sriov+0x198/0x2a0 [mlx5_core]
[] mlx5_device_disable_sriov+0xb8/0x1e0 [mlx5_core]
[] mlx5_sriov_detach+0x40/0x50 [mlx5_core]
[] mlx5_unload+0x40/0xc4 [mlx5_core]
[] mlx5_unload_one_devl_locked+0x6c/0xe4 [mlx5_core]
[] mlx5_unload_one+0x3c/0x60 [mlx5_core]
[] shutdown+0x7c/0xa4 [mlx5_core]
[] pci_device_shutdown+0x3c/0xa0
[] device_shutdown+0x170/0x340
[] __do_sys_reboot+0x1f4/0x2a0
[] __arm64_sys_reboot+0x2c/0x40
[] invoke_syscall+0x78/0x100
[] el0_svc_common.constprop.0+0x54/0x184
[] do_el0_svc+0x30/0xac
[] el0_svc+0x48/0x160
[] el0t_64_sync_handler+0xa4/0x12c
[] el0t_64_sync+0x1a4/0x1a8
[] --[ end trace 9c4601d68c70030e ]---
Fixes: a7719b29a821 ("net/mlx5: Add management of EC VF vports")
Reviewed-by: Daniel Jurgens <danielj@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Amir Tzin <amirtz@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/20250610151514.1094735-3-mbloch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
When firmware asks the driver to allocate more pages, using event of
give_pages, the driver should always allocate it from same NUMA, the
original device NUMA. Current code uses dev_to_node() which can result
in different NUMA as it is changed by other driver flows, such as
mlx5_dma_zalloc_coherent_node(). Instead, use saved numa node for
allocating firmware pages.
Fixes: 311c7c71c9bb ("net/mlx5e: Allocate DMA coherent memory on reader NUMA node")
Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/20250610151514.1094735-2-mbloch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Move this API to the canonical timer_*() namespace.
[ tglx: Redone against pre rc1 ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/aB2X0jCKQO56WdMt@gmail.com
|
|
The "freq" variable is in terms of MHz and "max_val_cycles" is in terms
of Hz. The fact that "max_val_cycles" is a u64 suggests that support
for high frequency is intended but the "freq_khz * 1000" would overflow
the u32 type if we went above 4GHz. Use unsigned long long type for the
mutliplication to prevent that.
Fixes: 31c128b66e5b ("net/mlx4_en: Choose time-stamping shift value according to HW frequency")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/aDbFHe19juIJKjsb@stanley.mountain
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Merge in late fixes to prepare for the 6.16 net-next PR.
No conflicts nor adjacent changes.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
'struct thermal_zone_device_ops' are not modified in this driver.
Constifying these structures moves some data to a read-only section, so
increases overall security, especially when the structure holds some
function pointers.
While at it, also constify a struct thermal_zone_params.
On a x86_64, with allmodconfig:
Before:
======
text data bss dec hex filename
24899 8036 0 32935 80a7 drivers/net/ethernet/mellanox/mlxsw/core_thermal.o
After:
=====
text data bss dec hex filename
25379 7556 0 32935 80a7 drivers/net/ethernet/mellanox/mlxsw/core_thermal.o
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/4516676973f5adc1cdb76db1691c0f98b6fa6614.1748164348.git.christophe.jaillet@wanadoo.fr
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
This was intended to be negative -ENOMEM but the '-' character was left
off accidentally. This typo doesn't affect runtime because the caller
treats all non-zero returns the same.
Fixes: 17e0accac577 ("net/mlx5: HWS, support complex matchers")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/aDCbjNcquNC68Hyj@stanley.mountain
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The function mlx5_query_nic_vport_node_guid() calls the function
mlx5_query_nic_vport_context() but does not check its return value.
A proper implementation can be found in mlx5_nic_vport_query_local_lb().
Add error handling for mlx5_query_nic_vport_context(). If it fails, free
the out buffer via kvfree() and return error code.
Fixes: 9efa75254593 ("net/mlx5_core: Introduce access functions to query vport RoCE fields")
Cc: stable@vger.kernel.org # v4.5
Signed-off-by: Wentao Liang <vulab@iscas.ac.cn>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20250524163425.1695-1-vulab@iscas.ac.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
A representor netdev does not correspond to real hardware that needs to
be updated when setting the MAC address. The default eth_mac_addr() is
sufficient for simply updating the netdev's MAC address with validation.
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/1747898036-1121904-1-git-send-email-tariqt@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
The function mlx5_query_nic_vport_qkey_viol_cntr() calls the function
mlx5_query_nic_vport_context() but does not check its return value. This
could lead to undefined behavior if the query fails. A proper
implementation can be found in mlx5_nic_vport_query_local_lb().
Add error handling for mlx5_query_nic_vport_context(). If it fails, free
the out buffer via kvfree() and return error code.
Fixes: 9efa75254593 ("net/mlx5_core: Introduce access functions to query vport RoCE fields")
Cc: stable@vger.kernel.org # v4.5
Signed-off-by: Wentao Liang <vulab@iscas.ac.cn>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20250521133620.912-1-vulab@iscas.ac.cn
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next
Steffen Klassert says:
====================
1) Remove some unnecessary strscpy_pad() size arguments.
From Thorsten Blum.
2) Correct use of xso.real_dev on bonding offloads.
Patchset from Cosmin Ratiu.
3) Add hardware offload configuration to XFRM_MSG_MIGRATE.
From Chiachang Wang.
4) Refactor migration setup during cloning. This was
done after the clone was created. Now it is done
in the cloning function itself.
From Chiachang Wang.
5) Validate assignment of maximal possible SEQ number.
Prevent from setting to the maximum sequrnce number
as this would cause for traffic drop.
From Leon Romanovsky.
6) Prevent configuration of interface index when offload
is used. Hardware can't handle this case.i
From Leon Romanovsky.
7) Always use kfree_sensitive() for SA secret zeroization.
From Zilin Guan.
ipsec-next-2025-05-23
* tag 'ipsec-next-2025-05-23' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next:
xfrm: use kfree_sensitive() for SA secret zeroization
xfrm: prevent configuration of interface index when offload is used
xfrm: validate assignment of maximal possible SEQ number
xfrm: Refactor migration setup during the cloning process
xfrm: Migrate offload configuration
bonding: Fix multiple long standing offload races
bonding: Mark active offloaded xfrm_states
xfrm: Add explicit dev to .xdo_dev_state_{add,delete,free}
xfrm: Remove unneeded device check from validate_xmit_xfrm
xfrm: Use xdo.dev instead of xdo.real_dev
net/mlx5: Avoid using xso.real_dev unnecessarily
xfrm: Remove unnecessary strscpy_pad() size arguments
====================
Link: https://patch.msgid.link/20250523075611.3723340-1-steffen.klassert@secunet.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
This patch convert mlx5 to use the new netdev instance lock in addition
to the pre-existing state_lock (and the RTNL).
mlx5e_priv.state_lock was already used throughout mlx5 to protect
against concurrent state modifications on the same netdev, usually in
addition to the RTNL. The new netdev instance lock will eventually
replace it, but for now, it is acquired in addition to the existing
locks in the order RTNL -> instance lock -> state_lock.
All three netdev types handled by mlx5 are converted to the new style of
locking, because they share a lot of code related to initializing
channels and dealing with NAPI, so it's better to convert all three
rather than introduce different assumptions deep in the call stack
depending on the type of device.
Because of the nature of the call graphs in mlx5, it wasn't possible to
incrementally convert parts of the driver to use the new lock, since
either all call paths into NAPI have to possess the new lock if the
*_locked variants are used, or none of them can have the lock.
One area which required extra care is the interaction between closing
channels and devlink health reporter tasks.
Previously, the recovery tasks were unconditionally acquiring the
RTNL, which could lead to deadlocks in these scenarios:
T1: mlx5e_close (== .ndo_stop(), has RTNL) -> mlx5e_close_locked
-> mlx5e_close_channels -> mlx5e_ptp_close
-> mlx5e_ptp_close_queues -> mlx5e_ptp_close_txqsqs
-> mlx5e_ptp_close_txqsq
-> cancel_work_sync(&ptpsq->report_unhealthy_work) waits for
T2: mlx5e_ptpsq_unhealthy_work -> mlx5e_reporter_tx_ptpsq_unhealthy
-> mlx5e_health_report -> devlink_health_report
-> devlink_health_reporter_recover
-> mlx5e_tx_reporter_ptpsq_unhealthy_recover which does:
rtnl_lock(); => Deadlock.
Another similar instance of this is:
T1: mlx5e_close (== .ndo_stop(), has RTNL) -> mlx5e_close_locked
-> mlx5e_close_channels -> mlx5e_ptp_close
-> mlx5e_ptp_close_queues -> mlx5e_ptp_close_txqsqs
-> mlx5e_ptp_close_txqsq
-> cancel_work_sync(&sq->recover_work) waits for
T2: mlx5e_tx_err_cqe_work -> mlx5e_reporter_tx_err_cqe
-> mlx5e_health_report -> devlink_health_report
-> devlink_health_reporter_recover
-> mlx5e_tx_reporter_err_cqe_recover which does:
rtnl_lock(); => Another deadlock.
Fix that by using the same pattern previously done in
mlx5e_tx_timeout_work, where the RTNL was repeatedly tried to be
acquired until either:
a) it is successfully acquired or
b) there's no need for the work to be done any more (channel is being
closed).
Now, for all three recovery tasks, the instance lock is repeatedly tried
to be acquired until successful or the channel/SQ is closed.
As a side-effect, drop the !test_bit(MLX5E_STATE_OPENED, &priv->state)
check from mlx5e_tx_timeout_work, it's weaker than
!test_bit(MLX5E_STATE_CHANNELS_ACTIVE, &priv->state) and unnecessary.
Future patches will introduce new call paths (from netdev queue
management ops) which can close channels (and call cancel_work_sync on
the recovery tasks) without the RTNL lock and only with the netdev
instance lock.
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/1747829342-1018757-6-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
There's no explanation in the original commit of why that was done, but
presumably flashing takes a long time and holding RTNL for so long
blocks other interactions with the netdev layer.
However, the stack is moving towards netdev instance locking and
dropping and reacquiring RTNL in the context of flashing introduces
locking ordering issues: RTNL must be acquired before the netdev
instance lock and released after it.
This patch therefore takes the simpler approach by no longer dropping
and reacquiring the RTNL, as soon RTNL for ethtool will be removed,
leaving only the instance lock to protect against races.
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/1747829342-1018757-5-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Having adjacent accelerated modify header actions (so-called
pattern-argument actions) may result in inconsistent outcome.
These inconsistencies can take the form of writes to the same
field or a read coupled with a write to the same field. The
solution is to detect such dependencies and insert nops between
the offending actions.
The existing implementation had a few issues, which pretty much
required a complete rewrite of the code that handles these
dependencies.
In the new implementation we're doing the following:
* Checking any two adjacent actions for conflicts (not just
odd-even pairs).
* Marking 'set' and 'add' action fields as destination, rather
than source, for the purposes of checking for conflicts.
* Checking all types of actions ('add', 'set', 'copy') for
dependencies.
* Managing offsets of the args in the buffer - copy the action
args to the right place in the buffer.
* Checking that after inserting nops we're still within the number
of supported actions - return an error otherwise.
Signed-off-by: Vlad Dogaru <vdogaru@nvidia.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/1747766802-958178-5-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Fix typo - rename 'nope_locations' to 'nop_locations', which describes
the locations of 'nop' actions. To shorten the lines, this renaming
also required some refactoring.
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/1747766802-958178-4-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Hardware steering handles actions differently from firmware, but for
termination rules that use encapsulation the firmware needs to be aware
of the action.
Fix this by registering reformat actions with the firmware the first
time this is needed. To do this, add a third possible owner for an
action, and also a lock to protect against registration of the same
action from different threads.
Signed-off-by: Vlad Dogaru <vdogaru@nvidia.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/1747766802-958178-3-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The firmware reformat id is a u32 and can't safely be returned as an
int. Because the functions also need a way to signal error, prefer to
return the id as an output parameter and keep the return code only for
success/error.
While we're at it, also extract some duplicate code to fetch the
reformat id from a more generic struct pkt_reformat.
Signed-off-by: Vlad Dogaru <vdogaru@nvidia.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/1747766802-958178-2-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
CONFIG_INIT_STACK_ALL_ZERO introduces a performance cost by
zero-initializing all stack variables on function entry. The mlx5 XDP
RX path previously allocated a struct mlx5e_xdp_buff on the stack per
received CQE, resulting in measurable performance degradation under
this config.
This patch reuses a mlx5e_xdp_buff stored in the mlx5e_rq struct,
avoiding per-CQE stack allocations and repeated zeroing.
With this change, XDP_DROP and XDP_TX performance matches that of
kernels built without CONFIG_INIT_STACK_ALL_ZERO.
Performance was measured on a ConnectX-6Dx using a single RX channel
(1 CPU at 100% usage) at ~50 Mpps. The baseline results were taken from
net-next-6.15.
Stack zeroing disabled:
- XDP_DROP:
* baseline: 31.47 Mpps
* baseline + per-RQ allocation: 32.31 Mpps (+2.68%)
- XDP_TX:
* baseline: 12.41 Mpps
* baseline + per-RQ allocation: 12.95 Mpps (+4.30%)
Stack zeroing enabled:
- XDP_DROP:
* baseline: 24.32 Mpps
* baseline + per-RQ allocation: 32.27 Mpps (+32.7%)
- XDP_TX:
* baseline: 11.80 Mpps
* baseline + per-RQ allocation: 12.24 Mpps (+3.72%)
Reported-by: Sebastiano Miano <mianosebastiano@gmail.com>
Reported-by: Samuel Dobron <sdobron@redhat.com>
Link: https://lore.kernel.org/all/CAMENy5pb8ea+piKLg5q5yRTMZacQqYWAoVLE1FE9WhQPq92E0g@mail.gmail.com/
Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Link: https://patch.msgid.link/1747253032-663457-1-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Cross-merge networking fixes after downstream PR (net-6.15-rc7).
Conflicts:
tools/testing/selftests/drivers/net/hw/ncdevmem.c
97c4e094a4b2 ("tests/ncdevmem: Fix double-free of queue array")
2f1a805f32ba ("selftests: ncdevmem: Implement devmem TCP TX")
https://lore.kernel.org/20250514122900.1e77d62d@canb.auug.org.au
Adjacent changes:
net/core/devmem.c
net/core/devmem.h
0afc44d8cdf6 ("net: devmem: fix kernel panic when netlink socket close after module unload")
bd61848900bf ("net: devmem: Implement TX path")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Use to_delayed_work() instead of open-coding it.
Signed-off-by: Chen Ni <nichen@iscas.ac.cn>
Acked-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20250514072419.2707578-1-nichen@iscas.ac.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The driver only offloads neighbors that are constructed on top of net
devices registered by it or their uppers (which are all Ethernet). The
device supports GRE encapsulation and decapsulation of forwarded
traffic, but the driver will not offload dummy neighbors constructed on
top of GRE net devices as they are not uppers of its net devices:
# ip link add name gre1 up type gre tos inherit local 192.0.2.1 remote 198.51.100.1
# ip neigh add 0.0.0.0 lladdr 0.0.0.0 nud noarp dev gre1
$ ip neigh show dev gre1 nud noarp
0.0.0.0 lladdr 0.0.0.0 NOARP
(Note that the neighbor is not marked with 'offload')
When the driver is reloaded and the existing configuration is replayed,
the driver does not perform the same check regarding existing neighbors
and offloads the previously added one:
# devlink dev reload pci/0000:01:00.0
$ ip neigh show dev gre1 nud noarp
0.0.0.0 lladdr 0.0.0.0 offload NOARP
If the neighbor is later deleted, the driver will ignore the
notification (given the GRE net device is not its upper) and will
therefore keep referencing freed memory, resulting in a use-after-free
[1] when the net device is deleted:
# ip neigh del 0.0.0.0 lladdr 0.0.0.0 dev gre1
# ip link del dev gre1
Fix by skipping neighbor replay if the net device for which the replay
is performed is not our upper.
[1]
BUG: KASAN: slab-use-after-free in mlxsw_sp_neigh_entry_update+0x1ea/0x200
Read of size 8 at addr ffff888155b0e420 by task ip/2282
[...]
Call Trace:
<TASK>
dump_stack_lvl+0x6f/0xa0
print_address_description.constprop.0+0x6f/0x350
print_report+0x108/0x205
kasan_report+0xdf/0x110
mlxsw_sp_neigh_entry_update+0x1ea/0x200
mlxsw_sp_router_rif_gone_sync+0x2a8/0x440
mlxsw_sp_rif_destroy+0x1e9/0x750
mlxsw_sp_netdevice_ipip_ol_event+0x3c9/0xdc0
mlxsw_sp_router_netdevice_event+0x3ac/0x15e0
notifier_call_chain+0xca/0x150
call_netdevice_notifiers_info+0x7f/0x100
unregister_netdevice_many_notify+0xc8c/0x1d90
rtnl_dellink+0x34e/0xa50
rtnetlink_rcv_msg+0x6fb/0xb70
netlink_rcv_skb+0x131/0x360
netlink_unicast+0x426/0x710
netlink_sendmsg+0x75a/0xc20
__sock_sendmsg+0xc1/0x150
____sys_sendmsg+0x5aa/0x7b0
___sys_sendmsg+0xfc/0x180
__sys_sendmsg+0x121/0x1b0
do_syscall_64+0xbb/0x1d0
entry_SYSCALL_64_after_hwframe+0x4b/0x53
Fixes: 8fdb09a7674c ("mlxsw: spectrum_router: Replay neighbours when RIF is made")
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Link: https://patch.msgid.link/c53c02c904fde32dad484657be3b1477884e9ad6.1747225701.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
New timestamping API was introduced in commit 66f7223039c0 ("net: add
NDOs for configuring hardware timestamping") from kernel v6.6. It is
time to convert the mlxsw driver to the new API, so that the
ndo_eth_ioctl() path can be removed completely.
The UAPI is still ioctl-only, but it's best to remove the "ioctl"
mentions from the driver in case a netlink variant appears.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20250512154411.848614-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
MACsec offload is not supported in switchdev mode for uplink
representors. When switching to the uplink representor profile, the
MACsec offload feature must be cleared from the netdevice's features.
If left enabled, attempts to add offloads result in a null pointer
dereference, as the uplink representor does not support MACsec offload
even though the feature bit remains set.
Clear NETIF_F_HW_MACSEC in mlx5e_fix_uplink_rep_features().
Kernel log:
Oops: general protection fault, probably for non-canonical address 0xdffffc000000000f: 0000 [#1] SMP KASAN
KASAN: null-ptr-deref in range [0x0000000000000078-0x000000000000007f]
CPU: 29 UID: 0 PID: 4714 Comm: ip Not tainted 6.14.0-rc4_for_upstream_debug_2025_03_02_17_35 #1
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
RIP: 0010:__mutex_lock+0x128/0x1dd0
Code: d0 7c 08 84 d2 0f 85 ad 15 00 00 8b 35 91 5c fe 03 85 f6 75 29 49 8d 7e 60 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 a6 15 00 00 4d 3b 76 60 0f 85 fd 0b 00 00 65 ff
RSP: 0018:ffff888147a4f160 EFLAGS: 00010206
RAX: dffffc0000000000 RBX: 0000000000000000 RCX: 0000000000000001
RDX: 000000000000000f RSI: 0000000000000000 RDI: 0000000000000078
RBP: ffff888147a4f2e0 R08: ffffffffa05d2c19 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000
R13: dffffc0000000000 R14: 0000000000000018 R15: ffff888152de0000
FS: 00007f855e27d800(0000) GS:ffff88881ee80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000004e5768 CR3: 000000013ae7c005 CR4: 0000000000372eb0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
Call Trace:
<TASK>
? die_addr+0x3d/0xa0
? exc_general_protection+0x144/0x220
? asm_exc_general_protection+0x22/0x30
? mlx5e_macsec_add_secy+0xf9/0x700 [mlx5_core]
? __mutex_lock+0x128/0x1dd0
? lockdep_set_lock_cmp_fn+0x190/0x190
? mlx5e_macsec_add_secy+0xf9/0x700 [mlx5_core]
? mutex_lock_io_nested+0x1ae0/0x1ae0
? lock_acquire+0x1c2/0x530
? macsec_upd_offload+0x145/0x380
? lockdep_hardirqs_on_prepare+0x400/0x400
? kasan_save_stack+0x30/0x40
? kasan_save_stack+0x20/0x40
? kasan_save_track+0x10/0x30
? __kasan_kmalloc+0x77/0x90
? __kmalloc_noprof+0x249/0x6b0
? genl_family_rcv_msg_attrs_parse.constprop.0+0xb5/0x240
? mlx5e_macsec_add_secy+0xf9/0x700 [mlx5_core]
mlx5e_macsec_add_secy+0xf9/0x700 [mlx5_core]
? mlx5e_macsec_add_rxsa+0x11a0/0x11a0 [mlx5_core]
macsec_update_offload+0x26c/0x820
? macsec_set_mac_address+0x4b0/0x4b0
? lockdep_hardirqs_on_prepare+0x284/0x400
? _raw_spin_unlock_irqrestore+0x47/0x50
macsec_upd_offload+0x2c8/0x380
? macsec_update_offload+0x820/0x820
? __nla_parse+0x22/0x30
? genl_family_rcv_msg_attrs_parse.constprop.0+0x15e/0x240
genl_family_rcv_msg_doit+0x1cc/0x2a0
? genl_family_rcv_msg_attrs_parse.constprop.0+0x240/0x240
? cap_capable+0xd4/0x330
genl_rcv_msg+0x3ea/0x670
? genl_family_rcv_msg_dumpit+0x2a0/0x2a0
? lockdep_set_lock_cmp_fn+0x190/0x190
? macsec_update_offload+0x820/0x820
netlink_rcv_skb+0x12b/0x390
? genl_family_rcv_msg_dumpit+0x2a0/0x2a0
? netlink_ack+0xd80/0xd80
? rwsem_down_read_slowpath+0xf90/0xf90
? netlink_deliver_tap+0xcd/0xac0
? netlink_deliver_tap+0x155/0xac0
? _copy_from_iter+0x1bb/0x12c0
genl_rcv+0x24/0x40
netlink_unicast+0x440/0x700
? netlink_attachskb+0x760/0x760
? lock_acquire+0x1c2/0x530
? __might_fault+0xbb/0x170
netlink_sendmsg+0x749/0xc10
? netlink_unicast+0x700/0x700
? __might_fault+0xbb/0x170
? netlink_unicast+0x700/0x700
__sock_sendmsg+0xc5/0x190
____sys_sendmsg+0x53f/0x760
? import_iovec+0x7/0x10
? kernel_sendmsg+0x30/0x30
? __copy_msghdr+0x3c0/0x3c0
? filter_irq_stacks+0x90/0x90
? stack_depot_save_flags+0x28/0xa30
___sys_sendmsg+0xeb/0x170
? kasan_save_stack+0x30/0x40
? copy_msghdr_from_user+0x110/0x110
? do_syscall_64+0x6d/0x140
? lock_acquire+0x1c2/0x530
? __virt_addr_valid+0x116/0x3b0
? __virt_addr_valid+0x1da/0x3b0
? lock_downgrade+0x680/0x680
? __delete_object+0x21/0x50
__sys_sendmsg+0xf7/0x180
? __sys_sendmsg_sock+0x20/0x20
? kmem_cache_free+0x14c/0x4e0
? __x64_sys_close+0x78/0xd0
do_syscall_64+0x6d/0x140
entry_SYSCALL_64_after_hwframe+0x4b/0x53
RIP: 0033:0x7f855e113367
Code: 0e 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b9 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 89 54 24 1c 48 89 74 24 10
RSP: 002b:00007ffd15e90c88 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f855e113367
RDX: 0000000000000000 RSI: 00007ffd15e90cf0 RDI: 0000000000000004
RBP: 00007ffd15e90dbc R08: 0000000000000028 R09: 000000000045d100
R10: 00007f855e011dd8 R11: 0000000000000246 R12: 0000000000000019
R13: 0000000067c6b785 R14: 00000000004a1e80 R15: 0000000000000000
</TASK>
Modules linked in: 8021q garp mrp sch_ingress openvswitch nsh mlx5_ib mlx5_fwctl mlx5_dpll mlx5_core rpcrdma rdma_ucm ib_iser libiscsi scsi_transport_iscsi ib_umad rdma_cm ib_ipoib iw_cm ib_cm ib_uverbs ib_core xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat br_netfilter rpcsec_gss_krb5 auth_rpcgss oid_registry overlay zram zsmalloc fuse [last unloaded: mlx5_core]
---[ end trace 0000000000000000 ]---
Fixes: 8ff0ac5be144 ("net/mlx5: Add MACsec offload Tx command support")
Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Reviewed-by: Shahar Shitrit <shshitrit@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/1746958552-561295-1-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Failing to insert/delete a rule should not happen. If it does happen,
it would be good to know at which stage it happened and what was the
failure. This patch adds printing of bad CQE details.
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Vlad Dogaru <vdogaru@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/1746992290-568936-11-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Reworking the rehash loop - simplifying the code and making it less
error prone:
- Instead of doing round-robin on all the queues with batch of rules in
each cycle, just go over all the queues and move all the rules that
belong to this queue.
- If at some stage of moving the rule we get a failure (which should
not happen), this can't be rolled back. So instead of aborting
rehash and leaving the matcher in a broken state, allow the loop
to continue: attempt to move the rest of the rules and delete the
old matcher. A rule that failed to move to a new matcher will loose
its match STE once the rehash is completed and the old matcher is
deleted, so the rule won't match any traffic any more. This rule's
packets will fall back to the steering pipeline w/o HW offload.
Rehash procedure will return an error, which will cause the rule
insertion to fail for the rule that started this whole rehash.
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Vlad Dogaru <vdogaru@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/1746992290-568936-10-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
When a rule is inserted into a matcher, we search for the suitable
action template. If such template is not found, action template array
is extended with the new template. However, when several threads are
performing this in parallel, there is a race - we can end up with
extending the action templates array with the same template.
This patch is doing the following:
- refactor the code to find action template index in rule create and
update, have the common code in an auxiliary function
- after locking all the queues, check again if the action template
array still needs to be extended
Signed-off-by: Vlad Dogaru <vdogaru@nvidia.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/1746992290-568936-9-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Currently the counter that counts number of rules in a matcher is
increased only when rule insertion is completed. In a multi-threaded
usecase this can lead to a scenario that many rules can be in process
of insertion in the same matcher, while none of them has completed
the insertion and the rule counter is not updated. This results in
a rule insertion failure for many of them at first attempt, which
leads to all of them requiring rehash and requiring locking of all
the queue locks.
This patch fixes the case by increasing the rule counter in the
beginning of insertion process and decreasing in case of any failure.
Signed-off-by: Vlad Dogaru <vdogaru@nvidia.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/1746992290-568936-8-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Rules are inserted into hash table in accordance with their hash index.
When a certain number of rules is reached, the table is rehashed:
a bigger new table is allocated and all the rules are moved there.
But sometimes a new rule can't be inserted into the hash table
because its index is full, even though the number of rules in the
table is well below the threshold. The hash function is not perfect,
so such cases are not rare. When that happens, we want to do the same
rehash, in order to increase the table size and lower the probability
for such cases.
This patch fixes the usecase where rule insertion was failing, but
rehash couldn't be initiated due to low number of rules: it adds flag
that denotes that rehash is required, even if the number of rules in
the table is below the rehash threshold.
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Vlad Dogaru <vdogaru@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/1746992290-568936-7-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
This patch adds support for Complex Matchers/Rules
Overview:
--------
A matcher can match on a certain set of match parameters. However, the
number and size of match params for a single matcher are limited: all
the parameters must fit within a single definer.
A common example of this limitation is IPv6 address matching, where
matching both source and destination IPs requires more bits than a
single definer can support.
SW Steering addresses this limitation by chaining multiple Steering
Table Entries (STEs) within the same matcher, where each STE matches
on a subset of the parameters.
In HW Steering, such chaining is not possible — the matcher's STEs
are managed in a hash table, and a single definer is used to calculate
the hash index for STEs.
To address this limitation in HW Steering, we introduce Complex
Matchers, which consist of two chained matchers. This allows matching
on twice as many parameters. Complex Matchers are filled with Complex
Rules — rules that are split into two parts and inserted into their
respective matchers.
The first half of the Complex Matcher is a regular matcher and points
to the second half, which is an Isolated Matcher. An Isolated Matcher
has its own isolated table and is accessible only by traffic coming
from the first half of the Complex Matcher.
This splitting of matchers/rules into multiple parts is transparent to
users. It is hidden under the BWC HWS API. It becomes visible only when
dumping steering debug information, where the Complex Matcher appears
as two separate matchers: one in the user-created table and another
in its isolated table.
Some implementation details:
---------------------------
All user actions are performed on the second part of the rules only.
The first part handles matching and applies two actions: modify header
(set metadata, see details below) and go-to-table (directing traffic to
the isolated table containing the isolated matcher).
Rule updates (updating rule actions) are applied to the second part of
the rule since user-provided actions are not executed in the first
matcher.
We use REG_C_6 metadata register to set and match on unique per-rule
tag (see details below).
Splitting rules into two parts introduces new challenges:
1. Invalid Combinations
Consider two rules with different matching values:
- Rule 1: A+B
- Rule 2: C+D
Let's split the rules into two parts as follows:
|---| |---|
| A | | B |
|---| --> |---|
| C | | D |
|---| |---|
Splitting these rules results in invalid combinations like A+D
and C+B.
To resolve this, we assign unique tags to each rule on the first
matcher and match these tags on the second matcher (the tag is
implemented through modify_hdr action that sets value to metadata
register REG_C_6):
|----------| |---------|
| A | | B, TagA |
| action: | | |
| set TagA | | |
|----------| --> |---------|
| C | | D, TagB |
| action: | | |
| set TagB | | |
|----------| |---------|
2. Duplicated Entries:
Consider two rules with overlapping values:
- Rule 1: A+B
- Rule 2: A+D
Let's split the rules into two parts as follows:
|---| |---|
| A | | B |
|---| --> |---|
| | | D |
|---| |---|
This leads to the duplicated entries on the first matcher, which HWS
doesn't allow: subsequent delete of either of the rules will delete
the only entry in the first matcher, leaving the remaining rule
broken.
To address this, we use a reference count for entries in the first
matcher and delete STEs only when their refcount reaches zero.
Both challenges are resolved by having a per-matcher data structure
(implemented with rhashtable) that manages refcounts for the first part
of the rules and holds unique tags (managed via IDA) for these rules to
set and to match on the second matcher.
Limitations:
-----------
We utilize metadata register REG_C_6 in this implementation, so its
usage anywhere along the steering of the flow that might include the
need for Complex Matcher is prohibited.
The number and size of match parameters remain limited — now it is
constrained by what can be represented by two definers instead of one.
This architectural limitation arises from the structure of Complex
Matchers. If future requirements demand more parameters,
Complex Matchers can be extended beyond two matchers.
Additionally, there is an implementation limit of 32 match parameters
per rule (disregarding parameter size). This limit can be lifted if
needed.
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Vlad Dogaru <vdogaru@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/1746992290-568936-6-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
In preparation for complex matcher support, introduce the isolated
matcher.
Isolated matcher is a matcher that has its own isolated table.
It is used as the second half of the complex matcher: when the rule
is split into two parts (complex rule), then matching on the first
part will send the packet to the isolated matcher that will try to
match on the second part. In case of miss, the packet goes back to
the matcher's end flow table.
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Vlad Dogaru <vdogaru@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/1746992290-568936-5-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
In preparation for complex matcher, expose the function that is
polling queue for completion (mlx5hws_bwc_queue_poll) in header
file, so that it will be used by complex matcher code.
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Vlad Dogaru <vdogaru@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/1746992290-568936-4-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
In preparation for complex matcher support, add function for
converting definer fname to str, which will be used in following
patches.
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Vlad Dogaru <vdogaru@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/1746992290-568936-3-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
In preparation for complex matcher support, make function
mlx5hws_table_ft_set_next_ft() non-static and expose it in header.
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Vlad Dogaru <vdogaru@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/1746992290-568936-2-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
As mlx4 has implemented skb_tx_timestamp() in mlx4_en_xmit(), the
SOFTWARE flag is surely needed when users are trying to get timestamp
information.
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20250510093442.79711-1-kerneljasonxing@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Having a software timestamp (along with existing hardware one) is
useful to trace how the packets flow through the stack.
mlx5e_tx_skb_update_hwts_flags is called from tx paths
to setup HW timestamp; extend it to add software one as well.
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Signed-off-by: Stanislav Fomichev <stfomichev@gmail.com>
Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20250508235109.585096-1-stfomichev@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Cross-merge networking fixes after downstream PR (net-6.15-rc5).
No conflicts or adjacent changes.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
In preparation for making the kmalloc family of allocators type aware,
we need to make sure that the returned type from the allocation matches
the type of the variable being assigned. (Before, the allocator would
always return "void *", which can be implicitly cast to any pointer type.)
The assigned type is "unsigned long **", but the returned type will be
"long **". These are the same size allocation (pointer size) but the
types do not match. Adjust the allocation type to match the assignment.
Signed-off-by: Kees Cook <kees@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20250426060757.work.865-kees@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The cited commit assumes enabling roce always succeeds. But it is
not true. Add error handling for it.
Fixes: 80f09dfc237f ("net/mlx5: Eswitch, enable RoCE loopback traffic")
Signed-off-by: Chris Mi <cmi@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Maor Gottlieb <maorg@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Link: https://patch.msgid.link/20250423083611.324567-6-mbloch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|