summaryrefslogtreecommitdiff
path: root/drivers/net/ethernet/mellanox
AgeCommit message (Collapse)Author
2025-04-15net: ptp: introduce .supported_perout_flags to ptp_clock_infoJacob Keller
The PTP_PEROUT_REQUEST2 ioctl has gained support for flags specifying specific output behavior including PTP_PEROUT_ONE_SHOT, PTP_PEROUT_DUTY_CYCLE, PTP_PEROUT_PHASE. Driver authors are notorious for not checking the flags of the request. This results in misinterpreting the request, generating an output signal that does not match the requested value. It is anticipated that even more flags will be added in the future, resulting in even more broken requests. Expecting these issues to be caught during review or playing whack-a-mole after the fact is not a great solution. Instead, introduce the supported_perout_flags field in the ptp_clock_info structure. Update the core character device logic to explicitly reject any request which has a flag not on this list. This ensures that drivers must 'opt in' to the flags they support. Drivers which don't set the .supported_perout_flags field will not need to check that unsupported flags aren't passed, as the core takes care of this. Update the drivers which do support flags to set this new field. Note the following driver files set n_per_out to a non-zero value but did not check the flags at all: • drivers/ptp/ptp_clockmatrix.c • drivers/ptp/ptp_idt82p33.c • drivers/ptp/ptp_fc3.c • drivers/net/ethernet/ti/am65-cpts.c • drivers/net/ethernet/aquantia/atlantic/aq_ptp.c • drivers/net/ethernet/broadcom/bnxt/bnxt_ptp.c • drivers/net/dsa/sja1105/sja1105_ptp.c • drivers/net/ethernet/freescale/dpaa2/dpaa2-ptp.c • drivers/net/ethernet/mscc/ocelot_vsc7514.c • drivers/net/ethernet/intel/i40e/i40e_ptp.c Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Kory Maincent <kory.maincent@bootlin.com> Link: https://patch.msgid.link/20250414-jk-supported-perout-flags-v2-2-f6b17d15475c@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-15net: ptp: introduce .supported_extts_flags to ptp_clock_infoJacob Keller
The PTP_EXTTS_REQUEST(2) ioctl has a flags field which specifies how the external timestamp request should behave. This includes which edge of the signal to timestamp, as well as a specialized "offset" mode. It is expected that more flags will be added in the future. Driver authors routinely do not check the flags, often accepting requests with flags which they do not support. Even drivers which do check flags may not be future-proofed to reject flags not yet defined. Thus, any future flag additions often require manually updating drivers to reject these flags. This approach of hoping we catch flag checks during review, or playing whack-a-mole after the fact is the wrong approach. Introduce the "supported_extts_flags" field to the ptp_clock_info structure. This field defines the set of flags the device actually supports. Update the core character device logic to check this field and reject unsupported requests. Getting this right is somewhat tricky. First, to avoid unnecessary repetition and make basic functionality work when .supported_extts_flags is 0, the core always accepts the PTP_ENABLE_FEATURE flag. This flag is used to set the 'on' parameter to the .enable function and is thus always 'supported' by all drivers. For backwards compatibility, the PTP_RISING_EDGE and PTP_FALLING_EDGE flags are merely "hints" when using the old PTP_EXTTS_REQUEST ioctl, and are not expected to be enforced. If the user issues PTP_EXTTS_REQUEST2, the PTP_STRICT_FLAGS flag is added which is supposed to inform the driver to strictly validate the flags and reject unsupported requests. To handle this, first check if the driver reports PTP_STRICT_FLAGS support. If it does not, then always allow the PTP_RISING_EDGE and PTP_FALLING_EDGE flags. This keeps backwards compatibility with the original PTP_EXTTS_REQUEST ioctl where these flags are not guaranteed to be honored. This way, drivers which do not set the supported_extts_flags will continue to accept requests for the original PTP_EXTTS_REQUEST ioctl. The core will automatically reject requests with new flags, and correctly reject requests with PTP_STRICT_FLAGS, where the driver is supposed to strictly validate the flags. Update the various drivers, refactoring their validation logic into the .supported_extts_flags field. For consistency and readability, PTP_ENABLE_FEATURE is not set in the supported flags list, and PTP_EXTTS_EDGES is expanded to PTP_RISING_EDGE | PTP_FALLING_EDGE in all cases. Note the following driver files set n_ext_ts to a non-zero value but did not check flags at all: • drivers/net/ethernet/freescale/dpaa2/dpaa2-ptp.c • drivers/net/ethernet/freescale/enetc/enetc_ptp.c • drivers/net/ethernet/intel/i40e/i40e_ptp.c • drivers/net/ethernet/marvell/octeontx2/nic/otx2_ptp.c • drivers/net/ethernet/renesas/ravb_ptp.c • drivers/net/ethernet/renesas/rtsn.c • drivers/net/ethernet/renesas/rtsn.h • drivers/net/ethernet/ti/am65-cpts.c • drivers/net/ethernet/ti/cpts.h • drivers/net/ethernet/ti/icssg/icss_iep.c • drivers/net/ethernet/xscale/ptp_ixp46x.c • drivers/net/phy/bcm-phy-ptp.c • drivers/ptp/ptp_ocp.c • drivers/ptp/ptp_pch.c • drivers/ptp/ptp_qoriq.c These drivers behavior does change slightly: they will now reject the PTP_EXTTS_REQUEST2 ioctl, because they do not strictly validate their flags. This also makes them no longer incorrectly accept PTP_EXT_OFFSET. Also note that the renesas ravb driver does not support PTP_STRICT_FLAGS. We could leave the .supported_extts_flags as 0, but I added the PTP_RISING_EDGE | PTP_FALLING_EDGE since the driver previously manually validated these flags. This is equivalent to 0 because the core will allow these flags regardless unless PTP_STRICT_FLAGS is also set. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Kory Maincent <kory.maincent@bootlin.com> Link: https://patch.msgid.link/20250414-jk-supported-perout-flags-v2-1-f6b17d15475c@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-14net/mlx5: HWS, Export action STE tables to debugfsVlad Dogaru
Introduce a new type of dump object and dump all action STE tables, along with information on their RTCs and STEs. Signed-off-by: Vlad Dogaru <vdogaru@nvidia.com> Reviewed-by: Hamdan Agbariya <hamdani@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Michal Kubiak <michal.kubiak@intel.com> Link: https://patch.msgid.link/1744312662-356571-13-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-14net/mlx5: HWS, Free unused action STE tablesVlad Dogaru
Periodically check for unused action STE tables and free their associated resources. In order to do this safely, add a per-queue lock to synchronize the garbage collect work with regular operations on steering rules. Signed-off-by: Vlad Dogaru <vdogaru@nvidia.com> Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Michal Kubiak <michal.kubiak@intel.com> Link: https://patch.msgid.link/1744312662-356571-12-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-14net/mlx5: HWS, Cleanup matcher action STE tableVlad Dogaru
Remove the matcher action STE implementation now that the code uses per-queue action STE pools. This also allows simplifying matcher code because it is now only handling a single type of RTC/STE. The matcher resize data is also going away. Matchers were saving old action STE data because the rules still used it, but now that data lives in the action STE pool and is no longer coupled to a matcher. Furthermore, matchers no longer need to rehash a due to action template addition. If a new action template needs more action STEs, we simply update the matcher's num_of_action_stes and future rules will allocate the correct number. Existing rules are unaffected by such an operation and can continue to use their existing action STEs. The range action was using the matcher action STE implementation, but there was no reason to do this other than the container fitting the purpose. Extract that information to a separate structure. Finally, stop dumping per-matcher information about action RTCs, because they no longer exist. A later patch in this series will add support for dumping action STE pools. Signed-off-by: Vlad Dogaru <vdogaru@nvidia.com> Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Michal Kubiak <michal.kubiak@intel.com> Link: https://patch.msgid.link/1744312662-356571-11-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-14net/mlx5: HWS, Use the new action STE poolVlad Dogaru
Use the central action STE pool when creating / updating rules. Signed-off-by: Vlad Dogaru <vdogaru@nvidia.com> Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Michal Kubiak <michal.kubiak@intel.com> Link: https://patch.msgid.link/1744312662-356571-10-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-14net/mlx5: HWS, Implement action STE poolVlad Dogaru
Implement a per-queue pool of action STEs that match STEs can link to, regardless of matcher. The code relies on hints to optimize whether a given rule is added to rx-only, tx-only or both. Correspondingly, action STEs need to be added to different RTC for ingress or egress paths. For rx-and-tx rules, the current rule implementation dictates that the offsets for a given rule must be the same in both RTCs. To avoid wasting STEs, each action STE pool element holds 3 pools: rx-only, tx-only, and rx-and-tx, corresponding to the possible values of the pool optimization enum. The implementation then chooses at rule creation / update which of these elements to allocate from. Each element holds multiple action STE tables, which wrap an RTC, an STE range, the logic to buddy-allocate offsets from the range, and an STC that allows match STEs to point to this table. When allocating offsets from an element, we iterate through available action STE tables and, if needed, create a new table. Similar to the previous implementation, this iteration does not free any resources. This is implemented in a subsequent patch. Signed-off-by: Vlad Dogaru <vdogaru@nvidia.com> Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Michal Kubiak <michal.kubiak@intel.com> Link: https://patch.msgid.link/1744312662-356571-9-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-14net/mlx5: HWS, Fix pool size optimizationVlad Dogaru
The optimization to create a size-one STE range for the unused direction was broken. The hardware prevents us from creating RTCs over unallocated STE space, so the only reason this has worked so far is because the optimization was never used. Signed-off-by: Vlad Dogaru <vdogaru@nvidia.com> Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Michal Kubiak <michal.kubiak@intel.com> Link: https://patch.msgid.link/1744312662-356571-8-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-14net/mlx5: HWS, Add fullness tracking to poolVlad Dogaru
Future users will need to query whether a pool is empty. Signed-off-by: Vlad Dogaru <vdogaru@nvidia.com> Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Michal Kubiak <michal.kubiak@intel.com> Link: https://patch.msgid.link/1744312662-356571-7-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-14net/mlx5: HWS, Cleanup after pool refactoringVlad Dogaru
Remove members which are now no longer used. In fact, many of the `struct mlx5hws_pool_chunk` were not even written to beyond being initialized, but they were used in various internals. Also cleanup some local variables which made more sense when the API was thicker. Signed-off-by: Vlad Dogaru <vdogaru@nvidia.com> Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Michal Kubiak <michal.kubiak@intel.com> Link: https://patch.msgid.link/1744312662-356571-6-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-14net/mlx5: HWS, Refactor pool implementationVlad Dogaru
Refactor the pool implementation to remove unused flags and clarify its usage. A pool represents a single range of STEs or STCs which are allocated at pool creation time. Pools are used under three patterns: 1. STCs are allocated one at a time from a global pool using a bitmap based implementation. 2. Action STEs are allocated in power-of-two blocks using a buddy algorithm. 3. Match STEs do not use allocation, since insertion into these tables is based on hashes or direct addressing. In such cases we use a pool only to create the STE range. Signed-off-by: Vlad Dogaru <vdogaru@nvidia.com> Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Michal Kubiak <michal.kubiak@intel.com> Link: https://patch.msgid.link/1744312662-356571-5-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-14net/mlx5: HWS, Make pool single resourceVlad Dogaru
The pool implementation claimed to support multiple resources, but this does not really make sense in context. Callers always allocate a single STC or STE chunk of exactly the size provided. The code that handled multiple resources was unused (and likely buggy) due to the combination of flags passed by callers. Simplify the pool by having it handle a single resource. As a result of this simplification, chunks no longer contain a resource offset (there is now only one resource per pool), and the get_base_id functions no longer take a chunk parameter. Signed-off-by: Vlad Dogaru <vdogaru@nvidia.com> Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Michal Kubiak <michal.kubiak@intel.com> Link: https://patch.msgid.link/1744312662-356571-4-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-14net/mlx5: HWS, Remove unused element arrayVlad Dogaru
Remove the array of elements wrapped in a struct because in reality only the first element was ever used. Signed-off-by: Vlad Dogaru <vdogaru@nvidia.com> Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Michal Kubiak <michal.kubiak@intel.com> Link: https://patch.msgid.link/1744312662-356571-3-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-14net/mlx5: HWS, Fix matcher action template attachVlad Dogaru
The procedure of attaching an action template to an existing matcher had a few issues: 1. Attaching accidentally overran the `at` array in bwc_matcher, which would result in memory corruption. This bug wasn't triggered, but it is possible to trigger it by attaching action templates beyond the initial buffer size of 8. Fix this by converting to a dynamically sized buffer and reallocating if needed. 2. Similarly, the `at` array inside the native matcher was never reallocated. Fix this the same as above. 3. The bwc layer treated any error in action template attach as a signal that the matcher should be rehashed to account for a larger number of action STEs. In reality, there are other unrelated errors that can arise and they should be propagated upstack. Fix this by adding a `need_rehash` output parameter that's orthogonal to error codes. Fixes: 2111bb970c78 ("net/mlx5: HWS, added backward-compatible API handling") Signed-off-by: Vlad Dogaru <vdogaru@nvidia.com> Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Michal Kubiak <michal.kubiak@intel.com> Link: https://patch.msgid.link/1744312662-356571-2-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-14page_pool: Move pp_magic check into helper functionsToke Høiland-Jørgensen
Since we are about to stash some more information into the pp_magic field, let's move the magic signature checks into a pair of helper functions so it can be changed in one place. Reviewed-by: Mina Almasry <almasrymina@google.com> Tested-by: Yonglong Liu <liuyonglong@huawei.com> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://patch.msgid.link/20250409-page-pool-track-dma-v9-1-6a9ef2e0cba8@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-05treewide: Switch/rename to timer_delete[_sync]()Thomas Gleixner
timer_delete[_sync]() replaces del_timer[_sync](). Convert the whole tree over and remove the historical wrapper inlines. Conversion was done with coccinelle plus manual fixups where necessary. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-04-02eth: mlx4: select PAGE_POOLGreg Thelen
With commit 8533b14b3d65 ("eth: mlx4: create a page pool for Rx") mlx4 started using functions guarded by PAGE_POOL. This change introduced build errors when CONFIG_MLX4_EN is set but CONFIG_PAGE_POOL is not: ld: vmlinux.o: in function `mlx4_en_alloc_frags': en_rx.c:(.text+0xa5eaf9): undefined reference to `page_pool_alloc_pages' ld: vmlinux.o: in function `mlx4_en_create_rx_ring': (.text+0xa5ee91): undefined reference to `page_pool_create' Make MLX4_EN select PAGE_POOL to fix the ml;x4 build errors. Fixes: 8533b14b3d65 ("eth: mlx4: create a page pool for Rx") Signed-off-by: Greg Thelen <gthelen@google.com> Reviewed-by: Joe Damato <jdamato@fastly.com> Link: https://patch.msgid.link/20250401015315.2306092-1-gthelen@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-01Merge tag 'net-6.15-rc0' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Rather tiny pull request, mostly so that we can get into our trees your fix to the x86 Makefile. Current release - regressions: - Revert "tcp: avoid atomic operations on sk->sk_rmem_alloc", error queue accounting was missed Current release - new code bugs: - 5 fixes for the netdevice instance locking work Previous releases - regressions: - usbnet: restore usb%d name exception for local mac addresses Previous releases - always broken: - rtnetlink: allocate vfinfo size for VF GUIDs when supported, avoid spurious GET_LINK failures - eth: mana: Switch to page pool for jumbo frames - phy: broadcom: Correct BCM5221 PHY model detection Misc: - selftests: drv-net: replace helpers for referring to other files" * tag 'net-6.15-rc0' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (22 commits) Revert "tcp: avoid atomic operations on sk->sk_rmem_alloc" bnxt_en: bring back rtnl lock in bnxt_shutdown eth: gve: add missing netdev locks on reset and shutdown paths selftests: mptcp: ignore mptcp_diag binary selftests: mptcp: close fd_in before returning in main_loop selftests: mptcp: fix incorrect fd checks in main_loop mptcp: fix NULL pointer in can_accept_new_subflow octeontx2-af: Free NIX_AF_INT_VEC_GEN irq octeontx2-af: Fix mbox INTR handler when num VFs > 64 net: fix use-after-free in the netdev_nl_sock_priv_destroy() selftests: net: use Path helpers in ping selftests: net: use the dummy bpf from net/lib selftests: drv-net: replace the rpath helper with Path objects net: lapbether: use netdev_lockdep_set_classes() helper net: phy: broadcom: Correct BCM5221 PHY model detection net: usb: usbnet: restore usb%d name exception for local mac addresses net/mlx5e: SHAMPO, Make reserved size independent of page size net: mana: Switch to page pool for jumbo frames MAINTAINERS: Add dedicated entries for phy_link_topology net: move replay logic to tc_modify_qdisc ...
2025-03-29Merge tag 'for-linus-fwctl' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma Pull fwctl subsystem from Jason Gunthorpe: "fwctl is a new subsystem intended to bring some common rules and order to the growing pattern of exposing a secure FW interface directly to userspace. Unlike existing places like RDMA/DRM/VFIO/uacce that are exposing a device for datapath operations fwctl is focused on debugging, configuration and provisioning of the device. It will not have the necessary features like interrupt delivery to support a datapath. This concept is similar to the long standing practice in the "HW" RAID space of having a device specific misc device to manage the RAID controller FW. fwctl generalizes this notion of a companion debug and management interface that goes along with a dataplane implemented in an appropriate subsystem. There have been three LWN articles written discussing various aspects of this: https://lwn.net/Articles/955001/ https://lwn.net/Articles/969383/ https://lwn.net/Articles/990802/ This includes three drivers to launch the subsystem: - CXL provides a vendor scheme for executing commands and a way to learn the 'command effects' (ie the security properties) of such commands. The fwctl driver allows access to these mechanism within the fwctl security model - mlx5 is family of networking products, the driver supports all current Mellanox HW still receiving FW feature updates. This includes RDMA multiprotocol NICs like ConnectX and the Bluefield family of Smart NICs. - AMD/Pensando Distributed Services card is a multi protocol Smart NIC with a multi PCI function design. fwctl works on the management PCI function following a 'command effects' model similar to CXL" * tag 'for-linus-fwctl' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (30 commits) pds_fwctl: add Documentation entries pds_fwctl: add rpc and query support pds_fwctl: initial driver framework pds_core: add new fwctl auxiliary_device pds_core: specify auxiliary_device to be created pds_core: make pdsc_auxbus_dev_del() void cxl: Fixup kdoc issues for include/cxl/features.h fwctl/cxl: Add documentation to FWCTL CXL cxl/test: Add Set Feature support to cxl_test cxl/test: Add Get Feature support to cxl_test cxl: Add support to handle user feature commands for set feature cxl: Add support to handle user feature commands for get feature cxl: Add support for fwctl RPC command to enable CXL feature commands cxl: Move cxl feature command structs to user header cxl: Add FWCTL support to CXL mlx5: Create an auxiliary device for fwctl_mlx5 fwctl/mlx5: Support for communicating with mlx5 fw fwctl: Add documentation fwctl: FWCTL_RPC to execute a Remote Procedure Call to device firmware taint: Add TAINT_FWCTL ...
2025-03-28net/mlx5e: SHAMPO, Make reserved size independent of page sizeLama Kayal
When hw-gro is enabled, the maximum number of header entries that are needed per wqe (hd_per_wqe) is calculated based on the size of the reservations among other parameters. Miscalculation of the size of reservations leads to incorrect calculation of hd_per_wqe as 0, particularly in the case of large page size like in aarch64, this prevents the SHAMPO header from being correctly initialized in the device, ultimately causing the following cqe err that indicates a violation of PD. mlx5_core 0000:00:08.0 eth2: ERR CQE on RQ: 0x1180 mlx5_core 0000:00:08.0 eth2: Error cqe on cqn 0x510, ci 0x0, qn 0x1180, opcode 0xe, syndrome 0x4, vendor syndrome 0x32 00000000: 00 00 00 00 04 4a 00 00 00 00 00 00 20 00 93 32 00000010: 55 00 00 00 fb cc 00 00 00 00 00 00 07 18 00 00 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 4a 00000030: 00 00 00 9a 93 00 32 04 00 00 00 00 00 00 da e1 Use the correct formula for calculating the size of reservations, precisely it shouldn't be dependent on page size, instead use the correct multiply of MLX5E_SHAMPO_WQ_BASE_RESRV_SIZE. Fixes: e5ca8fb08ab2 ("net/mlx5e: Add control path for SHAMPO feature") Signed-off-by: Lama Kayal <lkayal@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1742732906-166564-1-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-26Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Merge in late fixes to prepare for the 6.15 net-next PR. No conflicts, adjacent changes: drivers/net/ethernet/broadcom/bnxt/bnxt.c 919f9f497dbc ("eth: bnxt: fix out-of-range access of vnic_info array") fe96d717d38e ("bnxt_en: Extend queue stop/start for TX rings") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-25Merge tag 'ipsec-next-2025-03-24' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next Steffen Klassert says: ==================== pull request (net-next): ipsec-next 2025-03-24 1) Prevent setting high order sequence number bits input in non-ESN mode. From Leon Romanovsky. 2) Support PMTU handling in tunnel mode for packet offload. From Leon Romanovsky. 3) Make xfrm_state_lookup_byaddr lockless. From Florian Westphal. 4) Remove unnecessary NULL check in xfrm_lookup_with_ifid(). From Dan Carpenter. * tag 'ipsec-next-2025-03-24' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next: xfrm: Remove unnecessary NULL check in xfrm_lookup_with_ifid() xfrm: state: make xfrm_state_lookup_byaddr lockless xfrm: check for PMTU in tunnel mode for packet offload xfrm: provide common xdo_dev_offload_ok callback implementation xfrm: rely on XFRM offload xfrm: simplify SA initialization routine xfrm: delay initialization of offload path till its actually requested xfrm: prevent high SEQ input in non-ESN mode ==================== Link: https://patch.msgid.link/20250324061855.4116819-1-steffen.klassert@secunet.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-25net/mlx5e: TC, Don't offload CT commit if it's the last actionJianbo Liu
For CT action with commit argument, it's usually followed by the forward action, either to the output netdev or next chain. The default behavior for software is to drop by setting action attribute to TC_ACT_SHOT instead of TC_ACT_PIPE if it's the last action. But driver can't handle it, so block the offload for such case. Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1742392983-153050-6-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-25net/mlx5e: CT: Filter legacy rules that are unrelated to nicPaul Blakey
In nic mode CT setup where we do hairpin between the two nics, both nics register to the same flow table (per zone), and try to offload all rules on it. Instead, filter the rules that originated from the relevant nic (so only one side is offloaded for each nic). Signed-off-by: Paul Blakey <paulb@nvidia.com> Reviewed-by: Jianbo Liu <jianbol@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1742392983-153050-5-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-25net/mlx5: Update pfnum retrieval for devlink port attributesShay Drory
Align mlx5 driver usage of 'pfnum' with the documentation clarification introduced in commit bb70b0d48d8e ("devlink: Improve the port attributes description"). Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1742392983-153050-4-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-25net/mlx5: fw reset, check bridge accessibility at earlier stageAmir Tzin
Currently, mlx5_is_reset_now_capable() checks whether the pci bridge is accessible only on bridge hot plug capability check. If the pci bridge is not accessible, reset now will fail regardless of bridge hotplug capability. Move this check to function mlx5_is_reset_now_capable() which, in such case, aborts the reset and does so in the request phase instead of the reset now phase. Signed-off-by: Aya Levin <ayal@nvidia.com> Signed-off-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Amir Tzin <amirtz@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1742392983-153050-3-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-25net/mlx5: Lag, use port selection tables when availableMark Bloch
As queue affinity is being deprecated and will no longer be supported in the future, Always check for the presence of the port selection namespace. When available, leverage it to distribute traffic across the physical ports via steering, ensuring compatibility with future NICs. Signed-off-by: Mark Bloch <mbloch@nvidia.com> Reviewed-by: Maor Gottlieb <maorg@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1742392983-153050-2-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-25net/mlx5e: TX, Utilize WQ fragments edge for multi-packet WQEsTariq Toukan
For simplicity reasons, the driver avoids crossing work queue fragment boundaries within the same TX WQE (Work-Queue Element). Until today, as the number of packets in a TX MPWQE (Multi-Packet WQE) descriptor is not known in advance, the driver pre-prepared contiguous memory for the largest possible WQE. For this, when getting too close to the fragment edge, having no room for the largest WQE possible, the driver was filling the fragment remainder with NOP descriptors, aligning the next descriptor to the beginning of the next fragment. Generating and handling these NOPs wastes resources, like: CPU cycles, work-queue entries fetched to the device, and PCI bandwidth. In this patch, we replace this NOPs filling mechanism in the TX MPWQE flow. Instead, we utilize the remaining entries of the fragment with a TX MPWQE. If this room turns out to be too small, we simply open an additional descriptor starting at the beginning of the next fragment. Performance benchmark: uperf test, single server against 3 clients. TCP multi-stream, bidir, traffic profile "2x350B read, 1400B write". Bottleneck is in inbound PCI bandwidth (device POV). +---------------+------------+------------+--------+ | | Before | After | | +---------------+------------+------------+--------+ | BW | 117.4 Gbps | 121.1 Gbps | +3.1% | +---------------+------------+------------+--------+ | tx_packets | 15 M/sec | 15.5 M/sec | +3.3% | +---------------+------------+------------+--------+ | tx_nops | 3 M/sec | 0 | -100% | +---------------+------------+------------+--------+ Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/1742391746-118647-1-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-24net/mlx5: Start health poll after enable hcaMoshe Shemesh
The health poll mechanism performs periodic checks to detect firmware errors. One of the checks verifies the function is still enabled on firmware side, but the function is enabled only after enable_hca command completed. Start health poll after enable_hca command to avoid a race between function enabled and first health polling. Fixes: 9b98d395b85d ("net/mlx5: Start health poll at earlier stage of driver load") Signed-off-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Shay Drori <shayd@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Link: https://patch.msgid.link/1742331077-102038-3-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-24net/mlx5: LAG, reload representors on LAG creation failureMark Bloch
When LAG creation fails, the driver reloads the RDMA devices. If RDMA representors are present, they should also be reloaded. This step was missed in the cited commit. Fixes: 598fe77df855 ("net/mlx5: Lag, Create shared FDB when in switchdev mode") Signed-off-by: Mark Bloch <mbloch@nvidia.com> Reviewed-by: Shay Drori <shayd@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Link: https://patch.msgid.link/1742331077-102038-2-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-24mlxsw: spectrum_acl_bloom_filter: Workaround for some LLVM versionsWangYuli
This is a workaround to mitigate a compiler anomaly. During LLVM toolchain compilation of this driver on s390x architecture, an unreasonable __write_overflow_field warning occurs. Contextually, chunk_index is restricted to 0, 1 or 2. By expanding these possibilities, the compile warning is suppressed. Fix follow error with clang-19 when -Werror: In file included from drivers/net/ethernet/mellanox/mlxsw/spectrum_acl_bloom_filter.c:5: In file included from ./include/linux/gfp.h:7: In file included from ./include/linux/mmzone.h:8: In file included from ./include/linux/spinlock.h:63: In file included from ./include/linux/lockdep.h:14: In file included from ./include/linux/smp.h:13: In file included from ./include/linux/cpumask.h:12: In file included from ./include/linux/bitmap.h:13: In file included from ./include/linux/string.h:392: ./include/linux/fortify-string.h:571:4: error: call to '__write_overflow_field' declared with 'warning' attribute: detected write beyond size of field (1st parameter); maybe use struct_group()? [-Werror,-Wattribute-warning] 571 | __write_overflow_field(p_size_field, size); | ^ 1 error generated. According to the testing, we can be fairly certain that this is a clang compiler bug, impacting only clang-19 and below. Clang versions 20 and 21 do not exhibit this behavior. Link: https://lore.kernel.org/all/484364B641C901CD+20250311141025.1624528-1-wangyuli@uniontech.com/ Fixes: 7585cacdb978 ("mlxsw: spectrum_acl: Add Bloom filter handling") Co-developed-by: Zijian Chen <czj2441@163.com> Signed-off-by: Zijian Chen <czj2441@163.com> Co-developed-by: Wentao Guan <guanwentao@uniontech.com> Signed-off-by: Wentao Guan <guanwentao@uniontech.com> Suggested-by: Paolo Abeni <pabeni@redhat.com> Co-developed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Tested-by: Ido Schimmel <idosch@nvidia.com> Tested-by: WangYuli <wangyuli@uniontech.com> Signed-off-by: WangYuli <wangyuli@uniontech.com> Link: https://patch.msgid.link/A1858F1D36E653E0+20250318103654.708077-1-wangyuli@uniontech.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-24mlxsw: Add VXLAN bridge ports to same hardware domain as physical bridge portsAmit Cohen
When hardware floods packets to bridge ports, but flooding to VXLAN bridge port fails during encapsulation to one of the remote VTEPs, the packets are trapped to CPU. In such case, the packets are marked with skb->offload_fwd_mark, which means that packet was L2-forwarded in hardware. Software data path repeats flooding, but packets which are marked with skb->offload_fwd_mark will not be flooded by the bridge to bridge ports which are in the same hardware domain as the ingress port. Currently, mlxsw does not add VXLAN bridge ports to the same hardware domain as physical bridge ports despite the fact that the device is able to forward packets to and from VXLAN tunnels in hardware. In some scenarios (as mentioned above) this can result in remote VTEPs receiving duplicate packets. The packets are first flooded by hardware and after an encapsulation failure, they are flooded again to all remote VTEPs by software. Solve this by adding VXLAN bridge ports to the same hardware domain as physical bridge ports, so then nbp_switchdev_allowed_egress() will return false also for VXLAN, and packets will not be sent twice from VXLAN device. switchdev_bridge_port_offload() should get vxlan_dev not as const, so some changes are required. Call switchdev API from mlxsw_sp_bridge_vxlan_{join,leave}() which handle offload configurations. Reported-by: Vladimir Oltean <olteanv@gmail.com> Closes: https://lore.kernel.org/all/20250210152246.4ajumdchwhvbarik@skbuf/ Reported-by: Vladyslav Mykhaliuk <vmykhaliuk@nvidia.com> Signed-off-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/7279056843140fae3a72c2d204c7886b79d03899.1742224300.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-24mlxsw: spectrum_switchdev: Move mlxsw_sp_bridge_vxlan_join()Amit Cohen
Next patch will call __mlxsw_sp_bridge_vxlan_leave() from mlxsw_sp_bridge_vxlan_join() as part of error flow, move the function to be able to call the second one. Signed-off-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/64750a0965536530482318578bada30fac372b8a.1742224300.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-24mlxsw: spectrum_switchdev: Add an internal API for VXLAN leaveAmit Cohen
There is asymmetry in how the VXLAN join and leave functions are used. The join function (mlxsw_sp_bridge_vxlan_join()) is only called in response to netdev events (e.g., VXLAN device joining a bridge), but the leave function is also called in response to switchdev events (e.g., VLAN configuration on top of the VXLAN device) in order to invalidate VNI to FID mappings. This asymmetry will cause problems when the functions will be later extended to mark VXLAN bridge ports as offloaded or not. Therefore, create an internal function (__mlxsw_sp_bridge_vxlan_leave()) that is used to invalidate VNI to FID mappings and call it from mlxsw_sp_bridge_vxlan_leave() which will only be invoked in response to netdev events, like mlxsw_sp_bridge_vxlan_join(). No functional changes intended. Signed-off-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/f3a32bd2d87a0b7ac4d2bb98a427dc6d95a01cd0.1742224300.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-24mlxsw: spectrum: Call mlxsw_sp_bridge_vxlan_{join, leave}() for VLAN-aware ↵Amit Cohen
bridge mlxsw_sp_bridge_vxlan_{join,leave}() are not called when a VXLAN device joins or leaves a VLAN-aware bridge. As mentioned in the comment - when the bridge is VLAN-aware, the VNI of the VXLAN device needs to be mapped to a VLAN, but at this point no VLANs are configured on the VxLAN device. This means that we can call the APIs, but there is no point to do that, as they do not configure anything in such cases. Next patch will extend mlxsw_sp_bridge_vxlan_{join,leave}() to set hardware domain for VXLAN, this should be done also when a VXLAN device joins or leaves a VLAN-aware bridge. Call the APIs, which for now do not do anything in these flows. Align the call to mlxsw_sp_bridge_vxlan_leave() to be called like mlxsw_sp_bridge_vxlan_join(), only in case that the VXLAN device is up, so move the check to be done before calling mlxsw_sp_bridge_vxlan_{join,leave}(). This does not change the existing behavior, as there is a similar check inside mlxsw_sp_bridge_vxlan_leave(). Signed-off-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/994c1ea93520f9ea55d1011cd47dc2180d526484.1742224300.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-24mlxsw: Trap ARP packets at layer 2 instead of layer 3Amit Cohen
Next patch will set the same hardware domain for all bridge ports, including VXLAN, to prevent packets from being forwarded by software when they were already forwarded by hardware. ARP packets are not flooded by hardware to VXLAN, so software should handle such flooding. When hardware domain of VXLAN device will be changed, ARP packets which are trapped and marked with offload_fwd_mark will not be flooded to VXLAN also in software, which will break VXLAN traffic. To prevent such breaking, trap ARP packets at layer 2 and don't mark them as L2-forwarded in hardware, then flooding ARP packets will be done only in software, and VXLAN will send ARP packets. Remove NVE_ENCAP_ARP which is no longer needed, as now ARP packets are trapped when they enter the device. Signed-off-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/b2a2cc607a1f4cb96c10bd3b0b0244ba3117fd2e.1742224300.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-24net/mlx5e: Always select CONFIG_PAGE_POOL_STATSTariq Toukan
Always set PAGE_POOL_STATS in mlx5 Eth driver. Cleanup the corresponding #ifdefs. Page pool stats are essential to monitor and analyze RX performance. Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Gal Pressman <gal@nvidia.com> Link: https://patch.msgid.link/1742412199-159596-4-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-24net/mlx5e: Use right API to free bitmap memoryMark Zhang
Use bitmap_free() to free memory allocated with bitmap_zalloc_node(). This fixes memtrack error: mtl rsc inconsistency: memtrack_free: .../drivers/net/ethernet/mellanox/mlx5/core/en_main.c::466: kfree for unknown address=0xFFFF0000CA3619E8, device=0x0 Signed-off-by: Mark Zhang <markzhang@nvidia.com> Reviewed-by: Maher Sanalla <msanalla@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Link: https://patch.msgid.link/1742412199-159596-3-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-24net/mlx5: Remove NULL check before dev_{put, hold}Gal Pressman
Fix coccinelle warnings: WARNING: NULL check before dev_{put, hold} functions is not needed. Signed-off-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Jianbo Liu <jianbol@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Link: https://patch.msgid.link/1742412199-159596-2-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-24net/mlx5e: Fix ethtool -N flow-type ip4 to RSS contextMaxim Mikityanskiy
There commands can be used to add an RSS context and steer some traffic into it: # ethtool -X eth0 context new New RSS context is 1 # ethtool -N eth0 flow-type ip4 dst-ip 1.1.1.1 context 1 Added rule with ID 1023 However, the second command fails with EINVAL on mlx5e: # ethtool -N eth0 flow-type ip4 dst-ip 1.1.1.1 context 1 rmgr: Cannot insert RX class rule: Invalid argument Cannot insert classification rule It happens when flow_get_tirn calls flow_type_to_traffic_type with flow_type = IP_USER_FLOW or IPV6_USER_FLOW. That function only handles IPV4_FLOW and IPV6_FLOW cases, but unlike all other cases which are common for hash and spec, IPv4 and IPv6 defines different contants for hash and for spec: #define TCP_V4_FLOW 0x01 /* hash or spec (tcp_ip4_spec) */ #define UDP_V4_FLOW 0x02 /* hash or spec (udp_ip4_spec) */ ... #define IPV4_USER_FLOW 0x0d /* spec only (usr_ip4_spec) */ #define IP_USER_FLOW IPV4_USER_FLOW #define IPV6_USER_FLOW 0x0e /* spec only (usr_ip6_spec; nfc only) */ #define IPV4_FLOW 0x10 /* hash only */ #define IPV6_FLOW 0x11 /* hash only */ Extend the switch in flow_type_to_traffic_type to support both, which fixes the failing ethtool -N command with flow-type ip4 or ip6. Fixes: 248d3b4c9a39 ("net/mlx5e: Support flow classification into RSS contexts") Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com> Tested-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Joe Damato <jdamato@fastly.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20250319124508.3979818-1-maxim@isovalent.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-21net/mlx5e: Expose port reset cycle recovery counter via ethtoolYael Chemla
Display recovery event of PPCNT recovery counters group. Counts (per link) the number of total successful recovery events of any recovery types during port reset cycle. Signed-off-by: Yael Chemla <ychemla@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/1742112876-2890-5-git-send-email-tariqt@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-21net/mlx5e: Get counter group size by FW capabilityYael Chemla
Retrieve the number of fields supported by each PPCNT counter group based on the FW capability for this group. Signed-off-by: Yael Chemla <ychemla@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Link: https://patch.msgid.link/1742112876-2890-4-git-send-email-tariqt@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-21net/mlx5e: Access PHY layer counter group as other counter groupsYael Chemla
Adjust the way physical layer counters group is accessed to match the generic method used for accessing other PPCNT counter groups. Signed-off-by: Yael Chemla <ychemla@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/1742112876-2890-3-git-send-email-tariqt@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-21net/mlx5e: Ensure each counter group uses its PCAM bitYael Chemla
The code was incorrectly relying on PCAM bit of ppcnt_statistical_group for accessing per_lane_error_counters. If ppcnt_statistical_group PCAM bit was not set, we would not read per_lane_error_counters, even when its PCAM bit is set. Given the existing device capabilities, it seems to cause no harm, so this change primarily serves as cleanup. Signed-off-by: Yael Chemla <ychemla@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Link: https://patch.msgid.link/1742112876-2890-2-git-send-email-tariqt@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-19net/mlx5: HWS, log the unsupported mask in definerYevgeny Kliteynik
If a user requested to match on an unsupported combination of fields, print the unsupported combination in the error message. Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Reviewed-by: Vlad Dogaru <vdogaru@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1741780194-137519-4-git-send-email-tariqt@nvidia.com Reviewed-by: Dawid Osuchowski <dawid.osuchowski@linux.intel.com> Reviewed-by: Michal Kubiak <michal.kubiak@intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-19net/mlx5: HWS, use list_move() instead of del/addYevgeny Kliteynik
Wherever applicable, use list_move function instead of list_del + list_add. Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Reviewed-by: Vlad Dogaru <vdogaru@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1741780194-137519-3-git-send-email-tariqt@nvidia.com Reviewed-by: Dawid Osuchowski <dawid.osuchowski@linux.intel.com> Reviewed-by: Michal Kubiak <michal.kubiak@intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-19net/mlx5: HWS, remove unused code for alias flow tablesYevgeny Kliteynik
Alias flow tables are not in use by HWS - remove the unused code. Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Reviewed-by: Vlad Dogaru <vdogaru@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1741780194-137519-2-git-send-email-tariqt@nvidia.com Reviewed-by: Dawid Osuchowski <dawid.osuchowski@linux.intel.com> Reviewed-by: Michal Kubiak <michal.kubiak@intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-18net/mlx5: Add support for setting parent of nodesCarolina Jubran
Introduce `mlx5_esw_devlink_rate_node_parent_set()` to allow assigning a parent to scheduling nodes. Implement `mlx5_esw_qos_node_update_parent()` and `mlx5_esw_qos_node_validate_set_parent()` to enforce constraints on node reassignment. Don't allow reassignment of nodes with active rate objects. Update `esw_qos_node_set_parent()` to handle cases where the parent is NULL. A NULL parent indicates that the scheduling element is attached to the root scheduling element, and since only rate nodes can be connected to the root, this update is now necessary. Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1741642016-44918-5-git-send-email-tariqt@nvidia.com Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-18net/mlx5: Preserve rate settings when creating a rate nodeCarolina Jubran
Modify `esw_qos_create_node_sched_elem()` to receive max_rate and bw_share values while maintaining the previous configuration. This change is essential for the upcoming patch that will modify rate nodes and requires the existing settings to be preserved unless explicitly changed. Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1741642016-44918-4-git-send-email-tariqt@nvidia.com Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-18net/mlx5: Introduce hierarchy level tracking on scheduling nodesCarolina Jubran
Add a `level` field to `mlx5_esw_sched_node` to track the hierarchy depth of each scheduling node. This allows enforcement of the scheduling depth constraints based on `log_esw_max_sched_depth`. Modify `esw_qos_node_set_parent()` and `__esw_qos_alloc_node()` to correctly assign hierarchy levels. Ensure that nodes inherit their parent’s level incrementally. Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1741642016-44918-3-git-send-email-tariqt@nvidia.com Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>