Age | Commit message (Collapse) | Author |
|
RTNL needs to be acquired before state_lock.
Fixes: fdce06bda7e5 ("net/mlx5e: Acquire RTNL lock before RQs/SQs activation/deactivation")
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/20250423083611.324567-5-mbloch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Previously the offload of the rule with header rewrite and mirror to
both internal and external destinations is skipped if the encap entry
is not valid. But it shouldn't because driver will try to offload it
again if neighbor is updated and encap entry is valid, to replace the
old FTE added for slow path. But the extra split attr doesn't exist at
that time as the process is skipped, driver then fails to offload it.
To fix this issue, remove the checking and continue the attr process
if encap entry is invalid.
Fixes: b11bde56246e ("net/mlx5e: TC, Offload rewrite and mirror to both internal and external dests")
Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/20250423083611.324567-4-mbloch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Initialize the source MAC address when creating the default GID entry.
Since this entry is used only for loopback traffic, it only needs to
be a unicast address. A zeroed-out MAC address is sufficient for this
purpose.
Without this fix, random bits would be assigned as the source address.
If these bits formed a multicast address, the firmware would return an
error, preventing the user from switching to switchdev mode:
Error: mlx5_core: Failed setting eswitch to offloads.
kernel answers: Invalid argument
Fixes: 80f09dfc237f ("net/mlx5: Eswitch, enable RoCE loopback traffic")
Signed-off-by: Maor Gottlieb <maorg@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Link: https://patch.msgid.link/20250423083611.324567-3-mbloch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Symbolic (e.g. "vxlan") and custom (e.g. "tunnel_header_0") tunnels
cannot be combined, but the match params interface does not have fields
for matching on vxlan gbp. To match vxlan bgp, the tc_tun layer uses
tunnel_header_0.
Allow matching on both VNI and GBP by matching the VNI with a custom
tunnel header instead of the symbolic field name.
Matching solely on the VNI continues to use the symbolic field name.
Fixes: 74a778b4a63f ("net/mlx5: HWS, added definers handling")
Signed-off-by: Vlad Dogaru <vdogaru@nvidia.com>
Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Link: https://patch.msgid.link/20250423083611.324567-2-mbloch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Cross-merge networking fixes after downstream PR (net-6.15-rc4).
This pull includes wireless and a fix to vxlan which isn't
in Linus's tree just yet. The latter creates with a silent conflict
/ build breakage, so merging it now to avoid causing problems.
drivers/net/vxlan/vxlan_vnifilter.c
094adad91310 ("vxlan: Use a single lock to protect the FDB table")
087a9eb9e597 ("vxlan: vnifilter: Fix unlocked deletion of default FDB entry")
https://lore.kernel.org/20250423145131.513029-1-idosch@nvidia.com
No "normal" conflicts, or adjacent changes.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Signal clearly to the user, via an error, that mixing IPv4 and IPv6
rules in the same matcher is not supported. Previously such cases
silently failed by adding a rule that did not work correctly.
Rules can specify an IP version by one of two fields: IP version or
ethertype. At matcher creation, store whether the template matches on
any of these two fields. If yes, inspect each rule for its corresponding
match value and store the IP version inside the matcher to guard against
inconsistencies with subsequent rules.
Furthermore, also check rules for internal consistency, i.e. verify that
the ethertype and IP version match values do not contradict each other.
The logic applies to inner and outer headers independently, to account
for tunneling.
Rules that do not match on IP addresses are not affected.
Signed-off-by: Vlad Dogaru <vdogaru@nvidia.com>
Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/20250422092540.182091-4-mbloch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Replicate some sanity checks that firmware does, since hardware steering
does not go through firmware.
When creating a definer, disallow matching on IP addresses without also
matching on IP version. The latter can be satisfied by matching either
on the version field in the IP header, or on the ethertype field.
Also refuse to match IPv4 IHL alongside IPv6.
Signed-off-by: Vlad Dogaru <vdogaru@nvidia.com>
Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/20250422092540.182091-3-mbloch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Unify the check for IP version when creating a definer. A given matcher
is deemed to match on IPv6 if any of the higher order (>31) bits of
source or destination address mask are set.
A single packet cannot mix IP versions between source and destination
addresses, so it makes no sense that they would be decided on
independently.
Signed-off-by: Vlad Dogaru <vdogaru@nvidia.com>
Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/20250422092540.182091-2-mbloch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Relocate the memory allocation for ttc table after the switch statement
that validates params->ns_type in both mlx5_create_inner_ttc_table() and
mlx5_create_ttc_table(). This ensures memory is only allocated after
confirming valid input, eliminating potential memory leaks when invalid
ns_type cases occur.
Fixes: 137f3d50ad2a ("net/mlx5: Support matching on l4_type for ttc_table")
Signed-off-by: Henry Martin <bsdhenrymartin@gmail.com>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/20250418023814.71789-3-bsdhenrymartin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add NULL check for mlx5_get_flow_namespace() returns in
mlx5_create_inner_ttc_table() and mlx5_create_ttc_table() to prevent
NULL pointer dereference.
Fixes: 137f3d50ad2a ("net/mlx5: Support matching on l4_type for ttc_table")
Signed-off-by: Henry Martin <bsdhenrymartin@gmail.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Link: https://patch.msgid.link/20250418023814.71789-2-bsdhenrymartin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
There is a spelling mistake in a mlx5_core_dbg and two spelling mistakes
in comment blocks. Fix them.
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Acked-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/20250418135703.542722-1-colin.i.king@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The new GCC 15 warning -Wunterminated-string-initialization reports:
In file included from drivers/net/ethernet/mellanox/mlx5/core/en.h:55,
from drivers/net/ethernet/mellanox/mlx5/core/en_stats.c:34:
drivers/net/ethernet/mellanox/mlx5/core/en_stats.h:57:46: warning: initializer-string for array of 'char' truncates NUL terminator but destination lacks 'nonstring' attribute (33 chars into 32 available) [-Wunterminated-string-initialization]
57 | #define MLX5E_DECLARE_PTP_RQ_STAT(type, fld) "ptp_rq%d_"#fld, offsetof(type, fld)
| ^~~~~~~~~~~
drivers/net/ethernet/mellanox/mlx5/core/en_stats.c:2279:11: note: in expansion of macro 'MLX5E_DECLARE_PTP_RQ_STAT'
2279 | { MLX5E_DECLARE_PTP_RQ_STAT(struct mlx5e_rq_stats, csum_complete_tail_slow) },
| ^~~~~~~~~~~~~~~~~~~~~~~~~
This stat string is being used in ethtool_sprintf(), so it must be a
valid NUL-terminated string. Currently the string lacks the final NUL
byte (as GCC warns), but by absolute luck, the next byte in memory is a
space (decimal 32) followed by a NUL. "format" is immediately followed
by little-endian size_t:
struct counter_desc {
char format[32]; /* 0 32 */
size_t offset; /* 32 8 */
};
The "offset" member is populated by the stats member offset:
#define MLX5E_DECLARE_PTP_RQ_STAT(type, fld) "ptp_rq%d_"#fld, offsetof(type, fld)
which for this struct mlx5e_rq_stats member, csum_complete_tail_slow, is
32, or space, and then the rest of the "offset" bytes are NULs.
struct mlx5e_rq_stats {
...
u64 csum_complete_tail_slow; /* 32 8 */
The use of vsnprintf(), within ethtool_sprintf(), reads past the end of
"format" and sees the format string as "ptp_rq%d_csum_complete_tail_slow ",
with %d getting resolved by MLX5E_PTP_CHANNEL_IX (value 0):
ethtool_sprintf(data, ptp_rq_stats_desc[i].format,
MLX5E_PTP_CHANNEL_IX);
With an output result of "ptp_rq0_csum_complete_tail_slow", which gets
precisely truncated to 31 characters with a trailing NUL.
So, instead of accidentally getting this correct due to the NUL bytes
at the end of the size_t that happens to follow the format string, just
make the string initializer 1 byte shorter by replacing "%d" with "0",
since MLX5E_PTP_CHANNEL_IX is already hard-coded. This results in no
initializer truncation and no need to call sprintf().
Signed-off-by: Kees Cook <kees@kernel.org>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Link: https://patch.msgid.link/20250416020109.work.297-kees@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Many drivers populate the stats buffer using C-String based APIs (e.g.
ethtool_sprintf() and ethtool_puts()), usually when building up the
list of stats individually (i.e. with a for() loop). This, however,
requires that the source strings be populated in such a way as to have
a terminating NUL byte in the source.
Other drivers populate the stats buffer directly using one big memcpy()
of an entire array of strings. No NUL termination is needed here, as the
bytes are being directly passed through. Yet others will build up the
stats buffer individually, but also use memcpy(). This, too, does not
need NUL termination of the source strings.
However, there are cases where the strings that populate the
source stats strings are exactly ETH_GSTRING_LEN long, and GCC
15's -Wunterminated-string-initialization option complains that the
trailing NUL byte has been truncated. This situation is fine only if the
driver is using the memcpy() approach. If the C-String APIs are used,
the destination string name will have its final byte truncated by the
required trailing NUL byte applied by the C-string API.
For drivers that are already using memcpy() but have initializers that
truncate the NUL terminator, mark their source strings as __nonstring to
silence the GCC warnings.
For drivers that have initializers that truncate the NUL terminator and
are using the C-String APIs, switch to memcpy() to avoid destination
string truncation and mark their source strings as __nonstring to silence
the GCC warnings. (Also introduce ethtool_cpy() as a helper to make this
an easy replacement).
Specifically the following warnings were investigated and addressed:
../drivers/net/ethernet/chelsio/cxgb/cxgb2.c:364:9: warning: initializer-string for array of 'char' truncates NUL terminator but destination lacks 'nonstring' attribute (33 chars into 32 available) [-Wunterminated-string-initialization]
364 | "TxFramesAbortedDueToXSCollisions",
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
../drivers/net/ethernet/freescale/enetc/enetc_ethtool.c:165:33: warning: initializer-string for array of 'char' truncates NUL terminator but destination lacks 'nonstring' attribute (33 chars into 32 available) [-Wunterminated-string-initialization]
165 | { ENETC_PM_R1523X(0), "MAC rx 1523 to max-octet packets" },
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
../drivers/net/ethernet/freescale/enetc/enetc_ethtool.c:190:33: warning: initializer-string for array of 'char' truncates NUL terminator but destination lacks 'nonstring' attribute (33 chars into 32 available) [-Wunterminated-string-initialization]
190 | { ENETC_PM_T1523X(0), "MAC tx 1523 to max-octet packets" },
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
../drivers/net/ethernet/google/gve/gve_ethtool.c:76:9: warning: initializer-string for array of 'char' truncates NUL terminator but destination lacks 'nonstring' attribute (33 chars into 32 available) [-Wunterminated-string-initialization]
76 | "adminq_dcfg_device_resources_cnt", "adminq_set_driver_parameter_cnt",
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
../drivers/net/ethernet/stmicro/stmmac/stmmac_ethtool.c:117:53: warning: initializer-string for array of 'char' truncates NUL terminator but destination lacks 'nonstring' attribute (33 chars into 32 available) [-Wunterminated-string-initialization]
117 | STMMAC_STAT(ptp_rx_msg_type_pdelay_follow_up),
| ^
../drivers/net/ethernet/stmicro/stmmac/stmmac_ethtool.c:46:12: note: in definition of macro 'STMMAC_STAT'
46 | { #m, sizeof_field(struct stmmac_extra_stats, m), \
| ^
../drivers/net/ethernet/mellanox/mlxsw/spectrum_ethtool.c:328:24: warning: initializer-string for array of 'char' truncates NUL terminator but destination lacks 'nonstring' attribute (33 chars into 32 available) [-Wunterminated-string-initialization]
328 | .str = "a_mac_control_frames_transmitted",
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
../drivers/net/ethernet/mellanox/mlxsw/spectrum_ethtool.c:340:24: warning: initializer-string for array of 'char' truncates NUL terminator but destination lacks 'nonstring' attribute (33 chars into 32 available) [-Wunterminated-string-initialization]
340 | .str = "a_pause_mac_ctrl_frames_received",
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Signed-off-by: Kees Cook <kees@kernel.org>
Reviewed-by: Petr Machata <petrm@nvidia.com> # for mlxsw
Reviewed-by: Harshitha Ramamurthy <hramamurthy@google.com>
Link: https://patch.msgid.link/20250416010210.work.904-kees@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Previously, device driver IPSec offload implementations would fall into
two categories:
1. Those that used xso.dev to determine the offload device.
2. Those that used xso.real_dev to determine the offload device.
The first category didn't work with bonding while the second did.
In a non-bonding setup the two pointers are the same.
This commit adds explicit pointers for the offload netdevice to
.xdo_dev_state_add() / .xdo_dev_state_delete() / .xdo_dev_state_free()
which eliminates the confusion and allows drivers from the first
category to work with bonding.
xso.real_dev now becomes a private pointer managed by the bonding
driver.
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
|
|
The policy offload struct was reused from the state offload and
real_dev was copied from dev, but it was never set to anything else.
Simplify the code by always using xdo.dev for policies.
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
|
|
xso.real_dev is the active device of an offloaded xfrm state and is
managed by bonding. As such, it's subject to change when states are
migrated to a new device. Using it in places other than
offloading/unoffloading the states is risky.
This commit saves the device into the driver-specific struct
mlx5e_ipsec_sa_entry and switches mlx5e_ipsec_init_macs() and
mlx5e_ipsec_netevent_event() to make use of it.
Additionally, mlx5e_xfrm_update_stats() used xso.real_dev to validate
that correct net locks are held. But in a bonding config, the net of the
master device is the same as the underlying devices, and the net is
already a local var, so use that instead.
The only remaining references to xso.real_dev are now in the
.xdo_dev_state_add() / .xdo_dev_state_delete() path.
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
|
|
The PTP_PEROUT_REQUEST2 ioctl has gained support for flags specifying
specific output behavior including PTP_PEROUT_ONE_SHOT,
PTP_PEROUT_DUTY_CYCLE, PTP_PEROUT_PHASE.
Driver authors are notorious for not checking the flags of the request.
This results in misinterpreting the request, generating an output signal
that does not match the requested value. It is anticipated that even more
flags will be added in the future, resulting in even more broken requests.
Expecting these issues to be caught during review or playing whack-a-mole
after the fact is not a great solution.
Instead, introduce the supported_perout_flags field in the ptp_clock_info
structure. Update the core character device logic to explicitly reject any
request which has a flag not on this list.
This ensures that drivers must 'opt in' to the flags they support. Drivers
which don't set the .supported_perout_flags field will not need to check
that unsupported flags aren't passed, as the core takes care of this.
Update the drivers which do support flags to set this new field.
Note the following driver files set n_per_out to a non-zero value but did
not check the flags at all:
• drivers/ptp/ptp_clockmatrix.c
• drivers/ptp/ptp_idt82p33.c
• drivers/ptp/ptp_fc3.c
• drivers/net/ethernet/ti/am65-cpts.c
• drivers/net/ethernet/aquantia/atlantic/aq_ptp.c
• drivers/net/ethernet/broadcom/bnxt/bnxt_ptp.c
• drivers/net/dsa/sja1105/sja1105_ptp.c
• drivers/net/ethernet/freescale/dpaa2/dpaa2-ptp.c
• drivers/net/ethernet/mscc/ocelot_vsc7514.c
• drivers/net/ethernet/intel/i40e/i40e_ptp.c
Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Kory Maincent <kory.maincent@bootlin.com>
Link: https://patch.msgid.link/20250414-jk-supported-perout-flags-v2-2-f6b17d15475c@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The PTP_EXTTS_REQUEST(2) ioctl has a flags field which specifies how the
external timestamp request should behave. This includes which edge of the
signal to timestamp, as well as a specialized "offset" mode. It is expected
that more flags will be added in the future.
Driver authors routinely do not check the flags, often accepting requests
with flags which they do not support. Even drivers which do check flags may
not be future-proofed to reject flags not yet defined. Thus, any future
flag additions often require manually updating drivers to reject these
flags.
This approach of hoping we catch flag checks during review, or playing
whack-a-mole after the fact is the wrong approach.
Introduce the "supported_extts_flags" field to the ptp_clock_info
structure. This field defines the set of flags the device actually
supports.
Update the core character device logic to check this field and reject
unsupported requests. Getting this right is somewhat tricky. First, to
avoid unnecessary repetition and make basic functionality work when
.supported_extts_flags is 0, the core always accepts the PTP_ENABLE_FEATURE
flag. This flag is used to set the 'on' parameter to the .enable function
and is thus always 'supported' by all drivers.
For backwards compatibility, the PTP_RISING_EDGE and PTP_FALLING_EDGE flags
are merely "hints" when using the old PTP_EXTTS_REQUEST ioctl, and are not
expected to be enforced. If the user issues PTP_EXTTS_REQUEST2, the
PTP_STRICT_FLAGS flag is added which is supposed to inform the driver to
strictly validate the flags and reject unsupported requests. To handle
this, first check if the driver reports PTP_STRICT_FLAGS support. If it
does not, then always allow the PTP_RISING_EDGE and PTP_FALLING_EDGE flags.
This keeps backwards compatibility with the original PTP_EXTTS_REQUEST
ioctl where these flags are not guaranteed to be honored.
This way, drivers which do not set the supported_extts_flags will continue
to accept requests for the original PTP_EXTTS_REQUEST ioctl. The core will
automatically reject requests with new flags, and correctly reject requests
with PTP_STRICT_FLAGS, where the driver is supposed to strictly validate
the flags.
Update the various drivers, refactoring their validation logic into the
.supported_extts_flags field. For consistency and readability,
PTP_ENABLE_FEATURE is not set in the supported flags list, and
PTP_EXTTS_EDGES is expanded to PTP_RISING_EDGE | PTP_FALLING_EDGE in all
cases.
Note the following driver files set n_ext_ts to a non-zero value but did
not check flags at all:
• drivers/net/ethernet/freescale/dpaa2/dpaa2-ptp.c
• drivers/net/ethernet/freescale/enetc/enetc_ptp.c
• drivers/net/ethernet/intel/i40e/i40e_ptp.c
• drivers/net/ethernet/marvell/octeontx2/nic/otx2_ptp.c
• drivers/net/ethernet/renesas/ravb_ptp.c
• drivers/net/ethernet/renesas/rtsn.c
• drivers/net/ethernet/renesas/rtsn.h
• drivers/net/ethernet/ti/am65-cpts.c
• drivers/net/ethernet/ti/cpts.h
• drivers/net/ethernet/ti/icssg/icss_iep.c
• drivers/net/ethernet/xscale/ptp_ixp46x.c
• drivers/net/phy/bcm-phy-ptp.c
• drivers/ptp/ptp_ocp.c
• drivers/ptp/ptp_pch.c
• drivers/ptp/ptp_qoriq.c
These drivers behavior does change slightly: they will now reject the
PTP_EXTTS_REQUEST2 ioctl, because they do not strictly validate their
flags. This also makes them no longer incorrectly accept PTP_EXT_OFFSET.
Also note that the renesas ravb driver does not support PTP_STRICT_FLAGS.
We could leave the .supported_extts_flags as 0, but I added the
PTP_RISING_EDGE | PTP_FALLING_EDGE since the driver previously manually
validated these flags. This is equivalent to 0 because the core will allow
these flags regardless unless PTP_STRICT_FLAGS is also set.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Kory Maincent <kory.maincent@bootlin.com>
Link: https://patch.msgid.link/20250414-jk-supported-perout-flags-v2-1-f6b17d15475c@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Introduce a new type of dump object and dump all action STE tables,
along with information on their RTCs and STEs.
Signed-off-by: Vlad Dogaru <vdogaru@nvidia.com>
Reviewed-by: Hamdan Agbariya <hamdani@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Michal Kubiak <michal.kubiak@intel.com>
Link: https://patch.msgid.link/1744312662-356571-13-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Periodically check for unused action STE tables and free their
associated resources. In order to do this safely, add a per-queue lock
to synchronize the garbage collect work with regular operations on
steering rules.
Signed-off-by: Vlad Dogaru <vdogaru@nvidia.com>
Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Michal Kubiak <michal.kubiak@intel.com>
Link: https://patch.msgid.link/1744312662-356571-12-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Remove the matcher action STE implementation now that the code uses
per-queue action STE pools. This also allows simplifying matcher code
because it is now only handling a single type of RTC/STE.
The matcher resize data is also going away. Matchers were saving old
action STE data because the rules still used it, but now that data lives
in the action STE pool and is no longer coupled to a matcher.
Furthermore, matchers no longer need to rehash a due to action template
addition. If a new action template needs more action STEs, we simply
update the matcher's num_of_action_stes and future rules will allocate
the correct number. Existing rules are unaffected by such an operation
and can continue to use their existing action STEs.
The range action was using the matcher action STE implementation, but
there was no reason to do this other than the container fitting the
purpose. Extract that information to a separate structure.
Finally, stop dumping per-matcher information about action RTCs,
because they no longer exist. A later patch in this series will add
support for dumping action STE pools.
Signed-off-by: Vlad Dogaru <vdogaru@nvidia.com>
Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Michal Kubiak <michal.kubiak@intel.com>
Link: https://patch.msgid.link/1744312662-356571-11-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Use the central action STE pool when creating / updating rules.
Signed-off-by: Vlad Dogaru <vdogaru@nvidia.com>
Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Michal Kubiak <michal.kubiak@intel.com>
Link: https://patch.msgid.link/1744312662-356571-10-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Implement a per-queue pool of action STEs that match STEs can link to,
regardless of matcher.
The code relies on hints to optimize whether a given rule is added to
rx-only, tx-only or both. Correspondingly, action STEs need to be added
to different RTC for ingress or egress paths. For rx-and-tx rules, the
current rule implementation dictates that the offsets for a given rule
must be the same in both RTCs.
To avoid wasting STEs, each action STE pool element holds 3 pools:
rx-only, tx-only, and rx-and-tx, corresponding to the possible values of
the pool optimization enum. The implementation then chooses at rule
creation / update which of these elements to allocate from.
Each element holds multiple action STE tables, which wrap an RTC, an STE
range, the logic to buddy-allocate offsets from the range, and an STC
that allows match STEs to point to this table. When allocating offsets
from an element, we iterate through available action STE tables and, if
needed, create a new table.
Similar to the previous implementation, this iteration does not free any
resources. This is implemented in a subsequent patch.
Signed-off-by: Vlad Dogaru <vdogaru@nvidia.com>
Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Michal Kubiak <michal.kubiak@intel.com>
Link: https://patch.msgid.link/1744312662-356571-9-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The optimization to create a size-one STE range for the unused direction
was broken. The hardware prevents us from creating RTCs over unallocated
STE space, so the only reason this has worked so far is because the
optimization was never used.
Signed-off-by: Vlad Dogaru <vdogaru@nvidia.com>
Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Michal Kubiak <michal.kubiak@intel.com>
Link: https://patch.msgid.link/1744312662-356571-8-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Future users will need to query whether a pool is empty.
Signed-off-by: Vlad Dogaru <vdogaru@nvidia.com>
Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Michal Kubiak <michal.kubiak@intel.com>
Link: https://patch.msgid.link/1744312662-356571-7-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Remove members which are now no longer used. In fact, many of the
`struct mlx5hws_pool_chunk` were not even written to beyond being
initialized, but they were used in various internals.
Also cleanup some local variables which made more sense when the API was
thicker.
Signed-off-by: Vlad Dogaru <vdogaru@nvidia.com>
Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Michal Kubiak <michal.kubiak@intel.com>
Link: https://patch.msgid.link/1744312662-356571-6-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Refactor the pool implementation to remove unused flags and clarify its
usage. A pool represents a single range of STEs or STCs which are
allocated at pool creation time.
Pools are used under three patterns:
1. STCs are allocated one at a time from a global pool using a bitmap
based implementation.
2. Action STEs are allocated in power-of-two blocks using a buddy
algorithm.
3. Match STEs do not use allocation, since insertion into these tables
is based on hashes or direct addressing. In such cases we use a pool
only to create the STE range.
Signed-off-by: Vlad Dogaru <vdogaru@nvidia.com>
Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Michal Kubiak <michal.kubiak@intel.com>
Link: https://patch.msgid.link/1744312662-356571-5-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The pool implementation claimed to support multiple resources, but this
does not really make sense in context. Callers always allocate a single
STC or STE chunk of exactly the size provided.
The code that handled multiple resources was unused (and likely buggy)
due to the combination of flags passed by callers.
Simplify the pool by having it handle a single resource. As a result of
this simplification, chunks no longer contain a resource offset (there
is now only one resource per pool), and the get_base_id functions no
longer take a chunk parameter.
Signed-off-by: Vlad Dogaru <vdogaru@nvidia.com>
Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Michal Kubiak <michal.kubiak@intel.com>
Link: https://patch.msgid.link/1744312662-356571-4-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Remove the array of elements wrapped in a struct because in reality only
the first element was ever used.
Signed-off-by: Vlad Dogaru <vdogaru@nvidia.com>
Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Michal Kubiak <michal.kubiak@intel.com>
Link: https://patch.msgid.link/1744312662-356571-3-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The procedure of attaching an action template to an existing matcher had
a few issues:
1. Attaching accidentally overran the `at` array in bwc_matcher, which
would result in memory corruption. This bug wasn't triggered, but it
is possible to trigger it by attaching action templates beyond the
initial buffer size of 8. Fix this by converting to a dynamically
sized buffer and reallocating if needed.
2. Similarly, the `at` array inside the native matcher was never
reallocated. Fix this the same as above.
3. The bwc layer treated any error in action template attach as a signal
that the matcher should be rehashed to account for a larger number of
action STEs. In reality, there are other unrelated errors that can
arise and they should be propagated upstack. Fix this by adding a
`need_rehash` output parameter that's orthogonal to error codes.
Fixes: 2111bb970c78 ("net/mlx5: HWS, added backward-compatible API handling")
Signed-off-by: Vlad Dogaru <vdogaru@nvidia.com>
Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Michal Kubiak <michal.kubiak@intel.com>
Link: https://patch.msgid.link/1744312662-356571-2-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Since we are about to stash some more information into the pp_magic
field, let's move the magic signature checks into a pair of helper
functions so it can be changed in one place.
Reviewed-by: Mina Almasry <almasrymina@google.com>
Tested-by: Yonglong Liu <liuyonglong@huawei.com>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://patch.msgid.link/20250409-page-pool-track-dma-v9-1-6a9ef2e0cba8@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
timer_delete[_sync]() replaces del_timer[_sync](). Convert the whole tree
over and remove the historical wrapper inlines.
Conversion was done with coccinelle plus manual fixups where necessary.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
With commit 8533b14b3d65 ("eth: mlx4: create a page pool for Rx") mlx4
started using functions guarded by PAGE_POOL. This change introduced
build errors when CONFIG_MLX4_EN is set but CONFIG_PAGE_POOL is not:
ld: vmlinux.o: in function `mlx4_en_alloc_frags':
en_rx.c:(.text+0xa5eaf9): undefined reference to `page_pool_alloc_pages'
ld: vmlinux.o: in function `mlx4_en_create_rx_ring':
(.text+0xa5ee91): undefined reference to `page_pool_create'
Make MLX4_EN select PAGE_POOL to fix the ml;x4 build errors.
Fixes: 8533b14b3d65 ("eth: mlx4: create a page pool for Rx")
Signed-off-by: Greg Thelen <gthelen@google.com>
Reviewed-by: Joe Damato <jdamato@fastly.com>
Link: https://patch.msgid.link/20250401015315.2306092-1-gthelen@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Rather tiny pull request, mostly so that we can get into our trees
your fix to the x86 Makefile.
Current release - regressions:
- Revert "tcp: avoid atomic operations on sk->sk_rmem_alloc", error
queue accounting was missed
Current release - new code bugs:
- 5 fixes for the netdevice instance locking work
Previous releases - regressions:
- usbnet: restore usb%d name exception for local mac addresses
Previous releases - always broken:
- rtnetlink: allocate vfinfo size for VF GUIDs when supported, avoid
spurious GET_LINK failures
- eth: mana: Switch to page pool for jumbo frames
- phy: broadcom: Correct BCM5221 PHY model detection
Misc:
- selftests: drv-net: replace helpers for referring to other files"
* tag 'net-6.15-rc0' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (22 commits)
Revert "tcp: avoid atomic operations on sk->sk_rmem_alloc"
bnxt_en: bring back rtnl lock in bnxt_shutdown
eth: gve: add missing netdev locks on reset and shutdown paths
selftests: mptcp: ignore mptcp_diag binary
selftests: mptcp: close fd_in before returning in main_loop
selftests: mptcp: fix incorrect fd checks in main_loop
mptcp: fix NULL pointer in can_accept_new_subflow
octeontx2-af: Free NIX_AF_INT_VEC_GEN irq
octeontx2-af: Fix mbox INTR handler when num VFs > 64
net: fix use-after-free in the netdev_nl_sock_priv_destroy()
selftests: net: use Path helpers in ping
selftests: net: use the dummy bpf from net/lib
selftests: drv-net: replace the rpath helper with Path objects
net: lapbether: use netdev_lockdep_set_classes() helper
net: phy: broadcom: Correct BCM5221 PHY model detection
net: usb: usbnet: restore usb%d name exception for local mac addresses
net/mlx5e: SHAMPO, Make reserved size independent of page size
net: mana: Switch to page pool for jumbo frames
MAINTAINERS: Add dedicated entries for phy_link_topology
net: move replay logic to tc_modify_qdisc
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull fwctl subsystem from Jason Gunthorpe:
"fwctl is a new subsystem intended to bring some common rules and order
to the growing pattern of exposing a secure FW interface directly to
userspace.
Unlike existing places like RDMA/DRM/VFIO/uacce that are exposing a
device for datapath operations fwctl is focused on debugging,
configuration and provisioning of the device. It will not have the
necessary features like interrupt delivery to support a datapath.
This concept is similar to the long standing practice in the "HW" RAID
space of having a device specific misc device to manage the RAID
controller FW. fwctl generalizes this notion of a companion debug and
management interface that goes along with a dataplane implemented in
an appropriate subsystem.
There have been three LWN articles written discussing various aspects
of this:
https://lwn.net/Articles/955001/
https://lwn.net/Articles/969383/
https://lwn.net/Articles/990802/
This includes three drivers to launch the subsystem:
- CXL provides a vendor scheme for executing commands and a way to
learn the 'command effects' (ie the security properties) of such
commands. The fwctl driver allows access to these mechanism within
the fwctl security model
- mlx5 is family of networking products, the driver supports all
current Mellanox HW still receiving FW feature updates. This
includes RDMA multiprotocol NICs like ConnectX and the Bluefield
family of Smart NICs.
- AMD/Pensando Distributed Services card is a multi protocol Smart
NIC with a multi PCI function design. fwctl works on the management
PCI function following a 'command effects' model similar to CXL"
* tag 'for-linus-fwctl' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (30 commits)
pds_fwctl: add Documentation entries
pds_fwctl: add rpc and query support
pds_fwctl: initial driver framework
pds_core: add new fwctl auxiliary_device
pds_core: specify auxiliary_device to be created
pds_core: make pdsc_auxbus_dev_del() void
cxl: Fixup kdoc issues for include/cxl/features.h
fwctl/cxl: Add documentation to FWCTL CXL
cxl/test: Add Set Feature support to cxl_test
cxl/test: Add Get Feature support to cxl_test
cxl: Add support to handle user feature commands for set feature
cxl: Add support to handle user feature commands for get feature
cxl: Add support for fwctl RPC command to enable CXL feature commands
cxl: Move cxl feature command structs to user header
cxl: Add FWCTL support to CXL
mlx5: Create an auxiliary device for fwctl_mlx5
fwctl/mlx5: Support for communicating with mlx5 fw
fwctl: Add documentation
fwctl: FWCTL_RPC to execute a Remote Procedure Call to device firmware
taint: Add TAINT_FWCTL
...
|
|
When hw-gro is enabled, the maximum number of header entries that are
needed per wqe (hd_per_wqe) is calculated based on the size of the
reservations among other parameters.
Miscalculation of the size of reservations leads to incorrect
calculation of hd_per_wqe as 0, particularly in the case of large page
size like in aarch64, this prevents the SHAMPO header from being
correctly initialized in the device, ultimately causing the following
cqe err that indicates a violation of PD.
mlx5_core 0000:00:08.0 eth2: ERR CQE on RQ: 0x1180
mlx5_core 0000:00:08.0 eth2: Error cqe on cqn 0x510, ci 0x0, qn 0x1180, opcode 0xe, syndrome 0x4, vendor syndrome 0x32
00000000: 00 00 00 00 04 4a 00 00 00 00 00 00 20 00 93 32
00000010: 55 00 00 00 fb cc 00 00 00 00 00 00 07 18 00 00
00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 4a
00000030: 00 00 00 9a 93 00 32 04 00 00 00 00 00 00 da e1
Use the correct formula for calculating the size of reservations,
precisely it shouldn't be dependent on page size, instead use the
correct multiply of MLX5E_SHAMPO_WQ_BASE_RESRV_SIZE.
Fixes: e5ca8fb08ab2 ("net/mlx5e: Add control path for SHAMPO feature")
Signed-off-by: Lama Kayal <lkayal@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/1742732906-166564-1-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Merge in late fixes to prepare for the 6.15 net-next PR.
No conflicts, adjacent changes:
drivers/net/ethernet/broadcom/bnxt/bnxt.c
919f9f497dbc ("eth: bnxt: fix out-of-range access of vnic_info array")
fe96d717d38e ("bnxt_en: Extend queue stop/start for TX rings")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next
Steffen Klassert says:
====================
pull request (net-next): ipsec-next 2025-03-24
1) Prevent setting high order sequence number bits input in
non-ESN mode. From Leon Romanovsky.
2) Support PMTU handling in tunnel mode for packet offload.
From Leon Romanovsky.
3) Make xfrm_state_lookup_byaddr lockless.
From Florian Westphal.
4) Remove unnecessary NULL check in xfrm_lookup_with_ifid().
From Dan Carpenter.
* tag 'ipsec-next-2025-03-24' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next:
xfrm: Remove unnecessary NULL check in xfrm_lookup_with_ifid()
xfrm: state: make xfrm_state_lookup_byaddr lockless
xfrm: check for PMTU in tunnel mode for packet offload
xfrm: provide common xdo_dev_offload_ok callback implementation
xfrm: rely on XFRM offload
xfrm: simplify SA initialization routine
xfrm: delay initialization of offload path till its actually requested
xfrm: prevent high SEQ input in non-ESN mode
====================
Link: https://patch.msgid.link/20250324061855.4116819-1-steffen.klassert@secunet.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
For CT action with commit argument, it's usually followed by the
forward action, either to the output netdev or next chain. The default
behavior for software is to drop by setting action attribute to
TC_ACT_SHOT instead of TC_ACT_PIPE if it's the last action. But driver
can't handle it, so block the offload for such case.
Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/1742392983-153050-6-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
In nic mode CT setup where we do hairpin between the two
nics, both nics register to the same flow table (per zone),
and try to offload all rules on it.
Instead, filter the rules that originated from the relevant nic
(so only one side is offloaded for each nic).
Signed-off-by: Paul Blakey <paulb@nvidia.com>
Reviewed-by: Jianbo Liu <jianbol@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/1742392983-153050-5-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Align mlx5 driver usage of 'pfnum' with the documentation clarification
introduced in commit bb70b0d48d8e ("devlink: Improve the port attributes
description").
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/1742392983-153050-4-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Currently, mlx5_is_reset_now_capable() checks whether the pci bridge is
accessible only on bridge hot plug capability check. If the pci bridge
is not accessible, reset now will fail regardless of bridge hotplug
capability. Move this check to function mlx5_is_reset_now_capable()
which, in such case, aborts the reset and does so in the request phase
instead of the reset now phase.
Signed-off-by: Aya Levin <ayal@nvidia.com>
Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Amir Tzin <amirtz@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/1742392983-153050-3-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
As queue affinity is being deprecated and will no longer be supported
in the future, Always check for the presence of the port selection
namespace. When available, leverage it to distribute traffic
across the physical ports via steering, ensuring compatibility with
future NICs.
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Maor Gottlieb <maorg@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/1742392983-153050-2-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
For simplicity reasons, the driver avoids crossing work queue fragment
boundaries within the same TX WQE (Work-Queue Element). Until today, as
the number of packets in a TX MPWQE (Multi-Packet WQE) descriptor is not
known in advance, the driver pre-prepared contiguous memory for the
largest possible WQE. For this, when getting too close to the fragment
edge, having no room for the largest WQE possible, the driver was
filling the fragment remainder with NOP descriptors, aligning the next
descriptor to the beginning of the next fragment.
Generating and handling these NOPs wastes resources, like: CPU cycles,
work-queue entries fetched to the device, and PCI bandwidth.
In this patch, we replace this NOPs filling mechanism in the TX MPWQE
flow. Instead, we utilize the remaining entries of the fragment with a
TX MPWQE. If this room turns out to be too small, we simply open an
additional descriptor starting at the beginning of the next fragment.
Performance benchmark:
uperf test, single server against 3 clients.
TCP multi-stream, bidir, traffic profile "2x350B read, 1400B write".
Bottleneck is in inbound PCI bandwidth (device POV).
+---------------+------------+------------+--------+
| | Before | After | |
+---------------+------------+------------+--------+
| BW | 117.4 Gbps | 121.1 Gbps | +3.1% |
+---------------+------------+------------+--------+
| tx_packets | 15 M/sec | 15.5 M/sec | +3.3% |
+---------------+------------+------------+--------+
| tx_nops | 3 M/sec | 0 | -100% |
+---------------+------------+------------+--------+
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/1742391746-118647-1-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The health poll mechanism performs periodic checks to detect firmware
errors. One of the checks verifies the function is still enabled on
firmware side, but the function is enabled only after enable_hca command
completed. Start health poll after enable_hca command to avoid a race
between function enabled and first health polling.
Fixes: 9b98d395b85d ("net/mlx5: Start health poll at earlier stage of driver load")
Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Shay Drori <shayd@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Link: https://patch.msgid.link/1742331077-102038-3-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
When LAG creation fails, the driver reloads the RDMA devices. If RDMA
representors are present, they should also be reloaded. This step was
missed in the cited commit.
Fixes: 598fe77df855 ("net/mlx5: Lag, Create shared FDB when in switchdev mode")
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Shay Drori <shayd@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Link: https://patch.msgid.link/1742331077-102038-2-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
This is a workaround to mitigate a compiler anomaly.
During LLVM toolchain compilation of this driver on s390x architecture, an
unreasonable __write_overflow_field warning occurs.
Contextually, chunk_index is restricted to 0, 1 or 2. By expanding these
possibilities, the compile warning is suppressed.
Fix follow error with clang-19 when -Werror:
In file included from drivers/net/ethernet/mellanox/mlxsw/spectrum_acl_bloom_filter.c:5:
In file included from ./include/linux/gfp.h:7:
In file included from ./include/linux/mmzone.h:8:
In file included from ./include/linux/spinlock.h:63:
In file included from ./include/linux/lockdep.h:14:
In file included from ./include/linux/smp.h:13:
In file included from ./include/linux/cpumask.h:12:
In file included from ./include/linux/bitmap.h:13:
In file included from ./include/linux/string.h:392:
./include/linux/fortify-string.h:571:4: error: call to '__write_overflow_field' declared with 'warning' attribute: detected write beyond size of field (1st parameter); maybe use struct_group()? [-Werror,-Wattribute-warning]
571 | __write_overflow_field(p_size_field, size);
| ^
1 error generated.
According to the testing, we can be fairly certain that this is a clang
compiler bug, impacting only clang-19 and below. Clang versions 20 and
21 do not exhibit this behavior.
Link: https://lore.kernel.org/all/484364B641C901CD+20250311141025.1624528-1-wangyuli@uniontech.com/
Fixes: 7585cacdb978 ("mlxsw: spectrum_acl: Add Bloom filter handling")
Co-developed-by: Zijian Chen <czj2441@163.com>
Signed-off-by: Zijian Chen <czj2441@163.com>
Co-developed-by: Wentao Guan <guanwentao@uniontech.com>
Signed-off-by: Wentao Guan <guanwentao@uniontech.com>
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Co-developed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Tested-by: Ido Schimmel <idosch@nvidia.com>
Tested-by: WangYuli <wangyuli@uniontech.com>
Signed-off-by: WangYuli <wangyuli@uniontech.com>
Link: https://patch.msgid.link/A1858F1D36E653E0+20250318103654.708077-1-wangyuli@uniontech.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
When hardware floods packets to bridge ports, but flooding to VXLAN bridge
port fails during encapsulation to one of the remote VTEPs, the packets are
trapped to CPU. In such case, the packets are marked with
skb->offload_fwd_mark, which means that packet was L2-forwarded in
hardware. Software data path repeats flooding, but packets which are
marked with skb->offload_fwd_mark will not be flooded by the bridge to
bridge ports which are in the same hardware domain as the ingress port.
Currently, mlxsw does not add VXLAN bridge ports to the same hardware
domain as physical bridge ports despite the fact that the device is able
to forward packets to and from VXLAN tunnels in hardware. In some scenarios
(as mentioned above) this can result in remote VTEPs receiving duplicate
packets. The packets are first flooded by hardware and after an
encapsulation failure, they are flooded again to all remote VTEPs by
software.
Solve this by adding VXLAN bridge ports to the same hardware domain as
physical bridge ports, so then nbp_switchdev_allowed_egress() will return
false also for VXLAN, and packets will not be sent twice from VXLAN device.
switchdev_bridge_port_offload() should get vxlan_dev not as const, so
some changes are required. Call switchdev API from
mlxsw_sp_bridge_vxlan_{join,leave}() which handle offload configurations.
Reported-by: Vladimir Oltean <olteanv@gmail.com>
Closes: https://lore.kernel.org/all/20250210152246.4ajumdchwhvbarik@skbuf/
Reported-by: Vladyslav Mykhaliuk <vmykhaliuk@nvidia.com>
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/7279056843140fae3a72c2d204c7886b79d03899.1742224300.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Next patch will call __mlxsw_sp_bridge_vxlan_leave() from
mlxsw_sp_bridge_vxlan_join() as part of error flow, move the function to
be able to call the second one.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/64750a0965536530482318578bada30fac372b8a.1742224300.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
There is asymmetry in how the VXLAN join and leave functions are used.
The join function (mlxsw_sp_bridge_vxlan_join()) is only called in
response to netdev events (e.g., VXLAN device joining a bridge), but the
leave function is also called in response to switchdev events (e.g.,
VLAN configuration on top of the VXLAN device) in order to invalidate
VNI to FID mappings.
This asymmetry will cause problems when the functions will be later
extended to mark VXLAN bridge ports as offloaded or not.
Therefore, create an internal function (__mlxsw_sp_bridge_vxlan_leave())
that is used to invalidate VNI to FID mappings and call it from
mlxsw_sp_bridge_vxlan_leave() which will only be invoked in response to
netdev events, like mlxsw_sp_bridge_vxlan_join().
No functional changes intended.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/f3a32bd2d87a0b7ac4d2bb98a427dc6d95a01cd0.1742224300.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|