Age | Commit message (Collapse) | Author |
|
mlx5 driver queries the device for VECTOR_CALC and SHAMPO caps, but
there isn't any user who requires them.
As well as, MLX5_MCAM_REGS_0x9080_0x90FF is queried but not used.
Thus, drop all usages and definitions of the mentioned caps above.
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Maher Sanalla <msanalla@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
sw_function_id contains sfnum, so fix the error message to name the
value properly.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
Since mlx5_vhca_event_supported() is called in mlx5_sf_dev_supported(),
remove the redundant call.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
MLX5_CAP_GEN()
There is a helper called mlx5_sf_start_function_id() that
wraps up a query to get base SF function id. Use that instead of
calling MLX5_CAP_GEN() directly.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
Since mlx5_sf_supported() check is done as a first thing in
mlx5_sf_max_functions(), remove the redundant check.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
Instead of using device_put(), use auxiliary_device_uninit() for
auxiliary device uninit which internally just calls device_put().
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
Firmware doesn't allow flow rules in FDB to do header rewrite and send
packets to both internal and uplink vports. The following syndrome
will be generated when trying to offload such kind of rules:
mlx5_core 0000:08:00.0: mlx5_cmd_out_err:803:(pid 23569): SET_FLOW_TABLE_ENTRY(0x936) op_mod(0x0) failed, status bad parameter(0x3), syndrome (0x8c8f08), err(-22)
To avoid this syndrome, add a checking before creating FTE. If a rule
with header rewrite action forwards packets to both VF and PF, an
error is returned directly.
Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
Even if the PF driver had no error on his part of the sync reset flow,
the firmware can see wider picture as it syncs all the PFs in the flow.
So add at end of sync reset flow check with firmware by reading MFRL
register and initialization segment that the flow had no issue from
firmware point of view too.
Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
Introduce devlink resource for exposing max possible SFs on mlx5
devices.
For example:
$ devlink resource show pci/0000:00:0b.0
pci/0000:00:0b.0:
name max_local_SFs size 5 unit entry dpipe_tables none
name max_external_SFs size 0 unit entry dpipe_tables none
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
A new check for the tx devlink health reporter is introduced for
determining when the PTP port timestamping SQ is considered unhealthy. If
there are enough CQEs considered never to be delivered, the space that can
be utilized on the SQ decreases significantly, impacting performance and
usability of the SQ. The health reporter is triggered when the number of
likely never delivered port timestamping CQEs that utilize the space of the
PTP SQ is greater than 93.75% of the total capacity of the SQ. A devlink
health reporter recover method is also provided for this specific TX error
context that restarts the PTP SQ.
Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
Use a map structure for associating CQEs containing port timestamping
information with the appropriate skb. Track order of WQEs submitted using a
FIFO. Check if the corresponding port timestamping CQEs from the lookup
values in the FIFO are considered dropped due to time elapsed. Return the
lookup value to a freelist after consuming the skb. Reuse the freed lookup
in future WQE submission iterations.
The map structure uses an integer identifier for the key and returns an skb
corresponding to that identifier. Embed the integer identifier in the WQE
submitted to the WQ for the transmit path when the SQ is a PTP (port
timestamping) SQ. The embedded identifier can then be queried using a field
in the CQE of the corresponding port timestamping CQ. In the port
timestamping napi_poll context, the identifier is queried from the CQE
polled from CQ and used to lookup the corresponding skb from the WQE submit
path. The skb reference is removed from map and then embedded with the port
HW timestamp information from the CQE and eventually consumed.
The metadata freelist FIFO is an array containing integer identifiers that
can be pushed and popped in the FIFO. The purpose of this structure is
bookkeeping what identifier values can safely be used in a subsequent WQE
submission and should not contain identifiers that have still not been
reaped by processing a corresponding CQE completion on the port
timestamping CQ.
The ts_cqe_pending_list structure is a combination of an array and linked
list. The array is pre-populated with the nodes that will be added and
removed from the head of the linked list. Each node contains the unique
identifier value associated with the values submitted in the WQEs and
retrieved in the port timestamping CQEs. When a WQE is submitted, the node
in the array corresponding to the identifier popped from the metadata
freelist is added to the end of the CQE pending list and is marked as
"in-use". The node is removed from the linked list under two conditions.
The first condition is that the corresponding port timestamping CQE is
polled in the PTP napi_poll context. The second condition is that more than
a second has elapsed since the DMA timestamp value corresponding to the WQE
submission. When the first condition occurs, the "in-use" bit in the linked
list node is cleared, and the resources corresponding to the WQE submission
are then released. The second condition, however, indicates that the port
timestamping CQE will likely never be delivered. It's not impossible for
the device to post a CQE after an infinite amount of time though highly
improbable. In order to be resilient to this improbable case, resources
related to the corresponding WQE submission are still kept, the identifier
value is not returned to the freelist, and the "in-use" bit is cleared on
the node to indicate that it's no longer part of the linked list of "likely
to be delivered" port timestamping CQE identifiers. A count for the number
of port timestamping CQEs considered highly likely to never be delivered by
the device is maintained. This count gets decremented in the unlikely event
a port timestamping CQE considered unlikely to ever be delivered is polled
in the PTP napi_poll context.
Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
De-duplicate documentation by removing mellanox/mlx5/devlink.rst. Instead,
only use the generic devlink documentation directory to document mlx5
devlink parameters. Avoid providing general devlink tool usage information
in mlx5-specific documentation.
Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Gal Pressman <gal@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
Jiri Pirko says:
====================
devlink: introduce selective dumps
Motivation:
For SFs, one devlink instance per SF is created. There might be
thousands of these on a single host. When a user needs to know port
handle for specific SF, he needs to dump all devlink ports on the host
which does not scale good.
Solution:
Allow user to pass devlink handle (and possibly other attributes)
alongside the dump command and dump only objects which are matching
the selection.
Use split ops to generate policies for dump callbacks acccording to
the attributes used for selection.
The userspace can use ctrl genetlink GET_POLICY command to find out if
the selective dumps are supported by kernel for particular command.
Example:
$ devlink port show
auxiliary/mlx5_core.eth.0/65535: type eth netdev eth2 flavour physical port 0 splittable false
auxiliary/mlx5_core.eth.1/131071: type eth netdev eth3 flavour physical port 1 splittable false
$ devlink port show auxiliary/mlx5_core.eth.0
auxiliary/mlx5_core.eth.0/65535: type eth netdev eth2 flavour physical port 0 splittable false
$ devlink port show auxiliary/mlx5_core.eth.1
auxiliary/mlx5_core.eth.1/131071: type eth netdev eth3 flavour physical port 1 splittable false
Extension:
patches #12 and #13 extends selection attributes by port index
for health reporter dumping.
====================
Link: https://lore.kernel.org/r/20230811155714.1736405-1-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Allow user to pass port index for health reporter dump request.
Re-generate the related code.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230811155714.1736405-14-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Introduce a possibility for devlink object to expose attributes it
supports for selection of dumped objects.
Use this by health reporter to indicate it supports port index based
selection of dump objects. Implement this selection mechanism in
devlink_nl_cmd_health_reporter_get_dump_one()
Example:
$ devlink health
pci/0000:08:00.0:
reporter fw
state healthy error 0 recover 0 auto_dump true
reporter fw_fatal
state healthy error 0 recover 0 grace_period 60000 auto_recover true auto_dump true
reporter vnic
state healthy error 0 recover 0
pci/0000:08:00.0/32768:
reporter vnic
state healthy error 0 recover 0
pci/0000:08:00.0/32769:
reporter vnic
state healthy error 0 recover 0
pci/0000:08:00.0/32770:
reporter vnic
state healthy error 0 recover 0
pci/0000:08:00.1:
reporter fw
state healthy error 0 recover 0 auto_dump true
reporter fw_fatal
state healthy error 0 recover 0 grace_period 60000 auto_recover true auto_dump true
reporter vnic
state healthy error 0 recover 0
pci/0000:08:00.1/98304:
reporter vnic
state healthy error 0 recover 0
pci/0000:08:00.1/98305:
reporter vnic
state healthy error 0 recover 0
pci/0000:08:00.1/98306:
reporter vnic
state healthy error 0 recover 0
$ devlink health show pci/0000:08:00.0
pci/0000:08:00.0:
reporter fw
state healthy error 0 recover 0 auto_dump true
reporter fw_fatal
state healthy error 0 recover 0 grace_period 60000 auto_recover true auto_dump true
reporter vnic
state healthy error 0 recover 0
pci/0000:08:00.0/32768:
reporter vnic
state healthy error 0 recover 0
pci/0000:08:00.0/32769:
reporter vnic
state healthy error 0 recover 0
pci/0000:08:00.0/32770:
reporter vnic
state healthy error 0 recover 0
$ devlink health show pci/0000:08:00.0/32768
pci/0000:08:00.0/32768:
reporter vnic
state healthy error 0 recover 0
The last command is possible because of this patch.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230811155714.1736405-13-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
instance attributes
Extend per-instance dump command definitions to accept instance
attributes. Allow parsing of devlink handle attributes so they could
be used for instance selection.
Re-generate the related code.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230811155714.1736405-12-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
For SFs, one devlink instance per SF is created. There might be
thousands of these on a single host. When a user needs to know port
handle for specific SF, he needs to dump all devlink ports on the host
which does not scale good.
Allow user to pass devlink handle attributes alongside the dump command
and dump only objects which are under selected devlink instance.
Example:
$ devlink port show
auxiliary/mlx5_core.eth.0/65535: type eth netdev eth2 flavour physical port 0 splittable false
auxiliary/mlx5_core.eth.1/131071: type eth netdev eth3 flavour physical port 1 splittable false
$ devlink port show auxiliary/mlx5_core.eth.0
auxiliary/mlx5_core.eth.0/65535: type eth netdev eth2 flavour physical port 0 splittable false
$ devlink port show auxiliary/mlx5_core.eth.1
auxiliary/mlx5_core.eth.1/131071: type eth netdev eth3 flavour physical port 1 splittable false
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230811155714.1736405-11-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
As the commands are already defined in split ops, remove them
from small ops.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230811155714.1736405-10-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Remove the duplicate temporary netlink callback prototype as the
generated ones are already in place.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230811155714.1736405-9-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add the definitions for the commands that do per-instance dump
and re-generate the related code.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230811155714.1736405-8-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
In order to easily set NLM_F_DUMP_FILTERED for partial dumps, pass the
flags as an arg of dump_one() callback. Currently, it is always
NLM_F_MULTI.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230811155714.1736405-7-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Introduce dumpit callbacks for generated split ops. Have them
as a thin wrapper around iteration function and allow to pass dump_one()
function pointer directly without need to store in devlink_cmd structs.
Note that the function prototypes are temporary until the generated ones
will replace them in a follow-up patch.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230811155714.1736405-6-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Rename netlink doit callback functions for the commands that do
implement per-instance dump to match the generated names that are going
to be introduce in the follow-up patch.
Note that the function prototypes are temporary until the generated ones
will replace them in a follow-up patch.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230811155714.1736405-5-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Define port handling helpers what don't rely on internal_flags.
Have __devlink_nl_pre_doit() to accept the flags as a function arg and
make devlink_nl_pre_doit() a wrapper helper function calling it.
Introduce new helpers devlink_nl_pre_doit_port() and
devlink_nl_pre_doit_port_optional() to be used by split ops in follow-up
patch.
Note that the function prototypes are temporary until the generated ones
will replace them in a follow-up patch.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230811155714.1736405-4-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
No need to give the rate any special treatment in netlink attributes
parsing, as unlike for ports, there is only a couple of commands
benefiting from that.
Remove DEVLINK_NL_FLAG_NEED_RATE*, make pre_doit() callback simpler
by moving the rate attributes parsing to rate_*_doit() ops.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230811155714.1736405-3-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
No need to give the linecards any special treatment in netlink attribute
parsing, as unlike for ports, there is only a couple of commands
benefiting from that.
Remove DEVLINK_NL_FLAG_NEED_LINECARD, make pre_doit() callback simpler
by moving the linecard attribute parsing to linecard_[gs]et_doit() ops.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230811155714.1736405-2-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Running the bench_rename test script, the following error occurs:
# ./benchs/run_bench_rename.sh
base : 0.819 ± 0.012M/s
kprobe : 0.538 ± 0.009M/s
kretprobe : 0.503 ± 0.004M/s
rawtp : 0.779 ± 0.020M/s
fentry : 0.726 ± 0.007M/s
fexit : 0.691 ± 0.007M/s
benchmark 'rename-fmodret' not found
The bench_rename_fmodret has been removed in commit b000def2e052
("selftests: Remove fmod_ret from test_overhead"), thus remove it
from the runners in the test script.
Fixes: b000def2e052 ("selftests: Remove fmod_ret from test_overhead")
Signed-off-by: Yipeng Zou <zouyipeng@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20230814030727.3010390-1-zouyipeng@huawei.com
|
|
There is no way where topts.repeat can be set to 1 when tc_test fails.
Fix the typo where the break statement slipped by one line.
Fixes: fb66223a244f ("selftests/bpf: add test for accessing ctx from syscall program type")
Signed-off-by: Yipeng Zou <zouyipeng@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Li Zetao <lizetao1@huawei.com>
Link: https://lore.kernel.org/bpf/20230814031434.3077944-1-zouyipeng@huawei.com
|
|
Enable the close-on-exec flag when using gzopen. This is especially important
for multithreaded programs making use of libbpf, where a fork + exec could
race with libbpf library calls, potentially resulting in a file descriptor
leaked to the new process. This got missed in 59842c5451fe ("libbpf: Ensure
libbpf always opens files with O_CLOEXEC").
Fixes: 59842c5451fe ("libbpf: Ensure libbpf always opens files with O_CLOEXEC")
Signed-off-by: Marco Vedovati <marco.vedovati@crowdstrike.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20230810214350.106301-1-martin.kelly@crowdstrike.com
|
|
The PSGMII interface is similar to QSGMII. The main difference
is that the PSGMII interface combines five SGMII lines into a
single link while in QSGMII only four lines are combined.
Similarly to the QSGMII, this interface mode might also needs
special handling within the MAC driver.
It is commonly used by Qualcomm with their QCA807x PHY series and
modern WiSoC-s.
Add definitions for the PHY layer to allow to express this type
of connection between the MAC and PHY.
Signed-off-by: Gabor Juhos <j4g8y7@gmail.com>
Signed-off-by: Robert Marko <robert.marko@sartura.hr>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add a new PSGMII mode which is similar to QSGMII with the difference being
that it combines 5 SGMII lines into a single link compared to 4 on QSGMII.
It is commonly used by Qualcomm on their QCA807x PHY series.
Signed-off-by: Robert Marko <robert.marko@sartura.hr>
Acked-by: Rob Herring <robh@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Petr Machata says:
====================
mlxsw: Support traffic redirection from a locked bridge port
Ido Schimmel writes:
It is possible to add a filter that redirects traffic from the ingress
of a bridge port that is locked (i.e., performs security / SMAC lookup)
and has learning enabled. For example:
# ip link add name br0 type bridge
# ip link set dev swp1 master br0
# bridge link set dev swp1 learning on locked on mab on
# tc qdisc add dev swp1 clsact
# tc filter add dev swp1 ingress pref 1 proto ip flower skip_sw src_ip 192.0.2.1 action mirred egress redirect dev swp2
In the kernel's Rx path, this filter is evaluated before the Rx handler
of the bridge, which means that redirected traffic should not be
affected by bridge port configuration such as learning.
However, the hardware data path is a bit different and the redirect
action (FORWARDING_ACTION in hardware) merely attaches a pointer to the
packet, which is later used by the L2 lookup stage to understand how to
forward the packet. Between both stages - ingress ACL and L2 lookup -
learning and security lookup are performed, which means that redirected
traffic is affected by bridge port configuration, unlike in the kernel's
data path.
The learning discrepancy was handled in commit 577fa14d2100 ("mlxsw:
spectrum: Do not process learned records with a dummy FID") by simply
ignoring learning notifications generated by the redirected traffic. A
similar solution is not possible for the security / SMAC lookup since
- unlike learning - the CPU is not involved and packets that failed the
lookup are dropped by the device.
Instead, solve this by prepending the ignore action to the redirect
action and use it to instruct the device to disable both learning and
the security / SMAC lookup for redirected traffic.
Patch #1 adds the ignore action.
Patch #2 prepends the action to the redirect action in flower offload
code.
Patch #3 removes the workaround in commit 577fa14d2100 ("mlxsw:
spectrum: Do not process learned records with a dummy FID") since it is
no longer needed.
Patch #4 adds a test case.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Check that traffic can be redirected from a locked bridge port and that
it does not create locked FDB entries.
Cc: Hans J. Schultz <netdev@kapio-technology.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
As explained in the previous patch, with the ignore action prepended to
the redirect action, it is not longer possible for redirected traffic to
generate learning notifications.
Therefore, remove the workaround that was added in commit 577fa14d2100
("mlxsw: spectrum: Do not process learned records with a dummy FID") as
it is no longer needed.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
It is possible to add a filter that redirects traffic from the ingress
of a bridge port that is locked (i.e., performs security / SMAC lookup)
and has learning enabled. For example:
# ip link add name br0 type bridge
# ip link set dev swp1 master br0
# bridge link set dev swp1 learning on locked on mab on
# tc qdisc add dev swp1 clsact
# tc filter add dev swp1 ingress pref 1 proto ip flower skip_sw src_ip 192.0.2.1 action mirred egress redirect dev swp2
In the kernel's Rx path, this filter is evaluated before the Rx handler
of the bridge, which means that redirected traffic should not be
affected by bridge port configuration such as learning.
However, the hardware data path is a bit different and the redirect
action (FORWARDING_ACTION in hardware) merely attaches a pointer to the
packet, which is later used by the L2 lookup stage to understand how to
forward the packet. Between both stages - ingress ACL and L2 lookup -
learning and security lookup are performed, which means that redirected
traffic is affected by bridge port configuration, unlike in the kernel's
data path.
The learning discrepancy was handled in commit 577fa14d2100 ("mlxsw:
spectrum: Do not process learned records with a dummy FID") by simply
ignoring learning notifications generated by the redirected traffic. A
similar solution is not possible for the security / SMAC lookup since
- unlike learning - the CPU is not involved and packets that failed the
lookup are dropped by the device.
Instead, solve this by prepending the ignore action to the redirect
action and use it to instruct the device to disable both learning and
the security / SMAC lookup for redirected traffic.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add the IGNORE_ACTION which is used to ignore basic switching functions
such as learning on a per-packet basis.
The action will be prepended to the FORWARDING_ACTION in subsequent
patches.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
1. Show TSSTSSEL(Timestamp System Time Source),
ADDMACADRSEL(additional MAC addresses), SMASEL(SMA/MDIO Interface),
HDSEL(Half-duplex Support) in debugfs.
2. Show exact number of additional MAC address registers for XGMAC2 core.
3. XGMAC2 core does not have different IP checksum offload types, so just
show rx_coe instead of rx_coe_type1 or rx_coe_type2.
4. XGMAC2 core does not have rxfifo_over_2048 definition, skip it.
Signed-off-by: Furong Xu <0x1207@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Li Zetao says:
====================
Use helper functions to update stats
The patch set uses the helper functions dev_sw_netstats_rx_add() and
dev_sw_netstats_tx_add() to update stats, which is the same as
implementing the function separately.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Use the helper functions dev_sw_netstats_rx_add() and
dev_sw_netstats_tx_add() to update stats, which helps to
provide code readability.
Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Use the helper functions dev_sw_netstats_rx_add() and
dev_sw_netstats_tx_add() to update stats, which helps to
provide code readability.
Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The patch adds native-mode XDP support: XDP DROP, PASS, TX, and REDIRECT.
Background:
The vmxnet3 rx consists of three rings: ring0, ring1, and dataring.
For r0 and r1, buffers at r0 are allocated using alloc_skb APIs and dma
mapped to the ring's descriptor. If LRO is enabled and packet size larger
than 3K, VMXNET3_MAX_SKB_BUF_SIZE, then r1 is used to mapped the rest of
the buffer larger than VMXNET3_MAX_SKB_BUF_SIZE. Each buffer in r1 is
allocated using alloc_page. So for LRO packets, the payload will be in one
buffer from r0 and multiple from r1, for non-LRO packets, only one
descriptor in r0 is used for packet size less than 3k.
When receiving a packet, the first descriptor will have the sop (start of
packet) bit set, and the last descriptor will have the eop (end of packet)
bit set. Non-LRO packets will have only one descriptor with both sop and
eop set.
Other than r0 and r1, vmxnet3 dataring is specifically designed for
handling packets with small size, usually 128 bytes, defined in
VMXNET3_DEF_RXDATA_DESC_SIZE, by simply copying the packet from the backend
driver in ESXi to the ring's memory region at front-end vmxnet3 driver, in
order to avoid memory mapping/unmapping overhead. In summary, packet size:
A. < 128B: use dataring
B. 128B - 3K: use ring0 (VMXNET3_RX_BUF_SKB)
C. > 3K: use ring0 and ring1 (VMXNET3_RX_BUF_SKB + VMXNET3_RX_BUF_PAGE)
As a result, the patch adds XDP support for packets using dataring
and r0 (case A and B), not the large packet size when LRO is enabled.
XDP Implementation:
When user loads and XDP prog, vmxnet3 driver checks configurations, such
as mtu, lro, and re-allocate the rx buffer size for reserving the extra
headroom, XDP_PACKET_HEADROOM, for XDP frame. The XDP prog will then be
associated with every rx queue of the device. Note that when using dataring
for small packet size, vmxnet3 (front-end driver) doesn't control the
buffer allocation, as a result we allocate a new page and copy packet
from the dataring to XDP frame.
The receive side of XDP is implemented for case A and B, by invoking the
bpf program at vmxnet3_rq_rx_complete and handle its returned action.
The vmxnet3_process_xdp(), vmxnet3_process_xdp_small() function handles
the ring0 and dataring case separately, and decides the next journey of
the packet afterward.
For TX, vmxnet3 has split header design. Outgoing packets are parsed
first and protocol headers (L2/L3/L4) are copied to the backend. The
rest of the payload are dma mapped. Since XDP_TX does not parse the
packet protocol, the entire XDP frame is dma mapped for transmission
and transmitted in a batch. Later on, the frame is freed and recycled
back to the memory pool.
Performance:
Tested using two VMs inside one ESXi vSphere 7.0 machine, using single
core on each vmxnet3 device, sender using DPDK testpmd tx-mode attached
to vmxnet3 device, sending 64B or 512B UDP packet.
VM1 txgen:
$ dpdk-testpmd -l 0-3 -n 1 -- -i --nb-cores=3 \
--forward-mode=txonly --eth-peer=0,<mac addr of vm2>
option: add "--txonly-multi-flow"
option: use --txpkts=512 or 64 byte
VM2 running XDP:
$ ./samples/bpf/xdp_rxq_info -d ens160 -a <options> --skb-mode
$ ./samples/bpf/xdp_rxq_info -d ens160 -a <options>
options: XDP_DROP, XDP_PASS, XDP_TX
To test REDIRECT to cpu 0, use
$ ./samples/bpf/xdp_redirect_cpu -d ens160 -c 0 -e drop
Single core performance comparison with skb-mode.
64B: skb-mode -> native-mode
XDP_DROP: 1.6Mpps -> 2.4Mpps
XDP_PASS: 338Kpps -> 367Kpps
XDP_TX: 1.1Mpps -> 2.3Mpps
REDIRECT-drop: 1.3Mpps -> 2.3Mpps
512B: skb-mode -> native-mode
XDP_DROP: 863Kpps -> 1.3Mpps
XDP_PASS: 275Kpps -> 376Kpps
XDP_TX: 554Kpps -> 1.2Mpps
REDIRECT-drop: 659Kpps -> 1.2Mpps
Demo: https://youtu.be/4lm1CSCi78Q
Future work:
- XDP frag support
- use napi_consume_skb() instead of dev_kfree_skb_any at unmap
- stats using u64_stats_t
- using bitfield macro BIT()
- optimization for DMA synchronization using actual frame length,
instead of always max_len
Signed-off-by: William Tu <u9012063@gmail.com>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Adrian Moreno says:
====================
openvswitch: add drop reasons
There is currently a gap in drop visibility in the openvswitch module.
This series tries to improve this by adding a new drop reason subsystem
for OVS.
Apart from adding a new drop reasson subsystem and some common drop
reasons, this series takes Eric's preliminary work [1] on adding an
explicit drop action and integrates it into the same subsystem.
A limitation of this series is that it does not report upcall errors.
The reason is that there could be many sources of upcall drops and the
most common one, which is the netlink buffer overflow, cannot be
reported via kfree_skb() because the skb is freed in the netlink layer
(see [2]). Therefore, using a reason for the rare events and not the
common one would be even more misleading. I'd propose we add (in a
follow up patch) a tracepoint to better report upcall errors.
[1] https://lore.kernel.org/netdev/202306300609.tdRdZscy-lkp@intel.com/T/
[2] commit 1100248a5c5c ("openvswitch: Fix double reporting of drops in dropwatch")
---
v4 -> v5:
- Rebased
- Added a helper function to explicitly convert drop reason enum types
v3 -> v4:
- Changed names of errors following Ilya's suggestions
- Moved the ovs-dpctl.py changes from patch 7/7 to 3/7
- Added a test to ensure actions following a drop are rejected
rfc2 -> v3:
- Rebased on top of latest net-next
rfc1 -> rfc2:
- Fail when an explicit drop is not the last
- Added a drop reason for action errors
- Added braces around macros
- Dropped patch that added support for masks in ovs-dpctl.py as it's now
included in Aaron's series [2].
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Test explicit drops generate the right drop reason. Also, verify that
the kernel rejects flows with actions following an explicit drop.
Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Test if the correct drop reason is reported when OVS drops a packet due
to an explicit flow.
Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Use drop reasons from include/net/dropreason-core.h when a reasonable
candidate exists.
Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
By using an independent drop reason it makes it easy to distinguish
between QoS-triggered or flow-triggered drop.
Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
From: Eric Garver <eric@garver.life>
This adds an explicit drop action. This is used by OVS to drop packets
for which it cannot determine what to do. An explicit action in the
kernel allows passing the reason _why_ the packet is being dropped or
zero to indicate no particular error happened (i.e: OVS intentionally
dropped the packet).
Since the error codes coming from userspace mean nothing for the kernel,
we squash all of them into only two drop reasons:
- OVS_DROP_EXPLICIT_WITH_ERROR to indicate a non-zero value was passed
- OVS_DROP_EXPLICIT to indicate a zero value was passed (no error)
e.g. trace all OVS dropped skbs
# perf trace -e skb:kfree_skb --filter="reason >= 0x30000"
[..]
106.023 ping/2465 skb:kfree_skb(skbaddr: 0xffffa0e8765f2000, \
location:0xffffffffc0d9b462, protocol: 2048, reason: 196611)
reason: 196611 --> 0x30003 (OVS_DROP_EXPLICIT)
Also, this patch allows ovs-dpctl.py to add explicit drop actions as:
"drop" -> implicit empty-action drop
"drop(0)" -> explicit non-error action drop
"drop(42)" -> explicit error action drop
Signed-off-by: Eric Garver <eric@garver.life>
Co-developed-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add a drop reason for packets that are dropped because an action
returns a non-zero error code.
Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Create a new drop reason subsystem for openvswitch and add the first
drop reason to represent last-action drops.
Last-action drops happen when a flow has an empty action list or there
is no action that consumes the packet (output, userspace, recirc, etc).
It is the most common way in which OVS drops packets.
Implementation-wise, most of these skb-consuming actions already call
"consume_skb" internally and return directly from within the
do_execute_actions() loop so with minimal changes we can assume that
any skb that exits the loop normally is a packet drop.
Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Matthieu Baerts says:
====================
mptcp: get rid of msk->subflow
The MPTCP protocol maintains an additional struct socket per connection,
mainly to be able to easily use tcp-level struct socket operations.
This leads to several side effects, beyond the quite unfortunate /
confusing 'subflow' field name:
- active and passive sockets behaviour is inconsistent: only active ones
have a not NULL msk->subflow, leading to different error handling and
different error code returned to the user-space in several places.
- active sockets uses an unneeded, larger amount of memory
- passive sockets can't successfully go through accept(), disconnect(),
accept() sequence, see [1] for more details.
The 13 first patches of this series are from Paolo and address all the
above, finally getting rid of the blamed field:
- The first patch is a minor clean-up.
- In the next 11 patches, msk->subflow usage is systematically removed
from the MPTCP protocol, replacing it with direct msk->first usage,
eventually introducing new core helpers when needed.
- The 13th patch finally disposes the field, and it's the only patch in
the series intended to produce functional changes.
The last and 14th patch is from Kuniyuki and it is not linked to the
previous ones: it is a small clean-up to get rid of an unnecessary check
in mptcp_init_sock().
[1] https://github.com/multipath-tcp/mptcp_net-next/issues/290
====================
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
|