summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2023-08-23Merge branch 'mlx4-aux-bus'David S. Miller
Petr Pavlu says: ==================== Convert mlx4 to use auxiliary bus This series converts the mlx4 drivers to use auxiliary bus, similarly to how mlx5 was converted [1]. The first 6 patches are preparatory changes, the remaining 4 are the final conversion. Initial motivation for this change was to address a problem related to loading mlx4_en/mlx4_ib by mlx4_core using request_module_nowait(). When doing such a load in initrd, the operation is asynchronous to any init control and can get unexpectedly affected/interrupted by an eventual root switch. Using an auxiliary bus leaves these module loads to udevd which better integrates with systemd processing. [2] General benefit is to get rid of custom interface logic and instead use a common facility available for this task. An obvious risk is that some new bug is introduced by the conversion. Leon Romanovsky was kind enough to check for me that the series passes their verification tests. Changes since v2 [3]: * Use 'void *' as the event param of mlx4_dispatch_event(). Changes since v1 [4]: * Fix a missing definition of the err variable in mlx4_en_add(). * Remove not needed comments about the event type in mlx4_en_event() and mlx4_ib_event(). [1] https://lore.kernel.org/netdev/20201101201542.2027568-1-leon@kernel.org/ [2] https://lore.kernel.org/netdev/0a361ac2-c6bd-2b18-4841-b1b991f0635e@suse.com/ [3] https://lore.kernel.org/netdev/20230813145127.10653-1-petr.pavlu@suse.com/ [4] https://lore.kernel.org/netdev/20230804150527.6117-1-petr.pavlu@suse.com/ ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-23mlx4: Delete custom device management logicPetr Pavlu
After the conversion to use the auxiliary bus, the custom device management is not needed anymore and can be deleted. Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> Tested-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Acked-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-23mlx4: Connect the infiniband part to the auxiliary busPetr Pavlu
Use the auxiliary bus to perform device management of the infiniband part of the mlx4 driver. Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> Tested-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Acked-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-23mlx4: Connect the ethernet part to the auxiliary busPetr Pavlu
Use the auxiliary bus to perform device management of the ethernet part of the mlx4 driver. Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> Tested-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Acked-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-23mlx4: Register mlx4 devices to an auxiliary virtual busPetr Pavlu
Add an auxiliary virtual bus to model the mlx4 driver structure. The code is added along the current custom device management logic. Subsequent patches switch mlx4_en and mlx4_ib to the auxiliary bus and the old interface is then removed. Structure mlx4_priv gains a new adev dynamic array to keep track of its auxiliary devices. Access to the array is protected by the global mlx4_intf mutex. Functions mlx4_register_device() and mlx4_unregister_device() are updated to expose auxiliary devices on the bus in order to load mlx4_en and/or mlx4_ib. Functions mlx4_register_auxiliary_driver() and mlx4_unregister_auxiliary_driver() are added to substitute mlx4_register_interface() and mlx4_unregister_interface(), respectively. Function mlx4_do_bond() is adjusted to walk over the adev array and re-adds a specific auxiliary device if its driver sets the MLX4_INTFF_BONDING flag. Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> Tested-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Acked-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-23mlx4: Avoid resetting MLX4_INTFF_BONDING per driverPetr Pavlu
The mlx4_core driver has a logic that allows a sub-driver to set the MLX4_INTFF_BONDING flag which then causes that function mlx4_do_bond() asks the sub-driver to fully re-probe a device when its bonding configuration changes. Performing this operation is disallowed in mlx4_register_interface() when it is detected that any mlx4 device is multifunction (SRIOV). The code then resets MLX4_INTFF_BONDING in the driver flags. Move this check directly into mlx4_do_bond(). It provides a better separation as mlx4_core no longer directly modifies the sub-driver flags and it will allow to get rid of explicitly keeping track of all mlx4 devices by the intf.c code when it is switched to an auxiliary bus. Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> Tested-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Acked-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-23mlx4: Move the bond work to the core driverPetr Pavlu
Function mlx4_en_queue_bond_work() is used in mlx4_en to start a bond reconfiguration. It gathers data about a new port map setting, takes a reference on the netdev that triggered the change and queues a work object on mlx4_en_priv.mdev.workqueue to perform the operation. The scheduled work is mlx4_en_bond_work() which calls mlx4_bond()/mlx4_unbond() and consequently mlx4_do_bond(). At the same time, function mlx4_change_port_types() in mlx4_core might be invoked to change the port type configuration. As part of its logic, it re-registers the whole device by calling mlx4_unregister_device(), followed by mlx4_register_device(). The two operations can result in concurrent access to the data about currently active interfaces on the device. Functions mlx4_register_device() and mlx4_unregister_device() lock the intf_mutex to gain exclusive access to this data. The current implementation of mlx4_do_bond() doesn't do that which could result in an unexpected behavior. An updated version of mlx4_do_bond() for use with an auxiliary bus goes and locks the intf_mutex when accessing a new auxiliary device array. However, doing so can then result in the following deadlock: * A two-port mlx4 device is configured as an Ethernet bond. * One of the ports is changed from eth to ib, for instance, by writing into a mlx4_port<x> sysfs attribute file. * mlx4_change_port_types() is called to update port types. It invokes mlx4_unregister_device() to unregister the device which locks the intf_mutex and starts removing all associated interfaces. * Function mlx4_en_remove() gets invoked and starts destroying its first netdev. This triggers mlx4_en_netdev_event() which recognizes that the configured bond is broken. It runs mlx4_en_queue_bond_work() which takes a reference on the netdev. Removing the netdev now cannot proceed until the work is completed. * Work function mlx4_en_bond_work() gets scheduled. It calls mlx4_unbond() -> mlx4_do_bond(). The latter function tries to lock the intf_mutex but that is not possible because it is held already by mlx4_unregister_device(). This particular case could be possibly solved by unregistering the mlx4_en_netdev_event() notifier in mlx4_en_remove() earlier, but it seems better to decouple mlx4_en more and break this reference order. Avoid then this scenario by recognizing that the bond reconfiguration operates only on a mlx4_dev. The logic to queue and execute the bond work can be moved into the mlx4_core driver. Only a reference on the respective mlx4_dev object is needed to be taken during the work's lifetime. This removes a call from mlx4_en that can directly result in needing to lock the intf_mutex, it remains a privilege of the core driver. Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> Tested-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Acked-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-23mlx4: Get rid of the mlx4_interface.activate callbackPetr Pavlu
The mlx4_interface.activate callback was introduced in commit 79857cd31fe7 ("net/mlx4: Postpone the registration of net_device"). It dealt with a situation when a netdev notifier received a NETDEV_REGISTER event for a new net_device created by mlx4_en but the same device was not yet visible to mlx4_get_protocol_dev(). The callback can be removed now that mlx4_get_protocol_dev() is gone. Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> Tested-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Acked-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-23mlx4: Replace the mlx4_interface.event callback with a notifierPetr Pavlu
Use a notifier to implement mlx4_dispatch_event() in preparation to switch mlx4_en and mlx4_ib to be an auxiliary device. A problem is that if the mlx4_interface.event callback was replaced with something as mlx4_adrv.event then the implementation of mlx4_dispatch_event() would need to acquire a lock on a given device before executing this callback. That is necessary because otherwise there is no guarantee that the associated driver cannot get unbound when the callback is running. However, taking this lock is not possible because mlx4_dispatch_event() can be invoked from the hardirq context. Using an atomic notifier allows the driver to accurately record when it wants to receive these events and solves this problem. A handler registration is done by both mlx4_en and mlx4_ib at the end of their mlx4_interface.add callback. This matches the current situation when mlx4_add_device() would enable events for a given device immediately after this callback, by adding the device on the mlx4_priv.list. Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> Tested-by: Leon Romanovsky <leonro@nvidia.com> Acked-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-23mlx4: Use 'void *' as the event param of mlx4_dispatch_event()Petr Pavlu
Function mlx4_dispatch_event() takes an 'unsigned long' as its event parameter. The actual value is none (MLX4_DEV_EVENT_CATASTROPHIC_ERROR), a pointer to mlx4_eqe (MLX4_DEV_EVENT_PORT_MGMT_CHANGE), or a 32-bit integer (remaining events). In preparation to switch mlx4_en and mlx4_ib to be an auxiliary device, the mlx4_interface.event callback is replaced with a notifier and function mlx4_dispatch_event() gets updated to invoke atomic_notifier_call_chain(). This requires forwarding the input 'param' value from the former function to the latter. A problem is that the notifier call takes 'void *' as its 'param' value, compared to 'unsigned long' used by mlx4_dispatch_event(). Re-passing the value would need either punning it to 'void *' or passing down the address of the input 'param'. Both approaches create a number of unnecessary casts. Change instead the input 'param' of mlx4_dispatch_event() from 'unsigned long' to 'void *'. A mlx4_eqe pointer can be passed directly, callers using an int value are adjusted to pass its address. Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-23mlx4: Rename member mlx4_en_dev.nb to netdev_nbPetr Pavlu
Rename the mlx4_en_dev.nb notifier_block member to netdev_nb in preparation to add a mlx4 core notifier_block. Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> Tested-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Acked-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-23mlx4: Get rid of the mlx4_interface.get_dev callbackPetr Pavlu
Simplify the mlx4 driver interface by removing mlx4_get_protocol_dev() and the associated mlx4_interface.get_dev callbacks. This is done in preparation to use an auxiliary bus to model the mlx4 driver structure. The change is motivated by the following situation: * The mlx4_en interface is being initialized by mlx4_en_add() and mlx4_en_activate(). * The latter activate function calls mlx4_en_init_netdev() -> register_netdev() to register a new net_device. * A netdev event NETDEV_REGISTER is raised for the device. * The netdev notififier mlx4_ib_netdev_event() is called and it invokes mlx4_ib_scan_netdevs() -> mlx4_get_protocol_dev() -> mlx4_en_get_netdev() [via mlx4_interface.get_dev]. This chain creates a problem when mlx4_en gets switched to be an auxiliary driver. It contains two device calls which would both need to take a respective device lock. Avoid this situation by updating mlx4_ib_scan_netdevs() to no longer call mlx4_get_protocol_dev() but instead to utilize the information passed in net_device.parent and net_device.dev_port. This data is sufficient to determine that an updated port is one that the mlx4_ib driver should take care of and to keep mlx4_ib_dev.iboe.netdevs up to date. Following that, update mlx4_ib_get_netdev() to also not call mlx4_get_protocol_dev() and instead scan all current netdevs to find find a matching one. Note that mlx4_ib_get_netdev() is called early from ib_register_device() and cannot use data tracked in mlx4_ib_dev.iboe.netdevs which is not at that point yet set. Finally, remove function mlx4_get_protocol_dev() and the mlx4_interface.get_dev callbacks (only mlx4_en_get_netdev()) as they became unused. Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> Tested-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Acked-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-23qed/qede: Remove unused declarationsYue Haibing
Commit 8cd160a29415 ("qede: convert to new udp_tunnel_nic infra") removed qede_udp_tunnel_{add,del}() but not the declarations. Commit 0ebcebbef1cc ("qed: Read device port count from the shmem") removed qed_device_num_engines() but not its declaration. Commit 1e128c81290a ("qed: Add support for hardware offloaded FCoE.") declared but never implemented qed_fcoe_set_pf_params(). Signed-off-by: Yue Haibing <yuehaibing@huawei.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-23octeontx2-pf: Use PTP HW timestamp counter atomic update featureSai Krishna
Some of the newer silicon versions in CN10K series supports a feature where in the current PTP timestamp in HW can be updated atomically without losing any cpu cycles unlike read/modify/write register. This patch uses this feature so that PTP accuracy can be improved while adjusting the master offset in HW. There is no need for SW timecounter when using this feature. So removed references to SW timecounter wherever appropriate. Signed-off-by: Sai Krishna <saikrishnag@marvell.com> Signed-off-by: Naveen Mamindlapalli <naveenm@marvell.com> Signed-off-by: Sunil Kovvuri Goutham <sgoutham@marvell.com> Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-22net/mlx5e: Support IPsec upper TCP protocol selectorLeon Romanovsky
Support TCP as protocol selector for policy and state in IPsec packet offload mode. Example of state configuration is as follows: ip xfrm state add src 192.168.25.3 dst 192.168.25.1 \ proto esp spi 1001 reqid 10001 aead 'rfc4106(gcm(aes))' \ 0x54a7588d36873b031e4bd46301be5a86b3a53879 128 mode transport \ offload packet dev re0 dir in sel src 192.168.25.3 dst 192.168.25.1 \ proto tcp dport 9003 Acked-by: Raed Salem <raeds@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-08-22net/mlx5e: Support IPsec upper protocol selector field offload for RXEmeel Hakim
Support RX policy/state upper protocol selector field offload, to enable selecting RX traffic for IPsec operation based on l4 protocol UDP with specific source/destination port. Signed-off-by: Emeel Hakim <ehakim@nvidia.com> Reviewed-by: Raed Salem <raeds@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-08-22net/mlx5: Store vport in struct mlx5_devlink_port and use it in port opsJiri Pirko
Instead of using internal devlink_port->index to perform vport lookup in every devlink port op, store the vport pointer to the container struct mlx5_devlink_port and use it directly in port ops. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Shay Drory <shayd@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-08-22net/mlx5: Check vhca_resource_manager capability in each op and add extack msgJiri Pirko
Since the follow-up patch is going to remove mlx5_devlink_port_fn_get_vport() entirely, move the vhca_resource_manager capability checking to individual ops. Add proper extack message on the way. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Shay Drory <shayd@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-08-22net/mlx5: Relax mlx5_devlink_eswitch_get() return value checkingJiri Pirko
If called from port ops, it is not needed to perform the checks in mlx5_devlink_eswitch_get(). The reason is devlink port would not be registered if the checks are not true. Introduce relaxed version mlx5_devlink_eswitch_nocheck_get() and use it in port ops. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Shay Drory <shayd@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-08-22net/mlx5: Return -EOPNOTSUPP in mlx5_devlink_port_fn_migratable_set() directlyJiri Pirko
Instead of initializing "err" variable, just return "-EOPNOTSUPP" directly where it is needed. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Shay Drory <shayd@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-08-22net/mlx5: Reduce number of vport lookups passing vport pointer instead of indexJiri Pirko
During devlink port init/cleanup and register/unregister calls, there are many lookups of vport. Instead of passing vport_num as argument to functions, pass the vport struct pointer directly and avoid repeated lookups. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Shay Drory <shayd@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-08-22net/mlx5: Embed struct devlink_port into driver structureJiri Pirko
Struct devlink_port is usually embedded in a driver-specific struct which allows to carry driver context to devlink port ops. Introduce a container struct to include devlink_port struct in preparation to also include driver context for devlink port ops. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Shay Drory <shayd@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-08-22net/mlx5: Don't register ops for non-PF/VF/SF port and avoid checks in opsJiri Pirko
Currently each PF/VF/SF devlink port op called into mlx5 code calls is_port_function_supported() to check if the port is either PF, VF or SF. So make sure that the ops are registered with devlink port only for those and avoid the is_port_function_supported() checks in ops. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Shay Drory <shayd@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-08-22net/mlx5: Remove no longer used mlx5_esw_offloads_sf_vport_enable/disable()Jiri Pirko
Since the previous patch removed the only users of these functions, remove them. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Shay Drory <shayd@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-08-22net/mlx5: Introduce mlx5_eswitch_load/unload_sf_vport() and use it from SF codeJiri Pirko
Similar to the PF/VF helpers, introduce a set of load/unload helpers for SF vports. From there, call mlx5_eswitch_load/unload_vport() which are common for PFs/VFs and newly introduced SF helpers. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Shay Drory <shayd@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-08-22net/mlx5: Allow mlx5_esw_offloads_devlink_port_register() to register SFsJiri Pirko
Currently there is a separate set of functions used to register/unregister the SF. The only difference is currently the ops struct. Move the struct up and use it for SFs in mlx5_esw_offloads_devlink_port_register(). Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Shay Drory <shayd@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-08-22net/mlx5: Push devlink port PF/VF init/cleanup calls out of ↵Jiri Pirko
devlink_port_register/unregister() In order to prepare for mlx5_esw_offloads_devlink_port_register/unregister() to be used for SFs as well, push out the PF/VF specific init/cleanup calls outside. Introduce mlx5_eswitch_load/unload_pf_vf_vport() and call them from there. Use these new helpers of PF/VF loading and make mlx5_eswitch_local/unload_vport() reusable for SFs. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Shay Drory <shayd@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-08-22net/mlx5: Push out SF devlink port init and cleanup code to separate helpersJiri Pirko
Similar to what was done for PFs/VFs, introduce devlink port init and cleanup helpers for SFs and manage the vport->dl_port pointer there. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-08-22net/mlx5: Rework devlink port alloc/free into init/cleanupJiri Pirko
In order to prepare the devlink port registration function to be common for PFs/VFs and SFs, change the existing devlink port allocation and free functions into PF/VF init and cleanup, so similar helpers could be later on introduced for SFs. Make the init/cleanup helpers responsible for setting/clearing the vport->dl_port pointer. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Shay Drory <shayd@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-08-22Merge tag 'nf-next-23-08-22' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next Florian Westphal says: ==================== netfilter updates for net-next First patch resolves a fortify warning by wrapping the to-be-copied members via struct_group. Second patch replaces array[0] with array[] in ebtables uapi. Both changes from GONG Ruiqi. The largest chunk is replacement of strncpy with strscpy_pad() in netfilter, from Justin Stitt. Last patch, from myself, aborts ruleset validation if a fatal signal is pending, this speeds up process exit. * tag 'nf-next-23-08-22' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next: netfilter: nf_tables: allow loop termination for pending fatal signal netfilter: xtables: refactor deprecated strncpy netfilter: x_tables: refactor deprecated strncpy netfilter: nft_meta: refactor deprecated strncpy netfilter: nft_osf: refactor deprecated strncpy netfilter: nf_tables: refactor deprecated strncpy netfilter: nf_tables: refactor deprecated strncpy netfilter: ipset: refactor deprecated strncpy netfilter: ebtables: replace zero-length array members netfilter: ebtables: fix fortify warnings in size_entry_mwt() ==================== Link: https://lore.kernel.org/r/20230822154336.12888-1-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-22igc: Fix the typo in the PTM Control macroSasha Neftin
The IGC_PTM_CTRL_SHRT_CYC defines the time between two consecutive PTM requests. The bit resolution of this field is six bits. That bit five was missing in the mask. This patch comes to correct the typo in the IGC_PTM_CTRL_SHRT_CYC macro. Fixes: a90ec8483732 ("igc: Add support for PTP getcrosststamp()") Signed-off-by: Sasha Neftin <sasha.neftin@intel.com> Tested-by: Naama Meir <naamax.meir@linux.intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Link: https://lore.kernel.org/r/20230821171721.2203572-1-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-22Merge branch 'mptcp-prepare-mptcp-packet-scheduler-for-bpf-extension'Jakub Kicinski
Mat Martineau says: ==================== mptcp: Prepare MPTCP packet scheduler for BPF extension The kernel's MPTCP packet scheduler has, to date, been a one-size-fits all algorithm that is hard-coded. It attempts to balance latency and throughput when transmitting data across multiple TCP subflows, and has some limited tunability through sysctls. It has been a long-term goal of the Linux MPTCP community to support customizable packet schedulers for use cases that need to make different trade-offs regarding latency, throughput, redundancy, and other metrics. BPF is well-suited for configuring customized, per-packet scheduling decisions without having to modify the kernel or manage out-of-tree kernel modules. The first steps toward implementing BPF packet schedulers are to update the existing MPTCP transmit loops to allow more flexible scheduling decisions, and to add infrastructure for swappable packet schedulers. The existing scheduling algorithm remains the default. BPF-related changes will be in a future patch series. This code has been in the MPTCP development tree for quite a while, undergoing testing in our CI and community. Patches 1 and 2 refactor the transmit code and do some related cleanup. Patches 3-9 add infrastructure for registering and calling multiple schedulers. Patch 10 connects the in-kernel default scheduler to the new infrastructure. ==================== Link: https://lore.kernel.org/r/20230821-upstream-net-next-20230818-v1-0-0c860fb256a8@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-22mptcp: register default schedulerGeliang Tang
This patch defines the default packet scheduler mptcp_sched_default. Register it in mptcp_sched_init(), which is invoked in mptcp_proto_init(). Skip deleting this default scheduler in mptcp_unregister_scheduler(). Set msk->sched to the default scheduler when the input parameter of mptcp_init_sched() is NULL. Invoke mptcp_sched_default_get_subflow in get_send() and get_retrans() if the defaut scheduler is set or msk->sched is NULL. Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Mat Martineau <martineau@kernel.org> Link: https://lore.kernel.org/r/20230821-upstream-net-next-20230818-v1-10-0c860fb256a8@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-22mptcp: use get_retrans wrapperGeliang Tang
This patch adds the multiple subflows support for __mptcp_retrans(). Use get_retrans() wrapper instead of mptcp_subflow_get_retrans() in it. Check the subflow scheduled flags to test which subflow or subflows are picked by the scheduler, use them to send data. Move msk_owned_by_me() and fallback checks into get_retrans() wrapper from mptcp_subflow_get_retrans(). Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Mat Martineau <martineau@kernel.org> Link: https://lore.kernel.org/r/20230821-upstream-net-next-20230818-v1-9-0c860fb256a8@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-22mptcp: use get_send wrapperGeliang Tang
This patch adds the multiple subflows support for __mptcp_push_pending and __mptcp_subflow_push_pending. Use get_send() wrapper instead of mptcp_subflow_get_send() in them. Check the subflow scheduled flags to test which subflow or subflows are picked by the scheduler, use them to send data. Move msk_owned_by_me() and fallback checks into get_send() wrapper from mptcp_subflow_get_send(). This commit allows the scheduler to set the subflow->scheduled bit in multiple subflows, but it does not allow for sending redundant data. Multiple scheduled subflows will send sequential data on each subflow. Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Mat Martineau <martineau@kernel.org> Link: https://lore.kernel.org/r/20230821-upstream-net-next-20230818-v1-8-0c860fb256a8@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-22mptcp: add scheduler wrappersGeliang Tang
This patch defines two packet scheduler wrappers mptcp_sched_get_send() and mptcp_sched_get_retrans(), invoke get_subflow() of msk->sched in them. Set data->reinject to true in mptcp_sched_get_retrans(), set it false in mptcp_sched_get_send(). If msk->sched is NULL, use default functions mptcp_subflow_get_send() and mptcp_subflow_get_retrans() to send data. Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Mat Martineau <martineau@kernel.org> Link: https://lore.kernel.org/r/20230821-upstream-net-next-20230818-v1-7-0c860fb256a8@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-22mptcp: add scheduled in mptcp_subflow_contextGeliang Tang
This patch adds a new member scheduled in struct mptcp_subflow_context, which will be set in the MPTCP scheduler context when the scheduler picks this subflow to send data. Add a new helper mptcp_subflow_set_scheduled() to set this flag using WRITE_ONCE(). Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Mat Martineau <martineau@kernel.org> Link: https://lore.kernel.org/r/20230821-upstream-net-next-20230818-v1-6-0c860fb256a8@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-22mptcp: add sched in mptcp_sockGeliang Tang
This patch adds a new struct member sched in struct mptcp_sock. And two helpers mptcp_init_sched() and mptcp_release_sched() to init and release it. Init it with the sysctl scheduler in mptcp_init_sock(), copy the scheduler from the parent in mptcp_sk_clone(), and release it in __mptcp_destroy_sock(). Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Mat Martineau <martineau@kernel.org> Link: https://lore.kernel.org/r/20230821-upstream-net-next-20230818-v1-5-0c860fb256a8@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-22mptcp: add a new sysctl schedulerGeliang Tang
This patch adds a new sysctl, named scheduler, to support for selection of different schedulers. Export mptcp_get_scheduler helper to get this sysctl. Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Mat Martineau <martineau@kernel.org> Link: https://lore.kernel.org/r/20230821-upstream-net-next-20230818-v1-4-0c860fb256a8@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-22mptcp: add struct mptcp_sched_opsGeliang Tang
This patch defines struct mptcp_sched_ops, which has three struct members, name, owner and list, and four function pointers: init(), release() and get_subflow(). The scheduler function get_subflow() have a struct mptcp_sched_data parameter, which contains a reinject flag for retrans or not, a subflows number and a mptcp_subflow_context array. Add the scheduler registering, unregistering and finding functions to add, delete and find a packet scheduler on the global list mptcp_sched_list. Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Mat Martineau <martineau@kernel.org> Link: https://lore.kernel.org/r/20230821-upstream-net-next-20230818-v1-3-0c860fb256a8@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-22mptcp: drop last_snd and MPTCP_RESET_SCHEDULERGeliang Tang
Since the burst check conditions have moved out of the function mptcp_subflow_get_send(), it makes all msk->last_snd useless. This patch drops them as well as the macro MPTCP_RESET_SCHEDULER. Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Mat Martineau <martineau@kernel.org> Link: https://lore.kernel.org/r/20230821-upstream-net-next-20230818-v1-2-0c860fb256a8@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-22mptcp: refactor push_pending logicGeliang Tang
To support redundant package schedulers more easily, this patch refactors __mptcp_push_pending() logic from: For each dfrag: While sends succeed: Call the scheduler (selects subflow and msk->snd_burst) Update subflow locks (push/release/acquire as needed) Send the dfrag data with mptcp_sendmsg_frag() Update already_sent, snd_nxt, snd_burst Update msk->first_pending Push/release on final subflow -> While first_pending isn't empty: Call the scheduler (selects subflow and msk->snd_burst) Update subflow locks (push/release/acquire as needed) For each pending dfrag: While sends succeed: Send the dfrag data with mptcp_sendmsg_frag() Update already_sent, snd_nxt, snd_burst Update msk->first_pending Break if required by msk->snd_burst / etc Push/release on final subflow Refactors __mptcp_subflow_push_pending logic from: For each dfrag: While sends succeed: Call the scheduler (selects subflow and msk->snd_burst) Send the dfrag data with mptcp_subflow_delegate(), break Send the dfrag data with mptcp_sendmsg_frag() Update dfrag->already_sent, msk->snd_nxt, msk->snd_burst Update msk->first_pending -> While first_pending isn't empty: Call the scheduler (selects subflow and msk->snd_burst) Send the dfrag data with mptcp_subflow_delegate(), break Send the dfrag data with mptcp_sendmsg_frag() For each pending dfrag: While sends succeed: Send the dfrag data with mptcp_sendmsg_frag() Update already_sent, snd_nxt, snd_burst Update msk->first_pending Break if required by msk->snd_burst / etc Move the duplicate code from __mptcp_push_pending() and __mptcp_subflow_push_pending() into a new helper function, named __subflow_push_pending(). Simplify __mptcp_push_pending() and __mptcp_subflow_push_pending() by invoking this helper. Also move the burst check conditions out of the function mptcp_subflow_get_send(), check them in __subflow_push_pending() in the inner "for each pending dfrag" loop. Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Mat Martineau <martineau@kernel.org> Link: https://lore.kernel.org/r/20230821-upstream-net-next-20230818-v1-1-0c860fb256a8@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-22batman-adv: Hold rtnl lock during MTU update via netlinkSven Eckelmann
The automatic recalculation of the maximum allowed MTU is usually triggered by code sections which are already rtnl lock protected by callers outside of batman-adv. But when the fragmentation setting is changed via batman-adv's own batadv genl family, then the rtnl lock is not yet taken. But dev_set_mtu requires that the caller holds the rtnl lock because it uses netdevice notifiers. And this code will then fail the check for this lock: RTNL: assertion failed at net/core/dev.c (1953) Cc: stable@vger.kernel.org Reported-by: syzbot+f8812454d9b3ac00d282@syzkaller.appspotmail.com Fixes: c6a953cce8d0 ("batman-adv: Trigger events for auto adjusted MTU") Signed-off-by: Sven Eckelmann <sven@narfation.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20230821-batadv-missing-mtu-rtnl-lock-v1-1-1c5a7bfe861e@narfation.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-22igb: Avoid starting unnecessary workqueuesAlessio Igor Bogani
If ptp_clock_register() fails or CONFIG_PTP isn't enabled, avoid starting PTP related workqueues. In this way we can fix this: BUG: unable to handle page fault for address: ffffc9000440b6f8 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 100000067 P4D 100000067 PUD 1001e0067 PMD 107dc5067 PTE 0 Oops: 0000 [#1] PREEMPT SMP [...] Workqueue: events igb_ptp_overflow_check RIP: 0010:igb_rd32+0x1f/0x60 [...] Call Trace: igb_ptp_read_82580+0x20/0x50 timecounter_read+0x15/0x60 igb_ptp_overflow_check+0x1a/0x50 process_one_work+0x1cb/0x3c0 worker_thread+0x53/0x3f0 ? rescuer_thread+0x370/0x370 kthread+0x142/0x160 ? kthread_associate_blkcg+0xc0/0xc0 ret_from_fork+0x1f/0x30 Fixes: 1f6e8178d685 ("igb: Prevent dropped Tx timestamps via work items and interrupts.") Fixes: d339b1331616 ("igb: add PTP Hardware Clock code") Signed-off-by: Alessio Igor Bogani <alessio.bogani@elettra.eu> Tested-by: Arpana Arland <arpanax.arland@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20230821171927.2203644-1-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-22Merge branch '100GbE' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue Tony Nguyen says: ==================== Intel Wired LAN Driver Updates 2023-08-21 (ice) This series contains updates to ice driver only. Jesse fixes an issue on calculating buffer size. Petr Oros reverts a commit that does not fully resolve VF reset issues and implements one that provides a fuller fix. * '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue: ice: Fix NULL pointer deref during VF reset Revert "ice: Fix ice VF reset during iavf initialization" ice: fix receive buffer size miscalculation ==================== Link: https://lore.kernel.org/r/20230821171633.2203505-1-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-22Merge branch 'can-fixes-for-6-5-rc7'Jakub Kicinski
Oliver Hartkopp says: ==================== CAN fixes for 6.5-rc7 The isotp fix removes an unnecessary check which leads to delays and/or a wrong error notification. The fix for the CAN_RAW socket solves the last issue that has been introduced with commit ee8b94c8510c ("can: raw: fix receiver memory leak") in this upstream cycle (detected by Eric Dumazet). ==================== Link: https://lore.kernel.org/r/20230821144547.6658-1-socketcan@hartkopp.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-22can: raw: add missing refcount for memory leak fixOliver Hartkopp
Commit ee8b94c8510c ("can: raw: fix receiver memory leak") introduced a new reference to the CAN netdevice that has assigned CAN filters. But this new ro->dev reference did not maintain its own refcount which lead to another KASAN use-after-free splat found by Eric Dumazet. This patch ensures a proper refcount for the CAN nedevice. Fixes: ee8b94c8510c ("can: raw: fix receiver memory leak") Reported-by: Eric Dumazet <edumazet@google.com> Cc: Ziyang Xuan <william.xuanziyang@huawei.com> Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Link: https://lore.kernel.org/r/20230821144547.6658-3-socketcan@hartkopp.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-22can: isotp: fix support for transmission of SF without flow controlOliver Hartkopp
The original implementation had a very simple handling for single frame transmissions as it just sent the single frame without a timeout handling. With the new echo frame handling the echo frame was also introduced for single frames but the former exception ('simple without timers') has been maintained by accident. This leads to a 1 second timeout when closing the socket and to an -ECOMM error when CAN_ISOTP_WAIT_TX_DONE is selected. As the echo handling is always active (also for single frames) remove the wrong extra condition for single frames. Fixes: 9f39d36530e5 ("can: isotp: add support for transmission without flow control") Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Link: https://lore.kernel.org/r/20230821144547.6658-2-socketcan@hartkopp.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-22bnx2x: new flag for track HW resource allocationThinh Tran
While injecting PCIe errors to the upstream PCIe switch of a BCM57810 NIC, system hangs/crashes were observed. After several calls to bnx2x_tx_timout() complete, bnx2x_nic_unload() is called to free up HW resources and bnx2x_napi_disable() is called to release NAPI objects. Later, when the EEH driver calls bnx2x_io_slot_reset() to complete the recovery process, bnx2x attempts to disable NAPI again by calling bnx2x_napi_disable() and freeing resources which have already been freed, resulting in a hang or crash. Introduce a new flag to track the HW resource and NAPI allocation state, refactor duplicated code into a single function, check page pool allocation status before freeing, and reduces debug output when a TX timeout event occurs. Reviewed-by: Manish Chopra <manishc@marvell.com> Tested-by: Abdul Haleem <abdhalee@in.ibm.com> Tested-by: David Christensen <drc@linux.vnet.ibm.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Tested-by: Venkata Sai Duggi <venkata.sai.duggi@ibm.com> Signed-off-by: Thinh Tran <thinhtr@linux.vnet.ibm.com> Link: https://lore.kernel.org/r/20230818161443.708785-2-thinhtr@linux.vnet.ibm.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-22Merge tag 'devicetree-fixes-for-6.5-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux Pull devicetree fixes from Rob Herring: - Fix DT node refcount when creating platform devices - Fix deadlock in changeset code due to printing with devtree_lock held - Fix unittest EXPECT strings for parse_phandle_with_args_map() test - Fix IMA kexec memblock freeing * tag 'devicetree-fixes-for-6.5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux: of/platform: increase refcount of fwnode of: dynamic: Refactor action prints to not use "%pOF" inside devtree_lock of: unittest: Fix EXPECT for parse_phandle_with_args_map() test mm,ima,kexec,of: use memblock_free_late from ima_free_kexec_buffer