summaryrefslogtreecommitdiff
path: root/drivers/infiniband/hw/mlx5
AgeCommit message (Collapse)Author
2023-06-11RDMA/mlx5: Create an indirect flow table for steering anchorMark Bloch
A misbehaved user can create a steering anchor that points to a kernel flow table and then destroy the anchor without freeing the associated STC. This creates a problem as the kernel can't destroy the flow table since there is still a reference to it. As a result, this can exhaust all available flow table resources, preventing other users from using the RDMA device. To prevent this problem, a solution is implemented where a special flow table with two steering rules is created when a user creates a steering anchor for the first time. The rules include one that drops all traffic and another that points to the kernel flow table. If the steering anchor is destroyed, only the rule pointing to the kernel's flow table is removed. Any traffic reaching the special flow table after that is dropped. Since the special flow table is not destroyed when the steering anchor is destroyed, any issues are prevented from occurring. The remaining resources are only destroyed when the RDMA device is destroyed, which happens after all DEVX objects are freed, including the STCs, thus mitigating the issue. Fixes: 0c6ab0ca9a66 ("RDMA/mlx5: Expose steering anchor to userspace") Signed-off-by: Mark Bloch <mbloch@nvidia.com> Reviewed-by: Maor Gottlieb <maorg@nvidia.com> Link: https://lore.kernel.org/r/b4a88a871d651fa4e8f98d552553c1cfe9ba2cd6.1685960567.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-06-11RDMA/mlx5: Initiate dropless RQ for RAW Ethernet functionsMaher Sanalla
Delay drop data is initiated for PFs that have the capability of rq_delay_drop and are in roce profile. However, PFs with RAW ethernet profile do not initiate delay drop data on function load, causing kernel panic if delay drop struct members are accessed later on in case a dropless RQ is created. Thus, stage the delay drop initialization as part of RAW ethernet PF loading process. Fixes: b5ca15ad7e61 ("IB/mlx5: Add proper representors support") Signed-off-by: Maher Sanalla <msanalla@nvidia.com> Reviewed-by: Maor Gottlieb <maorg@nvidia.com> Link: https://lore.kernel.org/r/2e9d386785043d48c38711826eb910315c1de141.1685960567.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-06-11RDMA/mlx5: Return the firmware result upon destroying QP/RQPatrisious Haddad
Previously when destroying a QP/RQ, the result of the firmware destruction function was ignored and upper layers weren't informed about the failure. Which in turn could lead to various problems since when upper layer isn't aware of the failure it continues its operation thinking that the related QP/RQ was successfully destroyed while it actually wasn't, which could lead to the below kernel WARN. Currently, we return the correct firmware destruction status to upper layers which in case of the RQ would be mlx5_ib_destroy_wq() which was already capable of handling RQ destruction failure or in case of a QP to destroy_qp_common(), which now would actually warn upon qp destruction failure. WARNING: CPU: 3 PID: 995 at drivers/infiniband/core/rdma_core.c:940 uverbs_destroy_ufile_hw+0xcb/0xe0 [ib_uverbs] Modules linked in: xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat br_netfilter rpcrdma rdma_ucm ib_iser libiscsi scsi_transport_iscsi rdma_cm ib_umad ib_ipoib iw_cm ib_cm mlx5_ib ib_uverbs ib_core overlay mlx5_core fuse CPU: 3 PID: 995 Comm: python3 Not tainted 5.16.0-rc5+ #1 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 RIP: 0010:uverbs_destroy_ufile_hw+0xcb/0xe0 [ib_uverbs] Code: 41 5c 41 5d 41 5e e9 44 34 f0 e0 48 89 df e8 4c 77 ff ff 49 8b 86 10 01 00 00 48 85 c0 74 a1 4c 89 e7 ff d0 eb 9a 0f 0b eb c1 <0f> 0b be 04 00 00 00 48 89 df e8 b6 f6 ff ff e9 75 ff ff ff 90 0f RSP: 0018:ffff8881533e3e78 EFLAGS: 00010287 RAX: ffff88811b2cf3e0 RBX: ffff888106209700 RCX: 0000000000000000 RDX: ffff888106209780 RSI: ffff8881533e3d30 RDI: ffff888109b101a0 RBP: 0000000000000001 R08: ffff888127cb381c R09: 0de9890000000009 R10: ffff888127cb3800 R11: 0000000000000000 R12: ffff888106209780 R13: ffff888106209750 R14: ffff888100f20660 R15: 0000000000000000 FS: 00007f8be353b740(0000) GS:ffff88852c980000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f8bd5b117c0 CR3: 000000012cd8a004 CR4: 0000000000370ea0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> ib_uverbs_close+0x1a/0x90 [ib_uverbs] __fput+0x82/0x230 task_work_run+0x59/0x90 exit_to_user_mode_prepare+0x138/0x140 syscall_exit_to_user_mode+0x1d/0x50 ? __x64_sys_close+0xe/0x40 do_syscall_64+0x4a/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7f8be3ae0abb Code: 03 00 00 00 0f 05 48 3d 00 f0 ff ff 77 41 c3 48 83 ec 18 89 7c 24 0c e8 83 43 f9 ff 8b 7c 24 0c 41 89 c0 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 35 44 89 c7 89 44 24 0c e8 c1 43 f9 ff 8b 44 RSP: 002b:00007ffdb51909c0 EFLAGS: 00000293 ORIG_RAX: 0000000000000003 RAX: 0000000000000000 RBX: 0000557bb7f7c020 RCX: 00007f8be3ae0abb RDX: 0000557bb7c74010 RSI: 0000557bb7f14ca0 RDI: 0000000000000005 RBP: 0000557bb7fbd598 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000293 R12: 0000557bb7fbd5b8 R13: 0000557bb7fbd5a8 R14: 0000000000001000 R15: 0000557bb7f7c020 </TASK> Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Link: https://lore.kernel.org/r/c6df677f931d18090bafbe7f7dbb9524047b7d9b.1685953497.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-06-11RDMA/mlx5: Handle DCT QP logic separately from low level QP interfacePatrisious Haddad
Previously when destroying a DCT, if the firmware function for the destruction failed, the common resource would have been destroyed either way, since it was destroyed before the firmware object. Which leads to kernel warning "refcount_t: underflow" which indicates possible use-after-free. Which is triggered when we try to destroy the common resource for the second time and execute refcount_dec_and_test(&common->refcount). So, let's fix the destruction order by factoring out the DCT QP logic to be in separate XArray database. refcount_t: underflow; use-after-free. WARNING: CPU: 8 PID: 1002 at lib/refcount.c:28 refcount_warn_saturate+0xd8/0xe0 Modules linked in: xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat br_netfilter rpcrdma rdma_ucm ib_iser libiscsi scsi_transport_iscsi ib_umad rdma_cm ib_ipoib iw_cm ib_cm mlx5_ib ib_uverbs ib_core overlay mlx5_core fuse CPU: 8 PID: 1002 Comm: python3 Not tainted 5.16.0-rc5+ #1 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 RIP: 0010:refcount_warn_saturate+0xd8/0xe0 Code: ff 48 c7 c7 18 f5 23 82 c6 05 60 70 ff 00 01 e8 d0 0a 45 00 0f 0b c3 48 c7 c7 c0 f4 23 82 c6 05 4c 70 ff 00 01 e8 ba 0a 45 00 <0f> 0b c3 0f 1f 44 00 00 8b 07 3d 00 00 00 c0 74 12 83 f8 01 74 13 RSP: 0018:ffff8881221d3aa8 EFLAGS: 00010286 RAX: 0000000000000000 RBX: ffff8881313e8d40 RCX: ffff88852cc1b5c8 RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ffff88852cc1b5c0 RBP: ffff888100f70000 R08: ffff88853ffd1ba8 R09: 0000000000000003 R10: 00000000fffff000 R11: 3fffffffffffffff R12: 0000000000000246 R13: ffff888100f71fa0 R14: ffff8881221d3c68 R15: 0000000000000020 FS: 00007efebbb13740(0000) GS:ffff88852cc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00005611aac29f80 CR3: 00000001313de004 CR4: 0000000000370ea0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> destroy_resource_common+0x6e/0x95 [mlx5_ib] mlx5_core_destroy_rq_tracked+0x38/0xbe [mlx5_ib] mlx5_ib_destroy_wq+0x22/0x80 [mlx5_ib] ib_destroy_wq_user+0x1f/0x40 [ib_core] uverbs_free_wq+0x19/0x40 [ib_uverbs] destroy_hw_idr_uobject+0x18/0x50 [ib_uverbs] uverbs_destroy_uobject+0x2f/0x190 [ib_uverbs] uobj_destroy+0x3c/0x80 [ib_uverbs] ib_uverbs_cmd_verbs+0x3e4/0xb80 [ib_uverbs] ? uverbs_free_wq+0x40/0x40 [ib_uverbs] ? ip_list_rcv+0xf7/0x120 ? netif_receive_skb_list_internal+0x1b6/0x2d0 ? task_tick_fair+0xbf/0x450 ? __handle_mm_fault+0x11fc/0x1450 ib_uverbs_ioctl+0xa4/0x110 [ib_uverbs] __x64_sys_ioctl+0x3e4/0x8e0 ? handle_mm_fault+0xb9/0x210 do_syscall_64+0x3d/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7efebc0be17b Code: 0f 1e fa 48 8b 05 1d ad 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ed ac 0c 00 f7 d8 64 89 01 48 RSP: 002b:00007ffe71813e78 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 00007ffe71813fb8 RCX: 00007efebc0be17b RDX: 00007ffe71813fa0 RSI: 00000000c0181b01 RDI: 0000000000000005 RBP: 00007ffe71813f80 R08: 00005611aae96020 R09: 000000000000004f R10: 00007efebbf9ffa0 R11: 0000000000000246 R12: 00007ffe71813f80 R13: 00007ffe71813f4c R14: 00005611aae2eca0 R15: 00007efeae6c89d0 </TASK> Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/4470888466c8a898edc9833286967529cc5f3c0d.1685953497.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-06-11RDMA/mlx5: Reduce QP table exposureLeon Romanovsky
driver.h is common header to whole mlx5 code base, but struct mlx5_qp_table is used in mlx5_ib driver only. So move that struct to be under sole responsibility of mlx5_ib. Link: https://lore.kernel.org/r/bec0dc1158e795813b135d1143147977f26bf668.1685953497.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
2023-06-07{net/RDMA}/mlx5: introduce lag_for_each_peerShay Drory
Introduce a generic APIs to iterate over all the devices which are part of the LAG. This API replace mlx5_lag_get_peer_mdev() which retrieve only a single peer device from the lag. Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-06-07RDMA/mlx5: Free second uplink ib portShay Drory
The cited patch introduce ib port for the slave device uplink in case of multiport eswitch. However, this ib port didn't perform anything when unloaded. Unload the new ib port properly. Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-04-29Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdmaLinus Torvalds
Pull rdma updates from Jason Gunthorpe: "Usual wide collection of unrelated items in drivers: - Driver bug fixes and treewide cleanups in hfi1, siw, qib, mlx5, rxe, usnic, usnic, bnxt_re, ocrdma, iser: - remove unnecessary NULL checks - kmap obsolescence - pci_enable_pcie_error_reporting() obsolescence - unused variables and macros - trace event related warnings - casting warnings - Code cleanups for irdm and erdma - EFA reporting of 128 byte PCIe TLP support - mlx5 more agressively uses the out of order HW feature - Big rework of how state machines and tasks work in rxe - Fix a syzkaller found crash netdev refcount leak in siw - bnxt_re revises their HW description header - Congestion control for bnxt_re - Use mmu_notifiers more safely in hfi1 - mlx5 gets better support for PCIe relaxed ordering inside VMs" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (81 commits) RDMA/efa: Add rdma write capability to device caps RDMA/mlx5: Use correct device num_ports when modify DC RDMA/irdma: Drop spurious WQ_UNBOUND from alloc_ordered_workqueue() call RDMA/rxe: Fix spinlock recursion deadlock on requester RDMA/mlx5: Fix flow counter query via DEVX RDMA/rxe: Protect QP state with qp->state_lock RDMA/rxe: Move code to check if drained to subroutine RDMA/rxe: Remove qp->req.state RDMA/rxe: Remove qp->comp.state RDMA/rxe: Remove qp->resp.state RDMA/mlx5: Allow relaxed ordering read in VFs and VMs net/mlx5: Update relaxed ordering read HCA capabilities RDMA/mlx5: Check pcie_relaxed_ordering_enabled() in UMR RDMA/mlx5: Remove pcie_relaxed_ordering_enabled() check for RO write RDMA: Add ib_virt_dma_to_page() RDMA/rxe: Fix the error "trying to register non-static key in rxe_cleanup_task" RDMA/irdma: Slightly optimize irdma_form_ah_cm_frame() RDMA/rxe: Fix incorrect TASKLET_STATE_SCHED check in rxe_task.c IB/hfi1: Place struct mmu_rb_handler on cache line start IB/hfi1: Fix bugs with non-PAGE_SIZE-end multi-iovec user SDMA requests ...
2023-04-21RDMA/mlx5: Use correct device num_ports when modify DCMark Zhang
Just like other QP types, when modify DC, the port_num should be compared with dev->num_ports, instead of HCA_CAP.num_ports. Otherwise Multi-port vHCA on DC may not work. Fixes: 776a3906b692 ("IB/mlx5: Add support for DC target QP") Link: https://lore.kernel.org/r/20230420013906.1244185-1-markzhang@nvidia.com Signed-off-by: Mark Zhang <markzhang@nvidia.com> Reviewed-by: Maor Gottlieb <maorg@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-04-18RDMA/mlx5: Fix flow counter query via DEVXMark Bloch
Commit cited in "fixes" tag added bulk support for flow counters but it didn't account that's also possible to query a counter using a non-base id if the counter was allocated as bulk. When a user performs a query, validate the flow counter id given in the mailbox is inside the valid range taking bulk value into account. Fixes: 208d70f562e5 ("IB/mlx5: Support flow counters offset for bulk counters") Signed-off-by: Mark Bloch <mbloch@nvidia.com> Reviewed-by: Maor Gottlieb <maorg@nvidia.com> Link: https://lore.kernel.org/r/79d7fbe291690128e44672418934256254d93115.1681377114.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-04-16RDMA/mlx5: Allow relaxed ordering read in VFs and VMsAvihai Horon
According to PCIe spec, Enable Relaxed Ordering value in the VF's PCI config space is wired to 0 and PF relaxed ordering (RO) setting should be applied to the VF. In QEMU (and maybe others), when assigning VFs, the RO bit in PCI config space is not emulated properly and is always set to 0. Therefore, pcie_relaxed_ordering_enabled() always returns 0 for VFs and VMs and thus MKeys can't be created with RO read even if the PF supports it. pcie_relaxed_ordering_enabled() check was added to avoid a syndrome when creating a MKey with relaxed ordering (RO) enabled when the driver's relaxed_ordering_read_pci_enabled HCA capability is out of sync with FW. With the new relaxed_ordering_read capability this can't happen, as it's set regardless of RO value in PCI config space and thus can't change during runtime. Hence, to allow RO read in VFs and VMs, use the new HCA capability relaxed_ordering_read without checking pcie_relaxed_ordering_enabled(). The old capability checks are kept for backward compatibility with older FWs. Allowing RO in VFs and VMs is valuable since it can greatly improve performance on some setups. For example, testing throughput of a VF on an AMD EPYC 7763 and ConnectX-6 Dx setup showed roughly 60% performance improvement. Signed-off-by: Avihai Horon <avihaih@nvidia.com> Reviewed-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Aya Levin <ayal@nvidia.com> Link: https://lore.kernel.org/r/e7048640d66c341a8fa0465e099926e7989184bc.1681131553.git.leon@kernel.org Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-04-16net/mlx5: Update relaxed ordering read HCA capabilitiesAvihai Horon
Rename existing HCA capability relaxed_ordering_read to relaxed_ordering_read_pci_enabled. This is in accordance with recent PRM change to better describe the capability, as it's set only if both the device supports relaxed ordering (RO) read and RO is enabled in PCI config space. In addition, add new HCA capability relaxed_ordering_read which is set if the device supports RO read, regardless of RO in PCI config space. This will be used in the following patch to allow RO in VFs and VMs. Signed-off-by: Avihai Horon <avihaih@nvidia.com> Reviewed-by: Shay Drory <shayd@nvidia.com> Link: https://lore.kernel.org/r/caa0002fd8135086357dfcc368e2f5cc73b08480.1681131553.git.leon@kernel.org Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-04-16RDMA/mlx5: Check pcie_relaxed_ordering_enabled() in UMRAvihai Horon
relaxed_ordering_read HCA capability is set if both the device supports relaxed ordering (RO) read and RO is set in PCI config space. RO in PCI config space can change during runtime. This will change the value of relaxed_ordering_read HCA capability in FW, but the driver will not see it since it queries the capabilities only once. This can lead to the following scenario: 1. RO in PCI config space is enabled. 2. User creates MKey without RO. 3. RO in PCI config space is disabled. As a result, relaxed_ordering_read HCA capability is turned off in FW but remains on in driver copy of the capabilities. 4. User requests to reconfig the MKey with RO via UMR. 5. Driver will try to reconfig the MKey with RO read although it shouldn't (as relaxed_ordering_read HCA capability is really off). To fix this, check pcie_relaxed_ordering_enabled() before setting RO read in UMR. Fixes: 896ec9735336 ("RDMA/mlx5: Set mkey relaxed ordering by UMR with ConnectX-7") Signed-off-by: Avihai Horon <avihaih@nvidia.com> Reviewed-by: Shay Drory <shayd@nvidia.com> Link: https://lore.kernel.org/r/8d39eb8317e7bed1a354311a20ae707788fd94ed.1681131553.git.leon@kernel.org Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-04-16RDMA/mlx5: Remove pcie_relaxed_ordering_enabled() check for RO writeAvihai Horon
pcie_relaxed_ordering_enabled() check was added to avoid a syndrome when creating a MKey with relaxed ordering (RO) enabled when the driver's relaxed_ordering_{read,write} HCA capabilities are out of sync with FW. While this can happen with relaxed_ordering_read, it can't happen with relaxed_ordering_write as it's set if the device supports RO write, regardless of RO in PCI config space, and thus can't change during runtime. Therefore, drop the pcie_relaxed_ordering_enabled() check for relaxed_ordering_write while keeping it for relaxed_ordering_read. Doing so will also allow the usage of RO write in VFs and VMs (where RO in PCI config space is not reported/emulated properly). Signed-off-by: Avihai Horon <avihaih@nvidia.com> Reviewed-by: Shay Drory <shayd@nvidia.com> Link: https://lore.kernel.org/r/7e8f55e31572c1702d69cae015a395d3a824a38a.1681131553.git.leon@kernel.org Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-04-03RDMA/mlx5: Remove unused num_alloc_xa_entries variableTom Rix
clang with W=1 reports drivers/infiniband/hw/mlx5/devx.c:1996:6: error: variable 'num_alloc_xa_entries' set but not used [-Werror,-Wunused-but-set-variable] int num_alloc_xa_entries = 0; ^ This variable is not used so remove it. Signed-off-by: Tom Rix <trix@redhat.com> Link: https://lore.kernel.org/r/20230330153607.1838750-1-trix@redhat.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-03-29RDMA/mlx5: Expand switchdev Q-counters to expose representor statisticsPatrisious Haddad
Previously for switchdev only per device counters were supported. Currently we allocate counters for switchdev per port, which also includes the ports that belong to VF representors in order to expose them to users through the rdma tool, allowing the host to track the VFs statistics through their representors counters. Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Link: https://lore.kernel.org/r/ea31e1103c125cd27931ba213f307cde30d2eaed.1679566038.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-03-23Enable IB out-of-order by default in mlx5Leon Romanovsky
This series from Or changes default of IB out-of-order feature and allows to the RDMA users to decide if they need to wait for completion for all segments or it is enough to wait for last segment completion only. Thanks Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-03-23RDMA/mlx5: Disable out-of-order in integrity enabled QPsOr Har-Toov
Set retry_mode to GO_BACK_N when qp is created with INTEGRITY_EN flag because out-of-order is not supported when doing HW offload of signature operations. Signed-off-by: Or Har-Toov <ohartoov@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Link: https://lore.kernel.org/r/362de42cdc7a541afa5b1fd0ec6ae706061764a2.1679230449.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-03-20IB/mlx5: Add support for 400G_8X lane speedMaher Sanalla
Currently, when driver queries PTYS to report which link speed is being used on its RoCE ports, it does not check the case of having 400Gbps transmitted over 8 lanes. Thus it fails to report the said speed and instead it defaults to report 10G over 4 lanes. Add a check for the said speed when querying PTYS and report it back correctly when needed. Fixes: 08e8676f1607 ("IB/mlx5: Add support for 50Gbps per lane link modes") Signed-off-by: Maher Sanalla <msanalla@nvidia.com> Reviewed-by: Aya Levin <ayal@nvidia.com> Reviewed-by: Saeed Mahameed <saeedm@nvidia.com> Link: https://lore.kernel.org/r/ec9040548d119d22557d6a4b4070d6f421701fd4.1678973994.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-03-19RDMA/mlx5: Coding style fix reported by checkpatchRohit Chavan
Block comments should align the * on each line on line 2849 Avoid line continuations in quoted strings on line 3848 Signed-off-by: Rohit Chavan <roheetchavan@gmail.com> Link: https://lore.kernel.org/r/20230319100847.5566-1-roheetchavan@gmail.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-02-24Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdmaLinus Torvalds
Pull rdma updates from Jason Gunthorpe: "Quite a small cycle this time, even with the rc8. I suppose everyone went to sleep over xmas. - Minor driver updates for hfi1, cxgb4, erdma, hns, irdma, mlx5, siw, mana - inline CQE support for hns - Have mlx5 display device error codes - Pinned DMABUF support for irdma - Continued rxe cleanups, particularly converting the MRs to use xarray - Improvements to what can be cached in the mlx5 mkey cache" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (61 commits) IB/mlx5: Extend debug control for CC parameters IB/hfi1: Fix sdma.h tx->num_descs off-by-one errors IB/hfi1: Fix math bugs in hfi1_can_pin_pages() RDMA/irdma: Add support for dmabuf pin memory regions RDMA/mlx5: Use query_special_contexts for mkeys net/mlx5e: Use query_special_contexts for mkeys net/mlx5: Change define name for 0x100 lkey value net/mlx5: Expose bits for querying special mkeys RDMA/rxe: Fix missing memory barriers in rxe_queue.h RDMA/mana_ib: Fix a bug when the PF indicates more entries for registering memory on first packet RDMA/rxe: Remove rxe_alloc() RDMA/cma: Distinguish between sockaddr_in and sockaddr_in6 by size Subject: RDMA/rxe: Handle zero length rdma iw_cxgb4: Fix potential NULL dereference in c4iw_fill_res_cm_id_entry() RDMA/mlx5: Use rdma_umem_for_each_dma_block() RDMA/umem: Remove unused 'work' member from struct ib_umem RDMA/irdma: Cap MSIX used to online CPUs + 1 RDMA/mlx5: Check reg_create() create for errors RDMA/restrack: Correct spelling RDMA/cxgb4: Fix potential null-ptr-deref in pass_establish() ...
2023-02-23Merge tag 'mm-stable-2023-02-20-13-37' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Daniel Verkamp has contributed a memfd series ("mm/memfd: add F_SEAL_EXEC") which permits the setting of the memfd execute bit at memfd creation time, with the option of sealing the state of the X bit. - Peter Xu adds a patch series ("mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare") which addresses a rare race condition related to PMD unsharing. - Several folioification patch serieses from Matthew Wilcox, Vishal Moola, Sidhartha Kumar and Lorenzo Stoakes - Johannes Weiner has a series ("mm: push down lock_page_memcg()") which does perform some memcg maintenance and cleanup work. - SeongJae Park has added DAMOS filtering to DAMON, with the series "mm/damon/core: implement damos filter". These filters provide users with finer-grained control over DAMOS's actions. SeongJae has also done some DAMON cleanup work. - Kairui Song adds a series ("Clean up and fixes for swap"). - Vernon Yang contributed the series "Clean up and refinement for maple tree". - Yu Zhao has contributed the "mm: multi-gen LRU: memcg LRU" series. It adds to MGLRU an LRU of memcgs, to improve the scalability of global reclaim. - David Hildenbrand has added some userfaultfd cleanup work in the series "mm: uffd-wp + change_protection() cleanups". - Christoph Hellwig has removed the generic_writepages() library function in the series "remove generic_writepages". - Baolin Wang has performed some maintenance on the compaction code in his series "Some small improvements for compaction". - Sidhartha Kumar is doing some maintenance work on struct page in his series "Get rid of tail page fields". - David Hildenbrand contributed some cleanup, bugfixing and generalization of pte management and of pte debugging in his series "mm: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE on all architectures with swap PTEs". - Mel Gorman and Neil Brown have removed the __GFP_ATOMIC allocation flag in the series "Discard __GFP_ATOMIC". - Sergey Senozhatsky has improved zsmalloc's memory utilization with his series "zsmalloc: make zspage chain size configurable". - Joey Gouly has added prctl() support for prohibiting the creation of writeable+executable mappings. The previous BPF-based approach had shortcomings. See "mm: In-kernel support for memory-deny-write-execute (MDWE)". - Waiman Long did some kmemleak cleanup and bugfixing in the series "mm/kmemleak: Simplify kmemleak_cond_resched() & fix UAF". - T.J. Alumbaugh has contributed some MGLRU cleanup work in his series "mm: multi-gen LRU: improve". - Jiaqi Yan has provided some enhancements to our memory error statistics reporting, mainly by presenting the statistics on a per-node basis. See the series "Introduce per NUMA node memory error statistics". - Mel Gorman has a second and hopefully final shot at fixing a CPU-hog regression in compaction via his series "Fix excessive CPU usage during compaction". - Christoph Hellwig does some vmalloc maintenance work in the series "cleanup vfree and vunmap". - Christoph Hellwig has removed block_device_operations.rw_page() in ths series "remove ->rw_page". - We get some maple_tree improvements and cleanups in Liam Howlett's series "VMA tree type safety and remove __vma_adjust()". - Suren Baghdasaryan has done some work on the maintainability of our vm_flags handling in the series "introduce vm_flags modifier functions". - Some pagemap cleanup and generalization work in Mike Rapoport's series "mm, arch: add generic implementation of pfn_valid() for FLATMEM" and "fixups for generic implementation of pfn_valid()" - Baoquan He has done some work to make /proc/vmallocinfo and /proc/kcore better represent the real state of things in his series "mm/vmalloc.c: allow vread() to read out vm_map_ram areas". - Jason Gunthorpe rationalized the GUP system's interface to the rest of the kernel in the series "Simplify the external interface for GUP". - SeongJae Park wishes to migrate people from DAMON's debugfs interface over to its sysfs interface. To support this, we'll temporarily be printing warnings when people use the debugfs interface. See the series "mm/damon: deprecate DAMON debugfs interface". - Andrey Konovalov provided the accurately named "lib/stackdepot: fixes and clean-ups" series. - Huang Ying has provided a dramatic reduction in migration's TLB flush IPI rates with the series "migrate_pages(): batch TLB flushing". - Arnd Bergmann has some objtool fixups in "objtool warning fixes". * tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (505 commits) include/linux/migrate.h: remove unneeded externs mm/memory_hotplug: cleanup return value handing in do_migrate_range() mm/uffd: fix comment in handling pte markers mm: change to return bool for isolate_movable_page() mm: hugetlb: change to return bool for isolate_hugetlb() mm: change to return bool for isolate_lru_page() mm: change to return bool for folio_isolate_lru() objtool: add UACCESS exceptions for __tsan_volatile_read/write kmsan: disable ftrace in kmsan core code kasan: mark addr_has_metadata __always_inline mm: memcontrol: rename memcg_kmem_enabled() sh: initialize max_mapnr m68k/nommu: add missing definition of ARCH_PFN_OFFSET mm: percpu: fix incorrect size in pcpu_obj_full_size() maple_tree: reduce stack usage with gcc-9 and earlier mm: page_alloc: call panic() when memoryless node allocation fails mm: multi-gen LRU: avoid futile retries migrate_pages: move THP/hugetlb migration support check to simplify code migrate_pages: batch flushing TLB migrate_pages: share more code between _unmap and _move ...
2023-02-19IB/mlx5: Extend debug control for CC parametersEdward Srouji
This patch adds rtt_resp_dscp to the current debug controllability of congestion control (CC) parameters. rtt_resp_dscp can be read or written through debugfs. If set, its value overwrites the DSCP of the generated RTT response. Signed-off-by: Edward Srouji <edwards@nvidia.com> Reviewed-by: Maor Gottlieb <maorg@nvidia.com> Link: https://lore.kernel.org/r/1dcc3440ee53c688f19f579a051ded81a2aaa70a.1676538714.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-02-17Merge mlx5-next into rdma.git for-nextJason Gunthorpe
Synchronize the shared mlx5 branch with net: - From Jiri: fixe a deadlock in mlx5_ib's netdev notifier unregister. - From Mark and Patrisious: add IPsec RoCEv2 support. - From Or: Rely on firmware to get special mkeys * branch mlx5-next: RDMA/mlx5: Use query_special_contexts for mkeys net/mlx5e: Use query_special_contexts for mkeys net/mlx5: Change define name for 0x100 lkey value net/mlx5: Expose bits for querying special mkeys net/mlx5: Configure IPsec steering for egress RoCEv2 traffic net/mlx5: Configure IPsec steering for ingress RoCEv2 traffic net/mlx5: Add IPSec priorities in RDMA namespaces net/mlx5: Implement new destination type TABLE_TYPE net/mlx5: Introduce new destination type TABLE_TYPE RDMA/mlx5: Track netdev to avoid deadlock during netdev notifier unregister net/mlx5e: Propagate an internal event in case uplink netdev changes net/mlx5e: Fix trap event handling Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-02-17RDMA/mlx5: Use query_special_contexts for mkeysOr Har-Toov
Use query_sepcial_contexts to get the correct value of mkeys such as null_mkey, terminate_scatter_list_mkey and dump_fill_mkey, as FW will change them in certain configurations. Link: https://lore.kernel.org/r/000236f0a9487d48809f87bcc3620a3964b2d3d3.1673960981.git.leon@kernel.org Signed-off-by: Or Har-Toov <ohartoov@nvidia.com> Reviewed-by: Michael Guralnik <michaelgur@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-02-17net/mlx5: Change define name for 0x100 lkey valueOr Har-Toov
Change define of 0x100 lkey value from MLX5_INVALID_LKEY to be MLX5_TERMINATE_SCATTER_LIST_LKEY as 0x100 is the value of terminate_scatter_list_mkey. Link: https://lore.kernel.org/r/3a116dc3fbae4cb6b76a63d27d418830b06ade0c.1673960981.git.leon@kernel.org Signed-off-by: Or Har-Toov <ohartoov@nvidia.com> Reviewed-by: Michael Guralnik <michaelgur@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-02-15RDMA/mlx5: Use rdma_umem_for_each_dma_block()Jason Gunthorpe
Replace an open coding of rdma_umem_for_each_dma_block() with the proper function. Fixes: b3d47ebd4908 ("RDMA/mlx5: Use mlx5_umr_post_send_wait() to update MR pas") Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/0-v1-c13a5b88359b+556d0-mlx5_umem_block_jgg@nvidia.com Reviewed-by: Devesh Sharma <devesh.s.sharma@oracle.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-02-14net/mlx5: Lag, Add single RDMA device in multiport modeMark Bloch
In MultiPort E-Switch mode a single RDMA is created. This device has multiple RDMA ports that represent the uplink ports that are connected to the E-Switch. Account for this when creating the RDMA device so it has an additional port for the non native uplink. As a side effect of this patch, use shared fdb in multiport eswitch mode. Signed-off-by: Mark Bloch <mbloch@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-02-09mm: replace vma->vm_flags direct modifications with modifier callsSuren Baghdasaryan
Replace direct modifications to vma->vm_flags with calls to modifier functions to be able to track flag changes and to keep vma locking correctness. [akpm@linux-foundation.org: fix drivers/misc/open-dice.c, per Hyeonggon Yoo] Link: https://lkml.kernel.org/r/20230126193752.297968-5-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Acked-by: Sebastian Reichel <sebastian.reichel@collabora.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arjun Roy <arjunroy@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: David Rientjes <rientjes@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Joel Fernandes <joelaf@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Laurent Dufour <ldufour@linux.ibm.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Minchan Kim <minchan@google.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Peter Oskolkov <posk@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Punit Agrawal <punit.agrawal@bytedance.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Shakeel Butt <shakeelb@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Cc: Song Liu <songliubraving@fb.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-02-08RDMA/mlx5: Track netdev to avoid deadlock during netdev notifier unregisterJiri Pirko
When removing a network namespace with mlx5 devlink instance being in it, following callchain is performed: cleanup_net (takes down_read(&pernet_ops_rwsem) devlink_pernet_pre_exit() devlink_reload() mlx5_devlink_reload_down() mlx5_unload_one_devl_locked() mlx5_detach_device() del_adev() mlx5r_remove() __mlx5_ib_remove() mlx5_ib_roce_cleanup() mlx5_remove_netdev_notifier() unregister_netdevice_notifier (takes down_write(&pernet_ops_rwsem) This deadlocks. Resolve this by converting to register_netdevice_notifier_dev_net() which does not take pernet_ops_rwsem and moves the notifier block around according to netdev it takes as arg. Use previously introduced netdev added/removed events to track uplink netdev to be used for register_netdevice_notifier_dev_net() purposes. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-02-07RDMA/mlx5: Check reg_create() create for errorsDan Carpenter
The reg_create() can fail. Check for errors before dereferencing it. Fixes: dd1b913fb0d0 ("RDMA/mlx5: Cache all user cacheable mkeys on dereg MR flow") Signed-off-by: Dan Carpenter <error27@gmail.com> Link: https://lore.kernel.org/r/Y+ERYy4wN0LsKsm+@kili Reviewed-by: Devesh Sharma <devesh.s.sharma@oracle.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-02-06RDMA/mlx5: Remove impossible check of mkey cache cleanup failureLeon Romanovsky
mlx5_mkey_cache_cleanup() can't fail and can be changed to be void. Link: https://lore.kernel.org/r/1acd9528995d083114e7dec2a2afc59436406583.1675328463.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
2023-02-06RDMA/mlx5: Fix MR cache debugfs error in IB representors modeLeon Romanovsky
Block MR cache debugfs creation for IB representor flow as MR cache shouldn't be used at all in that mode. As part of this change, add missing debugfs cleanup in error path too. This change fixes the following debugfs errors: bond0: (slave enp8s0f1): Enslaving as a backup interface with an up link mlx5_core 0000:08:00.0: lag map: port 1:1 port 2:1 mlx5_core 0000:08:00.0: shared_fdb:1 mode:queue_affinity mlx5_core 0000:08:00.0: Operation mode is single FDB debugfs: Directory '2' with parent '/' already present! ... debugfs: Directory '22' with parent '/' already present! Fixes: 73d09b2fe833 ("RDMA/mlx5: Introduce mlx5r_cache_rb_key") Signed-off-by: Michael Guralnik <michaelgur@nvidia.com> Link: https://lore.kernel.org/r/482a78c54acbcfa1742a0e06a452546428900ffa.1675328463.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
2023-01-27RDMA/mlx5: Add work to remove temporary entries from the cacheMichael Guralnik
The non-cache mkeys are stored in the cache only to shorten restarting application time. Don't store them longer than needed. Configure cache entries that store non-cache MRs as temporary entries. If 30 seconds have passed and no user reclaimed the temporarily cached mkeys, an asynchronous work will destroy the mkeys entries. Link: https://lore.kernel.org/r/20230125222807.6921-7-michaelgur@nvidia.com Signed-off-by: Michael Guralnik <michaelgur@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-01-27RDMA/mlx5: Cache all user cacheable mkeys on dereg MR flowMichael Guralnik
Currently, when dereging an MR, if the mkey doesn't belong to a cache entry, it will be destroyed. As a result, the restart of applications with many non-cached mkeys is not efficient since all the mkeys are destroyed and then recreated. This process takes a long time (for 100,000 MRs, it is ~20 seconds for dereg and ~28 seconds for re-reg). To shorten the restart runtime, insert all cacheable mkeys to the cache. If there is no fitting entry to the mkey properties, create a temporary entry that fits it. After a predetermined timeout, the cache entries will shrink to the initial high limit. The mkeys will still be in the cache when consuming them again after an application restart. Therefore, the registration will be much faster (for 100,000 MRs, it is ~4 seconds for dereg and ~5 seconds for re-reg). The temporary cache entries created to store the non-cache mkeys are not exposed through sysfs like the default cache entries. Link: https://lore.kernel.org/r/20230125222807.6921-6-michaelgur@nvidia.com Signed-off-by: Michael Guralnik <michaelgur@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-01-27RDMA/mlx5: Introduce mlx5r_cache_rb_keyMichael Guralnik
Switch from using the mkey order to using the new struct as the key to the RB tree of cache entries. The key is all the mkey properties that UMR operations can't modify. Using this key to define the cache entries and to search and create cache mkeys. Link: https://lore.kernel.org/r/20230125222807.6921-5-michaelgur@nvidia.com Signed-off-by: Michael Guralnik <michaelgur@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-01-27RDMA/mlx5: Change the cache structure to an RB-treeMichael Guralnik
Currently, the cache structure is a static linear array. Therefore, his size is limited to the number of entries in it and is not expandable. The entries are dedicated to mkeys of size 2^x and no access_flags. Mkeys with different properties are not cacheable. In this patch, we change the cache structure to an RB-tree. This will allow to extend the cache to support more entries with different mkey properties. Link: https://lore.kernel.org/r/20230125222807.6921-4-michaelgur@nvidia.com Signed-off-by: Michael Guralnik <michaelgur@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-01-27RDMA/mlx5: Remove implicit ODP cache entryAharon Landau
Implicit ODP mkey doesn't have unique properties. It shares the same properties as the order 18 cache entry. There is no need to devote a special entry for that. Link: https://lore.kernel.org/r/20230125222807.6921-3-michaelgur@nvidia.com Signed-off-by: Aharon Landau <aharonl@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-01-27RDMA/mlx5: Don't keep umrable 'page_shift' in cache entriesAharon Landau
mkc.log_page_size can be changed using UMR. Therefore, don't treat it as a cache entry property. Removing it from struct mlx5_cache_ent. All cache mkeys will be created with default PAGE_SHIFT, and updated with the needed page_shift using UMR when passing them to a user. Link: https://lore.kernel.org/r/20230125222807.6921-2-michaelgur@nvidia.com Signed-off-by: Aharon Landau <aharonl@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2023-01-15RDMA/mlx5: Print error syndrome in case of fatal QP errorsPatrisious Haddad
Print syndromes in case of fatal QP events. This is helpful for upper level debugging, as there maybe no CQEs. Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Signed-off-by: Mark Zhang <markzhang@nvidia.com> Link: https://lore.kernel.org/r/edc794f622a33e4ee12d7f5d218d1a59aa7c6af5.1672821186.git.leonro@nvidia.com Reviewed-by: Saeed Mahameed <saeed@kernel.org> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-01-15RDMA/mlx: Calling qp event handler in workqueue contextMark Zhang
Move the call of qp event handler from atomic to workqueue context, so that the handler is able to block. This is needed by following patches. Signed-off-by: Mark Zhang <markzhang@nvidia.com> Reviewed-by: Patrisious Haddad <phaddad@nvidia.com> Link: https://lore.kernel.org/r/0cd17b8331e445f03942f4bb28d447f24ac5669d.1672821186.git.leonro@nvidia.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-01-01RDMA/mlx5: Fix validation of max_rd_atomic caps for DCMaor Gottlieb
Currently, when modifying DC, we validate max_rd_atomic user attribute against the RC cap, validate against DC. RC and DC QP types have different device limitations. This can cause userspace created DC QPs to malfunction. Fixes: c32a4f296e1d ("IB/mlx5: Add support for DC Initiator QP") Link: https://lore.kernel.org/r/0c5aee72cea188c3bb770f4207cce7abc9b6fc74.1672231736.git.leonro@nvidia.com Signed-off-by: Maor Gottlieb <maorg@nvidia.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2023-01-01RDMA/mlx5: Fix mlx5_ib_get_hw_stats when used for deviceShay Drory
Currently, when mlx5_ib_get_hw_stats() is used for device (port_num = 0), there is a special handling in order to use the correct counters, but, port_num is being passed down the stack without any change. Also, some functions assume that port_num >=1. As a result, the following oops can occur. BUG: unable to handle page fault for address: ffff89510294f1a8 #PF: supervisor write access in kernel mode #PF: error_code(0x0002) - not-present page PGD 0 P4D 0 Oops: 0002 [#1] SMP CPU: 8 PID: 1382 Comm: devlink Tainted: G W 6.1.0-rc4_for_upstream_base_2022_11_10_16_12 #1 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 RIP: 0010:_raw_spin_lock+0xc/0x20 Call Trace: <TASK> mlx5_ib_get_native_port_mdev+0x73/0xe0 [mlx5_ib] do_get_hw_stats.constprop.0+0x109/0x160 [mlx5_ib] mlx5_ib_get_hw_stats+0xad/0x180 [mlx5_ib] ib_setup_device_attrs+0xf0/0x290 [ib_core] ib_register_device+0x3bb/0x510 [ib_core] ? atomic_notifier_chain_register+0x67/0x80 __mlx5_ib_add+0x2b/0x80 [mlx5_ib] mlx5r_probe+0xb8/0x150 [mlx5_ib] ? auxiliary_match_id+0x6a/0x90 auxiliary_bus_probe+0x3c/0x70 ? driver_sysfs_add+0x6b/0x90 really_probe+0xcd/0x380 __driver_probe_device+0x80/0x170 driver_probe_device+0x1e/0x90 __device_attach_driver+0x7d/0x100 ? driver_allows_async_probing+0x60/0x60 ? driver_allows_async_probing+0x60/0x60 bus_for_each_drv+0x7b/0xc0 __device_attach+0xbc/0x200 bus_probe_device+0x87/0xa0 device_add+0x404/0x940 ? dev_set_name+0x53/0x70 __auxiliary_device_add+0x43/0x60 add_adev+0x99/0xe0 [mlx5_core] mlx5_attach_device+0xc8/0x120 [mlx5_core] mlx5_load_one_devl_locked+0xb2/0xe0 [mlx5_core] devlink_reload+0x133/0x250 devlink_nl_cmd_reload+0x480/0x570 ? devlink_nl_pre_doit+0x44/0x2b0 genl_family_rcv_msg_doit.isra.0+0xc2/0x110 genl_rcv_msg+0x180/0x2b0 ? devlink_nl_cmd_region_read_dumpit+0x540/0x540 ? devlink_reload+0x250/0x250 ? devlink_put+0x50/0x50 ? genl_family_rcv_msg_doit.isra.0+0x110/0x110 netlink_rcv_skb+0x54/0x100 genl_rcv+0x24/0x40 netlink_unicast+0x1f6/0x2c0 netlink_sendmsg+0x237/0x490 sock_sendmsg+0x33/0x40 __sys_sendto+0x103/0x160 ? handle_mm_fault+0x10e/0x290 ? do_user_addr_fault+0x1c0/0x5f0 __x64_sys_sendto+0x25/0x30 do_syscall_64+0x3d/0x90 entry_SYSCALL_64_after_hwframe+0x46/0xb0 Fix it by setting port_num to 1 in order to get device status and remove unused variable. Fixes: aac4492ef23a ("IB/mlx5: Update counter implementation for dual port RoCE") Link: https://lore.kernel.org/r/98b82994c3cd3fa593b8a75ed3f3901e208beb0f.1672231736.git.leonro@nvidia.com Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Patrisious Haddad <phaddad@nvidia.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2022-12-14Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdmaLinus Torvalds
Pull rdma updates from Jason Gunthorpe: "Usual size of updates, a new driver, and most of the bulk focusing on rxe: - Usual typos, style, and language updates - Driver updates for mlx5, irdma, siw, rts, srp, hfi1, hns, erdma, mlx4, srp - Lots of RXE updates: * Improve reply error handling for bad MR operations * Code tidying * Debug printing uses common loggers * Remove half implemented RD related stuff * Support IBA's recently defined Atomic Write and Flush operations - erdma support for atomic operations - New driver 'mana' for Ethernet HW available in Azure VMs. This driver only supports DPDK" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (122 commits) IB/IPoIB: Fix queue count inconsistency for PKEY child interfaces RDMA: Add missed netdev_put() for the netdevice_tracker RDMA/rxe: Enable RDMA FLUSH capability for rxe device RDMA/cm: Make QP FLUSHABLE for supported device RDMA/rxe: Implement flush completion RDMA/rxe: Implement flush execution in responder side RDMA/rxe: Implement RC RDMA FLUSH service in requester side RDMA/rxe: Extend rxe packet format to support flush RDMA/rxe: Allow registering persistent flag for pmem MR only RDMA/rxe: Extend rxe user ABI to support flush RDMA: Extend RDMA kernel verbs ABI to support flush RDMA: Extend RDMA user ABI to support flush RDMA/rxe: Fix incorrect responder length checking RDMA/rxe: Fix oops with zero length reads RDMA/mlx5: Remove not-used IB_FLOW_SPEC_IB define RDMA/hns: Fix XRC caps on HIP08 RDMA/hns: Fix error code of CMD RDMA/hns: Fix page size cap from firmware RDMA/hns: Fix PBL page MTR find RDMA/hns: Fix AH attr queried by query_qp ...
2022-12-08RDMA/mlx5: Remove not-used IB_FLOW_SPEC_IB defineZhu Yanjun
IB_FLOW_SPEC_IB is not used in mlx5 and can be deleted. Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev> Link: https://lore.kernel.org/r/20221208101954.687960-1-yanjun.zhu@intel.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2022-12-04RDMA/mlx5: no need to kfree NULL pointerLi Zhijian
Goto label 'free' where it will kfree the 'in' is not needed though it's safe to kfree NULL. Return err code directly to simplify the code. 1973 free: 1974 kfree(in); 1975 return err; Signed-off-by: Li Zhijian <lizhijian@fujitsu.com> Link: https://lore.kernel.org/r/20221203033714.25870-1-lizhijian@fujitsu.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2022-11-29net/mlx5: Generalize name of UMR alignment definitionTariq Toukan
Per the device spec, MLX5_UMR_MTT_ALIGNMENT is good not only for UMR MTT entries, but for all other entries as well, like KLMs and KSMs. Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Gal Pressman <gal@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-11-06RDMA/mlx5: Change debug log level for remote access error syndromesArumugam Kolappan
The mlx5 driver dumps the entire CQE buffer by default for few syndromes. Some syndromes are expected due to the application behavior [ex: MLX5_CQE_SYNDROME_REMOTE_ACCESS_ERR, MLX5_CQE_SYNDROME_REMOTE_OP_ERR and MLX5_CQE_SYNDROME_LOCAL_PROT_ERR]. Hence, for these syndromes, the patch converts the log level from KERN_WARNING to KERN_DEBUG. This enables the application to get the CQE buffer dump by changing to KERN_DEBUG level as and when needed. Suggested-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Arumugam Kolappan <aru.kolappan@oracle.com> Link: https://lore.kernel.org/r/1667287664-19377-1-git-send-email-aru.kolappan@oracle.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2022-10-07Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdmaLinus Torvalds
Pull rdma updates from Jason Gunthorpe: "Not a big list of changes this cycle, mostly small things. The new MANA rdma driver should come next cycle along with a bunch of work on rxe. Summary: - Small bug fixes in mlx5, efa, rxe, hns, irdma, erdma, siw - rts tracing improvements - Code improvements: strlscpy conversion, unused parameter, spelling mistakes, unused variables, flex arrays - restrack device details report for hns - Simplify struct device initialization in SRP - Eliminate the never-used service_mask support in IB CM - Make rxe not print to the console for some kinds of network packets - Asymetric paths and router support in the CM through netlink messages - DMABUF importer support for mlx5devx umem's" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (84 commits) RDMA/rxe: Remove error/warning messages from packet receiver path RDMA/usnic: fix set-but-not-unused variable 'flags' warning IB/hfi1: Use skb_put_data() instead of skb_put/memcpy pair RDMA/hns: Unified Log Printing Style RDMA/hns: Replacing magic number with macros in apply_func_caps() RDMA/hns: Repacing 'dseg_len' by macros in fill_ext_sge_inl_data() RDMA/hns: Remove redundant 'max_srq_desc_sz' in caps RDMA/hns: Remove redundant 'num_mtt_segs' and 'max_extend_sg' RDMA/hns: Remove redundant 'phy_addr' in hns_roce_hem_list_find_mtt() RDMA/hns: Remove redundant 'use_lowmem' argument from hns_roce_init_hem_table() RDMA/hns: Remove redundant 'bt_level' for hem_list_alloc_item() RDMA/hns: Remove redundant 'attr_mask' in modify_qp_init_to_init() RDMA/hns: Remove unnecessary brackets when getting point RDMA/hns: Remove unnecessary braces for single statement blocks RDMA/hns: Cleanup for a spelling error of Asynchronous IB/rdmavt: Add __init/__exit annotations to module init/exit funcs RDMA/rxe: Remove redundant num_sge fields RDMA/mlx5: Enable ATS support for MRs and umems RDMA/mlx5: Add support for dmabuf to devx umem RDMA/core: Add UVERBS_ATTR_RAW_FD ...
2022-10-06Merge tag 'v6.0' into rdma.git for-nextJason Gunthorpe
Trvial merge conflicts against rdma.git for-rc resolved matching linux-next: drivers/infiniband/hw/hns/hns_roce_hw_v2.c drivers/infiniband/hw/hns/hns_roce_main.c https://lore.kernel.org/r/20220929124005.105149-1-broonie@kernel.org Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>