Age | Commit message (Collapse) | Author |
|
Extend the output to print also the request type.
Signed-off-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Grzegorz Prajsner <grzegorz.prajsner@ionos.com>
Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Link: https://patch.msgid.link/20240821112217.41827-8-haris.iqbal@ionos.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
In the function init_conns(), after the create_con() and create_cm() for
loop if something fails. In the cleanup for loop after the destroy tag, we
access out of bound memory because cid is set to clt_path->s.con_num.
This commits resets the cid to clt_path->s.con_num - 1, to stay in bounds
in the cleanup loop later.
Fixes: 6a98d71daea1 ("RDMA/rtrs: client: main functionality")
Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Grzegorz Prajsner <grzegorz.prajsner@ionos.com>
Link: https://patch.msgid.link/20240821112217.41827-7-haris.iqbal@ionos.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
mr has a member need_inval, which can be used to indicate if
local invalidate is needed, switch to it and remove need_inv
from rtrs_clt_io_req.
Signed-off-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Signed-off-by: Grzegorz Prajsner <grzegorz.prajsner@ionos.com>
Link: https://patch.msgid.link/20240821112217.41827-6-haris.iqbal@ionos.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Reset hb_missed_cnt after receiving traffic from other peer, so
hb is more robust again high load on host or network.
Fixes: 6a98d71daea1 ("RDMA/rtrs: client: main functionality")
Signed-off-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Signed-off-by: Grzegorz Prajsner <grzegorz.prajsner@ionos.com>
Link: https://patch.msgid.link/20240821112217.41827-5-haris.iqbal@ionos.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
On network errors, a large number of these logs are printed due to all the
inflight IOs, rate limit them so they do not clutter kernel log.
Signed-off-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Signed-off-by: Grzegorz Prajsner <grzegorz.prajsner@ionos.com>
Link: https://patch.msgid.link/20240821112217.41827-4-haris.iqbal@ionos.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
In some cases need_inv can be missed for write requests, additionally
driver has to handle missing invalidates for write requests. While at
it, remove the else case from write invalidate path as it is possible
to reach there.
Signed-off-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Signed-off-by: Grzegorz Prajsner <grzegorz.prajsner@ionos.com>
Link: https://patch.msgid.link/20240821112217.41827-3-haris.iqbal@ionos.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
In case of HB error, we need to know the specific path on which it
happened, for better debugging. Since the clt/srv path structures are not
available in rtrs.c, it needs to be done in the individual HB error
handler.
This commit add those loging. A sample kernel log output after this commit:
rtrs_core L357: <blya>: HB missed max reached.
rtrs_server L717: <blya>: HB err handler for path=ip:x.x.x.x@ip:x.x.x.x
.
.
rtrs_core L357: <blya>: HB missed max reached.
rtrs_client L1519: <blya>: HB err handler for path=ip:x.x.x.x@ip:x.x.x.x
Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Reviewed-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Grzegorz Prajsner <grzegorz.prajsner@ionos.com>
Link: https://patch.msgid.link/20240821112217.41827-2-haris.iqbal@ionos.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Selvin Xavier says:
=============
Enable the Variable size Work Queue entry support for Gen P7
adapters. This would help in the better utilization of the queue memory
and pci bandwidth due to the smaller send queue Work entries.
=============
Based on v6.11-rc5 for dependencies.
* bnxt_re_variable_wqes: (829 commits)
RDMA/bnxt_re: Enable variable size WQEs for user space applications
RDMA/bnxt_re: Handle variable WQE support for user applications
RDMA/bnxt_re: Fix the table size for PSN/MSN entries
RDMA/bnxt_re: Get the WQE index from slot index while completing the WQEs
RDMA/bnxt_re: Add support for Variable WQE in Genp7 adapters
Linux 6.11-rc5
...
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
Add backward compatibility code to enable variable size WQEs only if the
user lib supports it.
Link: https://patch.msgid.link/r/1724042847-1481-6-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Hongguang Gao <hongguang.gao@broadcom.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
User library calculates the number of slots required for user applications
and it can pass that information to the driver. Driver can use this value
and update the HW directly. This mechanism is currently used only for the
newly introduced variable size WQEs.
Extend the bnxt_re_qp_req structure to pass the Send Queue slot count.
Reorganize the code to get the sq_slots before initializing the Send Queue
attributes.
Link: https://patch.msgid.link/r/1724042847-1481-5-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Hongguang Gao <hongguang.gao@broadcom.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
HW MSN table size is always a power of 2. So the pages should be mapped
accordingly.
Use the power of two calculation while get the number of PSN/MSN entries.
Fixes: 6f6bfbc595fb ("RDMA/bnxt_re: Expose the MSN table capability for user library")
Link: https://patch.msgid.link/r/1724042847-1481-4-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
While reporting the completions, SQ Work Queue index is required to
identify the WQE that generated the completions. In variable WQE mode, FW
returns the slot index for Error completions. Driver need to walk through
the shadow queue between the consumer index and producer index and matches
the slot index returned by FW. If a match is found, the next index of the
shadow queue is the WQE index to be considered for remaining poll_cq loop.
Link: https://patch.msgid.link/r/1724042847-1481-3-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Hongguang Gao <hongguang.gao@broadcom.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
Variable size WQE means that each send Work Queue Entry to HW can use
different WQE sizes as opposed to the static WQE size on the current
devices. Set variable WQE mode for Gen P7 devices. Depth of the Queue will
be a multiple of slot which is 16 bytes. The number of slots should be a
multiple of 256 as per the HW requirement.
Initialize the Software shadow queue to hold requests equal to the number
of slots. Also, do not expose the variable size WQE capability until the
last patch in the series.
Link: https://patch.msgid.link/r/1724042847-1481-2-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Hongguang Gao <hongguang.gao@broadcom.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
Propagate the unique, per device, ID in the device attributes to the
standard node_guid value in IB device.
Link: https://patch.msgid.link/r/20240822171143.2800-1-mrgolin@amazon.com
Reviewed-by: Yonatan Nachum <ynachum@amazon.com>
Signed-off-by: Yehuda Yitschak <yehuday@amazon.com>
Signed-off-by: Michael Margolin <mrgolin@amazon.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
In the commit aee2424246f9 ("RDMA/iwcm: Fix a use-after-free related to
destroying CM IDs"), the function flush_workqueue is invoked to flush the
work queue iwcm_wq.
But at that time, the work queue iwcm_wq was created via the function
alloc_ordered_workqueue without the flag WQ_MEM_RECLAIM.
Because the current process is trying to flush the whole iwcm_wq, if
iwcm_wq doesn't have the flag WQ_MEM_RECLAIM, verify that the current
process is not reclaiming memory or running on a workqueue which doesn't
have the flag WQ_MEM_RECLAIM as that can break forward-progress guarantee
leading to a deadlock.
The call trace is as below:
[ 125.350876][ T1430] Call Trace:
[ 125.356281][ T1430] <TASK>
[ 125.361285][ T1430] ? __warn (kernel/panic.c:693)
[ 125.367640][ T1430] ? check_flush_dependency (kernel/workqueue.c:3706 (discriminator 9))
[ 125.375689][ T1430] ? report_bug (lib/bug.c:180 lib/bug.c:219)
[ 125.382505][ T1430] ? handle_bug (arch/x86/kernel/traps.c:239)
[ 125.388987][ T1430] ? exc_invalid_op (arch/x86/kernel/traps.c:260 (discriminator 1))
[ 125.395831][ T1430] ? asm_exc_invalid_op (arch/x86/include/asm/idtentry.h:621)
[ 125.403125][ T1430] ? check_flush_dependency (kernel/workqueue.c:3706 (discriminator 9))
[ 125.410984][ T1430] ? check_flush_dependency (kernel/workqueue.c:3706 (discriminator 9))
[ 125.418764][ T1430] __flush_workqueue (kernel/workqueue.c:3970)
[ 125.426021][ T1430] ? __pfx___might_resched (kernel/sched/core.c:10151)
[ 125.433431][ T1430] ? destroy_cm_id (drivers/infiniband/core/iwcm.c:375) iw_cm
[ 125.441209][ T1430] ? __pfx___flush_workqueue (kernel/workqueue.c:3910)
[ 125.473900][ T1430] ? _raw_spin_lock_irqsave (arch/x86/include/asm/atomic.h:107 include/linux/atomic/atomic-arch-fallback.h:2170 include/linux/atomic/atomic-instrumented.h:1302 include/asm-generic/qspinlock.h:111 include/linux/spinlock.h:187 include/linux/spinlock_api_smp.h:111 kernel/locking/spinlock.c:162)
[ 125.473909][ T1430] ? __pfx__raw_spin_lock_irqsave (kernel/locking/spinlock.c:161)
[ 125.482537][ T1430] _destroy_id (drivers/infiniband/core/cma.c:2044) rdma_cm
[ 125.495072][ T1430] nvme_rdma_free_queue (drivers/nvme/host/rdma.c:656 drivers/nvme/host/rdma.c:650) nvme_rdma
[ 125.505827][ T1430] nvme_rdma_reset_ctrl_work (drivers/nvme/host/rdma.c:2180) nvme_rdma
[ 125.505831][ T1430] process_one_work (kernel/workqueue.c:3231)
[ 125.515122][ T1430] worker_thread (kernel/workqueue.c:3306 kernel/workqueue.c:3393)
[ 125.515127][ T1430] ? __pfx_worker_thread (kernel/workqueue.c:3339)
[ 125.531837][ T1430] kthread (kernel/kthread.c:389)
[ 125.539864][ T1430] ? __pfx_kthread (kernel/kthread.c:342)
[ 125.550628][ T1430] ret_from_fork (arch/x86/kernel/process.c:147)
[ 125.558840][ T1430] ? __pfx_kthread (kernel/kthread.c:342)
[ 125.558844][ T1430] ret_from_fork_asm (arch/x86/entry/entry_64.S:257)
[ 125.566487][ T1430] </TASK>
[ 125.566488][ T1430] ---[ end trace 0000000000000000 ]---
Fixes: aee2424246f9 ("RDMA/iwcm: Fix a use-after-free related to destroying CM IDs")
Link: https://patch.msgid.link/r/20240820113336.19860-1-yanjun.zhu@linux.dev
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202408151633.fc01893c-oliver.sang@intel.com
Tested-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
__bth_set_resv6a is used to clear BIT [24, 29] of rxe_bth::qpn, the
wrong expression leads other BITs into 1.
Link: https://patch.msgid.link/r/20240822065223.1117056-4-pizhenwei@bytedance.com
Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
Fix 'rmda' into 'RDMA'.
Link: https://patch.msgid.link/r/20240822065223.1117056-3-pizhenwei@bytedance.com
Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
Use 'sizeof(union rdma_network_hdr)' instead of hard code GRH length
for GSI and UD.
Link: https://patch.msgid.link/r/20240822065223.1117056-2-pizhenwei@bytedance.com
Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
Let alloc_ordered_workqueue() format the workqueue name instead of calling
snprintf() explicitly.
Link: https://patch.msgid.link/r/20240823101840.515398-5-ruanjinjie@huawei.com
Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
Let alloc_ordered_workqueue() format the workqueue name instead of calling
snprintf() explicitly.
Link: https://patch.msgid.link/r/20240823101840.515398-4-ruanjinjie@huawei.com
Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
Let alloc_ordered_workqueue() format the workqueue name instead of calling
snprintf() explicitly.
Link: https://patch.msgid.link/r/20240823101840.515398-3-ruanjinjie@huawei.com
Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
Let alloc_ordered_workqueue() format the workqueue name instead of
calling snprintf() explicitly.
Link: https://patch.msgid.link/r/20240823101840.515398-2-ruanjinjie@huawei.com
Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
There are some declarations without function definition, which are listed
as below:
1. ipoib_ib_tx_timer_func() has also been removed since
commit 8966e28d2e40 ("IB/ipoib: Use NAPI in UD/TX flows")
2. ipoib_pkey_event() has been removed since
commit ee1e2c82c245 ("IPoIB: Refresh paths instead of flushing them on
SM change events")
3. ipoib_mcast_dev_down() has been removed since
commit 988bd50300ef ("IPoIB: Fix memory leak of multicast group
structures")
4. ipoib_pkey_open() has been removed since
commit dd57c9308aff ("IB/ipoib: Avoid multicast join attempts with
invalid P_key")
Remove these unused declarations.
Link: https://patch.msgid.link/r/20240818055702.79547-3-zhangzekun11@huawei.com
Signed-off-by: Zhang Zekun <zhangzekun11@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
The definition of rdma_resolve_ip_route() has been removed. Remove the
unused declaration.
Fixes: 6aaecd385685 ("RDMA/core: Simplify roce_resolve_route_from_path()")
Link: https://patch.msgid.link/r/20240818055702.79547-2-zhangzekun11@huawei.com
Signed-off-by: Zhang Zekun <zhangzekun11@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
Commit e6fb246ccafb ("RDMA/mlx5: Consolidate MR destruction to
mlx5_ib_dereg_mr()") removed mlx5_ib_free_implicit_mr() but left
the declaration.
Commit d98995b4bf98 ("net/mlx5: Reimplement write combining test") left
mlx5_ib_test_wc().
Remove the unused declarations.
Link: https://patch.msgid.link/r/20240816101358.881247-1-yuehaibing@huawei.com
Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
We want the compiler to see that fdput() on empty instance
is a no-op. The emptiness check is that file reference is NULL,
while fdput() is "fput() if FDPUT_FPUT is present in flags".
The reason why fdput() on empty instance is a no-op is something
compiler can't see - it's that we never generate instances with
NULL file reference combined with non-zero flags.
It's not that hard to deal with - the real primitives behind
fdget() et.al. are returning an unsigned long value, unpacked by (inlined)
__to_fd() into the current struct file * + int. The lower bits are
used to store flags, while the rest encodes the pointer. Linus suggested
that keeping this unsigned long around with the extractions done by inlined
accessors should generate a sane code and that turns out to be the case.
Namely, turning struct fd into a struct-wrapped unsinged long, with
fd_empty(f) => unlikely(f.word == 0)
fd_file(f) => (struct file *)(f.word & ~3)
fdput(f) => if (f.word & 1) fput(fd_file(f))
ends up with compiler doing the right thing. The cost is the patch
footprint, of course - we need to switch f.file to fd_file(f) all over
the tree, and it's not doable with simple search and replace; there are
false positives, etc.
Note that the sole member of that structure is an opaque
unsigned long - all accesses should be done via wrappers and I don't
want to use a name that would invite manual casts to file pointers,
etc. The value of that member is equal either to (unsigned long)p | flags,
p being an address of some struct file instance, or to 0 for an empty fd.
For now the new predicate (fd_empty(f)) has no users; all the
existing checks have form (!fd_file(f)). We will convert to fd_empty()
use later; here we only define it (and tell the compiler that it's
unlikely to return true).
This commit only deals with representation change; there will
be followups.
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
For any changes of struct fd representation we need to
turn existing accesses to fields into calls of wrappers.
Accesses to struct fd::flags are very few (3 in linux/file.h,
1 in net/socket.c, 3 in fs/overlayfs/file.c and 3 more in
explicit initializers).
Those can be dealt with in the commit converting to
new layout; accesses to struct fd::file are too many for that.
This commit converts (almost) all of f.file to
fd_file(f). It's not entirely mechanical ('file' is used as
a member name more than just in struct fd) and it does not
even attempt to distinguish the uses in pointer context from
those in boolean context; the latter will be eventually turned
into a separate helper (fd_empty()).
NOTE: mass conversion to fd_empty(), tempting as it
might be, is a bad idea; better do that piecewise in commit
that convert from fdget...() to CLASS(...).
[conflicts in fs/fhandle.c, kernel/bpf/syscall.c, mm/memcontrol.c
caught by git; fs/stat.c one got caught by git grep]
[fs/xattr.c conflict]
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Introduce the 'GET_DATA_DIRECT_SYSFS_PATH' ioctl to return the sysfs
path of the affiliated 'data direct' device for a given device.
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://patch.msgid.link/403745463e0ef52adbef681ff09aa6a29a756352.1722512548.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Add support for DMABUF MR registrations with Data-direct device.
Upon userspace calling to register a DMABUF MR with the data direct bit
set, the below algorithm will be followed.
1) Obtain a pinned DMABUF umem from the IB core using the user input
parameters (FD, offset, length) and the DMA PF device. The DMA PF
device is needed to allow the IOMMU to enable the DMA PF to access the
user buffer over PCI.
2) Create a KSM MKEY by setting its entries according to the user buffer
VA to IOVA mapping, with the MKEY being the data direct device-crossed
MKEY. This KSM MKEY is umrable and will be used as part of the MR cache.
The PD for creating it is the internal device 'data direct' kernel one.
3) Create a crossing MKEY that points to the KSM MKEY using the crossing
access mode.
4) Manage the KSM MKEY by adding it to a list of 'data direct' MKEYs
managed on the mlx5_ib device.
5) Return the crossing MKEY to the user, created with its supplied PD.
Upon DMA PF unbind flow, the driver will revoke the KSM entries.
The final deregistration will occur under the hood once the application
deregisters its MKEY.
Notes:
- This version supports only the PINNED UMEM mode, so there is no
dependency on ODP.
- The IOVA supplied by the application must be system page aligned due to
HW translations of KSM.
- The crossing MKEY will not be umrable or part of the MR cache, as we
cannot change its crossed (i.e. KSM) MKEY over UMR.
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://patch.msgid.link/1f99d8020ed540d9702b9e2252a145a439609ba6.1722512548.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Pass uverbs_attr_bundle as part of '.reg_user_mr_dmabuf' API instead of
udata.
This enables passing some new ioctl attributes to the drivers, as will
be introduced in the next patches for mlx5 driver.
Change the involved drivers accordingly.
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://patch.msgid.link/9a25b2fc02443f7c36c2d93499ae25252b6afd40.1722512548.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Introduce an option to revoke DMABUF umem.
This option will retain the umem allocation while revoking its DMA
mapping. Furthermore, any subsequent attempts to map the pages should
fail once the umem has been revoked.
This functionality will be utilized in the upcoming patches in the
series, where we aim to delay umem deallocation until the mkey
deregistration. However, we must unmap its pages immediately.
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://patch.msgid.link/a38270f2fe4a194868ca2312f4c1c760e51bcbff.1722512548.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Add support for creating pinned DMABUF umem with a specified DMA device
instead of the DMA device of the given IB device.
This API will be utilized in the upcoming patches of the series when
multiple path DMAs are implemented.
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://patch.msgid.link/038aad36a43797e5591b20ba81051fc5758124f9.1722512548.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Add the NET device initialization flow to utilize the 'data
direct' device.
When a NET mlx5_ib device is capable of 'data direct', the following
sequence of actions will occur:
- Find its affiliated 'data direct' VUID via a firmware command.
- Create its own private PD and 'data direct' mkey.
- Register to be notified when its 'data direct' driver is probed or removed.
The DMA device of the affiliated 'data direct' device, including the
private PD and the 'data direct' mkey, will be used later during MR
registrations that request the data direct functionality.
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://patch.msgid.link/b11fa87b2a65bce4db8d40341bb6cee490fa4d06.1722512548.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Introduce the 'data direct' driver for a ConnectX-8 Data Direct device.
The 'data direct' driver functions as the affiliated DMA device for one
or more capable mlx5_ib devices. This DMA device, as the name suggests,
is used exclusively for DMA operations. It can be considered a DMA engine
managed by a PF/VF, lacking network capabilities and having minimal overall
capabilities.
Consequently, the DMA NIC PF will not be exposed to or directly used by
software applications. The driver will not have any direct interface or
interaction with the firmware (no command interface, no capabilities,
etc.). It will operate solely over PCI to enable its DMA functionality.
Registration and un-registration of the driver are handled as part of
the mlx5_ib initialization and exit processes, as the mlx5_ib devices
will effectively be its clients.
The driver will serve as the DMA device for accessing another PCI device
to achieve optimal performance (both on the same NUMA node, P2P access,
etc.).
Upon probing, it will read its VUID over PCI to handle mlx5_ib device
registrations with the same VUID.
Upon removal, it will notify its clients to allow them to clean up the
resources that were mmaped with its DMA device.
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://patch.msgid.link/b77edecfd476c3f445da96ab6aef499ae47b2829.1722512548.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
In multiport mode, RDMA devices make it impossible for userspace to use
DEVX to discover vhca id values for ports beyond port 1. This patch
addresses the issue by exposing the vhca id of all ports.
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Maor Gottlieb <maorg@nvidia.com>
Link: https://patch.msgid.link/41dea83aa51843aa4c067b4f73f28d64e51bd53c.1722331101.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Use strict parsing validation for set commands, and liberal
validation for get commands. Additionally, remove all usage of
nlmsg_parse_depricate().
Strict parsing validation fails when encountering unrecognized
attributes in the Netlink message, while liberal parsing
validation ignores them.
In 57d7a8fd904c ("rdma: Add an option to display driver-specific QPs in the rdma tool")
in iproute2, the attribute RDMA_NLDEV_ATTR_DRIVER_DETAILS
was added. This cause backwards compatibility issues when using
the rdma tool with the new attribute and an older kernel which does
recognize this attribute.
In this case, the command "rdma stat show mr" would fail, because the
new rdma tool would fill the netlink message with the new attribute and
the older kernel would fail as it used strict parsing and did not
recognize the new attribute.
In general, strict validation is appropriate for set commands as they
modify the system, while liberal validation is suitable for get
commands which only query system information.
Replace all uses of nlmsg_parse_deprecated() with __nlmsg_parse(),
using the NL_VALIDATE_LIBERAL flag.
The nlmsg_parse_deprecated() function internally calls
__nlmsg_parse() with the NL_VALIDATE_LIBERAL flag, but its name
is confusing.
Signed-off-by: Chiara Meiohas <cmeiohas@nvidia.com>
Reviewed-by: Michael Guralnik <michaelgur@nvidia.com>
Link: https://lore.kernel.org/r/f633a979a49db090d05c24a3ba83d30727bb777b.1722331020.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Current timeout handler of mad agent acquires/releases mad_agent_priv
lock for every timed out WRs. This causes heavy locking contention
when higher no. of WRs are to be handled inside timeout handler.
This leads to softlockup with below trace in some use cases where
rdma-cm path is used to establish connection between peer nodes
Trace:
-----
BUG: soft lockup - CPU#4 stuck for 26s! [kworker/u128:3:19767]
CPU: 4 PID: 19767 Comm: kworker/u128:3 Kdump: loaded Tainted: G OE
------- --- 5.14.0-427.13.1.el9_4.x86_64 #1
Hardware name: Dell Inc. PowerEdge R740/01YM03, BIOS 2.4.8 11/26/2019
Workqueue: ib_mad1 timeout_sends [ib_core]
RIP: 0010:__do_softirq+0x78/0x2ac
RSP: 0018:ffffb253449e4f98 EFLAGS: 00000246
RAX: 00000000ffffffff RBX: 0000000000000000 RCX: 000000000000001f
RDX: 000000000000001d RSI: 000000003d1879ab RDI: fff363b66fd3a86b
RBP: ffffb253604cbcd8 R08: 0000009065635f3b R09: 0000000000000000
R10: 0000000000000040 R11: ffffb253449e4ff8 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000040
FS: 0000000000000000(0000) GS:ffff8caa1fc80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fd9ec9db900 CR3: 0000000891934006 CR4: 00000000007706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
<IRQ>
? show_trace_log_lvl+0x1c4/0x2df
? show_trace_log_lvl+0x1c4/0x2df
? __irq_exit_rcu+0xa1/0xc0
? watchdog_timer_fn+0x1b2/0x210
? __pfx_watchdog_timer_fn+0x10/0x10
? __hrtimer_run_queues+0x127/0x2c0
? hrtimer_interrupt+0xfc/0x210
? __sysvec_apic_timer_interrupt+0x5c/0x110
? sysvec_apic_timer_interrupt+0x37/0x90
? asm_sysvec_apic_timer_interrupt+0x16/0x20
? __do_softirq+0x78/0x2ac
? __do_softirq+0x60/0x2ac
__irq_exit_rcu+0xa1/0xc0
sysvec_call_function_single+0x72/0x90
</IRQ>
<TASK>
asm_sysvec_call_function_single+0x16/0x20
RIP: 0010:_raw_spin_unlock_irq+0x14/0x30
RSP: 0018:ffffb253604cbd88 EFLAGS: 00000247
RAX: 000000000001960d RBX: 0000000000000002 RCX: ffff8cad2a064800
RDX: 000000008020001b RSI: 0000000000000001 RDI: ffff8cad5d39f66c
RBP: ffff8cad5d39f600 R08: 0000000000000001 R09: 0000000000000000
R10: ffff8caa443e0c00 R11: ffffb253604cbcd8 R12: ffff8cacb8682538
R13: 0000000000000005 R14: ffffb253604cbd90 R15: ffff8cad5d39f66c
cm_process_send_error+0x122/0x1d0 [ib_cm]
timeout_sends+0x1dd/0x270 [ib_core]
process_one_work+0x1e2/0x3b0
? __pfx_worker_thread+0x10/0x10
worker_thread+0x50/0x3a0
? __pfx_worker_thread+0x10/0x10
kthread+0xdd/0x100
? __pfx_kthread+0x10/0x10
ret_from_fork+0x29/0x50
</TASK>
Simplified timeout handler by creating local list of timed out WRs
and invoke send handler post creating the list. The new method acquires/
releases lock once to fetch the list and hence helps to reduce locking
contetiong when processing higher no. of WRs
Signed-off-by: Saravanan Vajravel <saravanan.vajravel@broadcom.com>
Link: https://lore.kernel.org/r/20240722110325.195085-1-saravanan.vajravel@broadcom.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Toggling link while running NVME-oF over siw hits a kernel panic
due to race condition within siw_handler and ib_destroy_qp().
The IB_EVENT_PORT_ERR event can alone handle destroying qps.
therefore remove unwanted processing in siw.
Suggested-by: Bernard Metzler <bmt@zurich.ibm.com>
Signed-off-by: Showrya M N <showrya@chelsio.com>
Signed-off-by: Potnuri Bharat Teja <bharat@chelsio.com>
Link: https://lore.kernel.org/r/20240724085428.3813-1-showrya@chelsio.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
dma_alloc_coherent() allocates contiguous memory irrespective of
iommu mode, but after commit f5ff79fddf0e ("dma-mapping: remove CONFIG_DMA_REMAP")
if iommu is enabled in translate mode, dma_alloc_coherent() may
allocate non-contiguous memory. Attempt to map this memory results in panic.
This patch fixes the issue by using dma_mmap_coherent() to map each page
to user space.
Signed-off-by: Anumula Murali Mohan Reddy <anumula@chelsio.com>
Signed-off-by: Potnuri Bharat Teja <bharat@chelsio.com>
Link: https://lore.kernel.org/r/20240716142532.97423-1-anumula@chelsio.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux
Pull iommu updates from Will Deacon:
"Core:
- Support for the "ats-supported" device-tree property
- Removal of the 'ops' field from 'struct iommu_fwspec'
- Introduction of iommu_paging_domain_alloc() and partial conversion
of existing users
- Introduce 'struct iommu_attach_handle' and provide corresponding
IOMMU interfaces which will be used by the IOMMUFD subsystem
- Remove stale documentation
- Add missing MODULE_DESCRIPTION() macro
- Misc cleanups
Allwinner Sun50i:
- Ensure bypass mode is disabled on H616 SoCs
- Ensure page-tables are allocated below 4GiB for the 32-bit
page-table walker
- Add new device-tree compatible strings
AMD Vi:
- Use try_cmpxchg64() instead of cmpxchg64() when updating pte
Arm SMMUv2:
- Print much more useful information on context faults
- Fix Qualcomm TBU probing when CONFIG_ARM_SMMU_QCOM_DEBUG=n
- Add new Qualcomm device-tree bindings
Arm SMMUv3:
- Support for hardware update of access/dirty bits and reporting via
IOMMUFD
- More driver rework from Jason, this time updating the PASID/SVA
support to prepare for full IOMMUFD support
- Add missing MODULE_DESCRIPTION() macro
- Minor fixes and cleanups
NVIDIA Tegra:
- Fix for benign fwspec initialisation issue exposed by rework on the
core branch
Intel VT-d:
- Use try_cmpxchg64() instead of cmpxchg64() when updating pte
- Use READ_ONCE() to read volatile descriptor status
- Remove support for handling Execute-Requested requests
- Avoid calling iommu_domain_alloc()
- Minor fixes and refactoring
Qualcomm MSM:
- Updates to the device-tree bindings"
* tag 'iommu-updates-v6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux: (72 commits)
iommu/tegra-smmu: Pass correct fwnode to iommu_fwspec_init()
iommu/vt-d: Fix identity map bounds in si_domain_init()
iommu: Move IOMMU_DIRTY_NO_CLEAR define
dt-bindings: iommu: Convert msm,iommu-v0 to yaml
iommu/vt-d: Fix aligned pages in calculate_psi_aligned_address()
iommu/vt-d: Limit max address mask to MAX_AGAW_PFN_WIDTH
docs: iommu: Remove outdated Documentation/userspace-api/iommu.rst
arm64: dts: fvp: Enable PCIe ATS for Base RevC FVP
iommu/of: Support ats-supported device-tree property
dt-bindings: PCI: generic: Add ats-supported property
iommu: Remove iommu_fwspec ops
OF: Simplify of_iommu_configure()
ACPI: Retire acpi_iommu_fwspec_ops()
iommu: Resolve fwspec ops automatically
iommu/mediatek-v1: Clean up redundant fwspec checks
RDMA/usnic: Use iommu_paging_domain_alloc()
wifi: ath11k: Use iommu_paging_domain_alloc()
wifi: ath10k: Use iommu_paging_domain_alloc()
drm/msm: Use iommu_paging_domain_alloc()
vhost-vdpa: Use iommu_paging_domain_alloc()
...
|
|
Pull rdma updates from Jason Gunthorpe:
"Usual collection of small improvements and fixes:
- Bug fixes and minor improvments in efa, irdma, mlx4, mlx5, rxe,
hf1, qib, ocrdma
- bnxt_re support for MSN, which is a new retransmit logic
- Initial mana support for RC qps
- Use after free bug and cleanups in iwcm
- Reduce resource usage in mlx5 when RDMA verbs features are not used
- New verb to drain shared recieve queues, similar to normal recieve
queues. This is necessary to allow ULPs a clean shutdown. Used in
the iscsi rdma target
- mlx5 support for more than 16 bits of doorbell indexes
- Doorbell moderation support for bnxt_re
- IB multi-plane support for mlx5
- New EFA adaptor PCI IDs
- RDMA_NAME_ASSIGN_TYPE_USER to hint to userspace that it shouldn't
rename the device
- A collection of hns bugs
- Fix long standing bug in bnxt_re with incorrect endian handling of
immediate data"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (65 commits)
IB/hfi1: Constify struct flag_table
RDMA/mana_ib: Set correct device into ib
bnxt_re: Fix imm_data endianness
RDMA: Fix netdev tracker in ib_device_set_netdev
RDMA/hns: Fix mbx timing out before CMD execution is completed
RDMA/hns: Fix insufficient extend DB for VFs.
RDMA/hns: Fix undifined behavior caused by invalid max_sge
RDMA/hns: Fix shift-out-bounds when max_inline_data is 0
RDMA/hns: Fix missing pagesize and alignment check in FRMR
RDMA/hns: Fix unmatch exception handling when init eq table fails
RDMA/hns: Fix soft lockup under heavy CEQE load
RDMA/hns: Check atomic wr length
RDMA/ocrdma: Don't inline statistics functions
RDMA/core: Introduce "name_assign_type" for an IB device
RDMA/qib: Fix truncation compilation warnings in qib_verbs.c
RDMA/qib: Fix truncation compilation warnings in qib_init.c
RDMA/efa: Add EFA 0xefa3 PCI ID
RDMA/mlx5: Support per-plane port IB counters by querying PPCNT register
net/mlx5: mlx5_ifc update for accessing ppcnt register of plane ports
RDMA/mlx5: Add plane index support when querying PTYS registers
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux
Saeed Mahameed says:
====================
aux-sysfs-irqs
Shay Says:
==========
Introduce auxiliary bus IRQs sysfs
Today, PCI PFs and VFs, which are anchored on the PCI bus, display their
IRQ information in the <pci_device>/msi_irqs/<irq_num> sysfs files. PCI
subfunctions (SFs) are similar to PFs and VFs and these SFs are anchored
on the auxiliary bus. However, these PCI SFs lack such IRQ information
on the auxiliary bus, leaving users without visibility into which IRQs
are used by the SFs. This absence makes it impossible to debug
situations and to understand the source of interrupts/SFs for
performance tuning and debug.
Additionally, the SFs are multifunctional devices supporting RDMA,
network devices, clocks, and more, similar to their peer PCI PFs and
VFs. Therefore, it is desirable to have SFs' IRQ information available
at the bus/device level.
To overcome the above limitations, this short series extends the
auxiliary bus to display IRQ information in sysfs, similar to that of
PFs and VFs.
It adds an 'irqs' directory under the auxiliary device and includes an
<irq_num> sysfs file within it.
For example:
$ ls /sys/bus/auxiliary/devices/mlx5_core.sf.1/irqs/
50 51 52 53 54 55 56 57 58
Patch summary:
patch-1 adds auxiliary bus to support irqs used by auxiliary device
patch-2 mlx5 driver using exposing irqs for PCI SF devices via auxiliary
bus
==========
* tag 'aux-sysfs-irqs' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux:
net/mlx5: Expose SFs IRQs
driver core: auxiliary bus: show auxiliary device IRQs
RDMA/mlx5: Add Qcounters req_transport_retries_exceeded/req_rnr_retries_exceeded
net/mlx5: Reimplement write combining test
====================
Link: https://patch.msgid.link/20240711213140.256997-1-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
'struct flag_table' are not modified in this driver.
Constifying this structure moves some data to a read-only section, so
increase overall security.
On a x86_64, with allmodconfig:
Before:
======
text data bss dec hex filename
302932 40271 112 343315 53d13 drivers/infiniband/hw/hfi1/chip.o
After:
=====
text data bss dec hex filename
311636 31567 112 343315 53d13 drivers/infiniband/hw/hfi1/chip.o
Link: https://lore.kernel.org/r/782b6a648bfbbf2bb83f81a73c0460b5bb7642a1.1720959310.git.christophe.jaillet@wanadoo.fr
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
Add mana_get_primary_netdev_rcu helper to get a primary
netdevice for a given port. When mana is used with
netvsc, the VF netdev is controlled by an upper netvsc
device. In a baremetal case, the VF netdev is the
primary device.
Use the mana_get_primary_netdev_rcu() helper in the mana_ib
to get the correct device for querying network states.
Fixes: 8b184e4f1c32 ("RDMA/mana_ib: Enable RoCE on port 1")
Signed-off-by: Konstantin Taranov <kotaranov@microsoft.com>
Link: https://lore.kernel.org/r/1720705077-322-1-git-send-email-kotaranov@linux.microsoft.com
Reviewed-by: Long Li <longli@microsoft.com>
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
When map a device between servers with MLX and BCM RoCE nics, RTRS
server complain about unknown imm type, and can't map the device,
After more debug, it seems bnxt_re wrongly handle the
imm_data, this patch fixed the compat issue with MLX for us.
In off list discussion, Selvin confirmed HW is working in little endian format
and all data needs to be converted to LE while providing.
This patch fix the endianness for imm_data
Fixes: 1ac5a4047975 ("RDMA/bnxt_re: Add bnxt_re RoCE driver")
Signed-off-by: Jack Wang <jinpu.wang@ionos.com>
Link: https://lore.kernel.org/r/20240710122102.37569-1-jinpu.wang@ionos.com
Acked-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
If a netdev has already been assigned, ib_device_set_netdev needs to
release the reference on the older netdev but it is mistakenly being
called for the new netdev. Fix it and in the process use netdev_put
to be symmetrical with the netdev_hold.
Fixes: 09f530f0c6d6 ("RDMA: Add netdevice_tracker to ib_device_set_netdev()")
Signed-off-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240710203310.19317-1-dsahern@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
When a large number of tasks are issued, the speed of HW processing
mbx will slow down. The standard for judging mbx timeout in the current
firmware is 30ms, and the current timeout standard for the driver is also
30ms.
Considering that firmware scheduling in multi-function scenarios takes a
certain amount of time, this will cause the driver to time out too early
and report a failure before mbx execution times out.
This patch introduces a new mechanism that can set different timeouts for
different cmds and extends the timeout of mbx to 35ms.
Fixes: a04ff739f2a9 ("RDMA/hns: Add command queue support for hip08 RoCE driver")
Signed-off-by: Chengchang Tang <tangchengchang@huawei.com>
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Link: https://lore.kernel.org/r/20240710133705.896445-9-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
VFs and its PF will share the memory of the extend DB. Currently,
the number of extend DB allocated by driver is only enough for PF.
This leads to a probability of DB loss and some other problems in
scenarios where both PF and VFs use a large number of QPs.
Fixes: 6b63597d3540 ("RDMA/hns: Add TSQ link table support")
Signed-off-by: Chengchang Tang <tangchengchang@huawei.com>
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Link: https://lore.kernel.org/r/20240710133705.896445-8-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
If max_sge has been set to 0, roundup_pow_of_two() in
set_srq_basic_param() may have undefined behavior.
Fixes: 9dd052474a26 ("RDMA/hns: Allocate one more recv SGE for HIP08")
Signed-off-by: Chengchang Tang <tangchengchang@huawei.com>
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Link: https://lore.kernel.org/r/20240710133705.896445-7-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
A shift-out-bounds may occur, if the max_inline_data has not been set.
The related log:
UBSAN: shift-out-of-bounds in kernel/include/linux/log2.h:57:13
shift exponent 64 is too large for 64-bit type 'long unsigned int'
Call trace:
dump_backtrace+0xb0/0x118
show_stack+0x20/0x38
dump_stack_lvl+0xbc/0x120
dump_stack+0x1c/0x28
__ubsan_handle_shift_out_of_bounds+0x104/0x240
set_ext_sge_param+0x40c/0x420 [hns_roce_hw_v2]
hns_roce_create_qp+0xf48/0x1c40 [hns_roce_hw_v2]
create_qp.part.0+0x294/0x3c0
ib_create_qp_kernel+0x7c/0x150
create_mad_qp+0x11c/0x1e0
ib_mad_init_device+0x834/0xc88
add_client_context+0x248/0x318
enable_device_and_get+0x158/0x280
ib_register_device+0x4ac/0x610
hns_roce_init+0x890/0xf98 [hns_roce_hw_v2]
__hns_roce_hw_v2_init_instance+0x398/0x720 [hns_roce_hw_v2]
hns_roce_hw_v2_init_instance+0x108/0x1e0 [hns_roce_hw_v2]
hclge_init_roce_client_instance+0x1a0/0x358 [hclge]
hclge_init_client_instance+0xa0/0x508 [hclge]
hnae3_register_client+0x18c/0x210 [hnae3]
hns_roce_hw_v2_init+0x28/0xff8 [hns_roce_hw_v2]
do_one_initcall+0xe0/0x510
do_init_module+0x110/0x370
load_module+0x2c6c/0x2f20
init_module_from_file+0xe0/0x140
idempotent_init_module+0x24c/0x350
__arm64_sys_finit_module+0x88/0xf8
invoke_syscall+0x68/0x1a0
el0_svc_common.constprop.0+0x11c/0x150
do_el0_svc+0x38/0x50
el0_svc+0x50/0xa0
el0t_64_sync_handler+0xc0/0xc8
el0t_64_sync+0x1a4/0x1a8
Fixes: 0c5e259b06a8 ("RDMA/hns: Fix incorrect sge nums calculation")
Signed-off-by: Chengchang Tang <tangchengchang@huawei.com>
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Link: https://lore.kernel.org/r/20240710133705.896445-6-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|