Age | Commit message (Collapse) | Author |
|
In later patches, we're going to change how the inode's ctime field is
used. Switch to using accessor functions instead of raw accesses of
inode->i_ctime.
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Message-Id: <20230705190309.579783-16-jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
When a VLAN interface is in use, get and use the VLAN
egress mapping.
Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com>
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Link: https://lore.kernel.org/r/20230711175318.1301-1-shiraz.saleem@intel.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
This code is trying to ensure that only the flags specified in the list
are allowed. The problem is that ucmd->rx_hash_fields_mask is a u64 and
the flags are an enum which is treated as a u32 in this context. That
means the test doesn't check whether the highest 32 bits are zero.
Fixes: 4d02ebd9bbbd ("IB/mlx4: Fix RSS hash fields restrictions")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Link: https://lore.kernel.org/r/233ed975-982d-422a-b498-410f71d8a101@moroto.mountain
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Delete a duplicate statement from this function implementation.
Fixes: b48c24c2d710 ("RDMA/irdma: Implement device supported verb APIs")
Signed-off-by: Minjie Du <duminjie@vivo.com>
Acked-by: Alok Prasad <palok@marvell.com>
Link: https://lore.kernel.org/r/20230706022704.1260-1-duminjie@vivo.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Update device API and request RDMA write counters if RDMA write is
supported by device. Expose newly added counters through ib core
counters mechanism.
Reviewed-by: Daniel Kranzdorf <dkkranzd@amazon.com>
Reviewed-by: Yonatan Nachum <ynachum@amazon.com>
Signed-off-by: Michael Margolin <mrgolin@amazon.com>
Link: https://lore.kernel.org/r/20230703153404.30877-1-mrgolin@amazon.com
Reviewed-by: Gal Pressman <gal.pressman@linux.dev>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
The MR memory allocation requests extra bytes to guarantee that there
is enough space to find the memory aligned to MLX5_UMR_ALIGN.
For power-of-two sizes, the alignment can be guaranteed by kmalloc()
according to commit 59bb47985c1d ("mm, sl[aou]b: guarantee natural
alignment for kmalloc(power-of-two)").
So if target alignment is power-of-two and adding the extra bytes
crosses a power-of-two boundary, use the next power-of-two as the
allocation size.
Signed-off-by: Yuanyuan Zhong <yzhong@purestorage.com>
Link: https://lore.kernel.org/r/20230629213248.3184245-2-yzhong@purestorage.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Pull rdma updates from Jason Gunthorpe:
"This cycle saw a focus on rxe and bnxt_re drivers:
- Code cleanups for irdma, rxe, rtrs, hns, vmw_pvrdma
- rxe uses workqueues instead of tasklets
- rxe has better compliance around access checks for MRs and rereg_mr
- mana supportst he 'v2' FW interface for RX coalescing
- hfi1 bug fix for stale cache entries in its MR cache
- mlx5 buf fix to handle FW failures when destroying QPs
- erdma HW has a new doorbell allocation mechanism for uverbs that is
secure
- Lots of small cleanups and rework in bnxt_re:
- Use the common mmap functions
- Support disassociation
- Improve FW command flow
- support for 'low latency push'"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (71 commits)
RDMA/bnxt_re: Fix an IS_ERR() vs NULL check
RDMA/bnxt_re: Fix spelling mistake "priviledged" -> "privileged"
RDMA/bnxt_re: Remove duplicated include in bnxt_re/main.c
RDMA/bnxt_re: Refactor code around bnxt_qplib_map_rc()
RDMA/bnxt_re: Remove incorrect return check from slow path
RDMA/bnxt_re: Enable low latency push
RDMA/bnxt_re: Reorg the bar mapping
RDMA/bnxt_re: Move the interface version to chip context structure
RDMA/bnxt_re: Query function capabilities from firmware
RDMA/bnxt_re: Optimize the bnxt_re_init_hwrm_hdr usage
RDMA/bnxt_re: Add disassociate ucontext support
RDMA/bnxt_re: Use the common mmap helper functions
RDMA/bnxt_re: Initialize opcode while sending message
RDMA/cma: Remove NULL check before dev_{put, hold}
RDMA/rxe: Simplify cq->notify code
RDMA/rxe: Fixes mr access supported list
RDMA/bnxt_re: optimize the parameters passed to helper functions
RDMA/bnxt_re: remove redundant cmdq_bitmap
RDMA/bnxt_re: use firmware provided max request timeout
RDMA/bnxt_re: cancel all control path command waiters upon error
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking changes from Jakub Kicinski:
"WiFi 7 and sendpage changes are the biggest pieces of work for this
release. The latter will definitely require fixes but I think that we
got it to a reasonable point.
Core:
- Rework the sendpage & splice implementations
Instead of feeding data into sockets page by page extend sendmsg
handlers to support taking a reference on the data, controlled by a
new flag called MSG_SPLICE_PAGES
Rework the handling of unexpected-end-of-file to invoke an
additional callback instead of trying to predict what the right
combination of MORE/NOTLAST flags is
Remove the MSG_SENDPAGE_NOTLAST flag completely
- Implement SCM_PIDFD, a new type of CMSG type analogous to
SCM_CREDENTIALS, but it contains pidfd instead of plain pid
- Enable socket busy polling with CONFIG_RT
- Improve reliability and efficiency of reporting for ref_tracker
- Auto-generate a user space C library for various Netlink families
Protocols:
- Allow TCP to shrink the advertised window when necessary, prevent
sk_rcvbuf auto-tuning from growing the window all the way up to
tcp_rmem[2]
- Use per-VMA locking for "page-flipping" TCP receive zerocopy
- Prepare TCP for device-to-device data transfers, by making sure
that payloads are always attached to skbs as page frags
- Make the backoff time for the first N TCP SYN retransmissions
linear. Exponential backoff is unnecessarily conservative
- Create a new MPTCP getsockopt to retrieve all info
(MPTCP_FULL_INFO)
- Avoid waking up applications using TLS sockets until we have a full
record
- Allow using kernel memory for protocol ioctl callbacks, paving the
way to issuing ioctls over io_uring
- Add nolocalbypass option to VxLAN, forcing packets to be fully
encapsulated even if they are destined for a local IP address
- Make TCPv4 use consistent hash in TIME_WAIT and SYN_RECV. Ensure
in-kernel ECMP implementation (e.g. Open vSwitch) select the same
link for all packets. Support L4 symmetric hashing in Open vSwitch
- PPPoE: make number of hash bits configurable
- Allow DNS to be overwritten by DHCPACK in the in-kernel DHCP client
(ipconfig)
- Add layer 2 miss indication and filtering, allowing higher layers
(e.g. ACL filters) to make forwarding decisions based on whether
packet matched forwarding state in lower devices (bridge)
- Support matching on Connectivity Fault Management (CFM) packets
- Hide the "link becomes ready" IPv6 messages by demoting their
printk level to debug
- HSR: don't enable promiscuous mode if device offloads the proto
- Support active scanning in IEEE 802.15.4
- Continue work on Multi-Link Operation for WiFi 7
BPF:
- Add precision propagation for subprogs and callbacks. This allows
maintaining verification efficiency when subprograms are used, or
in fact passing the verifier at all for complex programs,
especially those using open-coded iterators
- Improve BPF's {g,s}setsockopt() length handling. Previously BPF
assumed the length is always equal to the amount of written data.
But some protos allow passing a NULL buffer to discover what the
output buffer *should* be, without writing anything
- Accept dynptr memory as memory arguments passed to helpers
- Add routing table ID to bpf_fib_lookup BPF helper
- Support O_PATH FDs in BPF_OBJ_PIN and BPF_OBJ_GET commands
- Drop bpf_capable() check in BPF_MAP_FREEZE command (used to mark
maps as read-only)
- Show target_{obj,btf}_id in tracing link fdinfo
- Addition of several new kfuncs (most of the names are
self-explanatory):
- Add a set of new dynptr kfuncs: bpf_dynptr_adjust(),
bpf_dynptr_is_null(), bpf_dynptr_is_rdonly(), bpf_dynptr_size()
and bpf_dynptr_clone().
- bpf_task_under_cgroup()
- bpf_sock_destroy() - force closing sockets
- bpf_cpumask_first_and(), rework bpf_cpumask_any*() kfuncs
Netfilter:
- Relax set/map validation checks in nf_tables. Allow checking
presence of an entry in a map without using the value
- Increase ip_vs_conn_tab_bits range for 64BIT builds
- Allow updating size of a set
- Improve NAT tuple selection when connection is closing
Driver API:
- Integrate netdev with LED subsystem, to allow configuring HW
"offloaded" blinking of LEDs based on link state and activity
(i.e. packets coming in and out)
- Support configuring rate selection pins of SFP modules
- Factor Clause 73 auto-negotiation code out of the drivers, provide
common helper routines
- Add more fool-proof helpers for managing lifetime of MDIO devices
associated with the PCS layer
- Allow drivers to report advanced statistics related to Time Aware
scheduler offload (taprio)
- Allow opting out of VF statistics in link dump, to allow more VFs
to fit into the message
- Split devlink instance and devlink port operations
New hardware / drivers:
- Ethernet:
- Synopsys EMAC4 IP support (stmmac)
- Marvell 88E6361 8 port (5x1GE + 3x2.5GE) switches
- Marvell 88E6250 7 port switches
- Microchip LAN8650/1 Rev.B0 PHYs
- MediaTek MT7981/MT7988 built-in 1GE PHY driver
- WiFi:
- Realtek RTL8192FU, 2.4 GHz, b/g/n mode, 2T2R, 300 Mbps
- Realtek RTL8723DS (SDIO variant)
- Realtek RTL8851BE
- CAN:
- Fintek F81604
Drivers:
- Ethernet NICs:
- Intel (100G, ice):
- support dynamic interrupt allocation
- use meta data match instead of VF MAC addr on slow-path
- nVidia/Mellanox:
- extend link aggregation to handle 4, rather than just 2 ports
- spawn sub-functions without any features by default
- OcteonTX2:
- support HTB (Tx scheduling/QoS) offload
- make RSS hash generation configurable
- support selecting Rx queue using TC filters
- Wangxun (ngbe/txgbe):
- add basic Tx/Rx packet offloads
- add phylink support (SFP/PCS control)
- Freescale/NXP (enetc):
- report TAPRIO packet statistics
- Solarflare/AMD:
- support matching on IP ToS and UDP source port of outer
header
- VxLAN and GENEVE tunnel encapsulation over IPv4 or IPv6
- add devlink dev info support for EF10
- Virtual NICs:
- Microsoft vNIC:
- size the Rx indirection table based on requested
configuration
- support VLAN tagging
- Amazon vNIC:
- try to reuse Rx buffers if not fully consumed, useful for ARM
servers running with 16kB pages
- Google vNIC:
- support TCP segmentation of >64kB frames
- Ethernet embedded switches:
- Marvell (mv88e6xxx):
- enable USXGMII (88E6191X)
- Microchip:
- lan966x: add support for Egress Stage 0 ACL engine
- lan966x: support mapping packet priority to internal switch
priority (based on PCP or DSCP)
- Ethernet PHYs:
- Broadcom PHYs:
- support for Wake-on-LAN for BCM54210E/B50212E
- report LPI counter
- Microsemi PHYs: support RGMII delay configuration (VSC85xx)
- Micrel PHYs: receive timestamp in the frame (LAN8841)
- Realtek PHYs: support optional external PHY clock
- Altera TSE PCS: merge the driver into Lynx PCS which it is a
variant of
- CAN: Kvaser PCIEcan:
- support packet timestamping
- WiFi:
- Intel (iwlwifi):
- major update for new firmware and Multi-Link Operation (MLO)
- configuration rework to drop test devices and split the
different families
- support for segmented PNVM images and power tables
- new vendor entries for PPAG (platform antenna gain) feature
- Qualcomm 802.11ax (ath11k):
- Multiple Basic Service Set Identifier (MBSSID) and Enhanced
MBSSID Advertisement (EMA) support in AP mode
- support factory test mode
- RealTek (rtw89):
- add RSSI based antenna diversity
- support U-NII-4 channels on 5 GHz band
- RealTek (rtl8xxxu):
- AP mode support for 8188f
- support USB RX aggregation for the newer chips"
* tag 'net-next-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1602 commits)
net: scm: introduce and use scm_recv_unix helper
af_unix: Skip SCM_PIDFD if scm->pid is NULL.
net: lan743x: Simplify comparison
netlink: Add __sock_i_ino() for __netlink_diag_dump().
net: dsa: avoid suspicious RCU usage for synced VLAN-aware MAC addresses
Revert "af_unix: Call scm_recv() only after scm_set_cred()."
phylink: ReST-ify the phylink_pcs_neg_mode() kdoc
libceph: Partially revert changes to support MSG_SPLICE_PAGES
net: phy: mscc: fix packet loss due to RGMII delays
net: mana: use vmalloc_array and vcalloc
net: enetc: use vmalloc_array and vcalloc
ionic: use vmalloc_array and vcalloc
pds_core: use vmalloc_array and vcalloc
gve: use vmalloc_array and vcalloc
octeon_ep: use vmalloc_array and vcalloc
net: usb: qmi_wwan: add u-blox 0x1312 composition
perf trace: fix MSG_SPLICE_PAGES build error
ipvlan: Fix return value of ipvlan_queue_xmit()
netfilter: nf_tables: fix underflow in chain reference counter
netfilter: nf_tables: unbind non-anonymous set if rule construction fails
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull mm updates from Andrew Morton:
- Yosry Ahmed brought back some cgroup v1 stats in OOM logs
- Yosry has also eliminated cgroup's atomic rstat flushing
- Nhat Pham adds the new cachestat() syscall. It provides userspace
with the ability to query pagecache status - a similar concept to
mincore() but more powerful and with improved usability
- Mel Gorman provides more optimizations for compaction, reducing the
prevalence of page rescanning
- Lorenzo Stoakes has done some maintanance work on the
get_user_pages() interface
- Liam Howlett continues with cleanups and maintenance work to the
maple tree code. Peng Zhang also does some work on maple tree
- Johannes Weiner has done some cleanup work on the compaction code
- David Hildenbrand has contributed additional selftests for
get_user_pages()
- Thomas Gleixner has contributed some maintenance and optimization
work for the vmalloc code
- Baolin Wang has provided some compaction cleanups,
- SeongJae Park continues maintenance work on the DAMON code
- Huang Ying has done some maintenance on the swap code's usage of
device refcounting
- Christoph Hellwig has some cleanups for the filemap/directio code
- Ryan Roberts provides two patch series which yield some
rationalization of the kernel's access to pte entries - use the
provided APIs rather than open-coding accesses
- Lorenzo Stoakes has some fixes to the interaction between pagecache
and directio access to file mappings
- John Hubbard has a series of fixes to the MM selftesting code
- ZhangPeng continues the folio conversion campaign
- Hugh Dickins has been working on the pagetable handling code, mainly
with a view to reducing the load on the mmap_lock
- Catalin Marinas has reduced the arm64 kmalloc() minimum alignment
from 128 to 8
- Domenico Cerasuolo has improved the zswap reclaim mechanism by
reorganizing the LRU management
- Matthew Wilcox provides some fixups to make gfs2 work better with the
buffer_head code
- Vishal Moola also has done some folio conversion work
- Matthew Wilcox has removed the remnants of the pagevec code - their
functionality is migrated over to struct folio_batch
* tag 'mm-stable-2023-06-24-19-15' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (380 commits)
mm/hugetlb: remove hugetlb_set_page_subpool()
mm: nommu: correct the range of mmap_sem_read_lock in task_mem()
hugetlb: revert use of page_cache_next_miss()
Revert "page cache: fix page_cache_next/prev_miss off by one"
mm/vmscan: fix root proactive reclaim unthrottling unbalanced node
mm: memcg: rename and document global_reclaim()
mm: kill [add|del]_page_to_lru_list()
mm: compaction: convert to use a folio in isolate_migratepages_block()
mm: zswap: fix double invalidate with exclusive loads
mm: remove unnecessary pagevec includes
mm: remove references to pagevec
mm: rename invalidate_mapping_pagevec to mapping_try_invalidate
mm: remove struct pagevec
net: convert sunrpc from pagevec to folio_batch
i915: convert i915_gpu_error to use a folio_batch
pagevec: rename fbatch_count()
mm: remove check_move_unevictable_pages()
drm: convert drm_gem_put_pages() to use a folio_batch
i915: convert shmem_sg_free_table() to use a folio_batch
scatterlist: add sg_set_folio()
...
|
|
Linux 6.4
Resolve conflicts between rdma rc and next in rxe_cq matching linux-next:
drivers/infiniband/sw/rxe/rxe_cq.c:
https://lore.kernel.org/r/20230622115246.365d30ad@canb.auug.org.au
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
The bnxt_re_mmap_entry_insert() function returns NULL, not error pointers.
Update the check for errors accordingly.
Fixes: 360da60d6c6e ("RDMA/bnxt_re: Enable low latency push")
Link: https://lore.kernel.org/r/8d92e85f-626b-4eca-8501-ca7024cfc0ee@moroto.mountain
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Acked-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
There is a spelling mistake in a comment and in a dev_err error message.
Fix them.
Link: https://lore.kernel.org/r/20230626083535.53303-1-colin.i.king@gmail.com
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
./drivers/infiniband/hw/bnxt_re/main.c: ib_verbs.h is included more than once.
Link: https://lore.kernel.org/r/20230626003632.60435-1-yang.lee@linux.alibaba.com
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=5588
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Acked-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
Update function comment of bnxt_qplib_map_rc()
Remove intermediate return value ENXIO and directly called
bnxt_qplib_map_rc() from __send_message_basic_sanity().
Link: https://lore.kernel.org/r/20230616061700.741769-2-kashyap.desai@broadcom.com
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Kashyap Desai <kashyap.desai@broadcom.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
The commit 691eb7c6110f ("RDMA/bnxt_re: handle command completions after
driver detect a timedout") introduced code resulting in below warning
issued by the smatch static checker.
drivers/infiniband/hw/bnxt_re/qplib_rcfw.c:513 __bnxt_qplib_rcfw_send_message()
warn: duplicate check 'rc' (previous on line 506)
Fix the warning by removing incorrect code block.
Fixes: 691eb7c6110f ("RDMA/bnxt_re: handle command completions after driver detect a timedout")
Link: https://lore.kernel.org/r/20230616061700.741769-1-kashyap.desai@broadcom.com
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Kashyap Desai <kashyap.desai@broadcom.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
Introduce driver specific uapi functionalites. Added a alloc_page
functionality for user library to allocate specific pages. Currently added
support for allocating write combine pages for push functinality. This
interface shall be extended for other page allocations.
Allocate a WC page using the uapi hook for enabling the low latency push
in Gen P5 adapters for small packets. This is supported only for the user
space QPs.
Link: https://lore.kernel.org/r/1686679943-17117-8-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
Reorganize the code for allocation and mapping of Doorbell
pages. Implements new HW command to get the BAR length used by L2
driver. These changes are used by the future patch which maps the WC
Doorbell pages.
Also, introduced a new lock dpi_tbl_lock for synchronize the DB page
allocation from users.
Link: https://lore.kernel.org/r/1686679943-17117-7-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
FW interface version check is required for multiple features. Moving the
interface version to chip context structure.
Link: https://lore.kernel.org/r/1686679943-17117-6-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
Query Function capabilities to enable advanced features.
Link: https://lore.kernel.org/r/1686679943-17117-5-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
As of now bnxt_re_init_hwrm_hdr is taking only the opcode from the
caller. compl_ring and target_id field is always -1. These fields might be
changed when newer features are added. For now, removing these parameters
as they are hard coded. Also, remove the rdev field which is not used.
Also, initialize the structure bnxt_fw_msg during declaration itself.
Link: https://lore.kernel.org/r/1686679943-17117-4-git-send-email-selvin.xavier@broadcom.com
Suggested-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
Add driver disassociation support. Driver uses the APIs rdma_user_mmap_io
api while mapping the IO pages to user space. Add empty stub for
disassociate ucontext.
Link: https://lore.kernel.org/r/1686679943-17117-3-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
Replace the mmap handling function with common code in IB core. Create
rdma_user_mmap_entry for each mmap resource and add to the ib_core mmap
list. Add mmap_free verb support. Also, use rdma_user_mmap_io while
mapping Doorbell pages.
Link: https://lore.kernel.org/r/1686679943-17117-2-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
Fix compilation warning:
drivers/infiniband/hw/bnxt_re/qplib_rcfw.c:325:18:
error: variable 'opcode' is uninitialized when used here [-Werror,-Wuninitialized]
crsqe->opcode = opcode;
^~~~~~
drivers/infiniband/hw/bnxt_re/qplib_rcfw.c:291:11:
note: initialize the variable 'opcode' to silence this warning
u8 opcode;
^
= '\0'
Fixes: bcfee4ce3e01 ("RDMA/bnxt_re: remove redundant cmdq_bitmap")
Link: https://lore.kernel.org/r/6ad1e44be2b560986da6fdc6b68da606413e9026.1686644105.git.leonro@nvidia.com
Acked-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
Cross-merge networking fixes after downstream PR.
Conflicts:
include/linux/mlx5/driver.h
617f5db1a626 ("RDMA/mlx5: Fix affinity assignment")
dc13180824b7 ("net/mlx5: Enable devlink port for embedded cpu VF vports")
https://lore.kernel.org/all/20230613125939.595e50b8@canb.auug.org.au/
tools/testing/selftests/net/mptcp/mptcp_join.sh
47867f0a7e83 ("selftests: mptcp: join: skip check if MIB counter not supported")
425ba803124b ("selftests: mptcp: join: support RM_ADDR for used endpoints or not")
45b1a1227a7a ("mptcp: introduces more address related mibs")
0639fa230a21 ("selftests: mptcp: add explicit check for new mibs")
https://lore.kernel.org/netdev/20230609-upstream-net-20230610-mptcp-selftests-support-old-kernels-part-3-v1-0-2896fe2ee8a3@tessares.net/
No adjacent changes.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Avoid passing arguments like Opcode which can be retrieved from
bnxt_qplib_crsqe structure.
Signed-off-by: Kashyap Desai <kashyap.desai@broadcom.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Link: https://lore.kernel.org/r/1686308514-11996-18-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
cmdq_bitmap is used to derive the next available index in the CMDQ.
This is not required as the we can get the next index
using the existing bnxt_qplib_crsqe array.
Driver will use bnxt_qplib_crsqe array and flag is_in_used to
derive valid entries. is_in_used is replacement of cmdq_bitmap.
There is no change in the existing mechanism of the circular buffer
used to get index.
Added opcode field in bnxt_qplib_crsqe array so that it is easy to map
opcode associated with pending rcfw command.
Signed-off-by: Kashyap Desai <kashyap.desai@broadcom.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Link: https://lore.kernel.org/r/1686308514-11996-17-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Firmware provides max request timeout value as part of hwrm_ver_get
API. Driver gets the timeout from firmware and if that interface is
not available then fall back to hardcoded timeout value.
Also, Add a helper function to check the FW status.
Signed-off-by: Kashyap Desai <kashyap.desai@broadcom.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Link: https://lore.kernel.org/r/1686308514-11996-16-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
When an error is detected in FW, wake up all the waiters as the
all of them need to be completed with timeout. Add the device
error state also as a wait condition.
Signed-off-by: Kashyap Desai <kashyap.desai@broadcom.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Link: https://lore.kernel.org/r/1686308514-11996-15-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
If destroy_ah is timed out, it is likely to be destroyed by firmware
but it is taking longer time due to temporary slowness
in processing the rcfw command. In worst case, there might be
AH resource leak in firmware.
Sending timeout return value can dump warning message from ib_core
which can be avoided if we map timeout of destroy_ah as success.
Signed-off-by: Kashyap Desai <kashyap.desai@broadcom.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Link: https://lore.kernel.org/r/1686308514-11996-14-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
AH create may be called from interrpt context and driver has a special
timeout (8 sec) for this command. This is to avoid soft lockups when
the FW command takes more time. Driver returns -ETIMEOUT and fail
create AH, without waiting for actual completion from firmware.
When FW completion is received, use is_waiter_alive flag to avoid
a regular completion path.
If create_ah opcode is detected in completion path which does not have
waiter alive, driver will fetch ah_id from successful firmware
completion in the interrupt context and sends destroy_ah command
for same ah_id. This special post is done in quick manner using helper
function __send_message_no_waiter.
timeout_send is only used for debugging purposes.
If timeout_send value keeps incrementing, it indicates out of sync
active ah counter between driver and firmware. This is a limitation
but graceful handling is possible in future.
Signed-off-by: Kashyap Desai <kashyap.desai@broadcom.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Link: https://lore.kernel.org/r/1686308514-11996-13-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Every completion will update last_seen value in the unit of jiffies.
last_seen field will be used to know if firmware is alive and
is useful to detect firmware stall.
Non blocking interface __wait_for_resp will have logic to detect
firmware stall. After every 10 second interval if __wait_for_resp
has not received completion for a given command it will check for
firmware stall condition.
If current jiffies is greater than last_seen
jiffies + RCFW_FW_STALL_TIMEOUT_SEC * HZ, it is a firmware stall.
Signed-off-by: Kashyap Desai <kashyap.desai@broadcom.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Link: https://lore.kernel.org/r/1686308514-11996-12-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
If calling context detect command timeout, associated memory stored on
stack will not be valid. If firmware complete the same command later,
this causes incorrect memory access by driver.
Added is_waiter_alive to handle delayed completion by firmware.
is_waiter_alive is set and reset under command queue lock.
Signed-off-by: Kashyap Desai <kashyap.desai@broadcom.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Link: https://lore.kernel.org/r/1686308514-11996-11-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
This interface will be used if the driver has not enabled interrupt
and/or interrupt is disabled for a short period of time.
Completion is not possible from interrupt so this interface does
self-polling.
Signed-off-by: Kashyap Desai <kashyap.desai@broadcom.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Link: https://lore.kernel.org/r/1686308514-11996-10-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
- Use __send_message_basic_sanity helper function.
- Do not retry posting same command if there is a queue full detection.
- ENXIO is used to indicate controller recovery.
- In the case of ERR_DEVICE_DETACHED state, the driver should not post
commands to the firmware, but also return fabricated written code.
Signed-off-by: Kashyap Desai <kashyap.desai@broadcom.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Link: https://lore.kernel.org/r/1686308514-11996-9-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Whenever there is a fast path IO and create/destroy resources from
the slow path is happening in parallel, we may notice high latency
of slow path command completion.
Introduces a shadow queue depth to prevent the outstanding requests
to the FW. Driver will not allow more than #RCFW_CMD_NON_BLOCKING_SHADOW_QD
non-blocking commands to the Firmware.
Shadow queue depth is a soft limit only for non-blocking
commands. Blocking commands will be posted to the firmware
as long as there is a free slot.
Signed-off-by: Kashyap Desai <kashyap.desai@broadcom.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Link: https://lore.kernel.org/r/1686308514-11996-8-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Add a check to avoid waiting if driver already detects a
FW timeout. Return success for resource destroy in case
the device is detached. Add helper function to map timeout
error code to success.
Signed-off-by: Kashyap Desai <kashyap.desai@broadcom.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Link: https://lore.kernel.org/r/1686308514-11996-7-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Use jiffies based timewait instead of counting iteration for
commands that block for FW response.
Also add a poll routine for control path commands. This is for
polling completion if the waiting commands timeout. This avoids cases
where the driver misses completion interrupts.
Signed-off-by: Kashyap Desai <kashyap.desai@broadcom.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Link: https://lore.kernel.org/r/1686308514-11996-6-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
There is no need of setting max command queue entries based on
firmware version check.
Removing deperecated code.
Signed-off-by: Kashyap Desai <kashyap.desai@broadcom.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Link: https://lore.kernel.org/r/1686308514-11996-5-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
There is a common FW communication offset for both PF and VF.
Removed code around virt_fn check while creating FW channel.
Signed-off-by: Kashyap Desai <kashyap.desai@broadcom.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Link: https://lore.kernel.org/r/1686308514-11996-4-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
bnxt_qplib_service_creq can be called from interrupt or tasklet or
process context. So the function take irq variant of spin_lock.
But when wake_up is invoked with the lock held, it is putting the
calling context to sleep.
[exception RIP: __wake_up_common+190]
RIP: ffffffffb7539d7e RSP: ffffa73300207ad8 RFLAGS: 00000083
RAX: 0000000000000001 RBX: ffff91fa295f69b8 RCX: dead000000000200
RDX: ffffa733344af940 RSI: ffffa73336527940 RDI: ffffa73336527940
RBP: 000000000000001c R8: 0000000000000002 R9: 00000000000299c0
R10: 0000017230de82c5 R11: 0000000000000002 R12: ffffa73300207b28
R13: 0000000000000000 R14: ffffa733341bf928 R15: 0000000000000000
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
Call the wakeup after releasing the lock.
Fixes: 1ac5a4047975 ("RDMA/bnxt_re: Add bnxt_re RoCE driver")
Signed-off-by: Kashyap Desai <kashyap.desai@broadcom.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Link: https://lore.kernel.org/r/1686308514-11996-3-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Driver is not handling the wraparound of the mbox producer index correctly.
Currently the wraparound happens once u32 max is reached.
Bit 31 of the producer index register is special and should be set
only once for the first command. Because the producer index overflow
setting bit31 after a long time, FW goes to initialization sequence
and this causes FW hang.
Fix is to wraparound the mbox producer index once it reaches u16 max.
Fixes: cee0c7bba486 ("RDMA/bnxt_re: Refactor command queue management code")
Fixes: 1ac5a4047975 ("RDMA/bnxt_re: Add bnxt_re RoCE driver")
Signed-off-by: Kashyap Desai <kashyap.desai@broadcom.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Link: https://lore.kernel.org/r/1686308514-11996-2-git-send-email-selvin.xavier@broadcom.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
The original doorbell allocation mechanism is complex and does not meet
the isolation requirement. So we introduce a new doorbell mechanism and the
original mechanism (only be used with CAP_SYS_RAWIO if hardware does not
support the new mechanism) needs to be kept as simple as possible for
compatibility.
Signed-off-by: Cheng Xu <chengyou@linux.alibaba.com>
Link: https://lore.kernel.org/r/20230606055005.80729-5-chengyou@linux.alibaba.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
For the isolation requirement, each QP/CQ can only issue doorbells from the
allocated mmio space. Configure the relationship between QPs/CQs and
mmio doorbell spaces to hardware in create_qp/create_cq interfaces.
Signed-off-by: Cheng Xu <chengyou@linux.alibaba.com>
Link: https://lore.kernel.org/r/20230606055005.80729-4-chengyou@linux.alibaba.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Each ucontext will try to allocate doorbell resources in the extended bar
space from hardware. For compatibility, we change nothing for the original
bar space, and it will be used only for applications with CAP_SYS_RAWIO
authority in the older HW/FW environments.
Signed-off-by: Cheng Xu <chengyou@linux.alibaba.com>
Link: https://lore.kernel.org/r/20230606055005.80729-3-chengyou@linux.alibaba.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Add a new CMDQ message to configure hardware. Initially the page size (in
the format of shift) will be passed to hardware, so that hardware can
organize the mmio space properly. It's called only if hardware supports it.
Signed-off-by: Cheng Xu <chengyou@linux.alibaba.com>
Link: https://lore.kernel.org/r/20230606055005.80729-2-chengyou@linux.alibaba.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
The cited commit aimed to ensure that Virtual Functions (VFs) assign a
queue affinity to a Queue Pair (QP) to distribute traffic when
the LAG master creates a hardware LAG. If the affinity was set while
the hardware was not in LAG, the firmware would ignore the affinity value.
However, this commit unintentionally assigned an affinity to QPs on the LAG
master's VPORT even if the RDMA device was not marked as LAG-enabled.
In most cases, this was not an issue because when the hardware entered
hardware LAG configuration, the RDMA device of the LAG master would be
destroyed and a new one would be created, marked as LAG-enabled.
The problem arises when a user configures Equal-Cost Multipath (ECMP).
In ECMP mode, traffic can be directed to different physical ports based on
the queue affinity, which is intended for use by VPORTS other than the
E-Switch manager. ECMP mode is supported only if both E-Switch managers are
in switchdev mode and the appropriate route is configured via IP. In this
configuration, the RDMA device is not destroyed, and we retain the RDMA
device that is not marked as LAG-enabled.
To ensure correct behavior, Send Queues (SQs) opened by the E-Switch
manager through verbs should be assigned strict affinity. This means they
will only be able to communicate through the native physical port
associated with the E-Switch manager. This will prevent the firmware from
assigning affinity and will not allow the SQs to be remapped in case of
failover.
Fixes: 802dcc7fc5ec ("RDMA/mlx5: Support TX port affinity for VF drivers in LAG mode")
Reviewed-by: Maor Gottlieb <maorg@nvidia.com>
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Link: https://lore.kernel.org/r/425b05f4da840bc684b0f7e8ebf61aeb5cef09b0.1685960567.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Previously we used the core device associated to the IB device in order
to do the Q-counters query to the FW, but in LAG mode it is possible
that the core device isn't the one that created this VF.
Hence instead of using the core device to query the Q-counters
we use the ESW core device which is guaranteed to be that of the VF.
Fixes: d22467a71ebe ("RDMA/mlx5: Expand switchdev Q-counters to expose representor statistics")
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Mark Zhang <markzhang@nvidia.com>
Link: https://lore.kernel.org/r/778d7d7a24892348d0bdef17d2e5f9e044717e86.1685960567.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Previously the Q-counters initialization assumed that the vport Q-counters
structures and the normal Q-counters structures are identical in size,
and hence when a Q-counter was added to normal Q-counters structure but
not to the vport Q-counters struct it would lead to that counter name
being NULL in switchdev mode, which could cause the kernel crash below.
Currently break the dependency between those two structure and always
use the appropriate struct size, in order to remove the assumption
that both structure sizes are equal.
BUG: kernel NULL pointer dereference, address: 0000000000000000
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 20c64a067 P4D 20c64a067 PUD 20152b067 PMD 0
Oops: 0000 [#1] SMP
CPU: 19 PID: 11717 Comm: devlink Tainted: G OE 6.2.0_mlnx #1
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:strlen+0x0/0x20
Code: 66 2e 0f 1f 84 00 00 00 00 00 48 01 fe eb 0f 0f b6 07 38 d0 74 10 48 83 c7 01 84 c0 74 05 48 39 f7 75 ec 31 c0 c3 48 89 f8 c3 <80> 3f 00 48 89 f8 74 10 48 83 c7 01 80 3f 00 75 f7 48 29 c7 48 89
RSP: 0018:ffffc9000318b618 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000002c00
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: 0000000000000000 R08: ffff888211918110 R09: ffff888211918000
R10: 000000000000001e R11: ffff888211918000 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000000 R15: ffff8881038ec250
FS: 00007fa53342fe80(0000) GS:ffff88885fcc0000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 00000002042b2003 CR4: 0000000000770ee0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
<TASK>
kernfs_name_hash+0x12/0x80
kernfs_find_ns+0x35/0xb0
kernfs_remove_by_name_ns+0x46/0xc0
remove_files.isra.1+0x30/0x70
internal_create_group+0x253/0x380
internal_create_groups.part.4+0x3e/0xa0
setup_port+0x27a/0x8c0 [ib_core]
ib_setup_port_attrs+0x9d/0x300 [ib_core]
ib_register_device+0x48e/0x550 [ib_core]
__mlx5_ib_add+0x2b/0x80 [mlx5_ib]
mlx5_ib_vport_rep_load+0x141/0x360 [mlx5_ib]
mlx5_esw_offloads_rep_load+0x48/0xa0 [mlx5_core]
esw_offloads_enable+0x41e/0xd10 [mlx5_core]
mlx5_eswitch_enable_locked+0x1e3/0x340 [mlx5_core]
? __cond_resched+0x15/0x30
mlx5_devlink_eswitch_mode_set+0x204/0x3c0 [mlx5_core]
devlink_nl_cmd_eswitch_set_doit+0x8d/0x100
genl_family_rcv_msg_doit.isra.19+0xea/0x110
genl_rcv_msg+0x19b/0x290
? devlink_nl_cmd_region_read_dumpit+0x760/0x760
? devlink_nl_cmd_port_param_get_doit+0x30/0x30
? devlink_put+0x50/0x50
? genl_get_cmd_both+0x60/0x60
netlink_rcv_skb+0x54/0x100
genl_rcv+0x24/0x40
netlink_unicast+0x1be/0x2a0
netlink_sendmsg+0x361/0x4d0
sock_sendmsg+0x30/0x40
__sys_sendto+0x11a/0x150
? handle_mm_fault+0x101/0x2b0
? do_user_addr_fault+0x21d/0x720
__x64_sys_sendto+0x24/0x30
do_syscall_64+0x34/0x80
entry_SYSCALL_64_after_hwframe+0x46/0xb0
RIP: 0033:0x7fa533611cba
Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 76 c3 0f 1f 44 00 00 55 48 83 ec 30 44 89 4c
RSP: 002b:00007ffdb6a898a8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
RAX: ffffffffffffffda RBX: 0000000000daab00 RCX: 00007fa533611cba
RDX: 0000000000000038 RSI: 0000000000daab00 RDI: 0000000000000003
RBP: 0000000000daa910 R08: 00007fa533822000 R09: 000000000000000c
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001
</TASK>
Modules linked in: rdma_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_ib(OE) mlx5_core(OE) mlxdevm(OE) ib_uverbs(OE) ib_core(OE) mlx_compat(OE) mlxfw(OE) memtrack(OE) pci_hyperv_intf nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_filter iptable_nat dns_resolver nf_nat br_netfilter nfs bridge stp llc lockd grace fscache netfs rfkill overlay iTCO_wdt iTCO_vendor_support kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel i2c_i801 sunrpc lpc_ich sha512_ssse3 pcspkr i2c_smbus mfd_core drm sch_fq_codel i2c_core ip_tables fuse crc32c_intel serio_raw virtio_net net_failover failover [last unloaded: mlxfw]
CR2: 0000000000000000
---[ end trace 0000000000000000 ]---
RIP: 0010:strlen+0x0/0x20
Code: 66 2e 0f 1f 84 00 00 00 00 00 48 01 fe eb 0f 0f b6 07 38 d0 74 10 48 83 c7 01 84 c0 74 05 48 39 f7 75 ec 31 c0 c3 48 89 f8 c3 <80> 3f 00 48 89 f8 74 10 48 83 c7 01 80 3f 00 75 f7 48 29 c7 48 89
RSP: 0018:ffffc9000318b618 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000002c00
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: 0000000000000000 R08: ffff888211918110 R09: ffff888211918000
R10: 000000000000001e R11: ffff888211918000 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000000 R15: ffff8881038ec250
FS: 00007fa53342fe80(0000) GS:ffff88885fcc0000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 00000002042b2003 CR4: 0000000000770ee0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554
Kernel panic - not syncing: Fatal exception
Kernel Offset: disabled
---[ end Kernel panic - not syncing: Fatal exception ]---
Fixes: d22467a71ebe ("RDMA/mlx5: Expand switchdev Q-counters to expose representor statistics")
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Mark Zhang <markzhang@nvidia.com>
Link: https://lore.kernel.org/r/016777b7f16eb6bb178999ff59097d0c0f91f68a.1685960567.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Previously Q-counters data was being allocated over the PF for all of
the available vports, however that isn't necessary.
Since each VF or SF has a Q-counter allocated for itself.
So we only need to allocate two counters data structures, one for the
device counters, and one for all the other vports to expose the
representors, since they only need to read from it in order to
determine mainly counters numbers and names, so they can all share.
This in turn also solves a bug we previously had where we couldn't
switch the device to switchdev mode when there were more than 128 SF/VFs
configured, since that is the maximum amount of Q-counters available for
a single port
Fixes: d22467a71ebe ("RDMA/mlx5: Expand switchdev Q-counters to expose representor statistics")
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Mark Zhang <markzhang@nvidia.com>
Link: https://lore.kernel.org/r/f54671df16e2227a069b229b33b62cd9ee24c475.1685960567.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
A misbehaved user can create a steering anchor that points to a kernel
flow table and then destroy the anchor without freeing the associated
STC. This creates a problem as the kernel can't destroy the flow
table since there is still a reference to it. As a result, this can
exhaust all available flow table resources, preventing other users from
using the RDMA device.
To prevent this problem, a solution is implemented where a special flow
table with two steering rules is created when a user creates a steering
anchor for the first time. The rules include one that drops all traffic
and another that points to the kernel flow table. If the steering anchor
is destroyed, only the rule pointing to the kernel's flow table is removed.
Any traffic reaching the special flow table after that is dropped.
Since the special flow table is not destroyed when the steering anchor is
destroyed, any issues are prevented from occurring. The remaining resources
are only destroyed when the RDMA device is destroyed, which happens after
all DEVX objects are freed, including the STCs, thus mitigating the issue.
Fixes: 0c6ab0ca9a66 ("RDMA/mlx5: Expose steering anchor to userspace")
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Maor Gottlieb <maorg@nvidia.com>
Link: https://lore.kernel.org/r/b4a88a871d651fa4e8f98d552553c1cfe9ba2cd6.1685960567.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|