summaryrefslogtreecommitdiff
path: root/drivers/gpu/drm/xe
AgeCommit message (Collapse)Author
2025-02-26drm/xe/eustall: Add EU stall sampling support for Xe2Harish Chegondi
Add EU stall sampling support for Xe2 architecture GPUs - LNL and BMG. EU stall data format for LNL and BMG is different from that of PVC. v10: Update comments as per review feedback v9: Use GRAPHICS_VER() check instead of platform v8: Renamed struct drm_xe_eu_stall_data_xe2 to struct xe_eu_stall_data_xe2 since it is a local structure. Reviewed-by: Ashutosh Dixit <ashutosh.dixit@intel.com> Signed-off-by: Harish Chegondi <harish.chegondi@intel.com> Signed-off-by: Ashutosh Dixit <ashutosh.dixit@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/d85093e9ab1204d14d2cc783f304a4bc8688951c.1740533885.git.harish.chegondi@intel.com
2025-02-26drm/xe/eustall: Add support to handle dropped EU stall dataHarish Chegondi
If the user space doesn't read the EU stall data fast enough, it is possible that the EU stall data buffer can get filled, and if the hardware wants to write more data, it simply drops data due to unavailable buffer space. In that case, hardware sets a bit in a register. If the driver detects data drop, the driver read() returns -EIO error to let the user space know that HW has dropped data. The -EIO error is returned even if there is EU stall data in the buffer. A subsequent read by the user space returns the remaining EU stall data. v12: Move 'goto exit_drop;' to the next 'if (read_data_size == 0)' statement. v11: Clear drop bit even for empty data buffer as the data was read from the buffer in the previous read. v10: Reverted the changes back to v8: Clear the drop bits only after reading the data. v9: Move all data drop handling code to this patch Clear all drop data bits before returning -EIO. Reviewed-by: Ashutosh Dixit <ashutosh.dixit@intel.com> Signed-off-by: Harish Chegondi <harish.chegondi@intel.com> Signed-off-by: Ashutosh Dixit <ashutosh.dixit@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/6fbfd7cfa42cb3ef5515b6412573d74c7cd3d27a.1740533885.git.harish.chegondi@intel.com
2025-02-26drm/xe/eustall: Add support to read() and poll() EU stall dataHarish Chegondi
Implement the EU stall sampling APIs to read() and poll() EU stall data. A work function periodically polls the EU stall data buffer write pointer registers to look for any new data and caches the write pointer. The read function compares the cached read and write pointers and copies any new data to the user space. v11: Used gt->eu_stall->stream_lock instead of stream->buf_lock. Removed read and write offsets from trace and added read size. Moved workqueue from struct xe_eu_stall_data_stream to struct xe_eu_stall_gt. v10: Used cancel_delayed_work_sync() instead of flush_delayed_work() Replaced per xecore lock with a lock for all the xecore buffers Code movement and optimizations as per review feedback v9: New patch split from the previous patch. Used *_delayed_work functions instead of hrtimer Addressed the review feedback in read and poll functions Reviewed-by: Ashutosh Dixit <ashutosh.dixit@intel.com> Signed-off-by: Harish Chegondi <harish.chegondi@intel.com> Signed-off-by: Ashutosh Dixit <ashutosh.dixit@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/369dee85a3b6bd2c08aeae89ca55e66a9a0242d2.1740533885.git.harish.chegondi@intel.com
2025-02-26drm/xe/eustall: Add support to init, enable and disable EU stall samplingHarish Chegondi
Implement EU stall sampling APIs introduced in the previous patch for Xe_HPC (PVC). Add register definitions and the code that accesses these registers to the APIs. Add initialization and clean up functions and their implementations, EU stall enable and disable functions. v11: Move stream->xecore_buf alloc to xe_eu_stall_data_buf_alloc(). Register xe_eu_stall_fini() with devm_add_action_or_reset() instead of calling it from xe_gt_fini(). Changed a couple of variables in struct xe_eu_stall_data_stream from unsigned int to int. v10: Fixed error rewinding code Moved code around as per review feedback v9: Moved structure definitions from xe_eu_stall.h to xe_eu_stall.c Moved read and poll implementations to the next patch Used xe_bo_create_pin_map_at_aligned instead of xe_bo_create_pin_map Changed lock names as per review feedback Moved drop data handling into a subsequent patch Moved code around as per review feedback v8: Updated copyright year in xe_eu_stall_regs.h to 2025. Renamed struct drm_xe_eu_stall_data_pvc to struct xe_eu_stall_data_pvc since it is a local structure. v6: Fix buffer wrap around over write bug (Matt Olson) Reviewed-by: Ashutosh Dixit <ashutosh.dixit@intel.com> Signed-off-by: Harish Chegondi <harish.chegondi@intel.com> Signed-off-by: Ashutosh Dixit <ashutosh.dixit@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/b6aeca593d521828a0b4fbf6cfd2844716c4fc66.1740533885.git.harish.chegondi@intel.com
2025-02-26drm/xe/uapi: Introduce API for EU stall samplingHarish Chegondi
A new hardware feature first introduced in PVC gives capability to periodically sample EU stall state and record counts for different stall reasons, on a per IP basis, aggregate across all EUs in a subslice and record the samples in a buffer in each subslice. Eventually, the aggregated data is written out to a buffer in the memory. This feature is also supported in XE2 and later architecture GPUs. Use an existing IOCTL - DRM_IOCTL_XE_OBSERVATION as the interface into the driver from the user space to do initial setup and obtain a file descriptor for the EU stall data stream. Input parameter to the IOCTL is a struct drm_xe_observation_param in which observation_type should be set to DRM_XE_OBSERVATION_TYPE_EU_STALL, observation_op should be DRM_XE_OBSERVATION_OP_STREAM_OPEN and param should point to a chain of drm_xe_ext_set_property structures in which each structure has a pair of property and value. The EU stall sampling input properties are defined in drm_xe_eu_stall_property_id enum. With the file descriptor obtained from DRM_IOCTL_XE_OBSERVATION, user space can enable and disable EU stall sampling with the IOCTLs: DRM_XE_OBSERVATION_IOCTL_ENABLE and DRM_XE_OBSERVATION_IOCTL_DISABLE. User space can also call poll() to check for availability of data in the buffer. The data can be read with read(). Finally, the file descriptor can be closed with close(). v11: Changed a couple of variables in struct eu_stall_open_properties from unsigned int to int. v10: Use extension number while parsing chain of extensions. Remove function description for static functions. Move code around as per review feedback. v9: Changed some u32 to unsigned int. Moved some code around as per review feedback from v8. v8: Used div_u64 instead of / to fix 32-bit build issue. Changed copyright year in xe_eu_stall.c/h to 2025. v7: Renamed input property DRM_XE_EU_STALL_PROP_EVENT_REPORT_COUNT to DRM_XE_EU_STALL_PROP_WAIT_NUM_REPORTS to be consistent with OA. Renamed the corresponding internal variables. Fixed some commit messages based on review feedback. v6: Change the input sampling rate to GPU cycles instead of GPU cycles multiplier. Reviewed-by: Ashutosh Dixit <ashutosh.dixit@intel.com> Signed-off-by: Harish Chegondi <harish.chegondi@intel.com> Signed-off-by: Ashutosh Dixit <ashutosh.dixit@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/bb707a27975c33e4a912b9839b023acb7a1f9c90.1740533885.git.harish.chegondi@intel.com
2025-02-26drm/xe/topology: Add a function to find the index of the last enabled DSS in ↵Harish Chegondi
a mask Last enabled DSS in a DSS mask can help estimate the maximum DSSes enabled in the DSS mask, as the enabled DSSes can be discontiguous. Reviewed-by: Ashutosh Dixit <ashutosh.dixit@intel.com> Signed-off-by: Harish Chegondi <harish.chegondi@intel.com> Signed-off-by: Ashutosh Dixit <ashutosh.dixit@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/79944bb27eb4f7ce5df01f964aebbf431b3a6c61.1740533885.git.harish.chegondi@intel.com
2025-02-26drm/xe: Fix uninitialized pointer defColin Ian King
In the case where a set of checks on xe->info.platform don't assign a value to pointer def the pointer remains uninitialized and hence can fail the following !def check. Fix this be ensuring pointer def is initialized to NULL. Fixes: 292b1a8a5054 ("drm/xe: Stop ignoring errors from xe_heci_gsc_init()") Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250226160524.566074-1-colin.i.king@gmail.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2025-02-26drm/xe/oa: Refactor WAs to use XE_WA() macroAradhya Bhatia
Refactor Wa_18013179988, Wa_14015568240, Wa_1508761755, and Wa_1509372804, to use the proper workaround-check implementation for out-of-band workarounds, XE_WA(), and drop the use of the platform based WA selection. Reviewed-by: Tejas Upadhyay <tejas.upadhyay@intel.com> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Signed-off-by: Aradhya Bhatia <aradhya.bhatia@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250220094645.358647-3-aradhya.bhatia@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2025-02-26drm/xe: Add Wa_16021333562 and Wa_14016712196Aradhya Bhatia
Wa_16021333562 and Wa_14016712196 are permanent workarounds that apply to multiple platforms. Wa_16021333562 applies to platforms ranging from TGL (12.00) to Xe_LPM (13.00), while Wa_14016712196 from DG2 (12.55) to Xe_LPG (12.74). Reviewed-by: Tejas Upadhyay <tejas.upadhyay@intel.com> Signed-off-by: Aradhya Bhatia <aradhya.bhatia@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250220094645.358647-2-aradhya.bhatia@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2025-02-26drm/xe: cancel pending job timer before freeing schedulerTejas Upadhyay
The async call to __guc_exec_queue_fini_async frees the scheduler while a submission may time out and restart. To prevent this race condition, the pending job timer should be canceled before freeing the scheduler. V3(MattB): - Adjust position of cancel pending job - Remove gitlab issue# from commit message V2(MattB): - Cancel pending jobs before scheduler finish Fixes: a20c75dba192 ("drm/xe: Call __guc_exec_queue_fini_async direct for KERNEL exec_queues") Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250225045754.600905-1-tejas.upadhyay@intel.com Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com> (cherry picked from commit 18fbd567e75f9b97b699b2ab4f1fa76b7cf268f6) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2025-02-26drm/xe/regs: remove a duplicate definition for RING_CTL_SIZE(size)Mingcong Bai
Commit b79e8fd954c4 ("drm/xe: Remove dependency on intel_engine_regs.h") introduced an internal set of engine registers, however, as part of this change, it has also introduced two duplicate `define' lines for `RING_CTL_SIZE(size)'. This commit was introduced to the tree in v6.8-rc1. While this is harmless as the definitions did not change, so no compiler warning was observed. Drop this line anyway for the sake of correctness. Cc: stable@vger.kernel.org # v6.8-rc1+ Fixes: b79e8fd954c4 ("drm/xe: Remove dependency on intel_engine_regs.h") Signed-off-by: Mingcong Bai <jeffbai@aosc.io> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250225073104.865230-1-jeffbai@aosc.io Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com> (cherry picked from commit 6b68c4542ffecc36087a9e14db8fc990c88bb01b) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2025-02-26drm/xe/gt_pagefault: Change vma_pagefault unit to kilobyteFrancois Dugast
Increase the amount of bytes that can be counted before the counter overflows, while not losing information as the VMA is not expected to have sub-kilobyte size. Suggested-by: Matthew Auld <matthew.auld@intel.com> Reviewed-by: Matthew Auld <matthew.auld@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250225195902.1247100-3-francois.dugast@intel.com Signed-off-by: Francois Dugast <francois.dugast@intel.com>
2025-02-26drm/xe/gt_stats: Use atomic64_t for countersFrancois Dugast
The stats counters are now used for things like counting the VMA bytes during page faults. During workload execution, the counter value can grow fast and easily reach the atomic int limit, in which case it overflows. To make this less likely to happen, push the limit by switching to 64b atomic to store the counter value. Overhead is very small as there are only 3 stat entries per GT as of now, and stats are only enabled with CONFIG_DEBUG_FS. Suggested-by: Matthew Auld <matthew.auld@intel.com> Reviewed-by: Matthew Auld <matthew.auld@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250225195902.1247100-2-francois.dugast@intel.com Signed-off-by: Francois Dugast <francois.dugast@intel.com>
2025-02-26drm/xe: cancel pending job timer before freeing schedulerTejas Upadhyay
The async call to __guc_exec_queue_fini_async frees the scheduler while a submission may time out and restart. To prevent this race condition, the pending job timer should be canceled before freeing the scheduler. V3(MattB): - Adjust position of cancel pending job - Remove gitlab issue# from commit message V2(MattB): - Cancel pending jobs before scheduler finish Fixes: a20c75dba192 ("drm/xe: Call __guc_exec_queue_fini_async direct for KERNEL exec_queues") Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250225045754.600905-1-tejas.upadhyay@intel.com Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
2025-02-25drm/xe/regs: remove a duplicate definition for RING_CTL_SIZE(size)Mingcong Bai
Commit b79e8fd954c4 ("drm/xe: Remove dependency on intel_engine_regs.h") introduced an internal set of engine registers, however, as part of this change, it has also introduced two duplicate `define' lines for `RING_CTL_SIZE(size)'. This commit was introduced to the tree in v6.8-rc1. While this is harmless as the definitions did not change, so no compiler warning was observed. Drop this line anyway for the sake of correctness. Cc: stable@vger.kernel.org # v6.8-rc1+ Fixes: b79e8fd954c4 ("drm/xe: Remove dependency on intel_engine_regs.h") Signed-off-by: Mingcong Bai <jeffbai@aosc.io> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250225073104.865230-1-jeffbai@aosc.io Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2025-02-25drm/xe: Stop ignoring errors from xe_ttm_sys_mgr_init()Lucas De Marchi
xe_ttm_sys_mgr_init() already cleans up after itself, just return error if that failed. Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Reviewed-by: Tejas Upadhyay <tejas.upadhyay@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250222001051.3012936-12-lucas.demarchi@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2025-02-25drm/xe: Rename update_device_info() after sriovLucas De Marchi
This is only changing info flags for SR-IOV reasons. Rename it accordingly, because there are several other places in probe where the flags are updated, which is not inside this function. Cc: Michal Wajdeczko <michal.wajdeczko@intel.com> Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250222001051.3012936-11-lucas.demarchi@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2025-02-25drm/xe: Stop ignoring errors from xe_heci_gsc_init()Lucas De Marchi
Do not ignore errors from xe_heci_gsc_init(). For example, it shouldn't be fine to report successfully entering survivability mode when there's no communication with gsc working. The driver should also not be half-initialized in the normal case neither. Cc: Riana Tauro <riana.tauro@intel.com> Cc: Alexander Usyskin <alexander.usyskin@intel.com> Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250222001051.3012936-10-lucas.demarchi@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2025-02-25drm/xe: Move survivability entirely to xe_pciLucas De Marchi
There's an odd split between xe_pci.c and xe_device.c wrt xe_survivability: it's initialized by xe_device, but then finalized by xe_pci. Move it entirely to the outer layer, xe_pci, so it controls the flow entirely. This also allows to stop ignoring some of the errors. E.g.: if there's an -ENOMEM, it shouldn't continue as if it survivability had been enabled. One change worth mentioning is that if "wait for lmem" fails, it will also check the pcode status to decide if it should enter or not in survivability mode, which it was not doing before. The bit from pcode for that decision should remain the same after lmem failed initialization, so it should be fine. Cc: Riana Tauro <riana.tauro@intel.com> Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Reviewed-by: Riana Tauro <riana.tauro@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250222001051.3012936-9-lucas.demarchi@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2025-02-25drm/xe/display: Drop xe_display_driver_remove()Lucas De Marchi
Handle it as part of xe_display_fini(). The error handling was already calling it if a step after xe_display_init() failed. Just re-use the same xe_display_fini() for driver remove. Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Jani Nikula <jani.nikula@intel.com> Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250222001051.3012936-8-lucas.demarchi@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2025-02-25drm/xe: Drop remove callback supportLucas De Marchi
Now that devres supports component driver cleanup during driver removal cleanup, the xe custom support for removal callbacks is not needed anymore. Drop it. Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250222001051.3012936-7-lucas.demarchi@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2025-02-25drm/xe: Switch from xe to devm actionsLucas De Marchi
Now that component drivers are compatible with devm, switch to using it instead of our own. Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Reviewed-by: Tejas Upadhyay <tejas.upadhyay@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250222001051.3012936-6-lucas.demarchi@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2025-02-25drm/xe: Stop setting drvdata to NULLLucas De Marchi
PCI subsystem is not supposed to call the remove() function when probe fails and doesn't need a protection for that. The only places checking for NULL drvdata, is on 2 sysfs files and they shouldn't be needed since the files are removed and reads on open fds just return an error. For this protection the core driver implementation in drivers/base/dd.c:device_unbind_cleanup() already sets it to NULL, after the release of dev resources. Remove the setting to NULL so it's possible to obtain the xe pointer from callbacks like the component unbind from device_unbind_cleanup(), i.e. after xe_pci_remove() already finished. Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250222001051.3012936-5-lucas.demarchi@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2025-02-25drm/xe/oa: Allow oa_exponent value of 0Umesh Nerlige Ramappa
OA exponent value of 0 is a valid value for periodic reports. Allow user to pass 0 for the OA sampling interval since it gets converted to 2 gt clock ticks. v2: Update the check in xe_oa_stream_init as well (Ashutosh) v3: Fix mi-rpc failure by setting default exponent to -1 (CI) v4: Add the Fixes tag Fixes: b6fd51c62119 ("drm/xe/oa/uapi: Define and parse OA stream properties") Signed-off-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com> Reviewed-by: Ashutosh Dixit <ashutosh.dixit@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250221213352.1712932-1-umesh.nerlige.ramappa@intel.com (cherry picked from commit 30341f0b8ea71725cc4ab2c43e3a3b749892fc92) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2025-02-25Merge tag 'v6.14-rc4' into drm-nextDave Airlie
Backmerge Linux 6.14-rc4 at the request of tzimmermann so misc-next can base on rc4. Signed-off-by: Dave Airlie <airlied@redhat.com>
2025-02-24drm/xe/oa: Allow oa_exponent value of 0Umesh Nerlige Ramappa
OA exponent value of 0 is a valid value for periodic reports. Allow user to pass 0 for the OA sampling interval since it gets converted to 2 gt clock ticks. v2: Update the check in xe_oa_stream_init as well (Ashutosh) v3: Fix mi-rpc failure by setting default exponent to -1 (CI) v4: Add the Fixes tag Fixes: b6fd51c62119 ("drm/xe/oa/uapi: Define and parse OA stream properties") Signed-off-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com> Reviewed-by: Ashutosh Dixit <ashutosh.dixit@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250221213352.1712932-1-umesh.nerlige.ramappa@intel.com
2025-02-24drm/xe/devcoredump: Remove IS_ERR_OR_NULL check for kzallocShuicheng Lin
kzalloc returns a valid pointer or NULL if the allocation fails. It never returns an error pointer. It is better to check for NULL directly. Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com> Cc: John Harrison <John.C.Harrison@Intel.com> Cc: Lucas De Marchi <lucas.demarchi@intel.com> Reviewed-by: Tejas Upadhyay <tejas.upadhyay@intel.com> Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250220001710.1803749-3-shuicheng.lin@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2025-02-24drm/xe/devcoredump: Fix print typo of offsetShuicheng Lin
The log should print with "offset" instead of "size". Correct the typo in the comment. v2: split kzalloc change and add typo fix in commit message (Lucas) Signed-off-by: Shuicheng Lin <shuicheng.lin@intel.com> Cc: John Harrison <John.C.Harrison@Intel.com> Cc: Lucas De Marchi <lucas.demarchi@intel.com> Reviewed-by: Tejas Upadhyay <tejas.upadhyay@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250220001710.1803749-2-shuicheng.lin@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2025-02-24drm/xe/xe_pmu: Acquire forcewake on event init for engine eventsRiana Tauro
When the engine events are created, acquire GT forcewake to read gpm timestamp required for the events and release on event destroy. This cannot be done during read due to the raw spinlock held my pmu. v2: remove forcewake counting (Umesh) v3: remove extra space (Umesh) v4: use event pmu private data (Lucas) free local copy (Umesh) Signed-off-by: Riana Tauro <riana.tauro@intel.com> Reviewed-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com> Reviewed-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250224053903.2253539-6-riana.tauro@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2025-02-24drm/xe/xe_pmu: Add PMU support for engine activityRiana Tauro
PMU provides two counters (engine-active-ticks, engine-total-ticks) to calculate engine activity. When querying engine activity, user must group these 2 counters using the perf_event group mechanism to ensure both counters are sampled together. To list the events ./perf list xe_0000_03_00.0/engine-active-ticks/ [Kernel PMU event] xe_0000_03_00.0/engine-total-ticks/ [Kernel PMU event] The formats to be used with the above are engine_instance - config:12-19 engine_class - config:20-27 gt - config:60-63 The events can then be read using perf tool ./perf stat -e xe_0000_03_00.0/engine-active-ticks,gt=0, engine_class=0,engine_instance=0/, xe_0000_03_00.0/engine-total-ticks,gt=0, engine_class=0,engine_instance=0/ -I 1000 Engine activity can then be calculated as below engine activity % = (engine active ticks/engine total ticks) * 100 v2: validate gt rename total-ticks to engine-total-ticks add helper to get hwe (Umesh) v3: fix checkpatch warning add details to documentation (Umesh) remove ascii formats from documentation (Lucas) v4: remove unnecessary warn within raw_spinlock (Lucas) Signed-off-by: Riana Tauro <riana.tauro@intel.com> Reviewed-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250224053903.2253539-5-riana.tauro@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2025-02-24drm/xe/userptr: fix EFAULT handlingMatthew Auld
Currently we treat EFAULT from hmm_range_fault() as a non-fatal error when called from xe_vm_userptr_pin() with the idea that we want to avoid killing the entire vm and chucking an error, under the assumption that the user just did an unmap or something, and has no intention of actually touching that memory from the GPU. At this point we have already zapped the PTEs so any access should generate a page fault, and if the pin fails there also it will then become fatal. However it looks like it's possible for the userptr vma to still be on the rebind list in preempt_rebind_work_func(), if we had to retry the pin again due to something happening in the caller before we did the rebind step, but in the meantime needing to re-validate the userptr and this time hitting the EFAULT. This explains an internal user report of hitting: [ 191.738349] WARNING: CPU: 1 PID: 157 at drivers/gpu/drm/xe/xe_res_cursor.h:158 xe_pt_stage_bind.constprop.0+0x60a/0x6b0 [xe] [ 191.738551] Workqueue: xe-ordered-wq preempt_rebind_work_func [xe] [ 191.738616] RIP: 0010:xe_pt_stage_bind.constprop.0+0x60a/0x6b0 [xe] [ 191.738690] Call Trace: [ 191.738692] <TASK> [ 191.738694] ? show_regs+0x69/0x80 [ 191.738698] ? __warn+0x93/0x1a0 [ 191.738703] ? xe_pt_stage_bind.constprop.0+0x60a/0x6b0 [xe] [ 191.738759] ? report_bug+0x18f/0x1a0 [ 191.738764] ? handle_bug+0x63/0xa0 [ 191.738767] ? exc_invalid_op+0x19/0x70 [ 191.738770] ? asm_exc_invalid_op+0x1b/0x20 [ 191.738777] ? xe_pt_stage_bind.constprop.0+0x60a/0x6b0 [xe] [ 191.738834] ? ret_from_fork_asm+0x1a/0x30 [ 191.738849] bind_op_prepare+0x105/0x7b0 [xe] [ 191.738906] ? dma_resv_reserve_fences+0x301/0x380 [ 191.738912] xe_pt_update_ops_prepare+0x28c/0x4b0 [xe] [ 191.738966] ? kmemleak_alloc+0x4b/0x80 [ 191.738973] ops_execute+0x188/0x9d0 [xe] [ 191.739036] xe_vm_rebind+0x4ce/0x5a0 [xe] [ 191.739098] ? trace_hardirqs_on+0x4d/0x60 [ 191.739112] preempt_rebind_work_func+0x76f/0xd00 [xe] Followed by NPD, when running some workload, since the sg was never actually populated but the vma is still marked for rebind when it should be skipped for this special EFAULT case. This is confirmed to fix the user report. v2 (MattB): - Move earlier. v3 (MattB): - Update the commit message to make it clear that this indeed fixes the issue. Fixes: 521db22a1d70 ("drm/xe: Invalidate userptr VMA on page pin fault") Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: <stable@vger.kernel.org> # v6.10+ Reviewed-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250221143840.167150-5-matthew.auld@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com> (cherry picked from commit 6b93cb98910c826c2e2004942f8b060311e43618) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2025-02-24drm/xe/userptr: restore invalidation list on errorMatthew Auld
On error restore anything still on the pin_list back to the invalidation list on error. For the actual pin, so long as the vma is tracked on either list it should get picked up on the next pin, however it looks possible for the vma to get nuked but still be present on this per vm pin_list leading to corruption. An alternative might be then to instead just remove the link when destroying the vma. v2: - Also add some asserts. - Keep the overzealous locking so that we are consistent with the docs; updating the docs and related bits will be done as a follow up. Fixes: ed2bdf3b264d ("drm/xe/vm: Subclass userptr vmas") Suggested-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: <stable@vger.kernel.org> # v6.8+ Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250221143840.167150-4-matthew.auld@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com> (cherry picked from commit 4e37e928928b730de9aa9a2f5dc853feeebc1742) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2025-02-24drm/xe/guc: Expose engine activity only for supported GuC versionRiana Tauro
Engine activity is supported only on GuC submission version >= 1.14.1 Allow enabling/reading engine activity only on supported GuC versions. Warn once if not supported. v2: use guc interface version (John) v3: use debug log (Umesh) v4: use variable for supported and use gt logs use a friendlier log message (Michal) v5: fix kernel-doc do not continue in init if not supported (Michal) v6: remove hardcoding values (Michal) Cc: John Harrison <John.C.Harrison@Intel.com> Cc: Michal Wajdeczko <michal.wajdeczko@intel.com> Signed-off-by: Riana Tauro <riana.tauro@intel.com> Reviewed-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com> Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250224053903.2253539-4-riana.tauro@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2025-02-24drm/xe/trace: Add trace for engine activityRiana Tauro
Add engine activity related information to trace events for better debuggability v2: add trace for engine activity (Umesh) v3: use hex for quanta_ratio Signed-off-by: Riana Tauro <riana.tauro@intel.com> Reviewed-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250224053903.2253539-3-riana.tauro@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2025-02-24drm/xe: Add engine activity supportRiana Tauro
GuC provides support to read engine counters to calculate the engine activity. KMD exposes two counters via the PMU interface to calculate engine activity Engine Active Ticks(engine-active-ticks) - active ticks of engine Engine Total Ticks (engine-total-ticks) - total ticks of engine Engine activity percentage can be calculated as below Engine activity % = (engine active ticks/engine total ticks) * 100. v2: fix cosmetic review comments add forcewake for gpm_ts (Umesh) v3: fix CI hooks error change function parameters and unpin bo on error of allocate_activity_buffers fix kernel-doc (Umesh) use engine activity (Umesh, Lucas) rename xe_engine_activity to xe_guc_engine_* fix commit message to use engine activity (Lucas, Umesh) v4: add forcewake in PMU layer v5: fix makefile use drmm_kcalloc instead of kmalloc_array remove managed bo skip init for VF fix cosmetic review comments (Michal) Signed-off-by: Riana Tauro <riana.tauro@intel.com> Reviewed-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250224053903.2253539-2-riana.tauro@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2025-02-24drm/xe/userptr: remove tmp_evict listMatthew Auld
Doesn't look to be used. Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250221143840.167150-6-matthew.auld@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2025-02-24drm/xe/userptr: fix EFAULT handlingMatthew Auld
Currently we treat EFAULT from hmm_range_fault() as a non-fatal error when called from xe_vm_userptr_pin() with the idea that we want to avoid killing the entire vm and chucking an error, under the assumption that the user just did an unmap or something, and has no intention of actually touching that memory from the GPU. At this point we have already zapped the PTEs so any access should generate a page fault, and if the pin fails there also it will then become fatal. However it looks like it's possible for the userptr vma to still be on the rebind list in preempt_rebind_work_func(), if we had to retry the pin again due to something happening in the caller before we did the rebind step, but in the meantime needing to re-validate the userptr and this time hitting the EFAULT. This explains an internal user report of hitting: [ 191.738349] WARNING: CPU: 1 PID: 157 at drivers/gpu/drm/xe/xe_res_cursor.h:158 xe_pt_stage_bind.constprop.0+0x60a/0x6b0 [xe] [ 191.738551] Workqueue: xe-ordered-wq preempt_rebind_work_func [xe] [ 191.738616] RIP: 0010:xe_pt_stage_bind.constprop.0+0x60a/0x6b0 [xe] [ 191.738690] Call Trace: [ 191.738692] <TASK> [ 191.738694] ? show_regs+0x69/0x80 [ 191.738698] ? __warn+0x93/0x1a0 [ 191.738703] ? xe_pt_stage_bind.constprop.0+0x60a/0x6b0 [xe] [ 191.738759] ? report_bug+0x18f/0x1a0 [ 191.738764] ? handle_bug+0x63/0xa0 [ 191.738767] ? exc_invalid_op+0x19/0x70 [ 191.738770] ? asm_exc_invalid_op+0x1b/0x20 [ 191.738777] ? xe_pt_stage_bind.constprop.0+0x60a/0x6b0 [xe] [ 191.738834] ? ret_from_fork_asm+0x1a/0x30 [ 191.738849] bind_op_prepare+0x105/0x7b0 [xe] [ 191.738906] ? dma_resv_reserve_fences+0x301/0x380 [ 191.738912] xe_pt_update_ops_prepare+0x28c/0x4b0 [xe] [ 191.738966] ? kmemleak_alloc+0x4b/0x80 [ 191.738973] ops_execute+0x188/0x9d0 [xe] [ 191.739036] xe_vm_rebind+0x4ce/0x5a0 [xe] [ 191.739098] ? trace_hardirqs_on+0x4d/0x60 [ 191.739112] preempt_rebind_work_func+0x76f/0xd00 [xe] Followed by NPD, when running some workload, since the sg was never actually populated but the vma is still marked for rebind when it should be skipped for this special EFAULT case. This is confirmed to fix the user report. v2 (MattB): - Move earlier. v3 (MattB): - Update the commit message to make it clear that this indeed fixes the issue. Fixes: 521db22a1d70 ("drm/xe: Invalidate userptr VMA on page pin fault") Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: <stable@vger.kernel.org> # v6.10+ Reviewed-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250221143840.167150-5-matthew.auld@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2025-02-24drm/xe/userptr: restore invalidation list on errorMatthew Auld
On error restore anything still on the pin_list back to the invalidation list on error. For the actual pin, so long as the vma is tracked on either list it should get picked up on the next pin, however it looks possible for the vma to get nuked but still be present on this per vm pin_list leading to corruption. An alternative might be then to instead just remove the link when destroying the vma. v2: - Also add some asserts. - Keep the overzealous locking so that we are consistent with the docs; updating the docs and related bits will be done as a follow up. Fixes: ed2bdf3b264d ("drm/xe/vm: Subclass userptr vmas") Suggested-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: <stable@vger.kernel.org> # v6.8+ Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250221143840.167150-4-matthew.auld@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2025-02-24drm/xe/wa: Limit char per line to 100Tejas Upadhyay
Above 100 char per line checkpatch would complain. Fixing it. Reviewed-by: Badal Nilawar <badal.nilawar@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250221115344.389975-1-tejas.upadhyay@intel.com Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
2025-02-21drm/xe/oa: Ensure that polled read returns latest dataUmesh Nerlige Ramappa
In polled mode, user calls poll() for read data to be available before performing a read(). In the duration between these 2 calls, there may be new data available in the OA buffer. To ensure user reads all available data, check for latest data in the OA buffer in polled read. Signed-off-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com> Reviewed-by: Ashutosh Dixit <ashutosh.dixit@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250212010255.1423343-1-umesh.nerlige.ramappa@intel.com
2025-02-21drm/xe: Add fault injection for xe_sync_entry_parsePriyanka Dandamudi
Add fault injection for xe_sync_entry_parse to allow it to fail while executing xe_vm_bind_ioctl(). This needs to be added as it cannot be reached by injecting error through IOCTL arguments. Signed-off-by: Priyanka Dandamudi <priyanka.dandamudi@intel.com> Reviewed-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com> Reviewed-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250212093212.3069356-1-priyanka.dandamudi@intel.com Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
2025-02-20drm/xe/client: Skip show_run_ticks if unable to read timestampMarcin Bernatowicz
RING_TIMESTAMP registers are inaccessible in VF mode. Without drm-total-cycles-*, other keys provide little value. Skip all optional "run_ticks" keys in this case. Signed-off-by: Marcin Bernatowicz <marcin.bernatowicz@linux.intel.com> Cc: Lucas De Marchi <lucas.demarchi@intel.com> Cc: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Michał Winiarski <michal.winiarski@intel.com> Cc: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com> Reviewed-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com> Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250205191644.2550879-3-marcin.bernatowicz@linux.intel.com
2025-02-20drm/xe/vf: Return EOPNOTSUPP for DRM_XE_DEVICE_QUERY_ENGINE_CYCLES if VFMarcin Bernatowicz
RING_TIMESTAMP registers are not available for VF (Virtual Function) drivers. Return -EOPNOTSUPP when the DRM_XE_DEVICE_QUERY_ENGINE_CYCLES ioctl is invoked on a VF device. Signed-off-by: Marcin Bernatowicz <marcin.bernatowicz@linux.intel.com> Cc: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Michał Winiarski <michal.winiarski@intel.com> Cc: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com> Reviewed-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com> Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250205191644.2550879-2-marcin.bernatowicz@linux.intel.com
2025-02-19drm/xe: Drop unnecessary GT lookup in xe_exec_queue_create_ioctl()Matt Roper
xe_exec_queue_create_ioctl() performs a lookup of the xe_gt for the GT ID passed from userspace, but the result is never actually used. Since there's already a separate (and earlier) check that the ID passed from userspace is valid, the unnecessary lookup can be removed. Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Reviewed-by: Ashutosh Dixit <ashutosh.dixit@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250218200511.4050060-2-matthew.d.roper@intel.com Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
2025-02-18drm/xe/display: Spin-off xe_display runtime/d3cold sequencesRodrigo Vivi
No functional change. This patch only splits the xe_display_pm suspend/resume functions in the regular suspend/resume from the runtime/d3cold ones. v2: - Rename d3cold functions (Jonathan) - Rebase Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250218010330.761340-1-rodrigo.vivi@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2025-02-18drm/{i915, xe}/display: Move dsm registration under intel_driverRodrigo Vivi
Move dsm register/unregister calls from the drivers to under intel_display_driver register/unregister. v2: Rebase only Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250217200133.741758-1-rodrigo.vivi@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2025-02-18drm/xe/guc: Fix size_t print formatLucas De Marchi
Use %zx format to print size_t to remove the following warning when building for i386: >> drivers/gpu/drm/xe/xe_guc_ct.c:1727:43: warning: format specifies type 'unsigned long' but the argument has type 'size_t' (aka 'unsigned int') [-Wformat] 1727 | drm_printf(p, "[CTB].length: 0x%lx\n", snapshot->ctb_size); | ~~~ ^~~~~~~~~~~~~~~~~~ | %zx Cc: José Roberto de Souza <jose.souza@intel.com> Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202501281627.H6nj184e-lkp@intel.com/ Fixes: 643f209ba3fd ("drm/xe: Make GUC binaries dump consistent with other binaries in devcoredump") Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250128154242.3371687-1-lucas.demarchi@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com> (cherry picked from commit 7748289df510638ba61fed86b59ce7d2fb4a194c) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2025-02-18drm/xe: Make GUC binaries dump consistent with other binaries in devcoredumpJosé Roberto de Souza
All other(hwsp, hwctx and vmas) binaries follow this format: [name].length: 0x1000 [name].data: xxxxxxx [name].error: errno The error one is just in case by some reason it was not able to capture the binary. So this GuC binaries should follow the same patern. v2: - renamed GUC binary to LOG Cc: John Harrison <John.C.Harrison@Intel.com> Cc: Lucas De Marchi <lucas.demarchi@intel.com> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Signed-off-by: José Roberto de Souza <jose.souza@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250123202307.95103-3-jose.souza@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com> (cherry picked from commit cb1f868ca13756c0c18ba54d1591332476760d07) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2025-02-18drm/xe: Add xe_mmio_init() initialization functionIlia Levi
Add a convenience function for minimal initialization of struct xe_mmio. This function also validates that the entirety of the provided mmio region is usable with struct xe_reg. v2: Modify commit message, add kernel doc, refactor assert (Michal) v3: Fix off-by-one bug, add clarifying macro (Michal) v4: Derive bitfield width from size (Michal) Signed-off-by: Ilia Levi <ilia.levi@intel.com> Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250213093559.204652-1-ilia.levi@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2025-02-18drm/xe: s/xe_mmio_init/xe_mmio_probe_earlyIlia Levi
Rename so that xe_mmio_init() can be used in subsequent patches to initialize an instance of struct xe_mmio. Signed-off-by: Ilia Levi <ilia.levi@intel.com> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250130105057.136586-1-ilia.levi@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>