summaryrefslogtreecommitdiff
path: root/drivers/accel
AgeCommit message (Collapse)Author
2023-04-08accel/habanalabs: fix wrong reset and event flagsOfir Bitton
During event handling, driver sets relevant reset and user event notifier flags. Fix few wrong flags settings. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-04-08accel/habanalabs: fix events mask of decoder abnormal interruptsTomer Tayar
The decoder IRQ status register may have several set bits upon an abnormal interrupt. Therefore, when setting the events mask, need to check all bits and not using if-else. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-04-08accel/habanalabs: remove completion from abnormal interrupt work nameTomer Tayar
Decoder abnormal interrupts are for errors and not for completion, so rename the relevant work and work function to not include 'completion'. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-04-08accel/habanalabs: print raw binning masks in debug levelOfir Bitton
There are rare cases of failures when cards are initialized due to wrong values in efuse mappings that are parsed by firmware. To help debug those cases, print (in debug level) the raw binning masks as fetched from the firmware during device initialization. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-04-08accel/habanalabs: fix HBM MMU interrupt handlingOfir Bitton
Current mapping between HMMU event and HMMU block is wrong. In addition the captured address in case of a page fault or an access error is scrambled, Hence we must call the descramble function. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-04-08accel/habanalabs: improvements to FW ver extractionDafna Hirschfeld
1. Rename the func to hl_get_preboot_major_minor because we also set the extracted values in hdev fields. 2. Free the allocated string in the calling function which makes more sense Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-04-08accel/habanalabs: fix access error clear eventDani Liberman
The register which needs to be cleared is the valid register instead of the address. Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-04-08accel/habanalabs: send disable pci when compute ctx is activeTal Cohen
Fix an issue in hard reset flow in which the driver didn't send a disable pci message if there was an active compute context. In hard reset, disable pci message should be sent no matter if a compute context exists or not. Signed-off-by: Tal Cohen <talcohen@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
2023-04-08accel/habanalabs: remove duplicated disable pci msgTal Cohen
The disable pci message is sent in reset device. It informs the FW not to raise more EQs. The Driver may ignore received EQs, when the device is in disabled mode. The duplication happens when hard reset is scheduled during compute reset and also performs 'escalate_reset_flow'. Signed-off-by: Tal Cohen <talcohen@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
2023-04-08accel/habanalabs: change COMMS warning messages to error levelKoby Elbaz
COMMS protocol is used for LKD <--> FW communication, and any communication failure between the two might turn out to be destructive, hence, it should be well emphasized. Signed-off-by: Koby Elbaz <kelbaz@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
2023-04-08accel/habanalabs: check return value of add_va_block_lockedDafna Hirschfeld
since the function might fail and we should propagate the failure. Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
2023-04-08accel/habanalabs: print event type when device is disabledTal Cohen
When the device is in disabled state, the driver isn't suppose to receive any events from FW. Printing the event type, as part of the message that was already printed, shall help to get more info if this unexpected message is received. Signed-off-by: Tal Cohen <talcohen@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
2023-04-08accel/habanalabs: unmap mapped memory when TLB inv failsKoby Elbaz
Once a memory mapping is added to the page tables, it's followed by a TLB invalidation request which could potentially fail (HW failure). Removing the mapping is simply a part of this failure handling routine. TLB invalidation failure prints were updated to be more accurate. Signed-off-by: Koby Elbaz <kelbaz@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
2023-04-08accel/habanalabs: Remove redundant pci_clear_masterCai Huoqing
Remove pci_clear_master to simplify the code, the bus-mastering is also cleared in do_pci_disable_device, like this: ./drivers/pci/pci.c:2197 static void do_pci_disable_device(struct pci_dev *dev) { u16 pci_command; pci_read_config_word(dev, PCI_COMMAND, &pci_command); if (pci_command & PCI_COMMAND_MASTER) { pci_command &= ~PCI_COMMAND_MASTER; pci_write_config_word(dev, PCI_COMMAND, pci_command); } pcibios_disable_device(dev); }. And dev->is_busmaster is set to 0 in pci_disable_device. Signed-off-by: Cai Huoqing <cai.huoqing@linux.dev> Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-04-07hwmon: constify pointers to hwmon_channel_infoKrzysztof Kozlowski
HWmon core receives an array of pointers to hwmon_channel_info and it does not modify it, thus it can be array of const pointers for safety. This allows drivers to make them also const. Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Signed-off-by: Guenter Roeck <linux@roeck-us.net>
2023-04-06accel/qaic: Add qaic driver to the build systemJeffrey Hugo
Now that we have all the components of a minimum QAIC which can boot and run an AIC100 device, add the infrastructure that allows the QAIC driver to be built. Signed-off-by: Jeffrey Hugo <quic_jhugo@quicinc.com> Reviewed-by: Carl Vanderlip <quic_carlv@quicinc.com> Reviewed-by: Pranjal Ramajor Asha Kanojiya <quic_pkanojiy@quicinc.com> Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com> Reviewed-by: Jacek Lawrynowicz <jacek.lawrynowicz@linux.intel.com> Acked-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Jacek Lawrynowicz <jacek.lawrynowicz@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/1679932497-30277-8-git-send-email-quic_jhugo@quicinc.com
2023-04-06accel/qaic: Add mhi_qaic_cntlPranjal Ramajor Asha Kanojiya
Some of the MHI channels for an AIC100 device need to be routed to userspace so that userspace can communicate directly with QSM. The MHI bus does not support this, and while the WWAN subsystem does (for the same reasons), AIC100 is not a WWAN device. Also, MHI is not something that other accelerators are expected to share, thus an accel subsystem function that meets this usecase is unlikely. Create a QAIC specific MHI userspace shim that exposes these channels. Start with QAIC_SAHARA which is required to boot AIC100 and is consumed by the kickstart application as documented in aic100.rst Each AIC100 instance (currently, up to 16) in a system will create a chardev for QAIC_SAHARA. This chardev will be found as /dev/<mhi instance>_QAIC_SAHARA For example - /dev/mhi0_QAIC_SAHARA Signed-off-by: Pranjal Ramajor Asha Kanojiya <quic_pkanojiy@quicinc.com> Signed-off-by: Jeffrey Hugo <quic_jhugo@quicinc.com> Reviewed-by: Carl Vanderlip <quic_carlv@quicinc.com> Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com> Reviewed-by: Jacek Lawrynowicz <jacek.lawrynowicz@linux.intel.com> Acked-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Jacek Lawrynowicz <jacek.lawrynowicz@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/1679932497-30277-7-git-send-email-quic_jhugo@quicinc.com
2023-04-06accel/qaic: Add datapathJeffrey Hugo
Add the datapath component that manages BOs and submits them to running workloads on the qaic device via the dma_bridge hardware. This allows QAIC clients to interact with their workloads (run inferences) via the following ioctls along with mmap(): DRM_IOCTL_QAIC_CREATE_BO DRM_IOCTL_QAIC_MMAP_BO DRM_IOCTL_QAIC_ATTACH_SLICE_BO DRM_IOCTL_QAIC_EXECUTE_BO DRM_IOCTL_QAIC_PARTIAL_EXECUTE_BO DRM_IOCTL_QAIC_WAIT_BO DRM_IOCTL_QAIC_PERF_STATS_BO Signed-off-by: Jeffrey Hugo <quic_jhugo@quicinc.com> Reviewed-by: Carl Vanderlip <quic_carlv@quicinc.com> Reviewed-by: Pranjal Ramajor Asha Kanojiya <quic_pkanojiy@quicinc.com> Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com> Reviewed-by: Jacek Lawrynowicz <jacek.lawrynowicz@linux.intel.com> Acked-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Jacek Lawrynowicz <jacek.lawrynowicz@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/1679932497-30277-6-git-send-email-quic_jhugo@quicinc.com
2023-04-06accel/qaic: Add control pathJeffrey Hugo
Add the control path component that talks to the management processor (QSM) to load workloads onto the AIC100 device. This implements the KMD portion of the NNC protocol over the QAIC_CONTROL MHI channel and the DRM_IOCTL_QAIC_MANAGE IOCTL to userspace. With this functionality, QAIC clients are able to load, run, and cleanup their workloads on the device but not interact with the workloads (run inferences). Signed-off-by: Jeffrey Hugo <quic_jhugo@quicinc.com> Reviewed-by: Carl Vanderlip <quic_carlv@quicinc.com> Reviewed-by: Pranjal Ramajor Asha Kanojiya <quic_pkanojiy@quicinc.com> Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com> Reviewed-by: Jacek Lawrynowicz <jacek.lawrynowicz@linux.intel.com> Acked-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Jacek Lawrynowicz <jacek.lawrynowicz@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/1679932497-30277-5-git-send-email-quic_jhugo@quicinc.com
2023-04-06accel/qaic: Add MHI controllerJeffrey Hugo
An AIC100 device contains a MHI interface with a number of different channels for controlling different aspects of the device. The MHI controller works with the MHI bus to enable and drive that interface. AIC100 uses the BHI protocol in PBL to load SBL. The MHI controller expects the SBL to be located at /lib/firmware/qcom/aic100/sbl.bin and expects the MHI bus to manage the process of loading and sending SBL to the device. Signed-off-by: Jeffrey Hugo <quic_jhugo@quicinc.com> Reviewed-by: Carl Vanderlip <quic_carlv@quicinc.com> Reviewed-by: Pranjal Ramajor Asha Kanojiya <quic_pkanojiy@quicinc.com> Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com> Reviewed-by: Jacek Lawrynowicz <jacek.lawrynowicz@linux.intel.com> Acked-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Jacek Lawrynowicz <jacek.lawrynowicz@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/1679932497-30277-4-git-send-email-quic_jhugo@quicinc.com
2023-04-06accel/qaic: Add uapi and core driver fileJeffrey Hugo
Add the QAIC driver uapi file and core driver file that binds to the PCIe device. The core driver file also creates the accel device and manages all the interconnections between the different parts of the driver. The driver can be built as a module. If so, it will be called "qaic.ko". Signed-off-by: Jeffrey Hugo <quic_jhugo@quicinc.com> Reviewed-by: Carl Vanderlip <quic_carlv@quicinc.com> Reviewed-by: Pranjal Ramajor Asha Kanojiya <quic_pkanojiy@quicinc.com> Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com> Reviewed-by: Jacek Lawrynowicz <jacek.lawrynowicz@linux.intel.com> Acked-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Jacek Lawrynowicz <jacek.lawrynowicz@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/1679932497-30277-3-git-send-email-quic_jhugo@quicinc.com
2023-04-05accel/ivpu: Fix S3 system suspend when not idleJacek Lawrynowicz
Wait for VPU to be idle in ivpu_pm_suspend_cb() before powering off the device, so jobs are not lost and TDRs are not triggered after resume. Fixes: 852be13f3bd3 ("accel/ivpu: Add PM support") Signed-off-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com> Reviewed-by: Jeffrey Hugo <quic_jhugo@quicinc.com> Signed-off-by: Jacek Lawrynowicz <jacek.lawrynowicz@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20230331113603.2802515-3-stanislaw.gruszka@linux.intel.com
2023-04-05accel/ivpu: Add dma fence to command buffers onlyKarol Wachowski
Currently job->done_fence is added to every BO handle within a job. If job handle (command buffer) is shared between multiple submits, KMD will add the fence in each of them. Then bo_wait_ioctl() executed on command buffer will exit only when all jobs containing that handle are done. This creates deadlock scenario for user mode driver in case when job handle is added as dependency of another job, because bo_wait_ioctl() of first job will wait until second job finishes, and second job can not finish before first one. Having fences added only to job buffer handle allows user space to execute bo_wait_ioctl() on the job even if it's handle is submitted with other job. Fixes: cd7272215c44 ("accel/ivpu: Add command buffer submission logic") Signed-off-by: Karol Wachowski <karol.wachowski@linux.intel.com> Signed-off-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com> Reviewed-by: Jeffrey Hugo <quic_jhugo@quicinc.com> Signed-off-by: Jacek Lawrynowicz <jacek.lawrynowicz@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20230331113603.2802515-2-stanislaw.gruszka@linux.intel.com
2023-04-05accel/ivpu: Remove D3hot delay for MeteorlakeKarol Wachowski
VPU on MTL has hardware optimizations and does not require 10ms D0 - D3hot transition delay imposed by PCI specification (PCIe r6.0, sec 5.9.) . The delay removal is traditionally done by adding PCI ID to quirk_remove_d3hot_delay() in drivers/pci/quirks.c . But since we do not need that optimization before driver probe and we can better specify in the ivpu driver on what (future) hardware use the optimization, we do not use quirk_remove_d3hot_delay() for that. Signed-off-by: Karol Wachowski <karol.wachowski@linux.intel.com> Signed-off-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com> Reviewed-by: Jeffrey Hugo <quic_jhugo@quicinc.com> Signed-off-by: Jacek Lawrynowicz <jacek.lawrynowicz@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20230403121545.2995279-1-stanislaw.gruszka@linux.intel.com
2023-04-03Merge 6.3-rc5 into driver-core-nextGreg Kroah-Hartman
We need the fixes in here for testing, as well as the driver core changes for documentation updates to build on. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-03-24accel/ivpu: Fix IPC buffer header status field valueAndrzej Kacprowski
IPC messages transmitted to the device must be marked as allocated - status field must be set to 1. The VPU driver has IVPU_IPC_HDR_ALLOCATED incorrectly defined. Future VPU firmware versions will reject all IPC messages with invalid status and will not work with a VPU driver that is missing this fix. Fixes: 5d7422cfb498 ("accel/ivpu: Add IPC driver and JSM messages") Signed-off-by: Andrzej Kacprowski <andrzej.kacprowski@linux.intel.com> Signed-off-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com> Reviewed-by: Jeffrey Hugo <quic_jhugo@quicinc.com> Signed-off-by: Jacek Lawrynowicz <jacek.lawrynowicz@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20230323125504.2586442-9-stanislaw.gruszka@linux.intel.com
2023-03-24accel/ivpu: Fix VPU clock calculationStanislaw Gruszka
The driver calculates the wrong frequency because it ignores the workpoint config and this cause undesired power/performance characteristics. Fix this by using the workpoint config in the freq calculations. Fixes: 35b137630f08 ("accel/ivpu: Introduce a new DRM driver for Intel VPU") Co-developed-by: Andrzej Kacprowski <andrzej.kacprowski@linux.intel.com> Signed-off-by: Andrzej Kacprowski <andrzej.kacprowski@linux.intel.com> Signed-off-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com> Reviewed-by: Jeffrey Hugo <quic_jhugo@quicinc.com> Signed-off-by: Jacek Lawrynowicz <jacek.lawrynowicz@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20230323125504.2586442-8-stanislaw.gruszka@linux.intel.com
2023-03-24accel/ivpu: Remove support for 1 tile SKUsStanislaw Gruszka
The support for single tile SKUs was dropped from MTL. Note that we can still boot the VPU with 1-tile work point config - this is independent from number of tiles present in the VPU. Co-developed-by: Andrzej Kacprowski <andrzej.kacprowski@linux.intel.com> Signed-off-by: Andrzej Kacprowski <andrzej.kacprowski@linux.intel.com> Signed-off-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com> Reviewed-by: Jeffrey Hugo <quic_jhugo@quicinc.com> Signed-off-by: Jacek Lawrynowicz <jacek.lawrynowicz@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20230323125504.2586442-7-stanislaw.gruszka@linux.intel.com
2023-03-24accel/ivpu: Disable buttress on device removalStanislaw Gruszka
Use pci_set_power_state() to disable buttress when device is removed. This is workaround of hardware bug that hangs the system. Additionally not disabling buttress prevents CPU enter deeper Pkg-C states when the driver is unloaded or fail to probe. Fixes: 35b137630f08 ("accel/ivpu: Introduce a new DRM driver for Intel VPU") Signed-off-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com> Reviewed-by: Jeffrey Hugo <quic_jhugo@quicinc.com> Signed-off-by: Jacek Lawrynowicz <jacek.lawrynowicz@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20230323125504.2586442-6-stanislaw.gruszka@linux.intel.com
2023-03-24accel/ivpu: Fix power down sequenceStanislaw Gruszka
Remove FPGA workaround on power_down to skip checking for noc quiescent state. Put VPU in reset before powering it down and skip manipulating registers that are reset by the VPU reset. This fixes power down errors where VPU is powered down just after VPU is booted. Fixes: 35b137630f08 ("accel/ivpu: Introduce a new DRM driver for Intel VPU") Signed-off-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com> Reviewed-by: Jeffrey Hugo <quic_jhugo@quicinc.com> Signed-off-by: Jacek Lawrynowicz <jacek.lawrynowicz@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20230323125504.2586442-5-stanislaw.gruszka@linux.intel.com
2023-03-24accel/ivpu: Do not use SSID 1Stanislaw Gruszka
The SSID=1 is used by the firmware as default value in case SSID mapping is not initialized. This allows detecting use of miss-configured memory contexts. The future FW versions may not allow using SSID=1. SSID=65 is valid value, number of contexts are limited by number of available command queues, but SSID can be any u16 value. Fixes: 35b137630f08 ("accel/ivpu: Introduce a new DRM driver for Intel VPU") Co-developed-by: Andrzej Kacprowski <andrzej.kacprowski@linux.intel.com> Signed-off-by: Andrzej Kacprowski <andrzej.kacprowski@linux.intel.com> Signed-off-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com> Reviewed-by: Jeffrey Hugo <quic_jhugo@quicinc.com> Signed-off-by: Jacek Lawrynowicz <jacek.lawrynowicz@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20230323125504.2586442-4-stanislaw.gruszka@linux.intel.com
2023-03-24accel/ivpu: Cancel recovery workStanislaw Gruszka
Prevent running recovery_work after device is removed. Fixes: 852be13f3bd3 ("accel/ivpu: Add PM support") Signed-off-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com> Reviewed-by: Jeffrey Hugo <quic_jhugo@quicinc.com> Signed-off-by: Jacek Lawrynowicz <jacek.lawrynowicz@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20230323125504.2586442-3-stanislaw.gruszka@linux.intel.com
2023-03-24accel/ivpu: Do not access HW registers after unbindStanislaw Gruszka
We should not access hardware after we unbind from the bus. Use drm_dev_enter() / drm_dev_exit() to mark code sections where hardware is accessed (and not already protected by other locks) and drm_dev_unplug() to mark device is gone. Fixes: 35b137630f08 ("accel/ivpu: Introduce a new DRM driver for Intel VPU") Signed-off-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com> Reviewed-by: Jeffrey Hugo <quic_jhugo@quicinc.com> Signed-off-by: Jacek Lawrynowicz <jacek.lawrynowicz@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20230323125504.2586442-2-stanislaw.gruszka@linux.intel.com
2023-03-22Merge tag 'drm-habanalabs-next-2023-03-20' of ↵Dave Airlie
https://git.kernel.org/pub/scm/linux/kernel/git/ogabbay/linux into drm-next This tag contains habanalabs driver and accel changes for v6.4: - uAPI changes: - Add opcodes to the CS ioctl to allow user to stall/resume specific engines inside Gaudi2. This is to allow the user to perform power testing/measurements when training different topologies. - Expose in the INFO ioctl the amount of device memory that the driver and f/w reserve for themselves. - Expose in the INFO ioctl a bit-mask of the available rotator engines in Gaudi2. This is to align with other engines that are already exposed. - Expose in the INFO ioctl the register's address of the f/w that should be used to trigger interrupts from within the user's code running in the compute engines. - Add a critical-event bit in the eventfd bitmask so the user will know the event that was received was critical, and a reset will now occur - Expose in the INFO ioctl two new opcodes to fetch information on h/w and f/w events. The events recorded are the events that were reported in the eventfd. - New features and improvements: - Add a dedicated interrupt ID in MSI-X in the device to the notification of an unexpected user-related event in Gaudi2. Handle it in the driver by reporting this event. - Allow the user to fetch the device memory current usage even when the device is undergoing compute-reset (a reset type that only clears the compute engines). - Enable graceful reset mechanism for compute-reset. This will give the user a few seconds before the device is reset. For example, the user can, during that time, perform certain device operations (dump data for debug) or close the device in an orderly fashion. - Align the decoder with the rest of the engines in regard to notification to the user about interrupts and in regard to performing graceful reset when needed (instead of immediate reset). - Add support for assert interrupt from the TPC engine. - Get the reset type that is necessary to perform per event from the auto-generated irq_map array. - Print the specific reason why a device is still in use when notifying to the user about it (after the user closed the device's FD). - Move to threaded IRQ when handling interrupts of workload completions. - Firmware related fixes: - Fix RAZWI event handler to match newest f/w version. - Read error cause register in dma core events because the f/w doesn't do that. - Increase maximum time to wait for completion of Gaudi2 reset due to f/w bug. - Align to the latest firmware specs. - Enforce the release order of the compute device and dma-buf. i.e increment the device file refcount for any dma-buf that was exported for that device. This will make sure the compute device release function won't be called until the user closes all the FDs of the relevant dma-bufs. Without this change, closing the device's FD before/without closing the dma-buf's FD would always lead to hard-reset of the device. - Fix a link in the drm documentation to correctly point to the accel section. - Compilation warnings cleanups - Misc bug fixes and code cleanups Signed-off-by: Dave Airlie <airlied@redhat.com> # -----BEGIN PGP SIGNATURE----- # # iQEzBAABCgAdFiEE7TEboABC71LctBLFZR1NuKta54AFAmQYfcAACgkQZR1NuKta # 54DB4Af/SuiHZkVXwr+yHPv9El726rz9ZQD7mQtzNmehWGonwAvz15yqocNMUSbF # JbqE/vrZjvbXrP1Uv5UrlRVdnFHSPV18VnHU4BMS/WOm19SsR6vZ0QOXOoa6/AUb # w+kF3D//DbFI4/mTGfpH5/pzwu51ti8aVktosPFlHIa8iI8CB4/4IV+ivQ8UW4oK # HyDRkIvHdRmER7vGOfhwhsr4zdqSlJBYrv3C3Z1dkSYBPW/5ICbiM1UlKycwdYKI # cajQBSdUQwUCWnI+i8RmSy3kjNO6OE4XRUvTv89F2bQeyK/1rJLG2m2xZR/Ml/o5 # 7Cgvbn0hWZyeqe7OObYiBlSOBSehCA== # =wclm # -----END PGP SIGNATURE----- # gpg: Signature made Tue 21 Mar 2023 01:37:36 AEST # gpg: using RSA key ED311BA00042EF52DCB412C5651D4DB8AB5AE780 # gpg: Can't check signature: No public key From: Oded Gabbay <ogabbay@kernel.org> Link: https://patchwork.freedesktop.org/patch/msgid/20230320154026.GA766126@ogabbay-vm-u20.habana-labs.com
2023-03-20accel/habanalabs: remove redundant TODOsOfir Bitton
As mmu refactor and nic resume are not relevant anymore, remove their TODO comments. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-03-20accel/habanalabs: change razwi handle after fw fixDani Liberman
FW had one data route for tpc0 and tpc1 when running in secured mode and a different one when running without secured mode. After fw fixed this issue, both mode have the same data path. Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-03-20accel/habanalabs: add handling for unexpected user eventOfir Bitton
In order for the user to be aware of unexpected events in Gaudi2 that aren't assigned to a specific engine, we are adding the handling of this dedicated interrupt. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-03-20accel/habanalabs: fix a missing-braces compilation warningTomer Tayar
Replace initialization of "struct cpucp_packet" from "{0} to "{}" to avoid a "missing braces around initializer" compilation warning. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-03-20accel/habanalabs: fix a maybe-uninitialized compilation warningsTomer Tayar
Initialize 'index' in gaudi2_handle_qman_err() and 'offset' in gaudi2_get_nic_idle_status() to avoid "maybe-uninitialized" compilation warnings. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-03-20accel/habanalabs: fix page fault event clearDani Liberman
After getting page fault in gaudi2, we need to clear the valid bit instead of the address. Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-03-20accel/habanalabs: expose rotator mask to userspaceOfir Bitton
All engine masks are exposed to user, make sure user gets the correct rotator enabled mask in gaudi2. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-03-20accel/habanalabs: regenerate gaudi2 ids_map_extendedOhad Sharabi
Some names of events has been modified/added. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-03-20accel/habanalabs: expose dram reserved size by kmdOfir Bitton
We expose this in order for user applications to know how much dram is reserved for internal use. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-03-20accel/habanalabs: remove '\n' when passing strings to gaudi2_print_event()Tomer Tayar
Remove all '\n' from strings which are passed as arguments to gaudi2_print_event(), because the newline character is added internally in this function. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-03-20accel/habanalabs: return tlb inv error code upon failureKoby Elbaz
Now that CQ-completion based jobs do not trigger a reset upon failure, failure of such jobs (e.g., MMU cache invalidation) should be handled by the caller itself depending on the error code returned to it. Signed-off-by: Koby Elbaz <kelbaz@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-03-20accel/habanalabs: in {e/p}dma_core events read the err cause regDafna Hirschfeld
Since the err_cause register is unprivileged, we should read it from the driver instead of using the param that came from the FW. Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-03-20accel/habanalabs: fix use of var reset_sleep_msDafna Hirschfeld
- remove reset_sleep_ms arg from functions that don't use it. - move the call msleep(reset_sleep_ms) from btm poll to gaudi2_hw_fini as it is called from there already for other flow. Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-03-20accel/habanalabs: in hw_fini return error code if polling timed-outDafna Hirschfeld
In hw_fini callback, we use either the cpucp packet method or polling a register. Currently we return error only in the case of cpucp packet failure. In this patch we also return error if polling timed out. Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-03-20accel/habanalabs: increase reset poll timeoutOfir Bitton
Due to a firmware bug we need to increase reset poll timeout or else we will timeout in secured environments. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2023-03-20accel/habanalabs: do not verify engine modes after being changedKoby Elbaz
Engines idle state can't always be verified between changes of engine modes (e.g., stall/halt). For example, if a CS is inflight when altering engine's mode, idle state will return NOT idle, always. Signed-off-by: Koby Elbaz <kelbaz@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>