summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2022-02-07drm/amd/pm: correct the way for retrieving enabled ppfeatures on RenoirEvan Quan
As other dGPU asics, Renoir should use smu_cmn_get_enabled_mask() for that job. Signed-off-by: Evan Quan <evan.quan@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07drm/amd/display: Cap pflip irqs per max otg numberRoman Li
[Why] pflip interrupt order are mapped 1 to 1 to otg id. e.g. if irq_src=26 corresponds to otg0 then 27->otg1, 28->otg2... Linux DM registers pflip interrupts per number of crtcs. In fused pipe case crtc numbers can be less than otg id. e.g. if one pipe out of 3(otg#0-2) is fused adev->mode_info.num_crtc=2 so DM only registers irq_src 26,27. This is a bug since if pipe#2 remains unfused DM never gets otg2 pflip interrupt (irq_src=28) That may results in gfx failure due to pflip timeout. [How] Register pflip interrupts per max num of otg instead of num_crtc Signed-off-by: Roman Li <Roman.Li@amd.com> Reviewed-by: Nicholas Kazlauskas <nicholas.kazlauskas@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07drm/amdgpu: check the GART table before invalidating TLBAaron Liu
Bypass group programming (utcl2_harvest) aims to forbid UTCL2 to send invalidation command to harvested SE/SA. Once invalidation command comes into harvested SE/SA, SE/SA has no response and system hang. This patch is to add checking if the GART table is already allocated before invalidating TLB. The new procedure is as following: 1. Calling amdgpu_gtt_mgr_init() in amdgpu_ttm_init(). After this step GTT BOs can be allocated, but GART mappings are still ignored. 2. Calling amdgpu_gart_table_vram_alloc() from the GMC code. This allocates the GART backing store. 3. Initializing the hardware, and programming the backing store into VMID0 for all VMHUBs. 4. Calling amdgpu_gtt_mgr_recover() to make sure the table is updated with the GTT allocations done before it was allocated. Signed-off-by: Christian König <christian.koenig@amd.com> Signed-off-by: Aaron Liu <aaron.liu@amd.com> Acked-by: Huang Rui <ray.huang@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07drm/amdgpu: add utcl2_harvest to gc 10.3.1Aaron Liu
Confirmed with hardware team, there is harvesting for gc 10.3.1. Signed-off-by: Aaron Liu <aaron.liu@amd.com> Reviewed-by: Huang Rui <ray.huang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07drm/amdgpu: fix list add issue in vram reserveTao Zhou
The parameter order in the list_add_tail is incorrect, it causes the reuse of ras reserved page. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07Revert "drm/amdgpu: Add judgement to avoid infinite loop"yipechai
The commit d5e8ff5f7b2a ("drm/amdgpu: Fixed the defect of soft lock caused by infinite loop") had fixed this defect. Revert workaround commit a2170b4af62f ("drm/amdgpu: Add judgement to avoid infinite loop"). Signed-off-by: yipechai <YiPeng.Chai@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07drm/amdgpu: Fixed the defect of soft lock caused by infinite loopyipechai
1. The infinite loop case only occurs on multiple cards support ras functions. 2. The explanation of root cause refer to commit 76641cbbf196 ("drm/amdgpu: Add judgement to avoid infinite loop"). 3. Create new node to manage each unique ras instance to guarantee each device .ras_list is completely independent. 4. Fixes: commit 7a6b8ab3231b51 ("drm/amdgpu: Unify ras block interface for each ras block"). 5. The soft locked logs are as follows: [ 262.165690] CPU: 93 PID: 758 Comm: kworker/93:1 Tainted: G OE 5.13.0-27-generic #29~20.04.1-Ubuntu [ 262.165695] Hardware name: Supermicro AS -4124GS-TNR/H12DSG-O-CPU, BIOS T20200717143848 07/17/2020 [ 262.165698] Workqueue: events amdgpu_ras_do_recovery [amdgpu] [ 262.165980] RIP: 0010:amdgpu_ras_get_ras_block+0x86/0xd0 [amdgpu] [ 262.166239] Code: 68 d8 4c 8d 71 d8 48 39 c3 74 54 49 8b 45 38 48 85 c0 74 32 44 89 fa 44 89 e6 4c 89 ef e8 82 e4 9b dc 85 c0 74 3c 49 8b 46 28 <49> 8d 56 28 4d 89 f5 48 83 e8 28 48 39 d3 74 25 49 89 c6 49 8b 45 [ 262.166243] RSP: 0018:ffffac908fa87d80 EFLAGS: 00000202 [ 262.166247] RAX: ffffffffc1394248 RBX: ffff91e4ab8d6e20 RCX: ffffffffc1394248 [ 262.166249] RDX: ffff91e4aa356e20 RSI: 000000000000000e RDI: ffff91e4ab8c0000 [ 262.166252] RBP: ffffac908fa87da8 R08: 0000000000000007 R09: 0000000000000001 [ 262.166254] R10: ffff91e4930b64ec R11: 0000000000000000 R12: 000000000000000e [ 262.166256] R13: ffff91e4aa356df8 R14: ffffffffc1394320 R15: 0000000000000003 [ 262.166258] FS: 0000000000000000(0000) GS:ffff92238fb40000(0000) knlGS:0000000000000000 [ 262.166261] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 262.166264] CR2: 00000001004865d0 CR3: 000000406d796000 CR4: 0000000000350ee0 [ 262.166267] Call Trace: [ 262.166272] amdgpu_ras_do_recovery+0x130/0x290 [amdgpu] [ 262.166529] ? psi_task_switch+0xd2/0x250 [ 262.166537] ? __switch_to+0x11d/0x460 [ 262.166542] ? __switch_to_asm+0x36/0x70 [ 262.166549] process_one_work+0x220/0x3c0 [ 262.166556] worker_thread+0x4d/0x3f0 [ 262.166560] ? process_one_work+0x3c0/0x3c0 [ 262.166563] kthread+0x12b/0x150 [ 262.166568] ? set_kthread_struct+0x40/0x40 [ 262.166571] ret_from_fork+0x22/0x30 Signed-off-by: yipechai <YiPeng.Chai@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07drm/amdgpu: Set FRU bus for Aldebaran and Vega 20Luben Tuikov
The FRU and RAS EEPROMs share the same I2C bus on Aldebaran and Vega 20 ASICs. Set the FRU bus "pointer" to this single bus, as access to the FRU is sought through that bus "pointer" and not through the RAS bus "pointer". Cc: Roy Sun <Roy.Sun@amd.com> Cc: Alex Deucher <Alexander.Deucher@amd.com> Fixes: 2f60dd50769efc ("drm/amd: Expose the FRU SMU I2C bus") Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Reviewed-by: Alex Deucher <Alexander.Deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07drm/amdgpu: Fix recursive locking warningRajneesh Bhardwaj
Noticed the below warning while running a pytorch workload on vega10 GPUs. Change to trylock to avoid conflicts with already held reservation locks. [ +0.000003] WARNING: possible recursive locking detected [ +0.000003] 5.13.0-kfd-rajneesh #1030 Not tainted [ +0.000004] -------------------------------------------- [ +0.000002] python/4822 is trying to acquire lock: [ +0.000004] ffff932cd9a259f8 (reservation_ww_class_mutex){+.+.}-{3:3}, at: amdgpu_bo_release_notify+0xc4/0x160 [amdgpu] [ +0.000203] but task is already holding lock: [ +0.000003] ffff932cbb7181f8 (reservation_ww_class_mutex){+.+.}-{3:3}, at: ttm_eu_reserve_buffers+0x270/0x470 [ttm] [ +0.000017] other info that might help us debug this: [ +0.000002] Possible unsafe locking scenario: [ +0.000003] CPU0 [ +0.000002] ---- [ +0.000002] lock(reservation_ww_class_mutex); [ +0.000004] lock(reservation_ww_class_mutex); [ +0.000003] *** DEADLOCK *** [ +0.000002] May be due to missing lock nesting notation [ +0.000003] 7 locks held by python/4822: [ +0.000003] #0: ffff932c4ac028d0 (&process->mutex){+.+.}-{3:3}, at: kfd_ioctl_map_memory_to_gpu+0x10b/0x320 [amdgpu] [ +0.000232] #1: ffff932c55e830a8 (&info->lock#2){+.+.}-{3:3}, at: amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0x64/0xf60 [amdgpu] [ +0.000241] #2: ffff932cc45b5e68 (&(*mem)->lock){+.+.}-{3:3}, at: amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0xdf/0xf60 [amdgpu] [ +0.000236] #3: ffffb2b35606fd28 (reservation_ww_class_acquire){+.+.}-{0:0}, at: amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0x232/0xf60 [amdgpu] [ +0.000235] #4: ffff932cbb7181f8 (reservation_ww_class_mutex){+.+.}-{3:3}, at: ttm_eu_reserve_buffers+0x270/0x470 [ttm] [ +0.000015] #5: ffffffffc045f700 (*(sspp++)){....}-{0:0}, at: drm_dev_enter+0x5/0xa0 [drm] [ +0.000038] #6: ffff932c52da7078 (&vm->eviction_lock){+.+.}-{3:3}, at: amdgpu_vm_bo_update_mapping+0xd5/0x4f0 [amdgpu] [ +0.000195] stack backtrace: [ +0.000003] CPU: 11 PID: 4822 Comm: python Not tainted 5.13.0-kfd-rajneesh #1030 [ +0.000005] Hardware name: GIGABYTE MZ01-CE0-00/MZ01-CE0-00, BIOS F02 08/29/2018 [ +0.000003] Call Trace: [ +0.000003] dump_stack+0x6d/0x89 [ +0.000010] __lock_acquire+0xb93/0x1a90 [ +0.000009] lock_acquire+0x25d/0x2d0 [ +0.000005] ? amdgpu_bo_release_notify+0xc4/0x160 [amdgpu] [ +0.000184] ? lock_is_held_type+0xa2/0x110 [ +0.000006] ? amdgpu_bo_release_notify+0xc4/0x160 [amdgpu] [ +0.000184] __ww_mutex_lock.constprop.17+0xca/0x1060 [ +0.000007] ? amdgpu_bo_release_notify+0xc4/0x160 [amdgpu] [ +0.000183] ? lock_release+0x13f/0x270 [ +0.000005] ? lock_is_held_type+0xa2/0x110 [ +0.000006] ? amdgpu_bo_release_notify+0xc4/0x160 [amdgpu] [ +0.000183] amdgpu_bo_release_notify+0xc4/0x160 [amdgpu] [ +0.000185] ttm_bo_release+0x4c6/0x580 [ttm] [ +0.000010] amdgpu_bo_unref+0x1a/0x30 [amdgpu] [ +0.000183] amdgpu_vm_free_table+0x76/0xa0 [amdgpu] [ +0.000189] amdgpu_vm_free_pts+0xb8/0xf0 [amdgpu] [ +0.000189] amdgpu_vm_update_ptes+0x411/0x770 [amdgpu] [ +0.000191] amdgpu_vm_bo_update_mapping+0x324/0x4f0 [amdgpu] [ +0.000191] amdgpu_vm_bo_update+0x251/0x610 [amdgpu] [ +0.000191] update_gpuvm_pte+0xcc/0x290 [amdgpu] [ +0.000229] ? amdgpu_vm_bo_map+0xd7/0x130 [amdgpu] [ +0.000190] amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0x912/0xf60 [amdgpu] [ +0.000234] kfd_ioctl_map_memory_to_gpu+0x182/0x320 [amdgpu] [ +0.000218] kfd_ioctl+0x2b9/0x600 [amdgpu] [ +0.000216] ? kfd_ioctl_unmap_memory_from_gpu+0x270/0x270 [amdgpu] [ +0.000216] ? lock_release+0x13f/0x270 [ +0.000006] ? __fget_files+0x107/0x1e0 [ +0.000007] __x64_sys_ioctl+0x8b/0xd0 [ +0.000007] do_syscall_64+0x36/0x70 [ +0.000004] entry_SYSCALL_64_after_hwframe+0x44/0xae [ +0.000007] RIP: 0033:0x7fbff90a7317 [ +0.000004] Code: b3 66 90 48 8b 05 71 4b 2d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 41 4b 2d 00 f7 d8 64 89 01 48 [ +0.000005] RSP: 002b:00007fbe301fe648 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ +0.000006] RAX: ffffffffffffffda RBX: 00007fbcc402d820 RCX: 00007fbff90a7317 [ +0.000003] RDX: 00007fbe301fe690 RSI: 00000000c0184b18 RDI: 0000000000000004 [ +0.000003] RBP: 00007fbe301fe690 R08: 0000000000000000 R09: 00007fbcc402d880 [ +0.000003] R10: 0000000002001000 R11: 0000000000000246 R12: 00000000c0184b18 [ +0.000003] R13: 0000000000000004 R14: 00007fbf689593a0 R15: 00007fbcc402d820 Cc: Christian König <christian.koenig@amd.com> Cc: Felix Kuehling <Felix.Kuehling@amd.com> Cc: Alex Deucher <Alexander.Deucher@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07drm/amdgpu: Prevent random memory access in FRU codeLuben Tuikov
Prevent random memory access in the FRU EEPROM code by passing the size of the destination buffer to the reading routine, and reading no more than the size of the buffer. Cc: Kent Russell <kent.russell@amd.com> Cc: Alex Deucher <Alexander.Deucher@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Acked-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com> Reviewed-by: Kent Russell <kent.russell@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07drm/amdgpu: Don't offset by 2 in FRU EEPROMLuben Tuikov
Read buffers no longer expose the I2C address, and so we don't need to offset by two when we get the read data. Cc: Alex Deucher <Alexander.Deucher@amd.com> Cc: Kent Russell <kent.russell@amd.com> Cc: Andrey Grodzovsky <Andrey.Grodzovsky@amd.com> Fixes: bd607166af7fe3 ("drm/amdgpu: Enable reading FRU chip via I2C v3") Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Acked-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com> Reviewed-by: Kent Russell <kent.russell@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07drm/amdgpu: Nerf "buff" to "buf"Luben Tuikov
Buffer is abbreviated "buf" (buf-fer), not "buff" (buff-er). This is consistent with the rest of the kernel code. Cc: Kent Russell <kent.russell@amd.com> Cc: Alex Deucher <Alexander.Deucher@amd.com> Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Acked-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com> Reviewed-by: Kent Russell <kent.russell@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07drm/amdkfd: Bump up KFD API version for CRIURajneesh Bhardwaj
- Change KFD minor version to 7 for CRIU Proposed userspace changes: https://github.com/RadeonOpenCompute/criu Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07drm/amdkfd: CRIU resume shared virtual memory rangesRajneesh Bhardwaj
In CRIU resume stage, resume all the shared virtual memory ranges from the data stored inside the resuming kfd process during CRIU restore phase. Also setup xnack mode and free up the resources. KFD_IOCTL_SVM_ATTR_CLR_FLAGS is not available for querying via get_attr interface but we must clear the flags during restore as there might be some default flags set when the prange is created. Also handle the invalid PREFETCH atribute values saved during checkpoint by replacing them with another dummy KFD_IOCTL_SVM_ATTR_SET_FLAGS attribute. (rajneesh: Fixed the checkpatch reported problems) Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07drm/amdkfd: CRIU prepare for svm resumeRajneesh Bhardwaj
During CRIU restore phase, the VMAs for the virtual address ranges are not at their final location yet so in this stage, only cache the data required to successfully resume the svm ranges during an imminent CRIU resume phase. Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07drm/amdkfd: CRIU Save Shared Virtual Memory rangesRajneesh Bhardwaj
During checkpoint stage, save the shared virtual memory ranges and attributes for the target process. A process may contain a number of svm ranges and each range might contain a number of attributes. While not all attributes may be applicable for a given prange but during checkpoint we store all possible values for the max possible attribute types. Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07drm/amdkfd: CRIU Discover svm rangesRajneesh Bhardwaj
A KFD process may contain a number of virtual address ranges for shared virtual memory management and each such range can have many SVM attributes spanning across various nodes within the process boundary. This change reports the total number of such SVM ranges and their total private data size by extending the PROCESS_INFO op of the the CRIU IOCTL to discover the svm ranges in the target process and a future patches brings in the required support for checkpoint and restore for SVM ranges. Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07drm/amdkfd: use user_gpu_id for svm rangesRajneesh Bhardwaj
Currently the SVM ranges use actual_gpu_id but with Checkpoint Restore support its possible that the SVM ranges can be resumed on another node where the actual_gpu_id may not be same as the original (user_gpu_id) gpu id. So modify svm code to use user_gpu_id. Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07drm/amdkfd: CRIU allow external mm for svm rangesRajneesh Bhardwaj
Both svm_range_get_attr and svm_range_set_attr helpers use mm struct from current but for a Checkpoint or Restore operation, the current->mm will fetch the mm for the CRIU master process. So modify these helpers to accept the task mm for a target kfd process to support Checkpoint Restore. Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07drm/amdkfd: CRIU checkpoint and restore xnack modeRajneesh Bhardwaj
Recoverable page faults are represented by the xnack mode setting inside a kfd process and are used to represent the device page faults. For CR, we don't consider negative values which are typically used for querying the current xnack mode without modifying it. Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07drm/amdkfd: CRIU export BOs as prime dmabuf objectsRajneesh Bhardwaj
KFD buffer objects do not associate a GEM handle with them so cannot directly be used with libdrm to initiate a system dma (sDMA) operation to speedup the checkpoint and restore operation so export them as dmabuf objects and use with libdrm helper (amdgpu_bo_import) to further process the sdma command submissions. With sDMA, we see huge improvement in checkpoint and restore operations compared to the generic pci based access via host data path. Suggested-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: David Yat Sin <david.yatsin@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07drm/amdkfd: CRIU implement gpu_id remappingDavid Yat Sin
When doing a restore on a different node, the gpu_id's on the restore node may be different. But the user space application will still refer use the original gpu_id's in the ioctl calls. Adding code to create a gpu id mapping so that kfd can determine actual gpu_id during the user ioctl's. Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: David Yat Sin <david.yatsin@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07drm/amdkfd: CRIU checkpoint and restore eventsDavid Yat Sin
Add support to existing CRIU ioctl's to save and restore events during criu checkpoint and restore. Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: David Yat Sin <david.yatsin@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07drm/amdkfd: CRIU checkpoint and restore queue control stackDavid Yat Sin
Checkpoint contents of queue control stacks on CRIU dump and restore them during CRIU restore. Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: David Yat Sin <david.yatsin@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07drm/amdkfd: CRIU checkpoint and restore queue mqdsDavid Yat Sin
Checkpoint contents of queue MQD's on CRIU dump and restore them during CRIU restore. Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: David Yat Sin <david.yatsin@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07drm/amdkfd: CRIU restore queue doorbell idDavid Yat Sin
When re-creating queues during CRIU restore, restore the queue with the same doorbell id value used during CRIU dump. Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: David Yat Sin <david.yatsin@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07drm/amdkfd: CRIU restore sdma id for queuesDavid Yat Sin
When re-creating queues during CRIU restore, restore the queue with the same sdma id value used during CRIU dump. Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: David Yat Sin <david.yatsin@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07drm/amdkfd: CRIU restore queue idsDavid Yat Sin
When re-creating queues during CRIU restore, restore the queue with the same queue id value used during CRIU dump. Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: David Yat Sin <david.yatsin@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07drm/amdkfd: CRIU add queues supportDavid Yat Sin
Add support to existing CRIU ioctl's to save number of queues and queue properties for each queue during checkpoint and re-create queues on restore. Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: David Yat Sin <david.yatsin@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07drm/amdkfd: CRIU Implement KFD unpause operationDavid Yat Sin
Introducing UNPAUSE op. After CRIU amdgpu plugin performs a PROCESS_INFO op the queues will be stay in an evicted state. Once the plugin is done draining BO contents, it is safe to perform an UNPAUSE op for the queues to resume. Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: David Yat Sin <david.yatsin@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07drm/amdkfd: CRIU Implement KFD resume ioctlRajneesh Bhardwaj
This adds support to create userptr BOs on restore and introduces a new ioctl op to restart memory notifiers for the restored userptr BOs. When doing CRIU restore MMU notifications can happen anytime after we call amdgpu_mn_register. Prevent MMU notifications until we reach stage-4 of the restore process i.e. criu_resume ioctl op is received, and the process is ready to be resumed. This ioctl is different from other KFD CRIU ioctls since its called by CRIU master restore process for all the target processes being resumed by CRIU. Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: David Yat Sin <david.yatsin@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07drm/amdkfd: CRIU Implement KFD restore ioctlRajneesh Bhardwaj
This implements the KFD CRIU Restore ioctl that lays the basic foundation for the CRIU restore operation. It provides support to create the buffer objects corresponding to the checkpointed image. This ioctl creates various types of buffer objects such as VRAM, MMIO, Doorbell, GTT based on the date sent from the userspace plugin. The data mostly contains the previously checkpointed KFD images from some KFD processs. While restoring a criu process, attach old IDR values to newly created BOs. This also adds the minimal gpu mapping support for a single gpu checkpoint restore use case. Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: David Yat Sin <david.yatsin@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07drm/amdkfd: CRIU Implement KFD checkpoint ioctlRajneesh Bhardwaj
This adds support to discover the buffer objects that belong to a process being checkpointed. The data corresponding to these buffer objects is returned to user space plugin running under criu master context which then stores this info to recreate these buffer objects during a restore operation. Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: David Yat Sin <david.yatsin@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07drm/amdkfd: CRIU Implement KFD process_info ioctlRajneesh Bhardwaj
This IOCTL op is expected to be called as a precursor to the actual Checkpoint operation. This does the basic discovery into the target process seized by CRIU and relays the information to the userspace that utilizes it to start the Checkpoint operation via another dedicated IOCTL op. The process_info IOCTL op determines the number of GPUs, buffer objects that are associated with the target process, its process id in caller's namespace since /proc/pid/mem interface maybe used to drain the contents of the discovered buffer objects in userspace and getpid returns the pid of CRIU dumper process. Also the pid of a process inside a container might be different than its global pid so return the ns pid. Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: David Yat Sin <david.yatsin@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07drm/amdkfd: CRIU Introduce Checkpoint-Restore APIsRajneesh Bhardwaj
Checkpoint-Restore in userspace (CRIU) is a powerful tool that can snapshot a running process and later restore it on same or a remote machine but expects the processes that have a device file (e.g. GPU) associated with them, provide necessary driver support to assist CRIU and its extensible plugin interface. Thus, In order to support the Checkpoint-Restore of any ROCm process, the AMD Radeon Open Compute Kernel driver, needs to provide a set of new APIs that provide necessary VRAM metadata and its contents to a userspace component (CRIU plugin) that can store it in form of image files. This introduces some new ioctls which will be used to checkpoint-Restore any KFD bound user process. KFD only allows ioctl calls from the same process that opened the KFD file descriptor. Since these ioctls are expected to be called from a KFD criu plugin which has elevated ptrace attached privileges and CAP_CHECKPOINT_RESTORE capabilities attached with the file descriptors so modify KFD to allow such calls. (API redesigned by David Yat Sin) Suggested-by: Felix Kuehling <felix.kuehling@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: David Yat Sin <david.yatsin@amd.com> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07drm/amdgpu: Print once if RAS unsupportedLuben Tuikov
MESA polls for errors every 2-3 seconds. Printing with dev_info() causes the dmesg log to fill up with the same message, e.g, [18028.206676] amdgpu 0000:0b:00.0: amdgpu: df doesn't config ras function. Make it dev_dbg_once(), as it isn't something correctible during boot or thereafter, so printing just once is sufficient. Also sanitize the message. Cc: Alex Deucher <Alexander.Deucher@amd.com> Cc: Hawking Zhang <Hawking.Zhang@amd.com> Cc: John Clements <john.clements@amd.com> Cc: Tao Zhou <tao.zhou1@amd.com> Cc: yipechai <YiPeng.Chai@amd.com> Fixes: 8b0fb0e967c1 ("drm/amdgpu: Modify gfx block to fit for the unified ras block data and ops") Signed-off-by: Luben Tuikov <luben.tuikov@amd.com> Reviewed-by: Alex Deucher <Alexander.Deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07drm/amdgpu: rename amdgpu_vm_bo_rmv to _delChristian König
Some people complained about the name and this matches much more Linux naming conventions for object functions. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07drm/amdgpu: add some lockdep checks to the VM codeChristian König
Whenever a bo_va structure is added or removed the VM and eventually added BO should be locked. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-02drm/amd/display: Use NULL pointer instead of plain integerMagali Lemes
Assigning 0L to a pointer variable caused the following warning: drivers/gpu/drm/amd/amdgpu/../display/dc/dml/dsc/rc_calc_fpu.c:71:40: warning: Using plain integer as NULL pointer In order to remove this warning, this commit assigns a NULL pointer to the pointer variable that caused this issue. Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Magali Lemes <magalilemes00@gmail.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-02amdgpu/pm: Implement new API function "emit" that accepts buffer base and ↵Darren Powell
write offset (v3) Rewrote patchset to order patches as (API, hw impl, usecase) - added API for new power management function emit_clk_levels This function should duplicate the functionality of print_clk_levels, but this solution passes the buffer base and write offset down the stack. - new powerplay function emit_clock_levels, implemented by smu_emit_ppclk_levels() This function parallels the implementation of smu_print_ppclk_levels and calls emit_clk_levels, and allows the returns of errors - new helper function smu_convert_to_smuclk called by smu_print_ppclk_levels and smu_emit_ppclk_levels Signed-off-by: Darren Powell <darren.powell@amd.com> Reviewed-By: Evan Quan <evan.quan@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-02drm/amdgpu: limit the number of dst address in traceSomalapuram Amaranath
trace_amdgpu_vm_update_ptes trace unable to log when nptes too large Signed-off-by: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-02drm/amd: avoid suspend on dGPUs w/ s2idle support when runtime PM enabledMario Limonciello
dGPUs connected to Intel systems configured for suspend to idle will not have the power rails cut at suspend and resetting the GPU may lead to problematic behaviors. Fixes: e25443d2765f4 ("drm/amdgpu: add a dev_pm_ops prepare callback (v2)") Link: https://gitlab.freedesktop.org/drm/amd/-/issues/1879 Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-02drm/amdgpu: restructure amdgpu_fill_buffer v2Christian König
We ran into the problem that clearing really larger buffer (60GiB) caused an SDMA timeout. Restructure the function to use the dst window instead of mapping the whole buffer into the GART and then fill only 2MiB/256MiB chunks at a time. v2: rebase on restructured window map. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-02drm/amdgpu: rework GART copy window handlingChristian König
Instead of limiting the size before we call the mapping function let the function itself limit the size. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-02drm/amdgpu: lower BUG_ON into WARN_ON for AMDGPU_PL_PREEMPTChristian König
That should never happen, but make sure that we only warn instead of crash. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-02drm/amdgpu: fix logic inversion in checkChristian König
We probably never trigger this, but the logic inside the check is inverted. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-02drm/amd/display: Force link_rate as LINK_RATE_RBR2 for 2018 15" Apple Retina ↵Aun-Ali Zaidi
panels The eDP link rate reported by the DP_MAX_LINK_RATE dpcd register (0xa) is contradictory to the highest rate supported reported by EDID (0xc = LINK_RATE_RBR2). The effects of this compounded with commit '4a8ca46bae8a ("drm/amd/display: Default max bpc to 16 for eDP")' results in no display modes being found and a dark panel. For now, simply force the maximum supported link rate for the eDP attached 2018 15" Apple Retina panels. Additionally, we must also check the firmware revision since the device ID reported by the DPCD is identical to that of the more capable 16,1, incorrectly quirking it. We also use said firmware check to quirk the refreshed 15,1 models with Vega graphics as they use a slightly newer firmware version. Tested-by: Aun-Ali Zaidi <admin@kodeit.net> Reviewed-by: Harry Wentland <harry.wentland@amd.com> Signed-off-by: Aun-Ali Zaidi <admin@kodeit.net> Signed-off-by: Aditya Garg <gargaditya08@live.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-02drm/amd/display: clean up some inconsistent indentingYang Li
Eliminate the follow smatch warning: drivers/gpu/drm/amd/display/dc/core/dc_link_dp.c:2246 dp_perform_8b_10b_link_training() warn: inconsistent indenting Reviewed-by: Harry Wentland <harry.wentland@amd.com> Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-02drm/amd/display: Trigger DP2 Sequence With Uncertified CableFangzhi Zuo
DP2 sequence is triggered only if VESA certified cable is detected. Force DP2 sequence with uncertified cable for testing purpose. Tested-by: Daniel Wheeler <daniel.wheeler@amd.com> Reviewed-by: Wenjing Liu <Wenjing.Liu@amd.com> Acked-by: Stylon Wang <stylon.wang@amd.com> Signed-off-by: Fangzhi Zuo <Jerry.Zuo@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-02drm/amd/display: 3.2.171Aric Cyr
This version brings along following fixes: - DC refactor and bug fixes for DP links - Bug fixes for DP2 - Fix regressions causing display not light up - Improved debug trace - Improved DP AUX transfer - Updated watermark latencies to fix underflows in some modes Tested-by: Daniel Wheeler <daniel.wheeler@amd.com> Acked-by: Stylon Wang <stylon.wang@amd.com> Signed-off-by: Aric Cyr <aric.cyr@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>