summaryrefslogtreecommitdiff
path: root/drivers/misc/habanalabs/common
AgeCommit message (Collapse)Author
2022-09-18habanalabs: remove left-over code from bring-upOded Gabbay
There is some left-over code from the gaudi2 bring-up that wasn't removed so far. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-18habanalabs/gaudi2: change device f/w security checkfarah kassabri
On Gaudi2 the f/w always configures the PCIe iATU and allows access to scratchpad registers. Therefore, we can know if the f/w is secured by reading a status bit from the f/w registers. Signed-off-by: farah kassabri <fkassabri@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-18habanalabs: move common function out of debugfs.cOded Gabbay
A common function that is called from multiple places can't be located in degugfs.c because that file is only compiled if debugfs is enabled in the kernel config file. This can lead to undefined symbol compilation error. Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-18habanalabs: add a missing lock for in_reset indicationTomer Tayar
Add a missing lock in hl_device_resume() when it assigns a value to the 'in_reset' indication. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-18habanalabs: fix vma fields assignments order in hl_hw_block_mmap()Tomer Tayar
In hl_hw_block_mmap(), the vma's 'vm_private_data' and 'vm_ops' fields are assigned before filling the content of the private data. In between there is a call to the ASIC hw_block_mmap() function, and if it fails, the vma close function will be called with a bad private data value. Fix the order of assignments to avoid this issue. In hl_hw_block_mmap() the vma's 'vm_private_data and vm_ops are assigned before setting the Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-18habanalabs: avoid returning a valid handle if map_block() failsTomer Tayar
map_block() sets the block id handle even if get_hw_block_id() fails, and in this case it uses block id 0 which might be a valid id. Modify it to set the handle only if get_hw_block_id() succeeds. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-18habanalabs: fix command submission sanity checkTal Cohen
When a CS is submitted, the ioctl handler checks the CS flags and performs a sanity check, according to its value. As new CS flags are added, the sanity check needs to be updated according to the new flags. Signed-off-by: Tal Cohen <talcohen@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-18habanalabs: add uapi to retrieve engines statusDani Liberman
Currently, to get engines status, user needed to read debugfs file with root permissions. This new uapi allows user apace apps retrieve status, so for example, in case of failure, status can be retrieved immediately by the application itself which runs without root permissions. Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-18habanalabs: remove all kdma locksOded Gabbay
We don't use KDMA concurrently in the driver. The only use is through debugfs and we don't protect concurrent access through it. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-18habanalabs: wrap macro arg with parenthesesOhad Sharabi
The macro argument <val> is cast-ed to u32 in some of the places. Because this arg can be some arithmetic computation (e.g. address + offset) the cast should be on the whole expression. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-18habanalabs: fix spelling mistakesBharat Jauhari
Cosmetic commit, no logical changes. It just fixes the spelling mistakes. Signed-off-by: Bharat Jauhari <bjauhari@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-18habanalabs/gaudi: increase default cs timeout to 10 minutesOded Gabbay
In order to improve scalability and reduce host overhead, it is better to increase the default TDR timeout of Gaudi1 from 30 seconds to 10 minutes. This will allow the DL Framework (e.g. PyTorch, TensorFlow) to remove the host sync they are using now and improve overall performance on scaleout training. Note that one can always set the timeout to a custom value via a kernel module parameter given during driver load. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-18habanalabs: add return code field to module iteratorOhad Sharabi
Up until now the module iterator called void callback functions and so caller activating callback that may fail suffered from 2 issues: 1. The need to "plant" return called in the private data. This is a drawback since the iterator itself should not be aware of the private data of the caller. 2. Due to 1 even in a failure the iterator would keep iterating instead of break upon error. To overcome this an optional rc field added to the iterator context. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-18habanalabs: rename non_hard_reset to compute_resetOfir Bitton
In order to be more explicit we should use the term compute_reset for describing the reset in which only the compute engines gets reset. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-18habanalabs: removed seq_file parameter from is_idle asic functionsDani Liberman
Change is_idle functions so it would be more usable outside debugfs. Do this by replacing seq_file parameter with regular string. Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12habanalabs: rename soft reset to compute resetOded Gabbay
Doing compute reset can be the traditional inference soft reset that is supported only in Goya. Or it can be the new reset upon device release, which is supported in Gaudi2 and above. Therefore, wherever suitable, use the terminology of compute reset instead of soft reset. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12habanalabs: add status of reset after device releaseOded Gabbay
The user might want to know the device is in reset after device release, which is not an erroneous event as a regular reset. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12habanalabs: fix update of is_in_soft_resetOded Gabbay
reset_info.is_in_soft_reset should be updated both before in_reset and inside the spin lock of the reset info structure. The reasons are: - When we are inside soft reset, it implies we are in reset. Therefore, if someone checks if we are in soft reset, he can deduce we are in reset, while the opposite is not correct and might be misleading. - Both these flags are changed together so they must be changed inside the reset info spinlock. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12habanalabs: expose only valid debugfs nodesOfir Bitton
In case security is enabled on the device, some debugfs nodes will fail. Hence, we do not expose them. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12habanalabs: add a value field to hl_fw_send_pci_access_msg()Tomer Tayar
For gaudi2 we need to send a value to F/W as part of the PCI_ACCESS packet. As a preparation, modify hl_fw_send_pci_access_msg() to have a 'value' field. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12habanalabs: fixes to the poll-timeout macrosOhad Sharabi
- use conventional internal macro variables (double underscore prefix) - adjust address casting - on register poll using ELBI use ELBI read rather than BAR read on error condition - remove unused macro Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12habanalabs: initialize variable explicitlyOded Gabbay
Fix warning of "warning: ‘old_base’ may be used uninitialized in this function" Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12habanalabs: Use the bitmap API to allocate bitmapsChristophe JAILLET
Use bitmap_zalloc()/bitmap_free() instead of hand-writing them. It is less verbose and it improves the semantic. Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12habanalabs: make sure variable is set before usedOded Gabbay
timestamp could be unset in both _hl_interrupt_wait_ioctl() and _hl_interrupt_wait_ioctl_user_addr() so it is better to explicitly initialize it to 0 when declaring it. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12habanalabs: don't declare tmp twice in same functionOded Gabbay
tmp is declared in the scope of the function cs_do_release() and inside a block inside that function. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12habanalabs: do not set max power on a secured deviceOfir Bitton
Max power API is not supported in secured devices. Hence, we should skip setting it during boot. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12habanalabs: allow detection of unsupported f/w packetsOded Gabbay
If we send a packet to the f/w, and that packet is unsupported, we want to be able to identify this situation and possibly ignore this. Therefore, if the f/w returned an error, we need to propagate it to the callers in the result value, if those callers were interested in it. In addition, no point of printing the error code here because each caller prints its own error with a specific message. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12habanalabs: save f/w preboot minor versionSagiv Ozeri
We need this property for backward compatibility against the f/w. Signed-off-by: Sagiv Ozeri <sozeri@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12habanalabs: add support for common decoder interruptsOfir Bitton
User application should be able to get notification for any decoder completion. Hence, we introduce a new interface in which a user can wait for all current decoder pending interrupts. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12habanalabs: naming refactor of user interrupt flowOfir Bitton
Current naming convention can be misleading. Hence renaming some variables and defines in order to be more explicit. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12habanalabs: wait for preboot ready after hard resetOhad Sharabi
Currently we are not waiting for preboot ready after hard reset. This leads to a race in which COMMs protocol begins but will get no response from the f/w. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12habanalabs: enable gaudi2 code in driverOded Gabbay
Enable the Gaudi2 ASIC code in the pci probe callback of the driver so the driver will handle Gaudi2 ASICs. Add the PCI ID to the PCI table and add the ASIC enum value to all relevant places. Fixup the device parameters initialization for Gaudi2. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12habanalabs: add gaudi2 MMU supportMoti Haimovski
Gaudi2 has new MMU units. A PMMU for device->host accesses, and HMMU for HBM accesses. The page tables of both MMUs are located in the host's memory (referred to in the code as host-resident pgt). Signed-off-by: Moti Haimovski <mhaimovski@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12habanalabs: add gaudi2 wait-for-CS supportOded Gabbay
In Gaudi2 we moved to a different wait for command submission completion model. Instead of receiving interrupt only on external queues, we use the device's sync manager to notify us when the entire command submission finishes. This enables us to remove the categorization of queues to external and internal, and treat each queue equally, without the need to parse and patch any command buffer. This change also requires refactoring to the IRQ handling of CS completions. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12habanalabs: add generic security moduleOfir Bitton
As the ASICs become more complex and have many more registers, we need a better way to configure the security properties. As a reminder, we have two dedicated mechanisms for security: Range Registers and Protection bits. Those mechanisms protect sensitive memory and configuration areas inside the device. The generic module handles the low-level part of the configuration, because the configuration mechanism is identical in all ASICs. The difference is the address ranges and register names. Any ASIC that use this block should first block all the register blocks in the ASIC. Then, it should open only the registers that need to be accessed by the user (This is opposed to Goya and Gaudi, where we blocked only what should not be accesses by the user). The module contains several functions, to unblock single register, multiple registers, entire blocks, ranges, ranges with mask. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12habanalabs: remove obsolete device variables used for testingOded Gabbay
There are a couple of device variables that are used for testing purposes and they are set to fixed values. Remove the variables that are not relevant anymore and document the remaining variables. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12habanalabs: initialize new asic propertiesOded Gabbay
New asic properties were added for Gaudi2. We want to initialize and use them, when relevant, also for Goya and Gaudi. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12habanalabs: add gaudi2 asic-specific codeOded Gabbay
Add the ASIC-specific code for Gaudi2. Supply (almost) all of the function callbacks that the driver's common code need to initialize, finalize and submit workloads to the Gaudi2 ASIC. It also contains the code to initialize the F/W of the Gaudi2 ASIC and to receive events from the F/W. It contains new debugfs entry to dump razwi events. razwi is a case where the device's engines create a transaction that reaches an invalid destination. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12habanalabs: remove redundant argument in access_dev_mem APIsOfir Bitton
Region structure is derived from region type, hence no need to pass it as an argument. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12habanalabs: communicate supported page sizes to userOhad Sharabi
Because in future ASICs the driver will allow the user to set the page size we need to make sure this data is propagated in all APIs. In addition, since this is already an ASIC property we no longer need ASIC function for it. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12habanalabs: remove dead code from free_device_memory()Tomer Tayar
free_device_memory() ends with if and else, each has a return statement, followed by another return statement that can never be reached. Restructure the function and remove this dead code. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12habanalabs: page size can only be a power of 2Ohad Sharabi
We dropped support for page sizes that are not power of 2. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12habanalabs: refactor dma asic-specific functionsOhad Sharabi
This is a pre-requisite patch for adding tracepoints to the DMA memory operations (allocation/free) in the driver. The main purpose is to be able to cross data with the map operations and determine whether memory violation occurred, for example free DMA allocation before unmapping it from device memory. To achieve this the DMA alloc/free code flows were refactored so that a single DMA tracepoint will catch many flows. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12habanalabs: set default value for memory_scrubDafna Hirschfeld
Set a default value for memory scrubbing Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12habanalabs: move call to scrub_device_mem after ctx_finiDafna Hirschfeld
In future ASICs, it would be possible to have a non-idle device when context is released. We thus need to postpone the scrubbing. Postpone it to hpriv release if reset is not executed or to device late init if reset is executed. Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12habanalabs: don't send addr and size to scrub_device_mem cbDafna Hirschfeld
We use scrub_device_mem only to scrub the entire SRAM and entire DRAM. Therefore there is no need to send addr and size args to the callback. Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12habanalabs: don't do memory scrubbing when unmappingDafna Hirschfeld
There is no need to do memory scrub when unmapping anymore as it is an overhead as long as we have a single user at any given time. Remove that code and change return value of free_phys_pg_pack to void Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12habanalabs: print if firmware is secured during loadOfir Bitton
For easier debug, it is desirable to have a simple way to know whether the device is secured or not, hence we dump this indication during boot. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12habanalabs/gaudi: fix a race condition causing DMAR errorYuri Nudelman
There is a rare race condition in CB completion mechanism, that can occur under a very high pressure of command submissions. The preconditions for this to happen are: 1. There should be enough command submissions for the pre-allocated patched CB pool to run out of commands. At this stage we start allocating new patched CBs as they arrive. 2. CB size has to be exactly (128*n + 104)B for some n, i.e. 24B below a cache line end. The flow: 1. Two command buffers being completed on different streams, at the same time. Denote those CB1 and CB2. 2. Each command buffer is injected with two messages, 16B each - one for a HBW update of the completion queue, another to raise interrupt. 3. Assume CB1 updated the completion queue and raise the interrupt. 4. Assume CB2 updated the completion queue but did not raise the interrupt yet. 5. The host receives the interrupt. It goes over the completion queue and sees two completions - CB1 and CB2. Release them both. 6. CB2 performs the last command. The problem is that the last command is split between 2 cache lines. So to read the last 8B of the last command, it has to access the host again. Problem is - CB2 is already released. This causes a DMAR error. The solution to this problem is simply to make sure the last two commands in the CB are always in the same cache line, using NOP padding. Signed-off-by: Yuri Nudelman <ynudelman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12habanalabs: move memory_scrub_val to hdev structDafna Hirschfeld
move the field memory_scrub_val from struct hl_dbg_device_entry to struct hl_device. This is because we want to use this field also if debugfs is off. Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>