linux.git - Linus' kernel tree

Age	Commit message (Collapse)	Author
2023-06-23	PCI: epf-test: Improve handling of command and status registers	Damien Le Moal
	The pci-epf-test driver uses the test register BAR memory directly to get and execute a test registers set by the RC side and defined using a struct pci_epf_test_reg. This direct use relies on using the register BAR address as a pointer to a struct pci_epf_test_reg to execute the test case and to send back the test result through the status field of struct pci_epf_test_reg. In practice, the status field is always updated before an interrupt is raised in pci_epf_test_raise_irq(), to ensure that the RC side sees the updated status when receiving an interrupt. However, such assignment direct access does not ensure that changes to the status register make it to memory, and so visible to the host, before an interrupt is raised, thus potentially resulting in the RC host not seeing the correct status result for a test. Avoid this potential problem by using READ_ONCE()/WRITE_ONCE() when accessing the command and status fields of a pci_epf_test_reg structure. This ensure that a test start (pci_epf_test_cmd_handler() function) and completion (with the function pci_epf_test_raise_irq()) achieve a correct synchronization with the MMIO register accesses on the RC host. Link: https://lore.kernel.org/r/20230415023542.77601-10-dlemoal@kernel.org Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Lorenzo Pieralisi <lpieralisi@kernel.org> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Manivannan Sadhasivam <mani@kernel.org>
2023-06-23	PCI: epf-test: Simplify IRQ test commands execution	Damien Le Moal
	For the commands COMMAND_RAISE_LEGACY_IRQ, COMMAND_RAISE_MSI_IRQ and COMMAND_RAISE_MSIX_IRQ, the function pci_epf_test_cmd_handler() sets the STATUS_IRQ_RAISED status flag and calls the epc function pci_epc_raise_irq() directly. However, this is also exactly what the pci_epf_test_raise_irq() function does. Avoid duplicating these operations by directly using pci_epf_test_raise_irq() for the IRQ test commands. It is OK to do so as the host side endpoint test driver always set the correct IRQ type for the IRQ test commands. At the same time, move the IRQ number check done for the COMMAND_RAISE_MSI_IRQ and COMMAND_RAISE_MSIX_IRQ commands to pci_epf_test_raise_irq(), to also check the IRQ number requested by the host for other test commands. This significantly simplifies pci_epf_test_cmd_handler(). Link: https://lore.kernel.org/r/20230415023542.77601-9-dlemoal@kernel.org Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Lorenzo Pieralisi <lpieralisi@kernel.org> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Manivannan Sadhasivam <mani@kernel.org>
2023-06-23	PCI: epf-test: Simplify pci_epf_test_raise_irq()	Damien Le Moal
	Change the interface of the function pci_epf_test_raise_irq() to directly pass a pointer to the struct pci_epf_test_reg defining the test being executed. This avoids the need for grabbing this pointer using the register BAR address and simplifies the call sites as the IRQ type and IRQ numbers do not have to be passed as arguments. Link: https://lore.kernel.org/r/20230415023542.77601-8-dlemoal@kernel.org Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Lorenzo Pieralisi <lpieralisi@kernel.org> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Manivannan Sadhasivam <mani@kernel.org>
2023-06-23	PCI: epf-test: Simplify read/write/copy test functions	Damien Le Moal
	The function pci_epf_test_cmd_handler() uses the register BAR address as a pointer to a struct pci_epf_test_reg to determine the command sent by the host and to execute the test function accordingly. There is no need for doing this assignment again in each of the read, write and copy test functions. We can simply pass the reg pointer as an argument to the functions pci_epf_test_write(), pci_epf_test_read() and pci_epf_test_copy(). Link: https://lore.kernel.org/r/20230415023542.77601-7-dlemoal@kernel.org Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Lorenzo Pieralisi <lpieralisi@kernel.org> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Manivannan Sadhasivam <mani@kernel.org>
2023-06-23	PCI: epf-test: Use dmaengine_submit() to initiate DMA transfer	Damien Le Moal
	Instead of an open coded call to the tx_submit() operation of struct dma_async_tx_descriptor, use the helper function dmaengine_submit(). No functional change is introduced with this. Link: https://lore.kernel.org/r/20230415023542.77601-6-dlemoal@kernel.org Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Lorenzo Pieralisi <lpieralisi@kernel.org> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Manivannan Sadhasivam <mani@kernel.org>
2023-06-23	PCI: epf-test: Fix DMA transfer completion detection	Damien Le Moal
	pci_epf_test_data_transfer() and pci_epf_test_dma_callback() are not handling DMA transfer completion correctly, leading to completion notifications to the RC side that are too early. This problem can be detected when the RC side is running an IOMMU with messages such as: pci-endpoint-test 0000:0b:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x001c address=0xfff00000 flags=0x0000] When running the pcitest.sh tests: the address used for a previous test transfer generates the above error while the next test transfer is running. Fix this by testing the DMA transfer status in pci_epf_test_dma_callback() and notifying the completion only when the transfer status is DMA_COMPLETE or DMA_ERROR. Furthermore, in pci_epf_test_data_transfer(), be paranoid and check again the transfer status and always call dmaengine_terminate_sync() before returning. Link: https://lore.kernel.org/r/20230415023542.77601-5-dlemoal@kernel.org Fixes: 8353813c88ef ("PCI: endpoint: Enable DMA tests for endpoints with DMA capabilities") Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Lorenzo Pieralisi <lpieralisi@kernel.org> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Manivannan Sadhasivam <mani@kernel.org> Cc: stable@vger.kernel.org
2023-06-23	PCI: epf-test: Fix DMA transfer completion initialization	Damien Le Moal
	Reinitialize the transfer_complete DMA transfer completion before calling tx_submit(), to avoid seeing the DMA transfer complete before the completion is initialized, thus potentially losing the completion notification. Link: https://lore.kernel.org/r/20230415023542.77601-4-dlemoal@kernel.org Fixes: 8353813c88ef ("PCI: endpoint: Enable DMA tests for endpoints with DMA capabilities") Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Lorenzo Pieralisi <lpieralisi@kernel.org> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Manivannan Sadhasivam <mani@kernel.org> Cc: stable@vger.kernel.org
2023-06-23	PCI: endpoint: Move pci_epf_type_add_cfs() code	Damien Le Moal
	pci_epf_type_add_cfs() is called only from pci_ep_cfs_add_type_group() in drivers/pci/endpoint/pci-ep-cfs.c, so there is no need to export this function. Move its code from pci-epf-core.c to pci-ep-cfs.c as a static function. Link: https://lore.kernel.org/r/20230415023542.77601-3-dlemoal@kernel.org Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Lorenzo Pieralisi <lpieralisi@kernel.org> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Manivannan Sadhasivam <mani@kernel.org>
2023-06-23	PCI: endpoint: Automatically create a function specific attributes group	Damien Le Moal
	A PCI endpoint function driver can define function specific attributes under its function configfs directory using the add_cfs() endpoint driver operation. This is done by tying up the mkdir operation for the function configfs directory to a call to the add_cfs() operation. However, there are no checks preventing the user from repeatedly creating function specific attribute directories with different names, resulting in the same endpoint specific attributes group being added multiple times, which also result in an invalid reference counting for the attribute groups. E.g., using the pci-epf-ntb function driver as an example, the user creates the function as follows: $ modprobe pci-epf-ntb $ cd /sys/kernel/config/pci_ep/functions/pci_epf_ntb $ mkdir func0 $ tree func0 func0/ \|-- baseclass_code \|-- cache_line_size \|-- ... `-- vendorid $ mkdir func0/attrs $ tree func0 func0/ \|-- attrs \| \|-- db_count \| \|-- mw1 \| \|-- mw2 \| \|-- mw3 \| \|-- mw4 \| \|-- num_mws \| `-- spad_count \|-- baseclass_code \|-- cache_line_size \|-- ... `-- vendorid At this point, the function can be started by linking the EP controller. However, if the user mistakenly creates again a directory: $ mkdir func0/attrs2 $ tree func0 func0/ \|-- attrs \| \|-- db_count \| \|-- mw1 \| \|-- mw2 \| \|-- mw3 \| \|-- mw4 \| \|-- num_mws \| `-- spad_count \|-- attrs2 \| \|-- db_count \| \|-- mw1 \| \|-- mw2 \| \|-- mw3 \| \|-- mw4 \| \|-- num_mws \| `-- spad_count \|-- baseclass_code \|-- cache_line_size \|-- ... `-- vendorid The endpoint function specific attributes are duplicated and cause a crash when the endpoint function device is torn down: refcount_t: addition on 0; use-after-free. WARNING: CPU: 2 PID: 834 at lib/refcount.c:25 refcount_warn_saturate+0xc8/0x144 CPU: 2 PID: 834 Comm: rmdir Not tainted 6.3.0-rc1 #1 Hardware name: Pine64 RockPro64 v2.1 (DT) pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) ... Call trace: refcount_warn_saturate+0xc8/0x144 config_item_get+0x7c/0x80 configfs_rmdir+0x17c/0x30c vfs_rmdir+0x8c/0x204 do_rmdir+0x158/0x184 __arm64_sys_unlinkat+0x64/0x80 invoke_syscall+0x48/0x114 ... Fix this by modifying pci_epf_cfs_work() to execute the new function pci_ep_cfs_add_type_group() which itself calls pci_epf_type_add_cfs() to obtain the function specific attribute group and the group name (directory name) from the endpoint function driver. If the function driver defines an attribute group, pci_ep_cfs_add_type_group() then proceeds to register this group using configfs_register_group(), thus automatically exposing the function type specific configfs attributes to the user. E.g.: $ modprobe pci-epf-ntb $ cd /sys/kernel/config/pci_ep/functions/pci_epf_ntb $ mkdir func0 $ tree func0 func0/ \|-- baseclass_code \|-- cache_line_size \|-- ... \|-- pci_epf_ntb.0 \| \|-- db_count \| \|-- mw1 \| \|-- mw2 \| \|-- mw3 \| \|-- mw4 \| \|-- num_mws \| `-- spad_count \|-- primary \|-- ... `-- vendorid With this change, there is no need for the user to create or delete directories in the endpoint function attributes directory. The pci_epf_type_group_ops group operations are thus removed. Also update the documentation for the pci-epf-ntb and pci-epf-vntb function drivers to reflect this change, removing the explanations showing the need to manually create the sub-directory for the function specific attributes. Link: https://lore.kernel.org/r/20230415023542.77601-2-dlemoal@kernel.org Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Lorenzo Pieralisi <lpieralisi@kernel.org> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Manivannan Sadhasivam <mani@kernel.org>
2023-06-23	PCI: endpoint: Fix a Kconfig prompt of vNTB driver	Shunsuke Mie
	vNTB driver and NTB driver have same Kconfig prompt. Changed to make it distinguishable. Link: https://lore.kernel.org/r/20230202103832.2038286-1-mie@igel.co.jp Fixes: e35f56bb0330 ("PCI: endpoint: Support NTB transfer between RC and EP") Signed-off-by: Shunsuke Mie <mie@igel.co.jp> Signed-off-by: Lorenzo Pieralisi <lpieralisi@kernel.org> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Manivannan Sadhasivam <mani@kernel.org>
2023-06-23	PCI/ACPI: Call _REG when transitioning D-states	Mario Limonciello
	ACPI r6.5, sec 6.5.4, describes how AML is unable to access an OperationRegion unless _REG has been called to connect a handler: The OS runs _REG control methods to inform AML code of a change in the availability of an operation region. When an operation region handler is unavailable, AML cannot access data fields in that region. (Operation region writes will be ignored and reads will return indeterminate data.) The PCI core does not call _REG at any time, leading to the undefined behavior mentioned in the spec. The spec explains that _REG should be executed to indicate whether a given region can be accessed: Once _REG has been executed for a particular operation region, indicating that the operation region handler is ready, a control method can access fields in the operation region. Conversely, control methods must not access fields in operation regions when _REG method execution has not indicated that the operation region handler is ready. An example included in the spec demonstrates calling _REG when devices are turned off: "when the host controller or bridge controller is turned off or disabled, PCI Config Space Operation Regions for child devices are no longer available. As such, ETH0’s _REG method will be run when it is turned off and will again be run when PCI1 is turned off." It is reported that ASMedia PCIe GPIO controllers fail functional tests after the system has returning from suspend (S3 or s2idle). This is because the BIOS checks whether the OSPM has called the _REG method to determine whether it can interact with the OperationRegion assigned to the device as part of the other AML called for the device. To fix this issue, call acpi_evaluate_reg() when devices are transitioning to D3cold or D0. [bhelgaas: split pci_power_t checking to preliminary patch] Link: https://uefi.org/specs/ACPI/6.5/06_Device_Configuration.html#reg-region Link: https://lore.kernel.org/r/20230620140451.21007-1-mario.limonciello@amd.com Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Rafael J. Wysocki <rafael@kernel.org>
2023-06-23	PCI/ACPI: Validate acpi_pci_set_power_state() parameter	Bjorn Helgaas
	Previously acpi_pci_set_power_state() assumed the requested power state was valid (PCI_D0 ... PCI_D3cold). If a caller supplied something else, we could index outside the state_conv[] array and pass junk to acpi_device_set_power(). Validate the pci_power_t parameter and return -EINVAL if it's invalid. Link: https://lore.kernel.org/r/20230621222857.GA122930@bhelgaas Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>
2023-06-23	PCI/PM: Avoid putting EloPOS E2/S2/H2 PCIe Ports in D3cold	Ondrej Zary
	The quirk for Elo i2 introduced in commit 92597f97a40b ("PCI/PM: Avoid putting Elo i2 PCIe Ports in D3cold") is also needed by EloPOS E2/S2/H2 which uses the same Continental Z2 board. Change the quirk to match the board instead of system. Link: https://bugzilla.kernel.org/show_bug.cgi?id=215715 Link: https://lore.kernel.org/r/20230614074253.22318-1-linux@zary.sk Signed-off-by: Ondrej Zary <linux@zary.sk> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Cc: stable@vger.kernel.org
2023-06-22	PCI: rockchip: Set address alignment for endpoint mode	Damien Le Moal
	The address translation unit of the rockchip EP controller does not use the lower 8 bits of a PCIe-space address to map local memory. Thus we must set the align feature field to 256 to let the user know about this constraint. Link: https://lore.kernel.org/r/20230418074700.1083505-12-rick.wertenbroek@gmail.com Fixes: cf590b078391 ("PCI: rockchip: Add EP driver for Rockchip PCIe controller") Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Rick Wertenbroek <rick.wertenbroek@gmail.com> Signed-off-by: Lorenzo Pieralisi <lpieralisi@kernel.org> Cc: stable@vger.kernel.org
2023-06-22	PCI: rockchip: Don't advertise MSI-X in PCIe capabilities	Rick Wertenbroek
	The RK3399 PCIe endpoint controller cannot generate MSI-X IRQs. This is documented in the RK3399 technical reference manual (TRM) section 17.5.9 "Interrupt Support". MSI-X capability should therefore not be advertised. Remove the MSI-X capability by editing the capability linked-list. The previous entry is the MSI capability, therefore get the next entry from the MSI-X capability entry and set it as next entry for the MSI capability. This in effect removes MSI-X from the list. Linked list before : MSI cap -> MSI-X cap -> PCIe Device cap -> ... Linked list now : MSI cap -> PCIe Device cap -> ... Link: https://lore.kernel.org/r/20230418074700.1083505-11-rick.wertenbroek@gmail.com Fixes: cf590b078391 ("PCI: rockchip: Add EP driver for Rockchip PCIe controller") Tested-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Rick Wertenbroek <rick.wertenbroek@gmail.com> Signed-off-by: Lorenzo Pieralisi <lpieralisi@kernel.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Cc: stable@vger.kernel.org
2023-06-22	PCI: rockchip: Use u32 variable to access 32-bit registers	Rick Wertenbroek
	Previously u16 variables were used to access 32-bit registers, this resulted in not all of the data being read from the registers. Also the left shift of more than 16-bits would result in moving data out of the variable. Use u32 variables to access 32-bit registers Link: https://lore.kernel.org/r/20230418074700.1083505-10-rick.wertenbroek@gmail.com Fixes: cf590b078391 ("PCI: rockchip: Add EP driver for Rockchip PCIe controller") Tested-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Rick Wertenbroek <rick.wertenbroek@gmail.com> Signed-off-by: Lorenzo Pieralisi <lpieralisi@kernel.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Cc: stable@vger.kernel.org
2023-06-22	PCI: rockchip: Fix window mapping and address translation for endpoint	Rick Wertenbroek
	The RK3399 PCI endpoint core has 33 windows for PCIe space, now in the driver up to 32 fixed size (1M) windows are used and pages are allocated and mapped accordingly. The driver first used a single window and allocated space inside which caused translation issues (between CPU space and PCI space) because a window can only have a single translation at a given time, which if multiple pages are allocated inside will cause conflicts. Now each window is a single region of 1M which will always guarantee that the translation is not in conflict. Set the translation register addresses for physical function. As documented in the technical reference manual (TRM) section 17.5.5 "PCIe Address Translation" and section 17.6.8 "Address Translation Registers Description" Link: https://lore.kernel.org/r/20230418074700.1083505-9-rick.wertenbroek@gmail.com Fixes: cf590b078391 ("PCI: rockchip: Add EP driver for Rockchip PCIe controller") Tested-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Rick Wertenbroek <rick.wertenbroek@gmail.com> Signed-off-by: Lorenzo Pieralisi <lpieralisi@kernel.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Cc: stable@vger.kernel.org
2023-06-22	PCI: rockchip: Fix legacy IRQ generation for RK3399 PCIe endpoint core	Rick Wertenbroek
	Fix legacy IRQ generation for RK3399 PCIe endpoint core according to the technical reference manual (TRM). Assert and deassert legacy interrupt (INTx) through the legacy interrupt control register ("PCIE_CLIENT_LEGACY_INT_CTRL") instead of manually generating a PCIe message. The generation of the legacy interrupt was tested and validated with the PCIe endpoint test driver. Link: https://lore.kernel.org/r/20230418074700.1083505-8-rick.wertenbroek@gmail.com Fixes: cf590b078391 ("PCI: rockchip: Add EP driver for Rockchip PCIe controller") Tested-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Rick Wertenbroek <rick.wertenbroek@gmail.com> Signed-off-by: Lorenzo Pieralisi <lpieralisi@kernel.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Cc: stable@vger.kernel.org
2023-06-22	PCI: rockchip: Add poll and timeout to wait for PHY PLLs to be locked	Rick Wertenbroek
	The RK3399 PCIe controller should wait until the PHY PLLs are locked. Add poll and timeout to wait for PHY PLLs to be locked. If they cannot be locked generate error message and jump to error handler. Accessing registers in the PHY clock domain when PLLs are not locked causes hang The PHY PLLs status is checked through a side channel register. This is documented in the TRM section 17.5.8.1 "PCIe Initialization Sequence". Link: https://lore.kernel.org/r/20230418074700.1083505-5-rick.wertenbroek@gmail.com Fixes: cf590b078391 ("PCI: rockchip: Add EP driver for Rockchip PCIe controller") Tested-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Rick Wertenbroek <rick.wertenbroek@gmail.com> Signed-off-by: Lorenzo Pieralisi <lpieralisi@kernel.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Cc: stable@vger.kernel.org
2023-06-22	PCI: rockchip: Assert PCI Configuration Enable bit after probe	Rick Wertenbroek
	Assert PCI Configuration Enable bit after probe. When this bit is left to 0 in the endpoint mode, the RK3399 PCIe endpoint core will generate configuration request retry status (CRS) messages back to the root complex. Assert this bit after probe to allow the RK3399 PCIe endpoint core to reply to configuration requests from the root complex. This is documented in section 17.5.8.1.2 of the RK3399 TRM. Link: https://lore.kernel.org/r/20230418074700.1083505-4-rick.wertenbroek@gmail.com Fixes: cf590b078391 ("PCI: rockchip: Add EP driver for Rockchip PCIe controller") Tested-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Rick Wertenbroek <rick.wertenbroek@gmail.com> Signed-off-by: Lorenzo Pieralisi <lpieralisi@kernel.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Cc: stable@vger.kernel.org
2023-06-22	PCI: rockchip: Write PCI Device ID to correct register	Rick Wertenbroek
	Write PCI Device ID (DID) to the correct register. The Device ID was not updated through the correct register. Device ID was written to a read-only register and therefore did not work. The Device ID is now set through the correct register. This is documented in the RK3399 TRM section 17.6.6.1.1 Link: https://lore.kernel.org/r/20230418074700.1083505-3-rick.wertenbroek@gmail.com Fixes: cf590b078391 ("PCI: rockchip: Add EP driver for Rockchip PCIe controller") Tested-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Rick Wertenbroek <rick.wertenbroek@gmail.com> Signed-off-by: Lorenzo Pieralisi <lpieralisi@kernel.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Cc: stable@vger.kernel.org
2023-06-22	PCI: rockchip: Remove writes to unused registers	Rick Wertenbroek
	Remove write accesses to registers that are marked "unused" (and therefore read-only) in the technical reference manual (TRM) (see RK3399 TRM 17.6.8.1) Link: https://lore.kernel.org/r/20230418074700.1083505-2-rick.wertenbroek@gmail.com Tested-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Rick Wertenbroek <rick.wertenbroek@gmail.com> Signed-off-by: Lorenzo Pieralisi <lpieralisi@kernel.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
2023-06-20	PCI/ASPM: Avoid link retraining race	Ilpo Järvinen
	PCIe r6.0.1, sec 7.5.3.7, recommends setting the link control parameters, then waiting for the Link Training bit to be clear before setting the Retrain Link bit. This avoids a race where the LTSSM may not use the updated parameters if it is already in the midst of link training because of other normal link activity. Wait for the Link Training bit to be clear before toggling the Retrain Link bit to ensure that the LTSSM uses the updated link control parameters. [bhelgaas: commit log, return 0 (success)/-ETIMEDOUT instead of bool for both pcie_wait_for_retrain() and the existing pcie_retrain_link()] Suggested-by: Lukas Wunner <lukas@wunner.de> Fixes: 7d715a6c1ae5 ("PCI: add PCI Express ASPM support") Link: https://lore.kernel.org/r/20230502083923.34562-1-ilpo.jarvinen@linux.intel.com Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Lukas Wunner <lukas@wunner.de> Cc: stable@vger.kernel.org
2023-06-20	PCI/ASPM: Factor out pcie_wait_for_retrain()	Ilpo Järvinen
	Factor pcie_wait_for_retrain() out from pcie_retrain_link(). No functional change intended. [bhelgaas: split out from https://lore.kernel.org/r/20230502083923.34562-1-ilpo.jarvinen@linux.intel.com] Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2023-06-20	PCI/ASPM: Return 0 or -ETIMEDOUT from pcie_retrain_link()	Bjorn Helgaas
	"pcie_retrain_link" is not a question with a true/false answer, so "bool" isn't quite the right return type. Return 0 for success or -ETIMEDOUT if the retrain failed. No functional change intended. [bhelgaas: based on Ilpo's patch below] Link: https://lore.kernel.org/r/20230502083923.34562-1-ilpo.jarvinen@linux.intel.com Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2023-06-20	PCI: Add failed link recovery for device reset events	Maciej W. Rozycki
	Request failed link recovery with any upstream PCIe bridge where a device has not come back after reset within PCI_RESET_WAIT time. Reset the polling interval if recovery succeeded, otherwise continue as usual. [bhelgaas: inline pcie_parent_link_retrain()] Link: https://lore.kernel.org/r/alpine.DEB.2.21.2306111631050.64925@angie.orcam.me.uk Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2023-06-20	PCI: Work around PCIe link training failures	Maciej W. Rozycki
	Attempt to handle cases such as with a downstream port of the ASMedia ASM2824 PCIe switch where link training never completes and the link continues switching between speeds indefinitely with the data link layer never reaching the active state. It has been observed with a downstream port of the ASMedia ASM2824 Gen 3 switch wired to the upstream port of the Pericom PI7C9X2G304 Gen 2 switch, using a Delock Riser Card PCI Express x1 > 2 x PCIe x1 device, P/N 41433, wired to a SiFive HiFive Unmatched board. In this setup the switches should negotiate a link speed of 5.0GT/s, falling back to 2.5GT/s if necessary. Instead the link continues oscillating between the two speeds, at the rate of 34-35 times per second, with link training reported repeatedly active ~84% of the time. Limiting the target link speed to 2.5GT/s with the upstream ASM2824 device makes the two switches communicate correctly. Removing the speed restriction afterwards makes the two devices switch to 5.0GT/s then. Make use of these observations and detect the inability to train the link by checking for the Data Link Layer Link Active status bit being off while the Link Bandwidth Management Status indicating that hardware has changed the link speed or width in an attempt to correct unreliable link operation. Restrict the speed to 2.5GT/s then with the Target Link Speed field, request a retrain and wait 200ms for the data link to go up. If this is successful, lift the restriction, letting the devices negotiate a higher speed. Also check for a 2.5GT/s speed restriction the firmware may have already arranged and lift it too with ports of devices known to continue working afterwards (currently only ASM2824), that already report their data link being up. [bhelgaas: reorder and squash stubs from https://lore.kernel.org/r/alpine.DEB.2.21.2306111619570.64925@angie.orcam.me.uk to avoid adding stubs that do nothing] Link: https://lore.kernel.org/r/alpine.DEB.2.21.2203022037020.56670@angie.orcam.me.uk/ Link: https://source.denx.de/u-boot/u-boot/-/commit/a398a51ccc68 Link: https://lore.kernel.org/r/alpine.DEB.2.21.2305310038540.59226@angie.orcam.me.uk Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2023-06-20	PCI: Use pcie_wait_for_link_status() in pcie_wait_for_link_delay()	Maciej W. Rozycki
	Remove a DLLLA status bit polling loop from pcie_wait_for_link_delay() and call almost identical code in pcie_wait_for_link_status() instead. This reduces the lower bound on the polling interval from 10ms to 1ms, possibly increasing the CPU load on the system in favour to reducing the wait time. Link: https://lore.kernel.org/r/alpine.DEB.2.21.2306111611170.64925@angie.orcam.me.uk Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2023-06-20	PCI: Add support for polling DLLLA to pcie_retrain_link()	Maciej W. Rozycki
	Let the caller of pcie_retrain_link() specify whether they want to use the LT bit or the DLLLA bit of the Link Status Register to determine if link training has completed. It is up to the caller to verify whether the use of the DLLLA bit, the implementation of which is optional, is valid for the device requested. Link: https://lore.kernel.org/r/alpine.DEB.2.21.2306110310540.64925@angie.orcam.me.uk Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2023-06-20	PCI: Export pcie_retrain_link() for use outside ASPM	Maciej W. Rozycki
	Export pcie_retrain_link() for link retrain needs outside ASPM. Struct pcie_link_state is local to ASPM and only used by pcie_retrain_link() to get at the associated PCI device, so change the operand and adjust the lone call site accordingly. Document the interface. No functional change at this point. Link: https://lore.kernel.org/r/alpine.DEB.2.21.2306110229010.64925@angie.orcam.me.uk Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2023-06-20	PCI: Export PCIe link retrain timeout	Maciej W. Rozycki
	Convert LINK_RETRAIN_TIMEOUT from jiffies to milliseconds, accordingly rename to PCIE_LINK_RETRAIN_TIMEOUT_MS, and make available via "pci.h" for the PCI core to use. Use in pcie_wait_for_link_delay(). Link: https://lore.kernel.org/r/alpine.DEB.2.21.2305310030280.59226@angie.orcam.me.uk Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2023-06-20	PCI: Execute quirk_enable_clear_retrain_link() earlier	Maciej W. Rozycki
	Make quirk_enable_clear_retrain_link() an early quirk so that any later fixups can rely on dev->clear_retrain_link to have been already initialised. [bhelgaas: reorder to just before it becomes possible to call pcie_retrain_link() earlier] Link: https://lore.kernel.org/r/alpine.DEB.2.21.2305310049000.59226@angie.orcam.me.uk Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2023-06-20	PCI/ASPM: Factor out waiting for link training to complete	Maciej W. Rozycki
	Move code polling for the Link Training bit to clear into a function of its own. [bhelgaas: reorder to clean up before exposing to PCI core] Link: https://lore.kernel.org/r/alpine.DEB.2.21.2306111605060.64925@angie.orcam.me.uk Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2023-06-20	PCI/ASPM: Avoid unnecessary pcie_link_state use	Maciej W. Rozycki
	[bhelgaas: extract from expose patch, reorder to clean up before exposing] Link: https://lore.kernel.org/r/alpine.DEB.2.21.2306110229010.64925@angie.orcam.me.uk Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2023-06-20	PCI: qcom: Do not advertise hotplug capability for IP v2.1.0	Manivannan Sadhasivam
	SoCs making use of Qcom PCIe controller IP v2.1.0 do not support hotplug functionality. But the hotplug capability bit is set by default in the hardware. This causes the kernel PCI core to register hotplug service for the controller and send hotplug commands to it. But those commands will timeout generating messages as below during boot and suspend/resume. [ 5.782159] pcieport 0001:00:00.0: pciehp: Timeout on hotplug command 0x03c0 (issued 2020 msec ago) [ 5.810161] pcieport 0001:00:00.0: pciehp: Timeout on hotplug command 0x03c0 (issued 2048 msec ago) [ 7.838162] pcieport 0001:00:00.0: pciehp: Timeout on hotplug command 0x07c0 (issued 2020 msec ago) [ 7.870159] pcieport 0001:00:00.0: pciehp: Timeout on hotplug command 0x07c0 (issued 2052 msec ago) This not only spams the console output but also induces a delay of a couple of seconds. To fix this issue, let's clear the HPC bit in PCI_EXP_SLTCAP register as a part of the post init sequence to not advertise the hotplug capability for the controller. Link: https://lore.kernel.org/r/20230619150408.8468-10-manivannan.sadhasivam@linaro.org Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org> Signed-off-by: Lorenzo Pieralisi <lpieralisi@kernel.org>
2023-06-20	PCI: qcom: Do not advertise hotplug capability for IP v1.0.0	Manivannan Sadhasivam
	SoCs making use of Qcom PCIe controller IP v1.0.0 do not support hotplug functionality. But the hotplug capability bit is set by default in the hardware. This causes the kernel PCI core to register hotplug service for the controller and send hotplug commands to it. But those commands will timeout generating messages as below during boot and suspend/resume. [ 5.782159] pcieport 0001:00:00.0: pciehp: Timeout on hotplug command 0x03c0 (issued 2020 msec ago) [ 5.810161] pcieport 0001:00:00.0: pciehp: Timeout on hotplug command 0x03c0 (issued 2048 msec ago) [ 7.838162] pcieport 0001:00:00.0: pciehp: Timeout on hotplug command 0x07c0 (issued 2020 msec ago) [ 7.870159] pcieport 0001:00:00.0: pciehp: Timeout on hotplug command 0x07c0 (issued 2052 msec ago) This not only spams the console output but also induces a delay of a couple of seconds. To fix this issue, let's clear the HPC bit in PCI_EXP_SLTCAP register as a part of the post init sequence to not advertise the hotplug capability for the controller. Link: https://lore.kernel.org/r/20230619150408.8468-9-manivannan.sadhasivam@linaro.org Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org> Signed-off-by: Lorenzo Pieralisi <lpieralisi@kernel.org>
2023-06-20	PCI: qcom: Use post init sequence of IP v2.3.2 for v2.4.0	Manivannan Sadhasivam
	The post init sequence of IP v2.4.0 is same as v2.3.2. So let's reuse the v2.3.2 sequence which now also disables hotplug capability of the controller as it is not at all supported on any SoCs making use of this IP. Link: https://lore.kernel.org/r/20230619150408.8468-8-manivannan.sadhasivam@linaro.org Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org> Signed-off-by: Lorenzo Pieralisi <lpieralisi@kernel.org> Reviewed-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
2023-06-20	PCI: qcom: Do not advertise hotplug capability for IP v2.3.2	Manivannan Sadhasivam
	SoCs making use of Qcom PCIe controller IP v2.3.2 do not support hotplug functionality. But the hotplug capability bit is set by default in the hardware. This causes the kernel PCI core to register hotplug service for the controller and send hotplug commands to it. But those commands will timeout generating messages as below during boot and suspend/resume. [ 5.782159] pcieport 0001:00:00.0: pciehp: Timeout on hotplug command 0x03c0 (issued 2020 msec ago) [ 5.810161] pcieport 0001:00:00.0: pciehp: Timeout on hotplug command 0x03c0 (issued 2048 msec ago) [ 7.838162] pcieport 0001:00:00.0: pciehp: Timeout on hotplug command 0x07c0 (issued 2020 msec ago) [ 7.870159] pcieport 0001:00:00.0: pciehp: Timeout on hotplug command 0x07c0 (issued 2052 msec ago) This not only spams the console output but also induces a delay of a couple of seconds. To fix this issue, let's clear the HPC bit in PCI_EXP_SLTCAP register as a part of the post init sequence to not advertise the hotplug capability for the controller. Link: https://lore.kernel.org/r/20230619150408.8468-7-manivannan.sadhasivam@linaro.org Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org> Signed-off-by: Lorenzo Pieralisi <lpieralisi@kernel.org>
2023-06-20	PCI: qcom: Do not advertise hotplug capability for IPs v2.3.3 and v2.9.0	Manivannan Sadhasivam
	SoCs making use of Qcom PCIe controller IPs v2.3.3 and v2.9.0 do not support hotplug functionality. But the hotplug capability bit is set by default in the hardware. This causes the kernel PCI core to register hotplug service for the controller and send hotplug commands to it. But those commands will timeout generating messages as below during boot and suspend/resume. [ 5.782159] pcieport 0001:00:00.0: pciehp: Timeout on hotplug command 0x03c0 (issued 2020 msec ago) [ 5.810161] pcieport 0001:00:00.0: pciehp: Timeout on hotplug command 0x03c0 (issued 2048 msec ago) [ 7.838162] pcieport 0001:00:00.0: pciehp: Timeout on hotplug command 0x07c0 (issued 2020 msec ago) [ 7.870159] pcieport 0001:00:00.0: pciehp: Timeout on hotplug command 0x07c0 (issued 2052 msec ago) This not only spams the console output but also induces a delay of a couple of seconds. To fix this issue, let's not set the HPC bit in PCI_EXP_SLTCAP register as a part of the post init sequence to not advertise the hotplug capability for the controller. Link: https://lore.kernel.org/r/20230619150408.8468-6-manivannan.sadhasivam@linaro.org Tested-by: Sricharan Ramabadhran <quic_srichara@quicinc.com> Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org> Signed-off-by: Lorenzo Pieralisi <lpieralisi@kernel.org> Reviewed-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
2023-06-20	PCI: qcom: Do not advertise hotplug capability for IPs v2.7.0 and v1.9.0	Manivannan Sadhasivam
	SoCs making use of Qcom PCIe controller IPs v2.7.0 and v1.9.0 do not support hotplug functionality. But the hotplug capability bit is set by default in the hardware. This causes the kernel PCI core to register hotplug service for the controller and send hotplug commands to it. But those commands will timeout generating messages as below during boot and suspend/resume. [ 5.782159] pcieport 0001:00:00.0: pciehp: Timeout on hotplug command 0x03c0 (issued 2020 msec ago) [ 5.810161] pcieport 0001:00:00.0: pciehp: Timeout on hotplug command 0x03c0 (issued 2048 msec ago) [ 7.838162] pcieport 0001:00:00.0: pciehp: Timeout on hotplug command 0x07c0 (issued 2020 msec ago) [ 7.870159] pcieport 0001:00:00.0: pciehp: Timeout on hotplug command 0x07c0 (issued 2052 msec ago) This not only spams the console output but also induces a delay of a couple of seconds. To fix this issue, let's clear the HPC bit in PCI_EXP_SLTCAP register as a part of the post init sequence to not advertise the hotplug capability for the controller. Link: https://lore.kernel.org/r/20230619150408.8468-5-manivannan.sadhasivam@linaro.org Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org> Signed-off-by: Lorenzo Pieralisi <lpieralisi@kernel.org> Reviewed-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
2023-06-20	PCI: qcom: Disable write access to read only registers for IP v2.9.0	Manivannan Sadhasivam
	In the post init sequence of v2.9.0, write access to read only registers are not disabled after updating the registers. Fix it by disabling the access after register update. While at it, let's also add a newline after existing dw_pcie_dbi_ro_wr_en() guard function to align with rest of the driver. Link: https://lore.kernel.org/r/20230619150408.8468-4-manivannan.sadhasivam@linaro.org Fixes: 0cf7c2efe8ac ("PCI: qcom: Add IPQ60xx support") Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org> Signed-off-by: Lorenzo Pieralisi <lpieralisi@kernel.org> Reviewed-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
2023-06-20	PCI: qcom: Use DWC helpers for modifying the read-only DBI registers	Manivannan Sadhasivam
	DWC core already exposes dw_pcie_dbi_ro_wr_{en/dis} helper APIs for enabling and disabling the write access to read only DBI registers. So let's use them instead of doing it manually. Also, the existing code doesn't disable the write access when it's done. This is also fixed now. Link: https://lore.kernel.org/r/20230619150408.8468-3-manivannan.sadhasivam@linaro.org Fixes: 5d76117f070d ("PCI: qcom: Add support for IPQ8074 PCIe controller") Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org> Signed-off-by: Lorenzo Pieralisi <lpieralisi@kernel.org> Reviewed-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
2023-06-20	PCI: qcom: Disable write access to read only registers for IP v2.3.3	Manivannan Sadhasivam
	In the post init sequence of v2.9.0, write access to read only registers are not disabled after updating the registers. Fix it by disabling the access after register update. Link: https://lore.kernel.org/r/20230619150408.8468-2-manivannan.sadhasivam@linaro.org Fixes: 5d76117f070d ("PCI: qcom: Add support for IPQ8074 PCIe controller") Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org> Signed-off-by: Lorenzo Pieralisi <lpieralisi@kernel.org> Cc: <stable@vger.kernel.org>
2023-06-19	Merge tag 'hyperv-fixes-signed-20230619' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux Pull hyperv fixes from Wei Liu: - Fix races in Hyper-V PCI controller (Dexuan Cui) - Fix handling of hyperv_pcpu_input_arg (Michael Kelley) - Fix vmbus_wait_for_unload to scan present CPUs (Michael Kelley) - Call hv_synic_free in the failure path of hv_synic_alloc (Dexuan Cui) - Add noop for real mode handlers for virtual trust level code (Saurabh Sengar) * tag 'hyperv-fixes-signed-20230619' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux: PCI: hv: Add a per-bus mutex state_lock Revert "PCI: hv: Fix a timing issue which causes kdump to fail occasionally" PCI: hv: Remove the useless hv_pcichild_state from struct hv_pci_dev PCI: hv: Fix a race condition in hv_irq_unmask() that can cause panic PCI: hv: Fix a race condition bug in hv_pci_query_relations() arm64/hyperv: Use CPUHP_AP_HYPERV_ONLINE state to fix CPU online sequencing x86/hyperv: Fix hyperv_pcpu_input_arg handling when CPUs go online/offline Drivers: hv: vmbus: Fix vmbus_wait_for_unload() to scan present CPUs Drivers: hv: vmbus: Call hv_synic_free() if hv_synic_alloc() fails x86/hyperv/vtl: Add noop for realmode pointers
2023-06-19	scatterlist: add dedicated config for DMA flags	Robin Murphy
	The DMA flags field will be useful for users beyond PCI P2P, so upgrade to its own dedicated config option. [catalin.marinas@arm.com: use #ifdef CONFIG_NEED_SG_DMA_FLAGS in scatterlist.h] [catalin.marinas@arm.com: update PCI_P2PDMA dma_flags comment in scatterlist.h] Link: https://lkml.kernel.org/r/20230612153201.554742-13-catalin.marinas@arm.com Signed-off-by: Robin Murphy <robin.murphy@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: Isaac J. Manjarres <isaacmanjarres@google.com> Cc: Alasdair Kergon <agk@redhat.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Jerry Snitselaar <jsnitsel@redhat.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: Jonathan Cameron <jic23@kernel.org> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Lars-Peter Clausen <lars@metafoo.de> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Mike Snitzer <snitzer@kernel.org> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Saravana Kannan <saravanak@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19	PCI: imx6: Save and restore root port MSI control in suspend and resume	Richard Zhu
	The imx6 PCI host controller suffers from a HW integration bug whereby the MSI enable bit in the root port MSI capability enables/disables MSIs interrupts for all downstream components in the PCI tree. This requires, as implemented in 75cb8d20c112 ("PCI: imx: Enable MSI from downstream components") that the root port MSI enable bit should be set in order for downstream PCI devices MSIs to function. The MSI enable bit programming might be lost during the suspend and should be re-stored during resume. Save the MSI control during suspend and restore it in resume. Link: https://lore.kernel.org/r/1670479534-22154-1-git-send-email-hongxing.zhu@nxp.com Signed-off-by: Richard Zhu <hongxing.zhu@nxp.com> [lpieralisi@kernel.org: commit log] Signed-off-by: Lorenzo Pieralisi <lpieralisi@kernel.org>
2023-06-18	PCI: hv: Add a per-bus mutex state_lock	Dexuan Cui
	In the case of fast device addition/removal, it's possible that hv_eject_device_work() can start to run before create_root_hv_pci_bus() starts to run; as a result, the pci_get_domain_bus_and_slot() in hv_eject_device_work() can return a 'pdev' of NULL, and hv_eject_device_work() can remove the 'hpdev', and immediately send a message PCI_EJECTION_COMPLETE to the host, and the host immediately unassigns the PCI device from the guest; meanwhile, create_root_hv_pci_bus() and the PCI device driver can be probing the dead PCI device and reporting timeout errors. Fix the issue by adding a per-bus mutex 'state_lock' and grabbing the mutex before powering on the PCI bus in hv_pci_enter_d0(): when hv_eject_device_work() starts to run, it's able to find the 'pdev' and call pci_stop_and_remove_bus_device(pdev): if the PCI device driver has loaded, the PCI device driver's probe() function is already called in create_root_hv_pci_bus() -> pci_bus_add_devices(), and now hv_eject_device_work() -> pci_stop_and_remove_bus_device() is able to call the PCI device driver's remove() function and remove the device reliably; if the PCI device driver hasn't loaded yet, the function call hv_eject_device_work() -> pci_stop_and_remove_bus_device() is able to remove the PCI device reliably and the PCI device driver's probe() function won't be called; if the PCI device driver's probe() is already running (e.g., systemd-udev is loading the PCI device driver), it must be holding the per-device lock, and after the probe() finishes and releases the lock, hv_eject_device_work() -> pci_stop_and_remove_bus_device() is able to proceed to remove the device reliably. Fixes: 4daace0d8ce8 ("PCI: hv: Add paravirtual PCI front-end for Microsoft Hyper-V VMs") Signed-off-by: Dexuan Cui <decui@microsoft.com> Reviewed-by: Michael Kelley <mikelley@microsoft.com> Acked-by: Lorenzo Pieralisi <lpieralisi@kernel.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20230615044451.5580-6-decui@microsoft.com Signed-off-by: Wei Liu <wei.liu@kernel.org>
2023-06-18	Revert "PCI: hv: Fix a timing issue which causes kdump to fail occasionally"	Dexuan Cui
	This reverts commit d6af2ed29c7c1c311b96dac989dcb991e90ee195. The statement "the hv_pci_bus_exit() call releases structures of all its child devices" in commit d6af2ed29c7c is not true: in the path hv_pci_probe() -> hv_pci_enter_d0() -> hv_pci_bus_exit(hdev, true): the parameter "keep_devs" is true, so hv_pci_bus_exit() does not release the child "struct hv_pci_dev *hpdev" that is created earlier in pci_devices_present_work() -> new_pcichild_device(). The commit d6af2ed29c7c was originally made in July 2020 for RHEL 7.7, where the old version of hv_pci_bus_exit() was used; when the commit was rebased and merged into the upstream, people didn't notice that it's not really necessary. The commit itself doesn't cause any issue, but it makes hv_pci_probe() more complicated. Revert it to facilitate some upcoming changes to hv_pci_probe(). Signed-off-by: Dexuan Cui <decui@microsoft.com> Reviewed-by: Michael Kelley <mikelley@microsoft.com> Acked-by: Wei Hu <weh@microsoft.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20230615044451.5580-5-decui@microsoft.com Signed-off-by: Wei Liu <wei.liu@kernel.org>
2023-06-18	PCI: hv: Remove the useless hv_pcichild_state from struct hv_pci_dev	Dexuan Cui
	The hpdev->state is never really useful. The only use in hv_pci_eject_device() and hv_eject_device_work() is not really necessary. Signed-off-by: Dexuan Cui <decui@microsoft.com> Reviewed-by: Michael Kelley <mikelley@microsoft.com> Acked-by: Lorenzo Pieralisi <lpieralisi@kernel.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20230615044451.5580-4-decui@microsoft.com Signed-off-by: Wei Liu <wei.liu@kernel.org>
2023-06-18	PCI: hv: Fix a race condition in hv_irq_unmask() that can cause panic	Dexuan Cui
	When the host tries to remove a PCI device, the host first sends a PCI_EJECT message to the guest, and the guest is supposed to gracefully remove the PCI device and send a PCI_EJECTION_COMPLETE message to the host; the host then sends a VMBus message CHANNELMSG_RESCIND_CHANNELOFFER to the guest (when the guest receives this message, the device is already unassigned from the guest) and the guest can do some final cleanup work; if the guest fails to respond to the PCI_EJECT message within one minute, the host sends the VMBus message CHANNELMSG_RESCIND_CHANNELOFFER and removes the PCI device forcibly. In the case of fast device addition/removal, it's possible that the PCI device driver is still configuring MSI-X interrupts when the guest receives the PCI_EJECT message; the channel callback calls hv_pci_eject_device(), which sets hpdev->state to hv_pcichild_ejecting, and schedules a work hv_eject_device_work(); if the PCI device driver is calling pci_alloc_irq_vectors() -> ... -> hv_compose_msi_msg(), we can break the while loop in hv_compose_msi_msg() due to the updated hpdev->state, and leave data->chip_data with its default value of NULL; later, when the PCI device driver calls request_irq() -> ... -> hv_irq_unmask(), the guest crashes in hv_arch_irq_unmask() due to data->chip_data being NULL. Fix the issue by not testing hpdev->state in the while loop: when the guest receives PCI_EJECT, the device is still assigned to the guest, and the guest has one minute to finish the device removal gracefully. We don't really need to (and we should not) test hpdev->state in the loop. Fixes: de0aa7b2f97d ("PCI: hv: Fix 2 hang issues in hv_compose_msi_msg()") Signed-off-by: Dexuan Cui <decui@microsoft.com> Reviewed-by: Michael Kelley <mikelley@microsoft.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20230615044451.5580-3-decui@microsoft.com Signed-off-by: Wei Liu <wei.liu@kernel.org>