diff options
author | Caleb Sander Mateos <csander@purestorage.com> | 2025-05-12 15:50:40 +0200 |
---|---|---|
committer | Christoph Hellwig <hch@lst.de> | 2025-05-20 05:34:27 +0200 |
commit | d977506f8863807129d7a11f4057dfb1b38085ea (patch) | |
tree | bc960debf3a808f9ddf530ad8a292cd29675732f /tools/perf/scripts/python/exported-sql-viewer.py | |
parent | b9d1ec530cdb631a3298de97c7b2ccc6145e1798 (diff) |
nvme-pci: make PRP list DMA pools per-NUMA-node
NVMe commands with over 8 KB of discontiguous data allocate PRP list
pages from the per-nvme_device dma_pool prp_page_pool or prp_small_pool.
Each call to dma_pool_alloc() and dma_pool_free() takes the per-dma_pool
spinlock. These device-global spinlocks are a significant source of
contention when many CPUs are submitting to the same NVMe devices. On a
workload issuing 32 KB reads from 16 CPUs (8 hypertwin pairs) across 2
NUMA nodes to 23 NVMe devices, we observed 2.4% of CPU time spent in
_raw_spin_lock_irqsave called from dma_pool_alloc and dma_pool_free.
Ideally, the dma_pools would be per-hctx to minimize contention. But
that could impose considerable resource costs in a system with many NVMe
devices and CPUs.
As a compromise, allocate per-NUMA-node PRP list DMA pools. Map each
nvme_queue to the set of DMA pools corresponding to its device and its
hctx's NUMA node. This reduces the _raw_spin_lock_irqsave overhead by
about half, to 1.2%. Preventing the sharing of PRP list pages across
NUMA nodes also makes them cheaper to initialize.
Link: https://lore.kernel.org/linux-nvme/CADUfDZqa=OOTtTTznXRDmBQo1WrFcDw1hBA7XwM7hzJ-hpckcA@mail.gmail.com/T/#u
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Diffstat (limited to 'tools/perf/scripts/python/exported-sql-viewer.py')
0 files changed, 0 insertions, 0 deletions