diff options
| author | Yang Shi <yang@os.amperecomputing.com> | 2025-09-17 12:02:09 -0700 | 
|---|---|---|
| committer | Will Deacon <will@kernel.org> | 2025-09-18 21:36:37 +0100 | 
| commit | a166563e7ec375b38a0fd3a58f7b77e50a6bc6a8 (patch) | |
| tree | b3b6ac772cac763eb881231a06252488df7203fd /tools/perf/scripts/python/intel-pt-events.py | |
| parent | a660194dd101e937c319171ad99c3fbe466fd825 (diff) | |
arm64: mm: support large block mapping when rodata=full
When rodata=full is specified, kernel linear mapping has to be mapped at
PTE level since large page table can't be split due to break-before-make
rule on ARM64.
This resulted in a couple of problems:
  - performance degradation
  - more TLB pressure
  - memory waste for kernel page table
With FEAT_BBM level 2 support, splitting large block page table to
smaller ones doesn't need to make the page table entry invalid anymore.
This allows kernel split large block mapping on the fly.
Add kernel page table split support and use large block mapping by
default when FEAT_BBM level 2 is supported for rodata=full.  When
changing permissions for kernel linear mapping, the page table will be
split to smaller size.
The machine without FEAT_BBM level 2 will fallback to have kernel linear
mapping PTE-mapped when rodata=full.
With this we saw significant performance boost with some benchmarks and
much less memory consumption on my AmpereOne machine (192 cores, 1P)
with 256GB memory.
* Memory use after boot
Before:
MemTotal:       258988984 kB
MemFree:        254821700 kB
After:
MemTotal:       259505132 kB
MemFree:        255410264 kB
Around 500MB more memory are free to use.  The larger the machine, the
more memory saved.
* Memcached
We saw performance degradation when running Memcached benchmark with
rodata=full vs rodata=on.  Our profiling pointed to kernel TLB pressure.
With this patchset we saw ops/sec is increased by around 3.5%, P99
latency is reduced by around 9.6%.
The gain mainly came from reduced kernel TLB misses.  The kernel TLB
MPKI is reduced by 28.5%.
The benchmark data is now on par with rodata=on too.
* Disk encryption (dm-crypt) benchmark
Ran fio benchmark with the below command on a 128G ramdisk (ext4) with
disk encryption (by dm-crypt).
fio --directory=/data --random_generator=lfsr --norandommap            \
    --randrepeat 1 --status-interval=999 --rw=write --bs=4k --loops=1  \
    --ioengine=sync --iodepth=1 --numjobs=1 --fsync_on_close=1         \
    --group_reporting --thread --name=iops-test-job --eta-newline=1    \
    --size 100G
The IOPS is increased by 90% - 150% (the variance is high, but the worst
number of good case is around 90% more than the best number of bad
case). The bandwidth is increased and the avg clat is reduced
proportionally.
* Sequential file read
Read 100G file sequentially on XFS (xfs_io read with page cache
populated). The bandwidth is increased by 150%.
Co-developed-by: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Yang Shi <yang@os.amperecomputing.com>
Signed-off-by: Will Deacon <will@kernel.org>
Diffstat (limited to 'tools/perf/scripts/python/intel-pt-events.py')
0 files changed, 0 insertions, 0 deletions
