path: root/mm/percpu-km.c
AgeCommit message (Collapse)Author
2020-08-12mm: memcg/percpu: account percpu memory to memory cgroupsRoman Gushchin
Percpu memory is becoming more and more widely used by various subsystems, and the total amount of memory controlled by the percpu allocator can make a good part of the total memory. As an example, bpf maps can consume a lot of percpu memory, and they are created by a user. Also, some cgroup internals (e.g. memory controller statistics) can be quite large. On a machine with many CPUs and big number of cgroups they can consume hundreds of megabytes. So the lack of memcg accounting is creating a breach in the memory isolation. Similar to the slab memory, percpu memory should be accounted by default. To implement the perpcu accounting it's possible to take the slab memory accounting as a model to follow. Let's introduce two types of percpu chunks: root and memcg. What makes memcg chunks different is an additional space allocated to store memcg membership information. If __GFP_ACCOUNT is passed on allocation, a memcg chunk should be be used. If it's possible to charge the corresponding size to the target memory cgroup, allocation is performed, and the memcg ownership data is recorded. System-wide allocations are performed using root chunks, so there is no additional memory overhead. To implement a fast reparenting of percpu memory on memcg removal, we don't store mem_cgroup pointers directly: instead we use obj_cgroup API, introduced for slab accounting. [ fix CONFIG_MEMCG_KMEM=n build errors and warning] [ move unreachable code, per Roman] [ mm/percpu: fix 'defined but not used' warning] Link: Signed-off-by: Roman Gushchin <> Signed-off-by: Bixuan Cui <> Signed-off-by: Andrew Morton <> Reviewed-by: Shakeel Butt <> Acked-by: Dennis Zhou <> Cc: Christoph Lameter <> Cc: David Rientjes <> Cc: Johannes Weiner <> Cc: Joonsoo Kim <> Cc: Mel Gorman <> Cc: Michal Hocko <> Cc: Pekka Enberg <> Cc: Tejun Heo <> Cc: Tobin C. Harding <> Cc: Vlastimil Babka <> Cc: Waiman Long <> Cc: Bixuan Cui <> Cc: Michal Koutný <> Cc: Stephen Rothwell <> Link: Signed-off-by: Linus Torvalds <>
2019-06-05treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 428Thomas Gleixner
Based on 1 normalized pattern(s): this file is released under the gplv2 extracted by the scancode license scanner the SPDX license identifier GPL-2.0-only has been chosen to replace the boilerplate/reference in 68 file(s). Signed-off-by: Thomas Gleixner <> Reviewed-by: Armijn Hemel <> Reviewed-by: Allison Randal <> Cc: Link: Signed-off-by: Greg Kroah-Hartman <>
2019-03-13percpu: set PCPU_BITMAP_BLOCK_SIZE to PAGE_SIZEDennis Zhou
Previously, block size was flexible based on the constraint that the GCD(PCPU_BITMAP_BLOCK_SIZE, PAGE_SIZE) > 1. However, this carried the overhead that keeping a floating number of populated free pages required scanning over the free regions of a chunk. Setting the block size to be fixed at PAGE_SIZE lets us know when an empty page becomes used as we will break a full contig_hint of a block. This means we no longer have to scan the whole chunk upon breaking a contig_hint which empty page management piggybacked off. A later patch takes advantage of this to optimize the allocation path by only scanning forward using the scan_hint introduced later too. Signed-off-by: Dennis Zhou <> Reviewed-by: Peng Fan <>
2019-02-26percpu: km: no need to consider pcpu_group_offsets[0]Peng Fan
percpu-km is used on UP systems which only has one group, so the group offset will be always 0, there is no need to subtract pcpu_group_offsets[0] when assigning chunk->base_addr Signed-off-by: Peng Fan <> Acked-by: Christoph Lameter <> Signed-off-by: Dennis Zhou <>
2018-12-18percpu: convert spin_lock_irq to spin_lock_irqsave.Dennis Zhou
From Michael Cree: "Bisection lead to commit b38d08f3181c ("percpu: restructure locking") as being the cause of lockups at initial boot on the kernel built for generic Alpha. On a suggestion by Tejun Heo that: So, the only thing I can think of is that it's calling spin_unlock_irq() while irq handling isn't set up yet. Can you please try the followings? 1. Convert all spin_[un]lock_irq() to spin_lock_irqsave/unlock_irqrestore()." Fixes: b38d08f3181c ("percpu: restructure locking") Reported-and-tested-by: Michael Cree <> Acked-by: Tejun Heo <> Signed-off-by: Dennis Zhou <>
2018-02-18percpu: allow select gfp to be passed to underlying allocatorsDennis Zhou
The prior patch added support for passing gfp flags through to the underlying allocators. This patch allows users to pass along gfp flags (currently only __GFP_NORETRY and __GFP_NOWARN) to the underlying allocators. This should allow users to decide if they are ok with failing allocations recovering in a more graceful way. Additionally, gfp passing was done as additional flags in the previous patch. Instead, change this to caller passed semantics. GFP_KERNEL is also removed as the default flag. It continues to be used for internally caused underlying percpu allocations. V2: Removed gfp_percpu_mask in favor of doing it inline. Removed GFP_KERNEL as a default flag for __alloc_percpu_gfp. Signed-off-by: Dennis Zhou <> Suggested-by: Daniel Borkmann <> Acked-by: Christoph Lameter <> Signed-off-by: Tejun Heo <>
2018-02-18percpu: add __GFP_NORETRY semantics to the percpu balancing pathDennis Zhou
Percpu memory using the vmalloc area based chunk allocator lazily populates chunks by first requesting the full virtual address space required for the chunk and subsequently adding pages as allocations come through. To ensure atomic allocations can succeed, a workqueue item is used to maintain a minimum number of empty pages. In certain scenarios, such as reported in [1], it is possible that physical memory becomes quite scarce which can result in either a rather long time spent trying to find free pages or worse, a kernel panic. This patch adds support for __GFP_NORETRY and __GFP_NOWARN passing them through to the underlying allocators. This should prevent any unnecessary panics potentially caused by the workqueue item. The passing of gfp around is as additional flags rather than a full set of flags. The next patch will change these to caller passed semantics. V2: Added const modifier to gfp flags in the balance path. Removed an extra whitespace. [1] Signed-off-by: Dennis Zhou <> Suggested-by: Daniel Borkmann <> Reported-by: Acked-by: Christoph Lameter <> Signed-off-by: Tejun Heo <>
2017-07-26percpu: replace area map allocator with bitmapDennis Zhou (Facebook)
The percpu memory allocator is experiencing scalability issues when allocating and freeing large numbers of counters as in BPF. Additionally, there is a corner case where iteration is triggered over all chunks if the contig_hint is the right size, but wrong alignment. This patch replaces the area map allocator with a basic bitmap allocator implementation. Each subsequent patch will introduce new features and replace full scanning functions with faster non-scanning options when possible. Implementation: This patchset removes the area map allocator in favor of a bitmap allocator backed by metadata blocks. The primary goal is to provide consistency in performance and memory footprint with a focus on small allocations (< 64 bytes). The bitmap removes the heavy memmove from the freeing critical path and provides a consistent memory footprint. The metadata blocks provide a bound on the amount of scanning required by maintaining a set of hints. In an effort to make freeing fast, the metadata is updated on the free path if the new free area makes a page free, a block free, or spans across blocks. This causes the chunk's contig hint to potentially be smaller than what it could allocate by up to the smaller of a page or a block. If the chunk's contig hint is contained within a block, a check occurs and the hint is kept accurate. Metadata is always kept accurate on allocation, so there will not be a situation where a chunk has a later contig hint than available. Evaluation: I have primarily done testing against a simple workload of allocation of 1 million objects (2^20) of varying size. Deallocation was done by in order, alternating, and in reverse. These numbers were collected after rebasing ontop of a80099a152. I present the worst-case numbers here: Area Map Allocator: Object Size | Alloc Time (ms) | Free Time (ms) ---------------------------------------------- 4B | 310 | 4770 16B | 557 | 1325 64B | 436 | 273 256B | 776 | 131 1024B | 3280 | 122 Bitmap Allocator: Object Size | Alloc Time (ms) | Free Time (ms) ---------------------------------------------- 4B | 490 | 70 16B | 515 | 75 64B | 610 | 80 256B | 950 | 100 1024B | 3520 | 200 This data demonstrates the inability for the area map allocator to handle less than ideal situations. In the best case of reverse deallocation, the area map allocator was able to perform within range of the bitmap allocator. In the worst case situation, freeing took nearly 5 seconds for 1 million 4-byte objects. The bitmap allocator dramatically improves the consistency of the free path. The small allocations performed nearly identical regardless of the freeing pattern. While it does add to the allocation latency, the allocation scenario here is optimal for the area map allocator. The area map allocator runs into trouble when it is allocating in chunks where the latter half is full. It is difficult to replicate this, so I present a variant where the pages are second half filled. Freeing was done sequentially. Below are the numbers for this scenario: Area Map Allocator: Object Size | Alloc Time (ms) | Free Time (ms) ---------------------------------------------- 4B | 4118 | 4892 16B | 1651 | 1163 64B | 598 | 285 256B | 771 | 158 1024B | 3034 | 160 Bitmap Allocator: Object Size | Alloc Time (ms) | Free Time (ms) ---------------------------------------------- 4B | 481 | 67 16B | 506 | 69 64B | 636 | 75 256B | 892 | 90 1024B | 3262 | 147 The data shows a parabolic curve of performance for the area map allocator. This is due to the memmove operation being the dominant cost with the lower object sizes as more objects are packed in a chunk and at higher object sizes, the traversal of the chunk slots is the dominating cost. The bitmap allocator suffers this problem as well. The above data shows the inability to scale for the allocation path with the area map allocator and that the bitmap allocator demonstrates consistent performance in general. The second problem of additional scanning can result in the area map allocator completing in 52 minutes when trying to allocate 1 million 4-byte objects with 8-byte alignment. The same workload takes approximately 16 seconds to complete for the bitmap allocator. V2: Fixed a bug in pcpu_alloc_first_chunk end_offset was setting the bitmap using bytes instead of bits. Added a comment to pcpu_cnt_pop_pages to explain bitmap_weight. Signed-off-by: Dennis Zhou <> Reviewed-by: Josef Bacik <> Signed-off-by: Tejun Heo <>
2017-06-29percpu: fix static checker warnings in pcpu_destroy_chunkDennis Zhou
From 5021b97f4026334d2c8dfad80797dd1028cddd73 Mon Sep 17 00:00:00 2001 From: Dennis Zhou <> Date: Thu, 29 Jun 2017 07:11:41 -0700 Add NULL check in pcpu_destroy_chunk to correct static checker warnings. Signed-off-by: Dennis Zhou <> Reported-by: Dan Carpenter <> Signed-off-by: Tejun Heo <>
2017-06-20percpu: add tracepoint support for percpu memoryDennis Zhou
Add support for tracepoints to the following events: chunk allocation, chunk free, area allocation, area free, and area allocation failure. This should let us replay percpu memory requests and evaluate corresponding decisions. Signed-off-by: Dennis Zhou <> Signed-off-by: Tejun Heo <>
2017-06-20percpu: expose statistics about percpu memory via debugfsDennis Zhou
There is limited visibility into the use of percpu memory leaving us unable to reason about correctness of parameters and overall use of percpu memory. These counters and statistics aim to help understand basic statistics about percpu memory such as number of allocations over the lifetime, allocation sizes, and fragmentation. New Config: PERCPU_STATS Signed-off-by: Dennis Zhou <> Signed-off-by: Tejun Heo <>
2016-03-17mm: percpu: use pr_fmt to prefix outputJoe Perches
Use the normal mechanism to make the logging output consistently "percpu:" instead of a mix of "PERCPU:" and "percpu:" Signed-off-by: Joe Perches <> Acked-by: Tejun Heo <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2016-03-17mm: convert printk(KERN_<LEVEL> to pr_<level>Joe Perches
Most of the mm subsystem uses pr_<level> so make it consistent. Miscellanea: - Realign arguments - Add missing newline to format - kmemleak-test.c has a "kmemleak: " prefix added to the "Kmemleak testing" logging message via pr_fmt Signed-off-by: Joe Perches <> Acked-by: Tejun Heo <> [percpu] Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2014-09-02percpu: implmeent pcpu_nr_empty_pop_pages and chunk->nr_populatedTejun Heo
pcpu_nr_empty_pop_pages counts the number of empty populated pages across all chunks and chunk->nr_populated counts the number of populated pages in a chunk. Both will be used to implement pre/async population for atomic allocations. pcpu_chunk_[de]populated() are added to update chunk->populated, chunk->nr_populated and pcpu_nr_empty_pop_pages together. All successful chunk [de]populations should be followed by the corresponding pcpu_chunk_[de]populated() calls. Signed-off-by: Tejun Heo <>
2014-09-02percpu: restructure lockingTejun Heo
At first, the percpu allocator required a sleepable context for both alloc and free paths and used pcpu_alloc_mutex to protect everything. Later, pcpu_lock was introduced to protect the index data structure so that the free path can be invoked from atomic contexts. The conversion only updated what's necessary and left most of the allocation path under pcpu_alloc_mutex. The percpu allocator is planned to add support for atomic allocation and this patch restructures locking so that the coverage of pcpu_alloc_mutex is further reduced. * pcpu_alloc() now grab pcpu_alloc_mutex only while creating a new chunk and populating the allocated area. Everything else is now protected soley by pcpu_lock. After this change, multiple instances of pcpu_extend_area_map() may race but the function already implements sufficient synchronization using pcpu_lock. This also allows multiple allocators to arrive at new chunk creation. To avoid creating multiple empty chunks back-to-back, a new chunk is created iff there is no other empty chunk after grabbing pcpu_alloc_mutex. * pcpu_lock is now held while modifying chunk->populated bitmap. After this, all data structures are protected by pcpu_lock. Signed-off-by: Tejun Heo <>
2014-09-02percpu: make percpu-km set chunk->populated bitmap properlyTejun Heo
percpu-km instantiates the whole chunk on creation and doesn't make use of chunk->populated bitmap and leaves it as zero. While this currently doesn't cause any problem, the inconsistency makes it difficult to build further logic on top of chunk->populated. This patch makes percpu-km fill chunk->populated on creation so that the bitmap is always consistent. Signed-off-by: Tejun Heo <> Acked-by: Christoph Lameter <>
2014-09-02percpu: move region iterations out of pcpu_[de]populate_chunk()Tejun Heo
Previously, pcpu_[de]populate_chunk() were called with the range which may contain multiple target regions in it and pcpu_[de]populate_chunk() iterated over the regions. This has the benefit of batching up cache flushes for all the regions; however, we're planning to add more bookkeeping logic around [de]population to support atomic allocations and this delegation of iterations gets in the way. This patch moves the region iterations out of pcpu_[de]populate_chunk() into its callers - pcpu_alloc() and pcpu_reclaim() - so that we can later add logic to track more states around them. This change may make cache and tlb flushes more frequent but multi-region [de]populations are rare anyway and if this actually becomes a problem, it's not difficult to factor out cache flushes as separate callbacks which are directly invoked from percpu.c. Signed-off-by: Tejun Heo <>
2014-09-02percpu: move common parts out of pcpu_[de]populate_chunk()Tejun Heo
percpu-vm and percpu-km implement separate versions of pcpu_[de]populate_chunk() and some part which is or should be common are currently in the specific implementations. Make the following changes. * Allocate area clearing is moved from the pcpu_populate_chunk() implementations to pcpu_alloc(). This makes percpu-km's version noop. * Quick exit tests in pcpu_[de]populate_chunk() of percpu-vm are moved to their respective callers so that they are applied to percpu-km too. This doesn't make any meaningful difference as both functions are noop for percpu-km; however, this is more consistent and will help implementing atomic allocation support. Signed-off-by: Tejun Heo <>
2010-09-10percpu: clear memory allocated with the km allocatorTejun Heo
Percpu allocator should clear memory before returning it but the km allocator forgot to do it. Fix it. Signed-off-by: Tejun Heo <> Reported-by: Peter Zijlstra <> Acked-by: Peter Zijlstra <>
2010-09-08percpu: use percpu allocator on UP tooTejun Heo
On UP, percpu allocations were redirected to kmalloc. This has the following problems. * For certain amount of allocations (determined by PERCPU_DYNAMIC_EARLY_SLOTS and PERCPU_DYNAMIC_EARLY_SIZE), percpu allocator can be used before the usual kernel memory allocator is brought online. On SMP, this is used to initialize the kernel memory allocator. * percpu allocator honors alignment upto PAGE_SIZE but kmalloc() doesn't. For example, workqueue makes use of larger alignments for cpu_workqueues. Currently, users of percpu allocators need to handle UP differently, which is somewhat fragile and ugly. Other than small amount of memory, there isn't much to lose by enabling percpu allocator on UP. It can simply use kernel memory based chunk allocation which was added for SMP archs w/o MMUs. This patch removes mm/percpu_up.c, builds mm/percpu.c on UP too and makes UP build use percpu-km. As percpu addresses and kernel addresses are always identity mapped and static percpu variables don't need any special treatment, nothing is arch dependent and mm/percpu.c implements generic setup_per_cpu_areas() for UP. Signed-off-by: Tejun Heo <> Reviewed-by: Christoph Lameter <> Acked-by: Pekka Enberg <>
2010-05-01percpu: implement kernel memory based chunk allocationTejun Heo
Implement an alternate percpu chunk management based on kernel memeory for nommu SMP architectures. Instead of mapping into vmalloc area, chunks are allocated as a contiguous kernel memory using alloc_pages(). As such, percpu allocator on nommu will have the following restrictions. * It can't fill chunks on-demand page-by-page. It has to allocate each chunk fully upfront. * It can't support sparse chunk for NUMA configurations. SMP w/o mmu is crazy enough. Let's hope no one does NUMA w/o mmu. :-P * If chunk size isn't power-of-two multiple of PAGE_SIZE, the unaligned amount will be wasted on each chunk. So, archs which use this better align chunk size. For instructions on how to use this, read the comment on top of mm/percpu-km.c. Signed-off-by: Tejun Heo <> Reviewed-by: David Howells <> Cc: Graff Yang <> Cc: Sonic Zhang <>