summaryrefslogtreecommitdiff
path: root/drivers/md/dm-bufio.c
AgeCommit message (Collapse)Author
2025-05-26Merge tag 'for-6.16/block-20250523' of git://git.kernel.dk/linuxLinus Torvalds
Pull block updates from Jens Axboe: - ublk updates: - Add support for updating the size of a ublk instance - Zero-copy improvements - Auto-registering of buffers for zero-copy - Series simplifying and improving GET_DATA and request lookup - Series adding quiesce support - Lots of selftests additions - Various cleanups - NVMe updates via Christoph: - add per-node DMA pools and use them for PRP/SGL allocations (Caleb Sander Mateos, Keith Busch) - nvme-fcloop refcounting fixes (Daniel Wagner) - support delayed removal of the multipath node and optionally support the multipath node for private namespaces (Nilay Shroff) - support shared CQs in the PCI endpoint target code (Wilfred Mallawa) - support admin-queue only authentication (Hannes Reinecke) - use the crc32c library instead of the crypto API (Eric Biggers) - misc cleanups (Christoph Hellwig, Marcelo Moreira, Hannes Reinecke, Leon Romanovsky, Gustavo A. R. Silva) - MD updates via Yu: - Fix that normal IO can be starved by sync IO, found by mkfs on newly created large raid5, with some clean up patches for bdev inflight counters - Clean up brd, getting rid of atomic kmaps and bvec poking - Add loop driver specifically for zoned IO testing - Eliminate blk-rq-qos calls with a static key, if not enabled - Improve hctx locking for when a plug has IO for multiple queues pending - Remove block layer bouncing support, which in turn means we can remove the per-node bounce stat as well - Improve blk-throttle support - Improve delay support for blk-throttle - Improve brd discard support - Unify IO scheduler switching. This should also fix a bunch of lockdep warnings we've been seeing, after enabling lockdep support for queue freezing/unfreezeing - Add support for block write streams via FDP (flexible data placement) on NVMe - Add a bunch of block helpers, facilitating the removal of a bunch of duplicated boilerplate code - Remove obsolete BLK_MQ pci and virtio Kconfig options - Add atomic/untorn write support to blktrace - Various little cleanups and fixes * tag 'for-6.16/block-20250523' of git://git.kernel.dk/linux: (186 commits) selftests: ublk: add test for UBLK_F_QUIESCE ublk: add feature UBLK_F_QUIESCE selftests: ublk: add test case for UBLK_U_CMD_UPDATE_SIZE traceevent/block: Add REQ_ATOMIC flag to block trace events ublk: run auto buf unregisgering in same io_ring_ctx with registering io_uring: add helper io_uring_cmd_ctx_handle() ublk: remove io argument from ublk_auto_buf_reg_fallback() ublk: handle ublk_set_auto_buf_reg() failure correctly in ublk_fetch() selftests: ublk: add test for covering UBLK_AUTO_BUF_REG_FALLBACK selftests: ublk: support UBLK_F_AUTO_BUF_REG ublk: support UBLK_AUTO_BUF_REG_FALLBACK ublk: register buffer to local io_uring with provided buf index via UBLK_F_AUTO_BUF_REG ublk: prepare for supporting to register request buffer automatically ublk: convert to refcount_t selftests: ublk: make IO & device removal test more stressful nvme: rename nvme_mpath_shutdown_disk to nvme_mpath_remove_disk nvme: introduce multipath_always_on module param nvme-multipath: introduce delayed removal of the multipath head node nvme-pci: derive and better document max segments limits nvme-pci: use struct_size for allocation struct nvme_dev ...
2025-05-07dm-bufio: use bio_add_virt_nofailChristoph Hellwig
Convert the __bio_add_page(..., virt_to_page(), ...) pattern to the bio_add_virt_nofail helper implementing it. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250507120451.4000627-14-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-23dm-bufio: don't schedule in atomic contextLongPing Wei
A BUG was reported as below when CONFIG_DEBUG_ATOMIC_SLEEP and try_verify_in_tasklet are enabled. [ 129.444685][ T934] BUG: sleeping function called from invalid context at drivers/md/dm-bufio.c:2421 [ 129.444723][ T934] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 934, name: kworker/1:4 [ 129.444740][ T934] preempt_count: 201, expected: 0 [ 129.444756][ T934] RCU nest depth: 0, expected: 0 [ 129.444781][ T934] Preemption disabled at: [ 129.444789][ T934] [<ffffffd816231900>] shrink_work+0x21c/0x248 [ 129.445167][ T934] kernel BUG at kernel/sched/walt/walt_debug.c:16! [ 129.445183][ T934] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP [ 129.445204][ T934] Skip md ftrace buffer dump for: 0x1609e0 [ 129.447348][ T934] CPU: 1 PID: 934 Comm: kworker/1:4 Tainted: G W OE 6.6.56-android15-8-o-g6f82312b30b9-debug #1 1400000003000000474e5500b3187743670464e8 [ 129.447362][ T934] Hardware name: Qualcomm Technologies, Inc. Parrot QRD, Alpha-M (DT) [ 129.447373][ T934] Workqueue: dm_bufio_cache shrink_work [ 129.447394][ T934] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 129.447406][ T934] pc : android_rvh_schedule_bug+0x0/0x8 [sched_walt_debug] [ 129.447435][ T934] lr : __traceiter_android_rvh_schedule_bug+0x44/0x6c [ 129.447451][ T934] sp : ffffffc0843dbc90 [ 129.447459][ T934] x29: ffffffc0843dbc90 x28: ffffffffffffffff x27: 0000000000000c8b [ 129.447479][ T934] x26: 0000000000000040 x25: ffffff804b3d6260 x24: ffffffd816232b68 [ 129.447497][ T934] x23: ffffff805171c5b4 x22: 0000000000000000 x21: ffffffd816231900 [ 129.447517][ T934] x20: ffffff80306ba898 x19: 0000000000000000 x18: ffffffc084159030 [ 129.447535][ T934] x17: 00000000d2b5dd1f x16: 00000000d2b5dd1f x15: ffffffd816720358 [ 129.447554][ T934] x14: 0000000000000004 x13: ffffff89ef978000 x12: 0000000000000003 [ 129.447572][ T934] x11: ffffffd817a823c4 x10: 0000000000000202 x9 : 7e779c5735de9400 [ 129.447591][ T934] x8 : ffffffd81560d004 x7 : 205b5d3938373434 x6 : ffffffd8167397c8 [ 129.447610][ T934] x5 : 0000000000000000 x4 : 0000000000000001 x3 : ffffffc0843db9e0 [ 129.447629][ T934] x2 : 0000000000002f15 x1 : 0000000000000000 x0 : 0000000000000000 [ 129.447647][ T934] Call trace: [ 129.447655][ T934] android_rvh_schedule_bug+0x0/0x8 [sched_walt_debug 1400000003000000474e550080cce8a8a78606b6] [ 129.447681][ T934] __might_resched+0x190/0x1a8 [ 129.447694][ T934] shrink_work+0x180/0x248 [ 129.447706][ T934] process_one_work+0x260/0x624 [ 129.447718][ T934] worker_thread+0x28c/0x454 [ 129.447729][ T934] kthread+0x118/0x158 [ 129.447742][ T934] ret_from_fork+0x10/0x20 [ 129.447761][ T934] Code: ???????? ???????? ???????? d2b5dd1f (d4210000) [ 129.447772][ T934] ---[ end trace 0000000000000000 ]--- dm_bufio_lock will call spin_lock_bh when try_verify_in_tasklet is enabled, and __scan will be called in atomic context. Fixes: 7cd326747f46 ("dm bufio: remove dm_bufio_cond_resched()") Signed-off-by: LongPing Wei <weilongping@oppo.com> Cc: stable@vger.kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-02-24dm-bufio: remove unused return valueMikulas Patocka
The return value of the function "forget_buffer" is not tested, so we can remove it. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2024-11-20dm-bufio: use kmalloc to allocate power-of-two sized buffersMikulas Patocka
Vlastimil Babka said [1] that kmalloc will return a power-of-two-aligned buffer if it was called with a power-of-two size. So, we can use kmalloc instead of our own slab cache in dm-bufio. Note that the code for the slab cache was not removed because dm-bufio supports non-power-of-two buffer sizes. Link: https://lore.kernel.org/linux-mm/e7fca292-7c79-4f97-a90c-d68178d8ca59@suse.cz/ [1] Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2024-11-11dm-bufio: fix warnings about duplicate slab cachesMikulas Patocka
The commit 4c39529663b9 adds a warning about duplicate cache names if CONFIG_DEBUG_VM is selected. These warnings are triggered by the dm-bufio code. The dm-bufio code allocates a slab cache with each client. It is not possible to preallocate the caches in the module init function because the size of auxiliary per-buffer data is not known at this point. So, this commit changes dm-bufio so that it appends a unique atomic value to the cache name, to avoid the warnings. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Fixes: 4c39529663b9 ("slab: Warn on duplicate cache names when DEBUG_VM=y")
2024-08-22dm bufio: Remove NULL check of list_entry()Yuesong Li
list_entry() will never return a NULL pointer, thus remove the check. Signed-off-by: Yuesong Li <liyuesong@vivo.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2024-02-20dm: update relevant MODULE_AUTHOR entries to latest dm-devel mailing listMike Snitzer
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2024-02-20dm bufio: Support IO priorityHongyu Jin
Some IO will dispatch from kworker with different io_context settings than the submitting task, we may need to specify a priority to avoid losing priority. Add dm_bufio_read_with_ioprio() and dm_bufio_prefetch_with_ioprio() for use by bufio users to pass an ioprio other than IOPRIO_DEFAULT. Co-developed-by: Yibin Ding <yibin.ding@unisoc.com> Signed-off-by: Yibin Ding <yibin.ding@unisoc.com> Signed-off-by: Hongyu Jin <hongyu.jin@unisoc.com> Reviewed-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Mikulas Patocka <mpatocka@redhat.com> [snitzer: introduced _with_ioprio() wrappers to reduce churn] Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2024-02-20dm io: Support IO priorityHongyu Jin
Some IO will dispatch from kworker with different io_context settings than the submitting task, we may need to specify a priority to avoid losing priority. Add IO priority parameter to dm_io() and update all callers. Co-developed-by: Yibin Ding <yibin.ding@unisoc.com> Signed-off-by: Yibin Ding <yibin.ding@unisoc.com> Signed-off-by: Hongyu Jin <hongyu.jin@unisoc.com> Reviewed-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2024-01-08mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDERKirill A. Shutemov
commit 23baf831a32c ("mm, treewide: redefine MAX_ORDER sanely") has changed the definition of MAX_ORDER to be inclusive. This has caused issues with code that was not yet upstream and depended on the previous definition. To draw attention to the altered meaning of the define, rename MAX_ORDER to MAX_PAGE_ORDER. Link: https://lkml.kernel.org/r/20231228144704.14033-2-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-11-17dm-bufio: fix no-sleep modeMikulas Patocka
dm-bufio has a no-sleep mode. When activated (with the DM_BUFIO_CLIENT_NO_SLEEP flag), the bufio client is read-only and we could call dm_bufio_get from tasklets. This is used by dm-verity. Unfortunately, commit 450e8dee51aa ("dm bufio: improve concurrent IO performance") broke this and the kernel would warn that cache_get() was calling down_read() from no-sleeping context. The bug can be reproduced by using "veritysetup open" with the "--use-tasklets" flag. This commit fixes dm-bufio, so that the tasklet mode works again, by expanding use of the 'no_sleep_enabled' static_key to conditionally use either a rw_semaphore or rwlock_t (which are colocated in the buffer_tree structure using a union). Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org # v6.4 Fixes: 450e8dee51aa ("dm bufio: improve concurrent IO performance") Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-10-04dm: dynamically allocate the dm-bufio shrinkerQi Zheng
In preparation for implementing lockless slab shrink, use new APIs to dynamically allocate the dm-bufio shrinker, so that it can be freed asynchronously via RCU. Then it doesn't need to wait for RCU read-side critical section when releasing the struct dm_bufio_client. Link: https://lkml.kernel.org/r/20230911094444.68966-24-zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Alasdair Kergon <agk@redhat.com> Cc: Mike Snitzer <snitzer@kernel.org> Cc: Abhinav Kumar <quic_abhinavk@quicinc.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: Anna Schumaker <anna@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Bob Peterson <rpeterso@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Carlos Llamas <cmllamas@google.com> Cc: Chandan Babu R <chandan.babu@oracle.com> Cc: Chao Yu <chao@kernel.org> Cc: Chris Mason <clm@fb.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Christian Koenig <christian.koenig@amd.com> Cc: Chuck Lever <cel@kernel.org> Cc: Coly Li <colyli@suse.de> Cc: Dai Ngo <Dai.Ngo@oracle.com> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Airlie <airlied@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Sterba <dsterba@suse.com> Cc: Dmitry Baryshkov <dmitry.baryshkov@linaro.org> Cc: Gao Xiang <hsiangkao@linux.alibaba.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Huang Rui <ray.huang@amd.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Jason Wang <jasowang@redhat.com> Cc: Jeff Layton <jlayton@kernel.org> Cc: Jeffle Xu <jefflexu@linux.alibaba.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Juergen Gross <jgross@suse.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: Kirill Tkhai <tkhai@ya.ru> Cc: Marijn Suijten <marijn.suijten@somainline.org> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nadav Amit <namit@vmware.com> Cc: Neil Brown <neilb@suse.de> Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Cc: Olga Kornievskaia <kolga@netapp.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Richard Weinberger <richard@nod.at> Cc: Rob Clark <robdclark@gmail.com> Cc: Rob Herring <robh@kernel.org> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Sean Paul <sean@poorly.run> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Song Liu <song@kernel.org> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Steven Price <steven.price@arm.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tomeu Vizoso <tomeu.vizoso@collabora.com> Cc: Tom Talpey <tom@talpey.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Cc: Yue Hu <huyue2@coolpad.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-27dm: get rid of GFP_NOIO workarounds for __vmalloc and kvmallocMikulas Patocka
In the past, the function __vmalloc didn't respect the GFP flags - it allocated memory with the provided gfp flags, but it allocated page tables with GFP_KERNEL. This was fixed in commit 451769ebb7e7 ("mm/vmalloc: alloc GFP_NO{FS,IO} for vmalloc") so the memalloc_noio_{save,restore} workaround is no longer needed. The function kvmalloc didn't like flags different from GFP_KERNEL. This was fixed in commit a421ef303008 ("mm: allow !GFP_KERNEL allocations for kvmalloc"), so kvmalloc can now be called with GFP_NOIO. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-06-16dm thin metadata: Fix ABBA deadlock by resetting dm_bufio_clientLi Lingfeng
As described in commit 8111964f1b85 ("dm thin: Fix ABBA deadlock between shrink_slab and dm_pool_abort_metadata"), ABBA deadlocks will be triggered because shrinker_rwsem currently needs to held by dm_pool_abort_metadata() as a side-effect of thin-pool metadata operation failure. The following three problem scenarios have been noticed: 1) Described by commit 8111964f1b85 ("dm thin: Fix ABBA deadlock between shrink_slab and dm_pool_abort_metadata") 2) shrinker_rwsem and throttle->lock P1(drop cache) P2(kworker) drop_caches_sysctl_handler drop_slab shrink_slab down_read(&shrinker_rwsem) - LOCK A do_shrink_slab super_cache_scan prune_icache_sb dispose_list evict ext4_evict_inode ext4_clear_inode ext4_discard_preallocations ext4_mb_load_buddy_gfp ext4_mb_init_cache ext4_wait_block_bitmap __ext4_error ext4_handle_error ext4_commit_super ... dm_submit_bio do_worker throttle_work_update down_write(&t->lock) -- LOCK B process_deferred_bios commit metadata_operation_failed dm_pool_abort_metadata dm_block_manager_create dm_bufio_client_create register_shrinker down_write(&shrinker_rwsem) -- LOCK A thin_map thin_bio_map thin_defer_bio_with_throttle throttle_lock down_read(&t->lock) - LOCK B 3) shrinker_rwsem and wait_on_buffer P1(drop cache) P2(kworker) drop_caches_sysctl_handler drop_slab shrink_slab down_read(&shrinker_rwsem) - LOCK A do_shrink_slab ... ext4_wait_block_bitmap __ext4_error ext4_handle_error jbd2_journal_abort jbd2_journal_update_sb_errno jbd2_write_superblock submit_bh // LOCK B // RELEASE B do_worker throttle_work_update down_write(&t->lock) - LOCK B process_deferred_bios process_bio commit metadata_operation_failed dm_pool_abort_metadata dm_block_manager_create dm_bufio_client_create register_shrinker register_shrinker_prepared down_write(&shrinker_rwsem) - LOCK A bio_endio wait_on_buffer __wait_on_buffer Fix these by resetting dm_bufio_client without holding shrinker_rwsem. Fixes: 8111964f1b85 ("dm thin: Fix ABBA deadlock between shrink_slab and dm_pool_abort_metadata") Cc: stable@vger.kernel.org Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-04-27Merge tag 'mm-stable-2023-04-27-15-30' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Nick Piggin's "shoot lazy tlbs" series, to improve the peformance of switching from a user process to a kernel thread. - More folio conversions from Kefeng Wang, Zhang Peng and Pankaj Raghav. - zsmalloc performance improvements from Sergey Senozhatsky. - Yue Zhao has found and fixed some data race issues around the alteration of memcg userspace tunables. - VFS rationalizations from Christoph Hellwig: - removal of most of the callers of write_one_page() - make __filemap_get_folio()'s return value more useful - Luis Chamberlain has changed tmpfs so it no longer requires swap backing. Use `mount -o noswap'. - Qi Zheng has made the slab shrinkers operate locklessly, providing some scalability benefits. - Keith Busch has improved dmapool's performance, making part of its operations O(1) rather than O(n). - Peter Xu adds the UFFD_FEATURE_WP_UNPOPULATED feature to userfaultd, permitting userspace to wr-protect anon memory unpopulated ptes. - Kirill Shutemov has changed MAX_ORDER's meaning to be inclusive rather than exclusive, and has fixed a bunch of errors which were caused by its unintuitive meaning. - Axel Rasmussen give userfaultfd the UFFDIO_CONTINUE_MODE_WP feature, which causes minor faults to install a write-protected pte. - Vlastimil Babka has done some maintenance work on vma_merge(): cleanups to the kernel code and improvements to our userspace test harness. - Cleanups to do_fault_around() by Lorenzo Stoakes. - Mike Rapoport has moved a lot of initialization code out of various mm/ files and into mm/mm_init.c. - Lorenzo Stoakes removd vmf_insert_mixed_prot(), which was added for DRM, but DRM doesn't use it any more. - Lorenzo has also coverted read_kcore() and vread() to use iterators and has thereby removed the use of bounce buffers in some cases. - Lorenzo has also contributed further cleanups of vma_merge(). - Chaitanya Prakash provides some fixes to the mmap selftesting code. - Matthew Wilcox changes xfs and afs so they no longer take sleeping locks in ->map_page(), a step towards RCUification of pagefaults. - Suren Baghdasaryan has improved mmap_lock scalability by switching to per-VMA locking. - Frederic Weisbecker has reworked the percpu cache draining so that it no longer causes latency glitches on cpu isolated workloads. - Mike Rapoport cleans up and corrects the ARCH_FORCE_MAX_ORDER Kconfig logic. - Liu Shixin has changed zswap's initialization so we no longer waste a chunk of memory if zswap is not being used. - Yosry Ahmed has improved the performance of memcg statistics flushing. - David Stevens has fixed several issues involving khugepaged, userfaultfd and shmem. - Christoph Hellwig has provided some cleanup work to zram's IO-related code paths. - David Hildenbrand has fixed up some issues in the selftest code's testing of our pte state changing. - Pankaj Raghav has made page_endio() unneeded and has removed it. - Peter Xu contributed some rationalizations of the userfaultfd selftests. - Yosry Ahmed has fixed an issue around memcg's page recalim accounting. - Chaitanya Prakash has fixed some arm-related issues in the selftests/mm code. - Longlong Xia has improved the way in which KSM handles hwpoisoned pages. - Peter Xu fixes a few issues with uffd-wp at fork() time. - Stefan Roesch has changed KSM so that it may now be used on a per-process and per-cgroup basis. * tag 'mm-stable-2023-04-27-15-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (369 commits) mm,unmap: avoid flushing TLB in batch if PTE is inaccessible shmem: restrict noswap option to initial user namespace mm/khugepaged: fix conflicting mods to collapse_file() sparse: remove unnecessary 0 values from rc mm: move 'mmap_min_addr' logic from callers into vm_unmapped_area() hugetlb: pte_alloc_huge() to replace huge pte_alloc_map() maple_tree: fix allocation in mas_sparse_area() mm: do not increment pgfault stats when page fault handler retries zsmalloc: allow only one active pool compaction context selftests/mm: add new selftests for KSM mm: add new KSM process and sysfs knobs mm: add new api to enable ksm per process mm: shrinkers: fix debugfs file permissions mm: don't check VMA write permissions if the PTE/PMD indicates write permissions migrate_pages_batch: fix statistics for longterm pin retry userfaultfd: use helper function range_in_vma() lib/show_mem.c: use for_each_populated_zone() simplify code mm: correct arg in reclaim_pages()/reclaim_clean_pages_from_list() fs/buffer: convert create_page_buffers to folio_create_buffers fs/buffer: add folio_create_empty_buffers helper ...
2023-04-05mm, treewide: redefine MAX_ORDER sanelyKirill A. Shutemov
MAX_ORDER currently defined as number of orders page allocator supports: user can ask buddy allocator for page order between 0 and MAX_ORDER-1. This definition is counter-intuitive and lead to number of bugs all over the kernel. Change the definition of MAX_ORDER to be inclusive: the range of orders user can ask from buddy allocator is 0..MAX_ORDER now. [kirill@shutemov.name: fix min() warning] Link: https://lkml.kernel.org/r/20230315153800.32wib3n5rickolvh@box [akpm@linux-foundation.org: fix another min_t warning] [kirill@shutemov.name: fixups per Zi Yan] Link: https://lkml.kernel.org/r/20230316232144.b7ic4cif4kjiabws@box.shutemov.name [akpm@linux-foundation.org: fix underlining in docs] Link: https://lore.kernel.org/oe-kbuild-all/202303191025.VRCTk6mP-lkp@intel.com/ Link: https://lkml.kernel.org/r/20230315113133.11326-11-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc] Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-30dm: improve hash_locks sizing and hash functionJoe Thornber
Both bufio and bio-prison-v1 use the identical model for splitting their respective locks and rbtrees. Improve dm_num_hash_locks() to distribute across more rbtrees to improve overall performance -- but the maximum number of locks/rbtrees is still 64. Also factor out a common hash function named dm_hash_locks_index(), the magic numbers used were determined to be best using this program: https://gist.github.com/jthornber/e05c47daa7b500c56dc339269c5467fc Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-03-30dm bufio: intelligently size dm_buffer_cache's buffer_treesMike Snitzer
Size the dm_buffer_cache's number of buffer_tree structs using dm_num_hash_locks(). Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-03-30dm bufio: prepare to intelligently size dm_buffer_cache's buffer_treesMike Snitzer
Add num_locks member to dm_buffer_cache struct and use it rather than the NR_LOCKS magic value (64). Next commit will size the dm_buffer_cache's buffer_trees according to dm_num_hash_locks(). Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-03-30dm bufio: use multi-page bio vectorMikulas Patocka
The kernel supports multi page bio vector entries, so we can use them in dm-bufio as an optimization. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-03-30dm bufio: use waitqueue_active in __free_buffer_wakeMikulas Patocka
Save one spinlock by using waitqueue_active. We hold the bufio lock at this place, so no one can add entries to the waitqueue at this point. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-03-30dm bufio: move dm_bufio_client members to avoid spanning cachelinesMike Snitzer
Movement also consolidates holes in dm_bufio_client struct. But the overall size of the struct isn't changed. Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-03-30dm bufio: add lock_history optimization for cache iteratorsJoe Thornber
Sometimes it is beneficial to repeatedly get and drop locks as part of an iteration. Introduce lock_history struct to help avoid redundant drop and gets of the same lock. Optimizes cache_iterate, cache_mark_many and cache_evict. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-03-30dm bufio: improve concurrent IO performanceJoe Thornber
When multiple threads perform IO to a thin device, the underlying dm_bufio object can become a bottleneck; slowing down access to btree nodes that store the thin metadata. Prior to this commit, each bufio instance had a single mutex that was taken for every bufio operation. This commit concentrates on improving the common case where: a user of dm_bufio wishes to access, but not modify, a buffer which is already within the dm_bufio cache. Implementation:: The code has been refactored; pulling out an 'lru' abstraction and a 'buffer cache' abstraction (see 2 previous commits). This commit updates higher level bufio code (that performs allocation of buffers, IO and eviction/cache sizing) to leverage both abstractions. It also deals with the delicate locking requirements of both abstractions to provide finer grained locking. The result is significantly better concurrent IO performance. Before this commit, bufio has a global lru list it used to evict the oldest, clean buffers from _all_ clients. With the new locking we don’t want different ways to access the same buffer, so instead do_global_cleanup() loops around the clients asking them to free buffers older than a certain time. This commit also converts many old BUG_ONs to WARN_ON_ONCE, see the lru_evict and cache_evict code in particular. They will return ER_DONT_EVICT if a given buffer somehow meets the invariants that should _never_ happen. [Aside from revising this commit's header and fixing coding style and whitespace nits: this switching to WARN_ON_ONCE is Mike Snitzer's lone contribution to this commit] Testing:: Some of the low level functions have been unit tested using dm-unit: https://github.com/jthornber/dm-unit/blob/main/src/tests/bufio.rs Higher level concurrency and IO is tested via a test only target found here: https://github.com/jthornber/linux/blob/2023-03-24-thin-concurrency-9/drivers/md/dm-bufio-test.c The associated userland side of these tests is here: https://github.com/jthornber/dmtest-python/blob/main/src/dmtest/bufio/bufio_tests.py In addition the full dmtest suite of tests (dm-thin, dm-cache, etc) has been run (~450 tests). Performance:: Most bufio operations have unchanged performance. But if multiple threads are attempting to get buffers concurrently, and these buffers are already in the cache then there's a big speed up. Eg, one test has 16 'hotspot' threads simulating btree lookups while another thread dirties the whole device. In this case the hotspot threads acquire the buffers about 25 times faster. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-03-30dm bufio: add dm_buffer_cache abstractionJoe Thornber
The buffer cache is responsible for managing the holder count, tracking clean/dirty state, and choosing buffers via predicates. Higher level code is responsible for allocation of buffers, IO and eviction/cache sizing. The buffer cache has thread safe methods for acquiring a reference to an existing buffer. All other methods in buffer cache are _not_ threadsafe, and only contain enough locking to guarantee the safe methods. Rather than a single mutex, sharded rw_semaphores are used to allow concurrent threads to 'get' buffers. Each rw_semaphore protects its own rbtree of buffer entries. Code that uses this new dm_buffer_cache abstraction will be introduced in a following commit. This commit moves the dm_buffer struct in preparation for finer grained dm_buffer changes, in the next commit, to be more easily seen. It also introduces temporary dm_buffer struct members to allow compilation of this intermediate commit (they will be elided in the next commit). This commit will cause "defined but not used" compiler warnings that will be resolved by the next commit. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-03-30dm bufio: add LRU abstractionJoe Thornber
A CLOCK algorithm is used in this LRU abstraction. This avoids relinking list nodes, which would require a write lock protecting it. None of the LRU methods are threadsafe; locking must be done at a higher level. Code that uses this new LRU will be introduced in the next 2 commits. As such, this commit will cause "defined but not used" compiler warnings that will be resolved by the next 2 commits. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-03-30dm bufio: don't bug for clear developer oversightMike Snitzer
Reasonable to relax to WARN_ON because these are easily avoided but do offer some assurance future coding mistakes won't occur (if changes tested properly). Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-03-30dm bufio: never crash if dm_bufio_in_request()Mike Snitzer
All these instances are entirely avoidable given that they speak to coding mistakes that result in inappropriate use. Proper testing during development will catch any such coding bug so its best to relax all of these from BUG_ON to WARN_ON_ONCE. Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-03-30dm bufio: use WARN_ON in dm_bufio_client_destroy and dm_bufio_exitMike Snitzer
Using BUG_ON when tearing down is excessive. Relax these to WARN_ONs. Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-03-30dm bufio: remove unused dm_bufio_release_move interfaceJoe Thornber
Was used by multi-snapshot DM target that never went upstream. Signed-off-by: Joe Thornber <ejt@redhat.com> Acked-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-02-14dm: fix use of sizeof() macroHeinz Mauelshagen
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-02-14dm: avoid using symbolic permissionsHeinz Mauelshagen
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-02-14dm: add missing empty linesHeinz Mauelshagen
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-02-14dm: add argument identifier namesHeinz Mauelshagen
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-02-14dm: correct block comments format.Heinz Mauelshagen
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-02-14dm: avoid initializing static variablesHeinz Mauelshagen
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-02-14dm: change "unsigned" to "unsigned int"Heinz Mauelshagen
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-02-14dm: add missing SPDX-License-IndentifiersHeinz Mauelshagen
'GPL-2.0-only' is used instead of 'GPL-2.0' because SPDX has deprecated its use. Suggested-by: John Wiele <jwiele@redhat.com> Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-01-30dm: improve shrinker debug namesMike Snitzer
Commit e33c267ab70d ("mm: shrinkers: provide shrinkers with names") chose some fairly bad names for DM's shrinkers. Fixes: e33c267ab70d ("mm: shrinkers: provide shrinkers with names") Signed-off-by : Mike Snitzer <snitzer@kernel.org>
2022-11-18dm bufio: Fix missing decrement of no_sleep_enabled if ↵Zhihao Cheng
dm_bufio_client_create failed The 'no_sleep_enabled' should be decreased in error handling path in dm_bufio_client_create() when the DM_BUFIO_CLIENT_NO_SLEEP flag is set, otherwise static_branch_unlikely() will always return true even if no dm_bufio_client instances have DM_BUFIO_CLIENT_NO_SLEEP flag set. Cc: stable@vger.kernel.org Fixes: 3c1c875d0586 ("dm bufio: conditionally enable branching for DM_BUFIO_CLIENT_NO_SLEEP") Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-10-18dm bufio: use the acquire memory barrier when testing for B_READINGMikulas Patocka
The function test_bit doesn't provide any memory barrier. It may be possible that the read requests that follow test_bit(B_READING, &b->state) are reordered before the test, reading invalid data that existed before B_READING was cleared. Fix this bug by changing test_bit to test_bit_acquire. This is particularly important on arches with weak(er) memory ordering (e.g. arm64). Depends-On: 8238b4579866 ("wait_on_bit: add an acquire memory barrier") Depends-On: d6ffe6067a54 ("provide arch_test_bit_acquire for architectures that define test_bit") Cc: stable@vger.kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-08-11Merge tag 'for-6.0/dm-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper fixes from Mike Snitzer: - A few fixes for the DM verity and bufio changes in this merge window - A smatch warning fix for DM writecache locking in writecache_map * tag 'for-6.0/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm bufio: fix some cases where the code sleeps with spinlock held dm writecache: fix smatch warning about invalid return from writecache_map dm verity: fix verity_parse_opt_args parsing dm verity: fix DM_VERITY_OPTS_MAX value yet again dm bufio: simplify DM_BUFIO_CLIENT_NO_SLEEP locking
2022-08-11dm bufio: fix some cases where the code sleeps with spinlock heldMikulas Patocka
Commit b32d45824aa7 ("dm bufio: Add DM_BUFIO_CLIENT_NO_SLEEP flag") added a "NO_SLEEP" mode, it replaces a mutex with a spinlock, and it is only usable when the device is in read-only mode (because the write path may be sleeping while holding the dm_bufio_client lock). However, there are still two points where the code could sleep even in read-only mode. One is in __get_unclaimed_buffer -> __make_buffer_clean. The other is in __try_evict_buffer -> __make_buffer_clean. These functions will call __make_buffer_clean which sleeps if the buffer is being read. Fix these cases so that if c->no_sleep is set __make_buffer_clean will not be called and the buffer will be skipped instead. Fixes: b32d45824aa7 ("dm bufio: Add DM_BUFIO_CLIENT_NO_SLEEP flag") Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-08-09dm bufio: simplify DM_BUFIO_CLIENT_NO_SLEEP lockingMike Snitzer
Historically none of the bufio code runs in interrupt context but with the use of DM_BUFIO_CLIENT_NO_SLEEP a bufio client can, see: commit 5721d4e5a9cd ("dm verity: Add optional "try_verify_in_tasklet" feature") That said, the new tasklet usecase still doesn't require interrupts be disabled by bufio (let alone conditionally restore them). Yet with PREEMPT_RT, and falling back from tasklet to workqueue, care must be taken to properly synchronize between softirq and process context, otherwise ABBA deadlock may occur. While it is unnecessary to disable bottom-half preemption within a tasklet, we must consistently do so in process context to ensure locking is in the proper order. Fix these issues by switching from spin_lock_irq{save,restore} to using spin_{lock,unlock}_bh instead. Also remove the 'spinlock_flags' member in dm_bufio_client struct (that can be used unsafely if bufio must recurse on behalf of some caller, e.g. block layer's submit_bio). Fixes: 5721d4e5a9cd ("dm verity: Add optional "try_verify_in_tasklet" feature") Reported-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2022-08-06Merge tag 'for-6.0/dm-changes-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull more device mapper updates from Mike Snitzer: - Add flags argument to dm_bufio_client_create and introduce DM_BUFIO_CLIENT_NO_SLEEP flag to have dm-bufio use spinlock rather than mutex for its locking. - Add optional "try_verify_in_tasklet" feature to DM verity target. This feature gives users the option to improve IO latency by using a tasklet to verify, using hashes in bufio's cache, rather than wait to schedule a work item via workqueue. But if there is a bufio cache miss, or an error, then the tasklet will fallback to using workqueue. - Incremental changes to both dm-bufio and the DM verity target to use jump_label to minimize cost of branching associated with the niche "try_verify_in_tasklet" feature. DM-bufio in particular is used by quite a few other DM targets so it doesn't make sense to incur additional bufio cost in those targets purely for the benefit of this niche verity feature if the feature isn't ever used. - Optimize verity_verify_io, which is used by both workqueue and tasklet based verification, if FEC is not configured or tasklet based verification isn't used. - Remove DM verity target's verify_wq's use of the WQ_CPU_INTENSIVE flag since it uses WQ_UNBOUND. Also, use the WQ_HIGHPRI flag if "try_verify_in_tasklet" is specified. * tag 'for-6.0/dm-changes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm verity: have verify_wq use WQ_HIGHPRI if "try_verify_in_tasklet" dm verity: remove WQ_CPU_INTENSIVE flag since using WQ_UNBOUND dm verity: only copy bvec_iter in verity_verify_io if in_tasklet dm verity: optimize verity_verify_io if FEC not configured dm verity: conditionally enable branching for "try_verify_in_tasklet" dm bufio: conditionally enable branching for DM_BUFIO_CLIENT_NO_SLEEP dm verity: allow optional args to alter primary args handling dm verity: Add optional "try_verify_in_tasklet" feature dm bufio: Add DM_BUFIO_CLIENT_NO_SLEEP flag dm bufio: Add flags argument to dm_bufio_client_create
2022-08-05Merge tag 'mm-stable-2022-08-03' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "Most of the MM queue. A few things are still pending. Liam's maple tree rework didn't make it. This has resulted in a few other minor patch series being held over for next time. Multi-gen LRU still isn't merged as we were waiting for mapletree to stabilize. The current plan is to merge MGLRU into -mm soon and to later reintroduce mapletree, with a view to hopefully getting both into 6.1-rc1. Summary: - The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe Lin, Yang Shi, Anshuman Khandual and Mike Rapoport - Some kmemleak fixes from Patrick Wang and Waiman Long - DAMON updates from SeongJae Park - memcg debug/visibility work from Roman Gushchin - vmalloc speedup from Uladzislau Rezki - more folio conversion work from Matthew Wilcox - enhancements for coherent device memory mapping from Alex Sierra - addition of shared pages tracking and CoW support for fsdax, from Shiyang Ruan - hugetlb optimizations from Mike Kravetz - Mel Gorman has contributed some pagealloc changes to improve latency and realtime behaviour. - mprotect soft-dirty checking has been improved by Peter Xu - Many other singleton patches all over the place" [ XFS merge from hell as per Darrick Wong in https://lore.kernel.org/all/YshKnxb4VwXycPO8@magnolia/ ] * tag 'mm-stable-2022-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (282 commits) tools/testing/selftests/vm/hmm-tests.c: fix build mm: Kconfig: fix typo mm: memory-failure: convert to pr_fmt() mm: use is_zone_movable_page() helper hugetlbfs: fix inaccurate comment in hugetlbfs_statfs() hugetlbfs: cleanup some comments in inode.c hugetlbfs: remove unneeded header file hugetlbfs: remove unneeded hugetlbfs_ops forward declaration hugetlbfs: use helper macro SZ_1{K,M} mm: cleanup is_highmem() mm/hmm: add a test for cross device private faults selftests: add soft-dirty into run_vmtests.sh selftests: soft-dirty: add test for mprotect mm/mprotect: fix soft-dirty check in can_change_pte_writable() mm: memcontrol: fix potential oom_lock recursion deadlock mm/gup.c: fix formatting in check_and_migrate_movable_page() xfs: fail dax mount if reflink is enabled on a partition mm/memcontrol.c: remove the redundant updating of stats_flush_threshold userfaultfd: don't fail on unrecognized features hugetlb_cgroup: fix wrong hugetlb cgroup numa stat ...
2022-08-04dm bufio: conditionally enable branching for DM_BUFIO_CLIENT_NO_SLEEPMike Snitzer
Use jump_label to limit the need for branching unless the optional DM_BUFIO_CLIENT_NO_SLEEP is used. Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-07-28dm bufio: Add DM_BUFIO_CLIENT_NO_SLEEP flagNathan Huckleberry
Add an optional flag that ensures dm_bufio_client does not sleep (primary focus is to service dm_bufio_get without sleeping). This allows the dm-bufio cache to be queried from interrupt context. To ensure that dm-bufio does not sleep, dm-bufio must use a spinlock instead of a mutex. Additionally, to avoid deadlocks, special care must be taken so that dm-bufio does not sleep while holding the spinlock. But again: the scope of this no_sleep is initially confined to dm_bufio_get, so __alloc_buffer_wait_no_callback is _not_ changed to avoid sleeping because __bufio_new avoids allocation for NF_GET. Signed-off-by: Nathan Huckleberry <nhuck@google.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-07-28dm bufio: Add flags argument to dm_bufio_client_createNathan Huckleberry
Add a flags argument to dm_bufio_client_create and update all the callers. This is in preparation to add the DM_BUFIO_NO_SLEEP flag. Signed-off-by: Nathan Huckleberry <nhuck@google.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>