mm/madvise: split out mmap locking operations for madvise()

Patch series "mm/madvise: remove redundant mmap_lock operations from process_madvise()". process_madvise() calls do_madvise() for each address range. Then, each do_madvise() invocation holds and releases same mmap_lock. Optimize the redundant lock operations by splitting do_madvise() internal logic including the mmap_lock operations, and calling the small logic directly from process_madvise() in a sequence that removes the redundant locking. As a result of this change, process_madvise() becomes more efficient and less racy in terms of its results and latency. Note that the potential downside of this series is that other mmap_lock holders may take more time due to the increased length of mmap_lock critical section for process_madvise() calls. But there is maximum limit in the kernel space (IOV_MAX), and userspace can control the critical section length by setting the request size. Hence, the downside would be limited and controllable. Evaluation ========== I measured the time to apply MADV_DONTNEED advice to 256 MiB memory using multiple madvise() calls, 4 KiB per each call. I also do the same with process_madvise(), but with varying batch size (vlen) from 1 to 1024. The source code for the measurement is available at GitHub[1]. Because the microbenchmark result is not that stable, I ran each configuration five times and use the average. The measurement results are as below. 'sz_batches' column shows the batch size of process_madvise() calls. '0' batch size is for madvise() calls case. 'before' and 'after' columns are the measured time to apply MADV_DONTNEED to the 256 MiB memory buffer in nanoseconds, on kernels that built without and with the last patch of this series, respectively. So lower value means better efficiency. 'after/before' column is the ratio of 'after' to 'before'. sz_batches before after after/before 0 146294215.2 121280536.2 0.829017989769427 1 165851018.8 136305598.2 0.821855658085351 2 129469321.2 103740383.6 0.801273866569094 4 110369232.4 87835896.2 0.795836795182785 8 102906232.4 77420920.2 0.752344327397609 16 97551017.4 74959714.4 0.768415506038587 32 94809848.2 71200848.4 0.750985786305689 64 96087575.6 72593180 0.755489765942227 128 96154163.8 68517055.4 0.712575022154163 256 92901257.6 69054216.6 0.743307662177439 512 93646170.8 67053296.2 0.716028168874151 1024 92663219.2 70168196.8 0.75723892830177 Despite the unstable nature of the test program, the trend is as we expect. The measurement shows this patchset reduces the process_madvise() latency, proportional to the batching size. The latency gain was about 20% with the batch size 2, and it has increased to about 28% with the batch size 512, since more number of mmap locking is reduced with larger batch size. Note that the standard devitation of the measurements for each sz_batches configuration ranged from 1.9% to 7.2%. That is, this result is not very stable. The average of the standard deviations for different batch sizes were 4.62% and 4.70% for the 'before' and 'after' kernel measurements. Also note that this patch has somehow decreased latencies of madvise() and single batch size process_madvise(). Seems this code path is small enough to significantly be affected by compiler optimizations including inlining of split-out functions. Please focus on only the improvement amount that changed by the batch size. [1] https://github.com/sjp38/eval_proc_madvise This patch (of 4): Split out the madvise behavior-dependent mmap_lock operations from do_madvise(), for easier reuse of the logic in an upcoming change. [lorenzo.stoakes@oracle.com: fix madvise_[un]lock() issue] Link: https://lkml.kernel.org/r/2f448f7b-1da7-4099-aa9e-0179d47fde40@lucifer.local [akpm@linux-foundation.org: coding-style cleanups] Link: https://lkml.kernel.org/r/20250206061517.2958-1-sj@kernel.org Link: https://lkml.kernel.org/r/20250206061517.2958-2-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net> Reviewed-by: Liam R. Howlett <howlett@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: SeongJae Park <sj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
author: SeongJae Park <sj@kernel.org> 2025-02-05 22:15:14 -0800
committer: Andrew Morton <akpm@linux-foundation.org> 2025-03-16 22:06:04 -0700
commit: 4cc39f91ef6c6f876651eb231974a59ffbcb3a21 (patch)
tree: 23d9b1ed7ab498711cca37bba87a607416109416
parent: 023fff71d893d9aa7814078f24d730afccbab9b9 (diff)
1 files changed, 49 insertions, 13 deletions
diff --git a/mm/madvise.c b/mm/madvise.c
index 08b207f8e61e..fa5dae5a7723 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1574,6 +1574,50 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
 				 madvise_vma_anon_name);
 }
 #endif /* CONFIG_ANON_VMA_NAME */
+
+#ifdef CONFIG_MEMORY_FAILURE
+static bool is_memory_failure(int behavior)
+{
+	switch (behavior) {
+	case MADV_HWPOISON:
+	case MADV_SOFT_OFFLINE:
+		return true;
+	default:
+		return false;
+	}
+}
+#else
+static bool is_memory_failure(int behavior)
+{
+	return false;
+}
+#endif
+
+static int madvise_lock(struct mm_struct *mm, int behavior)
+{
+	if (is_memory_failure(behavior))
+		return 0;
+
+	if (madvise_need_mmap_write(behavior)) {
+		if (mmap_write_lock_killable(mm))
+			return -EINTR;
+	} else {
+		mmap_read_lock(mm);
+	}
+	return 0;
+}
+
+static void madvise_unlock(struct mm_struct *mm, int behavior)
+{
+	if (is_memory_failure(behavior))
+		return;
+
+	if (madvise_need_mmap_write(behavior))
+		mmap_write_unlock(mm);
+	else
+		mmap_read_unlock(mm);
+}
+
 /*
  * The madvise(2) system call.
  *
@@ -1650,7 +1694,6 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
 {
 	unsigned long end;
 	int error;
-	int write;
 	size_t len;
 	struct blk_plug plug;
 
@@ -1672,19 +1715,15 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
 	if (end == start)
 		return 0;
 
+	error = madvise_lock(mm, behavior);
+	if (error)
+		return error;
+
 #ifdef CONFIG_MEMORY_FAILURE
 	if (behavior == MADV_HWPOISON || behavior == MADV_SOFT_OFFLINE)
 		return madvise_inject_error(behavior, start, start + len_in);
 #endif
 
-	write = madvise_need_mmap_write(behavior);
-	if (write) {
-		if (mmap_write_lock_killable(mm))
-			return -EINTR;
-	} else {
-		mmap_read_lock(mm);
-	}
-
 	start = untagged_addr_remote(mm, start);
 	end = start + len;
 
@@ -1701,10 +1740,7 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
 	}
 	blk_finish_plug(&plug);
 
-	if (write)
-		mmap_write_unlock(mm);
-	else
-		mmap_read_unlock(mm);
+	madvise_unlock(mm, behavior);
 
 	return error;
 }
author	SeongJae Park <sj@kernel.org>	2025-02-05 22:15:14 -0800
committer	Andrew Morton <akpm@linux-foundation.org>	2025-03-16 22:06:04 -0700
commit	4cc39f91ef6c6f876651eb231974a59ffbcb3a21 (patch)
tree	23d9b1ed7ab498711cca37bba87a607416109416
parent	023fff71d893d9aa7814078f24d730afccbab9b9 (diff)