path: root/mm/process_vm_access.c
AgeCommit message (Collapse)Author
2021-05-05mm/process_vm_access.c: remove duplicate includeZhang Yunkai
'linux/compat.h' included in 'process_vm_access.c' is duplicated. Link: Signed-off-by: Zhang Yunkai <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2021-01-12mm/process_vm_access.c: include compat.hAndrew Morton
Fix the build error: mm/process_vm_access.c:277:5: error: implicit declaration of function 'in_compat_syscall'; did you mean 'in_ia32_syscall'? [-Werror=implicit-function-declaration] Fixes: 38dc5079da7081e "Fix compat regression in process_vm_rw()" Reported-by: Cc: Kyle Huey <> Cc: Jens Axboe <> Cc: Al Viro <> Cc: Christoph Hellwig <> Cc: <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2020-12-15mm/process_vm_access: remove redundant initialization of iov_rColin Ian King
The pointer iov_r is being initialized with a value that is never read and it is being updated later with a new value. The initialization is redundant and can be removed. Link: Signed-off-by: Colin Ian King <> Reviewed-by: Andrew Morton <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2020-10-27mm/process_vm_access: Add missing #include <linux/compat.h>Geert Uytterhoeven
With e.g. m68k/defconfig: mm/process_vm_access.c: In function ‘process_vm_rw’: mm/process_vm_access.c:277:5: error: implicit declaration of function ‘in_compat_syscall’ [-Werror=implicit-function-declaration] 277 | in_compat_syscall()); | ^~~~~~~~~~~~~~~~~ Fix this by adding #include <linux/compat.h>. Reported-by: Reported-by: damian <> Reported-by: Naresh Kamboju <> Fixes: 38dc5079da7081e8 ("Fix compat regression in process_vm_rw()") Signed-off-by: Geert Uytterhoeven <> Acked-by: Jens Axboe <> Signed-off-by: Linus Torvalds <>
2020-10-27Fix compat regression in process_vm_rw()Jens Axboe
The removal of compat_process_vm_{readv,writev} didn't change process_vm_rw(), which always assumes it's not doing a compat syscall. Instead of passing in 'false' unconditionally for 'compat', make it conditional on in_compat_syscall(). [ Both Al and Christoph point out that trying to access a 64-bit process from a 32-bit one cannot work anyway, and is likely better prohibited, but that's a separate issue - Linus ] Fixes: c3973b401ef2 ("mm: remove compat_process_vm_{readv,writev}") Reported-and-tested-by: Kyle Huey <> Signed-off-by: Jens Axboe <> Acked-by: Al Viro <> Reviewed-by: Christoph Hellwig <> Signed-off-by: Linus Torvalds <>
2020-10-03mm: remove compat_process_vm_{readv,writev}Christoph Hellwig
Now that import_iovec handles compat iovecs, the native syscalls can be used for the compat case as well. Signed-off-by: Christoph Hellwig <> Signed-off-by: Al Viro <>
2020-10-03iov_iter: transparently handle compat iovecs in import_iovecChristoph Hellwig
Use in compat_syscall to import either native or the compat iovecs, and remove the now superflous compat_import_iovec. This removes the need for special compat logic in most callers, and the remaining ones can still be simplified by using __import_iovec with a bool compat parameter. Signed-off-by: Christoph Hellwig <> Signed-off-by: Al Viro <>
2020-10-03iov_iter: refactor rw_copy_check_uvector and import_iovecChristoph Hellwig
Split rw_copy_check_uvector into two new helpers with more sensible calling conventions: - iovec_from_user copies a iovec from userspace either into the provided stack buffer if it fits, or allocates a new buffer for it. Returns the actually used iovec. It also verifies that iov_len does fit a signed type, and handles compat iovecs if the compat flag is set. - __import_iovec consolidates the native and compat versions of import_iovec. It calls iovec_from_user, then validates each iovec actually points to user addresses, and ensures the total length doesn't overflow. This has two major implications: - the access_process_vm case loses the total lenght checking, which wasn't required anyway, given that each call receives two iovecs for the local and remote side of the operation, and it verifies the total length on the local side already. - instead of a single loop there now are two loops over the iovecs. Given that the iovecs are cache hot this doesn't make a major difference Signed-off-by: Christoph Hellwig <> Signed-off-by: Al Viro <>
2020-08-12mm/gup: remove task_struct pointer for all gup codePeter Xu
After the cleanup of page fault accounting, gup does not need to pass task_struct around any more. Remove that parameter in the whole gup stack. Signed-off-by: Peter Xu <> Signed-off-by: Andrew Morton <> Reviewed-by: John Hubbard <> Link: Signed-off-by: Linus Torvalds <>
2020-06-09mmap locking API: use coccinelle to convert mmap_sem rwsem call sitesMichel Lespinasse
This change converts the existing mmap_sem rwsem calls to use the new mmap locking API instead. The change is generated using coccinelle with the following rule: // spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir . @@ expression mm; @@ ( -init_rwsem +mmap_init_lock | -down_write +mmap_write_lock | -down_write_killable +mmap_write_lock_killable | -down_write_trylock +mmap_write_trylock | -up_write +mmap_write_unlock | -downgrade_write +mmap_write_downgrade | -down_read +mmap_read_lock | -down_read_killable +mmap_read_lock_killable | -down_read_trylock +mmap_read_trylock | -up_read +mmap_read_unlock ) -(&mm->mmap_sem) +(mm) Signed-off-by: Michel Lespinasse <> Signed-off-by: Andrew Morton <> Reviewed-by: Daniel Jordan <> Reviewed-by: Laurent Dufour <> Reviewed-by: Vlastimil Babka <> Cc: Davidlohr Bueso <> Cc: David Rientjes <> Cc: Hugh Dickins <> Cc: Jason Gunthorpe <> Cc: Jerome Glisse <> Cc: John Hubbard <> Cc: Liam Howlett <> Cc: Matthew Wilcox <> Cc: Peter Zijlstra <> Cc: Ying Han <> Link: Signed-off-by: Linus Torvalds <>
2020-03-25mm: docs: Fix a comment in process_vm_rw_coreBernd Edlinger
This removes a duplicate "a" in the comment in process_vm_rw_core. Signed-off-by: Bernd Edlinger <> Reviewed-by: Kees Cook <> Signed-off-by: Eric W. Biederman <>
2020-01-31mm, tree-wide: rename put_user_page*() to unpin_user_page*()John Hubbard
In order to provide a clearer, more symmetric API for pinning and unpinning DMA pages. This way, pin_user_pages*() calls match up with unpin_user_pages*() calls, and the API is a lot closer to being self-explanatory. Link: Signed-off-by: John Hubbard <> Reviewed-by: Jan Kara <> Cc: Alex Williamson <> Cc: Aneesh Kumar K.V <> Cc: Björn Töpel <> Cc: Christoph Hellwig <> Cc: Daniel Vetter <> Cc: Dan Williams <> Cc: Hans Verkuil <> Cc: Ira Weiny <> Cc: Jason Gunthorpe <> Cc: Jason Gunthorpe <> Cc: Jens Axboe <> Cc: Jerome Glisse <> Cc: Jonathan Corbet <> Cc: Kirill A. Shutemov <> Cc: Leon Romanovsky <> Cc: Mauro Carvalho Chehab <> Cc: Mike Rapoport <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2020-01-31mm/process_vm_access: set FOLL_PIN via pin_user_pages_remote()John Hubbard
Convert process_vm_access to use the new pin_user_pages_remote() call, which sets FOLL_PIN. Setting FOLL_PIN is now required for code that requires tracking of pinned pages. Also, release the pages via put_user_page*(). Also, rename "pages" to "pinned_pages", as this makes for easier reading of process_vm_rw_single_vec(). Link: Signed-off-by: John Hubbard <> Reviewed-by: Jan Kara <> Reviewed-by: Jérôme Glisse <> Reviewed-by: Ira Weiny <> Cc: Alex Williamson <> Cc: Aneesh Kumar K.V <> Cc: Björn Töpel <> Cc: Christoph Hellwig <> Cc: Daniel Vetter <> Cc: Dan Williams <> Cc: Hans Verkuil <> Cc: Jason Gunthorpe <> Cc: Jason Gunthorpe <> Cc: Jens Axboe <> Cc: Jonathan Corbet <> Cc: Kirill A. Shutemov <> Cc: Leon Romanovsky <> Cc: Mauro Carvalho Chehab <> Cc: Mike Rapoport <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2019-05-30treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 152Thomas Gleixner
Based on 1 normalized pattern(s): this program is free software you can redistribute it and or modify it under the terms of the gnu general public license as published by the free software foundation either version 2 of the license or at your option any later version extracted by the scancode license scanner the SPDX license identifier GPL-2.0-or-later has been chosen to replace the boilerplate/reference in 3029 file(s). Signed-off-by: Thomas Gleixner <> Reviewed-by: Allison Randal <> Cc: Link: Signed-off-by: Greg Kroah-Hartman <>
2018-02-06mm: docs: add blank lines to silence sphinx "Unexpected indentation" errorsMike Rapoport
Link: Signed-off-by: Mike Rapoport <> Cc: Jonathan Corbet <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2018-02-06mm: docs: fix parameter names mismatchMike Rapoport
There are several places where parameter descriptions do no match the actual code. Fix it. Link: Signed-off-by: Mike Rapoport <> Cc: Jonathan Corbet <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2018-02-06pids: introduce find_get_task_by_vpid() helperMike Rapoport
There are several functions that do find_task_by_vpid() followed by get_task_struct(). We can use a helper function instead. Link: Signed-off-by: Mike Rapoport <> Acked-by: Oleg Nesterov <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2017-03-02sched/headers: Prepare for new header dependencies before moving code to ↵Ingo Molnar
<linux/sched/mm.h> We are going to split <linux/sched/mm.h> out of <linux/sched.h>, which will have to be picked up from other headers and a couple of .c files. Create a trivial placeholder <linux/sched/mm.h> file that just maps to <linux/sched.h> to make this patch obviously correct and bisectable. The APIs that are going to be moved first are: mm_alloc() __mmdrop() mmdrop() mmdrop_async_fn() mmdrop_async() mmget_not_zero() mmput() mmput_async() get_task_mm() mm_access() mm_release() Include the new header in the files that are going to need it. Acked-by: Linus Torvalds <> Cc: Mike Galbraith <> Cc: Peter Zijlstra <> Cc: Thomas Gleixner <> Cc: Signed-off-by: Ingo Molnar <>
2016-12-14mm: unexport __get_user_pages_unlocked()Lorenzo Stoakes
Unexport the low-level __get_user_pages_unlocked() function and replaces invocations with calls to more appropriate higher-level functions. In hva_to_pfn_slow() we are able to replace __get_user_pages_unlocked() with get_user_pages_unlocked() since we can now pass gup_flags. In async_pf_execute() and process_vm_rw_single_vec() we need to pass different tsk, mm arguments so get_user_pages_remote() is the sane replacement in these cases (having added manual acquisition and release of mmap_sem.) Additionally get_user_pages_remote() reintroduces use of the FOLL_TOUCH flag. However, this flag was originally silently dropped by commit 1e9877902dc7 ("mm/gup: Introduce get_user_pages_remote()"), so this appears to have been unintentional and reintroducing it is therefore not an issue. [ coding-style fixes] Link: Signed-off-by: Lorenzo Stoakes <> Acked-by: Michal Hocko <> Cc: Jan Kara <> Cc: Hugh Dickins <> Cc: Dave Hansen <> Cc: Rik van Riel <> Cc: Mel Gorman <> Cc: Paolo Bonzini <> Cc: Radim Krcmar <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2016-10-18mm: remove write/force parameters from __get_user_pages_unlocked()Lorenzo Stoakes
This removes the redundant 'write' and 'force' parameters from __get_user_pages_unlocked() to make the use of FOLL_FORCE explicit in callers as use of this flag can result in surprising behaviour (and hence bugs) within the mm subsystem. Signed-off-by: Lorenzo Stoakes <> Acked-by: Paolo Bonzini <> Reviewed-by: Jan Kara <> Acked-by: Michal Hocko <> Signed-off-by: Linus Torvalds <>
2016-02-16mm/gup: Introduce get_user_pages_remote()Dave Hansen
For protection keys, we need to understand whether protections should be enforced in software or not. In general, we enforce protections when working on our own task, but not when on others. We call these "current" and "remote" operations. This patch introduces a new get_user_pages() variant: get_user_pages_remote() Which is a replacement for when get_user_pages() is called on non-current tsk/mm. We also introduce a new gup flag: FOLL_REMOTE which can be used for the "__" gup variants to get this new behavior. The uprobes is_trap_at_addr() location holds mmap_sem and calls get_user_pages(current->mm) on an instruction address. This makes it a pretty unique gup caller. Being an instruction access and also really originating from the kernel (vs. the app), I opted to consider this a 'remote' access where protection keys will not be enforced. Without protection keys, this patch should not change any behavior. Signed-off-by: Dave Hansen <> Reviewed-by: Thomas Gleixner <> Cc: Andrea Arcangeli <> Cc: Andrew Morton <> Cc: Andy Lutomirski <> Cc: Borislav Petkov <> Cc: Brian Gerst <> Cc: Dave Hansen <> Cc: Denys Vlasenko <> Cc: H. Peter Anvin <> Cc: Kirill A. Shutemov <> Cc: Linus Torvalds <> Cc: Naoya Horiguchi <> Cc: Peter Zijlstra <> Cc: Rik van Riel <> Cc: Srikar Dronamraju <> Cc: Vlastimil Babka <> Cc: Cc: Link: Signed-off-by: Ingo Molnar <>
2016-01-20ptrace: use fsuid, fsgid, effective creds for fs access checksJann Horn
By checking the effective credentials instead of the real UID / permitted capabilities, ensure that the calling process actually intended to use its credentials. To ensure that all ptrace checks use the correct caller credentials (e.g. in case out-of-tree code or newly added code omits the PTRACE_MODE_*CREDS flag), use two new flags and require one of them to be set. The problem was that when a privileged task had temporarily dropped its privileges, e.g. by calling setreuid(0, user_uid), with the intent to perform following syscalls with the credentials of a user, it still passed ptrace access checks that the user would not be able to pass. While an attacker should not be able to convince the privileged task to perform a ptrace() syscall, this is a problem because the ptrace access check is reused for things in procfs. In particular, the following somewhat interesting procfs entries only rely on ptrace access checks: /proc/$pid/stat - uses the check for determining whether pointers should be visible, useful for bypassing ASLR /proc/$pid/maps - also useful for bypassing ASLR /proc/$pid/cwd - useful for gaining access to restricted directories that contain files with lax permissions, e.g. in this scenario: lrwxrwxrwx root root /proc/13020/cwd -> /root/foobar drwx------ root root /root drwxr-xr-x root root /root/foobar -rw-r--r-- root root /root/foobar/secret Therefore, on a system where a root-owned mode 6755 binary changes its effective credentials as described and then dumps a user-specified file, this could be used by an attacker to reveal the memory layout of root's processes or reveal the contents of files he is not allowed to access (through /proc/$pid/cwd). [ fix warning] Signed-off-by: Jann Horn <> Acked-by: Kees Cook <> Cc: Casey Schaufler <> Cc: Oleg Nesterov <> Cc: Ingo Molnar <> Cc: James Morris <> Cc: "Serge E. Hallyn" <> Cc: Andy Shevchenko <> Cc: Andy Lutomirski <> Cc: Al Viro <> Cc: "Eric W. Biederman" <> Cc: Willy Tarreau <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2015-04-11process_vm_access: switch to {compat_,}import_iovec()Al Viro
Signed-off-by: Al Viro <>
2015-02-11mm: gup: use get_user_pages_unlockedAndrea Arcangeli
This allows those get_user_pages calls to pass FAULT_FLAG_ALLOW_RETRY to the page fault in order to release the mmap_sem during the I/O. Signed-off-by: Andrea Arcangeli <> Reviewed-by: Kirill A. Shutemov <> Cc: Andres Lagar-Cavilla <> Cc: Peter Feiner <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2014-05-06start adding the tag to iov_iterAl Viro
For now, just use the same thing we pass to ->direct_IO() - it's all iovec-based at the moment. Pass it explicitly to iov_iter_init() and account for kvec vs. iovec in there, by the same kludge NFS ->direct_IO() uses. Signed-off-by: Al Viro <>
2014-05-06kill iov_iter_copy_from_user()Al Viro
all callers can use copy_page_from_iter() and it actually simplifies them. Signed-off-by: Al Viro <>
2014-04-12Merge branch 'for-linus' of ↵Linus Torvalds
git:// Pull vfs updates from Al Viro: "The first vfs pile, with deep apologies for being very late in this window. Assorted cleanups and fixes, plus a large preparatory part of iov_iter work. There's a lot more of that, but it'll probably go into the next merge window - it *does* shape up nicely, removes a lot of boilerplate, gets rid of locking inconsistencie between aio_write and splice_write and I hope to get Kent's direct-io rewrite merged into the same queue, but some of the stuff after this point is having (mostly trivial) conflicts with the things already merged into mainline and with some I want more testing. This one passes LTP and xfstests without regressions, in addition to usual beating. BTW, readahead02 in ltp syscalls testsuite has started giving failures since "mm/readahead.c: fix readahead failure for memoryless NUMA nodes and limit readahead pages" - might be a false positive, might be a real regression..." * 'for-linus' of git:// (63 commits) missing bits of "splice: fix racy pipe->buffers uses" cifs: fix the race in cifs_writev() ceph_sync_{,direct_}write: fix an oops on ceph_osdc_new_request() failure kill generic_file_buffered_write() ocfs2_file_aio_write(): switch to generic_perform_write() ceph_aio_write(): switch to generic_perform_write() xfs_file_buffered_aio_write(): switch to generic_perform_write() export generic_perform_write(), start getting rid of generic_file_buffer_write() generic_file_direct_write(): get rid of ppos argument btrfs_file_aio_write(): get rid of ppos kill the 5th argument of generic_file_buffered_write() kill the 4th argument of __generic_file_aio_write() lustre: don't open-code kernel_recvmsg() ocfs2: don't open-code kernel_recvmsg() drbd: don't open-code kernel_recvmsg() constify blk_rq_map_user_iov() and friends lustre: switch to kernel_sendmsg() ocfs2: don't open-code kernel_sendmsg() take iov_iter stuff to mm/iov_iter.c process_vm_access: tidy up a bit ...
2014-04-03mm/process_vm_access.c: mark function as staticRashika Kheria
Mark function as static in process_vm_access.c because it is not used outside this file. This eliminates the following warning in mm/process_vm_access.c: mm/process_vm_access.c:416:1: warning: no previous prototype for `compat_process_vm_rw' [-Wmissing-prototypes] [ remove unneeded asmlinkage - compat_process_vm_rw isn't referenced from asm] Signed-off-by: Rashika Kheria <> Reviewed-by: Josh Triplett <> Acked-by: David Rientjes <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2014-04-01process_vm_access: tidy up a bitAl Viro
saner variable names, update linuxdoc comments, etc. Signed-off-by: Al Viro <>
2014-04-01process_vm_access: don't bother with returning the amounts of bytes copiedAl Viro
we can calculate that in the caller just fine, TYVM Signed-off-by: Al Viro <>
2014-04-01process_vm_rw_pages(): pass accurate amount of bytesAl Viro
... makes passing the amount of pages unnecessary Signed-off-by: Al Viro <>
2014-04-01process_vm_access: take get_user_pages/put_pages one level upAl Viro
... and trim the fuck out of process_vm_rw_pages() argument list. Signed-off-by: Al Viro <>
2014-04-01process_vm_access: switch to copy_page_to_iter/iov_iter_copy_from_userAl Viro
... rather than open-coding those. As a side benefit, we get much saner loop calling those; we can just feed entire pages, instead of the "copy would span the iovec boundary, let's do it in two loop iterations" mess. Signed-off-by: Al Viro <>
2014-04-01process_vm_access: switch to iov_iterAl Viro
instead of keeping its pieces in separate variables and passing pointers to all of them... Signed-off-by: Al Viro <>
2014-04-01untangling process_vm_..., part 4Al Viro
instead of passing vector size (by value) and index (by reference), pass the number of elements remaining. That's all we care about in these functions by that point. Signed-off-by: Al Viro <>
2014-04-01untangling process_vm_..., part 3Al Viro
lift iov one more level out - from process_vm_rw_single_vec to process_vm_rw_core(). Same story as with the previous commit. Signed-off-by: Al Viro <>
2014-04-01untangling process_vm_..., part 2Al Viro
move iov to caller's stack frame; the value we assign to it on the next call of process_vm_rw_pages() is equal to the value it had when the last time we were leaving process_vm_rw_pages(). drop lvec argument of process_vm_rw_pages() - it's not used anymore. Signed-off-by: Al Viro <>
2014-04-01untangling process_vm_..., part 1Al Viro
we want to massage it to use of iov_iter. This one is an equivalent transformation - just introduce a local variable mirroring lvec + *lvec_current. Signed-off-by: Al Viro <>
2014-03-06mm/compat: convert to COMPAT_SYSCALL_DEFINE with changing parameter typesHeiko Carstens
In order to allow the COMPAT_SYSCALL_DEFINE macro generate code that performs proper zero and sign extension convert all 64 bit parameters to their corresponding 32 bit compat counterparts. Signed-off-by: Heiko Carstens <>
2013-03-12Fix: compat_rw_copy_check_uvector() misuse in aio, readv, writev, and ↵Mathieu Desnoyers
security keys Looking at mm/process_vm_access.c:process_vm_rw() and comparing it to compat_process_vm_rw() shows that the compatibility code requires an explicit "access_ok()" check before calling compat_rw_copy_check_uvector(). The same difference seems to appear when we compare fs/read_write.c:do_readv_writev() to fs/compat.c:compat_do_readv_writev(). This subtle difference between the compat and non-compat requirements should probably be debated, as it seems to be error-prone. In fact, there are two others sites that use this function in the Linux kernel, and they both seem to get it wrong: Now shifting our attention to fs/aio.c, we see that aio_setup_iocb() also ends up calling compat_rw_copy_check_uvector() through aio_setup_vectored_rw(). Unfortunately, the access_ok() check appears to be missing. Same situation for security/keys/compat.c:compat_keyctl_instantiate_key_iov(). I propose that we add the access_ok() check directly into compat_rw_copy_check_uvector(), so callers don't have to worry about it, and it therefore makes the compat call code similar to its non-compat counterpart. Place the access_ok() check in the same location where copy_from_user() can trigger a -EFAULT error in the non-compat code, so the ABI behaviors are alike on both compat and non-compat. While we are here, fix compat_do_readv_writev() so it checks for compat_rw_copy_check_uvector() negative return values. And also, fix a memory leak in compat_keyctl_instantiate_key_iov() error handling. Acked-by: Linus Torvalds <> Acked-by: Al Viro <> Signed-off-by: Mathieu Desnoyers <> Signed-off-by: Linus Torvalds <>
2012-05-31aio/vfs: cleanup of rw_copy_check_uvector() and compat_rw_copy_check_uvector()Christopher Yeoh
A cleanup of rw_copy_check_uvector and compat_rw_copy_check_uvector after changes made to support CMA in an earlier patch. Rather than having an additional check_access parameter to these functions, the first paramater type is overloaded to allow the caller to specify CHECK_IOVEC_ONLY which means check that the contents of the iovec are valid, but do not check the memory that they point to. This is used by process_vm_readv/writev where we need to validate that a iovec passed to the syscall is valid but do not want to check the memory that it points to at this point because it refers to an address space in another process. Signed-off-by: Chris Yeoh <> Reviewed-by: Oleg Nesterov <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2012-02-02Fix race in process_vm_rw_coreChristopher Yeoh
This fixes the race in process_vm_core found by Oleg (see for details). This has been updated since I last sent it as the creation of the new mm_access() function did almost exactly the same thing as parts of the previous version of this patch did. In order to use mm_access() even when /proc isn't enabled, we move it to kernel/fork.c where other related process mm access functions already are. Signed-off-by: Chris Yeoh <> Signed-off-by: Linus Torvalds <>
2011-10-31Cross Memory AttachChristopher Yeoh
The basic idea behind cross memory attach is to allow MPI programs doing intra-node communication to do a single copy of the message rather than a double copy of the message via shared memory. The following patch attempts to achieve this by allowing a destination process, given an address and size from a source process, to copy memory directly from the source process into its own address space via a system call. There is also a symmetrical ability to copy from the current process's address space into a destination process's address space. - Use of /proc/pid/mem has been considered, but there are issues with using it: - Does not allow for specifying iovecs for both src and dest, assuming preadv or pwritev was implemented either the area read from or written to would need to be contiguous. - Currently mem_read allows only processes who are currently ptrace'ing the target and are still able to ptrace the target to read from the target. This check could possibly be moved to the open call, but its not clear exactly what race this restriction is stopping (reason appears to have been lost) - Having to send the fd of /proc/self/mem via SCM_RIGHTS on unix domain socket is a bit ugly from a userspace point of view, especially when you may have hundreds if not (eventually) thousands of processes that all need to do this with each other - Doesn't allow for some future use of the interface we would like to consider adding in the future (see below) - Interestingly reading from /proc/pid/mem currently actually involves two copies! (But this could be fixed pretty easily) As mentioned previously use of vmsplice instead was considered, but has problems. Since you need the reader and writer working co-operatively if the pipe is not drained then you block. Which requires some wrapping to do non blocking on the send side or polling on the receive. In all to all communication it requires ordering otherwise you can deadlock. And in the example of many MPI tasks writing to one MPI task vmsplice serialises the copying. There are some cases of MPI collectives where even a single copy interface does not get us the performance gain we could. For example in an MPI_Reduce rather than copy the data from the source we would like to instead use it directly in a mathops (say the reduce is doing a sum) as this would save us doing a copy. We don't need to keep a copy of the data from the source. I haven't implemented this, but I think this interface could in the future do all this through the use of the flags - eg could specify the math operation and type and the kernel rather than just copying the data would apply the specified operation between the source and destination and store it in the destination. Although we don't have a "second user" of the interface (though I've had some nibbles from people who may be interested in using it for intra process messaging which is not MPI). This interface is something which hardware vendors are already doing for their custom drivers to implement fast local communication. And so in addition to this being useful for OpenMPI it would mean the driver maintainers don't have to fix things up when the mm changes. There was some discussion about how much faster a true zero copy would go. Here's a link back to the email with some testing I did on that: There is a basic man page for the proposed interface here: This has been implemented for x86 and powerpc, other architecture should mainly (I think) just need to add syscall numbers for the process_vm_readv and process_vm_writev. There are 32 bit compatibility versions for 64-bit kernels. For arch maintainers there are some simple tests to be able to quickly verify that the syscalls are working correctly here: Signed-off-by: Chris Yeoh <> Cc: Ingo Molnar <> Cc: "H. Peter Anvin" <> Cc: Thomas Gleixner <> Cc: Arnd Bergmann <> Cc: Paul Mackerras <> Cc: Benjamin Herrenschmidt <> Cc: David Howells <> Cc: James Morris <> Cc: <> Cc: <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>