summaryrefslogtreecommitdiff
path: root/fs/nfs
AgeCommit message (Collapse)Author
2024-07-08NFSv4: Add support for the FATTR4_OPEN_ARGUMENTS attributeTrond Myklebust
Query the server for the OPEN arguments that it supports so that we can figure out which extensions we can use. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Lance Shelton <lance.shelton@hammerspace.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2024-07-08NFSv4: Don't request atime/mtime/size if they are delegated to usTrond Myklebust
If the timestamps and size are delegated to the client, then it is authoritative w.r.t. their values, so we should not be requesting those values from the server. Note that this allows us to optimise away most GETATTR calls if the only changes to the attributes are the result of read() or write(). Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Lance Shelton <lance.shelton@hammerspace.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2024-07-08NFSv4: Fix up delegated attributes in nfs_setattrTrond Myklebust
nfs_setattr calls nfs_update_inode() directly, so we have to reset the m/ctime there. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Lance Shelton <lance.shelton@hammerspace.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2024-07-08NFSv4: Delegreturn must set m/atime when they are delegatedTrond Myklebust
If the atime or mtime attributes were delegated, then we need to propagate their new values back to the server when returning the delegation. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Lance Shelton <lance.shelton@hammerspace.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2024-07-08NFSv4: Enable attribute delegationsTrond Myklebust
If we see that the server supports attribute delegations, then request them by setting the appropriate OPEN arguments. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Lance Shelton <lance.shelton@hammerspace.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2024-07-08NFSv4: Add a capability for delegated attributesTrond Myklebust
Cache whether or not the server may have support for delegated attributes in a capability flag. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Lance Shelton <lance.shelton@hammerspace.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2024-07-08NFSv4: Add recovery of attribute delegationsTrond Myklebust
After a reboot of the NFSv4.2 server, the recovery code needs to specify whether the delegation to be recovered is an attribute delegation or not. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Lance Shelton <lance.shelton@hammerspace.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2024-07-08NFSv4: Add support for delegated atime and mtime attributesTrond Myklebust
Ensure that we update the mtime and atime correctly when we read or write data to the file and when we truncate. Let the server manage ctime on other attribute updates. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Lance Shelton <lance.shelton@hammerspace.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2024-07-08NFSv4: Add a flags argument to the 'have_delegation' callbackTrond Myklebust
This argument will be used to allow the caller to specify whether or not they need to know that this is an attribute delegation. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Lance Shelton <lance.shelton@hammerspace.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2024-07-08NFSv4: Add CB_GETATTR support for delegated attributesTrond Myklebust
When the client holds an attribute delegation, the server may retrieve all the timestamps through a CB_GETATTR callback. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Lance Shelton <lance.shelton@hammerspace.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2024-07-08NFSv4: Plumb in XDR support for the new delegation-only setattr opTrond Myklebust
We want to send the updated atime and mtime as part of the delegreturn compound. Add a special structure to hold those variables. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Lance Shelton <lance.shelton@hammerspace.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2024-07-08NFSv4: Add new attribute delegation definitionsTrond Myklebust
Add the attribute delegation XDR definitions from the spec. Signed-off-by: Tom Haynes <loghyr@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Lance Shelton <lance.shelton@hammerspace.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2024-07-08NFSv4: Refactor nfs4_opendata_check_deleg()Trond Myklebust
Modify it to no longer depend directly on the struct opendata. This will enable sharing with WANT_DELEGATION. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Lance Shelton <lance.shelton@hammerspace.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2024-07-08NFSv4: Clean up open delegation return structureTrond Myklebust
Instead of having the fields open coded in the struct nfs_openres, add a separate structure for them so that we can reuse that code for the WANT_DELEGATION case. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Lance Shelton <lance.shelton@hammerspace.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2024-07-08fs: nfs: add missing MODULE_DESCRIPTION() macrosJeff Johnson
Fix the 'make W=1' warnings: WARNING: modpost: missing MODULE_DESCRIPTION() in fs/nfs_common/nfs_acl.o WARNING: modpost: missing MODULE_DESCRIPTION() in fs/nfs_common/grace.o WARNING: modpost: missing MODULE_DESCRIPTION() in fs/nfs/nfs.o WARNING: modpost: missing MODULE_DESCRIPTION() in fs/nfs/nfsv2.o WARNING: modpost: missing MODULE_DESCRIPTION() in fs/nfs/nfsv3.o WARNING: modpost: missing MODULE_DESCRIPTION() in fs/nfs/nfsv4.o Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2024-07-08NFS: remove unused struct 'mnt_fhstatus'Dr. David Alan Gilbert
'mnt_fhstatus' has been unused since commit 065015e5efff ("NFS: Remove unused XDR decoder functions"). Remove it. Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2024-07-08nfs: add support for large foliosChristoph Hellwig
NFS already is void of folio size assumption, so just pass the chunk size to __filemap_get_folio and set the large folio address_space flag for all regular files. Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2024-07-03nfs: drop usage of folio_file_posKairui Song
folio_file_pos is only needed for mixed usage of page cache and swap cache, for pure page cache usage, the caller can just use folio_pos instead. After commit e1209d3a7a67 ("mm: introduce ->swap_rw and use it for reads from SWP_FS_OPS swap-space"), swap cache should never be exposed to nfs. So remove the usage of folio_file_pos in following NFS functions / helpers: - nfs_vm_page_mkwrite It's only used by nfs_file_vm_ops.page_mkwrite - trace event helper: nfs_folio_event - trace event helper: nfs_folio_event_done These two are used through DEFINE_NFS_FOLIO_EVENT and DEFINE_NFS_FOLIO_EVENT_DONE, which defined following events: - trace_nfs_aop_readpage{_done}: only called by nfs_read_folio - trace_nfs_writeback_folio: only called by nfs_wb_folio - trace_nfs_invalidate_folio: only called by nfs_invalidate_folio - trace_nfs_launder_folio_done: only called by nfs_launder_folio None of them could possibly be used on swap cache folio, nfs_read_folio only called by: .write_begin -> nfs_read_folio .read_folio nfs_wb_folio only called by nfs mapping: .release_folio -> nfs_wb_folio .launder_folio -> nfs_wb_folio .write_begin -> nfs_read_folio -> nfs_wb_folio .read_folio -> nfs_wb_folio .write_end -> nfs_update_folio -> nfs_writepage_setup -> nfs_setup_write_request -> nfs_try_to_update_request -> nfs_wb_folio .page_mkwrite -> nfs_update_folio -> nfs_writepage_setup -> nfs_setup_write_request -> nfs_try_to_update_request -> nfs_wb_folio .write_begin -> nfs_flush_incompatible -> nfs_wb_folio .page_mkwrite -> nfs_vm_page_mkwrite -> nfs_flush_incompatible -> nfs_wb_folio nfs_invalidate_folio is only called by .invalidate_folio. nfs_launder_folio is only called by .launder_folio - nfs_grow_file - nfs_update_folio nfs_grow_file is only called by nfs_update_folio, and all possible callers of them are: .write_end -> nfs_update_folio .page_mkwrite -> nfs_update_folio - nfs_wb_folio_cancel .invalidate_folio -> nfs_wb_folio_cancel Also, seeing from the swap side, swap_rw is now the only interface calling into fs, the offset info is always in iocb.ki_pos now. So we can remove all these folio_file_pos call safely. Link: https://lkml.kernel.org/r/20240521175854.96038-8-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Anna Schumaker <anna@kernel.org> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Chao Yu <chao@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Jeff Layton <jlayton@kernel.org> Cc: Marc Dionne <marc.dionne@auristor.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Minchan Kim <minchan@kernel.org> Cc: NeilBrown <neilb@suse.de> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: Xiubo Li <xiubli@redhat.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03NFS: remove nfs_page_lengthg and usage of page_indexKairui Song
This function is no longer used after commit 4fa7a717b432 ("NFS: Fix up nfs_vm_page_mkwrite() for folios"), all users have been converted to use folio instead, just delete it to remove usage of page_index. Link: https://lkml.kernel.org/r/20240521175854.96038-5-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Anna Schumaker <anna@kernel.org> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Chao Yu <chao@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Jeff Layton <jlayton@kernel.org> Cc: Marc Dionne <marc.dionne@auristor.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Minchan Kim <minchan@kernel.org> Cc: NeilBrown <neilb@suse.de> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: Xiubo Li <xiubli@redhat.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-06-24nfs: drop the incorrect assertion in nfs_swap_rw()Christoph Hellwig
Since commit 2282679fb20b ("mm: submit multipage write for SWP_FS_OPS swap-space"), we can plug multiple pages then unplug them all together. That means iov_iter_count(iter) could be way bigger than PAGE_SIZE, it actually equals the size of iov_iter_npages(iter, INT_MAX). Note this issue has nothing to do with large folios as we don't support THP_SWPOUT to non-block devices. [v-songbaohua@oppo.com: figure out the cause and correct the commit message] Link: https://lkml.kernel.org/r/20240618065647.21791-1-21cnbao@gmail.com Fixes: 2282679fb20b ("mm: submit multipage write for SWP_FS_OPS swap-space") Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Barry Song <v-songbaohua@oppo.com> Closes: https://lore.kernel.org/linux-mm/20240617053201.GA16852@lst.de/ Reviewed-by: Martin Wege <martin.l.wege@gmail.com> Cc: NeilBrown <neilb@suse.de> Cc: Anna Schumaker <anna@kernel.org> Cc: Steve French <sfrench@samba.org> Cc: Trond Myklebust <trondmy@kernel.org> Cc: Chuanhua Han <hanchuanhua@oppo.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Chris Li <chrisl@kernel.org> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Jeff Layton <jlayton@kernel.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-06-13Merge tag 'nfs-for-6.10-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds
Pull NFS client fixes from Trond Myklebust: "Bugfixes: - NFSv4.2: Fix a memory leak in nfs4_set_security_label - NFSv2/v3: abort nfs_atomic_open_v23 if the name is too long. - NFS: Add appropriate memory barriers to the sillyrename code - Propagate readlink errors in nfs_symlink_filler - NFS: don't invalidate dentries on transient errors - NFS: fix unnecessary synchronous writes in random write workloads - NFSv4.1: enforce rootpath check when deciding whether or not to trunk Other: - Change email address for Trond Myklebust due to email server concerns" * tag 'nfs-for-6.10-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: NFS: add barriers when testing for NFS_FSDATA_BLOCKED SUNRPC: return proper error from gss_wrap_req_priv NFSv4.1 enforce rootpath check in fs_location query NFS: abort nfs_atomic_open_v23 if name is too long. nfs: don't invalidate dentries on transient errors nfs: Avoid flushing many pages with NFS_FILE_SYNC nfs: propagate readlink errors in nfs_symlink_filler MAINTAINERS: Change email address for Trond Myklebust NFSv4: Fix memory leak in nfs4_set_security_label
2024-05-31nfs: Remove calls to folio_set_errorMatthew Wilcox (Oracle)
Common code doesn't test the error flag, so we don't need to set it in nfs. We can use folio_end_read() to combine the setting (or not) of the uptodate flag and clearing the lock flag. Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Anna Schumaker <anna@kernel.org> Cc: linux-nfs@vger.kernel.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://lore.kernel.org/r/20240530202110.2653630-10-willy@infradead.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-05-30NFS: add barriers when testing for NFS_FSDATA_BLOCKEDNeilBrown
dentry->d_fsdata is set to NFS_FSDATA_BLOCKED while unlinking or renaming-over a file to ensure that no open succeeds while the NFS operation progressed on the server. Setting dentry->d_fsdata to NFS_FSDATA_BLOCKED is done under ->d_lock after checking the refcount is not elevated. Any attempt to open the file (through that name) will go through lookp_open() which will take ->d_lock while incrementing the refcount, we can be sure that once the new value is set, __nfs_lookup_revalidate() *will* see the new value and will block. We don't have any locking guarantee that when we set ->d_fsdata to NULL, the wait_var_event() in __nfs_lookup_revalidate() will notice. wait/wake primitives do NOT provide barriers to guarantee order. We must use smp_load_acquire() in wait_var_event() to ensure we look at an up-to-date value, and must use smp_store_release() before wake_up_var(). This patch adds those barrier functions and factors out block_revalidate() and unblock_revalidate() far clarity. There is also a hypothetical bug in that if memory allocation fails (which never happens in practice) we might leave ->d_fsdata locked. This patch adds the missing call to unblock_revalidate(). Reported-and-tested-by: Richard Kojedzinszky <richard+debian+bugreport@kojedz.in> Closes: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1071501 Fixes: 3c59366c207e ("NFS: don't unhash dentry during unlink/rename") Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2024-05-30NFSv4.1 enforce rootpath check in fs_location queryOlga Kornievskaia
In commit 4ca9f31a2be66 ("NFSv4.1 test and add 4.1 trunking transport"), we introduce the ability to query the NFS server for possible trunking locations of the existing filesystem. However, we never checked the returned file system path for these alternative locations. According to the RFC, the server can say that the filesystem currently known under "fs_root" of fs_location also resides under these server locations under the following "rootpath" pathname. The client cannot handle trunking a filesystem that reside under different location under different paths other than what the main path is. This patch enforces the check that fs_root path and rootpath path in fs_location reply is the same. Fixes: 4ca9f31a2be6 ("NFSv4.1 test and add 4.1 trunking transport") Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2024-05-30NFS: abort nfs_atomic_open_v23 if name is too long.NeilBrown
An attempt to open a file with a name longer than NFS3_MAXNAMLEN will trigger a WARN_ON_ONCE in encode_filename3() because nfs_atomic_open_v23() doesn't have the test on ->d_name.len that nfs_atomic_open() has. So add that test. Reported-by: James Clark <james.clark@arm.com> Closes: https://lore.kernel.org/all/20240528105249.69200-1-james.clark@arm.com/ Fixes: 7c6c5249f061 ("NFS: add atomic_open for NFSv3 to handle O_TRUNC correctly.") Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2024-05-24nfs: don't invalidate dentries on transient errorsScott Mayhew
This is a slight variation on a patch previously proposed by Neil Brown that never got merged. Prior to commit 5ceb9d7fdaaf ("NFS: Refactor nfs_lookup_revalidate()"), any error from nfs_lookup_verify_inode() other than -ESTALE would result in nfs_lookup_revalidate() returning that error (-ESTALE is mapped to zero). Since that commit, all errors result in nfs_lookup_revalidate() returning zero, resulting in dentries being invalidated where they previously were not (particularly in the case of -ERESTARTSYS). Fix it by passing the actual error code to nfs_lookup_revalidate_done(), and leaving the decision on whether to map the error code to zero or one to nfs_lookup_revalidate_done(). A simple reproducer is to run the following python code in a subdirectory of an NFS mount (not in the root of the NFS mount): ---8<--- import os import multiprocessing import time if __name__=="__main__": multiprocessing.set_start_method("spawn") count = 0 while True: try: os.getcwd() pool = multiprocessing.Pool(10) pool.close() pool.terminate() count += 1 except Exception as e: print(f"Failed after {count} iterations") print(e) break ---8<--- Prior to commit 5ceb9d7fdaaf, the above code would run indefinitely. After commit 5ceb9d7fdaaf, it fails almost immediately with -ENOENT. Signed-off-by: Scott Mayhew <smayhew@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2024-05-24nfs: Avoid flushing many pages with NFS_FILE_SYNCJan Kara
When we are doing WB_SYNC_ALL writeback, nfs submits write requests with NFS_FILE_SYNC flag to the server (which then generally treats it as an O_SYNC write). This helps to reduce latency for single requests but when submitting more requests, additional fsyncs on the server side hurt latency. NFS generally avoids this additional overhead by not setting NFS_FILE_SYNC if desc->pg_moreio is set. However this logic doesn't always work. When we do random 4k writes to a huge file and then call fsync(2), each page writeback is going to be sent with NFS_FILE_SYNC because after preparing one page for writeback, we start writing back next, nfs_do_writepage() will call nfs_pageio_cond_complete() which finds the page is not contiguous with previously prepared IO and submits is *without* setting desc->pg_moreio. Hence NFS_FILE_SYNC is used resulting in poor performance. Fix the problem by setting desc->pg_moreio in nfs_pageio_cond_complete() before submitting outstanding IO. This improves throughput of fsync-after-random-writes on my test SSD from ~70MB/s to ~250MB/s. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2024-05-23Merge tag 'nfs-for-6.10-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds
Pull NFS client updates from Trond Myklebust: "Stable fixes: - nfs: fix undefined behavior in nfs_block_bits() - NFSv4.2: Fix READ_PLUS when server doesn't support OP_READ_PLUS Bugfixes: - Fix mixing of the lock/nolock and local_lock mount options - NFSv4: Fixup smatch warning for ambiguous return - NFSv3: Fix remount when using the legacy binary mount api - SUNRPC: Fix the handling of expired RPCSEC_GSS contexts - SUNRPC: fix the NFSACL RPC retries when soft mounts are enabled - rpcrdma: fix handling for RDMA_CM_EVENT_DEVICE_REMOVAL Features and cleanups: - NFSv3: Use the atomic_open API to fix open(O_CREAT|O_TRUNC) - pNFS/filelayout: S layout segment range in LAYOUTGET - pNFS: rework pnfs_generic_pg_check_layout to check IO range - NFSv2: Turn off enabling of NFS v2 by default" * tag 'nfs-for-6.10-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: nfs: fix undefined behavior in nfs_block_bits() pNFS: rework pnfs_generic_pg_check_layout to check IO range pNFS/filelayout: check layout segment range pNFS/filelayout: fixup pNfs allocation modes rpcrdma: fix handling for RDMA_CM_EVENT_DEVICE_REMOVAL NFS: Don't enable NFS v2 by default NFS: Fix READ_PLUS when server doesn't support OP_READ_PLUS sunrpc: fix NFSACL RPC retry on soft mount SUNRPC: fix handling expired GSS context nfs: keep server info for remounts NFSv4: Fixup smatch warning for ambiguous return NFS: make sure lock/nolock overriding local_lock mount option NFS: add atomic_open for NFSv3 to handle O_TRUNC correctly. pNFS/filelayout: Specify the layout segment range in LAYOUTGET pNFS/filelayout: Remove the whole file layout requirement
2024-05-22tracing/treewide: Remove second parameter of __assign_str()Steven Rostedt (Google)
With the rework of how the __string() handles dynamic strings where it saves off the source string in field in the helper structure[1], the assignment of that value to the trace event field is stored in the helper value and does not need to be passed in again. This means that with: __string(field, mystring) Which use to be assigned with __assign_str(field, mystring), no longer needs the second parameter and it is unused. With this, __assign_str() will now only get a single parameter. There's over 700 users of __assign_str() and because coccinelle does not handle the TRACE_EVENT() macro I ended up using the following sed script: git grep -l __assign_str | while read a ; do sed -e 's/\(__assign_str([^,]*[^ ,]\) *,[^;]*/\1)/' $a > /tmp/test-file; mv /tmp/test-file $a; done I then searched for __assign_str() that did not end with ';' as those were multi line assignments that the sed script above would fail to catch. Note, the same updates will need to be done for: __assign_str_len() __assign_rel_str() __assign_rel_str_len() I tested this with both an allmodconfig and an allyesconfig (build only for both). [1] https://lore.kernel.org/linux-trace-kernel/20240222211442.634192653@goodmis.org/ Link: https://lore.kernel.org/linux-trace-kernel/20240516133454.681ba6a0@rorschach.local.home Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Julia Lawall <Julia.Lawall@inria.fr> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Acked-by: Jani Nikula <jani.nikula@intel.com> Acked-by: Christian König <christian.koenig@amd.com> for the amdgpu parts. Acked-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> #for Acked-by: Rafael J. Wysocki <rafael@kernel.org> # for thermal Acked-by: Takashi Iwai <tiwai@suse.de> Acked-by: Darrick J. Wong <djwong@kernel.org> # xfs Tested-by: Guenter Roeck <linux@roeck-us.net>
2024-05-22nfs: propagate readlink errors in nfs_symlink_fillerSagi Grimberg
There is an inherent race where a symlink file may have been overriden (by a different client) between lookup and readlink, resulting in a spurious EIO error returned to userspace. Fix this by propagating back ESTALE errors such that the vfs will retry the lookup/get_link (similar to nfs4_file_open) at least once. Cc: Dan Aloni <dan.aloni@vastdata.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2024-05-22NFSv4: Fix memory leak in nfs4_set_security_labelDmitry Mastykin
We leak nfs_fattr and nfs4_label every time we set a security xattr. Signed-off-by: Dmitry Mastykin <mastichi@gmail.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2024-05-21nfs: fix undefined behavior in nfs_block_bits()Sergey Shtylyov
Shifting *signed int* typed constant 1 left by 31 bits causes undefined behavior. Specify the correct *unsigned long* type by using 1UL instead. Found by Linux Verification Center (linuxtesting.org) with the Svace static analysis tool. Cc: stable@vger.kernel.org Signed-off-by: Sergey Shtylyov <s.shtylyov@omp.ru> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2024-05-21pNFS: rework pnfs_generic_pg_check_layout to check IO rangeOlga Kornievskaia
All callers of pnfs_generic_pg_check_layout() also want to do a call to check that the layout's range covers the IO range. Merge the functionality of the pnfs_generic_pg_check_range() into that of pnfs_generic_pg_check_layout(). Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2024-05-21pNFS/filelayout: check layout segment rangeOlga Kornievskaia
Before doing the IO, check that we have the layout covering the range of IO. Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2024-05-21pNFS/filelayout: fixup pNfs allocation modesOlga Kornievskaia
Change left over allocation flags. Fixes: a245832aaa99 ("pNFS/files: Ensure pNFS allocation modes are consistent with nfsiod") Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2024-05-20NFS: Don't enable NFS v2 by defaultAnna Schumaker
This came up during one of the Bake-a-thon discussions. NFS v2 support was dropped from nfs-utils/mount.nfs in December 2021. Let's turn it off by default in the kernel too, since this means there isn't a way to mount and test it. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Reviewed-by: Jeffrey Layton <jlayton@kernel.org> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2024-05-20NFS: Fix READ_PLUS when server doesn't support OP_READ_PLUSAnna Schumaker
Olga showed me a case where the client was sending multiple READ_PLUS calls to the server in parallel, and the server replied NFS4ERR_OPNOTSUPP to each. The client would fall back to READ for the first reply, but fail to retry the other calls. I fix this by removing the test for NFS_CAP_READ_PLUS in nfs4_read_plus_not_supported(). This allows us to reschedule any READ_PLUS call that has a NFS4ERR_OPNOTSUPP return value, even after the capability has been cleared. Reported-by: Olga Kornievskaia <kolga@netapp.com> Fixes: c567552612ec ("NFS: Add READ_PLUS data segment support") Cc: stable@vger.kernel.org # v5.10+ Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2024-05-20nfs: keep server info for remountsMartin Kaiser
With newer kernels that use fs_context for nfs mounts, remounts fail with -EINVAL. $ mount -t nfs -o nolock 10.0.0.1:/tmp/test /mnt/test/ $ mount -t nfs -o remount /mnt/test/ mount: mounting 10.0.0.1:/tmp/test on /mnt/test failed: Invalid argument For remounts, the nfs server address and port are populated by nfs_init_fs_context and later overwritten with 0x00 bytes by nfs23_parse_monolithic. The remount then fails as the server address is invalid. Fix this by not overwriting nfs server info in nfs23_parse_monolithic if we're doing a remount. Fixes: f2aedb713c28 ("NFS: Add fs_context support.") Signed-off-by: Martin Kaiser <martin@kaiser.cx> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2024-05-20NFSv4: Fixup smatch warning for ambiguous returnBenjamin Coddington
Dan Carpenter reports smatch warning for nfs4_try_migration() when a memory allocation failure results in a zero return value. In this case, a transient allocation failure error will likely be retried the next time the server responds with NFS4ERR_MOVED. We can fixup the smatch warning with a small refactor: attempt all three allocations before testing and returning on a failure. Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Fixes: c3ed222745d9 ("NFSv4: Fix free of uninitialized nfs4_label on referral lookup.") Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Reviewed-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2024-05-20NFS: make sure lock/nolock overriding local_lock mount optionChen Hanxiao
Currently, mount option lock/nolock and local_lock option may override NFS_MOUNT_LOCAL_FLOCK NFS_MOUNT_LOCAL_FCNTL flags when passing in different order: mount -o vers=3,local_lock=all,lock: local_lock=none mount -o vers=3,lock,local_lock=all: local_lock=all This patch will let lock/nolock override local_lock option as nfs(5) suggested. Signed-off-by: Chen Hanxiao <chenhx.fnst@fujitsu.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2024-05-20NFS: add atomic_open for NFSv3 to handle O_TRUNC correctly.NeilBrown
With two clients, each with NFSv3 mounts of the same directory, the sequence: client1 client2 ls -l afile echo hello there > afile echo HELLO > afile cat afile will show HELLO there because the O_TRUNC requested in the final 'echo' doesn't take effect. This is because the "Negative dentry, just create a file" section in lookup_open() assumes that the file *does* get created since the dentry was negative, so it sets FMODE_CREATED, and this causes do_open() to clear O_TRUNC and so the file doesn't get truncated. Even mounting with -o lookupcache=none does not help as nfs_neg_need_reval() always returns false if LOOKUP_CREATE is set. This patch fixes the problem by providing an atomic_open inode operation for NFSv3 (and v2). The code is largely the code from the branch in lookup_open() when atomic_open is not provided. The significant change is that the O_TRUNC flag is passed a new nfs_do_create() which add 'trunc' handling to nfs_create(). With this change we also optimise away an unnecessary LOOKUP before the file is created. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2024-05-20pNFS/filelayout: Specify the layout segment range in LAYOUTGETAnna Schumaker
Move from only requesting full file layout segments to requesting layout segments that match our I/O size. This means the server is still free to return a full file layout if it wants, but partial layouts will no longer cause an error. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2024-05-20pNFS/filelayout: Remove the whole file layout requirementAnna Schumaker
Layout segments have been supported in pNFS for years, so remove the requirement that the server always sends whole file layouts. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2024-05-19Merge tag 'mm-stable-2024-05-17-19-19' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull mm updates from Andrew Morton: "The usual shower of singleton fixes and minor series all over MM, documented (hopefully adequately) in the respective changelogs. Notable series include: - Lucas Stach has provided some page-mapping cleanup/consolidation/ maintainability work in the series "mm/treewide: Remove pXd_huge() API". - In the series "Allow migrate on protnone reference with MPOL_PREFERRED_MANY policy", Donet Tom has optimized mempolicy's MPOL_PREFERRED_MANY mode, yielding almost doubled performance in one test. - In their series "Memory allocation profiling" Kent Overstreet and Suren Baghdasaryan have contributed a means of determining (via /proc/allocinfo) whereabouts in the kernel memory is being allocated: number of calls and amount of memory. - Matthew Wilcox has provided the series "Various significant MM patches" which does a number of rather unrelated things, but in largely similar code sites. - In his series "mm: page_alloc: freelist migratetype hygiene" Johannes Weiner has fixed the page allocator's handling of migratetype requests, with resulting improvements in compaction efficiency. - In the series "make the hugetlb migration strategy consistent" Baolin Wang has fixed a hugetlb migration issue, which should improve hugetlb allocation reliability. - Liu Shixin has hit an I/O meltdown caused by readahead in a memory-tight memcg. Addressed in the series "Fix I/O high when memory almost met memcg limit". - In the series "mm/filemap: optimize folio adding and splitting" Kairui Song has optimized pagecache insertion, yielding ~10% performance improvement in one test. - Baoquan He has cleaned up and consolidated the early zone initialization code in the series "mm/mm_init.c: refactor free_area_init_core()". - Baoquan has also redone some MM initializatio code in the series "mm/init: minor clean up and improvement". - MM helper cleanups from Christoph Hellwig in his series "remove follow_pfn". - More cleanups from Matthew Wilcox in the series "Various page->flags cleanups". - Vlastimil Babka has contributed maintainability improvements in the series "memcg_kmem hooks refactoring". - More folio conversions and cleanups in Matthew Wilcox's series: "Convert huge_zero_page to huge_zero_folio" "khugepaged folio conversions" "Remove page_idle and page_young wrappers" "Use folio APIs in procfs" "Clean up __folio_put()" "Some cleanups for memory-failure" "Remove page_mapping()" "More folio compat code removal" - David Hildenbrand chipped in with "fs/proc/task_mmu: convert hugetlb functions to work on folis". - Code consolidation and cleanup work related to GUP's handling of hugetlbs in Peter Xu's series "mm/gup: Unify hugetlb, part 2". - Rick Edgecombe has developed some fixes to stack guard gaps in the series "Cover a guard gap corner case". - Jinjiang Tu has fixed KSM's behaviour after a fork+exec in the series "mm/ksm: fix ksm exec support for prctl". - Baolin Wang has implemented NUMA balancing for multi-size THPs. This is a simple first-cut implementation for now. The series is "support multi-size THP numa balancing". - Cleanups to vma handling helper functions from Matthew Wilcox in the series "Unify vma_address and vma_pgoff_address". - Some selftests maintenance work from Dev Jain in the series "selftests/mm: mremap_test: Optimizations and style fixes". - Improvements to the swapping of multi-size THPs from Ryan Roberts in the series "Swap-out mTHP without splitting". - Kefeng Wang has significantly optimized the handling of arm64's permission page faults in the series "arch/mm/fault: accelerate pagefault when badaccess" "mm: remove arch's private VM_FAULT_BADMAP/BADACCESS" - GUP cleanups from David Hildenbrand in "mm/gup: consistently call it GUP-fast". - hugetlb fault code cleanups from Vishal Moola in "Hugetlb fault path to use struct vm_fault". - selftests build fixes from John Hubbard in the series "Fix selftests/mm build without requiring "make headers"". - Memory tiering fixes/improvements from Ho-Ren (Jack) Chuang in the series "Improved Memory Tier Creation for CPUless NUMA Nodes". Fixes the initialization code so that migration between different memory types works as intended. - David Hildenbrand has improved follow_pte() and fixed an errant driver in the series "mm: follow_pte() improvements and acrn follow_pte() fixes". - David also did some cleanup work on large folio mapcounts in his series "mm: mapcount for large folios + page_mapcount() cleanups". - Folio conversions in KSM in Alex Shi's series "transfer page to folio in KSM". - Barry Song has added some sysfs stats for monitoring multi-size THP's in the series "mm: add per-order mTHP alloc and swpout counters". - Some zswap cleanups from Yosry Ahmed in the series "zswap same-filled and limit checking cleanups". - Matthew Wilcox has been looking at buffer_head code and found the documentation to be lacking. The series is "Improve buffer head documentation". - Multi-size THPs get more work, this time from Lance Yang. His series "mm/madvise: enhance lazyfreeing with mTHP in madvise_free" optimizes the freeing of these things. - Kemeng Shi has added more userspace-visible writeback instrumentation in the series "Improve visibility of writeback". - Kemeng Shi then sent some maintenance work on top in the series "Fix and cleanups to page-writeback". - Matthew Wilcox reduces mmap_lock traffic in the anon vma code in the series "Improve anon_vma scalability for anon VMAs". Intel's test bot reported an improbable 3x improvement in one test. - SeongJae Park adds some DAMON feature work in the series "mm/damon: add a DAMOS filter type for page granularity access recheck" "selftests/damon: add DAMOS quota goal test" - Also some maintenance work in the series "mm/damon/paddr: simplify page level access re-check for pageout" "mm/damon: misc fixes and improvements" - David Hildenbrand has disabled some known-to-fail selftests ni the series "selftests: mm: cow: flag vmsplice() hugetlb tests as XFAIL". - memcg metadata storage optimizations from Shakeel Butt in "memcg: reduce memory consumption by memcg stats". - DAX fixes and maintenance work from Vishal Verma in the series "dax/bus.c: Fixups for dax-bus locking"" * tag 'mm-stable-2024-05-17-19-19' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (426 commits) memcg, oom: cleanup unused memcg_oom_gfp_mask and memcg_oom_order selftests/mm: hugetlb_madv_vs_map: avoid test skipping by querying hugepage size at runtime mm/hugetlb: add missing VM_FAULT_SET_HINDEX in hugetlb_wp mm/hugetlb: add missing VM_FAULT_SET_HINDEX in hugetlb_fault selftests: cgroup: add tests to verify the zswap writeback path mm: memcg: make alloc_mem_cgroup_per_node_info() return bool mm/damon/core: fix return value from damos_wmark_metric_value mm: do not update memcg stats for NR_{FILE/SHMEM}_PMDMAPPED selftests: cgroup: remove redundant enabling of memory controller Docs/mm/damon/maintainer-profile: allow posting patches based on damon/next tree Docs/mm/damon/maintainer-profile: change the maintainer's timezone from PST to PT Docs/mm/damon/design: use a list for supported filters Docs/admin-guide/mm/damon/usage: fix wrong schemes effective quota update command Docs/admin-guide/mm/damon/usage: fix wrong example of DAMOS filter matching sysfs file selftests/damon: classify tests for functionalities and regressions selftests/damon/_damon_sysfs: use 'is' instead of '==' for 'None' selftests/damon/_damon_sysfs: find sysfs mount point from /proc/mounts selftests/damon/_damon_sysfs: check errors from nr_schemes file reads mm/damon/core: initialize ->esz_bp from damos_quota_init_priv() selftests/damon: add a test for DAMOS quota goal ...
2024-05-13Merge tag 'vfs-6.10.netfs' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull netfs updates from Christian Brauner: "This reworks the netfslib writeback implementation so that pages read from the cache are written to the cache through ->writepages(), thereby allowing the fscache page flag to be retired. The reworking also: - builds on top of the new writeback_iter() infrastructure - makes it possible to use vectored write RPCs as discontiguous streams of pages can be accommodated - makes it easier to do simultaneous content crypto and stream division - provides support for retrying writes and re-dividing a stream - replaces the ->launder_folio() op, so that ->writepages() is used instead - uses mempools to allocate the netfs_io_request and netfs_io_subrequest structs to avoid allocation failure in the writeback path Some code that uses the fscache page flag is retained for compatibility purposes with nfs and ceph. The code is switched to using the synonymous private_2 label instead and marked with deprecation comments. The merge commit contains additional details on the new algorithm that I've left out of here as it would probably be excessively detailed. On top of the netfslib infrastructure this contains the work to convert cifs over to netfslib" * tag 'vfs-6.10.netfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (38 commits) cifs: Enable large folio support cifs: Remove some code that's no longer used, part 3 cifs: Remove some code that's no longer used, part 2 cifs: Remove some code that's no longer used, part 1 cifs: Cut over to using netfslib cifs: Implement netfslib hooks cifs: Make add_credits_and_wake_if() clear deducted credits cifs: Add mempools for cifs_io_request and cifs_io_subrequest structs cifs: Set zero_point in the copy_file_range() and remap_file_range() cifs: Move cifs_loose_read_iter() and cifs_file_write_iter() to file.c cifs: Replace the writedata replay bool with a netfs sreq flag cifs: Make wait_mtu_credits take size_t args cifs: Use more fields from netfs_io_subrequest cifs: Replace cifs_writedata with a wrapper around netfs_io_subrequest cifs: Replace cifs_readdata with a wrapper around netfs_io_subrequest cifs: Use alternative invalidation to using launder_folio netfs, afs: Use writeback retry to deal with alternate keys netfs: Miscellaneous tidy ups netfs: Remove the old writeback code netfs: Cut over to using new writeback code ...
2024-04-29Merge tag 'nfs-for-6.9-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds
Pull NFS client fixes from Trond Myklebust: - Fix an Oops in xs_tcp_tls_setup_socket - Fix an Oops due to missing error handling in nfs_net_init() * tag 'nfs-for-6.9-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: nfs: Handle error of rpc_proc_register() in nfs_net_init(). SUNRPC: add a missing rpc_stat for TCP TLS
2024-04-29mm: Remove the PG_fscache alias for PG_private_2David Howells
Remove the PG_fscache alias for PG_private_2 and use the latter directly. Use of this flag for marking pages undergoing writing to the cache should be considered deprecated and the folios should be marked dirty instead and the write done in ->writepages(). Note that PG_private_2 itself should be considered deprecated and up for future removal by the MM folks too. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> cc: Matthew Wilcox (Oracle) <willy@infradead.org> cc: Ilya Dryomov <idryomov@gmail.com> cc: Xiubo Li <xiubli@redhat.com> cc: Steve French <sfrench@samba.org> cc: Paulo Alcantara <pc@manguebit.com> cc: Ronnie Sahlberg <ronniesahlberg@gmail.com> cc: Shyam Prasad N <sprasad@microsoft.com> cc: Tom Talpey <tom@talpey.com> cc: Bharath SM <bharathsm@microsoft.com> cc: Trond Myklebust <trond.myklebust@hammerspace.com> cc: Anna Schumaker <anna@kernel.org> cc: netfs@lists.linux.dev cc: ceph-devel@vger.kernel.org cc: linux-cifs@vger.kernel.org cc: linux-nfs@vger.kernel.org cc: linux-fsdevel@vger.kernel.org cc: linux-mm@kvack.org
2024-04-29netfs: Replace PG_fscache by setting folio->private and marking dirtyDavid Howells
When dirty data is being written to the cache, setting/waiting on/clearing the fscache flag is always done in tandem with setting/waiting on/clearing the writeback flag. The netfslib buffered write routines wait on and set both flags and the write request cleanup clears both flags, so the fscache flag is almost superfluous. The reason it isn't superfluous is because the fscache flag is also used to indicate that data just read from the server is being written to the cache. The flag is used to prevent a race involving overlapping direct-I/O writes to the cache. Change this to indicate that a page is in need of being copied to the cache by placing a magic value in folio->private and marking the folios dirty. Then when the writeback code sees a folio marked in this way, it only writes it to the cache and not to the server. If a folio that has this magic value set is modified, the value is just replaced and the folio will then be uplodaded too. With this, PG_fscache is no longer required by the netfslib core, 9p and afs. Ceph and nfs, however, still need to use the old PG_fscache-based tracking. To deal with this, a flag, NETFS_ICTX_USE_PGPRIV2, now has to be set on the flags in the netfs_inode struct for those filesystems. This reenables the use of PG_fscache in that inode. 9p and afs use the netfslib write helpers so get switched over; cifs, for the moment, does page-by-page manual access to the cache, so doesn't use PG_fscache and is unaffected. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> cc: Matthew Wilcox (Oracle) <willy@infradead.org> cc: Eric Van Hensbergen <ericvh@kernel.org> cc: Latchesar Ionkov <lucho@ionkov.net> cc: Dominique Martinet <asmadeus@codewreck.org> cc: Christian Schoenebeck <linux_oss@crudebyte.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: Ilya Dryomov <idryomov@gmail.com> cc: Xiubo Li <xiubli@redhat.com> cc: Steve French <sfrench@samba.org> cc: Paulo Alcantara <pc@manguebit.com> cc: Ronnie Sahlberg <ronniesahlberg@gmail.com> cc: Shyam Prasad N <sprasad@microsoft.com> cc: Tom Talpey <tom@talpey.com> cc: Bharath SM <bharathsm@microsoft.com> cc: Trond Myklebust <trond.myklebust@hammerspace.com> cc: Anna Schumaker <anna@kernel.org> cc: netfs@lists.linux.dev cc: v9fs@lists.linux.dev cc: linux-afs@lists.infradead.org cc: ceph-devel@vger.kernel.org cc: linux-cifs@vger.kernel.org cc: linux-nfs@vger.kernel.org cc: linux-fsdevel@vger.kernel.org cc: linux-mm@kvack.org
2024-04-25mm: change inlined allocation helpers to account at the call siteSuren Baghdasaryan
Main goal of memory allocation profiling patchset is to provide accounting that is cheap enough to run in production. To achieve that we inject counters using codetags at the allocation call sites to account every time allocation is made. This injection allows us to perform accounting efficiently because injected counters are immediately available as opposed to the alternative methods, such as using _RET_IP_, which would require counter lookup and appropriate locking that makes accounting much more expensive. This method requires all allocation functions to inject separate counters at their call sites so that their callers can be individually accounted. Counter injection is implemented by allocation hooks which should wrap all allocation functions. Inlined functions which perform allocations but do not use allocation hooks are directly charged for the allocations they perform. In most cases these functions are just specialized allocation wrappers used from multiple places to allocate objects of a specific type. It would be more useful to do the accounting at their call sites instead. Instrument these helpers to do accounting at the call site. Simple inlined allocation wrappers are converted directly into macros. More complex allocators or allocators with documentation are converted into _noprof versions and allocation hooks are added. This allows memory allocation profiling mechanism to charge allocations to the callers of these functions. Link: https://lkml.kernel.org/r/20240415020731.1152108-1-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Jan Kara <jack@suse.cz> [jbd2] Cc: Anna Schumaker <anna@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Benjamin Tissoires <benjamin.tissoires@redhat.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: David S. Miller <davem@davemloft.net> Cc: Dennis Zhou <dennis@kernel.org> Cc: Eric Dumazet <edumazet@google.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Jakub Sitnicki <jakub@cloudflare.com> Cc: Jiri Kosina <jikos@kernel.org> Cc: Joerg Roedel <joro@8bytes.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-04nfs: Handle error of rpc_proc_register() in nfs_net_init().Kuniyuki Iwashima
syzkaller reported a warning [0] triggered while destroying immature netns. rpc_proc_register() was called in init_nfs_fs(), but its error has been ignored since at least the initial commit 1da177e4c3f4 ("Linux-2.6.12-rc2"). Recently, commit d47151b79e32 ("nfs: expose /proc/net/sunrpc/nfs in net namespaces") converted the procfs to per-netns and made the problem more visible. Even when rpc_proc_register() fails, nfs_net_init() could succeed, and thus nfs_net_exit() will be called while destroying the netns. Then, remove_proc_entry() will be called for non-existing proc directory and trigger the warning below. Let's handle the error of rpc_proc_register() properly in nfs_net_init(). [0]: name 'nfs' WARNING: CPU: 1 PID: 1710 at fs/proc/generic.c:711 remove_proc_entry+0x1bb/0x2d0 fs/proc/generic.c:711 Modules linked in: CPU: 1 PID: 1710 Comm: syz-executor.2 Not tainted 6.8.0-12822-gcd51db110a7e #12 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 RIP: 0010:remove_proc_entry+0x1bb/0x2d0 fs/proc/generic.c:711 Code: 41 5d 41 5e c3 e8 85 09 b5 ff 48 c7 c7 88 58 64 86 e8 09 0e 71 02 e8 74 09 b5 ff 4c 89 e6 48 c7 c7 de 1b 80 84 e8 c5 ad 97 ff <0f> 0b eb b1 e8 5c 09 b5 ff 48 c7 c7 88 58 64 86 e8 e0 0d 71 02 eb RSP: 0018:ffffc9000c6d7ce0 EFLAGS: 00010286 RAX: 0000000000000000 RBX: ffff8880422b8b00 RCX: ffffffff8110503c RDX: ffff888030652f00 RSI: ffffffff81105045 RDI: 0000000000000001 RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000001 R11: ffffffff81bb62cb R12: ffffffff84807ffc R13: ffff88804ad6fcc0 R14: ffffffff84807ffc R15: ffffffff85741ff8 FS: 00007f30cfba8640(0000) GS:ffff88807dd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007ff51afe8000 CR3: 000000005a60a005 CR4: 0000000000770ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: <TASK> rpc_proc_unregister+0x64/0x70 net/sunrpc/stats.c:310 nfs_net_exit+0x1c/0x30 fs/nfs/inode.c:2438 ops_exit_list+0x62/0xb0 net/core/net_namespace.c:170 setup_net+0x46c/0x660 net/core/net_namespace.c:372 copy_net_ns+0x244/0x590 net/core/net_namespace.c:505 create_new_namespaces+0x2ed/0x770 kernel/nsproxy.c:110 unshare_nsproxy_namespaces+0xae/0x160 kernel/nsproxy.c:228 ksys_unshare+0x342/0x760 kernel/fork.c:3322 __do_sys_unshare kernel/fork.c:3393 [inline] __se_sys_unshare kernel/fork.c:3391 [inline] __x64_sys_unshare+0x1f/0x30 kernel/fork.c:3391 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0x4f/0x110 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x46/0x4e RIP: 0033:0x7f30d0febe5d Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 73 9f 1b 00 f7 d8 64 89 01 48 RSP: 002b:00007f30cfba7cc8 EFLAGS: 00000246 ORIG_RAX: 0000000000000110 RAX: ffffffffffffffda RBX: 00000000004bbf80 RCX: 00007f30d0febe5d RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000000006c020600 RBP: 00000000004bbf80 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000002 R13: 000000000000000b R14: 00007f30d104c530 R15: 0000000000000000 </TASK> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Reported-by: syzkaller <syzkaller@googlegroups.com> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>