summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2025-04-13nfsd: decrease sc_count directly if fail to queue dl_recallLi Lingfeng
A deadlock warning occurred when invoking nfs4_put_stid following a failed dl_recall queue operation: T1 T2 nfs4_laundromat nfs4_get_client_reaplist nfs4_anylock_blockers __break_lease spin_lock // ctx->flc_lock spin_lock // clp->cl_lock nfs4_lockowner_has_blockers locks_owner_has_blockers spin_lock // flctx->flc_lock nfsd_break_deleg_cb nfsd_break_one_deleg nfs4_put_stid refcount_dec_and_lock spin_lock // clp->cl_lock When a file is opened, an nfs4_delegation is allocated with sc_count initialized to 1, and the file_lease holds a reference to the delegation. The file_lease is then associated with the file through kernel_setlease. The disassociation is performed in nfsd4_delegreturn via the following call chain: nfsd4_delegreturn --> destroy_delegation --> destroy_unhashed_deleg --> nfs4_unlock_deleg_lease --> kernel_setlease --> generic_delete_lease The corresponding sc_count reference will be released after this disassociation. Since nfsd_break_one_deleg executes while holding the flc_lock, the disassociation process becomes blocked when attempting to acquire flc_lock in generic_delete_lease. This means: 1) sc_count in nfsd_break_one_deleg will not be decremented to 0; 2) The nfs4_put_stid called by nfsd_break_one_deleg will not attempt to acquire cl_lock; 3) Consequently, no deadlock condition is created. Given that sc_count in nfsd_break_one_deleg remains non-zero, we can safely perform refcount_dec on sc_count directly. This approach effectively avoids triggering deadlock warnings. Fixes: 230ca758453c ("nfsd: put dl_stid if fail to queue dl_recall") Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-04-13nfs: add missing selections of CONFIG_CRC32Eric Biggers
nfs.ko, nfsd.ko, and lockd.ko all use crc32_le(), which is available only when CONFIG_CRC32 is enabled. But the only NFS kconfig option that selected CONFIG_CRC32 was CONFIG_NFS_DEBUG, which is client-specific and did not actually guard the use of crc32_le() even on the client. The code worked around this bug by only actually calling crc32_le() when CONFIG_CRC32 is built-in, instead hard-coding '0' in other cases. This avoided randconfig build errors, and in real kernels the fallback code was unlikely to be reached since CONFIG_CRC32 is 'default y'. But, this really needs to just be done properly, especially now that I'm planning to update CONFIG_CRC32 to not be 'default y'. Therefore, make CONFIG_NFS_FS, CONFIG_NFSD, and CONFIG_LOCKD select CONFIG_CRC32. Then remove the fallback code that becomes unnecessary, as well as the selection of CONFIG_CRC32 from CONFIG_NFS_DEBUG. Fixes: 1264a2f053a3 ("NFS: refactor code for calculating the crc32 hash of a filehandle") Signed-off-by: Eric Biggers <ebiggers@google.com> Acked-by: Anna Schumaker <anna.schumaker@oracle.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-14NFSD: Add a Kconfig setting to enable delegated timestampsChuck Lever
After three tries, we still see test failures with delegated timestamps. Disable them by default, but leave the implementation intact so that development can continue. Cc: stable@vger.kernel.org # v6.14 Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-10sysctl: Fixes nsm_local_state boundsNicolas Bouchinet
Bound nsm_local_state sysctl writings between SYSCTL_ZERO and SYSCTL_INT_MAX. The proc_handler has thus been updated to proc_dointvec_minmax. Signed-off-by: Nicolas Bouchinet <nicolas.bouchinet@ssi.gouv.fr> [ cel: updated to handle zero - UINT_MAX instead ] Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-10nfsd: use a long for the count in nfsd4_state_shrinker_count()Jeff Layton
If there are no courtesy clients then the return value from the atomic_long_read() could overflow an int. Use a long to store the value instead. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-10nfsd: remove obsolete comment from nfs4_alloc_stidJeff Layton
idr_alloc_cyclic() is what guarantees that now, not this long-gone trick. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-10nfsd: remove unneeded forward declaration of nfsd4_mark_cb_fault()Jeff Layton
This isn't needed. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-10nfsd: reorganize struct nfs4_delegation for better packingJeff Layton
Move dl_type field above dl_time, which shaves 8 bytes off this struct. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-10nfsd: handle errors from rpc_call_async()Jeff Layton
It's possible for rpc_call_async() to fail (mainly due to memory allocation failure). If it does, there isn't much recourse other than to requeue the callback and try again later. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-10nfsd: move cb_need_restart flag into cb_flagsJeff Layton
Since there is now a cb_flags word, use a new NFSD4_CALLBACK_REQUEUE flag in that instead of a separate boolean. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-10nfsd: replace CB_GETATTR_BUSY with NFSD4_CALLBACK_RUNNINGJeff Layton
These flags serve essentially the same purpose and get set and cleared at the same time. Drop CB_GETATTR_BUSY and just use NFSD4_CALLBACK_RUNNING instead. For this to work, we must use clear_and_wake_up_bit(), but doing that on for other types of callbacks is wasteful. Declare a new NFSD4_CALLBACK_WAKE flag in cb_flags to indicate that wake_up is needed, and only set that for CB_GETATTRs. Also, make the wait use a TASK_UNINTERRUPTIBLE sleep. This is done in the context of an nfsd thread, and it should never need to deal with signals. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-10nfsd: eliminate cl_ra_cblist and NFSD4_CLIENT_CB_RECALL_ANYJeff Layton
deleg_reaper() will walk the client_lru list and put any suitable entries onto "cblist" using the cl_ra_cblist pointer. It then walks the objects outside the spinlock and queues callbacks for them. None of the operations that deleg_reaper() does outside the nn->client_lock are blocking operations. Just queue their workqueue jobs under the nn->client_lock instead. Also, the NFSD4_CLIENT_CB_RECALL_ANY and NFSD4_CALLBACK_RUNNING flags serve an identical purpose now. Drop the NFSD4_CLIENT_CB_RECALL_ANY flag and just use the one in the callback. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-10nfsd: prevent callback tasks running concurrentlyJeff Layton
The nfsd4_callback workqueue jobs exist to queue backchannel RPCs to rpciod. Because they run in different workqueue contexts, the rpc_task can run concurrently with the workqueue job itself, should it become requeued. This is problematic as there is no locking when accessing the fields in the nfsd4_callback. Add a new unsigned long to nfsd4_callback and declare a new NFSD4_CALLBACK_RUNNING flag to be set in it. When attempting to run a workqueue job, do a test_and_set_bit() on that flag first, and don't queue the workqueue job if it returns true. Clear NFSD4_CALLBACK_RUNNING in nfsd41_destroy_cb(). This also gives us a more reliable mechanism for handling queueing failures in codepaths where we have to take references under spinlocks. We can now do the test_and_set_bit on NFSD4_CALLBACK_RUNNING first, and only take references to the objects if that returns false. Most of the nfsd4_run_cb() callers are converted to use this new flag or the nfsd4_try_run_cb() wrapper. The main exception is the callback channel probe, which has its own synchronization. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-10nfsd: disallow file locking and delegations for NFSv4 reexportMike Snitzer
We do not and cannot support file locking with NFS reexport over NFSv4.x for the same reason we don't do it for NFSv3: NFS reexport server reboot cannot allow clients to recover locks because the source NFS server has not rebooted, and so it is not in grace. Since the source NFS server is not in grace, it cannot offer any guarantees that the file won't have been changed between the locks getting lost and any attempt to recover/reclaim them. The same applies to delegations and any associated locks, so disallow them too. Clients are no longer allowed to get file locks or delegations from a reexport server, any attempts will fail with operation not supported. Update the "Reboot recovery" section accordingly in Documentation/filesystems/nfs/reexport.rst Signed-off-by: Mike Snitzer <snitzer@kernel.org> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-10nfsd: filecache: drop the list_lru lock during lock gc scansNeilBrown
Under a high NFSv3 load with lots of different files being accessed, the LRU list of garbage-collectable files can become quite long. Asking list_lru_scan_node() to scan the whole list can result in a long period during which a spinlock is held, blocking the addition of new LRU items. So ask list_lru_scan_node() to scan only a few entries at a time, and repeat until the scan is complete. If the shrinker runs between two consecutive calls of list_lru_scan_node() it could invalidate the "remaining" counter which could lead to premature freeing. So add a spinlock to avoid that. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-10nfsd: filecache: don't repeatedly add/remove files on the lru listNeilBrown
There is no need to remove a file from the lru every time we access it, and then add it back. It is sufficient to set the REFERENCED flag every time we put the file. The order in the lru of REFERENCED files is largely irrelevant as they will all be moved to the end. With this patch, files are added only when they are allocated (if want_gc) and they are removed only by the list_lru_(shrink_)walk callback or when forcibly removing a file. This should reduce contention on the list_lru spinlock(s) and reduce memory traffic a little. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-10nfsd: filecache: introduce NFSD_FILE_RECENTNeilBrown
The filecache lru is walked in 2 circumstances for 2 different reasons. 1/ When called from the shrinker we want to discard the first few entries on the list, ignoring any with NFSD_FILE_REFERENCED set because they should really be at the end of the LRU as they have been referenced recently. So those ones are ROTATED. 2/ When called from the nfsd_file_gc() timer function we want to discard anything that hasn't been used since before the previous call, and mark everything else as unused at this point in time. Using the same flag for both of these can result in some unexpected outcomes. If the shrinker callback clears NFSD_FILE_REFERENCED then nfsd_file_gc() will think the file hasn't been used in a while, while really it has. I think it is easier to reason about the behaviour if we instead have two flags. NFSD_FILE_REFERENCED means "this should be at the end of the LRU, please put it there when convenient" NFSD_FILE_RECENT means "this has been used recently - since the last run of nfsd_file_gc() When either caller finds an NFSD_FILE_REFERENCED entry, that entry should be moved to the end of the LRU and the flag cleared. This can safely happen at any time. The actual order on the lru might not be strictly least-recently-used, but that is normal for linux lrus. The shrinker callback can ignore the "recent" flag. If it ends up freeing something that is "recent" that simply means that memory pressure is sufficient to limit the acceptable cache age to less than the nfsd_file_gc frequency. The gc callback should primarily focus on NFSD_FILE_RECENT. It should free everything that doesn't have this flag set, and should clear the flag on everything else. When it clears the flag it is convenient to clear the "REFERENCED" flag and move to the end of the LRU too. With this, calls from the shrinker do not prematurely age files. It will focus only on freeing those that are least recently used. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-10nfsd: filecache: use list_lru_walk_node() in nfsd_file_gc()NeilBrown
list_lru_walk() is only useful when the aim is to remove all elements from the list_lru. It will repeatedly visit rotated elements of the first per-node sublist before proceeding to subsequent sublists. This patch changes nfsd_file_gc() to use list_lru_walk_node() and list_lru_count_node() on each NUMA node. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-10nfsd: filecache: use nfsd_file_dispose_list() in nfsd_file_close_inode_sync()NeilBrown
nfsd_file_close_inode_sync() contains an exact copy of nfsd_file_dispose_list(). This patch removes that copy and calls nfsd_file_dispose_list() instead. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-10NFSD: Re-organize nfsd_file_gc_worker()Chuck Lever
Dave opines: IMO, there is no need to do this unnecessary work on every object that is added to the LRU. Changing the gc worker to always run every 2s and check if it has work to do like so: static void nfsd_file_gc_worker(struct work_struct *work) { - nfsd_file_gc(); - if (list_lru_count(&nfsd_file_lru)) - nfsd_file_schedule_laundrette(); + if (list_lru_count(&nfsd_file_lru)) + nfsd_file_gc(); + nfsd_file_schedule_laundrette(); } means that nfsd_file_gc() will be run the same way and have the same behaviour as the current code. When the system it idle, it does a list_lru_count() check every 2 seconds and goes back to sleep. That's going to be pretty much unnoticable on most machines that run NFS servers. Suggested-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-10nfsd: filecache: remove race handling.NeilBrown
The race that this code tries to protect against is not interesting. The code is problematic as we access the "nf" after we have given our reference to the lru system. While that takes 2+ seconds to free things, it is still poor form. The only interesting race I can find would be with nfsd_file_close_inode_sync(); This is the only place that really doesn't want the file to stay on the LRU when unhashed (which is the direct consequence of the race). However for the race to happen, some other thread must own a reference to a file and be putting it while nfsd_file_close_inode_sync() is trying to close all files for an inode. If this is possible, that other thread could simply call nfsd_file_put() a little bit later and the result would be the same: not all files are closed when nfsd_file_close_inode_sync() completes. If this was really a problem, we would need to wait in close_inode_sync for the other references to be dropped. We probably don't want to do that. So it is best to simply remove this code. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-10fs: nfs: acl: Avoid -Wflex-array-member-not-at-end warningGustavo A. R. Silva
-Wflex-array-member-not-at-end was introduced in GCC-14, and we are getting ready to enable it, globally. So, in order to avoid ending up with a flexible-array member in the middle of other structs, we use the `struct_group_tagged()` helper to create a new tagged `struct posix_acl_hdr`. This structure groups together all the members of the flexible `struct posix_acl` except the flexible array. As a result, the array is effectively separated from the rest of the members without modifying the memory layout of the flexible structure. We then change the type of the middle struct member currently causing trouble from `struct posix_acl` to `struct posix_acl_hdr`. We also want to ensure that when new members need to be added to the flexible structure, they are always included within the newly created tagged struct. For this, we use `static_assert()`. This ensures that the memory layout for both the flexible structure and the new tagged struct is the same after any changes. This approach avoids having to implement `struct posix_acl_hdr` as a completely separate structure, thus preventing having to maintain two independent but basically identical structures, closing the door to potential bugs in the future. We also use `container_of()` whenever we need to retrieve a pointer to the flexible structure, through which we can access the flexible-array member, if necessary. So, with these changes, fix the following warning: fs/nfs_common/nfsacl.c:45:26: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end] Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Acked-by: Anna Schumaker <anna.schumaker@oracle.com> Acked-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-10NFSD: Fix callback decoder status codesChuck Lever
fs/nfsd/nfs4callback.c implements a callback client. Thus its XDR decoders are decoding replies, not calls. NFS4ERR_BAD_XDR is an on-the-wire status code that reports that the client sent a corrupted RPC /call/. It's not used as the internal error code when a /reply/ can't be decoded, since that kind of failure is never reported to the sender of that RPC message. Instead, a reply decoder should return -EIO, as the reply decoders in the NFS client do. Fixes: 6487a13b5c6b ("NFSD: add support for CB_GETATTR callback") Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-10nfsd: eliminate special handling of NFS4ERR_SEQ_MISORDEREDJeff Layton
On a SEQ_MISORDERED error, the current code will reattempt the call, but set the slot sequence ID to 1. I can find no mention of this remedy in the spec, and it seems potentially dangerous. It's possible that the last call was sent with seqid 1, and doing this will cause a retransmission of the reply. Drop this special handling, and always treat SEQ_MISORDERED like BADSLOT. Retry the call, but leak the slot so that it is no longer used. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-10nfsd: handle NFS4ERR_BADSLOT on CB_SEQUENCE betterJeff Layton
Currently it just restarts the call, without getting a new slot. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-10nfsd: when CB_SEQUENCE gets ESERVERFAULT don't increment seq_nrJeff Layton
ESERVERFAULT means that the server sent a successful and legitimate reply, but the session info didn't match what was expected. Don't increment the seq_nr in that case. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-10nfsd: only check RPC_SIGNALLED() when restarting rpc_taskJeff Layton
nfsd4_cb_sequence_done() currently checks RPC_SIGNALLED() when processing the compound and releasing the slot. If RPC_SIGNALLED() returns true, then that means that the client is going to be torn down. Don't check RPC_SIGNALLED() after processing a successful reply. Check it only before restarting the rpc_task. If it returns true, then requeue the callback instead of restarting the task. Also, handle rpc_restart_call() and rpc_restart_call_prepare() failures correctly, by requeueing the callback. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-10nfsd: always release slot when requeueing callbackJeff Layton
If the callback is going to be requeued to the workqueue, then release the slot. The callback client and session could change and the slot may no longer be valid after that point. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-10nfsd: lift NFSv4.0 handling out of nfsd4_cb_sequence_done()Jeff Layton
It's a bit strange to call nfsd4_cb_sequence_done() on a callback with no CB_SEQUENCE. Lift the handling of restarting a call into a new helper, and move the handling of NFSv4.0 into nfsd4_cb_done(). Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-10nfsd: prepare nfsd4_cb_sequence_done() for error handling reworkJeff Layton
There is only one case where we want to proceed with processing the rest of the CB_COMPOUND, and that's when the cb_seq_status is 0. Make the default return value be false, and only set it to true in that case. Rename the "need_restart" label to "requeue", to better indicate that it's being requeued to the workqueue. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-10nfsd: put dl_stid if fail to queue dl_recallLi Lingfeng
Before calling nfsd4_run_cb to queue dl_recall to the callback_wq, we increment the reference count of dl_stid. We expect that after the corresponding work_struct is processed, the reference count of dl_stid will be decremented through the callback function nfsd4_cb_recall_release. However, if the call to nfsd4_run_cb fails, the incremented reference count of dl_stid will not be decremented correspondingly, leading to the following nfs4_stid leak: unreferenced object 0xffff88812067b578 (size 344): comm "nfsd", pid 2761, jiffies 4295044002 (age 5541.241s) hex dump (first 32 bytes): 01 00 00 00 6b 6b 6b 6b b8 02 c0 e2 81 88 ff ff ....kkkk........ 00 6b 6b 6b 6b 6b 6b 6b 00 00 00 00 ad 4e ad de .kkkkkkk.....N.. backtrace: kmem_cache_alloc+0x4b9/0x700 nfsd4_process_open1+0x34/0x300 nfsd4_open+0x2d1/0x9d0 nfsd4_proc_compound+0x7a2/0xe30 nfsd_dispatch+0x241/0x3e0 svc_process_common+0x5d3/0xcc0 svc_process+0x2a3/0x320 nfsd+0x180/0x2e0 kthread+0x199/0x1d0 ret_from_fork+0x30/0x50 ret_from_fork_asm+0x1b/0x30 unreferenced object 0xffff8881499f4d28 (size 368): comm "nfsd", pid 2761, jiffies 4295044005 (age 5541.239s) hex dump (first 32 bytes): 01 00 00 00 00 00 00 00 30 4d 9f 49 81 88 ff ff ........0M.I.... 30 4d 9f 49 81 88 ff ff 20 00 00 00 01 00 00 00 0M.I.... ....... backtrace: kmem_cache_alloc+0x4b9/0x700 nfs4_alloc_stid+0x29/0x210 alloc_init_deleg+0x92/0x2e0 nfs4_set_delegation+0x284/0xc00 nfs4_open_delegation+0x216/0x3f0 nfsd4_process_open2+0x2b3/0xee0 nfsd4_open+0x770/0x9d0 nfsd4_proc_compound+0x7a2/0xe30 nfsd_dispatch+0x241/0x3e0 svc_process_common+0x5d3/0xcc0 svc_process+0x2a3/0x320 nfsd+0x180/0x2e0 kthread+0x199/0x1d0 ret_from_fork+0x30/0x50 ret_from_fork_asm+0x1b/0x30 Fix it by checking the result of nfsd4_run_cb and call nfs4_put_stid if fail to queue dl_recall. Cc: stable@vger.kernel.org Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-10nfsd: allow SC_STATUS_FREEABLE when searching via nfs4_lookup_stateid()Jeff Layton
The pynfs DELEG8 test fails when run against nfsd. It acquires a delegation and then lets the lease time out. It then tries to use the deleg stateid and expects to see NFS4ERR_DELEG_REVOKED, but it gets bad NFS4ERR_BAD_STATEID instead. When a delegation is revoked, it's initially marked with SC_STATUS_REVOKED, or SC_STATUS_ADMIN_REVOKED and later, it's marked with the SC_STATUS_FREEABLE flag, which denotes that it is waiting for s FREE_STATEID call. nfs4_lookup_stateid() accepts a statusmask that includes the status flags that a found stateid is allowed to have. Currently, that mask never includes SC_STATUS_FREEABLE, which means that revoked delegations are (almost) never found. Add SC_STATUS_FREEABLE to the always-allowed status flags, and remove it from nfsd4_delegreturn() since it's now always implied. Fixes: 8dd91e8d31fe ("nfsd: fix race between laundromat and free_stateid") Cc: stable@vger.kernel.org Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-10nfsd: don't ignore the return code of svc_proc_register()Jeff Layton
Currently, nfsd_proc_stat_init() ignores the return value of svc_proc_register(). If the procfile creation fails, then the kernel will WARN when it tries to remove the entry later. Fix nfsd_proc_stat_init() to return the same type of pointer as svc_proc_register(), and fix up nfsd_net_init() to check that and fail the nfsd_net construction if it occurs. svc_proc_register() can fail if the dentry can't be allocated, or if an identical dentry already exists. The second case is pretty unlikely in the nfsd_net construction codepath, so if this happens, return -ENOMEM. Reported-by: syzbot+e34ad04f27991521104c@syzkaller.appspotmail.com Closes: https://lore.kernel.org/linux-nfs/67a47501.050a0220.19061f.05f9.GAE@google.com/ Cc: stable@vger.kernel.org # v6.9 Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-10NFSD: Fix trace_nfsd_slot_seqid_sequenceChuck Lever
While running down the problem triggered by disconnect injection, I noticed the "in use" string was actually never hooked up in this trace point, so it always showed the traced slot as not in use. But what might be more useful is showing all the slot status flags. Also, this trace point can record and report the slot's index number, which among other things is useful for troubleshooting slot table expansion and contraction. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-10NFSD: Return NFS4ERR_FILE_OPEN only when linking an open fileChuck Lever
RFC 8881 Section 18.9.4 paragraphs 1 - 2 tell us that RENAME should return NFS4ERR_FILE_OPEN only when the target object is a file that is currently open. If the target is a directory, some other status must be returned. The VFS is unlikely to return -EBUSY, but NFSD has to ensure that errno does not leak to clients as a status code that is not permitted by spec. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-10NFSD: Return NFS4ERR_FILE_OPEN only when renaming over an open fileChuck Lever
RFC 8881 Section 18.26.4 paragraphs 1 - 3 tell us that RENAME should return NFS4ERR_FILE_OPEN only when the target object is a file that is currently open. If the target is a directory, some other status must be returned. Generally I expect that a delegation recall will be triggered in some of these circumstances. In other cases, the VFS might return -EBUSY for other reasons, and NFSD has to ensure that errno does not leak to clients as a status code that is not permitted by spec. There are some error flows where the target dentry hasn't been found yet. The default value for @type therefore is S_IFDIR to return an alternate status code in those cases. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-10NFSD: Never return NFS4ERR_FILE_OPEN when removing a directoryChuck Lever
RFC 8881 Section 18.25.4 paragraph 5 tells us that the server should return NFS4ERR_FILE_OPEN only if the target object is an opened file. This suggests that returning this status when removing a directory will confuse NFS clients. This is a version-specific issue; nfsd_proc_remove/rmdir() and nfsd3_proc_remove/rmdir() already return nfserr_access as appropriate. Unfortunately there is no quick way for nfsd4_remove() to determine whether the target object is a file or not, so the check is done in in nfsd_unlink() for now. Reported-by: Trond Myklebust <trondmy@hammerspace.com> Fixes: 466e16f0920f ("nfsd: check for EBUSY from vfs_rmdir/vfs_unink.") Reviewed-by: Jeff Layton <jlayton@kernel.org> Cc: stable@vger.kernel.org Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-10NFSD: nfsd_unlink() clobbers non-zero status returned from fh_fill_pre_attrs()Chuck Lever
If fh_fill_pre_attrs() returns a non-zero status, the error flow takes it through out_unlock, which then overwrites the returned status code with err = nfserrno(host_err); Fixes: a332018a91c4 ("nfsd: handle failure to collect pre/post-op attrs more sanely") Reviewed-by: Jeff Layton <jlayton@kernel.org> Cc: stable@vger.kernel.org Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-10nfsd: remove the redundant mapping of nfserr_mlinkLi Lingfeng
There two mappings of nfserr_mlink in nfs_errtbl. Remove one of them. Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-10NFSD: Skip sending CB_RECALL_ANY when the backchannel isn't upChuck Lever
NFSD sends CB_RECALL_ANY to clients when the server is low on memory or that client has a large number of delegations outstanding. We've seen cases where NFSD attempts to send CB_RECALL_ANY requests to disconnected clients, and gets confused. These calls never go anywhere if a backchannel transport to the target client isn't available. Before the server can send any backchannel operation, the client has to connect first and then do a BIND_CONN_TO_SESSION. This patch doesn't address the root cause of the confusion, but there's no need to queue up these optional operations if they can't go anywhere. Fixes: 44df6f439a17 ("NFSD: add delegation reaper to react to low memory condition") Reviewed-by: Jeff Layton <jlayton@kernel.org> Cc: stable@vger.kernel.org Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-10nfsd: adjust WARN_ON_ONCE in revoke_delegationOlga Kornievskaia
A WARN_ON_ONCE() is added to revoke delegations to make sure that the state has been marked for revocation. However, that's only true for 4.1+ stateids. For 4.0 stateids, in unhash_delegation_locked() the sc_status is set to SC_STATUS_CLOSED. Modify the check to reflect it, otherwise a WARN_ON_ONCE is erronously triggered. Signed-off-by: Olga Kornievskaia <okorniev@redhat.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-10nfsd: fix management of listener transportsOlga Kornievskaia
Currently, when no active threads are running, a root user using nfsdctl command can try to remove a particular listener from the list of previously added ones, then start the server by increasing the number of threads, it leads to the following problem: [ 158.835354] refcount_t: addition on 0; use-after-free. [ 158.835603] WARNING: CPU: 2 PID: 9145 at lib/refcount.c:25 refcount_warn_saturate+0x160/0x1a0 [ 158.836017] Modules linked in: rpcrdma rdma_cm iw_cm ib_cm ib_core nfsd auth_rpcgss nfs_acl lockd grace overlay isofs uinput snd_seq_dummy snd_hrtimer nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 rfkill ip_set nf_tables qrtr sunrpc vfat fat uvcvideo videobuf2_vmalloc videobuf2_memops uvc videobuf2_v4l2 videodev videobuf2_common snd_hda_codec_generic mc e1000e snd_hda_intel snd_intel_dspcfg snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore sg loop dm_multipath dm_mod nfnetlink vsock_loopback vmw_vsock_virtio_transport_common vmw_vsock_vmci_transport vmw_vmci vsock xfs libcrc32c crct10dif_ce ghash_ce vmwgfx sha2_ce sha256_arm64 sr_mod sha1_ce cdrom nvme drm_client_lib drm_ttm_helper ttm nvme_core drm_kms_helper nvme_auth drm fuse [ 158.840093] CPU: 2 UID: 0 PID: 9145 Comm: nfsd Kdump: loaded Tainted: G B W 6.13.0-rc6+ #7 [ 158.840624] Tainted: [B]=BAD_PAGE, [W]=WARN [ 158.840802] Hardware name: VMware, Inc. VMware20,1/VBSA, BIOS VMW201.00V.24006586.BA64.2406042154 06/04/2024 [ 158.841220] pstate: 61400005 (nZCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--) [ 158.841563] pc : refcount_warn_saturate+0x160/0x1a0 [ 158.841780] lr : refcount_warn_saturate+0x160/0x1a0 [ 158.842000] sp : ffff800089be7d80 [ 158.842147] x29: ffff800089be7d80 x28: ffff00008e68c148 x27: ffff00008e68c148 [ 158.842492] x26: ffff0002e3b5c000 x25: ffff600011cd1829 x24: ffff00008653c010 [ 158.842832] x23: ffff00008653c000 x22: 1fffe00011cd1829 x21: ffff00008653c028 [ 158.843175] x20: 0000000000000002 x19: ffff00008653c010 x18: 0000000000000000 [ 158.843505] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000 [ 158.843836] x14: 0000000000000000 x13: 0000000000000001 x12: ffff600050a26493 [ 158.844143] x11: 1fffe00050a26492 x10: ffff600050a26492 x9 : dfff800000000000 [ 158.844475] x8 : 00009fffaf5d9b6e x7 : ffff000285132493 x6 : 0000000000000001 [ 158.844823] x5 : ffff000285132490 x4 : ffff600050a26493 x3 : ffff8000805e72bc [ 158.845174] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff000098588000 [ 158.845528] Call trace: [ 158.845658] refcount_warn_saturate+0x160/0x1a0 (P) [ 158.845894] svc_recv+0x58c/0x680 [sunrpc] [ 158.846183] nfsd+0x1fc/0x348 [nfsd] [ 158.846390] kthread+0x274/0x2f8 [ 158.846546] ret_from_fork+0x10/0x20 [ 158.846714] ---[ end trace 0000000000000000 ]--- nfsd_nl_listener_set_doit() would manipulate the list of transports of server's sv_permsocks and close the specified listener but the other list of transports (server's sp_xprts list) would not be changed leading to the problem above. Instead, determined if the nfsdctl is trying to remove a listener, in which case, delete all the existing listener transports and re-create all-but-the-removed ones. Fixes: 16a471177496 ("NFSD: add listener-{set,get} netlink command") Signed-off-by: Olga Kornievskaia <okorniev@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Cc: stable@vger.kernel.org Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-10lockd: add netlink control interfaceJeff Layton
The legacy rpc.nfsd tool will set the nlm_grace_period if the NFSv4 grace period is set. nfsdctl is missing this functionality, so add a new netlink control interface for lockd that it can use. For now, it only allows setting the grace period, and the tcp and udp listener ports. lockd currently uses module parameters and sysctls for configuration, so all of its settings are global. With this change, lockd now tracks these values on a per-net-ns basis. It will only fall back to using the global values if any of them are 0. Finally, as a backward compatibility measure, if updating the nlm settings in the init_net namespace, also update the legacy global values to match. Link: https://issues.redhat.com/browse/RHEL-71698 Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-08Merge tag 'mm-hotfixes-stable-2025-03-08-16-27' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "33 hotfixes. 24 are cc:stable and the remainder address post-6.13 issues or aren't considered necessary for -stable kernels. 26 are for MM and 7 are for non-MM. - "mm: memory_failure: unmap poisoned folio during migrate properly" from Ma Wupeng fixes a couple of two year old bugs involving the migration of hwpoisoned folios. - "selftests/damon: three fixes for false results" from SeongJae Park fixes three one year old bugs in the SAMON selftest code. The remainder are singletons and doubletons. Please see the individual changelogs for details" * tag 'mm-hotfixes-stable-2025-03-08-16-27' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (33 commits) mm/page_alloc: fix uninitialized variable rapidio: add check for rio_add_net() in rio_scan_alloc_net() rapidio: fix an API misues when rio_add_net() fails MAINTAINERS: .mailmap: update Sumit Garg's email address Revert "mm/page_alloc.c: don't show protection in zone's ->lowmem_reserve[] for empty zone" mm: fix finish_fault() handling for large folios mm: don't skip arch_sync_kernel_mappings() in error paths mm: shmem: remove unnecessary warning in shmem_writepage() userfaultfd: fix PTE unmapping stack-allocated PTE copies userfaultfd: do not block on locking a large folio with raised refcount mm: zswap: use ATOMIC_LONG_INIT to initialize zswap_stored_pages mm: shmem: fix potential data corruption during shmem swapin mm: fix kernel BUG when userfaultfd_move encounters swapcache selftests/damon/damon_nr_regions: sort collected regiosn before checking with min/max boundaries selftests/damon/damon_nr_regions: set ops update for merge results check to 100ms selftests/damon/damos_quota: make real expectation of quota exceeds include/linux/log2.h: mark is_power_of_2() with __always_inline NFS: fix nfs_release_folio() to not deadlock via kcompactd writeback mm, swap: avoid BUG_ON in relocate_cluster() mm: swap: use correct step in loop to wait all clusters in wait_for_allocation() ...
2025-03-07Merge tag 'execve-v6.14-rc6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull core dumping fix from Kees Cook: - Only sort VMAs when core_sort_vma sysctl is set * tag 'execve-v6.14-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: coredump: Only sort VMAs when core_sort_vma sysctl is set
2025-03-07Merge tag 'for-6.14-rc5-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - fix leaked extent map after error when reading chunks - replace use of deprecated strncpy - in zoned mode, fixed range when ulocking extent range, causing a hang * tag 'for-6.14-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix a leaked chunk map issue in read_one_chunk() btrfs: replace deprecated strncpy() with strscpy() btrfs: zoned: fix extent range end unlock in cow_file_range()
2025-03-06fs/pipe: add simpler helpers for common casesLinus Torvalds
The fix to atomically read the pipe head and tail state when not holding the pipe mutex has caused a number of headaches due to the size change of the involved types. It turns out that we don't have _that_ many places that access these fields directly and were affected, but we have more than we strictly should have, because our low-level helper functions have been designed to have intimate knowledge of how the pipes work. And as a result, that random noise of direct 'pipe->head' and 'pipe->tail' accesses makes it harder to pinpoint any actual potential problem spots remaining. For example, we didn't have a "is the pipe full" helper function, but instead had a "given these pipe buffer indexes and this pipe size, is the pipe full". That's because some low-level pipe code does actually want that much more complicated interface. But most other places literally just want a "is the pipe full" helper, and not having it meant that those places ended up being unnecessarily much too aware of this all. It would have been much better if only the very core pipe code that cared had been the one aware of this all. So let's fix it - better late than never. This just introduces the trivial wrappers for "is this pipe full or empty" and to get how many pipe buffers are used, so that instead of writing if (pipe_full(pipe->head, pipe->tail, pipe->max_usage)) the places that literally just want to know if a pipe is full can just say if (pipe_is_full(pipe)) instead. The existing trivial cases were converted with a 'sed' script. This cuts down on the places that access pipe->head and pipe->tail directly outside of the pipe code (and core splice code) quite a lot. The splice code in particular still revels in doing the direct low-level accesses, and the fuse fuse_dev_splice_write() code also seems a bit unnecessarily eager to go very low-level, but it's at least a bit better than it used to be. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2025-03-06Merge tag 'bcachefs-2025-03-06' of git://evilpiepirate.org/bcachefsLinus Torvalds
Pull bcachefs fixes from Kent Overstreet: - Fix a compatibility issue: we shouldn't be setting incompat feature bits unless explicitly requested - Fix another bug where the journal alloc/resize path could spuriously fail with -BCH_ERR_open_buckets_empty - Copygc shouldn't run on read-only devices: fragmentation isn't an issue if we're not currently writing to a given device, and it may not have anywhere to move the data to * tag 'bcachefs-2025-03-06' of git://evilpiepirate.org/bcachefs: bcachefs: copygc now skips non-rw devices bcachefs: Fix bch2_dev_journal_alloc() spuriously failing bcachefs: Don't set BCH_FEATURE_incompat_version_field unless requested
2025-03-06bcachefs: copygc now skips non-rw devicesKent Overstreet
There's no point in doing copygc on non-rw devices: the fragmentation doesn't matter if we're not writing to them, and we may not have anywhere to put the data on our other devices. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-06bcachefs: Fix bch2_dev_journal_alloc() spuriously failingKent Overstreet
Previously, we fixed journal resize spuriousl failing with -BCH_ERR_open_buckets_empty, but initial journal allocation was missed because it didn't invoke the "block on allocator" loop at all. Factor out the "loop on allocator" code to fix that. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>