summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
13 daysbtrfs: fix race between async reclaim worker and close_ctree()Filipe Manana
Syzbot reported an assertion failure due to an attempt to add a delayed iput after we have set BTRFS_FS_STATE_NO_DELAYED_IPUT in the fs_info state: WARNING: CPU: 0 PID: 65 at fs/btrfs/inode.c:3420 btrfs_add_delayed_iput+0x2f8/0x370 fs/btrfs/inode.c:3420 Modules linked in: CPU: 0 UID: 0 PID: 65 Comm: kworker/u8:4 Not tainted 6.15.0-next-20250530-syzkaller #0 PREEMPT(full) Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/07/2025 Workqueue: btrfs-endio-write btrfs_work_helper RIP: 0010:btrfs_add_delayed_iput+0x2f8/0x370 fs/btrfs/inode.c:3420 Code: 4e ad 5d (...) RSP: 0018:ffffc9000213f780 EFLAGS: 00010293 RAX: ffffffff83c635b7 RBX: ffff888058920000 RCX: ffff88801c769e00 RDX: 0000000000000000 RSI: 0000000000000100 RDI: 0000000000000000 RBP: 0000000000000001 R08: ffff888058921b67 R09: 1ffff1100b12436c R10: dffffc0000000000 R11: ffffed100b12436d R12: 0000000000000001 R13: dffffc0000000000 R14: ffff88807d748000 R15: 0000000000000100 FS: 0000000000000000(0000) GS:ffff888125c53000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00002000000bd038 CR3: 000000006a142000 CR4: 00000000003526f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> btrfs_put_ordered_extent+0x19f/0x470 fs/btrfs/ordered-data.c:635 btrfs_finish_one_ordered+0x11d8/0x1b10 fs/btrfs/inode.c:3312 btrfs_work_helper+0x399/0xc20 fs/btrfs/async-thread.c:312 process_one_work kernel/workqueue.c:3238 [inline] process_scheduled_works+0xae1/0x17b0 kernel/workqueue.c:3321 worker_thread+0x8a0/0xda0 kernel/workqueue.c:3402 kthread+0x70e/0x8a0 kernel/kthread.c:464 ret_from_fork+0x3fc/0x770 arch/x86/kernel/process.c:148 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245 </TASK> This can happen due to a race with the async reclaim worker like this: 1) The async metadata reclaim worker enters shrink_delalloc(), which calls btrfs_start_delalloc_roots() with an nr_pages argument that has a value less than LONG_MAX, and that in turn enters start_delalloc_inodes(), which sets the local variable 'full_flush' to false because wbc->nr_to_write is less than LONG_MAX; 2) There it finds inode X in a root's delalloc list, grabs a reference for inode X (with igrab()), and triggers writeback for it with filemap_fdatawrite_wbc(), which creates an ordered extent for inode X; 3) The unmount sequence starts from another task, we enter close_ctree() and we flush the workqueue fs_info->endio_write_workers, which waits for the ordered extent for inode X to complete and when dropping the last reference of the ordered extent, with btrfs_put_ordered_extent(), when we call btrfs_add_delayed_iput() we don't add the inode to the list of delayed iputs because it has a refcount of 2, so we decrement it to 1 and return; 4) Shortly after at close_ctree() we call btrfs_run_delayed_iputs() which runs all delayed iputs, and then we set BTRFS_FS_STATE_NO_DELAYED_IPUT in the fs_info state; 5) The async reclaim worker, after calling filemap_fdatawrite_wbc(), now calls btrfs_add_delayed_iput() for inode X and there we trigger an assertion failure since the fs_info state has the flag BTRFS_FS_STATE_NO_DELAYED_IPUT set. Fix this by setting BTRFS_FS_STATE_NO_DELAYED_IPUT only after we wait for the async reclaim workers to finish, after we call cancel_work_sync() for them at close_ctree(), and by running delayed iputs after wait for the reclaim workers to finish and before setting the bit. This race was recently introduced by commit 19e60b2a95f5 ("btrfs: add extra warning if delayed iput is added when it's not allowed"). Without the new validation at btrfs_add_delayed_iput(), this described scenario was safe because close_ctree() later calls btrfs_commit_super(). That will run any final delayed iputs added by reclaim workers in the window between the btrfs_run_delayed_iputs() and the the reclaim workers being shut down. Reported-by: syzbot+0ed30ad435bf6f5b7a42@syzkaller.appspotmail.com Link: https://lore.kernel.org/linux-btrfs/6840481c.a00a0220.d4325.000c.GAE@google.com/T/#u Fixes: 19e60b2a95f5 ("btrfs: add extra warning if delayed iput is added when it's not allowed") Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
13 daysbtrfs: fix assertion when building free space treeFilipe Manana
When building the free space tree with the block group tree feature enabled, we can hit an assertion failure like this: BTRFS info (device loop0 state M): rebuilding free space tree assertion failed: ret == 0, in fs/btrfs/free-space-tree.c:1102 ------------[ cut here ]------------ kernel BUG at fs/btrfs/free-space-tree.c:1102! Internal error: Oops - BUG: 00000000f2000800 [#1] SMP Modules linked in: CPU: 1 UID: 0 PID: 6592 Comm: syz-executor322 Not tainted 6.15.0-rc7-syzkaller-gd7fa1af5b33e #0 PREEMPT Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/07/2025 pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : populate_free_space_tree+0x514/0x518 fs/btrfs/free-space-tree.c:1102 lr : populate_free_space_tree+0x514/0x518 fs/btrfs/free-space-tree.c:1102 sp : ffff8000a4ce7600 x29: ffff8000a4ce76e0 x28: ffff0000c9bc6000 x27: ffff0000ddfff3d8 x26: ffff0000ddfff378 x25: dfff800000000000 x24: 0000000000000001 x23: ffff8000a4ce7660 x22: ffff70001499cecc x21: ffff0000e1d8c160 x20: ffff0000e1cb7800 x19: ffff0000e1d8c0b0 x18: 00000000ffffffff x17: ffff800092f39000 x16: ffff80008ad27e48 x15: ffff700011e740c0 x14: 1ffff00011e740c0 x13: 0000000000000004 x12: ffffffffffffffff x11: ffff700011e740c0 x10: 0000000000ff0100 x9 : 94ef24f55d2dbc00 x8 : 94ef24f55d2dbc00 x7 : 0000000000000001 x6 : 0000000000000001 x5 : ffff8000a4ce6f98 x4 : ffff80008f415ba0 x3 : ffff800080548ef0 x2 : 0000000000000000 x1 : 0000000100000000 x0 : 000000000000003e Call trace: populate_free_space_tree+0x514/0x518 fs/btrfs/free-space-tree.c:1102 (P) btrfs_rebuild_free_space_tree+0x14c/0x54c fs/btrfs/free-space-tree.c:1337 btrfs_start_pre_rw_mount+0xa78/0xe10 fs/btrfs/disk-io.c:3074 btrfs_remount_rw fs/btrfs/super.c:1319 [inline] btrfs_reconfigure+0x828/0x2418 fs/btrfs/super.c:1543 reconfigure_super+0x1d4/0x6f0 fs/super.c:1083 do_remount fs/namespace.c:3365 [inline] path_mount+0xb34/0xde0 fs/namespace.c:4200 do_mount fs/namespace.c:4221 [inline] __do_sys_mount fs/namespace.c:4432 [inline] __se_sys_mount fs/namespace.c:4409 [inline] __arm64_sys_mount+0x3e8/0x468 fs/namespace.c:4409 __invoke_syscall arch/arm64/kernel/syscall.c:35 [inline] invoke_syscall+0x98/0x2b8 arch/arm64/kernel/syscall.c:49 el0_svc_common+0x130/0x23c arch/arm64/kernel/syscall.c:132 do_el0_svc+0x48/0x58 arch/arm64/kernel/syscall.c:151 el0_svc+0x58/0x17c arch/arm64/kernel/entry-common.c:767 el0t_64_sync_handler+0x78/0x108 arch/arm64/kernel/entry-common.c:786 el0t_64_sync+0x198/0x19c arch/arm64/kernel/entry.S:600 Code: f0047182 91178042 528089c3 9771d47b (d4210000) ---[ end trace 0000000000000000 ]--- This happens because we are processing an empty block group, which has no extents allocated from it, there are no items for this block group, including the block group item since block group items are stored in a dedicated tree when using the block group tree feature. It also means this is the block group with the highest start offset, so there are no higher keys in the extent root, hence btrfs_search_slot_for_read() returns 1 (no higher key found). Fix this by asserting 'ret' is 0 only if the block group tree feature is not enabled, in which case we should find a block group item for the block group since it's stored in the extent root and block group item keys are greater than extent item keys (the value for BTRFS_BLOCK_GROUP_ITEM_KEY is 192 and for BTRFS_EXTENT_ITEM_KEY and BTRFS_METADATA_ITEM_KEY the values are 168 and 169 respectively). In case 'ret' is 1, we just need to add a record to the free space tree which spans the whole block group, and we can achieve this by making 'ret == 0' as the while loop's condition. Reported-by: syzbot+36fae25c35159a763a2a@syzkaller.appspotmail.com Link: https://lore.kernel.org/linux-btrfs/6841dca8.a00a0220.d4325.0020.GAE@google.com/ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
13 daysbtrfs: don't silently ignore unexpected extent type when replaying logFilipe Manana
If there's an unexpected (invalid) extent type, we just silently ignore it. This means a corruption or some bug somewhere, so instead return -EUCLEAN to the caller, making log replay fail, and print an error message with relevant information. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
13 daysbtrfs: fix invalid inode pointer dereferences during log replayFilipe Manana
In a few places where we call read_one_inode(), if we get a NULL pointer we end up jumping into an error path, or fallthrough in case of __add_inode_ref(), where we then do something like this: iput(&inode->vfs_inode); which results in an invalid inode pointer that triggers an invalid memory access, resulting in a crash. Fix this by making sure we don't do such dereferences. Fixes: b4c50cbb01a1 ("btrfs: return a btrfs_inode from read_one_inode()") CC: stable@vger.kernel.org # 6.15+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
13 daysbtrfs: fix double unlock of buffer_tree xarray when releasing subpage ebFilipe Manana
If we break out of the loop because an extent buffer doesn't have the bit EXTENT_BUFFER_TREE_REF set, we end up unlocking the xarray twice, once before we tested for the bit and break out of the loop, and once again after the loop. Fix this by testing the bit and exiting before unlocking the xarray. The time spent testing the bit is negligible and it's not worth trying to do that outside the critical section delimited by the xarray lock due to the code complexity required to avoid it (like using a local boolean variable to track whether the xarray is locked or not). The xarray unlock only needs to be done before calling release_extent_buffer(), as that needs to lock the xarray (through xa_cmpxchg_irq()) and does a more significant amount of work. Fixes: 19d7f65f032f ("btrfs: convert the buffer_radix to an xarray") Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Link: https://lore.kernel.org/linux-btrfs/aDRNDU0GM1_D4Xnw@stanley.mountain/ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
13 daysbtrfs: update superblock's device bytes_used when dropping chunkMark Harmstone
Each superblock contains a copy of the device item for that device. In a transaction which drops a chunk but doesn't create any new ones, we were correctly updating the device item in the chunk tree but not copying over the new bytes_used value to the superblock. This can be seen by doing the following: # dd if=/dev/zero of=test bs=4096 count=2621440 # mkfs.btrfs test # mount test /root/temp # cd /root/temp # for i in {00..10}; do dd if=/dev/zero of=$i bs=4096 count=32768; done # sync # rm * # sync # btrfs balance start -dusage=0 . # sync # cd # umount /root/temp # btrfs check test For btrfs-check to detect this, you will also need my patch at https://github.com/kdave/btrfs-progs/pull/991. Change btrfs_remove_dev_extents() so that it adds the devices to the fs_info->post_commit_list if they're not there already. This causes btrfs_commit_device_sizes() to be called, which updates the bytes_used value in the superblock. Fixes: bbbf7243d62d ("btrfs: combine device update operations during transaction commit") CC: stable@vger.kernel.org # 5.10+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Mark Harmstone <maharmstone@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
13 daysbtrfs: fix a race between renames and directory loggingFilipe Manana
We have a race between a rename and directory inode logging that if it happens and we crash/power fail before the rename completes, the next time the filesystem is mounted, the log replay code will end up deleting the file that was being renamed. This is best explained following a step by step analysis of an interleaving of steps that lead into this situation. Consider the initial conditions: 1) We are at transaction N; 2) We have directories A and B created in a past transaction (< N); 3) We have inode X corresponding to a file that has 2 hardlinks, one in directory A and the other in directory B, so we'll name them as "A/foo_link1" and "B/foo_link2". Both hard links were persisted in a past transaction (< N); 4) We have inode Y corresponding to a file that as a single hard link and is located in directory A, we'll name it as "A/bar". This file was also persisted in a past transaction (< N). The steps leading to a file loss are the following and for all of them we are under transaction N: 1) Link "A/foo_link1" is removed, so inode's X last_unlink_trans field is updated to N, through btrfs_unlink() -> btrfs_record_unlink_dir(); 2) Task A starts a rename for inode Y, with the goal of renaming from "A/bar" to "A/baz", so we enter btrfs_rename(); 3) Task A inserts the new BTRFS_INODE_REF_KEY for inode Y by calling btrfs_insert_inode_ref(); 4) Because the rename happens in the same directory, we don't set the last_unlink_trans field of directoty A's inode to the current transaction id, that is, we don't cal btrfs_record_unlink_dir(); 5) Task A then removes the entries from directory A (BTRFS_DIR_ITEM_KEY and BTRFS_DIR_INDEX_KEY items) when calling __btrfs_unlink_inode() (actually the dir index item is added as a delayed item, but the effect is the same); 6) Now before task A adds the new entry "A/baz" to directory A by calling btrfs_add_link(), another task, task B is logging inode X; 7) Task B starts a fsync of inode X and after logging inode X, at btrfs_log_inode_parent() it calls btrfs_log_all_parents(), since inode X has a last_unlink_trans value of N, set at in step 1; 8) At btrfs_log_all_parents() we search for all parent directories of inode X using the commit root, so we find directories A and B and log them. Bu when logging direct A, we don't have a dir index item for inode Y anymore, neither the old name "A/bar" nor for the new name "A/baz" since the rename has deleted the old name but has not yet inserted the new name - task A hasn't called yet btrfs_add_link() to do that. Note that logging directory A doesn't fallback to a transaction commit because its last_unlink_trans has a lower value than the current transaction's id (see step 4); 9) Task B finishes logging directories A and B and gets back to btrfs_sync_file() where it calls btrfs_sync_log() to persist the log tree; 10) Task B successfully persisted the log tree, btrfs_sync_log() completed with success, and a power failure happened. We have a log tree without any directory entry for inode Y, so the log replay code deletes the entry for inode Y, name "A/bar", from the subvolume tree since it doesn't exist in the log tree and the log tree is authorative for its index (we logged a BTRFS_DIR_LOG_INDEX_KEY item that covers the index range for the dentry that corresponds to "A/bar"). Since there's no other hard link for inode Y and the log replay code deletes the name "A/bar", the file is lost. The issue wouldn't happen if task B synced the log only after task A called btrfs_log_new_name(), which would update the log with the new name for inode Y ("A/bar"). Fix this by pinning the log root during renames before removing the old directory entry, and unpinning after btrfs_log_new_name() is called. Fixes: 259c4b96d78d ("btrfs: stop doing unnecessary log updates during a rename") CC: stable@vger.kernel.org # 5.18+ Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
13 daysbtrfs: scrub: add prefix for the error messagesAnand Jain
Add a "scrub: " prefix to all messages logged by scrub so that it's easy to filter them from dmesg for analysis. Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
13 daysbtrfs: warn if leaking delayed_nodes in btrfs_put_root()Leo Martins
Add a warning for leaked delayed_nodes when putting a root. We currently do this for inodes, but not delayed_nodes. Signed-off-by: Leo Martins <loemra.dev@gmail.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> [ Remove the changelog from the commit message. ] Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
13 daysbtrfs: fix delayed ref refcount leak in debug assertionLeo Martins
If the delayed_root is not empty we are increasing the number of references to a delayed_node without decreasing it, causing a leak. Fix by decrementing the delayed_node reference count. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Leo Martins <loemra.dev@gmail.com> Reviewed-by: Qu Wenruo <wqu@suse.com> [ Remove the changelog from the commit message. ] Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
13 daysbtrfs: include root in error message when unlinking inodeFilipe Manana
To help debugging include the root number in the error message, and since this is a critical error that implies a metadata inconsistency and results in a transaction abort change the log message level from "info" to "critical", which is a much better fit. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
14 dayserofs: refuse crafted out-of-file-range encoded extentsGao Xiang
Crafted encoded extents could record out-of-range `lstart`, which should not happen in normal cases. It caused an iomap_iter_done() complaint [1] reported by syzbot. [1] https://lore.kernel.org/r/684cb499.a00a0220.c6bd7.0010.GAE@google.com Fixes: 1d191b4ca51d ("erofs: implement encoded extent metadata") Reported-and-tested-by: syzbot+d8f000c609f05f52d9b5@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=d8f000c609f05f52d9b5 Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20250619032839.2642193-1-hsiangkao@linux.alibaba.com
14 daysMerge tag '6.16-rc2-ksmbd-server-fixes' of git://git.samba.org/ksmbdLinus Torvalds
Pull smb server fixes from Steve French: - Fix alternate data streams bug - Important fix for null pointer deref with Kerberos authentication - Fix oops in smbdirect (RDMA) in free_transport * tag '6.16-rc2-ksmbd-server-fixes' of git://git.samba.org/ksmbd: ksmbd: handle set/get info file for streamed file ksmbd: fix null pointer dereference in destroy_previous_session ksmbd: add free_transport ops in ksmbd connection
2025-06-18f2fs: fix to zero post-eof pageChao Yu
fstest reports a f2fs bug: generic/363 42s ... [failed, exit status 1]- output mismatch (see /share/git/fstests/results//generic/363.out.bad) --- tests/generic/363.out 2025-01-12 21:57:40.271440542 +0800 +++ /share/git/fstests/results//generic/363.out.bad 2025-05-19 19:55:58.000000000 +0800 @@ -1,2 +1,78 @@ QA output created by 363 fsx -q -S 0 -e 1 -N 100000 +READ BAD DATA: offset = 0xd6fb, size = 0xf044, fname = /mnt/f2fs/junk +OFFSET GOOD BAD RANGE +0x1540d 0x0000 0x2a25 0x0 +operation# (mod 256) for the bad data may be 37 +0x1540e 0x0000 0x2527 0x1 ... (Run 'diff -u /share/git/fstests/tests/generic/363.out /share/git/fstests/results//generic/363.out.bad' to see the entire diff) Ran: generic/363 Failures: generic/363 Failed 1 of 1 tests The root cause is user can update post-eof page via mmap [1], however, f2fs missed to zero post-eof page in below operations, so, once it expands i_size, then it will include dummy data locates previous post-eof page, so during below operations, we need to zero post-eof page. Operations which can include dummy data after previous i_size after expanding i_size: - write - mapwrite [1] - truncate - fallocate * preallocate * zero_range * insert_range * collapse_range - clone_range (doesn’t support in f2fs) - copy_range (doesn’t support in f2fs) [1] https://man7.org/linux/man-pages/man2/mmap.2.html 'BUG section' Cc: stable@kernel.org Signed-off-by: Chao Yu <chao@kernel.org> Reviewed-by: Zhiguo Niu <zhiguo.niu@unisoc.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-06-18cifs: Remove duplicate fattr->cf_dtype assignment from wsl_to_fattr() functionPali Rohár
Commit 8bd25b61c5a5 ("smb: client: set correct d_type for reparse DFS/DFSR and mount point") deduplicated assignment of fattr->cf_dtype member from all places to end of the function cifs_reparse_point_to_fattr(). The only one missing place which was not deduplicated is wsl_to_fattr(). Fix it. Fixes: 8bd25b61c5a5 ("smb: client: set correct d_type for reparse DFS/DFSR and mount point") Signed-off-by: Pali Rohár <pali@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-06-18smb: fix secondary channel creation issue with kerberos by populating ↵Bharath SM
hostname when adding channels When mounting a share with kerberos authentication with multichannel support, share mounts correctly, but fails to create secondary channels. This occurs because the hostname is not populated when adding the channels. The hostname is necessary for the userspace cifs.upcall program to retrieve the required credentials and pass it back to kernel, without hostname secondary channels fails establish. Cc: stable@vger.kernel.org Reviewed-by: Shyam Prasad N <sprasad@microsoft.com> Signed-off-by: Bharath SM <bharathsm@microsoft.com> Reported-by: xfuren <xfuren@gmail.com> Link: https://bugzilla.samba.org/show_bug.cgi?id=15824 Signed-off-by: Steve French <stfrench@microsoft.com>
2025-06-18erofs: impersonate the opener's credentials when accessing backing fileTatsuyuki Ishi
Previously, file operations on a file-backed mount used the current process' credentials to access the backing FD. Attempting to do so on Android lead to SELinux denials, as ACL rules on the backing file (e.g. /system/apex/foo.apex) is restricted to a small set of process. Arguably, this error is redundant and leaking implementation details, as access to files on a mount is already ACL'ed by path. Instead, override to use the opener's cred when accessing the backing file. This makes the behavior similar to a loop-backed mount, which uses kworker cred when accessing the backing file and does not cause SELinux denials. Signed-off-by: Tatsuyuki Ishi <ishitatsuyuki@google.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Hongbo Li <lihongbo22@huawei.com> Link: https://lore.kernel.org/r/20250612-b4-erofs-impersonate-v1-1-8ea7d6f65171@google.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2025-06-17bcachefs: Fix missing newlines before eroKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-17bcachefs: fix spurious error in read_btree_roots()Kent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-17bcachefs: fsck: Fix oops in key_visible_in_snapshot()Kent Overstreet
The normal fsck code doesn't call key_visible_in_snapshot() with an empty list of snapshot IDs seen (the current snapshot ID will always be on the list), but str_hash_repair_key() -> bch2_get_snapshot_overwrites() can, and that's totally fine as long as we check for it. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-17bcachefs: fsck: fix unhandled restart in topology repairKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-17ksmbd: handle set/get info file for streamed fileNamjae Jeon
The bug only appears when: - windows 11 copies a file that has an alternate data stream - streams_xattr is enabled on the share configuration. Microsoft Edge adds a ZoneIdentifier data stream containing the URL for files it downloads. Another way to create a test file: - open cmd.exe - echo "hello from default data stream" > hello.txt - echo "hello again from ads" > hello.txt:ads.txt If you open the file using notepad, we'll see the first message. If you run "notepad hello.txt:ads.txt" in cmd.exe, we should see the second message. dir /s /r should least all streams for the file. The truncation happens because the windows 11 client sends a SetInfo/EndOfFile message on the ADS, but it is instead applied on the main file, because we don't check fp->stream. When receiving set/get info file for a stream file, Change to process requests using stream position and size. Truncate is unnecessary for stream files, so we skip set_file_allocation_info and set_end_of_file_info operations. Reported-by: Marios Makassikis <mmakassikis@freebox.fr> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-06-17ksmbd: fix null pointer dereference in destroy_previous_sessionNamjae Jeon
If client set ->PreviousSessionId on kerberos session setup stage, NULL pointer dereference error will happen. Since sess->user is not set yet, It can pass the user argument as NULL to destroy_previous_session. sess->user will be set in ksmbd_krb5_authenticate(). So this patch move calling destroy_previous_session() after ksmbd_krb5_authenticate(). Cc: stable@vger.kernel.org Reported-by: zdi-disclosures@trendmicro.com # ZDI-CAN-27391 Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-06-17ksmbd: add free_transport ops in ksmbd connectionNamjae Jeon
free_transport function for tcp connection can be called from smbdirect. It will cause kernel oops. This patch add free_transport ops in ksmbd connection, and add each free_transports for tcp and smbdirect. Fixes: 21a4e47578d4 ("ksmbd: fix use-after-free in __smb2_lease_break_noti()") Reviewed-by: Stefan Metzmacher <metze@samba.org> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-06-17bcachefs: fsck: Fix check_directory_structure when no check_direntsKent Overstreet
check_directory_structure runs after check_dirents, so it expects that it won't see any inodes with missing backpointers - normally. But online fsck can't run check_dirents yet, or the user might only be running a specific pass, so we need to be careful that this isn't an error. If an inode is unreachable, that's handled by a separate pass. Also, add a new 'bch2_inode_has_backpointer()' helper, since we were doing this inconsistently. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-17bcachefs: Fix restart handling in btree_node_scrub_work()Kent Overstreet
btree node scrub was sometimes failing to rewrite nodes with errors; bch2_btree_node_rewrite() can return a transaction restart and we weren't checking - the lockrestart_do() needs to wrap the entire operation. And there's a better helper it should've been using, bch2_btree_node_rewrite_key(), which makes all this more convenient. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-16bcachefs: Fix bch2_read_bio_to_text()Kent Overstreet
We can only pass negative error codes to bch2_err_str(); if it's a positive integer it's not an error and we trip an assert. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-16bcachefs: fsck: Fix check_path_loop() + snapshotsKent Overstreet
A path exists in a particular snapshot: we should do the pathwalk in the snapshot ID of the inode we started from, _not_ change snapshot ID as we walk inodes and dirents. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-16bcachefs: fsck: check_subdir_count logs pathKent Overstreet
We can easily go from inode number -> path now, which makes for more useful log messages. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-16bcachefs: fsck: additional diagnostics for reattach_inode()Kent Overstreet
Log the inode's new path. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-16bcachefs: fsck: check_directory_structure runs in reverse orderKent Overstreet
When we find a directory connectivity problem, we should do the repair in the oldest snapshot that has the issue - so that we don't end up duplicating work or making a real mess of things. Oldest snapshot IDs have the highest integer value, so - just walk inodes in reverse order. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-16bcachefs: fsck: Fix reattach_inode() for subvol rootsKent Overstreet
bch_subvolume.fs_path_parent needs to be updated as well, it should match inode.bi_parent_subvol. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-16bcachefs: fsck: Fix remove_backpointer() for subvol rootsKent Overstreet
The dirent will be in a different snapshot if the inode is a subvolume root. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-16bcachefs: fsck: Print path when we find a subvol loopKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-16bcachefs: Fix __bch2_inum_to_path() when crossing subvol boundariesKent Overstreet
The bch2_subvolume_get_snapshot() call needs to happen before the dirent lookup - the dirent is in the parent subvolume. Also, check for loops. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-16bcachefs: Call bch2_fs_init_rw() early if we'll be going rwKent Overstreet
kthread creation checks for pending signals, which is _very_ annoying if we have to do a long recovery and don't go rw until we've done significant work. Check if we'll be going rw and pre-allocate kthreads/workqueues. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-16bcachefs: fsck: Improve check_key_has_inode()Kent Overstreet
Print out more info when we find a key (extent, dirent, xattr) for a missing inode - was there a good inode in an older snapshot, full(ish) list of keys for that missing inode, so we can make better decisions on how to repair. If it looks like it should've been deleted, autofix it. If we ever hit the non-autofix cases, we'll want to write more repair code (possibly reconstituting the inode). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-16bcachefs: don't return fsck_fix for unfixable node errors in __btree_errBharadwaj Raju
After cd3cdb1ef706 ("Single err message for btree node reads"), all errors caused __btree_err to return -BCH_ERR_fsck_fix no matter what the actual error type was if the recovery pass was scanning for btree nodes. This lead to the code continuing despite things like bad node formats when they earlier would have caused a jump to fsck_err, because btree_err only jumps when the return from __btree_err does not match fsck_fix. Ultimately this lead to undefined behavior by attempting to unpack a key based on an invalid format. Make only errors of type -BCH_ERR_btree_node_read_err_fixable cause __btree_err to return -BCH_ERR_fsck_fix when scanning for btree nodes. Reported-by: syzbot+cfd994b9cdf00446fd54@syzkaller.appspotmail.com Fixes: cd3cdb1ef706 ("bcachefs: Single err message for btree node reads") Signed-off-by: Bharadwaj Raju <bharadwaj.raju777@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-16bcachefs: Fix pool->alloc NULL pointer dereferenceAlan Huang
btree_interior_update_pool has not been initialized before the filesystem becomes read-write, thus mempool_alloc in bch2_btree_update_start will trigger pool->alloc NULL pointer dereference in mempool_alloc_noprof Reported-by: syzbot+2f3859bd28f20fa682e6@syzkaller.appspotmail.com Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-16bcachefs: Move bset size check before csum checkAlan Huang
In syzbot's crash, the bset's u64s is larger than the btree node. Reported-by: syzbot+bfaeaa8e26281970158d@syzkaller.appspotmail.com Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-16bcachefs: mark more errors autofixKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-16bcachefs: Kill unused tracepointsKent Overstreet
Dead code cleanup. Link: https://lore.kernel.org/linux-bcachefs/20250612224059.39fddd07@batman.local.home/ Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-16bcachefs: opts.journal_rewindKent Overstreet
Add a mount option for rewinding the journal, bringing the entire filesystem to where it was at a previous point in time. This is for extreme disaster recovery scenarios - it's not intended as an undelete operation. The option takes a journal sequence number; the desired sequence number can be determined with 'bcachefs list_journal' Caveats: - The 'journal_transaction_names' option must have been enabled (it's on by default). The option controls emitting of extra debug info in the journal, so we can see what individual transactions were doing; It also enables journalling of keys being overwritten, which is what we rely on here. - A full fsck run will be automatically triggered since alloc info will be inconsistent. Only leaf node updates to non-alloc btrees are rewound, since rewinding interior btree updates isn't possible or desirable. - We can't do anything about data that was deleted and overwritten. Lots of metadata updates after the point in time we're rewinding to shouldn't cause a problem, since we segragate data and metadata allocations (this is in order to make repair by btree node scan practical on larger filesystems; there's a small 64-bit per device bitmap in the superblock of device ranges with btree nodes, and we try to keep this small). However, having discards enabled will cause problems, since buckets are discarded as soon as they become empty (this is why we don't implement fstrim: we don't need it). Hopefully, this feature will be a one-off thing that's never used again: this was implemented for recovering from the "vfs i_nlink 0 -> subvol deletion" bug, and that bug was unusually disastrous and additional safeguards have since been implemented. But if it does turn out that we need this more in the future, I'll have to implement an option so that empty buckets aren't discarded immediately - lagging by perhaps 1% of device capacity. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-16x86,fs/resctrl: Remove inappropriate references to cacheinfo in the resctrl ↵Qinyun Tan
subsystem In the resctrl subsystem's Sub-NUMA Cluster (SNC) mode, the rdt_mon_domain structure representing a NUMA node relies on the cacheinfo interface (rdt_mon_domain::ci) to store L3 cache information (e.g., shared_cpu_map) for monitoring. The L3 cache information of a SNC NUMA node determines which domains are summed for the "top level" L3-scoped events. rdt_mon_domain::ci is initialized using the first online CPU of a NUMA node. When this CPU goes offline, its shared_cpu_map is cleared to contain only the offline CPU itself. Subsequently, attempting to read counters via smp_call_on_cpu(offline_cpu) fails (and error ignored), returning zero values for "top-level events" without any error indication. Replace the cacheinfo references in struct rdt_mon_domain and struct rmid_read with the cacheinfo ID (a unique identifier for the L3 cache). rdt_domain_hdr::cpu_mask contains the online CPUs associated with that domain. When reading "top-level events", select a CPU from rdt_domain_hdr::cpu_mask and utilize its L3 shared_cpu_map to determine valid CPUs for reading RMID counter via the MSR interface. Considering all CPUs associated with the L3 cache improves the chances of picking a housekeeping CPU on which the counter reading work can be queued, avoiding an unnecessary IPI. Fixes: 328ea68874642 ("x86/resctrl: Prepare for new Sub-NUMA Cluster (SNC) monitor files") Signed-off-by: Qinyun Tan <qinyuntan@linux.alibaba.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Tested-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/20250530182053.37502-2-qinyuntan@linux.alibaba.com
2025-06-16Merge tag 'vfs-6.16-rc3.fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs fixes from Christian Brauner: - Fix a regression in overlayfs caused by reworking the lookup_one*() set of helpers - Make sure that the name of the dentry is printed in overlayfs' mkdir() helper - Add missing iocb values to TRACE_IOCB_STRINGS define - Unlock the superblock during iterate_supers_type(). This was an accidental internal api change - Drop a misleading assert in file_seek_cur_needs_f_lock() helper - Never refuse to return PIDFD_GET_INGO when parent pid is zero That can trivially happen in container scenarios where the parent process might be located in an ancestor pid namespace - Don't revalidate in try_lookup_noperm() as that causes regression for filesystems such as cifs - Fix simple_xattr_list() and reset the err variable after security_inode_listsecurity() got called so as not to confuse userspace about the length of the xattr * tag 'vfs-6.16-rc3.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fs: drop assert in file_seek_cur_needs_f_lock fs: unlock the superblock during iterate_supers_type ovl: fix debug print in case of mkdir error VFS: change try_lookup_noperm() to skip revalidation fs: add missing values to TRACE_IOCB_STRINGS fs/xattr.c: fix simple_xattr_list() ovl: fix regression caused by lookup helpers API changes pidfs: never refuse ppid == 0 in PIDFD_GET_INFO
2025-06-16fs: drop assert in file_seek_cur_needs_f_lockLuis Henriques
The assert in function file_seek_cur_needs_f_lock() can be triggered very easily because there are many users of vfs_llseek() (such as overlayfs) that do their custom locking around llseek instead of relying on fdget_pos(). Just drop the overzealous assertion. Fixes: da06e3c51794 ("fs: don't needlessly acquire f_lock") Suggested-by: Jan Kara <jack@suse.cz> Suggested-by: Mateusz Guzik <mjguzik@gmail.com> Signed-off-by: Luis Henriques <luis@igalia.com> Link: https://lore.kernel.org/20250613101111.17716-1-luis@igalia.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-15bcachefs: fsck: fix extent past end of inode repairKent Overstreet
Fix the case where we're deleting in a different snapshot and need to emit a whiteout - that requires a regular BTREE_ITER_filter_snapshots iterator. Also, only delete the part of the extent that extents past i_size. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-15bcachefs: fsck: fix add_inode()Kent Overstreet
the inode btree uses the offset field for the inum, not the inode field. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-15bcachefs: Fix snapshot_key_missing_inode_snapshot repairKent Overstreet
When the inode was a whiteout, we were inserting a new whiteout at the wrong (old) snapshot. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-15bcachefs: Fix "now allowing incompatible features" messageKent Overstreet
Check against version_incompat_allowed, not version_incompat. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>