path: root/fs
AgeCommit message (Collapse)Author
2021-06-25Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge misc fixes from Andrew Morton: "24 patches, based on 4a09d388f2ab382f217a764e6a152b3f614246f6. Subsystems affected by this patch series: mm (thp, vmalloc, hugetlb, memory-failure, and pagealloc), nilfs2, kthread, MAINTAINERS, and mailmap" * emailed patches from Andrew Morton <>: (24 commits) mailmap: add Marek's other e-mail address and identity without diacritics MAINTAINERS: fix Marek's identity again mm/page_alloc: do bulk array bounds check after checking populated elements mm/page_alloc: __alloc_pages_bulk(): do bounds check before accessing array mm/hwpoison: do not lock page again when me_huge_page() successfully recovers mm,hwpoison: return -EHWPOISON to denote that the page has already been poisoned mm/memory-failure: use a mutex to avoid memory_failure() races mm, futex: fix shared futex pgoff on shmem huge page kthread: prevent deadlock when kthread_mod_delayed_work() races with kthread_cancel_delayed_work_sync() kthread_worker: split code for canceling the delayed work timer mm/vmalloc: unbreak kasan vmalloc support KVM: s390: prepare for hugepage vmalloc mm/vmalloc: add vmalloc_no_huge nilfs2: fix memory leak in nilfs_sysfs_delete_device_group mm/thp: another PVMW_SYNC fix in page_vma_mapped_walk() mm/thp: fix page_vma_mapped_walk() if THP mapped by ptes mm: page_vma_mapped_walk(): get vma_address_end() earlier mm: page_vma_mapped_walk(): use goto instead of while (1) mm: page_vma_mapped_walk(): add a level of indentation mm: page_vma_mapped_walk(): crossing page table boundary ...
2021-06-25Merge tag 'ceph-for-5.13-rc8' of Torvalds
Pull ceph fixes from Ilya Dryomov: "Two regression fixes from the merge window: one in the auth code affecting old clusters and one in the filesystem for proper propagation of MDS request errors. Also included a locking fix for async creates, marked for stable" * tag 'ceph-for-5.13-rc8' of libceph: set global_id as soon as we get an auth ticket libceph: don't pass result into ac->ops->handle_reply() ceph: fix error handling in ceph_atomic_open and ceph_lookup ceph: must hold snap_rwsem when filling inode for async create
2021-06-25Merge tag 'netfs-fixes-20210621' of ↵Linus Torvalds
git:// Pull netfs fixes from David Howells: "This contains patches to fix netfs_write_begin() and afs_write_end() in the following ways: (1) In netfs_write_begin(), extract the decision about whether to skip a page out to its own helper and have that clear around the region to be written, but not clear that region. This requires the filesystem to patch it up afterwards if the hole doesn't get completely filled. (2) Use offset_in_thp() in (1) rather than manually calculating the offset into the page. (3) Due to (1), afs_write_end() now needs to handle short data write into the page by generic_perform_write(). I've adopted an analogous approach to ceph of just returning 0 in this case and letting the caller go round again. It also adds a note that (in the future) the len parameter may extend beyond the page allocated. This is because the page allocation is deferred to write_begin() and that gets to decide what size of THP to allocate." Jeff Layton points out: "The netfs fix in particular fixes a data corruption bug in cephfs" * tag 'netfs-fixes-20210621' of git:// netfs: fix test for whether we can skip read when writing beyond EOF afs: Fix afs_write_end() to handle short writes
2021-06-24nilfs2: fix memory leak in nilfs_sysfs_delete_device_groupPavel Skripkin
My local syzbot instance hit memory leak in nilfs2. The problem was in missing kobject_put() in nilfs_sysfs_delete_device_group(). kobject_del() does not call kobject_cleanup() for passed kobject and it leads to leaking duped kobject name if kobject_put() was not called. Fail log: BUG: memory leak unreferenced object 0xffff8880596171e0 (size 8): comm "syz-executor379", pid 8381, jiffies 4294980258 (age 21.100s) hex dump (first 8 bytes): 6c 6f 6f 70 30 00 00 00 loop0... backtrace: kstrdup+0x36/0x70 mm/util.c:60 kstrdup_const+0x53/0x80 mm/util.c:83 kvasprintf_const+0x108/0x190 lib/kasprintf.c:48 kobject_set_name_vargs+0x56/0x150 lib/kobject.c:289 kobject_add_varg lib/kobject.c:384 [inline] kobject_init_and_add+0xc9/0x160 lib/kobject.c:473 nilfs_sysfs_create_device_group+0x150/0x800 fs/nilfs2/sysfs.c:999 init_nilfs+0xe26/0x12b0 fs/nilfs2/the_nilfs.c:637 Link: Fixes: da7141fb78db ("nilfs2: add /sys/fs/nilfs2/<device> group") Signed-off-by: Pavel Skripkin <> Acked-by: Ryusuke Konishi <> Cc: Michael L. Semon <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2021-06-22ceph: fix error handling in ceph_atomic_open and ceph_lookupJeff Layton
Commit aa60cfc3f7ee broke the error handling in these functions such that they don't handle non-ENOENT errors from ceph_mdsc_do_request properly. Move the checking of -ENOENT out of ceph_handle_snapdir and into the callers, and if we get a different error, return it immediately. Fixes: aa60cfc3f7ee ("ceph: don't use d_add in ceph_handle_snapdir") Signed-off-by: Jeff Layton <> Reviewed-by: Ilya Dryomov <> Signed-off-by: Ilya Dryomov <>
2021-06-22ceph: must hold snap_rwsem when filling inode for async createJeff Layton
...and add a lockdep assertion for it to ceph_fill_inode(). Cc: # v5.7+ Fixes: 9a8d03ca2e2c3 ("ceph: attempt to do async create when possible") Signed-off-by: Jeff Layton <> Reviewed-by: Ilya Dryomov <> Signed-off-by: Ilya Dryomov <>
2021-06-21netfs: fix test for whether we can skip read when writing beyond EOFJeff Layton
It's not sufficient to skip reading when the pos is beyond the EOF. There may be data at the head of the page that we need to fill in before the write. Add a new helper function that corrects and clarifies the logic of when we can skip reads, and have it only zero out the part of the page that won't have data copied in for the write. Finally, don't set the page Uptodate after zeroing. It's not up to date since the write data won't have been copied in yet. [DH made the following changes: - Prefixed the new function with "netfs_". - Don't call zero_user_segments() for a full-page write. - Altered the beyond-last-page check to avoid a DIV instruction and got rid of then-redundant zero-length file check. ] Fixes: e1b1240c1ff5f ("netfs: Add write_begin helper") Reported-by: Andrew W Elble <> Signed-off-by: Jeff Layton <> Signed-off-by: David Howells <> Reviewed-by: Matthew Wilcox (Oracle) <> cc: Link: Link: # v1 Link: # v2
2021-06-21afs: Fix afs_write_end() to handle short writesDavid Howells
Fix afs_write_end() to correctly handle a short copy into the intended write region of the page. Two things are necessary: (1) If the page is not up to date, then we should just return 0 (ie. indicating a zero-length copy). The loop in generic_perform_write() will go around again, possibly breaking up the iterator into discrete chunks[1]. This is analogous to commit b9de313cf05fe08fa59efaf19756ec5283af672a for ceph. (2) The page should not have been set uptodate if it wasn't completely set up by netfs_write_begin() (this will be fixed in the next patch), so we need to set uptodate here in such a case. Also remove the assertion that was checking that the page was set uptodate since it's now set uptodate if it wasn't already a few lines above. The assertion was from when uptodate was set elsewhere. Changes: v3: Remove the handling of len exceeding the end of the page. Fixes: 3003bbd0697b ("afs: Use the netfs_write_begin() helper") Reported-by: Jeff Layton <> Signed-off-by: David Howells <> Acked-by: Jeff Layton <> Reviewed-by: Matthew Wilcox (Oracle) <> cc: Al Viro <> cc: Link: [1] Link: # v1 Link: # v2
2021-06-18Merge tag 'for-5.13-rc6-tag' of ↵Linus Torvalds
git:// Pull btrfs fix from David Sterba: "One more fix, for a space accounting bug in zoned mode. It happens when a block group is switched back rw->ro and unusable bytes (due to zoned constraints) are subtracted twice. It has user visible effects so I consider it important enough for late -rc inclusion and backport to stable" * tag 'for-5.13-rc6-tag' of git:// btrfs: zoned: fix negative space_info->bytes_readonly
2021-06-18afs: Re-enable freezing once a page fault is interruptedMatthew Wilcox (Oracle)
If a task is killed during a page fault, it does not currently call sb_end_pagefault(), which means that the filesystem cannot be frozen at any time thereafter. This may be reported by lockdep like this: ==================================== WARNING: fsstress/10757 still has locks held! 5.13.0-rc4-build4+ #91 Not tainted ------------------------------------ 1 lock held by fsstress/10757: #0: ffff888104eac530 ( sb_pagefaults as filesystem freezing is modelled as a lock. Fix this by removing all the direct returns from within the function, and using 'ret' to indicate whether we were interrupted or successful. Fixes: 1cf7a1518aef ("afs: Implement shared-writeable mmap") Signed-off-by: Matthew Wilcox (Oracle) <> Signed-off-by: David Howells <> cc: Link: Signed-off-by: Linus Torvalds <>
2021-06-17Merge tag 'fixes_for_v5.13-rc7' of ↵Linus Torvalds
git:// Pull quota and fanotify fixes from Jan Kara: "A fixup finishing disabling of quotactl_path() syscall (I've missed archs using different way to declare syscalls) and a fix of an fd leak in error handling path of fanotify" * tag 'fixes_for_v5.13-rc7' of git:// quota: finish disable quotactl_path syscall fanotify: fix copy_event_to_user() fid error clean up
2021-06-17btrfs: zoned: fix negative space_info->bytes_readonlyNaohiro Aota
Consider we have a using block group on zoned btrfs. |<- ZU ->|<- used ->|<---free--->| `- Alloc offset ZU: Zone unusable Marking the block group read-only will migrate the zone unusable bytes to the read-only bytes. So, we will have this. |<- RO ->|<- used ->|<--- RO --->| RO: Read only When marking it back to read-write, btrfs_dec_block_group_ro() subtracts the above "RO" bytes from the space_info->bytes_readonly. And, it moves the zone unusable bytes back and again subtracts those bytes from the space_info->bytes_readonly, leading to negative bytes_readonly. This can be observed in the output as eg.: Data, single: total=512.00MiB, used=165.21MiB, zone_unusable=16.00EiB Data, single: total=536870912, used=173256704, zone_unusable=18446744073603186688 This commit fixes the issue by reordering the operations. Link: Reported-by: David Sterba <> Fixes: 169e0da91a21 ("btrfs: zoned: track unusable bytes for zones") CC: # 5.12+ Reviewed-by: Johannes Thumshirn <> Signed-off-by: Naohiro Aota <> Signed-off-by: David Sterba <>
2021-06-16mm/hugetlb: expand restore_reserve_on_error functionalityMike Kravetz
The routine restore_reserve_on_error is called to restore reservation information when an error occurs after page allocation. The routine alloc_huge_page modifies the mapping reserve map and potentially the reserve count during allocation. If code calling alloc_huge_page encounters an error after allocation and needs to free the page, the reservation information needs to be adjusted. Currently, restore_reserve_on_error only takes action on pages for which the reserve count was adjusted(HPageRestoreReserve flag). There is nothing wrong with these adjustments. However, alloc_huge_page ALWAYS modifies the reserve map during allocation even if the reserve count is not adjusted. This can cause issues as observed during development of this patch [1]. One specific series of operations causing an issue is: - Create a shared hugetlb mapping Reservations for all pages created by default - Fault in a page in the mapping Reservation exists so reservation count is decremented - Punch a hole in the file/mapping at index previously faulted Reservation and any associated pages will be removed - Allocate a page to fill the hole No reservation entry, so reserve count unmodified Reservation entry added to map by alloc_huge_page - Error after allocation and before instantiating the page Reservation entry remains in map - Allocate a page to fill the hole Reservation entry exists, so decrement reservation count This will cause a reservation count underflow as the reservation count was decremented twice for the same index. A user would observe a very large number for HugePages_Rsvd in /proc/meminfo. This would also likely cause subsequent allocations of hugetlb pages to fail as it would 'appear' that all pages are reserved. This sequence of operations is unlikely to happen, however they were easily reproduced and observed using hacked up code as described in [1]. Address the issue by having the routine restore_reserve_on_error take action on pages where HPageRestoreReserve is not set. In this case, we need to remove any reserve map entry created by alloc_huge_page. A new helper routine vma_del_reservation assists with this operation. There are three callers of alloc_huge_page which do not currently call restore_reserve_on error before freeing a page on error paths. Add those missing calls. [1] Link: Fixes: 96b96a96ddee ("mm/hugetlb: fix huge page reservation leak in private mapping error paths" Signed-off-by: Mike Kravetz <> Reviewed-by: Mina Almasry <> Cc: Axel Rasmussen <> Cc: Peter Xu <> Cc: Muchun Song <> Cc: Michal Hocko <> Cc: Naoya Horiguchi <> Cc: <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2021-06-15proc: only require mm_struct for writingLinus Torvalds
Commit 591a22c14d3f ("proc: Track /proc/$pid/attr/ opener mm_struct") we started using __mem_open() to track the mm_struct at open-time, so that we could then check it for writes. But that also ended up making the permission checks at open time much stricter - and not just for writes, but for reads too. And that in turn caused a regression for at least Fedora 29, where NIC interfaces fail to start when using NetworkManager. Since only the write side wanted the mm_struct test, ignore any failures by __mem_open() at open time, leaving reads unaffected. The write() time verification of the mm_struct pointer will then catch the failure case because a NULL pointer will not match a valid 'current->mm'. Link: Fixes: 591a22c14d3f ("proc: Track /proc/$pid/attr/ opener mm_struct") Reported-and-tested-by: Leon Romanovsky <> Cc: Kees Cook <> Cc: Christian Brauner <> Cc: Andrea Righi <> Signed-off-by: Linus Torvalds <>
2021-06-15afs: Fix an IS_ERR() vs NULL checkDan Carpenter
The proc_symlink() function returns NULL on error, it doesn't return error pointers. Fixes: 5b86d4ff5dce ("afs: Implement network namespacing") Signed-off-by: Dan Carpenter <> Signed-off-by: David Howells <> cc: Link: Signed-off-by: Linus Torvalds <>
2021-06-14fanotify: fix copy_event_to_user() fid error clean upMatthew Bobrowski
Ensure that clean up is performed on the allocated file descriptor and struct file object in the event that an error is encountered while copying fid info objects. Currently, we return directly to the caller when an error is experienced in the fid info copying helper, which isn't ideal given that the listener process could be left with a dangling file descriptor in their fdtable. Fixes: 5e469c830fdb ("fanotify: copy event fid info to user") Fixes: 44d705b0370b ("fanotify: report name info for FAN_DIR_MODIFY event") Link: Link: Signed-off-by: Matthew Bobrowski <> Signed-off-by: Jan Kara <>
2021-06-13Merge tag 'nfs-for-5.13-3' of git:// Torvalds
Pull NFS client bugfixes from Trond Myklebust: "Highlights include: Stable fixes: - Fix use-after-free in nfs4_init_client() Bugfixes: - Fix deadlock between nfs4_evict_inode() and nfs4_opendata_get_inode() - Fix second deadlock in nfs4_evict_inode() - nfs4_proc_set_acl should not change the value of NFS_CAP_UIDGID_NOMAP - Fix setting of the NFS_CAP_SECURITY_LABEL capability" * tag 'nfs-for-5.13-3' of git:// NFSv4: Fix second deadlock in nfs4_evict_inode() NFSv4: Fix deadlock between nfs4_evict_inode() and nfs4_opendata_get_inode() NFS: FMODE_READ and friends are C macros, not enum types NFS: Fix a potential NULL dereference in nfs_get_client() NFS: Fix use-after-free in nfs4_init_client() NFS: Ensure the NFS_CAP_SECURITY_LABEL capability is set when appropriate NFSv4: nfs4_proc_set_acl needs to restore NFS_CAP_UIDGID_NOMAP on error.
2021-06-12Merge tag 'driver-core-5.13-rc6' of ↵Linus Torvalds
git:// Pull driver core fix from Greg KH: "A single debugfs fix for 5.13-rc6, fixing a bug in debugfs_read_file_str() that showed up in 5.13-rc1. It has been in linux-next for a full week with no reported problems" * tag 'driver-core-5.13-rc6' of git:// debugfs: Fix debugfs_read_file_str()
2021-06-12Merge tag 'io_uring-5.13-2021-06-12' of git:// Torvalds
Pull io_uring fixes from Jens Axboe: "Just an API change for the registration changes that went into this release. Better to get it sorted out now than before it's too late" * tag 'io_uring-5.13-2021-06-12' of git:// io_uring: add feature flag for rsrc tags io_uring: change registration/upd/rsrc tagging ABI
2021-06-10io_uring: add feature flag for rsrc tagsPavel Begunkov
Add IORING_FEAT_RSRC_TAGS indicating that io_uring supports a bunch of new IORING_REGISTER operations, in particular IORING_REGISTER_[FILES[,UPDATE]2,BUFFERS[2,UPDATE]] that support rsrc tagging, and also indicating implemented dynamic fixed buffer updates. Signed-off-by: Pavel Begunkov <> Link: Signed-off-by: Jens Axboe <>
2021-06-10io_uring: change registration/upd/rsrc tagging ABIPavel Begunkov
There are ABI moments about recently added rsrc registration/update and tagging that might become a nuisance in the future. First, IORING_REGISTER_RSRC[_UPD] hide different types of resources under it, so breaks fine control over them by restrictions. It works for now, but once those are wanted under restrictions it would require a rework. It was also inconvenient trying to fit a new resource not supporting all the features (e.g. dynamic update) into the interface, so better to return to IORING_REGISTER_* top level dispatching. Second, register/update were considered to accept a type of resource, however that's not a good idea because there might be several ways of registration of a single resource type, e.g. we may want to add non-contig buffers or anything more exquisite as dma mapped memory. So, remove IORING_RSRC_[FILE,BUFFER] out of the ABI, and place them internally for now to limit changes. Signed-off-by: Pavel Begunkov <> Link: Signed-off-by: Jens Axboe <>
2021-06-10coredump: Limit what can interrupt coredumpsEric W. Biederman
Olivier Langlois has been struggling with coredumps being incompletely written in processes using io_uring. Olivier Langlois <> writes: > io_uring is a big user of task_work and any event that io_uring made a > task waiting for that occurs during the core dump generation will > generate a TIF_NOTIFY_SIGNAL. > > Here are the detailed steps of the problem: > 1. io_uring calls vfs_poll() to install a task to a file wait queue > with io_async_wake() as the wakeup function cb from io_arm_poll_handler() > 2. wakeup function ends up calling task_work_add() with TWA_SIGNAL > 3. task_work_add() sets the TIF_NOTIFY_SIGNAL bit by calling > set_notify_signal() The coredump code deliberately supports being interrupted by SIGKILL, and depends upon prepare_signal to filter out all other signals. Now that signal_pending includes wake ups for TIF_NOTIFY_SIGNAL this hack in dump_emitted by the coredump code no longer works. Make the coredump code more robust by explicitly testing for all of the wakeup conditions the coredump code supports. This prevents new wakeup conditions from breaking the coredump code, as well as fixing the current issue. The filesystem code that the coredump code uses already limits itself to only aborting on fatal_signal_pending. So it should not develop surprising wake-up reasons either. v2: Don't remove the now unnecessary code in prepare_signal. Cc: Fixes: 12db8b690010 ("entry: Add support for TIF_NOTIFY_SIGNAL") Reported-by: Olivier Langlois <> Signed-off-by: "Eric W. Biederman" <> Signed-off-by: Linus Torvalds <>
2021-06-09Merge tag 'for-5.13-rc5-tag' of ↵Linus Torvalds
git:// Pull btrfs fixes from David Sterba: "A few more fixes that people hit during testing. Zoned mode fix: - fix 32bit value wrapping when calculating superblock offsets Error handling fixes: - properly check filesystema and device uuids - properly return errors when marking extents as written - do not write supers if we have an fs error" * tag 'for-5.13-rc5-tag' of git:// btrfs: promote debugging asserts to full-fledged checks in validate_super btrfs: return value from btrfs_mark_extent_written() in case of error btrfs: zoned: fix zone number to sector/physical calculation btrfs: do not write supers if we have an fs error
2021-06-08proc: Track /proc/$pid/attr/ opener mm_structKees Cook
Commit bfb819ea20ce ("proc: Check /proc/$pid/attr/ writes against file opener") tried to make sure that there could not be a confusion between the opener of a /proc/$pid/attr/ file and the writer. It used struct cred to make sure the privileges didn't change. However, there were existing cases where a more privileged thread was passing the opened fd to a differently privileged thread (during container setup). Instead, use mm_struct to track whether the opener and writer are still the same process. (This is what several other proc files already do, though for different reasons.) Reported-by: Christian Brauner <> Reported-by: Andrea Righi <> Tested-by: Andrea Righi <> Fixes: bfb819ea20ce ("proc: Check /proc/$pid/attr/ writes against file opener") Cc: Signed-off-by: Kees Cook <> Signed-off-by: Linus Torvalds <>
2021-06-07afs: Fix partial writeback of large files on fsync and closeMarc Dionne
In commit e87b03f5830e ("afs: Prepare for use of THPs"), the return value for afs_write_back_from_locked_page was changed from a number of pages to a length in bytes. The loop in afs_writepages_region uses the return value to compute the index that will be used to find dirty pages in the next iteration, but treats it as a number of pages and wrongly multiplies it by PAGE_SIZE. This gives a very large index value, potentially skipping any dirty data that was not covered in the first pass, which is limited to 256M. This causes fsync(), and indirectly close(), to only do a partial writeback of a large file's dirty data. The rest is eventually written back by background threads after dirty_expire_centisecs. Fixes: e87b03f5830e ("afs: Prepare for use of THPs") Signed-off-by: Marc Dionne <> Signed-off-by: David Howells <> Reviewed-by: Jeffrey Altman <> cc: Link: Signed-off-by: Linus Torvalds <>
2021-06-06Merge tag 'ext4_for_linus_stable' of ↵Linus Torvalds
git:// Pull ext4 fixes from Ted Ts'o: "Miscellaneous ext4 bug fixes" * tag 'ext4_for_linus_stable' of git:// ext4: Only advertise encrypted_casefold when encryption and unicode are enabled ext4: fix no-key deletion for encrypt+casefold ext4: fix memory leak in ext4_fill_super ext4: fix fast commit alignment issues ext4: fix bug on in ext4_es_cache_extent as ext4_split_extent_at failed ext4: fix accessing uninit percpu counter variable with fast_commit ext4: fix memory leak in ext4_mb_init_backend on error path.
2021-06-06ext4: Only advertise encrypted_casefold when encryption and unicode are enabledDaniel Rosenberg
Encrypted casefolding is only supported when both encryption and casefolding are both enabled in the config. Fixes: 471fbbea7ff7 ("ext4: handle casefolding with encryption") Cc: # 5.13+ Signed-off-by: Daniel Rosenberg <> Link: Signed-off-by: Theodore Ts'o <>
2021-06-06ext4: fix no-key deletion for encrypt+casefoldDaniel Rosenberg
commit 471fbbea7ff7 ("ext4: handle casefolding with encryption") is missing a few checks for the encryption key which are needed to support deleting enrypted casefolded files when the key is not present. This bug made it impossible to delete encrypted+casefolded directories without the encryption key, due to errors like: W : EXT4-fs warning (device vdc): __ext4fs_dirhash:270: inode #49202: comm Binder:378_4: Siphash requires key Repro steps in kvm-xfstests test appliance: mkfs.ext4 -F -E encoding=utf8 -O encrypt /dev/vdc mount /vdc mkdir /vdc/dir chattr +F /vdc/dir keyid=$(head -c 64 /dev/zero | xfs_io -c add_enckey /vdc | awk '{print $NF}') xfs_io -c "set_encpolicy $keyid" /vdc/dir for i in `seq 1 100`; do mkdir /vdc/dir/$i done xfs_io -c "rm_enckey $keyid" /vdc rm -rf /vdc/dir # fails with the bug Fixes: 471fbbea7ff7 ("ext4: handle casefolding with encryption") Signed-off-by: Daniel Rosenberg <> Link: Signed-off-by: Theodore Ts'o <>
2021-06-06ext4: fix memory leak in ext4_fill_superAlexey Makhalov
Buffer head references must be released before calling kill_bdev(); otherwise the buffer head (and its page referenced by b_data) will not be freed by kill_bdev, and subsequently that bh will be leaked. If blocksizes differ, sb_set_blocksize() will kill current buffers and page cache by using kill_bdev(). And then super block will be reread again but using correct blocksize this time. sb_set_blocksize() didn't fully free superblock page and buffer head, and being busy, they were not freed and instead leaked. This can easily be reproduced by calling an infinite loop of: systemctl start <ext4_on_lvm>.mount, and systemctl stop <ext4_on_lvm>.mount ... since systemd creates a cgroup for each slice which it mounts, and the bh leak get amplified by a dying memory cgroup that also never gets freed, and memory consumption is much more easily noticed. Fixes: ce40733ce93d ("ext4: Check for return value from sb_set_blocksize") Fixes: ac27a0ec112a ("ext4: initial copy of files from ext3") Link: Signed-off-by: Alexey Makhalov <> Signed-off-by: Theodore Ts'o <> Cc:
2021-06-06ext4: fix fast commit alignment issuesHarshad Shirwadkar
Fast commit recovery data on disk may not be aligned. So, when the recovery code reads it, this patch makes sure that fast commit info found on-disk is first memcpy-ed into an aligned variable before accessing it. As a consequence of it, we also remove some macros that could resulted in unaligned accesses. Cc: Fixes: 8016e29f4362 ("ext4: fast commit recovery path") Signed-off-by: Harshad Shirwadkar <> Link: Signed-off-by: Theodore Ts'o <>
2021-06-06ext4: fix bug on in ext4_es_cache_extent as ext4_split_extent_at failedYe Bin
We got follow bug_on when run fsstress with injecting IO fault: [130747.323114] kernel BUG at fs/ext4/extents_status.c:762! [130747.323117] Internal error: Oops - BUG: 0 [#1] SMP ...... [130747.334329] Call trace: [130747.334553] ext4_es_cache_extent+0x150/0x168 [ext4] [130747.334975] ext4_cache_extents+0x64/0xe8 [ext4] [130747.335368] ext4_find_extent+0x300/0x330 [ext4] [130747.335759] ext4_ext_map_blocks+0x74/0x1178 [ext4] [130747.336179] ext4_map_blocks+0x2f4/0x5f0 [ext4] [130747.336567] ext4_mpage_readpages+0x4a8/0x7a8 [ext4] [130747.336995] ext4_readpage+0x54/0x100 [ext4] [130747.337359] generic_file_buffered_read+0x410/0xae8 [130747.337767] generic_file_read_iter+0x114/0x190 [130747.338152] ext4_file_read_iter+0x5c/0x140 [ext4] [130747.338556] __vfs_read+0x11c/0x188 [130747.338851] vfs_read+0x94/0x150 [130747.339110] ksys_read+0x74/0xf0 This patch's modification is according to Jan Kara's suggestion in: "I see. Now I understand your patch. Honestly, seeing how fragile is trying to fix extent tree after split has failed in the middle, I would probably go even further and make sure we fix the tree properly in case of ENOSPC and EDQUOT (those are easily user triggerable). Anything else indicates a HW problem or fs corruption so I'd rather leave the extent tree as is and don't try to fix it (which also means we will not create overlapping extents)." Cc: Signed-off-by: Ye Bin <> Reviewed-by: Jan Kara <> Link: Signed-off-by: Theodore Ts'o <>
2021-06-05ocfs2: fix data corruption by fallocateJunxiao Bi
When fallocate punches holes out of inode size, if original isize is in the middle of last cluster, then the part from isize to the end of the cluster will be zeroed with buffer write, at that time isize is not yet updated to match the new size, if writeback is kicked in, it will invoke ocfs2_writepage()->block_write_full_page() where the pages out of inode size will be dropped. That will cause file corruption. Fix this by zero out eof blocks when extending the inode size. Running the following command with qemu-image 4.2.1 can get a corrupted coverted image file easily. qemu-img convert -p -t none -T none -f qcow2 $qcow_image \ -O qcow2 -o compat=1.1 $qcow_image.conv The usage of fallocate in qemu is like this, it first punches holes out of inode size, then extend the inode size. fallocate(11, FALLOC_FL_KEEP_SIZE|FALLOC_FL_PUNCH_HOLE, 2276196352, 65536) = 0 fallocate(11, 0, 2276196352, 65536) = 0 v1: v2: Link: Signed-off-by: Junxiao Bi <> Reviewed-by: Joseph Qi <> Cc: Jan Kara <> Cc: Mark Fasheh <> Cc: Joel Becker <> Cc: Changwei Ge <> Cc: Gang He <> Cc: Jun Piao <> Cc: <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2021-06-04debugfs: Fix debugfs_read_file_str()Dietmar Eggemann
Read the entire size of the buffer, including the trailing new line character. Discovered while reading the sched domain names of CPU0: before: cat /sys/kernel/debug/sched/domains/cpu0/domain*/name SMTMCDIE after: cat /sys/kernel/debug/sched/domains/cpu0/domain*/name SMT MC DIE Fixes: 9af0440ec86eb ("debugfs: Implement debugfs_create_str()") Reviewed-by: Steven Rostedt (VMware) <> Acked-by: Peter Zijlstra (Intel) <> Signed-off-by: Dietmar Eggemann <> Link: Signed-off-by: Greg Kroah-Hartman <>
2021-06-04btrfs: promote debugging asserts to full-fledged checks in validate_superNikolay Borisov
Syzbot managed to trigger this assert while performing its fuzzing. Turns out it's better to have those asserts turned into full-fledged checks so that in case buggy btrfs images are mounted the users gets an error and mounting is stopped. Alternatively with CONFIG_BTRFS_ASSERT disabled such image would have been erroneously allowed to be mounted. Reported-by: CC: # 5.4+ Reviewed-by: Johannes Thumshirn <> Reviewed-by: Qu Wenruo <> Signed-off-by: Nikolay Borisov <> Reviewed-by: David Sterba <> [ add uuids to the messages ] Signed-off-by: David Sterba <>
2021-06-04btrfs: return value from btrfs_mark_extent_written() in case of errorRitesh Harjani
We always return 0 even in case of an error in btrfs_mark_extent_written(). Fix it to return proper error value in case of a failure. All callers handle it. CC: # 4.4+ Signed-off-by: Ritesh Harjani <> Reviewed-by: David Sterba <> Signed-off-by: David Sterba <>
2021-06-04btrfs: zoned: fix zone number to sector/physical calculationNaohiro Aota
In btrfs_get_dev_zone_info(), we have "u32 sb_zone" and calculate "sector_t sector" by shifting it. But, this "sector" is calculated in 32bit, leading it to be 0 for the 2nd superblock copy. Since zone number is u32, shifting it to sector (sector_t) or physical address (u64) can easily trigger a missing cast bug like this. This commit introduces helpers to convert zone number to sector/LBA, so we won't fall into the same pitfall again. Reported-by: Dmitry Fomichev <> Fixes: 12659251ca5d ("btrfs: implement log-structured superblock for ZONED mode") CC: # 5.11+ Reviewed-by: Johannes Thumshirn <> Signed-off-by: Naohiro Aota <> Reviewed-by: David Sterba <> Signed-off-by: David Sterba <>
2021-06-04btrfs: do not write supers if we have an fs errorJosef Bacik
Error injection testing uncovered a pretty severe problem where we could end up committing a super that pointed to the wrong tree roots, resulting in transid mismatch errors. The way we commit the transaction is we update the super copy with the current generations and bytenrs of the important roots, and then copy that into our super_for_commit. Then we allow transactions to continue again, we write out the dirty pages for the transaction, and then we write the super. If the write out fails we'll bail and skip writing the supers. However since we've allowed a new transaction to start, we can have a log attempting to sync at this point, which would be blocked on fs_info->tree_log_mutex. Once the commit fails we're allowed to do the log tree commit, which uses super_for_commit, which now points at fs tree's that were not written out. Fix this by checking BTRFS_FS_STATE_ERROR once we acquire the tree_log_mutex. This way if the transaction commit fails we're sure to see this bit set and we can skip writing the super out. This patch fixes this specific transid mismatch error I was seeing with this particular error path. CC: # 5.12+ Reviewed-by: Filipe Manana <> Signed-off-by: Josef Bacik <> Reviewed-by: David Sterba <> Signed-off-by: David Sterba <>
2021-06-03Merge tag 'io_uring-5.13-2021-06-03' of git:// Torvalds
Pull io_uring fix from Jens Axboe: "Just a single one-liner fix for an accounting regression in this release" * tag 'io_uring-5.13-2021-06-03' of git:// io_uring: fix misaccounting fix buf pinned pages
2021-06-03Merge tag 'for-5.13-rc4-tag' of ↵Linus Torvalds
git:// Pull btrfs fixes from David Sterba: "Error handling improvements, caught by error injection: - handle errors during checksum deletion - set error on mapping when ordered extent io cannot be finished - inode link count fixup in tree-log - missing return value checks for inode updates in tree-log - abort transaction in rename exchange if adding second reference fails Fixes: - fix fsync failure after writes to prealloc extents - fix deadlock when cloning inline extents and low on available space - fix compressed writes that cross stripe boundary" * tag 'for-5.13-rc4-tag' of git:// MAINTAINERS: add btrfs IRC link btrfs: fix deadlock when cloning inline extents and low on available space btrfs: fix fsync failure and transaction abort after writes to prealloc extents btrfs: abort in rename_exchange if we fail to insert the second ref btrfs: check error value from btrfs_update_inode in tree log btrfs: fixup error handling in fixup_inode_link_counts btrfs: mark ordered extent and inode with error if we fail to finish btrfs: return errors from btrfs_del_csums in cleanup_ref_head btrfs: fix error handling in btrfs_del_csums btrfs: fix compressed writes that cross stripe boundary
2021-06-03NFSv4: Fix second deadlock in nfs4_evict_inode()Trond Myklebust
If the inode is being evicted but has to return a layout first, then that too can cause a deadlock in the corner case where the server reboots. Signed-off-by: Trond Myklebust <>
2021-06-03NFSv4: Fix deadlock between nfs4_evict_inode() and nfs4_opendata_get_inode()Trond Myklebust
If the inode is being evicted, but has to return a delegation first, then it can cause a deadlock in the corner case where the server reboots before the delegreturn completes, but while the call to iget5_locked() in nfs4_opendata_get_inode() is waiting for the inode free to complete. Since the open call still holds a session slot, the reboot recovery cannot proceed. In order to break the logjam, we can turn the delegation return into a privileged operation for the case where we're evicting the inode. We know that in that case, there can be no other state recovery operation that conflicts. Reported-by: zhangxiaoxu (A) <> Fixes: 5fcdfacc01f3 ("NFSv4: Return delegations synchronously in evict_inode") Signed-off-by: Trond Myklebust <>
2021-06-03NFS: FMODE_READ and friends are C macros, not enum typesChuck Lever
Address a sparse warning: CHECK fs/nfs/nfstrace.c fs/nfs/nfstrace.c: note: in included file (through /home/cel/src/linux/rpc-over-tls/include/trace/trace_events.h, /home/cel/src/linux/rpc-over-tls/include/trace/define_trace.h, ...): fs/nfs/./nfstrace.h:424:1: warning: incorrect type in initializer (different base types) fs/nfs/./nfstrace.h:424:1: expected unsigned long eval_value fs/nfs/./nfstrace.h:424:1: got restricted fmode_t [usertype] fs/nfs/./nfstrace.h:425:1: warning: incorrect type in initializer (different base types) fs/nfs/./nfstrace.h:425:1: expected unsigned long eval_value fs/nfs/./nfstrace.h:425:1: got restricted fmode_t [usertype] fs/nfs/./nfstrace.h:426:1: warning: incorrect type in initializer (different base types) fs/nfs/./nfstrace.h:426:1: expected unsigned long eval_value fs/nfs/./nfstrace.h:426:1: got restricted fmode_t [usertype] Signed-off-by: Chuck Lever <> Signed-off-by: Trond Myklebust <>
2021-06-03NFS: Fix a potential NULL dereference in nfs_get_client()Dan Carpenter
None of the callers are expecting NULL returns from nfs_get_client() so this code will lead to an Oops. It's better to return an error pointer. I expect that this is dead code so hopefully no one is affected. Fixes: 31434f496abb ("nfs: check hostname in nfs_get_client") Signed-off-by: Dan Carpenter <> Signed-off-by: Trond Myklebust <>
2021-06-03NFS: Fix use-after-free in nfs4_init_client()Anna Schumaker
KASAN reports a use-after-free when attempting to mount two different exports through two different NICs that belong to the same server. Olga was able to hit this with kernels starting somewhere between 5.7 and 5.10, but I traced the patch that introduced the clear_bit() call to 4.13. So something must have changed in the refcounting of the clp pointer to make this call to nfs_put_client() the very last one. Fixes: 8dcbec6d20 ("NFSv41: Handle EXCHID4_FLAG_CONFIRMED_R during NFSv4.1 migration") Cc: # 4.13+ Signed-off-by: Anna Schumaker <> Signed-off-by: Trond Myklebust <>
2021-06-03NFS: Ensure the NFS_CAP_SECURITY_LABEL capability is set when appropriateScott Mayhew
Commit ce62b114bbad ("NFS: Split attribute support out from the server capabilities") removed the logic from _nfs4_server_capabilities() that sets the NFS_CAP_SECURITY_LABEL capability based on the presence of FATTR4_WORD2_SECURITY_LABEL in the attr_bitmask of the server's response. Now NFS_CAP_SECURITY_LABEL is never set, which breaks labelled NFS. This was replaced with logic that clears the NFS_ATTR_FATTR_V4_SECURITY_LABEL bit in the newly added fattr_valid field based on the absence of FATTR4_WORD2_SECURITY_LABEL in the attr_bitmask of the server's response. This essentially has no effect since there's nothing looks for that bit in fattr_supported. So revert that part of the commit, but adding the logic that sets NFS_CAP_SECURITY_LABEL near where the other capabilities are set in _nfs4_server_capabilities(). Fixes: ce62b114bbad ("NFS: Split attribute support out from the server capabilities") Signed-off-by: Scott Mayhew <> Signed-off-by: Trond Myklebust <>
2021-06-02ext4: fix accessing uninit percpu counter variable with fast_commitRitesh Harjani
When running generic/527 with fast_commit configuration, the following issue is seen on Power. With fast_commit, during ext4_fc_replay() (which can be called from ext4_fill_super()), if inode eviction happens then it can access an uninitialized percpu counter variable. This patch adds the check before accessing the counters in ext4_free_inode() path. [ 321.165371] run fstests generic/527 at 2021-04-29 08:38:43 [ 323.027786] EXT4-fs (dm-0): mounted filesystem with ordered data mode. Opts: block_validity. Quota mode: none. [ 323.618772] BUG: Unable to handle kernel data access on read at 0x1fbd80000 [ 323.619767] Faulting instruction address: 0xc000000000bae78c cpu 0x1: Vector: 300 (Data Access) at [c000000010706ef0] pc: c000000000bae78c: percpu_counter_add_batch+0x3c/0x100 lr: c0000000006d0bb0: ext4_free_inode+0x780/0xb90 pid = 5593, comm = mount ext4_free_inode+0x780/0xb90 ext4_evict_inode+0xa8c/0xc60 evict+0xfc/0x1e0 ext4_fc_replay+0xc50/0x20f0 do_one_pass+0xfe0/0x1350 jbd2_journal_recover+0x184/0x2e0 jbd2_journal_load+0x1c0/0x4a0 ext4_fill_super+0x2458/0x4200 mount_bdev+0x1dc/0x290 ext4_mount+0x28/0x40 legacy_get_tree+0x4c/0xa0 vfs_get_tree+0x4c/0x120 path_mount+0xcf8/0xd70 do_mount+0x80/0xd0 sys_mount+0x3fc/0x490 system_call_exception+0x384/0x3d0 system_call_common+0xec/0x278 Cc: Fixes: 8016e29f4362 ("ext4: fast commit recovery path") Signed-off-by: Ritesh Harjani <> Reviewed-by: Harshad Shirwadkar <> Link: Signed-off-by: Theodore Ts'o <>
2021-06-01Revert "gfs2: Fix mmap locking for write faults"Andreas Gruenbacher
This reverts commit b7f55d928e75557295c1ac280c291b738905b6fb. As explained by Linus in [*], write faults on a mmap region are reads from a filesysten point of view, so taking the inode glock exclusively on write faults is incorrect. Instead, when a page is marked writable, the .page_mkwrite vm operation will be called, which is where the exclusive lock taking needs to happen. I got this wrong because of a broken test case that made me believe .page_mkwrite isn't getting called when it actually is. [*] Signed-off-by: Andreas Gruenbacher <>
2021-06-01NFSv4: nfs4_proc_set_acl needs to restore NFS_CAP_UIDGID_NOMAP on error.Dai Ngo
Currently if __nfs4_proc_set_acl fails with NFS4ERR_BADOWNER it re-enables the idmapper by clearing NFS_CAP_UIDGID_NOMAP before retrying again. The NFS_CAP_UIDGID_NOMAP remains cleared even if the retry fails. This causes problem for subsequent setattr requests for v4 server that does not have idmapping configured. This patch modifies nfs4_proc_set_acl to detect NFS4ERR_BADOWNER and NFS4ERR_BADNAME and skips the retry, since the kernel isn't involved in encoding the ACEs, and return -EINVAL. Steps to reproduce the problem: # mount -o vers=4.1,sec=sys server:/export/test /tmp/mnt # touch /tmp/mnt/file1 # chown 99 /tmp/mnt/file1 # nfs4_setfacl -a /tmp/mnt/file1 Failed setxattr operation: Invalid argument # chown 99 /tmp/mnt/file1 chown: changing ownership of ‘/tmp/mnt/file1’: Invalid argument # umount /tmp/mnt # mount -o vers=4.1,sec=sys server:/export/test /tmp/mnt # chown 99 /tmp/mnt/file1 # v2: detect NFS4ERR_BADOWNER and NFS4ERR_BADNAME and skip retry in nfs4_proc_set_acl. Signed-off-by: Dai Ngo <> Signed-off-by: Trond Myklebust <>
2021-05-31Merge tag 'gfs2-v5.13-rc2-fixes' of ↵Linus Torvalds
git:// Pull gfs2 fixes from Andreas Gruenbacher: "Various gfs2 fixes" * tag 'gfs2-v5.13-rc2-fixes' of git:// gfs2: Fix use-after-free in gfs2_glock_shrink_scan gfs2: Fix mmap locking for write faults gfs2: Clean up revokes on normal withdraws gfs2: fix a deadlock on withdraw-during-mount gfs2: fix scheduling while atomic bug in glocks gfs2: Fix I_NEW check in gfs2_dinode_in gfs2: Prevent direct-I/O write fallback errors from getting lost
2021-05-31Merge tag 'fsnotify_for_v5.13-rc5' of ↵Linus Torvalds
git:// Pull fsnotify fixes from Jan Kara: "A fix for permission checking with fanotify unpriviledged groups. Also there's a small update in MAINTAINERS file for fanotify" * tag 'fsnotify_for_v5.13-rc5' of git:// fanotify: fix permission model of unprivileged group MAINTAINERS: Add Matthew Bobrowski as a reviewer