summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2025-05-30Merge tag 'f2fs-for-6.16-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs Pull f2fs updates from Jaegeuk Kim: "In this round, Matthew converted most of page operations to using folio. Beyond the work, we've applied some performance tunings such as GC and linear lookup, in addition to enhancing fault injection and sanity checks. Enhancements: - large number of folio conversions - add a control to turn on/off the linear lookup for performance - tune GC logics for zoned block device - improve fault injection and sanity checks Bug fixes: - handle error cases of memory donation - fix to correct check conditions in f2fs_cross_rename - fix to skip f2fs_balance_fs() if checkpoint is disabled - don't over-report free space or inodes in statvfs - prevent the current section from being selected as a victim during GC - fix to calculate first_zoned_segno correctly - fix to avoid inconsistence between SIT and SSA for zoned block device As usual, there are several debugging patches and clean-ups as well" * tag 'f2fs-for-6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (195 commits) f2fs: fix to correct check conditions in f2fs_cross_rename f2fs: use d_inode(dentry) cleanup dentry->d_inode f2fs: fix to skip f2fs_balance_fs() if checkpoint is disabled f2fs: clean up to check bi_status w/ BLK_STS_OK f2fs: introduce is_{meta,node}_folio f2fs: add ckpt_valid_blocks to the section entry f2fs: add a method for calculating the remaining blocks in the current segment in LFS mode. f2fs: introduce FAULT_VMALLOC f2fs: use vmalloc instead of kvmalloc in .init_{,de}compress_ctx f2fs: add f2fs_bug_on() in f2fs_quota_read() f2fs: add f2fs_bug_on() to detect potential bug f2fs: remove unused sbi argument from checksum functions f2fs: fix 32-bits hexademical number in fault injection doc f2fs: don't over-report free space or inodes in statvfs f2fs: return bool from __write_node_folio f2fs: simplify return value handling in f2fs_fsync_node_pages f2fs: always unlock the page in f2fs_write_single_data_page f2fs: remove wbc->for_reclaim handling f2fs: return bool from __f2fs_write_meta_folio f2fs: fix to return correct error number in f2fs_sync_node_pages() ...
2025-05-30bcachefs: Mark bch_errcode helpers __attribute__((const))Kent Overstreet
These don't access global memory or defer pointer arguments - this enables CSE optimizations. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-30bcachefs: Add missing printbuf_reset() in bch2_check_dirent_inode_dirent()Kent Overstreet
We were accidentally including the contents from the previous fsck_err(). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-30bcachefs: sysfs/errorsKent Overstreet
Make the superblock error counters available in sysfs; the only other way they can be seen is 'show-super', but we don't write the superblock every time the error count gets incremented. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-30bcachefs: bch2_check_fix_ptrs() can now repair btree rootsKent Overstreet
This is straightforward enough: check_fix_ptrs() currently only runs before we go RW, so updating the btree root pointer in c->btree_roots suffices - it'll be written out in the first journal write we do. For that, do_bch2_trans_commit_to_journal_replay() now handles JSET_ENTRY_btree_root entries. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-30bcachefs: Include b->ob.nr in cached_btree_node_to_text()Kent Overstreet
We have a bug report that looks like we might be leaking open buckets - let's check if they got left attached to the cached btree node. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-30bcachefs: Move devs_sorted to alloc_requestKent Overstreet
More stack usage work. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-30bcachefs: reduce stack usage in alloc_sectors_start()Kent Overstreet
with typical config options, variables in different inline functions aren't sharing stack space - and these are slowpaths. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-30bcachefs: bch2_alloc_v4_to_text()Kent Overstreet
Specialize the .to_text() for alloc_v4, to avoid the temporary on the stack for conversion from old versions. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-30bcachefs: Tweak bch2_data_update_init() for stack usageKent Overstreet
- Separate out a slowpath for bkey_nocow_lock() - Don't call bch2_bkey_ptrs_c() or loop over pointers more than necessary Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-30bcachefs: kill replicas_sectors arg to __trigger_extent()Kent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-30bcachefs: Don't stack allocate bch_writepage_stateKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-30bcachefs: factor out break_cycle_fail()Kent Overstreet
More stack usage work. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-30bcachefs: btree_node_missing_err()Kent Overstreet
Factor out an error path for a small stack usage improvement. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-30bcachefs: Kill bkey_buf in btree_path_down()Kent Overstreet
Allocate some (smaller) temporary storage in btree_trans for this - btree_path_down() is in our max-stack call stack. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-30bcachefs: Add missing error logging in delete_dead_inodes()Kent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-30bcachefs: Fix misaligned bucket check in journal space calculationsKent Overstreet
Fix an assertion pop in the tiering_misaligned test: rounding down to bucket size at the end of the journal space calculations leaves cur_entry_sectors == 0, which is incorrect with !cur_entry_err. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-30bcachefs: Fix incorrect multiple dev check in journal write pathKent Overstreet
It's uncomon to have multiple devices with journalling only on a subset, but can be specified with the 'data_allowed' option. We need to know if we're doing data/metadata writes to multiple devices, as that requires issuing flushes before the journal writes. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-30bcachefs: Catch data_update_done events in trace_io_move_start_failKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-30bcachefs: io_move_evacuate_bucket tracepoint, counterKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-30bcachefs: trace_io_move_predKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-30bcachefs: Fix infinite loop in journal_entry_btree_keys_to_text()Kent Overstreet
Fix an infinite loop when bkey_i->k.u64s is 0. This only happens in userspace, where 'bcachefs list_journal' can print the entire contents of the journal, and non-dirty entries aren't validated. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-30bcachefs: Journal read error message improvementsKent Overstreet
- Don't print a checksum error when we first read a journal entry: we print a checksum error later if we'll be using the journal entry. - Continuing with the theme of of improving error messages and grouping errors into a single log message per error, print a single 'checksum error' message per journal entry, and use bch2_journal_ptr_to_text() to print out where on the device it was. - Factor out checksum error messages and checking for missing journal entries into helpers, bch2_journal_read() has gotten obnoxiously big. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-30fs/dax: Fix "don't skip locked entries when scanning entries"Alistair Popple
Commit 6be3e21d25ca ("fs/dax: don't skip locked entries when scanning entries") introduced a new function, wait_entry_unlocked_exclusive(), which waits for the current entry to become unlocked without advancing the XArray iterator state. Waiting for the entry to become unlocked requires dropping the XArray lock. This requires calling xas_pause() prior to dropping the lock which leaves the xas in a suitable state for the next iteration. However this has the side-effect of advancing the xas state to the next index. Normally this isn't an issue because xas_for_each() contains code to detect this state and thus avoid advancing the index a second time on the next loop iteration. However both callers of and wait_entry_unlocked_exclusive() itself subsequently use the xas state to reload the entry. As xas_pause() updated the state to the next index this will cause the current entry which is being waited on to be skipped. This caused the following warning to fire intermittently when running xftest generic/068 on an XFS filesystem with FS DAX enabled: [ 35.067397] ------------[ cut here ]------------ [ 35.068229] WARNING: CPU: 21 PID: 1640 at mm/truncate.c:89 truncate_folio_batch_exceptionals+0xd8/0x1e0 [ 35.069717] Modules linked in: nd_pmem dax_pmem nd_btt nd_e820 libnvdimm [ 35.071006] CPU: 21 UID: 0 PID: 1640 Comm: fstest Not tainted 6.15.0-rc7+ #77 PREEMPT(voluntary) [ 35.072613] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/204 [ 35.074845] RIP: 0010:truncate_folio_batch_exceptionals+0xd8/0x1e0 [ 35.075962] Code: a1 00 00 00 f6 47 0d 20 0f 84 97 00 00 00 4c 63 e8 41 39 c4 7f 0b eb 61 49 83 c5 01 45 39 ec 7e 58 42 f68 [ 35.079522] RSP: 0018:ffffb04e426c7850 EFLAGS: 00010202 [ 35.080359] RAX: 0000000000000000 RBX: ffff9d21e3481908 RCX: ffffb04e426c77f4 [ 35.081477] RDX: ffffb04e426c79e8 RSI: ffffb04e426c79e0 RDI: ffff9d21e34816e8 [ 35.082590] RBP: ffffb04e426c79e0 R08: 0000000000000001 R09: 0000000000000003 [ 35.083733] R10: 0000000000000000 R11: 822b53c0f7a49868 R12: 000000000000001f [ 35.084850] R13: 0000000000000000 R14: ffffb04e426c78e8 R15: fffffffffffffffe [ 35.085953] FS: 00007f9134c87740(0000) GS:ffff9d22abba0000(0000) knlGS:0000000000000000 [ 35.087346] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 35.088244] CR2: 00007f9134c86000 CR3: 000000040afff000 CR4: 00000000000006f0 [ 35.089354] Call Trace: [ 35.089749] <TASK> [ 35.090168] truncate_inode_pages_range+0xfc/0x4d0 [ 35.091078] truncate_pagecache+0x47/0x60 [ 35.091735] xfs_setattr_size+0xc7/0x3e0 [ 35.092648] xfs_vn_setattr+0x1ea/0x270 [ 35.093437] notify_change+0x1f4/0x510 [ 35.094219] ? do_truncate+0x97/0xe0 [ 35.094879] do_truncate+0x97/0xe0 [ 35.095640] path_openat+0xabd/0xca0 [ 35.096278] do_filp_open+0xd7/0x190 [ 35.096860] do_sys_openat2+0x8a/0xe0 [ 35.097459] __x64_sys_openat+0x6d/0xa0 [ 35.098076] do_syscall_64+0xbb/0x1d0 [ 35.098647] entry_SYSCALL_64_after_hwframe+0x77/0x7f [ 35.099444] RIP: 0033:0x7f9134d81fc1 [ 35.100033] Code: 75 57 89 f0 25 00 00 41 00 3d 00 00 41 00 74 49 80 3d 2a 26 0e 00 00 74 6d 89 da 48 89 ee bf 9c ff ff ff5 [ 35.102993] RSP: 002b:00007ffcd41e0d10 EFLAGS: 00000202 ORIG_RAX: 0000000000000101 [ 35.104263] RAX: ffffffffffffffda RBX: 0000000000000242 RCX: 00007f9134d81fc1 [ 35.105452] RDX: 0000000000000242 RSI: 00007ffcd41e1200 RDI: 00000000ffffff9c [ 35.106663] RBP: 00007ffcd41e1200 R08: 0000000000000000 R09: 0000000000000064 [ 35.107923] R10: 00000000000001a4 R11: 0000000000000202 R12: 0000000000000066 [ 35.109112] R13: 0000000000100000 R14: 0000000000100000 R15: 0000000000000400 [ 35.110357] </TASK> [ 35.110769] irq event stamp: 8415587 [ 35.111486] hardirqs last enabled at (8415599): [<ffffffff8d74b562>] __up_console_sem+0x52/0x60 [ 35.113067] hardirqs last disabled at (8415610): [<ffffffff8d74b547>] __up_console_sem+0x37/0x60 [ 35.114575] softirqs last enabled at (8415300): [<ffffffff8d6ac625>] handle_softirqs+0x315/0x3f0 [ 35.115933] softirqs last disabled at (8415291): [<ffffffff8d6ac811>] __irq_exit_rcu+0xa1/0xc0 [ 35.117316] ---[ end trace 0000000000000000 ]--- Fix this by using xas_reset() instead, which is equivalent in implementation to xas_pause() but does not advance the XArray state. Fixes: 6be3e21d25ca ("fs/dax: don't skip locked entries when scanning entries") Signed-off-by: Alistair Popple <apopple@nvidia.com> Link: https://lore.kernel.org/20250523043749.1460780-1-apopple@nvidia.com Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Alison Schofield <alison.schofield@intel.com> Cc: "Matthew Wilcow (Oracle)" <willy@infradead.org> Cc: Balbir Singh <balbirs@nvidia.com> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Ted Ts'o <tytso@mit.edu> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-29Merge tag 'fs_for_v6.16-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull ext2 and isofs updates from Jan Kara: - isofs fix of handling of particularly formatted Rock Ridge timestamps - Add deprecation notice about support of DAX in ext2 filesystem driver * tag 'fs_for_v6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: ext2: Deprecate DAX isofs: fix Y2038 and Y2156 issues in Rock Ridge TF entry
2025-05-29Merge tag 'fsnotify_for_v6.16-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull fsnotify updates from Jan Kara: "Two fanotify cleanups and support for watching namespace-owned filesystems by namespace admins (most useful for being able to watch for new mounts / unmounts happening within a user namespace)" * tag 'fsnotify_for_v6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: fanotify: support watching filesystems and mounts inside userns fanotify: remove redundant permission checks fanotify: Drop use of flex array in fanotify_fh
2025-05-29Merge tag 'driver-core-6.16-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core Pull driver core updates from Greg KH: "Here are the driver core / kernfs changes for 6.16-rc1. Not a huge number of changes this development cycle, here's the summary of what is included in here: - kernfs locking tweaks, pushing some global locks down into a per-fs image lock - rust driver core and pci device bindings added for new features. - sysfs const work for bin_attributes. The final churn of switching away from and removing the transitional struct members, "read_new", "write_new" and "bin_attrs_new" will come after the merge window to avoid unnecesary merge conflicts. - auxbus device creation helpers added - fauxbus fix for creating sysfs files after the probe completed properly - other tiny updates for driver core things. All of these have been in linux-next for over a week with no reported issues" * tag 'driver-core-6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core: kernfs: Relax constraint in draining guard Documentation: embargoed-hardware-issues.rst: Remove myself drivers: hv: fix up const issue with vmbus_chan_bin_attrs firmware_loader: use SHA-256 library API instead of crypto_shash API docs: debugfs: do not recommend debugfs_remove_recursive PM: wakeup: Do not expose 4 device wakeup source APIs kernfs: switch global kernfs_rename_lock to per-fs lock kernfs: switch global kernfs_idr_lock to per-fs lock driver core: auxiliary bus: Fix IS_ERR() vs NULL mixup in __devm_auxiliary_device_create() sysfs: constify attribute_group::bin_attrs sysfs: constify bin_attribute argument of bin_attribute::read/write() software node: Correct a OOB check in software_node_get_reference_args() devres: simplify devm_kstrdup() using devm_kmemdup() platform: replace magic number with macro PLATFORM_DEVID_NONE component: do not try to unbind unbound components driver core: auxiliary bus: add device creation helpers driver core: faux: Add sysfs groups after probing
2025-05-29fuse: increase readdir buffer sizeMiklos Szeredi
Increase the buffer size to the count requested by userspace. This improves performance. Signed-off-by: Miklos Szeredi <miklos@szeredi.hu> Signed-off-by: Bernd Schubert <bschubert@ddn.com>
2025-05-29readdir: supply dir_context.count as readdir buffer size hintMiklos Szeredi
This is a preparation for large readdir buffers in fuse. Simply setting the fuse buffer size to the userspace buffer size should work, the record sizes are similar (fuse's is slightly larger than libc's, so no overflow should ever happen). Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Jaco Kroon <jaco@uls.co.za>
2025-05-29fuse: don't allow signals to interrupt getdents copyingMiklos Szeredi
When getting the directory contents, the entries are first fetched to a kernel buffer, then they are copied to userspace with dir_emit(). This second phase is non-blocking as long as the userspace buffer is not paged out, making it interruptible makes zero sense. Overload d_type as flags, since it only uses 4 bits from 32. Reviewed-by: Bernd Schubert <bschubert@ddn.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-05-29fuse: support large folios for writebackJoanne Koong
Add support for folios larger than one page size for writeback. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-05-29fuse: support large folios for readaheadJoanne Koong
Add support for folios larger than one page size for readahead. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-05-29fuse: support large folios for queued writesJoanne Koong
Add support for folios larger than one page size for queued writes. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Bernd Schubert <bschubert@ddn.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-05-29fuse: support large folios for storesJoanne Koong
Add support for folios larger than one page size for stores. Also change variable naming from "this_num" to "nr_bytes". Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-05-29fuse: support large folios for symlinksJoanne Koong
Support large folios for symlinks and change the name from fuse_getlink_page() to fuse_getlink_folio(). Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Bernd Schubert <bschubert@ddn.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-05-29fuse: support large folios for folio readsJoanne Koong
Add support for folios larger than one page size for folio reads into the page cache. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Bernd Schubert <bschubert@ddn.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-05-29fuse: support large folios for writethrough writesJoanne Koong
Add support for folios larger than one page size for writethrough writes. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-05-29fuse: refactor fuse_fill_write_pages()Joanne Koong
Refactor the logic in fuse_fill_write_pages() for copying out write data. This will make the future change for supporting large folios for writes easier. No functional changes. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Bernd Schubert <bschubert@ddn.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-05-29fuse: support large folios for retrievesJoanne Koong
Add support for folios larger than one page size for retrieves. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Bernd Schubert <bschubert@ddn.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-05-29fuse: support copying large foliosJoanne Koong
Currently, all folios associated with fuse are one page size. As part of the work to enable large folios, this commit adds support for copying to/from folios larger than one page size. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Bernd Schubert <bschubert@ddn.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-05-28Merge tag 'net-next-6.16' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Paolo Abeni: "Core: - Implement the Device Memory TCP transmit path, allowing zero-copy data transmission on top of TCP from e.g. GPU memory to the wire. - Move all the IPv6 routing tables management outside the RTNL scope, under its own lock and RCU. The route control path is now 3x times faster. - Convert queue related netlink ops to instance lock, reducing again the scope of the RTNL lock. This improves the control plane scalability. - Refactor the software crc32c implementation, removing unneeded abstraction layers and improving significantly the related micro-benchmarks. - Optimize the GRO engine for UDP-tunneled traffic, for a 10% performance improvement in related stream tests. - Cover more per-CPU storage with local nested BH locking; this is a prep work to remove the current per-CPU lock in local_bh_disable() on PREMPT_RT. - Introduce and use nlmsg_payload helper, combining buffer bounds verification with accessing payload carried by netlink messages. Netfilter: - Rewrite the procfs conntrack table implementation, improving considerably the dump performance. A lot of user-space tools still use this interface. - Implement support for wildcard netdevice in netdev basechain and flowtables. - Integrate conntrack information into nft trace infrastructure. - Export set count and backend name to userspace, for better introspection. BPF: - BPF qdisc support: BPF-qdisc can be implemented with BPF struct_ops programs and can be controlled in similar way to traditional qdiscs using the "tc qdisc" command. - Refactor the UDP socket iterator, addressing long standing issues WRT duplicate hits or missed sockets. Protocols: - Improve TCP receive buffer auto-tuning and increase the default upper bound for the receive buffer; overall this improves the single flow maximum thoughput on 200Gbs link by over 60%. - Add AFS GSSAPI security class to AF_RXRPC; it provides transport security for connections to the AFS fileserver and VL server. - Improve TCP multipath routing, so that the sources address always matches the nexthop device. - Introduce SO_PASSRIGHTS for AF_UNIX, to allow disabling SCM_RIGHTS, and thus preventing DoS caused by passing around problematic FDs. - Retire DCCP socket. DCCP only receives updates for bugs, and major distros disable it by default. Its removal allows for better organisation of TCP fields to reduce the number of cache lines hit in the fast path. - Extend TCP drop-reason support to cover PAWS checks. Driver API: - Reorganize PTP ioctl flag support to require an explicit opt-in for the drivers, avoiding the problem of drivers not rejecting new unsupported flags. - Converted several device drivers to timestamping APIs. - Introduce per-PHY ethtool dump helpers, improving the support for dump operations targeting PHYs. Tests and tooling: - Add support for classic netlink in user space C codegen, so that ynl-c can now read, create and modify links, routes addresses and qdisc layer configuration. - Add ynl sub-types for binary attributes, allowing ynl-c to output known struct instead of raw binary data, clarifying the classic netlink output. - Extend MPTCP selftests to improve the code-coverage. - Add tests for XDP tail adjustment in AF_XDP. New hardware / drivers: - OpenVPN virtual driver: offload OpenVPN data channels processing to the kernel-space, increasing the data transfer throughput WRT the user-space implementation. - Renesas glue driver for the gigabit ethernet RZ/V2H(P) SoC. - Broadcom asp-v3.0 ethernet driver. - AMD Renoir ethernet device. - ReakTek MT9888 2.5G ethernet PHY driver. - Aeonsemi 10G C45 PHYs driver. Drivers: - Ethernet high-speed NICs: - nVidia/Mellanox (mlx5): - refactor the steering table handling to significantly reduce the amount of memory used - add support for complex matches in H/W flow steering - improve flow streeing error handling - convert to netdev instance locking - Intel (100G, ice, igb, ixgbe, idpf): - ice: add switchdev support for LLDP traffic over VF - ixgbe: add firmware manipulation and regions devlink support - igb: introduce support for frame transmission premption - igb: adds persistent NAPI configuration - idpf: introduce RDMA support - idpf: add initial PTP support - Meta (fbnic): - extend hardware stats coverage - add devlink dev flash support - Broadcom (bnxt): - add support for RX-side device memory TCP - Wangxun (txgbe): - implement support for udp tunnel offload - complete PTP and SRIOV support for AML 25G/10G devices - Ethernet NICs embedded and virtual: - Google (gve): - add device memory TCP TX support - Amazon (ena): - support persistent per-NAPI config - Airoha: - add H/W support for L2 traffic offload - add per flow stats for flow offloading - RealTek (rtl8211): add support for WoL magic packet - Synopsys (stmmac): - dwmac-socfpga 1000BaseX support - add Loongson-2K3000 support - introduce support for hardware-accelerated VLAN stripping - Broadcom (bcmgenet): - expose more H/W stats - Freescale (enetc, dpaa2-eth): - enetc: add MAC filter, VLAN filter RSS and loopback support - dpaa2-eth: convert to H/W timestamping APIs - vxlan: convert FDB table to rhashtable, for better scalabilty - veth: apply qdisc backpressure on full ring to reduce TX drops - Ethernet switches: - Microchip (kzZ88x3): add ETS scheduler support - Ethernet PHYs: - RealTek (rtl8211): - add support for WoL magic packet - add support for PHY LEDs - CAN: - Adds RZ/G3E CANFD support to the rcar_canfd driver. - Preparatory work for CAN-XL support. - Add self-tests framework with support for CAN physical interfaces. - WiFi: - mac80211: - scan improvements with multi-link operation (MLO) - Qualcomm (ath12k): - enable AHB support for IPQ5332 - add monitor interface support to QCN9274 - add multi-link operation support to WCN7850 - add 802.11d scan offload support to WCN7850 - monitor mode for WCN7850, better 6 GHz regulatory - Qualcomm (ath11k): - restore hibernation support - MediaTek (mt76): - WiFi-7 improvements - implement support for mt7990 - Intel (iwlwifi): - enhanced multi-link single-radio (EMLSR) support on 5 GHz links - rework device configuration - RealTek (rtw88): - improve throughput for RTL8814AU - RealTek (rtw89): - add multi-link operation support - STA/P2P concurrency improvements - support different SAR configs by antenna - Bluetooth: - introduce HCI Driver protocol - btintel_pcie: do not generate coredump for diagnostic events - btusb: add HCI Drv commands for configuring altsetting - btusb: add RTL8851BE device 0x0bda:0xb850 - btusb: add new VID/PID 13d3/3584 for MT7922 - btusb: add new VID/PID 13d3/3630 and 13d3/3613 for MT7925 - btnxpuart: implement host-wakeup feature" * tag 'net-next-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1611 commits) selftests/bpf: Fix bpf selftest build warning selftests: netfilter: Fix skip of wildcard interface test net: phy: mscc: Stop clearing the the UDPv4 checksum for L2 frames net: openvswitch: Fix the dead loop of MPLS parse calipso: Don't call calipso functions for AF_INET sk. selftests/tc-testing: Add a test for HFSC eltree double add with reentrant enqueue behaviour on netem net_sched: hfsc: Address reentrant enqueue adding class to eltree twice octeontx2-pf: QOS: Refactor TC_HTB_LEAF_DEL_LAST callback octeontx2-pf: QOS: Perform cache sync on send queue teardown net: mana: Add support for Multi Vports on Bare metal net: devmem: ncdevmem: remove unused variable net: devmem: ksft: upgrade rx test to send 1K data net: devmem: ksft: add 5 tuple FS support net: devmem: ksft: add exit_wait to make rx test pass net: devmem: ksft: add ipv4 support net: devmem: preserve sockc_err page_pool: fix ugly page_pool formatting net: devmem: move list_add to net_devmem_bind_dmabuf. selftests: netfilter: nft_queue.sh: include file transfer duration in log message net: phy: mscc: Fix memory leak when using one step timestamping ...
2025-05-28flexfiles/pNFS: update stats on NFS4ERR_DELAY for v4.1 DSesTigran Mkrtchyan
On NFS4ERR_DELAY nfs slient updates its stats, but misses for flexfiles v4.1 DSes. Signed-off-by: Tigran Mkrtchyan <tigran.mkrtchyan@desy.de> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-05-28nfs_localio: change nfsd_file_put_local() to take a pointer to __rcu pointerNeilBrown
Instead of calling xchg() and unrcu_pointer() before nfsd_file_put_local(), we now pass pointer to the __rcu pointer and call xchg() and unrcu_pointer() inside that function. Where unrcu_pointer() is currently called the internals of "struct nfsd_file" are not known and that causes older compilers such as gcc-8 to complain. In some cases we have a __kernel (aka normal) pointer not an __rcu pointer so we need to cast it to __rcu first. This is strictly a weakening so no information is lost. Somewhat surprisingly, this cast is accepted by gcc-8. This has the pleasing result that the cmpxchg() which sets ro_file and rw_file, and also the xchg() which clears them, are both now in the nfsd code. Reported-by: Pali Rohár <pali@kernel.org> Reported-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr> Fixes: 86e00412254a ("nfs: cache all open LOCALIO nfsd_file(s) in client") Signed-off-by: NeilBrown <neil@brown.name> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-05-28nfs_localio: protect race between nfs_uuid_put() and nfs_close_local_fh()NeilBrown
nfs_uuid_put() and nfs_close_local_fh() can race if a "struct nfs_file_localio" is released at the same time that nfsd calls nfs_localio_invalidate_clients(). It is important that neither of these functions completes after the other has started looking at a given nfs_file_localio and before it finishes. If nfs_uuid_put() exits while nfs_close_local_fh() is closing ro_file and rw_file it could return to __nfd_file_cache_purge() while some files are still referenced so the purge may not succeed. If nfs_close_local_fh() exits while nfsd_uuid_put() is still closing the files then the "struct nfs_file_localio" could be freed while nfsd_uuid_put() is still looking at it. This side is currently handled by copying the pointers out of ro_file and rw_file before deleting from the list in nfsd_uuid. We need to preserve this while ensuring that nfsd_uuid_put() does wait for nfs_close_local_fh(). This patch use nfl->uuid and nfl->list to provide the required interlock. nfs_uuid_put() removes the nfs_file_localio from the list, then drops locks and puts the two files, then reclaims the spinlock and sets ->nfs_uuid to NULL. nfs_close_local_fh() operates in the reverse order, setting ->nfs_uuid to NULL, then closing the files, then unlinking from the list. If nfs_uuid_put() finds that ->nfs_uuid is already NULL, it waits for the nfs_file_localio to be removed from the list. If nfs_close_local_fh() find that it has already been unlinked it waits for ->nfs_uuid to become NULL. This ensure that one of the two tries to close the files, but they each waits for the other. As nfs_uuid_put() is making the list empty, change from a list_for_each_safe loop to a while that always takes the first entry. This makes the intent more clear. Also don't move the list to a temporary local list as this would defeat the guarantees required for the interlock. Fixes: 86e00412254a ("nfs: cache all open LOCALIO nfsd_file(s) in client") Signed-off-by: NeilBrown <neil@brown.name> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-05-28nfs_localio: duplicate nfs_close_local_fh()NeilBrown
nfs_close_local_fh() is called from two different places for quite different use case. It is called from nfs_uuid_put() when the nfs_uuid is being detached - possibly because the nfs server is not longer serving that filesystem. In this case there will always be an nfs_uuid and so rcu_read_lock() is not needed. It is also called when the nfs_file_localio is no longer needed. In this case there may not be an active nfs_uuid. These two can race, and handling the race properly while avoiding excessive locking will require different handling on each side. This patch prepares the way by opencoding nfs_close_local_fh() into nfs_uuid_put(), then simplifying the code there as befits the context. Fixes: 86e00412254a ("nfs: cache all open LOCALIO nfsd_file(s) in client") Signed-off-by: NeilBrown <neil@brown.name> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-05-28nfs_localio: simplify interface to nfsd for getting nfsd_fileNeilBrown
The nfsd_localio_operations structure contains nfsd_file_get() to get a reference to an nfsd_file. This is only used in one place, where nfsd_open_local_fh() is also used. This patch combines the two, calling nfsd_open_local_fh() passing a pointer to where the nfsd_file pointer might be stored. If there is a pointer there an nfsd_file_get() can get a reference, that reference is returned. If not a new nfsd_file is acquired, stored at the pointer, and returned. When we store a reference we also increase the refcount on the net, as that refcount is decrements when we clear the stored pointer. We now get an extra reference *before* storing the new nfsd_file at the given location. This avoids possible races with the nfsd_file being freed before the final reference can be taken. This patch moves the rcu_dereference() needed after fetching from ro_file or rw_file into the nfsd code where the 'struct nfs_file' is fully defined. This avoids an error reported by older versions of gcc such as gcc-8 which complain about rcu_dereference() use in contexts where the structure (which will supposedly be accessed) is not fully defined. Reported-by: Pali Rohár <pali@kernel.org> Reported-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr> Fixes: 86e00412254a ("nfs: cache all open LOCALIO nfsd_file(s) in client") Signed-off-by: NeilBrown <neil@brown.name> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-05-28nfs_localio: always hold nfsd net ref with nfsd_file refNeilBrown
Having separate nfsd_file_put and nfsd_file_put_local in struct nfsd_localio_operations doesn't make much sense. The difference is that nfsd_file_put doesn't drop a reference to the nfs_net which is what keeps nfsd from shutting down. Currently, if nfsd tries to shutdown it will invalidate the files stored in the list from the nfs_uuid and this will drop all references to the nfsd net that the client holds. But the client could still hold some references to nfsd_files for active IO. So nfsd might think is has completely shut down local IO, but hasn't and has no way to wait for those active IO requests to complete. So this patch changes nfsd_file_get to nfsd_file_get_local and has it increase the ref count on the nfsd net and it replaces all calls to ->nfsd_put_file to ->nfsd_put_file_local. It also changes ->nfsd_open_local_fh to return with the refcount on the net elevated precisely when a valid nfsd_file is returned. This means that whenever the client holds a valid nfsd_file, there will be an associated count on the nfsd net, and so the count can only reach zero when all nfsd_files have been returned. nfs_local_file_put() is changed to call nfs_to_nfsd_file_put_local() instead of replacing calls to one with calls to the other because this will help a later patch which changes nfs_to_nfsd_file_put_local() to take an __rcu pointer while nfs_local_file_put() doesn't. Fixes: 86e00412254a ("nfs: cache all open LOCALIO nfsd_file(s) in client") Signed-off-by: NeilBrown <neil@brown.name> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-05-28nfs_localio: use cmpxchg() to install new nfs_file_localioNeilBrown
Rather than using nfs_uuid.lock to protect installing a new ro_file or rw_file, change to use cmpxchg(). Removing the file already uses xchg() so this improves symmetry and also makes the code a little simpler. Also remove the optimisation of not taking the lock, and not removing the nfs_file_localio from the linked list, when both ->ro_file and ->rw_file are already NULL. Given that ->nfs_uuid was not NULL, it is extremely unlikely that neither ->ro_file or ->rw_file is NULL so this optimisation can be of little value and it complicates understanding of the code - why can the list_del_init() be skipped? Finally, move the assignment of NULL to ->nfs_uuid until after the last action on the nfs_file_localio (the list_del_init). As soon as this is NULL a racing nfs_close_local_fh() can bypass all the locking and go on to free the nfs_file_localio, so we must be certain to be finished with it first. Fixes: 86e00412254a ("nfs: cache all open LOCALIO nfsd_file(s) in client") Signed-off-by: NeilBrown <neil@brown.name> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-05-28nfs: fix incorrect handling of large-number NFS errors in nfs4_do_mkdir()NeilBrown
A recent commit introduced nfs4_do_mkdir() which reports an error from nfs4_call_sync() by returning it with ERR_PTR(). This is a problem as nfs4_call_sync() can return negative NFS-specific errors with values larger than MAX_ERRNO (4095). One example is NFS4ERR_DELAY which has value 10008. This "pointer" gets to PTR_ERR_OR_ZERO() in nfs4_proc_mkdir() which chooses ZERO because it isn't in the range of value errors. Ultimately the pointer is dereferenced. This patch changes nfs4_do_mkdir() to report the dentry pointer and status separately - pointer as a return value, status in an "int *" parameter. The same separation is used for _nfs4_proc_mkdir() and the two are combined only in nfs4_proc_mkdir() after the status has passed through nfs4_handle_exception(), which ensures the error code does not exceed MAX_ERRNO. It also fixes a problem in the even when nfs4_handle_exception() updated the error value, the original 'alias' was still returned. Reported-by: Anna Schumaker <anna@kernel.org> Fixes: 8376583b84a1 ("nfs: change mkdir inode_operation to return alternate dentry if needed.") Signed-off-by: NeilBrown <neil@brown.name> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-05-28nfs: ignore SB_RDONLY when remounting nfsLi Lingfeng
In some scenarios, when mounting NFS, more than one superblock may be created. The final superblock used is the last one created, but only the first superblock carries the ro flag passed from user space. If a ro flag is added to the superblock via remount, it will trigger the issue described in Link[1]. Link[2] attempted to address this by marking the superblock as ro during the initial mount. However, this introduced a new problem in scenarios where multiple mount points share the same superblock: [root@a ~]# mount /dev/sdb /mnt/sdb [root@a ~]# echo "/mnt/sdb *(rw,no_root_squash)" > /etc/exports [root@a ~]# echo "/mnt/sdb/test_dir2 *(ro,no_root_squash)" >> /etc/exports [root@a ~]# systemctl restart nfs-server [root@a ~]# mount -t nfs -o rw 127.0.0.1:/mnt/sdb/test_dir1 /mnt/test_mp1 [root@a ~]# mount | grep nfs4 127.0.0.1:/mnt/sdb/test_dir1 on /mnt/test_mp1 type nfs4 (rw,relatime,... [root@a ~]# mount -t nfs -o ro 127.0.0.1:/mnt/sdb/test_dir2 /mnt/test_mp2 [root@a ~]# mount | grep nfs4 127.0.0.1:/mnt/sdb/test_dir1 on /mnt/test_mp1 type nfs4 (ro,relatime,... 127.0.0.1:/mnt/sdb/test_dir2 on /mnt/test_mp2 type nfs4 (ro,relatime,... [root@a ~]# When mounting the second NFS, the shared superblock is marked as ro, causing the previous NFS mount to become read-only. To resolve both issues, the ro flag is no longer applied to the superblock during remount. Instead, the ro flag on the mount is used to control whether the mount point is read-only. Fixes: 281cad46b34d ("NFS: Create a submount rpc_op") Link[1]: https://lore.kernel.org/all/20240604112636.236517-3-lilingfeng@huaweicloud.com/ Link[2]: https://lore.kernel.org/all/20241130035818.1459775-1-lilingfeng3@huawei.com/ Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>