Age | Commit message (Collapse) | Author |
|
git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs updates from Jaegeuk Kim:
"In this round, Matthew converted most of page operations to using
folio. Beyond the work, we've applied some performance tunings such as
GC and linear lookup, in addition to enhancing fault injection and
sanity checks.
Enhancements:
- large number of folio conversions
- add a control to turn on/off the linear lookup for performance
- tune GC logics for zoned block device
- improve fault injection and sanity checks
Bug fixes:
- handle error cases of memory donation
- fix to correct check conditions in f2fs_cross_rename
- fix to skip f2fs_balance_fs() if checkpoint is disabled
- don't over-report free space or inodes in statvfs
- prevent the current section from being selected as a victim during GC
- fix to calculate first_zoned_segno correctly
- fix to avoid inconsistence between SIT and SSA for zoned block device
As usual, there are several debugging patches and clean-ups as well"
* tag 'f2fs-for-6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (195 commits)
f2fs: fix to correct check conditions in f2fs_cross_rename
f2fs: use d_inode(dentry) cleanup dentry->d_inode
f2fs: fix to skip f2fs_balance_fs() if checkpoint is disabled
f2fs: clean up to check bi_status w/ BLK_STS_OK
f2fs: introduce is_{meta,node}_folio
f2fs: add ckpt_valid_blocks to the section entry
f2fs: add a method for calculating the remaining blocks in the current segment in LFS mode.
f2fs: introduce FAULT_VMALLOC
f2fs: use vmalloc instead of kvmalloc in .init_{,de}compress_ctx
f2fs: add f2fs_bug_on() in f2fs_quota_read()
f2fs: add f2fs_bug_on() to detect potential bug
f2fs: remove unused sbi argument from checksum functions
f2fs: fix 32-bits hexademical number in fault injection doc
f2fs: don't over-report free space or inodes in statvfs
f2fs: return bool from __write_node_folio
f2fs: simplify return value handling in f2fs_fsync_node_pages
f2fs: always unlock the page in f2fs_write_single_data_page
f2fs: remove wbc->for_reclaim handling
f2fs: return bool from __f2fs_write_meta_folio
f2fs: fix to return correct error number in f2fs_sync_node_pages()
...
|
|
These don't access global memory or defer pointer arguments - this
enables CSE optimizations.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
We were accidentally including the contents from the previous
fsck_err().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Make the superblock error counters available in sysfs; the only other
way they can be seen is 'show-super', but we don't write the superblock
every time the error count gets incremented.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
This is straightforward enough: check_fix_ptrs() currently only runs
before we go RW, so updating the btree root pointer in c->btree_roots
suffices - it'll be written out in the first journal write we do.
For that, do_bch2_trans_commit_to_journal_replay() now handles
JSET_ENTRY_btree_root entries.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
We have a bug report that looks like we might be leaking open buckets -
let's check if they got left attached to the cached btree node.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
More stack usage work.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
with typical config options, variables in different inline functions
aren't sharing stack space - and these are slowpaths.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Specialize the .to_text() for alloc_v4, to avoid the temporary on the
stack for conversion from old versions.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
- Separate out a slowpath for bkey_nocow_lock()
- Don't call bch2_bkey_ptrs_c() or loop over pointers more than
necessary
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
More stack usage work.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Factor out an error path for a small stack usage improvement.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Allocate some (smaller) temporary storage in btree_trans for this -
btree_path_down() is in our max-stack call stack.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Fix an assertion pop in the tiering_misaligned test: rounding down to
bucket size at the end of the journal space calculations leaves
cur_entry_sectors == 0, which is incorrect with !cur_entry_err.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
It's uncomon to have multiple devices with journalling only on a subset,
but can be specified with the 'data_allowed' option. We need to know if
we're doing data/metadata writes to multiple devices, as that requires
issuing flushes before the journal writes.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Fix an infinite loop when bkey_i->k.u64s is 0.
This only happens in userspace, where 'bcachefs list_journal' can print
the entire contents of the journal, and non-dirty entries aren't
validated.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
- Don't print a checksum error when we first read a journal entry: we
print a checksum error later if we'll be using the journal entry.
- Continuing with the theme of of improving error messages and grouping
errors into a single log message per error, print a single 'checksum
error' message per journal entry, and use bch2_journal_ptr_to_text()
to print out where on the device it was.
- Factor out checksum error messages and checking for missing journal
entries into helpers, bch2_journal_read() has gotten obnoxiously big.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Commit 6be3e21d25ca ("fs/dax: don't skip locked entries when scanning
entries") introduced a new function, wait_entry_unlocked_exclusive(),
which waits for the current entry to become unlocked without advancing
the XArray iterator state.
Waiting for the entry to become unlocked requires dropping the XArray
lock. This requires calling xas_pause() prior to dropping the lock
which leaves the xas in a suitable state for the next iteration. However
this has the side-effect of advancing the xas state to the next index.
Normally this isn't an issue because xas_for_each() contains code to
detect this state and thus avoid advancing the index a second time on
the next loop iteration.
However both callers of and wait_entry_unlocked_exclusive() itself
subsequently use the xas state to reload the entry. As xas_pause()
updated the state to the next index this will cause the current entry
which is being waited on to be skipped. This caused the following
warning to fire intermittently when running xftest generic/068 on an XFS
filesystem with FS DAX enabled:
[ 35.067397] ------------[ cut here ]------------
[ 35.068229] WARNING: CPU: 21 PID: 1640 at mm/truncate.c:89 truncate_folio_batch_exceptionals+0xd8/0x1e0
[ 35.069717] Modules linked in: nd_pmem dax_pmem nd_btt nd_e820 libnvdimm
[ 35.071006] CPU: 21 UID: 0 PID: 1640 Comm: fstest Not tainted 6.15.0-rc7+ #77 PREEMPT(voluntary)
[ 35.072613] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/204
[ 35.074845] RIP: 0010:truncate_folio_batch_exceptionals+0xd8/0x1e0
[ 35.075962] Code: a1 00 00 00 f6 47 0d 20 0f 84 97 00 00 00 4c 63 e8 41 39 c4 7f 0b eb 61 49 83 c5 01 45 39 ec 7e 58 42 f68
[ 35.079522] RSP: 0018:ffffb04e426c7850 EFLAGS: 00010202
[ 35.080359] RAX: 0000000000000000 RBX: ffff9d21e3481908 RCX: ffffb04e426c77f4
[ 35.081477] RDX: ffffb04e426c79e8 RSI: ffffb04e426c79e0 RDI: ffff9d21e34816e8
[ 35.082590] RBP: ffffb04e426c79e0 R08: 0000000000000001 R09: 0000000000000003
[ 35.083733] R10: 0000000000000000 R11: 822b53c0f7a49868 R12: 000000000000001f
[ 35.084850] R13: 0000000000000000 R14: ffffb04e426c78e8 R15: fffffffffffffffe
[ 35.085953] FS: 00007f9134c87740(0000) GS:ffff9d22abba0000(0000) knlGS:0000000000000000
[ 35.087346] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 35.088244] CR2: 00007f9134c86000 CR3: 000000040afff000 CR4: 00000000000006f0
[ 35.089354] Call Trace:
[ 35.089749] <TASK>
[ 35.090168] truncate_inode_pages_range+0xfc/0x4d0
[ 35.091078] truncate_pagecache+0x47/0x60
[ 35.091735] xfs_setattr_size+0xc7/0x3e0
[ 35.092648] xfs_vn_setattr+0x1ea/0x270
[ 35.093437] notify_change+0x1f4/0x510
[ 35.094219] ? do_truncate+0x97/0xe0
[ 35.094879] do_truncate+0x97/0xe0
[ 35.095640] path_openat+0xabd/0xca0
[ 35.096278] do_filp_open+0xd7/0x190
[ 35.096860] do_sys_openat2+0x8a/0xe0
[ 35.097459] __x64_sys_openat+0x6d/0xa0
[ 35.098076] do_syscall_64+0xbb/0x1d0
[ 35.098647] entry_SYSCALL_64_after_hwframe+0x77/0x7f
[ 35.099444] RIP: 0033:0x7f9134d81fc1
[ 35.100033] Code: 75 57 89 f0 25 00 00 41 00 3d 00 00 41 00 74 49 80 3d 2a 26 0e 00 00 74 6d 89 da 48 89 ee bf 9c ff ff ff5
[ 35.102993] RSP: 002b:00007ffcd41e0d10 EFLAGS: 00000202 ORIG_RAX: 0000000000000101
[ 35.104263] RAX: ffffffffffffffda RBX: 0000000000000242 RCX: 00007f9134d81fc1
[ 35.105452] RDX: 0000000000000242 RSI: 00007ffcd41e1200 RDI: 00000000ffffff9c
[ 35.106663] RBP: 00007ffcd41e1200 R08: 0000000000000000 R09: 0000000000000064
[ 35.107923] R10: 00000000000001a4 R11: 0000000000000202 R12: 0000000000000066
[ 35.109112] R13: 0000000000100000 R14: 0000000000100000 R15: 0000000000000400
[ 35.110357] </TASK>
[ 35.110769] irq event stamp: 8415587
[ 35.111486] hardirqs last enabled at (8415599): [<ffffffff8d74b562>] __up_console_sem+0x52/0x60
[ 35.113067] hardirqs last disabled at (8415610): [<ffffffff8d74b547>] __up_console_sem+0x37/0x60
[ 35.114575] softirqs last enabled at (8415300): [<ffffffff8d6ac625>] handle_softirqs+0x315/0x3f0
[ 35.115933] softirqs last disabled at (8415291): [<ffffffff8d6ac811>] __irq_exit_rcu+0xa1/0xc0
[ 35.117316] ---[ end trace 0000000000000000 ]---
Fix this by using xas_reset() instead, which is equivalent in
implementation to xas_pause() but does not advance the XArray state.
Fixes: 6be3e21d25ca ("fs/dax: don't skip locked entries when scanning entries")
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Link: https://lore.kernel.org/20250523043749.1460780-1-apopple@nvidia.com
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Alison Schofield <alison.schofield@intel.com>
Cc: "Matthew Wilcow (Oracle)" <willy@infradead.org>
Cc: Balbir Singh <balbirs@nvidia.com>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Ted Ts'o <tytso@mit.edu>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull ext2 and isofs updates from Jan Kara:
- isofs fix of handling of particularly formatted Rock Ridge timestamps
- Add deprecation notice about support of DAX in ext2 filesystem driver
* tag 'fs_for_v6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
ext2: Deprecate DAX
isofs: fix Y2038 and Y2156 issues in Rock Ridge TF entry
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull fsnotify updates from Jan Kara:
"Two fanotify cleanups and support for watching namespace-owned
filesystems by namespace admins (most useful for being able to watch
for new mounts / unmounts happening within a user namespace)"
* tag 'fsnotify_for_v6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
fanotify: support watching filesystems and mounts inside userns
fanotify: remove redundant permission checks
fanotify: Drop use of flex array in fanotify_fh
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core
Pull driver core updates from Greg KH:
"Here are the driver core / kernfs changes for 6.16-rc1.
Not a huge number of changes this development cycle, here's the
summary of what is included in here:
- kernfs locking tweaks, pushing some global locks down into a per-fs
image lock
- rust driver core and pci device bindings added for new features.
- sysfs const work for bin_attributes.
The final churn of switching away from and removing the
transitional struct members, "read_new", "write_new" and
"bin_attrs_new" will come after the merge window to avoid
unnecesary merge conflicts.
- auxbus device creation helpers added
- fauxbus fix for creating sysfs files after the probe completed
properly
- other tiny updates for driver core things.
All of these have been in linux-next for over a week with no reported
issues"
* tag 'driver-core-6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core:
kernfs: Relax constraint in draining guard
Documentation: embargoed-hardware-issues.rst: Remove myself
drivers: hv: fix up const issue with vmbus_chan_bin_attrs
firmware_loader: use SHA-256 library API instead of crypto_shash API
docs: debugfs: do not recommend debugfs_remove_recursive
PM: wakeup: Do not expose 4 device wakeup source APIs
kernfs: switch global kernfs_rename_lock to per-fs lock
kernfs: switch global kernfs_idr_lock to per-fs lock
driver core: auxiliary bus: Fix IS_ERR() vs NULL mixup in __devm_auxiliary_device_create()
sysfs: constify attribute_group::bin_attrs
sysfs: constify bin_attribute argument of bin_attribute::read/write()
software node: Correct a OOB check in software_node_get_reference_args()
devres: simplify devm_kstrdup() using devm_kmemdup()
platform: replace magic number with macro PLATFORM_DEVID_NONE
component: do not try to unbind unbound components
driver core: auxiliary bus: add device creation helpers
driver core: faux: Add sysfs groups after probing
|
|
Increase the buffer size to the count requested by userspace. This
improves performance.
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Bernd Schubert <bschubert@ddn.com>
|
|
This is a preparation for large readdir buffers in fuse.
Simply setting the fuse buffer size to the userspace buffer size should
work, the record sizes are similar (fuse's is slightly larger than libc's,
so no overflow should ever happen).
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Jaco Kroon <jaco@uls.co.za>
|
|
When getting the directory contents, the entries are first fetched to a
kernel buffer, then they are copied to userspace with dir_emit(). This
second phase is non-blocking as long as the userspace buffer is not paged
out, making it interruptible makes zero sense.
Overload d_type as flags, since it only uses 4 bits from 32.
Reviewed-by: Bernd Schubert <bschubert@ddn.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Add support for folios larger than one page size for writeback.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Add support for folios larger than one page size for readahead.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Add support for folios larger than one page size for queued writes.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Bernd Schubert <bschubert@ddn.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Add support for folios larger than one page size for stores.
Also change variable naming from "this_num" to "nr_bytes".
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Support large folios for symlinks and change the name from
fuse_getlink_page() to fuse_getlink_folio().
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Bernd Schubert <bschubert@ddn.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Add support for folios larger than one page size for folio reads into
the page cache.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Bernd Schubert <bschubert@ddn.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Add support for folios larger than one page size for writethrough
writes.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Refactor the logic in fuse_fill_write_pages() for copying out write
data. This will make the future change for supporting large folios for
writes easier. No functional changes.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Bernd Schubert <bschubert@ddn.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Add support for folios larger than one page size for retrieves.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Bernd Schubert <bschubert@ddn.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Currently, all folios associated with fuse are one page size. As part of
the work to enable large folios, this commit adds support for copying
to/from folios larger than one page size.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Bernd Schubert <bschubert@ddn.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Paolo Abeni:
"Core:
- Implement the Device Memory TCP transmit path, allowing zero-copy
data transmission on top of TCP from e.g. GPU memory to the wire.
- Move all the IPv6 routing tables management outside the RTNL scope,
under its own lock and RCU. The route control path is now 3x times
faster.
- Convert queue related netlink ops to instance lock, reducing again
the scope of the RTNL lock. This improves the control plane
scalability.
- Refactor the software crc32c implementation, removing unneeded
abstraction layers and improving significantly the related
micro-benchmarks.
- Optimize the GRO engine for UDP-tunneled traffic, for a 10%
performance improvement in related stream tests.
- Cover more per-CPU storage with local nested BH locking; this is a
prep work to remove the current per-CPU lock in local_bh_disable()
on PREMPT_RT.
- Introduce and use nlmsg_payload helper, combining buffer bounds
verification with accessing payload carried by netlink messages.
Netfilter:
- Rewrite the procfs conntrack table implementation, improving
considerably the dump performance. A lot of user-space tools still
use this interface.
- Implement support for wildcard netdevice in netdev basechain and
flowtables.
- Integrate conntrack information into nft trace infrastructure.
- Export set count and backend name to userspace, for better
introspection.
BPF:
- BPF qdisc support: BPF-qdisc can be implemented with BPF struct_ops
programs and can be controlled in similar way to traditional qdiscs
using the "tc qdisc" command.
- Refactor the UDP socket iterator, addressing long standing issues
WRT duplicate hits or missed sockets.
Protocols:
- Improve TCP receive buffer auto-tuning and increase the default
upper bound for the receive buffer; overall this improves the
single flow maximum thoughput on 200Gbs link by over 60%.
- Add AFS GSSAPI security class to AF_RXRPC; it provides transport
security for connections to the AFS fileserver and VL server.
- Improve TCP multipath routing, so that the sources address always
matches the nexthop device.
- Introduce SO_PASSRIGHTS for AF_UNIX, to allow disabling SCM_RIGHTS,
and thus preventing DoS caused by passing around problematic FDs.
- Retire DCCP socket. DCCP only receives updates for bugs, and major
distros disable it by default. Its removal allows for better
organisation of TCP fields to reduce the number of cache lines hit
in the fast path.
- Extend TCP drop-reason support to cover PAWS checks.
Driver API:
- Reorganize PTP ioctl flag support to require an explicit opt-in for
the drivers, avoiding the problem of drivers not rejecting new
unsupported flags.
- Converted several device drivers to timestamping APIs.
- Introduce per-PHY ethtool dump helpers, improving the support for
dump operations targeting PHYs.
Tests and tooling:
- Add support for classic netlink in user space C codegen, so that
ynl-c can now read, create and modify links, routes addresses and
qdisc layer configuration.
- Add ynl sub-types for binary attributes, allowing ynl-c to output
known struct instead of raw binary data, clarifying the classic
netlink output.
- Extend MPTCP selftests to improve the code-coverage.
- Add tests for XDP tail adjustment in AF_XDP.
New hardware / drivers:
- OpenVPN virtual driver: offload OpenVPN data channels processing to
the kernel-space, increasing the data transfer throughput WRT the
user-space implementation.
- Renesas glue driver for the gigabit ethernet RZ/V2H(P) SoC.
- Broadcom asp-v3.0 ethernet driver.
- AMD Renoir ethernet device.
- ReakTek MT9888 2.5G ethernet PHY driver.
- Aeonsemi 10G C45 PHYs driver.
Drivers:
- Ethernet high-speed NICs:
- nVidia/Mellanox (mlx5):
- refactor the steering table handling to significantly
reduce the amount of memory used
- add support for complex matches in H/W flow steering
- improve flow streeing error handling
- convert to netdev instance locking
- Intel (100G, ice, igb, ixgbe, idpf):
- ice: add switchdev support for LLDP traffic over VF
- ixgbe: add firmware manipulation and regions devlink support
- igb: introduce support for frame transmission premption
- igb: adds persistent NAPI configuration
- idpf: introduce RDMA support
- idpf: add initial PTP support
- Meta (fbnic):
- extend hardware stats coverage
- add devlink dev flash support
- Broadcom (bnxt):
- add support for RX-side device memory TCP
- Wangxun (txgbe):
- implement support for udp tunnel offload
- complete PTP and SRIOV support for AML 25G/10G devices
- Ethernet NICs embedded and virtual:
- Google (gve):
- add device memory TCP TX support
- Amazon (ena):
- support persistent per-NAPI config
- Airoha:
- add H/W support for L2 traffic offload
- add per flow stats for flow offloading
- RealTek (rtl8211): add support for WoL magic packet
- Synopsys (stmmac):
- dwmac-socfpga 1000BaseX support
- add Loongson-2K3000 support
- introduce support for hardware-accelerated VLAN stripping
- Broadcom (bcmgenet):
- expose more H/W stats
- Freescale (enetc, dpaa2-eth):
- enetc: add MAC filter, VLAN filter RSS and loopback support
- dpaa2-eth: convert to H/W timestamping APIs
- vxlan: convert FDB table to rhashtable, for better scalabilty
- veth: apply qdisc backpressure on full ring to reduce TX drops
- Ethernet switches:
- Microchip (kzZ88x3): add ETS scheduler support
- Ethernet PHYs:
- RealTek (rtl8211):
- add support for WoL magic packet
- add support for PHY LEDs
- CAN:
- Adds RZ/G3E CANFD support to the rcar_canfd driver.
- Preparatory work for CAN-XL support.
- Add self-tests framework with support for CAN physical interfaces.
- WiFi:
- mac80211:
- scan improvements with multi-link operation (MLO)
- Qualcomm (ath12k):
- enable AHB support for IPQ5332
- add monitor interface support to QCN9274
- add multi-link operation support to WCN7850
- add 802.11d scan offload support to WCN7850
- monitor mode for WCN7850, better 6 GHz regulatory
- Qualcomm (ath11k):
- restore hibernation support
- MediaTek (mt76):
- WiFi-7 improvements
- implement support for mt7990
- Intel (iwlwifi):
- enhanced multi-link single-radio (EMLSR) support on 5 GHz links
- rework device configuration
- RealTek (rtw88):
- improve throughput for RTL8814AU
- RealTek (rtw89):
- add multi-link operation support
- STA/P2P concurrency improvements
- support different SAR configs by antenna
- Bluetooth:
- introduce HCI Driver protocol
- btintel_pcie: do not generate coredump for diagnostic events
- btusb: add HCI Drv commands for configuring altsetting
- btusb: add RTL8851BE device 0x0bda:0xb850
- btusb: add new VID/PID 13d3/3584 for MT7922
- btusb: add new VID/PID 13d3/3630 and 13d3/3613 for MT7925
- btnxpuart: implement host-wakeup feature"
* tag 'net-next-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1611 commits)
selftests/bpf: Fix bpf selftest build warning
selftests: netfilter: Fix skip of wildcard interface test
net: phy: mscc: Stop clearing the the UDPv4 checksum for L2 frames
net: openvswitch: Fix the dead loop of MPLS parse
calipso: Don't call calipso functions for AF_INET sk.
selftests/tc-testing: Add a test for HFSC eltree double add with reentrant enqueue behaviour on netem
net_sched: hfsc: Address reentrant enqueue adding class to eltree twice
octeontx2-pf: QOS: Refactor TC_HTB_LEAF_DEL_LAST callback
octeontx2-pf: QOS: Perform cache sync on send queue teardown
net: mana: Add support for Multi Vports on Bare metal
net: devmem: ncdevmem: remove unused variable
net: devmem: ksft: upgrade rx test to send 1K data
net: devmem: ksft: add 5 tuple FS support
net: devmem: ksft: add exit_wait to make rx test pass
net: devmem: ksft: add ipv4 support
net: devmem: preserve sockc_err
page_pool: fix ugly page_pool formatting
net: devmem: move list_add to net_devmem_bind_dmabuf.
selftests: netfilter: nft_queue.sh: include file transfer duration in log message
net: phy: mscc: Fix memory leak when using one step timestamping
...
|
|
On NFS4ERR_DELAY nfs slient updates its stats, but misses for
flexfiles v4.1 DSes.
Signed-off-by: Tigran Mkrtchyan <tigran.mkrtchyan@desy.de>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
|
|
Instead of calling xchg() and unrcu_pointer() before
nfsd_file_put_local(), we now pass pointer to the __rcu pointer and call
xchg() and unrcu_pointer() inside that function.
Where unrcu_pointer() is currently called the internals of "struct
nfsd_file" are not known and that causes older compilers such as gcc-8
to complain.
In some cases we have a __kernel (aka normal) pointer not an __rcu
pointer so we need to cast it to __rcu first. This is strictly a
weakening so no information is lost. Somewhat surprisingly, this cast
is accepted by gcc-8.
This has the pleasing result that the cmpxchg() which sets ro_file and
rw_file, and also the xchg() which clears them, are both now in the nfsd
code.
Reported-by: Pali Rohár <pali@kernel.org>
Reported-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr>
Fixes: 86e00412254a ("nfs: cache all open LOCALIO nfsd_file(s) in client")
Signed-off-by: NeilBrown <neil@brown.name>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
|
|
nfs_uuid_put() and nfs_close_local_fh() can race if a "struct
nfs_file_localio" is released at the same time that nfsd calls
nfs_localio_invalidate_clients().
It is important that neither of these functions completes after the
other has started looking at a given nfs_file_localio and before it
finishes.
If nfs_uuid_put() exits while nfs_close_local_fh() is closing ro_file
and rw_file it could return to __nfd_file_cache_purge() while some files
are still referenced so the purge may not succeed.
If nfs_close_local_fh() exits while nfsd_uuid_put() is still closing the
files then the "struct nfs_file_localio" could be freed while
nfsd_uuid_put() is still looking at it. This side is currently handled
by copying the pointers out of ro_file and rw_file before deleting from
the list in nfsd_uuid. We need to preserve this while ensuring that
nfsd_uuid_put() does wait for nfs_close_local_fh().
This patch use nfl->uuid and nfl->list to provide the required
interlock.
nfs_uuid_put() removes the nfs_file_localio from the list, then drops
locks and puts the two files, then reclaims the spinlock and sets
->nfs_uuid to NULL.
nfs_close_local_fh() operates in the reverse order, setting ->nfs_uuid
to NULL, then closing the files, then unlinking from the list.
If nfs_uuid_put() finds that ->nfs_uuid is already NULL, it waits for
the nfs_file_localio to be removed from the list. If
nfs_close_local_fh() find that it has already been unlinked it waits for
->nfs_uuid to become NULL. This ensure that one of the two tries to
close the files, but they each waits for the other.
As nfs_uuid_put() is making the list empty, change from a
list_for_each_safe loop to a while that always takes the first entry.
This makes the intent more clear.
Also don't move the list to a temporary local list as this would defeat
the guarantees required for the interlock.
Fixes: 86e00412254a ("nfs: cache all open LOCALIO nfsd_file(s) in client")
Signed-off-by: NeilBrown <neil@brown.name>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
|
|
nfs_close_local_fh() is called from two different places for quite
different use case.
It is called from nfs_uuid_put() when the nfs_uuid is being detached -
possibly because the nfs server is not longer serving that filesystem.
In this case there will always be an nfs_uuid and so rcu_read_lock() is
not needed.
It is also called when the nfs_file_localio is no longer needed. In
this case there may not be an active nfs_uuid.
These two can race, and handling the race properly while avoiding
excessive locking will require different handling on each side.
This patch prepares the way by opencoding nfs_close_local_fh() into
nfs_uuid_put(), then simplifying the code there as befits the context.
Fixes: 86e00412254a ("nfs: cache all open LOCALIO nfsd_file(s) in client")
Signed-off-by: NeilBrown <neil@brown.name>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
|
|
The nfsd_localio_operations structure contains nfsd_file_get() to get a
reference to an nfsd_file. This is only used in one place, where
nfsd_open_local_fh() is also used.
This patch combines the two, calling nfsd_open_local_fh() passing a
pointer to where the nfsd_file pointer might be stored. If there is a
pointer there an nfsd_file_get() can get a reference, that reference is
returned. If not a new nfsd_file is acquired, stored at the pointer,
and returned. When we store a reference we also increase the refcount
on the net, as that refcount is decrements when we clear the stored
pointer.
We now get an extra reference *before* storing the new nfsd_file at the
given location. This avoids possible races with the nfsd_file being
freed before the final reference can be taken.
This patch moves the rcu_dereference() needed after fetching from
ro_file or rw_file into the nfsd code where the 'struct nfs_file' is
fully defined. This avoids an error reported by older versions of gcc
such as gcc-8 which complain about rcu_dereference() use in contexts
where the structure (which will supposedly be accessed) is not fully
defined.
Reported-by: Pali Rohár <pali@kernel.org>
Reported-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr>
Fixes: 86e00412254a ("nfs: cache all open LOCALIO nfsd_file(s) in client")
Signed-off-by: NeilBrown <neil@brown.name>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
|
|
Having separate nfsd_file_put and nfsd_file_put_local in struct
nfsd_localio_operations doesn't make much sense. The difference is that
nfsd_file_put doesn't drop a reference to the nfs_net which is what
keeps nfsd from shutting down.
Currently, if nfsd tries to shutdown it will invalidate the files stored
in the list from the nfs_uuid and this will drop all references to the
nfsd net that the client holds. But the client could still hold some
references to nfsd_files for active IO. So nfsd might think is has
completely shut down local IO, but hasn't and has no way to wait for
those active IO requests to complete.
So this patch changes nfsd_file_get to nfsd_file_get_local and has it
increase the ref count on the nfsd net and it replaces all calls to
->nfsd_put_file to ->nfsd_put_file_local.
It also changes ->nfsd_open_local_fh to return with the refcount on the
net elevated precisely when a valid nfsd_file is returned.
This means that whenever the client holds a valid nfsd_file, there will
be an associated count on the nfsd net, and so the count can only reach
zero when all nfsd_files have been returned.
nfs_local_file_put() is changed to call nfs_to_nfsd_file_put_local()
instead of replacing calls to one with calls to the other because this
will help a later patch which changes nfs_to_nfsd_file_put_local() to
take an __rcu pointer while nfs_local_file_put() doesn't.
Fixes: 86e00412254a ("nfs: cache all open LOCALIO nfsd_file(s) in client")
Signed-off-by: NeilBrown <neil@brown.name>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
|
|
Rather than using nfs_uuid.lock to protect installing
a new ro_file or rw_file, change to use cmpxchg().
Removing the file already uses xchg() so this improves symmetry
and also makes the code a little simpler.
Also remove the optimisation of not taking the lock, and not removing
the nfs_file_localio from the linked list, when both ->ro_file and
->rw_file are already NULL. Given that ->nfs_uuid was not NULL, it is
extremely unlikely that neither ->ro_file or ->rw_file is NULL so
this optimisation can be of little value and it complicates
understanding of the code - why can the list_del_init() be skipped?
Finally, move the assignment of NULL to ->nfs_uuid until after
the last action on the nfs_file_localio (the list_del_init). As soon as
this is NULL a racing nfs_close_local_fh() can bypass all the locking
and go on to free the nfs_file_localio, so we must be certain to be
finished with it first.
Fixes: 86e00412254a ("nfs: cache all open LOCALIO nfsd_file(s) in client")
Signed-off-by: NeilBrown <neil@brown.name>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
|
|
A recent commit introduced nfs4_do_mkdir() which reports an error from
nfs4_call_sync() by returning it with ERR_PTR().
This is a problem as nfs4_call_sync() can return negative NFS-specific
errors with values larger than MAX_ERRNO (4095). One example is
NFS4ERR_DELAY which has value 10008.
This "pointer" gets to PTR_ERR_OR_ZERO() in nfs4_proc_mkdir() which
chooses ZERO because it isn't in the range of value errors. Ultimately
the pointer is dereferenced.
This patch changes nfs4_do_mkdir() to report the dentry pointer and
status separately - pointer as a return value, status in an "int *"
parameter.
The same separation is used for _nfs4_proc_mkdir() and the two are
combined only in nfs4_proc_mkdir() after the status has passed through
nfs4_handle_exception(), which ensures the error code does not exceed
MAX_ERRNO.
It also fixes a problem in the even when nfs4_handle_exception() updated
the error value, the original 'alias' was still returned.
Reported-by: Anna Schumaker <anna@kernel.org>
Fixes: 8376583b84a1 ("nfs: change mkdir inode_operation to return alternate dentry if needed.")
Signed-off-by: NeilBrown <neil@brown.name>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
|
|
In some scenarios, when mounting NFS, more than one superblock may be
created. The final superblock used is the last one created, but only the
first superblock carries the ro flag passed from user space. If a ro flag
is added to the superblock via remount, it will trigger the issue
described in Link[1].
Link[2] attempted to address this by marking the superblock as ro during
the initial mount. However, this introduced a new problem in scenarios
where multiple mount points share the same superblock:
[root@a ~]# mount /dev/sdb /mnt/sdb
[root@a ~]# echo "/mnt/sdb *(rw,no_root_squash)" > /etc/exports
[root@a ~]# echo "/mnt/sdb/test_dir2 *(ro,no_root_squash)" >> /etc/exports
[root@a ~]# systemctl restart nfs-server
[root@a ~]# mount -t nfs -o rw 127.0.0.1:/mnt/sdb/test_dir1 /mnt/test_mp1
[root@a ~]# mount | grep nfs4
127.0.0.1:/mnt/sdb/test_dir1 on /mnt/test_mp1 type nfs4 (rw,relatime,...
[root@a ~]# mount -t nfs -o ro 127.0.0.1:/mnt/sdb/test_dir2 /mnt/test_mp2
[root@a ~]# mount | grep nfs4
127.0.0.1:/mnt/sdb/test_dir1 on /mnt/test_mp1 type nfs4 (ro,relatime,...
127.0.0.1:/mnt/sdb/test_dir2 on /mnt/test_mp2 type nfs4 (ro,relatime,...
[root@a ~]#
When mounting the second NFS, the shared superblock is marked as ro,
causing the previous NFS mount to become read-only.
To resolve both issues, the ro flag is no longer applied to the superblock
during remount. Instead, the ro flag on the mount is used to control
whether the mount point is read-only.
Fixes: 281cad46b34d ("NFS: Create a submount rpc_op")
Link[1]: https://lore.kernel.org/all/20240604112636.236517-3-lilingfeng@huaweicloud.com/
Link[2]: https://lore.kernel.org/all/20241130035818.1459775-1-lilingfeng3@huawei.com/
Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
|