Age | Commit message (Collapse) | Author |
|
btree_update_nodes_written() needs to wait on in-flight writes to old
nodes before marking them as freed. But it has no reason to pin those
old nodes in memory, so some trickyness ensues.
The update we're completing deleted references to those nodes from the
btree, so we know if they've been evicted they can't be pulled back in.
We just have to check if the nodes we have pointers to are still those
old nodes, and haven't been reused.
To do that we check the node's "sequence number" (actually a random 64
bit cookie), but that lives in the node's data buffer. 'struct btree'
can't be freed until filesystem shutdown (as they're quite small), but
the data buffers can be freed or swapped around.
Commit 1f88c3567495, which was fixing a kmsan warning, assumed that we
could safely do this locklessly with just a READ_ONCE() - if we've got a
non-null ptr it would be safe to read from.
But that's not true if the data buffer is a vmalloc allocation, so we
need to restore the locking that commit deleted (or alternatively RCU
free those data buffers, but there's no other reason for that).
Fixes: 1f88c3567495 ("bcachefs: Fix a KMSAN splat in btree_update_nodes_written()")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Instead of simply recreating a mis-casefolded dirent, use the str_hash
repair code, which will rename it if necessary - the dirent might have
been created again with the correct casefolding.
Factor out out bch2_str_hash_repair key() from
__bch2_str_hash_check_key() for the new path to use, and export
bch2_dirent_create_key() as well.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
bch2_fsck_renamed_dirent was creating bch_dirent keys open-coded - but
we need to use the appropriate helper, if the directory is casefolded.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Redo (and simplify somewhat) how casefolded and non casefolded dirents
are initialized, and export this to be used by fsck_rename_dirent().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Clang warns (or errors with CONFIG_WERROR=y):
fs/bcachefs/fsck.c:2325:2: error: label followed by a declaration is a C23 extension [-Werror,-Wc23-extensions]
2325 | int ret = bch2_trans_run(c,
| ^
On clang-17 and older, this is an unconditional error:
fs/bcachefs/fsck.c:2325:2: error: expected expression
2325 | int ret = bch2_trans_run(c,
| ^
Move the declaration of ret to the top of the function to resolve both
ways this issue manifests.
Fixes: c72def523799 ("bcachefs: Run check_dirents second time if required")
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
rdtgroup_mutex
A lockdep fix removed two rdt_last_cmd_clear() calls that were used to
clear the last_cmd_status buffer but called without holding the required
rdtgroup_mutex.
The impacted resctrl commands are writing to the cpus or cpus_list files
and creating a new monitor or control group. With stale data in the
last_cmd_status buffer the impacted resctrl commands report the stale error
on success, or append its own failure message to the stale error on
failure.
Consequently, restore the rdt_last_cmd_clear() calls after acquiring
rdtgroup_mutex.
Fixes: c8eafe149530 ("x86/resctrl: Fix potential lockdep warning")
Signed-off-by: Zeng Heng <zengheng4@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/all/20250603125828.1590067-1-zengheng4@huawei.com
|
|
When a server has multichannel enabled, we keep polling the server
for interfaces periodically. However, when this query fails, we
disable the polling. This can be problematic as it takes away the
chance for the server to start advertizing again.
This change reschedules the delayed work, even if the current call
failed. That way, multichannel sessions can recover.
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Cc: stable@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Today, during smb2_reconnect, session_mutex is released as soon as
the tcon is reconnected and is in a good state. However, in case
multichannel is enabled, there is also a query of server interfaces that
follows. We've seen that this query can race with reconnects of other
channels, causing them to step on each other with reconnects.
This change extends the hold of session_mutex till after the query of
server interfaces is complete. In order to avoid recursive smb2_reconnect
checks during query ioctl, this change also introduces a session flag
for sessions where such a query is in progress.
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Cc: stable@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Our current approach to select a channel for sending requests is this:
1. iterate all channels to find the min and max queue depth
2. if min and max are not the same, pick the channel with min depth
3. if min and max are same, round robin, as all channels are equally loaded
The problem with this approach is that there's a lag between selecting
a channel and sending the request (that increases the queue depth on the channel).
While these numbers will eventually catch up, there could be a skew in the
channel usage, depending on the application's I/O parallelism and the server's
speed of handling requests.
With sufficient parallelism, this lag can artificially increase the queue depth,
thereby impacting the performance negatively.
This change will change the step 1 above to start the iteration from the last
selected channel. This is to reduce the skew in channel usage even in the presence
of this lag.
Fixes: ea90708d3cf3 ("cifs: use the least loaded channel for sending requests")
Cc: <stable@vger.kernel.org>
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Long Li <longli@microsoft.com>
Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: Hyunchul Lee <hyc.lee@gmail.com>
Cc: Meetakshi Setiya <meetakshisetiyaoss@gmail.com>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
This is the next step in the direction of a common smbdirect layer.
Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Long Li <longli@microsoft.com>
Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: Hyunchul Lee <hyc.lee@gmail.com>
Cc: Meetakshi Setiya <meetakshisetiyaoss@gmail.com>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
This is the next step in the direction of a common smbdirect layer.
Currently only structures are shared, but that will change
over time until everything is shared.
Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Long Li <longli@microsoft.com>
Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: Hyunchul Lee <hyc.lee@gmail.com>
Cc: Meetakshi Setiya <meetakshisetiyaoss@gmail.com>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
This abstracts the common smbdirect layer.
Currently with just a few things in it,
but that will change over time until everything is
in common.
Will be used in client and server in the next commits
Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Long Li <longli@microsoft.com>
Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: Hyunchul Lee <hyc.lee@gmail.com>
Cc: Meetakshi Setiya <meetakshisetiyaoss@gmail.com>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Long Li <longli@microsoft.com>
Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: Hyunchul Lee <hyc.lee@gmail.com>
Cc: Meetakshi Setiya <meetakshisetiyaoss@gmail.com>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Will be used in client and server in the next commits.
Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Long Li <longli@microsoft.com>
Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: Hyunchul Lee <hyc.lee@gmail.com>
CC: Meetakshi Setiya <meetakshisetiyaoss@gmail.com>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Long Li <longli@microsoft.com>
Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: Hyunchul Lee <hyc.lee@gmail.com>
Cc: Meetakshi Setiya <meetakshisetiyaoss@gmail.com>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
This is just a start moving into a common smbdirect layer.
It will be used in the next commits...
Cc: Steve French <smfrench@gmail.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Long Li <longli@microsoft.com>
Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: Hyunchul Lee <hyc.lee@gmail.com>
Cc: Meetakshi Setiya <meetakshisetiyaoss@gmail.com>
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Pull NFS clent updates from Anna Schumaker:
"New Features:
- Implement the Sunrpc rfc2203 rpcsec_gss sequence number cache
- Add support for FALLOC_FL_ZERO_RANGE on NFS v4.2
- Add a localio sysfs attribute
Stable Fixes:
- Fix double-unlock bug in nfs_return_empty_folio()
- Don't check for OPEN feature support in v4.1
- Always probe for LOCALIO support asynchronously
- Prevent hang on NFS mounts with xprtsec=[m]tls
Other Bugfixes:
- xattr handlers should check for absent nfs filehandles
- Fix setattr caching of TIME_[MODIFY|ACCESS]_SET when timestamps are
delegated
- Fix listxattr to return selinux security labels
- Connect to NFSv3 DS using TLS if MDS connection uses TLS
- Clear SB_RDONLY before getting a superblock, and ignore when
remounting
- Fix incorrect handling of NFS error codes in nfs4_do_mkdir()
- Various nfs_localio fixes from Neil Brown that include fixing an
rcu compilation error found by older gcc versions.
- Update stats on flexfiles pNFS DSes when receiving NFS4ERR_DELAY
Cleanups:
- Add a refcount tracker for struct net in the nfs_client
- Allow FREE_STATEID to clean up delegations
- Always set NLINK even if the server doesn't support it
- Cleanups to the NFS folio writeback code
- Remove dead code from xs_tcp_tls_setup_socket()"
* tag 'nfs-for-6.16-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (30 commits)
flexfiles/pNFS: update stats on NFS4ERR_DELAY for v4.1 DSes
nfs_localio: change nfsd_file_put_local() to take a pointer to __rcu pointer
nfs_localio: protect race between nfs_uuid_put() and nfs_close_local_fh()
nfs_localio: duplicate nfs_close_local_fh()
nfs_localio: simplify interface to nfsd for getting nfsd_file
nfs_localio: always hold nfsd net ref with nfsd_file ref
nfs_localio: use cmpxchg() to install new nfs_file_localio
SUNRPC: Remove dead code from xs_tcp_tls_setup_socket()
SUNRPC: Prevent hang on NFS mount with xprtsec=[m]tls
nfs: fix incorrect handling of large-number NFS errors in nfs4_do_mkdir()
nfs: ignore SB_RDONLY when remounting nfs
nfs: clear SB_RDONLY before getting superblock
NFS: always probe for LOCALIO support asynchronously
pnfs/flexfiles: connect to NFSv3 DS using TLS if MDS connection uses TLS
NFS: add localio to sysfs
nfs: use writeback_iter directly
nfs: refactor nfs_do_writepage
nfs: don't return AOP_WRITEPAGE_ACTIVATE from nfs_do_writepage
nfs: fold nfs_page_async_flush into nfs_do_writepage
NFSv4: Always set NLINK even if the server doesn't support it
...
|
|
git://git.samba.org/sfrench/cifs-2.6
Pull smb client updates from Steve French:
- multichannel fixes (mostly reconnect related), and clarification of
locking documentation
- automount null pointer check fix
- fixes to add support for ParentLeaseKey
- minor cleanup
- smb1/cifs fixes
* tag 'v6.16-rc-part1-smb-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: update the lock ordering comments with new mutex
cifs: dns resolution is needed only for primary channel
cifs: update dstaddr whenever channel iface is updated
cifs: reset connections for all channels when reconnect requested
smb: client: use ParentLeaseKey in cifs_do_create
smb: client: use ParentLeaseKey in open_cached_dir
smb: client: add ParentLeaseKey support
cifs: Fix cifs_query_path_info() for Windows NT servers
cifs: Fix validation of SMB1 query reparse point response
cifs: Correctly set SMB1 SessionKey field in Session Setup Request
cifs: Fix encoding of SMB1 Session Setup NTLMSSP Request in non-UNICODE mode
smb: client: add NULL check in automount_fullpath
smb: client: Remove an unused function and variable
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull more MM updates from Andrew Morton:
- "zram: support algorithm-specific parameters" from Sergey Senozhatsky
adds infrastructure for passing algorithm-specific parameters into
zram. A single parameter `winbits' is implemented at this time.
- "memcg: nmi-safe kmem charging" from Shakeel Butt makes memcg
charging nmi-safe, which is required by BFP, which can operate in NMI
context.
- "Some random fixes and cleanup to shmem" from Kemeng Shi implements
small fixes and cleanups in the shmem code.
- "Skip mm selftests instead when kernel features are not present" from
Zi Yan fixes some issues in the MM selftest code.
- "mm/damon: build-enable essential DAMON components by default" from
SeongJae Park reworks DAMON Kconfig to make it easier to enable
CONFIG_DAMON.
- "sched/numa: add statistics of numa balance task migration" from Libo
Chen adds more info into sysfs and procfs files to improve visibility
into the NUMA balancer's task migration activity.
- "selftests/mm: cow and gup_longterm cleanups" from Mark Brown
provides various updates to some of the MM selftests to make them
play better with the overall containing framework.
* tag 'mm-stable-2025-06-01-14-06' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (43 commits)
mm/khugepaged: clean up refcount check using folio_expected_ref_count()
selftests/mm: fix test result reporting in gup_longterm
selftests/mm: report unique test names for each cow test
selftests/mm: add helper for logging test start and results
selftests/mm: use standard ksft_finished() in cow and gup_longterm
selftests/damon/_damon_sysfs: skip testcases if CONFIG_DAMON_SYSFS is disabled
sched/numa: add statistics of numa balance task
sched/numa: fix task swap by skipping kernel threads
tools/testing: check correct variable in open_procmap()
tools/testing/vma: add missing function stub
mm/gup: update comment explaining why gup_fast() disables IRQs
selftests/mm: two fixes for the pfnmap test
mm/khugepaged: fix race with folio split/free using temporary reference
mm: add CONFIG_PAGE_BLOCK_ORDER to select page block order
mmu_notifiers: remove leftover stub macros
selftests/mm: deduplicate test names in madv_populate
kcov: rust: add flags for KCOV with Rust
mm: rust: make CONFIG_MMU ifdefs more narrow
mmu_gather: move tlb flush for VM_PFNMAP/VM_MIXEDMAP vmas into free_pgtables()
mm/damon/Kconfig: enable CONFIG_DAMON by default
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull gfs2 fix from Andreas Gruenbacher:
- Fix a NULL pointer dereference reported by syzbot
* tag 'gfs2-for-6.16-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
gfs2: Don't clear sb->s_fs_info in gfs2_sys_fs_add
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse updates from Miklos Szeredi:
- Remove tmp page copying in writeback path (Joanne).
This removes ~300 lines and with that a lot of complexity related to
avoiding reclaim related deadlock. The old mechanism is replaced with
a mapping flag that tells the MM not to block reclaim waiting for
writeback to complete. The MM parts have been reviewed/acked by
respective maintainers.
- Convert more code to handle large folios (Joanne). This still just
adds the code to deal with large folios and does not enable them yet.
- Allow invalidating all cached lookups atomically (Luis Henriques).
This feature is useful for CernVMFS, which currently does this
iteratively.
- Align write prefaulting in fuse with generic one (Dave Hansen)
- Fix race causing invalid data to be cached when setting attributes on
different nodes of a distributed fs (Guang Yuan Wu)
- Update documentation for passthrough (Chen Linxuan)
- Add fdinfo about the device number associated with an opened
/dev/fuse instance (Chen Linxuan)
- Increase readdir buffer size (Miklos). This depends on a patch to VFS
readdir code that was already merged through Christians tree.
- Optimize io-uring request expiration (Joanne)
- Misc cleanups
* tag 'fuse-update-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: (25 commits)
fuse: increase readdir buffer size
readdir: supply dir_context.count as readdir buffer size hint
fuse: don't allow signals to interrupt getdents copying
fuse: support large folios for writeback
fuse: support large folios for readahead
fuse: support large folios for queued writes
fuse: support large folios for stores
fuse: support large folios for symlinks
fuse: support large folios for folio reads
fuse: support large folios for writethrough writes
fuse: refactor fuse_fill_write_pages()
fuse: support large folios for retrieves
fuse: support copying large folios
fs: fuse: add dev id to /dev/fuse fdinfo
docs: filesystems: add fuse-passthrough.rst
MAINTAINERS: update filter of FUSE documentation
fuse: fix race between concurrent setattrs from multiple nodes
fuse: remove tmp folio for writebacks and internal rb tree
mm: skip folio reclaim in legacy memcg contexts for deadlockable mappings
fuse: optimize over-io-uring request expiration check
...
|
|
The lock ordering rules listed as comments in cifsglob.h were
missing some lock details and also the fid_lock.
Updated those notes in this commit.
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull netfs updates from Christian Brauner:
- The main API document has been extensively updated/rewritten
- Fix an oops in write-retry due to mis-resetting the I/O iterator
- Fix the recording of transferred bytes for short DIO reads
- Fix a request's work item to not require a reference, thereby
avoiding the need to get rid of it in BH/IRQ context
- Fix waiting and waking to be consistent about the waitqueue used
- Remove NETFS_SREQ_SEEK_DATA_READ, NETFS_INVALID_WRITE,
NETFS_ICTX_WRITETHROUGH, NETFS_READ_HOLE_CLEAR,
NETFS_RREQ_DONT_UNLOCK_FOLIOS, and NETFS_RREQ_BLOCKED
- Reorder structs to eliminate holes
- Remove netfs_io_request::ractl
- Only provide proc_link field if CONFIG_PROC_FS=y
- Remove folio_queue::marks3
- Fix undifferentiation of DIO reads from unbuffered reads
* tag 'vfs-6.16-rc1.netfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
netfs: Fix undifferentiation of DIO reads from unbuffered reads
netfs: Fix wait/wake to be consistent about the waitqueue used
netfs: Fix the request's work item to not require a ref
netfs: Fix setting of transferred bytes with short DIO reads
netfs: Fix oops in write-retry from mis-resetting the subreq iterator
fs/netfs: remove unused flag NETFS_RREQ_BLOCKED
fs/netfs: remove unused flag NETFS_RREQ_DONT_UNLOCK_FOLIOS
folio_queue: remove unused field `marks3`
fs/netfs: declare field `proc_link` only if CONFIG_PROC_FS=y
fs/netfs: remove `netfs_io_request.ractl`
fs/netfs: reorder struct fields to eliminate holes
fs/netfs: remove unused enum choice NETFS_READ_HOLE_CLEAR
fs/netfs: remove unused flag NETFS_ICTX_WRITETHROUGH
fs/netfs: remove unused source NETFS_INVALID_WRITE
fs/netfs: remove unused flag NETFS_SREQ_SEEK_DATA_READ
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs fixes from Christian Brauner:
- Fix the AT_HANDLE_CONNECTABLE option so filesystems that don't know
how to decode a connected non-dir dentry fail the request
- Use repr(transparent) to ensure identical layout between the C and
Rust implementation of struct file
- Add a missing xas_pause() into the dax code employing
wait_entry_unlocked_exclusive()
- Fix FOP_DONTCACHE which we disabled for v6.15.
A folio could get redirtied and/or scheduled for writeback after the
initial dropbehind test. Change the test accordingly to handle these
cases so we can re-enable FOP_DONTCACHE again
* tag 'vfs-6.16-rc2.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
exportfs: require ->fh_to_parent() to encode connectable file handles
rust: file: improve safety comments
rust: file: mark `LocalFile` as `repr(transparent)`
fs/dax: Fix "don't skip locked entries when scanning entries"
iomap: don't lose folio dropbehind state for overwrites
mm/filemap: unify dropbehind flag testing and clearing
mm/filemap: unify read/write dropbehind naming
Revert "Disable FOP_DONTCACHE for now due to bugs"
mm/filemap: use filemap_end_dropbehind() for read invalidation
mm/filemap: gate dropbehind invalidate on folio !dirty && !writeback
|
|
When calling cifs_reconnect, before the connection to the
server is reestablished, the code today does a DNS resolution and
updates server->dstaddr.
However, this is not necessary for secondary channels. Secondary
channels use the interface list returned by the server to decide
which address to connect to. And that happens after tcon is reconnected
and server interfaces are requested.
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Cc: stable@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
When the server interface info changes (more common in clustered
servers like Azure Files), the per-channel iface gets updated.
However, this did not update the corresponding dstaddr. As a result
these channels will still connect (or try connecting) to older addresses.
Fixes: b54034a73baf ("cifs: during reconnect, update interface if necessary")
Cc: <stable@vger.kernel.org>
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
cifs_reconnect can be called with a flag to mark the session as needing
reconnect too. When this is done, we expect the connections of all
channels to be reconnected too, which is not happening today.
Without doing this, we have seen bad things happen when primary and
secondary channels are connected to different servers (in case of cloud
services like Azure Files SMB).
This change would force all connections to reconnect as well, not just
the sessions and tcons.
Cc: <stable@vger.kernel.org>
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
If we move a key backwards, we'll need a second pass to run the rest of
the fsck checks.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
We don't want this running out of the same workqueue, and blocking,
writes.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Snapshot deletion v2 added sentinal values for deleted snapshots, so
"key for deleted snapshot" - i.e. snapshot deletion missed something -
is safe to repair automatically.
But if we find a key for a missing snapshot we have no idea what
happened, and we shouldn't delete it unless we're very sure that
everything else is consistent.
So hook it up to the new bch2_require_recovery_pass(), we'll now only
delete if snapshots and subvolumes have recenlty been checked.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Add a superblock flag to temporarily disable ratelimiting for a recovery
pass.
This will be used to make check_key_has_snapshot safer: we don't want to
delete a key for a missing snapshot unless we know that the snapshots
and subvolumes btrees are consistent, i.e. check_snapshots and
check_subvols have run recently.
Changing those btrees - creating/deleting a subvolume or snapshot - will
set the "disable ratelimit" flag, i.e. ensuring that those passes run if
check_key_has_snapshot discovers an error.
We're only disabling ratelimiting in the snapshot/subvol delete paths,
we're not so concerned about the create paths.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Add a helper for requiring that a recovery pass has already run: either
run it directly, if we're still in recovery, or if we're not in recovery
check if it has run recently and schedule it if it hasn't.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Add a tracepoint for any time we return an error and unwind.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
We had a bug due due to an incomplete revert of the patch implementing
directory i_size (summing up the size of the dirents), leading to
completely screwy i_size values that underflow.
Most userspace programs don't seem to care (e.g. du ignores it), but it
turns out this broke sshfs, so needs to be repaired.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
'inode_has_wrong_backpointer'; we have more specific errors for every
case afterwards.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Implement ParentLeaseKey logic in cifs_do_create() by looking up the
parent cfid, copying its lease key into the fid struct, and setting
the appropriate lease flag.
Fixes: f047390a097e ("CIFS: Add create lease v2 context for SMB3")
Signed-off-by: Henrique Carvalho <henrique.carvalho@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Implement ParentLeaseKey logic in open_cached_dir() by looking up the
parent cfid, copying its lease key into the fid struct, and setting
the appropriate lease flag.
Fixes: f047390a097e ("CIFS: Add create lease v2 context for SMB3")
Signed-off-by: Henrique Carvalho <henrique.carvalho@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
According to MS-SMB2 3.2.4.3.8, when opening a file the client must
lookup its parent directory, copy that entry’s LeaseKey into
ParentLeaseKey, and set SMB2_LEASE_FLAG_PARENT_LEASE_KEY_SET.
Extend lease context functions to carry a parent_lease_key and
lease_flags and to add them to the lease context buffer accordingly in
smb3_create_lease_buf. Also add a parent_lease_key field to struct
cifs_fid and lease_flags to cifs_open_parms.
Only applies to the SMB 3.x dialect family.
Fixes: f047390a097e ("CIFS: Add create lease v2 context for SMB3")
Signed-off-by: Henrique Carvalho <henrique.carvalho@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
For TRANS2 QUERY_PATH_INFO request when the path does not exist, the
Windows NT SMB server returns error response STATUS_OBJECT_NAME_NOT_FOUND
or ERRDOS/ERRbadfile without the SMBFLG_RESPONSE flag set. Similarly it
returns STATUS_DELETE_PENDING when the file is being deleted. And looks
like that any error response from TRANS2 QUERY_PATH_INFO does not have
SMBFLG_RESPONSE flag set.
So relax check in check_smb_hdr() for detecting if the packet is response
for this special case.
This change fixes stat() operation against Windows NT SMB servers and also
all operations which depends on -ENOENT result from stat like creat() or
mkdir().
Signed-off-by: Pali Rohár <pali@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Validate the SMB1 query reparse point response per [MS-CIFS] section
2.2.7.2 NT_TRANSACT_IOCTL.
NT_TRANSACT_IOCTL response contains one word long setup data after which is
ByteCount member. So check that SetupCount is 1 before trying to read and
use ByteCount member.
Output setup data contains ReturnedDataLen member which is the output
length of executed IOCTL command by remote system. So check that output was
not truncated before transferring over network.
Change MaxSetupCount of NT_TRANSACT_IOCTL request from 4 to 1 as io_rsp
structure already expects one word long output setup data. This should
prevent server sending incompatible structure (in case it would be extended
in future, which is unlikely).
Change MaxParameterCount of NT_TRANSACT_IOCTL request from 2 to 0 as
NT IOCTL does not have any documented output parameters and this function
does not parse any output parameters at all.
Fixes: ed3e0a149b58 ("smb: client: implement ->query_reparse_point() for SMB1")
Signed-off-by: Pali Rohár <pali@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
[MS-CIFS] specification in section 2.2.4.53.1 where is described
SMB_COM_SESSION_SETUP_ANDX Request, for SessionKey field says:
The client MUST set this field to be equal to the SessionKey field in
the SMB_COM_NEGOTIATE Response for this SMB connection.
Linux SMB client currently set this field to zero. This is working fine
against Windows NT SMB servers thanks to [MS-CIFS] product behavior <94>:
Windows NT Server ignores the client's SessionKey.
For compatibility with [MS-CIFS], set this SessionKey field in Session
Setup Request to value retrieved from Negotiate response.
Signed-off-by: Pali Rohár <pali@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
SMB1 Session Setup NTLMSSP Request in non-UNICODE mode is similar to
UNICODE mode, just strings are encoded in ASCII and not in UTF-16.
With this change it is possible to setup SMB1 session with NTLM
authentication in non-UNICODE mode with Windows SMB server.
This change fixes mounting SMB1 servers with -o nounicode mount option
together with -o sec=ntlmssp mount option (which is the default sec=).
Signed-off-by: Pali Rohár <pali@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
page is checked for null in __build_path_from_dentry_optional_prefix
when tcon->origin_fullpath is not set. However, the check is missing when
it is set.
Add a check to prevent a potential NULL pointer dereference.
Signed-off-by: Ruben Devos <devosruben6@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
The CephFS kernel driver forgets to set the filesystem magic signature in
its superblock. As a result, IMA policy rules based on fsmagic matching do
not apply as intended. This causes a major performance regression in Talos
Linux [1] when mounting CephFS volumes, such as when deploying Rook Ceph
[2]. Talos Linux ships a hardened kernel with the following IMA policy
(irrelevant lines omitted):
[...]
dont_measure fsmagic=0xc36400 # CEPH_SUPER_MAGIC
[...]
measure func=FILE_CHECK mask=^MAY_READ euid=0
measure func=FILE_CHECK mask=^MAY_READ uid=0
[...]
Currently, IMA compares 0xc36400 == 0x0 for CephFS files, resulting in all
files opened with O_RDONLY or O_RDWR getting measured with SHA512 on every
open(2):
10 69990c87e8af323d47e2d6ae4... ima-ng sha512:<hash> /data/cephfs/test-file
Since O_WRONLY is rare, this results in an order of magnitude lower
performance than expected for practically all file operations. Properly
setting CEPH_SUPER_MAGIC in the CephFS superblock resolves the regression.
Tests performed on a 3x replicated Ceph v19.3.0 cluster across three
i5-7200U nodes each equipped with one Micron 7400 MAX M.2 disk (BlueStore)
and Gigabit ethernet, on Talos Linux v1.10.2:
FS-Mark 3.3
Test: 500 Files, Empty
Files/s > Higher Is Better
6.12.27-talos . 16.6 |====
+twelho patch . 208.4 |====================================================
FS-Mark 3.3
Test: 500 Files, 1KB Size
Files/s > Higher Is Better
6.12.27-talos . 15.6 |=======
+twelho patch . 118.6 |====================================================
FS-Mark 3.3
Test: 500 Files, 32 Sub Dirs, 1MB Size
Files/s > Higher Is Better
6.12.27-talos . 12.7 |===============
+twelho patch . 44.7 |=====================================================
IO500 [3] 2fcd6d6 results (benchmarks within variance omitted):
| IO500 benchmark | 6.12.27-talos | +twelho patch | Speedup |
|-------------------|----------------|----------------|-----------|
| mdtest-easy-write | 0.018524 kIOPS | 1.135027 kIOPS | 6027.33 % |
| mdtest-hard-write | 0.018498 kIOPS | 0.973312 kIOPS | 5161.71 % |
| ior-easy-read | 0.064727 GiB/s | 0.155324 GiB/s | 139.97 % |
| mdtest-hard-read | 0.018246 kIOPS | 0.780800 kIOPS | 4179.29 % |
This applies outside of synthetic benchmarks as well, for example, the time
to rsync a 55 MiB directory with ~12k of mostly small files drops from an
unusable 10m5s to a reasonable 26s (23x the throughput).
[1]: https://www.talos.dev/
[2]: https://www.talos.dev/v1.10/kubernetes-guides/configuration/ceph-with-rook/
[3]: https://github.com/IO500/io500
Cc: stable@vger.kernel.org
Signed-off-by: Dennis Marttinen <twelho@welho.tech>
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
The ceph/export.c contains very confusing logic of file handle size
calculation based on hardcoded values. This patch makes the cleanup of
this logic by means of introduction the named constants.
Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Reviewed-by: Alex Markuze <amarkuze@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
In 'ceph_zero_objects', promote 'object_size' to 'u64' to avoid possible
integer overflow.
Compile tested only.
Found by Linux Verification Center (linuxtesting.org) with SVACE.
Signed-off-by: Dmitry Kandybka <d.kandybka@gmail.com>
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
The generic/397 test hits a BUG_ON for the case of encrypted inode with
unaligned file size (for example, 33K or 1K):
[ 877.737811] run fstests generic/397 at 2025-01-03 12:34:40
[ 877.875761] libceph: mon0 (2)127.0.0.1:40674 session established
[ 877.876130] libceph: client4614 fsid 19b90bca-f1ae-47a6-93dd-0b03ee637949
[ 877.991965] libceph: mon0 (2)127.0.0.1:40674 session established
[ 877.992334] libceph: client4617 fsid 19b90bca-f1ae-47a6-93dd-0b03ee637949
[ 878.017234] libceph: mon0 (2)127.0.0.1:40674 session established
[ 878.017594] libceph: client4620 fsid 19b90bca-f1ae-47a6-93dd-0b03ee637949
[ 878.031394] xfs_io (pid 18988) is setting deprecated v1 encryption policy; recommend upgrading to v2.
[ 878.054528] libceph: mon0 (2)127.0.0.1:40674 session established
[ 878.054892] libceph: client4623 fsid 19b90bca-f1ae-47a6-93dd-0b03ee637949
[ 878.070287] libceph: mon0 (2)127.0.0.1:40674 session established
[ 878.070704] libceph: client4626 fsid 19b90bca-f1ae-47a6-93dd-0b03ee637949
[ 878.264586] libceph: mon0 (2)127.0.0.1:40674 session established
[ 878.265258] libceph: client4629 fsid 19b90bca-f1ae-47a6-93dd-0b03ee637949
[ 878.374578] -----------[ cut here ]------------
[ 878.374586] kernel BUG at net/ceph/messenger.c:1070!
[ 878.375150] Oops: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
[ 878.378145] CPU: 2 UID: 0 PID: 4759 Comm: kworker/2:9 Not tainted 6.13.0-rc5+ #1
[ 878.378969] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[ 878.380167] Workqueue: ceph-msgr ceph_con_workfn
[ 878.381639] RIP: 0010:ceph_msg_data_cursor_init+0x42/0x50
[ 878.382152] Code: 89 17 48 8b 46 70 55 48 89 47 08 c7 47 18 00 00 00 00 48 89 e5 e8 de cc ff ff 5d 31 c0 31 d2 31 f6 31 ff c3 cc cc cc cc 0f 0b <0f> 0b 0f 0b 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90
[ 878.383928] RSP: 0018:ffffb4ffc7cbbd28 EFLAGS: 00010287
[ 878.384447] RAX: ffffffff82bb9ac0 RBX: ffff981390c2f1f8 RCX: 0000000000000000
[ 878.385129] RDX: 0000000000009000 RSI: ffff981288232b58 RDI: ffff981390c2f378
[ 878.385839] RBP: ffffb4ffc7cbbe18 R08: 0000000000000000 R09: 0000000000000000
[ 878.386539] R10: 0000000000000000 R11: 0000000000000000 R12: ffff981390c2f030
[ 878.387203] R13: ffff981288232b58 R14: 0000000000000029 R15: 0000000000000001
[ 878.387877] FS: 0000000000000000(0000) GS:ffff9814b7900000(0000) knlGS:0000000000000000
[ 878.388663] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 878.389212] CR2: 00005e106a0554e0 CR3: 0000000112bf0001 CR4: 0000000000772ef0
[ 878.389921] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 878.390620] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 878.391307] PKRU: 55555554
[ 878.391567] Call Trace:
[ 878.391807] <TASK>
[ 878.392021] ? show_regs+0x71/0x90
[ 878.392391] ? die+0x38/0xa0
[ 878.392667] ? do_trap+0xdb/0x100
[ 878.392981] ? do_error_trap+0x75/0xb0
[ 878.393372] ? ceph_msg_data_cursor_init+0x42/0x50
[ 878.393842] ? exc_invalid_op+0x53/0x80
[ 878.394232] ? ceph_msg_data_cursor_init+0x42/0x50
[ 878.394694] ? asm_exc_invalid_op+0x1b/0x20
[ 878.395099] ? ceph_msg_data_cursor_init+0x42/0x50
[ 878.395583] ? ceph_con_v2_try_read+0xd16/0x2220
[ 878.396027] ? _raw_spin_unlock+0xe/0x40
[ 878.396428] ? raw_spin_rq_unlock+0x10/0x40
[ 878.396842] ? finish_task_switch.isra.0+0x97/0x310
[ 878.397338] ? __schedule+0x44b/0x16b0
[ 878.397738] ceph_con_workfn+0x326/0x750
[ 878.398121] process_one_work+0x188/0x3d0
[ 878.398522] ? __pfx_worker_thread+0x10/0x10
[ 878.398929] worker_thread+0x2b5/0x3c0
[ 878.399310] ? __pfx_worker_thread+0x10/0x10
[ 878.399727] kthread+0xe1/0x120
[ 878.400031] ? __pfx_kthread+0x10/0x10
[ 878.400431] ret_from_fork+0x43/0x70
[ 878.400771] ? __pfx_kthread+0x10/0x10
[ 878.401127] ret_from_fork_asm+0x1a/0x30
[ 878.401543] </TASK>
[ 878.401760] Modules linked in: hctr2 nhpoly1305_avx2 nhpoly1305_sse2 nhpoly1305 chacha_generic chacha_x86_64 libchacha adiantum libpoly1305 essiv authenc mptcp_diag xsk_diag tcp_diag udp_diag raw_diag inet_diag unix_diag af_packet_diag netlink_diag intel_rapl_msr intel_rapl_common intel_uncore_frequency_common skx_edac_common nfit kvm_intel kvm crct10dif_pclmul crc32_pclmul polyval_clmulni polyval_generic ghash_clmulni_intel sha256_ssse3 sha1_ssse3 aesni_intel joydev crypto_simd cryptd rapl input_leds psmouse sch_fq_codel serio_raw bochs i2c_piix4 floppy qemu_fw_cfg i2c_smbus mac_hid pata_acpi msr parport_pc ppdev lp parport efi_pstore ip_tables x_tables
[ 878.407319] ---[ end trace 0000000000000000 ]---
[ 878.407775] RIP: 0010:ceph_msg_data_cursor_init+0x42/0x50
[ 878.408317] Code: 89 17 48 8b 46 70 55 48 89 47 08 c7 47 18 00 00 00 00 48 89 e5 e8 de cc ff ff 5d 31 c0 31 d2 31 f6 31 ff c3 cc cc cc cc 0f 0b <0f> 0b 0f 0b 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90
[ 878.410087] RSP: 0018:ffffb4ffc7cbbd28 EFLAGS: 00010287
[ 878.410609] RAX: ffffffff82bb9ac0 RBX: ffff981390c2f1f8 RCX: 0000000000000000
[ 878.411318] RDX: 0000000000009000 RSI: ffff981288232b58 RDI: ffff981390c2f378
[ 878.412014] RBP: ffffb4ffc7cbbe18 R08: 0000000000000000 R09: 0000000000000000
[ 878.412735] R10: 0000000000000000 R11: 0000000000000000 R12: ffff981390c2f030
[ 878.413438] R13: ffff981288232b58 R14: 0000000000000029 R15: 0000000000000001
[ 878.414121] FS: 0000000000000000(0000) GS:ffff9814b7900000(0000) knlGS:0000000000000000
[ 878.414935] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 878.415516] CR2: 00005e106a0554e0 CR3: 0000000112bf0001 CR4: 0000000000772ef0
[ 878.416211] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 878.416907] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 878.417630] PKRU: 55555554
(gdb) l *ceph_msg_data_cursor_init+0x42
0xffffffff823b45a2 is in ceph_msg_data_cursor_init (net/ceph/messenger.c:1070).
1065
1066 void ceph_msg_data_cursor_init(struct ceph_msg_data_cursor *cursor,
1067 struct ceph_msg *msg, size_t length)
1068 {
1069 BUG_ON(!length);
1070 BUG_ON(length > msg->data_length);
1071 BUG_ON(!msg->num_data_items);
1072
1073 cursor->total_resid = length;
1074 cursor->data = msg->data;
The issue takes place because of this:
[ 202.628853] libceph: net/ceph/messenger_v2.c:2034 prepare_sparse_read_data(): msg->data_length 33792, msg->sparse_read_total 36864
1070 BUG_ON(length > msg->data_length);
The generic/397 test (xfstests) executes such steps:
(1) create encrypted files and directories;
(2) access the created files and folders with encryption key;
(3) access the created files and folders without encryption key.
The issue takes place in this portion of code:
if (IS_ENCRYPTED(inode)) {
struct page **pages;
size_t page_off;
err = iov_iter_get_pages_alloc2(&subreq->io_iter, &pages, len,
&page_off);
if (err < 0) {
doutc(cl, "%llx.%llx failed to allocate pages, %d\n",
ceph_vinop(inode), err);
goto out;
}
/* should always give us a page-aligned read */
WARN_ON_ONCE(page_off);
len = err;
err = 0;
osd_req_op_extent_osd_data_pages(req, 0, pages, len, 0, false,
false);
The reason of the issue is that subreq->io_iter.count keeps unaligned
value of length:
[ 347.751182] lib/iov_iter.c:1185 __iov_iter_get_pages_alloc(): maxsize 36864, maxpages 4294967295, start 18446659367320516064
[ 347.752808] lib/iov_iter.c:1196 __iov_iter_get_pages_alloc(): maxsize 33792, maxpages 4294967295, start 18446659367320516064
[ 347.754394] lib/iov_iter.c:1015 iter_folioq_get_pages(): maxsize 33792, maxpages 4294967295, extracted 0, _start_offset 18446659367320516064
This patch simply assigns the aligned value to subreq->io_iter.count
before calling iov_iter_get_pages_alloc2().
[ idryomov: tag the comment with FIXME to make it clear that it's only
a workaround for netfslib not coexisting with fscrypt nicely
(this is also noted in another pre-existing comment) ]
Cc: David Howells <dhowells@redhat.com>
Cc: stable@vger.kernel.org
Fixes: ee4cdf7ba857 ("netfs: Speed up buffered reading")
Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|