Age | Commit message (Collapse) | Author |
|
Pull io_uring updates from Jens Axboe:
- Avoid indirect function calls in io-wq for executing and freeing
work.
The design of io-wq is such that it can be a generic mechanism, but
as it's just used by io_uring now, may as well avoid these indirect
calls
- Clean up registered buffers for networking
- Add support for IORING_OP_PIPE. Pretty straight forward, allows
creating pipes with io_uring, particularly useful for having these be
instantiated as direct descriptors
- Clean up the coalescing support fore registered buffers
- Add support for multiple interface queues for zero-copy rx
networking. As this feature was merged for 6.15 it supported just a
single ifq per ring
- Clean up the eventfd support
- Add dma-buf support to zero-copy rx
- Clean up and improving the request draining support
- Clean up provided buffer support, most notably with an eye toward
making the legacy support less intrusive
- Minor fdinfo cleanups, dropping support for dumping what credentials
are registered
- Improve support for overflow CQE handling, getting rid of GFP_ATOMIC
for allocating overflow entries where possible
- Improve detection of cases where io-wq doesn't need to spawn a new
worker unnecessarily
- Various little cleanups
* tag 'for-6.16/io_uring-20250523' of git://git.kernel.dk/linux: (59 commits)
io_uring/cmd: warn on reg buf imports by ineligible cmds
io_uring/io-wq: only create a new worker if it can make progress
io_uring/io-wq: ignore non-busy worker going to sleep
io_uring/io-wq: move hash helpers to the top
trace/io_uring: fix io_uring_local_work_run ctx documentation
io_uring: finish IOU_OK -> IOU_COMPLETE transition
io_uring: add new helpers for posting overflows
io_uring: pass in struct io_big_cqe to io_alloc_ocqe()
io_uring: make io_alloc_ocqe() take a struct io_cqe pointer
io_uring: split alloc and add of overflow
io_uring: open code io_req_cqe_overflow()
io_uring/fdinfo: get rid of dumping credentials
io_uring/fdinfo: only compile if CONFIG_PROC_FS is set
io_uring/kbuf: unify legacy buf provision and removal
io_uring/kbuf: refactor __io_remove_buffers
io_uring/kbuf: don't compute size twice on prep
io_uring/kbuf: drop extra vars in io_register_pbuf_ring
io_uring/kbuf: use mem_is_zero()
io_uring/kbuf: account ring io_buffer_list memory
io_uring: drain based on allocates reqs
...
|
|
Pull block updates from Jens Axboe:
- ublk updates:
- Add support for updating the size of a ublk instance
- Zero-copy improvements
- Auto-registering of buffers for zero-copy
- Series simplifying and improving GET_DATA and request lookup
- Series adding quiesce support
- Lots of selftests additions
- Various cleanups
- NVMe updates via Christoph:
- add per-node DMA pools and use them for PRP/SGL allocations
(Caleb Sander Mateos, Keith Busch)
- nvme-fcloop refcounting fixes (Daniel Wagner)
- support delayed removal of the multipath node and optionally
support the multipath node for private namespaces (Nilay Shroff)
- support shared CQs in the PCI endpoint target code (Wilfred
Mallawa)
- support admin-queue only authentication (Hannes Reinecke)
- use the crc32c library instead of the crypto API (Eric Biggers)
- misc cleanups (Christoph Hellwig, Marcelo Moreira, Hannes
Reinecke, Leon Romanovsky, Gustavo A. R. Silva)
- MD updates via Yu:
- Fix that normal IO can be starved by sync IO, found by mkfs on
newly created large raid5, with some clean up patches for bdev
inflight counters
- Clean up brd, getting rid of atomic kmaps and bvec poking
- Add loop driver specifically for zoned IO testing
- Eliminate blk-rq-qos calls with a static key, if not enabled
- Improve hctx locking for when a plug has IO for multiple queues
pending
- Remove block layer bouncing support, which in turn means we can
remove the per-node bounce stat as well
- Improve blk-throttle support
- Improve delay support for blk-throttle
- Improve brd discard support
- Unify IO scheduler switching. This should also fix a bunch of lockdep
warnings we've been seeing, after enabling lockdep support for queue
freezing/unfreezeing
- Add support for block write streams via FDP (flexible data placement)
on NVMe
- Add a bunch of block helpers, facilitating the removal of a bunch of
duplicated boilerplate code
- Remove obsolete BLK_MQ pci and virtio Kconfig options
- Add atomic/untorn write support to blktrace
- Various little cleanups and fixes
* tag 'for-6.16/block-20250523' of git://git.kernel.dk/linux: (186 commits)
selftests: ublk: add test for UBLK_F_QUIESCE
ublk: add feature UBLK_F_QUIESCE
selftests: ublk: add test case for UBLK_U_CMD_UPDATE_SIZE
traceevent/block: Add REQ_ATOMIC flag to block trace events
ublk: run auto buf unregisgering in same io_ring_ctx with registering
io_uring: add helper io_uring_cmd_ctx_handle()
ublk: remove io argument from ublk_auto_buf_reg_fallback()
ublk: handle ublk_set_auto_buf_reg() failure correctly in ublk_fetch()
selftests: ublk: add test for covering UBLK_AUTO_BUF_REG_FALLBACK
selftests: ublk: support UBLK_F_AUTO_BUF_REG
ublk: support UBLK_AUTO_BUF_REG_FALLBACK
ublk: register buffer to local io_uring with provided buf index via UBLK_F_AUTO_BUF_REG
ublk: prepare for supporting to register request buffer automatically
ublk: convert to refcount_t
selftests: ublk: make IO & device removal test more stressful
nvme: rename nvme_mpath_shutdown_disk to nvme_mpath_remove_disk
nvme: introduce multipath_always_on module param
nvme-multipath: introduce delayed removal of the multipath head node
nvme-pci: derive and better document max segments limits
nvme-pci: use struct_size for allocation struct nvme_dev
...
|
|
Following patches need the RDMA rc branch since we are past the RC cycle
now.
Merge conflicts resolved based on Linux-next:
- For RXE odp changes keep for-next version and fixup new places that
need to call is_odp_mr()
https://lore.kernel.org/r/20250422143019.500201bd@canb.auug.org.au
https://lore.kernel.org/r/20250514122455.3593b083@canb.auug.org.au
- irdma is keeping the while/kfree bugfix from -rc and the pf/cdev_info
change from for-next
https://lore.kernel.org/r/20250513130630.280ee6c5@canb.auug.org.au
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull coredump updates from Christian Brauner:
"This adds support for sending coredumps over an AF_UNIX socket. It
also makes (implicit) use of the new SO_PEERPIDFD ability to hand out
pidfds for reaped peer tasks
The new coredump socket will allow userspace to not have to rely on
usermode helpers for processing coredumps and provides a saf way to
handle them instead of relying on super privileged coredumping helpers
This will also be significantly more lightweight since the kernel
doens't have to do a fork()+exec() for each crashing process to spawn
a usermodehelper. Instead the kernel just connects to the AF_UNIX
socket and userspace can process it concurrently however it sees fit.
Support for userspace is incoming starting with systemd-coredump
There's more work coming in that direction next cycle. The rest below
goes into some details and background
Coredumping currently supports two modes:
(1) Dumping directly into a file somewhere on the filesystem.
(2) Dumping into a pipe connected to a usermode helper process
spawned as a child of the system_unbound_wq or kthreadd
For simplicity I'm mostly ignoring (1). There's probably still some
users of (1) out there but processing coredumps in this way can be
considered adventurous especially in the face of set*id binaries
The most common option should be (2) by now. It works by allowing
userspace to put a string into /proc/sys/kernel/core_pattern like:
|/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h
The "|" at the beginning indicates to the kernel that a pipe must be
used. The path following the pipe indicator is a path to a binary that
will be spawned as a usermode helper process. Any additional
parameters pass information about the task that is generating the
coredump to the binary that processes the coredump
In the example the core_pattern shown causes the kernel to spawn
systemd-coredump as a usermode helper. There's various conceptual
consequences of this (non-exhaustive list):
- systemd-coredump is spawned with file descriptor number 0 (stdin)
connected to the read-end of the pipe. All other file descriptors
are closed. That specifically includes 1 (stdout) and 2 (stderr).
This has already caused bugs because userspace assumed that this
cannot happen (Whether or not this is a sane assumption is
irrelevant)
- systemd-coredump will be spawned as a child of system_unbound_wq.
So it is not a child of any userspace process and specifically not
a child of PID 1. It cannot be waited upon and is in a weird hybrid
upcall which are difficult for userspace to control correctly
- systemd-coredump is spawned with full kernel privileges. This
necessitates all kinds of weird privilege dropping excercises in
userspace to make this safe
- A new usermode helper has to be spawned for each crashing process
This adds a new mode:
(3) Dumping into an AF_UNIX socket
Userspace can set /proc/sys/kernel/core_pattern to:
@/path/to/coredump.socket
The "@" at the beginning indicates to the kernel that an AF_UNIX
coredump socket will be used to process coredumps
The coredump socket must be located in the initial mount namespace.
When a task coredumps it opens a client socket in the initial network
namespace and connects to the coredump socket:
- The coredump server uses SO_PEERPIDFD to get a stable handle on the
connected crashing task. The retrieved pidfd will provide a stable
reference even if the crashing task gets SIGKILLed while generating
the coredump. That is a huge attack vector right now
- By setting core_pipe_limit non-zero userspace can guarantee that
the crashing task cannot be reaped behind it's back and thus
process all necessary information in /proc/<pid>. The SO_PEERPIDFD
can be used to detect whether /proc/<pid> still refers to the same
process
The core_pipe_limit isn't used to rate-limit connections to the
socket. This can simply be done via AF_UNIX socket directly
- The pidfd for the crashing task will contain information how the
task coredumps. The PIDFD_GET_INFO ioctl gained a new flag
PIDFD_INFO_COREDUMP which can be used to retreive the coredump
information
If the coredump gets a new coredump client connection the kernel
guarantees that PIDFD_INFO_COREDUMP information is available.
Currently the following information is provided in the new
@coredump_mask extension to struct pidfd_info:
* PIDFD_COREDUMPED is raised if the task did actually coredump
* PIDFD_COREDUMP_SKIP is raised if the task skipped coredumping
(e.g., undumpable)
* PIDFD_COREDUMP_USER is raised if this is a regular coredump and
doesn't need special care by the coredump server
* PIDFD_COREDUMP_ROOT is raised if the generated coredump should
be treated as sensitive and the coredump server should restrict
access to the generated coredump to sufficiently privileged
users"
* tag 'vfs-6.16-rc1.coredump' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
mips, net: ensure that SOCK_COREDUMP is defined
selftests/coredump: add tests for AF_UNIX coredumps
selftests/pidfd: add PIDFD_INFO_COREDUMP infrastructure
coredump: validate socket name as it is written
coredump: show supported coredump modes
pidfs, coredump: add PIDFD_INFO_COREDUMP
coredump: add coredump socket
coredump: reflow dump helpers a little
coredump: massage do_coredump()
coredump: massage format_corename()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull pidfs updates from Christian Brauner:
"Features:
- Allow handing out pidfds for reaped tasks for AF_UNIX SO_PEERPIDFD
socket option
SO_PEERPIDFD is a socket option that allows to retrieve a pidfd for
the process that called connect() or listen(). This is heavily used
to safely authenticate clients in userspace avoiding security bugs
due to pid recycling races (dbus, polkit, systemd, etc.)
SO_PEERPIDFD currently doesn't support handing out pidfds if the
sk->sk_peer_pid thread-group leader has already been reaped. In
this case it currently returns EINVAL. Userspace still wants to get
a pidfd for a reaped process to have a stable handle it can pass
on. This is especially useful now that it is possible to retrieve
exit information through a pidfd via the PIDFD_GET_INFO ioctl()'s
PIDFD_INFO_EXIT flag
Another summary has been provided by David Rheinsberg:
> A pidfd can outlive the task it refers to, and thus user-space
> must already be prepared that the task underlying a pidfd is
> gone at the time they get their hands on the pidfd. For
> instance, resolving the pidfd to a PID via the fdinfo must be
> prepared to read `-1`.
>
> Despite user-space knowing that a pidfd might be stale, several
> kernel APIs currently add another layer that checks for this. In
> particular, SO_PEERPIDFD returns `EINVAL` if the peer-task was
> already reaped, but returns a stale pidfd if the task is reaped
> immediately after the respective alive-check.
>
> This has the unfortunate effect that user-space now has two ways
> to check for the exact same scenario: A syscall might return
> EINVAL/ESRCH/... *or* the pidfd might be stale, even though
> there is no particular reason to distinguish both cases. This
> also propagates through user-space APIs, which pass on pidfds.
> They must be prepared to pass on `-1` *or* the pidfd, because
> there is no guaranteed way to get a stale pidfd from the kernel.
>
> Userspace must already deal with a pidfd referring to a reaped
> task as the task may exit and get reaped at any time will there
> are still many pidfds referring to it
In order to allow handing out reaped pidfd SO_PEERPIDFD needs to
ensure that PIDFD_INFO_EXIT information is available whenever a
pidfd for a reaped task is created by PIDFD_INFO_EXIT. The uapi
promises that reaped pidfds are only handed out if it is guaranteed
that the caller sees the exit information:
TEST_F(pidfd_info, success_reaped)
{
struct pidfd_info info = {
.mask = PIDFD_INFO_CGROUPID | PIDFD_INFO_EXIT,
};
/*
* Process has already been reaped and PIDFD_INFO_EXIT been set.
* Verify that we can retrieve the exit status of the process.
*/
ASSERT_EQ(ioctl(self->child_pidfd4, PIDFD_GET_INFO, &info), 0);
ASSERT_FALSE(!!(info.mask & PIDFD_INFO_CREDS));
ASSERT_TRUE(!!(info.mask & PIDFD_INFO_EXIT));
ASSERT_TRUE(WIFEXITED(info.exit_code));
ASSERT_EQ(WEXITSTATUS(info.exit_code), 0);
}
To hand out pidfds for reaped processes we thus allocate a pidfs
entry for the relevant sk->sk_peer_pid at the time the
sk->sk_peer_pid is stashed and drop it when the socket is
destroyed. This guarantees that exit information will always be
recorded for the sk->sk_peer_pid task and we can hand out pidfds
for reaped processes
- Hand a pidfd to the coredump usermode helper process
Give userspace a way to instruct the kernel to install a pidfd for
the crashing process into the process started as a usermode helper.
There's still tricky race-windows that cannot be easily or
sometimes not closed at all by userspace. There's various ways like
looking at the start time of a process to make sure that the
usermode helper process is started after the crashing process but
it's all very very brittle and fraught with peril
The crashed-but-not-reaped process can be killed by userspace
before coredump processing programs like systemd-coredump have had
time to manually open a PIDFD from the PID the kernel provides
them, which means they can be tricked into reading from an
arbitrary process, and they run with full privileges as they are
usermode helper processes
Even if that specific race-window wouldn't exist it's still the
safest and cleanest way to let the kernel provide the pidfd
directly instead of requiring userspace to do it manually. In
parallel with this commit we already have systemd adding support
for this in [1]
When the usermode helper process is forked we install a pidfd file
descriptor three into the usermode helper's file descriptor table
so it's available to the exec'd program
Since usermode helpers are either children of the system_unbound_wq
workqueue or kthreadd we know that the file descriptor table is
empty and can thus always use three as the file descriptor number
Note, that we'll install a pidfd for the thread-group leader even
if a subthread is calling do_coredump(). We know that task linkage
hasn't been removed yet and even if this @current isn't the actual
thread-group leader we know that the thread-group leader cannot be
reaped until
@current has exited
- Allow telling when a task has not been found from finding the wrong
task when creating a pidfd
We currently report EINVAL whenever a struct pid has no tasked
attached anymore thereby conflating two concepts:
(1) The task has already been reaped
(2) The caller requested a pidfd for a thread-group leader but the
pid actually references a struct pid that isn't used as a
thread-group leader
This is causing issues for non-threaded workloads as in where they
expect ESRCH to be reported, not EINVAL
So allow userspace to reliably distinguish between (1) and (2)
- Make it possible to detect when a pidfs entry would outlive the
struct pid it pinned
- Add a range of new selftests
Cleanups:
- Remove unneeded NULL check from pidfd_prepare() for passed struct
pid
- Avoid pointless reference count bump during release_task()
Fixes:
- Various fixes to the pidfd and coredump selftests
- Fix error handling for replace_fd() when spawning coredump usermode
helper"
* tag 'vfs-6.16-rc1.pidfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
pidfs: detect refcount bugs
coredump: hand a pidfd to the usermode coredump helper
coredump: fix error handling for replace_fd()
pidfs: move O_RDWR into pidfs_alloc_file()
selftests: coredump: Raise timeout to 2 minutes
selftests: coredump: Fix test failure for slow machines
selftests: coredump: Properly initialize pointer
net, pidfs: enable handing out pidfds for reaped sk->sk_peer_pid
pidfs: get rid of __pidfd_prepare()
net, pidfs: prepare for handing out pidfds for reaped sk->sk_peer_pid
pidfs: register pid in pidfs
net, pidfd: report EINVAL for ESRCH
release_task: kill the no longer needed get/put_pid(thread_pid)
pidfs: ensure consistent ENOENT/ESRCH reporting
exit: move wake_up_all() pidfd waiters into __unhash_process()
selftest/pidfd: add test for thread-group leader pidfd open for thread
pidfd: improve uapi when task isn't found
pidfd: remove unneeded NULL check from pidfd_prepare()
selftests/pidfd: adapt to recent changes
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following batch contains Netfilter updates for net-next,
specifically 26 patches: 5 patches adding/updating selftests,
4 fixes, 3 PREEMPT_RT fixes, and 14 patches to enhance nf_tables):
1) Improve selftest coverage for pipapo 4 bit group format, from
Florian Westphal.
2) Fix incorrect dependencies when compiling a kernel without
legacy ip{6}tables support, also from Florian.
3) Two patches to fix nft_fib vrf issues, including selftest updates
to improve coverage, also from Florian Westphal.
4) Fix incorrect nesting in nft_tunnel's GENEVE support, from
Fernando F. Mancera.
5) Three patches to fix PREEMPT_RT issues with nf_dup infrastructure
and nft_inner to match in inner headers, from Sebastian Andrzej Siewior.
6) Integrate conntrack information into nft trace infrastructure,
from Florian Westphal.
7) A series of 13 patches to allow to specify wildcard netdevice in
netdev basechain and flowtables, eg.
table netdev filter {
chain ingress {
type filter hook ingress devices = { eth0, eth1, vlan* } priority 0; policy accept;
}
}
This also allows for runtime hook registration on NETDEV_{UN}REGISTER
event, from Phil Sutter.
netfilter pull request 25-05-23
* tag 'nf-next-25-05-23' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next: (26 commits)
selftests: netfilter: Torture nftables netdev hooks
netfilter: nf_tables: Add notifications for hook changes
netfilter: nf_tables: Support wildcard netdev hook specs
netfilter: nf_tables: Sort labels in nft_netdev_hook_alloc()
netfilter: nf_tables: Handle NETDEV_CHANGENAME events
netfilter: nf_tables: Wrap netdev notifiers
netfilter: nf_tables: Respect NETDEV_REGISTER events
netfilter: nf_tables: Prepare for handling NETDEV_REGISTER events
netfilter: nf_tables: Have a list of nf_hook_ops in nft_hook
netfilter: nf_tables: Pass nf_hook_ops to nft_unregister_flowtable_hook()
netfilter: nf_tables: Introduce nft_register_flowtable_ops()
netfilter: nf_tables: Introduce nft_hook_find_ops{,_rcu}()
netfilter: nf_tables: Introduce functions freeing nft_hook objects
netfilter: nf_tables: add packets conntrack state to debug trace info
netfilter: conntrack: make nf_conntrack_id callable without a module dependency
netfilter: nf_dup_netdev: Move the recursion counter struct netdev_xmit
netfilter: nft_inner: Use nested-BH locking for nft_pcpu_tun_ctx
netfilter: nf_dup{4, 6}: Move duplication check to task_struct
netfilter: nft_tunnel: fix geneve_opt dump
selftests: netfilter: nft_fib.sh: add type and oif tests with and without VRFs
...
====================
Link: https://patch.msgid.link/20250523132712.458507-1-pablo@netfilter.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Add feature UBLK_F_QUIESCE, which adds control command `UBLK_U_CMD_QUIESCE_DEV`
for quiescing device, then device state can become `UBLK_S_DEV_QUIESCED`
or `UBLK_S_DEV_FAIL_IO` finally from ublk_ch_release() with ublk server
cooperation.
This feature can help to support to upgrade ublk server application by
shutting down ublk server gracefully, meantime keep ublk block device
persistent during the upgrading period.
The feature is only available for UBLK_F_USER_RECOVERY.
Suggested-by: Yoav Cohen <yoav@nvidia.com>
Link: https://lore.kernel.org/linux-block/DM4PR12MB632807AB7CDCE77D1E5AB7D0A9B92@DM4PR12MB6328.namprd12.prod.outlook.com/
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250522163523.406289-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Notify user space if netdev hooks are updated due to netdev add/remove
events. Send minimal notification messages by introducing
NFT_MSG_NEWDEV/DELDEV message types describing a single device only.
Upon NETDEV_CHANGENAME, the callback has no information about the
interface's old name. To provide a clear message to user space, include
the hook's stored interface name in the notification.
Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Add the minimal relevant info needed for userspace ("nftables monitor
trace") to provide the conntrack view of the packet:
- state (new, related, established)
- direction (original, reply)
- status (e.g., if connection is subject to dnat)
- id (allows to query ctnetlink for remaining conntrack state info)
Example:
trace id a62 inet filter PRE_RAW packet: iif "enp0s3" ether [..]
[..]
trace id a62 inet filter PRE_MANGLE conntrack: ct direction original ct state new ct id 32
trace id a62 inet filter PRE_MANGLE packet: [..]
[..]
trace id a62 inet filter IN conntrack: ct direction original ct state new ct status dnat-done ct id 32
[..]
In this case one can see that while NAT is active, the new connection
isn't subject to a translation.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
* kvm-arm64/nv-nv:
: .
: Flick the switch on the NV support by adding the missing piece
: in the form of the VNCR page management. From the cover letter:
:
: "This is probably the most interesting bit of the whole NV adventure.
: So far, everything else has been a walk in the park, but this one is
: where the real fun takes place.
:
: With FEAT_NV2, most of the NV support revolves around tricking a guest
: into accessing memory while it tries to access system registers. The
: hypervisor's job is to handle the context switch of the actual
: registers with the state in memory as needed."
: .
KVM: arm64: nv: Release faulted-in VNCR page from mmu_lock critical section
KVM: arm64: nv: Handle TLBI S1E2 for VNCR invalidation with mmu_lock held
KVM: arm64: nv: Hold mmu_lock when invalidating VNCR SW-TLB before translating
KVM: arm64: Document NV caps and vcpu flags
KVM: arm64: Allow userspace to request KVM_ARM_VCPU_EL2*
KVM: arm64: nv: Remove dead code from ERET handling
KVM: arm64: nv: Plumb TLBI S1E2 into system instruction dispatch
KVM: arm64: nv: Add S1 TLB invalidation primitive for VNCR_EL2
KVM: arm64: nv: Program host's VNCR_EL2 to the fixmap address
KVM: arm64: nv: Handle VNCR_EL2 invalidation from MMU notifiers
KVM: arm64: nv: Handle mapping of VNCR_EL2 at EL2
KVM: arm64: nv: Handle VNCR_EL2-triggered faults
KVM: arm64: nv: Add userspace and guest handling of VNCR_EL2
KVM: arm64: nv: Add pseudo-TLB backing VNCR_EL2
KVM: arm64: nv: Don't adjust PSTATE.M when L2 is nesting
KVM: arm64: nv: Move TLBI range decoding to a helper
KVM: arm64: nv: Snapshot S1 ASID tagging information during walk
KVM: arm64: nv: Extract translation helper from the AT code
KVM: arm64: nv: Allocate VNCR page when required
arm64: sysreg: Add layout for VNCR_EL2
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
As long as recvmsg() or recvmmsg() is used with cmsg, it is not
possible to avoid receiving file descriptors via SCM_RIGHTS.
This behaviour has occasionally been flagged as problematic, as
it can be (ab)used to trigger DoS during close(), for example, by
passing a FUSE-controlled fd or a hung NFS fd.
For instance, as noted on the uAPI Group page [0], an untrusted peer
could send a file descriptor pointing to a hung NFS mount and then
close it. Once the receiver calls recvmsg() with msg_control, the
descriptor is automatically installed, and then the responsibility
for the final close() now falls on the receiver, which may result
in blocking the process for a long time.
Regarding this, systemd calls cmsg_close_all() [1] after each
recvmsg() to close() unwanted file descriptors sent via SCM_RIGHTS.
However, this cannot work around the issue at all, because the final
fput() may still occur on the receiver's side once sendmsg() with
SCM_RIGHTS succeeds. Also, even filtering by LSM at recvmsg() does
not work for the same reason.
Thus, we need a better way to refuse SCM_RIGHTS at sendmsg().
Let's introduce SO_PASSRIGHTS to disable SCM_RIGHTS.
Note that this option is enabled by default for backward
compatibility.
Link: https://uapi-group.org/kernel-features/#disabling-reception-of-scm_rights-for-af_unix-sockets #[0]
Link: https://github.com/systemd/systemd/blob/v257.5/src/basic/fd-util.c#L612-L628 #[1]
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
UBLK_F_AUTO_BUF_REG requires that the buffer registered automatically
is unregistered in same `io_ring_ctx`, so check it explicitly.
Document this requirement for UBLK_F_AUTO_BUF_REG.
Drop WARN_ON_ONCE() which is triggered from userspace code path.
Fixes: 99c1e4eb6a3f ("ublk: register buffer to local io_uring with provided buf index via UBLK_F_AUTO_BUF_REG")
Reported-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250522152043.399824-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Multi-PTP source support within a network topology has been merged,
but the hardware timestamp source is not yet exposed to users.
Currently, users only see the PTP index, which does not indicate
whether the timestamp comes from a PHY or a MAC.
Add support for reporting the hwtstamp source using a
hwtstamp-source field, alongside hwtstamp-phyindex, to describe
the origin of the hardware timestamp.
Remove HWTSTAMP_SOURCE_UNSPEC enum value as it is not used at all.
Signed-off-by: Kory Maincent <kory.maincent@bootlin.com>
Link: https://patch.msgid.link/20250519-feature_ptp_source-v4-1-5d10e19a0265@bootlin.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
When applying a recent commit to the <uapi/linux/perf_event.h>
header I noticed that we have accumulated quite a bit of
historic noise in this header, so do a bit of spring cleaning:
- Define bitfields in a vertically aligned fashion, like
perf_event_mmap_page::capabilities already does. This
makes it easier to see the distribution and sizing of
bits within a word, at a glance. The following is much
more readable:
__u64 cap_bit0 : 1,
cap_bit0_is_deprecated : 1,
cap_user_rdpmc : 1,
cap_user_time : 1,
cap_user_time_zero : 1,
cap_user_time_short : 1,
cap_____res : 58;
Than:
__u64 cap_bit0:1,
cap_bit0_is_deprecated:1,
cap_user_rdpmc:1,
cap_user_time:1,
cap_user_time_zero:1,
cap_user_time_short:1,
cap_____res:58;
So convert all bitfield definitions from the latter style to the
former style.
- Fix typos and grammar
- Fix capitalization
- Remove whitespace noise
- Harmonize the definitions of various generations and groups of
PERF_MEM_ ABI values.
- Vertically align all definitions and assignments to the same
column (48), as the first definition (enum perf_type_id),
throughout the entire header.
- And in general make the code and comments to be more in sync
with each other and to be more readable overall.
No change in functionality.
Copy the changes over to tools/include/uapi/linux/perf_event.h.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20250521221529.2547099-1-irogers@google.com
|
|
AAUX data for PERF_SAMPLE_AUX appears last. PERF_SAMPLE_CGROUP is
missing from the comment.
This makes the <uapi/linux/perf_event.h> comment match that in the
perf_event_open man page.
Signed-off-by: Ian Rogers <irogers@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: James Clark <james.clark@linaro.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-perf-users@vger.kernel.org
Link: https://lore.kernel.org/r/20250521221529.2547099-1-irogers@google.com
|
|
Kory is reporting build issues after recent additions to YNL
if the system headers are old.
Link: https://lore.kernel.org/20250519164949.597d6e92@kmaincent-XPS-13-7390
Reported-by: Kory Maincent <kory.maincent@bootlin.com>
Fixes: 0939a418b3b0 ("tools: ynl: submsg: reverse parse / error reporting")
Tested-by: Kory Maincent <kory.maincent@bootlin.com>
Link: https://patch.msgid.link/20250520161916.413298-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
- Provide user register access over IOCTL.
Both register read and write are supported.
- APML interface does not provide a synchronization method. By defining,
a register access path, we use APML modules and library for
all APML transactions. Without having to use external tools such as
i2c-tools, which may cause race conditions.
Reviewed-by: Naveen Krishna Chatradhi <naveenkrishna.chatradhi@amd.com>
Signed-off-by: Akshay Gupta <akshay.gupta@amd.com>
Link: https://lore.kernel.org/r/20250428063034.2145566-10-akshay.gupta@amd.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
- AMD provides custom protocol to read Machine Check Architecture(MCA)
registers over sideband. The information is accessed for range of
MCA registers by passing register address and thread ID to the protocol.
MCA register read command using the register address to access
Core::X86::Msr::MCG_CAP which determines the number of MCA banks.
Access is read-only
Reviewed-by: Naveen Krishna Chatradhi <naveenkrishna.chatradhi@amd.com>
Signed-off-by: Akshay Gupta <akshay.gupta@amd.com>
Link: https://lore.kernel.org/r/20250428063034.2145566-9-akshay.gupta@amd.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
- AMD provides custom protocol to read Processor feature
capabilities and configuration information through side band.
The information is accessed by providing CPUID Function,
extended function and thread ID to the protocol.
Undefined function returns 0.
Reviewed-by: Naveen Krishna Chatradhi <naveenkrishna.chatradhi@amd.com>
Signed-off-by: Akshay Gupta <akshay.gupta@amd.com>
Link: https://lore.kernel.org/r/20250428063034.2145566-8-akshay.gupta@amd.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
The present sbrmi module only support reporting power via hwmon.
However, AMD data center range of processors support various
system management functionality using custom protocols defined in
Advanced Platform Management Link (APML) specification.
Register a miscdevice, which creates a device /dev/sbrmiX with an IOCTL
interface for the user space to invoke the APML Mailbox protocol, which
is already defined in sbrmi_mailbox_xfer().
The APML protocols depend on a set of RMI registers. Having an IOCTL
as a single entry point will help in providing synchronization among
these protocols as multiple transactions on RMI register set may
create race condition.
Support for other protocols will be added in subsequent patches.
APML mailbox protocol returns additional error codes written by
SMU firmware in the out-bound register 0x37. These errors include,
invalid core, message not supported over platform and
others. This additional error codes can be used to provide more
details to user space.
Open-sourced and widely used https://github.com/amd/esmi_oob_library
will continue to provide user-space programmable API.
Reviewed-by: Naveen Krishna Chatradhi <naveenkrishna.chatradhi@amd.com>
Signed-off-by: Akshay Gupta <akshay.gupta@amd.com>
Link: https://lore.kernel.org/r/20250428063034.2145566-7-akshay.gupta@amd.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Extend the PIDFD_INFO_COREDUMP ioctl() with the new PIDFD_INFO_COREDUMP
mask flag. This adds the @coredump_mask field to struct pidfd_info.
When a task coredumps the kernel will provide the following information
to userspace in @coredump_mask:
* PIDFD_COREDUMPED is raised if the task did actually coredump.
* PIDFD_COREDUMP_SKIP is raised if the task skipped coredumping (e.g.,
undumpable).
* PIDFD_COREDUMP_USER is raised if this is a regular coredump and
doesn't need special care by the coredump server.
* PIDFD_COREDUMP_ROOT is raised if the generated coredump should be
treated as sensitive and the coredump server should restrict to the
generated coredump to sufficiently privileged users.
The kernel guarantees that by the time the connection is made the all
PIDFD_INFO_COREDUMP info is available.
Link: https://lore.kernel.org/20250516-work-coredump-socket-v8-5-664f3caf2516@kernel.org
Acked-by: Luca Boccassi <luca.boccassi@gmail.com>
Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Jann Horn <jannh@google.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
The console dimension and cursor position are available through the
/dev/vcsa interface already. However the /dev/vcsa header format uses
single-byte fields therefore those values are clamped to 255.
As surprizing as this may seem, some people do use 240-column 67-row
screens (a 1920x1080 monitor with 8x16 pixel fonts) which is getting
close to the limit. Monitors with higher resolution are not uncommon
these days (3840x2160 producing a 480x135 character display) and it is
just a matter of time before someone with, say, a braille display using
the Linux VT console and BRLTTY on such a screen reports a bug about
missing and oddly misaligned screen content.
Let's add VT_GETCONSIZECSRPOS for the retrieval of console size and cursor
position without byte-sized limitations. The actual console size limit as
encoded in vt.c is 32767x32767 so using a short here is appropriate. Then
this can be used to get the cursor position when /dev/vcsa reports 255.
The screen dimension may already be obtained using TIOCGWINSZ and adding
the same information to VT_GETCONSIZECSRPOS might be redundant. However
applications that care about cursor position also care about display
size and having 2 separate system calls to obtain them separately is
wasteful. Also, the cursor position can be queried by writing "\e[6n" to
a tty and reading back the result but that may be done only by the actual
application using that tty and not a sideline observer.
Signed-off-by: Nicolas Pitre <npitre@baylibre.com>
Link: https://lore.kernel.org/r/20250520171851.1219676-3-nico@fluxnic.net
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
This is comprised of 3 aspects:
- Take note of when applications advertise bracketed paste support via
"\e[?2004h" and "\e[?2004l".
- Insert bracketed paste markers ("\e[200~" and "\e[201~") around pasted
content in paste_selection() when bracketed paste is active.
- Add TIOCL_GETBRACKETEDPASTE to return bracketed paste status so user
space daemons implementing cut-and-paste functionality (e.g. gpm,
BRLTTY) may know when to insert bracketed paste markers.
Link: https://en.wikipedia.org/wiki/Bracketed-paste
Signed-off-by: Nicolas Pitre <npitre@baylibre.com>
Reviewed-by: Jiri Slaby <jirislaby@kernel.org>
Link: https://lore.kernel.org/r/20250520171851.1219676-2-nico@fluxnic.net
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Problem
========
commit 658eb5ab916d ("delayacct: add delay max to record delay peak")
- adding more fields
commit f65c64f311ee ("delayacct: add delay min to record delay peak")
- adding more fields
commit b016d0873777 ("taskstats: modify taskstats version")
- version bump to 15
Since version 15 (TASKSTATS_VERSION=15) the new layout of the structure
adds fields in the middle of the structure, rendering all old software
incompatible with newer kernels and software compiled against the new
kernel headers incompatible with older kernels.
Solution
=========
move delay max and delay min to the end of taskstat, and bump
the version to 16 after the change
[wang.yaxin@zte.com.cn: adjust indentation]
Link: https://lkml.kernel.org/r/202505192131489882NSciXV4EGd8zzjLuwoOK@zte.com.cn
Link: https://lkml.kernel.org/r/20250510155413259V4JNRXxukdDgzsaL0Fo6a@zte.com.cn
Fixes: f65c64f311ee ("delayacct: add delay min to record delay peak")
Signed-off-by: Wang Yaxin <wang.yaxin@zte.com.cn>
Signed-off-by: xu xin <xu.xin16@zte.com.cn>
Signed-off-by: Kun Jiang <jiang.kun2@zte.com.cn>
Reviewed-by: Yang Yang <yang.yang29@zte.com.cn>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Add a toggleable VM capability to reset the VCPU from userspace by
setting MP_STATE_INIT_RECEIVED through IOCTL.
Reset through a mp_state to avoid adding a new IOCTL.
Do not reset on a transition from STOPPED to RUNNABLE, because it's
better to avoid side effects that would complicate userspace adoption.
The MP_STATE_INIT_RECEIVED is not a permanent mp_state -- IOCTL resets
the VCPU while preserving the original mp_state -- because we wouldn't
gain much from having a new state it in the rest of KVM, but it's a very
non-standard use of the IOCTL.
Signed-off-by: Radim Krčmář <rkrcmar@ventanamicro.com>
Reviewed-by: Anup Patel <anup@brainfault.org>
Link: https://lore.kernel.org/r/20250515143723.2450630-5-rkrcmar@ventanamicro.com
Signed-off-by: Anup Patel <anup@brainfault.org>
|
|
https://gitlab.freedesktop.org/drm/nova into drm-next
Nova changes for v6.16
auxiliary:
- bus abstractions
- implementation for driver registration
- add sample driver
drm:
- implement __drm_dev_alloc()
- DRM core infrastructure Rust abstractions
- device, driver and registration
- DRM IOCTL
- DRM File
- GEM object
- IntoGEMObject rework
- generically implement AlwaysRefCounted through IntoGEMObject
- refactor unsound from_gem_obj() into as_ref()
- refactor into_gem_obj() into as_raw()
driver-core:
- merge topic/device-context-2025-04-17 from driver-core tree
- implement Devres::access()
- fix: doctest build under `!CONFIG_PCI`
- accessor for Device::parent()
- fix: conditionally expect `dead_code` for `parent()`
- impl TryFrom<&Device> bus devices (PCI, platform)
nova-core:
- remove completed Vec extentions from task list
- register auxiliary device for nova-drm
- derive useful traits for Chipset
- add missing GA100 chipset
- take &Device<Bound> in Gpu::new()
- infrastructure to generate register definitions
- fix register layout of NV_PMC_BOOT_0
- move Firmware into own (Rust) module
- fix: select AUXILIARY_BUS
nova-drm:
- initial driver skeleton (depends on drm and auxiliary bus
abstractions)
- fix: select AUXILIARY_BUS
Rust (dependencies):
- implement Opaque::zeroed()
- implement Revocable::try_access_with()
- implement Revocable::access()
From: Danilo Krummrich <dakr@kernel.org>
Link: https://lore.kernel.org/r/aCxAf3RqQAXLDhAj@cassiopeiae
|
|
For UBLK_F_AUTO_BUF_REG, buffer is registered to uring_cmd context
automatically with the provided buffer index. User may provide one wrong
buffer index, or the specified buffer is registered by application already.
Add UBLK_AUTO_BUF_REG_FALLBACK for supporting to auto buffer registering
fallback by completing the uring_cmd and telling ublk server the
register failure via UBLK_AUTO_BUF_REG_FALLBACK, then ublk server still
can register the buffer from userspace.
So we can provide reliable way for supporting auto buffer register.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250520045455.515691-5-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
UBLK_F_AUTO_BUF_REG
Add UBLK_F_AUTO_BUF_REG for supporting to register buffer automatically
to local io_uring context with provided buffer index.
Add UAPI structure `struct ublk_auto_buf_reg` for holding user parameter
to register request buffer automatically, one 'flags' field is defined, and
there is still 32bit available for future extension, such as, adding one
io_ring FD field for registering buffer to external io_uring.
`struct ublk_auto_buf_reg` is populated from ublk uring_cmd's sqe->addr,
and all existing ublk commands are data-less, so it is just fine to reuse
sqe->addr for this purpose.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250520045455.515691-4-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Since we're (almost) feature complete, let's allow userspace to
request KVM_ARM_VCPU_EL2* by bumping KVM_VCPU_MAX_FEATURES up.
We also now advertise the features to userspace with new capabilities.
It's going to be great...
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Reviewed-by: Ganapatrao Kulkarni <gankulkarni@os.amperecomputing.com>
Link: https://lore.kernel.org/r/20250514103501.2225951-17-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
The TCA_FLOWER_KEY_CFM enum has a UNSPEC and MAX with _OPT
in the name, but the real attributes don't. Add a MAX that
more reasonably matches the attrs.
The PAD in TCA_TAPRIO is the only attr which doesn't have
_ATTR in it, perhaps signifying that it's not a real attr?
If so interesting idea in abstract but it makes codegen painful.
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250513221752.843102-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Support new version of HBM.
Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
Add bind-tx netlink call to attach dmabuf for TX; queue is not
required, only ifindex and dmabuf fd for attachment.
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Signed-off-by: Mina Almasry <almasrymina@google.com>
Link: https://patch.msgid.link/20250508004830.4100853-4-almasrymina@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
We need the USB fixes in here as well.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Add the initial nova-drm driver skeleton.
nova-drm is connected to nova-core through the auxiliary bus and
implements the DRM parts of the nova driver stack.
For now, it implements the fundamental DRM abstractions, i.e. creates a
DRM device and registers it, exposing a three sample IOCTLs.
DRM_IOCTL_NOVA_GETPARAM
- provides the PCI bar size from the bar that maps the GPUs VRAM
from nova-core
DRM_IOCTL_NOVA_GEM_CREATE
- creates a new dummy DRM GEM object and returns a handle
DRM_IOCTL_NOVA_GEM_INFO
- provides metadata for the DRM GEM object behind a given handle
I implemented a small userspace test suite [1] that utilizes this
interface.
Link: https://gitlab.freedesktop.org/dakr/drm-test [1]
Reviewed-by: Maxime Ripard <mripard@kernel.org>
Acked-by: Dave Airlie <airlied@redhat.com>
Link: https://lore.kernel.org/r/20250424160452.8070-3-dakr@kernel.org
[ Kconfig: depend on DRM=y rather than just DRM. - Danilo ]
Signed-off-by: Danilo Krummrich <dakr@kernel.org>
|
|
Patch series "fs/proc: extend the PAGEMAP_SCAN ioctl to report guard
regions", v2.
Introduce the PAGE_IS_GUARD flag in the PAGEMAP_SCAN ioctl to expose
information about guard regions. This allows userspace tools, such as
CRIU, to detect and handle guard regions.
Currently, CRIU utilizes PAGEMAP_SCAN as a more efficient alternative to
parsing /proc/pid/pagemap. Without this change, guard regions are
incorrectly reported as swap-anon regions, leading CRIU to attempt dumping
them and subsequently failing.
The series includes updates to the documentation and selftests to reflect
the new functionality.
This patch (of 3):
Introduce the PAGE_IS_GUARD flag in the PAGEMAP_SCAN ioctl to expose
information about guard regions. This allows userspace tools, such as
CRIU, to detect and handle guard regions.
Link: https://lkml.kernel.org/r/20250324065328.107678-1-avagin@google.com
Link: https://lkml.kernel.org/r/20250324065328.107678-2-avagin@google.com
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
PTRACE_SET_SYSCALL_INFO is a generic ptrace API that complements
PTRACE_GET_SYSCALL_INFO by letting the ptracer modify details of system
calls the tracee is blocked in.
This API allows ptracers to obtain and modify system call details in a
straightforward and architecture-agnostic way, providing a consistent way
of manipulating the system call number and arguments across architectures.
As in case of PTRACE_GET_SYSCALL_INFO, PTRACE_SET_SYSCALL_INFO also does
not aim to address numerous architecture-specific system call ABI
peculiarities, like differences in the number of system call arguments for
such system calls as pread64 and preadv.
The current implementation supports changing only those bits of system
call information that are used by strace system call tampering, namely,
syscall number, syscall arguments, and syscall return value.
Support of changing additional details returned by
PTRACE_GET_SYSCALL_INFO, such as instruction pointer and stack pointer,
could be added later if needed, by using struct ptrace_syscall_info.flags
to specify the additional details that should be set. Currently, "flags"
and "reserved" fields of struct ptrace_syscall_info must be initialized
with zeroes; "arch", "instruction_pointer", and "stack_pointer" fields are
currently ignored.
PTRACE_SET_SYSCALL_INFO currently supports only PTRACE_SYSCALL_INFO_ENTRY,
PTRACE_SYSCALL_INFO_EXIT, and PTRACE_SYSCALL_INFO_SECCOMP operations.
Other operations could be added later if needed.
Ideally, PTRACE_SET_SYSCALL_INFO should have been introduced along with
PTRACE_GET_SYSCALL_INFO, but it didn't happen. The last straw that
convinced me to implement PTRACE_SET_SYSCALL_INFO was apparent failure to
provide an API of changing the first system call argument on riscv
architecture.
ptrace(2) man page:
long ptrace(enum __ptrace_request request, pid_t pid, void *addr, void *data);
...
PTRACE_SET_SYSCALL_INFO
Modify information about the system call that caused the stop.
The "data" argument is a pointer to struct ptrace_syscall_info
that specifies the system call information to be set.
The "addr" argument should be set to sizeof(struct ptrace_syscall_info)).
Link: https://lore.kernel.org/all/59505464-c84a-403d-972f-d4b2055eeaac@gmail.com/
Link: https://lkml.kernel.org/r/20250303112044.GF24170@strace.io
Signed-off-by: Dmitry V. Levin <ldv@strace.io>
Reviewed-by: Alexey Gladkov <legion@kernel.org>
Reviewed-by: Charlie Jenkins <charlie@rivosinc.com>
Tested-by: Charlie Jenkins <charlie@rivosinc.com>
Reviewed-by: Eugene Syromiatnikov <esyr@redhat.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: anton ivanov <anton.ivanov@cambridgegreys.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Betkov <bp@alien8.de>
Cc: Brian Cain <bcain@quicinc.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christian Zankel <chris@zankel.net>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Davide Berardi <berardi.dav@gmail.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Dinh Nguyen <dinguyen@kernel.org>
Cc: Eugene Syromyatnikov <evgsyr@gmail.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Guo Ren <guoren@kernel.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Maciej W. Rozycki <macro@orcam.me.uk>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Mike Frysinger <vapier@gentoo.org>
Cc: Naveen N Rao <naveen@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Renzo Davoi <renzo@cs.unibo.it>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russel King <linux@armlinux.org.uk>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@kernel.org>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
https://gitlab.freedesktop.org/agd5f/linux into drm-next
amd-drm-next-6.16-2025-05-09:
amdgpu:
- IPS fixes
- DSC cleanup
- DC Scaling updates
- DC FP fixes
- Fused I2C-over-AUX updates
- SubVP fixes
- Freesync fix
- DMUB AUX fixes
- VCN fix
- Hibernation fixes
- HDP fixes
- DCN 2.1 fixes
- DPIA fixes
- DMUB updates
- Use drm_file_err in amdgpu
- Enforce isolation updates
- Use new dma_fence helpers
- USERQ fixes
- Documentation updates
- Misc code cleanups
- SR-IOV updates
- RAS updates
- PSP 12 cleanups
amdkfd:
- Update error messages for SDMA
- Userptr updates
drm:
- Add drm_file_err function
dma-buf:
- Add a helper to sort and deduplicate dma_fence arrays
From: Alex Deucher <alexander.deucher@amd.com>
Link: https://lore.kernel.org/r/20250509230951.3871914-1-alexander.deucher@amd.com
Signed-off-by: Dave Airlie <airlied@redhat.com>
|
|
requested
Symmetric RSS hash requires that:
* No other fields besides IP src/dst and/or L4 src/dst are set
* If src is set, dst must also be set
This restriction was only enforced when RXNFC was configured after
symmetric hash was enabled. In the opposite order of operations (RXNFC
then symmetric enablement) the check was not performed.
Perform the sanity check on set_rxfh as well, by iterating over all flow
types hash fields and making sure they are all symmetric.
Introduce a function that returns whether a flow type is hashable (not
spec only) and needs to be iterated over. To make sure that no one
forgets to update the list of hashable flow types when adding new flow
types, a static assert is added to draw the developer's attention.
The conversion of uapi #defines to enum is not ideal, but as Jakub
mentioned [1], we have precedent for that.
[1] https://lore.kernel.org/netdev/20250324073509.6571ade3@kernel.org/
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Gal Pressman <gal@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250508103034.885536-1-gal@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Adding support to retrieve ref_ctr_offset for uprobe perf link,
which got somehow omitted from the initial uprobe link info changes.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yafang Shao <laoar.shao@gmail.com>
Link: https://lore.kernel.org/bpf/20250509153539.779599-2-jolsa@kernel.org
|
|
Add a header that describes the 3A statistics buffer and the
parameters buffer for C3 ISP
Reviewed-by: Jacopo Mondi <jacopo.mondi@ideasonboard.com>
Signed-off-by: Keke Li <keke.li@amlogic.com>
Signed-off-by: Jacopo Mondi <jacopo.mondi@ideasonboard.com>
Signed-off-by: Hans Verkuil <hverkuil@xs4all.nl>
|
|
C3ISP_PARAMS is the C3 ISP Parameters format.
C3ISP_STATS is the C3 ISP Statistics format.
Reviewed-by: Daniel Scally <dan.scally@ideasonboard.com>
Reviewed-by: Jacopo Mondi <jacopo.mondi@ideasonboard.com>
Signed-off-by: Keke Li <keke.li@amlogic.com>
Signed-off-by: Jacopo Mondi <jacopo.mondi@ideasonboard.com>
Signed-off-by: Hans Verkuil <hverkuil@xs4all.nl>
|
|
Cross-merge networking fixes after downstream PR (net-6.15-rc6).
No conflicts.
Adjacent changes:
net/core/dev.c:
08e9f2d584c4 ("net: Lock netdevices during dev_shutdown")
a82dc19db136 ("net: avoid potential race between netdev_get_by_index_lock() and netns switch")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
SST PP revision 2 added fabric 1 P0, P1 and Pm frequencies. Export them
by using a new IOCTL ISST_IF_GET_PERF_LEVEL_FABRIC_INFO. This IOCTL
requires platforms with SST PP revision 2 or higher.
To accommodate potential future increases in fabric count and avoid ABI
changes, support is extended for up to 8 fabrics.
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Link: https://lore.kernel.org/r/20250506163531.1061185-3-srinivas.pandruvada@linux.intel.com
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
|
|
When switching network namespaces with the bpf_redirect_peer helper, the
skb->mark and skb->tstamp fields are not zeroed out like they can be on
a typical netns switch. This patch clarifies that in the helper
description.
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/ccc86af26d43c5c0b776bcba2601b7479c0d46d0.1746460653.git.paul.chaignon@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
XFS will be able to support large atomic writes (atomic write > 1x block)
in future. This will be achieved by using different operating methods,
depending on the size of the write.
Specifically a new method of operation based in FS atomic extent remapping
will be supported in addition to the current HW offload-based method.
The FS method will generally be appreciably slower performing than the
HW-offload method. However the FS method will be typically able to
contribute to achieving a larger atomic write unit max limit.
XFS will support a hybrid mode, where HW offload method will be used when
possible, i.e. HW offload is used when the length of the write is
supported, and for other times FS-based atomic writes will be used.
As such, there is an atomic write length at which the user may experience
appreciably slower performance.
Advertise this limit in a new statx field, stx_atomic_write_unit_max_opt.
When zero, it means that there is no such performance boundary.
Masks STATX{_ATTR}_WRITE_ATOMIC can be used to get this new field. This is
ok for older kernels which don't support this new field, as they would
report 0 in this field (from zeroing in cp_statx()) already. Furthermore
those older kernels don't support large atomic writes - apart from block
fops, but there would be consistent performance there for atomic writes
in range [unit min, unit max].
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Acked-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next
Johannes Berg says:
====================
wireless features, notably
* stack
- free SKBTX_WIFI_STATUS flag
- fixes for VLAN multicast in multi-link
- improve codel parameters (revert some old twiddling)
* ath12k
- Enable AHB support for IPQ5332.
- Add monitor interface support to QCN9274.
- Add MLO support to WCN7850.
- Add 802.11d scan offload support to WCN7850.
* ath11k
- Restore hibernation support
* iwlwifi
- EMLSR on two 5 GHz links
* mwifiex
- cleanups/refactoring
along with many other small features/cleanups
* tag 'wireless-next-2025-05-06' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (177 commits)
Revert "wifi: iwlwifi: clean up config macro"
wifi: iwlwifi: move phy_filters to fw_runtime
wifi: iwlwifi: pcie: make sure to lock rxq->read
wifi: iwlwifi: add definitions for iwl_mac_power_cmd version 2
wifi: iwlwifi: clean up config macro
wifi: iwlwifi: mld: simplify iwl_mld_rx_fill_status()
wifi: iwlwifi: mld: rx: simplify channel handling
wifi: iwlwifi: clean up band in RX metadata
wifi: iwlwifi: mld: skip unknown FW channel load values
wifi: iwlwifi: define API for external FSEQ images
wifi: iwlwifi: mld: allow EMLSR on separated 5 GHz subbands
wifi: iwlwifi: mld: use cfg80211_chandef_get_width()
wifi: iwlwifi: mld: fix iwl_mld_emlsr_disallowed_with_link() return
wifi: iwlwifi: mld: clarify variable type
wifi: iwlwifi: pcie: add support for the reset handshake in MSI
wifi: mac80211_hwsim: Prevent tsf from setting if beacon is disabled
wifi: mac80211: restructure tx profile retrieval for MLO MBSSID
wifi: nl80211: add link id of transmitted profile for MLO MBSSID
wifi: ieee80211: Add helpers to fetch EMLSR delay and timeout values
wifi: mac80211: update ML STA with EML capabilities
...
====================
Link: https://patch.msgid.link/20250506174656.119970-3-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Devlink param and health reporter fmsg use attributes with dynamic type
which is determined according to a different type. Currently used values
are NLA_*. The problem is, they are not part of UAPI. They may change
which would cause a break.
To make this future safe, introduce a enum that shadows NLA_* values in
it and is part of UAPI.
Also, this allows to possibly carry types that are unrelated to NLA_*
values.
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Link: https://patch.msgid.link/20250505114513.53370-3-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add support for dmabuf backed zcrx areas. To use it, the user should
pass IORING_ZCRX_AREA_DMABUF in the struct io_uring_zcrx_area_reg flags
field and pass a dmabuf fd in the dmabuf_fd field.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/20bb1890e60a82ec945ab36370d1fd54be414ab6.1746097431.git.asml.silence@gmail.com
Link: https://lore.kernel.org/io-uring/6e37db97303212bbd8955f9501cf99b579f8aece.1746547722.git.asml.silence@gmail.com
[axboe: fold in fixup]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Allow userspace to pass a per-I/O write stream in the SQE:
__u8 write_stream;
The __u8 type matches the size the filesystems and block layer support.
Application can query the supported values from the block devices
max_write_streams sysfs attribute. Unsupported values are ignored by
file operations that do not support write streams or rejected with an
error by those that support them.
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Link: https://lore.kernel.org/r/20250506121732.8211-7-joshi.k@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|