summaryrefslogtreecommitdiff
path: root/include
AgeCommit message (Collapse)Author
2025-11-03PCI/IDE: Enumerate Selective Stream IDE capabilitiesDan Williams
Link encryption is a new PCIe feature enumerated by "PCIe r7.0 section 7.9.26 IDE Extended Capability". It is both a standalone port + endpoint capability, and a building block for the security protocol defined by "PCIe r7.0 section 11 TEE Device Interface Security Protocol (TDISP)". That protocol coordinates device security setup between a platform TSM (TEE Security Manager) and a device DSM (Device Security Manager). While the platform TSM can allocate resources like Stream ID and manage keys, it still requires system software to manage the IDE capability register block. Add register definitions and basic enumeration in preparation for Selective IDE Stream establishment. A follow on change selects the new CONFIG_PCI_IDE symbol. Note that while the IDE specification defines both a point-to-point "Link Stream" and a Root Port to endpoint "Selective Stream", only "Selective Stream" is considered for Linux as that is the predominant mode expected by Trusted Execution Environment Security Managers (TSMs), and it is the security model that limits the number of PCI components within the TCB in a PCIe topology with switches. Co-developed-by: Alexey Kardashevskiy <aik@amd.com> Signed-off-by: Alexey Kardashevskiy <aik@amd.com> Co-developed-by: Xu Yilun <yilun.xu@linux.intel.com> Signed-off-by: Xu Yilun <yilun.xu@linux.intel.com> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Alexey Kardashevskiy <aik@amd.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@kernel.org> Link: https://patch.msgid.link/20251031212902.2256310-3-dan.j.williams@intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2025-11-03coco/tsm: Introduce a core device for TEE Security ManagersDan Williams
A "TSM" is a platform component that provides an API for securely provisioning resources for a confidential guest (TVM) to consume. The name originates from the PCI specification for platform agent that carries out operations for PCIe TDISP (TEE Device Interface Security Protocol). Instances of this core device are parented by a device representing the platform security function like CONFIG_CRYPTO_DEV_CCP or CONFIG_INTEL_TDX_HOST. This device interface is a frontend to the aspects of a TSM and TEE I/O that are cross-architecture common. This includes mechanisms like enumerating available platform TEE I/O capabilities and provisioning connections between the platform TSM and device DSMs (Device Security Manager (TDISP)). For now this is just the scaffolding for registering a TSM device sysfs interface. Cc: Xu Yilun <yilun.xu@linux.intel.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Co-developed-by: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org> Signed-off-by: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org> Acked-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Alexey Kardashevskiy <aik@amd.com> Link: https://patch.msgid.link/20251031212902.2256310-2-dan.j.williams@intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2025-11-03soc: qcom: ubwc: Add config for KaanapaliAkhil P Oommen
Add the ubwc configuration for Kaanapali chipset. This chipset brings support for UBWC v6 version. The rest of the configurations remains as usual. Signed-off-by: Akhil P Oommen <akhilpo@oss.qualcomm.com> Reviewed-by: Konrad Dybcio <konrad.dybcio@oss.qualcomm.com> Reviewed-by: Dmitry Baryshkov <dmitry.baryshkov@oss.qualcomm.com> Link: https://lore.kernel.org/r/20250930-kaana-gpu-support-v1-1-73530b0700ed@oss.qualcomm.com Signed-off-by: Bjorn Andersson <andersson@kernel.org>
2025-11-03ethtool: netlink: add ETHTOOL_MSG_MSE_GET and wire up PHY MSE accessOleksij Rempel
Introduce the userspace entry point for PHY MSE diagnostics via ethtool netlink. This exposes the core API added previously and returns both capability information and one or more snapshots. Userspace sends ETHTOOL_MSG_MSE_GET. The reply carries: - ETHTOOL_A_MSE_CAPABILITIES: scale limits and timing information - ETHTOOL_A_MSE_CHANNEL_* nests: one or more snapshots (per-channel if available, otherwise WORST, otherwise LINK) Link down returns -ENETDOWN. Changes: - YAML: add attribute sets (mse, mse-capabilities, mse-snapshot) and the mse-get operation - UAPI (generated): add ETHTOOL_A_MSE_* enums and message IDs, ETHTOOL_MSG_MSE_GET/REPLY - ethtool core: add net/ethtool/mse.c implementing the request, register genl op, and hook into ethnl dispatch - docs: document MSE_GET in ethtool-netlink.rst The include/uapi/linux/ethtool_netlink_generated.h is generated from Documentation/netlink/specs/ethtool.yaml. Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de> Link: https://patch.msgid.link/20251027122801.982364-3-o.rempel@pengutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-03net: phy: introduce internal API for PHY MSE diagnosticsOleksij Rempel
Add the base infrastructure for Mean Square Error (MSE) diagnostics, as proposed by the OPEN Alliance "Advanced diagnostic features for 100BASE-T1 automotive Ethernet PHYs" [1] specification. The OPEN Alliance spec defines only average MSE and average peak MSE over a fixed number of symbols. However, other PHYs, such as the KSZ9131, additionally expose a worst-peak MSE value latched since the last channel capture. This API accounts for such vendor extensions by adding a distinct capability bit and snapshot field. Channel-to-pair mapping is normally straightforward, but in some cases (e.g. 100BASE-TX with MDI-X resolution unknown) the mapping is ambiguous. If hardware does not expose MDI-X status, the exact pair cannot be determined. To avoid returning misleading per-channel data in this case, a LINK selector is defined for aggregate MSE measurements. All investigated devices differ in MSE capabilities, such as sample rate, number of analyzed symbols, and scaling factors. For example, the KSZ9131 uses different scaling for MSE and pMSE. To make this visible to callers, scale limits and timing information are returned via get_mse_capability(). Some PHYs sample very few symbols at high frequency (e.g. 2 us update rate). To cover such cases and allow for future high-speed PHYs with even shorter intervals, the refresh rate is reported as u64 in picoseconds. This patch introduces the internal PHY API for Mean Square Error diagnostics. It defines new kernel-side data types and driver hooks: - struct phy_mse_capability: describes supported metrics, scale limits, update interval, and sampling length. - struct phy_mse_snapshot: holds one correlated measurement set. - New phy_driver ops: `get_mse_capability()` and `get_mse_snapshot()`. These definitions form the core kernel API. No user-visible interfaces are added in this commit. Standardization notes: OPEN Alliance defines presence and interpretation of some metrics but does not fix numeric scales or sampling internals: - SQI (3-bit, 0..7) is mandatory; correlation to SNR/BER is informative (OA 100BASE-T1 TC1 v1.0 6.1.2; OA 1000BASE-T1 TC12 v2.2 6.1.2). - MSE is optional; OA recommends 2^16 symbols and scaling to 0..511, with a worst-case latch since last read (OA 100BASE-T1 TC1 v1.0 6.1.1; OA 1000BASE-T1 TC12 v2.2 6.1.1). Refresh is recommended (~0.8-2.0 ms for 100BASE-T1; ~80-200 us for 1000BASE-T1). Exact scaling/time windows are vendor-specific. - Peak MSE (pMSE) is defined only for 100BASE-T1 as optional, e.g. 128-symbol sliding window with 8-bit range and worst-case latch (OA 100BASE-T1 TC1 v1.0 6.1.3). Therefore this API exposes which measures and selectors a PHY supports, and documents where behavior is standard-referenced vs vendor-specific. [1] <https://opensig.org/wp-content/uploads/2024/01/ Advanced_PHY_features_for_automotive_Ethernet_V1.0.pdf> Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de> Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/20251027122801.982364-2-o.rempel@pengutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-03net: Extend NAPI threaded polling to allow kthread based busy pollingSamiullah Khawaja
Add a new state NAPI_STATE_THREADED_BUSY_POLL to the NAPI state enum to enable and disable threaded busy polling. When threaded busy polling is enabled for a NAPI, enable NAPI_STATE_THREADED also. When the threaded NAPI is scheduled, set NAPI_STATE_IN_BUSY_POLL to signal napi_complete_done not to rearm interrupts. Whenever NAPI_STATE_THREADED_BUSY_POLL is unset, the NAPI_STATE_IN_BUSY_POLL will be unset, napi_complete_done unsets the NAPI_STATE_SCHED_THREADED bit also, which in turn will make the kthread go to sleep. Signed-off-by: Samiullah Khawaja <skhawaja@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Acked-by: Martin Karsten <mkarsten@uwaterloo.ca> Tested-by: Martin Karsten <mkarsten@uwaterloo.ca> Link: https://patch.msgid.link/20251028203007.575686-2-skhawaja@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-03mpls: Protect net->mpls.platform_label with a per-netns mutex.Kuniyuki Iwashima
MPLS (re)uses RTNL to protect net->mpls.platform_label, but the lock does not need to be RTNL at all. Let's protect net->mpls.platform_label with a dedicated per-netns mutex. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Link: https://patch.msgid.link/20251029173344.2934622-13-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-03ipv6: Add in6_dev_rcu().Kuniyuki Iwashima
rcu_dereference_rtnl() does not clearly tell whether the caller is under RCU or RTNL. Let's add in6_dev_rcu() to make it easy to remove __in6_dev_get() in the future. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Link: https://patch.msgid.link/20251029173344.2934622-5-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-03Merge branch '20251030-gcc_kaanapali-v2-v2-3-a774a587af6f@oss.qualcomm.com' ↵Bjorn Andersson
into clk-for-6.19 Merge Kaanapali RPMh, TCSR and global clock controllers through a topic branch, so they can be made available in the DeviceTree branch as well.
2025-11-03dt-bindings: clock: qcom: Add Kaanapali Global clock controllerTaniya Das
Add device tree bindings for the global clock controller on Qualcomm Kaanapali platform. Signed-off-by: Jingyi Wang <jingyi.wang@oss.qualcomm.com> Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Signed-off-by: Taniya Das <taniya.das@oss.qualcomm.com> Link: https://lore.kernel.org/r/20251030-gcc_kaanapali-v2-v2-3-a774a587af6f@oss.qualcomm.com Signed-off-by: Bjorn Andersson <andersson@kernel.org>
2025-11-03dt-bindings: arm: qcom,ids: Add SoC ID for QCS6490Komal Bajaj
Add unique ID for Qualcomm QCS6490 SoC. Signed-off-by: Komal Bajaj <komal.bajaj@oss.qualcomm.com> Link: https://lore.kernel.org/r/20251103-qcs6490_soc_id-v1-1-c139dd1e32c8@oss.qualcomm.com Signed-off-by: Bjorn Andersson <andersson@kernel.org>
2025-11-03Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf after 6.18-rc4Alexei Starovoitov
Cross-merge BPF and other fixes after downstream PR. No conflicts. Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-11-03sched_ext: Fix cgroup exit ordering by moving sched_ext_free() to ↵Tejun Heo
finish_task_switch() sched_ext_free() was called from __put_task_struct() when the last reference to the task is dropped, which could be long after the task has finished running. This causes cgroup-related problems: - ops.init_task() can be called on a cgroup which didn't get ops.cgroup_init()'d during scheduler load, because the cgroup might be destroyed/unlinked while the zombie or dead task is still lingering on the scx_tasks list. - ops.cgroup_exit() could be called before ops.exit_task() is called on all member tasks, leading to incorrect exit ordering. Fix by moving it to finish_task_switch() to be called right after the final context switch away from the dying task, matching when sched_class->task_dead() is called. Rename it to sched_ext_dead() to match the new calling context. By calling sched_ext_dead() before cgroup_task_dead(), we ensure that: - Tasks visible on scx_tasks list have valid cgroups during scheduler load, as cgroup_mutex prevents cgroup destruction while the task is still linked. - All member tasks have ops.exit_task() called and are removed from scx_tasks before the cgroup can be destroyed and trigger ops.cgroup_exit(). This fix is made possible by the cgroup_task_dead() split in the previous patch. This also makes more sense resource-wise as there's no point in keeping scheduler side resources around for dead tasks. Reported-by: Dan Schatzberg <dschatzberg@meta.com> Cc: Peter Zijlstra <peterz@infradead.org> Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-03sched_ext: Merge branch 'for-6.19' of ↵Tejun Heo
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup into for-6.19 Pull cgroup/for-6.19 to receive: 16dad7801aad ("cgroup: Rename cgroup lifecycle hooks to cgroup_task_*()") 260fbcb92bbe ("cgroup: Move dying_tasks cleanup from cgroup_task_release() to cgroup_task_free()") d245698d727a ("cgroup: Defer task cgroup unlink until after the task is done switching out") These are needed for the sched_ext cgroup exit ordering fix. Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-03cgroup: Defer task cgroup unlink until after the task is done switching outTejun Heo
When a task exits, css_set_move_task(tsk, cset, NULL, false) unlinks the task from its cgroup. From the cgroup's perspective, the task is now gone. If this makes the cgroup empty, it can be removed, triggering ->css_offline() callbacks that notify controllers the cgroup is going offline resource-wise. However, the exiting task can still run, perform memory operations, and schedule until the final context switch in finish_task_switch(). This creates a confusing situation where controllers are told a cgroup is offline while resource activities are still happening in it. While this hasn't broken existing controllers, it has caused direct confusion for sched_ext schedulers. Split cgroup_task_exit() into two functions. cgroup_task_exit() now only calls the subsystem exit callbacks and continues to be called from do_exit(). The css_set cleanup is moved to the new cgroup_task_dead() which is called from finish_task_switch() after the final context switch, so that the cgroup only appears empty after the task is truly done running. This also reorders operations so that subsys->exit() is now called before unlinking from the cgroup, which shouldn't break anything. Cc: Dan Schatzberg <dschatzberg@meta.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-03cgroup: Rename cgroup lifecycle hooks to cgroup_task_*()Tejun Heo
The current names cgroup_exit(), cgroup_release(), and cgroup_free() are confusing because they look like they're operating on cgroups themselves when they're actually task lifecycle hooks. For example, cgroup_init() initializes the cgroup subsystem while cgroup_exit() is a task exit notification to cgroup. Rename them to cgroup_task_exit(), cgroup_task_release(), and cgroup_task_free() to make it clear that these operate on tasks. Cc: Dan Schatzberg <dschatzberg@meta.com> Cc: Peter Zijlstra <peterz@infradead.org> Reviewed-by: Chen Ridong <chenridong@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-03arch: hookup listns() system callChristian Brauner
Add the listns() system call to all architectures. Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-20-2e6f823ebdc0@kernel.org Tested-by: syzbot@syzkaller.appspotmail.com Reviewed-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-03nstree: add listns()Christian Brauner
Add a new listns() system call that allows userspace to iterate through namespaces in the system. This provides a programmatic interface to discover and inspect namespaces, enhancing existing namespace apis. Currently, there is no direct way for userspace to enumerate namespaces in the system. Applications must resort to scanning /proc/<pid>/ns/ across all processes, which is: 1. Inefficient - requires iterating over all processes 2. Incomplete - misses inactive namespaces that aren't attached to any running process but are kept alive by file descriptors, bind mounts, or parent namespace references 3. Permission-heavy - requires access to /proc for many processes 4. No ordering or ownership. 5. No filtering per namespace type: Must always iterate and check all namespaces. The list goes on. The listns() system call solves these problems by providing direct kernel-level enumeration of namespaces. It is similar to listmount() but obviously tailored to namespaces. /* * @req: Pointer to struct ns_id_req specifying search parameters * @ns_ids: User buffer to receive namespace IDs * @nr_ns_ids: Size of ns_ids buffer (maximum number of IDs to return) * @flags: Reserved for future use (must be 0) */ ssize_t listns(const struct ns_id_req *req, u64 *ns_ids, size_t nr_ns_ids, unsigned int flags); Returns: - On success: Number of namespace IDs written to ns_ids - On error: Negative error code /* * @size: Structure size * @ns_id: Starting point for iteration; use 0 for first call, then * use the last returned ID for subsequent calls to paginate * @ns_type: Bitmask of namespace types to include (from enum ns_type): * 0: Return all namespace types * MNT_NS: Mount namespaces * NET_NS: Network namespaces * USER_NS: User namespaces * etc. Can be OR'd together * @user_ns_id: Filter results to namespaces owned by this user namespace: * 0: Return all namespaces (subject to permission checks) * LISTNS_CURRENT_USER: Namespaces owned by caller's user namespace * Other value: Namespaces owned by the specified user namespace ID */ struct ns_id_req { __u32 size; /* sizeof(struct ns_id_req) */ __u32 spare; /* Reserved, must be 0 */ __u64 ns_id; /* Last seen namespace ID (for pagination) */ __u32 ns_type; /* Filter by namespace type(s) */ __u32 spare2; /* Reserved, must be 0 */ __u64 user_ns_id; /* Filter by owning user namespace */ }; Example 1: List all namespaces void list_all_namespaces(void) { struct ns_id_req req = { .size = sizeof(req), .ns_id = 0, /* Start from beginning */ .ns_type = 0, /* All types */ .user_ns_id = 0, /* All user namespaces */ }; uint64_t ids[100]; ssize_t ret; printf("All namespaces in the system:\n"); do { ret = listns(&req, ids, 100, 0); if (ret < 0) { perror("listns"); break; } for (ssize_t i = 0; i < ret; i++) printf(" Namespace ID: %llu\n", (unsigned long long)ids[i]); /* Continue from last seen ID */ if (ret > 0) req.ns_id = ids[ret - 1]; } while (ret == 100); /* Buffer was full, more may exist */ } Example 2: List network namespaces only void list_network_namespaces(void) { struct ns_id_req req = { .size = sizeof(req), .ns_id = 0, .ns_type = NET_NS, /* Only network namespaces */ .user_ns_id = 0, }; uint64_t ids[100]; ssize_t ret; ret = listns(&req, ids, 100, 0); if (ret < 0) { perror("listns"); return; } printf("Network namespaces: %zd found\n", ret); for (ssize_t i = 0; i < ret; i++) printf(" netns ID: %llu\n", (unsigned long long)ids[i]); } Example 3: List namespaces owned by current user namespace void list_owned_namespaces(void) { struct ns_id_req req = { .size = sizeof(req), .ns_id = 0, .ns_type = 0, /* All types */ .user_ns_id = LISTNS_CURRENT_USER, /* Current userns */ }; uint64_t ids[100]; ssize_t ret; ret = listns(&req, ids, 100, 0); if (ret < 0) { perror("listns"); return; } printf("Namespaces owned by my user namespace: %zd\n", ret); for (ssize_t i = 0; i < ret; i++) printf(" ns ID: %llu\n", (unsigned long long)ids[i]); } Example 4: List multiple namespace types void list_network_and_mount_namespaces(void) { struct ns_id_req req = { .size = sizeof(req), .ns_id = 0, .ns_type = NET_NS | MNT_NS, /* Network and mount */ .user_ns_id = 0, }; uint64_t ids[100]; ssize_t ret; ret = listns(&req, ids, 100, 0); printf("Network and mount namespaces: %zd found\n", ret); } Example 5: Pagination through large namespace sets void list_all_with_pagination(void) { struct ns_id_req req = { .size = sizeof(req), .ns_id = 0, .ns_type = 0, .user_ns_id = 0, }; uint64_t ids[50]; size_t total = 0; ssize_t ret; printf("Enumerating all namespaces with pagination:\n"); while (1) { ret = listns(&req, ids, 50, 0); if (ret < 0) { perror("listns"); break; } if (ret == 0) break; /* No more namespaces */ total += ret; printf(" Batch: %zd namespaces\n", ret); /* Last ID in this batch becomes start of next batch */ req.ns_id = ids[ret - 1]; if (ret < 50) break; /* Partial batch = end of results */ } printf("Total: %zu namespaces\n", total); } Permission Model listns() respects namespace isolation and capabilities: (1) Global listing (user_ns_id = 0): - Requires CAP_SYS_ADMIN in the namespace's owning user namespace - OR the namespace must be in the caller's namespace context (e.g., a namespace the caller is currently using) - User namespaces additionally allow listing if the caller has CAP_SYS_ADMIN in that user namespace itself (2) Owner-filtered listing (user_ns_id != 0): - Requires CAP_SYS_ADMIN in the specified owner user namespace - OR the namespace must be in the caller's namespace context - This allows unprivileged processes to enumerate namespaces they own (3) Visibility: - Only "active" namespaces are listed - A namespace is active if it has a non-zero __ns_ref_active count - This includes namespaces used by running processes, held by open file descriptors, or kept active by bind mounts - Inactive namespaces (kept alive only by internal kernel references) are not visible via listns() Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-19-2e6f823ebdc0@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-03nstree: add unified namespace listChristian Brauner
Allow to walk the unified namespace list completely locklessly. Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-18-2e6f823ebdc0@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-03nstree: maintain list of owned namespacesChristian Brauner
The namespace tree doesn't express the ownership concept of namespace appropriately. Maintain a list of directly owned namespaces per user namespace. This will allow userspace and the kernel to use the listns() system call to walk the namespace tree by owning user namespace. The rbtree is used to find the relevant namespace entry point which allows to continue iteration and the owner list can be used to walk the tree completely lock free. Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-16-2e6f823ebdc0@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-03nstree: assign fixed ids to the initial namespacesChristian Brauner
The initial set of namespace comes with fixed inode numbers making it easy for userspace to identify them solely based on that information. This has long preceeded anything here. Similarly, let's assign fixed namespace ids for the initial namespaces. Kill the cookie and use a sequentially increasing number. This has the nice side-effect that the owning user namespace will always have a namespace id that is smaller than any of it's descendant namespaces. Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-15-2e6f823ebdc0@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-03nstree: introduce a unified treeChristian Brauner
This will allow userspace to lookup and stat a namespace simply by its identifier without having to know what type of namespace it is. Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-13-2e6f823ebdc0@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-03ns: use anonymous struct to group list memberChristian Brauner
Make it easier to spot that they belong together conceptually. Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-12-2e6f823ebdc0@kernel.org Tested-by: syzbot@syzkaller.appspotmail.com Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-03ns: add active reference countChristian Brauner
The namespace tree is, among other things, currently used to support file handles for namespaces. When a namespace is created it is placed on the namespace trees and when it is destroyed it is removed from the namespace trees. While a namespace is on the namespace trees with a valid reference count it is possible to reopen it through a namespace file handle. This is all fine but has some issues that should be addressed. On current kernels a namespace is visible to userspace in the following cases: (1) The namespace is in use by a task. (2) The namespace is persisted through a VFS object (namespace file descriptor or bind-mount). Note that (2) only cares about direct persistence of the namespace itself not indirectly via e.g., file->f_cred file references or similar. (3) The namespace is a hierarchical namespace type and is the parent of a single or multiple child namespaces. Case (3) is interesting because it is possible that a parent namespace might not fulfill any of (1) or (2), i.e., is invisible to userspace but it may still be resurrected through the NS_GET_PARENT ioctl(). Currently namespace file handles allow much broader access to namespaces than what is currently possible via (1)-(3). The reason is that namespaces may remain pinned for completely internal reasons yet are inaccessible to userspace. For example, a user namespace my remain pinned by get_cred() calls to stash the opener's credentials into file->f_cred. As it stands file handles allow to resurrect such a users namespace even though this should not be possible via (1)-(3). This is a fundamental uapi change that we shouldn't do if we don't have to. Consider the following insane case: Various architectures support the CONFIG_MMU_LAZY_TLB_REFCOUNT option which uses lazy TLB destruction. When this option is set a userspace task's struct mm_struct may be used for kernel threads such as the idle task and will only be destroyed once the cpu's runqueue switches back to another task. But because of ptrace() permission checks struct mm_struct stashes the user namespace of the task that struct mm_struct originally belonged to. The kernel thread will take a reference on the struct mm_struct and thus pin it. So on an idle system user namespaces can be persisted for arbitrary amounts of time which also means that they can be resurrected using namespace file handles. That makes no sense whatsoever. The problem is of course excarabted on large systems with a huge number of cpus. To handle this nicely we introduce an active reference count which tracks (1)-(3). This is easy to do as all of these things are already managed centrally. Only (1)-(3) will count towards the active reference count and only namespaces which are active may be opened via namespace file handles. The problem is that namespaces may be resurrected. Which means that they can become temporarily inactive and will be reactived some time later. Currently the only example of this is the SIOGCSKNS socket ioctl. The SIOCGSKNS ioctl allows to open a network namespace file descriptor based on a socket file descriptor. If a socket is tied to a network namespace that subsequently becomes inactive but that socket is persisted by another process in another network namespace (e.g., via SCM_RIGHTS of pidfd_getfd()) then the SIOCGSKNS ioctl will resurrect this network namespace. So calls to open_related_ns() and open_namespace() will end up resurrecting the corresponding namespace tree. Note that the active reference count does not regulate the lifetime of the namespace itself. This is still done by the normal reference count. The active reference count can only be elevated if the regular reference count is elevated. The active reference count also doesn't regulate the presence of a namespace on the namespace trees. It only regulates its visiblity to namespace file handles (and in later patches to listns()). A namespace remains on the namespace trees from creation until its actual destruction. This will allow the kernel to always reach any namespace trivially and it will also enable subsystems like bpf to walk the namespace lists on the system for tracing or general introspection purposes. Note that different namespaces have different visibility lifetimes on current kernels. While most namespace are immediately released when the last task using them exits, the user- and pid namespace are persisted and thus both remain accessible via /proc/<pid>/ns/<ns_type>. The user namespace lifetime is aliged with struct cred and is only released through exit_creds(). However, it becomes inaccessible to userspace once the last task using it is reaped, i.e., when release_task() is called and all proc entries are flushed. Similarly, the pid namespace is also visible until the last task using it has been reaped and the associated pid numbers are freed. The active reference counts of the user- and pid namespace are decremented once the task is reaped. Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-11-2e6f823ebdc0@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-03ns: rename to exit_nsproxy_namespaces()Christian Brauner
The current naming is very misleading as this really isn't exiting all of the task's namespaces. It is only exiting the namespaces that hang of off nsproxy. Reflect that in the name. Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-10-2e6f823ebdc0@kernel.org Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-03ns: add __ns_ref_read()Christian Brauner
Implement ns_ref_read() the same way as ns_ref_{get,put}(). No point in making that any more special or different from the other helpers. Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-9-2e6f823ebdc0@kernel.org Tested-by: syzbot@syzkaller.appspotmail.com Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-03ns: initialize ns_list_node for initial namespacesChristian Brauner
Make sure that the list is always initialized for initial namespaces. Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-8-2e6f823ebdc0@kernel.org Fixes: 885fc8ac0a4d ("nstree: make iterator generic") Tested-by: syzbot@syzkaller.appspotmail.com Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-03ns: add NS_COMMON_INIT()Christian Brauner
Add an initializer that can be used for the ns common initialization for static namespace such as most init namespaces. Suggested-by: Thomas Gleixner <tglx@linutronix.de> Link: https://patch.msgid.link/87ecqhy2y5.ffs@tglx Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-03ns: add missing authorshipChristian Brauner
I authored the files a short while ago. Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-03io_uring/zcrx: remove sync refill uapiPavel Begunkov
There is a better way to handle the problem IORING_REGISTER_ZCRX_REFILL solves. The uapi can also be slightly adjusted to accommodate future extensions. Remove the feature for now, it'll be reworked for the next release. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-03io_uring/uring_cmd: avoid double indirect call in task work dispatchCaleb Sander Mateos
io_uring task work dispatch makes an indirect call to struct io_kiocb's io_task_work.func field to allow running arbitrary task work functions. In the uring_cmd case, this calls io_uring_cmd_work(), which immediately makes another indirect call to struct io_uring_cmd's task_work_cb field. Change the uring_cmd task work callbacks to functions whose signatures match io_req_tw_func_t. Add a function io_uring_cmd_from_tw() to convert from the task work's struct io_tw_req argument to struct io_uring_cmd *. Define a constant IO_URING_CMD_TASK_WORK_ISSUE_FLAGS to avoid manufacturing issue_flags in the uring_cmd task work callbacks. Now uring_cmd task work dispatch makes a single indirect call to the uring_cmd implementation's callback. This also allows removing the task_work_cb field from struct io_uring_cmd, freeing up 8 bytes for future storage. Since fuse_uring_send_in_task() now has access to the io_tw_token_t, check its cancel field directly instead of relying on the IO_URING_F_TASK_DEAD issue flag. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-03io_uring: add wrapper type for io_req_tw_func_t argCaleb Sander Mateos
In preparation for uring_cmd implementations to implement functions with the io_req_tw_func_t signature, introduce a wrapper struct io_tw_req to hide the struct io_kiocb * argument. The intention is for only the io_uring core to access the inner struct io_kiocb *. uring_cmd implementations should instead call a helper from io_uring/cmd.h to convert struct io_tw_req to struct io_uring_cmd *. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-03blktrace: add support for REQ_OP_WRITE_ZEROES tracingChaitanya Kulkarni
Currently, REQ_OP_WRITE_ZEROES operations are not handled in the blktrace infrastructure, resulting in incorrect or missing operation labels in ftrace blktrace output. This manifests as write-zeroes operations appearing with incorrect labels like "N" instead of a proper "WZ" designation. This patch adds complete support for REQ_OP_WRITE_ZEROES across the blktrace infrastructure: Add BLK_TC_WRITE_ZEROES trace category in blktrace_api.h and update BLK_TC_END_V2 marker accordingly Map REQ_OP_WRITE_ZEROES to BLK_TC_WRITE_ZEROES in __blk_add_trace() to ensure proper trace event categorization Update fill_rwbs() to generate "WZ" label for write-zeroes operations in ftrace output, making them easily identifiable Add "write-zeroes" string mapping in act_to_str array for debugfs filter interface Update blk_fill_rwbs() to handle REQ_OP_WRITE_ZEROES for block layer event tracing With this fix, write-zeroes operations are now correctly traced and displayed. =========================================================== BEFORE THIS PATCH =========================================================== blkdiscard -z -o 0 -l 40960 /dev/nvme0n1 blkdiscard-3809 [030] ..... 1212.253701: block_bio_queue: 259,0 NS 0 + 80 [blkdiscard] blkdiscard-3809 [030] ..... 1212.253703: block_getrq: 259,0 NS 0 + 80 [blkdiscard] blkdiscard-3809 [030] ..... 1212.253704: block_io_start: 259,0 NS 40960 () 0 + 80 be,0,4 [blkdiscard] blkdiscard-3809 [030] ..... 1212.253704: block_plug: [blkdiscard] blkdiscard-3809 [030] ..... 1212.253706: block_unplug: [blkdiscard] 1 blkdiscard-3809 [030] ..... 1212.253706: block_rq_insert: 259,0 NS 40960 () 0 + 80 be,0,4 [blkdiscard] kworker/30:1H-566 [030] ..... 1212.253726: block_rq_issue: 259,0 NS 40960 () 0 + 80 be,0,4 [kworker/30:1H] <idle>-0 [030] d.h1. 1212.253957: block_rq_complete: 259,0 NS () 0 + 80 be,0,4 [0] <idle>-0 [030] dNh1. 1212.253960: block_io_done: 259,0 NS 0 () 0 + 0 none,0,0 [swapper/30] Trace Event Breakdown: Event | Device | Op | Sector | Sectors | Byte Size | Calculation block_bio_queue | 259,0 | NS | 0 | 80 | - | 80 × 512 = 40,960 block_getrq | 259,0 | NS | 0 | 80 | - | 80 × 512 = 40,960 block_io_start | 259,0 | NS | 0 | 80 | 40960 | Direct from trace block_rq_insert | 259,0 | NS | 0 | 80 | 40960 | Direct from trace block_rq_issue | 259,0 | NS | 0 | 80 | 40960 | Direct from trace block_rq_complete | 259,0 | NS | 0 | 80 | - | 80 × 512 = 40,960 block_io_done | 259,0 | NS | 0 | 0 | 0 | Completion (no data) Total Bytes Transferred: Sectors: 80 Bytes: 80 × 512 = 40,960 bytes =========================================================== AFTER THIS PATCH =========================================================== blkdiscard -z -o 0 -l 40960 /dev/nvme0n1 blkdiscard-2477 [020] ..... 960.989131: block_bio_queue: 259,0 WZS 0 + 80 [blkdiscard] blkdiscard-2477 [020] ..... 960.989134: block_getrq: 259,0 WZS 0 + 80 [blkdiscard] blkdiscard-2477 [020] ..... 960.989135: block_io_start: 259,0 WZS 40960 () 0 + 80 be,0,4 [blkdiscard] blkdiscard-2477 [020] ..... 960.989138: block_plug: [blkdiscard] blkdiscard-2477 [020] ..... 960.989140: block_unplug: [blkdiscard] 1 blkdiscard-2477 [020] ..... 960.989141: block_rq_insert: 259,0 WZS 40960 () 0 + 80 be,0,4 [blkdiscard] kworker/20:1H-736 [020] ..... 960.989166: block_rq_issue: 259,0 WZS 40960 () 0 + 80 be,0,4 [kworker/20:1H] <idle>-0 [020] d.h1. 960.989476: block_rq_complete: 259,0 WZS () 0 + 80 be,0,4 [0] <idle>-0 [020] dNh1. 960.989482: block_io_done: 259,0 WZS 0 () 0 + 0 none,0,0 [swapper/20] Trace Event Breakdown: Event | Device | Op | Sector | Sectors | Byte Size | Calculation block_bio_queue | 259,0 | WZS | 0 | 80 | - | 80 × 512 = 40,960 block_getrq | 259,0 | WZS | 0 | 80 | - | 80 × 512 = 40,960 block_io_start | 259,0 | WZS | 0 | 80 | 40960 | Direct from trace block_rq_insert | 259,0 | WZS | 0 | 80 | 40960 | Direct from trace block_rq_issue | 259,0 | WZS | 0 | 80 | 40960 | Direct from trace block_rq_complete | 259,0 | WZS | 0 | 80 | - | 80 × 512 = 40,960 block_io_done | 259,0 | WZS | 0 | 0 | 0 | Completion (no data) Total Bytes Transferred: Sectors: 80 Bytes: 80 × 512 = 40,960 bytes Tested with ftrace blktrace on NVMe devices using blkdiscard with the -z (write-zeroes) flag. Signed-off-by: Chaitanya Kulkarni <ckulkarnilinux@gmail.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-03media: saa7146: Replace saa7146_ext_vv.vbi_fops with write functionLaurent Pinchart
The vbi_fops stored in struct saa7146_ext_vv is a full v4l2_file_operations, but only its .write field is used. Replace it with a single vbi_write function pointer to save memory. Signed-off-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com> Signed-off-by: Hans Verkuil <hverkuil+cisco@kernel.org>
2025-11-03media: mc: Make macros to obtain containers const-awareSakari Ailus
Retain the constness of the graph objects and interfaces in macros to obtain their containers, by switching to container_of_const(). Signed-off-by: Sakari Ailus <sakari.ailus@linux.intel.com> Signed-off-by: Hans Verkuil <hverkuil+cisco@kernel.org>
2025-11-03media: v4l2-dev: Make macros to obtain containers const-awareSakari Ailus
Retain the constness of the object in media_entity_to_video_device() and to_video_device(), by switching to container_of_const(). Signed-off-by: Sakari Ailus <sakari.ailus@linux.intel.com> Signed-off-by: Hans Verkuil <hverkuil+cisco@kernel.org>
2025-11-03media: v4l2-subdev: Make media_entity_to_v4l2_subdev() const-awareSakari Ailus
Retain the constness of the object in media_entity_to_v4l2_subdev(), by switching to container_of_const(). Signed-off-by: Sakari Ailus <sakari.ailus@linux.intel.com> Signed-off-by: Hans Verkuil <hverkuil+cisco@kernel.org>
2025-11-03uaccess: Provide ASM GOTO safe wrappers for unsafe_*_user()Thomas Gleixner
ASM GOTO is miscompiled by GCC when it is used inside a auto cleanup scope: bool foo(u32 __user *p, u32 val) { scoped_guard(pagefault) unsafe_put_user(val, p, efault); return true; efault: return false; } e80: e8 00 00 00 00 call e85 <foo+0x5> e85: 65 48 8b 05 00 00 00 00 mov %gs:0x0(%rip),%rax e8d: 83 80 04 14 00 00 01 addl $0x1,0x1404(%rax) // pf_disable++ e94: 89 37 mov %esi,(%rdi) e96: 83 a8 04 14 00 00 01 subl $0x1,0x1404(%rax) // pf_disable-- e9d: b8 01 00 00 00 mov $0x1,%eax // success ea2: e9 00 00 00 00 jmp ea7 <foo+0x27> // ret ea7: 31 c0 xor %eax,%eax // fail ea9: e9 00 00 00 00 jmp eae <foo+0x2e> // ret which is broken as it leaks the pagefault disable counter on failure. Clang at least fails the build. Linus suggested to add a local label into the macro scope and let that jump to the actual caller supplied error label. __label__ local_label; \ arch_unsafe_get_user(x, ptr, local_label); \ if (0) { \ local_label: \ goto label; \ That works for both GCC and clang. clang: c80: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) c85: 65 48 8b 0c 25 00 00 00 00 mov %gs:0x0,%rcx c8e: ff 81 04 14 00 00 incl 0x1404(%rcx) // pf_disable++ c94: 31 c0 xor %eax,%eax // set retval to false c96: 89 37 mov %esi,(%rdi) // write c98: b0 01 mov $0x1,%al // set retval to true c9a: ff 89 04 14 00 00 decl 0x1404(%rcx) // pf_disable-- ca0: 2e e9 00 00 00 00 cs jmp ca6 <foo+0x26> // ret The exception table entry points correctly to c9a GCC: f70: e8 00 00 00 00 call f75 <baz+0x5> f75: 65 48 8b 05 00 00 00 00 mov %gs:0x0(%rip),%rax f7d: 83 80 04 14 00 00 01 addl $0x1,0x1404(%rax) // pf_disable++ f84: 8b 17 mov (%rdi),%edx f86: 89 16 mov %edx,(%rsi) f88: 83 a8 04 14 00 00 01 subl $0x1,0x1404(%rax) // pf_disable-- f8f: b8 01 00 00 00 mov $0x1,%eax // success f94: e9 00 00 00 00 jmp f99 <baz+0x29> // ret f99: 83 a8 04 14 00 00 01 subl $0x1,0x1404(%rax) // pf_disable-- fa0: 31 c0 xor %eax,%eax // fail fa2: e9 00 00 00 00 jmp fa7 <baz+0x37> // ret The exception table entry points correctly to f99 So both compilers optimize out the extra goto and emit correct and efficient code. Provide a generic wrapper to do that to avoid modifying all the affected architecture specific implementation with that workaround. The only change required for architectures is to rename unsafe_*_user() to arch_unsafe_*_user(). That's done in subsequent changes. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/877bweujtn.ffs@tglx
2025-11-03rtc: ds1685: stop setting max_user_freqAlexandre Belloni
max_user_freq has not been related to the hardware RTC since commit 6610e0893b8b ("RTC: Rework RTC code to use timerqueue for events"). Stop setting it from individual driver to avoid confusing new contributors. Acked-by: Joshua Kinard <linux@kumba.dev> Link: https://patch.msgid.link/20251101-max_user_freq-v1-2-c9a274fd6883@bootlin.com Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
2025-11-03perf: arm_pmuv3: Don't use PMCCNTR_EL0 on SMT coresYicong Yang
CPU_CYCLES is expected to count the logical CPU (PE) clock. Currently it's preferred to use PMCCNTR_EL0 for counting CPU_CYCLES, but it'll count processor clock rather than the PE clock (ARM DDI0487 L.b D13.1.3) if one of the SMT siblings is not idle on a multi-threaded implementation. So don't use it on SMT cores. Introduce topology_core_has_smt() for knowing the SMT implementation and cached it in arm_pmu::has_smt during allocation. When counting cycles on SMT CPU 2-3 and CPU 3 is idle, without this patch we'll get: [root@client1 tmp]# perf stat -e cycles -A -C 2-3 -- stress-ng -c 1 --taskset 2 --timeout 1 [...] Performance counter stats for 'CPU(s) 2-3': CPU2 2880457316 cycles CPU3 2880459810 cycles 1.254688470 seconds time elapsed With this patch the idle state of CPU3 is observed as expected: [root@client1 ~]# perf stat -e cycles -A -C 2-3 -- stress-ng -c 1 --taskset 2 --timeout 1 [...] Performance counter stats for 'CPU(s) 2-3': CPU2 2558580492 cycles CPU3 305749 cycles 1.113626410 seconds time elapsed Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> Signed-off-by: Will Deacon <will@kernel.org>
2025-11-03mtd: spear_smi: fix kernel-doc warnings <linux/mtd/spear_smi.h>Randy Dunlap
Correct most kernel-doc warnings in include/linux/mtd/spear_smi.h by adding a leading '@' to the description of struct members. Add a new description for the missing @np member. Warning: spear_smi.h:48 struct member 'name' not described in 'spear_smi_flash_info' Warning: spear_smi.h:48 struct member 'mem_base' not described in 'spear_smi_flash_info' Warning: spear_smi.h:48 struct member 'size' not described in 'spear_smi_flash_info' Warning: spear_smi.h:48 struct member 'partitions' not described in 'spear_smi_flash_info' Warning: spear_smi.h:48 struct member 'nr_partitions' not described in 'spear_smi_flash_info' Warning: spear_smi.h:48 struct member 'fast_mode' not described in 'spear_smi_flash_info' Warning: spear_smi.h:62 struct member 'clk_rate' not described in 'spear_smi_plat_data' Warning: spear_smi.h:62 struct member 'num_flashes' not described in 'spear_smi_plat_data' Warning: spear_smi.h:62 struct member 'board_flash_info' not described in 'spear_smi_plat_data' Warning: spear_smi.h:62 struct member 'np' not described in 'spear_smi_plat_data' Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
2025-11-03drm/{i915, xe}/display: Add display runtime pm parent interfaceJouni Högander
We have differing implementations for display runtime pm in i915 and xe drivers. Add struct of function pointers into display_parent_interface which will contain used implementation of runtime pm. v2: - add _interface suffix to rpm function pointer struct - add struct ref_tracker forward declaration - use kernel-doc comments Reviewed-by: Jani Nikula <jani.nikula@intel.com> Signed-off-by: Jouni Högander <jouni.hogander@intel.com> Link: https://patch.msgid.link/20251030202836.1815680-3-jouni.hogander@intel.com
2025-11-03drm/{i915, xe}/display: pass parent interface to display probeJani Nikula
Let's gradually start calling i915 and xe parent, or core, drivers from display via function pointers passed at display probe. Going forward, the struct intel_display_parent_interface is expected to include const pointers to sub-structs by functionality, for example: struct intel_display_rpm { struct ref_tracker *(*get)(struct drm_device *drm); /* ... */ }; struct intel_display_parent_interface { /* ... */ const struct intel_display_rpm *rpm; }; This is a baby step towards not building display as part of both i915 and xe drivers, but rather making it an independent driver interfacing with the two. v3: useless include additions dropped v2: unrelated include removal dropped Cc: Jouni Högander <jouni.hogander@intel.com> Cc: Lucas De Marchi <lucas.demarchi@intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Ville Syrjälä <ville.syrjala@linux.intel.com> Signed-off-by: Jani Nikula <jani.nikula@intel.com> Signed-off-by: Jouni Högander <jouni.hogander@intel.com> Link: https://patch.msgid.link/20251030202836.1815680-2-jouni.hogander@intel.com
2025-11-039p: convert to the new mount APIEric Sandeen
Convert 9p to the new mount API. This patch consolidates all parsing into fs/9p/v9fs.c, which stores all results into a filesystem context which can be passed to the various transports as needed. Some of the parsing helper functions such as get_cache_mode() have been eliminated in favor of using the new mount API's enum param type, for simplicity. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Message-ID: <20251010214222.1347785-5-sandeen@redhat.com> [ Dominique: handled source explicitly as per follow-up discussion ] Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
2025-11-039p: create a v9fs_context structure to hold parsed optionsEric Sandeen
This patch creates a new v9fs_context structure which includes new p9_session_opts and p9_client_opts structures, as well as re-using the existing p9_fd_opts and p9_rdma_opts to store options during parsing. The new structure will be used in the next commit to pass all parsed options to the appropriate transports. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Message-ID: <20251010214222.1347785-4-sandeen@redhat.com> Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
2025-11-03net/9p: move structures and macros to header filesEric Sandeen
With the new mount API all option parsing will need to happen in fs/v9fs.c, so move some existing data structures and macros to header files to facilitate this. Rename some to reflect the transport they are used for (rdma, fd, etc), for clarity. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Message-ID: <20251010214222.1347785-3-sandeen@redhat.com> Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
2025-11-03fs/fs_parse: add back fsparam_u32hexEric Sandeen
296b67059 removed fsparam_u32hex because there were no callers (yet) and it didn't build due to using the nonexistent symbol fs_param_is_u32_hex. fs/9p will need this parser, so add it back with the appropriate fix (use fs_param_is_u32). Signed-off-by: Eric Sandeen <sandeen@redhat.com> Message-ID: <20251010214222.1347785-2-sandeen@redhat.com> Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
2025-11-03net/9p: cleanup: change p9_trans_module->def to boolDominique Martinet
'->def' is only ever used as a true/false flag Reported-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Message-ID: <20251103-v9fs_trans_def_bool-v1-1-f33dc7ed9e81@codewreck.org> Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
2025-11-039p: Use kvmalloc for message buffers on supported transportsPierre Barre
While developing a 9P server (https://github.com/Barre/ZeroFS) and testing it under high-load, I was running into allocation failures. The failures occur even with plenty of free memory available because kmalloc requires contiguous physical memory. This results in errors like: ls: page allocation failure: order:7, mode:0x40c40(GFP_NOFS|__GFP_COMP) This patch introduces a transport capability flag (supports_vmalloc) that indicates whether a transport can work with vmalloc'd buffers (non-physically contiguous memory). Transports requiring DMA should leave this flag as false. The fd-based transports (tcp, unix, fd) set this flag to true, and p9_fcall_init will use kvmalloc instead of kmalloc for these transports. This allows the allocator to fall back to vmalloc when contiguous physical memory is not available. Additionally, if kmem_cache_alloc fails, the code falls back to kvmalloc for transports that support it. Signed-off-by: Pierre Barre <pierre@barre.sh> Reviewed-by: Christian Schoenebeck <linux_oss@crudebyte.com> Message-ID: <d2017c29-11fb-44a5-bd0f-4204329bbefb@app.fastmail.com> Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
2025-11-02Merge patch series "target: RW/num_cmds stats improvements"Martin K. Petersen
Mike Christie <michael.christie@oracle.com> says: The following patches were made over Linus tree. They fix/improve the stats used in the main IO path. The first patch fixes an issue where I made some stats u32 when they should have stayed u64. The rest of the patches improve the handling of RW/num_cmds stats to reduce code duplication and improve performance. Link: https://patch.msgid.link/20250917221338.14813-1-michael.christie@oracle.com Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>