summaryrefslogtreecommitdiff
path: root/include
AgeCommit message (Collapse)Author
2025-11-26drivers: hid: renegotiate resolution multipliers with device after resetBenedek Kupper
The scroll resolution multipliers are set in the context of hidinput_connect(), which is only called at probe time: when the host changes the value on the device with a SET_REPORT(FEATURE), and the device accepts it, these multipliers are stored on the host side, and used to calculate the final scroll event values sent to userspace. After a USB suspend, the resume operation on many hubs and chipsets involve a USB reset signal as well. A reset on the device side clears all previous state information, including the value of the multiplier report. This reset is not handled by the multiplier handling logic, so what ends up happening is the host is still expecting high-resolution scroll events, but the device is reset to default resolution, making the effective, user-perceived scroll speed incredibly slow. The solution is to renegotiate the multiplier selection after each reset. This is not the only bug related to the high-resolution scrolling implementation in the kernel (the other one is https://bugzilla.kernel.org/show_bug.cgi?id=220144), but for this one, there is no device side workaround for, leading to poor user experience with our product: https://github.com/UltimateHackingKeyboard/firmware/issues/1155 https://github.com/UltimateHackingKeyboard/firmware/issues/1261 https://github.com/UltimateHackingKeyboard/firmware/pull/1355 This patch was tested by an affected user and has been reported to fix the issue (see discussion in 1355). Signed-off-by: Benedek Kupper <kupper.benedek@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.com>
2025-11-26mod_devicetable: Bump auxiliary_device_id name sizeRaag Jadav
We have an upcoming driver named "intel_ehl_pse_io". This creates an auxiliary child device for it's GPIO sub-functionality, which matches against "intel_ehl_pse_io.gpio-elkhartlake" and overshoots the current maximum limit of 32 bytes for auxiliary device id string. Bump the size to 40 bytes to satisfy such cases. Suggested-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Raag Jadav <raag.jadav@intel.com> Link: https://patch.msgid.link/20251106052838.433673-1-raag.jadav@intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-11-26sysfs: simplify attribute definition macrosThomas Weißschuh
Define the macros in terms of each other. This makes them easier to understand and also will make it easier to implement the transition machinery for 'const struct attribute'. __ATTR_RO_MODE() can't be implemented in terms of __ATTR() as not all attributes have a .store callback. The same issue theoretically exists for __ATTR_WO(), but practically that does not occur today. Reorder __ATTR_RO() below __ATTR_RO_MODE() to keep the order of the macro definition consistent with respect to each other. Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Link: https://patch.msgid.link/20251029-sysfs-const-attr-prep-v5-7-ea7d745acff4@weissschuh.net Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-11-26sysfs: attribute_group: enable const variants of is_visible()Thomas Weißschuh
When constifying instances of struct attribute, for consistency the corresponding .is_visible() callback should be adapted, too. Introduce a temporary transition mechanism until all callbacks are converted. Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Link: https://patch.msgid.link/20251029-sysfs-const-attr-prep-v5-4-ea7d745acff4@weissschuh.net Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-11-26sysfs: introduce __SYSFS_FUNCTION_ALTERNATIVE()Thomas Weißschuh
For the constification phase of 'struct attribute' various callback struct members will need to exist in both const and non-const variants. Keeping both members in a union avoids memory and CPU overhead but will be detected and trapped by Control Flow Integrity (CFI). By deciding between a struct and a union depending whether CFI is enabled, most configurations can avoid this overhead. Code using these callbacks will still need to be updated to handle both members explicitly. In the union case the compiler will recognize that testing for one union member is enough and optimize away the code for the other one. Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Link: https://patch.msgid.link/20251029-sysfs-const-attr-prep-v5-3-ea7d745acff4@weissschuh.net Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-11-26sysfs: transparently handle const pointers in ATTRIBUTE_GROUPS()Thomas Weißschuh
To ease the constification process of 'struct attribute', transparently handle the const pointers in ATTRIBUTE_GROUPS(). A cast is used instead of assigning to .attrs_new as it keeps the macro smaller. As both members are aliased to each other the result is identical. Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Link: https://patch.msgid.link/20251029-sysfs-const-attr-prep-v5-2-ea7d745acff4@weissschuh.net Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-11-26sysfs: attribute_group: allow registration of const attributeThomas Weißschuh
To be able to constify instances of struct attribute it has to be possible to add them to struct attribute_group. The current type of the attrs member however is not compatible with that. Introduce a union that allows registration of both const and non-const attributes to enable a piecewise transition. As both union member types are compatible no logic needs to be adapted. Technically it is now possible register a const struct attribute and receive it as mutable pointer in the callbacks. This is a soundness issue. But this same soundness issue already exists today in sysfs_create_file(). Also the struct definition and callback implementation are always closely linked and are meant to be moved to const in lockstep. Similar to commit 906c508afdca ("sysfs: attribute_group: allow registration of const bin_attribute") Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Link: https://patch.msgid.link/20251029-sysfs-const-attr-prep-v5-1-ea7d745acff4@weissschuh.net Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-11-26virt: acrn: split acrn_mmio_dev_res out of acrn_mmiodevRandy Dunlap
Add struct acrn_mmio_dev_res before struct acrn_mmio_dev. The former is used in the latter and breaking them up provides better kernel-doc documentation for the struct members. Suggested-by: Fei Li <fei1.li@intel.com> Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Acked-by: Fei Li <fei1.li@intel.com> Link: https://patch.msgid.link/20251028040409.868254-1-rdunlap@infradead.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-11-26comedi: kcomedilib: Add loop checking variants of open and closeIan Abbott
Add `comedi_open_from(path, from)` and `comedi_close_from(dev, from)` as variants of the existing `comedi_from(path)` and `comedi_close(dev)`. The additional `from` parameter is a minor device number that tells the function that the COMEDI device is being opened or closed from another COMEDI device if the value is in the range [0, `COMEDI_NUM_BOARD_MINORS`-1]. In that case the function will refuse to open the device if it would lead to a chain of devices opening each other. (It will also impose a limit on the number of simultaneous opens from one device to another because we need to count those.) The new functions are intended to be used by the "comedi_bond" driver, which is the only driver that uses the existing `comedi_open()` and `comedi_close()` functions. The new functions will be used to avoid some possible deadlock situations. Replace the existing, exported `comedi_open()` and `comedi_close()` functions with inline wrapper functions that call the newly exported `comedi_open_from()` and `comedi_close_from()` functions. Signed-off-by: Ian Abbott <abbotti@mev.co.uk> Link: https://patch.msgid.link/20251027153748.4569-2-abbotti@mev.co.uk Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-11-26comedi: Add reference counting for Comedi command handlingIan Abbott
For interrupts from badly behaved hardware (as emulated by Syzbot), it is possible for the Comedi core functions that manage the progress of asynchronous data acquisition to be called from driver ISRs while no asynchronous command has been set up, which can cause problems such as invalid pointer dereferencing or dividing by zero. To help protect against that, introduce new functions to maintain a reference counter for asynchronous commands that are being set up. `comedi_get_is_subdevice_running(s)` will check if a command has been set up on a subdevice and is still marked as running, and if so will increment the reference counter and return `true`, otherwise it will return `false` without modifying the reference counter. `comedi_put_is_subdevice_running(s)` will decrement the reference counter and set a completion event when decremented to 0. Change the `do_cmd_ioctl()` function (responsible for setting up the asynchronous command) to reinitialize the completion event and set the reference counter to 1 before it marks the subdevice as running. Change the `do_become_nonbusy()` function (responsible for destroying a completed command) to call `comedi_put_is_subdevice_running(s)` and wait for the completion event after marking the subdevice as not running. Because the subdevice normally gets marked as not running before the call to `do_become_nonbusy()` (and may also be called when the Comedi device is being detached from the low-level driver), add a new flag `COMEDI_SRF_BUSY` to the set of subdevice run-flags that indicates that an asynchronous command was set up and will need to be destroyed. This flag is set by `do_cmd_ioctl()` and cleared and checked by `do_become_nonbusy()`. Subsequent patches will change the Comedi core functions that are called from low-level drivers for asynchrous command handling to make use of the `comedi_get_is_subdevice_running()` and `comedi_put_is_subdevice_running()` functions, and will modify the ISRs of some of these low-level drivers if they dereference the subdevice's `async` pointer directly. Signed-off-by: Ian Abbott <abbotti@mev.co.uk> Link: https://patch.msgid.link/20251023133001.8439-2-abbotti@mev.co.uk Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-11-26drm/ttm: rework pipelined eviction fence handlingPierre-Eric Pelloux-Prayer
Until now ttm stored a single pipelined eviction fence which means drivers had to use a single entity for these evictions. To lift this requirement, this commit allows up to 8 entities to be used. Ideally a dma_resv object would have been used as a container of the eviction fences, but the locking rules makes it complex. dma_resv all have the same ww_class, which means "Attempting to lock more mutexes after ww_acquire_done." is an error. One alternative considered was to introduced a 2nd ww_class for specific resv to hold a single "transient" lock (= the resv lock would only be held for a short period, without taking any other locks). The other option, is to statically reserve a fence array, and extend the existing code to deal with N fences, instead of 1. The driver is still responsible to reserve the correct number of fence slots. Signed-off-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com> Link: https://lore.kernel.org/r/20251121101315.3585-20-pierre-eric.pelloux-prayer@amd.com Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Christian König <christian.koenig@amd.com>
2025-11-26can: netlink: add PWM netlink interfaceVincent Mailhol
When the TMS is switched on, the node uses PWM (Pulse Width Modulation) during the data phase instead of the classic NRZ (Non Return to Zero) encoding. PWM is configured by three parameters: - PWMS: Pulse Width Modulation Short phase - PWML: Pulse Width Modulation Long phase - PWMO: Pulse Width Modulation Offset time For each of these parameters, define three IFLA symbols: - IFLA_CAN_PWM_PWM*_MIN: the minimum allowed value. - IFLA_CAN_PWM_PWM*_MAX: the maximum allowed value. - IFLA_CAN_PWM_PWM*: the runtime value. This results in a total of nine IFLA symbols which are all nested in a parent IFLA_CAN_XL_PWM symbol. IFLA_CAN_PWM_PWM*_MIN and IFLA_CAN_PWM_PWM*_MAX define the range of allowed values and will match the value statically configured by the device in struct can_pwm_const. IFLA_CAN_PWM_PWM* match the runtime values stored in struct can_pwm. Those parameters may only be configured when the tms mode is on. If the PWMS, PWML and PWMO parameters are provided, check that all the needed parameters are present using can_validate_pwm(), then check their value using can_validate_pwm_bittiming(). PWMO defaults to zero if omitted. Otherwise, if CAN_CTRLMODE_XL_TMS is true but none of the PWM parameters are provided, calculate them using can_calc_pwm(). Signed-off-by: Vincent Mailhol <mailhol@kernel.org> Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Link: https://patch.msgid.link/20251126-canxl-v8-11-e7e3eb74f889@pengutronix.de Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2025-11-26can: calc_bittiming: add PWM calculationVincent Mailhol
Perform the PWM calculation according to CiA recommendations. Note that for databitrates greater than 5 MBPS, tqmin is less than CAN_PWM_NS_MAX (which is defined to 200 nano seconds), consequently, the result of the division: DIV_ROUND_UP(xl_ns, CAN_PWM_NS_MAX) is one and thus the for loop automatically stops on the first iteration giving a single PWM symbol per bit as expected. Because of that, there is no actual need for a separate conditional branch for when the databitrate is greater than 5 MBPS. Signed-off-by: Vincent Mailhol <mailhol@kernel.org> Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Link: https://patch.msgid.link/20251126-canxl-v8-10-e7e3eb74f889@pengutronix.de Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2025-11-26can: bittiming: add PWM validationVincent Mailhol
Add can_validate_pwm() to validate the values pwms, pwml and pwml. Error messages are added to each of the checks to inform the user on what went wrong. Refer to those error messages to understand the validation logic. The boundary values CAN_PWM_DECODE_NS (the transceiver minimum decoding margin) and CAN_PWM_NS_MAX (the maximum PWM symbol duration) are hardcoded for the moment. Note that a transceiver capable of bitrates higher than 20 Mbps may be able to handle a CAN_PWM_DECODE_NS below 5 ns. If such transceivers become commercially available, this code could be revisited to make this parameter configurable. For now, leave it static. Signed-off-by: Vincent Mailhol <mailhol@kernel.org> Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Link: https://patch.msgid.link/20251126-canxl-v8-9-e7e3eb74f889@pengutronix.de Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2025-11-26can: bittiming: add PWM parametersVincent Mailhol
In CAN XL, higher data bit rates require the CAN transceiver to switch its operation mode to use Pulse-Width Modulation (PWM) transmission mode instead of the classic dominant/recessive transmission mode. The PWM parameters are: - PWMS: pulse width modulation short phase - PWML: pulse width modulation long phase - PWMO: pulse width modulation offset CiA 612-2 specifies PWMS and PWML to be at least 1 (arguably, PWML shall be at least 2 to respect the PWMS < PWML rule). PWMO's minimum is expected to always be zero. It is added more for consistency than anything else. Add struct can_pwm_const so that the different devices can provide their minimum and maximum values. When TMS is on, the runtime PWMS, PWML and PWMO are needed (either calculated or provided by the user): add struct can_pwm to store these. TDC and PWM can not be used at the same time (TDC can only be used when TMS is off and PWM only when TMS is on). struct can_pwm is thus put together with struct can_tdc inside a union to save some space. The netlink logic will be added in an upcoming change. Signed-off-by: Vincent Mailhol <mailhol@kernel.org> Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Link: https://patch.msgid.link/20251126-canxl-v8-8-e7e3eb74f889@pengutronix.de Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2025-11-26can: dev: can_dev_dropped_skb: drop CC/FD frames in CANXL-only modeOliver Hartkopp
The error-signalling (ES) is a mandatory functionality for CAN CC and CAN FD to report CAN frame format violations by sending an error-frame signal on the bus. A so-called 'mixed-mode' is intended to have (XL-tolerant) CAN FD nodes and CAN XL nodes on one CAN segment, where the FD-controllers can talk CC/FD and the XL-controllers can talk CC/FD/XL. This mixed-mode utilizes the error-signalling for sending CC/FD/XL frames. The CANXL-only mode disables the error-signalling in the CAN XL controller. This mode does not allow CC/FD frames to be sent but additionally offers a CAN XL transceiver mode switching (TMS). Configured with CAN_CTRLMODE_FD and CAN_CTRLMODE_XL this leads to: FD=0 XL=0 CC-only mode (ES=1) FD=1 XL=0 FD/CC mixed-mode (ES=1) FD=1 XL=1 XL/FD/CC mixed-mode (ES=1) FD=0 XL=1 XL-only mode (ES=0, TMS optional) The helper function can_dev_in_xl_only_mode() determines the required value to disable error signalling in the CAN XL controller. Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Link: https://patch.msgid.link/20251126-canxl-v8-7-e7e3eb74f889@pengutronix.de Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2025-11-26can: netlink: add CAN_CTRLMODE_XL_TMS flagVincent Mailhol
The Transceiver Mode Switching (TMS) indicates whether the CAN XL controller shall use the PWM or NRZ encoding during the data phase. The term "transceiver mode switching" is used in both ISO 11898-1 and CiA 612-2 (although only the latter one uses the abbreviation TMS). We adopt the same naming convention here for consistency. Add the CAN_CTRLMODE_XL_TMS flag to the list of the CAN control modes. Add can_validate_xl_flags() to check the coherency of the TMS flag. That function will be reused in upcoming changes to validate the other CAN XL flags. Signed-off-by: Vincent Mailhol <mailhol@kernel.org> Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Link: https://patch.msgid.link/20251126-canxl-v8-6-e7e3eb74f889@pengutronix.de Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2025-11-26can: netlink: add initial CAN XL supportVincent Mailhol
CAN XL uses bittiming parameters different from Classical CAN and CAN FD. Thus, all the data bittiming parameters, including TDC, need to be duplicated for CAN XL. Add the CAN XL netlink interface for all the features which are common with CAN FD. Any new CAN XL specific features are added later on. The first time CAN XL is activated, the MTU is set by default to CANXL_MAX_MTU. The user may then configure a custom MTU within the CANXL_MIN_MTU to CANXL_MAX_MTU range, in which case, the custom MTU value will be kept as long as CAN XL remains active. Signed-off-by: Vincent Mailhol <mailhol@kernel.org> Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Link: https://patch.msgid.link/20251126-canxl-v8-5-e7e3eb74f889@pengutronix.de Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2025-11-26can: netlink: add CAN_CTRLMODE_RESTRICTEDVincent Mailhol
ISO 11898-1:2024 adds a new restricted operation mode. This mode is added as a mandatory feature for nodes which support CAN XL and is retrofitted as optional for legacy nodes (i.e. the ones which only support Classical CAN and CAN FD). The restricted operation mode is nearly the same as the listen only mode: the node can not send data frames or remote frames and can not send dominant bits if an error occurs. The only exception is that the node shall still send the acknowledgment bit. A second niche exception is that the node may still send a data frame containing a time reference message if the node is a primary time provider, but because the time provider feature is not yet implemented in the kernel, this second exception is not relevant to us at the moment. Add the CAN_CTRLMODE_RESTRICTED control mode flag and update the can_dev_dropped_skb() helper function accordingly. Finally, bail out if both CAN_CTRLMODE_LISTENONLY and CAN_CTRLMODE_RESTRICTED are provided. Signed-off-by: Vincent Mailhol <mailhol@kernel.org> Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Link: https://patch.msgid.link/20251126-canxl-v8-4-e7e3eb74f889@pengutronix.de Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2025-11-26can: dev: can_dev_dropped_skb: drop CAN FD skbs if FD is offVincent Mailhol
Currently, the CAN FD skb validation logic is based on the MTU: the interface is deemed FD capable if and only if its MTU is greater or equal to CANFD_MTU. This logic is showing its limit with the introduction of CAN XL. For example, consider the two scenarios below: 1. An interface configured with CAN FD on and CAN XL on 2. An interface configured with CAN FD off and CAN XL on In those two scenarios, the interfaces would have the same MTU: CANXL_MTU making it impossible to differentiate which one has CAN FD turned on and which one has it off. Because of the limitation, the only non-UAPI-breaking workaround is to do the check at the device level using the can_priv->ctrlmode flags. Unfortunately, the virtual interfaces (vcan, vxcan), which do not have a can_priv, are left behind. Add a check on the CAN_CTRLMODE_FD flag in can_dev_dropped_skb() and drop FD frames whenever the feature is turned off. Signed-off-by: Vincent Mailhol <mailhol@kernel.org> Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Link: https://patch.msgid.link/20251126-canxl-v8-3-e7e3eb74f889@pengutronix.de Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2025-11-26can: bittiming: apply NL_SET_ERR_MSG() to can_calc_bittiming()Vincent Mailhol
When CONFIG_CAN_CALC_BITTIMING is disabled, the can_calc_bittiming() functions can not be used and the user needs to provide all the bittiming parameters. Currently, can_calc_bittiming() prints an error message to the kernel log. Instead use NL_SET_ERR_MSG() to make it return the error message through the netlink interface so that the user can directly see it. Signed-off-by: Vincent Mailhol <mailhol@kernel.org> Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Link: https://patch.msgid.link/20251126-canxl-v8-2-e7e3eb74f889@pengutronix.de Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2025-11-26Merge tag 'kvm-x86-svm-6.19' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM SVM changes for 6.19: - Fix a few missing "VMCB dirty" bugs. - Fix the worst of KVM's lack of EFER.LMSLE emulation. - Add AVIC support for addressing 4k vCPUs in x2AVIC mode. - Fix incorrect handling of selective CR0 writes when checking intercepts during emulation of L2 instructions. - Fix a currently-benign bug where KVM would clobber SPEC_CTRL[63:32] on VMRUN and #VMEXIT. - Fix a bug where KVM corrupt the guest code stream when re-injecting a soft interrupt if the guest patched the underlying code after the VM-Exit, e.g. when Linux patches code with a temporary INT3. - Add KVM_X86_SNP_POLICY_BITS to advertise supported SNP policy bits to userspace, and extend KVM "support" to all policy bits that don't require any actual support from KVM.
2025-11-26Merge tag 'kvm-x86-tdx-6.19' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM TDX changes for 6.19: - Overhaul the TDX code to address systemic races where KVM (acting on behalf of userspace) could inadvertantly trigger lock contention in the TDX-Module, which KVM was either working around in weird, ugly ways, or was simply oblivious to (as proven by Yan tripping several KVM_BUG_ON()s with clever selftests). - Fix a bug where KVM could corrupt a vCPU's cpu_list when freeing a vCPU if creating said vCPU failed partway through. - Fix a few sparse warnings (bad annotation, 0 != NULL). - Use struct_size() to simplify copying capabilities to userspace.
2025-11-26Merge tag 'kvm-x86-gmem-6.19' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM guest_memfd changes for 6.19: - Add NUMA mempolicy support for guest_memfd, and clean up a variety of rough edges in guest_memfd along the way. - Define a CLASS to automatically handle get+put when grabbing a guest_memfd from a memslot to make it harder to leak references. - Enhance KVM selftests to make it easer to develop and debug selftests like those added for guest_memfd NUMA support, e.g. where test and/or KVM bugs often result in hard-to-debug SIGBUS errors. - Misc cleanups.
2025-11-25tcp: remove icsk->icsk_retransmit_timerEric Dumazet
Now sk->sk_timer is no longer used by TCP keepalive, we can use its storage for TCP and MPTCP retransmit timers for better cache locality. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20251124175013.1473655-5-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-25tcp: introduce icsk->icsk_keepalive_timerEric Dumazet
sk->sk_timer has been used for TCP keepalives. Keepalive timers are not in fast path, we want to use sk->sk_timer storage for retransmit timers, for better cache locality. Create icsk->icsk_keepalive_timer and change keepalive code to no longer use sk->sk_timer. Added space is reclaimed in the following patch. This includes changes to MPTCP, which was also using sk_timer. Alias icsk->mptcp_tout_timer and icsk->icsk_keepalive_timer for inet_sk_diag_fill() sake. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20251124175013.1473655-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-25net: move sk_dst_pending_confirm and sk_pacing_status to sock_read_tx groupEric Dumazet
These two fields are mostly read in TCP tx path, move them in an more appropriate group for better cache locality. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20251124175013.1473655-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-25tcp: rename icsk_timeout() to tcp_timeout_expires()Eric Dumazet
In preparation of sk->tcp_timeout_timer introduction, rename icsk_timeout() helper and change its argument to plain 'const struct sock *sk'. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20251124175013.1473655-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-25tools: ynl-gen: add regeneration commentAsbjørn Sloth Tønnesen
Add a comment on regeneration to the generated files. The comment is placed after the YNL-GEN line[1], as to not interfere with ynl-regen.sh's detection logic. [1] and after the optional YNL-ARG line. Link: https://lore.kernel.org/r/aR5m174O7pklKrMR@zx2c4.com/ Suggested-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net> Acked-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20251120174429.390574-3-ast@fiberby.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-25bpf: Introduce internal bpf_map_check_op_flags helper functionLeon Hwang
It is to unify map flags checking for lookup_elem, update_elem, lookup_batch and update_batch APIs. Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Leon Hwang <leon.hwang@linux.dev> Link: https://lore.kernel.org/r/20251125145857.98134-2-leon.hwang@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-11-25hfs/hfsplus: move on-disk layout declarations into hfs_common.hViacheslav Dubeyko
Currently, HFS declares on-disk layout's metadata structures in fs/hfs/hfs.h and HFS+ declares it in fs/hfsplus/hfsplus_raw.h. However, HFS and HFS+ on-disk layouts have some similarity and overlapping in declarations. As a result, fs/hfs/hfs.h and fs/hfsplus/hfsplus_raw.h contain multiple duplicated declarations. Moreover, both HFS and HFS+ drivers contain completely similar implemented functionality in multiple places. This patch is moving the on-disk layout declarations from fs/hfs/hfs.h and fs/hfsplus/hfsplus_raw.h into include/linux/hfs_common.h with the goal to exclude the duplication in declarations. Also, this patch prepares the basis for creating a hfslib that can aggregate common functionality without necessity to duplicate the same code in HFS and HFS+ drivers. Signed-off-by: Viacheslav Dubeyko <slava@dubeyko.com> cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> cc: Yangtao Li <frank.li@vivo.com> cc: linux-fsdevel@vger.kernel.org Signed-off-by: Viacheslav Dubeyko <slava@dubeyko.com>
2025-11-25sched/mmcid: Switch over to the new mechanismThomas Gleixner
Now that all pieces are in place, change the implementations of sched_mm_cid_fork() and sched_mm_cid_exit() to adhere to the new strict ownership scheme and switch context_switch() over to use the new mm_cid_schedin() functionality. The common case is that there is no mode change required, which makes fork() and exit() just update the user count and the constraints. In case that a new user would exceed the CID space limit the fork() context handles the transition to per CPU mode with mm::mm_cid::mutex held. exit() handles the transition back to per task mode when the user count drops below the switch back threshold. fork() might also be forced to handle a deferred switch back to per task mode, when a affinity change increased the number of allowed CPUs enough. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251119172550.280380631@linutronix.de
2025-11-25sched/mmcid: Implement deferred mode changeThomas Gleixner
When affinity changes cause an increase of the number of CPUs allowed for tasks which are related to a MM, that might results in a situation where the ownership mode can go back from per CPU mode to per task mode. As affinity changes happen with runqueue lock held there is no way to do the actual mode change and required fixup right there. Add the infrastructure to defer it to a workqueue. The scheduled work can race with a fork() or exit(). Whatever happens first takes care of it. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251119172550.216484739@linutronix.de
2025-11-25irqwork: Move data struct to a types headerThomas Gleixner
... to avoid header recursion hell. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251119172550.152813625@linutronix.de
2025-11-25sched/mmcid: Provide CID ownership mode fixup functionsThomas Gleixner
CIDs are either owned by tasks or by CPUs. The ownership mode depends on the number of tasks related to a MM and the number of CPUs on which these tasks are theoretically allowed to run on. Theoretically because that number is the superset of CPU affinities of all tasks which only grows and never shrinks. Switching to per CPU mode happens when the user count becomes greater than the maximum number of CIDs, which is calculated by: opt_cids = min(mm_cid::nr_cpus_allowed, mm_cid::users); max_cids = min(1.25 * opt_cids, nr_cpu_ids); The +25% allowance is useful for tight CPU masks in scenarios where only a few threads are created and destroyed to avoid frequent mode switches. Though this allowance shrinks, the closer opt_cids becomes to nr_cpu_ids, which is the (unfortunate) hard ABI limit. At the point of switching to per CPU mode the new user is not yet visible in the system, so the task which initiated the fork() runs the fixup function: mm_cid_fixup_tasks_to_cpu() walks the thread list and either transfers each tasks owned CID to the CPU the task runs on or drops it into the CID pool if a task is not on a CPU at that point in time. Tasks which schedule in before the task walk reaches them do the handover in mm_cid_schedin(). When mm_cid_fixup_tasks_to_cpus() completes it's guaranteed that no task related to that MM owns a CID anymore. Switching back to task mode happens when the user count goes below the threshold which was recorded on the per CPU mode switch: pcpu_thrs = min(opt_cids - (opt_cids / 4), nr_cpu_ids / 2); This threshold is updated when a affinity change increases the number of allowed CPUs for the MM, which might cause a switch back to per task mode. If the switch back was initiated by a exiting task, then that task runs the fixup function. If it was initiated by a affinity change, then it's run either in the deferred update function in context of a workqueue or by a task which forks a new one or by a task which exits. Whatever happens first. mm_cid_fixup_cpus_to_task() walks through the possible CPUs and either transfers the CPU owned CIDs to a related task which runs on the CPU or drops it into the pool. Tasks which schedule in on a CPU which the walk did not cover yet do the handover themselves. This transition from CPU to per task ownership happens in two phases: 1) mm:mm_cid.transit contains MM_CID_TRANSIT. This is OR'ed on the task CID and denotes that the CID is only temporarily owned by the task. When it schedules out the task drops the CID back into the pool if this bit is set. 2) The initiating context walks the per CPU space and after completion clears mm:mm_cid.transit. After that point the CIDs are strictly task owned again. This two phase transition is required to prevent CID space exhaustion during the transition as a direct transfer of ownership would fail if two tasks are scheduled in on the same CPU before the fixup freed per CPU CIDs. When mm_cid_fixup_cpus_to_tasks() completes it's guaranteed that no CID related to that MM is owned by a CPU anymore. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251119172550.088189028@linutronix.de
2025-11-25sched/mmcid: Provide new scheduler CID mechanismThomas Gleixner
The MM CID management has two fundamental requirements: 1) It has to guarantee that at no given point in time the same CID is used by concurrent tasks in userspace. 2) The CID space must not exceed the number of possible CPUs in a system. While most allocators (glibc, tcmalloc, jemalloc) do not care about that, there seems to be at least some LTTng library depending on it. The CID space compaction itself is not a functional correctness requirement, it is only a useful optimization mechanism to reduce the memory foot print in unused user space pools. The optimal CID space is: min(nr_tasks, nr_cpus_allowed); Where @nr_tasks is the number of actual user space threads associated to the mm and @nr_cpus_allowed is the superset of all task affinities. It is growth only as it would be insane to take a racy snapshot of all task affinities when the affinity of one task changes just do redo it 2 milliseconds later when the next task changes it's affinity. That means that as long as the number of tasks is lower or equal than the number of CPUs allowed, each task owns a CID. If the number of tasks exceeds the number of CPUs allowed it switches to per CPU mode, where the CPUs own the CIDs and the tasks borrow them as long as they are scheduled in. For transition periods CIDs can go beyond the optimal space as long as they don't go beyond the number of possible CPUs. The current upstream implementation adds overhead into task migration to keep the CID with the task. It also has to do the CID space consolidation work from a task work in the exit to user space path. As that work is assigned to a random task related to a MM this can inflict unwanted exit latencies. Implement the context switch parts of a strict ownership mechanism to address this. This removes most of the work from the task which schedules out. Only during transitioning from per CPU to per task ownership it is required to drop the CID when leaving the CPU to prevent CID space exhaustion. Other than that scheduling out is just a single check and branch. The task which schedules in has to check whether: 1) The ownership mode changed 2) The CID is within the optimal CID space In stable situations this results in zero work. The only short disruption is when ownership mode changes or when the associated CID is not in the optimal CID space. The latter only happens when tasks exit and therefore the optimal CID space shrinks. That mechanism is strictly optimized for the common case where no change happens. The only case where it actually causes a temporary one time spike is on mode changes when and only when a lot of tasks related to a MM schedule exactly at the same time and have eventually to compete on allocating a CID from the bitmap. In the sysbench test case which triggered the spinlock contention in the initial CID code, __schedule() drops significantly in perf top on a 128 Core (256 threads) machine when running sysbench with 255 threads, which fits into the task mode limit of 256 together with the parent thread: Upstream rseq/perf branch +CID rework 0.42% 0.37% 0.32% [k] __schedule Increasing the number of threads to 256, which puts the test process into per CPU mode looks about the same. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251119172550.023984859@linutronix.de
2025-11-25sched/mmcid: Introduce per task/CPU ownership infrastructureThomas Gleixner
The MM CID management has two fundamental requirements: 1) It has to guarantee that at no given point in time the same CID is used by concurrent tasks in userspace. 2) The CID space must not exceed the number of possible CPUs in a system. While most allocators (glibc, tcmalloc, jemalloc) do not care about that, there seems to be at least librseq depending on it. The CID space compaction itself is not a functional correctness requirement, it is only a useful optimization mechanism to reduce the memory foot print in unused user space pools. The optimal CID space is: min(nr_tasks, nr_cpus_allowed); Where @nr_tasks is the number of actual user space threads associated to the mm and @nr_cpus_allowed is the superset of all task affinities. It is growth only as it would be insane to take a racy snapshot of all task affinities when the affinity of one task changes just do redo it 2 milliseconds later when the next task changes its affinity. That means that as long as the number of tasks is lower or equal than the number of CPUs allowed, each task owns a CID. If the number of tasks exceeds the number of CPUs allowed it switches to per CPU mode, where the CPUs own the CIDs and the tasks borrow them as long as they are scheduled in. For transition periods CIDs can go beyond the optimal space as long as they don't go beyond the number of possible CPUs. The current upstream implementation adds overhead into task migration to keep the CID with the task. It also has to do the CID space consolidation work from a task work in the exit to user space path. As that work is assigned to a random task related to a MM this can inflict unwanted exit latencies. This can be done differently by implementing a strict CID ownership mechanism. Either the CIDs are owned by the tasks or by the CPUs. The latter provides less locality when tasks are heavily migrating, but there is no justification to optimize for overcommit scenarios and thereby penalizing everyone else. Provide the basic infrastructure to implement this: - Change the UNSET marker to BIT(31) from ~0U - Add the ONCPU marker as BIT(30) - Add the TRANSIT marker as BIT(29) That allows to check for ownership trivially and provides a simple check for UNSET as well. The TRANSIT marker is required to prevent CID space exhaustion when switching from per CPU to per task mode. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://patch.msgid.link/20251119172549.960252358@linutronix.de
2025-11-25sched/mmcid: Serialize sched_mm_cid_fork()/exit() with a mutexThomas Gleixner
Prepare for the new CID management scheme which puts the CID ownership transition into the fork() and exit() slow path by serializing sched_mm_cid_fork()/exit() with it, so task list and cpu mask walks can be done in interruptible and preemptible code. The contention on it is not worse than on other concurrency controls in the fork()/exit() machinery. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251119172549.895826703@linutronix.de
2025-11-25sched/mmcid: Provide precomputed maximal valueThomas Gleixner
Reading mm::mm_users and mm:::mm_cid::nr_cpus_allowed every time to compute the maximal CID value is just wasteful as that value is only changing on fork(), exit() and eventually when the affinity changes. So it can be easily precomputed at those points and provided in mm::mm_cid for consumption in the hot path. But there is an issue with using mm::mm_users for accounting because that does not necessarily reflect the number of user space tasks as other kernel code can take temporary references on the MM which skew the picture. Solve that by adding a users counter to struct mm_mm_cid, which is modified by fork() and exit() and used for precomputing under mm_mm_cid::lock. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251119172549.832764634@linutronix.de
2025-11-25sched/mmcid: Move initialization out of lineThomas Gleixner
It's getting bigger soon, so just move it out of line to the rest of the code. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251119172549.769636491@linutronix.de
2025-11-25signal: Move MMCID exit out of sighand lockThomas Gleixner
There is no need anymore to keep this under sighand lock as the current code and the upcoming replacement are not depending on the exit state of a task anymore. That allows to use a mutex in the exit path. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251119172549.706439391@linutronix.de
2025-11-25sched/mmcid: Convert mm CID mask to a bitmapThomas Gleixner
This is truly a bitmap and just conveniently uses a cpumask because the maximum size of the bitmap is nr_cpu_ids. But that prevents to do searches for a zero bit in a limited range, which is helpful to provide an efficient mechanism to consolidate the CID space when the number of users decreases. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Yury Norov (NVIDIA) <yury.norov@gmail.com> Link: https://patch.msgid.link/20251119172549.642866767@linutronix.de
2025-11-25cpumask: Cache num_possible_cpus()Thomas Gleixner
Reevaluating num_possible_cpus() over and over does not make sense. That becomes a constant after init as cpu_possible_mask is marked ro_after_init. Cache the value during initialization and provide that for consumption. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Yury Norov <yury.norov@gmail.com> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com> Link: https://patch.msgid.link/20251119172549.578653738@linutronix.de
2025-11-25sched: idle: Respect the CPU system wakeup QoS limit for s2idleUlf Hansson
A CPU system wakeup QoS limit may have been requested by user space. To avoid breaking this constraint when entering a low power state during s2idle, let's start to take into account the QoS limit. Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dhruva Gole <d-gole@ti.com> Reviewed-by: Kevin Hilman (TI) <khilman@baylibre.com> Tested-by: Kevin Hilman (TI) <khilman@baylibre.com> Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org> Link: https://patch.msgid.link/20251125112650.329269-5-ulf.hansson@linaro.org Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-11-25pmdomain: Respect the CPU system wakeup QoS limit for s2idleUlf Hansson
A CPU system wakeup QoS limit may have been requested by user space. To avoid breaking this constraint when entering a low power state during s2idle through genpd, let's extend the corresponding genpd governor for CPUs. More precisely, during s2idle let the genpd governor select a suitable domain idle state, by taking into account the QoS limit. Reviewed-by: Dhruva Gole <d-gole@ti.com> Reviewed-by: Kevin Hilman (TI) <khilman@baylibre.com> Tested-by: Kevin Hilman (TI) <khilman@baylibre.com> Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org> Link: https://patch.msgid.link/20251125112650.329269-3-ulf.hansson@linaro.org Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-11-25PM: QoS: Introduce a CPU system wakeup QoS limitUlf Hansson
Some platforms supports multiple low power states for CPUs that can be used when entering system-wide suspend. Currently we are always selecting the deepest possible state for the CPUs, which can break the system wakeup latency constraint that may be required for a use case. Let's take the first step towards addressing this problem, by introducing an interface for user space, that allows us to specify the CPU system wakeup QoS limit. Subsequent changes will start taking into account the new QoS limit. Reviewed-by: Dhruva Gole <d-gole@ti.com> Reviewed-by: Kevin Hilman (TI) <khilman@baylibre.com> Tested-by: Kevin Hilman (TI) <khilman@baylibre.com> Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org> Link: https://patch.msgid.link/20251125112650.329269-2-ulf.hansson@linaro.org Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-11-25vfio/pci: Add vfio_pci_dma_buf_iommufd_map()Jason Gunthorpe
This function is used to establish the "private interconnect" between the VFIO DMABUF exporter and the iommufd DMABUF importer. This is intended to be a temporary API until the core DMABUF interface is improved to natively support a private interconnect and revocable negotiation. This function should only be called by iommufd when trying to map a DMABUF. For now iommufd will only support VFIO DMABUFs. The following improvements are needed in the DMABUF API to generically support more exporters with iommufd/kvm type importers that cannot use the DMA API: 1) Revoke semantics. VFIO needs to be able to prevent access to the MMIO during FLR, and so it will use dma_buf_move_notify() to prevent access. iommmufd does not support fault handling so it cannot implement the full move_notify. Instead if revoke is negotiated the exporter promises not to use move_notify() unless the importer can experiance failures. iommufd will unmap the dmabuf from the iommu page tables while it is revoked. 2) Private interconnect negotiation. iommufd will only be able to map a "private interconnect" that provides a phys_addr_t and a struct p2pdma_provider * to describe the memory. It cannot use a DMA mapped scatterlist since it is directly calling iommu_map(). 3) NULL device during dma_buf_dynamic_attach(). Since iommufd doesn't use the DMA API it doesn't have a DMAable struct device to pass here. Link: https://patch.msgid.link/r/1-v2-b2c110338e3f+5c2-iommufd_dmabuf_jgg@nvidia.com Reviewed-by: Nicolin Chen <nicolinc@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Shuai Xue <xueshuai@linux.alibaba.com> Acked-by: Alex Williamson <alex@shazbot.org> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-11-25net_sched: add qdisc_dequeue_drop() helperEric Dumazet
Some qdisc like cake, codel, fq_codel might drop packets in their dequeue() method. This is currently problematic because dequeue() runs with the qdisc spinlock held. Freeing skbs can be extremely expensive. Add qdisc_dequeue_drop() method and a new TCQ_F_DEQUEUE_DROPS so that these qdiscs can opt-in to defer the skb frees after the socket spinlock is released. TCQ_F_DEQUEUE_DROPS is an attempt to not penalize other qdiscs with an extra cache line miss. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20251121083256.674562-14-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-11-25net_sched: add tcf_kfree_skb_list() helperEric Dumazet
Using kfree_skb_list_reason() to free list of skbs from qdisc operations seems wrong as each skb might have a different drop reason. Cleanup __dev_xmit_skb() to call tcf_kfree_skb_list() once in preparation of the following patch. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20251121083256.674562-13-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-11-25net_sched: add Qdisc_read_mostly and Qdisc_write groupsEric Dumazet
It is possible to reorg Qdisc to avoid always dirtying 2 cache lines in fast path by reducing this to a single dirtied cache line. In current layout, we change only four/six fields in the first cache line: - q.spinlock - q.qlen - bstats.bytes - bstats.packets - some Qdisc also change q.next/q.prev In the second cache line we change in the fast path: - running - state - qstats.backlog /* --- cacheline 2 boundary (128 bytes) --- */ struct sk_buff_head gso_skb __attribute__((__aligned__(64))); /* 0x80 0x18 */ struct qdisc_skb_head q; /* 0x98 0x18 */ struct gnet_stats_basic_sync bstats __attribute__((__aligned__(16))); /* 0xb0 0x10 */ /* --- cacheline 3 boundary (192 bytes) --- */ struct gnet_stats_queue qstats; /* 0xc0 0x14 */ bool running; /* 0xd4 0x1 */ /* XXX 3 bytes hole, try to pack */ unsigned long state; /* 0xd8 0x8 */ struct Qdisc * next_sched; /* 0xe0 0x8 */ struct sk_buff_head skb_bad_txq; /* 0xe8 0x18 */ /* --- cacheline 4 boundary (256 bytes) --- */ Reorganize things to have a first cache line mostly read, then a mostly written one. This gives a ~3% increase of performance under tx stress. Note that there is an additional hole because @qstats now spans over a third cache line. /* --- cacheline 2 boundary (128 bytes) --- */ __u8 __cacheline_group_begin__Qdisc_read_mostly[0] __attribute__((__aligned__(64))); /* 0x80 0 */ struct sk_buff_head gso_skb; /* 0x80 0x18 */ struct Qdisc * next_sched; /* 0x98 0x8 */ struct sk_buff_head skb_bad_txq; /* 0xa0 0x18 */ __u8 __cacheline_group_end__Qdisc_read_mostly[0]; /* 0xb8 0 */ /* XXX 8 bytes hole, try to pack */ /* --- cacheline 3 boundary (192 bytes) --- */ __u8 __cacheline_group_begin__Qdisc_write[0] __attribute__((__aligned__(64))); /* 0xc0 0 */ struct qdisc_skb_head q; /* 0xc0 0x18 */ unsigned long state; /* 0xd8 0x8 */ struct gnet_stats_basic_sync bstats __attribute__((__aligned__(16))); /* 0xe0 0x10 */ bool running; /* 0xf0 0x1 */ /* XXX 3 bytes hole, try to pack */ struct gnet_stats_queue qstats; /* 0xf4 0x14 */ /* --- cacheline 4 boundary (256 bytes) was 8 bytes ago --- */ __u8 __cacheline_group_end__Qdisc_write[0]; /* 0x108 0 */ /* XXX 56 bytes hole, try to pack */ Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20251121083256.674562-8-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>