Age | Commit message (Collapse) | Author |
|
tick_freeze() acquires a raw spinlock (tick_freeze_lock). Later in the
callchain (timekeeping_suspend() -> mc146818_avoid_UIP()) the RTC driver
acquires a spinlock which becomes a sleeping lock on PREEMPT_RT. Lockdep
complains about this lock nesting.
Add a lockdep override for this special case and a comment explaining
why it is okay.
Reported-by: Borislav Petkov <bp@alien8.de>
Reported-by: Chris Bainbridge <chris.bainbridge@gmail.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/20250404133429.pnAzf-eF@linutronix.de
Closes: https://lore.kernel.org/all/20250330113202.GAZ-krsjAnurOlTcp-@fat_crate.local/
Closes: https://lore.kernel.org/all/CAP-bSRZ0CWyZZsMtx046YV8L28LhY0fson2g4EqcwRAVN1Jk+Q@mail.gmail.com/
|
|
The "function" field of struct hrtimer has been changed to private, but
two instances have not been converted to use ACCESS_PRIVATE().
Convert them to use ACCESS_PRIVATE().
Fixes: 04257da0c99c ("hrtimers: Make callback function pointer private")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250408103854.1851093-1-namcao@linutronix.de
Closes: https://lore.kernel.org/oe-kbuild-all/202504071931.vOVl13tt-lkp@intel.com/
Closes: https://lore.kernel.org/oe-kbuild-all/202504072155.5UAZjYGU-lkp@intel.com/
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer cleanups from Thomas Gleixner:
"A set of final cleanups for the timer subsystem:
- Convert all del_timer[_sync]() instances over to the new
timer_delete[_sync]() API and remove the legacy wrappers.
Conversion was done with coccinelle plus some manual fixups as
coccinelle chokes on scoped_guard().
- The final cleanup of the hrtimer_init() to hrtimer_setup()
conversion.
This has been delayed to the end of the merge window, so that all
patches which have been merged through other trees are in mainline
and all new users are catched.
Doing this right before rc1 ensures that new code which is merged post
rc1 is not introducing new instances of the original functionality"
* tag 'timers-cleanups-2025-04-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
tracing/timers: Rename the hrtimer_init event to hrtimer_setup
hrtimers: Rename debug_init_on_stack() to debug_setup_on_stack()
hrtimers: Rename debug_init() to debug_setup()
hrtimers: Rename __hrtimer_init_sleeper() to __hrtimer_setup_sleeper()
hrtimers: Remove unnecessary NULL check in hrtimer_start_range_ns()
hrtimers: Make callback function pointer private
hrtimers: Merge __hrtimer_init() into __hrtimer_setup()
hrtimers: Switch to use __htimer_setup()
hrtimers: Delete hrtimer_init()
treewide: Convert new and leftover hrtimer_init() users
treewide: Switch/rename to timer_delete[_sync]()
|
|
The function hrtimer_init() doesn't exist anymore. It was replaced by
hrtimer_setup().
Thus, rename the hrtimer_init trace event to hrtimer_setup to keep it
consistent.
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/all/cba84c3d853c5258aa3a262363a6eac08e2c7afc.1738746927.git.namcao@linutronix.de
|
|
All the hrtimer_init*() functions have been renamed to hrtimer_setup*().
Rename debug_init_on_stack() to debug_setup_on_stack() as well, to keep the
names consistent.
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/all/073cf6162779a2f5b12624677d4c49ee7eccc1ed.1738746927.git.namcao@linutronix.de
|
|
All the hrtimer_init*() functions have been renamed to hrtimer_setup*().
Rename debug_init() to debug_setup() as well, to keep the names consistent.
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/all/4b730c1f79648b16a1c5413f928fdc2e138dfc43.1738746927.git.namcao@linutronix.de
|
|
All the hrtimer_init*() functions have been renamed to hrtimer_setup*().
Rename __hrtimer_init_sleeper() to __hrtimer_setup_sleeper() as well, to
keep the names consistent.
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/all/807694aedad9353421c4a7347629a30c5c31026f.1738746927.git.namcao@linutronix.de
|
|
The struct hrtimer::function field can only be changed using
hrtimer_setup*() or hrtimer_update_function(), and both already null-check
'function'. Therefore, null-checking 'function' in hrtimer_start_range_ns()
is not necessary.
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/all/4661c571ee87980c340ccc318fc1a473c0c8f6bc.1738746927.git.namcao@linutronix.de
|
|
Make the struct hrtimer::function field private, to prevent users from
changing this field in an unsafe way. hrtimer_update_function() should be
used if the callback function needs to be changed.
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/all/7d0e6e0c5c59a64a9bea940051aac05d750bc0c2.1738746927.git.namcao@linutronix.de
|
|
__hrtimer_init() is only called by __hrtimer_setup(). Simplify by merging
__hrtimer_init() into __hrtimer_setup().
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/all/8a0a847a35f711f66b2d05b57255aa44e7e61279.1738746927.git.namcao@linutronix.de
|
|
__hrtimer_init_sleeper() calls __hrtimer_init() and also sets up the
callback function. But there is already __hrtimer_setup() which does both
actions.
Switch to use __hrtimer_setup() to simplify the code.
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/all/d9a45a51b6a8aa0045310d63f73753bf6b33f385.1738746927.git.namcao@linutronix.de
|
|
hrtimer_init() is now unused. Delete it.
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/all/003722f60c7a2a4f8d4ed24fb741aa313b7e5136.1738746927.git.namcao@linutronix.de
|
|
timer_delete[_sync]() replaces del_timer[_sync](). Convert the whole tree
over and remove the historical wrapper inlines.
Conversion was done with coccinelle plus manual fixups where necessary.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
This reverts commit 757b000f7b936edf79311ab0971fe465bbda75ea.
Miroslav reported that the changes for handling the inconsistencies in the
coarse time getters result in a regression on the adjtimex() side.
There are two issues:
1) The forwarding of the base time moves the update out of the original
period and establishes a new one.
2) The clearing of the accumulated NTP error is changing the behaviour as
well.
Userspace expects that multiplier/frequency updates are in effect, when the
syscall returns, so delaying the update to the next tick is not solving the
problem either.
Revert the change, so that the established expectations of user space
implementations (ntpd, chronyd) are restored. The re-introduced
inconsistency of the coarse time getters will be addressed in a subsequent
fix.
Fixes: 757b000f7b93 ("timekeeping: Fix possible inconsistencies in _COARSE clockids")
Reported-by: Miroslav Lichvar <mlichvar@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/Z-qsg6iDGlcIJulJ@localhost
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Jakub Kicinski:
"Core & protocols:
- Continue Netlink conversions to per-namespace RTNL lock
(IPv4 routing, routing rules, routing next hops, ARP ioctls)
- Continue extending the use of netdev instance locks. As a driver
opt-in protect queue operations and (in due course) ethtool
operations with the instance lock and not RTNL lock.
- Support collecting TCP timestamps (data submitted, sent, acked) in
BPF, allowing for transparent (to the application) and lower
overhead tracking of TCP RPC performance.
- Tweak existing networking Rx zero-copy infra to support zero-copy
Rx via io_uring.
- Optimize MPTCP performance in single subflow mode by 29%.
- Enable GRO on packets which went thru XDP CPU redirect (were queued
for processing on a different CPU). Improving TCP stream
performance up to 2x.
- Improve performance of contended connect() by 200% by searching for
an available 4-tuple under RCU rather than a spin lock. Bring an
additional 229% improvement by tweaking hash distribution.
- Avoid unconditionally touching sk_tsflags on RX, improving
performance under UDP flood by as much as 10%.
- Avoid skb_clone() dance in ping_rcv() to improve performance under
ping flood.
- Avoid FIB lookup in netfilter if socket is available, 20% perf win.
- Rework network device creation (in-kernel) API to more clearly
identify network namespaces and their roles. There are up to 4
namespace roles but we used to have just 2 netns pointer arguments,
interpreted differently based on context.
- Use sysfs_break_active_protection() instead of trylock to avoid
deadlocks between unregistering objects and sysfs access.
- Add a new sysctl and sockopt for capping max retransmit timeout in
TCP.
- Support masking port and DSCP in routing rule matches.
- Support dumping IPv4 multicast addresses with RTM_GETMULTICAST.
- Support specifying at what time packet should be sent on AF_XDP
sockets.
- Expose TCP ULP diagnostic info (for TLS and MPTCP) to non-admin
users.
- Add Netlink YAML spec for WiFi (nl80211) and conntrack.
- Introduce EXPORT_IPV6_MOD() and EXPORT_IPV6_MOD_GPL() for symbols
which only need to be exported when IPv6 support is built as a
module.
- Age FDB entries based on Rx not Tx traffic in VxLAN, similar to
normal bridging.
- Allow users to specify source port range for GENEVE tunnels.
- netconsole: allow attaching kernel release, CPU ID and task name to
messages as metadata
Driver API:
- Continue rework / fixing of Energy Efficient Ethernet (EEE) across
the SW layers. Delegate the responsibilities to phylink where
possible. Improve its handling in phylib.
- Support symmetric OR-XOR RSS hashing algorithm.
- Support tracking and preserving IRQ affinity by NAPI itself.
- Support loopback mode speed selection for interface selftests.
Device drivers:
- Remove the IBM LCS driver for s390
- Remove the sb1000 cable modem driver
- Add support for SFP module access over SMBus
- Add MCTP transport driver for MCTP-over-USB
- Enable XDP metadata support in multiple drivers
- Ethernet high-speed NICs:
- Broadcom (bnxt):
- add PCIe TLP Processing Hints (TPH) support for new AMD
platforms
- support dumping RoCE queue state for debug
- opt into instance locking
- Intel (100G, ice, idpf):
- ice: rework MSI-X IRQ management and distribution
- ice: support for E830 devices
- iavf: add support for Rx timestamping
- iavf: opt into instance locking
- nVidia/Mellanox:
- mlx4: use page pool memory allocator for Rx
- mlx5: support for one PTP device per hardware clock
- mlx5: support for 200Gbps per-lane link modes
- mlx5: move IPSec policy check after decryption
- AMD/Solarflare:
- support FW flashing via devlink
- Cisco (enic):
- use page pool memory allocator for Rx
- enable 32, 64 byte CQEs
- get max rx/tx ring size from the device
- Meta (fbnic):
- support flow steering and RSS configuration
- report queue stats
- support TCP segmentation
- support IRQ coalescing
- support ring size configuration
- Marvell/Cavium:
- support AF_XDP
- Wangxun:
- support for PTP clock and timestamping
- Huawei (hibmcge):
- checksum offload
- add more statistics
- Ethernet virtual:
- VirtIO net:
- aggressively suppress Tx completions, improve perf by 96%
with 1 CPU and 55% with 2 CPUs
- expose NAPI to IRQ mapping and persist NAPI settings
- Google (gve):
- support XDP in DQO RDA Queue Format
- opt into instance locking
- Microsoft vNIC:
- support BIG TCP
- Ethernet NICs consumer, and embedded:
- Synopsys (stmmac):
- cleanup Tx and Tx clock setting and other link-focused
cleanups
- enable SGMII and 2500BASEX mode switching for Intel platforms
- support Sophgo SG2044
- Broadcom switches (b53):
- support for BCM53101
- TI:
- iep: add perout configuration support
- icssg: support XDP
- Cadence (macb):
- implement BQL
- Xilinx (axinet):
- support dynamic IRQ moderation and changing coalescing at
runtime
- implement BQL
- report standard stats
- MediaTek:
- support phylink managed EEE
- Intel:
- igc: don't restart the interface on every XDP program change
- RealTek (r8169):
- support reading registers of internal PHYs directly
- increase max jumbo packet size on RTL8125/RTL8126
- Airoha:
- support for RISC-V NPU packet processing unit
- enable scatter-gather and support MTU up to 9kB
- Tehuti (tn40xx):
- support cards with TN4010 MAC and an Aquantia AQR105 PHY
- Ethernet PHYs:
- support for TJA1102S, TJA1121
- dp83tg720: add randomized polling intervals for link detection
- dp83822: support changing the transmit amplitude voltage
- support for LEDs on 88q2xxx
- CAN:
- canxl: support Remote Request Substitution bit access
- flexcan: add S32G2/S32G3 SoC
- WiFi:
- remove cooked monitor support
- strict mode for better AP testing
- basic EPCS support
- OMI RX bandwidth reduction support
- batman-adv: add support for jumbo frames
- WiFi drivers:
- RealTek (rtw88):
- support RTL8814AE and RTL8814AU
- RealTek (rtw89):
- switch using wiphy_lock and wiphy_work
- add BB context to manipulate two PHY as preparation of MLO
- improve BT-coexistence mechanism to play A2DP smoothly
- Intel (iwlwifi):
- add new iwlmld sub-driver for latest HW/FW combinations
- MediaTek (mt76):
- preparation for mt7996 Multi-Link Operation (MLO) support
- Qualcomm/Atheros (ath12k):
- continued work on MLO
- Silabs (wfx):
- Wake-on-WLAN support
- Bluetooth:
- add support for skb TX SND/COMPLETION timestamping
- hci_core: enable buffer flow control for SCO/eSCO
- coredump: log devcd dumps into the monitor
- Bluetooth drivers:
- intel: add support to configure TX power
- nxp: handle bootloader error during cmd5 and cmd7"
* tag 'net-next-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1681 commits)
unix: fix up for "apparmor: add fine grained af_unix mediation"
mctp: Fix incorrect tx flow invalidation condition in mctp-i2c
net: usb: asix: ax88772: Increase phy_name size
net: phy: Introduce PHY_ID_SIZE — minimum size for PHY ID string
net: libwx: fix Tx L4 checksum
net: libwx: fix Tx descriptor content for some tunnel packets
atm: Fix NULL pointer dereference
net: tn40xx: add pci-id of the aqr105-based Tehuti TN4010 cards
net: tn40xx: prepare tn40xx driver to find phy of the TN9510 card
net: tn40xx: create swnode for mdio and aqr105 phy and add to mdiobus
net: phy: aquantia: add essential functions to aqr105 driver
net: phy: aquantia: search for firmware-name in fwnode
net: phy: aquantia: add probe function to aqr105 for firmware loading
net: phy: Add swnode support to mdiobus_scan
gve: add XDP DROP and PASS support for DQ
gve: update XDP allocation path support RX buffer posting
gve: merge packet buffer size fields
gve: update GQ RX to use buf_size
gve: introduce config-based allocation for XDP
gve: remove xdp_xsk_done and xdp_xsk_wakeup statistics
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull VDSO infrastructure updates from Thomas Gleixner:
- Consolidate the VDSO storage
The VDSO data storage and data layout has been largely architecture
specific for historical reasons. That increases the maintenance
effort and causes inconsistencies over and over.
There is no real technical reason for architecture specific layouts
and implementations. The architecture specific details can easily be
integrated into a generic layout, which also reduces the amount of
duplicated code for managing the mappings.
Convert all architectures over to a unified layout and common mapping
infrastructure. This splits the VDSO data layout into subsystem
specific blocks, timekeeping, random and architecture parts, which
provides a better structure and allows to improve and update the
functionalities without conflict and interaction.
- Rework the timekeeping data storage
The current implementation is designed for exposing system
timekeeping accessors, which was good enough at the time when it was
designed.
PTP and Time Sensitive Networking (TSN) change that as there are
requirements to expose independent PTP clocks, which are not related
to system timekeeping.
Replace the monolithic data storage by a structured layout, which
allows to add support for independent PTP clocks on top while reusing
both the data structures and the time accessor implementations.
* tag 'timers-vdso-2025-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (55 commits)
sparc/vdso: Always reject undefined references during linking
x86/vdso: Always reject undefined references during linking
vdso: Rework struct vdso_time_data and introduce struct vdso_clock
vdso: Move architecture related data before basetime data
powerpc/vdso: Prepare introduction of struct vdso_clock
arm64/vdso: Prepare introduction of struct vdso_clock
x86/vdso: Prepare introduction of struct vdso_clock
time/namespace: Prepare introduction of struct vdso_clock
vdso/namespace: Rename timens_setup_vdso_data() to reflect new vdso_clock struct
vdso/vsyscall: Prepare introduction of struct vdso_clock
vdso/gettimeofday: Prepare helper functions for introduction of struct vdso_clock
vdso/gettimeofday: Prepare do_coarse_timens() for introduction of struct vdso_clock
vdso/gettimeofday: Prepare do_coarse() for introduction of struct vdso_clock
vdso/gettimeofday: Prepare do_hres_timens() for introduction of struct vdso_clock
vdso/gettimeofday: Prepare do_hres() for introduction of struct vdso_clock
vdso/gettimeofday: Prepare introduction of struct vdso_clock
vdso/helpers: Prepare introduction of struct vdso_clock
vdso/datapage: Define vdso_clock to prepare for multiple PTP clocks
vdso: Make vdso_time_data cacheline aligned
arm64: Make asm/cache.h compatible with vDSO
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer cleanups from Thomas Gleixner:
"A treewide hrtimer timer cleanup
hrtimers are initialized with hrtimer_init() and a subsequent store to
the callback pointer. This turned out to be suboptimal for the
upcoming Rust integration and is obviously a silly implementation to
begin with.
This cleanup replaces the hrtimer_init(T); T->function = cb; sequence
with hrtimer_setup(T, cb);
The conversion was done with Coccinelle and a few manual fixups.
Once the conversion has completely landed in mainline, hrtimer_init()
will be removed and the hrtimer::function becomes a private member"
* tag 'timers-cleanups-2025-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (100 commits)
wifi: rt2x00: Switch to use hrtimer_update_function()
io_uring: Use helper function hrtimer_update_function()
serial: xilinx_uartps: Use helper function hrtimer_update_function()
ASoC: fsl: imx-pcm-fiq: Switch to use hrtimer_setup()
RDMA: Switch to use hrtimer_setup()
virtio: mem: Switch to use hrtimer_setup()
drm/vmwgfx: Switch to use hrtimer_setup()
drm/xe/oa: Switch to use hrtimer_setup()
drm/vkms: Switch to use hrtimer_setup()
drm/msm: Switch to use hrtimer_setup()
drm/i915/request: Switch to use hrtimer_setup()
drm/i915/uncore: Switch to use hrtimer_setup()
drm/i915/pmu: Switch to use hrtimer_setup()
drm/i915/perf: Switch to use hrtimer_setup()
drm/i915/gvt: Switch to use hrtimer_setup()
drm/i915/huc: Switch to use hrtimer_setup()
drm/amdgpu: Switch to use hrtimer_setup()
stm class: heartbeat: Switch to use hrtimer_setup()
i2c: Switch to use hrtimer_setup()
iio: Switch to use hrtimer_setup()
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer core updates from Thomas Gleixner:
- Fix a memory ordering issue in posix-timers
Posix-timer lookup is lockless and reevaluates the timer validity
under the timer lock, but the update which validates the timer is not
protected by the timer lock. That allows the store to be reordered
against the initialization stores, so that the lookup side can
observe a partially initialized timer. That's mostly a theoretical
problem, but incorrect nevertheless.
- Fix a long standing inconsistency of the coarse time getters
The coarse time getters read the base time of the current update
cycle without reading the actual hardware clock. NTP frequency
adjustment can set the base time backwards. The fine grained
interfaces compensate this by reading the clock and applying the new
conversion factor, but the coarse grained time getters use the base
time directly. That allows the user to observe time going backwards.
Cure it by always forwarding base time, when NTP changes the
frequency with an immediate step.
- Rework of posix-timer hashing
The posix-timer hash is not scalable and due to the CRIU timer
restore mechanism prone to massive contention on the global hash
bucket lock.
Replace the global hash lock with a fine grained per bucket locking
scheme to address that.
- Rework the proc/$PID/timers interface.
/proc/$PID/timers is provided for CRIU to be able to restore a timer.
The printout happens with sighand lock held and interrupts disabled.
That's not required as this can be done with RCU protection as well.
- Provide a sane mechanism for CRIU to restore a timer ID
CRIU restores timers by creating and deleting them until the kernel
internal per process ID counter reached the requested ID. That's
horribly slow for sparse timer IDs.
Provide a prctl() which allows CRIU to restore a timer with a given
ID. When enabled the ID pointer is used as input pointer to read the
requested ID from user space. When disabled, the normal allocation
scheme (next ID) is active as before. This is backwards compatible
for both kernel and user space.
- Make hrtimer_update_function() less expensive.
The sanity checks are valuable, but expensive for high frequency
usage in io/uring. Make the debug checks conditional and enable them
only when lockdep is enabled.
- Small updates, cleanups and improvements
* tag 'timers-core-2025-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits)
selftests/timers: Improve skew_consistency by testing with other clockids
timekeeping: Fix possible inconsistencies in _COARSE clockids
posix-timers: Drop redundant memset() invocation
selftests/timers/posix-timers: Add a test for exact allocation mode
posix-timers: Provide a mechanism to allocate a given timer ID
posix-timers: Dont iterate /proc/$PID/timers with sighand:: Siglock held
posix-timers: Make per process list RCU safe
posix-timers: Avoid false cacheline sharing
posix-timers: Switch to jhash32()
posix-timers: Improve hash table performance
posix-timers: Make signal_struct:: Next_posix_timer_id an atomic_t
posix-timers: Make lock_timer() use guard()
posix-timers: Rework timer removal
posix-timers: Simplify lock/unlock_timer()
posix-timers: Use guards in a few places
posix-timers: Remove SLAB_PANIC from kmem cache
posix-timers: Remove a few paranoid warnings
posix-timers: Cleanup includes
posix-timers: Add cond_resched() to posix_timer_add() search loop
posix-timers: Initialise timer before adding it to the hash table
...
|
|
CONFIG_TRACE_BRANCH_PROFILING inserts a call to ftrace_likely_update()
for each use of likely() or unlikely(). That breaks noinstr rules if
the affected function is annotated as noinstr.
Disable branch profiling for files with noinstr functions. In addition
to some individual files, this also includes the entire arch/x86
subtree, as well as the kernel/entry, drivers/cpuidle, and drivers/idle
directories, all of which are noinstr-heavy.
Due to the nature of how sched binaries are built by combining multiple
.c files into one, branch profiling is disabled more broadly across the
sched code than would otherwise be needed.
This fixes many warnings like the following:
vmlinux.o: warning: objtool: do_syscall_64+0x40: call to ftrace_likely_update() leaves .noinstr.text section
vmlinux.o: warning: objtool: __rdgsbase_inactive+0x33: call to ftrace_likely_update() leaves .noinstr.text section
vmlinux.o: warning: objtool: handle_bug.isra.0+0x198: call to ftrace_likely_update() leaves .noinstr.text section
...
Reported-by: Ingo Molnar <mingo@kernel.org>
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/fb94fc9303d48a5ed370498f54500cc4c338eb6d.1742586676.git.jpoimboe@kernel.org
|
|
Lei Chen raised an issue with CLOCK_MONOTONIC_COARSE seeing time
inconsistencies.
Lei tracked down that this was being caused by the adjustment
tk->tkr_mono.xtime_nsec -= offset;
which is made to compensate for the unaccumulated cycles in offset when the
multiplicator is adjusted forward, so that the non-_COARSE clockids don't
see inconsistencies.
However, the _COARSE clockid getter functions use the adjusted xtime_nsec
value directly and do not compensate the negative offset via the
clocksource delta multiplied with the new multiplicator. In that case the
caller can observe time going backwards in consecutive calls.
By design, this negative adjustment should be fine, because the logic run
from timekeeping_adjust() is done after it accumulated approximately
multiplicator * interval_cycles
into xtime_nsec. The accumulated value is always larger then the
mult_adj * offset
value, which is subtracted from xtime_nsec. Both operations are done
together under the tk_core.lock, so the net change to xtime_nsec is always
always be positive.
However, do_adjtimex() calls into timekeeping_advance() as well, to to
apply the NTP frequency adjustment immediately. In this case,
timekeeping_advance() does not return early when the offset is smaller then
interval_cycles. In that case there is no time accumulated into
xtime_nsec. But the subsequent call into timekeeping_adjust(), which
modifies the multiplicator, subtracts from xtime_nsec to correct
for the new multiplicator.
Here because there was no accumulation, xtime_nsec becomes smaller than
before, which opens a window up to the next accumulation, where the _COARSE
clockid getters, which don't compensate for the offset, can observe the
inconsistency.
To fix this, rework the timekeeping_advance() logic so that when invoked
from do_adjtimex(), the time is immediately forwarded to accumulate also
the sub-interval portion into xtime. That means the remaining offset
becomes zero and the subsequent multiplier adjustment therefore does not
modify xtime_nsec.
There is another related inconsistency. If xtime is forwarded due to the
instantaneous multiplier adjustment, the NTP error, which was accumulated
with the previous setting, becomes meaningless.
Therefore clear the NTP error as well, after forwarding the clock for the
instantaneous multiplier update.
Fixes: da15cfdae033 ("time: Introduce CLOCK_REALTIME_COARSE")
Reported-by: Lei Chen <lei.chen@smartx.com>
Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250320200306.1712599-1-jstultz@google.com
Closes: https://lore.kernel.org/lkml/20250310030004.3705801-1-lei.chen@smartx.com/
|
|
Initially in commit 6891c4509c79 memset() was required to clear a variable
allocated on stack. Commit 2482097c6c0f removed the on stack variable and
retained the memset() despite the fact that the memory is allocated via
kmem_cache_zalloc() and therefore zereoed already.
Drop the redundant memset().
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/Z9ctVxwaYOV4A2g4@grain
|
|
Checkpoint/Restore in Userspace (CRIU) requires to reconstruct posix timers
with the same timer ID on restore. It uses sys_timer_create() and relies on
the monotonic increasing timer ID provided by this syscall. It creates and
deletes timers until the desired ID is reached. This is can loop for a long
time, when the checkpointed process had a very sparse timer ID range.
It has been debated to implement a new syscall to allow the creation of
timers with a given timer ID, but that's tideous due to the 32/64bit compat
issues of sigevent_t and of dubious value.
The restore mechanism of CRIU creates the timers in a state where all
threads of the restored process are held on a barrier and cannot issue
syscalls. That means the restorer task has exclusive control.
This allows to address this issue with a prctl() so that the restorer
thread can do:
if (prctl(PR_TIMER_CREATE_RESTORE_IDS, PR_TIMER_CREATE_RESTORE_IDS_ON))
goto linear_mode;
create_timers_with_explicit_ids();
prctl(PR_TIMER_CREATE_RESTORE_IDS, PR_TIMER_CREATE_RESTORE_IDS_OFF);
This is backwards compatible because the prctl() fails on older kernels and
CRIU can fall back to the linear timer ID mechanism. CRIU versions which do
not know about the prctl() just work as before.
Implement the prctl() and modify timer_create() so that it copies the
requested timer ID from userspace by utilizing the existing timer_t
pointer, which is used to copy out the allocated timer ID on success.
If the prctl() is disabled, which it is by default, timer_create() works as
before and does not try to read from the userspace pointer.
There is no problem when a broken or rogue user space application enables
the prctl(). If the user space pointer does not contain a valid ID, then
timer_create() fails. If the data is not initialized, but constains a
random valid ID, timer_create() will create that random timer ID or fail if
the ID is already given out.
As CRIU must use the raw syscall to avoid manipulating the internal state
of the restored process, this has no library dependencies and can be
adopted by CRIU right away.
Recreating two timers with IDs 1000000 and 2000000 takes 1.5 seconds with
the create/delete method. With the prctl() it takes 3 microseconds.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Cyrill Gorcunov <gorcunov@gmail.com>
Tested-by: Cyrill Gorcunov <gorcunov@gmail.com>
Link: https://lore.kernel.org/all/87jz8vz0en.ffs@tglx
|
|
Preparatory change to remove the sighand locking from the /proc/$PID/timers
iterator.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/all/20250308155624.403223080@linutronix.de
|
|
struct k_itimer has the hlist_node, which is used for lookup in the hash
bucket, and the timer lock in the same cache line.
That's obviously bad, if one CPU fiddles with a timer and the other is
walking the hash bucket on which that timer is queued.
Avoid this by restructuring struct k_itimer, so that the read mostly (only
modified during setup and teardown) fields are in the first cache line and
the lock and the rest of the fields which get written to are in cacheline
2-N.
Reduces cacheline contention in a test case of 64 processes creating and
accessing 20000 timers each by almost 30% according to perf.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/all/20250308155624.341108067@linutronix.de
|
|
The hash distribution of hash_32() is suboptimal. jhash32() provides a way
better distribution, which evens out the length of the hash bucket lists,
which in turn avoids large outliers in list walk times.
Due to the sparse ID space (thanks CRIU) there is no guarantee that the
timers will be fully evenly distributed over the hash buckets, but the
behaviour is way better than with hash_32() even for randomly sparse ID
spaces.
For a pathological test case with 64 processes creating and accessing
20000 timers each, this results in a runtime reduction of ~10% and a
significantly reduced runtime variation.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250308155624.279080328@linutronix.de
|
|
Eric and Ben reported a significant performance bottleneck on the global
hash, which is used to store posix timers for lookup.
Eric tried to do a lockless validation of a new timer ID before trying to
insert the timer, but that does not solve the problem.
For the non-contended case this is a pointless exercise and for the
contended case this extra lookup just creates enough interleaving that all
tasks can make progress.
There are actually two real solutions to the problem:
1) Provide a per process (signal struct) xarray storage
2) Implement a smarter hash like the one in the futex code
#1 works perfectly fine for most cases, but the fact that CRIU enforced a
linear increasing timer ID to restore timers makes this problematic.
It's easy enough to create a sparse timer ID space, which amounts very
fast to a large junk of memory consumed for the xarray. 2048 timers with
a ID offset of 512 consume more than one megabyte of memory for the
xarray storage.
#2 The main advantage of the futex hash is that it uses per hash bucket
locks instead of a global hash lock. Aside of that it is scaled
according to the number of CPUs at boot time.
Experiments with artifical benchmarks have shown that a scaled hash with
per bucket locks comes pretty close to the xarray performance and in some
scenarios it performes better.
Test 1:
A single process creates 20000 timers and afterwards invokes
timer_getoverrun(2) on each of them:
mainline Eric newhash xarray
create 23 ms 23 ms 9 ms 8 ms
getoverrun 14 ms 14 ms 5 ms 4 ms
Test 2:
A single process creates 50000 timers and afterwards invokes
timer_getoverrun(2) on each of them:
mainline Eric newhash xarray
create 98 ms 219 ms 20 ms 18 ms
getoverrun 62 ms 62 ms 10 ms 9 ms
Test 3:
A single process creates 100000 timers and afterwards invokes
timer_getoverrun(2) on each of them:
mainline Eric newhash xarray
create 313 ms 750 ms 48 ms 33 ms
getoverrun 261 ms 260 ms 20 ms 14 ms
Erics changes create quite some overhead in the create() path due to the
double list walk, as the main issue according to perf is the list walk
itself. With 100k timers each hash bucket contains ~200 timers, which in
the worst case need to be all inspected. The same problem applies for
getoverrun() where the lookup has to walk through the hash buckets to find
the timer it is looking for.
The scaled hash obviously reduces hash collisions and lock contention
significantly. This becomes more prominent with concurrency.
Test 4:
A process creates 63 threads and all threads wait on a barrier before
each instance creates 20000 timers and afterwards invokes
timer_getoverrun(2) on each of them. The threads are pinned on
seperate CPUs to achive maximum concurrency. The numbers are the
average times per thread:
mainline Eric newhash xarray
create 180239 ms 38599 ms 579 ms 813 ms
getoverrun 2645 ms 2642 ms 32 ms 7 ms
Test 5:
A process forks 63 times and all forks wait on a barrier before each
instance creates 20000 timers and afterwards invokes
timer_getoverrun(2) on each of them. The processes are pinned on
seperate CPUs to achive maximum concurrency. The numbers are the
average times per process:
mainline eric newhash xarray
create 157253 ms 40008 ms 83 ms 60 ms
getoverrun 2611 ms 2614 ms 40 ms 4 ms
So clearly the reduction of lock contention with Eric's changes makes a
significant difference for the create() loop, but it does not mitigate the
problem of long list walks, which is clearly visible on the getoverrun()
side because that is purely dominated by the lookup itself. Once the timer
is found, the syscall just reads from the timer structure with no other
locks or code paths involved and returns.
The reason for the difference between the thread and the fork case for the
new hash and the xarray is that both suffer from contention on
sighand::siglock and the xarray suffers additionally from contention on the
xarray lock on insertion.
The only case where the reworked hash slighly outperforms the xarray is a
tight loop which creates and deletes timers.
Test 4:
A process creates 63 threads and all threads wait on a barrier before
each instance runs a loop which creates and deletes a timer 100000
times in a row. The threads are pinned on seperate CPUs to achive
maximum concurrency. The numbers are the average times per thread:
mainline Eric newhash xarray
loop 5917 ms 5897 ms 5473 ms 7846 ms
Test 5:
A process forks 63 times and all forks wait on a barrier before each
each instance runs a loop which creates and deletes a timer 100000
times in a row. The processes are pinned on seperate CPUs to achive
maximum concurrency. The numbers are the average times per process:
mainline Eric newhash xarray
loop 5137 ms 7828 ms 891 ms 872 ms
In both test there is not much contention on the hash, but the ucount
accounting for the signal and in the thread case the sighand::siglock
contention (plus the xarray locking) contribute dominantly to the overhead.
As the memory consumption of the xarray in the sparse ID case is
significant, the scaled hash with per bucket locks seems to be the better
overall option. While the xarray has faster lookup times for a large number
of timers, the actual syscall usage, which requires the lookup is not an
extreme hotpath. Most applications utilize signal delivery and all syscalls
except timer_getoverrun(2) are all but cheap.
So implement a scaled hash with per bucket locks, which offers the best
tradeoff between performance and memory consumption.
Reported-by: Eric Dumazet <edumazet@google.com>
Reported-by: Benjamin Segall <bsegall@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/all/20250308155624.216091571@linutronix.de
|
|
The global hash_lock protecting the posix timer hash table can be heavily
contended especially when there is an extensive linear search for a timer
ID.
Timer IDs are handed out by monotonically increasing next_posix_timer_id
and then validating that there is no timer with the same ID in the hash
table. Both operations happen with the global hash lock held.
To reduce the hash lock contention the hash will be reworked to a scaled
hash with per bucket locks, which requires to handle the ID counter
lockless.
Prepare for this by making next_posix_timer_id an atomic_t, which can be
used lockless with atomic_inc_return().
[ tglx: Adopted from Eric's series, massaged change log and simplified it ]
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/all/20250219125522.2535263-2-edumazet@google.com
Link: https://lore.kernel.org/all/20250308155624.151545978@linutronix.de
|
|
The lookup and locking of posix timers requires the same repeating pattern
at all usage sites:
tmr = lock_timer(tiner_id);
if (!tmr)
return -EINVAL;
....
unlock_timer(tmr);
Solve this with a guard implementation, which works in most places out of
the box except for those, which need to unlock the timer inside the guard
scope.
Though the only places where this matters are timer_delete() and
timer_settime(). In both cases the timer pointer needs to be preserved
across the end of the scope, which is solved by storing the pointer in a
variable outside of the scope.
timer_settime() also has to protect the timer with RCU before unlocking,
which obviously can't use guard(rcu) before leaving the guard scope as that
guard is cleaned up before the unlock. Solve this by providing the RCU
protection open coded.
[ tglx: Made it work and added change log ]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/all/20250224162103.GD11590@noisy.programming.kicks-ass.net
Link: https://lore.kernel.org/all/20250308155624.087465658@linutronix.de
|
|
sys_timer_delete() and the do_exit() cleanup function itimer_delete() are
doing the same thing, but have needlessly different implementations instead
of sharing the code.
The other oddity of timer deletion is the fact that the timer is not
invalidated before the actual deletion happens, which allows concurrent
lookups to succeed.
That's wrong because a timer which is in the process of being deleted
should not be visible and any actions like signal queueing, delivery and
rearming should not happen once the task, which invoked timer_delete(), has
the timer locked.
Rework the code so that:
1) The signal queueing and delivery code ignore timers which are marked
invalid
2) The deletion implementation between sys_timer_delete() and
itimer_delete() is shared
3) The timer is invalidated and removed from the linked lists before
the deletion callback of the relevant clock is invoked.
That requires to rework timer_wait_running() as it does a lookup of
the timer when relocking it at the end. In case of deletion this
lookup would fail due to the preceding invalidation and the wait loop
would terminate prematurely.
But due to the preceding invalidation the timer cannot be accessed by
other tasks anymore, so there is no way that the timer has been freed
after the timer lock has been dropped.
Move the re-validation out of timer_wait_running() and handle it at
the only other usage site, timer_settime().
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/all/87zfht1exf.ffs@tglx
|
|
Since the integration of sigqueue into the timer struct, lock_timer() is
only used in task context. So taking the lock with irqsave() is not longer
required.
Convert it to use spin_[un]lock_irq().
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/all/20250308155623.959825668@linutronix.de
|
|
Switch locking and RCU to guards where applicable.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/all/20250308155623.892762130@linutronix.de
|
|
There is no need to panic when the posix-timer kmem_cache can't be
created. timer_create() will fail with -ENOMEM and that's it.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/all/20250308155623.829215801@linutronix.de
|
|
Warnings about a non-initialized timer or non-existing callbacks are just
useful for implementing new posix clocks, but there a NULL pointer
dereference is expected anyway. :)
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/all/20250308155623.765462334@linutronix.de
|
|
Remove pointless includes and sort the remaining ones alphabetically.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/all/20250308155623.701301552@linutronix.de
|
|
With a large number of POSIX timers the search for a valid ID might cause a
soft lockup on PREEMPT_NONE/VOLUNTARY kernels.
Add cond_resched() to the loop to prevent that.
[ tglx: Split out from Eric's series ]
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/all/20250214135911.2037402-2-edumazet@google.com
Link: https://lore.kernel.org/all/20250308155623.635612865@linutronix.de
|
|
A timer is only valid in the hashtable when both timer::it_signal and
timer::it_id are set to their final values, but timers are added without
those values being set.
The timer ID is allocated when the timer is added to the hash in invalid
state. The ID is taken from a monotonically increasing per process counter
which wraps around after reaching INT_MAX. The hash insertion validates
that there is no timer with the allocated ID in the hash table which
belongs to the same process. That opens a mostly theoretical race condition:
If other threads of the same process manage to create/delete timers in
rapid succession before the newly created timer is fully initialized and
wrap around to the timer ID which was handed out, then a duplicate timer ID
will be inserted into the hash table.
Prevent this by:
1) Setting timer::it_id before inserting the timer into the hashtable.
2) Storing the signal pointer in timer::it_signal with bit 0 set before
inserting it into the hashtable.
Bit 0 acts as a invalid bit, which means that the regular lookup for
sys_timer_*() will fail the comparison with the signal pointer.
But the lookup on insertion masks out bit 0 and can therefore detect a
timer which is not yet valid, but allocated in the hash table. Bit 0
in the pointer is cleared once the initialization of the timer
completed.
[ tglx: Fold ID and signal iniitializaion into one patch and massage change
log and comments. ]
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/all/20250219125522.2535263-3-edumazet@google.com
Link: https://lore.kernel.org/all/20250308155623.572035178@linutronix.de
|
|
Frederic pointed out that the memory operations to initialize the timer are
not guaranteed to be visible, when __lock_timer() observes timer::it_signal
valid under timer::it_lock:
T0 T1
--------- -----------
do_timer_create()
// A
new_timer->.... = ....
spin_lock(current->sighand)
// B
WRITE_ONCE(new_timer->it_signal, current->signal)
spin_unlock(current->sighand)
sys_timer_*()
t = __lock_timer()
spin_lock(&timr->it_lock)
// observes B
if (timr->it_signal == current->signal)
return timr;
if (!t)
return;
// Is not guaranteed to observe A
Protect the write of timer::it_signal, which makes the timer valid, with
timer::it_lock as well. This guarantees that T1 must observe the
initialization A completely, when it observes the valid signal pointer
under timer::it_lock. sighand::siglock must still be taken to protect the
signal::posix_timers list.
Reported-by: Frederic Weisbecker <frederic@kernel.org>
Suggested-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/all/20250308155623.507944489@linutronix.de
|
|
The size argument of strscpy() is only required when the destination
pointer is not a fixed sized array or when the copy needs to be smaller
than the size of the fixed sized destination array.
For fixed sized destination arrays and full copies, strscpy() automatically
determines the length of the destination buffer if the size argument is
omitted.
This makes the explicit sizeof() unnecessary. Remove it.
[ tglx: Massaged change log ]
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250311110624.495718-2-thorsten.blum@linux.dev
|
|
This reverts commit f590308536db ("timer debug: Hide kernel addresses via
%pK in /proc/timer_list")
The timer list helper SEQ_printf() uses either the real seq_printf() for
procfs output or vprintk() to print to the kernel log, when invoked from
SysRq-q. It uses %pK for printing pointers.
In the past %pK was prefered over %p as it would not leak raw pointer
values into the kernel log. Since commit ad67b74d2469 ("printk: hash
addresses printed with %p") the regular %p has been improved to avoid this
issue.
Furthermore, restricted pointers ("%pK") were never meant to be used
through printk(). They can still unintentionally leak raw pointers or
acquire sleeping looks in atomic contexts.
Switch to the regular pointer formatting which is safer, easier to reason
about and sufficient here.
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/lkml/20250113171731-dc10e3c1-da64-4af0-b767-7c7070468023@linutronix.de/
Link: https://lore.kernel.org/all/20250311-restricted-pointers-timer-v1-1-6626b91e54ab@linutronix.de
|
|
To support multiple PTP clocks, the VDSO data structure needs to be
reworked. All clock specific data will end up in struct vdso_clock and in
struct vdso_time_data there will be an array of VDSO clocks.
Now that all preparatory changes are in place:
Split the clock related struct members into a separate struct
vdso_clock. Make sure all users are aware, that vdso_time_data is no longer
initialized as an array and vdso_clock is now the array inside
vdso_data. Remove the vdso_clock define, which mapped it to vdso_time_data
for the transition.
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250303-vdso-clock-v1-19-c1b5c69a166f@linutronix.de
|
|
To support multiple PTP clocks, the VDSO data structure needs to be
reworked. All clock specific data will end up in struct vdso_clock and in
struct vdso_time_data there will be array of VDSO clocks. At the moment,
vdso_clock is simply a define which maps vdso_clock to vdso_time_data.
To prepare for the rework of the data structures, replace the struct
vdso_time_data pointer with a struct vdso_clock pointer where applicable.
No functional change.
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250303-vdso-clock-v1-14-c1b5c69a166f@linutronix.de
|
|
To support multiple PTP clocks, the VDSO data structure needs to be
reworked. All clock specific data will end up in struct vdso_clock and in
struct vdso_time_data there will be array of VDSO clocks. At the moment,
vdso_clock is simply a define which maps vdso_clock to vdso_time_data.
For time namespaces, vdso_time_data needs to be set up. But only the clock
related part of the vdso_data thats requires this setup. To reflect the
future struct vdso_clock, rename timens_setup_vdso_data() to
timns_setup_vdso_clock_data().
No functional change.
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250303-vdso-clock-v1-13-c1b5c69a166f@linutronix.de
|
|
To support multiple PTP clocks, the VDSO data structure needs to be
reworked. All clock specific data will end up in struct vdso_clock and in
struct vdso_time_data there will be array of VDSO clocks. At the moment,
vdso_clock is simply a define which maps vdso_clock to vdso_time_data.
To prepare for the rework of the data structures, replace the struct
vdso_time_data pointer with a struct vdso_clock pointer where applicable.
No functional change.
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250303-vdso-clock-v1-12-c1b5c69a166f@linutronix.de
|
|
Many devices implement highly accurate clocks, which the kernel manages
as PTP Hardware Clocks (PHCs). Userspace applications rely on these
clocks to timestamp events, trace workload execution, correlate
timescales across devices, and keep various clocks in sync.
The kernel’s current implementation of PTP clocks does not enforce file
permissions checks for most device operations except for POSIX clock
operations, where file mode is verified in the POSIX layer before
forwarding the call to the PTP subsystem. Consequently, it is common
practice to not give unprivileged userspace applications any access to
PTP clocks whatsoever by giving the PTP chardevs 600 permissions. An
example of users running into this limitation is documented in [1].
Additionally, POSIX layer requires WRITE permission even for readonly
adjtime() calls which are used in PTP layer to return current frequency
offset applied to the PHC.
Add permission checks for functions that modify the state of a PTP
device. Continue enforcing permission checks for POSIX clock operations
(settime, adjtime) in the POSIX layer. Only require WRITE access for
dynamic clocks adjtime() if any flags are set in the modes field.
[1] https://lists.nwtime.org/sympa/arc/linuxptp-users/2024-01/msg00036.html
Changes in v4:
- Require FMODE_WRITE in ajtime() only for calls modifying the clock in
any way.
Acked-by: Richard Cochran <richardcochran@gmail.com>
Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Signed-off-by: Wojtek Wasko <wwasko@nvidia.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
File descriptor based pc_clock_*() operations of dynamic posix clocks
have access to the file pointer and implement permission checks in the
generic code before invoking the relevant dynamic clock callback.
Character device operations (open, read, poll, ioctl) do not implement a
generic permission control and the dynamic clock callbacks have no
access to the file pointer to implement them.
Extend struct posix_clock_context with a struct file pointer and
initialize it in posix_clock_open(), so that all dynamic clock callbacks
can access it.
Acked-by: Richard Cochran <richardcochran@gmail.com>
Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Wojtek Wasko <wwasko@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The normal and compat ioctl handlers are identical,
which is fine as compat ioctls are detected and handled dynamically
inside the underlying clock implementation.
The duplicate definition however is unnecessary.
Just reuse the regular ioctl handler also for compat ioctls.
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Cyrill Gorcunov <gorcunov@gmail.com>
Link: https://lore.kernel.org/all/20250225-posix-clock-compat-cleanup-v2-1-30de86457a2b@weissschuh.net
|
|
All users of the time releated parts of the vDSO are now using the generic
storage implementation. Remove the therefore unnecessary compatibility
accessor functions and symbols.
Co-developed-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250204-vdso-store-rng-v3-18-13a4669dfc8c@linutronix.de
|
|
Historically each architecture defined their own way to store the vDSO
data page. Add a generic mechanism to provide storage for that page.
Furthermore this generic storage will be extended to also provide
uniform storage for *non*-time-related data, like the random state or
architecture-specific data. These will have their own pages and data
structures, so rename 'vdso_data' into 'vdso_time_data' to make that
split clear from the name.
Also introduce a new consistent naming scheme for the symbols related to
the vDSO, which makes it clear if the symbol is accessible from
userspace or kernel space and the type of data behind the symbol.
The generic fault handler contains an optimization to prefault the vvar
page when the timens page is accessed. This was lifted from s390 and x86.
Co-developed-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250204-vdso-store-rng-v3-5-13a4669dfc8c@linutronix.de
|
|
hrtimer_setup() takes the callback function pointer as argument and
initializes the timer completely.
Replace hrtimer_init() and the open coded initialization of
hrtimer::function with the new setup mechanism.
Most of this patch is generated by Coccinelle. Except for the TX thrtimer
in bcm_tx_setup() because this timer is not used and the callback function
is never set. For this particular case, set the callback to
hrtimer_dummy_timeout()
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Marc Kleine-Budde <mkl@pengutronix.de>
Link: https://lore.kernel.org/all/a3a6be42c818722ad41758457408a32163bfd9a0.1738746872.git.namcao@linutronix.de
|
|
hrtimer_setup() takes the callback function pointer as argument and
initializes the timer completely.
Replace hrtimer_init() and the open coded initialization of
hrtimer::function with the new setup mechanism.
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/170bb691a0d59917c8268a98c80b607128fc9f7f.1738746821.git.namcao@linutronix.de
|