summaryrefslogtreecommitdiff
path: root/include
AgeCommit message (Collapse)Author
2025-03-25rwonce: handle KCSAN like KASAN in read_word_at_a_time()Jann Horn
read_word_at_a_time() is allowed to read out of bounds by straddling the end of an allocation (and the caller is expected to then mask off out-of-bounds data). This works as long as the caller guarantees that the access won't hit a pagefault (either by ensuring that addr is aligned or by explicitly checking where the next page boundary is). Such out-of-bounds data could include things like KASAN redzones, adjacent allocations that are concurrently written to, or simply an adjacent struct field that is concurrently updated. KCSAN should ignore racy reads of OOB data that is not actually used, just like KASAN, so (similar to the code above) change read_word_at_a_time() to use __no_sanitize_or_inline instead of __no_kasan_or_inline, and explicitly inform KCSAN that we're reading the first byte. We do have an instrument_read() helper that calls into both KASAN and KCSAN, but I'm instead open-coding that here to avoid having to pull the entire instrumented.h header into rwonce.h. Also, since this read can be racy by design, we should technically do READ_ONCE(), so add that. Fixes: dfd402a4c4ba ("kcsan: Add Kernel Concurrency Sanitizer infrastructure") Signed-off-by: Jann Horn <jannh@google.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Marco Elver <elver@google.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2025-03-25Bluetooth: L2CAP: add TX timestampingPauli Virtanen
Support TX timestamping in L2CAP sockets. Support MSG_ERRQUEUE recvmsg. For other than SOCK_STREAM L2CAP sockets, if a packet from sendmsg() is fragmented, only the first ACL fragment is timestamped. For SOCK_STREAM L2CAP sockets, use the bytestream convention and timestamp the last fragment and count bytes in tskey. Timestamps are not generated in the Enhanced Retransmission mode, as meaning of COMPLETION stamp is unclear if L2CAP layer retransmits. Signed-off-by: Pauli Virtanen <pav@iki.fi> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-03-25Bluetooth: ISO: add TX timestampingPauli Virtanen
Add BT_SCM_ERROR socket CMSG type. Support TX timestamping in ISO sockets. Support MSG_ERRQUEUE in ISO recvmsg. If a packet from sendmsg() is fragmented, only the first ACL fragment is timestamped. Signed-off-by: Pauli Virtanen <pav@iki.fi> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-03-25Bluetooth: add support for skb TX SND/COMPLETION timestampingPauli Virtanen
Support enabling TX timestamping for some skbs, and track them until packet completion. Generate software SCM_TSTAMP_COMPLETION when getting completion report from hardware. Generate software SCM_TSTAMP_SND before sending to driver. Sending from driver requires changes in the driver API, and drivers mostly are going to send the skb immediately. Make the default situation with no COMPLETION TX timestamping more efficient by only counting packets in the queue when there is nothing to track. When there is something to track, we need to make clones, since the driver may modify sent skbs. The tx_q queue length is bounded by the hdev flow control, which will not send new packets before it has got completion reports for old ones. Signed-off-by: Pauli Virtanen <pav@iki.fi> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-03-25net-timestamp: COMPLETION timestamp on packet tx completionPauli Virtanen
Add SOF_TIMESTAMPING_TX_COMPLETION, for requesting a software timestamp when hardware reports a packet completed. Completion tstamp is useful for Bluetooth, as hardware timestamps do not exist in the HCI specification except for ISO packets, and the hardware has a queue where packets may wait. In this case the software SND timestamp only reflects the kernel-side part of the total latency (usually small) and queue length (usually 0 unless HW buffers congested), whereas the completion report time is more informative of the true latency. It may also be useful in other cases where HW TX timestamps cannot be obtained and user wants to estimate an upper bound to when the TX probably happened. Signed-off-by: Pauli Virtanen <pav@iki.fi> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-03-25Bluetooth: HCI: Add definition of hci_rp_remote_name_req_cancelWentao Guan
Return Parameters is not only status, also bdaddr: BLUETOOTH CORE SPECIFICATION Version 5.4 | Vol 4, Part E page 1870: BLUETOOTH CORE SPECIFICATION Version 5.0 | Vol 2, Part E page 802: Return parameters: Status: Size: 1 octet BD_ADDR: Size: 6 octets Note that it also fixes the warning: "Bluetooth: hci0: unexpected cc 0x041a length: 7 > 1" Fixes: c8992cffbe741 ("Bluetooth: hci_event: Use of a function table to handle Command Complete") Signed-off-by: Wentao Guan <guanwentao@uniontech.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-03-25Bluetooth: hci_core: Enable buffer flow control for SCO/eSCOLuiz Augusto von Dentz
This enables buffer flow control for SCO/eSCO (see: Bluetooth Core 6.0 spec: 6.22. Synchronous Flow Control Enable), recently this has caused the following problem and is actually a nice addition for the likes of Socket TX complete: < HCI Command: Read Buffer Size (0x04|0x0005) plen 0 > HCI Event: Command Complete (0x0e) plen 11 Read Buffer Size (0x04|0x0005) ncmd 1 Status: Success (0x00) ACL MTU: 1021 ACL max packet: 5 SCO MTU: 240 SCO max packet: 8 ... < SCO Data TX: Handle 257 flags 0x00 dlen 120 < SCO Data TX: Handle 257 flags 0x00 dlen 120 < SCO Data TX: Handle 257 flags 0x00 dlen 120 < SCO Data TX: Handle 257 flags 0x00 dlen 120 < SCO Data TX: Handle 257 flags 0x00 dlen 120 < SCO Data TX: Handle 257 flags 0x00 dlen 120 < SCO Data TX: Handle 257 flags 0x00 dlen 120 < SCO Data TX: Handle 257 flags 0x00 dlen 120 < SCO Data TX: Handle 257 flags 0x00 dlen 120 > HCI Event: Hardware Error (0x10) plen 1 Code: 0x0a To fix the code will now attempt to enable buffer flow control when HCI_QUIRK_SYNC_FLOWCTL_SUPPORTED is set by the driver: < HCI Command: Write Sync Fl.. (0x03|0x002f) plen 1 Flow control: Enabled (0x01) > HCI Event: Command Complete (0x0e) plen 4 Write Sync Flow Control Enable (0x03|0x002f) ncmd 1 Status: Success (0x00) On success then HCI_SCO_FLOWCTL would be set which indicates sco_cnt shall be used for flow contro. Fixes: 7fedd3bb6b77 ("Bluetooth: Prioritize SCO traffic") Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Tested-by: Pauli Virtanen <pav@iki.fi>
2025-03-25Bluetooth: Add quirk for broken READ_PAGE_SCAN_TYPEPedro Nishiyama
Some fake controllers cannot be initialized because they return a smaller report than expected for READ_PAGE_SCAN_TYPE. Signed-off-by: Pedro Nishiyama <nishiyama.pedro@gmail.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-03-25Bluetooth: Add quirk for broken READ_VOICE_SETTINGPedro Nishiyama
Some fake controllers cannot be initialized because they return a smaller report than expected for READ_VOICE_SETTING. Signed-off-by: Pedro Nishiyama <nishiyama.pedro@gmail.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-03-25Bluetooth: L2CAP: convert timeouts to secs_to_jiffies()Easwar Hariharan
Commit b35108a51cf7 ("jiffies: Define secs_to_jiffies()") introduced secs_to_jiffies(). As the value here is a multiple of 1000, use secs_to_jiffies() instead of msecs_to_jiffies() for readability. Signed-off-by: Easwar Hariharan <eahariha@linux.microsoft.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-03-25Bluetooth: MGMT: Remove unused mgmt_*_discovery_completeDr. David Alan Gilbert
mgmt_start_discovery_complete() and mgmt_stop_discovery_complete() last uses were removed in 2022 by commit ec2904c259c5 ("Bluetooth: Remove dead code from hci_request.c") Remove them. Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2025-03-25virtio_net: Split struct virtio_net_rss_configAkihiko Odaki
struct virtio_net_rss_config was less useful in actual code because of a flexible array placed in the middle. Add new structures that split it into two to avoid having a flexible array in the middle. Suggested-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com> Acked-by: Jason Wang <jasowang@redhat.com> Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Tested-by: Lei Yang <leiyang@redhat.com> Link: https://patch.msgid.link/20250321-virtio-v2-1-33afb8f4640b@daynix.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-25Merge tag 'irq-msi-2025-03-23' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull MSI irq updates from Thomas Gleixner: - Switch the MSI descriptor locking to guards - Replace the broken PCI/TPH implementation, which lacks any form of serialization against concurrent modifications with a properly serialized mechanism in the PCI/MSI core code - Replace the MSI descriptor abuse in the SCSI/UFS Qualcom driver with dedicated driver internal storage * tag 'irq-msi-2025-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: genirq/msi: Rename msi_[un]lock_descs() scsi: ufs: qcom: Remove the MSI descriptor abuse PCI/TPH: Replace the broken MSI-X control word update PCI/MSI: Provide a sane mechanism for TPH PCI: hv: Switch MSI descriptor locking to guard() PCI/MSI: Switch to MSI descriptor locking to guard() NTB/msi: Switch MSI descriptor locking to lock guard() soc: ti: ti_sci_inta_msi: Switch MSI descriptor locking to guard() genirq/msi: Use lock guards for MSI descriptor locking cleanup: Provide retain_ptr() genirq/msi: Make a few functions static
2025-03-25Revert "udp_tunnel: GRO optimizations"Jakub Kicinski
Revert "udp_tunnel: use static call for GRO hooks when possible" This reverts commit 311b36574ceaccfa3f91b74054a09cd4bb877702. Revert "udp_tunnel: create a fastpath GRO lookup." This reverts commit 8d4880db378350f8ed8969feea13bdc164564fc1. There are multiple small issues with the series. In the interest of unblocking the merge window let's opt for a revert. Link: https://lore.kernel.org/cover.1742557254.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-25Merge tag 'irq-core-2025-03-23' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq updates from Thomas Gleixner: "A small set of core changes for the interrupt subsystem: - Expose the MSI message in the existing debug filesystem dump. That's useful for validation and debugging. - Small cleanups" * tag 'irq-core-2025-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: genirq: Make a few functions static irqdomain: Remove extern from function declarations genirq/msi: Expose MSI message data in debugfs
2025-03-25Merge tag 'ipsec-next-2025-03-24' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next Steffen Klassert says: ==================== pull request (net-next): ipsec-next 2025-03-24 1) Prevent setting high order sequence number bits input in non-ESN mode. From Leon Romanovsky. 2) Support PMTU handling in tunnel mode for packet offload. From Leon Romanovsky. 3) Make xfrm_state_lookup_byaddr lockless. From Florian Westphal. 4) Remove unnecessary NULL check in xfrm_lookup_with_ifid(). From Dan Carpenter. * tag 'ipsec-next-2025-03-24' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next: xfrm: Remove unnecessary NULL check in xfrm_lookup_with_ifid() xfrm: state: make xfrm_state_lookup_byaddr lockless xfrm: check for PMTU in tunnel mode for packet offload xfrm: provide common xdo_dev_offload_ok callback implementation xfrm: rely on XFRM offload xfrm: simplify SA initialization routine xfrm: delay initialization of offload path till its actually requested xfrm: prevent high SEQ input in non-ESN mode ==================== Link: https://patch.msgid.link/20250324061855.4116819-1-steffen.klassert@secunet.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-25Merge tag 'nf-next-25-03-23' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following batch contains Netfilter updates for net-next: 1) Use kvmalloc in xt_hashlimit, from Denis Kirjanov. 2) Tighten nf_conntrack sysctl accepted values for nf_conntrack_max and nf_ct_expect_max, from Nicolas Bouchinet. 3) Avoid lookup in nft_fib if socket is available, from Florian Westphal. 4) Initialize struct lsm_context in nfnetlink_queue to avoid hypothetical ENOMEM errors, Chenyuan Yang. 5) Use strscpy() instead of _pad when initializing xtables table name, kzalloc is already used to initialized the table memory area. From Thorsten Blum. 6) Missing socket lookup by conntrack information for IPv6 traffic in nft_socket, there is a similar chunk in IPv4, this was never added when IPv6 NAT was introduced. From Maxim Mikityanskiy. 7) Fix clang issues with nf_tables CONFIG_MITIGATION_RETPOLINE, from WangYuli. * tag 'nf-next-25-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next: netfilter: nf_tables: Only use nf_skip_indirect_calls() when MITIGATION_RETPOLINE netfilter: socket: Lookup orig tuple for IPv6 SNAT netfilter: xtables: Use strscpy() instead of strscpy_pad() netfilter: nfnetlink_queue: Initialize ctx to avoid memory allocation error netfilter: fib: avoid lookup if socket is available netfilter: conntrack: Bound nf_conntrack sysctl writes netfilter: xt_hashlimit: replace vmalloc calls with kvmalloc ==================== Link: https://patch.msgid.link/20250323100922.59983-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-25net: rfs: hash function changeEric Dumazet
RFS is using two kinds of hash tables. First one is controlled by /proc/sys/net/core/rps_sock_flow_entries = 2^N and using the N low order bits of the l4 hash is good enough. Then each RX queue has its own hash table, controlled by /sys/class/net/eth1/queues/rx-$q/rps_flow_cnt = 2^X Current hash function, using the X low order bits is suboptimal, because RSS is usually using Func(hash) = (hash % power_of_two); For example, with 32 RX queues, 6 low order bits have no entropy for a given queue. Switch this hash function to hash_32(hash, log) to increase chances to use all possible slots and reduce collisions. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Tom Herbert <tom@herbertland.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250321171309.634100-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-25Merge tag 'wireless-next-2025-03-20' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next Johannes Berg says: ==================== More features for 6.15, major changes: * cfg80211/mac80211: fix and enable link reconfiguration * rtw88: support RTL8814AE/RTL8814AU * mt7996: preparations for MLO * ath12k: continued work on MLO * iwlwifi: add new iwlmld sub-driver/op-mode for some current and future devices * wfx: wowlan support * tag 'wireless-next-2025-03-20' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (311 commits) wifi: mt76: mt7996: fix locking in mt7996_mac_sta_rc_work() wifi: mt76: mt76x2u: add TP-Link TL-WDN6200 ID to device table wifi: mt76: mt792x: re-register CHANCTX_STA_CSA only for the mt7921 series wifi: mt76: mt7996: Update mt7996_tx to MLO support wifi: mt76: mt7996: rework mt7996_ampdu_action to support MLO wifi: mt76: mt7996: rework set/get_tsf callabcks to support MLO wifi: mt76: mt7996: set vif default link_id adding/removing vif links wifi: mt76: mt7996: rework mt7996_mcu_beacon_inband_discov to support MLO wifi: mt76: mt7996: rework mt7996_mcu_add_obss_spr to support MLO wifi: mt76: mt7996: rework mt7996_net_fill_forward_path to support MLO wifi: mt76: mt7996: rework mt7996_update_mu_group to support MLO wifi: mt76: mt7996: rework mt7996_mac_sta_poll to support MLO wifi: mt76: mt7996: rework mt7996_mac_sta_rc_work to support MLO wifi: mt76: mt7996: remove mt7996_mac_enable_rtscts() wifi: mt76: mt7996: rework mt7996_sta_hw_queue_read to support MLO wifi: mt76: mt7996: rework mt7996_set_hw_key to support MLO wifi: mt76: mt7996: Add mt7996_sta_link to mt7996_mcu_add_bss_info signature wifi: mt76: mt7996: rework mt7996_sta_set_4addr and mt7996_sta_set_decap_offload to support MLO wifi: mt76: mt7996: rework mt7996_rx_get_wcid to support MLO wifi: mt76: mt7996: Rely on wcid_to_sta in mt7996_mac_add_txs_skb() ... ==================== Link: https://patch.msgid.link/20250320131106.33266-3-johannes@sipsolutions.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-25bonding: check xdp prog when set bond modeWang Liang
Following operations can trigger a warning[1]: ip netns add ns1 ip netns exec ns1 ip link add bond0 type bond mode balance-rr ip netns exec ns1 ip link set dev bond0 xdp obj af_xdp_kern.o sec xdp ip netns exec ns1 ip link set bond0 type bond mode broadcast ip netns del ns1 When delete the namespace, dev_xdp_uninstall() is called to remove xdp program on bond dev, and bond_xdp_set() will check the bond mode. If bond mode is changed after attaching xdp program, the warning may occur. Some bond modes (broadcast, etc.) do not support native xdp. Set bond mode with xdp program attached is not good. Add check for xdp program when set bond mode. [1] ------------[ cut here ]------------ WARNING: CPU: 0 PID: 11 at net/core/dev.c:9912 unregister_netdevice_many_notify+0x8d9/0x930 Modules linked in: CPU: 0 UID: 0 PID: 11 Comm: kworker/u4:0 Not tainted 6.14.0-rc4 #107 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014 Workqueue: netns cleanup_net RIP: 0010:unregister_netdevice_many_notify+0x8d9/0x930 Code: 00 00 48 c7 c6 6f e3 a2 82 48 c7 c7 d0 b3 96 82 e8 9c 10 3e ... RSP: 0018:ffffc90000063d80 EFLAGS: 00000282 RAX: 00000000ffffffa1 RBX: ffff888004959000 RCX: 00000000ffffdfff RDX: 0000000000000000 RSI: 00000000ffffffea RDI: ffffc90000063b48 RBP: ffffc90000063e28 R08: ffffffff82d39b28 R09: 0000000000009ffb R10: 0000000000000175 R11: ffffffff82d09b40 R12: ffff8880049598e8 R13: 0000000000000001 R14: dead000000000100 R15: ffffc90000045000 FS: 0000000000000000(0000) GS:ffff888007a00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000000d406b60 CR3: 000000000483e000 CR4: 00000000000006f0 Call Trace: <TASK> ? __warn+0x83/0x130 ? unregister_netdevice_many_notify+0x8d9/0x930 ? report_bug+0x18e/0x1a0 ? handle_bug+0x54/0x90 ? exc_invalid_op+0x18/0x70 ? asm_exc_invalid_op+0x1a/0x20 ? unregister_netdevice_many_notify+0x8d9/0x930 ? bond_net_exit_batch_rtnl+0x5c/0x90 cleanup_net+0x237/0x3d0 process_one_work+0x163/0x390 worker_thread+0x293/0x3b0 ? __pfx_worker_thread+0x10/0x10 kthread+0xec/0x1e0 ? __pfx_kthread+0x10/0x10 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x2f/0x50 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1a/0x30 </TASK> ---[ end trace 0000000000000000 ]--- Fixes: 9e2ee5c7e7c3 ("net, bonding: Add XDP support to the bonding driver") Signed-off-by: Wang Liang <wangliang74@huawei.com> Acked-by: Jussi Maki <joamaki@gmail.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://patch.msgid.link/20250321044852.1086551-1-wangliang74@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-25net: phylink: add functions to block/unblock rx clock stopRussell King (Oracle)
Some MACs require the PHY receive clock to be running to complete setup actions. This may fail if the PHY has negotiated EEE, the MAC supports receive clock stop, and the link has entered LPI state. Provide a pair of APIs that MAC drivers can use to temporarily block the PHY disabling the receive clock. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Link: https://patch.msgid.link/E1tvO6k-008Vjt-MZ@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-25net: phylink: add phylink_prepare_resume()Russell King (Oracle)
When the system is suspended, the PHY may be placed in low-power mode by setting the BMCR 0.11 Power down bit. IEEE 802.3 states that the behaviour of the PHY in this state is implementation specific, and the PHY is not required to meet the RX_CLK and TX_CLK requirements. Essentially, this means that a PHY may stop the clocks that it is generating while in power down state. However, MACs exist which require the clocks from the PHY to be running in order to properly resume. phylink_prepare_resume() provides them with a way to clear the Power down bit early. Note, however, that IEEE 802.3 gives PHYs up to 500ms grace before the transmit and receive clocks meet the requirements after clearing the power down bit. Add a resume preparation function, which will ensure that the receive clock from the PHY is appropriately configured while resuming. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Link: https://patch.msgid.link/E1tvO6V-008Vjb-AP@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-25net: reorganize IP MIB values (II)Eric Dumazet
Commit 14a196807482 ("net: reorganize IP MIB values") changed MIB values to group hot fields together. Since then 5 new fields have been added without caring about data locality. This patch moves IPSTATS_MIB_OUTPKTS, IPSTATS_MIB_NOECTPKTS, IPSTATS_MIB_ECT1PKTS, IPSTATS_MIB_ECT0PKTS, IPSTATS_MIB_CEPKTS to the hot portion of per-cpu data. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250320101434.3174412-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-25tcp: avoid atomic operations on sk->sk_rmem_allocEric Dumazet
TCP uses generic skb_set_owner_r() and sock_rfree() for received packets, with socket lock being owned. Switch to private versions, avoiding two atomic operations per packet. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250320121604.3342831-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-25ipv6: fix _DEVADD() and _DEVUPD() macrosEric Dumazet
ip6_rcv_core() is using: __IP6_ADD_STATS(net, idev, IPSTATS_MIB_NOECTPKTS + (ipv6_get_dsfield(hdr) & INET_ECN_MASK), max_t(unsigned short, 1, skb_shinfo(skb)->gso_segs)); This is currently evaluating both expressions twice. Fix _DEVADD() and _DEVUPD() macros to evaluate their arguments once. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250319212516.2385451-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-25vfio: VFIO_DEVICE_[AT|DE]TACH_IOMMUFD_PT support pasidYi Liu
This extends the VFIO_DEVICE_[AT|DE]TACH_IOMMUFD_PT ioctls to attach/detach a given pasid of a vfio device to/from an IOAS/HWPT. Link: https://patch.msgid.link/r/20250321180143.8468-4-yi.l.liu@intel.com Reviewed-by: Alex Williamson <alex.williamson@redhat.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-03-25vfio-iommufd: Support pasid [at|de]tach for physical VFIO devicesYi Liu
This adds pasid_at|de]tach_ioas ops for attaching hwpt to pasid of a device and the helpers for it. For now, only vfio-pci supports pasid attach/detach. Link: https://patch.msgid.link/r/20250321180143.8468-3-yi.l.liu@intel.com Signed-off-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Alex Williamson <alex.williamson@redhat.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-03-25ida: Add ida_find_first_range()Yi Liu
There is no helpers for user to check if a given ID is allocated or not, neither a helper to loop all the allocated IDs in an IDA and do something for cleanup. With the two needs, a helper to get the lowest allocated ID of a range and two variants based on it. Caller can check if a given ID is allocated or not by: bool ida_exists(struct ida *ida, unsigned int id) Caller can iterate all allocated IDs by: int id; while ((id = ida_find_first(&pasid_ida)) >= 0) { //anything to do with the allocated ID ida_free(pasid_ida, pasid); } Link: https://patch.msgid.link/r/20250321180143.8468-2-yi.l.liu@intel.com Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Matthew Wilcox (Oracle) <willy@infradead.org> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-03-25iommufd: Allow allocating PASID-compatible domainYi Liu
The underlying infrastructure has supported the PASID attach and related enforcement per the requirement of the IOMMU_HWPT_ALLOC_PASID flag. This extends iommufd to support PASID compatible domain requested by userspace. Link: https://patch.msgid.link/r/20250321171940.7213-15-yi.l.liu@intel.com Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-03-25iommufd: Support pasid attach/replaceYi Liu
This extends the below APIs to support PASID. Device drivers to manage pasid attach/replace/detach. int iommufd_device_attach(struct iommufd_device *idev, ioasid_t pasid, u32 *pt_id); int iommufd_device_replace(struct iommufd_device *idev, ioasid_t pasid, u32 *pt_id); void iommufd_device_detach(struct iommufd_device *idev, ioasid_t pasid); The pasid operations share underlying attach/replace/detach infrastructure with the device operations, but still have some different implications: - no reserved region per pasid otherwise SVA architecture is already broken (CPU address space doesn't count device reserved regions); - accordingly no sw_msi trick; Cache coherency enforcement is still applied to pasid operations since it is about memory accesses post page table walking (no matter the walk is per RID or per PASID). Link: https://patch.msgid.link/r/20250321171940.7213-12-yi.l.liu@intel.com Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-03-25iommu: Drop sw_msi from iommu_domainNicolin Chen
There are only two sw_msi implementations in the entire system, thus it's not very necessary to have an sw_msi pointer. Instead, check domain->cookie_type to call the two sw_msi implementations directly from the core code. Link: https://patch.msgid.link/r/7ded87c871afcbaac665b71354de0a335087bf0f.1742871535.git.nicolinc@nvidia.com Suggested-by: Robin Murphy <robin.murphy@arm.com> Reviewed-by: Robin Murphy <robin.murphy@arm.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-03-25iommu: Sort out domain user dataRobin Murphy
When DMA/MSI cookies were made first-class citizens back in commit 46983fcd67ac ("iommu: Pull IOVA cookie management into the core"), there was no real need to further expose the two different cookie types. However, now that IOMMUFD wants to add a third type of MSI-mapping cookie, we do have a nicely compelling reason to properly dismabiguate things at the domain level beyond just vaguely guessing from the domain type. Meanwhile, we also effectively have another "cookie" in the form of the anonymous union for other user data, which isn't much better in terms of being vague and unenforced. The fact is that all these cookie types are mutually exclusive, in the sense that combining them makes zero sense and/or would be catastrophic (iommu_set_fault_handler() on an SVA domain, anyone?) - the only combination which *might* be reasonable is perhaps a fault handler and an MSI cookie, but nobody's doing that at the moment, so let's rule it out as well for the sake of being clear and robust. To that end, we pull DMA and MSI cookies apart a little more, mostly to clear up the ambiguity at domain teardown, then for clarity (and to save a little space), move them into the union, whose ownership we can then properly describe and enforce entirely unambiguously. [nicolinc: rebase on latest tree; use prefix IOMMU_COOKIE_; merge unions in iommu_domain; add IOMMU_COOKIE_IOMMUFD for iommufd_hwpt] Link: https://patch.msgid.link/r/1ace9076c95204bbe193ee77499d395f15f44b23.1742871535.git.nicolinc@nvidia.com Signed-off-by: Robin Murphy <robin.murphy@arm.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-03-25Merge branch '100GbE' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue Tony Nguyen says: ==================== Intel Wired LAN Driver Updates 2025-03-18 (ice, idpf) For ice: Przemek modifies string declarations to resolve compile issues on gcc 7.5. Karol adds padding to initial programming of GLTSYN_TIME* registers to ensure it will occur in the future to prevent hardware issues. Jesse Brandeburg turns off driver RDMA capability when the corresponding kernel config is not enabled to aid in preventing resource exhaustion. Jan adjusts type declaration to properly catch error conditions and prevent truncation of values. He also adds bounds checking to prevent overflow in ice_vc_cfg_q_quanta(). Lukasz adds checking and error reporting for invalid values in ice_vc_cfg_q_bw(). Mateusz adds check for valid size for ice_vc_fdir_parse_raw(). For idpf: Emil adds check, and handling, on failure to register netdev. * '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue: idpf: check error for register_netdev() on init ice: fix using untrusted value of pkt_len in ice_vc_fdir_parse_raw() ice: fix input validation for virtchnl BW ice: validate queue quanta parameters to prevent OOB access ice: stop truncating queue ids when checking virtchnl: make proto and filter action count unsigned ice: fix reservation of resources for RDMA when disabled ice: ensure periodic output start time is in the future ice: health.c: fix compilation on gcc 7.5 ==================== Link: https://patch.msgid.link/20250318200511.2958251-1-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-25Merge tag 'i2c-host-6.15' of ↵Wolfram Sang
git://git.kernel.org/pub/scm/linux/kernel/git/andi.shyti/linux into i2c/for-mergewindow i2c-host updates for v6.15 Refactoring and cleanups - octeon, cadence, i801, pasemi, mlxbf, bcm-iproc: general refactorings - octeon: remove 10-bit address support Improvements - amd-asf: improved error handling - designware: use guard(mutex) - amd-asf, designware: update naming to follow latest specs - cadence: fix cleanup path in probe - i801: use MMIO and I/O mapping helpers to access registers - pxa: handle error after clk_prepare_enable New features - added i2c_10bit_addr_*_from_msg() and updated multiple drivers - omap: added multiplexer state handling - qcom-geni: update frequency configuration - qup: introduce DMA usage policy New hardware support - exynos: add support for Samsung exynos7870 - k1: add support for spacemit k1 (new driver) - imx: add support for i.mx94 lpi2c - rk3x: add support for rk3562 Multiplexers - ltc4306, reg: fix assignment in platform_driver structure
2025-03-25af_unix: Explicitly include headers for non-pointer struct fields.Kuniyuki Iwashima
include/net/af_unix.h indirectly includes some definitions for structs. Let's include such headers explicitly. linux/atomic.h : scm_stat.nr_fds linux/net.h : unix_sock.peer_wq linux/path.h : unix_sock.path linux/spinlock.h : unix_sock.lock linux/wait.h : unix_sock.peer_wake uapi/linux/un.h : unix_address.name[] linux/socket.h is removed as the structs there are not used directly, and linux/un.h is clarified with uapi as un.h only exists under include/uapi. While at it, duplicate headers are removed from .c files. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250318034934.86708-4-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-25af_unix: Move internal definitions to net/unix/.Kuniyuki Iwashima
net/af_unix.h is included by core and some LSMs, but most definitions need not be. Let's move struct unix_{vertex,edge} to net/unix/garbage.c and other definitions to net/unix/af_unix.h. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Joe Damato <jdamato@fastly.com> Link: https://patch.msgid.link/20250318034934.86708-3-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-25af_unix: Sort headers.Kuniyuki Iwashima
This is a prep patch to make the following changes cleaner. No functional change intended. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Joe Damato <jdamato@fastly.com> Link: https://patch.msgid.link/20250318034934.86708-2-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-25tcp: support TCP_DELACK_MAX_US for set/getsockopt useJason Xing
Support adjusting/reading delayed ack max for socket level by using set/getsockopt(). This option aligns with TCP_BPF_DELACK_MAX usage. Considering that bpf option was implemented before this patch, so we need to use a standalone new option for pure tcp set/getsockopt() use. Add WRITE_ONCE/READ_ONCE() to prevent data-race if setsockopt() happens to write one value to icsk_delack_max while icsk_delack_max is being read. Signed-off-by: Jason Xing <kerneljasonxing@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250317120314.41404-3-kerneljasonxing@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-25tcp: support TCP_RTO_MIN_US for set/getsockopt useJason Xing
Support adjusting/reading RTO MIN for socket level by using set/getsockopt(). This new option has the same effect as TCP_BPF_RTO_MIN, which means it doesn't affect RTAX_RTO_MIN usage (by using ip route...). Considering that bpf option was implemented before this patch, so we need to use a standalone new option for pure tcp set/getsockopt() use. When the socket is created, its icsk_rto_min is set to the default value that is controlled by sysctl_tcp_rto_min_us. Then if application calls setsockopt() with TCP_RTO_MIN_US flag to pass a valid value, then icsk_rto_min will be overridden in jiffies unit. This patch adds WRITE_ONCE/READ_ONCE to avoid data-race around icsk_rto_min. Signed-off-by: Jason Xing <kerneljasonxing@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250317120314.41404-2-kerneljasonxing@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-25objtool: Fix up some outdated references to ENTRY/ENDPROCJosh Poimboeuf
ENTRY and ENDPROC were deprecated years ago and replaced with SYM_FUNC_{START,END}. Fix up a few outdated references in the objtool documentation and comments. Also fix a few typos. Suggested-by: Brendan Jackman <jackmanb@google.com> Suggested-by: Miroslav Benes <mbenes@suse.cz> Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/5eb7e06e1a0e87aaeda8d583ab060e7638a6ea8e.1742852846.git.jpoimboe@kernel.org
2025-03-24Merge tag 'x86-fpu-2025-03-22' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86/fpu updates from Ingo Molnar: - Improve crypto performance by making kernel-mode FPU reliably usable in softirqs ((Eric Biggers) - Fully optimize out WARN_ON_FPU() (Eric Biggers) - Initial steps to support Support Intel APX (Advanced Performance Extensions) (Chang S. Bae) - Fix KASAN for arch_dup_task_struct() (Benjamin Berg) - Refine and simplify the FPU magic number check during signal return (Chang S. Bae) - Fix inconsistencies in guest FPU xfeatures (Chao Gao, Stanislav Spassov) - selftests/x86/xstate: Introduce common code for testing extended states (Chang S. Bae) - Misc fixes and cleanups (Borislav Petkov, Colin Ian King, Uros Bizjak) * tag 'x86-fpu-2025-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/fpu/xstate: Fix inconsistencies in guest FPU xfeatures x86/fpu: Clarify the "xa" symbolic name used in the XSTATE* macros x86/fpu: Use XSAVE{,OPT,C,S} and XRSTOR{,S} mnemonics in xstate.h x86/fpu: Improve crypto performance by making kernel-mode FPU reliably usable in softirqs x86/fpu/xstate: Simplify print_xstate_features() x86/fpu: Refine and simplify the magic number check during signal return selftests/x86/xstate: Fix spelling mistake "hader" -> "header" x86/fpu: Avoid copying dynamic FP state from init_task in arch_dup_task_struct() vmlinux.lds.h: Remove entry to place init_task onto init_stack selftests/x86/avx: Add AVX tests selftests/x86/xstate: Clarify supported xstates selftests/x86/xstate: Consolidate test invocations into a single entry selftests/x86/xstate: Introduce signal ABI test selftests/x86/xstate: Refactor ptrace ABI test selftests/x86/xstate: Refactor context switching test selftests/x86/xstate: Enumerate and name xstate components selftests/x86/xstate: Refactor XSAVE helpers for general use selftests/x86: Consolidate redundant signal helper functions x86/fpu: Fix guest FPU state buffer allocation size x86/fpu: Fully optimize out WARN_ON_FPU()
2025-03-24Merge tag 'x86-core-2025-03-22' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull core x86 updates from Ingo Molnar: "x86 CPU features support: - Generate the <asm/cpufeaturemasks.h> header based on build config (H. Peter Anvin, Xin Li) - x86 CPUID parsing updates and fixes (Ahmed S. Darwish) - Introduce the 'setcpuid=' boot parameter (Brendan Jackman) - Enable modifying CPU bug flags with '{clear,set}puid=' (Brendan Jackman) - Utilize CPU-type for CPU matching (Pawan Gupta) - Warn about unmet CPU feature dependencies (Sohil Mehta) - Prepare for new Intel Family numbers (Sohil Mehta) Percpu code: - Standardize & reorganize the x86 percpu layout and related cleanups (Brian Gerst) - Convert the stackprotector canary to a regular percpu variable (Brian Gerst) - Add a percpu subsection for cache hot data (Brian Gerst) - Unify __pcpu_op{1,2}_N() macros to __pcpu_op_N() (Uros Bizjak) - Construct __percpu_seg_override from __percpu_seg (Uros Bizjak) MM: - Add support for broadcast TLB invalidation using AMD's INVLPGB instruction (Rik van Riel) - Rework ROX cache to avoid writable copy (Mike Rapoport) - PAT: restore large ROX pages after fragmentation (Kirill A. Shutemov, Mike Rapoport) - Make memremap(MEMREMAP_WB) map memory as encrypted by default (Kirill A. Shutemov) - Robustify page table initialization (Kirill A. Shutemov) - Fix flush_tlb_range() when used for zapping normal PMDs (Jann Horn) - Clear _PAGE_DIRTY for kernel mappings when we clear _PAGE_RW (Matthew Wilcox) KASLR: - x86/kaslr: Reduce KASLR entropy on most x86 systems, to support PCI BAR space beyond the 10TiB region (CONFIG_PCI_P2PDMA=y) (Balbir Singh) CPU bugs: - Implement FineIBT-BHI mitigation (Peter Zijlstra) - speculation: Simplify and make CALL_NOSPEC consistent (Pawan Gupta) - speculation: Add a conditional CS prefix to CALL_NOSPEC (Pawan Gupta) - RFDS: Exclude P-only parts from the RFDS affected list (Pawan Gupta) System calls: - Break up entry/common.c (Brian Gerst) - Move sysctls into arch/x86 (Joel Granados) Intel LAM support updates: (Maciej Wieczor-Retman) - selftests/lam: Move cpu_has_la57() to use cpuinfo flag - selftests/lam: Skip test if LAM is disabled - selftests/lam: Test get_user() LAM pointer handling AMD SMN access updates: - Add SMN offsets to exclusive region access (Mario Limonciello) - Add support for debugfs access to SMN registers (Mario Limonciello) - Have HSMP use SMN through AMD_NODE (Yazen Ghannam) Power management updates: (Patryk Wlazlyn) - Allow calling mwait_play_dead with an arbitrary hint - ACPI/processor_idle: Add FFH state handling - intel_idle: Provide the default enter_dead() handler - Eliminate mwait_play_dead_cpuid_hint() Build system: - Raise the minimum GCC version to 8.1 (Brian Gerst) - Raise the minimum LLVM version to 15.0.0 (Nathan Chancellor) Kconfig: (Arnd Bergmann) - Add cmpxchg8b support back to Geode CPUs - Drop 32-bit "bigsmp" machine support - Rework CONFIG_GENERIC_CPU compiler flags - Drop configuration options for early 64-bit CPUs - Remove CONFIG_HIGHMEM64G support - Drop CONFIG_SWIOTLB for PAE - Drop support for CONFIG_HIGHPTE - Document CONFIG_X86_INTEL_MID as 64-bit-only - Remove old STA2x11 support - Only allow CONFIG_EISA for 32-bit Headers: - Replace __ASSEMBLY__ with __ASSEMBLER__ in UAPI and non-UAPI headers (Thomas Huth) Assembly code & machine code patching: - x86/alternatives: Simplify alternative_call() interface (Josh Poimboeuf) - x86/alternatives: Simplify callthunk patching (Peter Zijlstra) - KVM: VMX: Use named operands in inline asm (Josh Poimboeuf) - x86/hyperv: Use named operands in inline asm (Josh Poimboeuf) - x86/traps: Cleanup and robustify decode_bug() (Peter Zijlstra) - x86/kexec: Merge x86_32 and x86_64 code using macros from <asm/asm.h> (Uros Bizjak) - Use named operands in inline asm (Uros Bizjak) - Improve performance by using asm_inline() for atomic locking instructions (Uros Bizjak) Earlyprintk: - Harden early_serial (Peter Zijlstra) NMI handler: - Add an emergency handler in nmi_desc & use it in nmi_shootdown_cpus() (Waiman Long) Miscellaneous fixes and cleanups: - by Ahmed S. Darwish, Andy Shevchenko, Ard Biesheuvel, Artem Bityutskiy, Borislav Petkov, Brendan Jackman, Brian Gerst, Dan Carpenter, Dr. David Alan Gilbert, H. Peter Anvin, Ingo Molnar, Josh Poimboeuf, Kevin Brodsky, Mike Rapoport, Lukas Bulwahn, Maciej Wieczor-Retman, Max Grobecker, Patryk Wlazlyn, Pawan Gupta, Peter Zijlstra, Philip Redkin, Qasim Ijaz, Rik van Riel, Thomas Gleixner, Thorsten Blum, Tom Lendacky, Tony Luck, Uros Bizjak, Vitaly Kuznetsov, Xin Li, liuye" * tag 'x86-core-2025-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (211 commits) zstd: Increase DYNAMIC_BMI2 GCC version cutoff from 4.8 to 11.0 to work around compiler segfault x86/asm: Make asm export of __ref_stack_chk_guard unconditional x86/mm: Only do broadcast flush from reclaim if pages were unmapped perf/x86/intel, x86/cpu: Replace Pentium 4 model checks with VFM ones perf/x86/intel, x86/cpu: Simplify Intel PMU initialization x86/headers: Replace __ASSEMBLY__ with __ASSEMBLER__ in non-UAPI headers x86/headers: Replace __ASSEMBLY__ with __ASSEMBLER__ in UAPI headers x86/locking/atomic: Improve performance by using asm_inline() for atomic locking instructions x86/asm: Use asm_inline() instead of asm() in clwb() x86/asm: Use CLFLUSHOPT and CLWB mnemonics in <asm/special_insns.h> x86/hweight: Use asm_inline() instead of asm() x86/hweight: Use ASM_CALL_CONSTRAINT in inline asm() x86/hweight: Use named operands in inline asm() x86/stackprotector/64: Only export __ref_stack_chk_guard on CONFIG_SMP x86/head/64: Avoid Clang < 17 stack protector in startup code x86/kexec: Merge x86_32 and x86_64 code using macros from <asm/asm.h> x86/runtime-const: Add the RUNTIME_CONST_PTR assembly macro x86/cpu/intel: Limit the non-architectural constant_tsc model checks x86/mm/pat: Replace Intel x86_model checks with VFM ones x86/cpu/intel: Fix fast string initialization for extended Families ...
2025-03-24Merge tag 'perf-core-2025-03-22' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull performance events updates from Ingo Molnar: "Core: - Move perf_event sysctls into kernel/events/ (Joel Granados) - Use POLLHUP for pinned events in error (Namhyung Kim) - Avoid the read if the count is already updated (Peter Zijlstra) - Allow the EPOLLRDNORM flag for poll (Tao Chen) - locking/percpu-rwsem: Add guard support [ NOTE: this got (mis-)merged into the perf tree due to related work ] (Peter Zijlstra) perf_pmu_unregister() related improvements: (Peter Zijlstra) - Simplify the perf_event_alloc() error path - Simplify the perf_pmu_register() error path - Simplify perf_pmu_register() - Simplify perf_init_event() - Simplify perf_event_alloc() - Merge struct pmu::pmu_disable_count into struct perf_cpu_pmu_context::pmu_disable_count - Add this_cpc() helper - Introduce perf_free_addr_filters() - Robustify perf_event_free_bpf_prog() - Simplify the perf_mmap() control flow - Further simplify perf_mmap() - Remove retry loop from perf_mmap() - Lift event->mmap_mutex in perf_mmap() - Detach 'struct perf_cpu_pmu_context' and 'struct pmu' lifetimes - Fix perf_mmap() failure path Uprobes: - Harden x86 uretprobe syscall trampoline check (Jiri Olsa) - Remove redundant spinlock in uprobe_deny_signal() (Liao Chang) - Remove the spinlock within handle_singlestep() (Liao Chang) x86 Intel PMU enhancements: - Support PEBS counters snapshotting (Kan Liang) - Fix intel_pmu_read_event() (Kan Liang) - Extend per event callchain limit to branch stack (Kan Liang) - Fix system-wide LBR profiling (Kan Liang) - Allocate bts_ctx only if necessary (Li RongQing) - Apply static call for drain_pebs (Peter Zijlstra) x86 AMD PMU enhancements: (Ravi Bangoria) - Remove pointless sample period check - Fix ->config to sample period calculation for OP PMU - Fix perf_ibs_op.cnt_mask for CurCnt - Don't allow freq mode event creation through ->config interface - Add PMU specific minimum period - Add ->check_period() callback - Ceil sample_period to min_period - Add support for OP Load Latency Filtering - Update DTLB/PageSize decode logic Hardware breakpoints: - Return EOPNOTSUPP for unsupported breakpoint type (Saket Kumar Bhaskar) Hardlockup detector improvements: (Li Huafei) - perf_event memory leak - Warn if watchdog_ev is leaked Fixes and cleanups: - Misc fixes and cleanups (Andy Shevchenko, Kan Liang, Peter Zijlstra, Ravi Bangoria, Thorsten Blum, XieLudan)" * tag 'perf-core-2025-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (55 commits) perf: Fix __percpu annotation perf: Clean up pmu specific data perf/x86: Remove swap_task_ctx() perf/x86/lbr: Fix shorter LBRs call stacks for the system-wide mode perf: Supply task information to sched_task() perf: attach/detach PMU specific data locking/percpu-rwsem: Add guard support perf: Save PMU specific data in task_struct perf: Extend per event callchain limit to branch stack perf/ring_buffer: Allow the EPOLLRDNORM flag for poll perf/core: Use POLLHUP for pinned events in error perf/core: Use sysfs_emit() instead of scnprintf() perf/core: Remove optional 'size' arguments from strscpy() calls perf/x86/intel/bts: Check if bts_ctx is allocated when calling BTS functions uprobes/x86: Harden uretprobe syscall trampoline check watchdog/hardlockup/perf: Warn if watchdog_ev is leaked watchdog/hardlockup/perf: Fix perf_event memory leak perf/x86: Annotate struct bts_buffer::buf with __counted_by() perf/core: Clean up perf_try_init_event() perf/core: Fix perf_mmap() failure path ...
2025-03-24Merge tag 'sched-core-2025-03-22' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: "Core & fair scheduler changes: - Cancel the slice protection of the idle entity (Zihan Zhou) - Reduce the default slice to avoid tasks getting an extra tick (Zihan Zhou) - Force propagating min_slice of cfs_rq when {en,de}queue tasks (Tianchen Ding) - Refactor can_migrate_task() to elimate looping (I Hsin Cheng) - Add unlikey branch hints to several system calls (Colin Ian King) - Optimize current_clr_polling() on certain architectures (Yujun Dong) Deadline scheduler: (Juri Lelli) - Remove redundant dl_clear_root_domain call - Move dl_rebuild_rd_accounting to cpuset.h Uclamp: - Use the uclamp_is_used() helper instead of open-coding it (Xuewen Yan) - Optimize sched_uclamp_used static key enabling (Xuewen Yan) Scheduler topology support: (Juri Lelli) - Ignore special tasks when rebuilding domains - Add wrappers for sched_domains_mutex - Generalize unique visiting of root domains - Rebuild root domain accounting after every update - Remove partition_and_rebuild_sched_domains - Stop exposing partition_sched_domains_locked RSEQ: (Michael Jeanson) - Update kernel fields in lockstep with CONFIG_DEBUG_RSEQ=y - Fix segfault on registration when rseq_cs is non-zero - selftests: Add rseq syscall errors test - selftests: Ensure the rseq ABI TLS is actually 1024 bytes Membarriers: - Fix redundant load of membarrier_state (Nysal Jan K.A.) Scheduler debugging: - Introduce and use preempt_model_str() (Sebastian Andrzej Siewior) - Make CONFIG_SCHED_DEBUG unconditional (Ingo Molnar) Fixes and cleanups: - Always save/restore x86 TSC sched_clock() on suspend/resume (Guilherme G. Piccoli) - Misc fixes and cleanups (Thorsten Blum, Juri Lelli, Sebastian Andrzej Siewior)" * tag 'sched-core-2025-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (40 commits) cpuidle, sched: Use smp_mb__after_atomic() in current_clr_polling() sched/debug: Remove CONFIG_SCHED_DEBUG sched/debug: Remove CONFIG_SCHED_DEBUG from self-test config files sched/debug, Documentation: Remove (most) CONFIG_SCHED_DEBUG references from documentation sched/debug: Make CONFIG_SCHED_DEBUG functionality unconditional sched/debug: Make 'const_debug' tunables unconditional __read_mostly sched/debug: Change SCHED_WARN_ON() to WARN_ON_ONCE() rseq/selftests: Fix namespace collision with rseq UAPI header include/{topology,cpuset}: Move dl_rebuild_rd_accounting to cpuset.h sched/topology: Stop exposing partition_sched_domains_locked cgroup/cpuset: Remove partition_and_rebuild_sched_domains sched/topology: Remove redundant dl_clear_root_domain call sched/deadline: Rebuild root domain accounting after every update sched/deadline: Generalize unique visiting of root domains sched/topology: Wrappers for sched_domains_mutex sched/deadline: Ignore special tasks when rebuilding domains tracing: Use preempt_model_str() xtensa: Rely on generic printing of preemption model x86: Rely on generic printing of preemption model s390: Rely on generic printing of preemption model ...
2025-03-24Merge tag 'locking-core-2025-03-22' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking updates from Ingo Molnar: "Locking primitives: - Micro-optimize percpu_{,try_}cmpxchg{64,128}_op() and {,try_}cmpxchg{64,128} on x86 (Uros Bizjak) - mutexes: extend debug checks in mutex_lock() (Yunhui Cui) - Misc cleanups (Uros Bizjak) Lockdep: - Fix might_fault() lockdep check of current->mm->mmap_lock (Peter Zijlstra) - Don't disable interrupts on RT in disable_irq_nosync_lockdep.*() (Sebastian Andrzej Siewior) - Disable KASAN instrumentation of lockdep.c (Waiman Long) - Add kasan_check_byte() check in lock_acquire() (Waiman Long) - Misc cleanups (Sebastian Andrzej Siewior) Rust runtime integration: - Use Pin for all LockClassKey usages (Mitchell Levy) - sync: Add accessor for the lock behind a given guard (Alice Ryhl) - sync: condvar: Add wait_interruptible_freezable() (Alice Ryhl) - sync: lock: Add an example for Guard:: Lock_ref() (Boqun Feng) Split-lock detection feature (x86): - Fix warning mode with disabled mitigation mode (Maksim Davydov) Locking events: - Add locking events for rtmutex slow paths (Waiman Long) - Add locking events for lockdep (Waiman Long)" * tag 'locking-core-2025-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: lockdep: Remove disable_irq_lockdep() lockdep: Don't disable interrupts on RT in disable_irq_nosync_lockdep.*() rust: lockdep: Use Pin for all LockClassKey usages rust: sync: condvar: Add wait_interruptible_freezable() rust: sync: lock: Add an example for Guard:: Lock_ref() rust: sync: Add accessor for the lock behind a given guard locking/lockdep: Add kasan_check_byte() check in lock_acquire() locking/lockdep: Disable KASAN instrumentation of lockdep.c locking/lock_events: Add locking events for lockdep locking/lock_events: Add locking events for rtmutex slow paths x86/split_lock: Fix the delayed detection logic lockdep/mm: Fix might_fault() lockdep check of current->mm->mmap_lock x86/locking: Remove semicolon from "lock" prefix locking/mutex: Add MUTEX_WARN_ON() into fast path x86/locking: Use asm_inline for {,try_}cmpxchg{64,128} emulations x86/locking: Use ALT_OUTPUT_SP() for percpu_{,try_}cmpxchg{64,128}_op()
2025-03-24Merge tag 'rcu-next-v6.15' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux Pull RCU updates from Boqun Feng: "Documentation: - Add broken-timing possibility to stallwarn.rst - Improve discussion of this_cpu_ptr(), add raw_cpu_ptr() - Document self-propagating callbacks - Point call_srcu() to call_rcu() for detailed memory ordering - Add CONFIG_RCU_LAZY delays to call_rcu() kernel-doc header - Clarify RCU_LAZY and RCU_LAZY_DEFAULT_OFF help text - Remove references to old grace-period-wait primitives srcu: - Introduce srcu_read_{un,}lock_fast(), which is similar to srcu_read_{un,}lock_lite(): avoid smp_mb()s in lock and unlock at the cost of calling synchronize_rcu() in synchronize_srcu() Moreover, by returning the percpu offset of the counter at srcu_read_lock_fast() time, srcu_read_unlock_fast() can avoid extra pointer dereferencing, which makes it faster than srcu_read_{un,}lock_lite() srcu_read_{un,}lock_fast() are intended to replace rcu_read_{un,}lock_trace() if possible RCU torture: - Add get_torture_init_jiffies() to return the start time of the test - Add a test_boost_holdoff module parameter to allow delaying boosting tests when building rcutorture as built-in - Add grace period sequence number logging at the beginning and end of failure/close-call results - Switch to hexadecimal for the expedited grace period sequence number in the rcu_exp_grace_period trace point - Make cur_ops->format_gp_seqs take buffer length - Move RCU_TORTURE_TEST_{CHK_RDR_STATE,LOG_CPU} to bool - Complain when invalid SRCU reader_flavor is specified - Add FORCE_NEED_SRCU_NMI_SAFE Kconfig for testing, which forces SRCU uses atomics even when percpu ops are NMI safe, and use the Kconfig for SRCU lockdep testing Misc: - Split rcu_report_exp_cpu_mult() mask parameter and use for tracing - Remove READ_ONCE() for rdp->gpwrap access in __note_gp_changes() - Fix get_state_synchronize_rcu_full() GP-start detection - Move RCU Tasks self-tests to core_initcall() - Print segment lengths in show_rcu_nocb_gp_state() - Make RCU watch ct_kernel_exit_state() warning - Flush console log from kernel_power_off() - rcutorture: Allow a negative value for nfakewriters - rcu: Update TREE05.boot to test normal synchronize_rcu() - rcu: Use _full() API to debug synchronize_rcu() Make RCU handle PREEMPT_LAZY better: - Fix header guard for rcu_all_qs() - rcu: Rename PREEMPT_AUTO to PREEMPT_LAZY - Update __cond_resched comment about RCU quiescent states - Handle unstable rdp in rcu_read_unlock_strict() - Handle quiescent states for PREEMPT_RCU=n, PREEMPT_COUNT=y - osnoise: Provide quiescent states - Adjust rcutorture with possible PREEMPT_RCU=n && PREEMPT_COUNT=y combination - Limit PREEMPT_RCU configurations - Make rcutorture senario TREE07 and senario TREE10 use PREEMPT_LAZY=y" * tag 'rcu-next-v6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux: (59 commits) rcutorture: Make scenario TREE07 build CONFIG_PREEMPT_LAZY=y rcutorture: Make scenario TREE10 build CONFIG_PREEMPT_LAZY=y rcu: limit PREEMPT_RCU configurations rcutorture: Update ->extendables check for lazy preemption rcutorture: Update rcutorture_one_extend_check() for lazy preemption osnoise: provide quiescent states rcu: Use _full() API to debug synchronize_rcu() rcu: Update TREE05.boot to test normal synchronize_rcu() rcutorture: Allow a negative value for nfakewriters Flush console log from kernel_power_off() context_tracking: Make RCU watch ct_kernel_exit_state() warning rcu/nocb: Print segment lengths in show_rcu_nocb_gp_state() rcu-tasks: Move RCU Tasks self-tests to core_initcall() rcu: Fix get_state_synchronize_rcu_full() GP-start detection torture: Make SRCU lockdep testing use srcu_read_lock_nmisafe() srcu: Add FORCE_NEED_SRCU_NMI_SAFE Kconfig for testing rcutorture: Complain when invalid SRCU reader_flavor is specified rcutorture: Move RCU_TORTURE_TEST_{CHK_RDR_STATE,LOG_CPU} to bool rcutorture: Make cur_ops->format_gp_seqs take buffer length rcutorture: Add ftrace-compatible timestamp to GP# failure/close-call output ...
2025-03-24Merge tag 'bitmap-for-6.15' of https://github.com/norov/linuxLinus Torvalds
Pull bitmap updates from Yury Norov: - cpumask_next_wrap() rework (me) - GENMASK() simplification (I Hsin) - rust bindings for cpumasks (Viresh and me) - scattered cleanups (Andy, Tamir, Vincent, Ignacio and Joel) * tag 'bitmap-for-6.15' of https://github.com/norov/linux: (22 commits) cpumask: align text in comment riscv: fix test_and_{set,clear}_bit ordering documentation treewide: fix typo 'unsigned __init128' -> 'unsigned __int128' MAINTAINERS: add rust bindings entry for bitmap API rust: Add cpumask helpers uapi: Revert "bitops: avoid integer overflow in GENMASK(_ULL)" cpumask: drop cpumask_next_wrap_old() PCI: hv: Switch hv_compose_multi_msi_req_get_cpu() to using cpumask_next_wrap() scsi: lpfc: rework lpfc_next_{online,present}_cpu() scsi: lpfc: switch lpfc_irq_rebalance() to using cpumask_next_wrap() s390: switch stop_machine_yield() to using cpumask_next_wrap() padata: switch padata_find_next() to using cpumask_next_wrap() cpumask: use cpumask_next_wrap() where appropriate cpumask: re-introduce cpumask_next{,_and}_wrap() cpumask: deprecate cpumask_next_wrap() powerpc/xmon: simplify xmon_batch_next_cpu() ibmvnic: simplify ibmvnic_set_queue_affinity() virtio_net: simplify virtnet_set_affinity() objpool: rework objpool_pop() cpumask: add for_each_{possible,online}_cpu_wrap ...
2025-03-24Merge tag 'docs-6.15' of git://git.lwn.net/linuxLinus Torvalds
Pull documentation updates from Jonathan Corbet: "It has been a reasonably busy cycle for docs... - Significant changes throughout the tree to bring Python code up to current standards and raise the minimum Python required to 3.9 Much of this is preparatory to replacing the ancient Perl scripts/kernel-doc horror with a slightly less horrifying Python implementation, expected for 6.16 - Update the minimum Sphinx required to 3.4.3, allowing us to remove a bunch of older compatibility code - Rework and improve the generation of the ABI documentation (All of the above done by Mauro) - Lots of translation updates. Alex Shi and Yanteng Si are taking on responsibility for the Chinese translations going forward; that work will still get to you via docs-next - Try to standardize the format for indicating a developer's affiliation in commit tags - Clarify the TAB's role in CoC enforcement actions - Try to spell out the rules for when a commit tag can name another developer without their explicit permission Plus lots of other typo fixes and updates" * tag 'docs-6.15' of git://git.lwn.net/linux: (98 commits) docs/zh_CN: fix spelling mistake docs/Chinese: change the disclaimer words docs/zh_CN: Add snp-tdx-threat-model index Chinese translation docs: driver-api: firmware: clarify userspace requirements docs: clarify rules wrt tagging other people docs: Remove outdated highuid.rst documentation Documentation: dma-buf: heaps: Add heap name definitions docs/.../submit-checklist: Use Documentation/admin-guide/abi.rst for cross-ref of README docs: Correct installation instruction Documentation: kcsan: fix "Plain Accesses and Data Races" URL in kcsan.rst Documentation/CoC: Spell out the TAB role in enforcement decisions Documentation: ocxl.rst: Update consortium site scripts: get_feat.pl: substitute s390x with s390 scripts/kernel-doc: drop dead code for Wcontents_before_sections scripts/kernel-doc: don't add not needed new lines docs: driver-api/infiniband.rst: fix Kerneldoc markup drivers: firewire: firewire-cdev.h: fix identation on a kernel-doc markup drivers: media: intel-ipu3.h: fix identation on a kernel-doc markup include/asm-generic/io.h: fix kerneldoc markup Docs/arch/arm64: Fix spelling in amu.rst ...
2025-03-24Merge tag 'sched_ext-for-6.15' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext updates from Tejun Heo: - Add mechanism to count and report internal events. This significantly improves visibility on subtle corner conditions. - The default idle CPU selection logic is revamped and improved in multiple ways including being made topology aware. - sched_ext was disabling ttwu_queue for simplicity, which can be costly when hardware topology is more complex. Implement SCX_OPS_ALLOWED_QUEUED_WAKEUP so that BPF schedulers can selectively enable ttwu_queue. - tools/sched_ext updates to improve compatibility among others. - Other misc updates and fixes. - sched_ext/for-6.14-fixes were pulled a few times to receive prerequisite fixes and resolve conflicts. * tag 'sched_ext-for-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: (42 commits) sched_ext: idle: Refactor scx_select_cpu_dfl() sched_ext: idle: Honor idle flags in the built-in idle selection policy sched_ext: Skip per-CPU tasks in scx_bpf_reenqueue_local() sched_ext: Add trace point to track sched_ext core events sched_ext: Change the event type from u64 to s64 sched_ext: Documentation: add task lifecycle summary tools/sched_ext: Provide a compatible helper for scx_bpf_events() selftests/sched_ext: Add NUMA-aware scheduler test tools/sched_ext: Provide consistent access to scx flags sched_ext: idle: Fix scx_bpf_pick_any_cpu_node() behavior sched_ext: idle: Introduce scx_bpf_nr_node_ids() sched_ext: idle: Introduce node-aware idle cpu kfunc helpers sched_ext: idle: Per-node idle cpumasks sched_ext: idle: Introduce SCX_OPS_BUILTIN_IDLE_PER_NODE sched_ext: idle: Make idle static keys private sched/topology: Introduce for_each_node_numadist() iterator mm/numa: Introduce nearest_node_nodemask() nodemask: numa: reorganize inclusion path nodemask: add nodes_copy() tools/sched_ext: Sync with scx repo ...
2025-03-24Merge tag 'cgroup-for-6.15' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: - Add deprecation info messages to cgroup1-only features - rstat updates including a bug fix and breaking up a critical section to reduce interrupt latency impact - Other misc and doc updates * tag 'cgroup-for-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: rstat: Cleanup flushing functions and locking cgroup/rstat: avoid disabling irqs for O(num_cpu) mm: Fix a build breakage in memcontrol-v1.c blk-cgroup: Simplify policy files registration cgroup: Update file naming comment cgroup: Add deprecation message to legacy freezer controller mm: Add transformation message for per-memcg swappiness RFC cgroup/cpuset-v1: Add deprecation messages to sched_relax_domain_level cgroup/cpuset-v1: Add deprecation messages to memory_migrate cgroup/cpuset-v1: Add deprecation messages to mem_exclusive and mem_hardwall cgroup: Print message when /proc/cgroups is read on v2-only system cgroup/blkio: Add deprecation messages to reset_stats cgroup/cpuset-v1: Add deprecation messages to memory_spread_page and memory_spread_slab cgroup/cpuset-v1: Add deprecation messages to sched_load_balance and memory_pressure_enabled cgroup, docs: Be explicit about independence of RT_GROUP_SCHED and non-cpu controllers cgroup/rstat: Fix forceidle time in cpu.stat cgroup/misc: Remove unused misc_cg_res_total_usage cgroup/cpuset: Move procfs cpuset attribute under cgroup-v1.c cgroup: update comment about dropping cgroup kn refs