Age | Commit message (Collapse) | Author |
|
Add a pre-register hook to allow RTC drivers to adjust the default
set_offset_nsec value.
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
|
|
Once a key's reference count has been reduced to 0, the garbage collector
thread may destroy it at any time and so key_put() is not allowed to touch
the key after that point. The most key_put() is normally allowed to do is
to touch key_gc_work as that's a static global variable.
However, in an effort to speed up the reclamation of quota, this is now
done in key_put() once the key's usage is reduced to 0 - but now the code
is looking at the key after the deadline, which is forbidden.
Fix this by using a flag to indicate that a key can be gc'd now rather than
looking at the key's refcount in the garbage collector.
Fixes: 9578e327b2b4 ("keys: update key quotas in key_put()")
Reported-by: syzbot+6105ffc1ded71d194d6d@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/673b6aec.050a0220.87769.004a.GAE@google.com/
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: syzbot+6105ffc1ded71d194d6d@syzkaller.appspotmail.com
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
"Including fixes from can, bluetooth and ipsec.
This contains a last minute revert of a recent GRE patch, mostly to
allow me stating there are no known regressions outstanding.
Current release - regressions:
- revert "gre: Fix IPv6 link-local address generation."
- eth: ti: am65-cpsw: fix NAPI registration sequence
Previous releases - regressions:
- ipv6: fix memleak of nhc_pcpu_rth_output in fib_check_nh_v6_gw().
- mptcp: fix data stream corruption in the address announcement
- bluetooth: fix connection regression between LE and non-LE adapters
- can:
- flexcan: only change CAN state when link up in system PM
- ucan: fix out of bound read in strscpy() source
Previous releases - always broken:
- lwtunnel: fix reentry loops
- ipv6: fix TCP GSO segmentation with NAT
- xfrm: force software GSO only in tunnel mode
- eth: ti: icssg-prueth: add lock to stats
Misc:
- add Andrea Mayer as a maintainer of SRv6"
* tag 'net-6.14-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (33 commits)
MAINTAINERS: Add Andrea Mayer as a maintainer of SRv6
Revert "gre: Fix IPv6 link-local address generation."
Revert "selftests: Add IPv6 link-local address generation tests for GRE devices."
net/neighbor: add missing policy for NDTPA_QUEUE_LENBYTES
tools headers: Sync uapi/asm-generic/socket.h with the kernel sources
mptcp: Fix data stream corruption in the address announcement
selftests: net: test for lwtunnel dst ref loops
net: ipv6: ioam6: fix lwtunnel_output() loop
net: lwtunnel: fix recursion loops
net: ti: icssg-prueth: Add lock to stats
net: atm: fix use after free in lec_send()
xsk: fix an integer overflow in xp_create_and_assign_umem()
net: stmmac: dwc-qos-eth: use devm_kzalloc() for AXI data
selftests: drv-net: use defer in the ping test
phy: fix xa_alloc_cyclic() error handling
dpll: fix xa_alloc_cyclic() error handling
devlink: fix xa_alloc_cyclic() error handling
ipv6: Set errno after ip_fib_metrics_init() in ip6_route_info_create().
ipv6: Fix memleak of nhc_pcpu_rth_output in fib_check_nh_v6_gw().
net: ipv6: fix TCP GSO segmentation with NAT
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth
Luiz Augusto von Dentz says:
====================
bluetooth pull request for net:
- hci_event: Fix connection regression between LE and non-LE adapters
- Fix error code in chan_alloc_skb_cb()
* tag 'for-net-2025-03-14' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth:
Bluetooth: hci_event: Fix connection regression between LE and non-LE adapters
Bluetooth: Fix error code in chan_alloc_skb_cb()
====================
Link: https://patch.msgid.link/20250314163847.110069-1-luiz.dentz@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/libata/linux
Pull ata fix from Niklas Cassel:
- Fix a regression on ATI AHCI controllers, where certain Samsung
drives fails to be detected on a warm boot when LPM is enabled.
LPM on ATI AHCI works fine with other drives. Likewise, the
Samsung drives works fine with LPM with other AHI controllers.
Thus, just like the weirdo ATA_QUIRK_NO_NCQ_ON_ATI quirk, add a
new ATA_QUIRK_NO_LPM_ON_ATI quirk to disable LPM only on ATI
AHCI controllers.
* tag 'ata-6.14-final' of git://git.kernel.org/pub/scm/linux/kernel/git/libata/linux:
ata: libata-core: Add ATA_QUIRK_NO_LPM_ON_ATI for certain Samsung SSDs
|
|
According to GDMA protocol, holes (zeros) are allowed at the beginning
or middle of the gdma_list_devices_resp message. The existing code
cannot properly handle this, and may miss some devices in the list.
To fix, scan the entire list until the num_of_devs are found, or until
the end of the list.
Cc: stable@vger.kernel.org
Fixes: ca9c54d2d6a5 ("net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)")
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Long Li <longli@microsoft.com>
Reviewed-by: Shradha Gupta <shradhagupta@microsoft.com>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Link: https://patch.msgid.link/1741723974-1534-1-git-send-email-haiyangz@microsoft.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Before commit 7627a0edef54 ("ata: ahci: Drop low power policy board type")
the ATI AHCI controllers specified board type 'board_ahci' rather than
board type 'board_ahci'. This means that LPM was historically not enabled
for the ATI AHCI controllers.
By looking at commit 7a8526a5cd51 ("libata: Add ATA_HORKAGE_NO_NCQ_ON_ATI
for Samsung 860 and 870 SSD."), it is clear that, for some unknown reason,
that Samsung SSDs do not play nice with ATI AHCI controllers. (When using
other AHCI controllers, NCQ can be enabled on these Samsung SSDs without
issues.)
In a similar way, from user reports, it is clear the ATI AHCI controllers
can enable LPM on e.g. Maxtor HDDs perfectly fine, but when enabling LPM
on certain Samsung SSDs, things break. (E.g. the SSDs will not get detected
by the ATI AHCI controller even after a COMRESET.)
Yet, when using LPM on these Samsung SSDs with other AHCI controllers, e.g.
Intel AHCI controllers, these Samsung drives appear to work perfectly fine.
Considering that the combination of ATI + Samsung, for some unknown reason,
does not seem to work well, disable LPM when detecting an ATI AHCI
controller with a problematic Samsung SSD.
Apply this new ATA_QUIRK_NO_LPM_ON_ATI quirk for all Samsung SSDs that have
already been reported to not play nice with ATI (ATA_QUIRK_NO_NCQ_ON_ATI).
Fixes: 7627a0edef54 ("ata: ahci: Drop low power policy board type")
Suggested-by: Hans de Goede <hdegoede@redhat.com>
Reported-by: Eric <eric.4.debian@grabatoulnz.fr>
Closes: https://lore.kernel.org/linux-ide/Z8SBZMBjvVXA7OAK@eldamar.lan/
Tested-by: Eric <eric.4.debian@grabatoulnz.fr>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20250317170348.1748671-2-cassel@kernel.org
Signed-off-by: Niklas Cassel <cassel@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc hotfixes from Andrew Morton:
"15 hotfixes. 7 are cc:stable and the remainder address post-6.13
issues or aren't considered necessary for -stable kernels.
13 are for MM and the other two are for squashfs and procfs.
All are singletons. Please see the individual changelogs for details"
* tag 'mm-hotfixes-stable-2025-03-17-20-09' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
mm/page_alloc: fix memory accept before watermarks gets initialized
mm: decline to manipulate the refcount on a slab page
memcg: drain obj stock on cpu hotplug teardown
mm/huge_memory: drop beyond-EOF folios with the right number of refs
selftests/mm: run_vmtests.sh: fix half_ufd_size_MB calculation
mm: fix error handling in __filemap_get_folio() with FGP_NOWAIT
mm: memcontrol: fix swap counter leak from offline cgroup
mm/vma: do not register private-anon mappings with khugepaged during mmap
squashfs: fix invalid pointer dereference in squashfs_cache_delete
mm/migrate: fix shmem xarray update during migration
mm/hugetlb: fix surplus pages in dissolve_free_huge_page()
mm/damon/core: initialize damos->walk_completed in damon_new_scheme()
mm/damon: respect core layer filters' allowance decision on ops layer
filemap: move prefaulting out of hot write path
proc: fix UAF in proc_get_inode()
|
|
Slab pages now have a refcount of 0, so nobody should be trying to
manipulate the refcount on them. Doing so has little effect; the object
could be freed and reallocated to a different purpose, although the slab
itself would not be until the refcount was put making it behave rather
like TYPESAFE_BY_RCU.
Unfortunately, __iov_iter_get_pages_alloc() does take a refcount. Fix
that to not change the refcount, and make put_page() silently not change
the refcount. get_page() warns so that we can fix any other callers that
need to be changed.
Long-term, networking needs to stop taking a refcount on the pages that it
uses and rely on the caller to hold whatever references are necessary to
make the memory stable. In the medium term, more page types are going to
hav a zero refcount, so we'll want to move get_page() and put_page() out
of line.
Link: https://lkml.kernel.org/r/20250310143544.1216127-1-willy@infradead.org
Fixes: 9aec2fb0fd5e (slab: allocate frozen pages)
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reported-by: Hannes Reinecke <hare@suse.de>
Closes: https://lore.kernel.org/all/08c29e4b-2f71-4b6d-8046-27e407214d8c@suse.com/
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Commit 6769183166b3 removed the parameter of id from swap_cgroup_record()
and get the memcg id from mem_cgroup_id(folio_memcg(folio)). However, the
caller of it may update a different memcg's counter instead of
folio_memcg(folio).
E.g. in the caller of mem_cgroup_swapout(), @swap_memcg could be
different with @memcg and update the counter of @swap_memcg, but
swap_cgroup_record() records the wrong memcg's ID. When it is uncharged
from __mem_cgroup_uncharge_swap(), the swap counter will leak since the
wrong recorded ID.
Fix it by bringing the parameter of id back.
Link: https://lkml.kernel.org/r/20250306023133.44838-1-songmuchun@bytedance.com
Fixes: 6769183166b3 ("mm/swap_cgroup: decouple swap cgroup recording and clearing")
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Kairui Song <kasong@tencent.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Filtering decisions are made in filters evaluation order. Once a decision
is made by a filter, filters that scheduled to be evaluated after the
decision-made filter should just respect it. This is the intended and
documented behavior. Since core layer-handled filters are evaluated
before operations layer-handled filters, decisions made on core layer
should respected by ops layer.
In case of reject filters, the decision is respected, since core
layer-rejected regions are not passed to ops layer. But in case of allow
filters, ops layer filters don't know if the region has passed to them
because it was allowed by core filters or just because it didn't match to
any core layer. The current wrong implementation assumes it was due to
not matched by any core filters. As a reuslt, the decision is not
respected. Pass the missing information to ops layer using a new filed in
'struct damos', and make the ops layer filters respect it.
Link: https://lkml.kernel.org/r/20250228175336.42781-1-sj@kernel.org
Fixes: 491fee286e56 ("mm/damon/core: support damos_filter->allow")
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Fix race between rmmod and /proc/XXX's inode instantiation.
The bug is that pde->proc_ops don't belong to /proc, it belongs to a
module, therefore dereferencing it after /proc entry has been registered
is a bug unless use_pde/unuse_pde() pair has been used.
use_pde/unuse_pde can be avoided (2 atomic ops!) because pde->proc_ops
never changes so information necessary for inode instantiation can be
saved _before_ proc_register() in PDE itself and used later, avoiding
pde->proc_ops->... dereference.
rmmod lookup
sys_delete_module
proc_lookup_de
pde_get(de);
proc_get_inode(dir->i_sb, de);
mod->exit()
proc_remove
remove_proc_subtree
proc_entry_rundown(de);
free_module(mod);
if (S_ISREG(inode->i_mode))
if (de->proc_ops->proc_read_iter)
--> As module is already freed, will trigger UAF
BUG: unable to handle page fault for address: fffffbfff80a702b
PGD 817fc4067 P4D 817fc4067 PUD 817fc0067 PMD 102ef4067 PTE 0
Oops: Oops: 0000 [#1] PREEMPT SMP KASAN PTI
CPU: 26 UID: 0 PID: 2667 Comm: ls Tainted: G
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
RIP: 0010:proc_get_inode+0x302/0x6e0
RSP: 0018:ffff88811c837998 EFLAGS: 00010a06
RAX: dffffc0000000000 RBX: ffffffffc0538140 RCX: 0000000000000007
RDX: 1ffffffff80a702b RSI: 0000000000000001 RDI: ffffffffc0538158
RBP: ffff8881299a6000 R08: 0000000067bbe1e5 R09: 1ffff11023906f20
R10: ffffffffb560ca07 R11: ffffffffb2b43a58 R12: ffff888105bb78f0
R13: ffff888100518048 R14: ffff8881299a6004 R15: 0000000000000001
FS: 00007f95b9686840(0000) GS:ffff8883af100000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: fffffbfff80a702b CR3: 0000000117dd2000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
proc_lookup_de+0x11f/0x2e0
__lookup_slow+0x188/0x350
walk_component+0x2ab/0x4f0
path_lookupat+0x120/0x660
filename_lookup+0x1ce/0x560
vfs_statx+0xac/0x150
__do_sys_newstat+0x96/0x110
do_syscall_64+0x5f/0x170
entry_SYSCALL_64_after_hwframe+0x76/0x7e
[adobriyan@gmail.com: don't do 2 atomic ops on the common path]
Link: https://lkml.kernel.org/r/3d25ded0-1739-447e-812b-e34da7990dcf@p183
Fixes: 778f3dd5a13c ("Fix procfs compat_ioctl regression")
Signed-off-by: Ye Bin <yebin10@huawei.com>
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: David S. Miller <davem@davemloft.net>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull fsnotify reverts from Jan Kara:
"Syzbot has found out that fsnotify HSM events generated on page fault
can be generated while we already hold freeze protection for the
filesystem (when you do buffered write from a buffer which is mmapped
file on the same filesystem) which violates expectations for HSM
events and could lead to deadlocks of HSM clients with filesystem
freezing.
Since it's quite late in the cycle we've decided to revert changes
implementing HSM events on page fault for now and instead just
generate one event for the whole range on mmap(2) so that HSM client
can fetch the data at that moment"
* tag 'fsnotify_for_v6.14-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
Revert "fanotify: disable readahead if we have pre-content watches"
Revert "mm: don't allow huge faults for files with pre content watches"
Revert "fsnotify: generate pre-content permission event on page fault"
Revert "xfs: add pre-content fsnotify hook for DAX faults"
Revert "ext4: add pre-content fsnotify hook for DAX faults"
fsnotify: add pre-content hooks on mmap()
|
|
Pull block fixes from Jens Axboe:
- NVMe pull request via Keith:
- Concurrent pci error and hotplug handling fix (Keith)
- Endpoint function fixes (Damien)
- Fix for a regression introduced in this cycle with error checking for
batched request completions (Shin'ichiro)
* tag 'block-6.14-20250313' of git://git.kernel.dk/linux:
block: change blk_mq_add_to_batch() third argument type to bool
nvme: move error logging from nvme_end_req() to __nvme_end_req()
nvmet: pci-epf: Do not add an IRQ vector if not needed
nvmet: pci-epf: Set NVMET_PCI_EPF_Q_LIVE when a queue is fully created
nvme-pci: fix stuck reset on concurrent DPC and HP
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"A collection of last-minute fixes.
Most of them are for ASoC, and the only one core fix is for reverting
the previous change, while the rest are all device-specific quirks and
fixes, which should be relatively safe to apply"
* tag 'sound-6.14-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
ASoC: cs42l43: convert to SYSTEM_SLEEP_PM_OPS
ALSA: hda/realtek: Add mute LED quirk for HP Pavilion x360 14-dy1xxx
ASoC: codecs: wm0010: Fix error handling path in wm0010_spi_probe()
ASoC: rt722-sdca: add missing readable registers
ASoC: amd: yc: Support mic on another Lenovo ThinkPad E16 Gen 2 model
ASoC: cs42l43: Fix maximum ADC Volume
ASoC: ops: Consistently treat platform_max as control value
ASoC: rt1320: set wake_capable = 0 explicitly
ASoC: cs42l43: Add jack delay debounce after suspend
ASoC: tegra: Fix ADX S24_LE audio format
ASoC: codecs: wsa884x: report temps to hwmon in millidegree of Celsius
ASoC: Intel: sof_sdw: Fix unlikely uninitialized variable use in create_sdw_dailinks()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull core fix from Ingo Molnar:
"Fix a Sparse false positive warning triggered by no_free_ptr()"
* tag 'core-urgent-2025-03-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
<linux/cleanup.h>: Allow the passing of both iomem and non-iomem pointers to no_free_ptr()
|
|
Due to a typo during defining HCI errors it is not possible to connect
LE-capable device with BR/EDR only adapter. The connection is terminated
by the LE adapter because the invalid LL params error code is treated
as unsupported remote feature.
Fixes: 79c0868ad65a ("Bluetooth: hci_event: Use HCI error defines instead of magic values")
Signed-off-by: Arkadiusz Bokowy <arkadiusz.bokowy@gmail.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
"Including fixes from netfilter, bluetooth and wireless.
No known regressions outstanding.
Current release - regressions:
- wifi: nl80211: fix assoc link handling
- eth: lan78xx: sanitize return values of register read/write
functions
Current release - new code bugs:
- ethtool: tsinfo: fix dump command
- bluetooth: btusb: configure altsetting for HCI_USER_CHANNEL
- eth: mlx5: DR, use the right action structs for STEv3
Previous releases - regressions:
- netfilter: nf_tables: make destruction work queue pernet
- gre: fix IPv6 link-local address generation.
- wifi: iwlwifi: fix TSO preparation
- bluetooth: revert "bluetooth: hci_core: fix sleeping function
called from invalid context"
- ovs: revert "openvswitch: switch to per-action label counting in
conntrack"
- eth:
- ice: fix switchdev slow-path in LAG
- bonding: fix incorrect MAC address setting to receive NS
messages
Previous releases - always broken:
- core: prevent TX of unreadable skbs
- sched: prevent creation of classes with TC_H_ROOT
- netfilter: nft_exthdr: fix offset with ipv4_find_option()
- wifi: cfg80211: cancel wiphy_work before freeing wiphy
- mctp: copy headers if cloned
- phy: nxp-c45-tja11xx: add errata for TJA112XA/B
- eth:
- bnxt: fix kernel panic in the bnxt_get_queue_stats{rx | tx}
- mlx5: bridge, fix the crash caused by LAG state check"
* tag 'net-6.14-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (65 commits)
net: mana: cleanup mana struct after debugfs_remove()
net/mlx5e: Prevent bridge link show failure for non-eswitch-allowed devices
net/mlx5: Bridge, fix the crash caused by LAG state check
net/mlx5: Lag, Check shared fdb before creating MultiPort E-Switch
net/mlx5: Fix incorrect IRQ pool usage when releasing IRQs
net/mlx5: HWS, Rightsize bwc matcher priority
net/mlx5: DR, use the right action structs for STEv3
Revert "openvswitch: switch to per-action label counting in conntrack"
net: openvswitch: remove misbehaving actions length check
selftests: Add IPv6 link-local address generation tests for GRE devices.
gre: Fix IPv6 link-local address generation.
netfilter: nft_exthdr: fix offset with ipv4_find_option()
selftests/tc-testing: Add a test case for DRR class with TC_H_ROOT
net_sched: Prevent creation of classes with TC_H_ROOT
ipvs: prevent integer overflow in do_ip_vs_get_ctl()
selftests: netfilter: skip br_netfilter queue tests if kernel is tainted
netfilter: nf_conncount: Fully initialize struct nf_conncount_tuple in insert_tree()
wifi: mac80211: fix MPDU length parsing for EHT 5/6 GHz
qlcnic: fix memory leak issues in qlcnic_sriov_common.c
rtase: Fix improper release of ring list entries in rtase_sw_reset
...
|
|
This reverts commit 8392bc2ff8c8bf7c4c5e6dfa71ccd893a3c046f6.
In the use case of buffered write whose input buffer is mmapped file on a
filesystem with a pre-content mark, the prefaulting of the buffer can
happen under the filesystem freeze protection (obtained in vfs_write())
which breaks assumptions of pre-content hook and introduces potential
deadlock of HSM handler in userspace with filesystem freezing.
Now that we have pre-content hooks at file mmap() time, disable the
pre-content event hooks on page fault to avoid the potential deadlock.
Reported-by: syzbot+7229071b47908b19d5b7@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/linux-fsdevel/7ehxrhbvehlrjwvrduoxsao5k3x4aw275patsb3krkwuq573yv@o2hskrfawbnc/
Fixes: 8392bc2ff8c8 ("fsnotify: generate pre-content permission event on page fault")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20250312073852.2123409-5-amir73il@gmail.com
|
|
Pre-content hooks in page faults introduces potential deadlock of HSM
handler in userspace with filesystem freezing.
The requirement with pre-content event is that for every accessed file
range an event covering at least this range will be generated at least
once before the file data is accesses.
In preparation to disabling pre-content event hooks on page faults,
add pre-content hooks at mmap() variants for the entire mmaped range,
so HSM can fill content when user requests to map a portion of the file.
Note that exec() variant also calls vm_mmap_pgoff() internally to map
code sections, so pre-content hooks are also generated in this case.
Link: https://lore.kernel.org/linux-fsdevel/7ehxrhbvehlrjwvrduoxsao5k3x4aw275patsb3krkwuq573yv@o2hskrfawbnc/
Suggested-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20250312073852.2123409-2-amir73il@gmail.com
|
|
Commit 1f47ed294a2b ("block: cleanup and fix batch completion adding
conditions") modified the evaluation criteria for the third argument,
'ioerror', in the blk_mq_add_to_batch() function. Initially, the
function had checked if 'ioerror' equals zero. Following the commit, it
started checking for negative error values, with the presumption that
such values, for instance -EIO, would be passed in.
However, blk_mq_add_to_batch() callers do not pass negative error
values. Instead, they pass status codes defined in various ways:
- NVMe PCI and Apple drivers pass NVMe status code
- virtio_blk driver passes the virtblk request header status byte
- null_blk driver passes blk_status_t
These codes are either zero or positive, therefore the revised check
fails to function as intended. Specifically, with the NVMe PCI driver,
this modification led to the failure of the blktests test case nvme/039.
In this test scenario, errors are artificially injected to the NVMe
driver, resulting in positive NVMe status codes passed to
blk_mq_add_to_batch(), which unexpectedly processes the failed I/O in a
batch. Hence the failure.
To correct the ioerror check within blk_mq_add_to_batch(), make all
callers to uniformly pass the argument as boolean. Modify the callers to
check their specific status codes and pass the boolean value 'is_error'.
Also describe the arguments of blK_mq_add_to_batch as kerneldoc.
Fixes: 1f47ed294a2b ("block: cleanup and fix batch completion adding conditions")
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Link: https://lore.kernel.org/r/20250311104359.1767728-3-shinichiro.kawasaki@wdc.com
[axboe: fold in documentation update]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
no_free_ptr()
Calling no_free_ptr() for an __iomem pointer results in Sparse
complaining about the types:
warning: incorrect type in argument 1 (different address spaces)
expected void const volatile *val
got void [noderef] __iomem *__val
[ The example is from drivers/platform/x86/intel/pmc/core_ssram.c:283 ]
The problem is caused by the signature of __must_check_fn() added in:
85be6d842447 ("cleanup: Make no_free_ptr() __must_check")
... to enforce that the return value is always used.
Use __force to allow both iomem and non-iomem pointers to be given for
no_free_ptr().
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250310122158.20966-1-ilpo.jarvinen@linux.intel.com
Closes: https://lore.kernel.org/oe-kbuild-all/202403050547.qnZtuNlN-lkp@intel.com/
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"33 hotfixes. 24 are cc:stable and the remainder address post-6.13
issues or aren't considered necessary for -stable kernels.
26 are for MM and 7 are for non-MM.
- "mm: memory_failure: unmap poisoned folio during migrate properly"
from Ma Wupeng fixes a couple of two year old bugs involving the
migration of hwpoisoned folios.
- "selftests/damon: three fixes for false results" from SeongJae Park
fixes three one year old bugs in the SAMON selftest code.
The remainder are singletons and doubletons. Please see the individual
changelogs for details"
* tag 'mm-hotfixes-stable-2025-03-08-16-27' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (33 commits)
mm/page_alloc: fix uninitialized variable
rapidio: add check for rio_add_net() in rio_scan_alloc_net()
rapidio: fix an API misues when rio_add_net() fails
MAINTAINERS: .mailmap: update Sumit Garg's email address
Revert "mm/page_alloc.c: don't show protection in zone's ->lowmem_reserve[] for empty zone"
mm: fix finish_fault() handling for large folios
mm: don't skip arch_sync_kernel_mappings() in error paths
mm: shmem: remove unnecessary warning in shmem_writepage()
userfaultfd: fix PTE unmapping stack-allocated PTE copies
userfaultfd: do not block on locking a large folio with raised refcount
mm: zswap: use ATOMIC_LONG_INIT to initialize zswap_stored_pages
mm: shmem: fix potential data corruption during shmem swapin
mm: fix kernel BUG when userfaultfd_move encounters swapcache
selftests/damon/damon_nr_regions: sort collected regiosn before checking with min/max boundaries
selftests/damon/damon_nr_regions: set ops update for merge results check to 100ms
selftests/damon/damos_quota: make real expectation of quota exceeds
include/linux/log2.h: mark is_power_of_2() with __always_inline
NFS: fix nfs_release_folio() to not deadlock via kcompactd writeback
mm, swap: avoid BUG_ON in relocate_cluster()
mm: swap: use correct step in loop to wait all clusters in wait_for_allocation()
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI fix from Rafael Wysocki:
"Restore the previous behavior of the ACPI platform_profile sysfs
interface that has been changed recently in a way incompatible with
the existing user space (Mario Limonciello)"
* tag 'acpi-6.14-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
platform/x86/amd: pmf: Add balanced-performance to hidden choices
platform/x86/amd: pmf: Add 'quiet' to hidden choices
ACPI: platform_profile: Add support for hidden choices
|
|
Pull block fixes from Jens Axboe:
- NVMe pull request via Keith:
- TCP use after free fix on polling (Sagi)
- Controller memory buffer cleanup fixes (Icenowy)
- Free leaking requests on bad user passthrough commands (Keith)
- TCP error message fix (Maurizio)
- TCP corruption fix on partial PDU (Maurizio)
- TCP memory ordering fix for weakly ordered archs (Meir)
- Type coercion fix on message error for TCP (Dan)
- Name the RQF flags enum, fixing issues with anon enums and BPF import
of it
- ublk parameter setting fix
- GPT partition 7-bit conversion fix
* tag 'block-6.14-20250306' of git://git.kernel.dk/linux:
block: Name the RQF flags enum
nvme-tcp: fix signedness bug in nvme_tcp_init_connection()
block: fix conversion of GPT partition name to 7-bit
ublk: set_params: properly check if parameters can be applied
nvmet-tcp: Fix a possible sporadic response drops in weakly ordered arch
nvme-tcp: fix potential memory corruption in nvme_tcp_recv_pdu()
nvme-tcp: Fix a C2HTermReq error message
nvmet: remove old function prototype
nvme-ioctl: fix leaked requests on mapping error
nvme-pci: skip CMB blocks incompatible with PCI P2P DMA
nvme-pci: clean up CMBMSC when registering CMB fails
nvme-tcp: fix possible UAF in nvme_tcp_poll
|
|
This reverts commit 4d94f05558271654670d18c26c912da0c1c15549 which has
problems (see [1]) and is no longer needed since 581dd2dc168f
("Bluetooth: hci_event: Fix using rcu_read_(un)lock while iterating")
has reworked the code where the original bug has been found.
[1] Link: https://lore.kernel.org/linux-bluetooth/877c55ci1r.wl-tiwai@suse.de/T/#t
Fixes: 4d94f0555827 ("Bluetooth: hci_core: Fix sleeping function called from invalid context")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
|
|
The fix to atomically read the pipe head and tail state when not holding
the pipe mutex has caused a number of headaches due to the size change
of the involved types.
It turns out that we don't have _that_ many places that access these
fields directly and were affected, but we have more than we strictly
should have, because our low-level helper functions have been designed
to have intimate knowledge of how the pipes work.
And as a result, that random noise of direct 'pipe->head' and
'pipe->tail' accesses makes it harder to pinpoint any actual potential
problem spots remaining.
For example, we didn't have a "is the pipe full" helper function, but
instead had a "given these pipe buffer indexes and this pipe size, is
the pipe full". That's because some low-level pipe code does actually
want that much more complicated interface.
But most other places literally just want a "is the pipe full" helper,
and not having it meant that those places ended up being unnecessarily
much too aware of this all.
It would have been much better if only the very core pipe code that
cared had been the one aware of this all.
So let's fix it - better late than never. This just introduces the
trivial wrappers for "is this pipe full or empty" and to get how many
pipe buffers are used, so that instead of writing
if (pipe_full(pipe->head, pipe->tail, pipe->max_usage))
the places that literally just want to know if a pipe is full can just
say
if (pipe_is_full(pipe))
instead. The existing trivial cases were converted with a 'sed' script.
This cuts down on the places that access pipe->head and pipe->tail
directly outside of the pipe code (and core splice code) quite a lot.
The splice code in particular still revels in doing the direct low-level
accesses, and the fuse fuse_dev_splice_write() code also seems a bit
unnecessarily eager to go very low-level, but it's at least a bit better
than it used to be.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for net:
1) Fix racy non-atomic read-then-increment operation with
PREEMPT_RT in nft_ct, from Sebastian Andrzej Siewior.
2) GC is not skipped when jiffies wrap around in nf_conncount,
from Nicklas Bo Jensen.
3) flush_work() on nf_tables_destroy_work waits for the last queued
instance, this could be an instance that is different from the one
that we must wait for, then make destruction work queue.
* tag 'nf-25-03-06' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
netfilter: nf_tables: make destruction work queue pernet
netfilter: nf_conncount: garbage collection is not skipped when jiffies wrap around
netfilter: nft_ct: Use __refcount_inc() for per-CPU nft_ct_pcpu_template.
====================
Link: https://patch.msgid.link/20250306153446.46712-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Commit 5f89154e8e9e3445f9b59 ("block: Use enum to define RQF_x bit
indexes") converted the RQF flags to an anonymous enum, which was
a beneficial change. This patch goes one step further by naming the enum
as "rqf_flags".
This naming enables exporting these flags to BPF clients, eliminating
the need to duplicate these flags in BPF code. Instead, BPF clients can
now access the same kernel-side values through CO:RE (Compile Once, Run
Everywhere), as shown in this example:
rqf_stats = bpf_core_enum_value(enum rqf_flags, __RQF_STATS)
Suggested-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://lore.kernel.org/r/20250306-rqf_flags-v1-1-bbd64918b406@debian.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs
Pull vfs fixes from Christian Brauner:
- Fix spelling mistakes in idmappings.rst
- Fix RCU warnings in override_creds()/revert_creds()
- Create new pid namespaces with default limit now that pid_max is
namespaced
* tag 'vfs-6.14-rc6.fixes' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs:
pid: Do not set pid_max in new pid namespaces
doc: correcting two prefix errors in idmappings.rst
cred: Fix RCU warnings in override/revert_creds
|
|
That's what 'pipe_full()' does, so it's more consistent. But more
importantly it gets the type limits right when the pipe head and tail
are no longer necessarily 'unsigned int'.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The call to flush_work before tearing down a table from the netlink
notifier was supposed to make sure that all earlier updates (e.g. rule
add) that might reference that table have been processed.
Unfortunately, flush_work() waits for the last queued instance.
This could be an instance that is different from the one that we must
wait for.
This is because transactions are protected with a pernet mutex, but the
work item is global, so holding the transaction mutex doesn't prevent
another netns from queueing more work.
Make the work item pernet so that flush_work() will wait for all
transactions queued from this netns.
A welcome side effect is that we no longer need to wait for transaction
objects from foreign netns.
The gc work queue is still global. This seems to be ok because nft_set
structures are reference counted and each container structure owns a
reference on the net namespace.
The destroy_list is still protected by a global spinlock rather than
pernet one but the hold time is very short anyway.
v2: call cancel_work_sync before reaping the remaining tables (Pablo).
Fixes: 9f6958ba2e90 ("netfilter: nf_tables: unconditionally flush pending work before notifier")
Reported-by: syzbot+5d8c5789c8cb076b2c25@syzkaller.appspotmail.com
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
When building kernel with randconfig, there is an error:
In function `kvm_is_cr4_bit_set',inlined from
`kvm_update_cpuid_runtime' at arch/x86/kvm/cpuid.c:310:9:
include/linux/compiler_types.h:542:38: error: call to
`__compiletime_assert_380' declared with attribute error:
BUILD_BUG_ON failed: !is_power_of_2(cr4_bit).
'!is_power_of_2(X86_CR4_OSXSAVE)' is False, but gcc treats is_power_of_2()
as non-inline function and a compilation error happens. Fix this by marking
is_power_of_2() with __always_inline.
Link: https://lkml.kernel.org/r/20250221071624.1356899-1-suhui@nfschina.com
Signed-off-by: Su Hui <suhui@nfschina.com>
Cc: Binbin Wu <binbin.wu@linux.intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Add PF_KCOMPACTD flag and current_is_kcompactd() helper to check for it so
nfs_release_folio() can skip calling nfs_wb_folio() from kcompactd.
Otherwise NFS can deadlock waiting for kcompactd enduced writeback which
recurses back to NFS (which triggers writeback to NFSD via NFS loopback
mount on the same host, NFSD blocks waiting for XFS's call to
__filemap_get_folio):
6070.550357] INFO: task kcompactd0:58 blocked for more than 4435 seconds.
{---
[58] "kcompactd0"
[<0>] folio_wait_bit+0xe8/0x200
[<0>] folio_wait_writeback+0x2b/0x80
[<0>] nfs_wb_folio+0x80/0x1b0 [nfs]
[<0>] nfs_release_folio+0x68/0x130 [nfs]
[<0>] split_huge_page_to_list_to_order+0x362/0x840
[<0>] migrate_pages_batch+0x43d/0xb90
[<0>] migrate_pages_sync+0x9a/0x240
[<0>] migrate_pages+0x93c/0x9f0
[<0>] compact_zone+0x8e2/0x1030
[<0>] compact_node+0xdb/0x120
[<0>] kcompactd+0x121/0x2e0
[<0>] kthread+0xcf/0x100
[<0>] ret_from_fork+0x31/0x40
[<0>] ret_from_fork_asm+0x1a/0x30
---}
[akpm@linux-foundation.org: fix build]
Link: https://lkml.kernel.org/r/20250225022002.26141-1-snitzer@kernel.org
Fixes: 96780ca55e3c ("NFS: fix up nfs_release_folio() to try to release the page")
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Cc: Anna Schumaker <anna.schumaker@oracle.com>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Since the introduction of commit c77c0a8ac4c52 ("mm/hugetlb: defer freeing
of huge pages if in non-task context"), which supports deferring the
freeing of hugetlb pages, the allocation of contiguous memory through
cma_alloc() may fail probabilistically.
In the CMA allocation process, if it is found that the CMA area is
occupied by in-use hugetlb folios, these in-use hugetlb folios need to be
migrated to another location. When there are no available hugetlb folios
in the free hugetlb pool during the migration of in-use hugetlb folios,
new folios are allocated from the buddy system. A temporary state is set
on the newly allocated folio. Upon completion of the hugetlb folio
migration, the temporary state is transferred from the new folios to the
old folios. Normally, when the old folios with the temporary state are
freed, it is directly released back to the buddy system. However, due to
the deferred freeing of hugetlb pages, the PageBuddy() check fails,
ultimately leading to the failure of cma_alloc().
Here is a simplified call trace illustrating the process:
cma_alloc()
->__alloc_contig_migrate_range() // Migrate in-use hugetlb folios
->unmap_and_move_huge_page()
->folio_putback_hugetlb() // Free old folios
->test_pages_isolated()
->__test_page_isolated_in_pageblock()
->PageBuddy(page) // Check if the page is in buddy
To resolve this issue, we have implemented a function named
wait_for_freed_hugetlb_folios(). This function ensures that the hugetlb
folios are properly released back to the buddy system after their
migration is completed. By invoking wait_for_freed_hugetlb_folios()
before calling PageBuddy(), we ensure that PageBuddy() will succeed.
Link: https://lkml.kernel.org/r/1739936804-18199-1-git-send-email-yangge1116@126.com
Fixes: c77c0a8ac4c5 ("mm/hugetlb: defer freeing of huge pages if in non-task context")
Signed-off-by: Ge Yang <yangge1116@126.com>
Reviewed-by: Muchun Song <muchun.song@linux.dev>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <21cnbao@gmail.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
While looking for incorrect users of the pipe head/tail fields (see
commit c27c66afc449: "fs/pipe: Fix pipe_occupancy() with 16-bit
indexes"), I found a bug in pipe_discard_from() that looked entirely
broken.
However, the fix is trivial: this buggy function isn't actually called
by anything, so let's just remove it ASAP.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This reverts commit 9bdd10d57a88 ("ASoC: ops: Shift tested values in
snd_soc_put_volsw() by +min"), and makes some additional related
updates.
There are two ways the platform_max could be interpreted; the maximum
register value, or the maximum value the control can be set to. The
patch moved from treating the value as a control value to a register
one. When the patch was applied it was technically correct as
snd_soc_limit_volume() also used the register interpretation. However,
even then most of the other usages treated platform_max as a
control value, and snd_soc_limit_volume() has since been updated to
also do so in commit fb9ad24485087 ("ASoC: ops: add correct range
check for limiting volume"). That patch however, missed updating
snd_soc_put_volsw() back to the control interpretation, and fixing
snd_soc_info_volsw_range(). The control interpretation makes more
sense as limiting is typically done from the machine driver, so it is
appropriate to use the customer facing representation rather than the
internal codec representation. Update all the code to consistently use
this interpretation of platform_max.
Finally, also add some comments to the soc_mixer_control struct to
hopefully avoid further patches switching between the two approaches.
Fixes: fb9ad24485087 ("ASoC: ops: add correct range check for limiting volume")
Signed-off-by: Charles Keepax <ckeepax@opensource.cirrus.com>
Link: https://patch.msgid.link/20250228151456.3703342-1-ckeepax@opensource.cirrus.com
Signed-off-by: Mark Brown <broonie@kernel.org>
|
|
Add htmldoc annotation for the newly introduced "head_tail" member
describing it to be a union of the pipe_inode_info's @head and @tail
members.
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Closes: https://lore.kernel.org/lkml/20250305204609.5e64768e@canb.auug.org.au/
Fixes: 3d252160b818 ("fs/pipe: Read pipe->{head,tail} atomically outside pipe->mutex")
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The pipe_occupancy() logic implicitly relied on the natural unsigned
modulo arithmetic in C, but that doesn't work for the new 'pipe_index_t'
case, since any arithmetic will be done in 'int' (and here we had also
made it 'unsigned int' due to the function call boundary).
So make the modulo arithmetic explicit by casting the result to the
proper type.
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Swapnil Sapkal <swapnil.sapkal@amd.com>
Cc: Alexey Gladkov <legion@kernel.org>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lore.kernel.org/all/CAHk-=wjyHsGLx=rxg6PKYBNkPYAejgo7=CbyL3=HGLZLsAaJFQ@mail.gmail.com/
Fixes: 3d252160b818 ("fs/pipe: Read pipe->{head,tail} atomically outside pipe->mutex")
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
When two drivers don't support all the same profiles the legacy interface
only exports the common profiles.
This causes problems for cases where one driver uses low-power but another
uses quiet because the result is that neither is exported to sysfs.
To allow two drivers to disagree, add support for "hidden choices".
Hidden choices are platform profiles that a driver supports to be
compatible with the platform profile of another driver.
Fixes: 688834743d67 ("ACPI: platform_profile: Allow multiple handlers")
Reported-by: Antheas Kapenekakis <lkml@antheas.dev>
Closes: https://lore.kernel.org/platform-driver-x86/e64b771e-3255-42ad-9257-5b8fc6c24ac9@gmx.de/T/#mc068042dd29df36c16c8af92664860fc4763974b
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Tested-by: Antheas Kapenekakis <lkml@antheas.dev>
Tested-by: Derek J. Clark <derekjohn.clark@gmail.com>
Acked-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Link: https://patch.msgid.link/20250228170155.2623386-2-superm1@kernel.org
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
pipe_readable(), pipe_writable(), and pipe_poll() can read "pipe->head"
and "pipe->tail" outside of "pipe->mutex" critical section. When the
head and the tail are read individually in that order, there is a window
for interruption between the two reads in which both the head and the
tail can be updated by concurrent readers and writers.
One of the problematic scenarios observed with hackbench running
multiple groups on a large server on a particular pipe inode is as
follows:
pipe->head = 36
pipe->tail = 36
hackbench-118762 [057] ..... 1029.550548: pipe_write: *wakes up: pipe not full*
hackbench-118762 [057] ..... 1029.550548: pipe_write: head: 36 -> 37 [tail: 36]
hackbench-118762 [057] ..... 1029.550548: pipe_write: *wake up next reader 118740*
hackbench-118762 [057] ..... 1029.550548: pipe_write: *wake up next writer 118768*
hackbench-118768 [206] ..... 1029.55055X: pipe_write: *writer wakes up*
hackbench-118768 [206] ..... 1029.55055X: pipe_write: head = READ_ONCE(pipe->head) [37]
... CPU 206 interrupted (exact wakeup was not traced but 118768 did read head at 37 in traces)
hackbench-118740 [057] ..... 1029.550558: pipe_read: *reader wakes up: pipe is not empty*
hackbench-118740 [057] ..... 1029.550558: pipe_read: tail: 36 -> 37 [head = 37]
hackbench-118740 [057] ..... 1029.550559: pipe_read: *pipe is empty; wakeup writer 118768*
hackbench-118740 [057] ..... 1029.550559: pipe_read: *sleeps*
hackbench-118766 [185] ..... 1029.550592: pipe_write: *New writer comes in*
hackbench-118766 [185] ..... 1029.550592: pipe_write: head: 37 -> 38 [tail: 37]
hackbench-118766 [185] ..... 1029.550592: pipe_write: *wakes up reader 118766*
hackbench-118740 [185] ..... 1029.550598: pipe_read: *reader wakes up; pipe not empty*
hackbench-118740 [185] ..... 1029.550599: pipe_read: tail: 37 -> 38 [head: 38]
hackbench-118740 [185] ..... 1029.550599: pipe_read: *pipe is empty*
hackbench-118740 [185] ..... 1029.550599: pipe_read: *reader sleeps; wakeup writer 118768*
... CPU 206 switches back to writer
hackbench-118768 [206] ..... 1029.550601: pipe_write: tail = READ_ONCE(pipe->tail) [38]
hackbench-118768 [206] ..... 1029.550601: pipe_write: pipe_full()? (u32)(37 - 38) >= 16? Yes
hackbench-118768 [206] ..... 1029.550601: pipe_write: *writer goes back to sleep*
[ Tasks 118740 and 118768 can then indefinitely wait on each other. ]
The unsigned arithmetic in pipe_occupancy() wraps around when
"pipe->tail > pipe->head" leading to pipe_full() returning true despite
the pipe being empty.
The case of genuine wraparound of "pipe->head" is handled since pipe
buffer has data allowing readers to make progress until the pipe->tail
wraps too after which the reader will wakeup a sleeping writer, however,
mistaking the pipe to be full when it is in fact empty can lead to
readers and writers waiting on each other indefinitely.
This issue became more problematic and surfaced as a hang in hackbench
after the optimization in commit aaec5a95d596 ("pipe_read: don't wake up
the writer if the pipe is still full") significantly reduced the number
of spurious wakeups of writers that had previously helped mask the
issue.
To avoid missing any updates between the reads of "pipe->head" and
"pipe->write", unionize the two with a single unsigned long
"pipe->head_tail" member that can be loaded atomically.
Using "pipe->head_tail" to read the head and the tail ensures the
lockless checks do not miss any updates to the head or the tail and
since those two are only updated under "pipe->mutex", it ensures that
the head is always ahead of, or equal to the tail resulting in correct
calculations.
[ prateek: commit log, testing on x86 platforms. ]
Reported-and-debugged-by: Swapnil Sapkal <swapnil.sapkal@amd.com>
Closes: https://lore.kernel.org/lkml/e813814e-7094-4673-bc69-731af065a0eb@amd.com/
Reported-by: Alexey Gladkov <legion@kernel.org>
Closes: https://lore.kernel.org/all/Z8Wn0nTvevLRG_4m@example.org/
Fixes: 8cefc107ca54 ("pipe: Use head and tail pointers for the ring, not cursor and length")
Tested-by: Swapnil Sapkal <swapnil.sapkal@amd.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Tested-by: Alexey Gladkov <legion@kernel.org>
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Fix RCU warnings in override_creds and revert_creds by turning
the RCU pointer into a normal pointer using rcu_replace_pointer.
These warnings were previously private to the cred code, but due
to the move into the header file they are now polluting unrelated
subsystems.
Fixes: 49dffdfde462 ("cred: Add a light version of override/revert_creds()")
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Link: https://lore.kernel.org/r/Z8QGQGW0IaSklKG7@gondor.apana.org.au
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 fixes from Will Deacon:
"Ryan's been hard at work finding and fixing mm bugs in the arm64 code,
so here's a small crop of fixes for -rc5.
The main changes are to fix our zapping of non-present PTEs for
hugetlb entries created using the contiguous bit in the page-table
rather than a block entry at the level above. Prior to these fixes, we
were pulling the contiguous bit back out of the PTE in order to
determine the size of the hugetlb page but this is clearly bogus if
the thing isn't present and consequently both the clearing of the
PTE(s) and the TLB invalidation were unreliable.
Although the problem was found by code inspection, we really don't
want this sitting around waiting to trigger and the changes are CC'd
to stable accordingly.
Note that the diffstat looks a lot worse than it really is;
huge_ptep_get_and_clear() now takes a size argument from the core code
and so all the arch implementations of that have been updated in a
pretty mechanical fashion.
- Fix a sporadic boot failure due to incorrect randomization of the
linear map on systems that support it
- Fix the zapping (both clearing the entries *and* invalidating the
TLB) of hugetlb PTEs constructed using the contiguous bit"
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
arm64: hugetlb: Fix flush_hugetlb_tlb_range() invalidation level
arm64: hugetlb: Fix huge_ptep_get_and_clear() for non-present ptes
mm: hugetlb: Add huge page size param to huge_ptep_get_and_clear()
arm64/mm: Fix Boot panic on Ampere Altra
|
|
Pull kvm fixes from Paolo Bonzini:
"ARM:
- Fix TCR_EL2 configuration to not use the ASID in TTBR1_EL2 and not
mess-up T1SZ/PS by using the HCR_EL2.E2H==0 layout.
- Bring back the VMID allocation to the vcpu_load phase, ensuring
that we only setup VTTBR_EL2 once on VHE. This cures an ugly race
that would lead to running with an unallocated VMID.
RISC-V:
- Fix hart status check in SBI HSM extension
- Fix hart suspend_type usage in SBI HSM extension
- Fix error returned by SBI IPI and TIME extensions for unsupported
function IDs
- Fix suspend_type usage in SBI SUSP extension
- Remove unnecessary vcpu kick after injecting interrupt via IMSIC
guest file
x86:
- Fix an nVMX bug where KVM fails to detect that, after nested
VM-Exit, L1 has a pending IRQ (or NMI).
- To avoid freeing the PIC while vCPUs are still around, which would
cause a NULL pointer access with the previous patch, destroy vCPUs
before any VM-level destruction.
- Handle failures to create vhost_tasks"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
kvm: retry nx_huge_page_recovery_thread creation
vhost: return task creation error instead of NULL
KVM: nVMX: Process events on nested VM-Exit if injectable IRQ or NMI is pending
KVM: x86: Free vCPUs before freeing VM state
riscv: KVM: Remove unnecessary vcpu kick
KVM: arm64: Ensure a VMID is allocated before programming VTTBR_EL2
KVM: arm64: Fix tcr_el2 initialisation in hVHE mode
riscv: KVM: Fix SBI sleep_type use
riscv: KVM: Fix SBI TIME error generation
riscv: KVM: Fix SBI IPI error generation
riscv: KVM: Fix hart suspend_type use
riscv: KVM: Fix hart suspend status check
|
|
A VMM may send a non-fatal signal to its threads, including vCPU tasks,
at any time, and thus may signal vCPU tasks during KVM_RUN. If a vCPU
task receives the signal while its trying to spawn the huge page recovery
vhost task, then KVM_RUN will fail due to copy_process() returning
-ERESTARTNOINTR.
Rework call_once() to mark the call complete if and only if the called
function succeeds, and plumb the function's true error code back to the
call_once() invoker. This provides userspace with the correct, non-fatal
error code so that the VMM doesn't terminate the VM on -ENOMEM, and allows
subsequent KVM_RUN a succeed by virtue of retrying creation of the NX huge
page task.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
[implemented the kvm user side]
Signed-off-by: Keith Busch <kbusch@kernel.org>
Message-ID: <20250227230631.303431-3-kbusch@meta.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull objtool fixes from Ingo Molnar:
"Fix an objtool false positive, and objtool related build warnings that
happens on PIE-enabled architectures such as LoongArch"
* tag 'objtool-urgent-2025-02-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
objtool: Add bch2_trans_unlocked_or_in_restart_error() to bcachefs noreturns
objtool: Fix C jump table annotations for Clang
vmlinux.lds: Ensure that const vars with relocations are mapped R/O
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fix from Ingo Molnar:
"Fix an rcuref_put() slowpath race"
* tag 'locking-urgent-2025-02-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
rcuref: Plug slowpath race in rcuref_put()
|
|
Pull block fixes from Jens Axboe:
- Fix plugging for native zone writes
- Fix segment limit settings for != 4K page size archs
- Fix for slab names overflowing
* tag 'block-6.14-20250228' of git://git.kernel.dk/linux:
block: fix 'kmem_cache of name 'bio-108' already exists'
block: Remove zone write plugs when handling native zone append writes
block: make segment size limit workable for > 4K PAGE_SIZE
|
|
In order to fix a bug, arm64 needs to be told the size of the huge page
for which the huge_pte is being cleared in huge_ptep_get_and_clear().
Provide for this by adding an `unsigned long sz` parameter to the
function. This follows the same pattern as huge_pte_clear() and
set_huge_pte_at().
This commit makes the required interface modifications to the core mm as
well as all arches that implement this function (arm64, loongarch, mips,
parisc, powerpc, riscv, s390, sparc). The actual arm64 bug will be fixed
in a separate commit.
Cc: stable@vger.kernel.org
Fixes: 66b3923a1a0f ("arm64: hugetlb: add support for PTE contiguous bit")
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com> # riscv
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Acked-by: Alexander Gordeev <agordeev@linux.ibm.com> # s390
Link: https://lore.kernel.org/r/20250226120656.2400136-2-ryan.roberts@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from bluetooth.
We didn't get netfilter or wireless PRs this week, so next week's PR
is probably going to be bigger. A healthy dose of fixes for bugs
introduced in the current release nonetheless.
Current release - regressions:
- Bluetooth: always allow SCO packets for user channel
- af_unix: fix memory leak in unix_dgram_sendmsg()
- rxrpc:
- remove redundant peer->mtu_lock causing lockdep splats
- fix spinlock flavor issues with the peer record hash
- eth: iavf: fix circular lock dependency with netdev_lock
- net: use rtnl_net_dev_lock() in
register_netdevice_notifier_dev_net() RDMA driver register notifier
after the device
Current release - new code bugs:
- ethtool: fix ioctl confusing drivers about desired HDS user config
- eth: ixgbe: fix media cage present detection for E610 device
Previous releases - regressions:
- loopback: avoid sending IP packets without an Ethernet header
- mptcp: reset connection when MPTCP opts are dropped after join
Previous releases - always broken:
- net: better track kernel sockets lifetime
- ipv6: fix dst ref loop on input in seg6 and rpl lw tunnels
- phy: qca807x: use right value from DTS for DAC_DSP_BIAS_CURRENT
- eth: enetc: number of error handling fixes
- dsa: rtl8366rb: reshuffle the code to fix config / build issue with
LED support"
* tag 'net-6.14-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (53 commits)
net: ti: icss-iep: Reject perout generation request
idpf: fix checksums set in idpf_rx_rsc()
selftests: drv-net: Check if combined-count exists
net: ipv6: fix dst ref loop on input in rpl lwt
net: ipv6: fix dst ref loop on input in seg6 lwt
usbnet: gl620a: fix endpoint checking in genelink_bind()
net/mlx5: IRQ, Fix null string in debug print
net/mlx5: Restore missing trace event when enabling vport QoS
net/mlx5: Fix vport QoS cleanup on error
net: mvpp2: cls: Fixed Non IP flow, with vlan tag flow defination.
af_unix: Fix memory leak in unix_dgram_sendmsg()
net: Handle napi_schedule() calls from non-interrupt
net: Clear old fragment checksum value in napi_reuse_skb
gve: unlink old napi when stopping a queue using queue API
net: Use rtnl_net_dev_lock() in register_netdevice_notifier_dev_net().
tcp: Defer ts_recent changes until req is owned
net: enetc: fix the off-by-one issue in enetc_map_tx_tso_buffs()
net: enetc: remove the mm_lock from the ENETC v4 driver
net: enetc: add missing enetc4_link_deinit()
net: enetc: update UDP checksum when updating originTimestamp field
...
|