summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2023-08-23netfilter: nf_tables: use correct lock to protect gc_listPablo Neira Ayuso
Use nf_tables_gc_list_lock spinlock, not nf_tables_destroy_list_lock to protect the gc list. Fixes: 5f68718b34a5 ("netfilter: nf_tables: GC transaction API to avoid race with control plane") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de>
2023-08-23netfilter: nf_tables: GC transaction race with abort pathPablo Neira Ayuso
Abort path is missing a synchronization point with GC transactions. Add GC sequence number hence any GC transaction losing race will be discarded. Fixes: 5f68718b34a5 ("netfilter: nf_tables: GC transaction API to avoid race with control plane") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de>
2023-08-23netfilter: nf_tables: flush pending destroy work before netlink notifierPablo Neira Ayuso
Destroy work waits for the RCU grace period then it releases the objects with no mutex held. All releases objects follow this path for transactions, therefore, order is guaranteed and references to top-level objects in the hierarchy remain valid. However, netlink notifier might interfer with pending destroy work. rcu_barrier() is not correct because objects are not release via RCU callback. Flush destroy work before releasing objects from netlink notifier path. Fixes: d4bc8271db21 ("netfilter: nf_tables: netlink notifier might race to release objects") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de>
2023-08-23netfilter: nf_tables: validate all pending tablesFlorian Westphal
We have to validate all tables in the transaction that are in VALIDATE_DO state, the blamed commit below did not move the break statement to its right location so we only validate one table. Moreover, we can't init table->validate to _SKIP when a table object is allocated. If we do, then if a transcaction creates a new table and then fails the transaction, nfnetlink will loop and nft will hang until user cancels the command. Add back the pernet state as a place to stash the last state encountered. This is either _DO (we hit an error during commit validation) or _SKIP (transaction passed all checks). Fixes: 00c320f9b755 ("netfilter: nf_tables: make validation state per table") Reported-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de>
2023-08-23ASoC: cs35l41: Correct amp_gain_tlv valuesCharles Keepax
The current analog gain TLV seems to have completely incorrect values in it. The gain starts at 0.5dB, proceeds in 1dB steps, and has no mute value, correct the control to match. Signed-off-by: Charles Keepax <ckeepax@opensource.cirrus.com> Link: https://lore.kernel.org/r/20230823085308.753572-1-ckeepax@opensource.cirrus.com Signed-off-by: Mark Brown <broonie@kernel.org>
2023-08-23ibmveth: Use dcbf rather than dcbflMichael Ellerman
When building for power4, newer binutils don't recognise the "dcbfl" extended mnemonic. dcbfl RA, RB is equivalent to dcbf RA, RB, 1. Switch to "dcbf" to avoid the build error. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-23i40e: fix potential NULL pointer dereferencing of pf->vf i40e_sync_vsi_filters()Andrii Staikov
Add check for pf->vf not being NULL before dereferencing pf->vf[vsi->vf_id] in updating VSI filter sync. Add a similar check before dereferencing !pf->vf[vsi->vf_id].trusted in the condition for clearing promisc mode bit. Fixes: c87c938f62d8 ("i40e: Add VF VLAN pruning") Signed-off-by: Andrii Staikov <andrii.staikov@intel.com> Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-23net/sched: fix a qdisc modification with ambiguous command requestJamal Hadi Salim
When replacing an existing root qdisc, with one that is of the same kind, the request boils down to essentially a parameterization change i.e not one that requires allocation and grafting of a new qdisc. syzbot was able to create a scenario which resulted in a taprio qdisc replacing an existing taprio qdisc with a combination of NLM_F_CREATE, NLM_F_REPLACE and NLM_F_EXCL leading to create and graft scenario. The fix ensures that only when the qdisc kinds are different that we should allow a create and graft, otherwise it goes into the "change" codepath. While at it, fix the code and comments to improve readability. While syzbot was able to create the issue, it did not zone on the root cause. Analysis from Vladimir Oltean <vladimir.oltean@nxp.com> helped narrow it down. v1->V2 changes: - remove "inline" function definition (Vladmir) - remove extrenous braces in branches (Vladmir) - change inline function names (Pedro) - Run tdc tests (Victor) v2->v3 changes: - dont break else/if (Simon) Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Reported-by: syzbot+a3618a167af2021433cd@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/20230816225759.g25x76kmgzya2gei@skbuf/T/ Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com> Tested-by: Victor Nogueira <victor@mojatatu.com> Reviewed-by: Pedro Tammela <pctammela@mojatatu.com> Reviewed-by: Victor Nogueira <victor@mojatatu.com> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-23dma-buf/sw_sync: Avoid recursive lock during fence signalRob Clark
If a signal callback releases the sw_sync fence, that will trigger a deadlock as the timeline_fence_release recurses onto the fence->lock (used both for signaling and the the timeline tree). To avoid that, temporarily hold an extra reference to the signalled fences until after we drop the lock. (This is an alternative implementation of https://patchwork.kernel.org/patch/11664717/ which avoids some potential UAF issues with the original patch.) v2: Remove now obsolete comment, use list_move_tail() and list_del_init() Reported-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl> Fixes: d3c6dd1fb30d ("dma-buf/sw_sync: Synchronize signal vs syncpt free") Signed-off-by: Rob Clark <robdclark@chromium.org> Link: https://patchwork.freedesktop.org/patch/msgid/20230818145939.39697-1-robdclark@gmail.com Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Christian König <christian.koenig@amd.com>
2023-08-23media: vcodec: Fix potential array out-of-bounds in encoder queue_setupWei Chen
variable *nplanes is provided by user via system call argument. The possible value of q_data->fmt->num_planes is 1-3, while the value of *nplanes can be 1-8. The array access by index i can cause array out-of-bounds. Fix this bug by checking *nplanes against the array size. Fixes: 4e855a6efa54 ("[media] vcodec: mediatek: Add Mediatek V4L2 Video Encoder Driver") Signed-off-by: Wei Chen <harperchen1110@gmail.com> Cc: stable@vger.kernel.org Reviewed-by: Chen-Yu Tsai <wenst@chromium.org> Signed-off-by: Hans Verkuil <hverkuil-cisco@xs4all.nl>
2023-08-22igc: Fix the typo in the PTM Control macroSasha Neftin
The IGC_PTM_CTRL_SHRT_CYC defines the time between two consecutive PTM requests. The bit resolution of this field is six bits. That bit five was missing in the mask. This patch comes to correct the typo in the IGC_PTM_CTRL_SHRT_CYC macro. Fixes: a90ec8483732 ("igc: Add support for PTP getcrosststamp()") Signed-off-by: Sasha Neftin <sasha.neftin@intel.com> Tested-by: Naama Meir <naamax.meir@linux.intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Link: https://lore.kernel.org/r/20230821171721.2203572-1-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-22batman-adv: Hold rtnl lock during MTU update via netlinkSven Eckelmann
The automatic recalculation of the maximum allowed MTU is usually triggered by code sections which are already rtnl lock protected by callers outside of batman-adv. But when the fragmentation setting is changed via batman-adv's own batadv genl family, then the rtnl lock is not yet taken. But dev_set_mtu requires that the caller holds the rtnl lock because it uses netdevice notifiers. And this code will then fail the check for this lock: RTNL: assertion failed at net/core/dev.c (1953) Cc: stable@vger.kernel.org Reported-by: syzbot+f8812454d9b3ac00d282@syzkaller.appspotmail.com Fixes: c6a953cce8d0 ("batman-adv: Trigger events for auto adjusted MTU") Signed-off-by: Sven Eckelmann <sven@narfation.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20230821-batadv-missing-mtu-rtnl-lock-v1-1-1c5a7bfe861e@narfation.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-22igb: Avoid starting unnecessary workqueuesAlessio Igor Bogani
If ptp_clock_register() fails or CONFIG_PTP isn't enabled, avoid starting PTP related workqueues. In this way we can fix this: BUG: unable to handle page fault for address: ffffc9000440b6f8 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 100000067 P4D 100000067 PUD 1001e0067 PMD 107dc5067 PTE 0 Oops: 0000 [#1] PREEMPT SMP [...] Workqueue: events igb_ptp_overflow_check RIP: 0010:igb_rd32+0x1f/0x60 [...] Call Trace: igb_ptp_read_82580+0x20/0x50 timecounter_read+0x15/0x60 igb_ptp_overflow_check+0x1a/0x50 process_one_work+0x1cb/0x3c0 worker_thread+0x53/0x3f0 ? rescuer_thread+0x370/0x370 kthread+0x142/0x160 ? kthread_associate_blkcg+0xc0/0xc0 ret_from_fork+0x1f/0x30 Fixes: 1f6e8178d685 ("igb: Prevent dropped Tx timestamps via work items and interrupts.") Fixes: d339b1331616 ("igb: add PTP Hardware Clock code") Signed-off-by: Alessio Igor Bogani <alessio.bogani@elettra.eu> Tested-by: Arpana Arland <arpanax.arland@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20230821171927.2203644-1-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-22Merge branch '100GbE' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue Tony Nguyen says: ==================== Intel Wired LAN Driver Updates 2023-08-21 (ice) This series contains updates to ice driver only. Jesse fixes an issue on calculating buffer size. Petr Oros reverts a commit that does not fully resolve VF reset issues and implements one that provides a fuller fix. * '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue: ice: Fix NULL pointer deref during VF reset Revert "ice: Fix ice VF reset during iavf initialization" ice: fix receive buffer size miscalculation ==================== Link: https://lore.kernel.org/r/20230821171633.2203505-1-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-22Merge branch 'can-fixes-for-6-5-rc7'Jakub Kicinski
Oliver Hartkopp says: ==================== CAN fixes for 6.5-rc7 The isotp fix removes an unnecessary check which leads to delays and/or a wrong error notification. The fix for the CAN_RAW socket solves the last issue that has been introduced with commit ee8b94c8510c ("can: raw: fix receiver memory leak") in this upstream cycle (detected by Eric Dumazet). ==================== Link: https://lore.kernel.org/r/20230821144547.6658-1-socketcan@hartkopp.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-22can: raw: add missing refcount for memory leak fixOliver Hartkopp
Commit ee8b94c8510c ("can: raw: fix receiver memory leak") introduced a new reference to the CAN netdevice that has assigned CAN filters. But this new ro->dev reference did not maintain its own refcount which lead to another KASAN use-after-free splat found by Eric Dumazet. This patch ensures a proper refcount for the CAN nedevice. Fixes: ee8b94c8510c ("can: raw: fix receiver memory leak") Reported-by: Eric Dumazet <edumazet@google.com> Cc: Ziyang Xuan <william.xuanziyang@huawei.com> Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Link: https://lore.kernel.org/r/20230821144547.6658-3-socketcan@hartkopp.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-22can: isotp: fix support for transmission of SF without flow controlOliver Hartkopp
The original implementation had a very simple handling for single frame transmissions as it just sent the single frame without a timeout handling. With the new echo frame handling the echo frame was also introduced for single frames but the former exception ('simple without timers') has been maintained by accident. This leads to a 1 second timeout when closing the socket and to an -ECOMM error when CAN_ISOTP_WAIT_TX_DONE is selected. As the echo handling is always active (also for single frames) remove the wrong extra condition for single frames. Fixes: 9f39d36530e5 ("can: isotp: add support for transmission without flow control") Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Link: https://lore.kernel.org/r/20230821144547.6658-2-socketcan@hartkopp.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-22bnx2x: new flag for track HW resource allocationThinh Tran
While injecting PCIe errors to the upstream PCIe switch of a BCM57810 NIC, system hangs/crashes were observed. After several calls to bnx2x_tx_timout() complete, bnx2x_nic_unload() is called to free up HW resources and bnx2x_napi_disable() is called to release NAPI objects. Later, when the EEH driver calls bnx2x_io_slot_reset() to complete the recovery process, bnx2x attempts to disable NAPI again by calling bnx2x_napi_disable() and freeing resources which have already been freed, resulting in a hang or crash. Introduce a new flag to track the HW resource and NAPI allocation state, refactor duplicated code into a single function, check page pool allocation status before freeing, and reduces debug output when a TX timeout event occurs. Reviewed-by: Manish Chopra <manishc@marvell.com> Tested-by: Abdul Haleem <abdhalee@in.ibm.com> Tested-by: David Christensen <drc@linux.vnet.ibm.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Tested-by: Venkata Sai Duggi <venkata.sai.duggi@ibm.com> Signed-off-by: Thinh Tran <thinhtr@linux.vnet.ibm.com> Link: https://lore.kernel.org/r/20230818161443.708785-2-thinhtr@linux.vnet.ibm.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-22clk: Fix slab-out-of-bounds error in devm_clk_release()Andrey Skvortsov
Problem can be reproduced by unloading snd_soc_simple_card, because in devm_get_clk_from_child() devres data is allocated as `struct clk`, but devm_clk_release() expects devres data to be `struct devm_clk_state`. KASAN report: ================================================================== BUG: KASAN: slab-out-of-bounds in devm_clk_release+0x20/0x54 Read of size 8 at addr ffffff800ee09688 by task (udev-worker)/287 Call trace: dump_backtrace+0xe8/0x11c show_stack+0x1c/0x30 dump_stack_lvl+0x60/0x78 print_report+0x150/0x450 kasan_report+0xa8/0xf0 __asan_load8+0x78/0xa0 devm_clk_release+0x20/0x54 release_nodes+0x84/0x120 devres_release_all+0x144/0x210 device_unbind_cleanup+0x1c/0xac really_probe+0x2f0/0x5b0 __driver_probe_device+0xc0/0x1f0 driver_probe_device+0x68/0x120 __driver_attach+0x140/0x294 bus_for_each_dev+0xec/0x160 driver_attach+0x38/0x44 bus_add_driver+0x24c/0x300 driver_register+0xf0/0x210 __platform_driver_register+0x48/0x54 asoc_simple_card_init+0x24/0x1000 [snd_soc_simple_card] do_one_initcall+0xac/0x340 do_init_module+0xd0/0x300 load_module+0x2ba4/0x3100 __do_sys_init_module+0x2c8/0x300 __arm64_sys_init_module+0x48/0x5c invoke_syscall+0x64/0x190 el0_svc_common.constprop.0+0x124/0x154 do_el0_svc+0x44/0xdc el0_svc+0x14/0x50 el0t_64_sync_handler+0xec/0x11c el0t_64_sync+0x14c/0x150 Allocated by task 287: kasan_save_stack+0x38/0x60 kasan_set_track+0x28/0x40 kasan_save_alloc_info+0x20/0x30 __kasan_kmalloc+0xac/0xb0 __kmalloc_node_track_caller+0x6c/0x1c4 __devres_alloc_node+0x44/0xb4 devm_get_clk_from_child+0x44/0xa0 asoc_simple_parse_clk+0x1b8/0x1dc [snd_soc_simple_card_utils] simple_parse_node.isra.0+0x1ec/0x230 [snd_soc_simple_card] simple_dai_link_of+0x1bc/0x334 [snd_soc_simple_card] __simple_for_each_link+0x2ec/0x320 [snd_soc_simple_card] asoc_simple_probe+0x468/0x4dc [snd_soc_simple_card] platform_probe+0x90/0xf0 really_probe+0x118/0x5b0 __driver_probe_device+0xc0/0x1f0 driver_probe_device+0x68/0x120 __driver_attach+0x140/0x294 bus_for_each_dev+0xec/0x160 driver_attach+0x38/0x44 bus_add_driver+0x24c/0x300 driver_register+0xf0/0x210 __platform_driver_register+0x48/0x54 asoc_simple_card_init+0x24/0x1000 [snd_soc_simple_card] do_one_initcall+0xac/0x340 do_init_module+0xd0/0x300 load_module+0x2ba4/0x3100 __do_sys_init_module+0x2c8/0x300 __arm64_sys_init_module+0x48/0x5c invoke_syscall+0x64/0x190 el0_svc_common.constprop.0+0x124/0x154 do_el0_svc+0x44/0xdc el0_svc+0x14/0x50 el0t_64_sync_handler+0xec/0x11c el0t_64_sync+0x14c/0x150 The buggy address belongs to the object at ffffff800ee09600 which belongs to the cache kmalloc-256 of size 256 The buggy address is located 136 bytes inside of 256-byte region [ffffff800ee09600, ffffff800ee09700) The buggy address belongs to the physical page: page:000000002d97303b refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x4ee08 head:000000002d97303b order:1 compound_mapcount:0 compound_pincount:0 flags: 0x10200(slab|head|zone=0) raw: 0000000000010200 0000000000000000 dead000000000122 ffffff8002c02480 raw: 0000000000000000 0000000080100010 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffffff800ee09580: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffffff800ee09600: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >ffffff800ee09680: 00 fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ^ ffffff800ee09700: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffffff800ee09780: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ================================================================== Fixes: abae8e57e49a ("clk: generalize devm_clk_get() a bit") Signed-off-by: Andrey Skvortsov <andrej.skvortzov@gmail.com> Link: https://lore.kernel.org/r/20230805084847.3110586-1-andrej.skvortzov@gmail.com Signed-off-by: Stephen Boyd <sboyd@kernel.org>
2023-08-22drm/i915: fix Sphinx indentation warningJani Nikula
Fix Sphinx warning about unexpected indent. Signed-off-by: Jani Nikula <jani.nikula@intel.com> Reviewed-by: Luca Coelho <luciano.coelho@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20230621123156.14907-2-jani.nikula@intel.com (cherry picked from commit 175b036472f678948b03baabce4a008b7ba91ce7) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2023-08-22RISC-V: vector: export VLENB csr in __sc_riscv_v_stateAndy Chiu
VLENB is critical for callers of ptrace to reconstruct Vector register files from the register dump of NT_RISCV_VECTOR. Also, future systems may will have a writable VLENB, so add it now to potentially save future compatibility issue. Fixes: 0c59922c769a ("riscv: Add ptrace vector support") Signed-off-by: Andy Chiu <andy.chiu@sifive.com> Link: https://lore.kernel.org/r/20230816155450.26200-3-andy.chiu@sifive.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2023-08-22RISC-V: Remove ptrace support for vectorsPalmer Dabbelt
We've found two bugs here: NT_RISCV_VECTOR steps on NT_RISCV_CSR (which is only for embedded), and we don't have vlenb in the core dumps. Given that we've have a pair of bugs croup up as part of the GDB review we've probably got other issues, so let's just cut this for 6.5 and get it right. Fixes: 0c59922c769a ("riscv: Add ptrace vector support") Reviewed-by: Maciej W. Rozycki <macro@orcam.me.uk> Signed-off-by: Andy Chiu <andy.chiu@sifive.com> Link: https://lore.kernel.org/r/20230816155450.26200-2-andy.chiu@sifive.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2023-08-22Merge tag 'devicetree-fixes-for-6.5-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux Pull devicetree fixes from Rob Herring: - Fix DT node refcount when creating platform devices - Fix deadlock in changeset code due to printing with devtree_lock held - Fix unittest EXPECT strings for parse_phandle_with_args_map() test - Fix IMA kexec memblock freeing * tag 'devicetree-fixes-for-6.5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux: of/platform: increase refcount of fwnode of: dynamic: Refactor action prints to not use "%pOF" inside devtree_lock of: unittest: Fix EXPECT for parse_phandle_with_args_map() test mm,ima,kexec,of: use memblock_free_late from ima_free_kexec_buffer
2023-08-22sfc: allocate a big enough SKB for loopback selftest packetEdward Cree
Cited commits passed a size to alloc_skb that was only big enough for the actual packet contents, but the following skb_put + memcpy writes the whole struct efx_loopback_payload including leading and trailing padding bytes (which are then stripped off with skb_pull/skb_trim). This could cause an skb_over_panic, although in practice we get saved by kmalloc_size_roundup. Pass the entire size we use, instead of the size of the final packet. Reported-by: Andy Moreton <andy.moreton@amd.com> Fixes: cf60ed469629 ("sfc: use padding to fix alignment in loopback test") Fixes: 30c24dd87f3f ("sfc: siena: use padding to fix alignment in loopback test") Fixes: 1186c6b31ee1 ("sfc: falcon: use padding to fix alignment in loopback test") Signed-off-by: Edward Cree <ecree.xilinx@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20230821180153.18652-1-edward.cree@amd.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-22Merge tag 'wireless-2023-08-22' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless Johannes Berg says: ==================== Two fixes: - reorder buffer filter checks can cause bad shift/UBSAN warning with newer HW, avoid the check (mac80211) - add Kconfig dependency for iwlwifi for PTP clock usage * tag 'wireless-2023-08-22' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless: wifi: mac80211: limit reorder_buf_filtered to avoid UBSAN warning wifi: iwlwifi: mvm: add dependency for PTP clock ==================== Link: https://lore.kernel.org/r/20230822124206.43926-2-johannes@sipsolutions.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-22leds: trigger: netdev: rename 'hw_control' sysfs entry to 'offloaded'Marek Behún
Commit b655892ffd6d ("leds: trigger: netdev: expose hw_control status via sysfs") exposed to sysfs the flag that tells whether the LED trigger is offloaded to hardware, under the name "hw_control", since that is the name under which this setting is called in the code. Everywhere else in kernel when some work that is normally done in software can be made to be done by hardware instead, we use the word "offloading" to describe this, e.g. "LED blinking is offloaded to hardware". Normally renaming sysfs entries is a no-go because of backwards compatibility. But since this patch was not yet released in a stable kernel, I think it is still possible to rename it, if there is consensus. Fixes: b655892ffd6d ("leds: trigger: netdev: expose hw_control status via sysfs") Signed-off-by: Marek Behún <kabel@kernel.org> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://lore.kernel.org/r/20230821121453.30203-1-kabel@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-22net: ethernet: mtk_eth_soc: fix NULL pointer on hw resetDaniel Golle
When a hardware reset is triggered on devices not initializing WED the calls to mtk_wed_fe_reset and mtk_wed_fe_reset_complete dereference a pointer on uninitialized stack memory. Break out of both functions in case a hw_list entry is 0. Fixes: 08a764a7c51b ("net: ethernet: mtk_wed: add reset/reset_complete callbacks") Signed-off-by: Daniel Golle <daniel@makrotopia.org> Reviewed-by: Simon Horman <horms@kernel.org> Acked-by: Lorenzo Bianconi <lorenzo@kernel.org> Link: https://lore.kernel.org/r/5465c1609b464cc7407ae1530c40821dcdf9d3e6.1692634266.git.daniel@makrotopia.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-22Merge tag 'nfs-for-6.5-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds
Pull NFS client fixes from Trond Myklebust: - fix a use after free in nfs_direct_join_group() (Cc: stable) - fix sysfs server name memory leak - fix lock recovery hang in NFSv4.0 - fix page free in the error path for nfs42_proc_getxattr() and __nfs4_get_acl_uncached() - SUNRPC/rdma: fix receive buffer dma-mapping after a server disconnect * tag 'nfs-for-6.5-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: xprtrdma: Remap Receive buffers after a reconnect NFSv4: fix out path in __nfs4_get_acl_uncached NFSv4.2: fix error handling in nfs42_proc_getxattr NFS: Fix sysfs server name memory leak NFS: Fix a use after free in nfs_direct_join_group() NFSv4: Fix dropped lock for racing OPEN and delegation return
2023-08-22Merge tag 'selinux-pr-20230821' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux Pull selinux fix from Paul Moore: "A small fix for a potential problem when cleaning up after a failed SELinux policy load (list next pointer not being properly initialized to NULL early enough)" * tag 'selinux-pr-20230821' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux: selinux: set next pointer before attaching to list
2023-08-22drm/i915/dgfx: Enable d3cold at s2idleAnshuman Gupta
System wide suspend already has support for lmem save/restore during suspend therefore enabling d3cold for s2idle and keepng it disable for runtime PM.(Refer below commit for d3cold runtime PM disable justification) 'commit 66eb93e71a7a ("drm/i915/dgfx: Keep PCI autosuspend control 'on' by default on all dGPU")' It will reduce the DG2 Card power consumption to ~0 Watt for s2idle power KPI. v2: - Added "Cc: stable@vger.kernel.org". Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/8755 Cc: stable@vger.kernel.org Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Anshuman Gupta <anshuman.gupta@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Tested-by: Aaron Ma <aaron.ma@canonical.com> Tested-by: Jianshui Yu <Jianshui.yu@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20230816125216.1722002-1-anshuman.gupta@intel.com (cherry picked from commit 2643e6d1f2a5e51877be24042d53cf956589be10) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2023-08-22drm/display/dp: Fix the DP DSC Receiver cap sizeAnkit Nautiyal
DP DSC Receiver Capabilities are exposed via DPCD 60h-6Fh. Fix the DSC RECEIVER CAP SIZE accordingly. Fixes: ffddc4363c28 ("drm/dp: Add DP DSC DPCD receiver capability size define and missing SHIFT") Cc: Anusha Srivatsa <anusha.srivatsa@intel.com> Cc: Manasi Navare <manasi.d.navare@intel.com> Cc: <stable@vger.kernel.org> # v5.0+ Signed-off-by: Ankit Nautiyal <ankit.k.nautiyal@intel.com> Reviewed-by: Stanislav Lisovskiy <stanislav.lisovskiy@intel.com> Signed-off-by: Jani Nikula <jani.nikula@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20230818044436.177806-1-ankit.k.nautiyal@intel.com
2023-08-21tg3: Use slab_build_skb() when neededKees Cook
The tg3 driver will use kmalloc() under some conditions. Check the frag_size and use slab_build_skb() when frag_size is 0. Silences the warning introduced by commit ce098da1497c ("skbuff: Introduce slab_build_skb()"): Use slab_build_skb() instead ... tg3_poll_work+0x638/0xf90 [tg3] Fixes: ce098da1497c ("skbuff: Introduce slab_build_skb()") Reported-by: Fiona Ebner <f.ebner@proxmox.com> Closes: https://lore.kernel.org/all/1bd4cb9c-4eb8-3bdb-3e05-8689817242d1@proxmox.com Cc: Siva Reddy Kallam <siva.kallam@broadcom.com> Cc: Prashant Sreedharan <prashant@broadcom.com> Cc: Michael Chan <mchan@broadcom.com> Cc: Bagas Sanjaya <bagasdotme@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com> Link: https://lore.kernel.org/r/20230818175417.never.273-kees@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-21selftests: bonding: do not set port down before adding to bondHangbin Liu
Before adding a port to bond, it need to be set down first. In the lacpdu test the author set the port down specifically. But commit a4abfa627c38 ("net: rtnetlink: Enslave device before bringing it up") changed the operation order, the kernel will set the port down _after_ adding to bond. So all the ports will be down at last and the test failed. In fact, the veth interfaces are already inactive when added. This means there's no need to set them down again before adding to the bond. Let's just remove the link down operation. Fixes: a4abfa627c38 ("net: rtnetlink: Enslave device before bringing it up") Reported-by: Zhengchao Shao <shaozhengchao@huawei.com> Closes: https://lore.kernel.org/netdev/a0ef07c7-91b0-94bd-240d-944a330fcabd@huawei.com/ Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Link: https://lore.kernel.org/r/20230817082459.1685972-1-liuhangbin@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-21samples: ftrace: Replace bti assembly with hint for older compilerGONG, Ruiqi
When cross-building the arm64 kernel with allmodconfig using GCC 9.4, the following error occurs on multiple files under samples/ftrace/: /tmp/ccPC1ODs.s: Assembler messages: /tmp/ccPC1ODs.s:8: Error: selected processor does not support `bti c' Fix this issue by replacing `bti c` with `hint 34`, which is compatible for the older compiler. Link: https://lore.kernel.org/linux-trace-kernel/20230820111509.1470826-1-gongruiqi@huaweicloud.com Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Florent Revest <revest@chromium.org> Fixes: 8c3526fb86060cb5 ("arm64: ftrace: Add direct call trampoline samples support") Acked-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: GONG, Ruiqi <gongruiqi1@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-08-21scsi: ufs: ufs-qcom: Clear qunipro_g4_sel for HW major version > 5Neil Armstrong
The qunipro_g4_sel clear is also needed for new platforms with major version > 5. Fix the version check to take this into account. Fixes: 9c02aa24bf40 ("scsi: ufs: ufs-qcom: Clear qunipro_g4_sel for HW version major 5") Acked-by: Manivannan Sadhasivam <mani@kernel.org> Reviewed-by: Nitin Rawat <quic_nitirawa@quicinc.com> Signed-off-by: Neil Armstrong <neil.armstrong@linaro.org> Link: https://lore.kernel.org/r/20230821-topic-sm8x50-upstream-ufs-major-5-plus-v2-1-f42a4b712e58@linaro.org Reviewed-by: "Bao D. Nguyen" <quic_nguyenb@quicinc.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2023-08-21scsi: ufs: mcq: Fix the search/wrap around logicBao D. Nguyen
The search and wrap around logic in the ufshcd_mcq_sqe_search() function does not work correctly when the hwq's queue depth is not a power of two number. Correct it so that any queue depth with a positive integer value within the supported range would work. Signed-off-by: "Bao D. Nguyen" <quic_nguyenb@quicinc.com> Link: https://lore.kernel.org/r/ff49c15be205135ed3ec186f3086694c02867dbd.1692149603.git.quic_nguyenb@quicinc.com Reviewed-by: Bart Van Assche <bvanassche@acm.org> Fixes: 8d7290348992 ("scsi: ufs: mcq: Add supporting functions for MCQ abort") Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2023-08-21of/platform: increase refcount of fwnodePeng Fan
commit 0f8e5651095b ("of/platform: Propagate firmware node by calling device_set_node()") use of_fwnode_handle to replace of_node_get, which introduces a side effect that the refcount is not increased. Then the out of tree jailhouse hypervisor enable/disable test will trigger kernel dump in of_overlay_remove, with the following sequence " of_changeset_revert(&overlay_changeset); of_changeset_destroy(&overlay_changeset); of_overlay_remove(&overlay_id); " So increase the refcount to avoid issues. This patch also release the refcount when releasing amba device to avoid refcount leakage. Fixes: 0f8e5651095b ("of/platform: Propagate firmware node by calling device_set_node()") Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Peng Fan <peng.fan@nxp.com> Link: https://lore.kernel.org/r/20230821023928.3324283-2-peng.fan@oss.nxp.com Signed-off-by: Rob Herring <robh@kernel.org>
2023-08-21mm: multi-gen LRU: don't spin during memcg releaseT.J. Mercier
When a memcg is in the process of being released mem_cgroup_tryget will fail because its reference count has already reached 0. This can happen during reclaim if the memcg has already been offlined, and we reclaim all remaining pages attributed to the offlined memcg. shrink_many attempts to skip the empty memcg in this case, and continue reclaiming from the remaining memcgs in the old generation. If there is only one memcg remaining, or if all remaining memcgs are in the process of being released then shrink_many will spin until all memcgs have finished being released. The release occurs through a workqueue, so it can take a while before kswapd is able to make any further progress. This fix results in reductions in kswapd activity and direct reclaim in a test where 28 apps (working set size > total memory) are repeatedly launched in a random sequence: A B delta ratio(%) allocstall_movable 5962 3539 -2423 -40.64 allocstall_normal 2661 2417 -244 -9.17 kswapd_high_wmark_hit_quickly 53152 7594 -45558 -85.71 pageoutrun 57365 11750 -45615 -79.52 Link: https://lkml.kernel.org/r/20230814151636.1639123-1-tjmercier@google.com Fixes: e4dde56cd208 ("mm: multi-gen LRU: per-node lru_gen_folio lists") Signed-off-by: T.J. Mercier <tjmercier@google.com> Acked-by: Yu Zhao <yuzhao@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21mm: memory-failure: fix unexpected return value in soft_offline_page()Miaohe Lin
When page_handle_poison() fails to handle the hugepage or free page in retry path, soft_offline_page() will return 0 while -EBUSY is expected in this case. Consequently the user will think soft_offline_page succeeds while it in fact failed. So the user will not try again later in this case. Link: https://lkml.kernel.org/r/20230627112808.1275241-1-linmiaohe@huawei.com Fixes: b94e02822deb ("mm,hwpoison: try to narrow window race for free pages") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21radix tree: remove unused variableArnd Bergmann
Recent versions of clang warn about an unused variable, though older versions saw the 'slot++' as a use and did not warn: radix-tree.c:1136:50: error: parameter 'slot' set but not used [-Werror,-Wunused-but-set-parameter] It's clearly not needed any more, so just remove it. Link: https://lkml.kernel.org/r/20230811131023.2226509-1-arnd@kernel.org Fixes: 3a08cd52c37c7 ("radix tree: Remove multiorder support") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Peng Zhang <zhangpeng.00@bytedance.com> Cc: Rong Tao <rongtao@cestc.cn> Cc: Tom Rix <trix@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21mm: add a call to flush_cache_vmap() in vmap_pfn()Alexandre Ghiti
flush_cache_vmap() must be called after new vmalloc mappings are installed in the page table in order to allow architectures to make sure the new mapping is visible. It could lead to a panic since on some architectures (like powerpc), the page table walker could see the wrong pte value and trigger a spurious page fault that can not be resolved (see commit f1cb8f9beba8 ("powerpc/64s/radix: avoid ptesync after set_pte and ptep_set_access_flags")). But actually the patch is aiming at riscv: the riscv specification allows the caching of invalid entries in the TLB, and since we recently removed the vmalloc page fault handling, we now need to emit a tlb shootdown whenever a new vmalloc mapping is emitted (https://lore.kernel.org/linux-riscv/20230725132246.817726-1-alexghiti@rivosinc.com/). That's a temporary solution, there are ways to avoid that :) Link: https://lkml.kernel.org/r/20230809164633.1556126-1-alexghiti@rivosinc.com Fixes: 3e9a9e256b1e ("mm: add a vmap_pfn function") Reported-by: Dylan Jhong <dylan@andestech.com> Closes: https://lore.kernel.org/linux-riscv/ZMytNY2J8iyjbPPy@atctrx.andestech.com/ Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Palmer Dabbelt <palmer@rivosinc.com> Acked-by: Palmer Dabbelt <palmer@rivosinc.com> Reviewed-by: Dylan Jhong <dylan@andestech.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21selftests/mm: FOLL_LONGTERM need to be updated to 0x100Ayush Jain
After commit 2c2241081f7d ("mm/gup: move private gup FOLL_ flags to internal.h") FOLL_LONGTERM flag value got updated from 0x10000 to 0x100 at include/linux/mm_types.h. As hmm.hmm_device_private.hmm_gup_test uses FOLL_LONGTERM Updating same here as well. Before this change test goes in an infinite assert loop in hmm.hmm_device_private.hmm_gup_test ========================================================== RUN hmm.hmm_device_private.hmm_gup_test ... hmm-tests.c:1962:hmm_gup_test:Expected HMM_DMIRROR_PROT_WRITE.. ..(2) == m[2] (34) hmm-tests.c:157:hmm_gup_test:Expected ret (-1) == 0 (0) hmm-tests.c:157:hmm_gup_test:Expected ret (-1) == 0 (0) ... ========================================================== Call Trace: <TASK> ? sched_clock+0xd/0x20 ? __lock_acquire.constprop.0+0x120/0x6c0 ? ktime_get+0x2c/0xd0 ? sched_clock+0xd/0x20 ? local_clock+0x12/0xd0 ? lock_release+0x26e/0x3b0 pin_user_pages_fast+0x4c/0x70 gup_test_ioctl+0x4ff/0xbb0 ? gup_test_ioctl+0x68c/0xbb0 __x64_sys_ioctl+0x99/0xd0 do_syscall_64+0x60/0x90 ? syscall_exit_to_user_mode+0x2a/0x50 ? do_syscall_64+0x6d/0x90 ? syscall_exit_to_user_mode+0x2a/0x50 ? do_syscall_64+0x6d/0x90 ? irqentry_exit_to_user_mode+0xd/0x20 ? irqentry_exit+0x3f/0x50 ? exc_page_fault+0x96/0x200 entry_SYSCALL_64_after_hwframe+0x72/0xdc RIP: 0033:0x7f6aaa31aaff After this change test is able to pass successfully. Link: https://lkml.kernel.org/r/20230808124347.79163-1-ayush.jain3@amd.com Fixes: 2c2241081f7d ("mm/gup: move private gup FOLL_ flags to internal.h") Signed-off-by: Ayush Jain <ayush.jain3@amd.com> Reviewed-by: Raghavendra K T <raghavendra.kt@amd.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21nilfs2: fix general protection fault in nilfs_lookup_dirty_data_buffers()Ryusuke Konishi
A syzbot stress test reported that create_empty_buffers() called from nilfs_lookup_dirty_data_buffers() can cause a general protection fault. Analysis using its reproducer revealed that the back reference "mapping" from a page/folio has been changed to NULL after dirty page/folio gang lookup in nilfs_lookup_dirty_data_buffers(). Fix this issue by excluding pages/folios from being collected if, after acquiring a lock on each page/folio, its back reference "mapping" differs from the pointer to the address space struct that held the page/folio. Link: https://lkml.kernel.org/r/20230805132038.6435-1-konishi.ryusuke@gmail.com Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Reported-by: syzbot+0ad741797f4565e7e2d2@syzkaller.appspotmail.com Closes: https://lkml.kernel.org/r/0000000000002930a705fc32b231@google.com Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21mm/gup: handle cont-PTE hugetlb pages correctly in gup_must_unshare() via ↵David Hildenbrand
GUP-fast In contrast to most other GUP code, GUP-fast common page table walking code like gup_pte_range() also handles hugetlb pages. But in contrast to other hugetlb page table walking code, it does not look at the hugetlb PTE abstraction whereby we have only a single logical hugetlb PTE per hugetlb page, even when using multiple cont-PTEs underneath -- which is for example what huge_ptep_get() abstracts. So when we have a hugetlb page that is mapped via cont-PTEs, GUP-fast might stumble over a PTE that does not map the head page of a hugetlb page -- not the first "head" PTE of such a cont mapping. Logically, the whole hugetlb page is mapped (entire_mapcount == 1), but we might end up calling gup_must_unshare() with a tail page of a hugetlb page. We only maintain a single PageAnonExclusive flag per hugetlb page (as hugetlb pages cannot get partially COW-shared), stored for the head page. That flag is clear for all tail pages. So when gup_must_unshare() ends up calling PageAnonExclusive() with a tail page of a hugetlb page: 1) With CONFIG_DEBUG_VM_PGFLAGS Stumbles over the: VM_BUG_ON_PGFLAGS(PageHuge(page) && !PageHead(page), page); For example, when executing the COW selftests with 64k hugetlb pages on arm64: [ 61.082187] page:00000000829819ff refcount:3 mapcount:1 mapping:0000000000000000 index:0x1 pfn:0x11ee11 [ 61.082842] head:0000000080f79bf7 order:4 entire_mapcount:1 nr_pages_mapped:0 pincount:2 [ 61.083384] anon flags: 0x17ffff80003000e(referenced|uptodate|dirty|head|mappedtodisk|node=0|zone=2|lastcpupid=0xfffff) [ 61.084101] page_type: 0xffffffff() [ 61.084332] raw: 017ffff800000000 fffffc00037b8401 0000000000000402 0000000200000000 [ 61.084840] raw: 0000000000000010 0000000000000000 00000000ffffffff 0000000000000000 [ 61.085359] head: 017ffff80003000e ffffd9e95b09b788 ffffd9e95b09b788 ffff0007ff63cf71 [ 61.085885] head: 0000000000000000 0000000000000002 00000003ffffffff 0000000000000000 [ 61.086415] page dumped because: VM_BUG_ON_PAGE(PageHuge(page) && !PageHead(page)) [ 61.086914] ------------[ cut here ]------------ [ 61.087220] kernel BUG at include/linux/page-flags.h:990! [ 61.087591] Internal error: Oops - BUG: 00000000f2000800 [#1] SMP [ 61.087999] Modules linked in: ... [ 61.089404] CPU: 0 PID: 4612 Comm: cow Kdump: loaded Not tainted 6.5.0-rc4+ #3 [ 61.089917] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015 [ 61.090409] pstate: 604000c5 (nZCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 61.090897] pc : gup_must_unshare.part.0+0x64/0x98 [ 61.091242] lr : gup_must_unshare.part.0+0x64/0x98 [ 61.091592] sp : ffff8000825eb940 [ 61.091826] x29: ffff8000825eb940 x28: 0000000000000000 x27: fffffc00037b8440 [ 61.092329] x26: 0400000000000001 x25: 0000000000080101 x24: 0000000000080000 [ 61.092835] x23: 0000000000080100 x22: ffff0000cffb9588 x21: ffff0000c8ec6b58 [ 61.093341] x20: 0000ffffad6b1000 x19: fffffc00037b8440 x18: ffffffffffffffff [ 61.093850] x17: 2864616548656761 x16: 5021202626202965 x15: 6761702865677548 [ 61.094358] x14: 6567615028454741 x13: 2929656761702864 x12: 6165486567615021 [ 61.094858] x11: 00000000ffff7fff x10: 00000000ffff7fff x9 : ffffd9e958b7a1c0 [ 61.095359] x8 : 00000000000bffe8 x7 : c0000000ffff7fff x6 : 00000000002bffa8 [ 61.095873] x5 : ffff0008bb19e708 x4 : 0000000000000000 x3 : 0000000000000000 [ 61.096380] x2 : 0000000000000000 x1 : ffff0000cf6636c0 x0 : 0000000000000046 [ 61.096894] Call trace: [ 61.097080] gup_must_unshare.part.0+0x64/0x98 [ 61.097392] gup_pte_range+0x3a8/0x3f0 [ 61.097662] gup_pgd_range+0x1ec/0x280 [ 61.097942] lockless_pages_from_mm+0x64/0x1a0 [ 61.098258] internal_get_user_pages_fast+0xe4/0x1d0 [ 61.098612] pin_user_pages_fast+0x58/0x78 [ 61.098917] pin_longterm_test_start+0xf4/0x2b8 [ 61.099243] gup_test_ioctl+0x170/0x3b0 [ 61.099528] __arm64_sys_ioctl+0xa8/0xf0 [ 61.099822] invoke_syscall.constprop.0+0x7c/0xd0 [ 61.100160] el0_svc_common.constprop.0+0xe8/0x100 [ 61.100500] do_el0_svc+0x38/0xa0 [ 61.100736] el0_svc+0x3c/0x198 [ 61.100971] el0t_64_sync_handler+0x134/0x150 [ 61.101280] el0t_64_sync+0x17c/0x180 [ 61.101543] Code: aa1303e0 f00074c1 912b0021 97fffeb2 (d4210000) 2) Without CONFIG_DEBUG_VM_PGFLAGS Always detects "not exclusive" for passed tail pages and refuses to PIN the tail pages R/O, as gup_must_unshare() == true. GUP-fast will fallback to ordinary GUP. As ordinary GUP properly considers the logical hugetlb PTE abstraction in hugetlb_follow_page_mask(), pinning the page will succeed when looking at the PageAnonExclusive on the head page only. So the only real effect of this is that with cont-PTE hugetlb pages, we'll always fallback from GUP-fast to ordinary GUP when not working on the head page, which ends up checking the head page and do the right thing. Consequently, the cow selftests pass with cont-PTE hugetlb pages as well without CONFIG_DEBUG_VM_PGFLAGS. Note that this only applies to anon hugetlb pages that are mapped using cont-PTEs: for example 64k hugetlb pages on a 4k arm64 kernel. ... and only when R/O-pinning (FOLL_PIN) such pages that are mapped into the page table R/O using GUP-fast. On production kernels (and even most debug kernels, that don't set CONFIG_DEBUG_VM_PGFLAGS) this patch should theoretically not be required to be backported. But of course, it does not hurt. Link: https://lkml.kernel.org/r/20230805101256.87306-1-david@redhat.com Fixes: a7f226604170 ("mm/gup: trigger FAULT_FLAG_UNSHARE when R/O-pinning a possibly shared anonymous page") Signed-off-by: David Hildenbrand <david@redhat.com> Reported-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Tested-by: Ryan Roberts <ryan.roberts@arm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Peter Xu <peterx@redhat.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21selftests: cgroup: fix test_kmem_basic less than errorLucas Karpinski
test_kmem_basic creates 100,000 negative dentries, with each one mapping to a slab object. After memory.high is set, these are reclaimed through the shrink_slab function call which reclaims all 100,000 entries. The test passes the majority of the time because when slab1 or current is calculated, it is often above 0, however, 0 is also an acceptable value. Link: https://lkml.kernel.org/r/7d6gcuyzdjcice6qbphrmpmv5skr5jtglg375unnjxqhstvhxc@qkn6dw6bao6v Signed-off-by: Lucas Karpinski <lkarpins@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeelb@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Zefan Li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21mm: enable page walking API to lock vmas during the walkSuren Baghdasaryan
walk_page_range() and friends often operate under write-locked mmap_lock. With introduction of vma locks, the vmas have to be locked as well during such walks to prevent concurrent page faults in these areas. Add an additional member to mm_walk_ops to indicate locking requirements for the walk. The change ensures that page walks which prevent concurrent page faults by write-locking mmap_lock, operate correctly after introduction of per-vma locks. With per-vma locks page faults can be handled under vma lock without taking mmap_lock at all, so write locking mmap_lock would not stop them. The change ensures vmas are properly locked during such walks. A sample issue this solves is do_mbind() performing queue_pages_range() to queue pages for migration. Without this change a concurrent page can be faulted into the area and be left out of migration. Link: https://lkml.kernel.org/r/20230804152724.3090321-2-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org> Suggested-by: Jann Horn <jannh@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Laurent Dufour <ldufour@linux.ibm.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Michel Lespinasse <michel@lespinasse.org> Cc: Peter Xu <peterx@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21smaps: use vm_normal_page_pmd() instead of follow_trans_huge_pmd()David Hildenbrand
We shouldn't be using a GUP-internal helper if it can be avoided. Similar to smaps_pte_entry() that uses vm_normal_page(), let's use vm_normal_page_pmd() that similarly refuses to return the huge zeropage. In contrast to follow_trans_huge_pmd(), vm_normal_page_pmd(): (1) Will always return the head page, not a tail page of a THP. If we'd ever call smaps_account with a tail page while setting "compound = true", we could be in trouble, because smaps_account() would look at the memmap of unrelated pages. If we're unlucky, that memmap does not exist at all. Before we removed PG_doublemap, we could have triggered something similar as in commit 24d7275ce279 ("fs/proc: task_mmu.c: don't read mapcount for migration entry"). This can theoretically happen ever since commit ff9f47f6f00c ("mm: proc: smaps_rollup: do not stall write attempts on mmap_lock"): (a) We're in show_smaps_rollup() and processed a VMA (b) We release the mmap lock in show_smaps_rollup() because it is contended (c) We merged that VMA with another VMA (d) We collapsed a THP in that merged VMA at that position If the end address of the original VMA falls into the middle of a THP area, we would call smap_gather_stats() with a start address that falls into a PMD-mapped THP. It's probably very rare to trigger when not really forced. (2) Will succeed on a is_pci_p2pdma_page(), like vm_normal_page() Treat such PMDs here just like smaps_pte_entry() would treat such PTEs. If such pages would be anonymous, we most certainly would want to account them. (3) Will skip over pmd_devmap(), like vm_normal_page() for pte_devmap() As noted in vm_normal_page(), that is only for handling legacy ZONE_DEVICE pages. So just like smaps_pte_entry(), we'll now also ignore such PMD entries. Especially, follow_pmd_mask() never ends up calling follow_trans_huge_pmd() on pmd_devmap(). Instead it calls follow_devmap_pmd() -- which will fail if neither FOLL_GET nor FOLL_PIN is set. So skipping pmd_devmap() pages seems to be the right thing to do. (4) Will properly handle VM_MIXEDMAP/VM_PFNMAP, like vm_normal_page() We won't be returning a memmap that should be ignored by core-mm, or worse, a memmap that does not even exist. Note that while walk_page_range() will skip VM_PFNMAP mappings, walk_page_vma() won't. Most probably this case doesn't currently really happen on the PMD level, otherwise we'd already be able to trigger kernel crashes when reading smaps / smaps_rollup. So most probably only (1) is relevant in practice as of now, but could only cause trouble in extreme corner cases. Let's move follow_trans_huge_pmd() to mm/internal.h to discourage future reuse in wrong context. Link: https://lkml.kernel.org/r/20230803143208.383663-3-david@redhat.com Fixes: ff9f47f6f00c ("mm: proc: smaps_rollup: do not stall write attempts on mmap_lock") Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: liubo <liubo254@huawei.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21mm/gup: reintroduce FOLL_NUMA as FOLL_HONOR_NUMA_FAULTDavid Hildenbrand
Unfortunately commit 474098edac26 ("mm/gup: replace FOLL_NUMA by gup_can_follow_protnone()") missed that follow_page() and follow_trans_huge_pmd() never implicitly set FOLL_NUMA because they really don't want to fail on PROT_NONE-mapped pages -- either due to NUMA hinting or due to inaccessible (PROT_NONE) VMAs. As spelled out in commit 0b9d705297b2 ("mm: numa: Support NUMA hinting page faults from gup/gup_fast"): "Other follow_page callers like KSM should not use FOLL_NUMA, or they would fail to get the pages if they use follow_page instead of get_user_pages." liubo reported [1] that smaps_rollup results are imprecise, because they miss accounting of pages that are mapped PROT_NONE. Further, it's easy to reproduce that KSM no longer works on inaccessible VMAs on x86-64, because pte_protnone()/pmd_protnone() also indictaes "true" in inaccessible VMAs, and follow_page() refuses to return such pages right now. As KVM really depends on these NUMA hinting faults, removing the pte_protnone()/pmd_protnone() handling in GUP code completely is not really an option. To fix the issues at hand, let's revive FOLL_NUMA as FOLL_HONOR_NUMA_FAULT to restore the original behavior for now and add better comments. Set FOLL_HONOR_NUMA_FAULT independent of FOLL_FORCE in is_valid_gup_args(), to add that flag for all external GUP users. Note that there are three GUP-internal __get_user_pages() users that don't end up calling is_valid_gup_args() and consequently won't get FOLL_HONOR_NUMA_FAULT set. 1) get_dump_page(): we really don't want to handle NUMA hinting faults. It specifies FOLL_FORCE and wouldn't have honored NUMA hinting faults already. 2) populate_vma_page_range(): we really don't want to handle NUMA hinting faults. It specifies FOLL_FORCE on accessible VMAs, so it wouldn't have honored NUMA hinting faults already. 3) faultin_vma_page_range(): we similarly don't want to handle NUMA hinting faults. To make the combination of FOLL_FORCE and FOLL_HONOR_NUMA_FAULT work in inaccessible VMAs properly, we have to perform VMA accessibility checks in gup_can_follow_protnone(). As GUP-fast should reject such pages either way in pte_access_permitted()/pmd_access_permitted() -- for example on x86-64 and arm64 that both implement pte_protnone() -- let's just always fallback to ordinary GUP when stumbling over pte_protnone()/pmd_protnone(). As Linus notes [2], honoring NUMA faults might only make sense for selected GUP users. So we should really see if we can instead let relevant GUP callers specify it manually, and not trigger NUMA hinting faults from GUP as default. Prepare for that by making FOLL_HONOR_NUMA_FAULT an external GUP flag and adding appropriate documenation. While at it, remove a stale comment from follow_trans_huge_pmd(): That comment for pmd_protnone() was added in commit 2b4847e73004 ("mm: numa: serialise parallel get_user_page against THP migration"), which noted: THP does not unmap pages due to a lack of support for migration entries at a PMD level. This allows races with get_user_pages Nowadays, we do have PMD migration entries, so the comment no longer applies. Let's drop it. [1] https://lore.kernel.org/r/20230726073409.631838-1-liubo254@huawei.com [2] https://lore.kernel.org/r/CAHk-=wgRiP_9X0rRdZKT8nhemZGNateMtb366t37d8-x7VRs=g@mail.gmail.com Link: https://lkml.kernel.org/r/20230803143208.383663-2-david@redhat.com Fixes: 474098edac26 ("mm/gup: replace FOLL_NUMA by gup_can_follow_protnone()") Signed-off-by: David Hildenbrand <david@redhat.com> Reported-by: liubo <liubo254@huawei.com> Closes: https://lore.kernel.org/r/20230726073409.631838-1-liubo254@huawei.com Reported-by: Peter Xu <peterx@redhat.com> Closes: https://lore.kernel.org/all/ZMKJjDaqZ7FW0jfe@x1n/ Acked-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Peter Xu <peterx@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21ice: Fix NULL pointer deref during VF resetPetr Oros
During stress test with attaching and detaching VF from KVM and simultaneously changing VFs spoofcheck and trust there was a NULL pointer dereference in ice_reset_vf that VF's VSI is null. More than one instance of ice_reset_vf() can be running at a given time. When we rebuild the VSI in ice_reset_vf, another reset can be triaged from ice_service_task. In this case we can access the currently uninitialized VSI and cause panic. The window for this racing condition has been around for a long time but it's much worse after commit 227bf4500aaa ("ice: move VSI delete outside deconfig") because the reset runs faster. ice_reset_vf() using vf->cfg_lock and when we move this lock before accessing to the VF VSI, we can fix BUG for all cases. Panic occurs sometimes in ice_vsi_is_rx_queue_active() and sometimes in ice_vsi_stop_all_rx_rings() With our reproducer, we can hit BUG: ~8h before commit 227bf4500aaa ("ice: move VSI delete outside deconfig"). ~20m after commit 227bf4500aaa ("ice: move VSI delete outside deconfig"). After this fix we are not able to reproduce it after ~48h There was commit cf90b74341ee ("ice: Fix call trace with null VSI during VF reset") which also tried to fix this issue, but it was only partially resolved and the bug still exists. [ 6420.658415] BUG: kernel NULL pointer dereference, address: 0000000000000000 [ 6420.665382] #PF: supervisor read access in kernel mode [ 6420.670521] #PF: error_code(0x0000) - not-present page [ 6420.675659] PGD 0 [ 6420.677679] Oops: 0000 [#1] PREEMPT SMP NOPTI [ 6420.682038] CPU: 53 PID: 326472 Comm: kworker/53:0 Kdump: loaded Not tainted 5.14.0-317.el9.x86_64 #1 [ 6420.691250] Hardware name: Dell Inc. PowerEdge R750/04V528, BIOS 1.6.5 04/15/2022 [ 6420.698729] Workqueue: ice ice_service_task [ice] [ 6420.703462] RIP: 0010:ice_vsi_is_rx_queue_active+0x2d/0x60 [ice] [ 6420.705860] ice 0000:ca:00.0: VF 0 is now untrusted [ 6420.709494] Code: 00 00 66 83 bf 76 04 00 00 00 48 8b 77 10 74 3e 31 c0 eb 0f 0f b7 97 76 04 00 00 48 83 c0 01 39 c2 7e 2b 48 8b 97 68 04 00 00 <0f> b7 0c 42 48 8b 96 20 13 00 00 48 8d 94 8a 00 00 12 00 8b 12 83 [ 6420.714426] ice 0000:ca:00.0 ens7f0: Setting MAC 22:22:22:22:22:00 on VF 0. VF driver will be reinitialized [ 6420.733120] RSP: 0018:ff778d2ff383fdd8 EFLAGS: 00010246 [ 6420.733123] RAX: 0000000000000000 RBX: ff2acf1916294000 RCX: 0000000000000000 [ 6420.733125] RDX: 0000000000000000 RSI: ff2acf1f2c6401a0 RDI: ff2acf1a27301828 [ 6420.762346] RBP: ff2acf1a27301828 R08: 0000000000000010 R09: 0000000000001000 [ 6420.769476] R10: ff2acf1916286000 R11: 00000000019eba3f R12: ff2acf19066460d0 [ 6420.776611] R13: ff2acf1f2c6401a0 R14: ff2acf1f2c6401a0 R15: 00000000ffffffff [ 6420.783742] FS: 0000000000000000(0000) GS:ff2acf28ffa80000(0000) knlGS:0000000000000000 [ 6420.791829] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 6420.797575] CR2: 0000000000000000 CR3: 00000016ad410003 CR4: 0000000000773ee0 [ 6420.804708] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 6420.811034] vfio-pci 0000:ca:01.0: enabling device (0000 -> 0002) [ 6420.811840] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 6420.811841] PKRU: 55555554 [ 6420.811842] Call Trace: [ 6420.811843] <TASK> [ 6420.811844] ice_reset_vf+0x9a/0x450 [ice] [ 6420.811876] ice_process_vflr_event+0x8f/0xc0 [ice] [ 6420.841343] ice_service_task+0x23b/0x600 [ice] [ 6420.845884] ? __schedule+0x212/0x550 [ 6420.849550] process_one_work+0x1e2/0x3b0 [ 6420.853563] ? rescuer_thread+0x390/0x390 [ 6420.857577] worker_thread+0x50/0x3a0 [ 6420.861242] ? rescuer_thread+0x390/0x390 [ 6420.865253] kthread+0xdd/0x100 [ 6420.868400] ? kthread_complete_and_exit+0x20/0x20 [ 6420.873194] ret_from_fork+0x1f/0x30 [ 6420.876774] </TASK> [ 6420.878967] Modules linked in: vfio_pci vfio_pci_core vfio_iommu_type1 vfio iavf vhost_net vhost vhost_iotlb tap tun xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 nft_compat nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_counter nf_tables bridge stp llc sctp ip6_udp_tunnel udp_tunnel nfp tls nfnetlink bluetooth mlx4_en mlx4_core rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache netfs rfkill sunrpc intel_rapl_msr intel_rapl_common i10nm_edac nfit libnvdimm ipmi_ssif x86_pkg_temp_thermal intel_powerclamp coretemp irdma kvm_intel i40e kvm iTCO_wdt dcdbas ib_uverbs irqbypass iTCO_vendor_support mgag200 mei_me ib_core dell_smbios isst_if_mmio isst_if_mbox_pci rapl i2c_algo_bit drm_shmem_helper intel_cstate drm_kms_helper syscopyarea sysfillrect isst_if_common sysimgblt intel_uncore fb_sys_fops dell_wmi_descriptor wmi_bmof intel_vsec mei i2c_i801 acpi_ipmi ipmi_si i2c_smbus ipmi_devintf intel_pch_thermal acpi_power_meter pcspk r Fixes: efe41860008e ("ice: Fix memory corruption in VF driver") Fixes: f23df5220d2b ("ice: Fix spurious interrupt during removal of trusted VF") Signed-off-by: Petr Oros <poros@redhat.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2023-08-21Revert "ice: Fix ice VF reset during iavf initialization"Petr Oros
This reverts commit 7255355a0636b4eff08d5e8139c77d98f151c4fc. After this commit we are not able to attach VF to VM: virsh attach-interface v0 hostdev --managed 0000:41:01.0 --mac 52:52:52:52:52:52 error: Failed to attach interface error: Cannot set interface MAC to 52:52:52:52:52:52 for ifname enp65s0f0np0 vf 0: Resource temporarily unavailable ice_check_vf_ready_for_cfg() already contain waiting for reset. New condition in ice_check_vf_ready_for_reset() causing only problems. Fixes: 7255355a0636 ("ice: Fix ice VF reset during iavf initialization") Signed-off-by: Petr Oros <poros@redhat.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>