summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2023-07-18KVM: s390: pv: simplify shutdown and fix raceClaudio Imbrenda
Simplify the shutdown of non-protected VMs. There is no need to do complex manipulations of the counter if it was zero. This also fixes a very rare race which caused pages to be torn down from the address space with a non-zero counter even on older machines that don't support the UVC instruction, causing a crash. Reported-by: Marc Hartmayer <mhartmay@linux.ibm.com> Fixes: fb491d5500a7 ("KVM: s390: pv: asynchronous destroy for reboot") Reviewed-by: Nico Boehr <nrb@linux.ibm.com> Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Message-ID: <20230705111937.33472-2-imbrenda@linux.ibm.com>
2023-07-18btrfs: fix warning when putting transaction with qgroups enabled after abortFilipe Manana
If we have a transaction abort with qgroups enabled we get a warning triggered when doing the final put on the transaction, like this: [552.6789] ------------[ cut here ]------------ [552.6815] WARNING: CPU: 4 PID: 81745 at fs/btrfs/transaction.c:144 btrfs_put_transaction+0x123/0x130 [btrfs] [552.6817] Modules linked in: btrfs blake2b_generic xor (...) [552.6819] CPU: 4 PID: 81745 Comm: btrfs-transacti Tainted: G W 6.4.0-rc6-btrfs-next-134+ #1 [552.6819] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014 [552.6819] RIP: 0010:btrfs_put_transaction+0x123/0x130 [btrfs] [552.6821] Code: bd a0 01 00 (...) [552.6821] RSP: 0018:ffffa168c0527e28 EFLAGS: 00010286 [552.6821] RAX: ffff936042caed00 RBX: ffff93604a3eb448 RCX: 0000000000000000 [552.6821] RDX: ffff93606421b028 RSI: ffffffff92ff0878 RDI: ffff93606421b010 [552.6821] RBP: ffff93606421b000 R08: 0000000000000000 R09: ffffa168c0d07c20 [552.6821] R10: 0000000000000000 R11: ffff93608dc52950 R12: ffffa168c0527e70 [552.6821] R13: ffff93606421b000 R14: ffff93604a3eb420 R15: ffff93606421b028 [552.6821] FS: 0000000000000000(0000) GS:ffff93675fb00000(0000) knlGS:0000000000000000 [552.6821] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [552.6821] CR2: 0000558ad262b000 CR3: 000000014feda005 CR4: 0000000000370ee0 [552.6822] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [552.6822] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [552.6822] Call Trace: [552.6822] <TASK> [552.6822] ? __warn+0x80/0x130 [552.6822] ? btrfs_put_transaction+0x123/0x130 [btrfs] [552.6824] ? report_bug+0x1f4/0x200 [552.6824] ? handle_bug+0x42/0x70 [552.6824] ? exc_invalid_op+0x14/0x70 [552.6824] ? asm_exc_invalid_op+0x16/0x20 [552.6824] ? btrfs_put_transaction+0x123/0x130 [btrfs] [552.6826] btrfs_cleanup_transaction+0xe7/0x5e0 [btrfs] [552.6828] ? _raw_spin_unlock_irqrestore+0x23/0x40 [552.6828] ? try_to_wake_up+0x94/0x5e0 [552.6828] ? __pfx_process_timeout+0x10/0x10 [552.6828] transaction_kthread+0x103/0x1d0 [btrfs] [552.6830] ? __pfx_transaction_kthread+0x10/0x10 [btrfs] [552.6832] kthread+0xee/0x120 [552.6832] ? __pfx_kthread+0x10/0x10 [552.6832] ret_from_fork+0x29/0x50 [552.6832] </TASK> [552.6832] ---[ end trace 0000000000000000 ]--- This corresponds to this line of code: void btrfs_put_transaction(struct btrfs_transaction *transaction) { (...) WARN_ON(!RB_EMPTY_ROOT( &transaction->delayed_refs.dirty_extent_root)); (...) } The warning happens because btrfs_qgroup_destroy_extent_records(), called in the transaction abort path, we free all entries from the rbtree "dirty_extent_root" with rbtree_postorder_for_each_entry_safe(), but we don't actually empty the rbtree - it's still pointing to nodes that were freed. So set the rbtree's root node to NULL to avoid this warning (assign RB_ROOT). Fixes: 81f7eb00ff5b ("btrfs: destroy qgroup extent records on transaction abort") CC: stable@vger.kernel.org # 5.10+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-07-18btrfs: fix ordered extent split error handling in btrfs_dio_submit_ioChristoph Hellwig
When the call to btrfs_extract_ordered_extent in btrfs_dio_submit_io fails to allocate memory for a new ordered_extent, it calls into the btrfs_dio_end_io for error handling. btrfs_dio_end_io then assumes that bbio->ordered is set because it is supposed to be at this point, except for this error handling corner case. Try to not overload the btrfs_dio_end_io with error handling of a bio in a non-canonical state, and instead call btrfs_finish_ordered_extent and iomap_dio_bio_end_io directly for this error case. Reported-by: syzbot <syzbot+5b82f0e951f8c2bcdb8f@syzkaller.appspotmail.com> Fixes: b41b6f6937dc ("btrfs: use btrfs_finish_ordered_extent to complete direct writes") Reviewed-by: Josef Bacik <josef@toxicpanda.com> Tested-by: syzbot <syzbot+5b82f0e951f8c2bcdb8f@syzkaller.appspotmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
2023-07-18btrfs: set_page_extent_mapped after read_folio in btrfs_cont_expandJosef Bacik
While trying to get the subpage blocksize tests running, I hit the following panic on generic/476 assertion failed: PagePrivate(page) && page->private, in fs/btrfs/subpage.c:229 kernel BUG at fs/btrfs/subpage.c:229! Internal error: Oops - BUG: 00000000f2000800 [#1] SMP CPU: 1 PID: 1453 Comm: fsstress Not tainted 6.4.0-rc7+ #12 Hardware name: QEMU KVM Virtual Machine, BIOS edk2-20230301gitf80f052277c8-26.fc38 03/01/2023 pstate: 61400005 (nZCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--) pc : btrfs_subpage_assert+0xbc/0xf0 lr : btrfs_subpage_assert+0xbc/0xf0 Call trace: btrfs_subpage_assert+0xbc/0xf0 btrfs_subpage_clear_checked+0x38/0xc0 btrfs_page_clear_checked+0x48/0x98 btrfs_truncate_block+0x5d0/0x6a8 btrfs_cont_expand+0x5c/0x528 btrfs_write_check.isra.0+0xf8/0x150 btrfs_buffered_write+0xb4/0x760 btrfs_do_write_iter+0x2f8/0x4b0 btrfs_file_write_iter+0x1c/0x30 do_iter_readv_writev+0xc8/0x158 do_iter_write+0x9c/0x210 vfs_iter_write+0x24/0x40 iter_file_splice_write+0x224/0x390 direct_splice_actor+0x38/0x68 splice_direct_to_actor+0x12c/0x260 do_splice_direct+0x90/0xe8 generic_copy_file_range+0x50/0x90 vfs_copy_file_range+0x29c/0x470 __arm64_sys_copy_file_range+0xcc/0x498 invoke_syscall.constprop.0+0x80/0xd8 do_el0_svc+0x6c/0x168 el0_svc+0x50/0x1b0 el0t_64_sync_handler+0x114/0x120 el0t_64_sync+0x194/0x198 This happens because during btrfs_cont_expand we'll get a page, set it as mapped, and if it's not Uptodate we'll read it. However between the read and re-locking the page we could have called release_folio() on the page, but left the page in the file mapping. release_folio() can clear the page private, and thus further down we blow up when we go to modify the subpage bits. Fix this by putting the set_page_extent_mapped() after the read. This is safe because read_folio() will call set_page_extent_mapped() before it does the read, and then if we clear page private but leave it on the mapping we're completely safe re-setting set_page_extent_mapped(). With this patch I can now run generic/476 without panicing. CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-07-18btrfs: raid56: always verify the P/Q contents for scrubQu Wenruo
[REGRESSION] Commit 75b470332965 ("btrfs: raid56: migrate recovery and scrub recovery path to use error_bitmap") changed the behavior of scrub_rbio(). Initially if we have no error reading the raid bio, we will assign @need_check to true, then finish_parity_scrub() would later verify the content of P/Q stripes before writeback. But after that commit we never verify the content of P/Q stripes and just writeback them. This can lead to unrepaired P/Q stripes during scrub, or already corrupted P/Q copied to the dev-replace target. [FIX] The situation is more complex than the regression, in fact the initial behavior is not 100% correct either. If we have the following rare case, it can still lead to the same problem using the old behavior: 0 16K 32K 48K 64K Data 1: |IIIIIII| | Data 2: | | Parity: | |CCCCCCC| | Where "I" means IO error, "C" means corruption. In the above case, we're scrubbing the parity stripe, then read out all the contents of Data 1, Data 2, Parity stripes. But found IO error in Data 1, which leads to rebuild using Data 2 and Parity and got the correct data. In that case, we would not verify if the Parity is correct for range [16K, 32K). So here we have to always verify the content of Parity no matter if we did recovery or not. This patch would remove the @need_check parameter of finish_parity_scrub() completely, and would always do the P/Q verification before writeback. Fixes: 75b470332965 ("btrfs: raid56: migrate recovery and scrub recovery path to use error_bitmap") CC: stable@vger.kernel.org # 6.2+ Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-07-18btrfs: use irq safe locking when running and adding delayed iputsFilipe Manana
Running delayed iputs, which never happens in an irq context, needs to lock the spinlock fs_info->delayed_iput_lock. When finishing bios for data writes (irq context, bio.c) we call btrfs_put_ordered_extent() which needs to add a delayed iput and for that it needs to acquire the spinlock fs_info->delayed_iput_lock. Without disabling irqs when running delayed iputs we can therefore deadlock on that spinlock. The same deadlock can also happen when adding an inode to the delayed iputs list, since this can be done outside an irq context as well. Syzbot recently reported this, which results in the following trace: ================================ WARNING: inconsistent lock state 6.4.0-syzkaller-09904-ga507db1d8fdc #0 Not tainted -------------------------------- inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage. btrfs-cleaner/16079 [HC0[0]:SC0[0]:HE1:SE1] takes: ffff888107804d20 (&fs_info->delayed_iput_lock){+.?.}-{2:2}, at: spin_lock include/linux/spinlock.h:350 [inline] ffff888107804d20 (&fs_info->delayed_iput_lock){+.?.}-{2:2}, at: btrfs_run_delayed_iputs+0x28/0xe0 fs/btrfs/inode.c:3523 {IN-SOFTIRQ-W} state was registered at: lock_acquire kernel/locking/lockdep.c:5761 [inline] lock_acquire+0x1b1/0x520 kernel/locking/lockdep.c:5726 __raw_spin_lock include/linux/spinlock_api_smp.h:133 [inline] _raw_spin_lock+0x2e/0x40 kernel/locking/spinlock.c:154 spin_lock include/linux/spinlock.h:350 [inline] btrfs_add_delayed_iput+0x128/0x390 fs/btrfs/inode.c:3490 btrfs_put_ordered_extent fs/btrfs/ordered-data.c:559 [inline] btrfs_put_ordered_extent+0x2f6/0x610 fs/btrfs/ordered-data.c:547 __btrfs_bio_end_io fs/btrfs/bio.c:118 [inline] __btrfs_bio_end_io+0x136/0x180 fs/btrfs/bio.c:112 btrfs_orig_bbio_end_io+0x86/0x2b0 fs/btrfs/bio.c:163 btrfs_simple_end_io+0x105/0x380 fs/btrfs/bio.c:378 bio_endio+0x589/0x690 block/bio.c:1617 req_bio_endio block/blk-mq.c:766 [inline] blk_update_request+0x5c5/0x1620 block/blk-mq.c:911 blk_mq_end_request+0x59/0x680 block/blk-mq.c:1032 lo_complete_rq+0x1c6/0x280 drivers/block/loop.c:370 blk_complete_reqs+0xb3/0xf0 block/blk-mq.c:1110 __do_softirq+0x1d4/0x905 kernel/softirq.c:553 run_ksoftirqd kernel/softirq.c:921 [inline] run_ksoftirqd+0x31/0x60 kernel/softirq.c:913 smpboot_thread_fn+0x659/0x9e0 kernel/smpboot.c:164 kthread+0x344/0x440 kernel/kthread.c:389 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:308 irq event stamp: 39 hardirqs last enabled at (39): [<ffffffff81d5ebc4>] __do_kmem_cache_free mm/slab.c:3558 [inline] hardirqs last enabled at (39): [<ffffffff81d5ebc4>] kmem_cache_free mm/slab.c:3582 [inline] hardirqs last enabled at (39): [<ffffffff81d5ebc4>] kmem_cache_free+0x244/0x370 mm/slab.c:3575 hardirqs last disabled at (38): [<ffffffff81d5eb5e>] __do_kmem_cache_free mm/slab.c:3553 [inline] hardirqs last disabled at (38): [<ffffffff81d5eb5e>] kmem_cache_free mm/slab.c:3582 [inline] hardirqs last disabled at (38): [<ffffffff81d5eb5e>] kmem_cache_free+0x1de/0x370 mm/slab.c:3575 softirqs last enabled at (0): [<ffffffff814ac99f>] copy_process+0x227f/0x75c0 kernel/fork.c:2448 softirqs last disabled at (0): [<0000000000000000>] 0x0 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&fs_info->delayed_iput_lock); <Interrupt> lock(&fs_info->delayed_iput_lock); *** DEADLOCK *** 1 lock held by btrfs-cleaner/16079: #0: ffff888107804860 (&fs_info->cleaner_mutex){+.+.}-{3:3}, at: cleaner_kthread+0x103/0x4b0 fs/btrfs/disk-io.c:1463 stack backtrace: CPU: 3 PID: 16079 Comm: btrfs-cleaner Not tainted 6.4.0-syzkaller-09904-ga507db1d8fdc #0 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0xd9/0x150 lib/dump_stack.c:106 print_usage_bug kernel/locking/lockdep.c:3978 [inline] valid_state kernel/locking/lockdep.c:4020 [inline] mark_lock_irq kernel/locking/lockdep.c:4223 [inline] mark_lock.part.0+0x1102/0x1960 kernel/locking/lockdep.c:4685 mark_lock kernel/locking/lockdep.c:4649 [inline] mark_usage kernel/locking/lockdep.c:4598 [inline] __lock_acquire+0x8e4/0x5e20 kernel/locking/lockdep.c:5098 lock_acquire kernel/locking/lockdep.c:5761 [inline] lock_acquire+0x1b1/0x520 kernel/locking/lockdep.c:5726 __raw_spin_lock include/linux/spinlock_api_smp.h:133 [inline] _raw_spin_lock+0x2e/0x40 kernel/locking/spinlock.c:154 spin_lock include/linux/spinlock.h:350 [inline] btrfs_run_delayed_iputs+0x28/0xe0 fs/btrfs/inode.c:3523 cleaner_kthread+0x2e5/0x4b0 fs/btrfs/disk-io.c:1478 kthread+0x344/0x440 kernel/kthread.c:389 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:308 </TASK> So fix this by using spin_lock_irq() and spin_unlock_irq() when running delayed iputs, and using spin_lock_irqsave() and spin_unlock_irqrestore() when adding a delayed iput(). Reported-by: syzbot+da501a04be5ff533b102@syzkaller.appspotmail.com Fixes: ec63b84d4611 ("btrfs: add an ordered_extent pointer to struct btrfs_bio") Link: https://lore.kernel.org/linux-btrfs/000000000000d5c89a05ffbd39dd@google.com/ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-07-18btrfs: fix iput() on error pointer after error during orphan cleanupFilipe Manana
At btrfs_orphan_cleanup(), if we can't find an inode (btrfs_iget() returns an -ENOENT error pointer), we proceed with 'ret' set to -ENOENT and the inode pointer set to ERR_PTR(-ENOENT). Later when we proceed to the body of the following if statement: if (ret == -ENOENT || inode->i_nlink) { (...) trans = btrfs_start_transaction(root, 1); if (IS_ERR(trans)) { ret = PTR_ERR(trans); iput(inode); goto out; } (...) ret = btrfs_del_orphan_item(trans, root, found_key.objectid); btrfs_end_transaction(trans); if (ret) { iput(inode); goto out; } continue; } If we get an error from btrfs_start_transaction() or from the call to btrfs_del_orphan_item() we end calling iput() against an inode pointer that has a value of ERR_PTR(-ENOENT), resulting in a crash with the following trace: [876.667] BUG: kernel NULL pointer dereference, address: 0000000000000096 [876.667] #PF: supervisor read access in kernel mode [876.667] #PF: error_code(0x0000) - not-present page [876.667] PGD 0 P4D 0 [876.668] Oops: 0000 [#1] PREEMPT SMP PTI [876.668] CPU: 0 PID: 2356187 Comm: mount Tainted: G W 6.4.0-rc6-btrfs-next-134+ #1 [876.668] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014 [876.668] RIP: 0010:iput+0xa/0x20 [876.668] Code: ff ff ff 66 (...) [876.669] RSP: 0018:ffffafa9c0c9f9d0 EFLAGS: 00010282 [876.669] RAX: ffffffffffffffe4 RBX: 000000000009453b RCX: 0000000000000000 [876.669] RDX: 0000000000000001 RSI: ffffafa9c0c9f930 RDI: fffffffffffffffe [876.669] RBP: ffff95c612f3b800 R08: 0000000000000001 R09: ffffffffffffffe4 [876.670] R10: 00018f2a71010000 R11: 000000000ead96e3 R12: ffff95cb7d6909a0 [876.670] R13: fffffffffffffffe R14: ffff95c60f477000 R15: 00000000ffffffe4 [876.670] FS: 00007f5fbe30a840(0000) GS:ffff95ccdfa00000(0000) knlGS:0000000000000000 [876.670] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [876.671] CR2: 0000000000000096 CR3: 000000055e9f6004 CR4: 0000000000370ef0 [876.671] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [876.671] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [876.672] Call Trace: [876.744] <TASK> [876.744] ? __die_body+0x1b/0x60 [876.744] ? page_fault_oops+0x15d/0x450 [876.745] ? __kmem_cache_alloc_node+0x47/0x410 [876.745] ? do_user_addr_fault+0x65/0x8a0 [876.745] ? exc_page_fault+0x74/0x170 [876.746] ? asm_exc_page_fault+0x22/0x30 [876.746] ? iput+0xa/0x20 [876.746] btrfs_orphan_cleanup+0x221/0x330 [btrfs] [876.746] btrfs_lookup_dentry+0x58f/0x5f0 [btrfs] [876.747] btrfs_lookup+0xe/0x30 [btrfs] [876.747] __lookup_slow+0x82/0x130 [876.785] walk_component+0xe5/0x160 [876.786] path_lookupat.isra.0+0x6e/0x150 [876.786] filename_lookup+0xcf/0x1a0 [876.786] ? mod_objcg_state+0xd2/0x360 [876.786] ? obj_cgroup_charge+0xf5/0x110 [876.787] ? should_failslab+0xa/0x20 [876.787] ? kmem_cache_alloc+0x47/0x450 [876.787] vfs_path_lookup+0x51/0x90 [876.788] mount_subtree+0x8d/0x130 [876.788] btrfs_mount+0x149/0x410 [btrfs] [876.788] ? __kmem_cache_alloc_node+0x47/0x410 [876.788] ? vfs_parse_fs_param+0xc0/0x110 [876.789] legacy_get_tree+0x24/0x50 [876.834] vfs_get_tree+0x22/0xd0 [876.852] path_mount+0x2d8/0x9c0 [876.852] do_mount+0x79/0x90 [876.852] __x64_sys_mount+0x8e/0xd0 [876.853] do_syscall_64+0x38/0x90 [876.899] entry_SYSCALL_64_after_hwframe+0x72/0xdc [876.958] RIP: 0033:0x7f5fbe50b76a [876.959] Code: 48 8b 0d a9 (...) [876.959] RSP: 002b:00007fff01925798 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5 [876.959] RAX: ffffffffffffffda RBX: 00007f5fbe694264 RCX: 00007f5fbe50b76a [876.960] RDX: 0000561bde6c8720 RSI: 0000561bde6bdec0 RDI: 0000561bde6c31a0 [876.960] RBP: 0000561bde6bdc70 R08: 0000000000000000 R09: 0000000000000001 [876.960] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 [876.960] R13: 0000561bde6c31a0 R14: 0000561bde6c8720 R15: 0000561bde6bdc70 [876.960] </TASK> So fix this by setting 'inode' to NULL whenever we get an error from btrfs_iget(), and to make the code simpler, stop testing for 'ret' being -ENOENT to check if we have an inode - instead test for 'inode' being NULL or not. Having a NULL 'inode' prevents any iput() call from crashing, as iput() ignores NULL inode pointers. Also, stop testing for a NULL return value from btrfs_iget() with PTR_ERR_OR_ZERO(), because btrfs_iget() never returns NULL - in case an inode is not found, it returns ERR_PTR(-ENOENT), and in case of memory allocation failure, it returns ERR_PTR(-ENOMEM). We also don't need the extra iput() calls on the error branches for the btrfs_start_transaction() and btrfs_del_orphan_item() calls, as we have already called iput() before, so remove them. Fixes: a13bb2c03848 ("btrfs: add missing iputs on orphan cleanup failure") CC: stable@vger.kernel.org # 6.4 Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-07-18btrfs: fix double iput() on inode after an error during orphan cleanupFilipe Manana
At btrfs_orphan_cleanup(), if we were able to find the inode, we do an iput() on the inode, then if btrfs_drop_verity_items() succeeds and then either btrfs_start_transaction() or btrfs_del_orphan_item() fail, we do another iput() in the respective error paths, resulting in an extra iput() on the inode. Fix this by setting inode to NULL after the first iput(), as iput() ignores a NULL inode pointer argument. Fixes: a13bb2c03848 ("btrfs: add missing iputs on orphan cleanup failure") CC: stable@vger.kernel.org # 6.4 Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-07-18btrfs: zoned: fix memory leak after finding block group with super blocksFilipe Manana
At exclude_super_stripes(), if we happen to find a block group that has super blocks mapped to it and we are on a zoned filesystem, we error out as this is not supposed to happen, indicating either a bug or maybe some memory corruption for example. However we are exiting the function without freeing the memory allocated for the logical address of the super blocks. Fix this by freeing the logical address. Fixes: 12659251ca5d ("btrfs: implement log-structured superblock for ZONED mode") CC: stable@vger.kernel.org # 5.10+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-07-17maple_tree: fix node allocation testing on 32 bitLiam R. Howlett
Internal node counting was altered and the 64 bit test was updated, however the 32bit test was missed. Restore the 32bit test to a functional state. Link: https://lore.kernel.org/linux-mm/CAMuHMdV4T53fOw7VPoBgPR7fP6RYqf=CBhD_y_vOg53zZX_DnA@mail.gmail.com/ Link: https://lkml.kernel.org/r/20230712173916.168805-2-Liam.Howlett@oracle.com Fixes: 541e06b772c1 ("maple_tree: remove GFP_ZERO from kmem_cache_alloc() and kmem_cache_alloc_bulk()") Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-07-17maple_tree: fix 32 bit mas_next testingLiam R. Howlett
The test setup of mas_next is dependent on node entry size to create a 2 level tree, but the tests did not account for this in the expected value when shifting beyond the scope of the tree. Fix this by setting up the test to succeed depending on the node entries which is dependent on the 32/64 bit setup. Link: https://lkml.kernel.org/r/20230712173916.168805-1-Liam.Howlett@oracle.com Fixes: 120b116208a0 ("maple_tree: reorganize testing to restore module testing") Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Closes: https://lore.kernel.org/linux-mm/CAMuHMdV4T53fOw7VPoBgPR7fP6RYqf=CBhD_y_vOg53zZX_DnA@mail.gmail.com/ Tested-by: Geert Uytterhoeven <geert@linux-m68k.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-07-17selftests/mm: mkdirty: fix incorrect position of #endifColin Ian King
The #endif is the wrong side of a } causing a build failure when __NR_userfaultfd is not defined. Fix this by moving the #end to enclose the } Link: https://lkml.kernel.org/r/20230712134648.456349-1-colin.i.king@gmail.com Fixes: 9eac40fc0cc7 ("selftests/mm: mkdirty: test behavior of (pte|pmd)_mkdirty on VMAs without write permissions") Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-07-17maple_tree: set the node limit when creating a new root nodePeng Zhang
Set the node limit of the root node so that the last pivot of all nodes is the node limit (if the node is not full). This patch also fixes a bug in mas_rev_awalk(). Effectively, always setting a maximum makes mas_logical_pivot() behave as mas_safe_pivot(). Without this fix, it is possible that very small tasks would fail to find the correct gap. Although this has not been observed with real tasks, it has been reported to happen in m68k nommu running the maple tree tests. Link: https://lkml.kernel.org/r/20230711035444.526-1-zhangpeng.00@bytedance.com Link: https://lore.kernel.org/linux-mm/CAMuHMdV4T53fOw7VPoBgPR7fP6RYqf=CBhD_y_vOg53zZX_DnA@mail.gmail.com/ Link: https://lkml.kernel.org/r/20230711035444.526-2-zhangpeng.00@bytedance.com Fixes: 54a611b60590 ("Maple Tree: add new data structure") Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Tested-by: Geert Uytterhoeven <geert@linux-m68k.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-07-17mm/mlock: fix vma iterator conversion of apply_vma_lock_flags()Liam R. Howlett
apply_vma_lock_flags() calls mlock_fixup(), which could merge the VMA after where the vma iterator is located. Although this is not an issue, the next iteration of the loop will check the start of the vma to be equal to the locally saved 'tmp' variable and cause an incorrect failure scenario. Fix the error by setting tmp to the end of the vma iterator value before restarting the loop. There is also a potential of the error code being overwritten when the loop terminates early. Fix the return issue by directly returning when an error is encountered since there is nothing to undo after the loop. Link: https://lkml.kernel.org/r/20230711175020.4091336-1-Liam.Howlett@oracle.com Fixes: 37598f5a9d8b ("mlock: convert mlock to vma iterator") Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reported-by: Ryan Roberts <ryan.roberts@arm.com> Link: https://lore.kernel.org/linux-mm/50341ca1-d582-b33a-e3d0-acb08a65166f@arm.com/ Tested-by: Ryan Roberts <ryan.roberts@arm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-07-17prctl: move PR_GET_AUXV out of PR_MCE_KILLMiguel Ojeda
Somehow PR_GET_AUXV got added into PR_MCE_KILL's switch when the patch was applied [1]. Thus move it out of the switch, to the place the patch added it. In the recently released v6.4 kernel some user could, in principle, be already using this feature by mapping the right page and passing the PR_GET_AUXV constant as a pointer: prctl(PR_MCE_KILL, PR_GET_AUXV, ...) So this does change the behavior for users. We could keep the bug since the other subcases in PR_MCE_KILL (PR_MCE_KILL_CLEAR and PR_MCE_KILL_SET) do not overlap. However, v6.4 may be recent enough (2 weeks old) that moving the lines (rather than just adding a new case) does not break anybody? Moreover, the documentation in man-pages was just committed today [2]. Link: https://lkml.kernel.org/r/20230708233344.361854-1-ojeda@kernel.org Fixes: ddc65971bb67 ("prctl: add PR_GET_AUXV to copy auxv to userspace") Link: https://lore.kernel.org/all/d81864a7f7f43bca6afa2a09fc2e850e4050ab42.1680611394.git.josh@joshtriplett.org/ [1] Link: https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/commit/?id=8cf0c06bfd3c2b219b044d4151c96f0da50af9ad [2] Signed-off-by: Miguel Ojeda <ojeda@kernel.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-07-17tpm,tpm_tis: Disable interrupts after 1000 unhandled IRQsLino Sanfilippo
After activation of interrupts for TPM TIS drivers 0-day reports an interrupt storm on an Inspur NF5180M6 server. Fix this by detecting the storm and falling back to polling: Count the number of unhandled interrupts within a 10 ms time interval. In case that more than 1000 were unhandled deactivate interrupts entirely, deregister the handler and use polling instead. Also print a note to point to the tpm_tis_dmi_table. Since the interrupt deregistration function devm_free_irq() waits for all interrupt handlers to finish, only trigger a worker in the interrupt handler and do the unregistration in the worker to avoid a deadlock. Note: the storm detection logic equals the implementation in note_interrupt() which uses timestamps and counters stored in struct irq_desc. Since this structure is private to the generic interrupt core the TPM TIS core uses its own timestamps and counters. Furthermore the TPM interrupt handler always returns IRQ_HANDLED to prevent the generic interrupt core from processing the interrupt storm. Cc: stable@vger.kernel.org # v6.4+ Fixes: e644b2f498d2 ("tpm, tpm_tis: Enable interrupt test") Reported-by: kernel test robot <yujie.liu@intel.com> Closes: https://lore.kernel.org/oe-lkp/202305041325.ae8b0c43-yujie.liu@intel.com/ Suggested-by: Lukas Wunner <lukas@wunner.de> Signed-off-by: Lino Sanfilippo <l.sanfilippo@kunbus.com> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
2023-07-17tpm/tpm_tis: Disable interrupts for Lenovo L590 devicesFlorian Bezdeka
The Lenovo L590 suffers from an irq storm issue like the T490, T490s and P360 Tiny, so add an entry for it to tpm_tis_dmi_table and force polling. Cc: stable@vger.kernel.org # v6.4+ Link: https://bugzilla.redhat.com/show_bug.cgi?id=2214069#c0 Fixes: e644b2f498d2 ("tpm, tpm_tis: Enable interrupt test") Signed-off-by: Florian Bezdeka <florian@bezdeka.de> Reviewed-by: Jerry Snitselaar <jsnitsel@redhat.com> Reviewed-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
2023-07-17tpm: Do not remap from ACPI resources again for Pluton TPMValentin David
For Pluton TPM devices, it was assumed that there was no ACPI memory regions. This is not true for ASUS ROG Ally. ACPI advertises 0xfd500000-0xfd5fffff. Since remapping is already done in `crb_map_pluton`, remapping again in `crb_map_io` causes EBUSY error: [ 3.510453] tpm_crb MSFT0101:00: can't request region for resource [mem 0xfd500000-0xfd5fffff] [ 3.510463] tpm_crb: probe of MSFT0101:00 failed with error -16 Cc: stable@vger.kernel.org # v6.3+ Fixes: 4d2732882703 ("tpm_crb: Add support for CRB devices based on Pluton") Signed-off-by: Valentin David <valentin.david@gmail.com> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
2023-07-17tpm/tpm_tis: Disable interrupts for Framework Laptop Intel 13th genChristian Hesse
This device suffer an irq storm, so add it in tpm_tis_dmi_table to force polling. Cc: stable@vger.kernel.org # v6.4+ Link: https://community.frame.work/t/boot-and-shutdown-hangs-with-arch-linux-kernel-6-4-1-mainline-and-arch/33118 Fixes: e644b2f498d2 ("tpm, tpm_tis: Enable interrupt test") Reported-by: <roubro1991@gmail.com> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217631 Signed-off-by: Christian Hesse <mail@eworm.de> Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
2023-07-17tpm/tpm_tis: Disable interrupts for Framework Laptop Intel 12th genChristian Hesse
This device suffer an irq storm, so add it in tpm_tis_dmi_table to force polling. Cc: stable@vger.kernel.org # v6.4+ Link: https://community.frame.work/t/boot-and-shutdown-hangs-with-arch-linux-kernel-6-4-1-mainline-and-arch/33118 Fixes: e644b2f498d2 ("tpm, tpm_tis: Enable interrupt test") Reported-by: <roubro1991@gmail.com> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217631 Signed-off-by: Christian Hesse <mail@eworm.de> Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
2023-07-17security: keys: Modify mismatched function nameJiapeng Chong
No functional modification involved. security/keys/trusted-keys/trusted_tpm2.c:203: warning: expecting prototype for tpm_buf_append_auth(). Prototype was for tpm2_buf_append_auth() instead. Fixes: 2e19e10131a0 ("KEYS: trusted: Move TPM2 trusted keys code") Reported-by: Abaci Robot <abaci@linux.alibaba.com> Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=5524 Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Reviewed-by: Paul Moore <paul@paul-moore.com> Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
2023-07-17tpm: return false from tpm_amd_is_rng_defective on non-x86 platformsJerry Snitselaar
tpm_amd_is_rng_defective is for dealing with an issue related to the AMD firmware TPM, so on non-x86 architectures just have it inline and return false. Cc: stable@vger.kernel.org # v6.3+ Reported-by: Sachin Sant <sachinp@linux.ibm.com> Reported-by: Aneesh Kumar K. V <aneesh.kumar@linux.ibm.com> Closes: https://lore.kernel.org/lkml/99B81401-DB46-49B9-B321-CF832B50CAC3@linux.ibm.com/ Fixes: f1324bbc4011 ("tpm: disable hwrng for fTPM on some AMD designs") Signed-off-by: Jerry Snitselaar <jsnitsel@redhat.com> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
2023-07-17keys: Fix linking a duplicate key to a keyring's assoc_arrayPetr Pavlu
When making a DNS query inside the kernel using dns_query(), the request code can in rare cases end up creating a duplicate index key in the assoc_array of the destination keyring. It is eventually found by a BUG_ON() check in the assoc_array implementation and results in a crash. Example report: [2158499.700025] kernel BUG at ../lib/assoc_array.c:652! [2158499.700039] invalid opcode: 0000 [#1] SMP PTI [2158499.700065] CPU: 3 PID: 31985 Comm: kworker/3:1 Kdump: loaded Not tainted 5.3.18-150300.59.90-default #1 SLE15-SP3 [2158499.700096] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 11/12/2020 [2158499.700351] Workqueue: cifsiod cifs_resolve_server [cifs] [2158499.700380] RIP: 0010:assoc_array_insert+0x85f/0xa40 [2158499.700401] Code: ff 74 2b 48 8b 3b 49 8b 45 18 4c 89 e6 48 83 e7 fe e8 95 ec 74 00 3b 45 88 7d db 85 c0 79 d4 0f 0b 0f 0b 0f 0b e8 41 f2 be ff <0f> 0b 0f 0b 81 7d 88 ff ff ff 7f 4c 89 eb 4c 8b ad 58 ff ff ff 0f [2158499.700448] RSP: 0018:ffffc0bd6187faf0 EFLAGS: 00010282 [2158499.700470] RAX: ffff9f1ea7da2fe8 RBX: ffff9f1ea7da2fc1 RCX: 0000000000000005 [2158499.700492] RDX: 0000000000000000 RSI: 0000000000000005 RDI: 0000000000000000 [2158499.700515] RBP: ffffc0bd6187fbb0 R08: ffff9f185faf1100 R09: 0000000000000000 [2158499.700538] R10: ffff9f1ea7da2cc0 R11: 000000005ed8cec8 R12: ffffc0bd6187fc28 [2158499.700561] R13: ffff9f15feb8d000 R14: ffff9f1ea7da2fc0 R15: ffff9f168dc0d740 [2158499.700585] FS: 0000000000000000(0000) GS:ffff9f185fac0000(0000) knlGS:0000000000000000 [2158499.700610] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [2158499.700630] CR2: 00007fdd94fca238 CR3: 0000000809d8c006 CR4: 00000000003706e0 [2158499.700702] Call Trace: [2158499.700741] ? key_alloc+0x447/0x4b0 [2158499.700768] ? __key_link_begin+0x43/0xa0 [2158499.700790] __key_link_begin+0x43/0xa0 [2158499.700814] request_key_and_link+0x2c7/0x730 [2158499.700847] ? dns_resolver_read+0x20/0x20 [dns_resolver] [2158499.700873] ? key_default_cmp+0x20/0x20 [2158499.700898] request_key_tag+0x43/0xa0 [2158499.700926] dns_query+0x114/0x2ca [dns_resolver] [2158499.701127] dns_resolve_server_name_to_ip+0x194/0x310 [cifs] [2158499.701164] ? scnprintf+0x49/0x90 [2158499.701190] ? __switch_to_asm+0x40/0x70 [2158499.701211] ? __switch_to_asm+0x34/0x70 [2158499.701405] reconn_set_ipaddr_from_hostname+0x81/0x2a0 [cifs] [2158499.701603] cifs_resolve_server+0x4b/0xd0 [cifs] [2158499.701632] process_one_work+0x1f8/0x3e0 [2158499.701658] worker_thread+0x2d/0x3f0 [2158499.701682] ? process_one_work+0x3e0/0x3e0 [2158499.701703] kthread+0x10d/0x130 [2158499.701723] ? kthread_park+0xb0/0xb0 [2158499.701746] ret_from_fork+0x1f/0x40 The situation occurs as follows: * Some kernel facility invokes dns_query() to resolve a hostname, for example, "abcdef". The function registers its global DNS resolver cache as current->cred.thread_keyring and passes the query to request_key_net() -> request_key_tag() -> request_key_and_link(). * Function request_key_and_link() creates a keyring_search_context object. Its match_data.cmp method gets set via a call to type->match_preparse() (resolves to dns_resolver_match_preparse()) to dns_resolver_cmp(). * Function request_key_and_link() continues and invokes search_process_keyrings_rcu() which returns that a given key was not found. The control is then passed to request_key_and_link() -> construct_alloc_key(). * Concurrently to that, a second task similarly makes a DNS query for "abcdef." and its result gets inserted into the DNS resolver cache. * Back on the first task, function construct_alloc_key() first runs __key_link_begin() to determine an assoc_array_edit operation to insert a new key. Index keys in the array are compared exactly as-is, using keyring_compare_object(). The operation finds that "abcdef" is not yet present in the destination keyring. * Function construct_alloc_key() continues and checks if a given key is already present on some keyring by again calling search_process_keyrings_rcu(). This search is done using dns_resolver_cmp() and "abcdef" gets matched with now present key "abcdef.". * The found key is linked on the destination keyring by calling __key_link() and using the previously calculated assoc_array_edit operation. This inserts the "abcdef." key in the array but creates a duplicity because the same index key is already present. Fix the problem by postponing __key_link_begin() in construct_alloc_key() until an actual key which should be linked into the destination keyring is determined. [jarkko@kernel.org: added a fixes tag and cc to stable] Cc: stable@vger.kernel.org # v5.3+ Fixes: df593ee23e05 ("keys: Hoist locking out of __key_link_begin()") Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> Reviewed-by: Joey Lee <jlee@suse.com> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
2023-07-17tpm: tis_i2c: Limit write bursts to I2C_SMBUS_BLOCK_MAX (32) bytesAlexander Sverdlin
Underlying I2C bus drivers not always support longer transfers and imx-lpi2c for instance doesn't. The fix is symmetric to previous patch which fixed the read direction. Cc: stable@vger.kernel.org # v5.20+ Fixes: bbc23a07b072 ("tpm: Add tpm_tis_i2c backend for tpm_tis_core") Tested-by: Michael Haener <michael.haener@siemens.com> Signed-off-by: Alexander Sverdlin <alexander.sverdlin@siemens.com> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> Reviewed-by: Jerry Snitselaar <jsnitsel@redhat.com> Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
2023-07-17tpm: tis_i2c: Limit read bursts to I2C_SMBUS_BLOCK_MAX (32) bytesAlexander Sverdlin
Underlying I2C bus drivers not always support longer transfers and imx-lpi2c for instance doesn't. SLB 9673 offers 427-bytes packets. Visible symptoms are: tpm tpm0: Error left over data tpm tpm0: tpm_transmit: tpm_recv: error -5 tpm_tis_i2c: probe of 1-002e failed with error -5 Cc: stable@vger.kernel.org # v5.20+ Fixes: bbc23a07b072 ("tpm: Add tpm_tis_i2c backend for tpm_tis_core") Tested-by: Michael Haener <michael.haener@siemens.com> Signed-off-by: Alexander Sverdlin <alexander.sverdlin@siemens.com> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> Reviewed-by: Jerry Snitselaar <jsnitsel@redhat.com> Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
2023-07-17tpm_tis_spi: Release chip select when flow control failsPeijie Shao
The failure paths in tpm_tis_spi_transfer() do not deactivate chip select. Send an empty message (cs_select == 0) to overcome this. The patch is tested by two ways. One way needs to touch hardware: 1. force pull MISO pin down to GND, it emulates a forever 'WAIT' timing. 2. probe cs pin by an oscilloscope. 3. load tpm_tis_spi.ko. After loading, dmesg prints: "probe of spi0.0 failed with error -110" and oscilloscope shows cs pin goes high(deactivated) after the failure. Before the patch, cs pin keeps low. Second way is by writing a fake spi controller. 1. implement .transfer_one method, fill all rx buf with 0. 2. implement .set_cs method, print the state of cs pin. we can see cs goes high after the failure. Signed-off-by: Peijie Shao <shaopeijie@cestc.cn> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
2023-07-17tpm: tpm_tis: Disable interrupts *only* for AEON UPX-i11Peter Ujfalusi
Further restrict with DMI_PRODUCT_VERSION. Cc: stable@vger.kernel.org # v6.4+ Link: https://lore.kernel.org/linux-integrity/20230517122931.22385-1-peter.ujfalusi@linux.intel.com/ Fixes: 95a9359ee22f ("tpm: tpm_tis: Disable interrupts for AEON UPX-i11") Signed-off-by: Peter Ujfalusi <peter.ujfalusi@linux.intel.com> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
2023-07-17tpm: tpm_vtpm_proxy: fix a race condition in /dev/vtpmx creationJarkko Sakkinen
/dev/vtpmx is made visible before 'workqueue' is initialized, which can lead to a memory corruption in the worst case scenario. Address this by initializing 'workqueue' as the very first step of the driver initialization. Cc: stable@vger.kernel.org Fixes: 6f99612e2500 ("tpm: Proxy driver for supporting multiple emulated TPMs") Reviewed-by: Stefan Berger <stefanb@linux.ibm.com> Signed-off-by: Jarkko Sakkinen <jarkko.sakkinen@tuni.fi> Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
2023-07-17ASoC: fsl_sai: Disable bit clock with transmitterMatus Gajdos
Otherwise bit clock remains running writing invalid data to the DAC. Signed-off-by: Matus Gajdos <matuszpd@gmail.com> Acked-by: Shengjiu Wang <shengjiu.wang@gmail.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20230712124934.32232-1-matuszpd@gmail.com Signed-off-by: Mark Brown <broonie@kernel.org>
2023-07-17can: mcp251xfd: __mcp251xfd_chip_set_mode(): increase poll timeoutFedor Ross
The mcp251xfd controller needs an idle bus to enter 'Normal CAN 2.0 mode' or . The maximum length of a CAN frame is 736 bits (64 data bytes, CAN-FD, EFF mode, worst case bit stuffing and interframe spacing). For low bit rates like 10 kbit/s the arbitrarily chosen MCP251XFD_POLL_TIMEOUT_US of 1 ms is too small. Otherwise during polling for the CAN controller to enter 'Normal CAN 2.0 mode' the timeout limit is exceeded and the configuration fails with: | $ ip link set dev can1 up type can bitrate 10000 | [ 731.911072] mcp251xfd spi2.1 can1: Controller failed to enter mode CAN 2.0 Mode (6) and stays in Configuration Mode (4) (con=0x068b0760, osc=0x00000468). | [ 731.927192] mcp251xfd spi2.1 can1: CRC read error at address 0x0e0c (length=4, data=00 00 00 00, CRC=0x0000) retrying. | [ 731.938101] A link change request failed with some changes committed already. Interface can1 may have been left with an inconsistent configuration, please check. | RTNETLINK answers: Connection timed out Make MCP251XFD_POLL_TIMEOUT_US timeout calculation dynamic. Use maximum of 1ms and bit time of 1 full 64 data bytes CAN-FD frame in EFF mode, worst case bit stuffing and interframe spacing at the current bit rate. For easier backporting define the macro MCP251XFD_FRAME_LEN_MAX_BITS that holds the max frame length in bits, which is 736. This can be replaced by can_frame_bits(true, true, true, true, CANFD_MAX_DLEN) in a cleanup patch later. Fixes: 55e5b97f003e8 ("can: mcp25xxfd: add driver for Microchip MCP25xxFD SPI CAN") Signed-off-by: Fedor Ross <fedor.ross@ifm.com> Signed-off-by: Marek Vasut <marex@denx.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/20230717-mcp251xfd-fix-increase-poll-timeout-v5-1-06600f34c684@pengutronix.de Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2023-07-17iavf: fix reset task race with iavf_remove()Ahmed Zaki
The reset task is currently scheduled from the watchdog or adminq tasks. First, all direct calls to schedule the reset task are replaced with the iavf_schedule_reset(), which is modified to accept the flag showing the type of reset. To prevent the reset task from starting once iavf_remove() starts, we need to check the __IAVF_IN_REMOVE_TASK bit before we schedule it. This is now easily added to iavf_schedule_reset(). Finally, remove the check for IAVF_FLAG_RESET_NEEDED in the watchdog task. It is redundant since all callers who set the flag immediately schedules the reset task. Fixes: 3ccd54ef44eb ("iavf: Fix init state closure on remove") Fixes: 14756b2ae265 ("iavf: Fix __IAVF_RESETTING state usage") Signed-off-by: Ahmed Zaki <ahmed.zaki@intel.com> Signed-off-by: Mateusz Palczewski <mateusz.palczewski@intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2023-07-17iavf: fix a deadlock caused by rtnl and driver's lock circular dependenciesAhmed Zaki
A driver's lock (crit_lock) is used to serialize all the driver's tasks. Lockdep, however, shows a circular dependency between rtnl and crit_lock. This happens when an ndo that already holds the rtnl requests the driver to reset, since the reset task (in some paths) tries to grab rtnl to either change real number of queues of update netdev features. [566.241851] ====================================================== [566.241893] WARNING: possible circular locking dependency detected [566.241936] 6.2.14-100.fc36.x86_64+debug #1 Tainted: G OE [566.241984] ------------------------------------------------------ [566.242025] repro.sh/2604 is trying to acquire lock: [566.242061] ffff9280fc5ceee8 (&adapter->crit_lock){+.+.}-{3:3}, at: iavf_close+0x3c/0x240 [iavf] [566.242167] but task is already holding lock: [566.242209] ffffffff9976d350 (rtnl_mutex){+.+.}-{3:3}, at: iavf_remove+0x6b5/0x730 [iavf] [566.242300] which lock already depends on the new lock. [566.242353] the existing dependency chain (in reverse order) is: [566.242401] -> #1 (rtnl_mutex){+.+.}-{3:3}: [566.242451] __mutex_lock+0xc1/0xbb0 [566.242489] iavf_init_interrupt_scheme+0x179/0x440 [iavf] [566.242560] iavf_watchdog_task+0x80b/0x1400 [iavf] [566.242627] process_one_work+0x2b3/0x560 [566.242663] worker_thread+0x4f/0x3a0 [566.242696] kthread+0xf2/0x120 [566.242730] ret_from_fork+0x29/0x50 [566.242763] -> #0 (&adapter->crit_lock){+.+.}-{3:3}: [566.242815] __lock_acquire+0x15ff/0x22b0 [566.242869] lock_acquire+0xd2/0x2c0 [566.242901] __mutex_lock+0xc1/0xbb0 [566.242934] iavf_close+0x3c/0x240 [iavf] [566.242997] __dev_close_many+0xac/0x120 [566.243036] dev_close_many+0x8b/0x140 [566.243071] unregister_netdevice_many_notify+0x165/0x7c0 [566.243116] unregister_netdevice_queue+0xd3/0x110 [566.243157] iavf_remove+0x6c1/0x730 [iavf] [566.243217] pci_device_remove+0x33/0xa0 [566.243257] device_release_driver_internal+0x1bc/0x240 [566.243299] pci_stop_bus_device+0x6c/0x90 [566.243338] pci_stop_and_remove_bus_device+0xe/0x20 [566.243380] pci_iov_remove_virtfn+0xd1/0x130 [566.243417] sriov_disable+0x34/0xe0 [566.243448] ice_free_vfs+0x2da/0x330 [ice] [566.244383] ice_sriov_configure+0x88/0xad0 [ice] [566.245353] sriov_numvfs_store+0xde/0x1d0 [566.246156] kernfs_fop_write_iter+0x15e/0x210 [566.246921] vfs_write+0x288/0x530 [566.247671] ksys_write+0x74/0xf0 [566.248408] do_syscall_64+0x58/0x80 [566.249145] entry_SYSCALL_64_after_hwframe+0x72/0xdc [566.249886] other info that might help us debug this: [566.252014] Possible unsafe locking scenario: [566.253432] CPU0 CPU1 [566.254118] ---- ---- [566.254800] lock(rtnl_mutex); [566.255514] lock(&adapter->crit_lock); [566.256233] lock(rtnl_mutex); [566.256897] lock(&adapter->crit_lock); [566.257388] *** DEADLOCK *** The deadlock can be triggered by a script that is continuously resetting the VF adapter while doing other operations requiring RTNL, e.g: while :; do ip link set $VF up ethtool --set-channels $VF combined 2 ip link set $VF down ip link set $VF up ethtool --set-channels $VF combined 4 ip link set $VF down done Any operation that triggers a reset can substitute "ethtool --set-channles" As a fix, add a new task "finish_config" that do all the work which needs rtnl lock. With the exception of iavf_remove(), all work that require rtnl should be called from this task. As for iavf_remove(), at the point where we need to call unregister_netdevice() (and grab rtnl_lock), we make sure the finish_config task is not running (cancel_work_sync()) to safely grab rtnl. Subsequent finish_config work cannot restart after that since the task is guarded by the __IAVF_IN_REMOVE_TASK bit in iavf_schedule_finish_config(). Fixes: 5ac49f3c2702 ("iavf: use mutexes for locking of critical sections") Signed-off-by: Ahmed Zaki <ahmed.zaki@intel.com> Signed-off-by: Mateusz Palczewski <mateusz.palczewski@intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2023-07-17Revert "iavf: Do not restart Tx queues after reset task failure"Marcin Szycik
This reverts commit 08f1c147b7265245d67321585c68a27e990e0c4b. Netdev is no longer being detached during reset, so this fix can be reverted. We leave the removal of "hacky" IFF_UP flag update. Signed-off-by: Marcin Szycik <marcin.szycik@linux.intel.com> Signed-off-by: Mateusz Palczewski <mateusz.palczewski@intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2023-07-17Revert "iavf: Detach device during reset task"Marcin Szycik
This reverts commit aa626da947e9cd30c4cf727493903e1adbb2c0a0. Detaching device during reset was not fully fixing the rtnl locking issue, as there could be a situation where callback was already in progress before detaching netdev. Furthermore, detaching netdevice causes TX timeouts if traffic is running. To reproduce: ip netns exec ns1 iperf3 -c $PEER_IP -t 600 --logfile /dev/null & while :; do for i in 200 7000 400 5000 300 3000 ; do ip netns exec ns1 ip link set $VF1 mtu $i sleep 2 done sleep 10 done Currently, callbacks such as iavf_change_mtu() wait for the reset. If the reset fails to acquire the rtnl_lock, they schedule the netdev update for later while continuing the reset flow. Operations like MTU changes are performed under the rtnl_lock. Therefore, when the operation finishes, another callback that uses rtnl_lock can start. Signed-off-by: Dawid Wesierski <dawidx.wesierski@intel.com> Signed-off-by: Marcin Szycik <marcin.szycik@linux.intel.com> Signed-off-by: Mateusz Palczewski <mateusz.palczewski@intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2023-07-17iavf: Wait for reset in callbacks which trigger itMarcin Szycik
There was a fail when trying to add the interface to bonding right after changing the MTU on the interface. It was caused by bonding interface unable to open the interface due to interface being in __RESETTING state because of MTU change. Add new reset_waitqueue to indicate that reset has finished. Add waiting for reset to finish in callbacks which trigger hw reset: iavf_set_priv_flags(), iavf_change_mtu() and iavf_set_ringparam(). We use a 5000ms timeout period because on Hyper-V based systems, this operation takes around 3000-4000ms. In normal circumstances, it doesn't take more than 500ms to complete. Add a function iavf_wait_for_reset() to reuse waiting for reset code and use it also in iavf_set_channels(), which already waits for reset. We don't use error handling in iavf_set_channels() as this could cause the device to be in incorrect state if the reset was scheduled but hit timeout or the waitng function was interrupted by a signal. Fixes: 4e5e6b5d9d13 ("iavf: Fix return of set the new channel count") Signed-off-by: Marcin Szycik <marcin.szycik@linux.intel.com> Co-developed-by: Dawid Wesierski <dawidx.wesierski@intel.com> Signed-off-by: Dawid Wesierski <dawidx.wesierski@intel.com> Signed-off-by: Sylwester Dziedziuch <sylwesterx.dziedziuch@intel.com> Signed-off-by: Kamil Maziarz <kamil.maziarz@intel.com> Signed-off-by: Mateusz Palczewski <mateusz.palczewski@intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2023-07-17iavf: use internal state to free traffic IRQsAhmed Zaki
If the system tries to close the netdev while iavf_reset_task() is running, __LINK_STATE_START will be cleared and netif_running() will return false in iavf_reinit_interrupt_scheme(). This will result in iavf_free_traffic_irqs() not being called and a leak as follows: [7632.489326] remove_proc_entry: removing non-empty directory 'irq/999', leaking at least 'iavf-enp24s0f0v0-TxRx-0' [7632.490214] WARNING: CPU: 0 PID: 10 at fs/proc/generic.c:718 remove_proc_entry+0x19b/0x1b0 is shown when pci_disable_msix() is later called. Fix by using the internal adapter state. The traffic IRQs will always exist if state == __IAVF_RUNNING. Fixes: 5b36e8d04b44 ("i40evf: Enable VF to request an alternate queue allocation") Signed-off-by: Ahmed Zaki <ahmed.zaki@intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2023-07-17iavf: Fix out-of-bounds when setting channels on removeDing Hui
If we set channels greater during iavf_remove(), and waiting reset done would be timeout, then returned with error but changed num_active_queues directly, that will lead to OOB like the following logs. Because the num_active_queues is greater than tx/rx_rings[] allocated actually. Reproducer: [root@host ~]# cat repro.sh #!/bin/bash pf_dbsf="0000:41:00.0" vf0_dbsf="0000:41:02.0" g_pids=() function do_set_numvf() { echo 2 >/sys/bus/pci/devices/${pf_dbsf}/sriov_numvfs sleep $((RANDOM%3+1)) echo 0 >/sys/bus/pci/devices/${pf_dbsf}/sriov_numvfs sleep $((RANDOM%3+1)) } function do_set_channel() { local nic=$(ls -1 --indicator-style=none /sys/bus/pci/devices/${vf0_dbsf}/net/) [ -z "$nic" ] && { sleep $((RANDOM%3)) ; return 1; } ifconfig $nic 192.168.18.5 netmask 255.255.255.0 ifconfig $nic up ethtool -L $nic combined 1 ethtool -L $nic combined 4 sleep $((RANDOM%3)) } function on_exit() { local pid for pid in "${g_pids[@]}"; do kill -0 "$pid" &>/dev/null && kill "$pid" &>/dev/null done g_pids=() } trap "on_exit; exit" EXIT while :; do do_set_numvf ; done & g_pids+=($!) while :; do do_set_channel ; done & g_pids+=($!) wait Result: [ 3506.152887] iavf 0000:41:02.0: Removing device [ 3510.400799] ================================================================== [ 3510.400820] BUG: KASAN: slab-out-of-bounds in iavf_free_all_tx_resources+0x156/0x160 [iavf] [ 3510.400823] Read of size 8 at addr ffff88b6f9311008 by task repro.sh/55536 [ 3510.400823] [ 3510.400830] CPU: 101 PID: 55536 Comm: repro.sh Kdump: loaded Tainted: G O --------- -t - 4.18.0 #1 [ 3510.400832] Hardware name: Powerleader PR2008AL/H12DSi-N6, BIOS 2.0 04/09/2021 [ 3510.400835] Call Trace: [ 3510.400851] dump_stack+0x71/0xab [ 3510.400860] print_address_description+0x6b/0x290 [ 3510.400865] ? iavf_free_all_tx_resources+0x156/0x160 [iavf] [ 3510.400868] kasan_report+0x14a/0x2b0 [ 3510.400873] iavf_free_all_tx_resources+0x156/0x160 [iavf] [ 3510.400880] iavf_remove+0x2b6/0xc70 [iavf] [ 3510.400884] ? iavf_free_all_rx_resources+0x160/0x160 [iavf] [ 3510.400891] ? wait_woken+0x1d0/0x1d0 [ 3510.400895] ? notifier_call_chain+0xc1/0x130 [ 3510.400903] pci_device_remove+0xa8/0x1f0 [ 3510.400910] device_release_driver_internal+0x1c6/0x460 [ 3510.400916] pci_stop_bus_device+0x101/0x150 [ 3510.400919] pci_stop_and_remove_bus_device+0xe/0x20 [ 3510.400924] pci_iov_remove_virtfn+0x187/0x420 [ 3510.400927] ? pci_iov_add_virtfn+0xe10/0xe10 [ 3510.400929] ? pci_get_subsys+0x90/0x90 [ 3510.400932] sriov_disable+0xed/0x3e0 [ 3510.400936] ? bus_find_device+0x12d/0x1a0 [ 3510.400953] i40e_free_vfs+0x754/0x1210 [i40e] [ 3510.400966] ? i40e_reset_all_vfs+0x880/0x880 [i40e] [ 3510.400968] ? pci_get_device+0x7c/0x90 [ 3510.400970] ? pci_get_subsys+0x90/0x90 [ 3510.400982] ? pci_vfs_assigned.part.7+0x144/0x210 [ 3510.400987] ? __mutex_lock_slowpath+0x10/0x10 [ 3510.400996] i40e_pci_sriov_configure+0x1fa/0x2e0 [i40e] [ 3510.401001] sriov_numvfs_store+0x214/0x290 [ 3510.401005] ? sriov_totalvfs_show+0x30/0x30 [ 3510.401007] ? __mutex_lock_slowpath+0x10/0x10 [ 3510.401011] ? __check_object_size+0x15a/0x350 [ 3510.401018] kernfs_fop_write+0x280/0x3f0 [ 3510.401022] vfs_write+0x145/0x440 [ 3510.401025] ksys_write+0xab/0x160 [ 3510.401028] ? __ia32_sys_read+0xb0/0xb0 [ 3510.401031] ? fput_many+0x1a/0x120 [ 3510.401032] ? filp_close+0xf0/0x130 [ 3510.401038] do_syscall_64+0xa0/0x370 [ 3510.401041] ? page_fault+0x8/0x30 [ 3510.401043] entry_SYSCALL_64_after_hwframe+0x65/0xca [ 3510.401073] RIP: 0033:0x7f3a9bb842c0 [ 3510.401079] Code: 73 01 c3 48 8b 0d d8 cb 2c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 83 3d 89 24 2d 00 00 75 10 b8 01 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 fe dd 01 00 48 89 04 24 [ 3510.401080] RSP: 002b:00007ffc05f1fe18 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [ 3510.401083] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f3a9bb842c0 [ 3510.401085] RDX: 0000000000000002 RSI: 0000000002327408 RDI: 0000000000000001 [ 3510.401086] RBP: 0000000002327408 R08: 00007f3a9be53780 R09: 00007f3a9c8a4700 [ 3510.401086] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000002 [ 3510.401087] R13: 0000000000000001 R14: 00007f3a9be52620 R15: 0000000000000001 [ 3510.401090] [ 3510.401093] Allocated by task 76795: [ 3510.401098] kasan_kmalloc+0xa6/0xd0 [ 3510.401099] __kmalloc+0xfb/0x200 [ 3510.401104] iavf_init_interrupt_scheme+0x26f/0x1310 [iavf] [ 3510.401108] iavf_watchdog_task+0x1d58/0x4050 [iavf] [ 3510.401114] process_one_work+0x56a/0x11f0 [ 3510.401115] worker_thread+0x8f/0xf40 [ 3510.401117] kthread+0x2a0/0x390 [ 3510.401119] ret_from_fork+0x1f/0x40 [ 3510.401122] 0xffffffffffffffff [ 3510.401123] In timeout handling, we should keep the original num_active_queues and reset num_req_queues to 0. Fixes: 4e5e6b5d9d13 ("iavf: Fix return of set the new channel count") Signed-off-by: Ding Hui <dinghui@sangfor.com.cn> Cc: Donglin Peng <pengdonglin@sangfor.com.cn> Cc: Huang Cun <huangcun@sangfor.com.cn> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2023-07-17iavf: Fix use-after-free in free_netdevDing Hui
We do netif_napi_add() for all allocated q_vectors[], but potentially do netif_napi_del() for part of them, then kfree q_vectors and leave invalid pointers at dev->napi_list. Reproducer: [root@host ~]# cat repro.sh #!/bin/bash pf_dbsf="0000:41:00.0" vf0_dbsf="0000:41:02.0" g_pids=() function do_set_numvf() { echo 2 >/sys/bus/pci/devices/${pf_dbsf}/sriov_numvfs sleep $((RANDOM%3+1)) echo 0 >/sys/bus/pci/devices/${pf_dbsf}/sriov_numvfs sleep $((RANDOM%3+1)) } function do_set_channel() { local nic=$(ls -1 --indicator-style=none /sys/bus/pci/devices/${vf0_dbsf}/net/) [ -z "$nic" ] && { sleep $((RANDOM%3)) ; return 1; } ifconfig $nic 192.168.18.5 netmask 255.255.255.0 ifconfig $nic up ethtool -L $nic combined 1 ethtool -L $nic combined 4 sleep $((RANDOM%3)) } function on_exit() { local pid for pid in "${g_pids[@]}"; do kill -0 "$pid" &>/dev/null && kill "$pid" &>/dev/null done g_pids=() } trap "on_exit; exit" EXIT while :; do do_set_numvf ; done & g_pids+=($!) while :; do do_set_channel ; done & g_pids+=($!) wait Result: [ 4093.900222] ================================================================== [ 4093.900230] BUG: KASAN: use-after-free in free_netdev+0x308/0x390 [ 4093.900232] Read of size 8 at addr ffff88b4dc145640 by task repro.sh/6699 [ 4093.900233] [ 4093.900236] CPU: 10 PID: 6699 Comm: repro.sh Kdump: loaded Tainted: G O --------- -t - 4.18.0 #1 [ 4093.900238] Hardware name: Powerleader PR2008AL/H12DSi-N6, BIOS 2.0 04/09/2021 [ 4093.900239] Call Trace: [ 4093.900244] dump_stack+0x71/0xab [ 4093.900249] print_address_description+0x6b/0x290 [ 4093.900251] ? free_netdev+0x308/0x390 [ 4093.900252] kasan_report+0x14a/0x2b0 [ 4093.900254] free_netdev+0x308/0x390 [ 4093.900261] iavf_remove+0x825/0xd20 [iavf] [ 4093.900265] pci_device_remove+0xa8/0x1f0 [ 4093.900268] device_release_driver_internal+0x1c6/0x460 [ 4093.900271] pci_stop_bus_device+0x101/0x150 [ 4093.900273] pci_stop_and_remove_bus_device+0xe/0x20 [ 4093.900275] pci_iov_remove_virtfn+0x187/0x420 [ 4093.900277] ? pci_iov_add_virtfn+0xe10/0xe10 [ 4093.900278] ? pci_get_subsys+0x90/0x90 [ 4093.900280] sriov_disable+0xed/0x3e0 [ 4093.900282] ? bus_find_device+0x12d/0x1a0 [ 4093.900290] i40e_free_vfs+0x754/0x1210 [i40e] [ 4093.900298] ? i40e_reset_all_vfs+0x880/0x880 [i40e] [ 4093.900299] ? pci_get_device+0x7c/0x90 [ 4093.900300] ? pci_get_subsys+0x90/0x90 [ 4093.900306] ? pci_vfs_assigned.part.7+0x144/0x210 [ 4093.900309] ? __mutex_lock_slowpath+0x10/0x10 [ 4093.900315] i40e_pci_sriov_configure+0x1fa/0x2e0 [i40e] [ 4093.900318] sriov_numvfs_store+0x214/0x290 [ 4093.900320] ? sriov_totalvfs_show+0x30/0x30 [ 4093.900321] ? __mutex_lock_slowpath+0x10/0x10 [ 4093.900323] ? __check_object_size+0x15a/0x350 [ 4093.900326] kernfs_fop_write+0x280/0x3f0 [ 4093.900329] vfs_write+0x145/0x440 [ 4093.900330] ksys_write+0xab/0x160 [ 4093.900332] ? __ia32_sys_read+0xb0/0xb0 [ 4093.900334] ? fput_many+0x1a/0x120 [ 4093.900335] ? filp_close+0xf0/0x130 [ 4093.900338] do_syscall_64+0xa0/0x370 [ 4093.900339] ? page_fault+0x8/0x30 [ 4093.900341] entry_SYSCALL_64_after_hwframe+0x65/0xca [ 4093.900357] RIP: 0033:0x7f16ad4d22c0 [ 4093.900359] Code: 73 01 c3 48 8b 0d d8 cb 2c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 83 3d 89 24 2d 00 00 75 10 b8 01 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 fe dd 01 00 48 89 04 24 [ 4093.900360] RSP: 002b:00007ffd6491b7f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [ 4093.900362] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f16ad4d22c0 [ 4093.900363] RDX: 0000000000000002 RSI: 0000000001a41408 RDI: 0000000000000001 [ 4093.900364] RBP: 0000000001a41408 R08: 00007f16ad7a1780 R09: 00007f16ae1f2700 [ 4093.900364] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000002 [ 4093.900365] R13: 0000000000000001 R14: 00007f16ad7a0620 R15: 0000000000000001 [ 4093.900367] [ 4093.900368] Allocated by task 820: [ 4093.900371] kasan_kmalloc+0xa6/0xd0 [ 4093.900373] __kmalloc+0xfb/0x200 [ 4093.900376] iavf_init_interrupt_scheme+0x63b/0x1320 [iavf] [ 4093.900380] iavf_watchdog_task+0x3d51/0x52c0 [iavf] [ 4093.900382] process_one_work+0x56a/0x11f0 [ 4093.900383] worker_thread+0x8f/0xf40 [ 4093.900384] kthread+0x2a0/0x390 [ 4093.900385] ret_from_fork+0x1f/0x40 [ 4093.900387] 0xffffffffffffffff [ 4093.900387] [ 4093.900388] Freed by task 6699: [ 4093.900390] __kasan_slab_free+0x137/0x190 [ 4093.900391] kfree+0x8b/0x1b0 [ 4093.900394] iavf_free_q_vectors+0x11d/0x1a0 [iavf] [ 4093.900397] iavf_remove+0x35a/0xd20 [iavf] [ 4093.900399] pci_device_remove+0xa8/0x1f0 [ 4093.900400] device_release_driver_internal+0x1c6/0x460 [ 4093.900401] pci_stop_bus_device+0x101/0x150 [ 4093.900402] pci_stop_and_remove_bus_device+0xe/0x20 [ 4093.900403] pci_iov_remove_virtfn+0x187/0x420 [ 4093.900404] sriov_disable+0xed/0x3e0 [ 4093.900409] i40e_free_vfs+0x754/0x1210 [i40e] [ 4093.900415] i40e_pci_sriov_configure+0x1fa/0x2e0 [i40e] [ 4093.900416] sriov_numvfs_store+0x214/0x290 [ 4093.900417] kernfs_fop_write+0x280/0x3f0 [ 4093.900418] vfs_write+0x145/0x440 [ 4093.900419] ksys_write+0xab/0x160 [ 4093.900420] do_syscall_64+0xa0/0x370 [ 4093.900421] entry_SYSCALL_64_after_hwframe+0x65/0xca [ 4093.900422] 0xffffffffffffffff [ 4093.900422] [ 4093.900424] The buggy address belongs to the object at ffff88b4dc144200 which belongs to the cache kmalloc-8k of size 8192 [ 4093.900425] The buggy address is located 5184 bytes inside of 8192-byte region [ffff88b4dc144200, ffff88b4dc146200) [ 4093.900425] The buggy address belongs to the page: [ 4093.900427] page:ffffea00d3705000 refcount:1 mapcount:0 mapping:ffff88bf04415c80 index:0x0 compound_mapcount: 0 [ 4093.900430] flags: 0x10000000008100(slab|head) [ 4093.900433] raw: 0010000000008100 dead000000000100 dead000000000200 ffff88bf04415c80 [ 4093.900434] raw: 0000000000000000 0000000000030003 00000001ffffffff 0000000000000000 [ 4093.900434] page dumped because: kasan: bad access detected [ 4093.900435] [ 4093.900435] Memory state around the buggy address: [ 4093.900436] ffff88b4dc145500: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 4093.900437] ffff88b4dc145580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 4093.900438] >ffff88b4dc145600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 4093.900438] ^ [ 4093.900439] ffff88b4dc145680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 4093.900440] ffff88b4dc145700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 4093.900440] ================================================================== Although the patch #2 (of 2) can avoid the issue triggered by this repro.sh, there still are other potential risks that if num_active_queues is changed to less than allocated q_vectors[] by unexpected, the mismatched netif_napi_add/del() can also cause UAF. Since we actually call netif_napi_add() for all allocated q_vectors unconditionally in iavf_alloc_q_vectors(), so we should fix it by letting netif_napi_del() match to netif_napi_add(). Fixes: 5eae00c57f5e ("i40evf: main driver core") Signed-off-by: Ding Hui <dinghui@sangfor.com.cn> Cc: Donglin Peng <pengdonglin@sangfor.com.cn> Cc: Huang Cun <huangcun@sangfor.com.cn> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Madhu Chittim <madhu.chittim@intel.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2023-07-17iomap: micro optimize the ki_pos assignment in iomap_file_buffered_writeChristoph Hellwig
We have the new value for ki_pos right at hand in iter.pos, so assign that instead of recalculating it from ret. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Ritesh Harjani (IBM) <ritesh.harjani@gmail.com>
2023-07-17iomap: fix a regression for partial write errorsChristoph Hellwig
When write* wrote some data it should return the amount of written data and not the error code that caused it to stop. Fix a recent regression in iomap_file_buffered_write that caused it to return the errno instead. Fixes: 219580eea1ee ("iomap: update ki_pos in iomap_file_buffered_write") Reported-by: kernel test robot <oliver.sang@intel.com> Reported-by: Cyril Hrubis <chrubis@suse.cz> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Ritesh Harjani (IBM) <ritesh.harjani@gmail.com>
2023-07-17xfs: convert flex-array declarations in xfs attr shortform objectsDarrick J. Wong
As of 6.5-rc1, UBSAN trips over the ondisk extended attribute shortform definitions using an array length of 1 to pretend to be a flex array. Kernel compilers have to support unbounded array declarations, so let's correct this. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Kees Cook <keescook@chromium.org>
2023-07-17xfs: convert flex-array declarations in xfs attr leaf blocksDarrick J. Wong
As of 6.5-rc1, UBSAN trips over the ondisk extended attribute leaf block definitions using an array length of 1 to pretend to be a flex array. Kernel compilers have to support unbounded array declarations, so let's correct this. ================================================================================ UBSAN: array-index-out-of-bounds in fs/xfs/libxfs/xfs_attr_leaf.c:2535:24 index 2 is out of range for type '__u8 [1]' Call Trace: <TASK> dump_stack_lvl+0x33/0x50 __ubsan_handle_out_of_bounds+0x9c/0xd0 xfs_attr3_leaf_getvalue+0x2ce/0x2e0 [xfs 4a986a89a77bb77402ab8a87a37da369ef6a3f09] xfs_attr_leaf_get+0x148/0x1c0 [xfs 4a986a89a77bb77402ab8a87a37da369ef6a3f09] xfs_attr_get_ilocked+0xae/0x110 [xfs 4a986a89a77bb77402ab8a87a37da369ef6a3f09] xfs_attr_get+0xee/0x150 [xfs 4a986a89a77bb77402ab8a87a37da369ef6a3f09] xfs_xattr_get+0x7d/0xc0 [xfs 4a986a89a77bb77402ab8a87a37da369ef6a3f09] __vfs_getxattr+0xa3/0x100 vfs_getxattr+0x87/0x1d0 do_getxattr+0x17a/0x220 getxattr+0x89/0xf0 Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Kees Cook <keescook@chromium.org>
2023-07-17xfs: convert flex-array declarations in struct xfs_attrlist*Darrick J. Wong
As of 6.5-rc1, UBSAN trips over the attrlist ioctl definitions using an array length of 1 to pretend to be a flex array. Kernel compilers have to support unbounded array declarations, so let's correct this. This may cause friction with userspace header declarations, but suck is life. ================================================================================ UBSAN: array-index-out-of-bounds in fs/xfs/xfs_ioctl.c:345:18 index 1 is out of range for type '__s32 [1]' Call Trace: <TASK> dump_stack_lvl+0x33/0x50 __ubsan_handle_out_of_bounds+0x9c/0xd0 xfs_ioc_attr_put_listent+0x413/0x420 [xfs 4a986a89a77bb77402ab8a87a37da369ef6a3f09] xfs_attr_list_ilocked+0x170/0x850 [xfs 4a986a89a77bb77402ab8a87a37da369ef6a3f09] xfs_attr_list+0xb7/0x120 [xfs 4a986a89a77bb77402ab8a87a37da369ef6a3f09] xfs_ioc_attr_list+0x13b/0x2e0 [xfs 4a986a89a77bb77402ab8a87a37da369ef6a3f09] xfs_attrlist_by_handle+0xab/0x120 [xfs 4a986a89a77bb77402ab8a87a37da369ef6a3f09] xfs_file_ioctl+0x1ff/0x15e0 [xfs 4a986a89a77bb77402ab8a87a37da369ef6a3f09] vfs_ioctl+0x1f/0x60 The kernel and xfsprogs code that uses these structures will not have problems, but the long tail of external user programs might. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Kees Cook <keescook@chromium.org>
2023-07-17Merge patch series "can: gs_usb: fix time stamp counter initialization"Marc Kleine-Budde
Marc Kleine-Budde <mkl@pengutronix.de> says: During testing I noticed a crash if unloading/loading the gs_usb driver during high CAN bus load. The current version of the candlelight firmware doesn't flush the queues of the received CAN frames during the reset command. This leads to a crash if hardware timestamps are enabled, and an URB from the device is received before the cycle counter/time counter infrastructure has been setup. First clean up then error handling in gs_can_open(). Then, fix the problem by converting the cycle counter/time counter infrastructure from a per-channel to per-device and set it up before submitting RX-URBs to the USB stack. Link: https://lore.kernel.org/all/20230716-gs_usb-fix-time-stamp-counter-v1-0-9017cefcd9d5@pengutronix.de Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2023-07-17perf test task_exit: No need for a cycles event to check if we get an ↵Arnaldo Carvalho de Melo
PERF_RECORD_EXIT The intent of this test is to check we get a PERF_RECORD_EXIT as asked for by setting perf_event_attr.task=1. When the test was written we didn't had the "dummy" event so we went with the default event, "cycles". There were reports of this test failing sometimes, one of these reports was with a PREEMPT_RT_FULL, but I noticed it failing sometimes with an aarch64 Firefly board. In the kernel the call to perf_event_task_output(), that generates the PERF_RECORD_EXIT may fail when there is not enough memory in the ring buffer, if the ring buffer is paused, etc. So switch to using the "dummy" event to use the ring buffer just for what the test was designed for, avoiding uneeded PERF_RECORD_SAMPLEs. Acked-by: Ian Rogers <irogers@google.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/lkml/ZLGXmMuNRpx1ubFm@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2023-07-17powerpc/kasan: Disable KCOV in KASAN codeBenjamin Gray
As per the generic KASAN code in mm/kasan, disable KCOV with KCOV_INSTRUMENT := n in the makefile. This fixes a ppc64 boot hang when KCOV and KASAN are enabled. kasan_early_init() gets called before a PACA is initialised, but the KCOV hook expects a valid PACA. Suggested-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Benjamin Gray <bgray@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20230710044143.146840-1-bgray@linux.ibm.com
2023-07-17Revert "drm/i915: use localized __diag_ignore_all() instead of per file"Jani Nikula
This reverts commit 88e9664434c994e97a9f6f8cdd1535495c660cea. __diag_ignore_all() only works for GCC 8 or later. -Woverride-init (from -Wextra, enabled in i915 Makefile) combined with CONFIG_WERROR=y or W=e breaks the build for older GCC. With i386_defconfig and x86_64_defconfig enabling CONFIG_WERROR=y by default, we really need to roll back the change. An alternative would be to disable -Woverride-init in the Makefile for GCC <8, but the revert seems like the safest bet now. Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/8768 Reported-by: John Garry <john.g.garry@oracle.com> References: https://lore.kernel.org/r/ad2601c0-84bb-c574-3702-a83ff8faf98c@oracle.com References: https://lore.kernel.org/r/87wmzezns4.fsf@intel.com Fixes: 88e9664434c9 ("drm/i915: use localized __diag_ignore_all() instead of per file") Cc: Gustavo Sousa <gustavo.sousa@intel.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> Signed-off-by: Jani Nikula <jani.nikula@intel.com> Reviewed-by: Gustavo Sousa <gustavo.sousa@intel.com> Tested-by: John Garry <john.g.garry@oracle.com> Link: https://patchwork.freedesktop.org/patch/msgid/20230711110214.25093-1-jani.nikula@intel.com (cherry picked from commit 290d161045753240f2100b8f44660426ecc97be5) Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
2023-07-17drm/i915/perf: add sentinel to xehp_oa_b_countersAndrzej Hajda
Arrays passed to reg_in_range_table should end with empty record. The patch solves KASAN detected bug with signature: BUG: KASAN: global-out-of-bounds in xehp_is_valid_b_counter_addr+0x2c7/0x350 [i915] Read of size 4 at addr ffffffffa1555d90 by task perf/1518 CPU: 4 PID: 1518 Comm: perf Tainted: G U 6.4.0-kasan_438-g3303d06107f3+ #1 Hardware name: Intel Corporation Meteor Lake Client Platform/MTL-P DDR5 SODIMM SBS RVP, BIOS MTLPFWI1.R00.3223.D80.2305311348 05/31/2023 Call Trace: <TASK> ... xehp_is_valid_b_counter_addr+0x2c7/0x350 [i915] Fixes: 0fa9349dda03 ("drm/i915/perf: complete programming whitelisting for XEHPSDV") Signed-off-by: Andrzej Hajda <andrzej.hajda@intel.com> Reviewed-by: Andi Shyti <andi.shyti@linux.intel.com> Reviewed-by: Nirmoy Das <nirmoy.das@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20230711153410.1224997-1-andrzej.hajda@intel.com (cherry picked from commit 2f42c5afb34b5696cf5fe79e744f99be9b218798) Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
2023-07-17can: gs_usb: fix time stamp counter initializationMarc Kleine-Budde
If the gs_usb device driver is unloaded (or unbound) before the interface is shut down, the USB stack first calls the struct usb_driver::disconnect and then the struct net_device_ops::ndo_stop callback. In gs_usb_disconnect() all pending bulk URBs are killed, i.e. no more RX'ed CAN frames are send from the USB device to the host. Later in gs_can_close() a reset control message is send to each CAN channel to remove the controller from the CAN bus. In this race window the USB device can still receive CAN frames from the bus and internally queue them to be send to the host. At least in the current version of the candlelight firmware, the queue of received CAN frames is not emptied during the reset command. After loading (or binding) the gs_usb driver, new URBs are submitted during the struct net_device_ops::ndo_open callback and the candlelight firmware starts sending its already queued CAN frames to the host. However, this scenario was not considered when implementing the hardware timestamp function. The cycle counter/time counter infrastructure is set up (gs_usb_timestamp_init()) after the USBs are submitted, resulting in a NULL pointer dereference if timecounter_cyc2time() (via the call chain: gs_usb_receive_bulk_callback() -> gs_usb_set_timestamp() -> gs_usb_skb_set_timestamp()) is called too early. Move the gs_usb_timestamp_init() function before the URBs are submitted to fix this problem. For a comprehensive solution, we need to consider gs_usb devices with more than 1 channel. The cycle counter/time counter infrastructure is setup per channel, but the RX URBs are per device. Once gs_can_open() of _a_ channel has been called, and URBs have been submitted, the gs_usb_receive_bulk_callback() can be called for _all_ available channels, even for channels that are not running, yet. As cycle counter/time counter has not set up, this will again lead to a NULL pointer dereference. Convert the cycle counter/time counter from a "per channel" to a "per device" functionality. Also set it up, before submitting any URBs to the device. Further in gs_usb_receive_bulk_callback(), don't process any URBs for not started CAN channels, only resubmit the URB. Fixes: 45dfa45f52e6 ("can: gs_usb: add RX and TX hardware timestamp support") Closes: https://github.com/candle-usb/candleLight_fw/issues/137#issuecomment-1623532076 Cc: stable@vger.kernel.org Cc: John Whittington <git@jbrengineering.co.uk> Link: https://lore.kernel.org/all/20230716-gs_usb-fix-time-stamp-counter-v1-2-9017cefcd9d5@pengutronix.de Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2023-07-17can: gs_usb: gs_can_open(): improve error handlingMarc Kleine-Budde
The gs_usb driver handles USB devices with more than 1 CAN channel. The RX path for all channels share the same bulk endpoint (the transmitted bulk data encodes the channel number). These per-device resources are allocated and submitted by the first opened channel. During this allocation, the resources are either released immediately in case of a failure or the URBs are anchored. All anchored URBs are finally killed with gs_usb_disconnect(). Currently, gs_can_open() returns with an error if the allocation of a URB or a buffer fails. However, if usb_submit_urb() fails, the driver continues with the URBs submitted so far, even if no URBs were successfully submitted. Treat every error as fatal and free all allocated resources immediately. Switch to goto-style error handling, to prepare the driver for more per-device resource allocation. Cc: stable@vger.kernel.org Cc: John Whittington <git@jbrengineering.co.uk> Link: https://lore.kernel.org/all/20230716-gs_usb-fix-time-stamp-counter-v1-1-9017cefcd9d5@pengutronix.de Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>