summaryrefslogtreecommitdiff
path: root/drivers/md
AgeCommit message (Collapse)Author
2024-02-20dm vdo: add thread and synchronization utilitiesMatthew Sakai
This patch adds utilities for managing and using named threads, as well as several locking and synchronization utilities. These utilities help dm-vdo minimize thread transitions and manage interactions between threads. Co-developed-by: J. corwin Coburn <corwin@hurlbutnet.net> Signed-off-by: J. corwin Coburn <corwin@hurlbutnet.net> Co-developed-by: Michael Sclafani <dm-devel@lists.linux.dev> Signed-off-by: Michael Sclafani <dm-devel@lists.linux.dev> Co-developed-by: Thomas Jaskiewicz <tom@jaskiewicz.us> Signed-off-by: Thomas Jaskiewicz <tom@jaskiewicz.us> Co-developed-by: Bruce Johnston <bjohnsto@redhat.com> Signed-off-by: Bruce Johnston <bjohnsto@redhat.com> Co-developed-by: Ken Raeburn <raeburn@redhat.com> Signed-off-by: Ken Raeburn <raeburn@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2024-02-20dm vdo: add vdo type declarations, constants, and simple data structuresMatthew Sakai
Add definitions of constants defining the fixed parameters of a VDO volume, and the default and maximum values of configurable or dynamic parameters. Add definitions of internal status codes used for internal communication within the module and for logging. Add definitions of types and structs used to manage the processing of an I/O operation. Co-developed-by: J. corwin Coburn <corwin@hurlbutnet.net> Signed-off-by: J. corwin Coburn <corwin@hurlbutnet.net> Co-developed-by: Michael Sclafani <dm-devel@lists.linux.dev> Signed-off-by: Michael Sclafani <dm-devel@lists.linux.dev> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2024-02-20dm vdo: add basic logging and support utilitiesMatthew Sakai
Add various support utilities for the vdo target and deduplication index, including logging utilities, string and time management, and index-specific error codes. Co-developed-by: J. corwin Coburn <corwin@hurlbutnet.net> Signed-off-by: J. corwin Coburn <corwin@hurlbutnet.net> Co-developed-by: Michael Sclafani <dm-devel@lists.linux.dev> Signed-off-by: Michael Sclafani <dm-devel@lists.linux.dev> Co-developed-by: Thomas Jaskiewicz <tom@jaskiewicz.us> Signed-off-by: Thomas Jaskiewicz <tom@jaskiewicz.us> Co-developed-by: Ken Raeburn <raeburn@redhat.com> Signed-off-by: Ken Raeburn <raeburn@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2024-02-20dm vdo: add memory allocation utilitiesMatthew Sakai
This patch adds standardized allocation macros and memory tracking tools to track and report any allocated memory that is not freed. This makes it easier to ensure that the vdo target does not leak memory. This patch also adds utilities for controlling whether certain threads are allowed to allocate memory, since memory allocation during certain critical code sections can cause the vdo target to deadlock. Co-developed-by: J. corwin Coburn <corwin@hurlbutnet.net> Signed-off-by: J. corwin Coburn <corwin@hurlbutnet.net> Co-developed-by: Michael Sclafani <dm-devel@lists.linux.dev> Signed-off-by: Michael Sclafani <dm-devel@lists.linux.dev> Co-developed-by: Thomas Jaskiewicz <tom@jaskiewicz.us> Signed-off-by: Thomas Jaskiewicz <tom@jaskiewicz.us> Co-developed-by: Ken Raeburn <raeburn@redhat.com> Signed-off-by: Ken Raeburn <raeburn@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2024-02-20dm vdo: add the MurmurHash3 fast hashing algorithmMatthew Sakai
MurmurHash3 is a fast, non-cryptographic, 128-bit hash. It was originally written by Austin Appleby and placed in the public domain. This version has been modified to produce the same result on both big endian and little endian processors, making it suitable for use in portable persistent data. Co-developed-by: J. corwin Coburn <corwin@hurlbutnet.net> Signed-off-by: J. corwin Coburn <corwin@hurlbutnet.net> Co-developed-by: Ken Raeburn <raeburn@redhat.com> Signed-off-by: Ken Raeburn <raeburn@redhat.com> Co-developed-by: John Wiele <jwiele@redhat.com> Signed-off-by: John Wiele <jwiele@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2024-02-20dm crypt: remove redundant state settings after waking upLizhe
The task status has been set to TASK_RUNNING in schedule(). No need to set again here. Signed-off-by: Lizhe <sensor1010@163.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2024-02-20dm thin: add braces around conditional code that spans linesMike Snitzer
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2024-02-20dm-crypt, dm-integrity, dm-verity: bump target versionMike Snitzer
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2024-02-20dm-verity, dm-crypt: align "struct bvec_iter" correctlyMikulas Patocka
"struct bvec_iter" is defined with the __packed attribute, so it is aligned on a single byte. On X86 (and on other architectures that support unaligned addresses in hardware), "struct bvec_iter" is accessed using the 8-byte and 4-byte memory instructions, however these instructions are less efficient if they operate on unaligned addresses. (on RISC machines that don't have unaligned access in hardware, GCC generates byte-by-byte accesses that are very inefficient - see [1]) This commit reorders the entries in "struct dm_verity_io" and "struct convert_context", so that "struct bvec_iter" is aligned on 8 bytes. [1] https://lore.kernel.org/all/ZcLuWUNRZadJr0tQ@fedora/T/ Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2024-02-20dm-crypt: recheck the integrity tag after a failureMikulas Patocka
If a userspace process reads (with O_DIRECT) multiple blocks into the same buffer, dm-crypt reports an authentication error [1]. The error is reported in a log and it may cause RAID leg being kicked out of the array. This commit fixes dm-crypt, so that if integrity verification fails, the data is read again into a kernel buffer (where userspace can't modify it) and the integrity tag is rechecked. If the recheck succeeds, the content of the kernel buffer is copied into the user buffer; if the recheck fails, an integrity error is reported. [1] https://people.redhat.com/~mpatocka/testcases/blk-auth-modify/read2.c Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2024-02-20dm-crypt: don't modify the data when using authenticated encryptionMikulas Patocka
It was said that authenticated encryption could produce invalid tag when the data that is being encrypted is modified [1]. So, fix this problem by copying the data into the clone bio first and then encrypt them inside the clone bio. This may reduce performance, but it is needed to prevent the user from corrupting the device by writing data with O_DIRECT and modifying them at the same time. [1] https://lore.kernel.org/all/20240207004723.GA35324@sol.localdomain/T/ Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2024-02-20dm-verity: recheck the hash after a failureMikulas Patocka
If a userspace process reads (with O_DIRECT) multiple blocks into the same buffer, dm-verity reports an error [1]. This commit fixes dm-verity, so that if hash verification fails, the data is read again into a kernel buffer (where userspace can't modify it) and the hash is rechecked. If the recheck succeeds, the content of the kernel buffer is copied into the user buffer; if the recheck fails, an error is reported. [1] https://people.redhat.com/~mpatocka/testcases/blk-auth-modify/read2.c Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2024-02-20dm-integrity: recheck the integrity tag after a failureMikulas Patocka
If a userspace process reads (with O_DIRECT) multiple blocks into the same buffer, dm-integrity reports an error [1]. The error is reported in a log and it may cause RAID leg being kicked out of the array. This commit fixes dm-integrity, so that if integrity verification fails, the data is read again into a kernel buffer (where userspace can't modify it) and the integrity tag is rechecked. If the recheck succeeds, the content of the kernel buffer is copied into the user buffer; if the recheck fails, an integrity error is reported. [1] https://people.redhat.com/~mpatocka/testcases/blk-auth-modify/read2.c Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2024-02-20treewide: replace or remove redundant def_bool in Kconfig filesMasahiro Yamada
'def_bool X' is a shorthand for 'bool' plus 'default X'. 'def_bool' is redundant where 'bool' is already present, so 'def_bool X' can be replaced with 'default X', or removed if X is 'n'. Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2024-02-19bcache: pass queue_limits to blk_mq_alloc_diskChristoph Hellwig
Pass the queue limits directly to blk_alloc_disk instead of setting them one at a time. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Link: https://lore.kernel.org/r/20240215071055.2201424-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-19block: pass a queue_limits argument to blk_alloc_diskChristoph Hellwig
Pass a queue_limits to blk_alloc_disk and apply it if non-NULL. This will allow allocating queues with valid queue limits instead of setting the values one at a time later. Also change blk_alloc_disk to return an ERR_PTR instead of just NULL which can't distinguish errors. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Link: https://lore.kernel.org/r/20240215071055.2201424-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-16Merge tag 'md-6.9-20240216' of ↵Jens Axboe
https://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-6.9/block Pull MD changes from Song: "1. Cleanup redundant checks, by Yu Kuai. 2. Remove deprecated headers, by Marc Zyngier and Song Liu. 3. Concurrency fixes, by Li Lingfeng. 4. Memory leak fix, by Li Nan." * tag 'md-6.9-20240216' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md: md: fix kmemleak of rdev->serial md/multipath: Remove md-multipath.h md/linear: Get rid of md-linear.h md: use RCU lock to protect traversal in md_spares_need_change() md: get rdev->mddev with READ_ONCE() md: remove redundant md_wakeup_thread() md: remove redundant check of 'mddev->sync_thread'
2024-02-15md: Don't suspend the array for interrupted reshapeYu Kuai
md_start_sync() will suspend the array if there are spares that can be added or removed from conf, however, if reshape is still in progress, this won't happen at all or data will be corrupted(remove_and_add_spares won't be called from md_choose_sync_action for reshape), hence there is no need to suspend the array if reshape is not done yet. Meanwhile, there is a potential deadlock for raid456: 1) reshape is interrupted; 2) set one of the disk WantReplacement, and add a new disk to the array, however, recovery won't start until the reshape is finished; 3) then issue an IO across reshpae position, this IO will wait for reshape to make progress; 4) continue to reshape, then md_start_sync() found there is a spare disk that can be added to conf, mddev_suspend() is called; Step 4 and step 3 is waiting for each other, deadlock triggered. Noted this problem is found by code review, and it's not reporduced yet. Fix this porblem by don't suspend the array for interrupted reshape, this is safe because conf won't be changed until reshape is done. Fixes: bc08041b32ab ("md: suspend array in md_start_sync() if array need reconfiguration") Cc: stable@vger.kernel.org # v6.7+ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240201092559.910982-6-yukuai1@huaweicloud.com
2024-02-15md: Don't register sync_thread for reshape directlyYu Kuai
Currently, if reshape is interrupted, then reassemble the array will register sync_thread directly from pers->run(), in this case 'MD_RECOVERY_RUNNING' is set directly, however, there is no guarantee that md_do_sync() will be executed, hence stop_sync_thread() will hang because 'MD_RECOVERY_RUNNING' can't be cleared. Last patch make sure that md_do_sync() will set MD_RECOVERY_DONE, however, following hang can still be triggered by dm-raid test shell/lvconvert-raid-reshape.sh occasionally: [root@fedora ~]# cat /proc/1982/stack [<0>] stop_sync_thread+0x1ab/0x270 [md_mod] [<0>] md_frozen_sync_thread+0x5c/0xa0 [md_mod] [<0>] raid_presuspend+0x1e/0x70 [dm_raid] [<0>] dm_table_presuspend_targets+0x40/0xb0 [dm_mod] [<0>] __dm_destroy+0x2a5/0x310 [dm_mod] [<0>] dm_destroy+0x16/0x30 [dm_mod] [<0>] dev_remove+0x165/0x290 [dm_mod] [<0>] ctl_ioctl+0x4bb/0x7b0 [dm_mod] [<0>] dm_ctl_ioctl+0x11/0x20 [dm_mod] [<0>] vfs_ioctl+0x21/0x60 [<0>] __x64_sys_ioctl+0xb9/0xe0 [<0>] do_syscall_64+0xc6/0x230 [<0>] entry_SYSCALL_64_after_hwframe+0x6c/0x74 Meanwhile mddev->recovery is: MD_RECOVERY_RUNNING | MD_RECOVERY_INTR | MD_RECOVERY_RESHAPE | MD_RECOVERY_FROZEN Fix this problem by remove the code to register sync_thread directly from raid10 and raid5. And let md_check_recovery() to register sync_thread. Fixes: f67055780caa ("[PATCH] md: Checkpoint and allow restart of raid5 reshape") Fixes: f52f5c71f3d4 ("md: fix stopping sync thread") Cc: stable@vger.kernel.org # v6.7+ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240201092559.910982-5-yukuai1@huaweicloud.com
2024-02-15md: Make sure md_do_sync() will set MD_RECOVERY_DONEYu Kuai
stop_sync_thread() will interrupt md_do_sync(), and md_do_sync() must set MD_RECOVERY_DONE, so that follow up md_check_recovery() will unregister sync_thread, clear MD_RECOVERY_RUNNING and wake up stop_sync_thread(). If MD_RECOVERY_WAIT is set or the array is read-only, md_do_sync() will return without setting MD_RECOVERY_DONE, and after commit f52f5c71f3d4 ("md: fix stopping sync thread"), dm-raid switch from md_reap_sync_thread() to stop_sync_thread() to unregister sync_thread from md_stop() and md_stop_writes(), causing the test shell/lvconvert-raid-reshape.sh hang. We shouldn't switch back to md_reap_sync_thread() because it's problematic in the first place. Fix the problem by making sure md_do_sync() will set MD_RECOVERY_DONE. Reported-by: Mikulas Patocka <mpatocka@redhat.com> Closes: https://lore.kernel.org/all/ece2b06f-d647-6613-a534-ff4c9bec1142@redhat.com/ Fixes: d5d885fd514f ("md: introduce new personality funciton start()") Fixes: 5fd6c1dce06e ("[PATCH] md: allow checkpoint of recovery with version-1 superblock") Fixes: f52f5c71f3d4 ("md: fix stopping sync thread") Cc: stable@vger.kernel.org # v6.7+ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240201092559.910982-4-yukuai1@huaweicloud.com
2024-02-15md: Don't ignore read-only array in md_check_recovery()Yu Kuai
Usually if the array is not read-write, md_check_recovery() won't register new sync_thread in the first place. And if the array is read-write and sync_thread is registered, md_set_readonly() will unregister sync_thread before setting the array read-only. md/raid follow this behavior hence there is no problem. After commit f52f5c71f3d4 ("md: fix stopping sync thread"), following hang can be triggered by test shell/integrity-caching.sh: 1) array is read-only. dm-raid update super block: rs_update_sbs ro = mddev->ro mddev->ro = 0 -> set array read-write md_update_sb 2) register new sync thread concurrently. 3) dm-raid set array back to read-only: rs_update_sbs mddev->ro = ro 4) stop the array: raid_dtr md_stop stop_sync_thread set_bit(MD_RECOVERY_INTR, &mddev->recovery); md_wakeup_thread_directly(mddev->sync_thread); wait_event(..., !test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)) 5) sync thread done: md_do_sync set_bit(MD_RECOVERY_DONE, &mddev->recovery); md_wakeup_thread(mddev->thread); 6) daemon thread can't unregister sync thread: md_check_recovery if (!md_is_rdwr(mddev) && !test_bit(MD_RECOVERY_NEEDED, &mddev->recovery)) return; -> -> MD_RECOVERY_RUNNING can't be cleared, hence step 4 hang; The root cause is that dm-raid manipulate 'mddev->ro' by itself, however, dm-raid really should stop sync thread before setting the array read-only. Unfortunately, I need to read more code before I can refacter the handler of 'mddev->ro' in dm-raid, hence let's fix the problem the easy way for now to prevent dm-raid regression. Reported-by: Mikulas Patocka <mpatocka@redhat.com> Closes: https://lore.kernel.org/all/9801e40-8ac7-e225-6a71-309dcf9dc9aa@redhat.com/ Fixes: ecbfb9f118bc ("dm raid: add raid level takeover support") Fixes: f52f5c71f3d4 ("md: fix stopping sync thread") Cc: stable@vger.kernel.org # v6.7+ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240201092559.910982-3-yukuai1@huaweicloud.com
2024-02-15md: Don't ignore suspended array in md_check_recovery()Yu Kuai
mddev_suspend() never stop sync_thread, hence it doesn't make sense to ignore suspended array in md_check_recovery(), which might cause sync_thread can't be unregistered. After commit f52f5c71f3d4 ("md: fix stopping sync thread"), following hang can be triggered by test shell/integrity-caching.sh: 1) suspend the array: raid_postsuspend mddev_suspend 2) stop the array: raid_dtr md_stop __md_stop_writes stop_sync_thread set_bit(MD_RECOVERY_INTR, &mddev->recovery); md_wakeup_thread_directly(mddev->sync_thread); wait_event(..., !test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)) 3) sync thread done: md_do_sync set_bit(MD_RECOVERY_DONE, &mddev->recovery); md_wakeup_thread(mddev->thread); 4) daemon thread can't unregister sync thread: md_check_recovery if (mddev->suspended) return; -> return directly md_read_sync_thread clear_bit(MD_RECOVERY_RUNNING, &mddev->recovery); -> MD_RECOVERY_RUNNING can't be cleared, hence step 2 hang; This problem is not just related to dm-raid, fix it by ignoring suspended array in md_check_recovery(). And follow up patches will improve dm-raid better to frozen sync thread during suspend. Reported-by: Mikulas Patocka <mpatocka@redhat.com> Closes: https://lore.kernel.org/all/8fb335e-6d2c-dbb5-d7-ded8db5145a@redhat.com/ Fixes: 68866e425be2 ("MD: no sync IO while suspended") Fixes: f52f5c71f3d4 ("md: fix stopping sync thread") Cc: stable@vger.kernel.org # v6.7+ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240201092559.910982-2-yukuai1@huaweicloud.com
2024-02-12md: fix kmemleak of rdev->serialLi Nan
If kobject_add() is fail in bind_rdev_to_array(), 'rdev->serial' will be alloc not be freed, and kmemleak occurs. unreferenced object 0xffff88815a350000 (size 49152): comm "mdadm", pid 789, jiffies 4294716910 hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace (crc f773277a): [<0000000058b0a453>] kmemleak_alloc+0x61/0xe0 [<00000000366adf14>] __kmalloc_large_node+0x15e/0x270 [<000000002e82961b>] __kmalloc_node.cold+0x11/0x7f [<00000000f206d60a>] kvmalloc_node+0x74/0x150 [<0000000034bf3363>] rdev_init_serial+0x67/0x170 [<0000000010e08fe9>] mddev_create_serial_pool+0x62/0x220 [<00000000c3837bf0>] bind_rdev_to_array+0x2af/0x630 [<0000000073c28560>] md_add_new_disk+0x400/0x9f0 [<00000000770e30ff>] md_ioctl+0x15bf/0x1c10 [<000000006cfab718>] blkdev_ioctl+0x191/0x3f0 [<0000000085086a11>] vfs_ioctl+0x22/0x60 [<0000000018b656fe>] __x64_sys_ioctl+0xba/0xe0 [<00000000e54e675e>] do_syscall_64+0x71/0x150 [<000000008b0ad622>] entry_SYSCALL_64_after_hwframe+0x6c/0x74 Fixes: 963c555e75b0 ("md: introduce mddev_create/destroy_wb_pool for the change of member device") Signed-off-by: Li Nan <linan122@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240208085556.2412922-1-linan666@huaweicloud.com
2024-02-12block: remove gfp_flags from blkdev_zone_mgmtJohannes Thumshirn
Now that all callers pass in GFP_KERNEL to blkdev_zone_mgmt() and use memalloc_no{io,fs}_{save,restore}() to define the allocation scope, we can drop the gfp_mask parameter from blkdev_zone_mgmt() as well as blkdev_zone_reset_all() and blkdev_zone_reset_all_emulated(). Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Mike Snitzer <snitzer@kernel.org> Link: https://lore.kernel.org/r/20240128-zonefs_nofs-v3-5-ae3b7c8def61@wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-12dm: dm-zoned: guard blkdev_zone_mgmt with noio scopeJohannes Thumshirn
Guard the calls to blkdev_zone_mgmt() with a memalloc_noio scope. This helps us getting rid of the GFP_NOIO argument to blkdev_zone_mgmt(); Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Mike Snitzer <snitzer@kernel.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20240128-zonefs_nofs-v3-2-ae3b7c8def61@wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-07md: Fix missing release of 'active_io' for flushYu Kuai
submit_flushes atomic_set(&mddev->flush_pending, 1); rdev_for_each_rcu(rdev, mddev) atomic_inc(&mddev->flush_pending); bi->bi_end_io = md_end_flush submit_bio(bi); /* flush io is done first */ md_end_flush if (atomic_dec_and_test(&mddev->flush_pending)) percpu_ref_put(&mddev->active_io) -> active_io is not released if (atomic_dec_and_test(&mddev->flush_pending)) -> missing release of active_io For consequence, mddev_suspend() will wait for 'active_io' to be zero forever. Fix this problem by releasing 'active_io' in submit_flushes() if 'flush_pending' is decreased to zero. Fixes: fa2bbff7b0b4 ("md: synchronize flush io with array reconfiguration") Cc: stable@vger.kernel.org # v6.1+ Reported-by: Blazej Kucman <blazej.kucman@linux.intel.com> Closes: https://lore.kernel.org/lkml/20240130172524.0000417b@linux.intel.com/ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240201092559.910982-7-yukuai1@huaweicloud.com
2024-02-05md/multipath: Remove md-multipath.hSong Liu
md-multipath is already deprecated. Remove the header file. Signed-off-by: Song Liu <song@kernel.org>
2024-02-05md/linear: Get rid of md-linear.hMarc Zyngier
Given that 849d18e27be9 ("md: Remove deprecated CONFIG_MD_LINEAR") killed the linear flavour of MD, it seems only logical to drop the leftover include file that used to come with it. I also feel that it should be my own privilege to remove my 30 year old attempt at writing kernel code ;-). RIP! Cc: Song Liu <song@kernel.org> Cc: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240201224549.750644-1-maz@kernel.org
2024-02-05md: use RCU lock to protect traversal in md_spares_need_change()Li Lingfeng
Since md_start_sync() will be called without the protect of mddev_lock, and it can run concurrently with array reconfiguration, traversal of rdev in it should be protected by RCU lock. Commit bc08041b32ab ("md: suspend array in md_start_sync() if array need reconfiguration") added md_spares_need_change() to md_start_sync(), casusing use of rdev without any protection. Fix this by adding RCU lock in md_spares_need_change(). Fixes: bc08041b32ab ("md: suspend array in md_start_sync() if array need reconfiguration") Cc: stable@vger.kernel.org # 6.7+ Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240104133629.1277517-1-lilingfeng@huaweicloud.com
2024-02-05md: get rdev->mddev with READ_ONCE()Li Lingfeng
Users may get rdev->mddev by sysfs while rdev is releasing. So use both READ_ONCE() and WRITE_ONCE() to prevent load/store tearing and to read/write mddev atomically. Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231229070500.3602712-1-lilingfeng@huaweicloud.com
2024-02-05md: remove redundant md_wakeup_thread()Yu Kuai
On the one hand, mddev_unlock() will call md_wakeup_thread() unconditionally; on the other hand, md_check_recovery() can't make progress if 'reconfig_mutex' can't be grabbed. Hence, it really doesn't make sense to wake up daemon thread while 'reconfig_mutex' is still grabbed. Remove all the md_wakup_thread() for 'mddev->thread' while 'reconfig_mtuex' is still grabbed. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231228125553.2697765-3-yukuai1@huaweicloud.com
2024-02-05md: remove redundant check of 'mddev->sync_thread'Yu Kuai
The lifetime of sync_thread: 1) Set MD_RECOVERY_NEEDED and wake up daemon thread (by ioctl/sysfs or other events); 2) Daemon thread woke up, md_check_recovery() found that MD_RECOVERY_NEEDED is set: a) try to grab reconfig_mutex; b) set MD_RECOVERY_RUNNING; c) clear MD_RECOVERY_NEEDED, and then queue sync_work; 3) md_start_sync() choose sync_action, then register sync_thread; 4) md_do_sync() is done, set MD_RECOVERY_DONE and wake up daemon thread; 5) Daemon thread woke up, md_check_recovery() found that MD_RECOVERY_DONE is set: a) try to grab reconfig_mutex; b) unregister sync_thread; c) clear MD_RECOVERY_RUNNING and MD_RECOVERY_DONE; Hence there is no such case that MD_RECOVERY_RUNNING is not set, while sync_thread is registered. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20231228125553.2697765-2-yukuai1@huaweicloud.com
2024-02-02dm-crypt, dm-verity: disable taskletsMikulas Patocka
Tasklets have an inherent problem with memory corruption. The function tasklet_action_common calls tasklet_trylock, then it calls the tasklet callback and then it calls tasklet_unlock. If the tasklet callback frees the structure that contains the tasklet or if it calls some code that may free it, tasklet_unlock will write into free memory. The commits 8e14f610159d and d9a02e016aaf try to fix it for dm-crypt, but it is not a sufficient fix and the data corruption can still happen [1]. There is no fix for dm-verity and dm-verity will write into free memory with every tasklet-processed bio. There will be atomic workqueues implemented in the kernel 6.9 [2]. They will have better interface and they will not suffer from the memory corruption problem. But we need something that stops the memory corruption now and that can be backported to the stable kernels. So, I'm proposing this commit that disables tasklets in both dm-crypt and dm-verity. This commit doesn't remove the tasklet support, because the tasklet code will be reused when atomic workqueues will be implemented. [1] https://lore.kernel.org/all/d390d7ee-f142-44d3-822a-87949e14608b@suse.de/T/ [2] https://lore.kernel.org/lkml/20240130091300.2968534-1-tj@kernel.org/ Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org Fixes: 39d42fa96ba1b ("dm crypt: add flags to optionally bypass kcryptd workqueues") Fixes: 5721d4e5a9cdb ("dm verity: Add optional "try_verify_in_tasklet" feature") Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2024-01-30dm writecache: allow allocations larger than 2GiBMikulas Patocka
The function kvmalloc_node limits the allocation size to INT_MAX. This limit will be overflowed if dm-writecache attempts to map a device with 1TiB or larger length. This commit changes kvmalloc_array to vmalloc_array to avoid the limit. The commit also changes vmalloc(array_size()) to vmalloc_array(). Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2024-01-30dm stats: limit the number of entriesMikulas Patocka
The kvmalloc function fails with a warning if the size is larger than INT_MAX. Linus said that there should be limits that prevent this warning from being hit. This commit adds the limits to the dm-stats subsystem in DM core. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2024-01-30dm: limit the number of targets and parameter size areaMikulas Patocka
The kvmalloc function fails with a warning if the size is larger than INT_MAX. The warning was triggered by a syscall testing robot. In order to avoid the warning, this commit limits the number of targets to 1048576 and the size of the parameter area to 1073741824. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2024-01-24md: fix a suspicious RCU usage warningMikulas Patocka
RCU protection was removed in the commit 2d32777d60de ("raid1: remove rcu protection to access rdev from conf"). However, the code in fix_read_error does rcu_dereference outside rcu_read_lock - this triggers the following warning. The warning is triggered by a LVM2 test shell/integrity-caching.sh. This commit removes rcu_dereference. ============================= WARNING: suspicious RCU usage 6.7.0 #2 Not tainted ----------------------------- drivers/md/raid1.c:2265 suspicious rcu_dereference_check() usage! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 no locks held by mdX_raid1/1859. stack backtrace: CPU: 2 PID: 1859 Comm: mdX_raid1 Not tainted 6.7.0 #2 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x60/0x70 lockdep_rcu_suspicious+0x153/0x1b0 raid1d+0x1732/0x1750 [raid1] ? lock_acquire+0x9f/0x270 ? finish_wait+0x3d/0x80 ? md_thread+0xf7/0x130 [md_mod] ? lock_release+0xaa/0x230 ? md_register_thread+0xd0/0xd0 [md_mod] md_thread+0xa0/0x130 [md_mod] ? housekeeping_test_cpu+0x30/0x30 kthread+0xdc/0x110 ? kthread_complete_and_exit+0x20/0x20 ret_from_fork+0x28/0x40 ? kthread_complete_and_exit+0x20/0x20 ret_from_fork_asm+0x11/0x20 </TASK> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Fixes: ca294b34aaf3 ("md/raid1: support read error check") Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/51539879-e1ca-fde3-b8b4-8934ddedcbc@redhat.com
2024-01-18Merge tag 'for-6.8/block-2024-01-18' of git://git.kernel.dk/linuxLinus Torvalds
Pull block fixes from Jens Axboe: - NVMe pull request via Keith: - tcp, fc, and rdma target fixes (Maurizio, Daniel, Hannes, Christoph) - discard fixes and improvements (Christoph) - timeout debug improvements (Keith, Max) - various cleanups (Daniel, Max, Giuxen) - trace event string fixes (Arnd) - shadow doorbell setup on reset fix (William) - a write zeroes quirk for SK Hynix (Jim) - MD pull request via Song: - Sparse warning since v6.0 (Bart) - /proc/mdstat regression since v6.7 (Yu Kuai) - Use symbolic error value (Christian) - IO Priority documentation update (Christian) - Fix for accessing queue limits without having entered the queue (Christoph, me) - Fix for loop dio support (Christoph) - Move null_blk off deprecated ida interface (Christophe) - Ensure nbd initializes full msghdr (Eric) - Fix for a regression with the folio conversion, which is now easier to hit because of an unrelated change (Matthew) - Remove redundant check in virtio-blk (Li) - Fix for a potential hang in sbitmap (Ming) - Fix for partial zone appending (Damien) - Misc changes and fixes (Bart, me, Kemeng, Dmitry) * tag 'for-6.8/block-2024-01-18' of git://git.kernel.dk/linux: (45 commits) Documentation: block: ioprio: Update schedulers loop: fix the the direct I/O support check when used on top of block devices blk-mq: Remove the hctx 'run' debugfs attribute nbd: always initialize struct msghdr completely block: Fix iterating over an empty bio with bio_for_each_folio_all block: bio-integrity: fix kcalloc() arguments order virtio_blk: remove duplicate check if queue is broken in virtblk_done sbitmap: remove stale comment in sbq_calc_wake_batch block: Correct a documentation comment in blk-cgroup.c null_blk: Remove usage of the deprecated ida_simple_xx() API block: ensure we hold a queue reference when using queue limits blk-mq: rename blk_mq_can_use_cached_rq block: print symbolic error name instead of error code blk-mq: fix IO hang from sbitmap wakeup race nvmet-rdma: avoid circular locking dependency on install_queue() nvmet-tcp: avoid circular locking dependency on install_queue() nvme-pci: set doorbell config before unquiescing block: fix partial zone append completion handling in req_bio_endio() block/iocost: silence warning on 'last_period' potentially being unused md/raid1: Use blk_opf_t for read and write operations ...
2024-01-11Merge tag 'for-6.8/block-2024-01-08' of git://git.kernel.dk/linuxLinus Torvalds
Pull block updates from Jens Axboe: "Pretty quiet round this time around. This contains: - NVMe updates via Keith: - nvme fabrics spec updates (Guixin, Max) - nvme target udpates (Guixin, Evan) - nvme attribute refactoring (Daniel) - nvme-fc numa fix (Keith) - MD updates via Song: - Fix/Cleanup RCU usage from conf->disks[i].rdev (Yu Kuai) - Fix raid5 hang issue (Junxiao Bi) - Add Yu Kuai as Reviewer of the md subsystem - Remove deprecated flavors (Song Liu) - raid1 read error check support (Li Nan) - Better handle events off-by-1 case (Alex Lyakas) - Efficiency improvements for passthrough (Kundan) - Support for mapping integrity data directly (Keith) - Zoned write fix (Damien) - rnbd fixes (Kees, Santosh, Supriti) - Default to a sane discard size granularity (Christoph) - Make the default max transfer size naming less confusing (Christoph) - Remove support for deprecated host aware zoned model (Christoph) - Misc fixes (me, Li, Matthew, Min, Ming, Randy, liyouhong, Daniel, Bart, Christoph)" * tag 'for-6.8/block-2024-01-08' of git://git.kernel.dk/linux: (78 commits) block: Treat sequential write preferred zone type as invalid block: remove disk_clear_zoned sd: remove the !ZBC && blk_queue_is_zoned case in sd_read_block_characteristics drivers/block/xen-blkback/common.h: Fix spelling typo in comment blk-cgroup: fix rcu lockdep warning in blkg_lookup() blk-cgroup: don't use removal safe list iterators block: floor the discard granularity to the physical block size mtd_blkdevs: use the default discard granularity bcache: use the default discard granularity zram: use the default discard granularity null_blk: use the default discard granularity nbd: use the default discard granularity ubd: use the default discard granularity block: default the discard granularity to sector size bcache: discard_granularity should not be smaller than a sector block: remove two comments in bio_split_discard block: rename and document BLK_DEF_MAX_SECTORS loop: don't abuse BLK_DEF_MAX_SECTORS aoe: don't abuse BLK_DEF_MAX_SECTORS null_blk: don't cap max_hw_sectors to BLK_DEF_MAX_SECTORS ...
2024-01-09md/raid1: Use blk_opf_t for read and write operationsBart Van Assche
Use the type blk_opf_t for read and write operations instead of int. This patch does not affect the generated code but fixes the following sparse warning: drivers/md/raid1.c:1993:60: sparse: sparse: incorrect type in argument 5 (different base types) expected restricted blk_opf_t [usertype] opf got int rw Cc: Song Liu <song@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Fixes: 3c5e514db58f ("md/raid1: Use the new blk_opf_t type") Cc: stable@vger.kernel.org # v6.0+ Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202401080657.UjFnvQgX-lkp@intel.com/ Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240108001223.23835-1-bvanassche@acm.org
2024-01-09Merge tag 'mm-stable-2024-01-08-15-31' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "Many singleton patches against the MM code. The patch series which are included in this merge do the following: - Peng Zhang has done some mapletree maintainance work in the series 'maple_tree: add mt_free_one() and mt_attr() helpers' 'Some cleanups of maple tree' - In the series 'mm: use memmap_on_memory semantics for dax/kmem' Vishal Verma has altered the interworking between memory-hotplug and dax/kmem so that newly added 'device memory' can more easily have its memmap placed within that newly added memory. - Matthew Wilcox continues folio-related work (including a few fixes) in the patch series 'Add folio_zero_tail() and folio_fill_tail()' 'Make folio_start_writeback return void' 'Fix fault handler's handling of poisoned tail pages' 'Convert aops->error_remove_page to ->error_remove_folio' 'Finish two folio conversions' 'More swap folio conversions' - Kefeng Wang has also contributed folio-related work in the series 'mm: cleanup and use more folio in page fault' - Jim Cromie has improved the kmemleak reporting output in the series 'tweak kmemleak report format'. - In the series 'stackdepot: allow evicting stack traces' Andrey Konovalov to permits clients (in this case KASAN) to cause eviction of no longer needed stack traces. - Charan Teja Kalla has fixed some accounting issues in the page allocator's atomic reserve calculations in the series 'mm: page_alloc: fixes for high atomic reserve caluculations'. - Dmitry Rokosov has added to the samples/ dorectory some sample code for a userspace memcg event listener application. See the series 'samples: introduce cgroup events listeners'. - Some mapletree maintanance work from Liam Howlett in the series 'maple_tree: iterator state changes'. - Nhat Pham has improved zswap's approach to writeback in the series 'workload-specific and memory pressure-driven zswap writeback'. - DAMON/DAMOS feature and maintenance work from SeongJae Park in the series 'mm/damon: let users feed and tame/auto-tune DAMOS' 'selftests/damon: add Python-written DAMON functionality tests' 'mm/damon: misc updates for 6.8' - Yosry Ahmed has improved memcg's stats flushing in the series 'mm: memcg: subtree stats flushing and thresholds'. - In the series 'Multi-size THP for anonymous memory' Ryan Roberts has added a runtime opt-in feature to transparent hugepages which improves performance by allocating larger chunks of memory during anonymous page faults. - Matthew Wilcox has also contributed some cleanup and maintenance work against eh buffer_head code int he series 'More buffer_head cleanups'. - Suren Baghdasaryan has done work on Andrea Arcangeli's series 'userfaultfd move option'. UFFDIO_MOVE permits userspace heap compaction algorithms to move userspace's pages around rather than UFFDIO_COPY'a alloc/copy/free. - Stefan Roesch has developed a 'KSM Advisor', in the series 'mm/ksm: Add ksm advisor'. This is a governor which tunes KSM's scanning aggressiveness in response to userspace's current needs. - Chengming Zhou has optimized zswap's temporary working memory use in the series 'mm/zswap: dstmem reuse optimizations and cleanups'. - Matthew Wilcox has performed some maintenance work on the writeback code, both code and within filesystems. The series is 'Clean up the writeback paths'. - Andrey Konovalov has optimized KASAN's handling of alloc and free stack traces for secondary-level allocators, in the series 'kasan: save mempool stack traces'. - Andrey also performed some KASAN maintenance work in the series 'kasan: assorted clean-ups'. - David Hildenbrand has gone to town on the rmap code. Cleanups, more pte batching, folio conversions and more. See the series 'mm/rmap: interface overhaul'. - Kinsey Ho has contributed some maintenance work on the MGLRU code in the series 'mm/mglru: Kconfig cleanup'. - Matthew Wilcox has contributed lruvec page accounting code cleanups in the series 'Remove some lruvec page accounting functions'" * tag 'mm-stable-2024-01-08-15-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (361 commits) mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDER mm, treewide: introduce NR_PAGE_ORDERS selftests/mm: add separate UFFDIO_MOVE test for PMD splitting selftests/mm: skip test if application doesn't has root privileges selftests/mm: conform test to TAP format output selftests: mm: hugepage-mmap: conform to TAP format output selftests/mm: gup_test: conform test to TAP format output mm/selftests: hugepage-mremap: conform test to TAP format output mm/vmstat: move pgdemote_* out of CONFIG_NUMA_BALANCING mm: zsmalloc: return -ENOSPC rather than -EINVAL in zs_malloc while size is too large mm/memcontrol: remove __mod_lruvec_page_state() mm/khugepaged: use a folio more in collapse_file() slub: use a folio in __kmalloc_large_node slub: use folio APIs in free_large_kmalloc() slub: use alloc_pages_node() in alloc_slab_page() mm: remove inc/dec lruvec page state functions mm: ratelimit stat flush from workingset shrinker kasan: stop leaking stack trace handles mm/mglru: remove CONFIG_TRANSPARENT_HUGEPAGE mm/mglru: add dummy pmd_dirty() ...
2024-01-09md: Fix md_seq_ops() regressionsYu Kuai
Commit cf1b6d4441ff ("md: simplify md_seq_ops") introduce following regressions: 1) If list all_mddevs is emptly, personalities and unused devices won't be showed to user anymore. 2) If seq_file buffer overflowed from md_seq_show(), then md_seq_start() will be called again, hence personalities will be showed to user again. 3) If seq_file buffer overflowed from md_seq_stop(), seq_read_iter() doesn't handle this, hence unused devices won't be showed to user. Fix above problems by printing personalities and unused devices in md_seq_show(). Fixes: cf1b6d4441ff ("md: simplify md_seq_ops") Cc: stable@vger.kernel.org # v6.7+ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240109133957.2975272-1-yukuai1@huaweicloud.com
2024-01-08mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDERKirill A. Shutemov
commit 23baf831a32c ("mm, treewide: redefine MAX_ORDER sanely") has changed the definition of MAX_ORDER to be inclusive. This has caused issues with code that was not yet upstream and depended on the previous definition. To draw attention to the altered meaning of the define, rename MAX_ORDER to MAX_PAGE_ORDER. Link: https://lkml.kernel.org/r/20231228144704.14033-2-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-01-08Merge tag 'vfs-6.8.super' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs super updates from Christian Brauner: "This contains the super work for this cycle including the long-awaited series by Jan to make it possible to prevent writing to mounted block devices: - Writing to mounted devices is dangerous and can lead to filesystem corruption as well as crashes. Furthermore syzbot comes with more and more involved examples how to corrupt block device under a mounted filesystem leading to kernel crashes and reports we can do nothing about. Add tracking of writers to each block device and a kernel cmdline argument which controls whether other writeable opens to block devices open with BLK_OPEN_RESTRICT_WRITES flag are allowed. Note that this effectively only prevents modification of the particular block device's page cache by other writers. The actual device content can still be modified by other means - e.g. by issuing direct scsi commands, by doing writes through devices lower in the storage stack (e.g. in case loop devices, DM, or MD are involved) etc. But blocking direct modifications of the block device page cache is enough to give filesystems a chance to perform data validation when loading data from the underlying storage and thus prevent kernel crashes. Syzbot can use this cmdline argument option to avoid uninteresting crashes. Also users whose userspace setup does not need writing to mounted block devices can set this option for hardening. We expect that this will be interesting to quite a few workloads. Btrfs is currently opted out of this because they still haven't merged patches we require for this to work from three kernel releases ago. - Reimplement block device freezing and thawing as holder operations on the block device. This allows us to extend block device freezing to all devices associated with a superblock and not just the main device. It also allows us to remove get_active_super() and thus another function that scans the global list of superblocks. Freezing via additional block devices only works if the filesystem chooses to use @fs_holder_ops for these additional devices as well. That currently only includes ext4 and xfs. Earlier releases switched get_tree_bdev() and mount_bdev() to use @fs_holder_ops. The remaining nilfs2 open-coded version of mount_bdev() has been converted to rely on @fs_holder_ops as well. So block device freezing for the main block device will continue to work as before. There should be no regressions in functionality. The only special case is btrfs where block device freezing for the main block device never worked because sb->s_bdev isn't set. Block device freezing for btrfs can be fixed once they can switch to @fs_holder_ops but that can happen whenever they're ready" * tag 'vfs-6.8.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (27 commits) block: Fix a memory leak in bdev_open_by_dev() super: don't bother with WARN_ON_ONCE() super: massage wait event mechanism ext4: Block writes to journal device xfs: Block writes to log device fs: Block writes to mounted block devices btrfs: Do not restrict writes to btrfs devices block: Add config option to not allow writing to mounted devices block: Remove blkdev_get_by_*() functions bcachefs: Convert to bdev_open_by_path() fs: handle freezing from multiple devices fs: remove dead check nilfs2: simplify device handling fs: streamline thaw_super_locked ext4: simplify device handling xfs: simplify device handling fs: simplify setup_bdev_super() calls blkdev: comment fs_holder_ops porting: document block device freeze and thaw changes fs: remove unused helper ...
2023-12-29bcache: use the default discard granularityChristoph Hellwig
The discard granularity now defaults to a single sector, so don't set that value explicitly. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20231228075545.362768-9-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-12-29bcache: discard_granularity should not be smaller than a sectorChristoph Hellwig
Just like all block I/O, discards are in units of sectors. Thus setting a smaller than sector size discard limit in case of > 512 byte sectors in bcache doesn't make sense. Always set the discard granularity to 512 bytes instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20231228075545.362768-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-12-19block: remove support for the host aware zone modelChristoph Hellwig
When zones were first added the SCSI and ATA specs, two different models were supported (in addition to the drive managed one that is invisible to the host): - host managed where non-conventional zones there is strict requirement to write at the write pointer, or else an error is returned - host aware where a write point is maintained if writes always happen at it, otherwise it is left in an under-defined state and the sequential write preferred zones behave like conventional zones (probably very badly performing ones, though) Not surprisingly this lukewarm model didn't prove to be very useful and was finally removed from the ZBC and SBC specs (NVMe never implemented it). Due to to the easily disappearing write pointer host software could never rely on the write pointer to actually be useful for say recovery. Fortunately only a few HDD prototypes shipped using this model which never made it to mass production. Drop the support before it is too late. Note that any such host aware prototype HDD can still be used with Linux as we'll now treat it as a conventional HDD. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20231217165359.604246-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-12-19md: Remove deprecated CONFIG_MD_FAULTYSong Liu
md-faulty has been marked as deprecated for 2.5 years. Remove it. Cc: Christoph Hellwig <hch@lst.de> Cc: Jens Axboe <axboe@kernel.dk> Cc: Neil Brown <neilb@suse.de> Cc: Guoqing Jiang <guoqing.jiang@linux.dev> Cc: Mateusz Grzonka <mateusz.grzonka@intel.com> Cc: Jes Sorensen <jes@trained-monkey.org> Signed-off-by: Song Liu <song@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20231214222107.2016042-4-song@kernel.org
2023-12-19md: Remove deprecated CONFIG_MD_MULTIPATHSong Liu
md-multipath has been marked as deprecated for 2.5 years. Remove it. Cc: Christoph Hellwig <hch@lst.de> Cc: Jens Axboe <axboe@kernel.dk> Cc: Neil Brown <neilb@suse.de> Cc: Guoqing Jiang <guoqing.jiang@linux.dev> Cc: Mateusz Grzonka <mateusz.grzonka@intel.com> Cc: Jes Sorensen <jes@trained-monkey.org> Signed-off-by: Song Liu <song@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20231214222107.2016042-3-song@kernel.org
2023-12-19md: Remove deprecated CONFIG_MD_LINEARSong Liu
md-linear has been marked as deprecated for 2.5 years. Remove it. Cc: Christoph Hellwig <hch@lst.de> Cc: Jens Axboe <axboe@kernel.dk> Cc: Neil Brown <neilb@suse.de> Cc: Guoqing Jiang <guoqing.jiang@linux.dev> Cc: Mateusz Grzonka <mateusz.grzonka@intel.com> Cc: Jes Sorensen <jes@trained-monkey.org> Signed-off-by: Song Liu <song@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20231214222107.2016042-2-song@kernel.org