summaryrefslogtreecommitdiff
path: root/drivers/md
AgeCommit message (Collapse)Author
2025-03-05md: switch md-cluster to use md_submodle_headYu Kuai
To make code cleaner, and prepare to add kconfig for bitmap. Also remove the unsed global variables pers_lock, md_cluster_ops and md_cluster_mod, and exported symbols register_md_cluster_operations(), unregister_md_cluster_operations() and md_cluster_ops. Link: https://lore.kernel.org/linux-raid/20250215092225.2427977-8-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Su Yue <glass.su@suse.com>
2025-03-05md: don't export md_cluster_opsYu Kuai
Add a new field 'cluster_ops' and initialize it md_setup_cluster(), so that the gloable variable 'md_cluter_ops' doesn't need to be exported. Also prepare to switch md-cluster to use md_submod_head. Link: https://lore.kernel.org/linux-raid/20250215092225.2427977-7-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Su Yue <glass.su@suse.com>
2025-03-05md/md-cluster: cleanup md_cluster_ops referenceYu Kuai
md_cluster_ops->slot_number() is implemented inside md-cluster.c, just call it directly. Link: https://lore.kernel.org/linux-raid/20250215092225.2427977-6-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Su Yue <glass.su@suse.com>
2025-03-05md: switch personalities to use md_submodule_headYu Kuai
Remove the global list 'pers_list', and switch to use md_submodule_head, which is managed by xarry. Prepare to unify registration and unregistration for all sub modules. Link: https://lore.kernel.org/linux-raid/20250215092225.2427977-5-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com>
2025-03-05md: introduce struct md_submodule_head and APIsYu Kuai
Prepare to unify registration and unregistration of md personalities and md-cluster, also prepare for add kconfig for md-bitmap. Link: https://lore.kernel.org/linux-raid/20250215092225.2427977-4-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com>
2025-03-05md: only include md-cluster.h if necessaryYu Kuai
md-cluster is only supportted by raid1 and raid10, there is no need to include md-cluster.h for other personalities. Also move APIs that is only used in md-cluster.c from md.h to md-cluster.h. Link: https://lore.kernel.org/linux-raid/20250215092225.2427977-3-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Su Yue <glass.su@suse.com>
2025-03-05md: merge common code into find_pers()Yu Kuai
- pers_lock() are held and released from caller - try_module_get() is called from caller - error message from caller Merge above code into find_pers(), and rename it to get_pers(), also add a wrapper to module_put() as put_pers(). Link: https://lore.kernel.org/linux-raid/20250215092225.2427977-2-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Su Yue <glass.su@suse.com>
2025-03-03block: split struct bio_integrity_payloadChristoph Hellwig
Many of the fields in struct bio_integrity_payload are only needed for the default integrity buffer in the block layer, and the variable sized array at the end of the structure makes it very hard to embed into caller allocated structures. Reduce struct bio_integrity_payload to the minimal structure needed in common code and create two separate containing structures for the automatically generated payload and the caller allocated payload. The latter is a simple wrapper for struct bio_integrity_payload and the bvecs, while the former contains the additional fields moved out of struct bio_integrity_payload. Always use a dedicated mempool for automatic integrity metadata instead of depending on bio_set that is submitter controlled and thus often doesn't have the mempool initialized and stop using mempools for the submitter buffers as they aren't in the NOIO I/O submission path where we need to guarantee forward progress. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Tested-by: Anuj Gupta <anuj20.g@samsung.com> Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Link: https://lore.kernel.org/r/20250225154449.422989-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-25dm vdo indexer: reorder uds_request to reduce paddingKen Raeburn
Reorder fields and make uds_request_type and uds_zone_message packed, to squeeze out some space. Use struct_group so the request reset code no longer needs to care about field order. On x86_64 this reduces the struct size from 144 to 120, which saves 48 kB (about 12%) per VDO hash zone. Signed-off-by: Ken Raeburn <raeburn@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-02-24Merge tag 'for-6.14/dm-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper fixes from Mikulas Patocka: - dm-vdo: add missing spin_lock_init - dm-integrity: divide-by-zero fix - dm-integrity: do not report unused entries in the table line * tag 'for-6.14/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm vdo: add missing spin_lock_init dm-integrity: Do not emit journal configuration in DM table for Inline mode dm-integrity: Avoid divide by zero in table status in Inline mode
2025-02-24dm: fix unconditional IO throttle caused by REQ_PREFLUSHJinliang Zheng
When a bio with REQ_PREFLUSH is submitted to dm, __send_empty_flush() generates a flush_bio with REQ_OP_WRITE | REQ_PREFLUSH | REQ_SYNC, which causes the flush_bio to be throttled by wbt_wait(). An example from v5.4, similar problem also exists in upstream: crash> bt 2091206 PID: 2091206 TASK: ffff2050df92a300 CPU: 109 COMMAND: "kworker/u260:0" #0 [ffff800084a2f7f0] __switch_to at ffff80004008aeb8 #1 [ffff800084a2f820] __schedule at ffff800040bfa0c4 #2 [ffff800084a2f880] schedule at ffff800040bfa4b4 #3 [ffff800084a2f8a0] io_schedule at ffff800040bfa9c4 #4 [ffff800084a2f8c0] rq_qos_wait at ffff8000405925bc #5 [ffff800084a2f940] wbt_wait at ffff8000405bb3a0 #6 [ffff800084a2f9a0] __rq_qos_throttle at ffff800040592254 #7 [ffff800084a2f9c0] blk_mq_make_request at ffff80004057cf38 #8 [ffff800084a2fa60] generic_make_request at ffff800040570138 #9 [ffff800084a2fae0] submit_bio at ffff8000405703b4 #10 [ffff800084a2fb50] xlog_write_iclog at ffff800001280834 [xfs] #11 [ffff800084a2fbb0] xlog_sync at ffff800001280c3c [xfs] #12 [ffff800084a2fbf0] xlog_state_release_iclog at ffff800001280df4 [xfs] #13 [ffff800084a2fc10] xlog_write at ffff80000128203c [xfs] #14 [ffff800084a2fcd0] xlog_cil_push at ffff8000012846dc [xfs] #15 [ffff800084a2fda0] xlog_cil_push_work at ffff800001284a2c [xfs] #16 [ffff800084a2fdb0] process_one_work at ffff800040111d08 #17 [ffff800084a2fe00] worker_thread at ffff8000401121cc #18 [ffff800084a2fe70] kthread at ffff800040118de4 After commit 2def2845cc33 ("xfs: don't allow log IO to be throttled"), the metadata submitted by xlog_write_iclog() should not be throttled. But due to the existence of the dm layer, throttling flush_bio indirectly causes the metadata bio to be throttled. Fix this by conditionally adding REQ_IDLE to flush_bio.bi_opf, which makes wbt_should_throttle() return false to avoid wbt_wait(). Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com> Reviewed-by: Tianxiang Peng <txpeng@tencent.com> Reviewed-by: Hao Peng <flyingpeng@tencent.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-02-24dm vdo: rework processing of loaded refcount byte arraysKen Raeburn
Clear provisional refcount values and count free/allocated blocks in one integrated loop. Process 8 aligned bytes at a time instead of every byte individually. On an Intel i7-11850H this reduces the CPU time needed to process a loaded refcount block by a factor of about 5-6. On a large system the refcount loading may be the largest factor in device startup time. Signed-off-by: Ken Raeburn <raeburn@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-02-24dm vdo: remove remaining ring referencesSweet Tea Dorminy
Lists are the new rings, so update all remaining references to rings to talk about lists. Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-02-24dm-verity: do forward error correction on metadata I/O errorsMikulas Patocka
Do forward error correction if metadata I/O fails. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
2025-02-24dm vdo: add missing spin_lock_initKen Raeburn
Signed-off-by: Ken Raeburn <raeburn@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org
2025-02-24dm-bufio: remove unused return valueMikulas Patocka
The return value of the function "forget_buffer" is not tested, so we can remove it. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-02-24dm-integrity: set ti->error on memory allocation failureMikulas Patocka
The dm-integrity target didn't set the error string when memory allocation failed. This patch fixes it. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org
2025-02-21Merge tag 'block-6.14-20250221' of git://git.kernel.dk/linuxLinus Torvalds
Pull block fixes from Jens Axboe: - NVMe pull request via Keith: - FC controller state check fixes (Daniel) - PCI Endpoint fixes (Damien) - TCP connection failure fixe (Caleb) - TCP handling C2HTermReq PDU (Maurizio) - RDMA queue state check (Ruozhu) - Apple controller fixes (Hector) - Target crash on disbaled namespace (Hannes) - MD pull request via Yu: - Fix queue limits error handling for raid0, raid1 and raid10 - Fix for a NULL pointer deref in request data mapping - Code cleanup for request merging * tag 'block-6.14-20250221' of git://git.kernel.dk/linux: nvme: only allow entering LIVE from CONNECTING state nvme-fc: rely on state transitions to handle connectivity loss apple-nvme: Support coprocessors left idle apple-nvme: Release power domains when probe fails nvmet: Use enum definitions instead of hardcoded values nvme: Cleanup the definition of the controller config register fields nvme/ioctl: add missing space in err message nvme-tcp: fix connect failure on receiving partial ICResp PDU nvme: tcp: Fix compilation warning with W=1 nvmet: pci-epf: Avoid RCU stalls under heavy workload nvmet: pci-epf: Do not uselessly write the CSTS register nvmet: pci-epf: Correctly initialize CSTS when enabling the controller nvmet-rdma: recheck queue state is LIVE in state lock in recv done nvmet: Fix crash when a namespace is disabled nvme-tcp: add basic support for the C2HTermReq PDU nvme-pci: quirk Acer FA100 for non-uniqueue identifiers block: fix NULL pointer dereferenced within __blk_rq_map_sg block/merge: remove unnecessary min() with UINT_MAX md/raid*: Fix the set_queue_limits implementations
2025-02-17dm: Enable inline crypto passthrough for striped targetEd Tsai
Added DM_TARGET_PASSES_CRYPTO feature to the striped target to utilize the hardware encryption of the underlying storage devices, preventing fallback to the crypto API. Signed-off-by: Ed Tsai <ed.tsai@mediatek.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-02-17dm-integrity: Do not emit journal configuration in DM table for Inline modeMilan Broz
The Inline mode does not use a journal; it makes no sense to print journal information in DM table. Print it only if the journal is used. The same applies to interleave_sectors (unused for Inline mode). Also, add comments for arg_count, as the current calculation is quite obscure. Signed-off-by: Milan Broz <gmazyland@gmail.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-02-17dm-integrity: Avoid divide by zero in table status in Inline modeMilan Broz
In Inline mode, the journal is unused, and journal_sectors is zero. Calculating the journal watermark requires dividing by journal_sectors, which should be done only if the journal is configured. Otherwise, a simple table query (dmsetup table) can cause OOPS. This bug did not show on some systems, perhaps only due to compiler optimization. On my 32-bit testing machine, this reliably crashes with the following: : Oops: divide error: 0000 [#1] PREEMPT SMP : CPU: 0 UID: 0 PID: 2450 Comm: dmsetup Not tainted 6.14.0-rc2+ #959 : EIP: dm_integrity_status+0x2f8/0xab0 [dm_integrity] ... Signed-off-by: Milan Broz <gmazyland@gmail.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Fixes: fb0987682c62 ("dm-integrity: introduce the Inline mode") Cc: stable@vger.kernel.org # 6.11+
2025-02-15md/raid1: fix memory leak in raid1_run() if no active rdevZheng Qixing
When `raid1_set_limits()` fails or when the array has no active `rdev`, the allocated memory for `conf` is not properly freed. Add raid1_free() call to properly free the conf in error path. Fixes: 799af947ed13 ("md/raid1: don't free conf on raid0_run failure") Signed-off-by: Zheng Qixing <zhengqixing@huawei.com> Link: https://lore.kernel.org/linux-raid/20250215020137.3703757-1-zhengqixing@huaweicloud.com Singed-off-by: Yu Kuai <yukuai3@huawei.com>
2025-02-15md: ensure resync is prioritized over recoveryLi Nan
If a new disk is added during resync, the resync process is interrupted, and recovery is triggered, causing the previous resync to be lost. In reality, disk addition should not terminate resync, fix it. Steps to reproduce the issue: mdadm -CR /dev/md0 -l1 -n3 -x1 /dev/sd[abcd] mdadm --fail /dev/md0 /dev/sdc Fixes: 24dd469d728d ("[PATCH] md: allow a manual resync with md") Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/linux-raid/20250213131530.3698600-1-linan666@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com>
2025-02-14md/raid*: Fix the set_queue_limits implementationsBart Van Assche
queue_limits_cancel_update() must only be called if queue_limits_start_update() is called first. Remove the queue_limits_cancel_update() calls from the raid*_set_limits() functions because there is no corresponding queue_limits_start_update() call. Cc: Christoph Hellwig <hch@lst.de> Fixes: c6e56cf6b2e7 ("block: move integrity information into queue_limits") Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/linux-raid/20250212171108.3483150-1-bvanassche@acm.org/ Signed-off-by: Yu Kuai <yukuai@kernel.org>
2025-02-10blk-crypto: add basic hardware-wrapped key supportEric Biggers
To prevent keys from being compromised if an attacker acquires read access to kernel memory, some inline encryption hardware can accept keys which are wrapped by a per-boot hardware-internal key. This avoids needing to keep the raw keys in kernel memory, without limiting the number of keys that can be used. Such hardware also supports deriving a "software secret" for cryptographic tasks that can't be handled by inline encryption; this is needed for fscrypt to work properly. To support this hardware, allow struct blk_crypto_key to represent a hardware-wrapped key as an alternative to a raw key, and make drivers set flags in struct blk_crypto_profile to indicate which types of keys they support. Also add the ->derive_sw_secret() low-level operation, which drivers supporting wrapped keys must implement. For more information, see the detailed documentation which this patch adds to Documentation/block/inline-encryption.rst. Signed-off-by: Eric Biggers <ebiggers@google.com> Tested-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org> # sm8650 Link: https://lore.kernel.org/r/20250204060041.409950-2-ebiggers@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-08lib/crc32: standardize on crc32c() name for Castagnoli CRC32Eric Biggers
For historical reasons, the Castagnoli CRC32 is available under 3 names: crc32c(), crc32c_le(), and __crc32c_le(). Most callers use crc32c(). The more verbose versions are not really warranted; there is no "_be" version that the "_le" version needs to be differentiated from, and the leading underscores are pointless. Therefore, let's standardize on just crc32c(). Remove the other two names, and update callers accordingly. Specifically, the new crc32c() comes from what was previously __crc32c_le(), so compared to the old crc32c() it now takes a size_t length rather than unsigned int, and it's now in linux/crc32.h instead of just linux/crc32c.h (which includes linux/crc32.h). Later patches will also rename __crc32c_le_combine(), crc32c_le_base(), and crc32c_le_arch(). Reviewed-by: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/20250208024911.14936-5-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>
2025-02-07Merge tag 'block-6.14-20250207' of git://git.kernel.dk/linuxLinus Torvalds
Pull block fixes from Jens Axboe: - MD pull request via Song: - fix an error handling path for md-linear - NVMe pull request via Keith: - Connection fixes for fibre channel transport (Daniel) - Endian fixes (Keith, Christoph) - Cleanup fix for host memory buffer (Francis) - Platform specific power quirks (Georg) - Target memory leak (Sagi) - Use appropriate controller state accessor (Daniel) - Fixup for a regression introduced last week, where sunvdc wasn't updated for an API change, causing compilation failures on sparc64. * tag 'block-6.14-20250207' of git://git.kernel.dk/linux: drivers/block/sunvdc.c: update the correct AIP call md: Fix linear_set_limits() nvme-fc: use ctrl state getter nvme: make nvme_tls_attrs_group static nvmet: add a missing endianess conversion in nvmet_execute_admin_connect nvmet: the result field in nvmet_alloc_ctrl_args is little endian nvmet: fix a memory leak in controller identify nvme-fc: do not ignore connectivity loss during connecting nvme: handle connectivity loss in nvme_set_queue_count nvme-fc: go straight to connecting state when initializing nvme-pci: Add TUXEDO IBP Gen9 to Samsung sleep quirk nvme-pci: Add TUXEDO InfinityFlex to Samsung sleep quirk nvme-pci: remove redundant dma frees in hmb nvmet: fix rw control endian access
2025-02-03dm vdo slab-depot: read refcount blocks in large chunks at load timeKen Raeburn
At startup, vdo loads all the reference count data before the device reports that it is ready. Using a pool of large metadata vios can improve the startup speed of vdo. The pool of large vios is released after the device is ready. During normal operation, reference counts are updated 4kB at a time, as before. Signed-off-by: Ken Raeburn <raeburn@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-02-03dm vdo vio-pool: allow variable-sized metadata viosKen Raeburn
With larger-sized metadata vio pools, vdo will sometimes need to issue I/O with a smaller size than the allocated size. Since vio_reset_bio is where the bvec array and I/O size are initialized, this reset interface must now specify what I/O size to use. Signed-off-by: Ken Raeburn <raeburn@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-02-03dm vdo vio-pool: support pools with multiple data blocks per vioKen Raeburn
Support pools with multiple data blocks per vio Signed-off-by: Ken Raeburn <raeburn@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-02-03dm vdo vio-pool: add a pool pointer to pooled_vioKen Raeburn
This allows us to simplify the return_vio_to_pool interface. Also, we don't need to use vdo_forget on local variables or arguments that are about to go out of scope anyway. Signed-off-by: Ken Raeburn <raeburn@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-02-03dm vdo: remove checks that can not failMatthew Sakai
Remove checks that can't fail. Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-02-03dm vdo indexer: prevent unterminated string warningChung Chung
Fix array initialization that triggers a warning: error: initializer-string for array of ‘unsigned char’ is too long [-Werror=unterminated-string-initialization] Signed-off-by: Chung Chung <cchung@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-02-03dm vdo: use a short static string for thread name prefixMatthew Sakai
Also remove MODULE_NAME and a BUG_ON check, both unneeded. This fixes a warning about string truncation in snprintf that will never happen in practice: drivers/md/dm-vdo/vdo.c: In function ‘vdo_make’: drivers/md/dm-vdo/vdo.c:564:5: error: ‘%s’ directive output may be truncated writing up to 55 bytes into a region of size 16 [-Werror=format-truncation=] "%s%u", MODULE_NAME, instance); ^~ drivers/md/dm-vdo/vdo.c:563:2: note: ‘snprintf’ output between 2 and 66 bytes into a destination of size 16 snprintf(vdo->thread_name_prefix, sizeof(vdo->thread_name_prefix), ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ "%s%u", MODULE_NAME, instance); ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Reported-by: John Garry <john.g.garry@oracle.com> Reviewed-by: John Garry <john.g.garry@oracle.com> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-02-03dm-crypt: switch to using the crc32 libraryEric Biggers
Now that the crc32() library function takes advantage of architecture-specific optimizations, it is unnecessary to go through the crypto API. Just use crc32(). This is much simpler, and it improves performance due to eliminating the crypto API overhead. (However, this only affects the TCW IV mode of dm-crypt, which is a compatibility mode that is rarely used compared to other dm-crypt modes.) Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-01-31Merge tag 'block-6.14-20250131' of git://git.kernel.dk/linuxLinus Torvalds
Pull more block updates from Jens Axboe: - MD pull request via Song: - Fix a md-cluster regression introduced - More sysfs race fixes - Mark anything inside queue freezing as not being able to do IO for memory allocations - Fix for a regression introduced in loop in this merge window - Fix for a regression in queue mapping setups introduced in this merge window - Fix for the block dio fops attempting an iov_iter revert upton getting -EIOCBQUEUED on the read side. This one is going to stable as well * tag 'block-6.14-20250131' of git://git.kernel.dk/linux: block: force noio scope in blk_mq_freeze_queue block: fix nr_hw_queue update racing with disk addition/removal block: get rid of request queue ->sysfs_dir_lock loop: don't clear LO_FLAGS_PARTSCAN on LOOP_SET_STATUS{,64} md/md-bitmap: Synchronize bitmap_get_stats() with bitmap lifetime blk-mq: create correct map for fallback case block: don't revert iter for -EIOCBQUEUED
2025-01-31md: Fix linear_set_limits()Bart Van Assche
queue_limits_cancel_update() must only be called if queue_limits_start_update() is called first. Remove the queue_limits_cancel_update() call from linear_set_limits() because there is no corresponding queue_limits_start_update() call. This bug was discovered by annotating all mutex operations with clang thread-safety attributes and by building the kernel with clang and -Wthread-safety. Cc: Yu Kuai <yukuai3@huawei.com> Cc: Coly Li <colyli@kernel.org> Cc: Mike Snitzer <snitzer@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Fixes: 127186cfb184 ("md: reintroduce md-linear") Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250129225636.2667932-1-bvanassche@acm.org Signed-off-by: Song Liu <song@kernel.org>
2025-01-28treewide: const qualify ctl_tables where applicableJoel Granados
Add the const qualifier to all the ctl_tables in the tree except for watchdog_hardlockup_sysctl, memory_allocation_profiling_sysctls, loadpin_sysctl_table and the ones calling register_net_sysctl (./net, drivers/inifiniband dirs). These are special cases as they use a registration function with a non-const qualified ctl_table argument or modify the arrays before passing them on to the registration function. Constifying ctl_table structs will prevent the modification of proc_handler function pointers as the arrays would reside in .rodata. This is made possible after commit 78eb4ea25cd5 ("sysctl: treewide: constify the ctl_table argument of proc_handlers") constified all the proc_handlers. Created this by running an spatch followed by a sed command: Spatch: virtual patch @ depends on !(file in "net") disable optional_qualifier @ identifier table_name != { watchdog_hardlockup_sysctl, iwcm_ctl_table, ucma_ctl_table, memory_allocation_profiling_sysctls, loadpin_sysctl_table }; @@ + const struct ctl_table table_name [] = { ... }; sed: sed --in-place \ -e "s/struct ctl_table .table = &uts_kern/const struct ctl_table *table = \&uts_kern/" \ kernel/utsname_sysctl.c Reviewed-by: Song Liu <song@kernel.org> Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> # for kernel/trace/ Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> # SCSI Reviewed-by: Darrick J. Wong <djwong@kernel.org> # xfs Acked-by: Jani Nikula <jani.nikula@intel.com> Acked-by: Corey Minyard <cminyard@mvista.com> Acked-by: Wei Liu <wei.liu@kernel.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Bill O'Donnell <bodonnel@redhat.com> Acked-by: Baoquan He <bhe@redhat.com> Acked-by: Ashutosh Dixit <ashutosh.dixit@intel.com> Acked-by: Anna Schumaker <anna.schumaker@oracle.com> Signed-off-by: Joel Granados <joel.granados@kernel.org>
2025-01-27Merge tag 'for-6.14/dm-changes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mikulas Patocka: - fix a spelling error in dm-raid - change kzalloc to kcalloc - remove useless test in alloc_multiple_bios - disable REQ_NOWAIT for flushes - dm-transaction-manager: use red-black trees instead of linear lists - atomic writes support for dm-linear, dm-stripe and dm-mirror - dm-crypt: code cleanups and two bugfixes * tag 'for-6.14/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm-crypt: track tag_offset in convert_context dm-crypt: don't initialize cc_sector again dm-crypt: don't update io->sector after kcryptd_crypt_write_io_submit() dm-crypt: use bi_sector in bio when initialize integrity seed dm-crypt: fully initialize clone->bi_iter in crypt_alloc_buffer() dm-crypt: set atomic as false when calling crypt_convert() in kworker dm-mirror: Support atomic writes dm-io: Warn on creating multiple atomic write bios for a region dm-stripe: Enable atomic writes dm-linear: Enable atomic writes dm: Ensure cloned bio is same length for atomic write dm-table: atomic writes support dm-transaction-manager: use red-black trees instead of linear lists dm: disable REQ_NOWAIT for flushes dm: remove useless test in alloc_multiple_bios dm: change kzalloc to kcalloc dm raid: fix spelling errors in raid_ctr()
2025-01-24md/md-bitmap: Synchronize bitmap_get_stats() with bitmap lifetimeYu Kuai
After commit ec6bb299c7c3 ("md/md-bitmap: add 'sync_size' into struct md_bitmap_stats"), following panic is reported: Oops: general protection fault, probably for non-canonical address RIP: 0010:bitmap_get_stats+0x2b/0xa0 Call Trace: <TASK> md_seq_show+0x2d2/0x5b0 seq_read_iter+0x2b9/0x470 seq_read+0x12f/0x180 proc_reg_read+0x57/0xb0 vfs_read+0xf6/0x380 ksys_read+0x6c/0xf0 do_syscall_64+0x82/0x170 entry_SYSCALL_64_after_hwframe+0x76/0x7e Root cause is that bitmap_get_stats() can be called at anytime if mddev is still there, even if bitmap is destroyed, or not fully initialized. Deferenceing bitmap in this case can crash the kernel. Meanwhile, the above commit start to deferencing bitmap->storage, make the problem easier to trigger. Fix the problem by protecting bitmap_get_stats() with bitmap_info.mutex. Cc: stable@vger.kernel.org # v6.12+ Fixes: 32a7627cf3a3 ("[PATCH] md: optimised resync using Bitmap based intent logging") Reported-and-tested-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com> Closes: https://lore.kernel.org/linux-raid/ca3a91a2-50ae-4f68-b317-abd9889f3907@oracle.com/T/#m6e5086c95201135e4941fe38f9efa76daf9666c5 Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20250124092055.4050195-1-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>
2025-01-21dm-crypt: track tag_offset in convert_contextHou Tao
dm-crypt uses tag_offset to index the integrity metadata for each crypt sector. When the initial crypt_convert() returns BLK_STS_DEV_RESOURCE, dm-crypt will try to continue the crypt/decrypt procedure in a kworker. However, it resets tag_offset as zero instead of using the tag_offset related with current sector. It may return unexpected data when using random IV or return unexpected integrity related error. Fix the problem by tracking tag_offset in per-IO convert_context. Therefore, when the crypt/decrypt procedure continues in a kworker, it could use the next tag_offset saved in convert_context. Fixes: 8abec36d1274 ("dm crypt: do not wait for backlogged crypto request completion in softirq") Cc: stable@vger.kernel.org Signed-off-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-01-21dm-crypt: don't initialize cc_sector againHou Tao
For aead_recheck case, cc_sector has already been initialized in crypt_convert_init() when trying to re-read the read. Therefore, remove the duplicated initialization. Signed-off-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-01-21dm-crypt: don't update io->sector after kcryptd_crypt_write_io_submit()Hou Tao
The updates of io->sector are the leftovers when dm-crypt allocated pages for partial write request. However, since commit cf2f1abfbd0db ("dm crypt: don't allocate pages for a partial request"), there is no partial request anymore. After the introduction of write request rb-tree, the updates of io->sectors may interfere the insertion procedure, because ->sectors of these write requests which have already been added in the rb-tree may be changed during the insertion of new write request. Fix it by removing these buggy updates of io->sectors. Considering these updates only effect the write request rb-tree, the commit which introduces the write request rb-tree is used as the fix tag. Fixes: b3c5fd305249 ("dm crypt: sort writes") Cc: stable@vger.kernel.org Signed-off-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-01-21dm-crypt: use bi_sector in bio when initialize integrity seedHou Tao
bio->bi_iter.bi_sector has already been initialized when initialize the integrity seed in dm_crypt_integrity_io_alloc(). There is no need to calculate it again. Therefore, use the helper bip_set_seed() to initialize the seed and pass bi_iter.bi_sector to it instead. Mikulas: We can't use bip_set_seed because it doesn't compile without CONFIG_BLK_DEV_INTEGRITY. Signed-off-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-01-21dm-crypt: fully initialize clone->bi_iter in crypt_alloc_buffer()Hou Tao
Both kcryptd_io_read() and kcryptd_crypt_write_convert() will invoke crypt_alloc_buffer() to allocate a new bio. Both of these two callers initialize bi_iter.bi_sector for the new bio separatedly after crypt_alloc_buffer() returns. However, kcryptd_crypt_write_convert() will copy the bi_iter of the new bio into ctx.iter_out or ctx.iter_in. Although it doesn't incur any harm now, it is better to fully initialize bi_iter before it is used. Therefore, initialize bi_iter.bi_sector in crypt_alloc_buffer() instead. Signed-off-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-01-21dm-crypt: set atomic as false when calling crypt_convert() in kworkerHou Tao
Both kcryptd_crypt_write_continue() and kcryptd_crypt_read_continue() are running in the kworker context, it is OK to call cond_resched(), Therefore, set atomic as false when invoking crypt_convert() under kworker context. Signed-off-by: Hou Tao <houtao1@huawei.com> Reviewed-by: Ignat Korchagin <ignat@cloudflare.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-01-20Merge tag 'for-6.14/block-20250118' of git://git.kernel.dk/linuxLinus Torvalds
Pull block updates from Jens Axboe: - NVMe pull requests via Keith: - Target support for PCI-Endpoint transport (Damien) - TCP IO queue spreading fixes (Sagi, Chaitanya) - Target handling for "limited retry" flags (Guixen) - Poll type fix (Yongsoo) - Xarray storage error handling (Keisuke) - Host memory buffer free size fix on error (Francis) - MD pull requests via Song: - Reintroduce md-linear (Yu Kuai) - md-bitmap refactor and fix (Yu Kuai) - Replace kmap_atomic with kmap_local_page (David Reaver) - Quite a few queue freeze and debugfs deadlock fixes Ming introduced lockdep support for this in the 6.13 kernel, and it has (unsurprisingly) uncovered quite a few issues - Use const attributes for IO schedulers - Remove bio ioprio wrappers - Fixes for stacked device atomic write support - Refactor queue affinity helpers, in preparation for better supporting isolated CPUs - Cleanups of loop O_DIRECT handling - Cleanup of BLK_MQ_F_* flags - Add rotational support for null_blk - Various fixes and cleanups * tag 'for-6.14/block-20250118' of git://git.kernel.dk/linux: (106 commits) block: Don't trim an atomic write block: Add common atomic writes enable flag md/md-linear: Fix a NULL vs IS_ERR() bug in linear_add() block: limit disk max sectors to (LLONG_MAX >> 9) block: Change blk_stack_atomic_writes_limits() unit_min check block: Ensure start sector is aligned for stacking atomic writes blk-mq: Move more error handling into blk_mq_submit_bio() block: Reorder the request allocation code in blk_mq_submit_bio() nvme: fix bogus kzalloc() return check in nvme_init_effects_log() md/md-bitmap: move bitmap_{start, end}write to md upper layer md/raid5: implement pers->bitmap_sector() md: add a new callback pers->bitmap_sector() md/md-bitmap: remove the last parameter for bimtap_ops->endwrite() md/md-bitmap: factor behind write counters out from bitmap_{start/end}write() md: Replace deprecated kmap_atomic() with kmap_local_page() md: reintroduce md-linear partitions: ldm: remove the initial kernel-doc notation blk-cgroup: rwstat: fix kernel-doc warnings in header file blk-cgroup: fix kernel-doc warnings in header file nbd: fix partial sending ...
2025-01-17dm-mirror: Support atomic writesJohn Garry
Support atomic writes by setting DM_TARGET_ATOMIC_WRITES for the target type, and also unmasking REQ_ATOMIC from the submitted bio op flags. Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-01-17dm-io: Warn on creating multiple atomic write bios for a regionJohn Garry
A region should just be for a single atomic write, so warn when we are creating many. This should not occur if request queue limits are properly configured. Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-01-17dm-stripe: Enable atomic writesJohn Garry
Set feature flag DM_TARGET_ATOMIC_WRITES. Similar to md raid0, the chunk size set in stripe_io_hints() limits the atomic write unit max. Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>