summaryrefslogtreecommitdiff
path: root/block
AgeCommit message (Collapse)Author
2025-07-14iomap: replace iomap_folio_ops with iomap_write_opsChristoph Hellwig
The iomap_folio_ops are only used for buffered writes, including the zero and unshare variants. Rename them to iomap_write_ops to better describe the usage, and pass them through the call chain like the other operation specific methods instead of through the iomap. xfs_iomap_valid grows a IOMAP_HOLE check to keep the existing behavior that never attached the folio_ops to a iomap representing a hole. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/20250710133343.399917-12-hch@lst.de Acked-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-14iomap: hide ioends from the generic writeback codeChristoph Hellwig
Replace the ioend pointer in iomap_writeback_ctx with a void *wb_ctx one to facilitate non-block, non-ioend writeback for use. Rename the submit_ioend method to writeback_submit and make it mandatory so that the generic writeback code stops seeing ioends and bios. Co-developed-by: Joanne Koong <joannelkoong@gmail.com> Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/20250710133343.399917-6-hch@lst.de Acked-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-14iomap: refactor the writeback interfaceChristoph Hellwig
Replace ->map_blocks with a new ->writeback_range, which differs in the following ways: - it must also queue up the I/O for writeback, that is called into the slightly refactored and extended in scope iomap_add_to_ioend for each region - can handle only a part of the requested region, that is the retry loop for partial mappings moves to the caller - handles cleanup on failures as well, and thus also replaces the discard_folio method only implemented by XFS. This will allow to use the iomap writeback code also for file systems that are not block based like fuse. Co-developed-by: Joanne Koong <joannelkoong@gmail.com> Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/20250710133343.399917-5-hch@lst.de Acked-by: Damien Le Moal <dlemoal@kernel.org> # zonefs Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-14iomap: pass more arguments using the iomap writeback contextChristoph Hellwig
Add inode and wpc fields to pass the inode and writeback context that are needed in the entire writeback call chain, and let the callers initialize all fields in the writeback context before calling iomap_writepages to simplify the argument passing. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/20250710133343.399917-3-hch@lst.de Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-11block: fix kobject leak in blk_unregister_queueMing Lei
The kobject for the queue, `disk->queue_kobj`, is initialized with a reference count of 1 via `kobject_init()` in `blk_register_queue()`. While `kobject_del()` is called during the unregister path to remove the kobject from sysfs, the initial reference is never released. Add a call to `kobject_put()` in `blk_unregister_queue()` to properly decrement the reference count and fix the leak. Fixes: 2bd85221a625 ("block: untangle request_queue refcounting from sysfs") Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250711083009.2574432-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-11block: fix FS_IOC_GETLBMD_CAP parsing in blkdev_common_ioctl()Arnd Bergmann
Anders and Naresh found that the addition of the FS_IOC_GETLBMD_CAP handling in the blockdev ioctl handler breaks all ioctls with _IOC_NR==2, as the new command is not added to the switch but only a few of the command bits are check. Move the check into the blk_get_meta_cap() function itself and make it return -ENOIOCTLCMD for any unsupported command code, including those with a smaller size that previously returned -EINVAL. For consistency this also drops the check for NULL 'arg' that is really useless, as any invalid pointer should return -EFAULT. Fixes: 9eb22f7fedfc ("fs: add ioctl to query metadata and protection info capabilities") Link: https://lore.kernel.org/all/CA+G9fYvk9HHE5UJ7cdJHTcY6P5JKnp+_e+sdC5U-ZQFTP9_hqQ@mail.gmail.com/ Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org> Cc: Anders Roxell <anders.roxell@linaro.org> Cc: Naresh Kamboju <naresh.kamboju@linaro.org> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Link: https://lore.kernel.org/20250711084708.2714436-1-arnd@kernel.org Tested-by: Anders Roxell <anders.roxell@linaro.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-09bio: use memzero_page() in bio_truncate()Matthew Wilcox (Oracle)
Patch series "Remove zero_user()". The zero_user() API is almost unused these days. Finish the job of removing it. This patch (of 5): memzero_page() is the new name for zero_user(). Link: https://lkml.kernel.org/r/20250612143443.2848197-1-willy@infradead.org Link: https://lkml.kernel.org/r/20250612143443.2848197-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Alex Markuze <amarkuze@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Xiubo Li <xiubli@redhat.com> Cc: Dan Carpenter <dan.carpenter@linaro.org> Cc: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09blk-mq-debugfs: use debugfs_get_aux()Al Viro
instead of manually stashing the data pointer into parent directory inode's ->i_private, just pass it to debugfs_create_file_aux() so that it can be extracted without that insane chasing through ->d_parent. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Link: https://lore.kernel.org/r/20250702212818.GJ3406663@ZenIV Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-07-01blk-mq: add number of queue calc helperDaniel Wagner
Add two variants of helper functions that calculate the correct number of queues to use. Two variants are needed because some drivers base their maximum number of queues on the possible CPU mask, while others use the online CPU mask. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Daniel Wagner <wagi@kernel.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20250617-isolcpus-queue-counters-v1-2-13923686b54b@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-01lib/group_cpus: Let group_cpu_evenly() return the number of initialized masksDaniel Wagner
group_cpu_evenly() might have allocated less groups then requested: group_cpu_evenly() __group_cpus_evenly() alloc_nodes_groups() # allocated total groups may be less than numgrps when # active total CPU number is less then numgrps In this case, the caller will do an out of bound access because the caller assumes the masks returned has numgrps. Return the number of groups created so the caller can limit the access range accordingly. Acked-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Daniel Wagner <wagi@kernel.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250617-isolcpus-queue-counters-v1-1-13923686b54b@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-01fs: add ioctl to query metadata and protection info capabilitiesAnuj Gupta
Add a new ioctl, FS_IOC_GETLBMD_CAP, to query metadata and protection info (PI) capabilities. This ioctl returns information about the files integrity profile. This is useful for userspace applications to understand a files end-to-end data protection support and configure the I/O accordingly. For now this interface is only supported by block devices. However the design and placement of this ioctl in generic FS ioctl space allows us to extend it to work over files as well. This maybe useful when filesystems start supporting PI-aware layouts. A new structure struct logical_block_metadata_cap is introduced, which contains the following fields: 1. lbmd_flags: bitmask of logical block metadata capability flags 2. lbmd_interval: the amount of data described by each unit of logical block metadata 3. lbmd_size: size in bytes of the logical block metadata associated with each interval 4. lbmd_opaque_size: size in bytes of the opaque block tag associated with each interval 5. lbmd_opaque_offset: offset in bytes of the opaque block tag within the logical block metadata 6. lbmd_pi_size: size in bytes of the T10 PI tuple associated with each interval 7. lbmd_pi_offset: offset in bytes of T10 PI tuple within the logical block metadata 8. lbmd_pi_guard_tag_type: T10 PI guard tag type 9. lbmd_pi_app_tag_size: size in bytes of the T10 PI application tag 10. lbmd_pi_ref_tag_size: size in bytes of the T10 PI reference tag 11. lbmd_pi_storage_tag_size: size in bytes of the T10 PI storage tag The internal logic to fetch the capability is encapsulated in a helper function blk_get_meta_cap(), which uses the blk_integrity profile associated with the device. The ioctl returns -EOPNOTSUPP, if CONFIG_BLK_DEV_INTEGRITY is not enabled. Suggested-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Link: https://lore.kernel.org/20250630090548.3317-5-anuj20.g@samsung.com Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-01block: introduce pi_tuple_size field in blk_integrityAnuj Gupta
Introduce a new pi_tuple_size field in struct blk_integrity to explicitly represent the size (in bytes) of the protection information (PI) tuple. This is a prep patch. Add validation in blk_validate_integrity_limits() to ensure that pi size matches the expected size for known checksum types and never exceeds the pi_tuple_size. Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> Link: https://lore.kernel.org/20250630090548.3317-3-anuj20.g@samsung.com Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-01block: rename tuple_size field in blk_integrity to metadata_sizeAnuj Gupta
The tuple_size field in blk_integrity currently represents the total size of metadata associated with each data interval. To make the meaning more explicit, rename tuple_size to metadata_size. This is a purely mechanical rename with no functional changes. Suggested-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> Link: https://lore.kernel.org/20250630090548.3317-2-anuj20.g@samsung.com Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-30block: add scatterlist-less DMA mapping helpersChristoph Hellwig
Add a new blk_rq_dma_map / blk_rq_dma_unmap pair that does away with the wasteful scatterlist structure. Instead it uses the mapping iterator to either add segments to the IOVA for IOMMU operations, or just maps them one by one for the direct mapping. For the IOMMU case instead of a scatterlist with an entry for each segment, only a single [dma_addr,len] pair needs to be stored for processing a request, and for the direct mapping the per-segment allocation shrinks from [page,offset,len,dma_addr,dma_len] to just [dma_addr,len]. One big difference to the scatterlist API, which could be considered downside, is that the IOVA collapsing only works when the driver sets a virt_boundary that matches the IOMMU granule. For NVMe this is done already so it works perfectly. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Link: https://lore.kernel.org/r/20250625113531.522027-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-30block: don't merge different kinds of P2P transfers in a single bioChristoph Hellwig
To get out of the DMA mapping helpers having to check every segment for it's P2P status, ensure that bios either contain P2P transfers or non-P2P transfers, and that a P2P bio only contains ranges from a single device. This means we do the page zone access in the bio add path where it should be still page hot, and will only have do the fairly expensive P2P topology lookup once per bio down in the DMA mapping path, and only for already marked bios. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Link: https://lore.kernel.org/r/20250625113531.522027-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-30block: Introduce bio_needs_zone_write_plugging()Damien Le Moal
In preparation for fixing device mapper zone write handling, introduce the inline helper function bio_needs_zone_write_plugging() to test if a BIO requires handling through zone write plugging using the function blk_zone_plug_bio(). This function returns true for any write (op_is_write(bio) == true) operation directed at a zoned block device using zone write plugging, that is, a block device with a disk that has a zone write plug hash table. This helper allows simplifying the check on entry to blk_zone_plug_bio() and used in to protect calls to it for blk-mq devices and DM devices. Fixes: f211268ed1f9 ("dm: Use the block layer zone append emulation") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250625093327.548866-3-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-30block: Increase BLK_DEF_MAX_SECTORS_CAPDamien Le Moal
Back in 2015, commit d2be537c3ba3 ("block: bump BLK_DEF_MAX_SECTORS to 2560") increased the default maximum size of a block device I/O to 2560 sectors (1280 KiB) to "accommodate a 10-data-disk stripe write with chunk size 128k". This choice is rather arbitrary and since then, improvements to the block layer have software RAID drivers correctly advertize their stripe width through chunk_sectors and abuses of BLK_DEF_MAX_SECTORS_CAP by drivers (to set the HW limit rather than the default user controlled maximum I/O size) have been fixed. Since many block devices can benefit from a larger value of BLK_DEF_MAX_SECTORS_CAP, and in particular HDDs, increase this value to be 4MiB, or 8192 sectors. And given that BLK_DEF_MAX_SECTORS_CAP is only used in the block layer and should not be used by drivers directly, move this macro definition to the block layer internal header file block/blk.h. Suggested-by: Martin K . Petersen <martin.petersen@oracle.com> Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20250618060045.37593-1-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-27Merge tag 'block-6.16-20250626' of git://git.kernel.dk/linuxLinus Torvalds
Pull block fixes from Jens Axboe: - Fixes for ublk: - fix C++ narrowing warnings in the uapi header - update/improve UBLK_F_SUPPORT_ZERO_COPY comment in uapi header - fix for the ublk ->queue_rqs() implementation, limiting a batch to just the specific task AND ring - ublk_get_data() error handling fix - sanity check more arguments in ublk_ctrl_add_dev() - selftest addition - NVMe pull request via Christoph: - reset delayed remove_work after reconnect - fix atomic write size validation - Fix for a warning introduced in bdev_count_inflight_rw() in this merge window * tag 'block-6.16-20250626' of git://git.kernel.dk/linux: block: fix false warning in bdev_count_inflight_rw() ublk: sanity check add_dev input for underflow nvme: fix atomic write size validation nvme: refactor the atomic write unit detection nvme: reset delayed remove_work after reconnect ublk: setup ublk_io correctly in case of ublk_get_data() failure ublk: update UBLK_F_SUPPORT_ZERO_COPY comment in UAPI header ublk: fix narrowing warnings in UAPI header selftests: ublk: don't take same backing file for more than one ublk devices ublk: build batch from IOs in same io_ring_ctx and io task
2025-06-26block: fix false warning in bdev_count_inflight_rw()Yu Kuai
While bdev_count_inflight is interating all cpus, if some IOs are issued from traversed cpu and then completed from the cpu that is not traversed yet: cpu0 cpu1 bdev_count_inflight //for_each_possible_cpu // cpu0 is 0 infliht += 0 // issue a io blk_account_io_start // cpu0 inflight ++ cpu2 // the io is done blk_account_io_done // cpu2 inflight -- // cpu 1 is 0 inflight += 0 // cpu2 is -1 inflight += -1 ... In this case, the total inflight will be -1, causing lots of false warning. Fix the problem by removing the warning. Noted there is still a valid warning for nvme-mpath(From Yi) that is not fixed yet. Fixes: f5482ee5edb9 ("block: WARN if bdev inflight counter is negative") Reported-by: Yi Zhang <yi.zhang@redhat.com> Closes: https://lore.kernel.org/linux-block/aFtUXy-lct0WxY2w@mozart.vkv.me/T/#mae89155a5006463d0a21a4a2c35ae0034b26a339 Reported-and-tested-by: Calvin Owens <calvin@wbinvd.org> Closes: https://lore.kernel.org/linux-block/aFtUXy-lct0WxY2w@mozart.vkv.me/T/#m1d935a00070bf95055d0ac84e6075158b08acaef Reported-by: Dave Chinner <david@fromorbit.com> Closes: https://lore.kernel.org/linux-block/aFuypjqCXo9-5_En@dread.disaster.area/ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20250626115743.1641443-1-yukuai3@huawei.com Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-24fs: Remove three arguments from block_write_end()Matthew Wilcox (Oracle)
block_write_end() looks like it can be used as a ->write_end() implementation. However, it can't as it does not unlock nor put the folio. Since it does not use the 'file', 'mapping' nor 'fsdata' arguments, remove them. Signed-off-by: "Matthew Wilcox (Oracle)" <willy@infradead.org> Link: https://lore.kernel.org/20250624132130.1590285-1-willy@infradead.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-23block: add FALLOC_FL_WRITE_ZEROES supportZhang Yi
Add support for FALLOC_FL_WRITE_ZEROES, if the block device enables the unmap write zeroes operation, it will issue a write zeroes command. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Link: https://lore.kernel.org/20250619111806.3546162-9-yi.zhang@huaweicloud.com Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Martin K. Petersen" <martin.petersen@oracle.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-23block: factor out common part in blkdev_fallocate()Zhang Yi
Only the flags passed to blkdev_issue_zeroout() differ among the two zeroing branches in blkdev_fallocate(). Therefore, do cleanup by factoring them out. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Link: https://lore.kernel.org/20250619111806.3546162-8-yi.zhang@huaweicloud.com Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Martin K. Petersen" <martin.petersen@oracle.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-23block: introduce max_{hw|user}_wzeroes_unmap_sectors to queue limitsZhang Yi
Currently, disks primarily implement the write zeroes command (aka REQ_OP_WRITE_ZEROES) through two mechanisms: the first involves physically writing zeros to the disk media (e.g., HDDs), while the second performs an unmap operation on the logical blocks, effectively putting them into a deallocated state (e.g., SSDs). The first method is generally slow, while the second method is typically very fast. For example, on certain NVMe SSDs that support NVME_NS_DEAC, submitting REQ_OP_WRITE_ZEROES requests with the NVME_WZ_DEAC bit can accelerate the write zeros operation by placing disk blocks into a deallocated state, which opportunistically avoids writing zeroes to media while still guaranteeing that subsequent reads from the specified block range will return zeroed data. This is a best-effort optimization, not a mandatory requirement, some devices may partially fall back to writing physical zeroes due to factors such as misalignment or being asked to clear a block range smaller than the device's internal allocation unit. Therefore, the speed of this operation is not guaranteed. It is difficult to determine whether the storage device supports unmap write zeroes operation. We cannot determine this by only querying bdev_limits(bdev)->max_write_zeroes_sectors. Therefore, first, add a new hardware queue limit parameters, max_hw_wzeroes_unmap_sectors, to indicate whether a device supports this unmap write zeroes operation. Then, add two new counterpart software queue limits, max_wzeroes_unmap_sectors and max_user_wzeroes_unmap_sectors, which allow users to disable this operation if the speed is very slow on some sepcial devices. Finally, for the stacked devices cases, initialize these two parameters to UINT_MAX. This operation should be enabled by both the stacking driver and all underlying devices. Thanks to Martin K. Petersen for optimizing the documentation of the write_zeroes_unmap sysfs interface. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Link: https://lore.kernel.org/20250619111806.3546162-2-yi.zhang@huaweicloud.com Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Martin K. Petersen" <martin.petersen@oracle.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-19fs: convert most other generic_file_*mmap() users to .mmap_prepare()Lorenzo Stoakes
Update nearly all generic_file_mmap() and generic_file_readonly_mmap() callers to use generic_file_mmap_prepare() and generic_file_readonly_mmap_prepare() respectively. We update blkdev, 9p, afs, erofs, ext2, nfs, ntfs3, smb, ubifs and vboxsf file systems this way. Remaining users we cannot yet update are ecryptfs, fuse and cramfs. The former two are nested file systems that must support any underlying file ssytem, and cramfs inserts a mixed mapping which currently requires a VMA. Once all file systems have been converted to mmap_prepare(), we can then update nested file systems. Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Link: https://lore.kernel.org/08db85970d89b17a995d2cffae96fb4cc462377f.1750099179.git.lorenzo.stoakes@oracle.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-14Merge tag 'block-6.16-20250614' of git://git.kernel.dk/linuxLinus Torvalds
Pull block fixes from Jens Axboe: - Fix for a deadlock on queue freeze with zoned writes - Fix for zoned append emulation - Two bio folio fixes, for sparsemem and for very large folios - Fix for a performance regression introduced in 6.13 when plug insertion was changed - Fix for NVMe passthrough handling for polled IO - Document the ublk auto registration feature - loop lockdep warning fix * tag 'block-6.16-20250614' of git://git.kernel.dk/linux: nvme: always punt polled uring_cmd end_io work to task_work Documentation: ublk: Separate UBLK_F_AUTO_BUF_REG fallback behavior sublists block: Fix bvec_set_folio() for very large folios bio: Fix bio_first_folio() for SPARSEMEM without VMEMMAP block: use plug request list tail for one-shot backmerge attempt block: don't use submit_bio_noacct_nocheck in blk_zone_wplug_bio_work block: Clear BIO_EMULATES_ZONE_APPEND flag on BIO completion ublk: document auto buffer registration(UBLK_F_AUTO_BUF_REG) loop: move lo_set_size() out of queue freeze
2025-06-11block: use plug request list tail for one-shot backmerge attemptJens Axboe
Previously, the block layer stored the requests in the plug list in LIFO order. For this reason, blk_attempt_plug_merge() would check just the head entry for a back merge attempt, and abort after that unless requests for multiple queues existed in the plug list. If more than one request is present in the plug list, this makes the one-shot back merging less useful than before, as it'll always fail to find a quick merge candidate. Use the tail entry for the one-shot merge attempt, which is the last added request in the list. If that fails, abort immediately unless there are multiple queues available. If multiple queues are available, then scan the list. Ideally the latter scan would be a backwards scan of the list, but as it currently stands, the plug list is singly linked and hence this isn't easily feasible. Cc: stable@vger.kernel.org Link: https://lore.kernel.org/linux-block/20250611121626.7252-1-abuehaze@amazon.com/ Reported-by: Hazem Mohamed Abuelfotoh <abuehaze@amazon.com> Fixes: e70c301faece ("block: don't reorder requests in blk_add_rq_to_plug") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-11block: don't use submit_bio_noacct_nocheck in blk_zone_wplug_bio_workChristoph Hellwig
Bios queued up in the zone write plug have already gone through all all preparation in the submit_bio path, including the freeze protection. Submitting them through submit_bio_noacct_nocheck duplicates the work and can can cause deadlocks when freezing a queue with pending bio write plugs. Go straight to ->submit_bio or blk_mq_submit_bio to bypass the superfluous extra freeze protection and checks. Fixes: 9b1ce7f0c6f8 ("block: Implement zone append emulation") Reported-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Tested-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20250611044416.2351850-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-11block: Clear BIO_EMULATES_ZONE_APPEND flag on BIO completionDamien Le Moal
When blk_zone_write_plug_bio_endio() is called for a regular write BIO used to emulate a zone append operation, that is, a BIO flagged with BIO_EMULATES_ZONE_APPEND, the BIO operation code is restored to the original REQ_OP_ZONE_APPEND but the BIO_EMULATES_ZONE_APPEND flag is not cleared. Clear it to fully return the BIO to its orginal definition. Fixes: 9b1ce7f0c6f8 ("block: Implement zone append emulation") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250611005915.89843-1-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-08treewide, timers: Rename from_timer() to timer_container_of()Ingo Molnar
Move this API to the canonical timer_*() namespace. [ tglx: Redone against pre rc1 ] Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/aB2X0jCKQO56WdMt@gmail.com
2025-06-06Merge tag 'block-6.16-20250606' of git://git.kernel.dk/linuxLinus Torvalds
Pull more block updates from Jens Axboe: - NVMe pull request via Christoph: - TCP error handling fix (Shin'ichiro Kawasaki) - TCP I/O stall handling fixes (Hannes Reinecke) - fix command limits status code (Keith Busch) - support vectored buffers also for passthrough (Pavel Begunkov) - spelling fixes (Yi Zhang) - MD pull request via Yu: - fix REQ_RAHEAD and REQ_NOWAIT IO err handling for raid1/10 - fix max_write_behind setting for dm-raid - some minor cleanups - Integrity data direction fix and cleanup - bcache NULL pointer fix - Fix for loop missing write start/end handling - Decouple hardware queues and IO threads in ublk - Slew of ublk selftests additions and updates * tag 'block-6.16-20250606' of git://git.kernel.dk/linux: (29 commits) nvme: spelling fixes nvme-tcp: fix I/O stalls on congested sockets nvme-tcp: sanitize request list handling nvme-tcp: remove tag set when second admin queue config fails nvme: enable vectored registered bufs for passthrough cmds nvme: fix implicit bool to flags conversion nvme: fix command limits status code selftests: ublk: kublk: improve behavior on init failure block: flip iter directions in blk_rq_integrity_map_user() block: drop direction param from bio_integrity_copy_user() selftests: ublk: cover PER_IO_DAEMON in more stress tests Documentation: ublk: document UBLK_F_PER_IO_DAEMON selftests: ublk: add stress test for per io daemons selftests: ublk: add functional test for per io daemons selftests: ublk: kublk: decouple ublk_queues from ublk server threads selftests: ublk: kublk: move per-thread data out of ublk_queue selftests: ublk: kublk: lift queue initialization out of thread selftests: ublk: kublk: tie sqe allocation to io instead of queue selftests: ublk: kublk: plumb q_id in io_uring user_data ublk: have a per-io daemon instead of a per-queue daemon ...
2025-06-03block: flip iter directions in blk_rq_integrity_map_user()Caleb Sander Mateos
blk_rq_integrity_map_user() creates the ubuf iter with ITER_DEST for write-direction operations and ITER_SOURCE for read-direction ones. This is backwards; writes use the user buffer as a source for metadata and reads use it as a destination. Switch to the rq_data_dir() helper, which maps writes to ITER_SOURCE (WRITE) and reads to ITER_DEST(READ). Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Fixes: fe8f4ca7107e ("block: modify bio_integrity_map_user to accept iov_iter as argument") Link: https://lore.kernel.org/r/20250603184752.1185676-1-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-03Merge tag 'for-6.16/dm-changes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mikulas Patocka: - better error handling when reloading a table - use use generic disable_* functions instead of open coding them - lock queue limits when reading them - remove unneeded kvfree from alloc_targets - fix BLK_FEAT_ATOMIC_WRITES - pass through operations on wrapped inline crypto keys - dm-verity: - use softirq context only when !need_resched() - fix a memory leak if some arguments are specified multiple times - dm-mpath: - interface for explicit probing of active paths - replace spin_lock_irqsave with spin_lock_irq - dm-delay: don't busy-wait in kthread - dm-bufio: remove maximum age based eviction - dm-flakey: various fixes - vdo indexer: don't read request structure after enqueuing - dm-zone: Use bdev_*() helper functions where applicable - dm-mirror: fix a tiny race condition - dm-stripe: small code cleanup * tag 'for-6.16/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (29 commits) dm-stripe: small code cleanup dm-verity: fix a memory leak if some arguments are specified multiple times dm-mirror: fix a tiny race condition dm-table: check BLK_FEAT_ATOMIC_WRITES inside limits_lock dm mpath: replace spin_lock_irqsave with spin_lock_irq dm-mpath: Don't grab work_mutex while probing paths dm-zone: Use bdev_*() helper functions where applicable dm vdo indexer: don't read request structure after enqueuing dm: pass through operations on wrapped inline crypto keys blk-crypto: export wrapped key functions dm-table: Set BLK_FEAT_ATOMIC_WRITES for target queue limits dm mpath: Interface for explicit probing of active paths dm: Allow .prepare_ioctl to handle ioctls directly dm-flakey: make corrupting read bios work dm-flakey: remove useless ERROR_READS check in flakey_end_io dm-flakey: error all IOs when num_features is absent dm-flakey: Clean up parsing messages dm: remove unneeded kvfree from alloc_targets dm-bufio: remove maximum age based eviction dm-verity: use softirq context only when !need_resched() ...
2025-06-03block: drop direction param from bio_integrity_copy_user()Caleb Sander Mateos
direction is determined from bio, which is already passed in. Compute op_is_write(bio_op(bio)) directly instead of converting it to an iter direction and back to a bool. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> Link: https://lore.kernel.org/r/20250603183133.1178062-1-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-27Merge tag 'cgroup-for-6.16' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: - cgroup rstat shared the tracking tree across all controllers with the rationale being that a cgroup which is using one resource is likely to be using other resources at the same time (ie. if something is allocating memory, it's probably consuming CPU cycles). However, this turned out to not scale very well especially with memcg using rstat for internal operations which made memcg stat read and flush patterns substantially different from other controllers. JP Kobryn split the rstat tree per controller. - cgroup BPF support was hooking into cgroup init/exit paths directly. Convert them to use a notifier chain instead so that other usages can be added easily. The two of the patches which implement this are mislabeled as belonging to sched_ext instead of cgroup. Sorry. - Relatively minor cpuset updates - Documentation updates * tag 'cgroup-for-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (23 commits) sched_ext: Convert cgroup BPF support to use cgroup_lifetime_notifier sched_ext: Introduce cgroup_lifetime_notifier cgroup: Minor reorganization of cgroup_create() cgroup, docs: cpu controller's interaction with various scheduling policies cgroup, docs: convert space indentation to tab indentation cgroup: avoid per-cpu allocation of size zero rstat cpu locks cgroup, docs: be specific about bandwidth control of rt processes cgroup: document the rstat per-cpu initialization cgroup: helper for checking rstat participation of css cgroup: use subsystem-specific rstat locks to avoid contention cgroup: use separate rstat trees for each subsystem cgroup: compare css to cgroup::self in helper for distingushing css cgroup: warn on rstat usage by early init subsystems cgroup/cpuset: drop useless cpumask_empty() in compute_effective_exclusive_cpumask() cgroup/rstat: Improve cgroup_rstat_push_children() documentation cgroup: fix goto ordering in cgroup_init() cgroup: fix pointer check in css_rstat_init() cgroup/cpuset: Add warnings to catch inconsistency in exclusive CPUs cgroup/cpuset: Fix obsolete comment in cpuset_css_offline() cgroup/cpuset: Always use cpu_active_mask ...
2025-05-26Merge tag 'xfs-merge-6.16' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds
Pull xfs updates from Carlos Maiolino: - Atomic writes for XFS - Remove experimental warnings for pNFS, scrub and parent pointers * tag 'xfs-merge-6.16' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (26 commits) xfs: add inode to zone caching for data placement xfs: free the item in xfs_mru_cache_insert on failure xfs: remove the EXPERIMENTAL warning for pNFS xfs: remove some EXPERIMENTAL warnings xfs: Remove deprecated xfs_bufd sysctl parameters xfs: stop using set_blocksize xfs: allow sysadmins to specify a maximum atomic write limit at mount time xfs: update atomic write limits xfs: add xfs_calc_atomic_write_unit_max() xfs: add xfs_file_dio_write_atomic() xfs: commit CoW-based atomic writes atomically xfs: add large atomic writes checks in xfs_direct_write_iomap_begin() xfs: add xfs_atomic_write_cow_iomap_begin() xfs: refine atomic write size check in xfs_file_write_iter() xfs: refactor xfs_reflink_end_cow_extent() xfs: allow block allocator to take an alignment hint xfs: ignore HW which cannot atomic write a single block xfs: add helpers to compute transaction reservation for finishing intent items xfs: add helpers to compute log item overhead xfs: separate out setting buftarg atomic writes limits ...
2025-05-26Merge tag 'for-6.16/block-20250523' of git://git.kernel.dk/linuxLinus Torvalds
Pull block updates from Jens Axboe: - ublk updates: - Add support for updating the size of a ublk instance - Zero-copy improvements - Auto-registering of buffers for zero-copy - Series simplifying and improving GET_DATA and request lookup - Series adding quiesce support - Lots of selftests additions - Various cleanups - NVMe updates via Christoph: - add per-node DMA pools and use them for PRP/SGL allocations (Caleb Sander Mateos, Keith Busch) - nvme-fcloop refcounting fixes (Daniel Wagner) - support delayed removal of the multipath node and optionally support the multipath node for private namespaces (Nilay Shroff) - support shared CQs in the PCI endpoint target code (Wilfred Mallawa) - support admin-queue only authentication (Hannes Reinecke) - use the crc32c library instead of the crypto API (Eric Biggers) - misc cleanups (Christoph Hellwig, Marcelo Moreira, Hannes Reinecke, Leon Romanovsky, Gustavo A. R. Silva) - MD updates via Yu: - Fix that normal IO can be starved by sync IO, found by mkfs on newly created large raid5, with some clean up patches for bdev inflight counters - Clean up brd, getting rid of atomic kmaps and bvec poking - Add loop driver specifically for zoned IO testing - Eliminate blk-rq-qos calls with a static key, if not enabled - Improve hctx locking for when a plug has IO for multiple queues pending - Remove block layer bouncing support, which in turn means we can remove the per-node bounce stat as well - Improve blk-throttle support - Improve delay support for blk-throttle - Improve brd discard support - Unify IO scheduler switching. This should also fix a bunch of lockdep warnings we've been seeing, after enabling lockdep support for queue freezing/unfreezeing - Add support for block write streams via FDP (flexible data placement) on NVMe - Add a bunch of block helpers, facilitating the removal of a bunch of duplicated boilerplate code - Remove obsolete BLK_MQ pci and virtio Kconfig options - Add atomic/untorn write support to blktrace - Various little cleanups and fixes * tag 'for-6.16/block-20250523' of git://git.kernel.dk/linux: (186 commits) selftests: ublk: add test for UBLK_F_QUIESCE ublk: add feature UBLK_F_QUIESCE selftests: ublk: add test case for UBLK_U_CMD_UPDATE_SIZE traceevent/block: Add REQ_ATOMIC flag to block trace events ublk: run auto buf unregisgering in same io_ring_ctx with registering io_uring: add helper io_uring_cmd_ctx_handle() ublk: remove io argument from ublk_auto_buf_reg_fallback() ublk: handle ublk_set_auto_buf_reg() failure correctly in ublk_fetch() selftests: ublk: add test for covering UBLK_AUTO_BUF_REG_FALLBACK selftests: ublk: support UBLK_F_AUTO_BUF_REG ublk: support UBLK_AUTO_BUF_REG_FALLBACK ublk: register buffer to local io_uring with provided buf index via UBLK_F_AUTO_BUF_REG ublk: prepare for supporting to register request buffer automatically ublk: convert to refcount_t selftests: ublk: make IO & device removal test more stressful nvme: rename nvme_mpath_shutdown_disk to nvme_mpath_remove_disk nvme: introduce multipath_always_on module param nvme-multipath: introduce delayed removal of the multipath head node nvme-pci: derive and better document max segments limits nvme-pci: use struct_size for allocation struct nvme_dev ...
2025-05-26Merge tag 'vfs-6.16-rc1.writepage' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull final writepage conversion from Christian Brauner: "This converts vboxfs from ->writepage() to ->writepages(). This was the last user of the ->writepage() method. So remove ->writepage() completely and all references to it" * tag 'vfs-6.16-rc1.writepage' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fs: Remove aops->writepage mm: Remove swap_writepage() and shmem_writepage() ttm: Call shmem_writeout() from ttm_backup_backup_page() i915: Use writeback_iter() shmem: Add shmem_writeout() writeback: Remove writeback_use_writepage() migrate: Remove call to ->writepage vboxsf: Convert to writepages 9p: Add a migrate_folio method
2025-05-19cgroup: use subsystem-specific rstat locks to avoid contentionJP Kobryn
It is possible to eliminate contention between subsystems when updating/flushing stats by using subsystem-specific locks. Let the existing rstat locks be dedicated to the cgroup base stats and rename them to reflect that. Add similar locks to the cgroup_subsys struct for use with individual subsystems. Lock initialization is done in the new function ss_rstat_init(ss) which replaces cgroup_rstat_boot(void). If NULL is passed to this function, the global base stat locks will be initialized. Otherwise, the subsystem locks will be initialized. Change the existing lock helper functions to accept a reference to a css. Then within these functions, conditionally select the appropriate locks based on the subsystem affiliation of the given css. Add helper functions for this selection routine to avoid repeated code. Signed-off-by: JP Kobryn <inwardvessel@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-05-16Merge tag 'scsi-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi Pull SCSI fix from James Bottomley: "Fix to zone block devices to make the maximum segment count match what the block layer is capable of" * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: scsi: sd_zbc: block: Respect bio vector limits for REPORT ZONES buffer
2025-05-16Merge tag 'block-6.15-20250515' of git://git.kernel.dk/linuxLinus Torvalds
Pull block fixes from Jens Axboe: - NVMe pull request via Christoph: - fixes for atomic writes (Alan Adamson) - fixes for polled CQs in nvmet-epf (Damien Le Moal) - fix for polled CQs in nvme-pci (Keith Busch) - fix compile on odd configs that need to be forced to inline (Kees Cook) - one more quirk (Ilya Guterman) - Fix for missing allocation of an integrity buffer for some cases - Fix for a regression with ublk command cancelation * tag 'block-6.15-20250515' of git://git.kernel.dk/linux: ublk: fix dead loop when canceling io command nvme-pci: add NVME_QUIRK_NO_DEEPEST_PS quirk for SOLIDIGM P44 Pro nvme: all namespaces in a subsystem must adhere to a common atomic write size nvme: multipath: enable BLK_FEAT_ATOMIC_WRITES for multipathing nvmet: pci-epf: remove NVMET_PCI_EPF_Q_IS_SQ nvmet: pci-epf: improve debug message nvmet: pci-epf: cleanup nvmet_pci_epf_raise_irq() nvmet: pci-epf: do not fall back to using INTX if not supported nvmet: pci-epf: clear completion queue IRQ flag on delete nvme-pci: acquire cq_poll_lock in nvme_poll_irqdisable nvme-pci: make nvme_pci_npages_prp() __always_inline block: always allocate integrity buffer when required
2025-05-16blk-mq: add a copyright notice to blk-mq-dma.cChristoph Hellwig
blk-mq-dma.c was split from blk-merge.c which has no copyright notice, but except for some boilerplate code and comments left from the old version this is all my code, so add my copyright. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250513071433.836797-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-16blk-mq: move the DMA mapping code to a separate fileChristoph Hellwig
While working on the new DMA API I kept getting annoyed how it was placed right in the middle of the bio splitting code in blk-merge.c. Split it out into a separate file. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250513071433.836797-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-15block: fix elv_update_nr_hw_queues() to reattach elevatorNilay Shroff
When nr_hw_queues is updated, the elevator needs to be switched to ensure that we exit elevator and reattach it to ensure that hctx-> sched_tags is correctly allocated for the new hardware queues. However, elv_update_nr_hw_queues() currently only switches the elevator if the queue is not registered. This is incorrect, as it prevents reattaching the elevator after updating nr_hw_queues, which in turn inhibits allocation of sched_tags. Fix this by allowing the elevator switch if the queue is registered, ensuring proper reattachment and resource allocation. Fixes: 596dce110b7d ("block: simplify elevator reattachment for updating nr_hw_queues") Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Link: https://lore.kernel.org/r/20250515134511.548270-1-nilay@linux.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-15block/blk-throttle: silence !BLK_DEV_IO_TRACE variable warningsJens Axboe
If blk-throttle is enabled but blktrace is not, then the compiler will notice that the following two variables are unused: ../block/blk-throttle.c: In function 'throtl_pending_timer_fn': ../block/blk-throttle.c:1153:30: warning: unused variable 'bio_cnt_w' [-Wunused-variable] 1153 | unsigned int bio_cnt_w = sq_queued(sq, WRITE); | ^~~~~~~~~ ../block/blk-throttle.c:1152:30: warning: unused variable 'bio_cnt_r' [-Wunused-variable] 1152 | unsigned int bio_cnt_r = sq_queued(sq, READ); | ^~~~~~~~~ Silence that my annotating them with __maybe_unused. Fixes: 28ad83b774a6 ("blk-throttle: Split the service queue") Link: https://lore.kernel.org/all/20250515130830.9671-1-aishwarya.tcv@arm.com/ Reported-by: Aishwarya <aishwarya.tcv@arm.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-14block: Remove obsolete configs BLK_MQ_{PCI,VIRTIO}Lukas Bulwahn
Commit 9bc1e897a821 ("blk-mq: remove unused queue mapping helpers") makes the two config options, BLK_MQ_PCI and BLK_MQ_VIRTIO, have no remaining effect. Remove the two obsolete config options. Signed-off-by: Lukas Bulwahn <lukas.bulwahn@redhat.com> Reviewed-by: Daniel Wagner <dwagner@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Garry <john.g.garry@oracle.com> Link: https://lore.kernel.org/r/20250514065513.463941-1-lukas.bulwahn@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-14Merge branch 'atomic_writes-6.16' into xfs-6.16-mergeCarlos Maiolino
Required update due to conflict with patch: xfs: stop using set_blocksize Conflicts: fs/xfs/xfs_buf.c Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-05-14Merge branch 'block-6.15' of ↵Carlos Maiolino
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block into xfs-6.16-merge Merging block tree into XFS because of some dependencies like bdev_validate_blocksize() Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-05-13block: remove the same_page output argument to bvec_try_merge_pageChristoph Hellwig
bvec_try_merge_page currently returns if the added page fragment is within the same page as the last page in the last current bio_vec. This information is used by __bio_iov_iter_get_pages so that we always have a single folio pin per page even when the page is split over multiple __bio_iov_iter_get_pages calls. Threading this through the entire lowlevel add page to bio logic is annoying and inefficient and leads to less code sharing than otherwise possible. Instead add code to __bio_iov_iter_get_pages that checks if the bio_vecs did not change and thus a merge into the last segment must have happened, and if there is an offset into the page for the currently added fragment, because if yes we must have already had a previous fragment of the same page in the last bio_vec. While this is still a bit ugly, it keeps the logic in the one place that needs it and allows for more code sharing. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250512042354.514329-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-13blk-throttle: Prevents the bps restricted io from entering the bps queue againZizhi Wo
[BUG] There has an issue of io delayed dispatch caused by io splitting. Consider the following scenario: 1) If we set a BPS limit of 1MB/s and restrict the maximum IO size per dispatch to 4KB, submitting -two- 1MB IO requests results in completion times of 1s and 2s, which is expected. 2) However, if we additionally set an IOPS limit of 1,000,000/s with the same BPS limit of 1MB/s, submitting -two- 1MB IO requests again results in both completing in 2s, even though the IOPS constraint is being met. [CAUSE] This issue arises because BPS and IOPS currently share the same queue in the blkthrotl mechanism: 1) This issue does not occur when only BPS is limited because the split IOs return false in blk_should_throtl() and do not go through to throtl again. 2) For split IOs, even if they have been tagged with BIO_BPS_THROTTLED, they still get queued alternately in the same list due to continuous splitting and reordering. As a result, the two IO requests are both completed at the 2-second mark, causing an unintended delay. 3) It is not difficult to imagine that in this scenario, if N 1MB IOs are issued at once, all IOs will eventually complete together in N seconds. [FIX] With the queue separation introduced in the previous patches, we now have separate BPS and IOPS queues. For IOs that have already passed the BPS limitation, they do not need to re-enter the BPS queue and can directly placed to the IOPS queue. Since we have split the queues, when the IOPS queue is previously empty and a new bio is added to the first qnode->bios_iops list in the service_queue, we also need to update the disptime. This patch introduces "THROTL_TG_IOPS_WAS_EMPTY" flag to mark it. Signed-off-by: Zizhi Wo <wozizhi@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Zizhi Wo <wozizhi@huaweicloud.com> Link: https://lore.kernel.org/r/20250506020935.655574-8-wozizhi@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-13blk-throttle: Split the service queueZizhi Wo
This patch splits throtl_service_queue->nr_queued into "nr_queued_bps" and "nr_queued_iops", allowing separate accounting of BPS and IOPS queued bios. This prepares for future changes that need to check whether the BPS or IOPS queues are empty. To facilitate updating the number of IOs in the BPS and IOPS queues, the addition logic will be moved from throtl_add_bio_tg() to throtl_qnode_add_bio(), and similarly, the removal logic will be moved from tg_dispatch_one_bio() to throtl_pop_queued(). And introduce sq_queued() to calculate the total sum of sq->nr_queued. Signed-off-by: Zizhi Wo <wozizhi@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Zizhi Wo <wozizhi@huaweicloud.com> Link: https://lore.kernel.org/r/20250506020935.655574-7-wozizhi@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>