Age | Commit message (Collapse) | Author |
|
The iomap_folio_ops are only used for buffered writes, including the zero
and unshare variants. Rename them to iomap_write_ops to better describe
the usage, and pass them through the call chain like the other operation
specific methods instead of through the iomap.
xfs_iomap_valid grows a IOMAP_HOLE check to keep the existing behavior
that never attached the folio_ops to a iomap representing a hole.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/20250710133343.399917-12-hch@lst.de
Acked-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Replace the ioend pointer in iomap_writeback_ctx with a void *wb_ctx
one to facilitate non-block, non-ioend writeback for use. Rename
the submit_ioend method to writeback_submit and make it mandatory so
that the generic writeback code stops seeing ioends and bios.
Co-developed-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/20250710133343.399917-6-hch@lst.de
Acked-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Replace ->map_blocks with a new ->writeback_range, which differs in the
following ways:
- it must also queue up the I/O for writeback, that is called into the
slightly refactored and extended in scope iomap_add_to_ioend for
each region
- can handle only a part of the requested region, that is the retry
loop for partial mappings moves to the caller
- handles cleanup on failures as well, and thus also replaces the
discard_folio method only implemented by XFS.
This will allow to use the iomap writeback code also for file systems
that are not block based like fuse.
Co-developed-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/20250710133343.399917-5-hch@lst.de
Acked-by: Damien Le Moal <dlemoal@kernel.org> # zonefs
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Add inode and wpc fields to pass the inode and writeback context that
are needed in the entire writeback call chain, and let the callers
initialize all fields in the writeback context before calling
iomap_writepages to simplify the argument passing.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/20250710133343.399917-3-hch@lst.de
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Joanne Koong <joannelkoong@gmail.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
The kobject for the queue, `disk->queue_kobj`, is initialized with a
reference count of 1 via `kobject_init()` in `blk_register_queue()`.
While `kobject_del()` is called during the unregister path to remove
the kobject from sysfs, the initial reference is never released.
Add a call to `kobject_put()` in `blk_unregister_queue()` to properly
decrement the reference count and fix the leak.
Fixes: 2bd85221a625 ("block: untangle request_queue refcounting from sysfs")
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250711083009.2574432-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Anders and Naresh found that the addition of the FS_IOC_GETLBMD_CAP
handling in the blockdev ioctl handler breaks all ioctls with
_IOC_NR==2, as the new command is not added to the switch but only
a few of the command bits are check.
Move the check into the blk_get_meta_cap() function itself and make
it return -ENOIOCTLCMD for any unsupported command code, including
those with a smaller size that previously returned -EINVAL.
For consistency this also drops the check for NULL 'arg' that
is really useless, as any invalid pointer should return -EFAULT.
Fixes: 9eb22f7fedfc ("fs: add ioctl to query metadata and protection info capabilities")
Link: https://lore.kernel.org/all/CA+G9fYvk9HHE5UJ7cdJHTcY6P5JKnp+_e+sdC5U-ZQFTP9_hqQ@mail.gmail.com/
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Cc: Anders Roxell <anders.roxell@linaro.org>
Cc: Naresh Kamboju <naresh.kamboju@linaro.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/20250711084708.2714436-1-arnd@kernel.org
Tested-by: Anders Roxell <anders.roxell@linaro.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Patch series "Remove zero_user()".
The zero_user() API is almost unused these days. Finish the job of
removing it.
This patch (of 5):
memzero_page() is the new name for zero_user().
Link: https://lkml.kernel.org/r/20250612143443.2848197-1-willy@infradead.org
Link: https://lkml.kernel.org/r/20250612143443.2848197-2-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Alex Markuze <amarkuze@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: Xiubo Li <xiubli@redhat.com>
Cc: Dan Carpenter <dan.carpenter@linaro.org>
Cc: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
instead of manually stashing the data pointer into parent directory inode's
->i_private, just pass it to debugfs_create_file_aux() so that it can
be extracted without that insane chasing through ->d_parent.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Link: https://lore.kernel.org/r/20250702212818.GJ3406663@ZenIV
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Add two variants of helper functions that calculate the correct number
of queues to use. Two variants are needed because some drivers base
their maximum number of queues on the possible CPU mask, while others
use the online CPU mask.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Daniel Wagner <wagi@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20250617-isolcpus-queue-counters-v1-2-13923686b54b@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
group_cpu_evenly() might have allocated less groups then requested:
group_cpu_evenly()
__group_cpus_evenly()
alloc_nodes_groups()
# allocated total groups may be less than numgrps when
# active total CPU number is less then numgrps
In this case, the caller will do an out of bound access because the
caller assumes the masks returned has numgrps.
Return the number of groups created so the caller can limit the access
range accordingly.
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Daniel Wagner <wagi@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250617-isolcpus-queue-counters-v1-1-13923686b54b@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Add a new ioctl, FS_IOC_GETLBMD_CAP, to query metadata and protection
info (PI) capabilities. This ioctl returns information about the files
integrity profile. This is useful for userspace applications to
understand a files end-to-end data protection support and configure the
I/O accordingly.
For now this interface is only supported by block devices. However the
design and placement of this ioctl in generic FS ioctl space allows us
to extend it to work over files as well. This maybe useful when
filesystems start supporting PI-aware layouts.
A new structure struct logical_block_metadata_cap is introduced, which
contains the following fields:
1. lbmd_flags: bitmask of logical block metadata capability flags
2. lbmd_interval: the amount of data described by each unit of logical
block metadata
3. lbmd_size: size in bytes of the logical block metadata associated
with each interval
4. lbmd_opaque_size: size in bytes of the opaque block tag associated
with each interval
5. lbmd_opaque_offset: offset in bytes of the opaque block tag within
the logical block metadata
6. lbmd_pi_size: size in bytes of the T10 PI tuple associated with each
interval
7. lbmd_pi_offset: offset in bytes of T10 PI tuple within the logical
block metadata
8. lbmd_pi_guard_tag_type: T10 PI guard tag type
9. lbmd_pi_app_tag_size: size in bytes of the T10 PI application tag
10. lbmd_pi_ref_tag_size: size in bytes of the T10 PI reference tag
11. lbmd_pi_storage_tag_size: size in bytes of the T10 PI storage tag
The internal logic to fetch the capability is encapsulated in a helper
function blk_get_meta_cap(), which uses the blk_integrity profile
associated with the device. The ioctl returns -EOPNOTSUPP, if
CONFIG_BLK_DEV_INTEGRITY is not enabled.
Suggested-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Link: https://lore.kernel.org/20250630090548.3317-5-anuj20.g@samsung.com
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Introduce a new pi_tuple_size field in struct blk_integrity to
explicitly represent the size (in bytes) of the protection information
(PI) tuple. This is a prep patch.
Add validation in blk_validate_integrity_limits() to ensure that
pi size matches the expected size for known checksum types and never
exceeds the pi_tuple_size.
Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
Link: https://lore.kernel.org/20250630090548.3317-3-anuj20.g@samsung.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
The tuple_size field in blk_integrity currently represents the total
size of metadata associated with each data interval. To make the meaning
more explicit, rename tuple_size to metadata_size. This is a purely
mechanical rename with no functional changes.
Suggested-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
Link: https://lore.kernel.org/20250630090548.3317-2-anuj20.g@samsung.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Add a new blk_rq_dma_map / blk_rq_dma_unmap pair that does away with
the wasteful scatterlist structure. Instead it uses the mapping iterator
to either add segments to the IOVA for IOMMU operations, or just maps
them one by one for the direct mapping. For the IOMMU case instead of
a scatterlist with an entry for each segment, only a single [dma_addr,len]
pair needs to be stored for processing a request, and for the direct
mapping the per-segment allocation shrinks from
[page,offset,len,dma_addr,dma_len] to just [dma_addr,len].
One big difference to the scatterlist API, which could be considered
downside, is that the IOVA collapsing only works when the driver sets
a virt_boundary that matches the IOMMU granule. For NVMe this is done
already so it works perfectly.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Link: https://lore.kernel.org/r/20250625113531.522027-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
To get out of the DMA mapping helpers having to check every segment for
it's P2P status, ensure that bios either contain P2P transfers or non-P2P
transfers, and that a P2P bio only contains ranges from a single device.
This means we do the page zone access in the bio add path where it should
be still page hot, and will only have do the fairly expensive P2P topology
lookup once per bio down in the DMA mapping path, and only for already
marked bios.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Link: https://lore.kernel.org/r/20250625113531.522027-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
In preparation for fixing device mapper zone write handling, introduce
the inline helper function bio_needs_zone_write_plugging() to test if a
BIO requires handling through zone write plugging using the function
blk_zone_plug_bio(). This function returns true for any write
(op_is_write(bio) == true) operation directed at a zoned block device
using zone write plugging, that is, a block device with a disk that has
a zone write plug hash table.
This helper allows simplifying the check on entry to blk_zone_plug_bio()
and used in to protect calls to it for blk-mq devices and DM devices.
Fixes: f211268ed1f9 ("dm: Use the block layer zone append emulation")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250625093327.548866-3-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Back in 2015, commit d2be537c3ba3 ("block: bump BLK_DEF_MAX_SECTORS to
2560") increased the default maximum size of a block device I/O to 2560
sectors (1280 KiB) to "accommodate a 10-data-disk stripe write with
chunk size 128k". This choice is rather arbitrary and since then,
improvements to the block layer have software RAID drivers correctly
advertize their stripe width through chunk_sectors and abuses of
BLK_DEF_MAX_SECTORS_CAP by drivers (to set the HW limit rather than the
default user controlled maximum I/O size) have been fixed.
Since many block devices can benefit from a larger value of
BLK_DEF_MAX_SECTORS_CAP, and in particular HDDs, increase this value to
be 4MiB, or 8192 sectors.
And given that BLK_DEF_MAX_SECTORS_CAP is only used in the block layer
and should not be used by drivers directly, move this macro definition
to the block layer internal header file block/blk.h.
Suggested-by: Martin K . Petersen <martin.petersen@oracle.com>
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20250618060045.37593-1-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Pull block fixes from Jens Axboe:
- Fixes for ublk:
- fix C++ narrowing warnings in the uapi header
- update/improve UBLK_F_SUPPORT_ZERO_COPY comment in uapi header
- fix for the ublk ->queue_rqs() implementation, limiting a batch
to just the specific task AND ring
- ublk_get_data() error handling fix
- sanity check more arguments in ublk_ctrl_add_dev()
- selftest addition
- NVMe pull request via Christoph:
- reset delayed remove_work after reconnect
- fix atomic write size validation
- Fix for a warning introduced in bdev_count_inflight_rw() in this
merge window
* tag 'block-6.16-20250626' of git://git.kernel.dk/linux:
block: fix false warning in bdev_count_inflight_rw()
ublk: sanity check add_dev input for underflow
nvme: fix atomic write size validation
nvme: refactor the atomic write unit detection
nvme: reset delayed remove_work after reconnect
ublk: setup ublk_io correctly in case of ublk_get_data() failure
ublk: update UBLK_F_SUPPORT_ZERO_COPY comment in UAPI header
ublk: fix narrowing warnings in UAPI header
selftests: ublk: don't take same backing file for more than one ublk devices
ublk: build batch from IOs in same io_ring_ctx and io task
|
|
While bdev_count_inflight is interating all cpus, if some IOs are issued
from traversed cpu and then completed from the cpu that is not traversed
yet:
cpu0
cpu1
bdev_count_inflight
//for_each_possible_cpu
// cpu0 is 0
infliht += 0
// issue a io
blk_account_io_start
// cpu0 inflight ++
cpu2
// the io is done
blk_account_io_done
// cpu2 inflight --
// cpu 1 is 0
inflight += 0
// cpu2 is -1
inflight += -1
...
In this case, the total inflight will be -1, causing lots of false
warning. Fix the problem by removing the warning.
Noted there is still a valid warning for nvme-mpath(From Yi) that is not
fixed yet.
Fixes: f5482ee5edb9 ("block: WARN if bdev inflight counter is negative")
Reported-by: Yi Zhang <yi.zhang@redhat.com>
Closes: https://lore.kernel.org/linux-block/aFtUXy-lct0WxY2w@mozart.vkv.me/T/#mae89155a5006463d0a21a4a2c35ae0034b26a339
Reported-and-tested-by: Calvin Owens <calvin@wbinvd.org>
Closes: https://lore.kernel.org/linux-block/aFtUXy-lct0WxY2w@mozart.vkv.me/T/#m1d935a00070bf95055d0ac84e6075158b08acaef
Reported-by: Dave Chinner <david@fromorbit.com>
Closes: https://lore.kernel.org/linux-block/aFuypjqCXo9-5_En@dread.disaster.area/
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20250626115743.1641443-1-yukuai3@huawei.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
block_write_end() looks like it can be used as a ->write_end()
implementation. However, it can't as it does not unlock nor put
the folio. Since it does not use the 'file', 'mapping' nor 'fsdata'
arguments, remove them.
Signed-off-by: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Link: https://lore.kernel.org/20250624132130.1590285-1-willy@infradead.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Add support for FALLOC_FL_WRITE_ZEROES, if the block device enables the
unmap write zeroes operation, it will issue a write zeroes command.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://lore.kernel.org/20250619111806.3546162-9-yi.zhang@huaweicloud.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Martin K. Petersen" <martin.petersen@oracle.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Only the flags passed to blkdev_issue_zeroout() differ among the two
zeroing branches in blkdev_fallocate(). Therefore, do cleanup by
factoring them out.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://lore.kernel.org/20250619111806.3546162-8-yi.zhang@huaweicloud.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Martin K. Petersen" <martin.petersen@oracle.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Currently, disks primarily implement the write zeroes command (aka
REQ_OP_WRITE_ZEROES) through two mechanisms: the first involves
physically writing zeros to the disk media (e.g., HDDs), while the
second performs an unmap operation on the logical blocks, effectively
putting them into a deallocated state (e.g., SSDs). The first method is
generally slow, while the second method is typically very fast.
For example, on certain NVMe SSDs that support NVME_NS_DEAC, submitting
REQ_OP_WRITE_ZEROES requests with the NVME_WZ_DEAC bit can accelerate
the write zeros operation by placing disk blocks into a deallocated
state, which opportunistically avoids writing zeroes to media while
still guaranteeing that subsequent reads from the specified block range
will return zeroed data. This is a best-effort optimization, not a
mandatory requirement, some devices may partially fall back to writing
physical zeroes due to factors such as misalignment or being asked to
clear a block range smaller than the device's internal allocation unit.
Therefore, the speed of this operation is not guaranteed.
It is difficult to determine whether the storage device supports unmap
write zeroes operation. We cannot determine this by only querying
bdev_limits(bdev)->max_write_zeroes_sectors. Therefore, first, add a new
hardware queue limit parameters, max_hw_wzeroes_unmap_sectors, to
indicate whether a device supports this unmap write zeroes operation.
Then, add two new counterpart software queue limits,
max_wzeroes_unmap_sectors and max_user_wzeroes_unmap_sectors, which
allow users to disable this operation if the speed is very slow on some
sepcial devices.
Finally, for the stacked devices cases, initialize these two parameters
to UINT_MAX. This operation should be enabled by both the stacking
driver and all underlying devices.
Thanks to Martin K. Petersen for optimizing the documentation of the
write_zeroes_unmap sysfs interface.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://lore.kernel.org/20250619111806.3546162-2-yi.zhang@huaweicloud.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Martin K. Petersen" <martin.petersen@oracle.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Update nearly all generic_file_mmap() and generic_file_readonly_mmap()
callers to use generic_file_mmap_prepare() and
generic_file_readonly_mmap_prepare() respectively.
We update blkdev, 9p, afs, erofs, ext2, nfs, ntfs3, smb, ubifs and vboxsf
file systems this way.
Remaining users we cannot yet update are ecryptfs, fuse and cramfs. The
former two are nested file systems that must support any underlying file
ssytem, and cramfs inserts a mixed mapping which currently requires a VMA.
Once all file systems have been converted to mmap_prepare(), we can then
update nested file systems.
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Link: https://lore.kernel.org/08db85970d89b17a995d2cffae96fb4cc462377f.1750099179.git.lorenzo.stoakes@oracle.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Pull block fixes from Jens Axboe:
- Fix for a deadlock on queue freeze with zoned writes
- Fix for zoned append emulation
- Two bio folio fixes, for sparsemem and for very large folios
- Fix for a performance regression introduced in 6.13 when plug
insertion was changed
- Fix for NVMe passthrough handling for polled IO
- Document the ublk auto registration feature
- loop lockdep warning fix
* tag 'block-6.16-20250614' of git://git.kernel.dk/linux:
nvme: always punt polled uring_cmd end_io work to task_work
Documentation: ublk: Separate UBLK_F_AUTO_BUF_REG fallback behavior sublists
block: Fix bvec_set_folio() for very large folios
bio: Fix bio_first_folio() for SPARSEMEM without VMEMMAP
block: use plug request list tail for one-shot backmerge attempt
block: don't use submit_bio_noacct_nocheck in blk_zone_wplug_bio_work
block: Clear BIO_EMULATES_ZONE_APPEND flag on BIO completion
ublk: document auto buffer registration(UBLK_F_AUTO_BUF_REG)
loop: move lo_set_size() out of queue freeze
|
|
Previously, the block layer stored the requests in the plug list in
LIFO order. For this reason, blk_attempt_plug_merge() would check
just the head entry for a back merge attempt, and abort after that
unless requests for multiple queues existed in the plug list. If more
than one request is present in the plug list, this makes the one-shot
back merging less useful than before, as it'll always fail to find a
quick merge candidate.
Use the tail entry for the one-shot merge attempt, which is the last
added request in the list. If that fails, abort immediately unless
there are multiple queues available. If multiple queues are available,
then scan the list. Ideally the latter scan would be a backwards scan
of the list, but as it currently stands, the plug list is singly linked
and hence this isn't easily feasible.
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/linux-block/20250611121626.7252-1-abuehaze@amazon.com/
Reported-by: Hazem Mohamed Abuelfotoh <abuehaze@amazon.com>
Fixes: e70c301faece ("block: don't reorder requests in blk_add_rq_to_plug")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Bios queued up in the zone write plug have already gone through all all
preparation in the submit_bio path, including the freeze protection.
Submitting them through submit_bio_noacct_nocheck duplicates the work
and can can cause deadlocks when freezing a queue with pending bio
write plugs.
Go straight to ->submit_bio or blk_mq_submit_bio to bypass the
superfluous extra freeze protection and checks.
Fixes: 9b1ce7f0c6f8 ("block: Implement zone append emulation")
Reported-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Tested-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20250611044416.2351850-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
When blk_zone_write_plug_bio_endio() is called for a regular write BIO
used to emulate a zone append operation, that is, a BIO flagged with
BIO_EMULATES_ZONE_APPEND, the BIO operation code is restored to the
original REQ_OP_ZONE_APPEND but the BIO_EMULATES_ZONE_APPEND flag is not
cleared. Clear it to fully return the BIO to its orginal definition.
Fixes: 9b1ce7f0c6f8 ("block: Implement zone append emulation")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250611005915.89843-1-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Move this API to the canonical timer_*() namespace.
[ tglx: Redone against pre rc1 ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/aB2X0jCKQO56WdMt@gmail.com
|
|
Pull more block updates from Jens Axboe:
- NVMe pull request via Christoph:
- TCP error handling fix (Shin'ichiro Kawasaki)
- TCP I/O stall handling fixes (Hannes Reinecke)
- fix command limits status code (Keith Busch)
- support vectored buffers also for passthrough (Pavel Begunkov)
- spelling fixes (Yi Zhang)
- MD pull request via Yu:
- fix REQ_RAHEAD and REQ_NOWAIT IO err handling for raid1/10
- fix max_write_behind setting for dm-raid
- some minor cleanups
- Integrity data direction fix and cleanup
- bcache NULL pointer fix
- Fix for loop missing write start/end handling
- Decouple hardware queues and IO threads in ublk
- Slew of ublk selftests additions and updates
* tag 'block-6.16-20250606' of git://git.kernel.dk/linux: (29 commits)
nvme: spelling fixes
nvme-tcp: fix I/O stalls on congested sockets
nvme-tcp: sanitize request list handling
nvme-tcp: remove tag set when second admin queue config fails
nvme: enable vectored registered bufs for passthrough cmds
nvme: fix implicit bool to flags conversion
nvme: fix command limits status code
selftests: ublk: kublk: improve behavior on init failure
block: flip iter directions in blk_rq_integrity_map_user()
block: drop direction param from bio_integrity_copy_user()
selftests: ublk: cover PER_IO_DAEMON in more stress tests
Documentation: ublk: document UBLK_F_PER_IO_DAEMON
selftests: ublk: add stress test for per io daemons
selftests: ublk: add functional test for per io daemons
selftests: ublk: kublk: decouple ublk_queues from ublk server threads
selftests: ublk: kublk: move per-thread data out of ublk_queue
selftests: ublk: kublk: lift queue initialization out of thread
selftests: ublk: kublk: tie sqe allocation to io instead of queue
selftests: ublk: kublk: plumb q_id in io_uring user_data
ublk: have a per-io daemon instead of a per-queue daemon
...
|
|
blk_rq_integrity_map_user() creates the ubuf iter with ITER_DEST for
write-direction operations and ITER_SOURCE for read-direction ones.
This is backwards; writes use the user buffer as a source for metadata
and reads use it as a destination. Switch to the rq_data_dir() helper,
which maps writes to ITER_SOURCE (WRITE) and reads to ITER_DEST(READ).
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Fixes: fe8f4ca7107e ("block: modify bio_integrity_map_user to accept iov_iter as argument")
Link: https://lore.kernel.org/r/20250603184752.1185676-1-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper updates from Mikulas Patocka:
- better error handling when reloading a table
- use use generic disable_* functions instead of open coding them
- lock queue limits when reading them
- remove unneeded kvfree from alloc_targets
- fix BLK_FEAT_ATOMIC_WRITES
- pass through operations on wrapped inline crypto keys
- dm-verity:
- use softirq context only when !need_resched()
- fix a memory leak if some arguments are specified multiple times
- dm-mpath:
- interface for explicit probing of active paths
- replace spin_lock_irqsave with spin_lock_irq
- dm-delay: don't busy-wait in kthread
- dm-bufio: remove maximum age based eviction
- dm-flakey: various fixes
- vdo indexer: don't read request structure after enqueuing
- dm-zone: Use bdev_*() helper functions where applicable
- dm-mirror: fix a tiny race condition
- dm-stripe: small code cleanup
* tag 'for-6.16/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (29 commits)
dm-stripe: small code cleanup
dm-verity: fix a memory leak if some arguments are specified multiple times
dm-mirror: fix a tiny race condition
dm-table: check BLK_FEAT_ATOMIC_WRITES inside limits_lock
dm mpath: replace spin_lock_irqsave with spin_lock_irq
dm-mpath: Don't grab work_mutex while probing paths
dm-zone: Use bdev_*() helper functions where applicable
dm vdo indexer: don't read request structure after enqueuing
dm: pass through operations on wrapped inline crypto keys
blk-crypto: export wrapped key functions
dm-table: Set BLK_FEAT_ATOMIC_WRITES for target queue limits
dm mpath: Interface for explicit probing of active paths
dm: Allow .prepare_ioctl to handle ioctls directly
dm-flakey: make corrupting read bios work
dm-flakey: remove useless ERROR_READS check in flakey_end_io
dm-flakey: error all IOs when num_features is absent
dm-flakey: Clean up parsing messages
dm: remove unneeded kvfree from alloc_targets
dm-bufio: remove maximum age based eviction
dm-verity: use softirq context only when !need_resched()
...
|
|
direction is determined from bio, which is already passed in. Compute
op_is_write(bio_op(bio)) directly instead of converting it to an iter
direction and back to a bool.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Link: https://lore.kernel.org/r/20250603183133.1178062-1-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
- cgroup rstat shared the tracking tree across all controllers with the
rationale being that a cgroup which is using one resource is likely
to be using other resources at the same time (ie. if something is
allocating memory, it's probably consuming CPU cycles).
However, this turned out to not scale very well especially with memcg
using rstat for internal operations which made memcg stat read and
flush patterns substantially different from other controllers. JP
Kobryn split the rstat tree per controller.
- cgroup BPF support was hooking into cgroup init/exit paths directly.
Convert them to use a notifier chain instead so that other usages can
be added easily. The two of the patches which implement this are
mislabeled as belonging to sched_ext instead of cgroup. Sorry.
- Relatively minor cpuset updates
- Documentation updates
* tag 'cgroup-for-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (23 commits)
sched_ext: Convert cgroup BPF support to use cgroup_lifetime_notifier
sched_ext: Introduce cgroup_lifetime_notifier
cgroup: Minor reorganization of cgroup_create()
cgroup, docs: cpu controller's interaction with various scheduling policies
cgroup, docs: convert space indentation to tab indentation
cgroup: avoid per-cpu allocation of size zero rstat cpu locks
cgroup, docs: be specific about bandwidth control of rt processes
cgroup: document the rstat per-cpu initialization
cgroup: helper for checking rstat participation of css
cgroup: use subsystem-specific rstat locks to avoid contention
cgroup: use separate rstat trees for each subsystem
cgroup: compare css to cgroup::self in helper for distingushing css
cgroup: warn on rstat usage by early init subsystems
cgroup/cpuset: drop useless cpumask_empty() in compute_effective_exclusive_cpumask()
cgroup/rstat: Improve cgroup_rstat_push_children() documentation
cgroup: fix goto ordering in cgroup_init()
cgroup: fix pointer check in css_rstat_init()
cgroup/cpuset: Add warnings to catch inconsistency in exclusive CPUs
cgroup/cpuset: Fix obsolete comment in cpuset_css_offline()
cgroup/cpuset: Always use cpu_active_mask
...
|
|
Pull xfs updates from Carlos Maiolino:
- Atomic writes for XFS
- Remove experimental warnings for pNFS, scrub and parent pointers
* tag 'xfs-merge-6.16' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (26 commits)
xfs: add inode to zone caching for data placement
xfs: free the item in xfs_mru_cache_insert on failure
xfs: remove the EXPERIMENTAL warning for pNFS
xfs: remove some EXPERIMENTAL warnings
xfs: Remove deprecated xfs_bufd sysctl parameters
xfs: stop using set_blocksize
xfs: allow sysadmins to specify a maximum atomic write limit at mount time
xfs: update atomic write limits
xfs: add xfs_calc_atomic_write_unit_max()
xfs: add xfs_file_dio_write_atomic()
xfs: commit CoW-based atomic writes atomically
xfs: add large atomic writes checks in xfs_direct_write_iomap_begin()
xfs: add xfs_atomic_write_cow_iomap_begin()
xfs: refine atomic write size check in xfs_file_write_iter()
xfs: refactor xfs_reflink_end_cow_extent()
xfs: allow block allocator to take an alignment hint
xfs: ignore HW which cannot atomic write a single block
xfs: add helpers to compute transaction reservation for finishing intent items
xfs: add helpers to compute log item overhead
xfs: separate out setting buftarg atomic writes limits
...
|
|
Pull block updates from Jens Axboe:
- ublk updates:
- Add support for updating the size of a ublk instance
- Zero-copy improvements
- Auto-registering of buffers for zero-copy
- Series simplifying and improving GET_DATA and request lookup
- Series adding quiesce support
- Lots of selftests additions
- Various cleanups
- NVMe updates via Christoph:
- add per-node DMA pools and use them for PRP/SGL allocations
(Caleb Sander Mateos, Keith Busch)
- nvme-fcloop refcounting fixes (Daniel Wagner)
- support delayed removal of the multipath node and optionally
support the multipath node for private namespaces (Nilay Shroff)
- support shared CQs in the PCI endpoint target code (Wilfred
Mallawa)
- support admin-queue only authentication (Hannes Reinecke)
- use the crc32c library instead of the crypto API (Eric Biggers)
- misc cleanups (Christoph Hellwig, Marcelo Moreira, Hannes
Reinecke, Leon Romanovsky, Gustavo A. R. Silva)
- MD updates via Yu:
- Fix that normal IO can be starved by sync IO, found by mkfs on
newly created large raid5, with some clean up patches for bdev
inflight counters
- Clean up brd, getting rid of atomic kmaps and bvec poking
- Add loop driver specifically for zoned IO testing
- Eliminate blk-rq-qos calls with a static key, if not enabled
- Improve hctx locking for when a plug has IO for multiple queues
pending
- Remove block layer bouncing support, which in turn means we can
remove the per-node bounce stat as well
- Improve blk-throttle support
- Improve delay support for blk-throttle
- Improve brd discard support
- Unify IO scheduler switching. This should also fix a bunch of lockdep
warnings we've been seeing, after enabling lockdep support for queue
freezing/unfreezeing
- Add support for block write streams via FDP (flexible data placement)
on NVMe
- Add a bunch of block helpers, facilitating the removal of a bunch of
duplicated boilerplate code
- Remove obsolete BLK_MQ pci and virtio Kconfig options
- Add atomic/untorn write support to blktrace
- Various little cleanups and fixes
* tag 'for-6.16/block-20250523' of git://git.kernel.dk/linux: (186 commits)
selftests: ublk: add test for UBLK_F_QUIESCE
ublk: add feature UBLK_F_QUIESCE
selftests: ublk: add test case for UBLK_U_CMD_UPDATE_SIZE
traceevent/block: Add REQ_ATOMIC flag to block trace events
ublk: run auto buf unregisgering in same io_ring_ctx with registering
io_uring: add helper io_uring_cmd_ctx_handle()
ublk: remove io argument from ublk_auto_buf_reg_fallback()
ublk: handle ublk_set_auto_buf_reg() failure correctly in ublk_fetch()
selftests: ublk: add test for covering UBLK_AUTO_BUF_REG_FALLBACK
selftests: ublk: support UBLK_F_AUTO_BUF_REG
ublk: support UBLK_AUTO_BUF_REG_FALLBACK
ublk: register buffer to local io_uring with provided buf index via UBLK_F_AUTO_BUF_REG
ublk: prepare for supporting to register request buffer automatically
ublk: convert to refcount_t
selftests: ublk: make IO & device removal test more stressful
nvme: rename nvme_mpath_shutdown_disk to nvme_mpath_remove_disk
nvme: introduce multipath_always_on module param
nvme-multipath: introduce delayed removal of the multipath head node
nvme-pci: derive and better document max segments limits
nvme-pci: use struct_size for allocation struct nvme_dev
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull final writepage conversion from Christian Brauner:
"This converts vboxfs from ->writepage() to ->writepages().
This was the last user of the ->writepage() method. So remove
->writepage() completely and all references to it"
* tag 'vfs-6.16-rc1.writepage' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
fs: Remove aops->writepage
mm: Remove swap_writepage() and shmem_writepage()
ttm: Call shmem_writeout() from ttm_backup_backup_page()
i915: Use writeback_iter()
shmem: Add shmem_writeout()
writeback: Remove writeback_use_writepage()
migrate: Remove call to ->writepage
vboxsf: Convert to writepages
9p: Add a migrate_folio method
|
|
It is possible to eliminate contention between subsystems when
updating/flushing stats by using subsystem-specific locks. Let the existing
rstat locks be dedicated to the cgroup base stats and rename them to
reflect that. Add similar locks to the cgroup_subsys struct for use with
individual subsystems.
Lock initialization is done in the new function ss_rstat_init(ss) which
replaces cgroup_rstat_boot(void). If NULL is passed to this function, the
global base stat locks will be initialized. Otherwise, the subsystem locks
will be initialized.
Change the existing lock helper functions to accept a reference to a css.
Then within these functions, conditionally select the appropriate locks
based on the subsystem affiliation of the given css. Add helper functions
for this selection routine to avoid repeated code.
Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fix from James Bottomley:
"Fix to zone block devices to make the maximum segment count match what
the block layer is capable of"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: sd_zbc: block: Respect bio vector limits for REPORT ZONES buffer
|
|
Pull block fixes from Jens Axboe:
- NVMe pull request via Christoph:
- fixes for atomic writes (Alan Adamson)
- fixes for polled CQs in nvmet-epf (Damien Le Moal)
- fix for polled CQs in nvme-pci (Keith Busch)
- fix compile on odd configs that need to be forced to inline
(Kees Cook)
- one more quirk (Ilya Guterman)
- Fix for missing allocation of an integrity buffer for some cases
- Fix for a regression with ublk command cancelation
* tag 'block-6.15-20250515' of git://git.kernel.dk/linux:
ublk: fix dead loop when canceling io command
nvme-pci: add NVME_QUIRK_NO_DEEPEST_PS quirk for SOLIDIGM P44 Pro
nvme: all namespaces in a subsystem must adhere to a common atomic write size
nvme: multipath: enable BLK_FEAT_ATOMIC_WRITES for multipathing
nvmet: pci-epf: remove NVMET_PCI_EPF_Q_IS_SQ
nvmet: pci-epf: improve debug message
nvmet: pci-epf: cleanup nvmet_pci_epf_raise_irq()
nvmet: pci-epf: do not fall back to using INTX if not supported
nvmet: pci-epf: clear completion queue IRQ flag on delete
nvme-pci: acquire cq_poll_lock in nvme_poll_irqdisable
nvme-pci: make nvme_pci_npages_prp() __always_inline
block: always allocate integrity buffer when required
|
|
blk-mq-dma.c was split from blk-merge.c which has no copyright notice,
but except for some boilerplate code and comments left from the old
version this is all my code, so add my copyright.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250513071433.836797-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
While working on the new DMA API I kept getting annoyed how it was placed
right in the middle of the bio splitting code in blk-merge.c.
Split it out into a separate file.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250513071433.836797-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
When nr_hw_queues is updated, the elevator needs to be switched to
ensure that we exit elevator and reattach it to ensure that hctx->
sched_tags is correctly allocated for the new hardware queues.
However, elv_update_nr_hw_queues() currently only switches the
elevator if the queue is not registered. This is incorrect, as it
prevents reattaching the elevator after updating nr_hw_queues, which
in turn inhibits allocation of sched_tags.
Fix this by allowing the elevator switch if the queue is registered,
ensuring proper reattachment and resource allocation.
Fixes: 596dce110b7d ("block: simplify elevator reattachment for updating nr_hw_queues")
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Link: https://lore.kernel.org/r/20250515134511.548270-1-nilay@linux.ibm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
If blk-throttle is enabled but blktrace is not, then the compiler will
notice that the following two variables are unused:
../block/blk-throttle.c: In function 'throtl_pending_timer_fn':
../block/blk-throttle.c:1153:30: warning: unused variable 'bio_cnt_w' [-Wunused-variable]
1153 | unsigned int bio_cnt_w = sq_queued(sq, WRITE);
| ^~~~~~~~~
../block/blk-throttle.c:1152:30: warning: unused variable 'bio_cnt_r' [-Wunused-variable]
1152 | unsigned int bio_cnt_r = sq_queued(sq, READ);
| ^~~~~~~~~
Silence that my annotating them with __maybe_unused.
Fixes: 28ad83b774a6 ("blk-throttle: Split the service queue")
Link: https://lore.kernel.org/all/20250515130830.9671-1-aishwarya.tcv@arm.com/
Reported-by: Aishwarya <aishwarya.tcv@arm.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Commit 9bc1e897a821 ("blk-mq: remove unused queue mapping helpers") makes
the two config options, BLK_MQ_PCI and BLK_MQ_VIRTIO, have no remaining
effect.
Remove the two obsolete config options.
Signed-off-by: Lukas Bulwahn <lukas.bulwahn@redhat.com>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Link: https://lore.kernel.org/r/20250514065513.463941-1-lukas.bulwahn@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Required update due to conflict with patch:
xfs: stop using set_blocksize
Conflicts:
fs/xfs/xfs_buf.c
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block into xfs-6.16-merge
Merging block tree into XFS because of some dependencies like
bdev_validate_blocksize()
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
bvec_try_merge_page currently returns if the added page fragment is
within the same page as the last page in the last current bio_vec.
This information is used by __bio_iov_iter_get_pages so that we always
have a single folio pin per page even when the page is split over
multiple __bio_iov_iter_get_pages calls.
Threading this through the entire lowlevel add page to bio logic is
annoying and inefficient and leads to less code sharing than otherwise
possible. Instead add code to __bio_iov_iter_get_pages that checks if
the bio_vecs did not change and thus a merge into the last segment must
have happened, and if there is an offset into the page for the currently
added fragment, because if yes we must have already had a previous
fragment of the same page in the last bio_vec. While this is still a bit
ugly, it keeps the logic in the one place that needs it and allows for
more code sharing.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250512042354.514329-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
[BUG]
There has an issue of io delayed dispatch caused by io splitting. Consider
the following scenario:
1) If we set a BPS limit of 1MB/s and restrict the maximum IO size per
dispatch to 4KB, submitting -two- 1MB IO requests results in completion
times of 1s and 2s, which is expected.
2) However, if we additionally set an IOPS limit of 1,000,000/s with the
same BPS limit of 1MB/s, submitting -two- 1MB IO requests again results in
both completing in 2s, even though the IOPS constraint is being met.
[CAUSE]
This issue arises because BPS and IOPS currently share the same queue in
the blkthrotl mechanism:
1) This issue does not occur when only BPS is limited because the split IOs
return false in blk_should_throtl() and do not go through to throtl again.
2) For split IOs, even if they have been tagged with BIO_BPS_THROTTLED,
they still get queued alternately in the same list due to continuous
splitting and reordering. As a result, the two IO requests are both
completed at the 2-second mark, causing an unintended delay.
3) It is not difficult to imagine that in this scenario, if N 1MB IOs are
issued at once, all IOs will eventually complete together in N seconds.
[FIX]
With the queue separation introduced in the previous patches, we now have
separate BPS and IOPS queues. For IOs that have already passed the BPS
limitation, they do not need to re-enter the BPS queue and can directly
placed to the IOPS queue.
Since we have split the queues, when the IOPS queue is previously empty
and a new bio is added to the first qnode->bios_iops list in the
service_queue, we also need to update the disptime. This patch introduces
"THROTL_TG_IOPS_WAS_EMPTY" flag to mark it.
Signed-off-by: Zizhi Wo <wozizhi@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Zizhi Wo <wozizhi@huaweicloud.com>
Link: https://lore.kernel.org/r/20250506020935.655574-8-wozizhi@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This patch splits throtl_service_queue->nr_queued into "nr_queued_bps" and
"nr_queued_iops", allowing separate accounting of BPS and IOPS queued bios.
This prepares for future changes that need to check whether the BPS or IOPS
queues are empty.
To facilitate updating the number of IOs in the BPS and IOPS queues, the
addition logic will be moved from throtl_add_bio_tg() to
throtl_qnode_add_bio(), and similarly, the removal logic will be moved from
tg_dispatch_one_bio() to throtl_pop_queued().
And introduce sq_queued() to calculate the total sum of sq->nr_queued.
Signed-off-by: Zizhi Wo <wozizhi@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Zizhi Wo <wozizhi@huaweicloud.com>
Link: https://lore.kernel.org/r/20250506020935.655574-7-wozizhi@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|