summaryrefslogtreecommitdiff
path: root/drivers/block
AgeCommit message (Collapse)Author
2024-06-19block: move the nonrot flag to queue_limitsChristoph Hellwig
Move the nonrot flag into the queue_limits feature field so that it can be set atomically with the queue frozen. Use the chance to switch to defaulting to non-rotational and require the driver to opt into rotational, which matches the polarity of the sysfs interface. For the z2ram, ps3vram, 2x memstick, ubiblock and dcssblk the new rotational flag is not set as they clearly are not rotational despite this being a behavior change. There are some other drivers that unconditionally set the rotational flag to keep the existing behavior as they arguably can be used on rotational devices even if that is probably not their main use today (e.g. virtio_blk and drbd). The flag is automatically inherited in blk_stack_limits matching the existing behavior in dm and md. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20240617060532.127975-15-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-19block: move cache control settings out of queue->flagsChristoph Hellwig
Move the cache control settings into the queue_limits so that the flags can be set atomically with the device queue frozen. Add new features and flags field for the driver set flags, and internal (usually sysfs-controlled) flags in the block layer. Note that we'll eventually remove enough field from queue_limits to bring it back to the previous size. The disable flag is inverted compared to the previous meaning, which means it now survives a rescan, similar to the max_sectors and max_discard_sectors user limits. The FLUSH and FUA flags are now inherited by blk_stack_limits, which simplified the code in dm a lot, but also causes a slight behavior change in that dm-switch and dm-unstripe now advertise a write cache despite setting num_flush_bios to 0. The I/O path will handle this gracefully, but as far as I can tell the lack of num_flush_bios and thus flush support is a pre-existing data integrity bug in those targets that really needs fixing, after which a non-zero num_flush_bios should be required in dm for targets that map to underlying devices. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Ulf Hansson <ulf.hansson@linaro.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20240617060532.127975-14-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-19nbd: move setting the cache control flags to __nbd_set_sizeChristoph Hellwig
Move setting the cache control flags in nbd in preparation for moving these flags into the queue_limits structure. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20240617060532.127975-11-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-19virtio_blk: remove virtblk_update_cache_modeChristoph Hellwig
virtblk_update_cache_mode boils down to a single call to blk_queue_write_cache. Remove it in preparation for moving the cache control flags into the queue_limits. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20240617060532.127975-10-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-19loop: fold loop_update_rotational into loop_reconfigure_limitsChristoph Hellwig
This prepares for moving the rotational flag into the queue_limits and also fixes it for the case where the loop device is backed by a block device. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20240617060532.127975-9-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-19loop: also use the default block size from an underlying block deviceChristoph Hellwig
Fix the code in loop_reconfigure_limits to pick a default block size for O_DIRECT file descriptors to also work when the loop device sits on top of a block device and not just on a regular file on a block device based file system. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20240617060532.127975-8-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-19loop: regularize upgrading the block size for direct I/OChristoph Hellwig
The LOOP_CONFIGURE path automatically upgrades the block size to that of the underlying file for O_DIRECT file descriptors, but the LOOP_SET_BLOCK_SIZE path does not. Fix this by lifting the code to pick the block size into common code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20240617060532.127975-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-19loop: always update discard settings in loop_reconfigure_limitsChristoph Hellwig
Simplify loop_reconfigure_limits by always updating the discard limits. This adds a little more work to loop_set_block_size, but doesn't change the outcome as the discard flag won't change. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20240617060532.127975-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-19loop: stop using loop_reconfigure_limits in __loop_clr_fdChristoph Hellwig
__loop_clr_fd wants to clear all settings on the device. Prepare for moving more settings into the block limits by open coding loop_reconfigure_limits. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20240617060532.127975-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-19xen-blkfront: don't disable cache flushes when they failChristoph Hellwig
blkfront always had a robust negotiation protocol for detecting a write cache. Stop simply disabling cache flushes in the block layer as the flags handling is moving to the atomic queue limits API that needs user context to freeze the queue for that. Instead handle the case of the feature flags cleared inside of blkfront. This removes old debug code to check for such a mismatch which was previously impossible to hit, including the check for passthrough requests that blkfront never used to start with. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20240617060532.127975-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-19z2ram: add missing MODULE_DESCRIPTION() macroJeff Johnson
With ARCH=m68k, make allmodconfig && make W=1 C=1 reports: WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/block/z2ram.o Add the missing invocation of the MODULE_DESCRIPTION() macro. Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com> Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20240617-md-m68k-drivers-block-v1-3-b200599a315e@quicinc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-19ataflop: add missing MODULE_DESCRIPTION() macroJeff Johnson
With ARCH=m68k, make allmodconfig && make W=1 C=1 reports: WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/block/ataflop.o Add the missing invocation of the MODULE_DESCRIPTION() macro. Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com> Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20240617-md-m68k-drivers-block-v1-2-b200599a315e@quicinc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-19amiflop: add missing MODULE_DESCRIPTION() macroJeff Johnson
With ARCH=m68k, make allmodconfig && make W=1 C=1 reports: WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/block/amiflop.o Add the missing invocation of the MODULE_DESCRIPTION() macro. Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com> Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20240617-md-m68k-drivers-block-v1-1-b200599a315e@quicinc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-14Merge branch 'for-6.11/block-limits' into for-6.11/blockJens Axboe
Pull in block limits branch, which exists as a shared branch for both the block and SCSI tree. * for-6.11/block-limits: (26 commits) block: move integrity information into queue_limits block: invert the BLK_INTEGRITY_{GENERATE,VERIFY} flags block: bypass the STABLE_WRITES flag for protection information block: don't require stable pages for non-PI metadata block: use kstrtoul in flag_store block: factor out flag_{store,show} helper for integrity block: remove the blk_flush_integrity call in blk_integrity_unregister block: remove the blk_integrity_profile structure dm-integrity: use the nop integrity profile md/raid1: don't free conf on raid0_run failure md/raid0: don't free conf on raid0_run failure block: initialize integrity buffer to zero before writing it to media block: add special APIs for run-time disabling of discard and friends block: remove unused queue limits API sr: convert to the atomic queue limits API sd: convert to the atomic queue limits API sd: cleanup zoned queue limits initialization sd: factor out a sd_discard_mode helper sd: simplify the disable case in sd_config_discard sd: add a sd_disable_write_same helper ...
2024-06-14block: add special APIs for run-time disabling of discard and friendsChristoph Hellwig
A few drivers optimistically try to support discard, write zeroes and secure erase and disable the features from the I/O completion handler if the hardware can't support them. This disable can't be done using the atomic queue limits API because the I/O completion handlers can't take sleeping locks or freeze the queue. Keep the existing clearing of the relevant field to zero, but replace the old blk_queue_max_* APIs with new disable APIs that force the value to 0. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240531074837.1648501-15-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-14block: take io_opt and io_min into account for max_sectorsChristoph Hellwig
The soft max_sectors limit is normally capped by the hardware limits and an arbitrary upper limit enforced by the kernel, but can be modified by the user. A few drivers want to increase this limit (nbd, rbd) or adjust it up or down based on hardware capabilities (sd). Change blk_validate_limits to default max_sectors to the optimal I/O size, or upgrade it to the preferred minimal I/O size if that is larger than the kernel default if no optimal I/O size is provided based on the logic in the SD driver. This keeps the existing kernel default for drivers that do not provide an io_opt or very big io_min value, but picks a much more useful default for those who provide these hints, and allows to remove the hacks to set the user max_sectors limit in nbd, rbd and sd. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Acked-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240531074837.1648501-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-14rbd: increase io_opt againChristoph Hellwig
Commit 16d80c54ad42 ("rbd: set io_min, io_opt and discard_granularity to alloc_size") lowered the io_opt size for rbd from objset_bytes which is 4MB for typical setup to alloc_size which is typically 64KB. The commit mostly talks about discard behavior and does mention io_min in passing. Reducing io_opt means reducing the readahead size, which seems counter-intuitive given that rbd currently abuses the user max_sectors setting to actually increase the I/O size. Switch back to the old setting to allow larger reads (the readahead size despite it's name actually limits the size of any buffered read) and to prepare for using io_opt in the max_sectors calculation and getting drivers out of the business of overriding the max_user_sectors value. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240531074837.1648501-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-14rust: block: add rnull, Rust null_blk implementationAndreas Hindborg
This patch adds an initial version of the Rust null block driver. Signed-off-by: Andreas Hindborg <a.hindborg@samsung.com> Reviewed-by: Benno Lossin <benno.lossin@proton.me> Link: https://lore.kernel.org/r/20240611114551.228679-3-nmi@metaspace.dk Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-14loop: Disable fallocate() zero and discard if not supportedCyril Hrubis
If fallcate is implemented but zero and discard operations are not supported by the filesystem the backing file is on we continue to fill dmesg with errors from the blk_mq_end_request() since each time we call fallocate() on the loop device the EOPNOTSUPP error from lo_fallocate() ends up propagated into the block layer. In the end syscall succeeds since the blkdev_issue_zeroout() falls back to writing zeroes which makes the errors even more misleading and confusing. How to reproduce: 1. make sure /tmp is mounted as tmpfs 2. dd if=/dev/zero of=/tmp/disk.img bs=1M count=100 3. losetup /dev/loop0 /tmp/disk.img 4. mkfs.ext2 /dev/loop0 5. dmesg |tail [710690.898214] operation not supported error, dev loop0, sector 204672 op 0x9:(WRITE_ZEROES) flags 0x8000800 phys_seg 0 prio class 0 [710690.898279] operation not supported error, dev loop0, sector 522 op 0x9:(WRITE_ZEROES) flags 0x8000800 phys_seg 0 prio class 0 [710690.898603] operation not supported error, dev loop0, sector 16906 op 0x9:(WRITE_ZEROES) flags 0x8000800 phys_seg 0 prio class 0 [710690.898917] operation not supported error, dev loop0, sector 32774 op 0x9:(WRITE_ZEROES) flags 0x8000800 phys_seg 0 prio class 0 [710690.899218] operation not supported error, dev loop0, sector 49674 op 0x9:(WRITE_ZEROES) flags 0x8000800 phys_seg 0 prio class 0 [710690.899484] operation not supported error, dev loop0, sector 65542 op 0x9:(WRITE_ZEROES) flags 0x8000800 phys_seg 0 prio class 0 [710690.899743] operation not supported error, dev loop0, sector 82442 op 0x9:(WRITE_ZEROES) flags 0x8000800 phys_seg 0 prio class 0 [710690.900015] operation not supported error, dev loop0, sector 98310 op 0x9:(WRITE_ZEROES) flags 0x8000800 phys_seg 0 prio class 0 [710690.900276] operation not supported error, dev loop0, sector 115210 op 0x9:(WRITE_ZEROES) flags 0x8000800 phys_seg 0 prio class 0 [710690.900546] operation not supported error, dev loop0, sector 131078 op 0x9:(WRITE_ZEROES) flags 0x8000800 phys_seg 0 prio class 0 This patch changes the lo_fallocate() to clear the flags for zero and discard operations if we get EOPNOTSUPP from the backing file fallocate callback, that way we at least stop spewing errors after the first unsuccessful try. CC: Jan Kara <jack@suse.cz> Signed-off-by: Cyril Hrubis <chrubis@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20240613163817.22640-1-chrubis@suse.cz Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-12nbd: Remove __force castsChristoph Hellwig
Make it again possible for sparse to verify that blk_status_t and Unix error codes are used in the proper context by making nbd_send_cmd() return a blk_status_t instead of an integer. No functionality has been changed. Signed-off-by: Christoph Hellwig <hch@lst.de> [ bvanassche: added description and made two small formatting changes ] Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20240604221531.327131-1-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-05null_blk: fix validation of block sizeAndreas Hindborg
Block size should be between 512 and PAGE_SIZE and be a power of 2. The current check does not validate this, so update the check. Without this patch, null_blk would Oops due to a null pointer deref when loaded with bs=1536 [1]. Link: https://lore.kernel.org/all/87wmn8mocd.fsf@metaspace.dk/ Signed-off-by: Andreas Hindborg <a.hindborg@samsung.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20240603192645.977968-1-nmi@metaspace.dk [axboe: remove unnecessary braces and != 0 check] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-30null_blk: Do not allow runt zone with zone capacity smaller then zone sizeDamien Le Moal
A zoned device with a smaller last zone together with a zone capacity smaller than the zone size does make any sense as that does not correspond to any possible setup for a real device: 1) For ZNS and zoned UFS devices, all zones are always the same size. 2) For SMR HDDs, all zones always have the same capacity. In other words, if we have a smaller last runt zone, then this zone capacity should always be equal to the zone size. Add a check in null_init_zoned_dev() to prevent a configuration to have both a smaller zone size and a zone capacity smaller than the zone size. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Niklas Cassel <cassel@kernel.org> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20240530054035.491497-2-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-28null_blk: Print correct max open zones limit in null_init_zoned_dev()Damien Le Moal
When changing the maximum number of open zones, print that number instead of the total number of zones. Fixes: dc4d137ee3b7 ("null_blk: add support for max open/active zone limit for zoned devices") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Niklas Cassel <cassel@kernel.org> Link: https://lore.kernel.org/r/20240528062852.437599-1-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-27null_blk: Fix return value of nullb_device_power_store()Damien Le Moal
When powering on a null_blk device that is not already on, the return value ret that is initialized to be count is reused to check the return value of null_add_dev(), leading to nullb_device_power_store() to return null_add_dev() return value (0 on success) instead of "count". So make sure to set ret to be equal to count when there are no errors. Fixes: a2db328b0839 ("null_blk: fix null-ptr-dereference while configuring 'power' and 'submit_queues'") Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Link: https://lore.kernel.org/r/20240527043445.235267-1-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-23Merge tag 'block-6.10-20240523' of git://git.kernel.dk/linuxLinus Torvalds
Pull more block updates from Jens Axboe: "Followup block updates, mostly due to NVMe being a bit late to the party. But nothing major in there, so not a big deal. In detail, this contains: - NVMe pull request via Keith: - Fabrics connection retries (Daniel, Hannes) - Fabrics logging enhancements (Tokunori) - RDMA delete optimization (Sagi) - ublk DMA alignment fix (me) - null_blk sparse warning fixes (Bart) - Discard support for brd (Keith) - blk-cgroup list corruption fixes (Ming) - blk-cgroup stat propagation fix (Waiman) - Regression fix for plugging stall with md (Yu) - Misc fixes or cleanups (David, Jeff, Justin)" * tag 'block-6.10-20240523' of git://git.kernel.dk/linux: (24 commits) null_blk: fix null-ptr-dereference while configuring 'power' and 'submit_queues' blk-throttle: remove unused struct 'avg_latency_bucket' block: fix lost bio for plug enabled bio based device block: t10-pi: add MODULE_DESCRIPTION() blk-mq: add helper for checking if one CPU is mapped to specified hctx blk-cgroup: Properly propagate the iostat update up the hierarchy blk-cgroup: fix list corruption from reorder of WRITE ->lqueued blk-cgroup: fix list corruption from resetting io stat cdrom: rearrange last_media_change check to avoid unintentional overflow nbd: Fix signal handling nbd: Remove a local variable from nbd_send_cmd() nbd: Improve the documentation of the locking assumptions nbd: Remove superfluous casts nbd: Use NULL to represent a pointer brd: implement discard support null_blk: Fix two sparse warnings ublk_drv: set DMA alignment mask to 3 nvme-rdma, nvme-tcp: include max reconnects for reconnect logging nvmet-rdma: Avoid o(n^2) loop in delete_ctrl nvme: do not retry authentication failures ...
2024-05-23Merge tag 'trace-assign-str-v6.10' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing cleanup from Steven Rostedt: "Remove second argument of __assign_str() The __assign_str() macro logic of the TRACE_EVENT() macro was optimized so that it no longer needs the second argument. The __assign_str() is always matched with __string() field that takes a field name and the source for that field: __string(field, source) The TRACE_EVENT() macro logic will save off the source value and then use that value to copy into the ring buffer via the __assign_str(). Before commit c1fa617caeb0 ("tracing: Rework __assign_str() and __string() to not duplicate getting the string"), the __assign_str() needed the second argument which would perform the same logic as the __string() source parameter did. Not only would this add overhead, but it was error prone as if the __assign_str() source produced something different, it may not have allocated enough for the string in the ring buffer (as the __string() source was used to determine how much to allocate) Now that the __assign_str() just uses the same string that was used in __string() it no longer needs the source parameter. It can now be removed" * tag 'trace-assign-str-v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing/treewide: Remove second parameter of __assign_str()
2024-05-23Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhostLinus Torvalds
Pull virtio updates from Michael Tsirkin: "Several new features here: - virtio-net is finally supported in vduse - virtio (balloon and mem) interaction with suspend is improved - vhost-scsi now handles signals better/faster And fixes, cleanups all over the place" * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (48 commits) virtio-pci: Check if is_avq is NULL virtio: delete vq in vp_find_vqs_msix() when request_irq() fails MAINTAINERS: add Eugenio Pérez as reviewer vhost-vdpa: Remove usage of the deprecated ida_simple_xx() API vp_vdpa: don't allocate unused msix vectors sound: virtio: drop owner assignment fuse: virtio: drop owner assignment scsi: virtio: drop owner assignment rpmsg: virtio: drop owner assignment nvdimm: virtio_pmem: drop owner assignment wifi: mac80211_hwsim: drop owner assignment vsock/virtio: drop owner assignment net: 9p: virtio: drop owner assignment net: virtio: drop owner assignment net: caif: virtio: drop owner assignment misc: nsm: drop owner assignment iommu: virtio: drop owner assignment drm/virtio: drop owner assignment gpio: virtio: drop owner assignment firmware: arm_scmi: virtio: drop owner assignment ...
2024-05-23null_blk: fix null-ptr-dereference while configuring 'power' and 'submit_queues'Yu Kuai
Writing 'power' and 'submit_queues' concurrently will trigger kernel panic: Test script: modprobe null_blk nr_devices=0 mkdir -p /sys/kernel/config/nullb/nullb0 while true; do echo 1 > submit_queues; echo 4 > submit_queues; done & while true; do echo 1 > power; echo 0 > power; done Test result: BUG: kernel NULL pointer dereference, address: 0000000000000148 Oops: 0000 [#1] PREEMPT SMP RIP: 0010:__lock_acquire+0x41d/0x28f0 Call Trace: <TASK> lock_acquire+0x121/0x450 down_write+0x5f/0x1d0 simple_recursive_removal+0x12f/0x5c0 blk_mq_debugfs_unregister_hctxs+0x7c/0x100 blk_mq_update_nr_hw_queues+0x4a3/0x720 nullb_update_nr_hw_queues+0x71/0xf0 [null_blk] nullb_device_submit_queues_store+0x79/0xf0 [null_blk] configfs_write_iter+0x119/0x1e0 vfs_write+0x326/0x730 ksys_write+0x74/0x150 This is because del_gendisk() can concurrent with blk_mq_update_nr_hw_queues(): nullb_device_power_store nullb_apply_submit_queues null_del_dev del_gendisk nullb_update_nr_hw_queues if (!dev->nullb) // still set while gendisk is deleted return 0 blk_mq_update_nr_hw_queues dev->nullb = NULL Fix this problem by resuing the global mutex to protect nullb_device_power_store() and nullb_update_nr_hw_queues() from configfs. Fixes: 45919fbfe1c4 ("null_blk: Enable modifying 'submit_queues' after an instance has been configured") Reported-and-tested-by: Yi Zhang <yi.zhang@redhat.com> Closes: https://lore.kernel.org/all/CAHj4cs9LgsHLnjg8z06LQ3Pr5cax-+Ps+xT7AP7TPnEjStuwZA@mail.gmail.com/ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev> Link: https://lore.kernel.org/r/20240523153934.1937851-1-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-22tracing/treewide: Remove second parameter of __assign_str()Steven Rostedt (Google)
With the rework of how the __string() handles dynamic strings where it saves off the source string in field in the helper structure[1], the assignment of that value to the trace event field is stored in the helper value and does not need to be passed in again. This means that with: __string(field, mystring) Which use to be assigned with __assign_str(field, mystring), no longer needs the second parameter and it is unused. With this, __assign_str() will now only get a single parameter. There's over 700 users of __assign_str() and because coccinelle does not handle the TRACE_EVENT() macro I ended up using the following sed script: git grep -l __assign_str | while read a ; do sed -e 's/\(__assign_str([^,]*[^ ,]\) *,[^;]*/\1)/' $a > /tmp/test-file; mv /tmp/test-file $a; done I then searched for __assign_str() that did not end with ';' as those were multi line assignments that the sed script above would fail to catch. Note, the same updates will need to be done for: __assign_str_len() __assign_rel_str() __assign_rel_str_len() I tested this with both an allmodconfig and an allyesconfig (build only for both). [1] https://lore.kernel.org/linux-trace-kernel/20240222211442.634192653@goodmis.org/ Link: https://lore.kernel.org/linux-trace-kernel/20240516133454.681ba6a0@rorschach.local.home Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Julia Lawall <Julia.Lawall@inria.fr> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Acked-by: Jani Nikula <jani.nikula@intel.com> Acked-by: Christian König <christian.koenig@amd.com> for the amdgpu parts. Acked-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> #for Acked-by: Rafael J. Wysocki <rafael@kernel.org> # for thermal Acked-by: Takashi Iwai <tiwai@suse.de> Acked-by: Darrick J. Wong <djwong@kernel.org> # xfs Tested-by: Guenter Roeck <linux@roeck-us.net>
2024-05-22virtio_blk: drop owner assignmentKrzysztof Kozlowski
virtio core already sets the .owner, so driver does not need to. Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Message-Id: <20240331-module-owner-virtio-v2-6-98f04bfaf46a@linaro.org> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
2024-05-21Merge tag 'pull-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds
Pull misc vfs updates from Al Viro: "Assorted commits that had missed the last merge window..." * tag 'pull-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: remove call_{read,write}_iter() functions do_dentry_open(): kill inode argument kernel_file_open(): get rid of inode argument get_file_rcu(): no need to check for NULL separately fd_is_open(): move to fs/file.c close_on_exec(): pass files_struct instead of fdtable
2024-05-21Merge tag 'pull-set_blocksize' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs blocksize updates from Al Viro: "This gets rid of bogus set_blocksize() uses, switches it over to be based on a 'struct file *' and verifies that the caller has the device opened exclusively" * tag 'pull-set_blocksize' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: make set_blocksize() fail unless block device is opened exclusive set_blocksize(): switch to passing struct file * btrfs_get_bdev_and_sb(): call set_blocksize() only for exclusive opens swsusp: don't bother with setting block size zram: don't bother with reopening - just use O_EXCL for open swapon(2): open swap with O_EXCL swapon(2)/swapoff(2): don't bother with block size pktcdvd: sort set_blocksize() calls out bcache_register(): don't bother with set_blocksize()
2024-05-19Merge tag 'mm-stable-2024-05-17-19-19' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull mm updates from Andrew Morton: "The usual shower of singleton fixes and minor series all over MM, documented (hopefully adequately) in the respective changelogs. Notable series include: - Lucas Stach has provided some page-mapping cleanup/consolidation/ maintainability work in the series "mm/treewide: Remove pXd_huge() API". - In the series "Allow migrate on protnone reference with MPOL_PREFERRED_MANY policy", Donet Tom has optimized mempolicy's MPOL_PREFERRED_MANY mode, yielding almost doubled performance in one test. - In their series "Memory allocation profiling" Kent Overstreet and Suren Baghdasaryan have contributed a means of determining (via /proc/allocinfo) whereabouts in the kernel memory is being allocated: number of calls and amount of memory. - Matthew Wilcox has provided the series "Various significant MM patches" which does a number of rather unrelated things, but in largely similar code sites. - In his series "mm: page_alloc: freelist migratetype hygiene" Johannes Weiner has fixed the page allocator's handling of migratetype requests, with resulting improvements in compaction efficiency. - In the series "make the hugetlb migration strategy consistent" Baolin Wang has fixed a hugetlb migration issue, which should improve hugetlb allocation reliability. - Liu Shixin has hit an I/O meltdown caused by readahead in a memory-tight memcg. Addressed in the series "Fix I/O high when memory almost met memcg limit". - In the series "mm/filemap: optimize folio adding and splitting" Kairui Song has optimized pagecache insertion, yielding ~10% performance improvement in one test. - Baoquan He has cleaned up and consolidated the early zone initialization code in the series "mm/mm_init.c: refactor free_area_init_core()". - Baoquan has also redone some MM initializatio code in the series "mm/init: minor clean up and improvement". - MM helper cleanups from Christoph Hellwig in his series "remove follow_pfn". - More cleanups from Matthew Wilcox in the series "Various page->flags cleanups". - Vlastimil Babka has contributed maintainability improvements in the series "memcg_kmem hooks refactoring". - More folio conversions and cleanups in Matthew Wilcox's series: "Convert huge_zero_page to huge_zero_folio" "khugepaged folio conversions" "Remove page_idle and page_young wrappers" "Use folio APIs in procfs" "Clean up __folio_put()" "Some cleanups for memory-failure" "Remove page_mapping()" "More folio compat code removal" - David Hildenbrand chipped in with "fs/proc/task_mmu: convert hugetlb functions to work on folis". - Code consolidation and cleanup work related to GUP's handling of hugetlbs in Peter Xu's series "mm/gup: Unify hugetlb, part 2". - Rick Edgecombe has developed some fixes to stack guard gaps in the series "Cover a guard gap corner case". - Jinjiang Tu has fixed KSM's behaviour after a fork+exec in the series "mm/ksm: fix ksm exec support for prctl". - Baolin Wang has implemented NUMA balancing for multi-size THPs. This is a simple first-cut implementation for now. The series is "support multi-size THP numa balancing". - Cleanups to vma handling helper functions from Matthew Wilcox in the series "Unify vma_address and vma_pgoff_address". - Some selftests maintenance work from Dev Jain in the series "selftests/mm: mremap_test: Optimizations and style fixes". - Improvements to the swapping of multi-size THPs from Ryan Roberts in the series "Swap-out mTHP without splitting". - Kefeng Wang has significantly optimized the handling of arm64's permission page faults in the series "arch/mm/fault: accelerate pagefault when badaccess" "mm: remove arch's private VM_FAULT_BADMAP/BADACCESS" - GUP cleanups from David Hildenbrand in "mm/gup: consistently call it GUP-fast". - hugetlb fault code cleanups from Vishal Moola in "Hugetlb fault path to use struct vm_fault". - selftests build fixes from John Hubbard in the series "Fix selftests/mm build without requiring "make headers"". - Memory tiering fixes/improvements from Ho-Ren (Jack) Chuang in the series "Improved Memory Tier Creation for CPUless NUMA Nodes". Fixes the initialization code so that migration between different memory types works as intended. - David Hildenbrand has improved follow_pte() and fixed an errant driver in the series "mm: follow_pte() improvements and acrn follow_pte() fixes". - David also did some cleanup work on large folio mapcounts in his series "mm: mapcount for large folios + page_mapcount() cleanups". - Folio conversions in KSM in Alex Shi's series "transfer page to folio in KSM". - Barry Song has added some sysfs stats for monitoring multi-size THP's in the series "mm: add per-order mTHP alloc and swpout counters". - Some zswap cleanups from Yosry Ahmed in the series "zswap same-filled and limit checking cleanups". - Matthew Wilcox has been looking at buffer_head code and found the documentation to be lacking. The series is "Improve buffer head documentation". - Multi-size THPs get more work, this time from Lance Yang. His series "mm/madvise: enhance lazyfreeing with mTHP in madvise_free" optimizes the freeing of these things. - Kemeng Shi has added more userspace-visible writeback instrumentation in the series "Improve visibility of writeback". - Kemeng Shi then sent some maintenance work on top in the series "Fix and cleanups to page-writeback". - Matthew Wilcox reduces mmap_lock traffic in the anon vma code in the series "Improve anon_vma scalability for anon VMAs". Intel's test bot reported an improbable 3x improvement in one test. - SeongJae Park adds some DAMON feature work in the series "mm/damon: add a DAMOS filter type for page granularity access recheck" "selftests/damon: add DAMOS quota goal test" - Also some maintenance work in the series "mm/damon/paddr: simplify page level access re-check for pageout" "mm/damon: misc fixes and improvements" - David Hildenbrand has disabled some known-to-fail selftests ni the series "selftests: mm: cow: flag vmsplice() hugetlb tests as XFAIL". - memcg metadata storage optimizations from Shakeel Butt in "memcg: reduce memory consumption by memcg stats". - DAX fixes and maintenance work from Vishal Verma in the series "dax/bus.c: Fixups for dax-bus locking"" * tag 'mm-stable-2024-05-17-19-19' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (426 commits) memcg, oom: cleanup unused memcg_oom_gfp_mask and memcg_oom_order selftests/mm: hugetlb_madv_vs_map: avoid test skipping by querying hugepage size at runtime mm/hugetlb: add missing VM_FAULT_SET_HINDEX in hugetlb_wp mm/hugetlb: add missing VM_FAULT_SET_HINDEX in hugetlb_fault selftests: cgroup: add tests to verify the zswap writeback path mm: memcg: make alloc_mem_cgroup_per_node_info() return bool mm/damon/core: fix return value from damos_wmark_metric_value mm: do not update memcg stats for NR_{FILE/SHMEM}_PMDMAPPED selftests: cgroup: remove redundant enabling of memory controller Docs/mm/damon/maintainer-profile: allow posting patches based on damon/next tree Docs/mm/damon/maintainer-profile: change the maintainer's timezone from PST to PT Docs/mm/damon/design: use a list for supported filters Docs/admin-guide/mm/damon/usage: fix wrong schemes effective quota update command Docs/admin-guide/mm/damon/usage: fix wrong example of DAMOS filter matching sysfs file selftests/damon: classify tests for functionalities and regressions selftests/damon/_damon_sysfs: use 'is' instead of '==' for 'None' selftests/damon/_damon_sysfs: find sysfs mount point from /proc/mounts selftests/damon/_damon_sysfs: check errors from nr_schemes file reads mm/damon/core: initialize ->esz_bp from damos_quota_init_priv() selftests/damon: add a test for DAMOS quota goal ...
2024-05-14nbd: Fix signal handlingBart Van Assche
Both nbd_send_cmd() and nbd_handle_cmd() return either a negative error number or a positive blk_status_t value. nbd_queue_rq() converts these return values into a blk_status_t value. There is a bug in the conversion code: if nbd_send_cmd() returns BLK_STS_RESOURCE, nbd_queue_rq() should return BLK_STS_RESOURCE instead of BLK_STS_OK. Fix this, move the conversion code into nbd_handle_cmd() and fix the remaining sparse warnings. This patch fixes the following sparse warnings: drivers/block/nbd.c:673:32: warning: incorrect type in return expression (different base types) drivers/block/nbd.c:673:32: expected int drivers/block/nbd.c:673:32: got restricted blk_status_t [usertype] drivers/block/nbd.c:714:48: warning: incorrect type in return expression (different base types) drivers/block/nbd.c:714:48: expected int drivers/block/nbd.c:714:48: got restricted blk_status_t [usertype] drivers/block/nbd.c:1120:21: warning: incorrect type in assignment (different base types) drivers/block/nbd.c:1120:21: expected int [assigned] ret drivers/block/nbd.c:1120:21: got restricted blk_status_t [usertype] drivers/block/nbd.c:1125:16: warning: incorrect type in return expression (different base types) drivers/block/nbd.c:1125:16: expected restricted blk_status_t drivers/block/nbd.c:1125:16: got int [assigned] ret Cc: Christoph Hellwig <hch@lst.de> Cc: Josef Bacik <jbacik@fb.com> Cc: Yu Kuai <yukuai3@huawei.com> Cc: Markus Pargmann <mpa@pengutronix.de> Fixes: fc17b6534eb8 ("blk-mq: switch ->queue_rq return value to blk_status_t") Cc: stable@vger.kernel.org Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20240510202313.25209-6-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-14nbd: Remove a local variable from nbd_send_cmd()Bart Van Assche
blk_rq_bytes() returns an unsigned int while 'size' has type unsigned long. This is confusing. Improve code readability by removing the local variable 'size'. Cc: Christoph Hellwig <hch@lst.de> Cc: Josef Bacik <jbacik@fb.com> Cc: Yu Kuai <yukuai3@huawei.com> Cc: Markus Pargmann <mpa@pengutronix.de> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20240510202313.25209-5-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-14nbd: Improve the documentation of the locking assumptionsBart Van Assche
Document locking assumptions with lockdep_assert_held() instead of source code comments. The advantage of lockdep_assert_held() is that it is verified at runtime if lockdep is enabled in the kernel config. Cc: Christoph Hellwig <hch@lst.de> Cc: Josef Bacik <jbacik@fb.com> Cc: Yu Kuai <yukuai3@huawei.com> Cc: Markus Pargmann <mpa@pengutronix.de> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20240510202313.25209-4-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-14nbd: Remove superfluous castsBart Van Assche
In Linux kernel code it is preferred not to use a cast when converting a void pointer to another pointer type. Cc: Christoph Hellwig <hch@lst.de> Cc: Josef Bacik <jbacik@fb.com> Cc: Yu Kuai <yukuai3@huawei.com> Cc: Markus Pargmann <mpa@pengutronix.de> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20240510202313.25209-3-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-13brd: implement discard supportKeith Busch
The ramdisk memory utilization can only go up when data is written to new pages. Implement discard to provide the possibility to reduce memory usage for pages no longer in use. Aligned discards will free the associated pages, if any, and determinisitically return zeroed data until written again. Signed-off-by: Keith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/20240429102308.147627-1-kbusch@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-13null_blk: Fix two sparse warningsBart Van Assche
Fix the following sparse warnings: drivers/block/null_blk/main.c:1243:35: warning: incorrect type in return expression (different base types) drivers/block/null_blk/main.c:1243:35: expected int drivers/block/null_blk/main.c:1243:35: got restricted blk_status_t drivers/block/null_blk/main.c:1291:30: warning: incorrect type in return expression (different base types) drivers/block/null_blk/main.c:1291:30: expected restricted blk_status_t drivers/block/null_blk/main.c:1291:30: got int Cc: Christoph Hellwig <hch@lst.de> Cc: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20240510201816.24921-1-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-13ublk_drv: set DMA alignment mask to 3Jens Axboe
By default, this will be 511, as that's the block layer default. But drivers these days can support memory alignments that aren't tied to the sector sizes, instead just being limited by what the DMA engine supports. An example is NVMe, where it's generally set to a 32-bit or 64-bit boundary. As ublk itself doesn't really care, just set it low enough that we don't run into issues with NVMe where the required O_DIRECT memory alignment is now more restrictive on ublk than it is on the underlying device. This was triggered by spurious -EINVAL returns on O_DIRECT IO on a setup with ublk managing NVMe devices, which previously worked just fine on the NVMe device itself. With the alignment relaxed, the test works fine. Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-13Merge tag 'for-6.10/block-20240511' of git://git.kernel.dk/linuxLinus Torvalds
Pull block updates from Jens Axboe: - Add a partscan attribute in sysfs, fixing an issue with systemd relying on an internal interface that went away. - Attempt #2 at making long running discards interruptible. The previous attempt went into 6.9, but we ended up mostly reverting it as it had issues. - Remove old ida_simple API in bcache - Support for zoned write plugging, greatly improving the performance on zoned devices. - Remove the old throttle low interface, which has been experimental since 2017 and never made it beyond that and isn't being used. - Remove page->index debugging checks in brd, as it hasn't caught anything and prepares us for removing in struct page. - MD pull request from Song - Don't schedule block workers on isolated CPUs * tag 'for-6.10/block-20240511' of git://git.kernel.dk/linux: (84 commits) blk-throttle: delay initialization until configuration blk-throttle: remove CONFIG_BLK_DEV_THROTTLING_LOW block: fix that util can be greater than 100% block: support to account io_ticks precisely block: add plug while submitting IO bcache: fix variable length array abuse in btree_iter bcache: Remove usage of the deprecated ida_simple_xx() API md: Revert "md: Fix overflow in is_mddev_idle" blk-lib: check for kill signal in ioctl BLKDISCARD block: add a bio_await_chain helper block: add a blk_alloc_discard_bio helper block: add a bio_chain_and_submit helper block: move discard checks into the ioctl handler block: remove the discard_granularity check in __blkdev_issue_discard block/ioctl: prefer different overflow check null_blk: Fix the WARNING: modpost: missing MODULE_DESCRIPTION() block: fix and simplify blkdevparts= cmdline parsing block: refine the EOF check in blkdev_iomap_begin block: add a partscan sysfs attribute for disks block: add a disk_has_partscan helper ...
2024-05-06null_blk: Fix the WARNING: modpost: missing MODULE_DESCRIPTION()Zhu Yanjun
No functional changes intended. Fixes: f2298c0403b0 ("null_blk: multi queue aware block test driver") Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20240506075538.6064-1-yanjun.zhu@linux.dev Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-02set_blocksize(): switch to passing struct file *Al Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-05-02zram: don't bother with reopening - just use O_EXCL for openAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-05-02pktcdvd: sort set_blocksize() calls outAl Viro
1) it doesn't make any sense to have ->open() call set_blocksize() on the device being opened - the caller will override that anyway. 2) setting block size on underlying device, OTOH, ought to be done when we are opening it exclusive - i.e. as part of pkt_open_dev(). Having it done at setup time doesn't guarantee us anything about the state at the time we start talking to it. Worse, if you happen to have the underlying device containing e.g. ext2 with 4Kb blocks that is currently mounted r/o, that set_blocksize() will confuse the hell out of filesystem. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-04-30ublk: remove segment count and size limitsUday Shankar
ublk_drv currently creates block devices with the default max_segments and max_segment_size limits of BLK_MAX_SEGMENTS (128) and BLK_MAX_SEGMENT_SIZE (65536) respectively. These defaults can artificially constrain the I/O size seen by the ublk server - for example, suppose that the ublk server has configured itself to accept I/Os up to 1M and the application is also issuing 1M sized I/Os. If the I/O buffer used by the application is backed by 4K pages, the buffer could consist of up to 1M / 4K = 256 physically discontiguous segments (even if the buffer is virtually contiguous). As such, the I/O could exceed the default max_segments limit and get split. This can cause unnecessary performance issues if the ublk server is optimized to handle 1M I/Os. The block layer's segment count/size limits exist to model hardware constraints which don't exist in ublk_drv's case, so just remove those limits for the block devices created by ublk_drv. Signed-off-by: Uday Shankar <ushankar@purestorage.com> Reviewed-by: Riley Thomasson <riley@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20240430211623.2802036-1-ushankar@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-25zram: add max_pages param to recompressionSergey Senozhatsky
Introduce "max_pages" param to recompress device attribute which sets an upper limit on the number of entries (pages) zram attempts to recompress (in this particular recompression call). S/W recompression can be quite expensive so limiting the number of pages recompress touches can be quite helpful. Link: https://lkml.kernel.org/r/20240329094050.2815699-1-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Acked-by: Brian Geffon <bgeffon@google.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25null_blk: Fix missing mutex_destroy() at module removalZhu Yanjun
When a mutex lock is not used any more, the function mutex_destroy should be called to mark the mutex lock uninitialized. Fixes: f2298c0403b0 ("null_blk: multi queue aware block test driver") Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev> Link: https://lore.kernel.org/r/20240425171635.4227-1-yanjun.zhu@linux.dev Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17null_blk: Simplify null_zone_write()Damien Le Moal
In null_zone_write, we do not need to first check if the target zone condition is FULL, READONLY or OFFLINE: for theses conditions, the check of the command sector against the zone write pointer will always result in the command failing. Remove these checks. We still however need to check that the target zone write pointer is not invalid for zone append operations. To do so, add the macro NULL_ZONE_INVALID_WP and use it in null_set_zone_cond() when changing a zone to READONLY or OFFLINE condition. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20240411085502.728558-4-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17null_blk: Do zone resource management only if necessaryDamien Le Moal
For zoned null_blk devices setup without any limit on the maximum number of open and active zones, there is no need to count the number of zones that are implicitly open, explicitly open and closed. This is indicated by the boolean field need_zone_res_mgmt of sturct nullb_device. Modify the zone management functions null_reset_zone(), null_finish_zone(), null_open_zone() and null_close_zone() to manage the zone condition counters only if the device need_zone_res_mgmt field is true. With this change, the function __null_close_zone() is removed and integrated into the 2 caller sites directly, with the null_close_imp_open_zone() call site greatly simplified as this function closes zones that are known to be in the implicit open condition. null_zone_write() is modified in a similar manner to do zone condition accouting only when the device need_zone_res_mgmt field is true. With these changes, the inline helpers null_lock_zone_res() and null_unlock_zone_res() are removed and replaced with direct calls to spin_lock()/spin_unlock(). Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20240411085502.728558-3-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>