summaryrefslogtreecommitdiff
path: root/drivers/block
AgeCommit message (Collapse)Author
2025-03-30Merge tag 'rust-6.15' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ojeda/linux Pull Rust updates from Miguel Ojeda: "Toolchain and infrastructure: - Extract the 'pin-init' API from the 'kernel' crate and make it into a standalone crate. In order to do this, the contents are rearranged so that they can easily be kept in sync with the version maintained out-of-tree that other projects have started to use too (or plan to, like QEMU). This will reduce the maintenance burden for Benno, who will now have his own sub-tree, and will simplify future expected changes like the move to use 'syn' to simplify the implementation. - Add '#[test]'-like support based on KUnit. We already had doctests support based on KUnit, which takes the examples in our Rust documentation and runs them under KUnit. Now, we are adding the beginning of the support for "normal" tests, similar to those the '#[test]' tests in userspace Rust. For instance: #[kunit_tests(my_suite)] mod tests { #[test] fn my_test() { assert_eq!(1 + 1, 2); } } Unlike with doctests, the 'assert*!'s do not map to the KUnit assertion APIs yet. - Check Rust signatures at compile time for functions called from C by name. In particular, introduce a new '#[export]' macro that can be placed in the Rust function definition. It will ensure that the function declaration on the C side matches the signature on the Rust function: #[export] pub unsafe extern "C" fn my_function(a: u8, b: i32) -> usize { // ... } The macro essentially forces the compiler to compare the types of the actual Rust function and the 'bindgen'-processed C signature. These cases are rare so far. In the future, we may consider introducing another tool, 'cbindgen', to generate C headers automatically. Even then, having these functions explicitly marked may be a good idea anyway. - Enable the 'raw_ref_op' Rust feature: it is already stable, and allows us to use the new '&raw' syntax, avoiding a couple macros. After everyone has migrated, we will disallow the macros. - Pass the correct target to 'bindgen' on Usermode Linux. - Fix 'rusttest' build in macOS. 'kernel' crate: - New 'hrtimer' module: add support for setting up intrusive timers without allocating when starting the timer. Add support for 'Pin<Box<_>>', 'Arc<_>', 'Pin<&_>' and 'Pin<&mut _>' as pointer types for use with timer callbacks. Add support for setting clock source and timer mode. - New 'dma' module: add a simple DMA coherent allocator abstraction and a test sample driver. - 'list' module: make the linked list 'Cursor' point between elements, rather than at an element, which is more convenient to us and allows for cursors to empty lists; and document it with examples of how to perform common operations with the provided methods. - 'str' module: implement a few traits for 'BStr' as well as the 'strip_prefix()' method. - 'sync' module: add 'Arc::as_ptr'. - 'alloc' module: add 'Box::into_pin'. - 'error' module: extend the 'Result' documentation, including a few examples on different ways of handling errors, a warning about using methods that may panic, and links to external documentation. 'macros' crate: - 'module' macro: add the 'authors' key to support multiple authors. The original key will be kept until everyone has migrated. Documentation: - Add error handling sections. MAINTAINERS: - Add Danilo Krummrich as reviewer of the Rust "subsystem". - Add 'RUST [PIN-INIT]' entry with Benno Lossin as maintainer. It has its own sub-tree. - Add sub-tree for 'RUST [ALLOC]'. - Add 'DMA MAPPING HELPERS DEVICE DRIVER API [RUST]' entry with Abdiel Janulgue as primary maintainer. It will go through the sub-tree of the 'RUST [ALLOC]' entry. - Add 'HIGH-RESOLUTION TIMERS [RUST]' entry with Andreas Hindborg as maintainer. It has its own sub-tree. And a few other cleanups and improvements" * tag 'rust-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/ojeda/linux: (71 commits) rust: dma: add `Send` implementation for `CoherentAllocation` rust: macros: fix `make rusttest` build on macOS rust: block: refactor to use `&raw mut` rust: enable `raw_ref_op` feature rust: uaccess: name the correct function rust: rbtree: fix comments referring to Box instead of KBox rust: hrtimer: add maintainer entry rust: hrtimer: add clocksource selection through `ClockId` rust: hrtimer: add `HrTimerMode` rust: hrtimer: implement `HrTimerPointer` for `Pin<Box<T>>` rust: alloc: add `Box::into_pin` rust: hrtimer: implement `UnsafeHrTimerPointer` for `Pin<&mut T>` rust: hrtimer: implement `UnsafeHrTimerPointer` for `Pin<&T>` rust: hrtimer: add `hrtimer::ScopedHrTimerPointer` rust: hrtimer: add `UnsafeHrTimerPointer` rust: hrtimer: allow timer restart from timer handler rust: str: implement `strip_prefix` for `BStr` rust: str: implement `AsRef<BStr>` for `[u8]` and `BStr` rust: str: implement `Index` for `BStr` rust: str: implement `PartialEq` for `BStr` ...
2025-03-29ublk: specify io_cmd_buf pointer typeCaleb Sander Mateos
io_cmd_buf points to an array of ublksrv_io_desc structs but its type is char *. Indexing the array requires an explicit multiplication and cast. The compiler also can't check the pointer types. Change io_cmd_buf's type to struct ublksrv_io_desc * so it can be indexed directly and the compiler can type-check the code. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Acked-by: Shuah Khan <skhan@linuxfoundation.org> Link: https://lore.kernel.org/r/20250328194230.2726862-2-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-28ublk: store req in ublk_uring_cmd_pdu for ublk_cmd_tw_cb()Caleb Sander Mateos
Pass struct request *rq to ublk_cmd_tw_cb() through ublk_uring_cmd_pdu, mirroring how it works for ublk_cmd_list_tw_cb(). This saves some pointer dereferences, as well as the bounds check in blk_mq_tag_to_rq(). Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Link: https://lore.kernel.org/r/20250328180411.2696494-6-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-28ublk: avoid redundant io->cmd in ublk_queue_cmd_list()Caleb Sander Mateos
ublk_queue_cmd_list() loads io->cmd twice. The intervening stores prevent the compiler from combining the loads. Since struct ublk_io *io is only used to compute io->cmd, replace the variable with io->cmd. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Link: https://lore.kernel.org/r/20250328180411.2696494-5-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-28ublk: get ubq from pdu in ublk_cmd_list_tw_cb()Caleb Sander Mateos
Save a few pointer dereferences by obtaining struct ublk_queue *ubq from the ublk_uring_cmd_pdu instead of the request's mq_hctx. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Link: https://lore.kernel.org/r/20250328180411.2696494-4-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-28ublk: skip 1 NULL check in ublk_cmd_list_tw_cb() loopCaleb Sander Mateos
ublk_cmd_list_tw_cb() is always performed on a non-empty request list. So don't check whether rq is NULL on the first iteration of the loop, just on subsequent iterations. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Link: https://lore.kernel.org/r/20250328180411.2696494-3-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-28ublk: remove unused cmd argument to ublk_dispatch_req()Caleb Sander Mateos
ublk_dispatch_req() never uses its struct io_uring_cmd *cmd argument. Drop it so callers don't have to pass a value. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Link: https://lore.kernel.org/r/20250328180411.2696494-2-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-28ublk: rename ublk_rq_task_work_cb as ublk_cmd_tw_cbMing Lei
The new name is aligned with ublk_cmd_list_tw_cb(), and looks more readable. Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250327095123.179113-10-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-28ublk: implement ->queue_rqs()Ming Lei
Implement ->queue_rqs() for improving perf in case of MQ. In this way, we just need to call io_uring_cmd_complete_in_task() once for whole IO batch, then both io_uring and ublk server can get exact batch from ublk frontend. Follows IOPS improvement: - tests tools/testing/selftests/ublk/kublk add -t null -q 2 [-z] fio/t/io_uring -p0 /dev/ublkb0 - results: more than 10% IOPS boost observed Pass all ublk selftests, especially the io dispatch order test. Cc: Uday Shankar <ushankar@purestorage.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250327095123.179113-9-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-28ublk: add segment parameterMing Lei
IO split is usually bad in io_uring world, since -EAGAIN is caused and IO handling may have to fallback to io-wq, this way does hurt performance. ublk starts to support zero copy recently, for avoiding unnecessary IO split, ublk driver's segment limit should be aligned with backend device's segment limit. Another reason is that io_buffer_register_bvec() needs to allocate bvecs, which number is aligned with ublk request segment number, so that big memory allocation can be avoided by setting reasonable max_segments limit. So add segment parameter for providing ublk server chance to align segment limit with backend, and keep it reasonable from implementation viewpoint. Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250327095123.179113-7-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-28ublk: call io_uring_cmd_to_pdu to get uring_cmd pduMing Lei
Call io_uring_cmd_to_pdu() to get uring_cmd pdu, and one big benefit is the automatic pdu size build check. Suggested-by: Uday Shankar <ushankar@purestorage.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Link: https://lore.kernel.org/r/20250327095123.179113-6-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-28ublk: add helper of ublk_need_map_io()Ming Lei
ublk_need_map_io() is more readable. Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250327095123.179113-5-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-28ublk: remove two unused fields from 'struct ublk_queue'Ming Lei
Remove two unused fields(`io_addr` & `max_io_sz`) from `struct ublk_queue`. Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250327095123.179113-4-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-28ublk: comment on ubq->canceling handling in ublk_queue_rq()Ming Lei
In ublk_queue_rq(), ubq->canceling has to be handled after ->fail_io and ->force_abort are dealt with, otherwise the request may not be failed when deleting disk. Add comment on this usage. Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250327095123.179113-3-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-28ublk: make sure ubq->canceling is set when queue is frozenMing Lei
Now ublk driver depends on `ubq->canceling` for deciding if the request can be dispatched via uring_cmd & io_uring_cmd_complete_in_task(). Once ubq->canceling is set, the uring_cmd can be done via ublk_cancel_cmd() and io_uring_cmd_done(). So set ubq->canceling when queue is frozen, this way makes sure that the flag can be observed from ublk_queue_rq() reliably, and avoids use-after-free on uring_cmd. Fixes: 216c8f5ef0f2 ("ublk: replace monitor with cancelable uring_cmd") Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250327095123.179113-2-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-26Merge tag 'for-6.15/block-20250322' of git://git.kernel.dk/linuxLinus Torvalds
Pull block updates from Jens Axboe: - Fixes for integrity handling - NVMe pull request via Keith: - Secure concatenation for TCP transport (Hannes) - Multipath sysfs visibility (Nilay) - Various cleanups (Qasim, Baruch, Wang, Chen, Mike, Damien, Li) - Correct use of 64-bit BARs for pci-epf target (Niklas) - Socket fix for selinux when used in containers (Peijie) - MD pull request via Yu: - fix recovery can preempt resync (Li Nan) - fix md-bitmap IO limit (Su Yue) - fix raid10 discard with REQ_NOWAIT (Xiao Ni) - fix raid1 memory leak (Zheng Qixing) - fix mddev uaf (Yu Kuai) - fix raid1,raid10 IO flags (Yu Kuai) - some refactor and cleanup (Yu Kuai) - Series cleaning up and fixing bugs in the bad block handling code - Improve support for write failure simulation in null_blk - Various lock ordering fixes - Fixes for locking for debugfs attributes - Various ublk related fixes and improvements - Cleanups for blk-rq-qos wait handling - blk-throttle fixes - Fixes for loop dio and sync handling - Fixes and cleanups for the auto-PI code - Block side support for hardware encryption keys in blk-crypto - Various cleanups and fixes * tag 'for-6.15/block-20250322' of git://git.kernel.dk/linux: (105 commits) nvmet: replace max(a, min(b, c)) by clamp(val, lo, hi) nvme-tcp: fix selinux denied when calling sock_sendmsg nvmet: pci-epf: Always configure BAR0 as 64-bit nvmet: Remove duplicate uuid_copy nvme: zns: Simplify nvme_zone_parse_entry() nvmet: pci-epf: Remove redundant 'flush_workqueue()' calls nvmet-fc: Remove unused functions nvme-pci: remove stale comment nvme-fc: Utilise min3() to simplify queue count calculation nvme-multipath: Add visibility for queue-depth io-policy nvme-multipath: Add visibility for numa io-policy nvme-multipath: Add visibility for round-robin io-policy nvmet: add tls_concat and tls_key debugfs entries nvmet-tcp: support secure channel concatenation nvmet: Add 'sq' argument to alloc_ctrl_args nvme-fabrics: reset admin connection for secure concatenation nvme-tcp: request secure channel concatenation nvme-keyring: add nvme_tls_psk_refresh() nvme: add nvme_auth_derive_tls_psk() nvme: add nvme_auth_generate_digest() ...
2025-03-26Merge tag 'for-6.15/io_uring-20250322' of git://git.kernel.dk/linuxLinus Torvalds
Pull io_uring updates from Jens Axboe: "This is the first of the io_uring pull requests for the 6.15 merge window, there will be others once the net tree has gone in. This contains: - Cleanup and unification of cancelation handling across various request types. - Improvement for bundles, supporting them both for incrementally consumed buffers, and for non-multishot requests. - Enable toggling of using iowait while waiting on io_uring events or not. Unfortunately this is still tied with CPU frequency boosting on short waits, as the scheduler side has not been very receptive to splitting the (useless) iowait stat from the cpufreq implied boost. - Add support for kbuf nodes, enabling zero-copy support for the ublk block driver. - Various cleanups for resource node handling. - Series greatly cleaning up the legacy provided (non-ring based) buffers. For years, we've been pushing the ring provided buffers as the way to go, and that is what people have been using. Reduce the complexity and code associated with legacy provided buffers. - Series cleaning up the compat handling. - Series improving and cleaning up the recvmsg/sendmsg iovec and msg handling. - Series of cleanups for io-wq. - Start adding a bunch of selftests. The liburing repository generally carries feature and regression tests for everything, but at least for ublk initially, we'll try and go the route of having it in selftests as well. We'll see how this goes, might decide to migrate more tests this way in the future. - Various little cleanups and fixes" * tag 'for-6.15/io_uring-20250322' of git://git.kernel.dk/linux: (108 commits) selftests: ublk: add stripe target selftests: ublk: simplify loop io completion selftests: ublk: enable zero copy for null target selftests: ublk: prepare for supporting stripe target selftests: ublk: move common code into common.c selftests: ublk: increase max buffer size to 1MB selftests: ublk: add single sqe allocator helper selftests: ublk: add generic_01 for verifying sequential IO order selftests: ublk: fix starting ublk device io_uring: enable toggle of iowait usage when waiting on CQEs selftests: ublk: fix write cache implementation selftests: ublk: add variable for user to not show test result selftests: ublk: don't show `modprobe` failure selftests: ublk: add one dependency header io_uring/kbuf: enable bundles for incrementally consumed buffers Revert "io_uring/rsrc: simplify the bvec iter count calculation" selftests: ublk: improve test usability selftests: ublk: add stress test for covering IO vs. killing ublk server selftests: ublk: add one stress test for covering IO vs. removing device selftests: ublk: load/unload ublk_drv when preparing & cleaning up tests ...
2025-03-25Merge tag 'timers-cleanups-2025-03-23' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer cleanups from Thomas Gleixner: "A treewide hrtimer timer cleanup hrtimers are initialized with hrtimer_init() and a subsequent store to the callback pointer. This turned out to be suboptimal for the upcoming Rust integration and is obviously a silly implementation to begin with. This cleanup replaces the hrtimer_init(T); T->function = cb; sequence with hrtimer_setup(T, cb); The conversion was done with Coccinelle and a few manual fixups. Once the conversion has completely landed in mainline, hrtimer_init() will be removed and the hrtimer::function becomes a private member" * tag 'timers-cleanups-2025-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (100 commits) wifi: rt2x00: Switch to use hrtimer_update_function() io_uring: Use helper function hrtimer_update_function() serial: xilinx_uartps: Use helper function hrtimer_update_function() ASoC: fsl: imx-pcm-fiq: Switch to use hrtimer_setup() RDMA: Switch to use hrtimer_setup() virtio: mem: Switch to use hrtimer_setup() drm/vmwgfx: Switch to use hrtimer_setup() drm/xe/oa: Switch to use hrtimer_setup() drm/vkms: Switch to use hrtimer_setup() drm/msm: Switch to use hrtimer_setup() drm/i915/request: Switch to use hrtimer_setup() drm/i915/uncore: Switch to use hrtimer_setup() drm/i915/pmu: Switch to use hrtimer_setup() drm/i915/perf: Switch to use hrtimer_setup() drm/i915/gvt: Switch to use hrtimer_setup() drm/i915/huc: Switch to use hrtimer_setup() drm/amdgpu: Switch to use hrtimer_setup() stm class: heartbeat: Switch to use hrtimer_setup() i2c: Switch to use hrtimer_setup() iio: Switch to use hrtimer_setup() ...
2025-03-19ublk: remove io_cmds list in ublk_queueUday Shankar
The current I/O dispatch mechanism - queueing I/O by adding it to the io_cmds list (and poking task_work as needed), then dispatching it in ublk server task context by reversing io_cmds and completing the io_uring command associated to each one - was introduced by commit 7d4a93176e014 ("ublk_drv: don't forward io commands in reserve order") to ensure that the ublk server received I/O in the same order that the block layer submitted it to ublk_drv. This mechanism was only needed for the "raw" task_work submission mechanism, since the io_uring task work wrapper maintains FIFO ordering (using quite a similar mechanism in fact). The "raw" task_work submission mechanism is no longer supported in ublk_drv as of commit 29dc5d06613f2 ("ublk: kill queuing request by task_work_add"), so the explicit llist/reversal is no longer needed - it just duplicates logic already present in the underlying io_uring APIs. Remove it. Signed-off-by: Uday Shankar <ushankar@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250318-ublk_io_cmds-v1-1-c1bb74798fef@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-18loop: move vfs_fsync() out of loop_update_dio()Ming Lei
If vfs_flush() is called with queue frozen, the queue freeze lock may be connected with FS internal lock, and lockdep warning can be triggered because the queue freeze lock is connected with too many global or sub-system locks. Fix the warning by moving vfs_fsync() out of loop_update_dio(): - vfs_fsync() is only needed when switching to dio - only loop_change_fd() and loop_configure() may switch from buffered IO to direct IO, so call vfs_fsync() directly here. This way is safe because either loop is in unbound, or new file isn't attached - for the other two cases of set_status and set_block_size, direct IO can only become off, so no need to call vfs_fsync() Cc: Christoph Hellwig <hch@infradead.org> Reported-by: Kun Hu <huk23@m.fudan.edu.cn> Reported-by: Jiaji Qin <jjtan24@m.fudan.edu.cn> Closes: https://lore.kernel.org/linux-block/359BC288-B0B1-4815-9F01-3A349B12E816@m.fudan.edu.cn/T/#u Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250318072955.3893805-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-16zram: add might_sleep to zcomp APISergey Senozhatsky
Explicitly state that zcomp compress/decompress must be called from non-atomic context. Link: https://lkml.kernel.org/r/20250303022425.285971-20-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Hillf Danton <hdanton@sina.com> Cc: Kairui Song <ryncsn@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16zram: do not leak page on writeback_store error pathSergey Senozhatsky
Ensure the page used for local object data is freed on error out path. Link: https://lkml.kernel.org/r/20250303022425.285971-19-senozhatsky@chromium.org Fixes: 330edc2bc059 (zram: rework writeback target selection strategy) Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Hillf Danton <hdanton@sina.com> Cc: Kairui Song <ryncsn@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16zram: do not leak page on recompress_store error pathSergey Senozhatsky
Ensure the page used for local object data is freed on error out path. Link: https://lkml.kernel.org/r/20250303022425.285971-18-senozhatsky@chromium.org Fixes: 3f909a60cec1 ("zram: rework recompress target selection strategy") Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Hillf Danton <hdanton@sina.com> Cc: Kairui Song <ryncsn@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16zram: permit reclaim in zstd custom allocatorSergey Senozhatsky
When configured with pre-trained compression/decompression dictionary support, zstd requires custom memory allocator, which it calls internally from compression()/decompression() routines. That means allocation from atomic context (either under entry spin-lock, or per-CPU local-lock or both). Now, with non-atomic zram read()/write(), those limitations are relaxed and we can allow direct and indirect reclaim. Link: https://lkml.kernel.org/r/20250303022425.285971-17-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Hillf Danton <hdanton@sina.com> Cc: Kairui Song <ryncsn@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16zram: switch to new zsmalloc object mapping APISergey Senozhatsky
Use new read/write zsmalloc object API. For cases when RO mapped object spans two physical pages (requires temp buffer) compression streams now carry around one extra physical page. Link: https://lkml.kernel.org/r/20250303022425.285971-16-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Hillf Danton <hdanton@sina.com> Cc: Kairui Song <ryncsn@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16zram: move post-processing target allocationSergey Senozhatsky
Allocate post-processing target in place_pp_slot(). This simplifies scan_slots_for_writeback() and scan_slots_for_recompress() loops because we don't need to track pps pointer state anymore. Previously we have to explicitly NULL the point if it has been added to a post-processing bucket or re-use previously allocated pointer otherwise and make sure we don't leak the memory in the end. We are also fine doing GFP_NOIO allocation, as post-processing can be called under memory pressure so we better pick as many slots as we can as soon as we can and start post-processing them, possibly saving the memory. Allocation failure there is not fatal, we will post-process whatever we put into the buckets on previous iterations. Link: https://lkml.kernel.org/r/20250303022425.285971-12-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Hillf Danton <hdanton@sina.com> Cc: Kairui Song <ryncsn@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16zram: rework recompression loopSergey Senozhatsky
This reworks recompression loop handling: - set a rule that stream-put NULLs the stream pointer If the loop returns with a non-NULL stream then it's a successful recompression, otherwise the stream should always be NULL. - do not count the number of recompressions Mark object as incompressible as soon as the algorithm with the highest priority failed to compress that object. - count compression errors as resource usage Even if compression has failed, we still need to bump num_recomp_pages counter. Link: https://lkml.kernel.org/r/20250303022425.285971-11-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Hillf Danton <hdanton@sina.com> Cc: Kairui Song <ryncsn@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16zram: filter out recomp targets based on prioritySergey Senozhatsky
Do no select for post processing slots that are already compressed with same or higher priority compression algorithm. This should save some memory, as previously we would still put those entries into corresponding post-processing buckets and filter them out later in recompress_slot(). Link: https://lkml.kernel.org/r/20250303022425.285971-10-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Hillf Danton <hdanton@sina.com> Cc: Kairui Song <ryncsn@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16zram: limit max recompress prio to num_active_compsSergey Senozhatsky
Use the actual number of algorithms zram was configure with instead of theoretical limit of ZRAM_MAX_COMPS. Also make sure that min prio is not above max prio. Link: https://lkml.kernel.org/r/20250303022425.285971-9-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Hillf Danton <hdanton@sina.com> Cc: Kairui Song <ryncsn@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16zram: remove writestall zram_stats memberSergey Senozhatsky
There is no zsmalloc handle allocation slow path now and writestall is not possible any longer. Remove it from zram_stats. Link: https://lkml.kernel.org/r/20250303022425.285971-8-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Hillf Danton <hdanton@sina.com> Cc: Kairui Song <ryncsn@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16zram: add GFP_NOWARN to incompressible zsmalloc handle allocationSergey Senozhatsky
We normally use __GFP_NOWARN for zsmalloc handle allocations, add it to write_incompressible_page() allocation too. Link: https://lkml.kernel.org/r/20250303022425.285971-7-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Hillf Danton <hdanton@sina.com> Cc: Kairui Song <ryncsn@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16zram: remove second stage of handle allocationSergey Senozhatsky
Previously zram write() was atomic which required us to pass __GFP_KSWAPD_RECLAIM to zsmalloc handle allocation on a fast path and attempt a slow path allocation (with recompression) if the fast path failed. Since we are not in atomic context anymore we can permit direct reclaim during handle allocation, and hence can have a single allocation path. There is no slow path anymore so we don't unlock per-CPU stream (and don't lose compressed data) which means that there is no need to do recompression now (which should reduce CPU and battery usage). Link: https://lkml.kernel.org/r/20250303022425.285971-6-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Hillf Danton <hdanton@sina.com> Cc: Kairui Song <ryncsn@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16zram: remove max_comp_streams device attrSergey Senozhatsky
max_comp_streams device attribute has been defunct since May 2016 when zram switched to per-CPU compression streams, remove it. Link: https://lkml.kernel.org/r/20250303022425.285971-5-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Hillf Danton <hdanton@sina.com> Cc: Kairui Song <ryncsn@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16zram: remove unused crypto includeSergey Senozhatsky
We stopped using crypto API (for the time being), so remove its include and replace CRYPTO_MAX_ALG_NAME with a local define. Link: https://lkml.kernel.org/r/20250303022425.285971-4-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Hillf Danton <hdanton@sina.com> Cc: Kairui Song <ryncsn@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16zram: permit preemption with active compression streamSergey Senozhatsky
Currently, per-CPU stream access is done from a non-preemptible (atomic) section, which imposes the same atomicity requirements on compression backends as entry spin-lock, and makes it impossible to use algorithms that can schedule/wait/sleep during compression and decompression. Switch to preemptible per-CPU model, similar to the one used in zswap. Instead of a per-CPU local lock, each stream carries a mutex which is locked throughout entire time zram uses it for compression or decompression, so that cpu-dead event waits for zram to stop using a particular per-CPU stream and release it. Link: https://lkml.kernel.org/r/20250303022425.285971-3-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Suggested-by: Yosry Ahmed <yosry.ahmed@linux.dev> Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: Hillf Danton <hdanton@sina.com> Cc: Kairui Song <ryncsn@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16zram: sleepable entry lockingSergey Senozhatsky
Patch series "zsmalloc/zram: there be preemption", v10. Currently zram runs compression and decompression in non-preemptible sections, e.g. zcomp_stream_get() // grabs CPU local lock zcomp_compress() or zram_slot_lock() // grabs entry spin-lock zcomp_stream_get() // grabs CPU local lock zs_map_object() // grabs rwlock and CPU local lock zcomp_decompress() Potentially a little troublesome for a number of reasons. For instance, this makes it impossible to use async compression algorithms or/and H/W compression algorithms, which can wait for OP completion or resource availability. This also restricts what compression algorithms can do internally, for example, zstd can allocate internal state memory for C/D dictionaries: do_fsync() do_writepages() zram_bio_write() zram_write_page() // become non-preemptible zcomp_compress() zstd_compress() ZSTD_compress_usingCDict() ZSTD_compressBegin_usingCDict_internal() ZSTD_resetCCtx_usingCDict() ZSTD_resetCCtx_internal() zstd_custom_alloc() // memory allocation Not to mention that the system can be configured to maximize compression ratio at a cost of CPU/HW time (e.g. lz4hc or deflate with very high compression level) so zram can stay in non-preemptible section (even under spin-lock or/and rwlock) for an extended period of time. Aside from compression algorithms, this also restricts what zram can do. One particular example is zram_write_page() zsmalloc handle allocation, which has an optimistic allocation (disallowing direct reclaim) and a pessimistic fallback path, which then forces zram to compress the page one more time. This series changes zram to not directly impose atomicity restrictions on compression algorithms (and on itself), which makes zram write() fully preemptible; zram read(), sadly, is not always preemptible yet. There are still indirect atomicity restrictions imposed by zsmalloc(). One notable example is object mapping API, which returns with: a) local CPU lock held b) zspage rwlock held First, zsmalloc's zspage lock is converted from rwlock to a special type of RW-lookalike look with some extra guarantees/features. Second, a new handle mapping is introduced which doesn't use per-CPU buffers (and hence no local CPU lock), does fewer memcpy() calls, but requires users to provide a pointer to temp buffer for object copy-in (when needed). Third, zram is converted to the new zsmalloc mapping API and thus zram read() becomes preemptible. This patch (of 19): Concurrent modifications of meta table entries is now handled by per-entry spin-lock. This has a number of shortcomings. First, this imposes atomic requirements on compression backends. zram can call both zcomp_compress() and zcomp_decompress() under entry spin-lock, which implies that we can use only compression algorithms that don't schedule/sleep/wait during compression and decompression. This, for instance, makes it impossible to use some of the ASYNC compression algorithms (H/W compression, etc.) implementations. Second, this can potentially trigger watchdogs. For example, entry re-compression with secondary algorithms is performed under entry spin-lock. Given that we chain secondary compression algorithms and that some of them can be configured for best compression ratio (and worst compression speed) zram can stay under spin-lock for quite some time. Having a per-entry mutex (or, for instance, a rw-semaphore) significantly increases sizeof() of each entry and hence the meta table. Therefore entry locking returns back to bit locking, as before, however, this time also preempt-rt friendly, because if waits-on-bit instead of spinning-on-bit. Lock owners are also now permitted to schedule, which is a first step on the path of making zram non-atomic. Link: https://lkml.kernel.org/r/20250303022425.285971-1-senozhatsky@chromium.org Link: https://lkml.kernel.org/r/20250303022425.285971-2-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Hillf Danton <hdanton@sina.com> Cc: Kairui Song <ryncsn@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-14Merge tag 'block-6.14-20250313' of git://git.kernel.dk/linuxLinus Torvalds
Pull block fixes from Jens Axboe: - NVMe pull request via Keith: - Concurrent pci error and hotplug handling fix (Keith) - Endpoint function fixes (Damien) - Fix for a regression introduced in this cycle with error checking for batched request completions (Shin'ichiro) * tag 'block-6.14-20250313' of git://git.kernel.dk/linux: block: change blk_mq_add_to_batch() third argument type to bool nvme: move error logging from nvme_end_req() to __nvme_end_req() nvmet: pci-epf: Do not add an IRQ vector if not needed nvmet: pci-epf: Set NVMET_PCI_EPF_Q_LIVE when a queue is fully created nvme-pci: fix stuck reset on concurrent DPC and HP
2025-03-13block: remove unused parameter 'q' parameter in __blk_rq_map_sg()Anuj Gupta
request_queue param is no longer used by blk_rq_map_sg and __blk_rq_map_sg. Remove it. Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250313035322.243239-1-anuj20.g@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-12block: change blk_mq_add_to_batch() third argument type to boolShin'ichiro Kawasaki
Commit 1f47ed294a2b ("block: cleanup and fix batch completion adding conditions") modified the evaluation criteria for the third argument, 'ioerror', in the blk_mq_add_to_batch() function. Initially, the function had checked if 'ioerror' equals zero. Following the commit, it started checking for negative error values, with the presumption that such values, for instance -EIO, would be passed in. However, blk_mq_add_to_batch() callers do not pass negative error values. Instead, they pass status codes defined in various ways: - NVMe PCI and Apple drivers pass NVMe status code - virtio_blk driver passes the virtblk request header status byte - null_blk driver passes blk_status_t These codes are either zero or positive, therefore the revised check fails to function as intended. Specifically, with the NVMe PCI driver, this modification led to the failure of the blktests test case nvme/039. In this test scenario, errors are artificially injected to the NVMe driver, resulting in positive NVMe status codes passed to blk_mq_add_to_batch(), which unexpectedly processes the failed I/O in a batch. Hence the failure. To correct the ioerror check within blk_mq_add_to_batch(), make all callers to uniformly pass the argument as boolean. Modify the callers to check their specific status codes and pass the boolean value 'is_error'. Also describe the arguments of blK_mq_add_to_batch as kerneldoc. Fixes: 1f47ed294a2b ("block: cleanup and fix batch completion adding conditions") Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Link: https://lore.kernel.org/r/20250311104359.1767728-3-shinichiro.kawasaki@wdc.com [axboe: fold in documentation update] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-10rust: module: introduce `authors` keyGuilherme Giacomo Simoes
In the `module!` macro, the `author` field is currently of type `String`. Since modules can have multiple authors, this limitation prevents specifying more than one. Add an `authors` field as `Option<Vec<String>>` to allow creating modules with multiple authors, and change the documentation and all current users to use it. Eventually, the single `author` field may be removed. [ The `modinfo` key needs to still be `author`; otherwise, tooling may not work properly, e.g.: $ modinfo --author samples/rust/rust_print.ko Rust for Linux Contributors I have also kept the original `author` field (undocumented), so that we can drop it more easily in a kernel cycle or two. - Miguel ] Suggested-by: Miguel Ojeda <ojeda@kernel.org> Link: https://github.com/Rust-for-Linux/linux/issues/244 Reviewed-by: Charalampos Mitrodimas <charmitro@posteo.net> Reviewed-by: Alice Ryhl <aliceryhl@google.com> Reviewed-by: Andreas Hindborg <a.hindborg@kernel.org> Signed-off-by: Guilherme Giacomo Simoes <trintaeoitogc@gmail.com> Link: https://lore.kernel.org/r/20250309175712.845622-2-trintaeoitogc@gmail.com [ Fixed `modinfo` key. Kept `author` field. Reworded message accordingly. Updated my email. - Miguel ] Signed-off-by: Miguel Ojeda <ojeda@kernel.org>
2025-03-07Merge tag 'block-6.14-20250306' of git://git.kernel.dk/linuxLinus Torvalds
Pull block fixes from Jens Axboe: - NVMe pull request via Keith: - TCP use after free fix on polling (Sagi) - Controller memory buffer cleanup fixes (Icenowy) - Free leaking requests on bad user passthrough commands (Keith) - TCP error message fix (Maurizio) - TCP corruption fix on partial PDU (Maurizio) - TCP memory ordering fix for weakly ordered archs (Meir) - Type coercion fix on message error for TCP (Dan) - Name the RQF flags enum, fixing issues with anon enums and BPF import of it - ublk parameter setting fix - GPT partition 7-bit conversion fix * tag 'block-6.14-20250306' of git://git.kernel.dk/linux: block: Name the RQF flags enum nvme-tcp: fix signedness bug in nvme_tcp_init_connection() block: fix conversion of GPT partition name to 7-bit ublk: set_params: properly check if parameters can be applied nvmet-tcp: Fix a possible sporadic response drops in weakly ordered arch nvme-tcp: fix potential memory corruption in nvme_tcp_recv_pdu() nvme-tcp: Fix a C2HTermReq error message nvmet: remove old function prototype nvme-ioctl: fix leaked requests on mapping error nvme-pci: skip CMB blocks incompatible with PCI P2P DMA nvme-pci: clean up CMBMSC when registering CMB fails nvme-tcp: fix possible UAF in nvme_tcp_poll
2025-03-06badblocks: use sector_t instead of int to avoid truncation of badblocks lengthZheng Qixing
There is a truncation of badblocks length issue when set badblocks as follow: echo "2055 4294967299" > bad_blocks cat bad_blocks 2055 3 Change 'sectors' argument type from 'int' to 'sector_t'. This change avoids truncation of badblocks length for large sectors by replacing 'int' with 'sector_t' (u64), enabling proper handling of larger disk sizes and ensuring compatibility with 64-bit sector addressing. Fixes: 9e0e252a048b ("badblocks: Add core badblock management code") Signed-off-by: Zheng Qixing <zhengqixing@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Coly Li <colyli@kernel.org> Link: https://lore.kernel.org/r/20250227075507.151331-13-zhengqixing@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-06badblocks: return boolean from badblocks_set() and badblocks_clear()Zheng Qixing
Change the return type of badblocks_set() and badblocks_clear() from int to bool, indicating success or failure. Specifically: - _badblocks_set() and _badblocks_clear() functions now return true for success and false for failure. - All calls to these functions are updated to handle the new boolean return type. - This change improves code clarity and ensures a more consistent handling of success and failure states. Signed-off-by: Zheng Qixing <zhengqixing@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Coly Li <colyli@kernel.org> Acked-by: Ira Weiny <ira.weiny@intel.com> Link: https://lore.kernel.org/r/20250227075507.151331-11-zhengqixing@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-05ublk: set_params: properly check if parameters can be appliedUday Shankar
The parameters set by the set_params call are only applied to the block device in the start_dev call. So if a device has already been started, a subsequently issued set_params on that device will not have the desired effect, and should return an error. There is an existing check for this - set_params fails on devices in the LIVE state. But this check is not sufficient to cover the recovery case. In this case, the device will be in the QUIESCED or FAIL_IO states, so set_params will succeed. But this success is misleading, because the parameters will not be applied, since the device has already been started (by a previous ublk server). The bit UB_STATE_USED is set on completion of the start_dev; use it to detect and fail set_params commands which arrive too late to be applied (after start_dev). Signed-off-by: Uday Shankar <ushankar@purestorage.com> Fixes: 0aa73170eba5 ("ublk_drv: add SET_PARAMS/GET_PARAMS control command") Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250304-set_params-v1-1-17b5e0887606@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-04ublk: enforce ublks_max only for unprivileged devicesUday Shankar
Commit 403ebc877832 ("ublk_drv: add module parameter of ublks_max for limiting max allowed ublk dev"), claimed ublks_max was added to prevent a DoS situation with an untrusted user creating too many ublk devices. If that's the case, ublks_max should only restrict the number of unprivileged ublk devices in the system. Enforce the limit only for unprivileged ublk devices, and rename variables accordingly. Leave the external-facing parameter name unchanged, since changing it may break systems which use it (but still update its documentation to reflect its new meaning). As a result of this change, in a system where there are only normal (non-unprivileged) devices, the maximum number of such devices is increased to 1 << MINORBITS, or 1048576. That ought to be enough for anyone, right? Signed-off-by: Uday Shankar <ushankar@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250228-ublks_max-v1-1-04b7379190c0@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-04loop: Remove struct loop_func_tableZhu Yanjun
The struct is introduced in the commit 754d96798fab ("loop: remove loop.h"), but it is not used now. So remove it. No functional changes. Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250227163343.55952-1-yanjun.zhu@linux.dev Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-04ublk: don't cast registered buffer index to intCaleb Sander Mateos
io_buffer_register_bvec() takes index as an unsigned int argument, but ublk_register_io_buf() casts ub_cmd->addr (a u64) to int. Remove the misleading cast and instead pass index as an unsigned value to ublk_register_io_buf() and ublk_unregister_io_buf(). Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250301190317.950208-1-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-03ublk: add DMA alignment limitMing Lei
The in-tree ublk driver doesn't need DMA alignment limit because there is one data copy between request pages and the userspace buffer. However, ublk is going to support zero copy, then DMA alignment limit is required, because same IO buffer is forwarded to backend which may have specific buffer DMA alignment limit, so the limit has to be exposed from the frontend driver to client application. Cc: Keith Busch <kbusch@kernel.org> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250227103707.2640014-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-03null_blk: do partial IO for bad blocksShin'ichiro Kawasaki
The current null_blk implementation checks if any bad blocks exist in the target blocks of each IO. If so, the IO fails and data is not transferred for all of the IO target blocks. However, when real storage devices have bad blocks, the devices may transfer data partially up to the first bad blocks (e.g., SAS drives). Especially, when the IO is a write operation, such partial IO leaves partially written data on the device. To simulate such partial IO using null_blk, introduce the new parameter 'badblocks_partial_io'. When this parameter is set, null_handle_badblocks() returns the number of the sectors for the partial IO as its third pointer argument. Pass the returned number of sectors to the following calls to null_handle_memory_backend() in null_process_cmd() and null_zone_write(). Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Link: https://lore.kernel.org/r/20250226100613.1622564-6-shinichiro.kawasaki@wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-03null_blk: pass transfer size to null_handle_rq()Shin'ichiro Kawasaki
As preparation to support partial data transfer, add a new argument to null_handle_rq() to pass the number of sectors to transfer. While at it, rename the function from null_handle_rq to null_handle_data_transfer. This commit does not change the behavior. Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Link: https://lore.kernel.org/r/20250226100613.1622564-5-shinichiro.kawasaki@wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>