Age | Commit message (Collapse) | Author |
|
ublk_init_queues() ensures that all nr_hw_queues queues are initialized,
with each ublk_queue's q_id set to its index. And ublk_init_queues() is
called before ublk_add_chdev(), which creates the cdev. Is is therefore
impossible for the !ubq || ub_cmd->q_id != ubq->q_id condition to hit in
__ublk_ch_uring_cmd(). Remove it to avoids some branches in the I/O path.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Uday Shankar <ushankar@purestorage.com>
Link: https://lore.kernel.org/r/20250416170154.3621609-1-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Currently ublk only allows the size of the ublkb block device to be
set via UBLK_CMD_SET_PARAMS before UBLK_CMD_START_DEV is triggered.
This does not provide support for extendable user-space block devices
without having to stop and restart the underlying ublkb block device
causing IO interruption.
This patch adds a new ublk command UBLK_U_CMD_UPDATE_SIZE to allow the
ublk block device to be resized on-the-fly.
Feature flag UBLK_F_UPDATE_SIZE is also added to indicate support.
Signed-off-by: Omri Mann <omri@nvidia.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/2a370ab1-d85b-409d-b762-f9f3f6bdf705@nvidia.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Pull block fixes from Jens Axboe:
- MD pull via Yu:
- fix raid10 missing discard IO accounting (Yu Kuai)
- fix bitmap stats for bitmap file (Zheng Qixing)
- fix oops while reading all member disks failed during
check/repair (Meir Elisha)
- NVMe pull via Christoph:
- fix scan failure for non-ANA multipath controllers (Hannes
Reinecke)
- fix multipath sysfs links creation for some cases (Hannes
Reinecke)
- PCIe endpoint fixes (Damien Le Moal)
- use NULL instead of 0 in the auth code (Damien Le Moal)
- Various ublk fixes:
- Slew of selftest additions
- Improvements and fixes for IO cancelation
- Tweak to Kconfig verbiage
- Fix for page dirtying for blk integrity mapped pages
- loop fixes:
- buffered IO fix
- uevent fixes
- request priority inheritance fix
- Various little fixes
* tag 'block-6.15-20250417' of git://git.kernel.dk/linux: (38 commits)
selftests: ublk: add generic_06 for covering fault inject
ublk: simplify aborting ublk request
ublk: remove __ublk_quiesce_dev()
ublk: improve detection and handling of ublk server exit
ublk: move device reset into ublk_ch_release()
ublk: rely on ->canceling for dealing with ublk_nosrv_dev_should_queue_io
ublk: add ublk_force_abort_dev()
ublk: properly serialize all FETCH_REQs
selftests: ublk: move creating UBLK_TMP into _prep_test()
selftests: ublk: add test_stress_05.sh
selftests: ublk: support user recovery
selftests: ublk: support target specific command line
selftests: ublk: increase max nr_queues and queue depth
selftests: ublk: set queue pthread's cpu affinity
selftests: ublk: setup ring with IORING_SETUP_SINGLE_ISSUER/IORING_SETUP_DEFER_TASKRUN
selftests: ublk: add two stress tests for zero copy feature
selftests: ublk: run stress tests in parallel
selftests: ublk: make sure _add_ublk_dev can return in sub-shell
selftests: ublk: cleanup backfile automatically
selftests: ublk: add io_uring uapi header
...
|
|
Now ublk_abort_queue() is moved to ublk char device release handler,
meantime our request queue is "quiesced" because either ->canceling was
set from uring_cmd cancel function or all IOs are inflight and can't be
completed by ublk server, things becomes easy much:
- all uring_cmd are done, so we needn't to mark io as UBLK_IO_FLAG_ABORTED
for handling completion from uring_cmd
- ublk char device is closed, no one can hold IO request reference any more,
so we can simply complete this request or requeue it for ublk_nosrv_should_reissue_outstanding.
Reviewed-by: Uday Shankar <ushankar@purestorage.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250416035444.99569-8-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Remove __ublk_quiesce_dev() and open code for updating device state as
QUIESCED.
We needn't to drain inflight requests in __ublk_quiesce_dev() any more,
because all inflight requests are aborted in ublk char device release
handler.
Also we needn't to set ->canceling in __ublk_quiesce_dev() any more
because it is done unconditionally now in ublk_ch_release().
Reviewed-by: Uday Shankar <ushankar@purestorage.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250416035444.99569-7-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
There are currently two ways in which ublk server exit is detected by
ublk_drv:
1. uring_cmd cancellation. If there are any outstanding uring_cmds which
have not been completed to the ublk server when it exits, io_uring
calls the uring_cmd callback with a special cancellation flag as the
issuing task is exiting.
2. I/O timeout. This is needed in addition to the above to handle the
"saturated queue" case, when all I/Os for a given queue are in the
ublk server, and therefore there are no outstanding uring_cmds to
cancel when the ublk server exits.
There are a couple of issues with this approach:
- It is complex and inelegant to have two methods to detect the same
condition
- The second method detects ublk server exit only after a long delay
(~30s, the default timeout assigned by the block layer). This delays
the nosrv behavior from kicking in and potential subsequent recovery
of the device.
The second issue is brought to light with the new test_generic_06 which
will be added in following patch. It fails before this fix:
selftests: ublk: test_generic_06.sh
dev id is 0
dd: error writing '/dev/ublkb0': Input/output error
1+0 records in
0+0 records out
0 bytes copied, 30.0611 s, 0.0 kB/s
DEAD
dd took 31 seconds to exit (>= 5s tolerance)!
generic_06 : [FAIL]
Fix this by instead detecting and handling ublk server exit in the
character file release callback. This has several advantages:
- This one place can handle both saturated and unsaturated queues. Thus,
it replaces both preexisting methods of detecting ublk server exit.
- It runs quickly on ublk server exit - there is no 30s delay.
- It starts the process of removing task references in ublk_drv. This is
needed if we want to relax restrictions in the driver like letting
only one thread serve each queue
There is also the disadvantage that the character file release callback
can also be triggered by intentional close of the file, which is a
significant behavior change. Preexisting ublk servers (libublksrv) are
dependent on the ability to open/close the file multiple times. To
address this, only transition to a nosrv state if the file is released
while the ublk device is live. This allows for programs to open/close
the file multiple times during setup. It is still a behavior change if a
ublk server decides to close/reopen the file while the device is LIVE
(i.e. while it is responsible for serving I/O), but that would be highly
unusual. This behavior is in line with what is done by FUSE, which is
very similar to ublk in that a userspace daemon is providing services
traditionally provided by the kernel.
With this change in, the new test (and all other selftests, and all
ublksrv tests) pass:
selftests: ublk: test_generic_06.sh
dev id is 0
dd: error writing '/dev/ublkb0': Input/output error
1+0 records in
0+0 records out
0 bytes copied, 0.0376731 s, 0.0 kB/s
DEAD
generic_04 : [PASS]
Signed-off-by: Uday Shankar <ushankar@purestorage.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250416035444.99569-6-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
ublk_ch_release() is called after ublk char device is closed, when all
uring_cmd are done, so it is perfect fine to move ublk device reset to
ublk_ch_release() from ublk_ctrl_start_recovery().
This way can avoid to grab the exiting daemon task_struct too long.
However, reset of the following ublk IO flags has to be moved until ublk
io_uring queues are ready:
- ubq->canceling
For requeuing IO in case of ublk_nosrv_dev_should_queue_io() before device
is recovered
- ubq->fail_io
For failing IO in case of UBLK_F_USER_RECOVERY_FAIL_IO before device is
recovered
- ublk_io->flags
For preventing using io->cmd
With this way, recovery is simplified a lot.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250416035444.99569-5-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Now ublk deals with ublk_nosrv_dev_should_queue_io() by keeping request
queue as quiesced. This way is fragile because queue quiesce crosses syscalls
or process contexts.
Switch to rely on ubq->canceling for dealing with
ublk_nosrv_dev_should_queue_io(), because it has been used for this purpose
during io_uring context exiting, and it can be reused before recovering too.
In ublk_queue_rq(), the request will be added to requeue list without
kicking off requeue in case of ubq->canceling, and finally requests added in
requeue list will be dispatched from either ublk_stop_dev() or
ublk_ctrl_end_recovery().
Meantime we have to move reset of ubq->canceling from ublk_ctrl_start_recovery()
to ublk_ctrl_end_recovery(), when IO handling can be recovered completely.
Then blk_mq_quiesce_queue() and blk_mq_unquiesce_queue() are always used
in same context.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Uday Shankar <ushankar@purestorage.com>
Link: https://lore.kernel.org/r/20250416035444.99569-4-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Add ublk_force_abort_dev() for handling ublk_nosrv_dev_should_queue_io()
in ublk_stop_dev(). Then queue quiesce and unquiesce can be paired in
single function.
Meantime not change device state to QUIESCED any more, since the disk is
going to be removed soon.
Reviewed-by: Uday Shankar <ushankar@purestorage.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250416035444.99569-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Most uring_cmds issued against ublk character devices are serialized
because each command affects only one queue, and there is an early check
which only allows a single task (the queue's ubq_daemon) to issue
uring_cmds against that queue. However, this mechanism does not work for
FETCH_REQs, since they are expected before ubq_daemon is set. Since
FETCH_REQs are only used at initialization and not in the fast path,
serialize them using the per-ublk-device mutex. This fixes a number of
data races that were previously possible if a badly behaved ublk server
decided to issue multiple FETCH_REQs against the same qid/tag
concurrently.
Reported-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Uday Shankar <ushankar@purestorage.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250416035444.99569-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The CONFIG_BLK_DEV_UBLK help text suggests setting the config option to
Y so task_work_add() can be used to dispatch I/O, improving performance.
However, this mechanism was removed in commit 29dc5d06613f2 ("ublk: kill
queuing request by task_work_add"). So remove this paragraph from the
config help text.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Uday Shankar <ushankar@purestorage.com>
Link: https://lore.kernel.org/r/20250416004111.3242817-1-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
vfs_iter_{read,write} always perform direct I/O when the file has the
O_DIRECT flag set, which breaks disabling direct I/O using the
LOOP_SET_STATUS / LOOP_SET_STATUS64 ioctls.
This was recenly reported as a regression, but as far as I can tell
was only uncovered by better checking for block sizes and has been
around since the direct I/O support was added.
Fix this by using the existing aio code that calls the raw read/write
iter methods instead. Note that despite the comments there is no need
for block drivers to ever call flush_dcache_page themselves, and the
call is a left-over from prehistoric times.
Fixes: ab1cb278bc70 ("block: loop: introduce ioctl command of LOOP_SET_DIRECT_IO")
Reported-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20250409130940.3685677-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Remove the suppression of the uevents before scanning for partitions.
The partitions inherit their suppression settings from their parent device,
which lead to the uevents being dropped.
This is similar to the same changes for LOOP_CONFIGURE done in
commit bb430b694226 ("loop: LOOP_CONFIGURE: send uevents for partitions").
Fixes: 498ef5c777d9 ("loop: suppress uevents while reconfiguring the device")
Cc: stable@vger.kernel.org
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20250415-loop-uevent-changed-v3-1-60ff69ac6088@linutronix.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The original commit message and the wording "uncork" in the code comment
indicate that it is expected that the suppressed event instances are
automatically sent after unsuppressing.
This is not the case, instead they are discarded.
In effect this means that no "changed" events are emitted on the device
itself by default.
While each discovered partition does trigger a changed event on the
device, devices without partitions don't have any event emitted.
This makes udev miss the device creation and prompted workarounds in
userspace. See the linked util-linux/losetup bug.
Explicitly emit the events and drop the confusingly worded comments.
Link: https://github.com/util-linux/util-linux/issues/2434
Fixes: 498ef5c777d9 ("loop: suppress uevents while reconfiguring the device")
Cc: stable@vger.kernel.org
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Link: https://lore.kernel.org/r/20250415-loop-uevent-changed-v2-1-0c4e6a923b2a@linutronix.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Set cmd->iocb.ki_ioprio to the ioprio of loop device's request.
The purpose is to inherit the original request ioprio in the aio
flow.
Signed-off-by: Yunlong Xing <yunlong.xing@unisoc.com>
Signed-off-by: Zhiguo Niu <zhiguo.niu@unisoc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250414030159.501180-1-yunlong.xing@unisoc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Pull more block fixes from Jens Axboe:
"Apparently my internal clock was off, or perhaps it was just wishful
thinking, but I sent out block fixes yesterday as my brain assumed it
was Friday. Subsequently, that missed the NVMe fixes that should go
into this weeks release as well. Hence, here's a followup with those,
and another simple fix.
- NVMe pull request via Christoph:
- nvmet fc/fcloop refcounting fixes (Daniel Wagner)
- fix missed namespace/ANA scans (Hannes Reinecke)
- fix a use after free in the new TCP netns support (Kuniyuki
Iwashima)
- fix a NULL instead of false review in multipath (Uday Shankar)
- Use strscpy() for null_blk disk name copy"
* tag 'block-6.15-20250411' of git://git.kernel.dk/linux:
null_blk: Use strscpy() instead of strscpy_pad() in null_add_dev()
nvmet-fc: put ref when assoc->del_work is already scheduled
nvmet-fc: take tgtport reference only once
nvmet-fc: update tgtport ref per assoc
nvmet-fc: inline nvmet_fc_free_hostport
nvmet-fc: inline nvmet_fc_delete_assoc
nvmet-fcloop: add ref counting to lport
nvmet-fcloop: replace kref with refcount
nvmet-fcloop: swap list_add_tail arguments
nvme-tcp: fix use-after-free of netns by kernel TCP socket.
nvme: multipath: fix return value of nvme_available_path
nvme: re-read ANA log page after ns scan completes
nvme: requeue namespace scan on missed AENs
|
|
blk_mq_alloc_disk() already zero-initializes the destination buffer,
making strscpy() sufficient for safely copying the disk's name. The
additional NUL-padding performed by strscpy_pad() is unnecessary.
If the destination buffer has a fixed length, strscpy() automatically
determines its size using sizeof() when the argument is omitted. This
makes the explicit size argument unnecessary.
The source string is also NUL-terminated and meets the __must_be_cstr()
requirement of strscpy().
No functional changes intended.
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20250410154727.883207-1-thorsten.blum@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Pull block fixes from Jens Axboe:
- Add a missing ublk selftest script, from test additions added last
week
- Two fixes for ublk error recovery and reissue
- Cleanup of ublk argument passing
* tag 'block-6.15-20250410' of git://git.kernel.dk/linux:
ublk: pass ublksrv_ctrl_cmd * instead of io_uring_cmd *
ublk: don't fail request for recovery & reissue in case of ubq->canceling
ublk: fix handling recovery & reissue in ublk_abort_queue()
selftests: ublk: fix test_stripe_04
|
|
The ublk_ctrl_*() handlers all take struct io_uring_cmd *cmd but only
use it to get struct ublksrv_ctrl_cmd *header from the io_uring SQE.
Since the caller ublk_ctrl_uring_cmd() has already computed header, pass
it instead of cmd.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://lore.kernel.org/r/20250409012928.3527198-1-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
ubq->canceling is set with request queue quiesced when io_uring context is
exiting. USER_RECOVERY or !RECOVERY_FAIL_IO requires request to be re-queued
and re-dispatch after device is recovered.
However commit d796cea7b9f3 ("ublk: implement ->queue_rqs()") still may fail
any request in case of ubq->canceling, this way breaks USER_RECOVERY or
!RECOVERY_FAIL_IO.
Fix it by calling __ublk_abort_rq() in case of ubq->canceling.
Reviewed-by: Uday Shankar <ushankar@purestorage.com>
Reported-by: Uday Shankar <ushankar@purestorage.com>
Closes: https://lore.kernel.org/linux-block/Z%2FQkkTRHfRxtN%2FmB@dev-ushankar.dev.purestorage.com/
Fixes: d796cea7b9f3 ("ublk: implement ->queue_rqs()")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250409011444.2142010-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Commit 8284066946e6 ("ublk: grab request reference when the request is handled
by userspace") doesn't grab request reference in case of recovery reissue.
Then the request can be requeued & re-dispatch & failed when canceling
uring command.
If it is one zc request, the request can be freed before io_uring
returns the zc buffer back, then cause kernel panic:
[ 126.773061] BUG: kernel NULL pointer dereference, address: 00000000000000c8
[ 126.773657] #PF: supervisor read access in kernel mode
[ 126.774052] #PF: error_code(0x0000) - not-present page
[ 126.774455] PGD 0 P4D 0
[ 126.774698] Oops: Oops: 0000 [#1] SMP NOPTI
[ 126.775034] CPU: 13 UID: 0 PID: 1612 Comm: kworker/u64:55 Not tainted 6.14.0_blk+ #182 PREEMPT(full)
[ 126.775676] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-1.fc39 04/01/2014
[ 126.776275] Workqueue: iou_exit io_ring_exit_work
[ 126.776651] RIP: 0010:ublk_io_release+0x14/0x130 [ublk_drv]
Fixes it by always grabbing request reference for aborting the request.
Reported-by: Caleb Sander Mateos <csander@purestorage.com>
Closes: https://lore.kernel.org/linux-block/CADUfDZodKfOGUeWrnAxcZiLT+puaZX8jDHoj_sfHZCOZwhzz6A@mail.gmail.com/
Fixes: 8284066946e6 ("ublk: grab request reference when the request is handled by userspace")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250409011444.2142010-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux
Pull CRC cleanups from Eric Biggers:
"Finish cleaning up the CRC kconfig options by removing the remaining
unnecessary prompts and an unnecessary 'default y', removing
CONFIG_LIBCRC32C, and documenting all the CRC library options"
* tag 'crc-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux:
lib/crc: remove CONFIG_LIBCRC32C
lib/crc: document all the CRC library kconfig options
lib/crc: remove unnecessary prompt for CONFIG_CRC_ITU_T
lib/crc: remove unnecessary prompt for CONFIG_CRC_T10DIF
lib/crc: remove unnecessary prompt for CONFIG_CRC16
lib/crc: remove unnecessary prompt for CONFIG_CRC_CCITT
lib/crc: remove unnecessary prompt for CONFIG_CRC32 and drop 'default y'
|
|
timer_delete[_sync]() replaces del_timer[_sync](). Convert the whole tree
over and remove the historical wrapper inlines.
Conversion was done with coccinelle plus manual fixups where necessary.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Now that LIBCRC32C does nothing besides select CRC32, make every option
that selects LIBCRC32C instead select CRC32 directly. Then remove
LIBCRC32C.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Martin K. Petersen" <martin.petersen@oracle.com>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20250401221600.24878-8-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
|
|
Pull more io_uring updates from Jens Axboe:
"Set of fixes/updates for io_uring that should go into this release.
The ublk bits could've gone via either tree - usually I put them in
block, but they got a bit mixed this series with the zero-copy
supported that ended up dipping into both trees.
This contains:
- Fix for sendmsg zc, include in pinned pages accounting like we do
for the other zc types
- Series for ublk fixing request aborting, doing various little
cleanups, fixing some zc issues, and adding queue_rqs support
- Another ublk series doing some code cleanups
- Series cleaning up the io_uring send path, mostly in preparation
for registered buffers
- Series doing little MSG_RING cleanups
- Fix for the newly added zc rx, fixing len being 0 for the last
invocation of the callback
- Add vectored registered buffer support for ublk. With that, then
ublk also supports this feature in the kernel revision where it
could generically introduced for rw/net
- A bunch of selftest additions for ublk. This is the majority of the
diffstat
- Silence a KCSAN data race warning for io-wq
- Various little cleanups and fixes"
* tag 'io_uring-6.15-20250403' of git://git.kernel.dk/linux: (44 commits)
io_uring: always do atomic put from iowq
selftests: ublk: enable zero copy for stripe target
io_uring: support vectored kernel fixed buffer
block: add for_each_mp_bvec()
io_uring: add validate_fixed_range() for validate fixed buffer
selftests: ublk: kublk: fix an error log line
selftests: ublk: kublk: use ioctl-encoded opcodes
io_uring/zcrx: return early from io_zcrx_recv_skb if readlen is 0
io_uring/net: avoid import_ubuf for regvec send
io_uring/rsrc: check size when importing reg buffer
io_uring: cleanup {g,s]etsockopt sqe reading
io_uring: hide caches sqes from drivers
io_uring: make zcrx depend on CONFIG_IO_URING
io_uring: add req flag invariant build assertion
Documentation: ublk: remove dead footnote
selftests: ublk: specify io_cmd_buf pointer type
ublk: specify io_cmd_buf pointer type
io_uring: don't pass ctx to tw add remote helper
io_uring/msg: initialise msg request opcode
io_uring/msg: rename io_double_lock_ctx()
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- The series "Enable strict percpu address space checks" from Uros
Bizjak uses x86 named address space qualifiers to provide
compile-time checking of percpu area accesses.
This has caused a small amount of fallout - two or three issues were
reported. In all cases the calling code was found to be incorrect.
- The series "Some cleanup for memcg" from Chen Ridong implements some
relatively monir cleanups for the memcontrol code.
- The series "mm: fixes for device-exclusive entries (hmm)" from David
Hildenbrand fixes a boatload of issues which David found then using
device-exclusive PTE entries when THP is enabled. More work is
needed, but this makes thins better - our own HMM selftests now
succeed.
- The series "mm: zswap: remove z3fold and zbud" from Yosry Ahmed
remove the z3fold and zbud implementations. They have been deprecated
for half a year and nobody has complained.
- The series "mm: further simplify VMA merge operation" from Lorenzo
Stoakes implements numerous simplifications in this area. No runtime
effects are anticipated.
- The series "mm/madvise: remove redundant mmap_lock operations from
process_madvise()" from SeongJae Park rationalizes the locking in the
madvise() implementation. Performance gains of 20-25% were observed
in one MADV_DONTNEED microbenchmark.
- The series "Tiny cleanup and improvements about SWAP code" from
Baoquan He contains a number of touchups to issues which Baoquan
noticed when working on the swap code.
- The series "mm: kmemleak: Usability improvements" from Catalin
Marinas implements a couple of improvements to the kmemleak
user-visible output.
- The series "mm/damon/paddr: fix large folios access and schemes
handling" from Usama Arif provides a couple of fixes for DAMON's
handling of large folios.
- The series "mm/damon/core: fix wrong and/or useless damos_walk()
behaviors" from SeongJae Park fixes a few issues with the accuracy of
kdamond's walking of DAMON regions.
- The series "expose mapping wrprotect, fix fb_defio use" from Lorenzo
Stoakes changes the interaction between framebuffer deferred-io and
core MM. No functional changes are anticipated - this is preparatory
work for the future removal of page structure fields.
- The series "mm/damon: add support for hugepage_size DAMOS filter"
from Usama Arif adds a DAMOS filter which permits the filtering by
huge page sizes.
- The series "mm: permit guard regions for file-backed/shmem mappings"
from Lorenzo Stoakes extends the guard region feature from its
present "anon mappings only" state. The feature now covers shmem and
file-backed mappings.
- The series "mm: batched unmap lazyfree large folios during
reclamation" from Barry Song cleans up and speeds up the unmapping
for pte-mapped large folios.
- The series "reimplement per-vma lock as a refcount" from Suren
Baghdasaryan puts the vm_lock back into the vma. Our reasons for
pulling it out were largely bogus and that change made the code more
messy. This patchset provides small (0-10%) improvements on one
microbenchmark.
- The series "Docs/mm/damon: misc DAMOS filters documentation fixes and
improves" from SeongJae Park does some maintenance work on the DAMON
docs.
- The series "hugetlb/CMA improvements for large systems" from Frank
van der Linden addresses a pile of issues which have been observed
when using CMA on large machines.
- The series "mm/damon: introduce DAMOS filter type for unmapped pages"
from SeongJae Park enables users of DMAON/DAMOS to filter my the
page's mapped/unmapped status.
- The series "zsmalloc/zram: there be preemption" from Sergey
Senozhatsky teaches zram to run its compression and decompression
operations preemptibly.
- The series "selftests/mm: Some cleanups from trying to run them" from
Brendan Jackman fixes a pile of unrelated issues which Brendan
encountered while runnimg our selftests.
- The series "fs/proc/task_mmu: add guard region bit to pagemap" from
Lorenzo Stoakes permits userspace to use /proc/pid/pagemap to
determine whether a particular page is a guard page.
- The series "mm, swap: remove swap slot cache" from Kairui Song
removes the swap slot cache from the allocation path - it simply
wasn't being effective.
- The series "mm: cleanups for device-exclusive entries (hmm)" from
David Hildenbrand implements a number of unrelated cleanups in this
code.
- The series "mm: Rework generic PTDUMP configs" from Anshuman Khandual
implements a number of preparatoty cleanups to the GENERIC_PTDUMP
Kconfig logic.
- The series "mm/damon: auto-tune aggregation interval" from SeongJae
Park implements a feedback-driven automatic tuning feature for
DAMON's aggregation interval tuning.
- The series "Fix lazy mmu mode" from Ryan Roberts fixes some issues in
powerpc, sparc and x86 lazy MMU implementations. Ryan did this in
preparation for implementing lazy mmu mode for arm64 to optimize
vmalloc.
- The series "mm/page_alloc: Some clarifications for migratetype
fallback" from Brendan Jackman reworks some commentary to make the
code easier to follow.
- The series "page_counter cleanup and size reduction" from Shakeel
Butt cleans up the page_counter code and fixes a size increase which
we accidentally added late last year.
- The series "Add a command line option that enables control of how
many threads should be used to allocate huge pages" from Thomas
Prescher does that. It allows the careful operator to significantly
reduce boot time by tuning the parallalization of huge page
initialization.
- The series "Fix calculations in trace_balance_dirty_pages() for cgwb"
from Tang Yizhou fixes the tracing output from the dirty page
balancing code.
- The series "mm/damon: make allow filters after reject filters useful
and intuitive" from SeongJae Park improves the handling of allow and
reject filters. Behaviour is made more consistent and the documention
is updated accordingly.
- The series "Switch zswap to object read/write APIs" from Yosry Ahmed
updates zswap to the new object read/write APIs and thus permits the
removal of some legacy code from zpool and zsmalloc.
- The series "Some trivial cleanups for shmem" from Baolin Wang does as
it claims.
- The series "fs/dax: Fix ZONE_DEVICE page reference counts" from
Alistair Popple regularizes the weird ZONE_DEVICE page refcount
handling in DAX, permittig the removal of a number of special-case
checks.
- The series "refactor mremap and fix bug" from Lorenzo Stoakes is a
preparatoty refactoring and cleanup of the mremap() code.
- The series "mm: MM owner tracking for large folios (!hugetlb) +
CONFIG_NO_PAGE_MAPCOUNT" from David Hildenbrand reworks the manner in
which we determine whether a large folio is known to be mapped
exclusively into a single MM.
- The series "mm/damon: add sysfs dirs for managing DAMOS filters based
on handling layers" from SeongJae Park adds a couple of new sysfs
directories to ease the management of DAMON/DAMOS filters.
- The series "arch, mm: reduce code duplication in mem_init()" from
Mike Rapoport consolidates many per-arch implementations of
mem_init() into code generic code, where that is practical.
- The series "mm/damon/sysfs: commit parameters online via
damon_call()" from SeongJae Park continues the cleaning up of sysfs
access to DAMON internal data.
- The series "mm: page_ext: Introduce new iteration API" from Luiz
Capitulino reworks the page_ext initialization to fix a boot-time
crash which was observed with an unusual combination of compile and
cmdline options.
- The series "Buddy allocator like (or non-uniform) folio split" from
Zi Yan reworks the code to split a folio into smaller folios. The
main benefit is lessened memory consumption: fewer post-split folios
are generated.
- The series "Minimize xa_node allocation during xarry split" from Zi
Yan reduces the number of xarray xa_nodes which are generated during
an xarray split.
- The series "drivers/base/memory: Two cleanups" from Gavin Shan
performs some maintenance work on the drivers/base/memory code.
- The series "Add tracepoints for lowmem reserves, watermarks and
totalreserve_pages" from Martin Liu adds some more tracepoints to the
page allocator code.
- The series "mm/madvise: cleanup requests validations and
classifications" from SeongJae Park cleans up some warts which
SeongJae observed during his earlier madvise work.
- The series "mm/hwpoison: Fix regressions in memory failure handling"
from Shuai Xue addresses two quite serious regressions which Shuai
has observed in the memory-failure implementation.
- The series "mm: reliable huge page allocator" from Johannes Weiner
makes huge page allocations cheaper and more reliable by reducing
fragmentation.
- The series "Minor memcg cleanups & prep for memdescs" from Matthew
Wilcox is preparatory work for the future implementation of memdescs.
- The series "track memory used by balloon drivers" from Nico Pache
introduces a way to track memory used by our various balloon drivers.
- The series "mm/damon: introduce DAMOS filter type for active pages"
from Nhat Pham permits users to filter for active/inactive pages,
separately for file and anon pages.
- The series "Adding Proactive Memory Reclaim Statistics" from Hao Jia
separates the proactive reclaim statistics from the direct reclaim
statistics.
- The series "mm/vmscan: don't try to reclaim hwpoison folio" from
Jinjiang Tu fixes our handling of hwpoisoned pages within the reclaim
code.
* tag 'mm-stable-2025-03-30-16-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (431 commits)
mm/page_alloc: remove unnecessary __maybe_unused in order_to_pindex()
x86/mm: restore early initialization of high_memory for 32-bits
mm/vmscan: don't try to reclaim hwpoison folio
mm/hwpoison: introduce folio_contain_hwpoisoned_page() helper
cgroup: docs: add pswpin and pswpout items in cgroup v2 doc
mm: vmscan: split proactive reclaim statistics from direct reclaim statistics
selftests/mm: speed up split_huge_page_test
selftests/mm: uffd-unit-tests support for hugepages > 2M
docs/mm/damon/design: document active DAMOS filter type
mm/damon: implement a new DAMOS filter type for active pages
fs/dax: don't disassociate zero page entries
MM documentation: add "Unaccepted" meminfo entry
selftests/mm: add commentary about 9pfs bugs
fork: use __vmalloc_node() for stack allocation
docs/mm: Physical Memory: Populate the "Zones" section
xen: balloon: update the NR_BALLOON_PAGES state
hv_balloon: update the NR_BALLOON_PAGES state
balloon_compaction: update the NR_BALLOON_PAGES state
meminfo: add a per node counter for balloon drivers
mm: remove references to folio in __memcg_kmem_uncharge_page()
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/ojeda/linux
Pull Rust updates from Miguel Ojeda:
"Toolchain and infrastructure:
- Extract the 'pin-init' API from the 'kernel' crate and make it into
a standalone crate.
In order to do this, the contents are rearranged so that they can
easily be kept in sync with the version maintained out-of-tree that
other projects have started to use too (or plan to, like QEMU).
This will reduce the maintenance burden for Benno, who will now
have his own sub-tree, and will simplify future expected changes
like the move to use 'syn' to simplify the implementation.
- Add '#[test]'-like support based on KUnit.
We already had doctests support based on KUnit, which takes the
examples in our Rust documentation and runs them under KUnit.
Now, we are adding the beginning of the support for "normal" tests,
similar to those the '#[test]' tests in userspace Rust. For
instance:
#[kunit_tests(my_suite)]
mod tests {
#[test]
fn my_test() {
assert_eq!(1 + 1, 2);
}
}
Unlike with doctests, the 'assert*!'s do not map to the KUnit
assertion APIs yet.
- Check Rust signatures at compile time for functions called from C
by name.
In particular, introduce a new '#[export]' macro that can be placed
in the Rust function definition. It will ensure that the function
declaration on the C side matches the signature on the Rust
function:
#[export]
pub unsafe extern "C" fn my_function(a: u8, b: i32) -> usize {
// ...
}
The macro essentially forces the compiler to compare the types of
the actual Rust function and the 'bindgen'-processed C signature.
These cases are rare so far. In the future, we may consider
introducing another tool, 'cbindgen', to generate C headers
automatically. Even then, having these functions explicitly marked
may be a good idea anyway.
- Enable the 'raw_ref_op' Rust feature: it is already stable, and
allows us to use the new '&raw' syntax, avoiding a couple macros.
After everyone has migrated, we will disallow the macros.
- Pass the correct target to 'bindgen' on Usermode Linux.
- Fix 'rusttest' build in macOS.
'kernel' crate:
- New 'hrtimer' module: add support for setting up intrusive timers
without allocating when starting the timer. Add support for
'Pin<Box<_>>', 'Arc<_>', 'Pin<&_>' and 'Pin<&mut _>' as pointer
types for use with timer callbacks. Add support for setting clock
source and timer mode.
- New 'dma' module: add a simple DMA coherent allocator abstraction
and a test sample driver.
- 'list' module: make the linked list 'Cursor' point between
elements, rather than at an element, which is more convenient to us
and allows for cursors to empty lists; and document it with
examples of how to perform common operations with the provided
methods.
- 'str' module: implement a few traits for 'BStr' as well as the
'strip_prefix()' method.
- 'sync' module: add 'Arc::as_ptr'.
- 'alloc' module: add 'Box::into_pin'.
- 'error' module: extend the 'Result' documentation, including a few
examples on different ways of handling errors, a warning about
using methods that may panic, and links to external documentation.
'macros' crate:
- 'module' macro: add the 'authors' key to support multiple authors.
The original key will be kept until everyone has migrated.
Documentation:
- Add error handling sections.
MAINTAINERS:
- Add Danilo Krummrich as reviewer of the Rust "subsystem".
- Add 'RUST [PIN-INIT]' entry with Benno Lossin as maintainer. It has
its own sub-tree.
- Add sub-tree for 'RUST [ALLOC]'.
- Add 'DMA MAPPING HELPERS DEVICE DRIVER API [RUST]' entry with
Abdiel Janulgue as primary maintainer. It will go through the
sub-tree of the 'RUST [ALLOC]' entry.
- Add 'HIGH-RESOLUTION TIMERS [RUST]' entry with Andreas Hindborg as
maintainer. It has its own sub-tree.
And a few other cleanups and improvements"
* tag 'rust-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/ojeda/linux: (71 commits)
rust: dma: add `Send` implementation for `CoherentAllocation`
rust: macros: fix `make rusttest` build on macOS
rust: block: refactor to use `&raw mut`
rust: enable `raw_ref_op` feature
rust: uaccess: name the correct function
rust: rbtree: fix comments referring to Box instead of KBox
rust: hrtimer: add maintainer entry
rust: hrtimer: add clocksource selection through `ClockId`
rust: hrtimer: add `HrTimerMode`
rust: hrtimer: implement `HrTimerPointer` for `Pin<Box<T>>`
rust: alloc: add `Box::into_pin`
rust: hrtimer: implement `UnsafeHrTimerPointer` for `Pin<&mut T>`
rust: hrtimer: implement `UnsafeHrTimerPointer` for `Pin<&T>`
rust: hrtimer: add `hrtimer::ScopedHrTimerPointer`
rust: hrtimer: add `UnsafeHrTimerPointer`
rust: hrtimer: allow timer restart from timer handler
rust: str: implement `strip_prefix` for `BStr`
rust: str: implement `AsRef<BStr>` for `[u8]` and `BStr`
rust: str: implement `Index` for `BStr`
rust: str: implement `PartialEq` for `BStr`
...
|
|
io_cmd_buf points to an array of ublksrv_io_desc structs but its type is
char *. Indexing the array requires an explicit multiplication and cast.
The compiler also can't check the pointer types.
Change io_cmd_buf's type to struct ublksrv_io_desc * so it can be
indexed directly and the compiler can type-check the code.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Acked-by: Shuah Khan <skhan@linuxfoundation.org>
Link: https://lore.kernel.org/r/20250328194230.2726862-2-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Pass struct request *rq to ublk_cmd_tw_cb() through ublk_uring_cmd_pdu,
mirroring how it works for ublk_cmd_list_tw_cb(). This saves some
pointer dereferences, as well as the bounds check in blk_mq_tag_to_rq().
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://lore.kernel.org/r/20250328180411.2696494-6-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
ublk_queue_cmd_list() loads io->cmd twice. The intervening stores
prevent the compiler from combining the loads. Since struct ublk_io *io
is only used to compute io->cmd, replace the variable with io->cmd.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://lore.kernel.org/r/20250328180411.2696494-5-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Save a few pointer dereferences by obtaining struct ublk_queue *ubq from
the ublk_uring_cmd_pdu instead of the request's mq_hctx.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://lore.kernel.org/r/20250328180411.2696494-4-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
ublk_cmd_list_tw_cb() is always performed on a non-empty request list.
So don't check whether rq is NULL on the first iteration of the loop,
just on subsequent iterations.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://lore.kernel.org/r/20250328180411.2696494-3-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
ublk_dispatch_req() never uses its struct io_uring_cmd *cmd argument.
Drop it so callers don't have to pass a value.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://lore.kernel.org/r/20250328180411.2696494-2-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The new name is aligned with ublk_cmd_list_tw_cb(), and looks
more readable.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250327095123.179113-10-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Implement ->queue_rqs() for improving perf in case of MQ.
In this way, we just need to call io_uring_cmd_complete_in_task() once for
whole IO batch, then both io_uring and ublk server can get exact batch from
ublk frontend.
Follows IOPS improvement:
- tests
tools/testing/selftests/ublk/kublk add -t null -q 2 [-z]
fio/t/io_uring -p0 /dev/ublkb0
- results:
more than 10% IOPS boost observed
Pass all ublk selftests, especially the io dispatch order test.
Cc: Uday Shankar <ushankar@purestorage.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250327095123.179113-9-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
IO split is usually bad in io_uring world, since -EAGAIN is caused and
IO handling may have to fallback to io-wq, this way does hurt performance.
ublk starts to support zero copy recently, for avoiding unnecessary IO
split, ublk driver's segment limit should be aligned with backend
device's segment limit.
Another reason is that io_buffer_register_bvec() needs to allocate bvecs,
which number is aligned with ublk request segment number, so that big
memory allocation can be avoided by setting reasonable max_segments limit.
So add segment parameter for providing ublk server chance to align
segment limit with backend, and keep it reasonable from implementation
viewpoint.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250327095123.179113-7-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Call io_uring_cmd_to_pdu() to get uring_cmd pdu, and one big benefit
is the automatic pdu size build check.
Suggested-by: Uday Shankar <ushankar@purestorage.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://lore.kernel.org/r/20250327095123.179113-6-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
ublk_need_map_io() is more readable.
Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250327095123.179113-5-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Remove two unused fields(`io_addr` & `max_io_sz`) from `struct ublk_queue`.
Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250327095123.179113-4-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
In ublk_queue_rq(), ubq->canceling has to be handled after ->fail_io and
->force_abort are dealt with, otherwise the request may not be failed
when deleting disk.
Add comment on this usage.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250327095123.179113-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Now ublk driver depends on `ubq->canceling` for deciding if the request
can be dispatched via uring_cmd & io_uring_cmd_complete_in_task().
Once ubq->canceling is set, the uring_cmd can be done via ublk_cancel_cmd()
and io_uring_cmd_done().
So set ubq->canceling when queue is frozen, this way makes sure that the
flag can be observed from ublk_queue_rq() reliably, and avoids
use-after-free on uring_cmd.
Fixes: 216c8f5ef0f2 ("ublk: replace monitor with cancelable uring_cmd")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250327095123.179113-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Pull block updates from Jens Axboe:
- Fixes for integrity handling
- NVMe pull request via Keith:
- Secure concatenation for TCP transport (Hannes)
- Multipath sysfs visibility (Nilay)
- Various cleanups (Qasim, Baruch, Wang, Chen, Mike, Damien, Li)
- Correct use of 64-bit BARs for pci-epf target (Niklas)
- Socket fix for selinux when used in containers (Peijie)
- MD pull request via Yu:
- fix recovery can preempt resync (Li Nan)
- fix md-bitmap IO limit (Su Yue)
- fix raid10 discard with REQ_NOWAIT (Xiao Ni)
- fix raid1 memory leak (Zheng Qixing)
- fix mddev uaf (Yu Kuai)
- fix raid1,raid10 IO flags (Yu Kuai)
- some refactor and cleanup (Yu Kuai)
- Series cleaning up and fixing bugs in the bad block handling code
- Improve support for write failure simulation in null_blk
- Various lock ordering fixes
- Fixes for locking for debugfs attributes
- Various ublk related fixes and improvements
- Cleanups for blk-rq-qos wait handling
- blk-throttle fixes
- Fixes for loop dio and sync handling
- Fixes and cleanups for the auto-PI code
- Block side support for hardware encryption keys in blk-crypto
- Various cleanups and fixes
* tag 'for-6.15/block-20250322' of git://git.kernel.dk/linux: (105 commits)
nvmet: replace max(a, min(b, c)) by clamp(val, lo, hi)
nvme-tcp: fix selinux denied when calling sock_sendmsg
nvmet: pci-epf: Always configure BAR0 as 64-bit
nvmet: Remove duplicate uuid_copy
nvme: zns: Simplify nvme_zone_parse_entry()
nvmet: pci-epf: Remove redundant 'flush_workqueue()' calls
nvmet-fc: Remove unused functions
nvme-pci: remove stale comment
nvme-fc: Utilise min3() to simplify queue count calculation
nvme-multipath: Add visibility for queue-depth io-policy
nvme-multipath: Add visibility for numa io-policy
nvme-multipath: Add visibility for round-robin io-policy
nvmet: add tls_concat and tls_key debugfs entries
nvmet-tcp: support secure channel concatenation
nvmet: Add 'sq' argument to alloc_ctrl_args
nvme-fabrics: reset admin connection for secure concatenation
nvme-tcp: request secure channel concatenation
nvme-keyring: add nvme_tls_psk_refresh()
nvme: add nvme_auth_derive_tls_psk()
nvme: add nvme_auth_generate_digest()
...
|
|
Pull io_uring updates from Jens Axboe:
"This is the first of the io_uring pull requests for the 6.15 merge
window, there will be others once the net tree has gone in. This
contains:
- Cleanup and unification of cancelation handling across various
request types.
- Improvement for bundles, supporting them both for incrementally
consumed buffers, and for non-multishot requests.
- Enable toggling of using iowait while waiting on io_uring events or
not. Unfortunately this is still tied with CPU frequency boosting
on short waits, as the scheduler side has not been very receptive
to splitting the (useless) iowait stat from the cpufreq implied
boost.
- Add support for kbuf nodes, enabling zero-copy support for the ublk
block driver.
- Various cleanups for resource node handling.
- Series greatly cleaning up the legacy provided (non-ring based)
buffers. For years, we've been pushing the ring provided buffers as
the way to go, and that is what people have been using. Reduce the
complexity and code associated with legacy provided buffers.
- Series cleaning up the compat handling.
- Series improving and cleaning up the recvmsg/sendmsg iovec and msg
handling.
- Series of cleanups for io-wq.
- Start adding a bunch of selftests. The liburing repository
generally carries feature and regression tests for everything, but
at least for ublk initially, we'll try and go the route of having
it in selftests as well. We'll see how this goes, might decide to
migrate more tests this way in the future.
- Various little cleanups and fixes"
* tag 'for-6.15/io_uring-20250322' of git://git.kernel.dk/linux: (108 commits)
selftests: ublk: add stripe target
selftests: ublk: simplify loop io completion
selftests: ublk: enable zero copy for null target
selftests: ublk: prepare for supporting stripe target
selftests: ublk: move common code into common.c
selftests: ublk: increase max buffer size to 1MB
selftests: ublk: add single sqe allocator helper
selftests: ublk: add generic_01 for verifying sequential IO order
selftests: ublk: fix starting ublk device
io_uring: enable toggle of iowait usage when waiting on CQEs
selftests: ublk: fix write cache implementation
selftests: ublk: add variable for user to not show test result
selftests: ublk: don't show `modprobe` failure
selftests: ublk: add one dependency header
io_uring/kbuf: enable bundles for incrementally consumed buffers
Revert "io_uring/rsrc: simplify the bvec iter count calculation"
selftests: ublk: improve test usability
selftests: ublk: add stress test for covering IO vs. killing ublk server
selftests: ublk: add one stress test for covering IO vs. removing device
selftests: ublk: load/unload ublk_drv when preparing & cleaning up tests
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer cleanups from Thomas Gleixner:
"A treewide hrtimer timer cleanup
hrtimers are initialized with hrtimer_init() and a subsequent store to
the callback pointer. This turned out to be suboptimal for the
upcoming Rust integration and is obviously a silly implementation to
begin with.
This cleanup replaces the hrtimer_init(T); T->function = cb; sequence
with hrtimer_setup(T, cb);
The conversion was done with Coccinelle and a few manual fixups.
Once the conversion has completely landed in mainline, hrtimer_init()
will be removed and the hrtimer::function becomes a private member"
* tag 'timers-cleanups-2025-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (100 commits)
wifi: rt2x00: Switch to use hrtimer_update_function()
io_uring: Use helper function hrtimer_update_function()
serial: xilinx_uartps: Use helper function hrtimer_update_function()
ASoC: fsl: imx-pcm-fiq: Switch to use hrtimer_setup()
RDMA: Switch to use hrtimer_setup()
virtio: mem: Switch to use hrtimer_setup()
drm/vmwgfx: Switch to use hrtimer_setup()
drm/xe/oa: Switch to use hrtimer_setup()
drm/vkms: Switch to use hrtimer_setup()
drm/msm: Switch to use hrtimer_setup()
drm/i915/request: Switch to use hrtimer_setup()
drm/i915/uncore: Switch to use hrtimer_setup()
drm/i915/pmu: Switch to use hrtimer_setup()
drm/i915/perf: Switch to use hrtimer_setup()
drm/i915/gvt: Switch to use hrtimer_setup()
drm/i915/huc: Switch to use hrtimer_setup()
drm/amdgpu: Switch to use hrtimer_setup()
stm class: heartbeat: Switch to use hrtimer_setup()
i2c: Switch to use hrtimer_setup()
iio: Switch to use hrtimer_setup()
...
|
|
The current I/O dispatch mechanism - queueing I/O by adding it to the
io_cmds list (and poking task_work as needed), then dispatching it in
ublk server task context by reversing io_cmds and completing the
io_uring command associated to each one - was introduced by commit
7d4a93176e014 ("ublk_drv: don't forward io commands in reserve order")
to ensure that the ublk server received I/O in the same order that the
block layer submitted it to ublk_drv. This mechanism was only needed for
the "raw" task_work submission mechanism, since the io_uring task work
wrapper maintains FIFO ordering (using quite a similar mechanism in
fact). The "raw" task_work submission mechanism is no longer supported
in ublk_drv as of commit 29dc5d06613f2 ("ublk: kill queuing request by
task_work_add"), so the explicit llist/reversal is no longer needed - it
just duplicates logic already present in the underlying io_uring APIs.
Remove it.
Signed-off-by: Uday Shankar <ushankar@purestorage.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250318-ublk_io_cmds-v1-1-c1bb74798fef@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
If vfs_flush() is called with queue frozen, the queue freeze lock may be
connected with FS internal lock, and lockdep warning can be triggered
because the queue freeze lock is connected with too many global or
sub-system locks.
Fix the warning by moving vfs_fsync() out of loop_update_dio():
- vfs_fsync() is only needed when switching to dio
- only loop_change_fd() and loop_configure() may switch from buffered
IO to direct IO, so call vfs_fsync() directly here. This way is safe
because either loop is in unbound, or new file isn't attached
- for the other two cases of set_status and set_block_size, direct IO
can only become off, so no need to call vfs_fsync()
Cc: Christoph Hellwig <hch@infradead.org>
Reported-by: Kun Hu <huk23@m.fudan.edu.cn>
Reported-by: Jiaji Qin <jjtan24@m.fudan.edu.cn>
Closes: https://lore.kernel.org/linux-block/359BC288-B0B1-4815-9F01-3A349B12E816@m.fudan.edu.cn/T/#u
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250318072955.3893805-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Explicitly state that zcomp compress/decompress must be called from
non-atomic context.
Link: https://lkml.kernel.org/r/20250303022425.285971-20-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Kairui Song <ryncsn@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Yosry Ahmed <yosry.ahmed@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Ensure the page used for local object data is freed on error out path.
Link: https://lkml.kernel.org/r/20250303022425.285971-19-senozhatsky@chromium.org
Fixes: 330edc2bc059 (zram: rework writeback target selection strategy)
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Kairui Song <ryncsn@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Yosry Ahmed <yosry.ahmed@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Ensure the page used for local object data is freed on error out path.
Link: https://lkml.kernel.org/r/20250303022425.285971-18-senozhatsky@chromium.org
Fixes: 3f909a60cec1 ("zram: rework recompress target selection strategy")
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Kairui Song <ryncsn@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Yosry Ahmed <yosry.ahmed@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
When configured with pre-trained compression/decompression dictionary
support, zstd requires custom memory allocator, which it calls internally
from compression()/decompression() routines. That means allocation from
atomic context (either under entry spin-lock, or per-CPU local-lock or
both). Now, with non-atomic zram read()/write(), those limitations are
relaxed and we can allow direct and indirect reclaim.
Link: https://lkml.kernel.org/r/20250303022425.285971-17-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Kairui Song <ryncsn@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Yosry Ahmed <yosry.ahmed@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|