Age | Commit message (Collapse) | Author |
|
Although the non-preemptible implementation of rcu_exp_handler()
contains checks to enforce idempotency, the preemptible version does not.
The reason for this omission is that in preemptible kernels, there is
no reporting of quiescent states from CPU hotplug notifiers, and thus
no need for idempotency.
In theory, anyway.
In practice, accidents happen. This commit therefore adds checks under
WARN_ON_ONCE() to catch any such accidents.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
|
|
Currently, the preemptible implementation of rcu_exp_handler()
almost open-codes rcu_exp_need_qs(). A call to that function would be
shorter and would improve expediting in cases where rcu_exp_handler()
interrupted a preemption-disabled or bh-disabled region of code.
This commit therefore moves rcu_exp_need_qs() out of the non-preemptible
leg of the enclosing #ifdef and replaces the open coding in preemptible
rcu_exp_handler() with a call to rcu_exp_need_qs().
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
|
|
This commit reduces the state space of rcu_report_exp_rdp() by moving
the setting of ->cpu_no_qs.b.exp under the rcu_node structure's ->lock.
The lock isn't really all that important here, given that this per-CPU
field is supposed to be written only by its CPU, but the disabling of
interrupts excludes things like rcu_exp_handler(), which also can write
to this same field. Avoiding this sort of interleaved access reduces
the state space.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
|
|
There is a hard-to-trigger bug in the expedited grace-period computation
whose fix requires that the __sync_rcu_exp_select_node_cpus() function
to check that the grace-period sequence number has not changed before
invoking rcu_report_exp_cpu_mult(). However, this check must be done
while holding the leaf rcu_node structure's ->lock.
This commit therefore prepares for that fix by moving this lock's
acquisition from rcu_report_exp_cpu_mult() to its callers (all two
of them).
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
|
|
Callbacks enqueued after rcutree_report_cpu_dead() fall into RCU barrier
blind spot. Report any potential misuse.
Reported-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
|
|
This commit converts rcutorture.c values for the reader_flavor module
parameter from hexadecimal to the SRCU_READ_FLAVOR_* C-preprocessor
macros. The actual modprobe or kernel-boot-parameter values for
read_flavor must still be entered in hexadecimal.
Link: https://lore.kernel.org/all/c48c9dca-fe07-4833-acaa-28c827e5a79e@amd.com/
Suggested-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
|
|
For preemptible RCU, this commit adds an indication for each
reader segments to whether the rcu_torture_reader() task was
on the ->blkd_tasks lists, though only in kernels built with
CONFIG_RCU_TORTURE_TEST_LOG_CPU=y.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
|
|
Currently, rcutorture_one_extend() reads the CPU ID before making any
change to the type of RCU reader. This can be confusing because the
properties of the code from which the CPU ID is read are not that of
the reader segment that this same CPU ID is listed with.
This commit therefore causes rcutorture_one_extend() to read the CPU
ID just after the new protections have been added, but before the old
protections have been removed. With this change in place, all of the
protections of a given reader segment apply from the reading of one CPU ID
to the reading of the next. This change therefore also allows a single
read of the CPU ID to work for both the old and the new reader segment.
And this dual use of a single read of the CPU ID avoids inflicting any
additional to heisenbugs.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
|
|
This commit adds the value of preempt_count() to the diagnostics produced
by rcutorture_one_extend_check() to improve debugging.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
|
|
This commit adds rcutorture module parameters gp_cond_wi, gp_cond_wi_exp,
gp_poll_wi, and gp_poll_wi_exp to control the wait interval for
conditional, conditional expedited, polled, and polled expedited grace
periods, respectively. When rcu_torture_writer() is testing these types
of grace periods, hrtimers are used to randomly wait up to the specified
number of microseconds, but with nanosecond granularity.
In the case of conditional grace periods (get_state_synchronize_rcu()
and cond_synchronize_rcu(), for example) there is just one
wait. For polled grace periods (start_poll_synchronize_rcu() and
poll_state_synchronize_rcu(), for example), there is a repeated series
of waits until the grace period ends.
For normal grace periods, the default is 16 jiffies (for example, 16,000
microseconds on a HZ=1000 system) and for expedited grace periods the
default is 128 microseconds.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
|
|
Use of the rcutorture preempt_duration and the default-on fwd_progress
kernel parameters can result in preemption of callback processing during
forward-progress testing, which is an excellent way to OOM your test
if your kernel offloads RCU callbacks. This commit therefore treats
preempt_duration in the same way as stall_cpu in CONFIG_RCU_NOCB_CPU=y
kernels, prohibiting fwd_progress testing and splatting when rcutorture
is built in (as opposed to being a loadable module).
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
|
|
This commit adds reader-state debugging checks to a new function named
rcutorture_one_extend_check(), which is invoked before and after setting
new reader states by the existing rcutorture_one_extend() function.
These checks have proven to be rather heavyweight, reducing reproduction
rate of some failures by a factor of two. They are therefore hidden
behind a new RCU_TORTURE_TEST_CHK_RDR_STATE Kconfig option.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Frederic Weisbecker <frederic@kernel.org>
Tested-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
|
|
The current "Failure/close-call rcutorture reader segments" output is
good and sufficient, but annoying when you have to interpret several
tens of them after an all-night rcutorture run. This commit therefore
makes them a bit more human-readable.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
|
|
The purpose of the "busted" torture type is to test rcutorture code paths
used only when a too-short grace period is detected. Currently, "busted"
only uses normal rcu_read_lock()-style readers, which fails to exercise
much of the "Failure/close-call rcutorture reader segments" functionality.
This commit therefore sets the .extendables field of rcu_busted_ops to
RCUTORTURE_MAX_EXTEND in order to more fully exercise the reporting.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
|
|
In kernels built with CONFIG_RCU_TORTURE_TEST_LOG_CPU=y, the CPU is
logged at the beginning of each reader segment. This commit further
logs it at the end of the full set of reader segments in order to show
any migration that might have occurred during the last reader segment.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
|
|
This commit checks to see if the RCU reader has been preempted within
its read-side critical section for RCU flavors supporting this notion
(currently only preemptible RCU). If such a preemption occurred, then
this is printed at the end of the "Failure/close-call rcutorture reader
segments" list at the end of the rcutorture run.
[ paulmck: Apply kernel test robot feedback. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Frederic Weisbecker <frederic@kernel.org>
Tested-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
|
|
The rcu_ops structure currently lacks a ->cond_sync_exp_full function,
which prevents testign of conditional full-state polled grace periods.
This commit therefore adds them, enabling testing this option.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
|
|
The rcu_torture_writer() polling currently uses timeouts ranging from
zero to 16 milliseconds to wait for the polled grace period to end.
This works, but it would be better to have a higher probability of
exercising races with the code that cleans up after a grace period.
This commit therefore switches from these millisecond-scale timeouts
to timeouts ranging from zero to 128 microseconds, and with a full
microsecond's worth of timeout fuzz.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
|
|
This commit adds CPU number to the "Failure/close-call rcutorture reader
segments" list printed at the end of an rcutorture run that had too-short
grace periods. This information can help debugging interactions with
migration and CPU hotplug.
However, experience indicates that sampling the CPU number in rcutorture's
read-side code can reduce the probability of too-short bugs by a small
integer factor. And small integer factors are crucial to RCU bug hunting,
so this commit also introduces a default-off RCU_TORTURE_TEST_LOG_CPU
Kconfig option to enable this CPU-number-logging functionality at
build time.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
|
|
This commit adds the rcutorture.preempt_duration kernel module parameter,
which gives the real-time preemption duration in milliseconds (zero to
disable, which is the default) and also the rcutorture.preempt_interval
module parameter, which gives the interval between successive preemptions,
also in milliseconds, defaulting to one second. The CPU to preempt is
chosen at random from those online at that time. Races between preempting
a given CPU and that CPU going offline are ignored, and preemption is
forgone when this occurs.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
|
|
Current use cases of torture_sched_setaffinity() are well served by its
unconditional warning on error. However, an upcoming use case for a
preemption kthread needs to avoid warnings that might otherwise arise
when that kthread attempted to bind itself to a CPU on its way offline.
This commit therefore adds a dowarn argument that, when false, suppresses
the warning.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
|
|
This commit adds a "sched-clock" test for the sched_clock() function.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
|
|
Recently we received a patchset that aims to enable file handle encoding
and decoding via name_to_handle_at(2) and open_by_handle_at(2).
A crucical step in the patch series is how to go from inode number to
struct pid without leaking information into unprivileged contexts. The
issue is that in order to find a struct pid the pid number in the
initial pid namespace must be encoded into the file handle via
name_to_handle_at(2). This can be used by containers using a separate
pid namespace to learn what the pid number of a given process in the
initial pid namespace is. While this is a weak information leak it could
be used in various exploits and in general is an ugly wart in the design.
To solve this problem a new way is needed to lookup a struct pid based
on the inode number allocated for that struct pid. The other part is to
remove the custom inode number allocation on 32bit systems that is also
an ugly wart that should go away.
So, a new scheme is used that I was discusssing with Tejun some time
back. A cyclic ida is used for the lower 32 bits and a the high 32 bits
are used for the generation number. This gives a 64 bit inode number
that is unique on both 32 bit and 64 bit. The lower 32 bit number is
recycled slowly and can be used to lookup struct pids.
Link: https://lore.kernel.org/r/20241129-work-pidfs-v2-1-61043d66fbce@kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Arguments to a raw tracepoint are tagged as trusted, which carries the
semantics that the pointer will be non-NULL. However, in certain cases,
a raw tracepoint argument may end up being NULL. More context about this
issue is available in [0].
Thus, there is a discrepancy between the reality, that raw_tp arguments can
actually be NULL, and the verifier's knowledge, that they are never NULL,
causing explicit NULL check branch to be dead code eliminated.
A previous attempt [1], i.e. the second fixed commit, was made to
simulate symbolic execution as if in most accesses, the argument is a
non-NULL raw_tp, except for conditional jumps. This tried to suppress
branch prediction while preserving compatibility, but surfaced issues
with production programs that were difficult to solve without increasing
verifier complexity. A more complete discussion of issues and fixes is
available at [2].
Fix this by maintaining an explicit list of tracepoints where the
arguments are known to be NULL, and mark the positional arguments as
PTR_MAYBE_NULL. Additionally, capture the tracepoints where arguments
are known to be ERR_PTR, and mark these arguments as scalar values to
prevent potential dereference.
Each hex digit is used to encode NULL-ness (0x1) or ERR_PTR-ness (0x2),
shifted by the zero-indexed argument number x 4. This can be represented
as follows:
1st arg: 0x1
2nd arg: 0x10
3rd arg: 0x100
... and so on (likewise for ERR_PTR case).
In the future, an automated pass will be used to produce such a list, or
insert __nullable annotations automatically for tracepoints. Each
compilation unit will be analyzed and results will be collated to find
whether a tracepoint pointer is definitely not null, maybe null, or an
unknown state where verifier conservatively marks it PTR_MAYBE_NULL.
A proof of concept of this tool from Eduard is available at [3].
Note that in case we don't find a specification in the raw_tp_null_args
array and the tracepoint belongs to a kernel module, we will
conservatively mark the arguments as PTR_MAYBE_NULL. This is because
unlike for in-tree modules, out-of-tree module tracepoints may pass NULL
freely to the tracepoint. We don't protect against such tracepoints
passing ERR_PTR (which is uncommon anyway), lest we mark all such
arguments as SCALAR_VALUE.
While we are it, let's adjust the test raw_tp_null to not perform
dereference of the skb->mark, as that won't be allowed anymore, and make
it more robust by using inline assembly to test the dead code
elimination behavior, which should still stay the same.
[0]: https://lore.kernel.org/bpf/ZrCZS6nisraEqehw@jlelli-thinkpadt14gen4.remote.csb
[1]: https://lore.kernel.org/all/20241104171959.2938862-1-memxor@gmail.com
[2]: https://lore.kernel.org/bpf/20241206161053.809580-1-memxor@gmail.com
[3]: https://github.com/eddyz87/llvm-project/tree/nullness-for-tracepoint-params
Reported-by: Juri Lelli <juri.lelli@redhat.com> # original bug
Reported-by: Manu Bretelle <chantra@meta.com> # bugs in masking fix
Fixes: 3f00c5239344 ("bpf: Allow trusted pointers to be passed to KF_TRUSTED_ARGS kfuncs")
Fixes: cb4158ce8ec8 ("bpf: Mark raw_tp arguments with PTR_MAYBE_NULL")
Reviewed-by: Eduard Zingerman <eddyz87@gmail.com>
Co-developed-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20241213221929.3495062-3-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
This patch reverts commit
cb4158ce8ec8 ("bpf: Mark raw_tp arguments with PTR_MAYBE_NULL"). The
patch was well-intended and meant to be as a stop-gap fixing branch
prediction when the pointer may actually be NULL at runtime. Eventually,
it was supposed to be replaced by an automated script or compiler pass
detecting possibly NULL arguments and marking them accordingly.
However, it caused two main issues observed for production programs and
failed to preserve backwards compatibility. First, programs relied on
the verifier not exploring == NULL branch when pointer is not NULL, thus
they started failing with a 'dereference of scalar' error. Next,
allowing raw_tp arguments to be modified surfaced the warning in the
verifier that warns against reg->off when PTR_MAYBE_NULL is set.
More information, context, and discusson on both problems is available
in [0]. Overall, this approach had several shortcomings, and the fixes
would further complicate the verifier's logic, and the entire masking
scheme would have to be removed eventually anyway.
Hence, revert the patch in preparation of a better fix avoiding these
issues to replace this commit.
[0]: https://lore.kernel.org/bpf/20241206161053.809580-1-memxor@gmail.com
Reported-by: Manu Bretelle <chantra@meta.com>
Fixes: cb4158ce8ec8 ("bpf: Mark raw_tp arguments with PTR_MAYBE_NULL")
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20241213221929.3495062-2-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
These BTF functions are not available unconditionally,
only reference them when they are available.
Avoid the following build warnings:
BTF .tmp_vmlinux1.btf.o
btf_encoder__tag_kfunc: failed to find kfunc 'bpf_send_signal_task' in BTF
btf_encoder__tag_kfuncs: failed to tag kfunc 'bpf_send_signal_task'
NM .tmp_vmlinux1.syms
KSYMS .tmp_vmlinux1.kallsyms.S
AS .tmp_vmlinux1.kallsyms.o
LD .tmp_vmlinux2
NM .tmp_vmlinux2.syms
KSYMS .tmp_vmlinux2.kallsyms.S
AS .tmp_vmlinux2.kallsyms.o
LD vmlinux
BTFIDS vmlinux
WARN: resolve_btfids: unresolved symbol prog_test_ref_kfunc
WARN: resolve_btfids: unresolved symbol bpf_crypto_ctx
WARN: resolve_btfids: unresolved symbol bpf_send_signal_task
WARN: resolve_btfids: unresolved symbol bpf_modify_return_test_tp
WARN: resolve_btfids: unresolved symbol bpf_dynptr_from_xdp
WARN: resolve_btfids: unresolved symbol bpf_dynptr_from_skb
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20241213-bpf-cond-ids-v1-1-881849997219@weissschuh.net
|
|
The fd_array attribute of the BPF_PROG_LOAD syscall may contain a set
of file descriptors: maps or btfs. This field was introduced as a
sparse array. Introduce a new attribute, fd_array_cnt, which, if
present, indicates that the fd_array is a continuous array of the
corresponding length.
If fd_array_cnt is non-zero, then every map in the fd_array will be
bound to the program, as if it was used by the program. This
functionality is similar to the BPF_PROG_BIND_MAP syscall, but such
maps can be used by the verifier during the program load.
Signed-off-by: Anton Protopopov <aspsk@isovalent.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20241213130934.1087929-5-aspsk@isovalent.com
|
|
Introduce a helper to add btfs to the env->used_maps array. Use it
to simplify the check_pseudo_btf_id() function. This new helper will
also be re-used in a consequent patch.
Signed-off-by: Anton Protopopov <aspsk@isovalent.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20241213130934.1087929-4-aspsk@isovalent.com
|
|
Move some inlined map/prog compatibility checks from the
resolve_pseudo_ldimm64() function to the dedicated
check_map_prog_compatibility() function. Call the latter function
from the add_used_map_from_fd() function directly.
This simplifies code and optimizes logic a bit, as before these
changes the check_map_prog_compatibility() function was executed on
every map usage, which doesn't make sense, as it doesn't include any
per-instruction checks, only map type vs. prog type.
(This patch also simplifies a consequent patch which will call the
add_used_map_from_fd() function from another code path.)
Signed-off-by: Anton Protopopov <aspsk@isovalent.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20241213130934.1087929-3-aspsk@isovalent.com
|
|
Add a new helper to get a pointer to a struct btf from a file
descriptor. This helper doesn't increase a refcnt. Add a comment
explaining this and pointing to a corresponding function which
does take a reference.
Signed-off-by: Anton Protopopov <aspsk@isovalent.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20241213130934.1087929-2-aspsk@isovalent.com
|
|
Update the `dsq_hash_params` initialization to use `sizeof_field`
for the `key_len` field instead of a hardcoded value.
This improves code readability and ensures the key length dynamically
matches the size of the `id` field in the `scx_dispatch_q` structure.
Signed-off-by: Liang Jie <liangjie@lixiang.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
dlserver time is accounted when:
- dlserver is active and the dlserver proxies the cfs task.
- dlserver is active but deferred and cfs task runs after being picked
through the normal fair class pick.
dl_server_update is called in two places to make sure that both the
above times are accounted for. But it doesn't check if dlserver is
active or not. Now that we have this dl_server_active flag, we can
consolidate dl_server_update into one place and all we need to check is
whether dlserver is active or not. When dlserver is active there is only
two possible conditions:
- dlserver is deferred.
- cfs task is running on behalf of dlserver.
Fixes: a110a81c52a9 ("sched/deadline: Deferrable dl server")
Signed-off-by: "Vineeth Pillai (Google)" <vineeth@bitbyteword.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Marcel Ziswiler <marcel.ziswiler@codethink.co.uk> # ROCK 5B
Link: https://lore.kernel.org/r/20241213032244.877029-2-vineeth@bitbyteword.org
|
|
dlserver can get dequeued during a dlserver pick_task due to the delayed
deueue feature and this can lead to issues with dlserver logic as it
still thinks that dlserver is on the runqueue. The dlserver throttling
and replenish logic gets confused and can lead to double enqueue of
dlserver.
Double enqueue of dlserver could happend due to couple of reasons:
Case 1
------
Delayed dequeue feature[1] can cause dlserver being stopped during a
pick initiated by dlserver:
__pick_next_task
pick_task_dl -> server_pick_task
pick_task_fair
pick_next_entity (if (sched_delayed))
dequeue_entities
dl_server_stop
server_pick_task goes ahead with update_curr_dl_se without knowing that
dlserver is dequeued and this confuses the logic and may lead to
unintended enqueue while the server is stopped.
Case 2
------
A race condition between a task dequeue on one cpu and same task's enqueue
on this cpu by a remote cpu while the lock is released causing dlserver
double enqueue.
One cpu would be in the schedule() and releasing RQ-lock:
current->state = TASK_INTERRUPTIBLE();
schedule();
deactivate_task()
dl_stop_server();
pick_next_task()
pick_next_task_fair()
sched_balance_newidle()
rq_unlock(this_rq)
at which point another CPU can take our RQ-lock and do:
try_to_wake_up()
ttwu_queue()
rq_lock()
...
activate_task()
dl_server_start() --> first enqueue
wakeup_preempt() := check_preempt_wakeup_fair()
update_curr()
update_curr_task()
if (current->dl_server)
dl_server_update()
enqueue_dl_entity() --> second enqueue
This bug was not apparent as the enqueue in dl_server_start doesn't
usually happen because of the defer logic. But as a side effect of the
first case(dequeue during dlserver pick), dl_throttled and dl_yield will
be set and this causes the time accounting of dlserver to messup and
then leading to a enqueue in dl_server_start.
Have an explicit flag representing the status of dlserver to avoid the
confusion. This is set in dl_server_start and reset in dlserver_stop.
Fixes: 63ba8422f876 ("sched/deadline: Introduce deadline servers")
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: "Vineeth Pillai (Google)" <vineeth@bitbyteword.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Marcel Ziswiler <marcel.ziswiler@codethink.co.uk> # ROCK 5B
Link: https://lkml.kernel.org/r/20241213032244.877029-1-vineeth@bitbyteword.org
|
|
Add static_call_update_early() for updating static-call targets in
very early boot.
This will be needed for support of Xen guest type specific hypercall
functions.
This is part of XSA-466 / CVE-2024-53241.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Co-developed-by: Peter Zijlstra <peterz@infradead.org>
Co-developed-by: Josh Poimboeuf <jpoimboe@redhat.com>
|
|
Initially, xdp_frame::mem.id was used to search for the corresponding
&page_pool to return the page correctly.
However, after that struct page was extended to have a direct pointer
to its PP (netmem has it as well), further keeping of this field makes
no sense. xdp_return_frame_bulk() still used it to do a lookup, and
this leftover is now removed.
Remove xdp_frame::mem and replace it with ::mem_type, as only memory
type still matters and we need to know it to be able to free the frame
correctly.
As a cute side effect, we can now make every scalar field in &xdp_frame
of 4 byte width, speeding up accesses to them.
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://patch.msgid.link/20241211172649.761483-3-aleksander.lobakin@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Cross-merge networking fixes after downstream PR (net-6.13-rc3).
No conflicts or adjacent changes.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Robert Morris reported the following program type which passes the
verifier in [0]:
SEC("struct_ops/bpf_cubic_init")
void BPF_PROG(bpf_cubic_init, struct sock *sk)
{
asm volatile("r2 = *(u16*)(r1 + 0)"); // verifier should demand u64
asm volatile("*(u32 *)(r2 +1504) = 0"); // 1280 in some configs
}
The second line may or may not work, but the first instruction shouldn't
pass, as it's a narrow load into the context structure of the struct ops
callback. The code falls back to btf_ctx_access to ensure correctness
and obtaining the types of pointers. Ensure that the size of the access
is correctly checked to be 8 bytes, otherwise the verifier thinks the
narrow load obtained a trusted BTF pointer and will permit loads/stores
as it sees fit.
Perform the check on size after we've verified that the load is for a
pointer field, as for scalar values narrow loads are fine. Access to
structs passed as arguments to a BPF program are also treated as
scalars, therefore no adjustment is needed in their case.
Existing verifier selftests are broken by this change, but because they
were incorrect. Verifier tests for d_path were performing narrow load
into context to obtain path pointer, had this program actually run it
would cause a crash. The same holds for verifier_btf_ctx_access tests.
[0]: https://lore.kernel.org/bpf/51338.1732985814@localhost
Fixes: 9e15db66136a ("bpf: Implement accurate raw_tp context access via BTF")
Reported-by: Robert Morris <rtm@mit.edu>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20241212092050.3204165-2-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
bpf_prog_aux->func field might be NULL if program does not have
subprograms except for main sub-program. The fixed commit does
bpf_prog_aux->func access unconditionally, which might lead to null
pointer dereference.
The bug could be triggered by replacing the following BPF program:
SEC("tc")
int main_changes(struct __sk_buff *sk)
{
bpf_skb_pull_data(sk, 0);
return 0;
}
With the following BPF program:
SEC("freplace")
long changes_pkt_data(struct __sk_buff *sk)
{
return bpf_skb_pull_data(sk, 0);
}
bpf_prog_aux instance itself represents the main sub-program,
use this property to fix the bug.
Fixes: 81f6d0530ba0 ("bpf: check changes_pkt_data property for extension programs")
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/r/202412111822.qGw6tOyB-lkp@intel.com/
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20241212070711.427443-1-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Shifting 1 << 31 on a 32-bit int causes signed integer overflow, which
leads to undefined behavior. To prevent this, cast 1 to u32 before
performing the shift, ensuring well-defined behavior.
This change explicitly avoids any potential overflow by ensuring that
the shift occurs on an unsigned 32-bit integer.
Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com>
Acked-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240928113608.1438087-1-visitorckw@gmail.com
Signed-off-by: Petr Mladek <pmladek@suse.com>
|
|
The multi_cpu_stop() contains a loop that can initially be executed with
interrupts enabled (in the MULTI_STOP_NONE and MULTI_STOP_PREPARE states).
Interrupts are guaranteed to be once the MULTI_STOP_DISABLE_IRQ state
is reached. Unfortunately, the rcu_momentary_eqs() function that is
currently invoked on each pass through this loop requires that interrupts
be disabled.
This commit therefore moves this call to rcu_momentary_eqs() to the body
of the "else if (curstate > MULTI_STOP_PREPARE)" portion of the loop, thus
guaranteeing that interrupts will be disabled on each call, as required.
Kudos to 朱恺乾 (Kaiqian) for noting that this had not made it to mainline.
[ paulmck: Update from rcu_momentary_dyntick_idle() to rcu_momentary_eqs(). ]
Link: https://lore.kernel.org/all/1712649736-27058-1-git-send-email-quic_mojha@quicinc.com/
Signed-off-by: Mukesh Ojha <quic_mojha@quicinc.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
Currently the CSD lock tunables can only be set at boot time in the
kernel commandline, but the way these variables are used means there
is really no reason not to tune them at runtime through /sys.
Make the CSD lock debug tunables tunable through /sys.
Signed-off-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
While adding outer irqsave/restore locking, 0e7ffff1b811 ("scx: Fix raciness
in scx_ops_bypass()") forgot to convert an inner rq_unlock_irqrestore() to
rq_unlock() which could re-enable IRQ prematurely leading to the following
warning:
raw_local_irq_restore() called with IRQs enabled
WARNING: CPU: 1 PID: 96 at kernel/locking/irqflag-debug.c:10 warn_bogus_irq_restore+0x30/0x40
...
Sched_ext: create_dsq (enabling)
pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : warn_bogus_irq_restore+0x30/0x40
lr : warn_bogus_irq_restore+0x30/0x40
...
Call trace:
warn_bogus_irq_restore+0x30/0x40 (P)
warn_bogus_irq_restore+0x30/0x40 (L)
scx_ops_bypass+0x224/0x3b8
scx_ops_enable.isra.0+0x2c8/0xaa8
bpf_scx_reg+0x18/0x30
...
irq event stamp: 33739
hardirqs last enabled at (33739): [<ffff8000800b699c>] scx_ops_bypass+0x174/0x3b8
hardirqs last disabled at (33738): [<ffff800080d48ad4>] _raw_spin_lock_irqsave+0xb4/0xd8
Drop the stray _irqrestore().
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Ihor Solodrai <ihor.solodrai@pm.me>
Link: http://lkml.kernel.org/r/qC39k3UsonrBYD_SmuxHnZIQLsuuccoCrkiqb_BT7DvH945A1_LZwE4g-5Pu9FcCtqZt4lY1HhIPi0homRuNWxkgo1rgP3bkxa0donw8kV4=@pm.me
Fixes: 0e7ffff1b811 ("scx: Fix raciness in scx_ops_bypass()")
Cc: stable@vger.kernel.org # v6.12
|
|
During machine kexec, machine_kexec_mask_interrupts() is responsible for
disabling or masking all interrupts. While the irq_disable() is only
invoked when the interrupt is not yet disabled, it unconditionally invokes
the irq_mask() callback for every interrupt descriptor, even when the
interrupt is already masked or not even started up yet.
A specific issue was observed in the crash kernel flow after unbinding a
device (prior to kexec) that used a GPIO as an IRQ source. The warning was
triggered by the gpiochip_disable_irq() function, which attempts to clear
the FLAG_IRQ_IS_ENABLED flag when FLAG_USED_AS_IRQ was not set.
This issue surfaced after commit a8173820f441 ("gpio: gpiolib: Allow GPIO
IRQs to lazy disable") introduced lazy disablement for GPIO IRQs. It
replaced disable/enable hooks with mask/unmask hooks. Unlike the disable
hook, the mask hook doesn't handle already-masked IRQs.
When a GPIO-IRQ driver is unbound, the IRQ is released, triggering
__irq_disable() and irq_state_set_masked(). A subsequent call to
machine_kexec_mask_interrupts() re-invokes chip->irq_mask(). This results
in a call chain, including gpiochip_irq_mask() and gpiochip_disable_irq().
Since FLAG_USED_AS_IRQ was cleared earlier, the warning is triggered.
Replace the direct invocation of the irq_mask() and irq_disable() callbacks
invoking to irq_shutdown(), which handles the cases correct and avoid it
all together when the interrupt has never been started up.
Signed-off-by: Eliav Farber <farbere@amazon.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20241204142003.32859-3-farbere@amazon.com
|
|
Consolidate the machine_kexec_mask_interrupts implementation into a common
function located in a new file: kernel/irq/kexec.c. This removes duplicate
implementations from architecture-specific files in arch/arm, arch/arm64,
arch/powerpc, and arch/riscv, reducing code duplication and improving
maintainability.
The new implementation retains architecture-specific behavior for
CONFIG_GENERIC_IRQ_KEXEC_CLEAR_VM_FORWARD, which was previously implemented
for ARM64. When enabled (currently for ARM64), it clears the active state
of interrupts forwarded to virtual machines (VMs) before handling other
interrupt masking operations.
Signed-off-by: Eliav Farber <farbere@amazon.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20241204142003.32859-2-farbere@amazon.com
|
|
Commit 2a010c412853 ("fs: don't block i_writecount during exec") removed
the legacy behavior of getting ETXTBSY on attempt to open and executable
file for write while it is being executed.
This commit was reverted because an application that depends on this
legacy behavior was broken by the change.
We need to allow HSM writing into executable files while executed to
fill their content on-the-fly.
To that end, disable the ETXTBSY legacy behavior for files that are
watched by pre-content events.
This change is not expected to cause regressions with existing systems
which do not have any pre-content event listeners.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Acked-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20241128142532.465176-1-amir73il@gmail.com
|
|
Isolated CPUs are not allowed to be used in a non-isolated partition.
The only exception is the top cpuset which is allowed to contain boot
time isolated CPUs.
Commit ccac8e8de99c ("cgroup/cpuset: Fix remote root partition creation
problem") introduces a simplified scheme of including only partition
roots in sched domain generation. However, it does not properly account
for this exception case. This can result in leakage of isolated CPUs
into a sched domain.
Fix it by making sure that isolated CPUs are excluded from the top
cpuset before generating sched domains.
Also update the way the boot time isolated CPUs are handled in
test_cpuset_prs.sh to make sure that those isolated CPUs are really
isolated instead of just skipping them in the tests.
Fixes: ccac8e8de99c ("cgroup/cpuset: Fix remote root partition creation problem")
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Task's cpuset pointer was removed by
commit 8793d854edbc ("Task Control Groups: make cpusets a client of cgroups")
Paragraph "The task_lock() exception ...." was removed by
commit 2df167a300d7 ("cgroups: update comments in cpuset.c")
Remove stale text:
We also require taking task_lock() when dereferencing a
task's cpuset pointer. See "The task_lock() exception", at the end of this
comment.
Accessing a task's cpuset should be done in accordance with the
guidelines for accessing subsystem state in kernel/cgroup.c
and reformat.
Co-developed-by: Michal Koutný <mkoutny@suse.com>
Co-developed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Costa Shulyupin <costa.shul@redhat.com>
Acked-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull eprobes fix from Masami Hiramatsu:
- release eprobe when failing to add dyn_event.
This unregisters event call and release eprobe when it fails to add a
dynamic event. Found in cleaning up.
* tag 'probes-fixes-v6.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing/eprobe: Fix to release eprobe when failed to add dyn_event
|
|
Currently, the pointer stored in call->prog_array is loaded in
__uprobe_perf_func(), with no RCU annotation and no immediately visible
RCU protection, so it looks as if the loaded pointer can immediately be
dangling.
Later, bpf_prog_run_array_uprobe() starts a RCU-trace read-side critical
section, but this is too late. It then uses rcu_dereference_check(), but
this use of rcu_dereference_check() does not actually dereference anything.
Fix it by aligning the semantics to bpf_prog_run_array(): Let the caller
provide rcu_read_lock_trace() protection and then load call->prog_array
with rcu_dereference_check().
This issue seems to be theoretical: I don't know of any way to reach this
code without having handle_swbp() further up the stack, which is already
holding a rcu_read_lock_trace() lock, so where we take
rcu_read_lock_trace() in __uprobe_perf_func()/bpf_prog_run_array_uprobe()
doesn't actually have any effect.
Fixes: 8c7dcb84e3b7 ("bpf: implement sleepable uprobes by chaining gps")
Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20241210-bpf-fix-uprobe-uaf-v4-1-5fc8959b2b74@google.com
|
|
The bpf_remove_insns() function returns WARN_ON_ONCE(error), where
error is a result of bpf_adj_branches(), and thus should be always 0
However, if for any reason it is not 0, then it will be converted to
boolean by WARN_ON_ONCE and returned to user space as 1, not an actual
error value. Fix this by returning the original err after the WARN check.
Signed-off-by: Anton Protopopov <aspsk@isovalent.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20241210114245.836164-1-aspsk@isovalent.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|