Age | Commit message (Collapse) | Author |
|
Eduard Zingerman says:
====================
verify callbacks as if they are called unknown number of times
This series updates verifier logic for callback functions handling.
Current master simulates callback body execution exactly once,
which leads to verifier not detecting unsafe programs like below:
static int unsafe_on_zero_iter_cb(__u32 idx, struct num_context *ctx)
{
ctx->i = 0;
return 0;
}
SEC("?raw_tp")
int unsafe_on_zero_iter(void *unused)
{
struct num_context loop_ctx = { .i = 32 };
__u8 choice_arr[2] = { 0, 1 };
bpf_loop(100, unsafe_on_zero_iter_cb, &loop_ctx, 0);
return choice_arr[loop_ctx.i];
}
This was reported previously in [0].
The basic idea of the fix is to schedule callback entry state for
verification in env->head until some identical, previously visited
state in current DFS state traversal is found. Same logic as with open
coded iterators, and builds on top recent fixes [1] for those.
The series is structured as follows:
- patches #1,2,3 update strobemeta, xdp_synproxy selftests and
bpf_loop_bench benchmark to allow convergence of the bpf_loop
callback states;
- patches #4,5 just shuffle the code a bit;
- patch #6 is the main part of the series;
- patch #7 adds test cases for #6;
- patch #8 extend patch #6 with same speculative scalar widening
logic, as used for open coded iterators;
- patch #9 adds test cases for #8;
- patch #10 extends patch #6 to track maximal number of callback
executions specifically for bpf_loop();
- patch #11 adds test cases for #10.
Veristat results comparing this series to master+patches #1,2,3 using selftests
show the following difference:
File Program States (A) States (B) States (DIFF)
------------------------- ------------- ---------- ---------- -------------
bpf_loop_bench.bpf.o benchmark 1 2 +1 (+100.00%)
pyperf600_bpf_loop.bpf.o on_event 322 407 +85 (+26.40%)
strobemeta_bpf_loop.bpf.o on_event 113 151 +38 (+33.63%)
xdp_synproxy_kern.bpf.o syncookie_tc 341 291 -50 (-14.66%)
xdp_synproxy_kern.bpf.o syncookie_xdp 344 301 -43 (-12.50%)
Veristat results comparing this series to master using Tetragon BPF
files [2] also show some differences.
States diff varies from +2% to +15% on 23 programs out of 186,
no new failures.
Changelog:
- V3 [5] -> V4, changes suggested by Andrii:
- validate mark_chain_precision() result in patch #10;
- renaming s/cumulative_callback_depth/callback_unroll_depth/.
- V2 [4] -> V3:
- fixes in expected log messages for test cases:
- callback_result_precise;
- parent_callee_saved_reg_precise_with_callback;
- parent_stack_slot_precise_with_callback;
- renamings (suggested by Alexei):
- s/callback_iter_depth/cumulative_callback_depth/
- s/is_callback_iter_next/calls_callback/
- s/mark_callback_iter_next/mark_calls_callback/
- prepare_func_exit() updated to exit with -EFAULT when
callee->in_callback_fn is true but calls_callback() is not true
for callsite;
- test case 'bpf_loop_iter_limit_nested' rewritten to use return
value check instead of verifier log message checks
(suggested by Alexei).
- V1 [3] -> V2, changes suggested by Andrii:
- small changes for error handling code in __check_func_call();
- callback body processing log is now matched in relevant
verifier_subprog_precision.c tests;
- R1 passed to bpf_loop() is now always marked as precise;
- log level 2 message for bpf_loop() iteration termination instead of
iteration depth messages;
- __no_msg macro removed;
- bpf_loop_iter_limit_nested updated to avoid using __no_msg;
- commit message for patch #3 updated according to Alexei's request.
[0] https://lore.kernel.org/bpf/CA+vRuzPChFNXmouzGG+wsy=6eMcfr1mFG0F3g7rbg-sedGKW3w@mail.gmail.com/
[1] https://lore.kernel.org/bpf/20231024000917.12153-1-eddyz87@gmail.com/
[2] git@github.com:cilium/tetragon.git
[3] https://lore.kernel.org/bpf/20231116021803.9982-1-eddyz87@gmail.com/T/#t
[4] https://lore.kernel.org/bpf/20231118013355.7943-1-eddyz87@gmail.com/T/#t
[5] https://lore.kernel.org/bpf/20231120225945.11741-1-eddyz87@gmail.com/T/#t
====================
Link: https://lore.kernel.org/r/20231121020701.26440-1-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Check that even if bpf_loop() callback simulation does not converge to
a specific state, verification could proceed via "brute force"
simulation of maximal number of callback calls.
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20231121020701.26440-12-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
In some cases verifier can't infer convergence of the bpf_loop()
iteration. E.g. for the following program:
static int cb(__u32 idx, struct num_context* ctx)
{
ctx->i++;
return 0;
}
SEC("?raw_tp")
int prog(void *_)
{
struct num_context ctx = { .i = 0 };
__u8 choice_arr[2] = { 0, 1 };
bpf_loop(2, cb, &ctx, 0);
return choice_arr[ctx.i];
}
Each 'cb' simulation would eventually return to 'prog' and reach
'return choice_arr[ctx.i]' statement. At which point ctx.i would be
marked precise, thus forcing verifier to track multitude of separate
states with {.i=0}, {.i=1}, ... at bpf_loop() callback entry.
This commit allows "brute force" handling for such cases by limiting
number of callback body simulations using 'umax' value of the first
bpf_loop() parameter.
For this, extend bpf_func_state with 'callback_depth' field.
Increment this field when callback visiting state is pushed to states
traversal stack. For frame #N it's 'callback_depth' field counts how
many times callback with frame depth N+1 had been executed.
Use bpf_func_state specifically to allow independent tracking of
callback depths when multiple nested bpf_loop() calls are present.
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20231121020701.26440-11-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
A test case to verify that imprecise scalars widening is applied to
callback entering state, when callback call is simulated repeatedly.
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20231121020701.26440-10-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Callbacks are similar to open coded iterators, so add imprecise
widening logic for callback body processing. This makes callback based
loops behave identically to open coded iterators, e.g. allowing to
verify programs like below:
struct ctx { u32 i; };
int cb(u32 idx, struct ctx* ctx)
{
++ctx->i;
return 0;
}
...
struct ctx ctx = { .i = 0 };
bpf_loop(100, cb, &ctx, 0);
...
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20231121020701.26440-9-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
A set of test cases to check behavior of callback handling logic,
check if verifier catches the following situations:
- program not safe on second callback iteration;
- program not safe on zero callback iterations;
- infinite loop inside a callback.
Verify that callback logic works for bpf_loop, bpf_for_each_map_elem,
bpf_user_ringbuf_drain, bpf_find_vma.
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20231121020701.26440-8-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Prior to this patch callbacks were handled as regular function calls,
execution of callback body was modeled exactly once.
This patch updates callbacks handling logic as follows:
- introduces a function push_callback_call() that schedules callback
body verification in env->head stack;
- updates prepare_func_exit() to reschedule callback body verification
upon BPF_EXIT;
- as calls to bpf_*_iter_next(), calls to callback invoking functions
are marked as checkpoints;
- is_state_visited() is updated to stop callback based iteration when
some identical parent state is found.
Paths with callback function invoked zero times are now verified first,
which leads to necessity to modify some selftests:
- the following negative tests required adding release/unlock/drop
calls to avoid previously masked unrelated error reports:
- cb_refs.c:underflow_prog
- exceptions_fail.c:reject_rbtree_add_throw
- exceptions_fail.c:reject_with_cp_reference
- the following precision tracking selftests needed change in expected
log trace:
- verifier_subprog_precision.c:callback_result_precise
(note: r0 precision is no longer propagated inside callback and
I think this is a correct behavior)
- verifier_subprog_precision.c:parent_callee_saved_reg_precise_with_callback
- verifier_subprog_precision.c:parent_stack_slot_precise_with_callback
Reported-by: Andrew Werner <awerner32@gmail.com>
Closes: https://lore.kernel.org/bpf/CA+vRuzPChFNXmouzGG+wsy=6eMcfr1mFG0F3g7rbg-sedGKW3w@mail.gmail.com/
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20231121020701.26440-7-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Move code for simulated stack frame creation to a separate utility
function. This function would be used in the follow-up change for
callbacks handling.
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20231121020701.26440-6-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Split check_reg_arg() into two utility functions:
- check_reg_arg() operating on registers from current verifier state;
- __check_reg_arg() operating on a specific set of registers passed as
a parameter;
The __check_reg_arg() function would be used by a follow-up change for
callbacks handling.
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20231121020701.26440-5-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
This is a preparatory change. A follow-up patch "bpf: verify callbacks
as if they are called unknown number of times" changes logic for
callbacks handling. While previously callbacks were verified as a
single function call, new scheme takes into account that callbacks
could be executed unknown number of times.
This has dire implications for bpf_loop_bench:
SEC("fentry/" SYS_PREFIX "sys_getpgid")
int benchmark(void *ctx)
{
for (int i = 0; i < 1000; i++) {
bpf_loop(nr_loops, empty_callback, NULL, 0);
__sync_add_and_fetch(&hits, nr_loops);
}
return 0;
}
W/o callbacks change verifier sees it as a 1000 calls to
empty_callback(). However, with callbacks change things become
exponential:
- i=0: state exploring empty_callback is scheduled with i=0 (a);
- i=1: state exploring empty_callback is scheduled with i=1;
...
- i=999: state exploring empty_callback is scheduled with i=999;
- state (a) is popped from stack;
- i=1: state exploring empty_callback is scheduled with i=1;
...
Avoid this issue by rewriting outer loop as bpf_loop().
Unfortunately, this adds a function call to a loop at runtime, which
negatively affects performance:
throughput latency
before: 149.919 ± 0.168 M ops/s, 6.670 ns/op
after : 137.040 ± 0.187 M ops/s, 7.297 ns/op
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20231121020701.26440-4-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
This change prepares strobemeta for update in callbacks verification
logic. To allow bpf_loop() verification converge when multiple
callback iterations are considered:
- track offset inside strobemeta_payload->payload directly as scalar
value;
- at each iteration make sure that remaining
strobemeta_payload->payload capacity is sufficient for execution of
read_{map,str}_var functions;
- make sure that offset is tracked as unbound scalar between
iterations, otherwise verifier won't be able infer that bpf_loop
callback reaches identical states.
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20231121020701.26440-3-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
This change prepares syncookie_{tc,xdp} for update in callbakcs
verification logic. To allow bpf_loop() verification converge when
multiple callback itreations are considered:
- track offset inside TCP payload explicitly, not as a part of the
pointer;
- make sure that offset does not exceed MAX_PACKET_OFF enforced by
verifier;
- make sure that offset is tracked as unbound scalar between
iterations, otherwise verifier won't be able infer that bpf_loop
callback reaches identical states.
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20231121020701.26440-2-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Daniel Borkmann says:
====================
This fixes bpf_redirect_peer stats accounting for veth and netkit,
and adds tstats in the first place for the latter. Utilise indirect
call wrapper for bpf_redirect_peer, and improve test coverage of the
latter also for netkit devices. Details in the patches, thanks!
The series was targeted at bpf originally, and is done here as well,
so it can trigger BPF CI. Jakub, if you think directly going via net
is better since the majority of the diff touches net anyway, that is
fine, too.
Thanks!
v2 -> v3:
- Add kdoc for pcpu_stat_type (Simon)
- Reject invalid type value in netdev_do_alloc_pcpu_stats (Simon)
- Add Reviewed-by tags from list
v1 -> v2:
- Move stats allocation/freeing into net core (Jakub)
- As prepwork for the above, move vrf's dstats over into the core
- Add a check into stats alloc to enforce tstats upon
implementing ndo_get_peer_dev
- Add Acked-by tags from list
Daniel Borkmann (6):
net, vrf: Move dstats structure to core
net: Move {l,t,d}stats allocation to core and convert veth & vrf
netkit: Add tstats per-CPU traffic counters
bpf, netkit: Add indirect call wrapper for fetching peer dev
selftests/bpf: De-veth-ize the tc_redirect test case
selftests/bpf: Add netkit to tc_redirect selftest
====================
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
|
|
Extend the existing tc_redirect selftest to also cover netkit devices
for exercising the bpf_redirect_peer() code paths, so that we have both
veth as well as netkit covered, all tests still pass after this change.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Stanislav Fomichev <sdf@google.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://lore.kernel.org/r/20231114004220.6495-9-daniel@iogearbox.net
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
|
|
No functional changes to the test case, but just renaming various functions,
variables, etc, to remove veth part of their name for making it more generic
and reusable later on (e.g. for netkit).
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Stanislav Fomichev <sdf@google.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://lore.kernel.org/r/20231114004220.6495-8-daniel@iogearbox.net
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
|
|
ndo_get_peer_dev is used in tcx BPF fast path, therefore make use of
indirect call wrapper and therefore optimize the bpf_redirect_peer()
internal handling a bit. Add a small skb_get_peer_dev() wrapper which
utilizes the INDIRECT_CALL_1() macro instead of open coding.
Future work could potentially add a peer pointer directly into struct
net_device in future and convert veth and netkit over to use it so
that eventually ndo_get_peer_dev can be removed.
Co-developed-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20231114004220.6495-7-daniel@iogearbox.net
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
|
|
Traffic redirected by bpf_redirect_peer() (used by recent CNIs like Cilium)
is not accounted for in the RX stats of supported devices (that is, veth
and netkit), confusing user space metrics collectors such as cAdvisor [0],
as reported by Youlun.
Fix it by calling dev_sw_netstats_rx_add() in skb_do_redirect(), to update
RX traffic counters. Devices that support ndo_get_peer_dev _must_ use the
@tstats per-CPU counters (instead of @lstats, or @dstats).
To make this more fool-proof, error out when ndo_get_peer_dev is set but
@tstats are not selected.
[0] Specifically, the "container_network_receive_{byte,packet}s_total"
counters are affected.
Fixes: 9aa1206e8f48 ("bpf: Add redirect_peer helper")
Reported-by: Youlun Zhang <zhangyoulun@bytedance.com>
Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>
Co-developed-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://lore.kernel.org/r/20231114004220.6495-6-daniel@iogearbox.net
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
|
|
Currently veth devices use the lstats per-CPU traffic counters, which only
cover TX traffic. veth_get_stats64() actually populates RX stats of a veth
device from its peer's TX counters, based on the assumption that a veth
device can _only_ receive packets from its peer, which is no longer true:
For example, recent CNIs (like Cilium) can use the bpf_redirect_peer() BPF
helper to redirect traffic from NIC's tc ingress to veth's tc ingress (in
a different netns), skipping veth's peer device. Unfortunately, this kind
of traffic isn't currently accounted for in veth's RX stats.
In preparation for the fix, use tstats (instead of lstats) to maintain
both RX and TX counters for each veth device. We'll use RX counters for
bpf_redirect_peer() traffic, and keep using TX counters for the usual
"peer-to-peer" traffic. In veth_get_stats64(), calculate RX stats by
_adding_ RX count to peer's TX count, in order to cover both kinds of
traffic.
veth_stats_rx() might need a name change (perhaps to "veth_stats_xdp()")
for less confusion, but let's leave it to another patch to keep the fix
minimal.
Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>
Co-developed-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://lore.kernel.org/r/20231114004220.6495-5-daniel@iogearbox.net
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
|
|
Add dev->tstats traffic accounting to netkit. The latter contains per-CPU
RX and TX counters.
The dev's TX counters are bumped upon pass/unspec as well as redirect
verdicts, in other words, on everything except for drops.
The dev's RX counters are bumped upon successful __netif_rx(), as well
as from skb_do_redirect() (not part of this commit here).
Using dev->lstats with having just a single packets/bytes counter and
inferring one another's RX counters from the peer dev's lstats is not
possible given skb_do_redirect() can also bump the device's stats.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://lore.kernel.org/r/20231114004220.6495-4-daniel@iogearbox.net
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
|
|
Move {l,t,d}stats allocation to the core and let netdevs pick the stats
type they need. That way the driver doesn't have to bother with error
handling (allocation failure checking, making sure free happens in the
right spot, etc) - all happening in the core.
Co-developed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Cc: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20231114004220.6495-3-daniel@iogearbox.net
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
|
|
Just move struct pcpu_dstats out of the vrf into the core, and streamline
the field names slightly, so they better align with the {t,l}stats ones.
No functional change otherwise. A conversion of the u64s to u64_stats_t
could be done at a separate point in future. This move is needed as we are
moving the {t,l,d}stats allocation/freeing to the core.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20231114004220.6495-2-daniel@iogearbox.net
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
|
|
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild fixes from Masahiro Yamada:
- Fix section mismatch warning messages for riscv and loongarch
- Remove CONFIG_IA64 left-over from linux/export-internal.h
- Fix the location of the quotes for UIMAGE_NAME
- Fix a memory leak bug in Kconfig
* tag 'kbuild-fixes-v6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
kconfig: fix memory leak from range properties
kbuild: Move the single quotes for image name
linux/export: clean up the IA-64 KSYM_FUNC macro
modpost: fix section mismatch message for RELA
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fix from Borislav Petkov:
- Flush the translation service tables to prevent unpredictable
behavior on non-coherent GIC devices
* tag 'irq_urgent_for_v6.7_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip/gic-v3-its: Flush ITS tables correctly in non-coherent GIC designs
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Borislav Petkov:
- Ignore invalid x2APIC entries in order to not waste per-CPU data
- Fix a back-to-back signals handling scenario when shadow stack is in
use
- A documentation fix
- Add Kirill as TDX maintainer
* tag 'x86_urgent_for_v6.7_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/acpi: Ignore invalid x2APIC entries
x86/shstk: Delay signal entry SSP write until after user accesses
x86/Documentation: Indent 'note::' directive for protocol version number note
MAINTAINERS: Add Intel TDX entry
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fix from Borislav Petkov:
- Do the push of pending hrtimers away from a CPU which is being
offlined earlier in the offlining process in order to prevent a
deadlock
* tag 'timers_urgent_for_v6.7_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
hrtimers: Push pending hrtimers away from outgoing CPU earlier
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Borislav Petkov:
- Fix virtual runtime calculation when recomputing a sched entity's
weights
- Fix wrongly rejected unprivileged poll requests to the cgroup psi
pressure files
- Make sure the load balancing is done by only one CPU
* tag 'sched_urgent_for_v6.7_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/fair: Fix the decision for load balance
sched: psi: fix unprivileged polling against cgroups
sched/eevdf: Fix vruntime adjustment on reweight
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fix from Borislav Petkov:
- Fix a hardcoded futex flags case which lead to one robust futex test
failure
* tag 'locking_urgent_for_v6.7_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
futex: Fix hardcoded flags
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fix from Borislav Petkov:
- Make sure the context refcount is transferred too when migrating perf
events
* tag 'perf_urgent_for_v6.7_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/core: Fix cpuctx refcounting
|
|
W=1 builds now warn if module is built without a MODULE_DESCRIPTION().
Add descriptions to all the sock diag modules in one fell swoop.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
During 'ifconfig <netdev> down' one RSS memory was not getting freed.
This patch fixes the same.
Fixes: 81a4362016e7 ("octeontx2-pf: Add RSS multi group support")
Signed-off-by: Suman Ghosh <sumang@marvell.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
wg_xmit() can be called concurrently, KCSAN reported [1]
some device stats updates can be lost.
Use DEV_STATS_INC() for this unlikely case.
[1]
BUG: KCSAN: data-race in wg_xmit / wg_xmit
read-write to 0xffff888104239160 of 8 bytes by task 1375 on cpu 0:
wg_xmit+0x60f/0x680 drivers/net/wireguard/device.c:231
__netdev_start_xmit include/linux/netdevice.h:4918 [inline]
netdev_start_xmit include/linux/netdevice.h:4932 [inline]
xmit_one net/core/dev.c:3543 [inline]
dev_hard_start_xmit+0x11b/0x3f0 net/core/dev.c:3559
...
read-write to 0xffff888104239160 of 8 bytes by task 1378 on cpu 1:
wg_xmit+0x60f/0x680 drivers/net/wireguard/device.c:231
__netdev_start_xmit include/linux/netdevice.h:4918 [inline]
netdev_start_xmit include/linux/netdevice.h:4932 [inline]
xmit_one net/core/dev.c:3543 [inline]
dev_hard_start_xmit+0x11b/0x3f0 net/core/dev.c:3559
...
v2: also change wg_packet_consume_data_done() (Hangbin Liu)
and wg_packet_purge_staged_packets()
Fixes: e7096c131e51 ("net: WireGuard secure network tunnel")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jason A. Donenfeld <Jason@zx2c4.com>
Cc: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Reviewed-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
When the device uses a custom subsystem vendor ID, the function
wx_sw_init() returns before the memory of 'wx->mac_table' is allocated.
The null pointer will causes the kernel panic.
Fixes: 79625f45ca73 ("net: wangxun: Move MAC address handling to libwx")
Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"Seven small fixes, six in drivers and one in sd.
The sd fix is so large because it changes a struct pointer to a struct
but otherwise is fairly simple"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: ufs: qcom-ufs: dt-bindings: Document the SM8650 UFS Controller
scsi: sd: Fix sshdr use in sd_suspend_common()
scsi: scsi_debug: Delete some bogus error checking
scsi: scsi_debug: Fix some bugs in sdebug_error_write()
scsi: ufs: core: Fix racing issue between ufshcd_mcq_abort() and ISR
scsi: ufs: core: Expand MCQ queue slot to DeviceQueueDepth + 1
scsi: qla2xxx: Fix system crash due to bad pointer access
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux
Pull parisc fixes from Helge Deller:
"On parisc we still sometimes need writeable stacks, e.g. if programs
aren't compiled with gcc-14. To avoid issues with the upcoming
systemd-254 we therefore have to disable prctl(PR_SET_MDWE) for now
(for parisc only).
The other two patches are minor: a bugfix for the soft power-off on
qemu with 64-bit kernel and prefer strscpy() over strlcpy():
- Fix power soft-off on qemu
- Disable prctl(PR_SET_MDWE) since parisc sometimes still needs
writeable stacks
- Use strscpy instead of strlcpy in show_cpuinfo()"
* tag 'parisc-for-6.7-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
prctl: Disable prctl(PR_SET_MDWE) on parisc
parisc/power: Fix power soft-off when running on qemu
parisc: Replace strlcpy() with strscpy()
|
|
Pull xfs fixes from Chandan Babu:
- Fix deadlock arising due to intent items in AIL not being cleared
when log recovery fails
- Fix stale data exposure bug when remapping COW fork extents to data
fork
- Fix deadlock when data device flush fails
- Fix AGFL minimum size calculation
- Select DEBUG_FS instead of XFS_DEBUG when XFS_ONLINE_SCRUB_STATS is
selected
- Fix corruption of log inode's extent count field when NREXT64 feature
is enabled
* tag 'xfs-6.7-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: recovery should not clear di_flushiter unconditionally
xfs: inode recovery does not validate the recovered inode
xfs: fix again select in kconfig XFS_ONLINE_SCRUB_STATS
xfs: fix internal error from AGFL exhaustion
xfs: up(ic_sema) if flushing data device fails
xfs: only remap the written blocks in xfs_reflink_end_cow_extent
XFS: Update MAINTAINERS to catch all XFS documentation
xfs: abort intent items when recovery intents fail
xfs: factor out xfs_defer_pending_abort
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd fixes from Chuck Lever:
- Fix several long-standing bugs in the duplicate reply cache
- Fix a memory leak
* tag 'nfsd-6.7-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
NFSD: Fix checksum mismatches in the duplicate reply cache
NFSD: Fix "start of NFS reply" pointer passed to nfsd_cache_update()
NFSD: Update nfsd_cache_append() to use xdr_stream
nfsd: fix file memleak on client_opens_release
|
|
Pull smb client fixes from Steve French:
- multichannel fixes (including a lock ordering fix and an important
refcounting fix)
- spnego fix
* tag '6.7-rc1-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: fix lock ordering while disabling multichannel
cifs: fix leak of iface for primary channel
cifs: fix check of rc in function generate_smb3signingkey
cifs: spnego: add ';' in HOST_KEY_LEN
|
|
systemd-254 tries to use prctl(PR_SET_MDWE) for it's MemoryDenyWriteExecute
functionality, but fails on parisc which still needs executable stacks in
certain combinations of gcc/glibc/kernel.
Disable prctl(PR_SET_MDWE) by returning -EINVAL for now on parisc, until
userspace has catched up.
Signed-off-by: Helge Deller <deller@gmx.de>
Co-developed-by: Linus Torvalds <torvalds@linux-foundation.org>
Reported-by: Sam James <sam@gentoo.org>
Closes: https://github.com/systemd/systemd/issues/29775
Tested-by: Sam James <sam@gentoo.org>
Link: https://lore.kernel.org/all/875y2jro9a.fsf@gentoo.org/
Cc: <stable@vger.kernel.org> # v6.3+
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper fixes from Mike Snitzer:
- Various fixes for the DM delay target to address regressions
introduced during the 6.7 merge window
- Fixes to both DM bufio and the verity target for no-sleep mode,
to address sleeping while atomic issues
- Update DM crypt target in response to the treewide change that
made MAX_ORDER inclusive
* tag 'for-6.7/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm-crypt: start allocating with MAX_ORDER
dm-verity: don't use blocking calls from tasklets
dm-bufio: fix no-sleep mode
dm-delay: avoid duplicate logic
dm-delay: fix bugs introduced by kthread mode
dm-delay: fix a race between delay_presuspend and delay_bio
|
|
Firmware returns the physical address of the power switch,
so need to use gsc_writel() instead of direct memory access.
Fixes: d0c219472980 ("parisc/power: Add power soft-off when running on qemu")
Signed-off-by: Helge Deller <deller@gmx.de>
Cc: stable@vger.kernel.org # v6.0+
|
|
strlcpy() reads the entire source buffer first. This read may exceed
the destination size limit. This is both inefficient and can lead
to linear read overflows if a source string is not NUL-terminated[1].
Additionally, it returns the size of the source string, not the
resulting size of the destination string. In an effort to remove strlcpy()
completely[2], replace strlcpy() here with strscpy().
Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strlcpy [1]
Link: https://github.com/KSPP/linux/issues/89 [2]
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Azeem Shaikh <azeemshaikh38@gmail.com>
Cc: linux-parisc@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Helge Deller <deller@gmx.de>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux
Pull i2c fixes from Wolfram Sang:
"Revert a not-working conversion to generic recovery for PXA,
use proper IO accessors for designware, and use proper PM level
for ocores to allow accessing interrupt providers late"
* tag 'i2c-for-6.7-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
i2c: ocores: Move system PM hooks to the NOIRQ phase
i2c: designware: Fix corrupted memory seen in the ISR
Revert "i2c: pxa: move to generic GPIO recovery"
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux
Pull turbostat updates from Len Brown:
- Turbostat features are now table-driven (Rui Zhang)
- Add support for some new platforms (Sumeet Pawnikar, Rui Zhang)
- Gracefully run in configs when CPUs are limited (Rui Zhang, Srinivas
Pandruvada)
- misc minor fixes
[ This came in during the merge window, but sorting out the signed tag
took a while, so thus the late merge - Linus ]
* tag 'turbostat-2023.11.07' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux: (86 commits)
tools/power turbostat: version 2023.11.07
tools/power/turbostat: bugfix "--show IPC"
tools/power/turbostat: Add initial support for LunarLake
tools/power/turbostat: Add initial support for ArrowLake
tools/power/turbostat: Add initial support for GrandRidge
tools/power/turbostat: Add initial support for SierraForest
tools/power/turbostat: Add initial support for GraniteRapids
tools/power/turbostat: Add MSR_CORE_C1_RES support for spr_features
tools/power/turbostat: Move process to root cgroup
tools/power/turbostat: Handle cgroup v2 cpu limitation
tools/power/turbostat: Abstrct function for parsing cpu string
tools/power/turbostat: Handle offlined CPUs in cpu_subset
tools/power/turbostat: Obey allowed CPUs for system summary
tools/power/turbostat: Obey allowed CPUs for primary thread/core detection
tools/power/turbostat: Abstract several functions
tools/power/turbostat: Obey allowed CPUs during startup
tools/power/turbostat: Obey allowed CPUs when accessing CPU counters
tools/power/turbostat: Introduce cpu_allowed_set
tools/power/turbostat: Remove PC7/PC9 support on ADL/RPL
tools/power/turbostat: Enable MSR_CORE_C1_RES on recent Intel client platforms
...
|
|
Pull bcachefs fixes from Kent Overstreet:
"Lots of small fixes for minor nits and compiler warnings.
Bigger items:
- The six locks lost wakeup is finally fixed: six_read_trylock() was
checking for the waiting bit before decrementing the number of
readers - validated the fix with a torture test.
- Fix for a memory reclaim issue: when needing to reallocate a key
cache key, we now do our usual GFP_NOWAIT; unlock(); GFP_KERNEL
dance.
- Multiple deleted inodes btree fixes
- Fix an issue in fsck, where i_nlink would be recalculated
incorrectly for hardlinked files if a snapshot had ever been taken.
- Kill journal pre-reservations: This is a bigger patch than I would
normally send at this point, but it deletes code and it fixes some
of our tests that would sporadically die with the journal getting
stuck, and it's a performance improvement, too"
* tag 'bcachefs-2023-11-17' of https://evilpiepirate.org/git/bcachefs: (22 commits)
bcachefs: Fix missing locking for dentry->d_parent access
bcachefs: six locks: Fix lost wakeup
bcachefs: Fix no_data_io mode checksum check
bcachefs: Fix bch2_check_nlinks() for snapshots
bcachefs: Don't decrease BTREE_ITER_MAX when LOCKDEP=y
bcachefs: Disable debug log statements
bcachefs: Fix missing transaction commit
bcachefs: Fix error path in bch2_mount()
bcachefs: Fix potential sleeping during mount
bcachefs: Fix iterator leak in may_delete_deleted_inode()
bcachefs: Kill journal pre-reservations
bcachefs: Check for nonce offset inconsistency in data_update path
bcachefs: Make sure to drop/retake btree locks before reclaim
bcachefs: btree_trans->write_locked
bcachefs: Run btree key cache shrinker less aggressively
bcachefs: Split out btree_key_cache_types.h
bcachefs: Guard against insufficient devices to create stripes
bcachefs: Fix null ptr deref in bch2_backpointer_get_node()
bcachefs: Fix multiple -Warray-bounds warnings
bcachefs: Use DECLARE_FLEX_ARRAY() helper and fix multiple -Warray-bounds warnings
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"Thirteen hotfixes. Seven are cc:stable and the remainder pertain to
post-6.6 issues or aren't considered suitable for backporting"
* tag 'mm-hotfixes-stable-2023-11-17-14-04' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
mm: more ptep_get() conversion
parisc: fix mmap_base calculation when stack grows upwards
mm/damon/core.c: avoid unintentional filtering out of schemes
mm: kmem: drop __GFP_NOFAIL when allocating objcg vectors
mm/damon/sysfs-schemes: handle tried region directory allocation failure
mm/damon/sysfs-schemes: handle tried regions sysfs directory allocation failure
mm/damon/sysfs: check error from damon_sysfs_update_target()
mm: fix for negative counter: nr_file_hugepages
selftests/mm: add hugetlb_fault_after_madv to .gitignore
selftests/mm: restore number of hugepages
selftests: mm: fix some build warnings
selftests: mm: skip whole test instead of failure
mm/damon/sysfs: eliminate potential uninitialized variable warning
|
|
Pull block fix from Jens Axboe:
"Just a single fix from Christoph/Ming, fixing a case where integrity
IO could be called without having an appropriate queue reference"
* tag 'block-6.7-2023-11-17' of git://git.kernel.dk/linux:
blk-mq: make sure active queue usage is held for bio_integrity_prep()
|
|
Pull io_uring fix from Jens Axboe:
"Just a single fixup for a change we made in this release, which caused
a regression in sometimes missing fdinfo output if the SQPOLL thread
had the lock held when fdinfo output was retrieved.
This brings us back on par with what we had before, where just the
main uring_lock will prevent that output. We'd love to get rid of that
too, but that is beyond the scope of this release and will have to
wait for 6.8"
* tag 'io_uring-6.7-2023-11-17' of git://git.kernel.dk/linux:
io_uring/fdinfo: remove need for sqpoll lock for thread/pid retrieval
|
|
Pull drm fixes from Daniel Vetter:
"This is a 'blast from the bast' fixes pull, because it contains a
bunch of AGP fixes for amdgpu. Otherwise nothing out of the ordinary.
Next week is back to Dave unless he's knocked out by some conference
bug.
- amdgpu: fixes all over, including a set of AGP fixes
- nouvea: GSP + other bugfixes
- ivpu build fix
- lenovo legion go panel orientation quirk"
* tag 'drm-fixes-2023-11-17' of git://anongit.freedesktop.org/drm/drm: (30 commits)
drm/amdgpu/gmc9: disable AGP aperture
drm/amdgpu/gmc10: disable AGP aperture
drm/amdgpu/gmc11: disable AGP aperture
drm/amdgpu: add a module parameter to control the AGP aperture
drm/amdgpu/gmc11: fix logic typo in AGP check
drm/amd/display: Fix encoder disable logic
drm/amd/display: Change the DMCUB mailbox memory location from FB to inbox
drm/amdgpu: add and populate the port num into xgmi topology info
drm/amd/display: Negate IPS allow and commit bits
drm/amd/pm: Don't send unload message for reset
drm/amdgpu: fix ras err_data null pointer issue in amdgpu_ras.c
drm/amd/display: Clear dpcd_sink_ext_caps if not set
drm/amd/display: Enable fast plane updates on DCN3.2 and above
drm/amd/display: fix NULL dereference
drm/amd/display: fix a NULL pointer dereference in amdgpu_dm_i2c_xfer()
drm/amd/display: Add null checks for 8K60 lightup
drm/amd/pm: Fill pcie error counters for gpu v1_4
drm/amd/pm: Update metric table for smu v13_0_6
drm/amdgpu: correct chunk_ptr to a pointer to chunk.
drm/amd/display: Fix DSC not Enabled on Direct MST Sink
...
|
|
nfsd_cache_csum() currently assumes that the server's RPC layer has
been advancing rq_arg.head[0].iov_base as it decodes an incoming
request, because that's the way it used to work. On entry, it
expects that buf->head[0].iov_base points to the start of the NFS
header, and excludes the already-decoded RPC header.
These days however, head[0].iov_base now points to the start of the
RPC header during all processing. It no longer points at the NFS
Call header when execution arrives at nfsd_cache_csum().
In a retransmitted RPC the XID and the NFS header are supposed to
be the same as the original message, but the contents of the
retransmitted RPC header can be different. For example, for krb5,
the GSS sequence number will be different between the two. Thus if
the RPC header is always included in the DRC checksum computation,
the checksum of the retransmitted message might not match the
checksum of the original message, even though the NFS part of these
messages is identical.
The result is that, even if a matching XID is found in the DRC,
the checksum mismatch causes the server to execute the
retransmitted RPC transaction again.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Tested-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|