path: root/kernel
AgeCommit message (Collapse)Author
2021-06-27Revert "signal: Allow tasks to cache one sigqueue struct"Linus Torvalds
This reverts commits 4bad58ebc8bc4f20d89cff95417c9b4674769709 (and 399f8dd9a866e107639eabd3c1979cd526ca3a98, which tried to fix it). I do not believe these are correct, and I'm about to release 5.13, so am reverting them out of an abundance of caution. The locking is odd, and appears broken. On the allocation side (in __sigqueue_alloc()), the locking is somewhat straightforward: it depends on sighand->siglock. Since one caller doesn't hold that lock, it further then tests 'sigqueue_flags' to avoid the case with no locks held. On the freeing side (in sigqueue_cache_or_free()), there is no locking at all, and the logic instead depends on 'current' being a single thread, and not able to race with itself. To make things more exciting, there's also the data race between freeing a signal and allocating one, which is handled by using WRITE_ONCE() and READ_ONCE(), and being mutually exclusive wrt the initial state (ie freeing will only free if the old state was NULL, while allocating will obviously only use the value if it was non-NULL, so only one or the other will actually act on the value). However, while the free->alloc paths do seem mutually exclusive thanks to just the data value dependency, it's not clear what the memory ordering constraints are on it. Could writes from the previous allocation possibly be delayed and seen by the new allocation later, causing logical inconsistencies? So it's all very exciting and unusual. And in particular, it seems that the freeing side is incorrect in depending on "current" being single-threaded. Yes, 'current' is a single thread, but in the presense of asynchronous events even a single thread can have data races. And such asynchronous events can and do happen, with interrupts causing signals to be flushed and thus free'd (for example - sending a SIGCONT/SIGSTOP can happen from interrupt context, and can flush previously queued process control signals). So regardless of all the other questions about the memory ordering and locking for this new cached allocation, the sigqueue_cache_or_free() assumptions seem to be fundamentally incorrect. It may be that people will show me the errors of my ways, and tell me why this is all safe after all. We can reinstate it if so. But my current belief is that the WRITE_ONCE() that sets the cached entry needs to be a smp_store_release(), and the READ_ONCE() that finds a cached entry needs to be a smp_load_acquire() to handle memory ordering correctly. And the sequence in sigqueue_cache_or_free() would need to either use a lock or at least be interrupt-safe some way (perhaps by using something like the percpu 'cmpxchg': it doesn't need to be SMP-safe, but like the percpu operations it needs to be interrupt-safe). Fixes: 399f8dd9a866 ("signal: Prevent sigqueue caching after task got released") Fixes: 4bad58ebc8bc ("signal: Allow tasks to cache one sigqueue struct") Cc: Thomas Gleixner <> Cc: Peter Zijlstra <> Cc: Oleg Nesterov <> Cc: Christian Brauner <> Signed-off-by: Linus Torvalds <>
2021-06-24mm, futex: fix shared futex pgoff on shmem huge pageHugh Dickins
If more than one futex is placed on a shmem huge page, it can happen that waking the second wakes the first instead, and leaves the second waiting: the key's shared.pgoff is wrong. When 3.11 commit 13d60f4b6ab5 ("futex: Take hugepages into account when generating futex_key"), the only shared huge pages came from hugetlbfs, and the code added to deal with its exceptional page->index was put into hugetlb source. Then that was missed when 4.8 added shmem huge pages. page_to_pgoff() is what others use for this nowadays: except that, as currently written, it gives the right answer on hugetlbfs head, but nonsense on hugetlbfs tails. Fix that by calling hugetlbfs-specific hugetlb_basepage_index() on PageHuge tails as well as on head. Yes, it's unconventional to declare hugetlb_basepage_index() there in pagemap.h, rather than in hugetlb.h; but I do not expect anything but page_to_pgoff() ever to need it. [ give hugetlb_basepage_index() prototype the correct scope] Link: Fixes: 800d8c63b2e9 ("shmem: add huge pages support") Reported-by: Neel Natu <> Signed-off-by: Hugh Dickins <> Reviewed-by: Matthew Wilcox (Oracle) <> Acked-by: Thomas Gleixner <> Cc: "Kirill A. Shutemov" <> Cc: Zhang Yi <> Cc: Mel Gorman <> Cc: Mike Kravetz <> Cc: Ingo Molnar <> Cc: Peter Zijlstra <> Cc: Darren Hart <> Cc: Davidlohr Bueso <> Cc: <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2021-06-24kthread: prevent deadlock when kthread_mod_delayed_work() races with ↵Petr Mladek
kthread_cancel_delayed_work_sync() The system might hang with the following backtrace: schedule+0x80/0x100 schedule_timeout+0x48/0x138 wait_for_common+0xa4/0x134 wait_for_completion+0x1c/0x2c kthread_flush_work+0x114/0x1cc kthread_cancel_work_sync.llvm.16514401384283632983+0xe8/0x144 kthread_cancel_delayed_work_sync+0x18/0x2c xxxx_pm_notify+0xb0/0xd8 blocking_notifier_call_chain_robust+0x80/0x194 pm_notifier_call_chain_robust+0x28/0x4c suspend_prepare+0x40/0x260 enter_state+0x80/0x3f4 pm_suspend+0x60/0xdc state_store+0x108/0x144 kobj_attr_store+0x38/0x88 sysfs_kf_write+0x64/0xc0 kernfs_fop_write_iter+0x108/0x1d0 vfs_write+0x2f4/0x368 ksys_write+0x7c/0xec It is caused by the following race between kthread_mod_delayed_work() and kthread_cancel_delayed_work_sync(): CPU0 CPU1 Context: Thread A Context: Thread B kthread_mod_delayed_work() spin_lock() __kthread_cancel_work() spin_unlock() del_timer_sync() kthread_cancel_delayed_work_sync() spin_lock() __kthread_cancel_work() spin_unlock() del_timer_sync() spin_lock() work->canceling++ spin_unlock spin_lock() queue_delayed_work() // dwork is put into the worker->delayed_work_list spin_unlock() kthread_flush_work() // flush_work is put at the tail of the dwork wait_for_completion() Context: IRQ kthread_delayed_work_timer_fn() spin_lock() list_del_init(&work->node); spin_unlock() BANG: flush_work is not longer linked and will never get proceed. The problem is that kthread_mod_delayed_work() checks work->canceling flag before canceling the timer. A simple solution is to (re)check work->canceling after __kthread_cancel_work(). But then it is not clear what should be returned when __kthread_cancel_work() removed the work from the queue (list) and it can't queue it again with the new @delay. The return value might be used for reference counting. The caller has to know whether a new work has been queued or an existing one was replaced. The proper solution is that kthread_mod_delayed_work() will remove the work from the queue (list) _only_ when work->canceling is not set. The flag must be checked after the timer is stopped and the remaining operations can be done under worker->lock. Note that kthread_mod_delayed_work() could remove the timer and then bail out. It is fine. The other canceling caller needs to cancel the timer as well. The important thing is that the queue (list) manipulation is done atomically under worker->lock. Link: Fixes: 9a6b06c8d9a220860468a ("kthread: allow to modify delayed kthread work") Signed-off-by: Petr Mladek <> Reported-by: Martin Liu <> Cc: <> Cc: Minchan Kim <> Cc: Nathan Chancellor <> Cc: Nick Desaulniers <> Cc: Oleg Nesterov <> Cc: Tejun Heo <> Cc: <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2021-06-24kthread_worker: split code for canceling the delayed work timerPetr Mladek
Patch series "kthread_worker: Fix race between kthread_mod_delayed_work() and kthread_cancel_delayed_work_sync()". This patchset fixes the race between kthread_mod_delayed_work() and kthread_cancel_delayed_work_sync() including proper return value handling. This patch (of 2): Simple code refactoring as a preparation step for fixing a race between kthread_mod_delayed_work() and kthread_cancel_delayed_work_sync(). It does not modify the existing behavior. Link: Signed-off-by: Petr Mladek <> Cc: <> Cc: Martin Liu <> Cc: Minchan Kim <> Cc: Nathan Chancellor <> Cc: Nick Desaulniers <> Cc: Oleg Nesterov <> Cc: Tejun Heo <> Cc: <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2021-06-24Merge tag 'core-urgent-2021-06-24' of ↵Linus Torvalds
git:// Pull sigqueue cache fix from Ingo Molnar: "Fix a memory leak in the recently introduced sigqueue cache" * tag 'core-urgent-2021-06-24' of git:// signal: Prevent sigqueue caching after task got released
2021-06-24Merge tag 'sched-urgent-2021-06-24' of ↵Linus Torvalds
git:// Pull scheduler fix from Ingo Molnar: "A last minute cgroup bandwidth scheduling fix for a recently introduced logic fail which triggered a kernel warning by LTP's cfs_bandwidth01 test" * tag 'sched-urgent-2021-06-24' of git:// sched/fair: Ensure that the CFS parent is added after unthrottling
2021-06-24Merge tag 'objtool-urgent-2021-06-24' of ↵Linus Torvalds
git:// Pull objtool fixes from Ingo Molnar: "Address a number of objtool warnings that got reported. No change in behavior intended, but code generation might be impacted by commit 1f008d46f124 ("x86: Always inline task_size_max()")" * tag 'objtool-urgent-2021-06-24' of git:// locking/lockdep: Improve noinstr vs errors x86: Always inline task_size_max() x86/xen: Fix noinstr fail in exc_xen_unknown_trap() x86/xen: Fix noinstr fail in xen_pv_evtchn_do_upcall() x86/entry: Fix noinstr fail in __do_fast_syscall_32() objtool/x86: Ignore __x86_indirect_alt_* symbols
2021-06-23Merge branch 'stable/for-linus-5.14' of ↵Linus Torvalds
git:// Pull swiotlb fix from Konrad Rzeszutek Wilk: "A fix for the regression for the DMA operations where the offset was ignored and corruptions would appear. Going forward there will be a cleanups to make the offset and alignment logic more clearer and better test-cases to help with this" * 'stable/for-linus-5.14' of git:// swiotlb: manipulate orig_addr when tlb_addr has offset
2021-06-22module: limit enabling module.sig_enforceMimi Zohar
Irrespective as to whether CONFIG_MODULE_SIG is configured, specifying "module.sig_enforce=1" on the boot command line sets "sig_enforce". Only allow "sig_enforce" to be set when CONFIG_MODULE_SIG is configured. This patch makes the presence of /sys/module/module/parameters/sig_enforce dependent on CONFIG_MODULE_SIG=y. Fixes: fda784e50aac ("module: export module signature enforcement status") Reported-by: Nayna Jain <> Tested-by: Mimi Zohar <> Tested-by: Jessica Yu <> Signed-off-by: Mimi Zohar <> Signed-off-by: Jessica Yu <> Signed-off-by: Linus Torvalds <>
2021-06-22signal: Prevent sigqueue caching after task got releasedThomas Gleixner
syzbot reported a memory leak related to sigqueue caching. The assumption that a task cannot cache a sigqueue after the signal handler has been dropped and exit_task_sigqueue_cache() has been invoked turns out to be wrong. Such a task can still invoke release_task(other_task), which cleans up the signals of 'other_task' and ends up in sigqueue_cache_or_free(), which in turn will cache the signal because task->sigqueue_cache is NULL. That's obviously bogus because nothing will free the cached signal of that task anymore, so the cached item is leaked. This happens when e.g. the last non-leader thread exits and reaps the zombie leader. Prevent this by setting tsk::sigqueue_cache to an error pointer value in exit_task_sigqueue_cache() which forces any subsequent invocation of sigqueue_cache_or_free() from that task to hand the sigqueue back to the kmemcache. Add comments to all relevant places. Fixes: 4bad58ebc8bc ("signal: Allow tasks to cache one sigqueue struct") Reported-by: Signed-off-by: Thomas Gleixner <> Reviewed-by: Oleg Nesterov <> Acked-by: Christian Brauner <> Link:
2021-06-22sched/fair: Ensure that the CFS parent is added after unthrottlingRik van Riel
Ensure that a CFS parent will be in the list whenever one of its children is also in the list. A warning on rq->tmp_alone_branch != &rq->leaf_cfs_rq_list has been reported while running LTP test cfs_bandwidth01. Odin Ugedal found the root cause: $ tree /sys/fs/cgroup/ltp/ -d --charset=ascii /sys/fs/cgroup/ltp/ |-- drain `-- test-6851 `-- level2 |-- level3a | |-- worker1 | `-- worker2 `-- level3b `-- worker3 Timeline (ish): - worker3 gets throttled - level3b is decayed, since it has no more load - level2 get throttled - worker3 get unthrottled - level2 get unthrottled - worker3 is added to list - level3b is not added to list, since nr_running==0 and is decayed [ Vincent Guittot: Rebased and updated to fix for the reported warning. ] Fixes: a7b359fc6a37 ("sched/fair: Correctly insert cfs_rq's to list on unthrottle") Reported-by: Sachin Sant <> Suggested-by: Vincent Guittot <> Signed-off-by: Rik van Riel <> Signed-off-by: Vincent Guittot <> Signed-off-by: Ingo Molnar <> Tested-by: Sachin Sant <> Acked-by: Odin Ugedal <> Link:
2021-06-22locking/lockdep: Improve noinstr vs errorsPeter Zijlstra
Better handle the failure paths. vmlinux.o: warning: objtool: debug_locks_off()+0x23: call to console_verbose() leaves .noinstr.text section vmlinux.o: warning: objtool: debug_locks_off()+0x19: call to __kasan_check_write() leaves .noinstr.text section debug_locks_off+0x19/0x40: instrument_atomic_write at include/linux/instrumented.h:86 (inlined by) __debug_locks_off at include/linux/debug_locks.h:17 (inlined by) debug_locks_off at lib/debug_locks.c:41 Fixes: 6eebad1ad303 ("lockdep: __always_inline more for noinstr") Signed-off-by: Peter Zijlstra (Intel) <> Signed-off-by: Ingo Molnar <> Link:
2021-06-21swiotlb: manipulate orig_addr when tlb_addr has offsetBumyong Lee
in case of driver wants to sync part of ranges with offset, swiotlb_tbl_sync_single() copies from orig_addr base to tlb_addr with offset and ends up with data mismatch. It was removed from "swiotlb: don't modify orig_addr in swiotlb_tbl_sync_single", but said logic has to be added back in. From Linus's email: "That commit which the removed the offset calculation entirely, because the old (unsigned long)tlb_addr & (IO_TLB_SIZE - 1) was wrong, but instead of removing it, I think it should have just fixed it to be (tlb_addr - mem->start) & (IO_TLB_SIZE - 1); instead. That way the slot offset always matches the slot index calculation." (Unfortunatly that broke NVMe). The use-case that drivers are hitting is as follow: 1. Get dma_addr_t from dma_map_single() dma_addr_t tlb_addr = dma_map_single(dev, vaddr, vsize, DMA_TO_DEVICE); |<---------------vsize------------->| +-----------------------------------+ | | original buffer +-----------------------------------+ vaddr swiotlb_align_offset |<----->|<---------------vsize------------->| +-------+-----------------------------------+ | | | swiotlb buffer +-------+-----------------------------------+ tlb_addr 2. Do something 3. Sync dma_addr_t through dma_sync_single_for_device(..) dma_sync_single_for_device(dev, tlb_addr + offset, size, DMA_TO_DEVICE); Error case. Copy data to original buffer but it is from base addr (instead of base addr + offset) in original buffer: swiotlb_align_offset |<----->|<- offset ->|<- size ->| +-------+-----------------------------------+ | | |##########| | swiotlb buffer +-------+-----------------------------------+ tlb_addr |<- size ->| +-----------------------------------+ |##########| | original buffer +-----------------------------------+ vaddr The fix is to copy the data to the original buffer and take into account the offset, like so: swiotlb_align_offset |<----->|<- offset ->|<- size ->| +-------+-----------------------------------+ | | |##########| | swiotlb buffer +-------+-----------------------------------+ tlb_addr |<- offset ->|<- size ->| +-----------------------------------+ | |##########| | original buffer +-----------------------------------+ vaddr [One fix which was Linus's that made more sense to as it created a symmetry would break NVMe. The reason for that is the: unsigned int offset = (tlb_addr - mem->start) & (IO_TLB_SIZE - 1); would come up with the proper offset, but it would lose the alignment (which this patch contains).] Fixes: 16fc3cef33a0 ("swiotlb: don't modify orig_addr in swiotlb_tbl_sync_single") Signed-off-by: Bumyong Lee <> Signed-off-by: Chanho Park <> Reviewed-by: Christoph Hellwig <> Reported-by: Dominique MARTINET <> Reported-by: Horia Geantă <> Tested-by: Horia Geantă <> CC: Signed-off-by: Konrad Rzeszutek Wilk <>
2021-06-20Merge tag 'sched_urgent_for_v5.13_rc6' of ↵Linus Torvalds
git:// Pull scheduler fix from Borislav Petkov: "A single fix to restore fairness between control groups with equal priority" * tag 'sched_urgent_for_v5.13_rc6' of git:// sched/fair: Correctly insert cfs_rq's to list on unthrottle
2021-06-18Merge tag 'net-5.13-rc7' of ↵Linus Torvalds
git:// Pull networking fixes from Jakub Kicinski: "Networking fixes for 5.13-rc7, including fixes from wireless, bpf, bluetooth, netfilter and can. Current release - regressions: - mlxsw: spectrum_qdisc: Pass handle, not band number to find_class() to fix modifying offloaded qdiscs - lantiq: net: fix duplicated skb in rx descriptor ring - rtnetlink: fix regression in bridge VLAN configuration, empty info is not an error, bot-generated "fix" was not needed - libbpf: s/rx/tx/ typo on umem->rx_ring_setup_done to fix umem creation Current release - new code bugs: - ethtool: fix NULL pointer dereference during module EEPROM dump via the new netlink API - mlx5e: don't update netdev RQs with PTP-RQ, the special purpose queue should not be visible to the stack - mlx5e: select special PTP queue only for SKBTX_HW_TSTAMP skbs - mlx5e: verify dev is present in get devlink port ndo, avoid a panic Previous releases - regressions: - neighbour: allow NUD_NOARP entries to be force GCed - further fixes for fallout from reorg of WiFi locking (staging: rtl8723bs, mac80211, cfg80211) - skbuff: fix incorrect msg_zerocopy copy notifications - mac80211: fix NULL ptr deref for injected rate info - Revert "net/mlx5: Arm only EQs with EQEs" it may cause missed IRQs Previous releases - always broken: - bpf: more speculative execution fixes - netfilter: nft_fib_ipv6: skip ipv6 packets from any to link-local - udp: fix race between close() and udp_abort() resulting in a panic - fix out of bounds when parsing TCP options before packets are validated (in netfilter: synproxy, tc: sch_cake and mptcp) - mptcp: improve operation under memory pressure, add missing wake-ups - mptcp: fix double-lock/soft lookup in subflow_error_report() - bridge: fix races (null pointer deref and UAF) in vlan tunnel egress - ena: fix DMA mapping function issues in XDP - rds: fix memory leak in rds_recvmsg Misc: - vrf: allow larger MTUs - icmp: don't send out ICMP messages with a source address of - cdc_ncm: switch to eth%d interface naming" * tag 'net-5.13-rc7' of git:// (139 commits) net: ethernet: fix potential use-after-free in ec_bhf_remove selftests/net: Add for testing ICMP dummy address responses icmp: don't send out ICMP messages with a source address of net: ll_temac: Avoid ndo_start_xmit returning NETDEV_TX_BUSY net: ll_temac: Fix TX BD buffer overwrite net: ll_temac: Add memory-barriers for TX BD access net: ll_temac: Make sure to free skb when it is completely used MAINTAINERS: add Guvenc as SMC maintainer bnxt_en: Call bnxt_ethtool_free() in bnxt_init_one() error path bnxt_en: Fix TQM fastpath ring backing store computation bnxt_en: Rediscover PHY capabilities after firmware reset cxgb4: fix wrong shift. mac80211: handle various extensible elements correctly mac80211: reset profile_periodicity/ema_ap cfg80211: avoid double free of PMSR request cfg80211: make certificate generation more robust mac80211: minstrel_ht: fix sample time check net: qed: Fix memcpy() overflow of qed_dcbx_params() net: cdc_eem: fix tx fixup skb leak net: hamradio: fix memory leak in mkiss_close ...
2021-06-18Merge tag 'trace-v5.13-rc6' of ↵Linus Torvalds
git:// Pull tracing fixes from Steven Rostedt: - Have recordmcount check for valid st_shndx otherwise some archs may have invalid references for the mcount location. - Two fixes done for mapping pids to task names. Traces were not showing the names of tasks when they should have. - Fix to trace_clock_global() to prevent it from going backwards * tag 'trace-v5.13-rc6' of git:// tracing: Do no increment trace_clock_global() by one tracing: Do not stop recording comms if the trace file is being read tracing: Do not stop recording cmdlines when tracing is off recordmcount: Correct st_shndx handling
2021-06-18Merge tag 'printk-for-5.13-fixup' of ↵Linus Torvalds
git:// Pull printk fixup from Petr Mladek: "Fix misplaced EXPORT_SYMBOL(vsprintf)" * tag 'printk-for-5.13-fixup' of git:// printk: Move EXPORT_SYMBOL() closer to vprintk definition
2021-06-18Merge tag 'pm-5.13-rc7' of ↵Linus Torvalds
git:// Pull power management fix from Rafael Wysocki: "Remove recently added frequency invariance support from the CPPC cpufreq driver, because it has turned out to be problematic and it cannot be fixed properly on time for 5.13 (Viresh Kumar)" * tag 'pm-5.13-rc7' of git:// Revert "cpufreq: CPPC: Add support for frequency invariance"
2021-06-18tracing: Do no increment trace_clock_global() by oneSteven Rostedt (VMware)
The trace_clock_global() tries to make sure the events between CPUs is somewhat in order. A global value is used and updated by the latest read of a clock. If one CPU is ahead by a little, and is read by another CPU, a lock is taken, and if the timestamp of the other CPU is behind, it will simply use the other CPUs timestamp. The lock is also only taken with a "trylock" due to tracing, and strange recursions can happen. The lock is not taken at all in NMI context. In the case where the lock is not able to be taken, the non synced timestamp is returned. But it will not be less than the saved global timestamp. The problem arises because when the time goes "backwards" the time returned is the saved timestamp plus 1. If the lock is not taken, and the plus one to the timestamp is returned, there's a small race that can cause the time to go backwards! CPU0 CPU1 ---- ---- trace_clock_global() { ts = clock() [ 1000 ] trylock(clock_lock) [ success ] global_ts = ts; [ 1000 ] <interrupted by NMI> trace_clock_global() { ts = clock() [ 999 ] if (ts < global_ts) ts = global_ts + 1 [ 1001 ] trylock(clock_lock) [ fail ] return ts [ 1001] } unlock(clock_lock); return ts; [ 1000 ] } trace_clock_global() { ts = clock() [ 1000 ] if (ts < global_ts) [ false 1000 == 1000 ] trylock(clock_lock) [ success ] global_ts = ts; [ 1000 ] unlock(clock_lock) return ts; [ 1000 ] } The above case shows to reads of trace_clock_global() on the same CPU, but the second read returns one less than the first read. That is, time when backwards, and this is not what is allowed by trace_clock_global(). This was triggered by heavy tracing and the ring buffer checker that tests for the clock going backwards: Ring buffer clock went backwards: 20613921464 -> 20613921463 ------------[ cut here ]------------ WARNING: CPU: 2 PID: 0 at kernel/trace/ring_buffer.c:3412 check_buffer+0x1b9/0x1c0 Modules linked in: [..] [CPU: 2]TIME DOES NOT MATCH expected:20620711698 actual:20620711697 delta:6790234 before:20613921463 after:20613921463 [20613915818] PAGE TIME STAMP [20613915818] delta:0 [20613915819] delta:1 [20613916035] delta:216 [20613916465] delta:430 [20613916575] delta:110 [20613916749] delta:174 [20613917248] delta:499 [20613917333] delta:85 [20613917775] delta:442 [20613917921] delta:146 [20613918321] delta:400 [20613918568] delta:247 [20613918768] delta:200 [20613919306] delta:538 [20613919353] delta:47 [20613919980] delta:627 [20613920296] delta:316 [20613920571] delta:275 [20613920862] delta:291 [20613921152] delta:290 [20613921464] delta:312 [20613921464] delta:0 TIME EXTEND [20613921464] delta:0 This happened more than once, and always for an off by one result. It also started happening after commit aafe104aa9096 was added. Cc: Fixes: aafe104aa9096 ("tracing: Restructure trace_clock_global() to never block") Signed-off-by: Steven Rostedt (VMware) <>
2021-06-18tracing: Do not stop recording comms if the trace file is being readSteven Rostedt (VMware)
A while ago, when the "trace" file was opened, tracing was stopped, and code was added to stop recording the comms to saved_cmdlines, for mapping of the pids to the task name. Code has been added that only records the comm if a trace event occurred, and there's no reason to not trace it if the trace file is opened. Cc: Fixes: 7ffbd48d5cab2 ("tracing: Cache comms only after an event occurred") Signed-off-by: Steven Rostedt (VMware) <>
2021-06-18tracing: Do not stop recording cmdlines when tracing is offSteven Rostedt (VMware)
The saved_cmdlines is used to map pids to the task name, such that the output of the tracing does not just show pids, but also gives a human readable name for the task. If the name is not mapped, the output looks like this: <...>-1316 [005] ...2 132.044039: ... Instead of this: gnome-shell-1316 [005] ...2 132.044039: ... The names are updated when tracing is running, but are skipped if tracing is stopped. Unfortunately, this stops the recording of the names if the top level tracer is stopped, and not if there's other tracers active. The recording of a name only happens when a new event is written into a ring buffer, so there is no need to test if tracing is on or not. If tracing is off, then no event is written and no need to test if tracing is off or not. Remove the check, as it hides the names of tasks for events in the instance buffers. Cc: Fixes: 7ffbd48d5cab2 ("tracing: Cache comms only after an event occurred") Signed-off-by: Steven Rostedt (VMware) <>
2021-06-16crash_core, vmcoreinfo: append 'SECTION_SIZE_BITS' to vmcoreinfoPingfan Liu
As mentioned in kernel commit 1d50e5d0c505 ("crash_core, vmcoreinfo: Append 'MAX_PHYSMEM_BITS' to vmcoreinfo"), SECTION_SIZE_BITS in the formula: #define SECTIONS_SHIFT (MAX_PHYSMEM_BITS - SECTION_SIZE_BITS) Besides SECTIONS_SHIFT, SECTION_SIZE_BITS is also used to calculate PAGES_PER_SECTION in makedumpfile just like kernel. Unfortunately, this arch-dependent macro SECTION_SIZE_BITS changes, e.g. recently in kernel commit f0b13ee23241 ("arm64/sparsemem: reduce SECTION_SIZE_BITS"). But user space wants a stable interface to get this info. Such info is impossible to be deduced from a crashdump vmcore. Hence append SECTION_SIZE_BITS to vmcoreinfo. Link: Link: Signed-off-by: Pingfan Liu <> Acked-by: Baoquan He <> Cc: Bhupesh Sharma <> Cc: Kazuhito Hagio <> Cc: Dave Young <> Cc: Boris Petkov <> Cc: Ingo Molnar <> Cc: Thomas Gleixner <> Cc: James Morse <> Cc: Mark Rutland <> Cc: Will Deacon <> Cc: Catalin Marinas <> Cc: Michael Ellerman <> Cc: Paul Mackerras <> Cc: Benjamin Herrenschmidt <> Cc: Dave Anderson <> Cc: <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2021-06-16printk: Move EXPORT_SYMBOL() closer to vprintk definitionPunit Agrawal
Commit 28e1745b9fa2 ("printk: rename vprintk_func to vprintk") while improving readability by removing vprintk indirection, inadvertently placed the EXPORT_SYMBOL() for the newly renamed function at the end of the file. For reader sanity, and as is convention move the EXPORT_SYMBOL() declaration just after the end of the function. Fixes: 28e1745b9fa2 ("printk: rename vprintk_func to vprintk") Signed-off-by: Punit Agrawal <> Acked-by: Rasmus Villemoes <> Acked-by: Sergey Senozhatsky <> Signed-off-by: Petr Mladek <> Link:
2021-06-14bpf: Fix leakage under speculation on mispredicted branchesDaniel Borkmann
The verifier only enumerates valid control-flow paths and skips paths that are unreachable in the non-speculative domain. And so it can miss issues under speculative execution on mispredicted branches. For example, a type confusion has been demonstrated with the following crafted program: // r0 = pointer to a map array entry // r6 = pointer to readable stack slot // r9 = scalar controlled by attacker 1: r0 = *(u64 *)(r0) // cache miss 2: if r0 != 0x0 goto line 4 3: r6 = r9 4: if r0 != 0x1 goto line 6 5: r9 = *(u8 *)(r6) 6: // leak r9 Since line 3 runs iff r0 == 0 and line 5 runs iff r0 == 1, the verifier concludes that the pointer dereference on line 5 is safe. But: if the attacker trains both the branches to fall-through, such that the following is speculatively executed ... r6 = r9 r9 = *(u8 *)(r6) // leak r9 ... then the program will dereference an attacker-controlled value and could leak its content under speculative execution via side-channel. This requires to mistrain the branch predictor, which can be rather tricky, because the branches are mutually exclusive. However such training can be done at congruent addresses in user space using different branches that are not mutually exclusive. That is, by training branches in user space ... A: if r0 != 0x0 goto line C B: ... C: if r0 != 0x0 goto line D D: ... ... such that addresses A and C collide to the same CPU branch prediction entries in the PHT (pattern history table) as those of the BPF program's lines 2 and 4, respectively. A non-privileged attacker could simply brute force such collisions in the PHT until observing the attack succeeding. Alternative methods to mistrain the branch predictor are also possible that avoid brute forcing the collisions in the PHT. A reliable attack has been demonstrated, for example, using the following crafted program: // r0 = pointer to a [control] map array entry // r7 = *(u64 *)(r0 + 0), training/attack phase // r8 = *(u64 *)(r0 + 8), oob address // [...] // r0 = pointer to a [data] map array entry 1: if r7 == 0x3 goto line 3 2: r8 = r0 // crafted sequence of conditional jumps to separate the conditional // branch in line 193 from the current execution flow 3: if r0 != 0x0 goto line 5 4: if r0 == 0x0 goto exit 5: if r0 != 0x0 goto line 7 6: if r0 == 0x0 goto exit [...] 187: if r0 != 0x0 goto line 189 188: if r0 == 0x0 goto exit // load any slowly-loaded value (due to cache miss in phase 3) ... 189: r3 = *(u64 *)(r0 + 0x1200) // ... and turn it into known zero for verifier, while preserving slowly- // loaded dependency when executing: 190: r3 &= 1 191: r3 &= 2 // speculatively bypassed phase dependency 192: r7 += r3 193: if r7 == 0x3 goto exit 194: r4 = *(u8 *)(r8 + 0) // leak r4 As can be seen, in training phase (phase != 0x3), the condition in line 1 turns into false and therefore r8 with the oob address is overridden with the valid map value address, which in line 194 we can read out without issues. However, in attack phase, line 2 is skipped, and due to the cache miss in line 189 where the map value is (zeroed and later) added to the phase register, the condition in line 193 takes the fall-through path due to prior branch predictor training, where under speculation, it'll load the byte at oob address r8 (unknown scalar type at that point) which could then be leaked via side-channel. One way to mitigate these is to 'branch off' an unreachable path, meaning, the current verification path keeps following the is_branch_taken() path and we push the other branch to the verification stack. Given this is unreachable from the non-speculative domain, this branch's vstate is explicitly marked as speculative. This is needed for two reasons: i) if this path is solely seen from speculative execution, then we later on still want the dead code elimination to kick in in order to sanitize these instructions with jmp-1s, and ii) to ensure that paths walked in the non-speculative domain are not pruned from earlier walks of paths walked in the speculative domain. Additionally, for robustness, we mark the registers which have been part of the conditional as unknown in the speculative path given there should be no assumptions made on their content. The fix in here mitigates type confusion attacks described earlier due to i) all code paths in the BPF program being explored and ii) existing verifier logic already ensuring that given memory access instruction references one specific data structure. An alternative to this fix that has also been looked at in this scope was to mark aux->alu_state at the jump instruction with a BPF_JMP_TAKEN state as well as direction encoding (always-goto, always-fallthrough, unknown), such that mixing of different always-* directions themselves as well as mixing of always-* with unknown directions would cause a program rejection by the verifier, e.g. programs with constructs like 'if ([...]) { x = 0; } else { x = 1; }' with subsequent 'if (x == 1) { [...] }'. For unprivileged, this would result in only single direction always-* taken paths, and unknown taken paths being allowed, such that the former could be patched from a conditional jump to an unconditional jump (ja). Compared to this approach here, it would have two downsides: i) valid programs that otherwise are not performing any pointer arithmetic, etc, would potentially be rejected/broken, and ii) we are required to turn off path pruning for unprivileged, where both can be avoided in this work through pushing the invalid branch to the verification stack. The issue was originally discovered by Adam and Ofek, and later independently discovered and reported as a result of Benedict and Piotr's research work. Fixes: b2157399cc98 ("bpf: prevent out-of-bounds speculation") Reported-by: Adam Morrison <> Reported-by: Ofek Kirzner <> Reported-by: Benedict Schlueter <> Reported-by: Piotr Krysiuk <> Signed-off-by: Daniel Borkmann <> Reviewed-by: John Fastabend <> Reviewed-by: Benedict Schlueter <> Reviewed-by: Piotr Krysiuk <> Acked-by: Alexei Starovoitov <>
2021-06-14bpf: Do not mark insn as seen under speculative path verificationDaniel Borkmann
... in such circumstances, we do not want to mark the instruction as seen given the goal is still to jmp-1 rewrite/sanitize dead code, if it is not reachable from the non-speculative path verification. We do however want to verify it for safety regardless. With the patch as-is all the insns that have been marked as seen before the patch will also be marked as seen after the patch (just with a potentially different non-zero count). An upcoming patch will also verify paths that are unreachable in the non-speculative domain, hence this extension is needed. Signed-off-by: Daniel Borkmann <> Reviewed-by: John Fastabend <> Reviewed-by: Benedict Schlueter <> Reviewed-by: Piotr Krysiuk <> Acked-by: Alexei Starovoitov <>
2021-06-14bpf: Inherit expanded/patched seen count from old aux dataDaniel Borkmann
Instead of relying on current env->pass_cnt, use the seen count from the old aux data in adjust_insn_aux_data(), and expand it to the new range of patched instructions. This change is valid given we always expand 1:n with n>=1, so what applies to the old/original instruction needs to apply for the replacement as well. Not relying on env->pass_cnt is a prerequisite for a later change where we want to avoid marking an instruction seen when verified under speculative execution path. Signed-off-by: Daniel Borkmann <> Reviewed-by: John Fastabend <> Reviewed-by: Benedict Schlueter <> Reviewed-by: Piotr Krysiuk <> Acked-by: Alexei Starovoitov <>
2021-06-14sched/fair: Correctly insert cfs_rq's to list on unthrottleOdin Ugedal
Fix an issue where fairness is decreased since cfs_rq's can end up not being decayed properly. For two sibling control groups with the same priority, this can often lead to a load ratio of 99/1 (!!). This happens because when a cfs_rq is throttled, all the descendant cfs_rq's will be removed from the leaf list. When they initial cfs_rq is unthrottled, it will currently only re add descendant cfs_rq's if they have one or more entities enqueued. This is not a perfect heuristic. Instead, we insert all cfs_rq's that contain one or more enqueued entities, or it its load is not completely decayed. Can often lead to situations like this for equally weighted control groups: $ ps u -C stress USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 10009 88.8 0.0 3676 100 pts/1 R+ 11:04 0:13 stress --cpu 1 root 10023 3.0 0.0 3676 104 pts/1 R+ 11:04 0:00 stress --cpu 1 Fixes: 31bc6aeaab1d ("sched/fair: Optimize update_blocked_averages()") [vingo: !SMP build fix] Signed-off-by: Odin Ugedal <> Signed-off-by: Peter Zijlstra (Intel) <> Reviewed-by: Vincent Guittot <> Link:
2021-06-14Revert "cpufreq: CPPC: Add support for frequency invariance"Viresh Kumar
This reverts commit 4c38f2df71c8e33c0b64865992d693f5022eeaad. There are few races in the frequency invariance support for CPPC driver, namely the driver doesn't stop the kthread_work and irq_work on policy exit during suspend/resume or CPU hotplug. A proper fix won't be possible for the 5.13-rc, as it requires a lot of changes. Lets revert the patch instead for now. Fixes: 4c38f2df71c8 ("cpufreq: CPPC: Add support for frequency invariance") Reported-by: Qian Cai <> Signed-off-by: Viresh Kumar <> Signed-off-by: Rafael J. Wysocki <>
2021-06-12Merge tag 'sched-urgent-2021-06-12' of ↵Linus Torvalds
git:// Pull scheduler fixes from Ingo Molnar: "Misc fixes: - Fix performance regression caused by lack of intended batching of RCU callbacks by over-eager NOHZ-full code. - Fix cgroups related corruption of load_avg and load_sum metrics. - Three fixes to fix blocked load, util_sum/runnable_sum and util_est tracking bugs" * tag 'sched-urgent-2021-06-12' of git:// sched/fair: Fix util_est UTIL_AVG_UNCHANGED handling sched/pelt: Ensure that *_sum is always synced with *_avg tick/nohz: Only check for RCU deferred wakeup on user/guest entry when needed sched/fair: Make sure to update tg contrib for blocked load sched/fair: Keep load_avg and load_sum synced
2021-06-12Merge tag 'perf-urgent-2021-06-12' of ↵Linus Torvalds
git:// Pull perf fixes from Ingo Molnar: "Misc fixes: - Fix the NMI watchdog on ancient Intel CPUs - Remove a misguided, NMI-unsafe KASAN callback from the NMI-safe irq_work path used by perf. - Fix uncore events on Ice Lake servers. - Someone booted maxcpus=1 on an SNB-EP, and the uncore driver emitted warnings and was probably buggy. Fix it. - KCSAN found a genuine data race in the core perf code. Somewhat ironically the bug was introduced through a recent race fix. :-/ In our defense, the new race window was much more narrow. Fix it" * tag 'perf-urgent-2021-06-12' of git:// x86/nmi_watchdog: Fix old-style NMI watchdog regression on old Intel CPUs irq_work: Make irq_work_queue() NMI-safe again perf/x86/intel/uncore: Fix M2M event umask for Ice Lake server perf/x86/intel/uncore: Fix a kernel WARNING triggered by maxcpus=1 perf: Fix data race between pin_count increment/decrement
2021-06-11Merge tag 'trace-v5.13-rc5-2' of ↵Linus Torvalds
git:// Pull tracing fixes from Steven Rostedt: - Fix the length check in the temp buffer filter - Fix build failure in bootconfig tools for "fallthrough" macro - Fix error return of bootconfig apply_xbc() routine * tag 'trace-v5.13-rc5-2' of git:// tracing: Correct the length check which causes memory corruption ftrace: Do not blindly read the ip address in ftrace_bug() tools/bootconfig: Fix a build error accroding to undefined fallthrough tools/bootconfig: Fix error return code in apply_xbc()
2021-06-10Merge branch 'for-5.13-fixes' of ↵Linus Torvalds
git:// Pull cgroup fix from Tejun Heo: "This is a high priority but low risk fix for a cgroup1 bug where rename(2) can change a cgroup's name to something which can break parsing of /proc/PID/cgroup" * 'for-5.13-fixes' of git:// cgroup1: don't allow '\n' in renaming
2021-06-10cgroup1: don't allow '\n' in renamingAlexander Kuznetsov
cgroup_mkdir() have restriction on newline usage in names: $ mkdir $'/sys/fs/cgroup/cpu/test\ntest2' mkdir: cannot create directory '/sys/fs/cgroup/cpu/test\ntest2': Invalid argument But in cgroup1_rename() such check is missed. This allows us to make /proc/<pid>/cgroup unparsable: $ mkdir /sys/fs/cgroup/cpu/test $ mv /sys/fs/cgroup/cpu/test $'/sys/fs/cgroup/cpu/test\ntest2' $ echo $$ > $'/sys/fs/cgroup/cpu/test\ntest2' $ cat /proc/self/cgroup 11:pids:/ 10:freezer:/ 9:hugetlb:/ 8:cpuset:/ 7:blkio:/user.slice 6:memory:/user.slice 5:net_cls,net_prio:/ 4:perf_event:/ 3:devices:/user.slice 2:cpu,cpuacct:/test test2 1:name=systemd:/ 0::/ Signed-off-by: Alexander Kuznetsov <> Reported-by: Andrey Krasichkov <> Acked-by: Dmitry Yakunin <> Cc: Signed-off-by: Tejun Heo <>
2021-06-10irq_work: Make irq_work_queue() NMI-safe againPeter Zijlstra
Someone carelessly put NMI unsafe code in irq_work_queue(), breaking just about every single user. Also, someone has a terrible comment style. Fixes: e2b5bcf9f5ba ("irq_work: record irq_work_queue() call stack") Reported-by: Alexander Shishkin <> Signed-off-by: Peter Zijlstra (Intel) <> Link:
2021-06-08tracing: Correct the length check which causes memory corruptionLiangyan
We've suffered from severe kernel crashes due to memory corruption on our production environment, like, Call Trace: [1640542.554277] general protection fault: 0000 [#1] SMP PTI [1640542.554856] CPU: 17 PID: 26996 Comm: python Kdump: loaded Tainted:G [1640542.556629] RIP: 0010:kmem_cache_alloc+0x90/0x190 [1640542.559074] RSP: 0018:ffffb16faa597df8 EFLAGS: 00010286 [1640542.559587] RAX: 0000000000000000 RBX: 0000000000400200 RCX: 0000000006e931bf [1640542.560323] RDX: 0000000006e931be RSI: 0000000000400200 RDI: ffff9a45ff004300 [1640542.560996] RBP: 0000000000400200 R08: 0000000000023420 R09: 0000000000000000 [1640542.561670] R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff9a20608d [1640542.562366] R13: ffff9a45ff004300 R14: ffff9a45ff004300 R15: 696c662f65636976 [1640542.563128] FS: 00007f45d7c6f740(0000) GS:ffff9a45ff840000(0000) knlGS:0000000000000000 [1640542.563937] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [1640542.564557] CR2: 00007f45d71311a0 CR3: 000000189d63e004 CR4: 00000000003606e0 [1640542.565279] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [1640542.566069] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [1640542.566742] Call Trace: [1640542.567009] anon_vma_clone+0x5d/0x170 [1640542.567417] __split_vma+0x91/0x1a0 [1640542.567777] do_munmap+0x2c6/0x320 [1640542.568128] vm_munmap+0x54/0x70 [1640542.569990] __x64_sys_munmap+0x22/0x30 [1640542.572005] do_syscall_64+0x5b/0x1b0 [1640542.573724] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [1640542.575642] RIP: 0033:0x7f45d6e61e27 James Wang has reproduced it stably on the latest 4.19 LTS. After some debugging, we finally proved that it's due to ftrace buffer out-of-bound access using a debug tool as follows: [ 86.775200] BUG: Out-of-bounds write at addr 0xffff88aefe8b7000 [ 86.780806] no_context+0xdf/0x3c0 [ 86.784327] __do_page_fault+0x252/0x470 [ 86.788367] do_page_fault+0x32/0x140 [ 86.792145] page_fault+0x1e/0x30 [ 86.795576] strncpy_from_unsafe+0x66/0xb0 [ 86.799789] fetch_memory_string+0x25/0x40 [ 86.804002] fetch_deref_string+0x51/0x60 [ 86.808134] kprobe_trace_func+0x32d/0x3a0 [ 86.812347] kprobe_dispatcher+0x45/0x50 [ 86.816385] kprobe_ftrace_handler+0x90/0xf0 [ 86.820779] ftrace_ops_assist_func+0xa1/0x140 [ 86.825340] 0xffffffffc00750bf [ 86.828603] do_sys_open+0x5/0x1f0 [ 86.832124] do_syscall_64+0x5b/0x1b0 [ 86.835900] entry_SYSCALL_64_after_hwframe+0x44/0xa9 commit b220c049d519 ("tracing: Check length before giving out the filter buffer") adds length check to protect trace data overflow introduced in 0fc1b09ff1ff, seems that this fix can't prevent overflow entirely, the length check should also take the sizeof entry->array[0] into account, since this array[0] is filled the length of trace data and occupy addtional space and risk overflow. Link: Cc: Cc: Ingo Molnar <> Cc: Xunlei Pang <> Cc: Greg Kroah-Hartman <> Fixes: b220c049d519 ("tracing: Check length before giving out the filter buffer") Reviewed-by: Xunlei Pang <> Reviewed-by: yinbinbin <> Reviewed-by: Wetp Zhang <> Tested-by: James Wang <> Signed-off-by: Liangyan <> Signed-off-by: Steven Rostedt (VMware) <>
2021-06-08ftrace: Do not blindly read the ip address in ftrace_bug()Steven Rostedt (VMware)
It was reported that a bug on arm64 caused a bad ip address to be used for updating into a nop in ftrace_init(), but the error path (rightfully) returned -EINVAL and not -EFAULT, as the bug caused more than one error to occur. But because -EINVAL was returned, the ftrace_bug() tried to report what was at the location of the ip address, and read it directly. This caused the machine to panic, as the ip was not pointing to a valid memory address. Instead, read the ip address with copy_from_kernel_nofault() to safely access the memory, and if it faults, report that the address faulted, otherwise report what was in that location. Link: Cc: Fixes: 05736a427f7e1 ("ftrace: warn on failure to disable mcount callers") Reported-by: Mark-PK Tsai <> Tested-by: Mark-PK Tsai <> Signed-off-by: Steven Rostedt (VMware) <>
2021-06-04Merge tag 'net-5.13-rc5' of ↵Linus Torvalds
git:// Pull networking fixes from Jakub Kicinski: "Networking fixes, including fixes from bpf, wireless, netfilter and wireguard trees. The bpf vs lockdown+audit fix is the most notable. Things haven't slowed down just yet, both in terms of regressions in current release and largish fixes for older code, but we usually see a slowdown only after -rc5. Current release - regressions: - virtio-net: fix page faults and crashes when XDP is enabled - mlx5e: fix HW timestamping with CQE compression, and make sure they are only allowed to coexist with capable devices - stmmac: - fix kernel panic due to NULL pointer dereference of mdio_bus_data - fix double clk unprepare when no PHY device is connected Current release - new code bugs: - mt76: a few fixes for the recent MT7921 devices and runtime power management Previous releases - regressions: - ice: - track AF_XDP ZC enabled queues in bitmap to fix copy mode Tx - fix allowing VF to request more/less queues via virtchnl - correct supported and advertised autoneg by using PHY capabilities - allow all LLDP packets from PF to Tx - kbuild: quote OBJCOPY var to avoid a pahole call break the build Previous releases - always broken: - bpf, lockdown, audit: fix buggy SELinux lockdown permission checks - mt76: address the recent FragAttack vulnerabilities not covered by generic fixes - ipv6: fix KASAN: slab-out-of-bounds Read in fib6_nh_flush_exceptions - Bluetooth: - fix the erroneous flush_work() order, to avoid double free - use correct lock to prevent UAF of hdev object - nfc: fix NULL ptr dereference in llcp_sock_getname() after failed connect - ieee802154: multiple fixes to error checking and return values - igb: fix XDP with PTP enabled - intel: add correct exception tracing for XDP - tls: fix use-after-free when TLS offload device goes down and back up - ipvs: ignore IP_VS_SVC_F_HASHED flag when adding service - netfilter: nft_ct: skip expectations for confirmed conntrack - mptcp: fix falling back to TCP in presence of out of order packets early in connection lifetime - wireguard: switch from O(n) to a O(1) algorithm for maintaining peers, fixing stalls and a large memory leak in the process Misc: - devlink: correct VIRTUAL port to not have phys_port attributes - Bluetooth: fix VIRTIO_ID_BT assigned number - net: return the correct errno code ENOBUF -> ENOMEM - wireguard: - peer: allocate in kmem_cache saving 25% on peer memory - do not use -O3" * tag 'net-5.13-rc5' of git:// (91 commits) cxgb4: avoid link re-train during TC-MQPRIO configuration sch_htb: fix refcount leak in htb_parent_to_leaf_offload wireguard: allowedips: free empty intermediate nodes when removing single node wireguard: allowedips: allocate nodes in kmem_cache wireguard: allowedips: remove nodes in O(1) wireguard: allowedips: initialize list head in selftest wireguard: peer: allocate in kmem_cache wireguard: use synchronize_net rather than synchronize_rcu wireguard: do not use -O3 wireguard: selftests: make sure rp_filter is disabled on vethc wireguard: selftests: remove old conntrack kconfig value virtchnl: Add missing padding to virtchnl_proto_hdrs ice: Allow all LLDP packets from PF to Tx ice: report supported and advertised autoneg using PHY capabilities ice: handle the VF VSI rebuild failure ice: Fix VFR issues for AVF drivers that expect ATQLEN cleared ice: Fix allowing VF to request more/less queues via virtchnl virtio-net: fix for skb_over_panic inside big mode ipv6: Fix KASAN: slab-out-of-bounds Read in fib6_nh_flush_exceptions fib: Return the correct errno code ...
2021-06-03sched/fair: Fix util_est UTIL_AVG_UNCHANGED handlingDietmar Eggemann
The util_est internal UTIL_AVG_UNCHANGED flag which is used to prevent unnecessary util_est updates uses the LSB of util_est.enqueued. It is exposed via _task_util_est() (and task_util_est()). Commit 92a801e5d5b7 ("sched/fair: Mask UTIL_AVG_UNCHANGED usages") mentions that the LSB is lost for util_est resolution but find_energy_efficient_cpu() checks if task_util_est() returns 0 to return prev_cpu early. _task_util_est() returns the max value of util_est.ewma and util_est.enqueued or'ed w/ UTIL_AVG_UNCHANGED. So task_util_est() returning the max of task_util() and _task_util_est() will never return 0 under the default SCHED_FEAT(UTIL_EST, true). To fix this use the MSB of util_est.enqueued instead and keep the flag util_est internal, i.e. don't export it via _task_util_est(). The maximal possible util_avg value for a task is 1024 so the MSB of 'unsigned int util_est.enqueued' isn't used to store a util value. As a caveat the code behind the util_est_se trace point has to filter UTIL_AVG_UNCHANGED to see the real util_est.enqueued value which should be easy to do. This also fixes an issue report by Xuewen Yan that util_est_update() only used UTIL_AVG_UNCHANGED for the subtrahend of the equation: last_enqueued_diff = ue.enqueued - (task_util() | UTIL_AVG_UNCHANGED) Fixes: b89997aa88f0b sched/pelt: Fix task util_est update filtering Signed-off-by: Dietmar Eggemann <> Signed-off-by: Peter Zijlstra (Intel) <> Reviewed-by: Xuewen Yan <> Reviewed-by: Vincent Donnefort <> Reviewed-by: Vincent Guittot <> Link:
2021-06-03sched/pelt: Ensure that *_sum is always synced with *_avgVincent Guittot
Rounding in PELT calculation happening when entities are attached/detached of a cfs_rq can result into situations where util/runnable_avg is not null but util/runnable_sum is. This is normally not possible so we need to ensure that util/runnable_sum stays synced with util/runnable_avg. detach_entity_load_avg() is the last place where we don't sync util/runnable_sum with util/runnbale_avg when moving some sched_entities Signed-off-by: Vincent Guittot <> Signed-off-by: Peter Zijlstra (Intel) <> Link:
2021-06-02bpf, lockdown, audit: Fix buggy SELinux lockdown permission checksDaniel Borkmann
Commit 59438b46471a ("security,lockdown,selinux: implement SELinux lockdown") added an implementation of the locked_down LSM hook to SELinux, with the aim to restrict which domains are allowed to perform operations that would breach lockdown. This is indirectly also getting audit subsystem involved to report events. The latter is problematic, as reported by Ondrej and Serhei, since it can bring down the whole system via audit: 1) The audit events that are triggered due to calls to security_locked_down() can OOM kill a machine, see below details [0]. 2) It also seems to be causing a deadlock via avc_has_perm()/slow_avc_audit() when trying to wake up kauditd, for example, when using trace_sched_switch() tracepoint, see details in [1]. Triggering this was not via some hypothetical corner case, but with existing tools like runqlat & runqslower from bcc, for example, which make use of this tracepoint. Rough call sequence goes like: rq_lock(rq) -> -------------------------+ trace_sched_switch() -> | bpf_prog_xyz() -> +-> deadlock selinux_lockdown() -> | audit_log_end() -> | wake_up_interruptible() -> | try_to_wake_up() -> | rq_lock(rq) --------------+ What's worse is that the intention of 59438b46471a to further restrict lockdown settings for specific applications in respect to the global lockdown policy is completely broken for BPF. The SELinux policy rule for the current lockdown check looks something like this: allow <who> <who> : lockdown { <reason> }; However, this doesn't match with the 'current' task where the security_locked_down() is executed, example: httpd does a syscall. There is a tracing program attached to the syscall which triggers a BPF program to run, which ends up doing a bpf_probe_read_kernel{,_str}() helper call. The selinux_lockdown() hook does the permission check against 'current', that is, httpd in this example. httpd has literally zero relation to this tracing program, and it would be nonsensical having to write an SELinux policy rule against httpd to let the tracing helper pass. The policy in this case needs to be against the entity that is installing the BPF program. For example, if bpftrace would generate a histogram of syscall counts by user space application: bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }' bpftrace would then go and generate a BPF program from this internally. One way of doing it [for the sake of the example] could be to call bpf_get_current_task() helper and then access current->comm via one of bpf_probe_read_kernel{,_str}() helpers. So the program itself has nothing to do with httpd or any other random app doing a syscall here. The BPF program _explicitly initiated_ the lockdown check. The allow/deny policy belongs in the context of bpftrace: meaning, you want to grant bpftrace access to use these helpers, but other tracers on the system like my_random_tracer _not_. Therefore fix all three issues at the same time by taking a completely different approach for the security_locked_down() hook, that is, move the check into the program verification phase where we actually retrieve the BPF func proto. This also reliably gets the task (current) that is trying to install the BPF tracing program, e.g. bpftrace/bcc/perf/systemtap/etc, and it also fixes the OOM since we're moving this out of the BPF helper's fast-path which can be called several millions of times per second. The check is then also in line with other security_locked_down() hooks in the system where the enforcement is performed at open/load time, for example, open_kcore() for /proc/kcore access or module_sig_check() for module signatures just to pick few random ones. What's out of scope in the fix as well as in other security_locked_down() hook locations /outside/ of BPF subsystem is that if the lockdown policy changes on the fly there is no retrospective action. This requires a different discussion, potentially complex infrastructure, and it's also not clear whether this can be solved generically. Either way, it is out of scope for a suitable stable fix which this one is targeting. Note that the breakage is specifically on 59438b46471a where it started to rely on 'current' as UAPI behavior, and _not_ earlier infrastructure such as 9d1f8be5cf42 ("bpf: Restrict bpf when kernel lockdown is in confidentiality mode"). [0], Jakub Hrozek says: I starting seeing this with F-34. When I run a container that is traced with BPF to record the syscalls it is doing, auditd is flooded with messages like: type=AVC msg=audit(1619784520.593:282387): avc: denied { confidentiality } for pid=476 comm="auditd" lockdown_reason="use of bpf to read kernel RAM" scontext=system_u:system_r:auditd_t:s0 tcontext=system_u:system_r:auditd_t:s0 tclass=lockdown permissive=0 This seems to be leading to auditd running out of space in the backlog buffer and eventually OOMs the machine. [...] auditd running at 99% CPU presumably processing all the messages, eventually I get: Apr 30 12:20:42 fedora kernel: audit: backlog limit exceeded Apr 30 12:20:42 fedora kernel: audit: backlog limit exceeded Apr 30 12:20:42 fedora kernel: audit: audit_backlog=2152579 > audit_backlog_limit=64 Apr 30 12:20:42 fedora kernel: audit: audit_backlog=2152626 > audit_backlog_limit=64 Apr 30 12:20:42 fedora kernel: audit: audit_backlog=2152694 > audit_backlog_limit=64 Apr 30 12:20:42 fedora kernel: audit: audit_lost=6878426 audit_rate_limit=0 audit_backlog_limit=64 Apr 30 12:20:45 fedora kernel: oci-seccomp-bpf invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=-1000 Apr 30 12:20:45 fedora kernel: CPU: 0 PID: 13284 Comm: oci-seccomp-bpf Not tainted 5.11.12-300.fc34.x86_64 #1 Apr 30 12:20:45 fedora kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-2.fc32 04/01/2014 [...] [1], Serhei Makarov says: Upstream kernel 5.11.0-rc7 and later was found to deadlock during a bpf_probe_read_compat() call within a sched_switch tracepoint. The problem is reproducible with the reg_alloc3 testcase from SystemTap's BPF backend testsuite on x86_64 as well as the runqlat, runqslower tools from bcc on ppc64le. Example stack trace: [...] [ 730.868702] stack backtrace: [ 730.869590] CPU: 1 PID: 701 Comm: in:imjournal Not tainted, 5.12.0-0.rc2.20210309git144c79ef3353.166.fc35.x86_64 #1 [ 730.871605] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014 [ 730.873278] Call Trace: [ 730.873770] dump_stack+0x7f/0xa1 [ 730.874433] check_noncircular+0xdf/0x100 [ 730.875232] __lock_acquire+0x1202/0x1e10 [ 730.876031] ? __lock_acquire+0xfc0/0x1e10 [ 730.876844] lock_acquire+0xc2/0x3a0 [ 730.877551] ? __wake_up_common_lock+0x52/0x90 [ 730.878434] ? lock_acquire+0xc2/0x3a0 [ 730.879186] ? lock_is_held_type+0xa7/0x120 [ 730.880044] ? skb_queue_tail+0x1b/0x50 [ 730.880800] _raw_spin_lock_irqsave+0x4d/0x90 [ 730.881656] ? __wake_up_common_lock+0x52/0x90 [ 730.882532] __wake_up_common_lock+0x52/0x90 [ 730.883375] audit_log_end+0x5b/0x100 [ 730.884104] slow_avc_audit+0x69/0x90 [ 730.884836] avc_has_perm+0x8b/0xb0 [ 730.885532] selinux_lockdown+0xa5/0xd0 [ 730.886297] security_locked_down+0x20/0x40 [ 730.887133] bpf_probe_read_compat+0x66/0xd0 [ 730.887983] bpf_prog_250599c5469ac7b5+0x10f/0x820 [ 730.888917] trace_call_bpf+0xe9/0x240 [ 730.889672] perf_trace_run_bpf_submit+0x4d/0xc0 [ 730.890579] perf_trace_sched_switch+0x142/0x180 [ 730.891485] ? __schedule+0x6d8/0xb20 [ 730.892209] __schedule+0x6d8/0xb20 [ 730.892899] schedule+0x5b/0xc0 [ 730.893522] exit_to_user_mode_prepare+0x11d/0x240 [ 730.894457] syscall_exit_to_user_mode+0x27/0x70 [ 730.895361] entry_SYSCALL_64_after_hwframe+0x44/0xae [...] Fixes: 59438b46471a ("security,lockdown,selinux: implement SELinux lockdown") Reported-by: Ondrej Mosnacek <> Reported-by: Jakub Hrozek <> Reported-by: Serhei Makarov <> Reported-by: Jiri Olsa <> Signed-off-by: Daniel Borkmann <> Acked-by: Alexei Starovoitov <> Tested-by: Jiri Olsa <> Cc: Paul Moore <> Cc: James Morris <> Cc: Jerome Marchand <> Cc: Frank Eigler <> Cc: Linus Torvalds <> Link:
2021-05-31perf: Fix data race between pin_count increment/decrementMarco Elver
KCSAN reports a data race between increment and decrement of pin_count: write to 0xffff888237c2d4e0 of 4 bytes by task 15740 on cpu 1: find_get_context kernel/events/core.c:4617 __do_sys_perf_event_open kernel/events/core.c:12097 [inline] __se_sys_perf_event_open kernel/events/core.c:11933 ... read to 0xffff888237c2d4e0 of 4 bytes by task 15743 on cpu 0: perf_unpin_context kernel/events/core.c:1525 [inline] __do_sys_perf_event_open kernel/events/core.c:12328 [inline] __se_sys_perf_event_open kernel/events/core.c:11933 ... Because neither read-modify-write here is atomic, this can lead to one of the operations being lost, resulting in an inconsistent pin_count. Fix it by adding the missing locking in the CPU-event case. Fixes: fe4b04fa31a6 ("perf: Cure task_oncpu_function_call() races") Reported-by: Signed-off-by: Marco Elver <> Signed-off-by: Peter Zijlstra (Intel) <> Link:
2021-05-31tick/nohz: Only check for RCU deferred wakeup on user/guest entry when neededFrederic Weisbecker
Checking for and processing RCU-nocb deferred wakeup upon user/guest entry is only relevant when nohz_full runs on the local CPU, otherwise the periodic tick should take care of it. Make sure we don't needlessly pollute these fast-paths as a -3% performance regression on a will-it-scale.per_process_ops has been reported so far. Fixes: 47b8ff194c1f (entry: Explicitly flush pending rcuog wakeup before last rescheduling point) Fixes: 4ae7dc97f726 (entry/kvm: Explicitly flush pending rcuog wakeup before last rescheduling point) Reported-by: kernel test robot <> Signed-off-by: Frederic Weisbecker <> Signed-off-by: Peter Zijlstra (Intel) <> Reviewed-by: Paul E. McKenney <> Cc: Link:
2021-05-31sched/fair: Make sure to update tg contrib for blocked loadVincent Guittot
During the update of fair blocked load (__update_blocked_fair()), we update the contribution of the cfs in tg->load_avg if cfs_rq's pelt has decayed. Nevertheless, the pelt values of a cfs_rq could have been recently updated while propagating the change of a child. In this case, cfs_rq's pelt will not decayed because it has already been updated and we don't update tg->load_avg. __update_blocked_fair ... for_each_leaf_cfs_rq_safe: child cfs_rq update cfs_rq_load_avg() for child cfs_rq ... update_load_avg(cfs_rq_of(se), se, 0) ... update cfs_rq_load_avg() for parent cfs_rq -propagation of child's load makes parent cfs_rq->load_sum becoming null -UPDATE_TG is not set so it doesn't update parent cfs_rq->tg_load_avg_contrib .. for_each_leaf_cfs_rq_safe: parent cfs_rq update cfs_rq_load_avg() for parent cfs_rq - nothing to do because parent cfs_rq has already been updated recently so cfs_rq->tg_load_avg_contrib is not updated ... parent cfs_rq is decayed list_del_leaf_cfs_rq parent cfs_rq - but it still contibutes to tg->load_avg we must set UPDATE_TG flags when propagting pending load to the parent Fixes: 039ae8bcf7a5 ("sched/fair: Fix O(nr_cgroups) in the load balancing path") Reported-by: Odin Ugedal <> Signed-off-by: Vincent Guittot <> Signed-off-by: Peter Zijlstra (Intel) <> Reviewed-by: Odin Ugedal <> Link:
2021-05-31sched/fair: Keep load_avg and load_sum syncedVincent Guittot
when removing a cfs_rq from the list we only check _sum value so we must ensure that _avg and _sum stay synced so load_sum can't be null whereas load_avg is not after propagating load in the cgroup hierarchy. Use load_avg to compute load_sum similarly to what is done for util_sum and runnable_sum. Fixes: 0e2d2aaaae52 ("sched/fair: Rewrite PELT migration propagation") Reported-by: Odin Ugedal <> Signed-off-by: Vincent Guittot <> Signed-off-by: Peter Zijlstra (Intel) <> Reviewed-by: Odin Ugedal <> Link:
2021-05-29Merge tag 'seccomp-fixes-v5.13-rc4' of ↵Linus Torvalds
git:// Pull seccomp fixes from Kees Cook: "This fixes a hard-to-hit race condition in the addfd user_notif feature of seccomp, visible since v5.9. And a small documentation fix" * tag 'seccomp-fixes-v5.13-rc4' of git:// seccomp: Refactor notification handler to prepare for new semantics Documentation: seccomp: Fix user notification documentation
2021-05-29seccomp: Refactor notification handler to prepare for new semanticsSargun Dhillon
This refactors the user notification code to have a do / while loop around the completion condition. This has a small change in semantic, in that previously we ignored addfd calls upon wakeup if the notification had been responded to, but instead with the new change we check for an outstanding addfd calls prior to returning to userspace. Rodrigo Campos also identified a bug that can result in addfd causing an early return, when the supervisor didn't actually handle the syscall [1]. [1]: Fixes: 7cf97b125455 ("seccomp: Introduce addfd ioctl to seccomp user notifier") Signed-off-by: Sargun Dhillon <> Acked-by: Tycho Andersen <> Acked-by: Christian Brauner <> Signed-off-by: Kees Cook <> Tested-by: Rodrigo Campos <> Cc: Link:
2021-05-26Merge tag 'net-5.13-rc4' of ↵Linus Torvalds
git:// Pull networking fixes from Jakub Kicinski: "Networking fixes for 5.13-rc4, including fixes from bpf, netfilter, can and wireless trees. Notably including fixes for the recently announced "FragAttacks" WiFi vulnerabilities. Rather large batch, touching some core parts of the stack, too, but nothing hair-raising. Current release - regressions: - tipc: make node link identity publish thread safe - dsa: felix: re-enable TAS guard band mode - stmmac: correct clocks enabled in stmmac_vlan_rx_kill_vid() - stmmac: fix system hang if change mac address after interface ifdown Current release - new code bugs: - mptcp: avoid OOB access in setsockopt() - bpf: Fix nested bpf_bprintf_prepare with more per-cpu buffers - ethtool: stats: fix a copy-paste error - init correct array size Previous releases - regressions: - sched: fix packet stuck problem for lockless qdisc - net: really orphan skbs tied to closing sk - mlx4: fix EEPROM dump support - bpf: fix alu32 const subreg bound tracking on bitwise operations - bpf: fix mask direction swap upon off reg sign change - bpf, offload: reorder offload callback 'prepare' in verifier - stmmac: Fix MAC WoL not working if PHY does not support WoL - packetmmap: fix only tx timestamp on request - tipc: skb_linearize the head skb when reassembling msgs Previous releases - always broken: - mac80211: address recent "FragAttacks" vulnerabilities - mac80211: do not accept/forward invalid EAPOL frames - mptcp: avoid potential error message floods - bpf, ringbuf: deny reserve of buffers larger than ringbuf to prevent out of buffer writes - bpf: forbid trampoline attach for functions with variable arguments - bpf: add deny list of functions to prevent inf recursion of tracing programs - tls splice: check SPLICE_F_NONBLOCK instead of MSG_DONTWAIT - can: isotp: prevent race between isotp_bind() and isotp_setsockopt() - netfilter: nft_set_pipapo_avx2: Add irq_fpu_usable() check, fallback to non-AVX2 version Misc: - bpf: add kconfig knob for disabling unpriv bpf by default" * tag 'net-5.13-rc4' of git:// (172 commits) net: phy: Document phydev::dev_flags bits allocation mptcp: validate 'id' when stopping the ADD_ADDR retransmit timer mptcp: avoid error message on infinite mapping mptcp: drop unconditional pr_warn on bad opt mptcp: avoid OOB access in setsockopt() nfp: update maintainer and mailing list addresses net: mvpp2: add buffer header handling in RX bnx2x: Fix missing error code in bnx2x_iov_init_one() net: zero-initialize tc skb extension on allocation net: hns: Fix kernel-doc sctp: fix the proc_handler for sysctl encap_port sctp: add the missing setting for asoc encap_port bpf, selftests: Adjust few selftest result_unpriv outcomes bpf: No need to simulate speculative domain for immediates bpf: Fix mask direction swap upon off reg sign change bpf: Wrap aux data inside bpf_sanitize_info container bpf: Fix BPF_LSM kconfig symbol dependency selftests/bpf: Add test for l3 use of bpf_redirect_peer bpftool: Add sock_release help info for cgroup attach/prog load command net: dsa: microchip: enable phy errata workaround on 9567 ...
2021-05-25bpf: No need to simulate speculative domain for immediatesDaniel Borkmann
In 801c6058d14a ("bpf: Fix leakage of uninitialized bpf stack under speculation") we replaced masking logic with direct loads of immediates if the register is a known constant. Given in this case we do not apply any masking, there is also no reason for the operation to be truncated under the speculative domain. Therefore, there is also zero reason for the verifier to branch-off and simulate this case, it only needs to do it for unknown but bounded scalars. As a side-effect, this also enables few test cases that were previously rejected due to simulation under zero truncation. Signed-off-by: Daniel Borkmann <> Reviewed-by: Piotr Krysiuk <> Acked-by: Alexei Starovoitov <>
2021-05-25bpf: Fix mask direction swap upon off reg sign changeDaniel Borkmann
Masking direction as indicated via mask_to_left is considered to be calculated once and then used to derive pointer limits. Thus, this needs to be placed into bpf_sanitize_info instead so we can pass it to sanitize_ptr_alu() call after the pointer move. Piotr noticed a corner case where the off reg causes masking direction change which then results in an incorrect final aux->alu_limit. Fixes: 7fedb63a8307 ("bpf: Tighten speculative pointer arithmetic mask") Reported-by: Piotr Krysiuk <> Signed-off-by: Daniel Borkmann <> Reviewed-by: Piotr Krysiuk <> Acked-by: Alexei Starovoitov <>
2021-05-25bpf: Wrap aux data inside bpf_sanitize_info containerDaniel Borkmann
Add a container structure struct bpf_sanitize_info which holds the current aux info, and update call-sites to sanitize_ptr_alu() to pass it in. This is needed for passing in additional state later on. Signed-off-by: Daniel Borkmann <> Reviewed-by: Piotr Krysiuk <> Acked-by: Alexei Starovoitov <>