summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2020-12-14Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextJakub Kicinski
Daniel Borkmann says: ==================== pull-request: bpf-next 2020-12-14 1) Expose bpf_sk_storage_*() helpers to iterator programs, from Florent Revest. 2) Add AF_XDP selftests based on veth devs to BPF selftests, from Weqaar Janjua. 3) Support for finding BTF based kernel attach targets through libbpf's bpf_program__set_attach_target() API, from Andrii Nakryiko. 4) Permit pointers on stack for helper calls in the verifier, from Yonghong Song. 5) Fix overflows in hash map elem size after rlimit removal, from Eric Dumazet. 6) Get rid of direct invocation of llc in BPF selftests, from Andrew Delgadillo. 7) Fix xsk_recvmsg() to reorder socket state check before access, from Björn Töpel. 8) Add new libbpf API helper to retrieve ring buffer epoll fd, from Brendan Jackman. 9) Batch of minor BPF selftest improvements all over the place, from Florian Lehner, KP Singh, Jiri Olsa and various others. * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (31 commits) selftests/bpf: Add a test for ptr_to_map_value on stack for helper access bpf: Permits pointers on stack for helper calls libbpf: Expose libbpf ring_buffer epoll_fd selftests/bpf: Add set_attach_target() API selftest for module target libbpf: Support modules in bpf_program__set_attach_target() API selftests/bpf: Silence ima_setup.sh when not running in verbose mode. selftests/bpf: Drop the need for LLVM's llc selftests/bpf: fix bpf_testmod.ko recompilation logic samples/bpf: Fix possible hang in xdpsock with multiple threads selftests/bpf: Make selftest compilation work on clang 11 selftests/bpf: Xsk selftests - adding xdpxceiver to .gitignore selftests/bpf: Drop tcp-{client,server}.py from Makefile selftests/bpf: Xsk selftests - Bi-directional Sockets - SKB, DRV selftests/bpf: Xsk selftests - Socket Teardown - SKB, DRV selftests/bpf: Xsk selftests - DRV POLL, NOPOLL selftests/bpf: Xsk selftests - SKB POLL, NOPOLL selftests/bpf: Xsk selftests framework bpf: Only provide bpf_sock_from_file with CONFIG_NET bpf: Return -ENOTSUPP when attaching to non-kernel BTF xsk: Validate socket state in xsk_recvmsg, prior touching socket members ... ==================== Link: https://lore.kernel.org/r/20201214214316.20642-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-14bpf: Permits pointers on stack for helper callsYonghong Song
Currently, when checking stack memory accessed by helper calls, for spills, only PTR_TO_BTF_ID and SCALAR_VALUE are allowed. Song discovered an issue where the below bpf program int dump_task(struct bpf_iter__task *ctx) { struct seq_file *seq = ctx->meta->seq; static char[] info = "abc"; BPF_SEQ_PRINTF(seq, "%s\n", info); return 0; } may cause a verifier failure. The verifier output looks like: ; struct seq_file *seq = ctx->meta->seq; 1: (79) r1 = *(u64 *)(r1 +0) ; BPF_SEQ_PRINTF(seq, "%s\n", info); 2: (18) r2 = 0xffff9054400f6000 4: (7b) *(u64 *)(r10 -8) = r2 5: (bf) r4 = r10 ; 6: (07) r4 += -8 ; BPF_SEQ_PRINTF(seq, "%s\n", info); 7: (18) r2 = 0xffff9054400fe000 9: (b4) w3 = 4 10: (b4) w5 = 8 11: (85) call bpf_seq_printf#126 R1_w=ptr_seq_file(id=0,off=0,imm=0) R2_w=map_value(id=0,off=0,ks=4,vs=4,imm=0) R3_w=inv4 R4_w=fp-8 R5_w=inv8 R10=fp0 fp-8_w=map_value last_idx 11 first_idx 0 regs=8 stack=0 before 10: (b4) w5 = 8 regs=8 stack=0 before 9: (b4) w3 = 4 invalid indirect read from stack off -8+0 size 8 Basically, the verifier complains the map_value pointer at "fp-8" location. To fix the issue, if env->allow_ptr_leaks is true, let us also permit pointers on the stack to be accessible by the helper. Reported-by: Song Liu <songliubraving@fb.com> Suggested-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20201210013349.943719-1-yhs@fb.com
2020-12-14Merge branch 'linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 Pull crypto updates from Herbert Xu: "API: - Add speed testing on 1420-byte blocks for networking Algorithms: - Improve performance of chacha on ARM for network packets - Improve performance of aegis128 on ARM for network packets Drivers: - Add support for Keem Bay OCS AES/SM4 - Add support for QAT 4xxx devices - Enable crypto-engine retry mechanism in caam - Enable support for crypto engine on sdm845 in qce - Add HiSilicon PRNG driver support" * 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (161 commits) crypto: qat - add capability detection logic in qat_4xxx crypto: qat - add AES-XTS support for QAT GEN4 devices crypto: qat - add AES-CTR support for QAT GEN4 devices crypto: atmel-i2c - select CONFIG_BITREVERSE crypto: hisilicon/trng - replace atomic_add_return() crypto: keembay - Add support for Keem Bay OCS AES/SM4 dt-bindings: Add Keem Bay OCS AES bindings crypto: aegis128 - avoid spurious references crypto_aegis128_update_simd crypto: seed - remove trailing semicolon in macro definition crypto: x86/poly1305 - Use TEST %reg,%reg instead of CMP $0,%reg crypto: x86/sha512 - Use TEST %reg,%reg instead of CMP $0,%reg crypto: aesni - Use TEST %reg,%reg instead of CMP $0,%reg crypto: cpt - Fix sparse warnings in cptpf hwrng: ks-sa - Add dependency on IOMEM and OF crypto: lib/blake2s - Move selftest prototype into header file crypto: arm/aes-ce - work around Cortex-A57/A72 silion errata crypto: ecdh - avoid unaligned accesses in ecdh_set_secret() crypto: ccree - rework cache parameters handling crypto: cavium - Use dma_set_mask_and_coherent to simplify code crypto: marvell/octeontx - Use dma_set_mask_and_coherent to simplify code ...
2020-12-14Merge branch 'cpufreq/arm/linux-next' of ↵Rafael J. Wysocki
git://git.kernel.org/pub/scm/linux/kernel/git/vireshk/pm Pull ARM cpufreq updates for 5.11-rc1 from Viresh Kumar: "This contains the following updates: - Fix imx's NVMEM_IMX_OCOTP dependency (Arnd Bergmann). - Add support for mt8167 and blacklist mt8516 (Fabien Parent). - Some ->get() callback related cleanups to the tegra194 driver and some optimizations in tegra186 driver (Jon Hunter and Sumit Gupta). - Power scale improvements to arm_scmi driver (Lukasz Luba). - Add missing MODULE_DEVICE_TABLE and MODULE_ALIAS to several drivers (Pali Rohár). - Fix error path in mediatek driver (Qinglang Miao). - Fix memleak in ST's cpufreq driver (Yangtao Li)." * 'cpufreq/arm/linux-next' of git://git.kernel.org/pub/scm/linux/kernel/git/vireshk/pm: (22 commits) cpufreq: arm_scmi: Discover the power scale in performance protocol firmware: arm_scmi: Add power_scale_mw_get() interface cpufreq: tegra194: Rename tegra194_get_speed_common function cpufreq: tegra194: Remove unnecessary frequency calculation cpufreq: tegra186: Simplify cluster information lookup cpufreq: tegra186: Fix sparse 'incorrect type in assignment' warning cpufreq: imx: fix NVMEM_IMX_OCOTP dependency cpufreq: vexpress-spc: Add missing MODULE_ALIAS cpufreq: scpi: Add missing MODULE_ALIAS cpufreq: loongson1: Add missing MODULE_ALIAS cpufreq: sun50i: Add missing MODULE_DEVICE_TABLE cpufreq: st: Add missing MODULE_DEVICE_TABLE cpufreq: qcom: Add missing MODULE_DEVICE_TABLE cpufreq: mediatek: Add missing MODULE_DEVICE_TABLE cpufreq: highbank: Add missing MODULE_DEVICE_TABLE cpufreq: ap806: Add missing MODULE_DEVICE_TABLE cpufreq: mediatek: add missing platform_driver_unregister() on error in mtk_cpufreq_driver_init cpufreq: tegra194: get consistent cpuinfo_cur_freq cpufreq: blacklist mt8516 in cpufreq-dt-platdev cpufreq: mediatek: Add support for mt8167 ...
2020-12-14Revert: "ring-buffer: Remove HAVE_64BIT_ALIGNED_ACCESS"Steven Rostedt (VMware)
It was believed that metag was the only architecture that required the ring buffer to keep 8 byte words aligned on 8 byte architectures, and with its removal, it was assumed that the ring buffer code did not need to handle this case. It appears that sparc64 also requires this. The following was reported on a sparc64 boot up: kernel: futex hash table entries: 65536 (order: 9, 4194304 bytes, linear) kernel: Running postponed tracer tests: kernel: Testing tracer function: kernel: Kernel unaligned access at TPC[552a20] trace_function+0x40/0x140 kernel: Kernel unaligned access at TPC[552a24] trace_function+0x44/0x140 kernel: Kernel unaligned access at TPC[552a20] trace_function+0x40/0x140 kernel: Kernel unaligned access at TPC[552a24] trace_function+0x44/0x140 kernel: Kernel unaligned access at TPC[552a20] trace_function+0x40/0x140 kernel: PASSED Need to put back the 64BIT aligned code for the ring buffer. Link: https://lore.kernel.org/r/CADxRZqzXQRYgKc=y-KV=S_yHL+Y8Ay2mh5ezeZUnpRvg+syWKw@mail.gmail.com Cc: stable@vger.kernel.org Fixes: 86b3de60a0b6 ("ring-buffer: Remove HAVE_64BIT_ALIGNED_ACCESS") Reported-by: Anatoly Pugachev <matorola@gmail.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-12-14ring-buffer: Add rb_check_bpage in __rb_allocate_pagesQiujun Huang
It may be better to check each page is aligned by 4 bytes. The 2 least significant bits of the address will be used as flags. Link: https://lkml.kernel.org/r/20201015113842.2921-1-hqjagain@gmail.com Signed-off-by: Qiujun Huang <hqjagain@gmail.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-12-14ring-buffer: Fix two typos in commentsQiujun Huang
s/inerrupting/interrupting/ s/beween/between/ Link: https://lkml.kernel.org/r/20201014152749.29986-1-hqjagain@gmail.com Signed-off-by: Qiujun Huang <hqjagain@gmail.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-12-14tracing: Drop unneeded assignment in ring_buffer_resize()Lukas Bulwahn
Since commit 0a1754b2a97e ("ring-buffer: Return 0 on success from ring_buffer_resize()"), computing the size is not needed anymore. Drop unneeded assignment in ring_buffer_resize(). Link: https://lkml.kernel.org/r/20201214084503.3079-1-lukas.bulwahn@gmail.com Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-12-14tracing: Disable ftrace selftests when any tracer is runningMasami Hiramatsu
Disable ftrace selftests when any tracer (kernel command line options like ftrace=, trace_events=, kprobe_events=, and boot-time tracing) starts running because selftest can disturb it. Currently ftrace= and trace_events= are checked, but kprobe_events has a different flag, and boot-time tracing didn't checked. This unifies the disabled flag and all of those boot-time tracing features sets the flag. This also fixes warnings on kprobe-event selftest (CONFIG_FTRACE_STARTUP_TEST=y and CONFIG_KPROBE_EVENTS=y) with boot-time tracing (ftrace.event.kprobes.EVENT.probes) like below; [ 59.803496] trace_kprobe: Testing kprobe tracing: [ 59.804258] ------------[ cut here ]------------ [ 59.805682] WARNING: CPU: 3 PID: 1 at kernel/trace/trace_kprobe.c:1987 kprobe_trace_self_tests_ib [ 59.806944] Modules linked in: [ 59.807335] CPU: 3 PID: 1 Comm: swapper/0 Not tainted 5.10.0-rc7+ #172 [ 59.808029] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1 04/01/204 [ 59.808999] RIP: 0010:kprobe_trace_self_tests_init+0x5f/0x42b [ 59.809696] Code: e8 03 00 00 48 c7 c7 30 8e 07 82 e8 6d 3c 46 ff 48 c7 c6 00 b2 1a 81 48 c7 c7 7 [ 59.812439] RSP: 0018:ffffc90000013e78 EFLAGS: 00010282 [ 59.813038] RAX: 00000000ffffffef RBX: 0000000000000000 RCX: 0000000000049443 [ 59.813780] RDX: 0000000000049403 RSI: 0000000000049403 RDI: 000000000002deb0 [ 59.814589] RBP: ffffc90000013e90 R08: 0000000000000001 R09: 0000000000000001 [ 59.815349] R10: 0000000000000001 R11: 0000000000000000 R12: 00000000ffffffef [ 59.816138] R13: ffff888004613d80 R14: ffffffff82696940 R15: ffff888004429138 [ 59.816877] FS: 0000000000000000(0000) GS:ffff88807dcc0000(0000) knlGS:0000000000000000 [ 59.817772] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 59.818395] CR2: 0000000001a8dd38 CR3: 0000000002222000 CR4: 00000000000006a0 [ 59.819144] Call Trace: [ 59.819469] ? init_kprobe_trace+0x6b/0x6b [ 59.819948] do_one_initcall+0x5f/0x300 [ 59.820392] ? rcu_read_lock_sched_held+0x4f/0x80 [ 59.820916] kernel_init_freeable+0x22a/0x271 [ 59.821416] ? rest_init+0x241/0x241 [ 59.821841] kernel_init+0xe/0x10f [ 59.822251] ret_from_fork+0x22/0x30 [ 59.822683] irq event stamp: 16403349 [ 59.823121] hardirqs last enabled at (16403359): [<ffffffff810db81e>] console_unlock+0x48e/0x580 [ 59.824074] hardirqs last disabled at (16403368): [<ffffffff810db786>] console_unlock+0x3f6/0x580 [ 59.825036] softirqs last enabled at (16403200): [<ffffffff81c0033a>] __do_softirq+0x33a/0x484 [ 59.825982] softirqs last disabled at (16403087): [<ffffffff81a00f02>] asm_call_irq_on_stack+0x10 [ 59.827034] ---[ end trace 200c544775cdfeb3 ]--- [ 59.827635] trace_kprobe: error on probing function entry. Link: https://lkml.kernel.org/r/160741764955.3448999.3347769358299456915.stgit@devnote2 Fixes: 4d655281eb1b ("tracing/boot Add kprobe event support") Cc: Ingo Molnar <mingo@kernel.org> Cc: stable@vger.kernel.org Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-12-14Merge branch 'for-5.11' into for-linusPetr Mladek
2020-12-14Merge branch 'for-5.11-null-console' into for-linusPetr Mladek
2020-12-13Merge tag 'x86-urgent-2020-12-13' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Thomas Gleixner: "A set of x86 and membarrier fixes: - Correct a few problems in the x86 and the generic membarrier implementation. Small corrections for assumptions about visibility which have turned out not to be true. - Make the PAT bits for memory encryption correct vs 4K and 2M/1G page table entries as they are at a different location. - Fix a concurrency issue in the the local bandwidth readout of resource control leading to incorrect values - Fix the ordering of allocating a vector for an interrupt. The order missed to respect the provided cpumask when the first attempt of allocating node local in the mask fails. It then tries the node instead of trying the full provided mask first. This leads to erroneous error messages and breaking the (user) supplied affinity request. Reorder it. - Make the INT3 padding detection in optprobe work correctly" * tag 'x86-urgent-2020-12-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/kprobes: Fix optprobe to detect INT3 padding correctly x86/apic/vector: Fix ordering in vector assignment x86/resctrl: Fix incorrect local bandwidth when mba_sc is enabled x86/mm/mem_encrypt: Fix definition of PMD_FLAGS_DEC_WP membarrier: Execute SYNC_CORE on the calling thread membarrier: Explicitly sync remote cores when SYNC_CORE is requested membarrier: Add an actual barrier before rseq_preempt() x86/membarrier: Get rid of a dubious optimization
2020-12-12kernel: remove checking for TIF_NOTIFY_SIGNALJens Axboe
It's available everywhere now, no need to check or add dummy defines. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-12signal: kill JOBCTL_TASK_WORKJens Axboe
It's no longer used, get rid of it. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-12task_work: remove legacy TWA_SIGNAL pathJens Axboe
All archs now support TIF_NOTIFY_SIGNAL. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-11Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
xdp_return_frame_bulk() needs to pass a xdp_buff to __xdp_return(). strlcpy got converted to strscpy but here it makes no functional difference, so just keep the right code. Conflicts: net/netfilter/nf_tables_api.c Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-11tick/sched: Make jiffies update quick check more robustThomas Gleixner
The quick check in tick_do_update_jiffies64() whether jiffies need to be updated is not really correct under all circumstances and on all architectures, especially not on 32bit systems. The quick check does: if (now < READ_ONCE(tick_next_period)) return; and the counterpart in the update is: WRITE_ONCE(tick_next_period, next_update_time); This has two problems: 1) On weakly ordered architectures there is no guarantee that the stores before the WRITE_ONCE() are visible which means that other CPUs can operate on a stale jiffies value. 2) On 32bit the store of tick_next_period which is an u64 is split into two 32bit stores. If the first 32bit store advances tick_next_period far out and the second 32bit store is delayed (virt, NMI ...) then jiffies will become stale until the second 32bit store happens. Address this by seperating the handling for 32bit and 64bit. On 64bit problem #1 is addressed by replacing READ_ONCE() / WRITE_ONCE() with smp_load_acquire() / smp_store_release(). On 32bit problem #2 is addressed by protecting the quick check with the jiffies sequence counter. The load and stores can be plain because the sequence count mechanics provides the required barriers already. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/87czzpc02w.fsf@nanos.tec.linutronix.de
2020-12-11bpf: Fix enum names for bpf_this_cpu_ptr() and bpf_per_cpu_ptr() helpersAndrii Nakryiko
Remove bpf_ prefix, which causes these helpers to be reported in verifier dump as bpf_bpf_this_cpu_ptr() and bpf_bpf_per_cpu_ptr(), respectively. Lets fix it as long as it is still possible before UAPI freezes on these helpers. Fixes: eaa6bcb71ef6 ("bpf: Introduce bpf_per_cpu_ptr()") Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-11elfcore: fix building with clangArnd Bergmann
kernel/elfcore.c only contains weak symbols, which triggers a bug with clang in combination with recordmcount: Cannot find symbol for section 2: .text. kernel/elfcore.o: failed Move the empty stubs into linux/elfcore.h as inline functions. As only two architectures use these, just use the architecture specific Kconfig symbols to key off the declaration. Link: https://lkml.kernel.org/r/20201204165742.3815221-2-arnd@kernel.org Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Nathan Chancellor <natechancellor@gmail.com> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Barret Rhoden <brho@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-11x86,swiotlb: Adjust SWIOTLB bounce buffer size for SEV guestsAshish Kalra
For SEV, all DMA to and from guest has to use shared (un-encrypted) pages. SEV uses SWIOTLB to make this happen without requiring changes to device drivers. However, depending on the workload being run, the default 64MB of it might not be enough and it may run out of buffers to use for DMA, resulting in I/O errors and/or performance degradation for high I/O workloads. Adjust the default size of SWIOTLB for SEV guests using a percentage of the total memory available to guest for the SWIOTLB buffers. Adds a new sev_setup_arch() function which is invoked from setup_arch() and it calls into a new swiotlb generic code function swiotlb_adjust_size() to do the SWIOTLB buffer adjustment. v5 fixed build errors and warnings as Reported-by: kbuild test robot <lkp@intel.com> Signed-off-by: Ashish Kalra <ashish.kalra@amd.com> Co-developed-by: Borislav Petkov <bp@suse.de> Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2020-12-11cpufreq: schedutil: Simplify sugov_update_next_freq()Rafael J. Wysocki
Rearrange a conditional to make it more straightforward. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2020-12-11genirq/affinity: Add irq_update_affinity_desc()John Garry
Add a function to allow the affinity of an interrupt be switched to managed, such that interrupts allocated for platform devices may be managed. This new interface has certain limitations, and attempts to use it in the following circumstances will fail: - For when the kernel is configured for generic IRQ reservation mode (in config GENERIC_IRQ_RESERVATION_MODE). The reason being that it could conflict with managed vs. non-managed interrupt accounting. - The interrupt is already started, which should not be the case during init - The interrupt is already configured as managed, which means double init Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: John Garry <john.garry@huawei.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/1606905417-183214-2-git-send-email-john.garry@huawei.com
2020-12-11Revert "genirq: Add fasteoi IPI flow"Valentin Schneider
handle_percpu_devid_fasteoi_ipi() has no more users, and handle_percpu_devid_irq() can do all that it was supposed to do. Get rid of it. This reverts commit c5e5ec033c4ab25c53f1fd217849e75deb0bf7bf. Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20201109094121.29975-6-valentin.schneider@arm.com
2020-12-11ntp: Consolidate the RTC update implementationThomas Gleixner
The code for the legacy RTC and the RTC class based update are pretty much the same. Consolidate the common parts into one function and just invoke the actual setter functions. For RTC class based devices the update code checks whether the offset is valid for the device, which is usually not the case for the first invocation. If it's not the same it stores the correct offset and lets the caller try again. That's not much different from the previous approach where the first invocation had a pretty low probability to actually hit the allowed window. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Acked-by: Alexandre Belloni <alexandre.belloni@bootlin.com> Link: https://lore.kernel.org/r/20201206220542.355743355@linutronix.de
2020-12-11ntp: Make the RTC sync offset less obscureThomas Gleixner
The current RTC set_offset_nsec value is not really intuitive to understand. tsched twrite(t2.tv_sec - 1) t2 (seconds increment) The offset is calculated from twrite based on the assumption that t2 - twrite == 1s. That means for the MC146818 RTC the offset needs to be negative so that the write happens 500ms before t2. It's easier to understand when the whole calculation is based on t2. That avoids negative offsets and the meaning is obvious: t2 - twrite: The time defined by the chip when seconds increment after the write. twrite - tsched: The time for the transport to the point where the chip is updated. ==> set_offset_nsec = t2 - tsched ttransport = twrite - tsched tRTCinc = t2 - twrite ==> set_offset_nsec = ttransport + tRTCinc tRTCinc is a chip property and can be obtained from the data sheet. ttransport depends on how the RTC is connected. It is close to 0 for directly accessible RTCs. For RTCs behind a slow bus, e.g. i2c, it's the time required to send the update over the bus. This can be estimated or even calibrated, but that's a different problem. Adjust the implementation and update comments accordingly. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Alexandre Belloni <alexandre.belloni@bootlin.com> Link: https://lore.kernel.org/r/20201206220542.263204937@linutronix.de
2020-12-11ntp, rtc: Move rtc_set_ntp_time() to ntp codeThomas Gleixner
rtc_set_ntp_time() is not really RTC functionality as the code is just a user of RTC. Move it into the NTP code which allows further cleanups. Requested-by: Alexandre Belloni <alexandre.belloni@bootlin.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Acked-by: Alexandre Belloni <alexandre.belloni@bootlin.com> Link: https://lore.kernel.org/r/20201206220542.166871172@linutronix.de
2020-12-11ntp: Make the RTC synchronization more reliableThomas Gleixner
Miroslav reported that the periodic RTC synchronization in the NTP code fails more often than not to hit the specified update window. The reason is that the code uses delayed_work to schedule the update which needs to be in thread context as the underlying RTC might be connected via a slow bus, e.g. I2C. In the update function it verifies whether the current time is correct vs. the requirements of the underlying RTC. But delayed_work is using the timer wheel for scheduling which is inaccurate by design. Depending on the distance to the expiry the wheel gets less granular to allow batching and to avoid the cascading of the original timer wheel. See 500462a9de65 ("timers: Switch to a non-cascading wheel") and the code for further details. The code already deals with this by splitting the 660 seconds period into a long 659 seconds timer and then retrying with a smaller delta. But looking at the actual granularities of the timer wheel (which depend on the HZ configuration) the 659 seconds timer ends up in an outer wheel level and is affected by a worst case granularity of: HZ Granularity 1000 32s 250 16s 100 40s So the initial timer can be already off by max 12.5% which is not a big issue as the period of the sync is defined as ~11 minutes. The fine grained second attempt schedules to the desired update point with a timer expiring less than a second from now. Depending on the actual delta and the HZ setting even the second attempt can end up in outer wheel levels which have a large enough granularity to make the correctness check fail. As this is a fundamental property of the timer wheel there is no way to make this more accurate short of iterating in one jiffies steps towards the update point. Switch it to an hrtimer instead which schedules the actual update work. The hrtimer will expire precisely (max 1 jiffie delay when high resolution timers are not available). The actual scheduling delay of the work is the same as before. The update is triggered from do_adjtimex() which is a bit racy but not much more racy than it was before: if (ntp_synced()) queue_delayed_work(system_power_efficient_wq, &sync_work, 0); which is racy when the work is currently executed and has not managed to reschedule itself. This becomes now: if (ntp_synced() && !hrtimer_is_queued(&sync_hrtimer)) queue_work(system_power_efficient_wq, &sync_work, 0); which is racy when the hrtimer has expired and the work is currently executed and has not yet managed to rearm the hrtimer. Not a big problem as it just schedules work for nothing. The new implementation has a safe guard in place to catch the case where the hrtimer is queued on entry to the work function and avoids an extra update attempt of the RTC that way. Reported-by: Miroslav Lichvar <mlichvar@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Miroslav Lichvar <mlichvar@redhat.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Acked-by: Alexandre Belloni <alexandre.belloni@bootlin.com> Link: https://lore.kernel.org/r/20201206220542.062910520@linutronix.de
2020-12-11sched/fair: Trivial correction of the newidle_balance() commentBarry Song
idle_balance() has been renamed to newidle_balance(). To differentiate with nohz_idle_balance, it seems refining the comment will be helpful for the readers of the code. Signed-off-by: Barry Song <song.bao.hua@hisilicon.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/20201202220641.22752-1-song.bao.hua@hisilicon.com
2020-12-11sched/fair: Clear SMT siblings after determining the core is not idleMel Gorman
The clearing of SMT siblings from the SIS mask before checking for an idle core is a small but unnecessary cost. Defer the clearing of the siblings until the scan moves to the next potential target. The cost of this was not measured as it is borderline noise but it should be self-evident. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20201130144020.GS3371@techsingularity.net
2020-12-11sched: Fix kernel-doc markupMauro Carvalho Chehab
Kernel-doc requires that a kernel-doc markup to be immediately below the function prototype, as otherwise it will rename it. So, move sys_sched_yield() markup to the right place. Also fix the cpu_util() markup: Kernel-doc markups should use this format: identifier - description Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/50cd6f460aeb872ebe518a8e9cfffda2df8bdb0a.1606823973.git.mchehab+huawei@kernel.org
2020-12-10Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netLinus Torvalds
Pull networking fixes from David Miller: 1) IPsec compat fixes, from Dmitry Safonov. 2) Fix memory leak in xfrm_user_policy(). Fix from Yu Kuai. 3) Fix polling in xsk sockets by using sk_poll_wait() instead of datagram_poll() which keys off of sk_wmem_alloc and such which xsk sockets do not update. From Xuan Zhuo. 4) Missing init of rekey_data in cfgh80211, from Sara Sharon. 5) Fix destroy of timer before init, from Davide Caratti. 6) Missing CRYPTO_CRC32 selects in ethernet driver Kconfigs, from Arnd Bergmann. 7) Missing error return in rtm_to_fib_config() switch case, from Zhang Changzhong. 8) Fix some src/dest address handling in vrf and add a testcase. From Stephen Suryaputra. 9) Fix multicast handling in Seville switches driven by mscc-ocelot driver. From Vladimir Oltean. 10) Fix proto value passed to skb delivery demux in udp, from Xin Long. 11) HW pkt counters not reported correctly in enetc driver, from Claudiu Manoil. 12) Fix deadlock in bridge, from Joseph Huang. 13) Missing of_node_pur() in dpaa2 driver, fromn Christophe JAILLET. 14) Fix pid fetching in bpftool when there are a lot of results, from Andrii Nakryiko. 15) Fix long timeouts in nft_dynset, from Pablo Neira Ayuso. 16) Various stymmac fixes, from Fugang Duan. 17) Fix null deref in tipc, from Cengiz Can. 18) When mss is biog, coose more resonable rcvq_space in tcp, fromn Eric Dumazet. 19) Revert a geneve change that likely isnt necessary, from Jakub Kicinski. 20) Avoid premature rx buffer reuse in various Intel driversm from Björn Töpel. 21) retain EcT bits during TIS reflection in tcp, from Wei Wang. 22) Fix Tso deferral wrt. cwnd limiting in tcp, from Neal Cardwell. 23) MPLS_OPT_LSE_LABEL attribute is 342 ot 8 bits, from Guillaume Nault 24) Fix propagation of 32-bit signed bounds in bpf verifier and add test cases, from Alexei Starovoitov. * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (81 commits) selftests: fix poll error in udpgro.sh selftests/bpf: Fix "dubious pointer arithmetic" test selftests/bpf: Fix array access with signed variable test selftests/bpf: Add test for signed 32-bit bound check bug bpf: Fix propagation of 32-bit signed bounds from 64-bit bounds. MAINTAINERS: Add entry for Marvell Prestera Ethernet Switch driver net: sched: Fix dump of MPLS_OPT_LSE_LABEL attribute in cls_flower net/mlx4_en: Handle TX error CQE net/mlx4_en: Avoid scheduling restart task if it is already running tcp: fix cwnd-limited bug for TSO deferral where we send nothing net: flow_offload: Fix memory leak for indirect flow block tcp: Retain ECT bits for tos reflection ethtool: fix stack overflow in ethnl_parse_bitset() e1000e: fix S0ix flow to allow S0i3.2 subset entry ice: avoid premature Rx buffer reuse ixgbe: avoid premature Rx buffer reuse i40e: avoid premature Rx buffer reuse igb: avoid transmit queue timeout in xdp path igb: use xdp_do_flush igb: skb add metasize for xdp ...
2020-12-10Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller
Alexei Starovoitov says: ==================== pull-request: bpf 2020-12-10 The following pull-request contains BPF updates for your *net* tree. We've added 21 non-merge commits during the last 12 day(s) which contain a total of 21 files changed, 163 insertions(+), 88 deletions(-). The main changes are: 1) Fix propagation of 32-bit signed bounds from 64-bit bounds, from Alexei. 2) Fix ring_buffer__poll() return value, from Andrii. 3) Fix race in lwt_bpf, from Cong. 4) Fix test_offload, from Toke. 5) Various xsk fixes. Please consider pulling these changes from: git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git Thanks a lot! Also thanks to reporters, reviewers and testers of commits in this pull-request: Cong Wang, Hulk Robot, Jakub Kicinski, Jean-Philippe Brucker, John Fastabend, Magnus Karlsson, Maxim Mikityanskiy, Yonghong Song ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-12-10bpf: Fix propagation of 32-bit signed bounds from 64-bit bounds.Alexei Starovoitov
The 64-bit signed bounds should not affect 32-bit signed bounds unless the verifier knows that upper 32-bits are either all 1s or all 0s. For example the register with smin_value==1 doesn't mean that s32_min_value is also equal to 1, since smax_value could be larger than 32-bit subregister can hold. The verifier refines the smax/s32_max return value from certain helpers in do_refine_retval_range(). Teach the verifier to recognize that smin/s32_min value is also bounded. When both smin and smax bounds fit into 32-bit subregister the verifier can propagate those bounds. Fixes: 3f50f132d840 ("bpf: Verifier, do explicit ALU32 bounds tracking") Reported-by: Jean-Philippe Brucker <jean-philippe@linaro.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2020-12-10exec: Transform exec_update_mutex into a rw_semaphoreEric W. Biederman
Recently syzbot reported[0] that there is a deadlock amongst the users of exec_update_mutex. The problematic lock ordering found by lockdep was: perf_event_open (exec_update_mutex -> ovl_i_mutex) chown (ovl_i_mutex -> sb_writes) sendfile (sb_writes -> p->lock) by reading from a proc file and writing to overlayfs proc_pid_syscall (p->lock -> exec_update_mutex) While looking at possible solutions it occured to me that all of the users and possible users involved only wanted to state of the given process to remain the same. They are all readers. The only writer is exec. There is no reason for readers to block on each other. So fix this deadlock by transforming exec_update_mutex into a rw_semaphore named exec_update_lock that only exec takes for writing. Cc: Jann Horn <jannh@google.com> Cc: Vasiliy Kulikov <segoon@openwall.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Bernd Edlinger <bernd.edlinger@hotmail.de> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Christopher Yeoh <cyeoh@au1.ibm.com> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Cc: Sargun Dhillon <sargun@sargun.me> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Fixes: eea9673250db ("exec: Add exec_update_mutex to replace cred_guard_mutex") [0] https://lkml.kernel.org/r/00000000000063640c05ade8e3de@google.com Reported-by: syzbot+db9cdf3dd1f64252c6ef@syzkaller.appspotmail.com Link: https://lkml.kernel.org/r/87ft4mbqen.fsf@x220.int.ebiederm.org Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-12-10bpf/task_iter: In task_file_seq_get_next use task_lookup_next_fd_rcuEric W. Biederman
When discussing[1] exec and posix file locks it was realized that none of the callers of get_files_struct fundamentally needed to call get_files_struct, and that by switching them to helper functions instead it will both simplify their code and remove unnecessary increments of files_struct.count. Those unnecessary increments can result in exec unnecessarily unsharing files_struct which breaking posix locks, and it can result in fget_light having to fallback to fget reducing system performance. Using task_lookup_next_fd_rcu simplifies task_file_seq_get_next, by moving the checking for the maximum file descritor into the generic code, and by remvoing the need for capturing and releasing a reference on files_struct. As the reference count of files_struct no longer needs to be maintained bpf_iter_seq_task_file_info can have it's files member removed and task_file_seq_get_next no longer needs it's fstruct argument. The curr_fd local variable does need to become unsigned to be used with fnext_task. As curr_fd is assigned from and assigned a u32 making curr_fd an unsigned int won't cause problems and might prevent them. [1] https://lkml.kernel.org/r/20180915160423.GA31461@redhat.com Suggested-by: Oleg Nesterov <oleg@redhat.com> v1: https://lkml.kernel.org/r/20200817220425.9389-11-ebiederm@xmission.com Link: https://lkml.kernel.org/r/20201120231441.29911-16-ebiederm@xmission.com Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-12-10kcmp: In get_file_raw_ptr use task_lookup_fd_rcuEric W. Biederman
Modify get_file_raw_ptr to use task_lookup_fd_rcu. The helper task_lookup_fd_rcu does the work of taking the task lock and verifying that task->files != NULL and then calls files_lookup_fd_rcu. So let use the helper to make a simpler implementation of get_file_raw_ptr. Acked-by: Cyrill Gorcunov <gorcunov@gmail.com> Link: https://lkml.kernel.org/r/20201120231441.29911-13-ebiederm@xmission.com Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-12-10file: Replace fcheck_files with files_lookup_fd_rcuEric W. Biederman
This change renames fcheck_files to files_lookup_fd_rcu. All of the remaining callers take the rcu_read_lock before calling this function so the _rcu suffix is appropriate. This change also tightens up the debug check to verify that all callers hold the rcu_read_lock. All callers that used to call files_check with the files->file_lock held have now been changed to call files_lookup_fd_locked. This change of name has helped remind me of which locks and which guarantees are in place helping me to catch bugs later in the patchset. The need for better names became apparent in the last round of discussion of this set of changes[1]. [1] https://lkml.kernel.org/r/CAHk-=wj8BQbgJFLa+J0e=iT-1qpmCRTbPAJ8gd6MJQ=kbRPqyQ@mail.gmail.com Link: https://lkml.kernel.org/r/20201120231441.29911-9-ebiederm@xmission.com Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-12-10bpf: In bpf_task_fd_query use fget_taskEric W. Biederman
Use the helper fget_task to simplify bpf_task_fd_query. As well as simplifying the code this removes one unnecessary increment of struct files_struct. This unnecessary increment of files_struct.count can result in exec unnecessarily unsharing files_struct and breaking posix locks, and it can result in fget_light having to fallback to fget reducing performance. This simplification comes from the observation that none of the callers of get_files_struct actually need to call get_files_struct that was made when discussing[1] exec and posix file locks. [1] https://lkml.kernel.org/r/20180915160423.GA31461@redhat.com Suggested-by: Oleg Nesterov <oleg@redhat.com> v1: https://lkml.kernel.org/r/20200817220425.9389-5-ebiederm@xmission.com Link: https://lkml.kernel.org/r/20201120231441.29911-5-ebiederm@xmission.com Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-12-10kcmp: In kcmp_epoll_target use fget_taskEric W. Biederman
Use the helper fget_task and simplify the code. As well as simplifying the code this removes one unnecessary increment of struct files_struct. This unnecessary increment of files_struct.count can result in exec unnecessarily unsharing files_struct and breaking posix locks, and it can result in fget_light having to fallback to fget reducing performance. Suggested-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Cyrill Gorcunov <gorcunov@gmail.com> v1: https://lkml.kernel.org/r/20200817220425.9389-4-ebiederm@xmission.com Link: https://lkml.kernel.org/r/20201120231441.29911-4-ebiederm@xmission.com Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-12-10exec: Simplify unshare_filesEric W. Biederman
Now that exec no longer needs to return the unshared files to their previous value there is no reason to return displaced. Instead when unshare_fd creates a copy of the file table, call put_files_struct before returning from unshare_files. Acked-by: Christian Brauner <christian.brauner@ubuntu.com> v1: https://lkml.kernel.org/r/20200817220425.9389-2-ebiederm@xmission.com Link: https://lkml.kernel.org/r/20201120231441.29911-2-ebiederm@xmission.com Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-12-09Input: gtco - remove driverDmitry Torokhov
The driver has its own HID descriptor parsing code, that had and still has several issues discovered by syzbot and other tools. Ideally we should move the driver over to the HID subsystem, so that it uses proven parsing code. However the devices in question are EOL, and GTCO is not willing to extend resources for that, so let's simply remove the driver. Note that our HID support has greatly improved over the last 10 years, we may also consider reverting 6f8d9e26e7de ("hid-core.c: Adds all GTCO CalComp Digitizers and InterWrite School Products to blacklist") and see if GTCO devices actually work with normal HID drivers. Link: https://lore.kernel.org/r/X8wbBtO5KidME17K@google.com Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
2020-12-09driver core: Add fwnode_init()Saravana Kannan
There are multiple locations in the kernel where a struct fwnode_handle is initialized. Add fwnode_init() so that we have one way of initializing a fwnode_handle. Acked-by: Rob Herring <robh@kernel.org> Signed-off-by: Saravana Kannan <saravanak@google.com> Link: https://lore.kernel.org/r/20201121020232.908850-8-saravanak@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-12-09Merge remote-tracking branch 'arm64/for-next/fixes' into for-next/coreCatalin Marinas
* arm64/for-next/fixes: (26 commits) arm64: mte: fix prctl(PR_GET_TAGGED_ADDR_CTRL) if TCF0=NONE arm64: mte: Fix typo in macro definition arm64: entry: fix EL1 debug transitions arm64: entry: fix NMI {user, kernel}->kernel transitions arm64: entry: fix non-NMI kernel<->kernel transitions arm64: ptrace: prepare for EL1 irq/rcu tracking arm64: entry: fix non-NMI user<->kernel transitions arm64: entry: move el1 irq/nmi logic to C arm64: entry: prepare ret_to_user for function call arm64: entry: move enter_from_user_mode to entry-common.c arm64: entry: mark entry code as noinstr arm64: mark idle code as noinstr arm64: syscall: exit userspace before unmasking exceptions arm64: pgtable: Ensure dirty bit is preserved across pte_wrprotect() arm64: pgtable: Fix pte_accessible() ACPI/IORT: Fix doc warnings in iort.c arm64/fpsimd: add <asm/insn.h> to <asm/kprobes.h> to fix fpsimd build arm64: cpu_errata: Apply Erratum 845719 to KRYO2XX Silver arm64: proton-pack: Add KRYO2XX silver CPUs to spectre-v2 safe-list arm64: kpti: Add KRYO2XX gold/silver CPU cores to kpti safelist ... # Conflicts: # arch/arm64/include/asm/exception.h # arch/arm64/kernel/sdei.c
2020-12-09Merge remote-tracking branch 'arm64/for-next/scs' into for-next/coreCatalin Marinas
* arm64/for-next/scs: arm64: sdei: Push IS_ENABLED() checks down to callee functions arm64: scs: use vmapped IRQ and SDEI shadow stacks scs: switch to vmapped shadow stacks
2020-12-09perf: Break deadlock involving exec_update_mutexpeterz@infradead.org
Syzbot reported a lock inversion involving perf. The sore point being perf holding exec_update_mutex() for a very long time, specifically across a whole bunch of filesystem ops in pmu::event_init() (uprobes) and anon_inode_getfile(). This then inverts against procfs code trying to take exec_update_mutex. Move the permission checks later, such that we need to hold the mutex over less code. Reported-by: syzbot+db9cdf3dd1f64252c6ef@syzkaller.appspotmail.com Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2020-12-09locking/rwsem: Remove reader optimistic spinningWaiman Long
Reader optimistic spinning is helpful when the reader critical section is short and there aren't that many readers around. It also improves the chance that a reader can get the lock as writer optimistic spinning disproportionally favors writers much more than readers. Since commit d3681e269fff ("locking/rwsem: Wake up almost all readers in wait queue"), all the waiting readers are woken up so that they can all get the read lock and run in parallel. When the number of contending readers is large, allowing reader optimistic spinning will likely cause reader fragmentation where multiple smaller groups of readers can get the read lock in a sequential manner separated by writers. That reduces reader parallelism. One possible way to address that drawback is to limit the number of readers (preferably one) that can do optimistic spinning. These readers act as representatives of all the waiting readers in the wait queue as they will wake up all those waiting readers once they get the lock. Alternatively, as reader optimistic lock stealing has already enhanced fairness to readers, it may be easier to just remove reader optimistic spinning and simplifying the optimistic spinning code as a result. Performance measurements (locking throughput kops/s) using a locking microbenchmark with 50/50 reader/writer distribution and turbo-boost disabled was done on a 2-socket Cascade Lake system (48-core 96-thread) to see the impacts of these changes: 1) Vanilla - 5.10-rc3 kernel 2) Before - 5.10-rc3 kernel with previous patches in this series 2) limit-rspin - 5.10-rc3 kernel with limited reader spinning patch 3) no-rspin - 5.10-rc3 kernel with reader spinning disabled # of threads CS Load Vanilla Before limit-rspin no-rspin ------------ ------- ------- ------ ----------- -------- 2 1 5,185 5,662 5,214 5,077 4 1 5,107 4,983 5,188 4,760 8 1 4,782 4,564 4,720 4,628 16 1 4,680 4,053 4,567 3,402 32 1 4,299 1,115 1,118 1,098 64 1 3,218 983 1,001 957 96 1 1,938 944 957 930 2 20 2,008 2,128 2,264 1,665 4 20 1,390 1,033 1,046 1,101 8 20 1,472 1,155 1,098 1,213 16 20 1,332 1,077 1,089 1,122 32 20 967 914 917 980 64 20 787 874 891 858 96 20 730 836 847 844 2 100 372 356 360 355 4 100 492 425 434 392 8 100 533 537 529 538 16 100 548 572 568 598 32 100 499 520 527 537 64 100 466 517 526 512 96 100 406 497 506 509 The column "CS Load" represents the number of pause instructions issued in the locking critical section. A CS load of 1 is extremely short and is not likey in real situations. A load of 20 (moderate) and 100 (long) are more realistic. It can be seen that the previous patches in this series have reduced performance in general except in highly contended cases with moderate or long critical sections that performance improves a bit. This change is mostly caused by the "Prevent potential lock starvation" patch that reduce reader optimistic spinning and hence reduce reader fragmentation. The patch that further limit reader optimistic spinning doesn't seem to have too much impact on overall performance as shown in the benchmark data. The patch that disables reader optimistic spinning shows reduced performance at lightly loaded cases, but comparable or slightly better performance on with heavier contention. This patch just removes reader optimistic spinning for now. As readers are not going to do optimistic spinning anymore, we don't need to consider if the OSQ is empty or not when doing lock stealing. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Davidlohr Bueso <dbueso@suse.de> Link: https://lkml.kernel.org/r/20201121041416.12285-6-longman@redhat.com
2020-12-09locking/rwsem: Enable reader optimistic lock stealingWaiman Long
If the optimistic spinning queue is empty and the rwsem does not have the handoff or write-lock bits set, it is actually not necessary to call rwsem_optimistic_spin() to spin on it. Instead, it can steal the lock directly as its reader bias is in the count already. If it is the first reader in this state, it will try to wake up other readers in the wait queue. With this patch applied, the following were the lock event counts after rebooting a 2-socket system and a "make -j96" kernel rebuild. rwsem_opt_rlock=4437 rwsem_rlock=29 rwsem_rlock_steal=19 So lock stealing represents about 0.4% of all the read locks acquired in the slow path. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Davidlohr Bueso <dbueso@suse.de> Link: https://lkml.kernel.org/r/20201121041416.12285-4-longman@redhat.com
2020-12-09locking/rwsem: Prevent potential lock starvationWaiman Long
The lock handoff bit is added in commit 4f23dbc1e657 ("locking/rwsem: Implement lock handoff to prevent lock starvation") to avoid lock starvation. However, allowing readers to do optimistic spinning does introduce an unlikely scenario where lock starvation can happen. The lock handoff bit may only be set when a waiter is being woken up. In the case of reader unlock, wakeup happens only when the reader count reaches 0. If there is a continuous stream of incoming readers acquiring read lock via optimistic spinning, it is possible that the reader count may never reach 0 and so the handoff bit will never be asserted. One way to prevent this scenario from happening is to disallow optimistic spinning if the rwsem is currently owned by readers. If the previous or current owner is a writer, optimistic spinning will be allowed. If the previous owner is a reader but the reader count has reached 0 before, a wakeup should have been issued. So the handoff mechanism will be kicked in to prevent lock starvation. As a result, it should be OK to do optimistic spinning in this case. This patch may have some impact on reader performance as it reduces reader optimistic spinning especially if the lock critical sections are short the number of contending readers are small. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Davidlohr Bueso <dbueso@suse.de> Link: https://lkml.kernel.org/r/20201121041416.12285-3-longman@redhat.com
2020-12-09locking/rwsem: Pass the current atomic count to rwsem_down_read_slowpath()Waiman Long
The atomic count value right after reader count increment can be useful to determine the rwsem state at trylock time. So the count value is passed down to rwsem_down_read_slowpath() to be used when appropriate. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Davidlohr Bueso <dbueso@suse.de> Link: https://lkml.kernel.org/r/20201121041416.12285-2-longman@redhat.com
2020-12-09locking/rwsem: Fold __down_{read,write}*()Peter Zijlstra
There's a lot needless duplication in __down_{read,write}*(), cure that with a helper. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20201207090243.GE3040@hirez.programming.kicks-ass.net