summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2024-11-19Merge tag 'pm-6.13-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management updates from Rafael Wysocki: "The amd-pstate cpufreq driver gets the majority of changes this time. They are mostly fixes and cleanups, but one of them causes it to become the default cpufreq driver on some AMD server platforms. Apart from that, the menu cpuidle governor is modified to not use iowait any more, the intel_idle gets a custom C-states table for Granite Rapids Xeon D, and the intel_pstate driver will use a more aggressive Balance- performance default EPP value on Granite Rapids now. There are also some fixes, cleanups and tooling updates. Specifics: - Update the amd-pstate driver to set the initial scaling frequency policy lower bound to be the lowest non-linear frequency (Dhananjay Ugwekar) - Enable amd-pstate by default on servers starting with newer AMD Epyc processors (Swapnil Sapkal) - Align more codepaths between shared memory and MSR designs in amd-pstate (Dhananjay Ugwekar) - Clean up amd-pstate code to rename functions and remove redundant calls (Dhananjay Ugwekar, Mario Limonciello) - Do other assorted fixes and cleanups in amd-pstate (Dhananjay Ugwekar and Mario Limonciello) - Change the Balance-performance EPP value for Granite Rapids in the intel_pstate driver to a more performance-biased one (Srinivas Pandruvada) - Simplify MSR read on the boot CPU in the ACPI cpufreq driver (Chang S. Bae) - Ensure sugov_eas_rebuild_sd() is always called when sugov_init() succeeds to always enforce sched domains rebuild in case EAS needs to be enabled (Christian Loehle) - Switch cpufreq back to platform_driver::remove() (Uwe Kleine-König) - Use proper frequency unit names in cpufreq (Marcin Juszkiewicz) - Add a built-in idle states table for Granite Rapids Xeon D to the intel_idle driver (Artem Bityutskiy) - Fix some typos in comments in the cpuidle core and drivers (Shen Lichuan) - Remove iowait influence from the menu cpuidle governor (Christian Loehle) - Add min/max available performance state limits to the Energy Model management code (Lukasz Luba) - Update pm-graph to v5.13 (Todd Brandt) - Add documentation for some recently introduced cpupower utility options (Tor Vic) - Make cpupower inform users where cpufreq-bench.conf should be located when opening it fails (Peng Fan) - Allow overriding cross-compiling env params in cpupower (Peng Fan) - Add compile_commands.json to .gitignore in cpupower (John B. Wyatt IV) - Improve disable c_state block in cpupower bindings and add a test to confirm that CPU state is disabled to it (John B. Wyatt IV) - Add Chinese Simplified translation to cpupower (Kieran Moy) - Add checks for xgettext and msgfmt to cpupower (Siddharth Menon)" * tag 'pm-6.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (38 commits) cpufreq: intel_pstate: Update Balance-performance EPP for Granite Rapids cpufreq: ACPI: Simplify MSR read on the boot CPU sched/cpufreq: Ensure sd is rebuilt for EAS check intel_idle: add Granite Rapids Xeon D support PM: EM: Add min/max available performance state limits cpufreq/amd-pstate: Move registration after static function call update cpufreq/amd-pstate: Push adjust_perf vfunc init into cpu_init cpufreq/amd-pstate: Align offline flow of shared memory and MSR based systems cpufreq/amd-pstate: Call cppc_set_epp_perf in the reenable function cpufreq/amd-pstate: Do not attempt to clear MSR_AMD_CPPC_ENABLE cpufreq/amd-pstate: Rename functions that enable CPPC cpufreq/amd-pstate-ut: Add fix for min freq unit test amd-pstate: Switch to amd-pstate by default on some Server platforms amd-pstate: Set min_perf to nominal_perf for active mode performance gov cpufreq/amd-pstate: Remove the redundant amd_pstate_set_driver() call cpufreq/amd-pstate: Remove the switch case in amd_pstate_init() cpufreq/amd-pstate: Call amd_pstate_set_driver() in amd_pstate_register_driver() cpufreq/amd-pstate: Call amd_pstate_register() in amd_pstate_init() cpufreq/amd-pstate: Set the initial min_freq to lowest_nonlinear_freq cpufreq/amd-pstate: Remove the redundant verify() function ...
2024-11-19Merge tag 'random-6.13-rc1-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/crng/random Pull random number generator updates from Jason Donenfeld: "This contains a single series from Uros to replace uses of <linux/random.h> with prandom.h or other more specific headers as needed, in order to avoid a circular header issue. Uros' goal is to be able to use percpu.h from prandom.h, which will then allow him to define __percpu in percpu.h rather than in compiler_types.h" * tag 'random-6.13-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random: prandom: Include <linux/percpu.h> in <linux/prandom.h> random: Do not include <linux/prandom.h> in <linux/random.h> netem: Include <linux/prandom.h> in sch_netem.c lib/test_scanf: Include <linux/prandom.h> instead of <linux/random.h> lib/test_parman: Include <linux/prandom.h> instead of <linux/random.h> bpf/tests: Include <linux/prandom.h> instead of <linux/random.h> lib/rbtree-test: Include <linux/prandom.h> instead of <linux/random.h> random32: Include <linux/prandom.h> instead of <linux/random.h> kunit: string-stream-test: Include <linux/prandom.h> lib/interval_tree_test.c: Include <linux/prandom.h> instead of <linux/random.h> bpf: Include <linux/prandom.h> instead of <linux/random.h> scsi: libfcoe: Include <linux/prandom.h> instead of <linux/random.h> fscrypt: Include <linux/once.h> in fs/crypto/keyring.c mtd: tests: Include <linux/prandom.h> instead of <linux/random.h> media: vivid: Include <linux/prandom.h> in vivid-vid-cap.c drm/lib: Include <linux/prandom.h> instead of <linux/random.h> drm/i915/selftests: Include <linux/prandom.h> instead of <linux/random.h> crypto: testmgr: Include <linux/prandom.h> instead of <linux/random.h> x86/kaslr: Include <linux/prandom.h> instead of <linux/random.h>
2024-11-19Merge tag 'v6.13-p1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 Pull crypto updates from Herbert Xu: "API: - Add sig driver API - Remove signing/verification from akcipher API - Move crypto_simd_disabled_for_test to lib/crypto - Add WARN_ON for return values from driver that indicates memory corruption Algorithms: - Provide crc32-arch and crc32c-arch through Crypto API - Optimise crc32c code size on x86 - Optimise crct10dif on arm/arm64 - Optimise p10-aes-gcm on powerpc - Optimise aegis128 on x86 - Output full sample from test interface in jitter RNG - Retry without padata when it fails in pcrypt Drivers: - Add support for Airoha EN7581 TRNG - Add support for STM32MP25x platforms in stm32 - Enable iproc-r200 RNG driver on BCMBCA - Add Broadcom BCM74110 RNG driver" * tag 'v6.13-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (112 commits) crypto: marvell/cesa - fix uninit value for struct mv_cesa_op_ctx crypto: cavium - Fix an error handling path in cpt_ucode_load_fw() crypto: aesni - Move back to module_init crypto: lib/mpi - Export mpi_set_bit crypto: aes-gcm-p10 - Use the correct bit to test for P10 hwrng: amd - remove reference to removed PPC_MAPLE config crypto: arm/crct10dif - Implement plain NEON variant crypto: arm/crct10dif - Macroify PMULL asm code crypto: arm/crct10dif - Use existing mov_l macro instead of __adrl crypto: arm64/crct10dif - Remove remaining 64x64 PMULL fallback code crypto: arm64/crct10dif - Use faster 16x64 bit polynomial multiply crypto: arm64/crct10dif - Remove obsolete chunking logic crypto: bcm - add error check in the ahash_hmac_init function crypto: caam - add error check to caam_rsa_set_priv_key_form hwrng: bcm74110 - Add Broadcom BCM74110 RNG driver dt-bindings: rng: add binding for BCM74110 RNG padata: Clean up in padata_do_multithreaded() crypto: inside-secure - Fix the return value of safexcel_xcbcmac_cra_init() crypto: qat - Fix missing destroy_workqueue in adf_init_aer() crypto: rsassa-pkcs1 - Reinstate support for legacy protocols ...
2024-11-19Merge tag 'csd-lock.2024.11.16a' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu Pull CSD-lock update from Paul McKenney: "This switches from sched_clock() to ktime_get_mono_fast_ns(), which on x86 switches from the rdtsc instruction to the rdtscp instruction, thus avoiding instruction reorderings that cause false-positive reports of CSD-lock stalls of almost 2^46 nanoseconds. These false positives are rare, but really are seen in the wild" * tag 'csd-lock.2024.11.16a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: locking/csd-lock: Switch from sched_clock() to ktime_get_mono_fast_ns()
2024-11-19Merge tag 'scftorture.2024.11.16a' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu Pull scftorture updates from Paul McKenney: - Avoid divide operation - Fix cleanup code waiting for IPI handlers - Move memory allocations out of preempt-disable region of code for PREEMPT_RT compatibility - Use a lockless list to avoid freeing memory while interrupts are disabled, again for PREEMPT_RT compatibility - Make lockless list scf_add_to_free_list() correctly handle freeing a NULL pointer * tag 'scftorture.2024.11.16a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: scftorture: Handle NULL argument passed to scf_add_to_free_list(). scftorture: Use a lock-less list to free memory. scftorture: Move memory allocation outside of preempt_disable region. scftorture: Wait until scf_cleanup_handler() completes. scftorture: Avoid additional div operation.
2024-11-19perf/core: Save raw sample data conditionally based on sample typeYabin Cui
Currently, space for raw sample data is always allocated within sample records for both BPF output and tracepoint events. This leads to unused space in sample records when raw sample data is not requested. This patch enforces checking sample type of an event in perf_sample_save_raw_data(). So raw sample data will only be saved if explicitly requested, reducing overhead when it is not needed. Fixes: 0a9081cf0a11 ("perf/core: Add perf_sample_save_raw_data() helper") Signed-off-by: Yabin Cui <yabinc@google.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Ian Rogers <irogers@google.com> Acked-by: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20240515193610.2350456-2-yabinc@google.com
2024-11-18Merge tag 'arm64-upstream' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 updates from Catalin Marinas: - Support for running Linux in a protected VM under the Arm Confidential Compute Architecture (CCA) - Guarded Control Stack user-space support. Current patches follow the x86 ABI of implicitly creating a shadow stack on clone(). Subsequent patches (already on the list) will add support for clone3() allowing finer-grained control of the shadow stack size and placement from libc - AT_HWCAP3 support (not running out of HWCAP2 bits yet but we are getting close with the upcoming dpISA support) - Other arch features: - In-kernel use of the memcpy instructions, FEAT_MOPS (previously only exposed to user; uaccess support not merged yet) - MTE: hugetlbfs support and the corresponding kselftests - Optimise CRC32 using the PMULL instructions - Support for FEAT_HAFT enabling ARCH_HAS_NONLEAF_PMD_YOUNG - Optimise the kernel TLB flushing to use the range operations - POE/pkey (permission overlays): further cleanups after bringing the signal handler in line with the x86 behaviour for 6.12 - arm64 perf updates: - Support for the NXP i.MX91 PMU in the existing IMX driver - Support for Ampere SoCs in the Designware PCIe PMU driver - Support for Marvell's 'PEM' PCIe PMU present in the 'Odyssey' SoC - Support for Samsung's 'Mongoose' CPU PMU - Support for PMUv3.9 finer-grained userspace counter access control - Switch back to platform_driver::remove() now that it returns 'void' - Add some missing events for the CXL PMU driver - Miscellaneous arm64 fixes/cleanups: - Page table accessors cleanup: type updates, drop unused macros, reorganise arch_make_huge_pte() and clean up pte_mkcont(), sanity check addresses before runtime P4D/PUD folding - Command line override for ID_AA64MMFR0_EL1.ECV (advertising the FEAT_ECV for the generic timers) allowing Linux to boot with firmware deployments that don't set SCTLR_EL3.ECVEn - ACPI/arm64: tighten the check for the array of platform timer structures and adjust the error handling procedure in gtdt_parse_timer_block() - Optimise the cache flush for the uprobes xol slot (skip if no change) and other uprobes/kprobes cleanups - Fix the context switching of tpidrro_el0 when kpti is enabled - Dynamic shadow call stack fixes - Sysreg updates - Various arm64 kselftest improvements * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (168 commits) arm64: tls: Fix context-switching of tpidrro_el0 when kpti is enabled kselftest/arm64: Try harder to generate different keys during PAC tests kselftest/arm64: Don't leak pipe fds in pac.exec_sign_all() arm64/ptrace: Clarify documentation of VL configuration via ptrace kselftest/arm64: Corrupt P0 in the irritator when testing SSVE acpi/arm64: remove unnecessary cast arm64/mm: Change protval as 'pteval_t' in map_range() kselftest/arm64: Fix missing printf() argument in gcs/gcs-stress.c kselftest/arm64: Add FPMR coverage to fp-ptrace kselftest/arm64: Expand the set of ZA writes fp-ptrace does kselftets/arm64: Use flag bits for features in fp-ptrace assembler code kselftest/arm64: Enable build of PAC tests with LLVM=1 kselftest/arm64: Check that SVCR is 0 in signal handlers selftests/mm: Fix unused function warning for aarch64_write_signal_pkey() kselftest/arm64: Fix printf() compiler warnings in the arm64 syscall-abi.c tests kselftest/arm64: Fix printf() warning in the arm64 MTE prctl() test kselftest/arm64: Fix printf() compiler warnings in the arm64 fp tests kselftest/arm64: Fix build with stricter assemblers arm64/scs: Drop unused prototype __pi_scs_patch_vmlinux() arm64/scs: Deal with 64-bit relative offsets in FDE frames ...
2024-11-18Merge tag 'lsm-pr-20241112' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm Pull lsm updates from Paul Moore: "Thirteen patches, all focused on moving away from the current 'secid' LSM identifier to a richer 'lsm_prop' structure. This move will help reduce the translation that is necessary in many LSMs, offering better performance, and make it easier to support different LSMs in the future" * tag 'lsm-pr-20241112' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm: lsm: remove lsm_prop scaffolding netlabel,smack: use lsm_prop for audit data audit: change context data from secid to lsm_prop lsm: create new security_cred_getlsmprop LSM hook audit: use an lsm_prop in audit_names lsm: use lsm_prop in security_inode_getsecid lsm: use lsm_prop in security_current_getsecid audit: update shutdown LSM data lsm: use lsm_prop in security_ipc_getsecid audit: maintain an lsm_prop in audit_context lsm: add lsmprop_to_secctx hook lsm: use lsm_prop in security_audit_rule_match lsm: add the lsm_prop data structure
2024-11-18Merge tag 'audit-pr-20241112' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit Pull audit updates from Paul Moore: "The audit patches are minimal this time around with one patch to correct some kdoc function parameters and one to leverage the `str_yes_no()` function; nothing very exciting" * tag 'audit-pr-20241112' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit: audit: Use str_yes_no() helper function audit: Reorganize kerneldoc parameter names
2024-11-18Merge tag 'pull-fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds
Pull 'struct fd' class updates from Al Viro: "The bulk of struct fd memory safety stuff Making sure that struct fd instances are destroyed in the same scope where they'd been created, getting rid of reassignments and passing them by reference, converting to CLASS(fd{,_pos,_raw}). We are getting very close to having the memory safety of that stuff trivial to verify" * tag 'pull-fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (28 commits) deal with the last remaing boolean uses of fd_file() css_set_fork(): switch to CLASS(fd_raw, ...) memcg_write_event_control(): switch to CLASS(fd) assorted variants of irqfd setup: convert to CLASS(fd) do_pollfd(): convert to CLASS(fd) convert do_select() convert vfs_dedupe_file_range(). convert cifs_ioctl_copychunk() convert media_request_get_by_fd() convert spu_run(2) switch spufs_calls_{get,put}() to CLASS() use convert cachestat(2) convert do_preadv()/do_pwritev() fdget(), more trivial conversions fdget(), trivial conversions privcmd_ioeventfd_assign(): don't open-code eventfd_ctx_fdget() o2hb_region_dev_store(): avoid goto around fdget()/fdput() introduce "fd_pos" class, convert fdget_pos() users to it. fdget_raw() users: switch to CLASS(fd_raw) convert vmsplice() to CLASS(fd) ...
2024-11-18tracing: Fix function name for trampolineTatsuya S
The issue that unrelated function name is shown on stack trace like following even though it should be trampoline code address is caused by the creation of trampoline code in the area where .init.text section of module was freed after module is loaded. bash-1344 [002] ..... 43.644608: <stack trace> => (MODULE INIT FUNCTION) => vfs_write => ksys_write => do_syscall_64 => entry_SYSCALL_64_after_hwframe To resolve this, when function address of stack trace entry is in trampoline, output without looking up symbol name. Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20241021071454.34610-2-tatsuya.s2862@gmail.com Signed-off-by: Tatsuya S <tatsuya.s2862@gmail.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-11-18Merge tag 'vfs-6.13.usercopy' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull copy_struct_to_user helper from Christian Brauner: "This adds a copy_struct_to_user() helper which is a companion helper to the already widely used copy_struct_from_user(). It copies a struct from kernel space to userspace, in a way that guarantees backwards-compatibility for struct syscall arguments as long as future struct extensions are made such that all new fields are appended to the old struct, and zeroed-out new fields have the same meaning as the old struct. The first user is sched_getattr() system call but the new extensible pidfs ioctl will be ported to it as well" * tag 'vfs-6.13.usercopy' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: sched_getattr: port to copy_struct_to_user uaccess: add copy_struct_to_user helper
2024-11-18Merge tag 'vfs-6.13.file' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs file updates from Christian Brauner: "This contains changes the changes for files for this cycle: - Introduce a new reference counting mechanism for files. As atomic_inc_not_zero() is implemented with a try_cmpxchg() loop it has O(N^2) behaviour under contention with N concurrent operations and it is in a hot path in __fget_files_rcu(). The rcuref infrastructures remedies this problem by using an unconditional increment relying on safe- and dead zones to make this work and requiring rcu protection for the data structure in question. This not just scales better it also introduces overflow protection. However, in contrast to generic rcuref, files require a memory barrier and thus cannot rely on *_relaxed() atomic operations and also require to be built on atomic_long_t as having massive amounts of reference isn't unheard of even if it is just an attack. This adds a file specific variant instead of making this a generic library. This has been tested by various people and it gives consistent improvement up to 3-5% on workloads with loads of threads. - Add a fastpath for find_next_zero_bit(). Skip 2-levels searching via find_next_zero_bit() when there is a free slot in the word that contains the next fd. This improves pts/blogbench-1.1.0 read by 8% and write by 4% on Intel ICX 160. - Conditionally clear full_fds_bits since it's very likely that a bit in full_fds_bits has been cleared during __clear_open_fds(). This improves pts/blogbench-1.1.0 read up to 13%, and write up to 5% on Intel ICX 160. - Get rid of all lookup_*_fdget_rcu() variants. They were used to lookup files without taking a reference count. That became invalid once files were switched to SLAB_TYPESAFE_BY_RCU and now we're always taking a reference count. Switch to an already existing helper and remove the legacy variants. - Remove pointless includes of <linux/fdtable.h>. - Avoid cmpxchg() in close_files() as nobody else has a reference to the files_struct at that point. - Move close_range() into fs/file.c and fold __close_range() into it. - Cleanup calling conventions of alloc_fdtable() and expand_files(). - Merge __{set,clear}_close_on_exec() into one. - Make __set_open_fd() set cloexec as well instead of doing it in two separate steps" * tag 'vfs-6.13.file' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: selftests: add file SLAB_TYPESAFE_BY_RCU recycling stressor fs: port files to file_ref fs: add file_ref expand_files(): simplify calling conventions make __set_open_fd() set cloexec state as well fs: protect backing files with rcu file.c: merge __{set,clear}_close_on_exec() alloc_fdtable(): change calling conventions. fs/file.c: add fast path in find_next_fd() fs/file.c: conditionally clear full_fds fs/file.c: remove sanity_check and add likely/unlikely in alloc_fd() move close_range(2) into fs/file.c, fold __close_range() into it close_files(): don't bother with xchg() remove pointless includes of <linux/fdtable.h> get rid of ...lookup...fdget_rcu() family
2024-11-18Merge tag 'vfs-6.13.mgtime' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs multigrain timestamps from Christian Brauner: "This is another try at implementing multigrain timestamps. This time with significant help from the timekeeping maintainers to reduce the performance impact. Thomas provided a base branch that contains the required timekeeping interfaces for the VFS. It serves as the base for the multi-grain timestamp work: - Multigrain timestamps allow the kernel to use fine-grained timestamps when an inode's attributes is being actively observed via ->getattr(). With this support, it's possible for a file to get a fine-grained timestamp, and another modified after it to get a coarse-grained stamp that is earlier than the fine-grained time. If this happens then the files can appear to have been modified in reverse order, which breaks VFS ordering guarantees. To prevent this, a floor value is maintained for multigrain timestamps. Whenever a fine-grained timestamp is handed out, record it, and when later coarse-grained stamps are handed out, ensure they are not earlier than that value. If the coarse-grained timestamp is earlier than the fine-grained floor, return the floor value instead. The timekeeper changes add a static singleton atomic64_t into timekeeper.c that is used to keep track of the latest fine-grained time ever handed out. This is tracked as a monotonic ktime_t value to ensure that it isn't affected by clock jumps. Because it is updated at different times than the rest of the timekeeper object, the floor value is managed independently of the timekeeper via a cmpxchg() operation, and sits on its own cacheline. Two new public timekeeper interfaces are added: (1) ktime_get_coarse_real_ts64_mg() fills a timespec64 with the later of the coarse-grained clock and the floor time (2) ktime_get_real_ts64_mg() gets the fine-grained clock value, and tries to swap it into the floor. A timespec64 is filled with the result. - The VFS has always used coarse-grained timestamps when updating the ctime and mtime after a change. This has the benefit of allowing filesystems to optimize away a lot metadata updates, down to around 1 per jiffy, even when a file is under heavy writes. Unfortunately, this has always been an issue when we're exporting via NFSv3, which relies on timestamps to validate caches. A lot of changes can happen in a jiffy, so timestamps aren't sufficient to help the client decide when to invalidate the cache. Even with NFSv4, a lot of exported filesystems don't properly support a change attribute and are subject to the same problems with timestamp granularity. Other applications have similar issues with timestamps (e.g backup applications). If we were to always use fine-grained timestamps, that would improve the situation, but that becomes rather expensive, as the underlying filesystem would have to log a lot more metadata updates. This adds a way to only use fine-grained timestamps when they are being actively queried. Use the (unused) top bit in inode->i_ctime_nsec as a flag that indicates whether the current timestamps have been queried via stat() or the like. When it's set, we allow the kernel to use a fine-grained timestamp iff it's necessary to make the ctime show a different value. This solves the problem of being able to distinguish the timestamp between updates, but introduces a new problem: it's now possible for a file being changed to get a fine-grained timestamp. A file that is altered just a bit later can then get a coarse-grained one that appears older than the earlier fine-grained time. This violates timestamp ordering guarantees. This is where the earlier mentioned timkeeping interfaces help. A global monotonic atomic64_t value is kept that acts as a timestamp floor. When we go to stamp a file, we first get the latter of the current floor value and the current coarse-grained time. If the inode ctime hasn't been queried then we just attempt to stamp it with that value. If it has been queried, then first see whether the current coarse time is later than the existing ctime. If it is, then we accept that value. If it isn't, then we get a fine-grained time and try to swap that into the global floor. Whether that succeeds or fails, we take the resulting floor time, convert it to realtime and try to swap that into the ctime. We take the result of the ctime swap whether it succeeds or fails, since either is just as valid. Filesystems can opt into this by setting the FS_MGTIME fstype flag. Others should be unaffected (other than being subject to the same floor value as multigrain filesystems)" * tag 'vfs-6.13.mgtime' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fs: reduce pointer chasing in is_mgtime() test tmpfs: add support for multigrain timestamps btrfs: convert to multigrain timestamps ext4: switch to multigrain timestamps xfs: switch to multigrain timestamps Documentation: add a new file documenting multigrain timestamps fs: add percpu counters for significant multigrain timestamp events fs: tracepoints around multigrain timestamp events fs: handle delegated timestamps in setattr_copy_mgtime timekeeping: Add percpu counter for tracking floor swap events timekeeping: Add interfaces for handling timestamps with a floor value fs: have setattr_copy handle multigrain timestamps appropriately fs: add infrastructure for multigrain timestamps
2024-11-18posix-timers: Fix spurious warning on double enqueue versus do_exit()Frederic Weisbecker
A timer sigqueue may find itself already pending when it is tried to be enqueued. This situation can happen if the timer sigqueue is enqueued but then the timer is reset afterwards and fires before the pending signal managed to be delivered. However when such a double enqueue occurs while the corresponding signal is ignored, the sigqueue is expected to be found either on the dedicated ignored list if the timer was periodic or dropped if the timer was one-shot. In any case it is not supposed to be queued on the real signal queue. An assertion verifies the latter expectation on top of the return value of prepare_signal(), assuming "false" means that the signal is being ignored. But prepare_signal() may also fail if the target is exiting as the last task of its group. In this case the double enqueue observes the sigqueue queued, as in such a situation: TASK A (same group as B) TASK B (same group as A) ------------------------ ------------------------ // timer event // queue signal to TASK B posix_timer_queue_signal() // reset timer through syscall do_timer_settime() // exit, leaving task B alone do_exit() do_exit() synchronize_group_exit() signal->flags = SIGNAL_GROUP_EXIT // ========> <IRQ> timer event posix_timer_queue_signal() // return false due to SIGNAL_GROUP_EXIT if (!prepare_signal()) WARN_ON_ONCE(!list_empty(&q->list)) And this spuriously triggers this warning: WARNING: CPU: 0 PID: 5854 at kernel/signal.c:2008 posixtimer_send_sigqueue CPU: 0 UID: 0 PID: 5854 Comm: syz-executor139 Not tainted 6.12.0-rc6-next-20241108-syzkaller #0 RIP: 0010:posixtimer_send_sigqueue+0x9da/0xbc0 kernel/signal.c:2008 Call Trace: <IRQ> alarm_handle_timer alarmtimer_fired __run_hrtimer __hrtimer_run_queues hrtimer_interrupt local_apic_timer_interrupt __sysvec_apic_timer_interrupt instr_sysvec_apic_timer_interrupt sysvec_apic_timer_interrupt </IRQ> Fortunately the recovery code in that case already does the right thing: just exit from posixtimer_send_sigqueue() and wait for __exit_signal() to flush the pending signal. Just make sure to warn only the case when the sigqueue is queued and the signal is really ignored. Fixes: df7a996b4dab ("signal: Queue ignored posixtimers on ignore list") Reported-by: syzbot+852e935b899bde73626e@syzkaller.appspotmail.com Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: syzbot+852e935b899bde73626e@syzkaller.appspotmail.com Link: https://lore.kernel.org/all/20241116234823.28497-1-frederic@kernel.org Closes: https://lore.kernel.org/all/673549c6.050a0220.1324f8.008c.GAE@google.com
2024-11-18ftrace: Get the true parent ip for function tracerJeff Xie
When using both function tracer and function graph simultaneously, it is found that function tracer sometimes captures a fake parent ip (return_to_handler) instead of the true parent ip. This issue is easy to reproduce. Below are my reproduction steps: jeff-labs:~/bin # ./trace-net.sh jeff-labs:~/bin # cat /sys/kernel/debug/tracing/instances/foo/trace | grep return_to_handler trace-net.sh-405 [001] ...2. 31.859501: avc_has_perm+0x4/0x190 <-return_to_handler+0x0/0x40 trace-net.sh-405 [001] ...2. 31.859503: simple_setattr+0x4/0x70 <-return_to_handler+0x0/0x40 trace-net.sh-405 [001] ...2. 31.859503: truncate_pagecache+0x4/0x60 <-return_to_handler+0x0/0x40 trace-net.sh-405 [001] ...2. 31.859505: unmap_mapping_range+0x4/0x140 <-return_to_handler+0x0/0x40 trace-net.sh-405 [001] ...3. 31.859508: _raw_spin_unlock+0x4/0x30 <-return_to_handler+0x0/0x40 [...] The following is my simple trace script: <snip> jeff-labs:~/bin # cat ./trace-net.sh TRACE_PATH="/sys/kernel/tracing" set_events() { echo 1 > $1/events/net/enable echo 1 > $1/events/tcp/enable echo 1 > $1/events/sock/enable echo 1 > $1/events/napi/enable echo 1 > $1/events/fib/enable echo 1 > $1/events/neigh/enable } set_events ${TRACE_PATH} echo 1 > ${TRACE_PATH}/options/sym-offset echo 1 > ${TRACE_PATH}/options/funcgraph-tail echo 1 > ${TRACE_PATH}/options/funcgraph-proc echo 1 > ${TRACE_PATH}/options/funcgraph-abstime echo 'tcp_orphan*' > ${TRACE_PATH}/set_ftrace_notrace echo function_graph > ${TRACE_PATH}/current_tracer INSTANCE_FOO=${TRACE_PATH}/instances/foo if [ ! -e $INSTANCE_FOO ]; then mkdir ${INSTANCE_FOO} fi set_events ${INSTANCE_FOO} echo 1 > ${INSTANCE_FOO}/options/sym-offset echo 'tcp_orphan*' > ${INSTANCE_FOO}/set_ftrace_notrace echo function > ${INSTANCE_FOO}/current_tracer echo 1 > ${TRACE_PATH}/tracing_on echo 1 > ${INSTANCE_FOO}/tracing_on echo > ${TRACE_PATH}/trace echo > ${INSTANCE_FOO}/trace </snip> Link: https://lore.kernel.org/20241008033159.22459-1-jeff.xie@linux.dev Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Jeff Xie <jeff.xie@linux.dev> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-11-18cpu: Remove spurious NULL in attribute_group definitionThomas Weißschuh
This NULL value is most-likely a copy-paste error from an array definition. The NULL doesn't have any effect. Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Link: https://lore.kernel.org/r/20241118-sysfs-const-attribute_group-fixes-v1-3-48e0b0ad8cba@weissschuh.net Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-11-18kdb: fix ctrl+e/a/f/b/d/p/n broken in keyboard modeNir Lichtman
Problem: When using kdb via keyboard it does not react to control characters which are supported in serial mode. Example: Chords such as ctrl+a/e/d/p do not work in keyboard mode Solution: Before disregarding non-printable key characters, check if they are one of the supported control characters, I have took the control characters from the switch case upwards in this function that translates scan codes of arrow keys/backspace/home/.. to the control characters. Suggested-by: Douglas Anderson <dianders@chromium.org> Signed-off-by: Nir Lichtman <nir@lichtman.org> Reviewed-by: Douglas Anderson <dianders@chromium.org> Link: https://lore.kernel.org/r/20241111215622.GA161253@lichtman.org Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
2024-11-18ring-buffer: Correct a grammatical error in a commentliujing
The word "trace" begins with a consonant sound, so "a" should be used instead of "an". Link: https://lore.kernel.org/20241107095327.6390-1-liujing@cmss.chinamobile.com Signed-off-by: liujing <liujing@cmss.chinamobile.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-11-18Merge branch 'for-6.13-force-console' into for-linusPetr Mladek
2024-11-16Merge tag 'mm-hotfixes-stable-2024-11-16-15-33' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull hotfixes from Andrew Morton: "10 hotfixes, 7 of which are cc:stable. All singletons, please see the changelogs for details" * tag 'mm-hotfixes-stable-2024-11-16-15-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mm: revert "mm: shmem: fix data-race in shmem_getattr()" ocfs2: uncache inode which has failed entering the group mm: fix NULL pointer dereference in alloc_pages_bulk_noprof mm, doc: update read_ahead_kb for MADV_HUGEPAGE fs/proc/task_mmu: prevent integer overflow in pagemap_scan_get_args() sched/task_stack: fix object_is_on_stack() for KASAN tagged pointers crash, powerpc: default to CRASH_DUMP=n on PPC_BOOK3S_32 mm/mremap: fix address wraparound in move_page_tables() tools/mm: fix compile error mm, swap: fix allocation and scanning race with swapoff
2024-11-16Merge tag 'trace-ringbuffer-v6.12-rc7-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull ring buffer fixes from Steven Rostedt: - Revert: "ring-buffer: Do not have boot mapped buffers hook to CPU hotplug" A crash that happened on cpu hotplug was actually caused by the incorrect ref counting that was fixed by commit 2cf9733891a4 ("ring-buffer: Fix refcount setting of boot mapped buffers"). The removal of calling cpu hotplug callbacks on memory mapped buffers was not an issue even though the tests at the time pointed toward it. But in fact, there's a check in that code that tests to see if the buffers are already allocated or not, and will not allocate them again if they are. Not calling the cpu hotplug callbacks ended up not initializing the non boot CPU buffers. Simply remove that change. - Clear all CPU buffers when starting tracing in a boot mapped buffer To properly process events from a previous boot, the address space needs to be accounted for due to KASLR and the events in the buffer are updated accordingly when read. This also requires that when the buffer has tracing enabled again in the current boot that the buffers are reset so that events from the previous boot do not interact with the events of the current boot and cause confusing due to not having the proper meta data. It was found that if a CPU is taken offline, that its per CPU buffer is not reset when tracing starts. This allows for events to be from both the previous boot and the current boot to be in the buffer at the same time. Clear all CPU buffers when tracing is started in a boot mapped buffer. * tag 'trace-ringbuffer-v6.12-rc7-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing/ring-buffer: Clear all memory mapped CPU ring buffers on first recording Revert: "ring-buffer: Do not have boot mapped buffers hook to CPU hotplug"
2024-11-15Merge branches 'rcu/fixes', 'rcu/nocb', 'rcu/torture', 'rcu/stall' and ↵Frederic Weisbecker
'rcu/srcu' into rcu/dev
2024-11-15rcuscale: Remove redundant WARN_ON_ONCE() splatUladzislau Rezki (Sony)
There are two places where WARN_ON_ONCE() is called two times in the error paths. One which is encapsulated into if() condition and another one, which is unnecessary, is placed in the brackets. Remove an extra WARN_ON_ONCE() splat which is in brackets. Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2024-11-15rcuscale: Do a proper cleanup if kfree_scale_init() failsUladzislau Rezki (Sony)
A static analyzer for C, Smatch, reports and triggers below warnings: kernel/rcu/rcuscale.c:1215 rcu_scale_init() warn: inconsistent returns 'global &fullstop_mutex'. The checker complains about, we do not unlock the "fullstop_mutex" mutex, in case of hitting below error path: <snip> ... if (WARN_ON_ONCE(jiffies_at_lazy_cb - jif_start < 2 * HZ)) { pr_alert("ERROR: call_rcu() CBs are not being lazy as expected!\n"); WARN_ON_ONCE(1); return -1; ^^^^^^^^^^ ... <snip> it happens because "-1" is returned right away instead of doing a proper unwinding. Fix it by jumping to "unwind" label instead of returning -1. Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com> Closes: https://lore.kernel.org/rcu/ZxfTrHuEGtgnOYWp@pc636/T/ Fixes: 084e04fff160 ("rcuscale: Add laziness and kfree tests") Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2024-11-15srcu: Unconditionally record srcu_read_lock_lite() in ->srcu_reader_flavorPaul E. McKenney
Currently, srcu_read_lock_lite() uses the SRCU_READ_FLAVOR_LITE bit in ->srcu_reader_flavor to communicate to the grace-period processing in srcu_readers_active_idx_check() that the smp_mb() must be replaced by a synchronize_rcu(). Unfortunately, ->srcu_reader_flavor is not updated unless the kernel is built with CONFIG_PROVE_RCU=y. Therefore in all kernels built with CONFIG_PROVE_RCU=n, srcu_readers_active_idx_check() incorrectly uses smp_mb() instead of synchronize_rcu() for srcu_struct structures whose readers use srcu_read_lock_lite(). This commit therefore causes Tree SRCU srcu_read_lock_lite() to unconditionally update ->srcu_reader_flavor so that srcu_readers_active_idx_check() can make the correct choice. Reported-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com> Closes: https://lore.kernel.org/all/d07e8f4a-d5ff-4c8e-8e61-50db285c57e9@amd.com/ Fixes: c0f08d6b5a61 ("srcu: Add srcu_read_lock_lite() and srcu_read_unlock_lite()") Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2024-11-15Merge branches 'pm-cpuidle' and 'pm-em'Rafael J. Wysocki
Merge cpuidle and Energy Model changes for 6.13-rc1: - Add a built-in idle states table for Granite Rapids Xeon D to the intel_idle driver (Artem Bityutskiy). - Fix some typos in comments in the cpuidle core and drivers (Shen Lichuan). - Remove iowait influence from the menu cpuidle governor (Christian Loehle). - Add min/max available performance state limits to the Energy Model management code (Lukasz Luba). * pm-cpuidle: intel_idle: add Granite Rapids Xeon D support cpuidle: Correct some typos in comments cpuidle: menu: Remove iowait influence * pm-em: PM: EM: Add min/max available performance state limits
2024-11-15bpf: use common instruction history across all statesAndrii Nakryiko
Instead of allocating and copying instruction history each time we enqueue child verifier state, switch to a model where we use one common dynamically sized array of instruction history entries across all states. The key observation for proving this is correct is that instruction history is only relevant while state is active, which means it either is a current state (and thus we are actively modifying instruction history and no other state can interfere with us) or we are checkpointed state with some children still active (either enqueued or being current). In the latter case our portion of instruction history is finalized and won't change or grow, so as long as we keep it immutable until the state is finalized, we are good. Now, when state is finalized and is put into state hash for potentially future pruning lookups, instruction history is not used anymore. This is because instruction history is only used by precision marking logic, and we never modify precision markings for finalized states. So, instead of each state having its own small instruction history, we keep a global dynamically-sized instruction history, where each state in current DFS path from root to active state remembers its portion of instruction history. Current state can append to this history, but cannot modify any of its parent histories. Async callback state enqueueing, while logically detached from parent state, still is part of verification backtracking tree, so has to follow the same schema as normal state checkpoints. Because the insn_hist array can be grown through realloc, states don't keep pointers, they instead maintain two indices, [start, end), into global instruction history array. End is exclusive index, so `start == end` means there is no relevant instruction history. This eliminates a lot of allocations and minimizes overall memory usage. For instance, running a worst-case test from [0] (but without the heuristics-based fix [1]), it took 12.5 minutes until we get -ENOMEM. With the changes in this patch the whole test succeeds in 10 minutes (very slow, so heuristics from [1] is important, of course). To further validate correctness, veristat-based comparison was performed for Meta production BPF objects and BPF selftests objects. In both cases there were no differences *at all* in terms of verdict or instruction and state counts, providing a good confidence in the change. Having this low-memory-overhead solution of keeping dynamic per-instruction history cheaply opens up some new possibilities, like keeping extra information for literally every single validated instruction. This will be used for simplifying precision backpropagation logic in follow up patches. [0] https://lore.kernel.org/bpf/20241029172641.1042523-2-eddyz87@gmail.com/ [1] https://lore.kernel.org/bpf/20241029172641.1042523-1-eddyz87@gmail.com/ Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20241115001303.277272-1-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-11-15Merge tag 'sched_ext-for-6.12-rc7-fixes-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext fix from Tejun Heo: "One more fix for v6.12-rc7 ops.cpu_acquire() was being invoked with the wrong kfunc mask allowing the operation to call kfuncs which shouldn't be allowed. Fix it by using SCX_KF_REST instead, which is trivial and low risk" * tag 'sched_ext-for-6.12-rc7-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: sched_ext: ops.cpu_acquire() should be called with SCX_KF_REST
2024-11-15workqueue: Reduce expensive locks for unbound workqueueWangyang Guo
For unbound workqueue, pwqs usually map to just a few pools. Most of the time, pwqs will be linked sequentially to wq->pwqs list by cpu index. Usually, consecutive CPUs have the same workqueue attribute (e.g. belong to the same NUMA node). This makes pwqs with the same pool cluster together in the pwq list. Only do lock/unlock if the pool has changed in flush_workqueue_prep_pwqs(). This reduces the number of expensive lock operations. The performance data shows this change boosts FIO by 65x in some cases when multiple concurrent threads write to xfs mount points with fsync. FIO Benchmark Details - FIO version: v3.35 - FIO Options: ioengine=libaio,iodepth=64,norandommap=1,rw=write, size=128M,bs=4k,fsync=1 - FIO Job Configs: 64 jobs in total writing to 4 mount points (ramdisks formatted as xfs file system). - Kernel Codebase: v6.12-rc5 - Test Platform: Xeon 8380 (2 sockets) Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Wangyang Guo <wangyang.guo@intel.com> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-11-15bpf: Add necessary migrate_disable to range_tree.Yonghong Song
When running bpf selftest (./test_progs -j), the following warnings showed up: $ ./test_progs -t arena_atomics ... BUG: using smp_processor_id() in preemptible [00000000] code: kworker/u19:0/12501 caller is bpf_mem_free+0x128/0x330 ... Call Trace: <TASK> dump_stack_lvl check_preemption_disabled bpf_mem_free range_tree_destroy arena_map_free bpf_map_free_deferred process_scheduled_works ... For selftests arena_htab and arena_list, similar smp_process_id() BUGs are dumped, and the following are two stack trace: <TASK> dump_stack_lvl check_preemption_disabled bpf_mem_alloc range_tree_set arena_map_alloc map_create ... <TASK> dump_stack_lvl check_preemption_disabled bpf_mem_alloc range_tree_clear arena_vm_fault do_pte_missing handle_mm_fault do_user_addr_fault ... Add migrate_{disable,enable}() around related bpf_mem_{alloc,free}() calls to fix the issue. Fixes: b795379757eb ("bpf: Introduce range_tree data structure and use it in bpf arena") Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20241115060354.2832495-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-11-15bpf: Do not alloc arena on unsupported archesViktor Malik
Do not allocate BPF arena on arches that do not support it, instead return EOPNOTSUPP. This is useful to prevent bugs such as soft lockups while trying to free the arena which we have witnessed on ppc64le [1]. [1] https://lore.kernel.org/bpf/4afdcb50-13f2-4772-8db1-3fd02bd985b3@redhat.com/ Signed-off-by: Viktor Malik <vmalik@redhat.com> Link: https://lore.kernel.org/r/20241115082548.74972-1-vmalik@redhat.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-11-14crash, powerpc: default to CRASH_DUMP=n on PPC_BOOK3S_32Dave Vasilevsky
Fixes boot failures on 6.9 on PPC_BOOK3S_32 machines using Open Firmware. On these machines, the kernel refuses to boot from non-zero PHYSICAL_START, which occurs when CRASH_DUMP is on. Since most PPC_BOOK3S_32 machines boot via Open Firmware, it should default to off for them. Users booting via some other mechanism can still turn it on explicitly. Does not change the default on any other architectures for the time being. Link: https://lkml.kernel.org/r/20240917163720.1644584-1-dave@vasilevsky.ca Fixes: 75bc255a7444 ("crash: clean up kdump related config items") Signed-off-by: Dave Vasilevsky <dave@vasilevsky.ca> Reported-by: Reimar Döffinger <Reimar.Doeffinger@gmx.de> Closes: https://lists.debian.org/debian-powerpc/2024/07/msg00001.html Acked-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc] Acked-by: Baoquan He <bhe@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Reimar Döffinger <Reimar.Doeffinger@gmx.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-14sched_ext: Replace scx_next_task_picked() with switch_class() in commentZhao Mengmeng
scx_next_task_picked() has been replaced with siwtch_class(), but comment is still referencing old one, so replace it. Signed-off-by: Zhao Mengmeng <zhaomengmeng@kylinos.cn> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-11-14scftorture: Handle NULL argument passed to scf_add_to_free_list().Sebastian Andrzej Siewior
Dan reported that after the rework the newly introduced scf_add_to_free_list() may get a NULL pointer passed. This replaced kfree() which was fine with a NULL pointer but scf_add_to_free_list() isn't. Let scf_add_to_free_list() handle NULL pointer. Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/all/2375aa2c-3248-4ffa-b9b0-f0a24c50f237@stanley.mountain Fixes: 4788c861ad7e9 ("scftorture: Use a lock-less list to free memory.") Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2024-11-14Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.12-rc8). Conflicts: tools/testing/selftests/net/.gitignore 252e01e68241 ("selftests: net: add netlink-dumps to .gitignore") be43a6b23829 ("selftests: ncdevmem: Move ncdevmem under drivers/net/hw") https://lore.kernel.org/all/20241113122359.1b95180a@canb.auug.org.au/ drivers/net/phy/phylink.c 671154f174e0 ("net: phylink: ensure PHY momentary link-fails are handled") 7530ea26c810 ("net: phylink: remove "using_mac_select_pcs"") Adjacent changes: drivers/net/ethernet/stmicro/stmmac/dwmac-intel-plat.c 5b366eae7193 ("stmmac: dwmac-intel-plat: fix call balance of tx_clk handling routines") e96321fad3ad ("net: ethernet: Switch back to struct platform_driver::remove()") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-14sched_ext: ops.cpu_acquire() should be called with SCX_KF_RESTTejun Heo
ops.cpu_acquire() is currently called with 0 kf_maks which is interpreted as SCX_KF_UNLOCKED which allows all unlocked kfuncs, but ops.cpu_acquire() is called from balance_one() under the rq lock and should only be allowed call kfuncs that are safe under the rq lock. Update it to use SCX_KF_REST. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: David Vernet <void@manifault.com> Cc: Zhao Mengmeng <zhaomzhao@126.com> Link: http://lkml.kernel.org/r/ZzYvf2L3rlmjuKzh@slm.duckdns.org Fixes: 245254f7081d ("sched_ext: Implement sched_ext_ops.cpu_acquire/release()")
2024-11-14cgroup/cpuset: Disable cpuset_cpumask_can_shrink() test if not load balancingWaiman Long
With some recent proposed changes [1] in the deadline server code, it has caused a test failure in test_cpuset_prs.sh when a change is being made to an isolated partition. This is due to failing the cpuset_cpumask_can_shrink() check for SCHED_DEADLINE tasks at validate_change(). This is actually a false positive as the failed test case involves an isolated partition with load balancing disabled. The deadline check is not meaningful in this case and the users should know what they are doing. Fix this by doing the cpuset_cpumask_can_shrink() check only when loading balanced is enabled. Also change its arguments to use effective_cpus for the current cpuset and user_xcpus() as an approiximation for the target effective_cpus as the real effective_cpus hasn't been fully computed yet as this early stage. As the check isn't comprehensive, there may be false positives or negatives. We may have to revise the code to do a more thorough check in the future if this becomes a concern. [1] https://lore.kernel.org/lkml/82be06c1-6d6d-4651-86c9-bcc828cbcb80@redhat.com/T/#t Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-11-14tracing/ring-buffer: Clear all memory mapped CPU ring buffers on first recordingSteven Rostedt
The events of a memory mapped ring buffer from the previous boot should not be mixed in with events from the current boot. There's meta data that is used to handle KASLR so that function names can be shown properly. Also, since the timestamps of the previous boot have no meaning to the timestamps of the current boot, having them intermingled in a buffer can also cause confusion because there could possibly be events in the future. When a trace is activated the meta data is reset so that the pointers of are now processed for the new address space. The trace buffers are reset when tracing starts for the first time. The problem here is that the reset only happens on online CPUs. If a CPU is offline, it does not get reset. To demonstrate the issue, a previous boot had tracing enabled in the boot mapped ring buffer on reboot. On the following boot, tracing has not been started yet so the function trace from the previous boot is still visible. # trace-cmd show -B boot_mapped -c 3 | tail <idle>-0 [003] d.h2. 156.462395: __rcu_read_lock <-cpu_emergency_disable_virtualization <idle>-0 [003] d.h2. 156.462396: vmx_emergency_disable_virtualization_cpu <-cpu_emergency_disable_virtualization <idle>-0 [003] d.h2. 156.462396: __rcu_read_unlock <-__sysvec_reboot <idle>-0 [003] d.h2. 156.462397: stop_this_cpu <-__sysvec_reboot <idle>-0 [003] d.h2. 156.462397: set_cpu_online <-stop_this_cpu <idle>-0 [003] d.h2. 156.462397: disable_local_APIC <-stop_this_cpu <idle>-0 [003] d.h2. 156.462398: clear_local_APIC <-disable_local_APIC <idle>-0 [003] d.h2. 156.462574: mcheck_cpu_clear <-stop_this_cpu <idle>-0 [003] d.h2. 156.462575: mce_intel_feature_clear <-stop_this_cpu <idle>-0 [003] d.h2. 156.462575: lmce_supported <-mce_intel_feature_clear Now, if CPU 3 is taken offline, and tracing is started on the memory mapped ring buffer, the events from the previous boot in the CPU 3 ring buffer is not reset. Now those events are using the meta data from the current boot and produces just hex values. # echo 0 > /sys/devices/system/cpu/cpu3/online # trace-cmd start -B boot_mapped -p function # trace-cmd show -B boot_mapped -c 3 | tail <idle>-0 [003] d.h2. 156.462395: 0xffffffff9a1e3194 <-0xffffffff9a0f655e <idle>-0 [003] d.h2. 156.462396: 0xffffffff9a0a1d24 <-0xffffffff9a0f656f <idle>-0 [003] d.h2. 156.462396: 0xffffffff9a1e6bc4 <-0xffffffff9a0f7323 <idle>-0 [003] d.h2. 156.462397: 0xffffffff9a0d12b4 <-0xffffffff9a0f732a <idle>-0 [003] d.h2. 156.462397: 0xffffffff9a1458d4 <-0xffffffff9a0d12e2 <idle>-0 [003] d.h2. 156.462397: 0xffffffff9a0faed4 <-0xffffffff9a0d12e7 <idle>-0 [003] d.h2. 156.462398: 0xffffffff9a0faaf4 <-0xffffffff9a0faef2 <idle>-0 [003] d.h2. 156.462574: 0xffffffff9a0e3444 <-0xffffffff9a0d12ef <idle>-0 [003] d.h2. 156.462575: 0xffffffff9a0e4964 <-0xffffffff9a0d12ef <idle>-0 [003] d.h2. 156.462575: 0xffffffff9a0e3fb0 <-0xffffffff9a0e496f Reset all CPUs when starting a boot mapped ring buffer for the first time, and not just the online CPUs. Fixes: 7a1d1e4b9639f ("tracing/ring-buffer: Add last_boot_info file to boot instance") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-11-14Revert: "ring-buffer: Do not have boot mapped buffers hook to CPU hotplug"Steven Rostedt
A crash happened when testing cpu hotplug with respect to the memory mapped ring buffers. It was assumed that the hot plug code was adding a per CPU buffer that was already created that caused the crash. The real problem was due to ref counting and was fixed by commit 2cf9733891a4 ("ring-buffer: Fix refcount setting of boot mapped buffers"). When a per CPU buffer is created, it will not be created again even with CPU hotplug, so the fix to not use CPU hotplug was a red herring. In fact, it caused only the boot CPU buffer to be created, leaving the other CPU per CPU buffers disabled. Revert that change as it was not the culprit of the fix it was intended to be. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20241113230839.6c03640f@gandalf.local.home Fixes: 912da2c384d5 ("ring-buffer: Do not have boot mapped buffers hook to CPU hotplug") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-11-14Merge branches 'for-next/gcs', 'for-next/probes', 'for-next/asm-offsets', ↵Catalin Marinas
'for-next/tlb', 'for-next/misc', 'for-next/mte', 'for-next/sysreg', 'for-next/stacktrace', 'for-next/hwcap3', 'for-next/kselftest', 'for-next/crc32', 'for-next/guest-cca', 'for-next/haft' and 'for-next/scs', remote-tracking branch 'arm64/for-next/perf' into for-next/core * arm64/for-next/perf: perf: Switch back to struct platform_driver::remove() perf: arm_pmuv3: Add support for Samsung Mongoose PMU dt-bindings: arm: pmu: Add Samsung Mongoose core compatible perf/dwc_pcie: Fix typos in event names perf/dwc_pcie: Add support for Ampere SoCs ARM: pmuv3: Add missing write_pmuacr() perf/marvell: Marvell PEM performance monitor support perf/arm_pmuv3: Add PMUv3.9 per counter EL0 access control perf/dwc_pcie: Convert the events with mixed case to lowercase perf/cxlpmu: Support missing events in 3.1 spec perf: imx_perf: add support for i.MX91 platform dt-bindings: perf: fsl-imx-ddr: Add i.MX91 compatible drivers perf: remove unused field pmu_node * for-next/gcs: (42 commits) : arm64 Guarded Control Stack user-space support kselftest/arm64: Fix missing printf() argument in gcs/gcs-stress.c arm64/gcs: Fix outdated ptrace documentation kselftest/arm64: Ensure stable names for GCS stress test results kselftest/arm64: Validate that GCS push and write permissions work kselftest/arm64: Enable GCS for the FP stress tests kselftest/arm64: Add a GCS stress test kselftest/arm64: Add GCS signal tests kselftest/arm64: Add test coverage for GCS mode locking kselftest/arm64: Add a GCS test program built with the system libc kselftest/arm64: Add very basic GCS test program kselftest/arm64: Always run signals tests with GCS enabled kselftest/arm64: Allow signals tests to specify an expected si_code kselftest/arm64: Add framework support for GCS to signal handling tests kselftest/arm64: Add GCS as a detected feature in the signal tests kselftest/arm64: Verify the GCS hwcap arm64: Add Kconfig for Guarded Control Stack (GCS) arm64/ptrace: Expose GCS via ptrace and core files arm64/signal: Expose GCS state in signal frames arm64/signal: Set up and restore the GCS context for signal handlers arm64/mm: Implement map_shadow_stack() ... * for-next/probes: : Various arm64 uprobes/kprobes cleanups arm64: insn: Simulate nop instruction for better uprobe performance arm64: probes: Remove probe_opcode_t arm64: probes: Cleanup kprobes endianness conversions arm64: probes: Move kprobes-specific fields arm64: probes: Fix uprobes for big-endian kernels arm64: probes: Fix simulate_ldr*_literal() arm64: probes: Remove broken LDR (literal) uprobe support * for-next/asm-offsets: : arm64 asm-offsets.c cleanup (remove unused offsets) arm64: asm-offsets: remove PREEMPT_DISABLE_OFFSET arm64: asm-offsets: remove DMA_{TO,FROM}_DEVICE arm64: asm-offsets: remove VM_EXEC and PAGE_SZ arm64: asm-offsets: remove MM_CONTEXT_ID arm64: asm-offsets: remove COMPAT_{RT_,SIGFRAME_REGS_OFFSET arm64: asm-offsets: remove VMA_VM_* arm64: asm-offsets: remove TSK_ACTIVE_MM * for-next/tlb: : TLB flushing optimisations arm64: optimize flush tlb kernel range arm64: tlbflush: add __flush_tlb_range_limit_excess() * for-next/misc: : Miscellaneous patches arm64: tls: Fix context-switching of tpidrro_el0 when kpti is enabled arm64/ptrace: Clarify documentation of VL configuration via ptrace acpi/arm64: remove unnecessary cast arm64/mm: Change protval as 'pteval_t' in map_range() arm64: uprobes: Optimize cache flushes for xol slot acpi/arm64: Adjust error handling procedure in gtdt_parse_timer_block() arm64: fix .data.rel.ro size assertion when CONFIG_LTO_CLANG arm64/ptdump: Test both PTE_TABLE_BIT and PTE_VALID for block mappings arm64/mm: Sanity check PTE address before runtime P4D/PUD folding arm64/mm: Drop setting PTE_TYPE_PAGE in pte_mkcont() ACPI: GTDT: Tighten the check for the array of platform timer structures arm64/fpsimd: Fix a typo arm64: Expose ID_AA64ISAR1_EL1.XS to sanitised feature consumers arm64: Return early when break handler is found on linked-list arm64/mm: Re-organize arch_make_huge_pte() arm64/mm: Drop _PROT_SECT_DEFAULT arm64: Add command-line override for ID_AA64MMFR0_EL1.ECV arm64: head: Drop SWAPPER_TABLE_SHIFT arm64: cpufeature: add POE to cpucap_is_possible() arm64/mm: Change pgattr_change_is_safe() arguments as pteval_t * for-next/mte: : Various MTE improvements selftests: arm64: add hugetlb mte tests hugetlb: arm64: add mte support * for-next/sysreg: : arm64 sysreg updates arm64/sysreg: Update ID_AA64MMFR1_EL1 to DDI0601 2024-09 * for-next/stacktrace: : arm64 stacktrace improvements arm64: preserve pt_regs::stackframe during exec*() arm64: stacktrace: unwind exception boundaries arm64: stacktrace: split unwind_consume_stack() arm64: stacktrace: report recovered PCs arm64: stacktrace: report source of unwind data arm64: stacktrace: move dump_backtrace() to kunwind_stack_walk() arm64: use a common struct frame_record arm64: pt_regs: swap 'unused' and 'pmr' fields arm64: pt_regs: rename "pmr_save" -> "pmr" arm64: pt_regs: remove stale big-endian layout arm64: pt_regs: assert pt_regs is a multiple of 16 bytes * for-next/hwcap3: : Add AT_HWCAP3 support for arm64 (also wire up AT_HWCAP4) arm64: Support AT_HWCAP3 binfmt_elf: Wire up AT_HWCAP3 at AT_HWCAP4 * for-next/kselftest: (30 commits) : arm64 kselftest fixes/cleanups kselftest/arm64: Try harder to generate different keys during PAC tests kselftest/arm64: Don't leak pipe fds in pac.exec_sign_all() kselftest/arm64: Corrupt P0 in the irritator when testing SSVE kselftest/arm64: Add FPMR coverage to fp-ptrace kselftest/arm64: Expand the set of ZA writes fp-ptrace does kselftets/arm64: Use flag bits for features in fp-ptrace assembler code kselftest/arm64: Enable build of PAC tests with LLVM=1 kselftest/arm64: Check that SVCR is 0 in signal handlers kselftest/arm64: Fix printf() compiler warnings in the arm64 syscall-abi.c tests kselftest/arm64: Fix printf() warning in the arm64 MTE prctl() test kselftest/arm64: Fix printf() compiler warnings in the arm64 fp tests kselftest/arm64: Fix build with stricter assemblers kselftest/arm64: Test signal handler state modification in fp-stress kselftest/arm64: Provide a SIGUSR1 handler in the kernel mode FP stress test kselftest/arm64: Implement irritators for ZA and ZT kselftest/arm64: Remove unused ADRs from irritator handlers kselftest/arm64: Correct misleading comments on fp-stress irritators kselftest/arm64: Poll less often while waiting for fp-stress children kselftest/arm64: Increase frequency of signal delivery in fp-stress kselftest/arm64: Fix encoding for SVE B16B16 test ... * for-next/crc32: : Optimise CRC32 using PMULL instructions arm64/crc32: Implement 4-way interleave using PMULL arm64/crc32: Reorganize bit/byte ordering macros arm64/lib: Handle CRC-32 alternative in C code * for-next/guest-cca: : Support for running Linux as a guest in Arm CCA arm64: Document Arm Confidential Compute virt: arm-cca-guest: TSM_REPORT support for realms arm64: Enable memory encrypt for Realms arm64: mm: Avoid TLBI when marking pages as valid arm64: Enforce bounce buffers for realm DMA efi: arm64: Map Device with Prot Shared arm64: rsi: Map unprotected MMIO as decrypted arm64: rsi: Add support for checking whether an MMIO is protected arm64: realm: Query IPA size from the RMM arm64: Detect if in a realm and set RIPAS RAM arm64: rsi: Add RSI definitions * for-next/haft: : Support for arm64 FEAT_HAFT arm64: pgtable: Warn unexpected pmdp_test_and_clear_young() arm64: Enable ARCH_HAS_NONLEAF_PMD_YOUNG arm64: Add support for FEAT_HAFT arm64: setup: name 'tcr2' register arm64/sysreg: Update ID_AA64MMFR1_EL1 register * for-next/scs: : Dynamic shadow call stack fixes arm64/scs: Drop unused prototype __pi_scs_patch_vmlinux() arm64/scs: Deal with 64-bit relative offsets in FDE frames arm64/scs: Fix handling of DWARF augmentation data in CIE/FDE frames
2024-11-14Merge tag 'loongarch-kvm-6.13' of ↵Paolo Bonzini
git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson into HEAD LoongArch KVM changes for v6.13 1. Add iocsr and mmio bus simulation in kernel. 2. Add in-kernel interrupt controller emulation. 3. Add virt extension support for eiointc irqchip.
2024-11-14Merge tag 'kvmarm-6.13' of ↵Paolo Bonzini
https://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 changes for 6.13, part #1 - Support for stage-1 permission indirection (FEAT_S1PIE) and permission overlays (FEAT_S1POE), including nested virt + the emulated page table walker - Introduce PSCI SYSTEM_OFF2 support to KVM + client driver. This call was introduced in PSCIv1.3 as a mechanism to request hibernation, similar to the S4 state in ACPI - Explicitly trap + hide FEAT_MPAM (QoS controls) from KVM guests. As part of it, introduce trivial initialization of the host's MPAM context so KVM can use the corresponding traps - PMU support under nested virtualization, honoring the guest hypervisor's trap configuration and event filtering when running a nested guest - Fixes to vgic ITS serialization where stale device/interrupt table entries are not zeroed when the mapping is invalidated by the VM - Avoid emulated MMIO completion if userspace has requested synchronous external abort injection - Various fixes and cleanups affecting pKVM, vCPU initialization, and selftests
2024-11-14dma-mapping: save base/size instead of pointer to shared DMA poolGeert Uytterhoeven
On RZ/Five, which is non-coherent, and uses CONFIG_DMA_GLOBAL_POOL=y: Oops - store (or AMO) access fault [#1] CPU: 0 UID: 0 PID: 1 Comm: swapper Not tainted 6.12.0-rc1-00015-g8a6e02d0c00e #201 Hardware name: Renesas SMARC EVK based on r9a07g043f01 (DT) epc : __memset+0x60/0x100 ra : __dma_alloc_from_coherent+0x150/0x17a epc : ffffffff8062d2bc ra : ffffffff80053a94 sp : ffffffc60000ba20 gp : ffffffff812e9938 tp : ffffffd601920000 t0 : ffffffc6000d0000 t1 : 0000000000000000 t2 : ffffffffe9600000 s0 : ffffffc60000baa0 s1 : ffffffc6000d0000 a0 : ffffffc6000d0000 a1 : 0000000000000000 a2 : 0000000000001000 a3 : ffffffc6000d1000 a4 : 0000000000000000 a5 : 0000000000000000 a6 : ffffffd601adacc0 a7 : ffffffd601a841a8 s2 : ffffffd6018573c0 s3 : 0000000000001000 s4 : ffffffd6019541e0 s5 : 0000000200000022 s6 : ffffffd6018f8410 s7 : ffffffd6018573e8 s8 : 0000000000000001 s9 : 0000000000000001 s10: 0000000000000010 s11: 0000000000000000 t3 : 0000000000000000 t4 : ffffffffdefe62d1 t5 : 000000001cd6a3a9 t6 : ffffffd601b2aad6 status: 0000000200000120 badaddr: ffffffc6000d0000 cause: 0000000000000007 [<ffffffff8062d2bc>] __memset+0x60/0x100 [<ffffffff80053e1a>] dma_alloc_from_global_coherent+0x1c/0x28 [<ffffffff80053056>] dma_direct_alloc+0x98/0x112 [<ffffffff8005238c>] dma_alloc_attrs+0x78/0x86 [<ffffffff8035fdb4>] rz_dmac_probe+0x3f6/0x50a [<ffffffff803a0694>] platform_probe+0x4c/0x8a If CONFIG_DMA_GLOBAL_POOL=y, the reserved_mem structure passed to rmem_dma_setup() is saved for later use, by saving the passed pointer. However, when dma_init_reserved_memory() is called later, the pointer has become stale, causing a crash. E.g. in the RZ/Five case, the referenced memory now contains the reserved_mem structure for the "mmode_resv0@30000" node (with base 0x30000 and size 0x10000), instead of the correct "pma_resv0@58000000" node (with base 0x58000000 and size 0x8000000). Fix this by saving the needed reserved_mem structure's contents instead. Fixes: 8a6e02d0c00e7b62 ("of: reserved_mem: Restructure how the reserved memory regions are processed") Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Reviewed-by: Oreoluwa Babatunde <quic_obabatun@quicinc.com> Tested-by: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2024-11-14perf/core: Correct perf sampling with guest VMsColton Lewis
Previously any PMU overflow interrupt that fired while a VCPU was loaded was recorded as a guest event whether it truly was or not. This resulted in nonsense perf recordings that did not honor perf_event_attr.exclude_guest and recorded guest IPs where it should have recorded host IPs. Rework the sampling logic to only record guest samples for events with exclude_guest = 0. This way any host-only events with exclude_guest set will never see unexpected guest samples. The behaviour of events with exclude_guest = 0 is unchanged. Note that events configured to sample both host and guest may still misattribute a PMI that arrived in the host as a guest event depending on KVM arch and vendor behavior. Signed-off-by: Colton Lewis <coltonlewis@google.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Acked-by: Mark Rutland <mark.rutland@arm.com> Acked-by: Kan Liang <kan.liang@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20241113190156.2145593-6-coltonlewis@google.com
2024-11-14perf/core: Hoist perf_instruction_pointer() and perf_misc_flags()Colton Lewis
For clarity, rename the arch-specific definitions of these functions to perf_arch_* to denote they are arch-specifc. Define the generic-named functions in one place where they can call the arch-specific ones as needed. Signed-off-by: Colton Lewis <coltonlewis@google.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Acked-by: Thomas Richter <tmricht@linux.ibm.com> Acked-by: Mark Rutland <mark.rutland@arm.com> Acked-by: Madhavan Srinivasan <maddy@linux.ibm.com> Acked-by: Kan Liang <kan.liang@linux.intel.com> Link: https://lore.kernel.org/r/20241113190156.2145593-3-coltonlewis@google.com
2024-11-13bpf: Introduce range_tree data structure and use it in bpf arenaAlexei Starovoitov
Introduce range_tree data structure and use it in bpf arena to track ranges of allocated pages. range_tree is a large bitmap that is implemented as interval tree plus rbtree. The contiguous sequence of bits represents unallocated pages. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/bpf/20241108025616.17625-2-alexei.starovoitov@gmail.com
2024-11-13Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfAlexei Starovoitov
Cross-merge bpf fixes after downstream PR. In particular to bring the fix in commit aa30eb3260b2 ("bpf: Force checkpoint when jmp history is too long"). The follow up verifier work depends on it. And the fix in commit 6801cf7890f2 ("selftests/bpf: Use -4095 as the bad address for bits iterator"). It's fixing instability of BPF CI on s390 arch. No conflicts. Adjacent changes in: Auto-merging arch/Kconfig Auto-merging kernel/bpf/helpers.c Auto-merging kernel/bpf/memalloc.c Auto-merging kernel/bpf/verifier.c Auto-merging mm/slab_common.c Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-11-13genirq/proc: Use seq_put_decimal_ull_width() for decimal valuesDavid Wang
seq_printf() is more expensive than seq_put_decimal_ull_width() due to the format string parsing costs. Profiling on a x86 8-core system indicates seq_printf() takes ~47% samples of show_interrupts(). Replacing it with seq_put_decimal_ull_width() yields almost 30% performance gain. [ tglx: Massaged changelog and fixed up coding style ] Signed-off-by: David Wang <00107082@163.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20241108160717.9547-1-00107082@163.com
2024-11-12bpf: Add kernel symbol for struct_ops trampolineXu Kuohai
Without kernel symbols for struct_ops trampoline, the unwinder may produce unexpected stacktraces. For example, the x86 ORC and FP unwinders check if an IP is in kernel text by verifying the presence of the IP's kernel symbol. When a struct_ops trampoline address is encountered, the unwinder stops due to the absence of symbol, resulting in an incomplete stacktrace that consists only of direct and indirect child functions called from the trampoline. The arm64 unwinder is another example. While the arm64 unwinder can proceed across a struct_ops trampoline address, the corresponding symbol name is displayed as "unknown", which is confusing. Thus, add kernel symbol for struct_ops trampoline. The name is bpf__<struct_ops_name>_<member_name>, where <struct_ops_name> is the type name of the struct_ops, and <member_name> is the name of the member that the trampoline is linked to. Below is a comparison of stacktraces captured on x86 by perf record, before and after this patch. Before: ffffffff8116545d __lock_acquire+0xad ([kernel.kallsyms]) ffffffff81167fcc lock_acquire+0xcc ([kernel.kallsyms]) ffffffff813088f4 __bpf_prog_enter+0x34 ([kernel.kallsyms]) After: ffffffff811656bd __lock_acquire+0x30d ([kernel.kallsyms]) ffffffff81167fcc lock_acquire+0xcc ([kernel.kallsyms]) ffffffff81309024 __bpf_prog_enter+0x34 ([kernel.kallsyms]) ffffffffc000d7e9 bpf__tcp_congestion_ops_cong_avoid+0x3e ([kernel.kallsyms]) ffffffff81f250a5 tcp_ack+0x10d5 ([kernel.kallsyms]) ffffffff81f27c66 tcp_rcv_established+0x3b6 ([kernel.kallsyms]) ffffffff81f3ad03 tcp_v4_do_rcv+0x193 ([kernel.kallsyms]) ffffffff81d65a18 __release_sock+0xd8 ([kernel.kallsyms]) ffffffff81d65af4 release_sock+0x34 ([kernel.kallsyms]) ffffffff81f15c4b tcp_sendmsg+0x3b ([kernel.kallsyms]) ffffffff81f663d7 inet_sendmsg+0x47 ([kernel.kallsyms]) ffffffff81d5ab40 sock_write_iter+0x160 ([kernel.kallsyms]) ffffffff8149c67b vfs_write+0x3fb ([kernel.kallsyms]) ffffffff8149caf6 ksys_write+0xc6 ([kernel.kallsyms]) ffffffff8149cb5d __x64_sys_write+0x1d ([kernel.kallsyms]) ffffffff81009200 x64_sys_call+0x1d30 ([kernel.kallsyms]) ffffffff82232d28 do_syscall_64+0x68 ([kernel.kallsyms]) ffffffff8240012f entry_SYSCALL_64_after_hwframe+0x76 ([kernel.kallsyms]) Fixes: 85d33df357b6 ("bpf: Introduce BPF_MAP_TYPE_STRUCT_OPS") Signed-off-by: Xu Kuohai <xukuohai@huawei.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20241112145849.3436772-4-xukuohai@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>