summaryrefslogtreecommitdiff
path: root/include/linux
AgeCommit message (Collapse)Author
2023-04-24Merge tag 'iter-ubuf.2-2023-04-21' of git://git.kernel.dk/linuxLinus Torvalds
Pull ITER_UBUF updates from Jens Axboe: "This turns singe vector imports into ITER_UBUF, rather than ITER_IOVEC. The former is more trivial to iterate and advance, and hence a bit more efficient. From some very unscientific testing, ~60% of all iovec imports are single vector" * tag 'iter-ubuf.2-2023-04-21' of git://git.kernel.dk/linux: iov_iter: Mark copy_compat_iovec_from_user() noinline iov_iter: import single vector iovecs as ITER_UBUF iov_iter: convert import_single_range() to ITER_UBUF iov_iter: overlay struct iovec and ubuf/len iov_iter: set nr_segs = 1 for ITER_UBUF iov_iter: remove iov_iter_iovec() iov_iter: add iter_iov_addr() and iter_iov_len() helpers ALSA: pcm: check for user backed iterator, not specific iterator type IB/qib: check for user backed iterator, not specific iterator type IB/hfi1: check for user backed iterator, not specific iterator type iov_iter: add iter_iovec() helper block: ensure bio_alloc_map_data() deals with ITER_UBUF correctly
2023-04-24Merge branches 'pm-core', 'pm-sleep', 'pm-opp' and 'pm-tools'Rafael J. Wysocki
Merge PM core changes, updates related to system sleep support, operating performance points (OPP) changes and power management utilities changes for 6.4-rc1: - Drop unnecessary (void *) conversions from the PM core (Li zeming). - Add sysfs files to represent time spent in a platform sleep state during suspend-to-idle and make AMD and Intel PMC drivers use them (Mario Limonciello). - Use of_property_present() for testing DT property presence (Rob Herring). - Add set_required_opps() callback to the 'struct opp_table', to make the code paths cleaner (Viresh Kumar). - Update the pm-graph siute of utilities to v5.11 with the following changes: * New script which allows users to install the latest pm-graph from the upstream github repo. * Update all the dmesg suspend/resume PM print formats to be able to process recent timelines using dmesg only. * Add ethtool output to the log for the system's ethernet device if ethtool exists. * Make the tool more robustly handle events where mangled dmesg or ftrace outputs do not include all the requisite data. - Make the sleepgraph utility recognize "CPU killed" messages (Xueqin Luo). * pm-core: PM: core: Remove unnecessary (void *) conversions * pm-sleep: platform/x86/intel/pmc: core: Report duration of time in HW sleep state platform/x86/intel/pmc: core: Always capture counters on suspend platform/x86/amd: pmc: Report duration of time in hw sleep state PM: Add sysfs files to represent time spent in hardware sleep state * pm-opp: OPP: Move required opps configuration to specialized callback OPP: Handle all genpd cases together in _set_required_opps() opp: Use of_property_present() for testing DT property presence * pm-tools: PM: tools: sleepgraph: Recognize "CPU killed" messages pm-graph: Update to v5.11
2023-04-24Merge branch 'pm-cpufreq'Rafael J. Wysocki
Merge cpufreq updates for 6.4-rc1: - Fix the frequency unit in cpufreq_verify_current_freq checks() (Sanjay Chandrashekara). - Make mode_state_machine in amd-pstate static (Tom Rix). - Make the cpufreq core require drivers with target_index() to set freq_table (Viresh Kumar). - Fix typo in the ARM_BRCMSTB_AVS_CPUFREQ Kconfig entry (Jingyu Wang). - Use of_property_read_bool() for boolean properties in the pmac32 cpufreq driver (Rob Herring). - Make the cpufreq sysfs interface return proper error codes on obviously invalid input (qinyu). - Add guided autonomous mode support to the AMD P-state driver (Wyes Karny). - Make the Intel P-state driver enable HWP IO boost on all server platforms (Srinivas Pandruvada). - Add opp and bandwidth support to tegra194 cpufreq driver (Sumit Gupta). - Use of_property_present() for testing DT property presence (Rob Herring). - Remove MODULE_LICENSE in non-modules (Nick Alcock). - Add SM7225 to cpufreq-dt-platdev blocklist (Luca Weiss). - Optimizations and fixes for qcom-cpufreq-hw driver (Krzysztof Kozlowski, Konrad Dybcio, and Bjorn Andersson). - DT binding updates for qcom-cpufreq-hw driver (Konrad Dybcio and Bartosz Golaszewski). - Updates and fixes for mediatek driver (Jia-Wei Chang and AngeloGioacchino Del Regno). * pm-cpufreq: (29 commits) cpufreq: use correct unit when verify cur freq cpufreq: tegra194: add OPP support and set bandwidth cpufreq: amd-pstate: Make varaiable mode_state_machine static cpufreq: drivers with target_index() must set freq_table cpufreq: qcom-cpufreq-hw: Revert adding cpufreq qos dt-bindings: cpufreq: cpufreq-qcom-hw: Add QCM2290 dt-bindings: cpufreq: cpufreq-qcom-hw: Sanitize data per compatible dt-bindings: cpufreq: cpufreq-qcom-hw: Allow just 1 frequency domain cpufreq: Add SM7225 to cpufreq-dt-platdev blocklist cpufreq: qcom-cpufreq-hw: fix double IO unmap and resource release on exit cpufreq: mediatek: Raise proc and sram max voltage for MT7622/7623 cpufreq: mediatek: raise proc/sram max voltage for MT8516 cpufreq: mediatek: fix KP caused by handler usage after regulator_put/clk_put cpufreq: mediatek: fix passing zero to 'PTR_ERR' cpufreq: pmac32: Use of_property_read_bool() for boolean properties cpufreq: Fix typo in the ARM_BRCMSTB_AVS_CPUFREQ Kconfig entry cpufreq: warn about invalid vals to scaling_max/min_freq interfaces Documentation: cpufreq: amd-pstate: Update amd_pstate status sysfs for guided cpufreq: amd-pstate: Add guided mode control support via sysfs cpufreq: amd-pstate: Add guided autonomous mode ...
2023-04-24Merge branches 'acpi-bus', 'acpi-video' and 'acpi-misc'Rafael J. Wysocki
Merge ACPI bus type driver changes, ACPI backlight driver updates and a series of cleanups related to of.h for 6.4-rc1: - Ensure that ACPI notify handlers are not running after removal and clean up code in acpi_sb_notify() (Rafael Wysocki). - Remove register_backlight_delay module option and code and remove quirks for false-positive backlight control support advertised on desktop boards (Hans de Goede). - Replace irqdomain.h include with struct declarations in ACPI headers and update several pieces of code previously including of.h implicitly through those headers (Rob Herring). * acpi-bus: ACPI: bus: Ensure that notify handlers are not running after removal ACPI: bus: Add missing braces to acpi_sb_notify() * acpi-video: ACPI: video: Remove desktops without backlight DMI quirks ACPI: video: Remove register_backlight_delay module option and code * acpi-misc: ACPI: Replace irqdomain.h include with struct declarations fpga: lattice-sysconfig-spi: Add explicit include for of.h tpm: atmel: Add explicit include for of.h virtio-mmio: Add explicit include for of.h pata: ixp4xx: Add explicit include for of.h ata: pata_macio: Add explicit include of irqdomain.h serial: 8250_tegra: Add explicit include for of.h net: rfkill-gpio: Add explicit include for of.h staging: iio: resolver: ad2s1210: Add explicit include for of.h iio: adc: ad7292: Add explicit include for of.h
2023-04-24linux/vt_buffer.h: allow either builtin or modular for macrosRandy Dunlap
Fix build errors on ARCH=alpha when CONFIG_MDA_CONSOLE=m. This allows the ARCH macros to be the only ones defined. In file included from ../drivers/video/console/mdacon.c:37: ../arch/alpha/include/asm/vga.h:17:40: error: expected identifier or '(' before 'volatile' 17 | static inline void scr_writew(u16 val, volatile u16 *addr) | ^~~~~~~~ ../include/linux/vt_buffer.h:24:34: note: in definition of macro 'scr_writew' 24 | #define scr_writew(val, addr) (*(addr) = (val)) | ^~~~ ../include/linux/vt_buffer.h:24:40: error: expected ')' before '=' token 24 | #define scr_writew(val, addr) (*(addr) = (val)) | ^ ../arch/alpha/include/asm/vga.h:17:20: note: in expansion of macro 'scr_writew' 17 | static inline void scr_writew(u16 val, volatile u16 *addr) | ^~~~~~~~~~ ../arch/alpha/include/asm/vga.h:25:29: error: expected identifier or '(' before 'volatile' 25 | static inline u16 scr_readw(volatile const u16 *addr) | ^~~~~~~~ Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jiri Slaby <jirislaby@kernel.org> Cc: dri-devel@lists.freedesktop.org Cc: linux-fbdev@vger.kernel.org Signed-off-by: Helge Deller <deller@gmx.de>
2023-04-24platform/chrome: Replace fake flexible arrays with flexible-array memberGustavo A. R. Silva
Zero-length arrays as fake flexible arrays are deprecated and we are moving towards adopting C99 flexible-array members instead. Use the DECLARE_FLEX_ARRAY() helper macro to transform zero-length arrays in unions with flexible-array members. Address the following warning found with GCC-13 and -fstrict-flex-arrays=3 enabled: drivers/iio/accel/cros_ec_accel_legacy.c:66:46: warning: array subscript <unknown> is outside array bounds of ‘struct ec_response_motion_sensor_data[0]’ [-Warray-bounds=] This helps with the ongoing efforts to tighten the FORTIFY_SOURCE routines on memcpy() and help us make progress towards globally enabling -fstrict-flex-arrays=3 [1]. Link: https://github.com/KSPP/linux/issues/21 Link: https://github.com/KSPP/linux/issues/193 Link: https://github.com/KSPP/linux/issues/262 Link: https://gcc.gnu.org/pipermail/gcc-patches/2022-October/602902.html [1] Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: Tzung-Bi Shih <tzungbi@kernel.org> Link: https://lore.kernel.org/r/ZAZUGBmSLc5wg7AK@work
2023-04-24ksmbd: fix racy issue from using ->d_parent and ->d_nameNamjae Jeon
Al pointed out that ksmbd has racy issue from using ->d_parent and ->d_name in ksmbd_vfs_unlink and smb2_vfs_rename(). and use new lock_rename_child() to lock stable parent while underlying rename racy. Introduce vfs_path_parent_lookup helper to avoid out of share access and export vfs functions like the following ones to use vfs_path_parent_lookup(). - rename __lookup_hash() to lookup_one_qstr_excl(). - export lookup_one_qstr_excl(). - export getname_kernel() and putname(). vfs_path_parent_lookup() is used for parent lookup of destination file using absolute pathname given from FILE_RENAME_INFORMATION request. Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2023-04-24Merge tag 'pull-lock_rename_child' of ↵Steve French
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs into ksmbd-for-next lock_rename_child() (for ksmbd folks) Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2023-04-23serdev: Add method to assert break signal over tty UART portNeeraj Sanjay Kale
Adds serdev_device_break_ctl() and an implementation for ttyport. This function simply calls the break_ctl in tty layer, which can assert a break signal over UART-TX line, if the tty and the underlying platform and UART peripheral supports this operation. Signed-off-by: Neeraj Sanjay Kale <neeraj.sanjaykale@nxp.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-04-23serdev: Replace all instances of ENOTSUPP with EOPNOTSUPPNeeraj Sanjay Kale
This replaces all instances of ENOTSUPP with EOPNOTSUPP since ENOTSUPP is not a standard error code. This will help maintain consistency in error codes when new serdev API's are added. Signed-off-by: Neeraj Sanjay Kale <neeraj.sanjaykale@nxp.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-04-23net: dsa: tag_ocelot: call only the relevant portion of __skb_vlan_pop() on TXVladimir Oltean
ocelot_xmit_get_vlan_info() calls __skb_vlan_pop() as the most appropriate helper I could find which strips away a VLAN header. That's all I need it to do, but __skb_vlan_pop() has more logic, which will become incompatible with the future revert of commit 6d1ccff62780 ("net: reset mac header in dev_start_xmit()"). Namely, it performs a sanity check on skb_mac_header(), which will stop being set after the above revert, so it will return an error instead of removing the VLAN tag. ocelot_xmit_get_vlan_info() gets called in 2 circumstances: (1) the port is under a VLAN-aware bridge and the bridge sends VLAN-tagged packets (2) the port is under a VLAN-aware bridge and somebody else (an 8021q upper) sends VLAN-tagged packets (using a VID that isn't in the bridge vlan tables) In case (1), there is actually no bug to defend against, because br_dev_xmit() calls skb_reset_mac_header() and things continue to work. However, in case (2), illustrated using the commands below, it can be seen that our intervention is needed, since __skb_vlan_pop() complains: $ ip link add br0 type bridge vlan_filtering 1 && ip link set br0 up $ ip link set $eth master br0 && ip link set $eth up $ ip link add link $eth name $eth.100 type vlan id 100 && ip link set $eth.100 up $ ip addr add 192.168.100.1/24 dev $eth.100 I could fend off the checks in __skb_vlan_pop() with some skb_mac_header_was_set() calls, but seeing how few callers of __skb_vlan_pop() there are from TX paths, that seems rather unproductive. As an alternative solution, extract the bare minimum logic to strip a VLAN header, and move it to a new helper named vlan_remove_tag(), close to the definition of vlan_insert_tag(). Document it appropriately and make ocelot_xmit_get_vlan_info() call this smaller helper instead. Seeing that it doesn't appear illegal to test skb->protocol in the TX path, I guess it would be a good for vlan_remove_tag() to also absorb the vlan_set_encap_proto() function call. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-23net: vlan: introduce skb_vlan_eth_hdr()Vladimir Oltean
Similar to skb_eth_hdr() introduced in commit 96cc4b69581d ("macvlan: do not assume mac_header is set in macvlan_broadcast()"), let's introduce a skb_vlan_eth_hdr() helper which can be used in TX-only code paths to get to the VLAN header based on skb->data rather than based on the skb_mac_header(skb). We also consolidate the drivers that dereference skb->data to go through this helper. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-23net: vlan: don't adjust MAC header in __vlan_insert_inner_tag() unless setVladimir Oltean
This is a preparatory change for the deletion of skb_reset_mac_header(skb) from __dev_queue_xmit(). After that deletion, skb_mac_header(skb) will no longer be set in TX paths, from which __vlan_insert_inner_tag() can still be called (perhaps indirectly). If we don't make this change, then an unset MAC header (equal to ~0U) will become set after the adjustment with VLAN_HLEN. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-23net: optimize napi_threaded_poll() vs RPS/RFSEric Dumazet
We use napi_threaded_poll() in order to reduce our softirq dependency. We can add a followup of 821eba962d95 ("net: optimize napi_schedule_rps()") to further remove the need of firing NET_RX_SOFTIRQ whenever RPS/RFS are used. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-22bpf: fix link failure with NETFILTER=y INET=nFlorian Westphal
Explicitly check if NETFILTER_BPF_LINK is enabled, else configs that have NETFILTER=y but CONFIG_INET=n fail to link: > kernel/bpf/syscall.o: undefined reference to `netfilter_prog_ops' > kernel/bpf/verifier.o: undefined reference to `netfilter_verifier_ops' Fixes: fd9c663b9ad6 ("bpf: minimal support for programs hooked into netfilter framework") Reported-by: kernel test robot <lkp@intel.com> Link: https://lore.kernel.org/oe-kbuild-all/202304220903.fRZTJtxe-lkp@intel.com/ Signed-off-by: Florian Westphal <fw@strlen.de> Link: https://lore.kernel.org/r/20230422073544.17634-1-fw@strlen.de Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-21Merge tag 'mlx5-updates-2023-04-20' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux Saeed Mahameed says: ==================== mlx5-updates-2023-04-20 1) Dragos Improves RX page pool, and provides some fixes to his previous series: 1.1) Fix releasing page_pool for striding RQ and legacy RQ nonlinear case 1.2) Hook NAPIs to page pools to gain more performance. 2) From Roi, Some cleanups to TC and eswitch modules. 3) Maher migrates vnic diagnostic counters reporting from debugfs to a dedicated devlink health reporter Maher Says: =========== net/mlx5: Expose vnic diagnostic counters using devlink Currently, vnic diagnostic counters are exposed through the following debugfs: $ ls /sys/kernel/debug/mlx5/0000:08:00.0/esw/vf_0/vnic_diag/ cq_overrun quota_exceeded_command total_q_under_processor_handle invalid_command send_queue_priority_update_flow nic_receive_steering_discard The current design does not allow the hypervisor to view the diagnostic counters of its VFs, in case the VFs get bound to a VM. In other words, the counters are not exposed for representor interfaces. Furthermore, the debugfs design is inconvenient future-wise, in case more counters need to be reported by the driver in the future. As these counters pertain to vNIC health, it is more appropriate to utilize the devlink health reporter to expose them. Thus, this patchest includes the following changes: * Drop the current vnic diagnostic counters debugfs interface. * Add a vnic devlink health reporter for PFs/VFs core devices, which when diagnosed will dump vnic diagnostic counter values that are queried from FW. * Add a vnic devlink health reporter for the representor interface, which serves the same purpose listed in the previous point, in addition to allowing the hypervisor to view its VFs diagnostic counters, even when the VFs are bounded to external VMs. Example of devlink health reporter usage is: $devlink health diagnose pci/0000:08:00.0 reporter vnic vNIC env counters: total_error_queues: 0 send_queue_priority_update_flow: 0 comp_eq_overrun: 0 async_eq_overrun: 0 cq_overrun: 0 invalid_command: 0 quota_exceeded_command: 0 nic_receive_steering_discard: 0 =========== 4) SW steering fixes and improvements Yevgeny Kliteynik Says: ======================= These short patch series are just small fixes / improvements for SW steering: - Patch 1: Fix dumping of legacy modify_hdr in debug dump to align to what is expected by parser - Patch 2: Have separate threshold for ICM sync per ICM type - Patch 3: Add more info to the steering debug dump - Linux version and device name - Patch 4: Keep track of number of buddies that are currently in use per domain per buddy type ======================= * tag 'mlx5-updates-2023-04-20' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux: net/mlx5: Update op_mode to op_mod for port selection net/mlx5: E-Switch, Remove unused mlx5_esw_offloads_vport_metadata_set() net/mlx5: E-Switch, Remove redundant dev arg from mlx5_esw_vport_alloc() net/mlx5: Include linux/pci.h for pci_msix_can_alloc_dyn() net/mlx5e: RX, Hook NAPIs to page pools net/mlx5e: RX, Fix XDP_TX page release for legacy rq nonlinear case net/mlx5e: RX, Fix releasing page_pool pages twice for striding RQ net/mlx5e: Add vnic devlink health reporter to representors net/mlx5: Add vnic devlink health reporter to PFs/VFs Revert "net/mlx5: Expose vnic diagnostic counters for eswitch managed vports" Revert "net/mlx5: Expose steering dropped packets counter" net/mlx5: DR, Add memory statistics for domain object net/mlx5: DR, Add more info in domain dbg dump net/mlx5: DR, Calculate sync threshold of each pool according to its type net/mlx5: DR, Fix dumping of legacy modify_hdr in debug dump ==================== Link: https://lore.kernel.org/r/20230421013850.349646-1-saeed@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-21Merge tag 'for-netdev' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2023-04-21 We've added 71 non-merge commits during the last 8 day(s) which contain a total of 116 files changed, 13397 insertions(+), 8896 deletions(-). The main changes are: 1) Add a new BPF netfilter program type and minimal support to hook BPF programs to netfilter hooks such as prerouting or forward, from Florian Westphal. 2) Fix race between btf_put and btf_idr walk which caused a deadlock, from Alexei Starovoitov. 3) Second big batch to migrate test_verifier unit tests into test_progs for ease of readability and debugging, from Eduard Zingerman. 4) Add support for refcounted local kptrs to the verifier for allowing shared ownership, useful for adding a node to both the BPF list and rbtree, from Dave Marchevsky. 5) Migrate bpf_for(), bpf_for_each() and bpf_repeat() macros from BPF selftests into libbpf-provided bpf_helpers.h header and improve kfunc handling, from Andrii Nakryiko. 6) Support 64-bit pointers to kfuncs needed for archs like s390x, from Ilya Leoshkevich. 7) Support BPF progs under getsockopt with a NULL optval, from Stanislav Fomichev. 8) Improve verifier u32 scalar equality checking in order to enable LLVM transformations which earlier had to be disabled specifically for BPF backend, from Yonghong Song. 9) Extend bpftool's struct_ops object loading to support links, from Kui-Feng Lee. 10) Add xsk selftest follow-up fixes for hugepage allocated umem, from Magnus Karlsson. 11) Support BPF redirects from tc BPF to ifb devices, from Daniel Borkmann. 12) Add BPF support for integer type when accessing variable length arrays, from Feng Zhou. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (71 commits) selftests/bpf: verifier/value_ptr_arith converted to inline assembly selftests/bpf: verifier/value_illegal_alu converted to inline assembly selftests/bpf: verifier/unpriv converted to inline assembly selftests/bpf: verifier/subreg converted to inline assembly selftests/bpf: verifier/spin_lock converted to inline assembly selftests/bpf: verifier/sock converted to inline assembly selftests/bpf: verifier/search_pruning converted to inline assembly selftests/bpf: verifier/runtime_jit converted to inline assembly selftests/bpf: verifier/regalloc converted to inline assembly selftests/bpf: verifier/ref_tracking converted to inline assembly selftests/bpf: verifier/map_ptr_mixing converted to inline assembly selftests/bpf: verifier/map_in_map converted to inline assembly selftests/bpf: verifier/lwt converted to inline assembly selftests/bpf: verifier/loops1 converted to inline assembly selftests/bpf: verifier/jeq_infer_not_null converted to inline assembly selftests/bpf: verifier/direct_packet_access converted to inline assembly selftests/bpf: verifier/d_path converted to inline assembly selftests/bpf: verifier/ctx converted to inline assembly selftests/bpf: verifier/btf_ctx_access converted to inline assembly selftests/bpf: verifier/bpf_get_stack converted to inline assembly ... ==================== Link: https://lore.kernel.org/r/20230421211035.9111-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-22netfilter: nf_tables: don't write table validation state without mutexFlorian Westphal
The ->cleanup callback needs to be removed, this doesn't work anymore as the transaction mutex is already released in the ->abort function. Just do it after a successful validation pass, this either happens from commit or abort phases where transaction mutex is held. Fixes: f102d66b335a ("netfilter: nf_tables: use dedicated mutex to guard transactions") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-04-21hugetlb: pte_alloc_huge() to replace huge pte_alloc_map()Hugh Dickins
Some architectures can have their hugetlb pages down at the lowest PTE level: their huge_pte_alloc() using pte_alloc_map(), but without any following pte_unmap(). Since none of these arches uses CONFIG_HIGHPTE, this is not seen as a problem at present; but would become a problem if forthcoming changes were to add an rcu_read_lock() into pte_offset_map(), with the rcu_read_unlock() expected in pte_unmap(). Similarly in their huge_pte_offset(): pte_offset_kernel() is good enough for that, but it's probably less confusing if we define pte_offset_huge() along with pte_alloc_huge(). Only define them without CONFIG_HIGHPTE: so there would be a build error to signal if ever more work is needed. For ease of development, define these now for 6.4-rc1, ahead of any use: then architectures can integrate patches using them, independent from mm. Link: https://lkml.kernel.org/r/ae9e7d98-8a3a-cfd9-4762-bcddffdf96cf@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-21mm: add new KSM process and sysfs knobsStefan Roesch
This adds the general_profit KSM sysfs knob and the process profit metric knobs to ksm_stat. 1) expose general_profit metric The documentation mentions a general profit metric, however this metric is not calculated. In addition the formula depends on the size of internal structures, which makes it more difficult for an administrator to make the calculation. Adding the metric for a better user experience. 2) document general_profit sysfs knob 3) calculate ksm process profit metric The ksm documentation mentions the process profit metric and how to calculate it. This adds the calculation of the metric. 4) mm: expose ksm process profit metric in ksm_stat This exposes the ksm process profit metric in /proc/<pid>/ksm_stat. The documentation mentions the formula for the ksm process profit metric, however it does not calculate it. In addition the formula depends on the size of internal structures. So it makes sense to expose it. 5) document new procfs ksm knobs Link: https://lkml.kernel.org/r/20230418051342.1919757-3-shr@devkernel.io Signed-off-by: Stefan Roesch <shr@devkernel.io> Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-21mm: add new api to enable ksm per processStefan Roesch
Patch series "mm: process/cgroup ksm support", v9. So far KSM can only be enabled by calling madvise for memory regions. To be able to use KSM for more workloads, KSM needs to have the ability to be enabled / disabled at the process / cgroup level. Use case 1: The madvise call is not available in the programming language. An example for this are programs with forked workloads using a garbage collected language without pointers. In such a language madvise cannot be made available. In addition the addresses of objects get moved around as they are garbage collected. KSM sharing needs to be enabled "from the outside" for these type of workloads. Use case 2: The same interpreter can also be used for workloads where KSM brings no benefit or even has overhead. We'd like to be able to enable KSM on a workload by workload basis. Use case 3: With the madvise call sharing opportunities are only enabled for the current process: it is a workload-local decision. A considerable number of sharing opportunities may exist across multiple workloads or jobs (if they are part of the same security domain). Only a higler level entity like a job scheduler or container can know for certain if its running one or more instances of a job. That job scheduler however doesn't have the necessary internal workload knowledge to make targeted madvise calls. Security concerns: In previous discussions security concerns have been brought up. The problem is that an individual workload does not have the knowledge about what else is running on a machine. Therefore it has to be very conservative in what memory areas can be shared or not. However, if the system is dedicated to running multiple jobs within the same security domain, its the job scheduler that has the knowledge that sharing can be safely enabled and is even desirable. Performance: Experiments with using UKSM have shown a capacity increase of around 20%. Here are the metrics from an instagram workload (taken from a machine with 64GB main memory): full_scans: 445 general_profit: 20158298048 max_page_sharing: 256 merge_across_nodes: 1 pages_shared: 129547 pages_sharing: 5119146 pages_to_scan: 4000 pages_unshared: 1760924 pages_volatile: 10761341 run: 1 sleep_millisecs: 20 stable_node_chains: 167 stable_node_chains_prune_millisecs: 2000 stable_node_dups: 2751 use_zero_pages: 0 zero_pages_sharing: 0 After the service is running for 30 minutes to an hour, 4 to 5 million shared pages are common for this workload when using KSM. Detailed changes: 1. New options for prctl system command This patch series adds two new options to the prctl system call. The first one allows to enable KSM at the process level and the second one to query the setting. The setting will be inherited by child processes. With the above setting, KSM can be enabled for the seed process of a cgroup and all processes in the cgroup will inherit the setting. 2. Changes to KSM processing When KSM is enabled at the process level, the KSM code will iterate over all the VMA's and enable KSM for the eligible VMA's. When forking a process that has KSM enabled, the setting will be inherited by the new child process. 3. Add general_profit metric The general_profit metric of KSM is specified in the documentation, but not calculated. This adds the general profit metric to /sys/kernel/debug/mm/ksm. 4. Add more metrics to ksm_stat This adds the process profit metric to /proc/<pid>/ksm_stat. 5. Add more tests to ksm_tests and ksm_functional_tests This adds an option to specify the merge type to the ksm_tests. This allows to test madvise and prctl KSM. It also adds a two new tests to ksm_functional_tests: one to test the new prctl options and the other one is a fork test to verify that the KSM process setting is inherited by client processes. This patch (of 3): So far KSM can only be enabled by calling madvise for memory regions. To be able to use KSM for more workloads, KSM needs to have the ability to be enabled / disabled at the process / cgroup level. 1. New options for prctl system command This patch series adds two new options to the prctl system call. The first one allows to enable KSM at the process level and the second one to query the setting. The setting will be inherited by child processes. With the above setting, KSM can be enabled for the seed process of a cgroup and all processes in the cgroup will inherit the setting. 2. Changes to KSM processing When KSM is enabled at the process level, the KSM code will iterate over all the VMA's and enable KSM for the eligible VMA's. When forking a process that has KSM enabled, the setting will be inherited by the new child process. 1) Introduce new MMF_VM_MERGE_ANY flag This introduces the new flag MMF_VM_MERGE_ANY flag. When this flag is set, kernel samepage merging (ksm) gets enabled for all vma's of a process. 2) Setting VM_MERGEABLE on VMA creation When a VMA is created, if the MMF_VM_MERGE_ANY flag is set, the VM_MERGEABLE flag will be set for this VMA. 3) support disabling of ksm for a process This adds the ability to disable ksm for a process if ksm has been enabled for the process with prctl. 4) add new prctl option to get and set ksm for a process This adds two new options to the prctl system call - enable ksm for all vmas of a process (if the vmas support it). - query if ksm has been enabled for a process. 3. Disabling MMF_VM_MERGE_ANY for storage keys in s390 In the s390 architecture when storage keys are used, the MMF_VM_MERGE_ANY will be disabled. Link: https://lkml.kernel.org/r/20230418051342.1919757-1-shr@devkernel.io Link: https://lkml.kernel.org/r/20230418051342.1919757-2-shr@devkernel.io Signed-off-by: Stefan Roesch <shr@devkernel.io> Acked-by: David Hildenbrand <david@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Rik van Riel <riel@surriel.com> Cc: Bagas Sanjaya <bagasdotme@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-21mm: correct arg in reclaim_pages()/reclaim_clean_pages_from_list()Kefeng Wang
Both of them change the arg from page_list to folio_list when convert them to use a folio, but not the declaration, let's correct it, also move the reclaim_pages() from swap.h to internal.h as it only used in mm. Link: https://lkml.kernel.org/r/20230417114807.186786-1-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviwed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: SeongJae Park <sj@kernel.org> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-21fs/buffer: add folio_create_empty_buffers helperPankaj Raghav
Folio version of create_empty_buffers(). This is required to convert create_page_buffers() to folio_create_buffers() later in the series. It removes several calls to compound_head() as it works directly on folio compared to create_empty_buffers(). Hence, create_empty_buffers() has been modified to call folio_create_empty_buffers(). Link: https://lkml.kernel.org/r/20230417123618.22094-4-p.raghav@samsung.com Signed-off-by: Pankaj Raghav <p.raghav@samsung.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-21buffer: add folio_alloc_buffers() helperPankaj Raghav
Folio version of alloc_page_buffers() helper. This is required to convert create_page_buffers() to folio_create_buffers() later in the series. alloc_page_buffers() has been modified to call folio_alloc_buffers() which adds one call to compound_head() but folio_alloc_buffers() removes one call to compound_head() compared to the existing alloc_page_buffers() implementation. Link: https://lkml.kernel.org/r/20230417123618.22094-3-p.raghav@samsung.com Signed-off-by: Pankaj Raghav <p.raghav@samsung.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-21fs/buffer: add folio_set_bh helperPankaj Raghav
Patch series "convert create_page_buffers to folio_create_buffers". One of the first kernel panic we hit when we try to increase the block size > 4k is inside create_page_buffers()[1]. Even though buffer.c function do not support large folios (folios > PAGE_SIZE) at the moment, these changes are required when we want to remove that constraint. This patch (of 4): The folio version of set_bh_page(). This is required to convert create_page_buffers() to folio_create_buffers() later in the series. Link: https://lkml.kernel.org/r/20230417123618.22094-1-p.raghav@samsung.com Link: https://lkml.kernel.org/r/20230417123618.22094-2-p.raghav@samsung.com Signed-off-by: Pankaj Raghav <p.raghav@samsung.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-21bpf: add test_run support for netfilter program typeFlorian Westphal
add glue code so a bpf program can be run using userspace-provided netfilter state and packet/skb. Default is to use ipv4:output hook point, but this can be overridden by userspace. Userspace provided netfilter state is restricted, only hook and protocol families can be overridden and only to ipv4/ipv6. Signed-off-by: Florian Westphal <fw@strlen.de> Link: https://lore.kernel.org/r/20230421170300.24115-7-fw@strlen.de Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-21bpf: minimal support for programs hooked into netfilter frameworkFlorian Westphal
This adds minimal support for BPF_PROG_TYPE_NETFILTER bpf programs that will be invoked via the NF_HOOK() points in the ip stack. Invocation incurs an indirect call. This is not a necessity: Its possible to add 'DEFINE_BPF_DISPATCHER(nf_progs)' and handle the program invocation with the same method already done for xdp progs. This isn't done here to keep the size of this chunk down. Verifier restricts verdicts to either DROP or ACCEPT. Signed-off-by: Florian Westphal <fw@strlen.de> Link: https://lore.kernel.org/r/20230421170300.24115-3-fw@strlen.de Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-21bpf: add bpf_link support for BPF_NETFILTER programsFlorian Westphal
Add bpf_link support skeleton. To keep this reviewable, no bpf program can be invoked yet, if a program is attached only a c-stub is called and not the actual bpf program. Defaults to 'y' if both netfilter and bpf syscall are enabled in kconfig. Uapi example usage: union bpf_attr attr = { }; attr.link_create.prog_fd = progfd; attr.link_create.attach_type = 0; /* unused */ attr.link_create.netfilter.pf = PF_INET; attr.link_create.netfilter.hooknum = NF_INET_LOCAL_IN; attr.link_create.netfilter.priority = -128; err = bpf(BPF_LINK_CREATE, &attr, sizeof(attr)); ... this would attach progfd to ipv4:input hook. Such hook gets removed automatically if the calling program exits. BPF_NETFILTER program invocation is added in followup change. NF_HOOK_OP_BPF enum will eventually be read from nfnetlink_hook, it allows to tell userspace which program is attached at the given hook when user runs 'nft hook list' command rather than just the priority and not-very-helpful 'this hook runs a bpf prog but I can't tell which one'. Will also be used to disallow registration of two bpf programs with same priority in a followup patch. v4: arm32 cmpxchg only supports 32bit operand s/prio/priority/ v3: restrict prog attachment to ip/ip6 for now, lets lift restrictions if more use cases pop up (arptables, ebtables, netdev ingress/egress etc). Signed-off-by: Florian Westphal <fw@strlen.de> Link: https://lore.kernel.org/r/20230421170300.24115-2-fw@strlen.de Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-21iomap: Remove IOMAP_DIO_NOSYNC unused dio flagRitesh Harjani (IBM)
IOMAP_DIO_NOSYNC earlier was added for use in btrfs. But it seems for aio dsync writes this is not useful anyway. For aio dsync case, we we queue the request and return -EIOCBQUEUED. Now, since IOMAP_DIO_NOSYNC doesn't let iomap_dio_complete() to call generic_write_sync(), hence we may lose the sync write. Hence kill this flag as it is not in use by any FS now. Tested-by: Disha Goel <disgoel@linux.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2023-04-21fs.h: Add TRACE_IOCB_STRINGS for use in trace pointsRitesh Harjani (IBM)
Add TRACE_IOCB_STRINGS macro which can be used in the trace point patch to print different flag values with meaningful string output. Tested-by: Disha Goel <disgoel@linux.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> [djwong: line up strings all prettylike] Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2023-04-21Merge tag 'irqchip-6.4' of ↵Thomas Gleixner
git://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms into irq/core Pull irqchip changes from Marc Zyngier: - Large RISC-V IPI rework to make way for a new interrupt architecture - More Loongarch fixes from Lianmin Lv, fixing issues in the so called "dual-bridge" systems. - Workaround for the nvidia T241 chip that gets confused in 3 and 4 socket configurations, leading to the GIC malfunctionning in some contexts - Drop support for non-firmware driven GIC configurarations now that the old ARM11MP Cavium board is gone - Workaround for the Rockchip 3588 chip that doesn't correctly deal with the shareability attributes. - Replace uses of of_find_property() with the more appropriate of_property_read_bool() - Make bcm-6345-l1 request its MMIO region - Add suspend support to the SiFive PLIC - Drop support for stih415, stih416 and stid127 platforms Link: https://lore.kernel.org/lkml/20230421132104.3021536-1-maz@kernel.org
2023-04-21Merge tag 'wireless-next-2023-04-21' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next Kalle Valo says: ==================== wireless-next patches for v6.4 Most likely the last -next pull request for v6.4. We have changes all over. rtw88 now supports SDIO bus and iwlwifi continues to work on Wi-Fi 7 support. Not much stack changes this time. Major changes: cfg80211/mac80211 - fix some Fine Time Measurement (FTM) frames not being bufferable - flush frames before key removal to avoid potential unencrypted transmission depending on the hardware design iwlwifi - preparation for Wi-Fi 7 EHT and multi-link support rtw88 - SDIO bus support - RTL8822BS, RTL8822CS and RTL8821CS SDIO chipset support rtw89 - framework firmware backwards compatibility brcmfmac - Cypress 43439 SDIO support mt76 - mt7921 P2P support - mt7996 mesh A-MSDU support - mt7996 EHT support - mt7996 coredump support wcn36xx - support for pronto v3 hardware ath11k - PCIe DeviceTree bindings - WCN6750: enable SAR support ath10k - convert DeviceTree bindings to YAML * tag 'wireless-next-2023-04-21' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (261 commits) wifi: rtw88: Update spelling in main.h wifi: airo: remove ISA_DMA_API dependency wifi: rtl8xxxu: Simplify setting the initial gain wifi: rtl8xxxu: Add rtl8xxxu_write{8,16,32}_{set,clear} wifi: rtl8xxxu: Don't print the vendor/product/serial wifi: rtw88: Fix memory leak in rtw88_usb wifi: rtw88: call rtw8821c_switch_rf_set() according to chip variant wifi: rtw88: set pkg_type correctly for specific rtw8821c variants wifi: rtw88: rtw8821c: Fix rfe_option field width wifi: rtw88: usb: fix priority queue to endpoint mapping wifi: rtw88: 8822c: add iface combination wifi: rtw88: handle station mode concurrent scan with AP mode wifi: rtw88: prevent scan abort with other VIFs wifi: rtw88: refine reserved page flow for AP mode wifi: rtw88: disallow PS during AP mode wifi: rtw88: 8822c: extend reserved page number wifi: rtw88: add port switch for AP mode wifi: rtw88: add bitmap for dynamic port settings wifi: rtw89: mac: use regular int as return type of DLE buffer request wifi: mac80211: remove return value check of debugfs_create_dir() ... ==================== Link: https://lore.kernel.org/r/20230421104726.800BCC433D2@smtp.kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-21spi: Add TPM HW flow flagKrishna Yarlagadda
TPM specification [1] defines flow control over SPI. Client device can insert a wait state on MISO when address is transmitted by controller on MOSI. Detecting the wait state in software is only possible for full duplex controllers. For controllers that support only half- duplex, the wait state detection needs to be implemented in hardware. Add a flag SPI_TPM_HW_FLOW for TPM device to set when software flow control is not possible and hardware flow control is expected from SPI controller. Reference: [1] https://trustedcomputinggroup.org/resource/pc-client-platform-tpm -profile-ptp-specification/ Signed-off-by: Krishna Yarlagadda <kyarlagadda@nvidia.com> Reviewed-by: Jerry Snitselaar <jsnitsel@redhat.com> Link: https://lore.kernel.org/r/20230421091309.2672-2-kyarlagadda@nvidia.com Signed-off-by: Mark Brown <broonie@kernel.org>
2023-04-21posix-cpu-timers: Implement the missing timer_wait_running callbackThomas Gleixner
For some unknown reason the introduction of the timer_wait_running callback missed to fixup posix CPU timers, which went unnoticed for almost four years. Marco reported recently that the WARN_ON() in timer_wait_running() triggers with a posix CPU timer test case. Posix CPU timers have two execution models for expiring timers depending on CONFIG_POSIX_CPU_TIMERS_TASK_WORK: 1) If not enabled, the expiry happens in hard interrupt context so spin waiting on the remote CPU is reasonably time bound. Implement an empty stub function for that case. 2) If enabled, the expiry happens in task work before returning to user space or guest mode. The expired timers are marked as firing and moved from the timer queue to a local list head with sighand lock held. Once the timers are moved, sighand lock is dropped and the expiry happens in fully preemptible context. That means the expiring task can be scheduled out, migrated, interrupted etc. So spin waiting on it is more than suboptimal. The timer wheel has a timer_wait_running() mechanism for RT, which uses a per CPU timer-base expiry lock which is held by the expiry code and the task waiting for the timer function to complete blocks on that lock. This does not work in the same way for posix CPU timers as there is no timer base and expiry for process wide timers can run on any task belonging to that process, but the concept of waiting on an expiry lock can be used too in a slightly different way: - Add a mutex to struct posix_cputimers_work. This struct is per task and used to schedule the expiry task work from the timer interrupt. - Add a task_struct pointer to struct cpu_timer which is used to store a the task which runs the expiry. That's filled in when the task moves the expired timers to the local expiry list. That's not affecting the size of the k_itimer union as there are bigger union members already - Let the task take the expiry mutex around the expiry function - Let the waiter acquire a task reference with rcu_read_lock() held and block on the expiry mutex This avoids spin-waiting on a task which might not even be on a CPU and works nicely for RT too. Fixes: ec8f954a40da ("posix-timers: Use a callback for cancel synchronization on PREEMPT_RT") Reported-by: Marco Elver <elver@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Marco Elver <elver@google.com> Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/87zg764ojw.ffs@tglx
2023-04-21Merge branch irq/gic-6.4 into irq/irqchip-nextMarc Zyngier
* irq/gic-6.4: : . : Collection of GIC/GICv3 fixes and cleanups : : - Workaround for the nvidia T241 chip that gets confused : in 3 and 4 socket configurations, leading to the GIC : malfunctionning in some contexts : : - Drop support for non-firmware driven GIC configurarations : now that the old ARM11MP Cavium board is gone : : - Workaround for the Rockchip 3588 chip that doesn't : correctly deal with the shareability attributes. : . irqchip/gic-v3: Add Rockchip 3588001 erratum workaround irqchip/gicv3: Workaround for NVIDIA erratum T241-FABRIC-4 irqchip/gic: Drop support for board files Signed-off-by: Marc Zyngier <maz@kernel.org>
2023-04-21sched: Fix performance regression introduced by mm_cidMathieu Desnoyers
Introduce per-mm/cpu current concurrency id (mm_cid) to fix a PostgreSQL sysbench regression reported by Aaron Lu. Keep track of the currently allocated mm_cid for each mm/cpu rather than freeing them immediately on context switch. This eliminates most atomic operations when context switching back and forth between threads belonging to different memory spaces in multi-threaded scenarios (many processes, each with many threads). The per-mm/per-cpu mm_cid values are serialized by their respective runqueue locks. Thread migration is handled by introducing invocation to sched_mm_cid_migrate_to() (with destination runqueue lock held) in activate_task() for migrating tasks. If the destination cpu's mm_cid is unset, and if the source runqueue is not actively using its mm_cid, then the source cpu's mm_cid is moved to the destination cpu on migration. Introduce a task-work executed periodically, similarly to NUMA work, which delays reclaim of cid values when they are unused for a period of time. Keep track of the allocation time for each per-cpu cid, and let the task work clear them when they are observed to be older than SCHED_MM_CID_PERIOD_NS and unused. This task work also clears all mm_cids which are greater or equal to the Hamming weight of the mm cidmask to keep concurrency ids compact. Because we want to ensure the mm_cid converges towards the smaller values as migrations happen, the prior optimization that was done when context switching between threads belonging to the same mm is removed, because it could delay the lazy release of the destination runqueue mm_cid after it has been replaced by a migration. Removing this prior optimization is not an issue performance-wise because the introduced per-mm/per-cpu mm_cid tracking also covers this more specific case. Fixes: af7f588d8f73 ("sched: Introduce per-memory-map concurrency ID") Reported-by: Aaron Lu <aaron.lu@intel.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Aaron Lu <aaron.lu@intel.com> Link: https://lore.kernel.org/lkml/20230327080502.GA570847@ziqianlu-desk2/
2023-04-21Merge branch 'v6.3-rc7'Peter Zijlstra
Sync with the urgent patches; in particular: a53ce18cacb4 ("sched/fair: Sanitize vruntime of entity being migrated") Signed-off-by: Peter Zijlstra <peterz@infradead.org>
2023-04-21pds_core: publish events to the clientsShannon Nelson
When the Core device gets an event from the device, or notices the device FW to be up or down, it needs to send those events on to the clients that have an event handler. Add the code to pass along the events to the clients. The entry points pdsc_register_notify() and pdsc_unregister_notify() are EXPORTed for other drivers that want to listen for these events. Signed-off-by: Shannon Nelson <shannon.nelson@amd.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-21pds_core: add the aux client APIShannon Nelson
Add the client API operations for running adminq commands. The core registers the client with the FW, then the client has a context for requesting adminq services. We expect to add additional operations for other clients, including requesting additional private adminqs and IRQs, but don't have the need yet. Signed-off-by: Shannon Nelson <shannon.nelson@amd.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-21pds_core: add auxiliary_bus devicesShannon Nelson
An auxiliary_bus device is created for each vDPA type VF at VF probe and destroyed at VF remove. The aux device name comes from the driver name + VIF type + the unique id assigned at PCI probe. The VFs are always removed on PF remove, so there should be no issues with VFs trying to access missing PF structures. The auxiliary_device names will look like "pds_core.vDPA.nn" where 'nn' is the VF's uid. Signed-off-by: Shannon Nelson <shannon.nelson@amd.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-21pds_core: set up the VIF definitions and defaultsShannon Nelson
The Virtual Interfaces (VIFs) supported by the DSC's configuration (vDPA, Eth, RDMA, etc) are reported in the dev_ident struct and made visible in debugfs. At this point only vDPA is supported in this driver so we only setup devices for that feature. Signed-off-by: Shannon Nelson <shannon.nelson@amd.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-21pds_core: Add adminq processing and commandsShannon Nelson
Add the service routines for submitting and processing the adminq messages and for handling notifyq events. Signed-off-by: Shannon Nelson <shannon.nelson@amd.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-21pds_core: set up device and adminqShannon Nelson
Set up the basic adminq and notifyq queue structures. These are used mostly by the client drivers for feature configuration. These are essentially the same adminq and notifyq as in the ionic driver. Part of this includes querying for device identity and FW information, so we can make that available to devlink dev info. $ devlink dev info pci/0000:b5:00.0 pci/0000:b5:00.0: driver pds_core serial_number FLM18420073 versions: fixed: asic.id 0x0 asic.rev 0x0 running: fw 1.51.0-73 stored: fw.goldfw 1.15.9-C-22 fw.mainfwa 1.60.0-73 fw.mainfwb 1.60.0-57 Signed-off-by: Shannon Nelson <shannon.nelson@amd.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-21pds_core: add devcmd device interfacesShannon Nelson
The devcmd interface is the basic connection to the device through the PCI BAR for low level identification and command services. This does the early device initialization and finds the identity data, and adds devcmd routines to be used by later driver bits. Signed-off-by: Shannon Nelson <shannon.nelson@amd.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-21pds_core: initial framework for pds_core PF driverShannon Nelson
This is the initial PCI driver framework for the new pds_core device driver and its family of devices. This does the very basics of registering for the new PF PCI device 1dd8:100c, setting up debugfs entries, and registering with devlink. Signed-off-by: Shannon Nelson <shannon.nelson@amd.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-21bridge: Add internal flags for per-{Port, VLAN} neighbor suppressionIdo Schimmel
Add two internal flags that will be used to enable / disable per-{Port, VLAN} neighbor suppression: 1. 'BR_NEIGH_VLAN_SUPPRESS': A per-port flag used to indicate that per-{Port, VLAN} neighbor suppression is enabled on the bridge port. When set, 'BR_NEIGH_SUPPRESS' has no effect. 2. 'BR_VLFLAG_NEIGH_SUPPRESS_ENABLED': A per-VLAN flag used to indicate that neighbor suppression is enabled on the given VLAN. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-21sctp: delete the nested flexible array payloadXin Long
This patch deletes the flexible-array payload[] from the structure sctp_datahdr to avoid some sparse warnings: # make C=2 CF="-Wflexible-array-nested" M=./net/sctp/ net/sctp/socket.c: note: in included file (through include/net/sctp/structs.h, include/net/sctp/sctp.h): ./include/linux/sctp.h:230:29: warning: nested flexible array This member is not even used anywhere. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-21sctp: delete the nested flexible array hmacXin Long
This patch deletes the flexible-array hmac[] from the structure sctp_authhdr to avoid some sparse warnings: # make C=2 CF="-Wflexible-array-nested" M=./net/sctp/ net/sctp/auth.c: note: in included file (through include/net/sctp/structs.h, include/net/sctp/sctp.h): ./include/linux/sctp.h:735:29: warning: nested flexible array Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-21sctp: delete the nested flexible array variableXin Long
This patch deletes the flexible-array variable[] from the structure sctp_sackhdr and sctp_errhdr to avoid some sparse warnings: # make C=2 CF="-Wflexible-array-nested" M=./net/sctp/ net/sctp/sm_statefuns.c: note: in included file (through include/net/sctp/structs.h, include/net/sctp/sctp.h): ./include/linux/sctp.h:451:28: warning: nested flexible array ./include/linux/sctp.h:393:29: warning: nested flexible array Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-21sctp: delete the nested flexible array skipXin Long
This patch deletes the flexible-array skip[] from the structure sctp_ifwdtsn/fwdtsn_hdr to avoid some sparse warnings: # make C=2 CF="-Wflexible-array-nested" M=./net/sctp/ net/sctp/stream_interleave.c: note: in included file (through include/net/sctp/structs.h, include/net/sctp/sctp.h): ./include/linux/sctp.h:611:32: warning: nested flexible array ./include/linux/sctp.h:628:33: warning: nested flexible array Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>