Age | Commit message (Collapse) | Author |
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq core updates from Thomas Gleixner:
"Updates for the generic interrupt subsystem core code:
- Address a long standing subtle problem in the CPU hotplug code for
affinity-managed interrupts.
Affinity-managed interrupts are shut down by the core code when the
last CPU in the affinity set goes offline and started up again when
the first CPU in the affinity set becomes online again.
This unfortunately does not take into account whether an interrupt
has been disabled before the last CPU goes offline and starts up
the interrupt unconditionally when the first CPU becomes online
again.
That's obviously not what drivers expect.
Address this by preserving the disabled state for affinity-managed
interrupts accross these CPU hotplug operations. All non-managed
interrupts are not affected by this because startup/shutdown is
coupled to request/free_irq() which obviously has to reset state.
- Support three-cell scheme interrupts to allow GPIO drivers to
specify interrupts from an already existing scheme
- Switch the interrupt subsystem core to lock guards. This gets rid
of quite some copy & pasta boilerplate code all over the place.
- The usual small cleanups and improvements all over the place"
* tag 'irq-core-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (59 commits)
genirq/irqdesc: Remove double locking in hwirq_show()
genirq: Retain disable depth for managed interrupts across CPU hotplug
genirq: Bump the size of the local variable for sprintf()
genirq/manage: Use the correct lock guard in irq_set_irq_wake()
genirq: Consistently use '%u' format specifier for unsigned int variables
genirq: Ensure flags in lock guard is consistently initialized
genirq: Fix inverted condition in handle_nested_irq()
genirq/cpuhotplug: Fix up lock guards conversion brainf..t
genirq: Use scoped_guard() to shut clang up
genirq: Remove unused remove_percpu_irq()
genirq: Remove irq_[get|put]_desc*()
genirq/manage: Rework irq_set_irqchip_state()
genirq/manage: Rework irq_get_irqchip_state()
genirq/manage: Rework teardown_percpu_nmi()
genirq/manage: Rework prepare_percpu_nmi()
genirq/manage: Rework disable_percpu_irq()
genirq/manage: Rework irq_percpu_is_enabled()
genirq/manage: Rework enable_percpu_irq()
genirq/manage: Rework irq_set_parent()
genirq/manage: Rework can_request_irq()
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull core entry code updates from Thomas Gleixner:
"Updates for the generic and architecture entry code:
- Move LoongArch and RISC-V ret_from_fork() implementations to C code
so that syscall_exit_user_mode() can be inlined
- Split the RISC-V ret_from_fork() implementation into return to user
and return to kernel, which gives a measurable performance
improvement
- Inline syscall_exit_user_mode() which benefits all architectures by
avoiding a function call and letting the compiler do better
optimizations"
* tag 'core-entry-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
LoongArch: entry: Fix include order
entry: Inline syscall_exit_to_user_mode()
LoongArch: entry: Migrate ret_from_fork() to C
riscv: entry: Split ret_from_fork() into user and kernel
riscv: entry: Convert ret_from_fork() to C
|
|
Expose the virtio-rtc UTC-like clock as an RTC clock to userspace - if it
is present, and if it does not step on leap seconds. The RTC class enables
the virtio-rtc device to resume the system from sleep states on RTC alarm.
Support RTC alarm if the virtio-rtc alarm feature is present. The
virtio-rtc device signals an alarm by marking an alarmq buffer as used.
Peculiarities
-------------
A virtio-rtc clock is a bit special for an RTC clock in that
- the clock may step (also backwards) autonomously at any time and
- the device, and its notification mechanism, will be reset during boot or
resume from sleep.
The virtio-rtc device avoids that the driver might miss an alarm. The
device signals an alarm whenever the clock has reached or passed the alarm
time, and also when the device is reset (on boot or resume from sleep), if
the alarm time is in the past.
Open Issue
----------
The CLOCK_BOOTTIME_ALARM will use the RTC clock to wake up from sleep, and
implicitly assumes that no RTC clock steps will occur during sleep. The RTC
class driver does not know whether the current alarm is a real-time alarm
or a boot-time alarm.
Perhaps this might be handled by the driver also setting a virtio-rtc
monotonic alarm (which uses a clock similar to CLOCK_BOOTTIME_ALARM). The
virtio-rtc monotonic alarm would just be used to wake up in case it was a
CLOCK_BOOTTIME_ALARM alarm.
Otherwise, the behavior should not differ from other RTC class drivers.
Signed-off-by: Peter Hilber <quic_philber@quicinc.com>
Acked-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
Message-Id: <20250509160734.1772-5-quic_philber@quicinc.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
|
For platforms using the Arm Generic Timer, add precise cross-timestamping
support to virtio_rtc.
Always report the CP15 virtual counter as the HW counter in use by
arm_arch_timer, since the Linux kernel's usage of the Arm Generic Timer
should always be compatible with this.
Signed-off-by: Peter Hilber <quic_philber@quicinc.com>
Message-Id: <20250509160734.1772-4-quic_philber@quicinc.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
|
Expose the virtio_rtc clocks as PTP clocks to userspace, similar to
ptp_kvm. virtio_rtc can expose multiple clocks, e.g. a UTC clock and a
monotonic clock.
Userspace should distinguish different clocks through the name assigned by
the driver. In particular, UTC-like clocks can also be distinguished by if
and how leap seconds are smeared. udev rules such as the following can be
used to get different symlinks for different clock types:
SUBSYSTEM=="ptp", ATTR{clock_name}=="Virtio PTP type 0/variant 0", SYMLINK += "ptp_virtio"
SUBSYSTEM=="ptp", ATTR{clock_name}=="Virtio PTP type 1/variant 0", SYMLINK += "ptp_virtio_tai"
SUBSYSTEM=="ptp", ATTR{clock_name}=="Virtio PTP type 2/variant 0", SYMLINK += "ptp_virtio_monotonic"
SUBSYSTEM=="ptp", ATTR{clock_name}=="Virtio PTP type 3/variant 0", SYMLINK += "ptp_virtio_smear_unspecified"
SUBSYSTEM=="ptp", ATTR{clock_name}=="Virtio PTP type 3/variant 1", SYMLINK += "ptp_virtio_smear_noon_linear"
SUBSYSTEM=="ptp", ATTR{clock_name}=="Virtio PTP type 3/variant 2", SYMLINK += "ptp_virtio_smear_sls"
SUBSYSTEM=="ptp", ATTR{clock_name}=="Virtio PTP type 4/variant 0", SYMLINK += "ptp_virtio_maybe_smeared"
The preferred PTP clock reading method is ioctl PTP_SYS_OFFSET_PRECISE2,
through the ptp_clock_info.getcrosststamp() op. For now,
PTP_SYS_OFFSET_PRECISE2 will return -EOPNOTSUPP through a weak function.
PTP_SYS_OFFSET_PRECISE2 requires cross-timestamping support for specific
clocksources, which will be added in the following. If the clocksource
specific code is enabled, check that the Virtio RTC device supports the
respective HW counter before obtaining an actual cross-timestamp from the
Virtio device.
The Virtio RTC device response time may be higher than the timekeeper
seqcount increment interval. Therefore, obtain the cross-timestamp before
calling get_device_system_crosststamp().
As a fallback, support the ioctl PTP_SYS_OFFSET_EXTENDED2 for all
platforms.
Assume that concurrency issues during PTP clock removal are avoided by the
posix_clock framework.
Kconfig recursive dependencies prevent virtio_rtc from implicitly enabling
PTP_1588_CLOCK, therefore just warn the user if PTP_1588_CLOCK is not
available. Since virtio_rtc should in the future also expose clocks as RTC
class devices, do not depend VIRTIO_RTC on PTP_1588_CLOCK.
Signed-off-by: Peter Hilber <quic_philber@quicinc.com>
Message-Id: <20250509160734.1772-3-quic_philber@quicinc.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
|
Add the virtio_rtc module and driver core. The virtio_rtc module implements
a driver compatible with the proposed Virtio RTC device specification.
The Virtio RTC (Real Time Clock) device provides information about current
time. The device can provide different clocks, e.g. for the UTC or TAI time
standards, or for physical time elapsed since some past epoch. The driver
can read the clocks with simple or more accurate methods.
Implement the core, which interacts with the Virtio RTC device. Apart from
this, the core does not expose functionality outside of the virtio_rtc
module. Follow-up patches will expose PTP clocks and an RTC Class device.
Provide synchronous messaging, which is enough for the expected time
synchronization use cases through PTP clocks (similar to ptp_kvm) or RTC
Class device.
Signed-off-by: Peter Hilber <quic_philber@quicinc.com>
Message-Id: <20250509160734.1772-2-quic_philber@quicinc.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
|
Use the bvec_kmap_local helper rather than digging into the bvec
internals.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Message-Id: <20250501142244.2888227-1-hch@lst.de>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
|
|
In preparation for making the kmalloc family of allocators type aware,
we need to make sure that the returned type from the allocation matches
the type of the variable being assigned. (Before, the allocator would
always return "void *", which can be implicitly cast to any pointer type.)
The assigned type is "struct kvec *", but the returned type will be
"struct iovec *". These have the same allocation size, so there is no
bug:
struct kvec {
void *iov_base; /* and that should *never* hold a userland pointer */
size_t iov_len;
};
struct iovec
{
void __user *iov_base; /* BSD uses caddr_t (1003.1g requires void *) */
__kernel_size_t iov_len; /* Must be size_t (1003.1g) */
};
Adjust the allocation type to match the assignment.
Signed-off-by: Kees Cook <kees@kernel.org>
Message-Id: <20250426062214.work.334-kees@kernel.org>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
|
The result size returned by virtio_pci_admin_dev_parts_get() is 8 bytes
larger than the actual result data size. This occurs because the
result_sg_size field of the command is filled with the result length
from virtqueue_get_buf(), which includes both the data size and an
additional 8 bytes of status.
This oversized result size causes two issues:
1. The state transferred to the destination includes 8 bytes of extra
data at the end.
2. The allocated buffer in the kernel may be smaller than the returned
size, leading to failures when reading beyond the allocated size.
The commit fixes this by subtracting the status size from the result of
virtqueue_get_buf().
This fix has been tested through live migrations with virtio-net,
virtio-net-transitional, and virtio-blk devices.
Fixes: 704806ca400e ("virtio: Extend the admin command to include the result size")
Signed-off-by: Israel Rukshin <israelr@nvidia.com>
Reviewed-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Message-Id: <1745318025-23103-1-git-send-email-israelr@nvidia.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
|
|
PCI region request functions such as pci_request_region() currently have
the problem of becoming sometimes managed functions, if
pcim_enable_device() instead of pci_enable_device() was called. The PCI
subsystem wants to remove this deprecated behavior from its interfaces.
octeopn_ep enables its device with pcim_enable_device() (for VF. PF uses
manual management), but does so only to get automatic disablement. The
driver wants to manage its PCI resources for VF manually, without devres.
The easiest way not to use automatic resource management at all is by
also handling device enable- and disablement manually.
Replace pcim_enable_device() with pci_enable_device(). Add the necessary
calls to pci_disable_device().
Signed-off-by: Philipp Stanner <phasta@kernel.org>
Acked-by: Vamsi Attunuru <vattunuru@marvell.com>
Message-Id: <20250508085134.24084-2-phasta@kernel.org>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Tested-by: Lei Yang <leiyang@redhat.com>
Signed-off-by: Philipp Stanner <<a href="mailto:phasta@kernel.org" target="_blank">phasta@kernel.org</a>><br>
Acked-by: Vamsi Attunuru <<a href="mailto:vattunuru@marvell.com" target="_blank">vattunuru@marvell.com</a>><br>
|
|
Move the `#[cfg(CONFIG_OF)]` attribute to the top of the documentation test
block and hide it. This applies the condition to the entire test and improves
readability.
Placing configuration flags like `CONFIG_OF` at the top serves as a clear
indicator of the conditions under which the example is valid, effectively
acting like configuration metadata for the example itself.
Suggested-by: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Link: https://patch.msgid.link/9d93c783cc4419f16dd8942a4359d74bc0149203.1748323971.git.viresh.kumar@linaro.org
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
If the first cpu_node = of_cpu_device_node_get() fails then the cleanup.h
code will try to free "state_node" but it hasn't been initialized yet.
Declare the device_nodes where they are initialized to fix this.
Fixes: 5836ebeb4a2b ("cpuidle: psci: Avoid initializing faux device if no DT idle states are present")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Link: https://patch.msgid.link/aDVRcfU8O8sez1x7@stanley.mountain
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
The error code is not set correctly on if kasprintf() fails. On the
first iteration it would return -EINVAL and subsequent iterations
would return success. Set it to -ENOMEM.
In real life, this allocation will not fail and if it did the system
will not boot so this change is mostly to silence static checker warnings
more than anything else.
Fixes: 04f53540f791 ("ACPI: MRRM: Add /sys files to describe memory ranges")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Link: https://patch.msgid.link/aDVTfEm-Jch7FuHG@stanley.mountain
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
Call acpi_put_table() before returning the error code.
Fixes: e54b1dc1c4f0 ("ACPI: APEI: EINJ: Remove redundant calls to einj_get_available_error_type()")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Link: https://patch.msgid.link/aDVRBok33LZhXcId@stanley.mountain
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
In the zoned mode there's a bug in the extent buffer tree conversion to
xarray. The reference for eb is dropped and code continues but the
references get dropped by releasing the batch.
Reported-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reported-by: kernel test robot <oliver.sang@intel.com>
Link: https://lore.kernel.org/linux-btrfs/202505191521.435b97ac-lkp@intel.com/
Fixes: 19d7f65f032f ("btrfs: convert the buffer_radix to an xarray")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
* for-next/vdso:
arm64: vdso: Use __arch_counter_get_cntvct()
|
|
* for-next/sme-fixes: (35 commits)
arm64/fpsimd: Allow CONFIG_ARM64_SME to be selected
arm64/fpsimd: ptrace: Gracefully handle errors
arm64/fpsimd: ptrace: Mandate SVE payload for streaming-mode state
arm64/fpsimd: ptrace: Do not present register data for inactive mode
arm64/fpsimd: ptrace: Save task state before generating SVE header
arm64/fpsimd: ptrace/prctl: Ensure VL changes leave task in a valid state
arm64/fpsimd: ptrace/prctl: Ensure VL changes do not resurrect stale data
arm64/fpsimd: Make clone() compatible with ZA lazy saving
arm64/fpsimd: Clear PSTATE.SM during clone()
arm64/fpsimd: Consistently preserve FPSIMD state during clone()
arm64/fpsimd: Remove redundant task->mm check
arm64/fpsimd: signal: Use SMSTOP behaviour in setup_return()
arm64/fpsimd: Add task_smstop_sm()
arm64/fpsimd: Factor out {sve,sme}_state_size() helpers
arm64/fpsimd: Clarify sve_sync_*() functions
arm64/fpsimd: ptrace: Consistently handle partial writes to NT_ARM_(S)SVE
arm64/fpsimd: signal: Consistently read FPSIMD context
arm64/fpsimd: signal: Mandate SVE payload for streaming-mode state
arm64/fpsimd: signal: Clear PSTATE.SM when restoring FPSIMD frame only
arm64/fpsimd: Do not discard modified SVE state
...
|
|
* for-next/selftests:
kselftest/arm64: Set default OUTPUT path when undefined
kselftest/arm64: fp-ptrace: Adjust to new inactive mode behaviour
kselftest/arm64: fp-ptrace: Adjust to new VL change behaviour
kselftest/arm64: tpidr2: Adjust to new clone() behaviour
kselftest/arm64: fp-ptrace: Fix expected FPMR value when PSTATE.SM is changed
|
|
* for-next/psci:
firmware: psci: Fix refcount leak in psci_dt_init
|
|
* for-next/perf:
perf/arm-cmn: Add CMN S3 ACPI binding
perf/arm-cmn: Initialise cmn->cpu earlier
perf/amlogic: Replace smp_processor_id() with raw_smp_processor_id() in meson_ddr_pmu_create()
perf/arm-cmn: Fix REQ2/SNP2 mixup
perf: Do not enable by default during compile testing
perf: arm-ni: Fix missing platform_set_drvdata()
perf: arm-ni: Unregister PMUs on probe failure
perf/arm-cmn: Remove CMN-600 DTC domain special case
|
|
* for-next/mm:
arm64/boot: Disallow BSS exports to startup code
arm64/boot: Move global CPU override variables out of BSS
arm64/boot: Move init_pgdir[] and init_idmap_pgdir[] into __pi_ namespace
arm64: mm: Drop redundant check in pmd_trans_huge()
arm64/mm: Permit lazy_mmu_mode to be nested
arm64/mm: Disable barrier batching in interrupt contexts
arm64/mm: Batch barriers when updating kernel mappings
mm/vmalloc: Enter lazy mmu mode while manipulating vmalloc ptes
arm64/mm: Support huge pte-mapped pages in vmap
mm/vmalloc: Gracefully unmap huge ptes
mm/vmalloc: Warn on improper use of vunmap_range()
arm64/mm: Hoist barriers out of set_ptes_anysz() loop
arm64: hugetlb: Use __set_ptes_anysz() and __ptep_get_and_clear_anysz()
arm64/mm: Refactor __set_ptes() and __ptep_get_and_clear()
mm/page_table_check: Batch-check pmds/puds just like ptes
arm64: hugetlb: Refine tlb maintenance scope
arm64: hugetlb: Cleanup huge_pte size discovery mechanisms
arm64: pageattr: Explicitly bail out when changing permissions for vmalloc_huge mappings
arm64: Support ARM64_VA_BITS=52 when setting ARCH_MMAP_RND_BITS_MAX
arm64/mm: Remove randomization of the linear map
|
|
* for-next/misc:
arm64/cpuinfo: only show one cpu's info in c_show()
arm64: Extend pr_crit message on invalid FDT
arm64: Kconfig: remove unnecessary selection of CRC32
arm64: Add missing includes for mem_encrypt
|
|
Merge in for-next/fixes, as subsequent improvements to our early PI
code that disallow BSS exports depend on the 'arm64_use_ng_mappings'
fix here.
* for-next/fixes:
arm64: cpufeature: Move arm64_use_ng_mappings to the .data section to prevent wrong idmap generation
arm64: errata: Add missing sentinels to Spectre-BHB MIDR arrays
|
|
* for-next/entry:
arm64: el2_setup.h: Make __init_el2_fgt labels consistent, again
arm64: Update comment regarding values in __boot_cpu_mode
arm64/mm: Re-organise setting up FEAT_S1PIE registers PIRE0_EL1 and PIR_EL1
arm64: enable PREEMPT_LAZY
|
|
* for-next/efi:
arm64/fpsimd: Avoid unnecessary per-CPU buffers for EFI runtime calls
|
|
* for-next/cpufeature:
arm64: cputype: Add cputype definition for HIP12
arm64: Expose AIDR_EL1 via sysfs
arm64/cpufeature: Add missing id_aa64mmfr4 feature reg update
|
|
* for-next/acpi:
firmware: SDEI: Allow sdei initialization without ACPI_APEI_GHES
|
|
Corrected "sages" to "messages" in the bitmap allocation description.
Fixed "competed" to "completed" in the recv path datagram handling section.
Corrected "privatee" to "private" in the multipath RDS section.
Fixed "mutlipath" to "multipath" in the transport capabilities description.
These changes improve documentation clarity and maintain consistency.
Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Link: https://patch.msgid.link/20250522074413.3634446-1-alok.a.tiwari@oracle.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
A representor netdev does not correspond to real hardware that needs to
be updated when setting the MAC address. The default eth_mac_addr() is
sufficient for simply updating the netdev's MAC address with validation.
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/1747898036-1121904-1-git-send-email-tariqt@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Subbaraya Sundeep says:
====================
octeontx2-pf: Do not detect MACSEC block based on silicon
Out of various silicon variants of CN10K series some have hardware
MACSEC block for offloading MACSEC operations and some do not.
AF driver already has the information of whether MACSEC is present
or not on running silicon. Hence fetch that information from
AF via mailbox message.
====================
Link: https://patch.msgid.link/1747894516-4565-1-git-send-email-sbhatta@marvell.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
The presence of MACSEC block is currently figured out based
on the running silicon variant. This may not be correct all
the times since the MACSEC block can be fused out. Hence get
the macsec info from AF via mailbox.
Signed-off-by: Subbaraya Sundeep <sbhatta@marvell.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/1747894548-4657-1-git-send-email-sbhatta@marvell.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
MACSEC block may be fused out on some silicons hence modify
get_hw_cap mailbox message to set a capability flag in its
response message based on MACSEC block availability.
Signed-off-by: Subbaraya Sundeep <sbhatta@marvell.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/1747894528-4611-1-git-send-email-sbhatta@marvell.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
In commit 7ead4405e06f ("xsk: convert xdp_copy_frags_from_zc() to use
page_pool_dev_alloc()"), when converting from netmem to page, I missed a
call to page_address() around skb_frag_page(frag) to get the virtual
address of the page. This commit uses skb_frag_address() helper to fix
the issue.
Fixes: 7ead4405e06f ("xsk: convert xdp_copy_frags_from_zc() to use page_pool_dev_alloc()")
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
Link: https://patch.msgid.link/20250522040115.5057-1-minhquangbui99@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Syzkaller reports the following issue:
BUG: sleeping function called from invalid context at kernel/locking/mutex.c:578
__mutex_lock+0x106/0xe80 kernel/locking/mutex.c:746
team_change_rx_flags+0x38/0x220 drivers/net/team/team_core.c:1781
dev_change_rx_flags net/core/dev.c:9145 [inline]
__dev_set_promiscuity+0x3f8/0x590 net/core/dev.c:9189
netif_set_promiscuity+0x50/0xe0 net/core/dev.c:9201
dev_set_promiscuity+0x126/0x260 net/core/dev_api.c:286 packet_dev_mc net/packet/af_packet.c:3698 [inline]
packet_dev_mclist_delete net/packet/af_packet.c:3722 [inline]
packet_notifier+0x292/0xa60 net/packet/af_packet.c:4247
notifier_call_chain+0x1b3/0x3e0 kernel/notifier.c:85
call_netdevice_notifiers_extack net/core/dev.c:2214 [inline]
call_netdevice_notifiers net/core/dev.c:2228 [inline]
unregister_netdevice_many_notify+0x15d8/0x2330 net/core/dev.c:11972
rtnl_delete_link net/core/rtnetlink.c:3522 [inline]
rtnl_dellink+0x488/0x710 net/core/rtnetlink.c:3564
rtnetlink_rcv_msg+0x7cf/0xb70 net/core/rtnetlink.c:6955
netlink_rcv_skb+0x219/0x490 net/netlink/af_netlink.c:2534
Calling `PACKET_ADD_MEMBERSHIP` on an ops-locked device can trigger
the `NETDEV_UNREGISTER` notifier, which may require disabling promiscuous
and/or allmulti mode. Both of these operations require acquiring
the netdev instance lock.
Move the call to `packet_dev_mc` outside of the RCU critical section.
The `mclist` modifications (add, del, flush, unregister) are protected by
the RTNL, not the RCU. The RCU only protects the `sklist` and its
associated `sks`. The delayed operation on the `mclist` entry remains
within the RTNL.
Reported-by: syzbot+b191b5ccad8d7a986286@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=b191b5ccad8d7a986286
Fixes: ad7c7b2172c3 ("net: hold netdev instance lock during sysfs operations")
Signed-off-by: Stanislav Fomichev <stfomichev@gmail.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250522031129.3247266-1-stfomichev@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Michal Luczaj says:
====================
vsock: SOCK_LINGER rework
Change vsock's lingerning to wait on close() until all data is sent, i.e.
until workers picked all the packets for processing.
v5: https://lore.kernel.org/r/20250521-vsock-linger-v5-0-94827860d1d6@rbox.co
v4: https://lore.kernel.org/r/20250501-vsock-linger-v4-0-beabbd8a0847@rbox.co
v3: https://lore.kernel.org/r/20250430-vsock-linger-v3-0-ddbe73b53457@rbox.co
v2: https://lore.kernel.org/r/20250421-vsock-linger-v2-0-fe9febd64668@rbox.co
v1: https://lore.kernel.org/r/20250407-vsock-linger-v1-0-1458038e3492@rbox.co
Signed-off-by: Michal Luczaj <mhal@rbox.co>
====================
Link: https://patch.msgid.link/20250522-vsock-linger-v6-0-2ad00b0e447e@rbox.co
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
There was an issue with SO_LINGER: instead of blocking until all queued
messages for the socket have been successfully sent (or the linger timeout
has been reached), close() would block until packets were handled by the
peer.
Add a test to alert on close() lingering when it should not.
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Link: https://patch.msgid.link/20250522-vsock-linger-v6-5-2ad00b0e447e@rbox.co
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Add a helper function that sets SO_LINGER. Adapt the caller.
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Link: https://patch.msgid.link/20250522-vsock-linger-v6-4-2ad00b0e447e@rbox.co
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Distill the virtio_vsock_sock::bytes_unsent checking loop (ioctl SIOCOUTQ)
and move it to utils. Tweak the comment.
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Link: https://patch.msgid.link/20250522-vsock-linger-v6-3-2ad00b0e447e@rbox.co
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Lingering should be transport-independent in the long run. In preparation
for supporting other transports, as well as the linger on shutdown(), move
code to core.
Generalize by querying vsock_transport::unsent_bytes(), guard against the
callback being unimplemented. Do not pass sk_lingertime explicitly. Pull
SOCK_LINGER check into vsock_linger().
Flatten the function. Remove the nested block by inverting the condition:
return early on !timeout.
Suggested-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Link: https://patch.msgid.link/20250522-vsock-linger-v6-2-2ad00b0e447e@rbox.co
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Currently vsock's lingering effectively boils down to waiting (or timing
out) until packets are consumed or dropped by the peer; be it by receiving
the data, closing or shutting down the connection.
To align with the semantics described in the SO_LINGER section of man
socket(7) and to mimic AF_INET's behaviour more closely, change the logic
of a lingering close(): instead of waiting for all data to be handled,
block until data is considered sent from the vsock's transport point of
view. That is until worker picks the packets for processing and decrements
virtio_vsock_sock::bytes_unsent down to 0.
Note that (some interpretation of) lingering was always limited to
transports that called virtio_transport_wait_close() on transport release.
This does not change, i.e. under Hyper-V and VMCI no lingering would be
observed.
The implementation does not adhere strictly to man page's interpretation of
SO_LINGER: shutdown() will not trigger the lingering. This follows AF_INET.
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Link: https://patch.msgid.link/20250522-vsock-linger-v6-1-2ad00b0e447e@rbox.co
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Add support for the MaxLinear MxL86110 Gigabit Ethernet PHY, a low-power,
cost-optimized transceiver supporting 10/100/1000 Mbps over twisted-pair
copper, compliant with IEEE 802.3.
The driver implements basic features such as:
- Device initialization
- RGMII interface timing configuration
- Wake-on-LAN support
- LED initialization and control via /sys/class/leds
This driver has been tested on multiple Variscite boards, including:
- VAR-SOM-MX93 (i.MX93)
- VAR-SOM-MX8M-PLUS (i.MX8MP)
Example boot log showing driver probe:
[ 7.692101] imx-dwmac 428a0000.ethernet eth0:
PHY [stmmac-0:00] driver [MXL86110 Gigabit Ethernet] (irq=POLL)
Signed-off-by: Stefano Radaelli <stefano.radaelli21@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20250521212821.593057-1-stefano.radaelli21@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Jason A. Donenfeld says:
====================
wireguard updates for 6.16
This small series contains mostly cleanups and one new feature:
1) Kees' __nonstring annotation comes to wireguard.
2) Two selftest fixes, one to help with compilation on gcc 15, and one
removing stale config options.
3) Adoption of NLA_POLICY_MASK.
4) Jordan has added the ability to run:
# wg set ... peer ... allowed-ips -192.168.1.0/24
Which will remove the allowed IP for that peer. Previously you had to
replace all the IPs non-atomically, or move it to a dummy peer
atomically, which wasn't very clean.
====================
Link: https://patch.msgid.link/20250521212707.1767879-1-Jason@zx2c4.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
GCC 15 defaults to C23, which bash can't compile under, so specify gnu17
explicitly.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Link: https://patch.msgid.link/20250521212707.1767879-6-Jason@zx2c4.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
The current netlink API for WireGuard does not directly support removal
of allowed ips from a peer. A user can remove an allowed ip from a peer
in one of two ways:
1. By using the WGPEER_F_REPLACE_ALLOWEDIPS flag and providing a new
list of allowed ips which omits the allowed ip that is to be removed.
2. By reassigning an allowed ip to a "dummy" peer then removing that
peer with WGPEER_F_REMOVE_ME.
With the first approach, the driver completely rebuilds the allowed ip
list for a peer. If my current configuration is such that a peer has
allowed ips 192.168.0.2 and 192.168.0.3 and I want to remove 192.168.0.2
the actual transition looks like this.
[192.168.0.2, 192.168.0.3] <-- Initial state
[] <-- Step 1: Allowed ips removed for peer
[192.168.0.3] <-- Step 2: Allowed ips added back for peer
This is true even if the allowed ip list is small and the update does
not need to be batched into multiple WG_CMD_SET_DEVICE requests, as the
removal and subsequent addition of ips is non-atomic within a single
request. Consequently, wg_allowedips_lookup_dst and
wg_allowedips_lookup_src may return NULL while reconfiguring a peer even
for packets bound for ips a user did not intend to remove leading to
unintended interruptions in connectivity. This presents in userspace as
failed calls to sendto and sendmsg for UDP sockets. In my case, I ran
netperf while repeatedly reconfiguring the allowed ips for a peer with
wg.
/usr/local/bin/netperf -H 10.102.73.72 -l 10m -t UDP_STREAM -- -R 1 -m 1024
send_data: data send error: No route to host (errno 113)
netperf: send_omni: send_data failed: No route to host
While this may not be of particular concern for environments where peers
and allowed ips are mostly static, systems like Cilium manage peers and
allowed ips in a dynamic environment where peers (i.e. Kubernetes nodes)
and allowed ips (i.e. pods running on those nodes) can frequently
change making WGPEER_F_REPLACE_ALLOWEDIPS problematic.
The second approach avoids any possible connectivity interruptions
but is hacky and less direct, requiring the creation of a temporary
peer just to dispose of an allowed ip.
Introduce a new flag called WGALLOWEDIP_F_REMOVE_ME which in the same
way that WGPEER_F_REMOVE_ME allows a user to remove a single peer from
a WireGuard device's configuration allows a user to remove an ip from a
peer's set of allowed ips. This enables incremental updates to a
device's configuration without any connectivity blips or messy
workarounds.
A corresponding patch for wg extends the existing `wg set` interface to
leverage this feature.
$ wg set wg0 peer <PUBKEY> allowed-ips +192.168.88.0/24,-192.168.0.1/32
When '+' or '-' is prepended to any ip in the list, wg clears
WGPEER_F_REPLACE_ALLOWEDIPS and sets the WGALLOWEDIP_F_REMOVE_ME flag on
any ip prefixed with '-'.
Signed-off-by: Jordan Rife <jordan@jrife.io>
[Jason: minor style nits, fixes to selftest, bump of wireguard-tools version]
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Link: https://patch.msgid.link/20250521212707.1767879-5-Jason@zx2c4.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Rather than manually validating flags against the various __ALL_*
constants, put this in the netlink policy description and have the upper
layer machinery check it for us.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Link: https://patch.msgid.link/20250521212707.1767879-4-Jason@zx2c4.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
When a character array without a terminating NUL character has a static
initializer, GCC 15's -Wunterminated-string-initialization will only
warn if the array lacks the "nonstring" attribute[1]. Mark the arrays
with __nonstring to correctly identify the char array as "not a C string"
and thereby eliminate the warning:
../drivers/net/wireguard/cookie.c:29:56: warning: initializer-string for array of 'unsigned char' truncates NUL terminator but destination lacks 'nonstring' attribute (9 chars into 8 available) [-Wunterminated-string-initialization]
29 | static const u8 mac1_key_label[COOKIE_KEY_LABEL_LEN] = "mac1----";
| ^~~~~~~~~~
../drivers/net/wireguard/cookie.c:30:58: warning: initializer-string for array of 'unsigned char' truncates NUL terminator but destination lacks 'nonstring' attribute (9 chars into 8 available) [-Wunterminated-string-initialization]
30 | static const u8 cookie_key_label[COOKIE_KEY_LABEL_LEN] = "cookie--";
| ^~~~~~~~~~
../drivers/net/wireguard/noise.c:28:38: warning: initializer-string for array of 'unsigned char' truncates NUL terminator but destination lacks 'nonstring' attribute (38 chars into 37 available) [-Wunterminated-string-initialization]
28 | static const u8 handshake_name[37] = "Noise_IKpsk2_25519_ChaChaPoly_BLAKE2s";
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
../drivers/net/wireguard/noise.c:29:39: warning: initializer-string for array of 'unsigned char' truncates NUL terminator but destination lacks 'nonstring' attribute (35 chars into 34 available) [-Wunterminated-string-initialization]
29 | static const u8 identifier_name[34] = "WireGuard v1 zx2c4 Jason@zx2c4.com";
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The arrays are always used with their fixed size, so use __nonstring.
Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117178 [1]
Signed-off-by: Kees Cook <kees@kernel.org>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Link: https://patch.msgid.link/20250521212707.1767879-3-Jason@zx2c4.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Commit 918327e9b7ff ("ubsan: Remove CONFIG_UBSAN_SANITIZE_ALL")
removed the CONFIG_UBSAN_SANITIZE_ALL configuration option.
Eliminate invalid configurations to improve code readability.
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: WangYuli <wangyuli@uniontech.com>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Link: https://patch.msgid.link/20250521212707.1767879-2-Jason@zx2c4.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Kees Cook says:
====================
net: Convert dev_set_mac_address() to struct sockaddr_storage
As part of the effort to allow the compiler to reason about object sizes,
we need to deal with the problematic variably sized struct sockaddr,
which has no internal runtime size tracking. In much of the network
stack the use of struct sockaddr_storage has been adopted. Continue the
transition toward this for more of the internal APIs. Specifically:
- inet_addr_is_any()
- netif_set_mac_address()
- dev_set_mac_address()
- dev_set_mac_address_user()
Only a few callers of dev_set_mac_address() needed adjustment; all others
were already using struct sockaddr_storage internally.
v1: https://lore.kernel.org/all/20250520222452.work.063-kees@kernel.org/
====================
Link: https://patch.msgid.link/20250521204310.it.500-kees@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Convert callers of dev_set_mac_address_user() to use struct
sockaddr_storage. Add sanity checks on dev->addr_len usage.
Signed-off-by: Kees Cook <kees@kernel.org>
Acked-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Link: https://patch.msgid.link/20250521204619.2301870-8-kees@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Instead of a heap allocating a variably sized struct sockaddr and lying
about the type in the call to netif_set_mac_address(), use a stack
allocated struct sockaddr_storage. This lets us drop the cast and avoid
the allocation.
Putting "ss" on the stack means it will get a reused stack slot since
it is the same size (128B) as other existing single-scope stack variables,
like the vfinfo array (128B), so no additional stack space is used by
this function.
Acked-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Kees Cook <kees@kernel.org>
Link: https://patch.msgid.link/20250521204619.2301870-7-kees@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|