Age | Commit message (Collapse) | Author |
|
netdev netlink is the only reader of netdev_{,rx_}queue->napi,
and it already holds netdev->lock. Switch protection of
the writes to netdev->lock to "ops protected".
The expectation will be now that accessing queue->napi
will require netdev->lock for "ops locked" drivers, and
rtnl_lock for all other drivers.
Current "ops locked" drivers don't require any changes.
gve and netdevsim use _locked() helpers right next to
netif_queue_set_napi() so they must be holding the instance
lock. iavf doesn't call it. bnxt is a bit messy but all paths
seem locked.
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250324224537.248800-7-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Drivers which opt into instance lock protection of ops should
only call set_real_num_*_queues() under the instance lock.
This means that queue counts are double protected (writes
are under both rtnl_lock and instance lock, readers under
either).
Some readers may still be under the rtnl_lock, however, so for
now we need double protection of writers.
OTOH queue API paths are only under the protection of the instance
lock, so we need to validate that the instance is actually locking
ops, otherwise the input checks we do against queue count are racy.
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250324224537.248800-6-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Since commit a953be53ce40 ("net-sysfs: add support for device-specific
rx queue sysfs attributes"), so for at least a decade now it is safe
to call net_rx_queue_update_kobjects() when SYSFS=n. That function
does its own ifdef-inery and will return 0. Remove the unnecessary
stub for netif_set_real_num_rx_queues().
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250324224537.248800-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
net_devmem_unbind_dmabuf()
A recent commit added taking the netdev instance lock
in netdev_nl_bind_rx_doit(), but didn't remove it in
net_devmem_unbind_dmabuf() which it calls from an error path.
Always expect the callers of net_devmem_unbind_dmabuf() to
hold the lock. This is consistent with net_devmem_bind_dmabuf().
(Not so) coincidentally this also protects mp_param with the instance
lock, which the rest of this series needs.
Fixes: 1d22d3060b9b ("net: drop rtnl_lock for queue_mgmt operations")
Reviewed-by: Mina Almasry <almasrymina@google.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250324224537.248800-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Support TX timestamping in SCO sockets.
Not available for hdevs without SCO_FLOWCTL.
Support MSG_ERRQUEUE in SCO recvmsg.
Signed-off-by: Pauli Virtanen <pav@iki.fi>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
|
|
Support TX timestamping in L2CAP sockets.
Support MSG_ERRQUEUE recvmsg.
For other than SOCK_STREAM L2CAP sockets, if a packet from sendmsg() is
fragmented, only the first ACL fragment is timestamped.
For SOCK_STREAM L2CAP sockets, use the bytestream convention and
timestamp the last fragment and count bytes in tskey.
Timestamps are not generated in the Enhanced Retransmission mode, as
meaning of COMPLETION stamp is unclear if L2CAP layer retransmits.
Signed-off-by: Pauli Virtanen <pav@iki.fi>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
|
|
Add BT_SCM_ERROR socket CMSG type.
Support TX timestamping in ISO sockets.
Support MSG_ERRQUEUE in ISO recvmsg.
If a packet from sendmsg() is fragmented, only the first ACL fragment is
timestamped.
Signed-off-by: Pauli Virtanen <pav@iki.fi>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
|
|
Support enabling TX timestamping for some skbs, and track them until
packet completion. Generate software SCM_TSTAMP_COMPLETION when getting
completion report from hardware.
Generate software SCM_TSTAMP_SND before sending to driver. Sending from
driver requires changes in the driver API, and drivers mostly are going
to send the skb immediately.
Make the default situation with no COMPLETION TX timestamping more
efficient by only counting packets in the queue when there is nothing to
track. When there is something to track, we need to make clones, since
the driver may modify sent skbs.
The tx_q queue length is bounded by the hdev flow control, which will
not send new packets before it has got completion reports for old ones.
Signed-off-by: Pauli Virtanen <pav@iki.fi>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
|
|
Add SOF_TIMESTAMPING_TX_COMPLETION, for requesting a software timestamp
when hardware reports a packet completed.
Completion tstamp is useful for Bluetooth, as hardware timestamps do not
exist in the HCI specification except for ISO packets, and the hardware
has a queue where packets may wait. In this case the software SND
timestamp only reflects the kernel-side part of the total latency
(usually small) and queue length (usually 0 unless HW buffers
congested), whereas the completion report time is more informative of
the true latency.
It may also be useful in other cases where HW TX timestamps cannot be
obtained and user wants to estimate an upper bound to when the TX
probably happened.
Signed-off-by: Pauli Virtanen <pav@iki.fi>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
|
|
This logs the devcd dumps with hci_recv_diag so they appear in the
monitor traces with proper timestamps which can then be used to relate
the HCI traffic that caused the dump:
= Vendor Diagnostic (len 174)
42 6c 75 65 74 6f 6f 74 68 20 64 65 76 63 6f 72 Bluetooth devcor
65 64 75 6d 70 0a 53 74 61 74 65 3a 20 32 0a 00 edump.State: 2..
43 6f 6e 74 72 6f 6c 6c 65 72 20 4e 61 6d 65 3a Controller Name:
20 76 68 63 69 5f 63 74 72 6c 0a 46 69 72 6d 77 vhci_ctrl.Firmw
61 72 65 20 56 65 72 73 69 6f 6e 3a 20 76 68 63 are Version: vhc
69 5f 66 77 0a 44 72 69 76 65 72 3a 20 76 68 63 i_fw.Driver: vhc
69 5f 64 72 76 0a 56 65 6e 64 6f 72 3a 20 76 68 i_drv.Vendor: vh
63 69 0a 2d 2d 2d 20 53 74 61 72 74 20 64 75 6d ci.--- Start dum
70 20 2d 2d 2d 0a 74 65 73 74 20 64 61 74 61 00 p ---.test data.
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 ..............
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
|
|
Return Parameters is not only status, also bdaddr:
BLUETOOTH CORE SPECIFICATION Version 5.4 | Vol 4, Part E
page 1870:
BLUETOOTH CORE SPECIFICATION Version 5.0 | Vol 2, Part E
page 802:
Return parameters:
Status:
Size: 1 octet
BD_ADDR:
Size: 6 octets
Note that it also fixes the warning:
"Bluetooth: hci0: unexpected cc 0x041a length: 7 > 1"
Fixes: c8992cffbe741 ("Bluetooth: hci_event: Use of a function table to handle Command Complete")
Signed-off-by: Wentao Guan <guanwentao@uniontech.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
|
|
This enables buffer flow control for SCO/eSCO
(see: Bluetooth Core 6.0 spec: 6.22. Synchronous Flow Control Enable),
recently this has caused the following problem and is actually a nice
addition for the likes of Socket TX complete:
< HCI Command: Read Buffer Size (0x04|0x0005) plen 0
> HCI Event: Command Complete (0x0e) plen 11
Read Buffer Size (0x04|0x0005) ncmd 1
Status: Success (0x00)
ACL MTU: 1021 ACL max packet: 5
SCO MTU: 240 SCO max packet: 8
...
< SCO Data TX: Handle 257 flags 0x00 dlen 120
< SCO Data TX: Handle 257 flags 0x00 dlen 120
< SCO Data TX: Handle 257 flags 0x00 dlen 120
< SCO Data TX: Handle 257 flags 0x00 dlen 120
< SCO Data TX: Handle 257 flags 0x00 dlen 120
< SCO Data TX: Handle 257 flags 0x00 dlen 120
< SCO Data TX: Handle 257 flags 0x00 dlen 120
< SCO Data TX: Handle 257 flags 0x00 dlen 120
< SCO Data TX: Handle 257 flags 0x00 dlen 120
> HCI Event: Hardware Error (0x10) plen 1
Code: 0x0a
To fix the code will now attempt to enable buffer flow control when
HCI_QUIRK_SYNC_FLOWCTL_SUPPORTED is set by the driver:
< HCI Command: Write Sync Fl.. (0x03|0x002f) plen 1
Flow control: Enabled (0x01)
> HCI Event: Command Complete (0x0e) plen 4
Write Sync Flow Control Enable (0x03|0x002f) ncmd 1
Status: Success (0x00)
On success then HCI_SCO_FLOWCTL would be set which indicates sco_cnt
shall be used for flow contro.
Fixes: 7fedd3bb6b77 ("Bluetooth: Prioritize SCO traffic")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Tested-by: Pauli Virtanen <pav@iki.fi>
|
|
A SCO connection without the proper voice_setting can cause
the controller to lock up.
Signed-off-by: Pedro Nishiyama <nishiyama.pedro@gmail.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
|
|
Some fake controllers cannot be initialized because they return a smaller
report than expected for READ_PAGE_SCAN_TYPE.
Signed-off-by: Pedro Nishiyama <nishiyama.pedro@gmail.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
|
|
Some fake controllers cannot be initialized because they return a smaller
report than expected for READ_VOICE_SETTING.
Signed-off-by: Pedro Nishiyama <nishiyama.pedro@gmail.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
|
|
Commit b35108a51cf7 ("jiffies: Define secs_to_jiffies()") introduced
secs_to_jiffies(). As the value here is a multiple of 1000, use
secs_to_jiffies() instead of msecs_to_jiffies() for readability.
Signed-off-by: Easwar Hariharan <eahariha@linux.microsoft.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
|
|
Commit b35108a51cf7 ("jiffies: Define secs_to_jiffies()") introduced
secs_to_jiffies(). As the value here is a multiple of 1000, use
secs_to_jiffies() instead of msecs_to_jiffies() for readability.
Signed-off-by: Easwar Hariharan <eahariha@linux.microsoft.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
|
|
Commit b35108a51cf7 ("jiffies: Define secs_to_jiffies()") introduced
secs_to_jiffies(). As the value here is a multiple of 1000, use
secs_to_jiffies() instead of msecs_to_jiffies to avoid the multiplication.
This is converted using scripts/coccinelle/misc/secs_to_jiffies.cocci with
the following Coccinelle rules:
@depends on patch@
expression E;
@@
-msecs_to_jiffies(E * 1000)
+secs_to_jiffies(E)
-msecs_to_jiffies(E * MSEC_PER_SEC)
+secs_to_jiffies(E)
Signed-off-by: Easwar Hariharan <eahariha@linux.microsoft.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
|
|
mgmt_start_discovery_complete() and mgmt_stop_discovery_complete() last
uses were removed in 2022 by
commit ec2904c259c5 ("Bluetooth: Remove dead code from hci_request.c")
Remove them.
Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
|
|
mgmt_pending_find_data() last use was removed in 2021 by
commit 5a7501374664 ("Bluetooth: hci_sync: Convert MGMT_OP_GET_CLOCK_INFO")
Remove it.
Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
|
|
Revert "udp_tunnel: use static call for GRO hooks when possible"
This reverts commit 311b36574ceaccfa3f91b74054a09cd4bb877702.
Revert "udp_tunnel: create a fastpath GRO lookup."
This reverts commit 8d4880db378350f8ed8969feea13bdc164564fc1.
There are multiple small issues with the series. In the interest
of unblocking the merge window let's opt for a revert.
Link: https://lore.kernel.org/cover.1742557254.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next
Steffen Klassert says:
====================
pull request (net-next): ipsec-next 2025-03-24
1) Prevent setting high order sequence number bits input in
non-ESN mode. From Leon Romanovsky.
2) Support PMTU handling in tunnel mode for packet offload.
From Leon Romanovsky.
3) Make xfrm_state_lookup_byaddr lockless.
From Florian Westphal.
4) Remove unnecessary NULL check in xfrm_lookup_with_ifid().
From Dan Carpenter.
* tag 'ipsec-next-2025-03-24' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next:
xfrm: Remove unnecessary NULL check in xfrm_lookup_with_ifid()
xfrm: state: make xfrm_state_lookup_byaddr lockless
xfrm: check for PMTU in tunnel mode for packet offload
xfrm: provide common xdo_dev_offload_ok callback implementation
xfrm: rely on XFRM offload
xfrm: simplify SA initialization routine
xfrm: delay initialization of offload path till its actually requested
xfrm: prevent high SEQ input in non-ESN mode
====================
Link: https://patch.msgid.link/20250324061855.4116819-1-steffen.klassert@secunet.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following batch contains Netfilter updates for net-next:
1) Use kvmalloc in xt_hashlimit, from Denis Kirjanov.
2) Tighten nf_conntrack sysctl accepted values for nf_conntrack_max
and nf_ct_expect_max, from Nicolas Bouchinet.
3) Avoid lookup in nft_fib if socket is available, from Florian Westphal.
4) Initialize struct lsm_context in nfnetlink_queue to avoid
hypothetical ENOMEM errors, Chenyuan Yang.
5) Use strscpy() instead of _pad when initializing xtables table name,
kzalloc is already used to initialized the table memory area.
From Thorsten Blum.
6) Missing socket lookup by conntrack information for IPv6 traffic
in nft_socket, there is a similar chunk in IPv4, this was never
added when IPv6 NAT was introduced. From Maxim Mikityanskiy.
7) Fix clang issues with nf_tables CONFIG_MITIGATION_RETPOLINE,
from WangYuli.
* tag 'nf-next-25-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
netfilter: nf_tables: Only use nf_skip_indirect_calls() when MITIGATION_RETPOLINE
netfilter: socket: Lookup orig tuple for IPv6 SNAT
netfilter: xtables: Use strscpy() instead of strscpy_pad()
netfilter: nfnetlink_queue: Initialize ctx to avoid memory allocation error
netfilter: fib: avoid lookup if socket is available
netfilter: conntrack: Bound nf_conntrack sysctl writes
netfilter: xt_hashlimit: replace vmalloc calls with kvmalloc
====================
Link: https://patch.msgid.link/20250323100922.59983-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
RFS is using two kinds of hash tables.
First one is controlled by /proc/sys/net/core/rps_sock_flow_entries = 2^N
and using the N low order bits of the l4 hash is good enough.
Then each RX queue has its own hash table, controlled by
/sys/class/net/eth1/queues/rx-$q/rps_flow_cnt = 2^X
Current hash function, using the X low order bits is suboptimal,
because RSS is usually using Func(hash) = (hash % power_of_two);
For example, with 32 RX queues, 6 low order bits have no entropy
for a given queue.
Switch this hash function to hash_32(hash, log) to increase
chances to use all possible slots and reduce collisions.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tom Herbert <tom@herbertland.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250321171309.634100-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next
Johannes Berg says:
====================
More features for 6.15, major changes:
* cfg80211/mac80211: fix and enable link reconfiguration
* rtw88: support RTL8814AE/RTL8814AU
* mt7996: preparations for MLO
* ath12k: continued work on MLO
* iwlwifi: add new iwlmld sub-driver/op-mode for
some current and future devices
* wfx: wowlan support
* tag 'wireless-next-2025-03-20' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (311 commits)
wifi: mt76: mt7996: fix locking in mt7996_mac_sta_rc_work()
wifi: mt76: mt76x2u: add TP-Link TL-WDN6200 ID to device table
wifi: mt76: mt792x: re-register CHANCTX_STA_CSA only for the mt7921 series
wifi: mt76: mt7996: Update mt7996_tx to MLO support
wifi: mt76: mt7996: rework mt7996_ampdu_action to support MLO
wifi: mt76: mt7996: rework set/get_tsf callabcks to support MLO
wifi: mt76: mt7996: set vif default link_id adding/removing vif links
wifi: mt76: mt7996: rework mt7996_mcu_beacon_inband_discov to support MLO
wifi: mt76: mt7996: rework mt7996_mcu_add_obss_spr to support MLO
wifi: mt76: mt7996: rework mt7996_net_fill_forward_path to support MLO
wifi: mt76: mt7996: rework mt7996_update_mu_group to support MLO
wifi: mt76: mt7996: rework mt7996_mac_sta_poll to support MLO
wifi: mt76: mt7996: rework mt7996_mac_sta_rc_work to support MLO
wifi: mt76: mt7996: remove mt7996_mac_enable_rtscts()
wifi: mt76: mt7996: rework mt7996_sta_hw_queue_read to support MLO
wifi: mt76: mt7996: rework mt7996_set_hw_key to support MLO
wifi: mt76: mt7996: Add mt7996_sta_link to mt7996_mcu_add_bss_info signature
wifi: mt76: mt7996: rework mt7996_sta_set_4addr and mt7996_sta_set_decap_offload to support MLO
wifi: mt76: mt7996: rework mt7996_rx_get_wcid to support MLO
wifi: mt76: mt7996: Rely on wcid_to_sta in mt7996_mac_add_txs_skb()
...
====================
Link: https://patch.msgid.link/20250320131106.33266-3-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
TCP uses generic skb_set_owner_r() and sock_rfree()
for received packets, with socket lock being owned.
Switch to private versions, avoiding two atomic operations
per packet.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250320121604.3342831-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
In rtm_del_nexthop(), only nexthop_find_by_id() and remove_nexthop()
require RTNL as they touch net->nexthop.rb_root.
Let's move RTNL down as rtnl_net_lock() before nexthop_find_by_id().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20250319230743.65267-8-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
If we pass false to the rtnl_held param of lwtunnel_valid_encap_type(),
we can move RTNL down before rtm_to_nh_config_rtnl().
Let's use rtnl_net_lock() in rtm_new_nexthop().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20250319230743.65267-7-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The number of NHA_GROUP entries is guaranteed to be non-zero in
nh_check_attr_group().
Let's remove the redundant check in nexthop_create_group().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20250319230743.65267-6-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
nexthop_add() checks if NLM_F_REPLACE is specified without
non-zero NHA_ID, which does not require RTNL.
Let's move the check to rtm_new_nexthop().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20250319230743.65267-5-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
NHA_OIF needs to look up a device by __dev_get_by_index(),
which requires RTNL.
Let's move NHA_OIF validation to rtm_to_nh_config_rtnl().
Note that the proceeding checks made the original !cfg->nh_fdb
check redundant.
NHA_FDB is set -> NHA_OIF cannot be set
NHA_FDB is set but false -> NHA_OIF must be set
NHA_FDB is not set -> NHA_OIF must be set
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20250319230743.65267-4-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
We will push RTNL down to rtm_new_nexthop(), and then we
want to move non-RTNL operations out of the scope.
nh_check_attr_group() validates NHA_GROUP attributes, and
nexthop_find_by_id() and some validation requires RTNL.
Let's factorise such parts as nh_check_attr_group_rtnl()
and call it from rtm_to_nh_config_rtnl().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20250319230743.65267-3-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
We will split rtm_to_nh_config() into non-RTNL and RTNL parts,
and then the latter also needs tb.
As a prep, let's move nlmsg_parse() to rtm_new_nexthop().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250319230743.65267-2-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
net/unix/*.c include many unnecessary header files (rtnetlink.h,
netdevice.h, etc).
Let's clean them up.
af_unix.c:
+uapi/linux/sockios.h : Only exist under include/uapi
+uapi/linux/termios.h : Only exist under include/uapi
-linux/freezer.h : No longer use freezable_schedule_timeout()
-linux/in.h : No ipv4_is_XXX() etc
-linux/module.h : No longer support CONFIG_UNIX=m
-linux/netdevice.h : No dev used
-linux/rtnetlink.h : Not part of rtnetlink API
-linux/signal.h : signal_pending() is defined in sched/signal.h
-linux/stat.h : No struct stat used
-net/checksum.h : CHECKSUM_UNNECESSARY is defined in skbuff.h
diag.c:
+linux/dcache.h : struct dentry in sk_diag_dump_vfs()
+linux/user_namespace.h : struct user_namespace in sk_diag_dump_uid()
+uapi/linux/unix_diag.h : Only exist under include/uapi/
garbage.c:
+linux/list.h : struct unix_{vertex,edge}, etc
+linux/workqueue.h : DECLARE_WORK(unix_gc_work, ...)
-linux/file.h : No fget() etc
-linux/kernel.h : No cond_resched() etc
-linux/netdevice.h : No dev used
-linux/proc_fs.h : No procfs provided
-linux/string.h : No memcpy(), kmemdup(), etc
sysctl_net_unix.c:
+linux/string.h : kmemdup()
+net/net_namespace.h : struct net, net_eq()
-linux/mm.h : slab.h is enough
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250318034934.86708-5-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
include/net/af_unix.h indirectly includes some definitions for structs.
Let's include such headers explicitly.
linux/atomic.h : scm_stat.nr_fds
linux/net.h : unix_sock.peer_wq
linux/path.h : unix_sock.path
linux/spinlock.h : unix_sock.lock
linux/wait.h : unix_sock.peer_wake
uapi/linux/un.h : unix_address.name[]
linux/socket.h is removed as the structs there are not used directly,
and linux/un.h is clarified with uapi as un.h only exists under
include/uapi.
While at it, duplicate headers are removed from .c files.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250318034934.86708-4-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
net/af_unix.h is included by core and some LSMs, but most definitions
need not be.
Let's move struct unix_{vertex,edge} to net/unix/garbage.c and other
definitions to net/unix/af_unix.h.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Joe Damato <jdamato@fastly.com>
Link: https://patch.msgid.link/20250318034934.86708-3-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
This is a prep patch to make the following changes cleaner.
No functional change intended.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Joe Damato <jdamato@fastly.com>
Link: https://patch.msgid.link/20250318034934.86708-2-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Support adjusting/reading delayed ack max for socket level by using
set/getsockopt().
This option aligns with TCP_BPF_DELACK_MAX usage. Considering that bpf
option was implemented before this patch, so we need to use a standalone
new option for pure tcp set/getsockopt() use.
Add WRITE_ONCE/READ_ONCE() to prevent data-race if setsockopt()
happens to write one value to icsk_delack_max while icsk_delack_max is
being read.
Signed-off-by: Jason Xing <kerneljasonxing@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250317120314.41404-3-kerneljasonxing@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Support adjusting/reading RTO MIN for socket level by using set/getsockopt().
This new option has the same effect as TCP_BPF_RTO_MIN, which means it
doesn't affect RTAX_RTO_MIN usage (by using ip route...). Considering that
bpf option was implemented before this patch, so we need to use a standalone
new option for pure tcp set/getsockopt() use.
When the socket is created, its icsk_rto_min is set to the default
value that is controlled by sysctl_tcp_rto_min_us. Then if application
calls setsockopt() with TCP_RTO_MIN_US flag to pass a valid value, then
icsk_rto_min will be overridden in jiffies unit.
This patch adds WRITE_ONCE/READ_ONCE to avoid data-race around
icsk_rto_min.
Signed-off-by: Jason Xing <kerneljasonxing@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250317120314.41404-2-kerneljasonxing@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Currently network taps unbound to any interface are linked in the
global ptype_all list, affecting the performance in all the network
namespaces.
Add per netns ptypes chains, so that in the mentioned case only
the netns owning the packet socket(s) is affected.
While at that drop the global ptype_all list: no in kernel user
registers a tap on "any" type without specifying either the target
device or the target namespace (and IMHO doing that would not make
any sense).
Note that this adds a conditional in the fast path (to check for
per netns ptype_specific list) and increases the dataset size by
a cacheline (owing the per netns lists).
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumaze@google.com>
Link: https://patch.msgid.link/ae405f98875ee87f8150c460ad162de7e466f8a7.1742494826.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The assignment of zero to udph->check is unnecessary as it is
immediately overwritten in the subsequent line. Remove the redundant
assignment.
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Joe Damato <jdamato@fastly.com>
Link: https://patch.msgid.link/20250319-netpoll_nit-v1-1-a7faac5cbd92@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs afs updates from Christian Brauner:
"This contains the work for afs for this cycle:
- Fix an occasional hang that's only really encountered when
rmmod'ing the kafs module
- Remove the "-o autocell" mount option. This is obsolete with the
dynamic root and removing it makes the next patch slightly easier
- Change how the dynamic root mount is constructed. Currently, the
root directory is (de)populated when it is (un)mounted if there are
cells already configured and, further, pairs of automount points
have to be created/removed each time a cell is added/deleted
This is changed so that readdir on the root dir lists all the known
cell automount pairs plus the @cell symlinks and the inodes and
dentries are constructed by lookup on demand. This simplifies the
cell management code
- A few improvements to the afs_volume and afs_server tracepoints
- Pass trace info into the afs_lookup_cell() function to allow the
trace log to indicate the purpose of the lookup
- Remove the 'net' parameter from afs_unuse_cell() as it's
superfluous
- In rxrpc, allow a kernel app (such as kafs) to store a word of
information on rxrpc_peer records
- Use the information stored on the rxrpc_peer record to point to the
afs_server record. This allows the server address lookup to be done
away with
- Simplify the afs_server ref/activity accounting to make each one
self-contained and not garbage collected from the cell management
work item
- Simplify the afs_cell ref/activity accounting to make each one of
these also self-contained and not driven by a central management
work item
The current code was intended to make it such that a single timer
for the namespace and one work item per cell could do all the work
required to maintain these records. This, however, made for some
sequencing problems when cleaning up these records. Further, the
attempt to pass refs along with timers and work items made getting
it right rather tricky when the timer or work item already had a
ref attached and now a ref had to be got rid of"
* tag 'vfs-6.15-rc1.afs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
afs: Simplify cell record handling
afs: Fix afs_server ref accounting
afs: Use the per-peer app data provided by rxrpc
rxrpc: Allow the app to store private data on peer structs
afs: Drop the net parameter from afs_unuse_cell()
afs: Make afs_lookup_cell() take a trace note
afs: Improve server refcount/active count tracing
afs: Improve afs_volume tracing to display a debug ID
afs: Change dynroot to create contents on demand
afs: Remove the "autocell" mount option
|
|
inet_connection_sock_af_ops.addr2sockaddr() hasn't been used at all
in the git era.
$ git grep addr2sockaddr $(git rev-list HEAD | tail -n 1)
Let's remove it.
Note that there was a 4 bytes hole after sockaddr_len and now it's
6 bytes, so the binary layout is not changed.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250318060112.3729-1-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add strict buffer parsing index check to avoid the following Smatch
warning:
net/core/pktgen.c:877 get_imix_entries()
warn: check that incremented offset 'i' is capped
Checking the buffer index i after every get_user/i++ step and returning
with error code immediately avoids the current indirect (but correct)
error handling.
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/netdev/36cf3ee2-38b1-47e5-a42a-363efeb0ace3@stanley.mountain/
Signed-off-by: Peter Seiderer <ps.report@gmx.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250317090401.1240704-1-ps.report@gmx.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
As a followup of my presentation in Zagreb for netdev 0x19:
icsk_clean_acked is only used by TCP when/if CONFIG_TLS_DEVICE
is enabled from tcp_ack().
Rename it to tcp_clean_acked, move it to tcp_sock structure
in the tcp_sock_read_rx for better cache locality in TCP
fast path.
Define this field only when CONFIG_TLS_DEVICE is enabled
saving 8 bytes on configs not using it.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250317085313.2023214-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Some field descriptions were missing, some were not very accurate.
Not touching the uAPI header or .c files for now.
Formatting of those comments isn't great in general, but at least
they are not missing anything now.
Before:
$ ./scripts/kernel-doc -none -Wall net/openvswitch/*.h 2>&1 | wc -l
16
After:
$ ./scripts/kernel-doc -none -Wall net/openvswitch/*.h 2>&1 | wc -l
0
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Reviewed-by: Aaron Conole <aconole@redhat.com>
Link: https://patch.msgid.link/20250320224431.252489-1-i.maximets@ovn.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Binding AX25 socket by using the autobind feature leads to memory leaks
in ax25_connect() and also refcount leaks in ax25_release(). Memory
leak was detected with kmemleak:
================================================================
unreferenced object 0xffff8880253cd680 (size 96):
backtrace:
__kmalloc_node_track_caller_noprof (./include/linux/kmemleak.h:43)
kmemdup_noprof (mm/util.c:136)
ax25_rt_autobind (net/ax25/ax25_route.c:428)
ax25_connect (net/ax25/af_ax25.c:1282)
__sys_connect_file (net/socket.c:2045)
__sys_connect (net/socket.c:2064)
__x64_sys_connect (net/socket.c:2067)
do_syscall_64 (arch/x86/entry/common.c:52 arch/x86/entry/common.c:83)
entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
================================================================
When socket is bound, refcounts must be incremented the way it is done
in ax25_bind() and ax25_setsockopt() (SO_BINDTODEVICE). In case of
autobind, the refcounts are not incremented.
This bug leads to the following issue reported by Syzkaller:
================================================================
ax25_connect(): syz-executor318 uses autobind, please contact jreuter@yaina.de
------------[ cut here ]------------
refcount_t: decrement hit 0; leaking memory.
WARNING: CPU: 0 PID: 5317 at lib/refcount.c:31 refcount_warn_saturate+0xfa/0x1d0 lib/refcount.c:31
Modules linked in:
CPU: 0 UID: 0 PID: 5317 Comm: syz-executor318 Not tainted 6.14.0-rc4-syzkaller-00278-gece144f151ac #0
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014
RIP: 0010:refcount_warn_saturate+0xfa/0x1d0 lib/refcount.c:31
...
Call Trace:
<TASK>
__refcount_dec include/linux/refcount.h:336 [inline]
refcount_dec include/linux/refcount.h:351 [inline]
ref_tracker_free+0x6af/0x7e0 lib/ref_tracker.c:236
netdev_tracker_free include/linux/netdevice.h:4302 [inline]
netdev_put include/linux/netdevice.h:4319 [inline]
ax25_release+0x368/0x960 net/ax25/af_ax25.c:1080
__sock_release net/socket.c:647 [inline]
sock_close+0xbc/0x240 net/socket.c:1398
__fput+0x3e9/0x9f0 fs/file_table.c:464
__do_sys_close fs/open.c:1580 [inline]
__se_sys_close fs/open.c:1565 [inline]
__x64_sys_close+0x7f/0x110 fs/open.c:1565
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
...
</TASK>
================================================================
Considering the issues above and the comments left in the code that say:
"check if we can remove this feature. It is broken."; "autobinding in this
may or may not work"; - it is better to completely remove this feature than
to fix it because it is broken and leads to various kinds of memory bugs.
Now calling connect() without first binding socket will result in an
error (-EINVAL). Userspace software that relies on the autobind feature
might get broken. However, this feature does not seem widely used with
this specific driver as it was not reliable at any point of time, and it
is already broken anyway. E.g. ax25-tools and ax25-apps packages for
popular distributions do not use the autobind feature for AF_AX25.
Found by Linux Verification Center (linuxtesting.org) with Syzkaller.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Reported-by: syzbot+33841dc6aa3e1d86b78a@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=33841dc6aa3e1d86b78a
Signed-off-by: Murad Masimov <m.masimov@mt-integration.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
MITIGATION_RETPOLINE
1. MITIGATION_RETPOLINE is x86-only (defined in arch/x86/Kconfig),
so no need to AND with CONFIG_X86 when checking if enabled.
2. Remove unused declaration of nf_skip_indirect_calls() when
MITIGATION_RETPOLINE is disabled to avoid warnings.
3. Declare nf_skip_indirect_calls() and nf_skip_indirect_calls_enable()
as inline when MITIGATION_RETPOLINE is enabled, as they are called
only once and have simple logic.
Fix follow error with clang-21 when W=1e:
net/netfilter/nf_tables_core.c:39:20: error: unused function 'nf_skip_indirect_calls' [-Werror,-Wunused-function]
39 | static inline bool nf_skip_indirect_calls(void) { return false; }
| ^~~~~~~~~~~~~~~~~~~~~~
1 error generated.
make[4]: *** [scripts/Makefile.build:207: net/netfilter/nf_tables_core.o] Error 1
make[3]: *** [scripts/Makefile.build:465: net/netfilter] Error 2
make[3]: *** Waiting for unfinished jobs....
Fixes: d8d760627855 ("netfilter: nf_tables: add static key to skip retpoline workarounds")
Co-developed-by: Wentao Guan <guanwentao@uniontech.com>
Signed-off-by: Wentao Guan <guanwentao@uniontech.com>
Signed-off-by: WangYuli <wangyuli@uniontech.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
nf_sk_lookup_slow_v4 does the conntrack lookup for IPv4 packets to
restore the original 5-tuple in case of SNAT, to be able to find the
right socket (if any). Then socket_match() can correctly check whether
the socket was transparent.
However, the IPv6 counterpart (nf_sk_lookup_slow_v6) lacks this
conntrack lookup, making xt_socket fail to match on the socket when the
packet was SNATed. Add the same logic to nf_sk_lookup_slow_v6.
IPv6 SNAT is used in Kubernetes clusters for pod-to-world packets, as
pods' addresses are in the fd00::/8 ULA subnet and need to be replaced
with the node's external address. Cilium leverages Envoy to enforce L7
policies, and Envoy uses transparent sockets. Cilium inserts an iptables
prerouting rule that matches on `-m socket --transparent` and redirects
the packets to localhost, but it fails to match SNATed IPv6 packets due
to that missing conntrack lookup.
Closes: https://github.com/cilium/cilium/issues/37932
Fixes: eb31628e37a0 ("netfilter: nf_tables: Add support for IPv6 NAT")
Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
kzalloc() already zero-initializes the destination buffer, making
strscpy() sufficient for safely copying the name. The additional NUL-
padding performed by strscpy_pad() is unnecessary.
The size parameter is optional, and strscpy() automatically determines
the size of the destination buffer using sizeof() if the argument is
omitted. This makes the explicit sizeof() call unnecessary; remove it.
No functional changes intended.
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|