summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2023-08-15seg6: add NEXT-C-SID support for SRv6 End.X behaviorAndrea Mayer
The NEXT-C-SID mechanism described in [1] offers the possibility of encoding several SRv6 segments within a single 128 bit SID address. Such a SID address is called a Compressed SID (C-SID) container. In this way, the length of the SID List can be drastically reduced. A SID instantiated with the NEXT-C-SID flavor considers an IPv6 address logically structured in three main blocks: i) Locator-Block; ii) Locator-Node Function; iii) Argument. C-SID container +------------------------------------------------------------------+ | Locator-Block |Loc-Node| Argument | | |Function| | +------------------------------------------------------------------+ <--------- B -----------> <- NF -> <------------- A ---------------> (i) The Locator-Block can be any IPv6 prefix available to the provider; (ii) The Locator-Node Function represents the node and the function to be triggered when a packet is received on the node; (iii) The Argument carries the remaining C-SIDs in the current C-SID container. This patch leverages the NEXT-C-SID mechanism previously introduced in the Linux SRv6 subsystem [2] to support SID compression capabilities in the SRv6 End.X behavior [3]. An SRv6 End.X behavior with NEXT-C-SID flavor works as an End.X behavior but it is capable of processing the compressed SID List encoded in C-SID containers. An SRv6 End.X behavior with NEXT-C-SID flavor can be configured to support user-provided Locator-Block and Locator-Node Function lengths. In this implementation, such lengths must be evenly divisible by 8 (i.e. must be byte-aligned), otherwise the kernel informs the user about invalid values with a meaningful error code and message through netlink_ext_ack. If Locator-Block and/or Locator-Node Function lengths are not provided by the user during configuration of an SRv6 End.X behavior instance with NEXT-C-SID flavor, the kernel will choose their default values i.e., 32-bit Locator-Block and 16-bit Locator-Node Function. [1] - https://datatracker.ietf.org/doc/html/draft-ietf-spring-srv6-srh-compression [2] - https://lore.kernel.org/all/20220912171619.16943-1-andrea.mayer@uniroma2.it/ [3] - https://datatracker.ietf.org/doc/html/rfc8986#name-endx-l3-cross-connect Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it> Reviewed-by: Hangbin Liu <liuhangbin@gmail.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20230812180926.16689-2-andrea.mayer@uniroma2.it Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-16netfilter: nft_dynset: disallow object mapsPablo Neira Ayuso
Do not allow to insert elements from datapath to objects maps. Fixes: 8aeff920dcc9 ("netfilter: nf_tables: add stateful object reference to set elements") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de>
2023-08-16netfilter: nf_tables: GC transaction race with netns dismantlePablo Neira Ayuso
Use maybe_get_net() since GC workqueue might race with netns exit path. Fixes: 5f68718b34a5 ("netfilter: nf_tables: GC transaction API to avoid race with control plane") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de>
2023-08-16netfilter: nf_tables: fix GC transaction races with netns and netlink event ↵Pablo Neira Ayuso
exit path Netlink event path is missing a synchronization point with GC transactions. Add GC sequence number update to netns release path and netlink event path, any GC transaction losing race will be discarded. Fixes: 5f68718b34a5 ("netfilter: nf_tables: GC transaction API to avoid race with control plane") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de>
2023-08-16ipvs: fix racy memcpy in proc_do_sync_thresholdSishuai Gong
When two threads run proc_do_sync_threshold() in parallel, data races could happen between the two memcpy(): Thread-1 Thread-2 memcpy(val, valp, sizeof(val)); memcpy(valp, val, sizeof(val)); This race might mess up the (struct ctl_table *) table->data, so we add a mutex lock to serialize them. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Link: https://lore.kernel.org/netdev/B6988E90-0A1E-4B85-BF26-2DAF6D482433@gmail.com/ Signed-off-by: Sishuai Gong <sishuai.system@gmail.com> Acked-by: Simon Horman <horms@kernel.org> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Florian Westphal <fw@strlen.de>
2023-08-16netfilter: set default timeout to 3 secs for sctp shutdown send and recv stateXin Long
In SCTP protocol, it is using the same timer (T2 timer) for SHUTDOWN and SHUTDOWN_ACK retransmission. However in sctp conntrack the default timeout value for SCTP_CONNTRACK_SHUTDOWN_ACK_SENT state is 3 secs while it's 300 msecs for SCTP_CONNTRACK_SHUTDOWN_SEND/RECV state. As Paolo Valerio noticed, this might cause unwanted expiration of the ct entry. In my test, with 1s tc netem delay set on the NAT path, after the SHUTDOWN is sent, the sctp ct entry enters SCTP_CONNTRACK_SHUTDOWN_SEND state. However, due to 300ms (too short) delay, when the SHUTDOWN_ACK is sent back from the peer, the sctp ct entry has expired and been deleted, and then the SHUTDOWN_ACK has to be dropped. Also, it is confusing these two sysctl options always show 0 due to all timeout values using sec as unit: net.netfilter.nf_conntrack_sctp_timeout_shutdown_recd = 0 net.netfilter.nf_conntrack_sctp_timeout_shutdown_sent = 0 This patch fixes it by also using 3 secs for sctp shutdown send and recv state in sctp conntrack, which is also RTO.initial value in SCTP protocol. Note that the very short time value for SCTP_CONNTRACK_SHUTDOWN_SEND/RECV was probably used for a rare scenario where SHUTDOWN is sent on 1st path but SHUTDOWN_ACK is replied on 2nd path, then a new connection started immediately on 1st path. So this patch also moves from SHUTDOWN_SEND/RECV to CLOSE when receiving INIT in the ORIGINAL direction. Fixes: 9fb9cbb1082d ("[NETFILTER]: Add nf_conntrack subsystem.") Reported-by: Paolo Valerio <pvalerio@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Florian Westphal <fw@strlen.de>
2023-08-16netfilter: nf_tables: don't fail inserts if duplicate has expiredFlorian Westphal
nftables selftests fail: run-tests.sh testcases/sets/0044interval_overlap_0 Expected: 0-2 . 0-3, got: W: [FAILED] ./testcases/sets/0044interval_overlap_0: got 1 Insertion must ignore duplicate but expired entries. Moreover, there is a strange asymmetry in nft_pipapo_activate: It refetches the current element, whereas the other ->activate callbacks (bitmap, hash, rhash, rbtree) use elem->priv. Same for .remove: other set implementations take elem->priv, nft_pipapo_remove fetches elem->priv, then does a relookup, remove this. I suspect this was the reason for the change that prompted the removal of the expired check in pipapo_get() in the first place, but skipping exired elements there makes no sense to me, this helper is used for normal get requests, insertions (duplicate check) and deactivate callback. In first two cases expired elements must be skipped. For ->deactivate(), this gets called for DELSETELEM, so it seems to me that expired elements should be skipped as well, i.e. delete request should fail with -ENOENT error. Fixes: 24138933b97b ("netfilter: nf_tables: don't skip expired elements during walk") Signed-off-by: Florian Westphal <fw@strlen.de>
2023-08-16netfilter: nf_tables: deactivate catchall elements in next generationFlorian Westphal
When flushing, individual set elements are disabled in the next generation via the ->flush callback. Catchall elements are not disabled. This is incorrect and may lead to double-deactivations of catchall elements which then results in memory leaks: WARNING: CPU: 1 PID: 3300 at include/net/netfilter/nf_tables.h:1172 nft_map_deactivate+0x549/0x730 CPU: 1 PID: 3300 Comm: nft Not tainted 6.5.0-rc5+ #60 RIP: 0010:nft_map_deactivate+0x549/0x730 [..] ? nft_map_deactivate+0x549/0x730 nf_tables_delset+0xb66/0xeb0 (the warn is due to nft_use_dec() detecting underflow). Fixes: aaa31047a6d2 ("netfilter: nftables: add catch-all set element support") Reported-by: lonial con <kongln9170@gmail.com> Signed-off-by: Florian Westphal <fw@strlen.de>
2023-08-16netfilter: nf_tables: fix kdoc warnings after gc reworkFlorian Westphal
Jakub Kicinski says: We've got some new kdoc warnings here: net/netfilter/nft_set_pipapo.c:1557: warning: Function parameter or member '_set' not described in 'pipapo_gc' net/netfilter/nft_set_pipapo.c:1557: warning: Excess function parameter 'set' description in 'pipapo_gc' include/net/netfilter/nf_tables.h:577: warning: Function parameter or member 'dead' not described in 'nft_set' Fixes: 5f68718b34a5 ("netfilter: nf_tables: GC transaction API to avoid race with control plane") Fixes: f6c383b8c31a ("netfilter: nf_tables: adapt set backend to use GC transaction API") Reported-by: Jakub Kicinski <kuba@kernel.org> Closes: https://lore.kernel.org/netdev/20230810104638.746e46f1@kernel.org/ Signed-off-by: Florian Westphal <fw@strlen.de>
2023-08-16netfilter: nf_tables: fix false-positive lockdep splatFlorian Westphal
->abort invocation may cause splat on debug kernels: WARNING: suspicious RCU usage net/netfilter/nft_set_pipapo.c:1697 suspicious rcu_dereference_check() usage! [..] rcu_scheduler_active = 2, debug_locks = 1 1 lock held by nft/133554: [..] (nft_net->commit_mutex){+.+.}-{3:3}, at: nf_tables_valid_genid [..] lockdep_rcu_suspicious+0x1ad/0x260 nft_pipapo_abort+0x145/0x180 __nf_tables_abort+0x5359/0x63d0 nf_tables_abort+0x24/0x40 nfnetlink_rcv+0x1a0a/0x22c0 netlink_unicast+0x73c/0x900 netlink_sendmsg+0x7f0/0xc20 ____sys_sendmsg+0x48d/0x760 Transaction mutex is held, so parallel updates are not possible. Switch to _protected and check mutex is held for lockdep enabled builds. Fixes: 212ed75dc5fb ("netfilter: nf_tables: integrate pipapo into commit protocol") Signed-off-by: Florian Westphal <fw@strlen.de>
2023-08-15Merge branch 'genetlink-provide-struct-genl_info-to-dumps'Jakub Kicinski
Jakub Kicinski says: ==================== genetlink: provide struct genl_info to dumps One of the biggest (which is not to say only) annoyances with genetlink handling today is that doit and dumpit need some of the same information, but it is passed to them in completely different structs. The implementations commonly end up writing a _fill() method which populates a message and have to pass at least 6 parameters. 3 of which are extracted manually from request info. After a lot of umming and ahing I decided to populate struct genl_info for dumps, without trying to factor out only the common parts. This makes the adoption easiest. In the future we may add a new version of dump which takes struct genl_info *info as the second argument, instead of struct netlink_callback *cb. For now developers have to call genl_info_dump(cb) to get the info. Typical genetlink families no longer get exposed to netlink protocol internals like pid and seq numbers. v3: - correct the condition in ethtool code (patch 10) v2: https://lore.kernel.org/all/20230810233845.2318049-1-kuba@kernel.org/ - replace the GENL_INFO_NTF() macro with init helper - fix the commit messages v1: https://lore.kernel.org/all/20230809182648.1816537-1-kuba@kernel.org/ ==================== Link: https://lore.kernel.org/r/20230814214723.2924989-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-15ethtool: netlink: always pass genl_info to .prepare_dataJakub Kicinski
We had a number of bugs in the past because developers forgot to fully test dumps, which pass NULL as info to .prepare_data. .prepare_data implementations would try to access info->extack leading to a null-deref. Now that dumps and notifications can access struct genl_info we can pass it in, and remove the info null checks. Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com> Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com> # pause Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20230814214723.2924989-11-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-15ethtool: netlink: simplify arguments to ethnl_default_parse()Jakub Kicinski
Pass struct genl_info directly instead of its members. Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20230814214723.2924989-10-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-15netdev-genl: use struct genl_info for reply constructionJakub Kicinski
Use the just added APIs to make the code simpler. Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20230814214723.2924989-9-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-15genetlink: add genlmsg_iput() APIJakub Kicinski
Add some APIs and helpers required for convenient construction of replies and notifications based on struct genl_info. Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20230814214723.2924989-8-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-15genetlink: add a family pointer to struct genl_infoJakub Kicinski
Having family in struct genl_info is quite useful. It cuts down the number of arguments which need to be passed to helpers which already take struct genl_info. Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20230814214723.2924989-7-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-15genetlink: use attrs from struct genl_infoJakub Kicinski
Since dumps carry struct genl_info now, use the attrs pointer from genl_info and remove the one in struct genl_dumpit_info. Reviewed-by: Johannes Berg <johannes@sipsolutions.net> Reviewed-by: Miquel Raynal <miquel.raynal@bootlin.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20230814214723.2924989-6-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-15genetlink: add struct genl_info to struct genl_dumpit_infoJakub Kicinski
Netlink GET implementations must currently juggle struct genl_info and struct netlink_callback, depending on whether they were called from doit or dumpit. Add genl_info to the dump state and populate the fields. This way implementations can simply pass struct genl_info around. Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20230814214723.2924989-5-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-15genetlink: remove userhdr from struct genl_infoJakub Kicinski
Only three families use info->userhdr today and going forward we discourage using fixed headers in new families. So having the pointer to user header in struct genl_info is an overkill. Compute the header pointer at runtime. Reviewed-by: Johannes Berg <johannes@sipsolutions.net> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Aaron Conole <aconole@redhat.com> Link: https://lore.kernel.org/r/20230814214723.2924989-4-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-15genetlink: make genl_info->nlhdr constJakub Kicinski
struct netlink_callback has a const nlh pointer, make the pointer in struct genl_info const as well, to make copying between the two easier. Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20230814214723.2924989-3-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-15genetlink: push conditional locking into dumpit/doneJakub Kicinski
Add helpers which take/release the genl mutex based on family->parallel_ops. Remove the separation between handling of ops in locked and parallel families. Future patches would make the duplicated code grow even more. Reviewed-by: Johannes Berg <johannes@sipsolutions.net> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20230814214723.2924989-2-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-15net: Fix slab-out-of-bounds in inet[6]_steal_sockLorenz Bauer
Kumar reported a KASAN splat in tcp_v6_rcv: bash-5.2# ./test_progs -t btf_skc_cls_ingress ... [ 51.810085] BUG: KASAN: slab-out-of-bounds in tcp_v6_rcv+0x2d7d/0x3440 [ 51.810458] Read of size 2 at addr ffff8881053f038c by task test_progs/226 The problem is that inet[6]_steal_sock accesses sk->sk_protocol without accounting for request or timewait sockets. To fix this we can't just check sock_common->skc_reuseport since that flag is present on timewait sockets. Instead, add a fullsock check to avoid the out of bands access of sk_protocol. Fixes: 9c02bec95954 ("bpf, net: Support SO_REUSEPORT sockets with bpf_sk_assign") Reported-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Lorenz Bauer <lmb@isovalent.com> Link: https://lore.kernel.org/r/20230815-bpf-next-v2-1-95126eaa4c1b@isovalent.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-08-15Merge tag 'parisc-for-6.5-rc7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux Pull parisc fix from Helge Deller: "Fix the parisc TLB ptlock checks so that they can be enabled together with the lightweight spinlock checks" * tag 'parisc-for-6.5-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux: parisc: Fix CONFIG_TLB_PTLOCK to work with lightweight spinlock checks
2023-08-15Merge tag '6.5-rc6-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds
Pull smb client fixes from Steve French: "Three smb client fixes, all for stable: - fix for oops in unmount race with lease break of deferred close - debugging improvement for reconnect - fix for fscache deadlock (folio_wait_bit_common hang)" * tag '6.5-rc6-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6: smb3: display network namespace in debug information cifs: Release folio lock on fscache read hit. cifs: fix potential oops in cifs_oplock_break
2023-08-15Merge tag 'regulator-fix-v6.5-rc6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator Pull regulator fixes from Mark Brown: "Two small driver specific fixes: one incorrect definition for one of the Qualcomm regulators and better handling of poorly formed DTs in the DA9063 driver" * tag 'regulator-fix-v6.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator: regulator: qcom-rpmh: Fix LDO 12 regulator for PM8550 regulator: da9063: better fix null deref with partial DT
2023-08-15net: fix the RTO timer retransmitting skb every 1ms if linear option is enabledJason Xing
In the real workload, I encountered an issue which could cause the RTO timer to retransmit the skb per 1ms with linear option enabled. The amount of lost-retransmitted skbs can go up to 1000+ instantly. The root cause is that if the icsk_rto happens to be zero in the 6th round (which is the TCP_THIN_LINEAR_RETRIES value), then it will always be zero due to the changed calculation method in tcp_retransmit_timer() as follows: icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX); Above line could be converted to icsk->icsk_rto = min(0 << 1, TCP_RTO_MAX) = 0 Therefore, the timer expires so quickly without any doubt. I read through the RFC 6298 and found that the RTO value can be rounded up to a certain value, in Linux, say TCP_RTO_MIN as default, which is regarded as the lower bound in this patch as suggested by Eric. Fixes: 36e31b0af587 ("net: TCP thin linear timeouts") Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-15accel/qaic: Clean up integer overflow checking in map_user_pages()Dan Carpenter
The encode_dma() function has some validation on in_trans->size but it would be more clear to move those checks to find_and_map_user_pages(). The encode_dma() had two checks: if (in_trans->addr + in_trans->size < in_trans->addr || !in_trans->size) return -EINVAL; The in_trans->addr variable is the starting address. The in_trans->size variable is the total size of the transfer. The transfer can occur in parts and the resources->xferred_dma_size tracks how many bytes we have already transferred. This patch introduces a new variable "remaining" which represents the amount we want to transfer (in_trans->size) minus the amount we have already transferred (resources->xferred_dma_size). I have modified the check for if in_trans->size is zero to instead check if in_trans->size is less than resources->xferred_dma_size. If we have already transferred more bytes than in_trans->size then there are negative bytes remaining which doesn't make sense. If there are zero bytes remaining to be copied, just return success. The check in encode_dma() checked that "addr + size" could not overflow and barring a driver bug that should work, but it's easier to check if we do this in parts. First check that "in_trans->addr + resources->xferred_dma_size" is safe. Then check that "xfer_start_addr + remaining" is safe. My final concern was that we are dealing with u64 values but on 32bit systems the kmalloc() function will truncate the sizes to 32 bits. So I calculated "total = in_trans->size + offset_in_page(xfer_start_addr);" and returned -EINVAL if it were >= SIZE_MAX. This will not affect 64bit systems. Fixes: 129776ac2e38 ("accel/qaic: Add control path") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: Jeffrey Hugo <quic_jhugo@quicinc.com> Reviewed-by: Carl Vanderlip <quic_carlv@quicinc.com> Signed-off-by: Jeffrey Hugo <quic_jhugo@quicinc.com> Link: https://patchwork.freedesktop.org/patch/msgid/24d3348b-25ac-4c1b-b171-9dae7c43e4e0@moroto.mountain
2023-08-15accel/qaic: Fix slicing memory leakPranjal Ramajor Asha Kanojiya
The temporary buffer storing slicing configuration data from user is only freed on error. This is a memory leak. Free the buffer unconditionally. Fixes: ff13be830333 ("accel/qaic: Add datapath") Signed-off-by: Pranjal Ramajor Asha Kanojiya <quic_pkanojiy@quicinc.com> Reviewed-by: Carl Vanderlip <quic_carlv@quicinc.com> Reviewed-by: Jeffrey Hugo <quic_jhugo@quicinc.com> Signed-off-by: Jeffrey Hugo <quic_jhugo@quicinc.com> Link: https://patchwork.freedesktop.org/patch/msgid/20230802145937.14827-1-quic_jhugo@quicinc.com
2023-08-15Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhostLinus Torvalds
Pull virtio fixes from Michael Tsirkin: "Just a bunch of bugfixes all over the place" * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (26 commits) virtio-mem: check if the config changed before fake offlining memory virtio-mem: keep retrying on offline_and_remove_memory() errors in Sub Block Mode (SBM) virtio-mem: convert most offline_and_remove_memory() errors to -EBUSY virtio-mem: remove unsafe unplug in Big Block Mode (BBM) pds_vdpa: fix up debugfs feature bit printing pds_vdpa: alloc irq vectors on DRIVER_OK pds_vdpa: clean and reset vqs entries pds_vdpa: always allow offering VIRTIO_NET_F_MAC pds_vdpa: reset to vdpa specified mac virtio-net: Zero max_tx_vq field for VIRTIO_NET_CTRL_MQ_HASH_CONFIG case vdpa/mlx5: Fix crash on shutdown for when no ndev exists vdpa/mlx5: Delete control vq iotlb in destroy_mr only when necessary vdpa/mlx5: Fix mr->initialized semantics vdpa/mlx5: Correct default number of queues when MQ is on virtio-vdpa: Fix cpumask memory leak in virtio_vdpa_find_vqs() vduse: Use proper spinlock for IRQ injection vdpa: Enable strict validation for netlinks ops vdpa: Add max vqp attr to vdpa_nl_policy for nlattr length check vdpa: Add queue index attr to vdpa_nl_policy for nlattr length check vdpa: Add features attr to vdpa_nl_policy for nlattr length check ...
2023-08-14Merge branch 'Update and document struct_ops'Martin KaFai Lau
David Vernet says: ==================== The struct bpf_struct_ops structure in BPF is a framework that allows subsystems to extend themselves using BPF. In commit 68b04864ca425 ("bpf: Create links for BPF struct_ops maps") and commit aef56f2e918bf ("bpf: Update the struct_ops of a bpf_link"), the structure was updated to include new ->validate() and ->update() callbacks respectively in support of allowing struct_ops maps to be created with BPF_F_LINK. The intention was that struct bpf_struct_ops implementations could support map updates through the link. Because map validation and registration would take place in two separate steps for struct_ops maps managed by the link (the first in map update elem, and the latter in link create), the ->validate() callback was added, and any struct_ops implementation that wished to use BPF_F_LINK, even just for lifetime management, would then be required to define both it and ->update(). Not all struct_ops implementations can or will support update, however. For example, the sched_ext struct_ops implementation proposed in [0] will not be able to support atomic map updates because it can race with sysrq, has to cycle tasks through various states in order to safely transition, etc. It can, however, benefit from letting the BPF link automatically evict the struc_ops map when the application exits (e.g. if it crashes). This patch set therefore: 1. Updates the struct_ops implementation to support default values for ->validate() and ->update() so that struct_ops implementations can benefit from BPF_F_LINK management even if they can't support updates. 2. Documents struct bpf_struct_ops so that the semantics are clear and well defined. --- v2: https://lore.kernel.org/bpf/0f5ea3de-c6e7-490f-b5ec-b5c7cd288687@gmail.com/T/ Changes from v2 -> v3: - Add patch 2/2 that documents the struct bpf_struct_ops structure. - Add Kui-Feng's Acked-by tag to patch 1/2. v1: https://lore.kernel.org/lkml/20230811150934.GA542801@maniforge/ Changes from v1 -> v2: - Move the if (!st_map->st_ops->update) check outside of the critical section before we acquire the update_mutex. ==================== Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-08-14bpf: Document struct bpf_struct_ops fieldsDavid Vernet
Subsystems that want to implement a struct bpf_struct_ops structure to enable struct_ops maps must currently reverse engineer how the structure works. Given that this is meant to be a way for subsystem maintainers to extend their subsystems using BPF, let's document it to make it a bit easier on them. Signed-off-by: David Vernet <void@manifault.com> Link: https://lore.kernel.org/r/20230814185908.700553-3-void@manifault.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-08-14bpf: Support default .validate() and .update() behavior for struct_ops linksDavid Vernet
Currently, if a struct_ops map is loaded with BPF_F_LINK, it must also define the .validate() and .update() callbacks in its corresponding struct bpf_struct_ops in the kernel. Enabling struct_ops link is useful in its own right to ensure that the map is unloaded if an application crashes. For example, with sched_ext, we want to automatically unload the host-wide scheduler if the application crashes. We would likely never support updating elements of a sched_ext struct_ops map, so we'd have to implement these callbacks showing that they _can't_ support element updates just to benefit from the basic lifetime management of struct_ops links. Let's enable struct_ops maps to work with BPF_F_LINK even if they haven't defined these callbacks, by assuming that a struct_ops map element cannot be updated by default. Acked-by: Kui-Feng Lee <thinker.li@gmail.com> Signed-off-by: David Vernet <void@manifault.com> Link: https://lore.kernel.org/r/20230814185908.700553-2-void@manifault.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-08-14selftests/bpf: Add various more tcx test casesDaniel Borkmann
Add several new tcx test cases to improve test coverage. This also includes a few new tests with ingress instead of clsact qdisc, to cover the fix from commit dc644b540a2d ("tcx: Fix splat in ingress_destroy upon tcx_entry_free"). # ./test_progs -t tc [...] #234 tc_links_after:OK #235 tc_links_append:OK #236 tc_links_basic:OK #237 tc_links_before:OK #238 tc_links_chain_classic:OK #239 tc_links_chain_mixed:OK #240 tc_links_dev_cleanup:OK #241 tc_links_dev_mixed:OK #242 tc_links_ingress:OK #243 tc_links_invalid:OK #244 tc_links_prepend:OK #245 tc_links_replace:OK #246 tc_links_revision:OK #247 tc_opts_after:OK #248 tc_opts_append:OK #249 tc_opts_basic:OK #250 tc_opts_before:OK #251 tc_opts_chain_classic:OK #252 tc_opts_chain_mixed:OK #253 tc_opts_delete_empty:OK #254 tc_opts_demixed:OK #255 tc_opts_detach:OK #256 tc_opts_detach_after:OK #257 tc_opts_detach_before:OK #258 tc_opts_dev_cleanup:OK #259 tc_opts_invalid:OK #260 tc_opts_mixed:OK #261 tc_opts_prepend:OK #262 tc_opts_replace:OK #263 tc_opts_revision:OK [...] Summary: 44/38 PASSED, 0 SKIPPED, 0 FAILED Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/8699efc284b75ccdc51ddf7062fa2370330dc6c0.1692029283.git.daniel@iogearbox.net Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-08-14net: veth: Page pool creation error handling for existing pools onlyLiang Chen
The failure handling procedure destroys page pools for all queues, including those that haven't had their page pool created yet. this patch introduces necessary adjustments to prevent potential risks and inconsistency with the error handling behavior. Fixes: 0ebab78cbcbf ("net: veth: add page_pool for page recycling") Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Signed-off-by: Liang Chen <liangchen.linux@gmail.com> Link: https://lore.kernel.org/r/20230812023016.10553-1-liangchen.linux@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-14Merge branch 'octeon_ep-fixes-for-error-and-remove-paths'Jakub Kicinski
Michal Schmidt says: ==================== octeon_ep: fixes for error and remove paths I have an Octeon card that's misconfigured in a way that exposes a couple of bugs in the octeon_ep driver's error paths. It can reproduce the issues that patches 1 & 4 are fixing. Patches 2 & 3 are a result of reviewing the nearby code. ==================== Link: https://lore.kernel.org/r/20230810150114.107765-1-mschmidt@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-14octeon_ep: cancel queued works in probe error pathMichal Schmidt
If it fails to get the devices's MAC address, octep_probe exits while leaving the delayed work intr_poll_task queued. When the work later runs, it's a use after free. Move the cancelation of intr_poll_task from octep_remove into octep_device_cleanup. This does not change anything in the octep_remove flow, but octep_device_cleanup is called also in the octep_probe error path, where the cancelation is needed. Note that the cancelation of ctrl_mbox_task has to follow intr_poll_task's, because the ctrl_mbox_task may be queued by intr_poll_task. Fixes: 24d4333233b3 ("octeon_ep: poll for control messages") Signed-off-by: Michal Schmidt <mschmidt@redhat.com> Link: https://lore.kernel.org/r/20230810150114.107765-5-mschmidt@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-14octeon_ep: cancel ctrl_mbox_task after intr_poll_taskMichal Schmidt
intr_poll_task may queue ctrl_mbox_task. The function octep_poll_non_ioq_interrupts_cn93_pf does this. When removing the driver and canceling these two works, cancel ctrl_mbox_task last to guarantee it does not run anymore. Fixes: 24d4333233b3 ("octeon_ep: poll for control messages") Signed-off-by: Michal Schmidt <mschmidt@redhat.com> Link: https://lore.kernel.org/r/20230810150114.107765-4-mschmidt@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-14octeon_ep: cancel tx_timeout_task later in remove sequenceMichal Schmidt
tx_timeout_task is canceled too early when removing the driver. Nothing prevents .ndo_tx_timeout from triggering and queuing the work again. Better cancel it after the netdev is unregistered. It's harmless for octep_tx_timeout_task to run in the window between the unregistration and cancelation, because it checks netif_running. Fixes: 862cd659a6fb ("octeon_ep: Add driver framework and device initialization") Signed-off-by: Michal Schmidt <mschmidt@redhat.com> Link: https://lore.kernel.org/r/20230810150114.107765-3-mschmidt@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-14octeon_ep: fix timeout value for waiting on mbox responseMichal Schmidt
The intention was to wait up to 500 ms for the mbox response. The third argument to wait_event_interruptible_timeout() is supposed to be the timeout duration. The driver mistakenly passed absolute time instead. Fixes: 577f0d1b1c5f ("octeon_ep: add separate mailbox command and response queues") Signed-off-by: Michal Schmidt <mschmidt@redhat.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20230810150114.107765-2-mschmidt@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-14net: macb: In ZynqMP resume always configure PS GTR for non-wakeup sourceRadhey Shyam Pandey
On Zynq UltraScale+ MPSoC ubuntu platform when systemctl issues suspend, network manager bring down the interface and goes into suspend. When it wakes up it again enables the interface. This leads to xilinx-psgtr "PLL lock timeout" on interface bringup, as the power management controller power down the entire FPD (including SERDES) if none of the FPD devices are in use and serdes is not initialized on resume. $ sudo rtcwake -m no -s 120 -v $ sudo systemctl suspend <this does ifconfig eth1 down> $ ifconfig eth1 up xilinx-psgtr fd400000.phy: lane 0 (type 10, protocol 5): PLL lock timeout phy phy-fd400000.phy.0: phy poweron failed --> -110 macb driver is called in this way: 1. macb_close: Stop network interface. In this function, it reset MACB IP and disables PHY and network interface. 2. macb_suspend: It is called in kernel suspend flow. But because network interface has been disabled(netif_running(ndev) is false), it does nothing and returns directly; 3. System goes into suspend state. Some time later, system is waken up by RTC wakeup device; 4. macb_resume: It does nothing because network interface has been disabled; 5. macb_open: It is called to enable network interface again. ethernet interface is initialized in this API but serdes which is power-off by PMUFW during FPD-off suspend is not initialized again and so we hit GT PLL lock issue on open. To resolve this PLL timeout issue always do PS GTR initialization when ethernet device is configured as non-wakeup source. Fixes: f22bd29ba19a ("net: macb: Fix ZynqMP SGMII non-wakeup source resume failure") Fixes: 8b73fa3ae02b ("net: macb: Added ZynqMP-specific initialization") Signed-off-by: Radhey Shyam Pandey <radhey.shyam.pandey@amd.com> Link: https://lore.kernel.org/r/1691414091-2260697-1-git-send-email-radhey.shyam.pandey@amd.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-14net: dsa: mv88e6060: add phylink_get_caps implementationRussell King (Oracle)
Add a phylink_get_caps implementation for Marvell 88e6060 DSA switch. This is a fast ethernet switch, with internal PHYs for ports 0 through 4. Port 4 also supports MII, REVMII, REVRMII and SNI. Port 5 supports MII, REVMII, REVRMII and SNI without an internal PHY. Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Link: https://lore.kernel.org/r/E1qUkx7-003dMX-9b@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-14net/mlx5: Don't query MAX caps twiceShay Drory
Whenever mlx5 driver is probed or reloaded, it queries some capabilities in MAX mode via set_hca_cap() API. Afterwards, the driver queries all capabilities in MAX mode via mlx5_query_hca_caps() API. Since MAX caps are read only caps, querying them twice is redundant. Hence, delete the second query. Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Maher Sanalla <msanalla@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-08-14net/mlx5: Remove unused MAX HCA capabilitiesShay Drory
Each device cap has two modes: MAX and CUR. The driver maintains a cache of both modes of the capabilities. For most device caps, the MAX cap mode is never used. Hence, remove all driver queries of the MAX mode of the said caps as well as their helper MACROs. Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Maher Sanalla <msanalla@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-08-14net/mlx5: Remove unused CAPsShay Drory
mlx5 driver queries the device for VECTOR_CALC and SHAMPO caps, but there isn't any user who requires them. As well as, MLX5_MCAM_REGS_0x9080_0x90FF is queried but not used. Thus, drop all usages and definitions of the mentioned caps above. Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Maher Sanalla <msanalla@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-08-14net/mlx5: Fix error message in mlx5_sf_dev_state_change_handler()Jiri Pirko
sw_function_id contains sfnum, so fix the error message to name the value properly. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Shay Drory <shayd@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-08-14net/mlx5: Remove redundant check of mlx5_vhca_event_supported()Jiri Pirko
Since mlx5_vhca_event_supported() is called in mlx5_sf_dev_supported(), remove the redundant call. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Shay Drory <shayd@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-08-14net/mlx5: Use mlx5_sf_start_function_id() helper instead of directly calling ↵Jiri Pirko
MLX5_CAP_GEN() There is a helper called mlx5_sf_start_function_id() that wraps up a query to get base SF function id. Use that instead of calling MLX5_CAP_GEN() directly. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Shay Drory <shayd@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-08-14net/mlx5: Remove redundant SF supported check from mlx5_sf_hw_table_init()Jiri Pirko
Since mlx5_sf_supported() check is done as a first thing in mlx5_sf_max_functions(), remove the redundant check. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Shay Drory <shayd@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-08-14net/mlx5: Use auxiliary_device_uninit() instead of device_put()Jiri Pirko
Instead of using device_put(), use auxiliary_device_uninit() for auxiliary device uninit which internally just calls device_put(). Signed-off-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-08-14net/mlx5: E-switch, Add checking for flow rule destinationsJianbo Liu
Firmware doesn't allow flow rules in FDB to do header rewrite and send packets to both internal and uplink vports. The following syndrome will be generated when trying to offload such kind of rules: mlx5_core 0000:08:00.0: mlx5_cmd_out_err:803:(pid 23569): SET_FLOW_TABLE_ENTRY(0x936) op_mod(0x0) failed, status bad parameter(0x3), syndrome (0x8c8f08), err(-22) To avoid this syndrome, add a checking before creating FTE. If a rule with header rewrite action forwards packets to both VF and PF, an error is returned directly. Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>