Age | Commit message (Collapse) | Author |
|
git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
mlx5-updates-2023-05-31
net/mlx5: Support 4 ports VF LAG, part 1/2
This series continues the series[1] "Support 4 ports HCAs LAG mode"
by Mark Bloch. This series adds support for 4 ports VF LAG (single FDB
E-Switch).
This series of patches focuses on refactoring different sections of the
code that make assumptions about VF LAG supporting only two ports. For
instance, it assumes that each device can only have one peer.
Patches 1-5:
- Refactor ETH handling of TC rules of eswitches with peers.
Patch 6:
- Refactors peer miss group table.
Patches 7-9:
- Refactor single FDB E-Switch creation.
Patch 10:
- Refactor the DR layer.
Patches 11-14:
- Refactors devcom layer.
Next series will refactor LAG layer and enable 4 ports VF LAG.
This series specifically allows HCAs with 4 ports to create a VF LAG
with only 4 ports. It is not possible to create a VF LAG with 2 or 3
ports using HCAs that have 4 ports.
Currently, the Merged E-Switch feature only supports HCAs with 2 ports.
However, upcoming patches will introduce support for HCAs with 4 ports.
In order to activate VF LAG a user can execute:
devlink dev eswitch set pci/0000:08:00.0 mode switchdev
devlink dev eswitch set pci/0000:08:00.1 mode switchdev
devlink dev eswitch set pci/0000:08:00.2 mode switchdev
devlink dev eswitch set pci/0000:08:00.3 mode switchdev
ip link add name bond0 type bond
ip link set dev bond0 type bond mode 802.3ad
ip link set dev eth2 master bond0
ip link set dev eth3 master bond0
ip link set dev eth4 master bond0
ip link set dev eth5 master bond0
Where eth2, eth3, eth4 and eth5 are net-interfaces of pci/0000:08:00.0
pci/0000:08:00.1 pci/0000:08:00.2 pci/0000:08:00.3 respectively.
User can verify LAG state and type via debugfs:
/sys/kernel/debug/mlx5/0000\:08\:00.0/lag/state
/sys/kernel/debug/mlx5/0000\:08\:00.0/lag/type
[1]
https://lore.kernel.org/netdev/20220510055743.118828-1-saeedm@nvidia.com/
* tag 'mlx5-updates-2023-05-31' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux:
net/mlx5: Devcom, extend mlx5_devcom_send_event to work with more than two devices
net/mlx5: Devcom, introduce devcom_for_each_peer_entry
net/mlx5: E-switch, mark devcom as not ready when all eswitches are unpaired
net/mlx5: Devcom, Rename paired to ready
net/mlx5: DR, handle more than one peer domain
net/mlx5: E-switch, generalize shared FDB creation
net/mlx5: E-switch, Handle multiple master egress rules
net/mlx5: E-switch, refactor FDB miss rule add/remove
net/mlx5: E-switch, enlarge peer miss group table
net/mlx5e: Handle offloads flows per peer
net/mlx5e: en_tc, re-factor query route port
net/mlx5e: rep, store send to vport rules per peer
net/mlx5e: tc, Refactor peer add/del flow
net/mlx5e: en_tc, Extend peer flows to a list
====================
Link: https://lore.kernel.org/r/20230602191301.47004-1-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Andrzej Hajda says:
====================
drm/i915: use ref_tracker library for tracking wakerefs
This is reviewed series of ref_tracker patches, ready to merge
via network tree, rebased on net-next/main.
i915 patches will be merged later via intel-gfx tree.
====================
Merge on top of an -rc tag in case it's needed in another tree.
Link: https://lore.kernel.org/r/20230224-track_gt-v9-0-5b47a33f55d1@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Library can handle allocation failures. To avoid allocation warnings
__GFP_NOWARN has been added everywhere. Moreover GFP_ATOMIC has been
replaced with GFP_NOWAIT in case of stack allocation on tracker free
call.
Signed-off-by: Andrzej Hajda <andrzej.hajda@intel.com>
Reviewed-by: Andi Shyti <andi.shyti@linux.intel.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Similar to stack_(depot|trace)_snprint the patch
adds helper to printing stats to memory buffer.
It will be helpful in case of debugfs.
Signed-off-by: Andrzej Hajda <andrzej.hajda@intel.com>
Reviewed-by: Andi Shyti <andi.shyti@linux.intel.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
In case the library is tracking busy subsystem, simply
printing stack for every active reference will spam log
with long, hard to read, redundant stack traces. To improve
readabilty following changes have been made:
- reports are printed per stack_handle - log is more compact,
- added display name for ref_tracker_dir - it will differentiate
multiple subsystems,
- stack trace is printed indented, in the same printk call,
- info about dropped references is printed as well.
Signed-off-by: Andrzej Hajda <andrzej.hajda@intel.com>
Reviewed-by: Andi Shyti <andi.shyti@linux.intel.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
To have reliable detection of leaks, caller must be able to check under
the same lock both: tracked counter and the leaks. dir.lock is natural
candidate for such lock and unlocked print helper can be called with this
lock taken.
As a bonus we can reuse this helper in ref_tracker_dir_exit.
Signed-off-by: Andrzej Hajda <andrzej.hajda@intel.com>
Reviewed-by: Andi Shyti <andi.shyti@linux.intel.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Andrii Nakryiko writes:
And we currently don't have an attach type for NETLINK BPF link.
Thankfully it's not too late to add it. I see that link_create() in
kernel/bpf/syscall.c just bypasses attach_type check. We shouldn't
have done that. Instead we need to add BPF_NETLINK attach type to enum
bpf_attach_type. And wire all that properly throughout the kernel and
libbpf itself.
This adds BPF_NETFILTER and uses it. This breaks uabi but this
wasn't in any non-rc release yet, so it should be fine.
v2: check link_attack prog type in link_create too
Fixes: 84601d6ee68a ("bpf: add bpf_link support for BPF_NETFILTER programs")
Suggested-by: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/CAEf4BzZ69YgrQW7DHCJUT_X+GqMq_ZQQPBwopaJJVGFD5=d5Vg@mail.gmail.com/
Link: https://lore.kernel.org/bpf/20230605131445.32016-1-fw@strlen.de
|
|
It seems we forgot the normal case to terminate the retry loop,
making us asking 3 times each command, which is probably a little bit
too much.
And remove the ugly "goto exit" that can be replaced by a simpler "break"
Fixes: 586e8fede795 ("HID: logitech-hidpp: Retry commands when device is busy")
Suggested-by: Mark Lord <mlord@pobox.com>
Tested-by: Mark Lord <mlord@pobox.com>
Signed-off-by: Benjamin Tissoires <benjamin.tissoires@redhat.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
|
|
https://github.com/robertosassu/linux
Pull asymmetric keys fix from Roberto Sassu:
"Here is a small fix to make an unconditional copy of the buffer passed
to crypto operations, to take into account the case of the stack not
in the linear mapping area.
It has been tested and verified to fix the bug"
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: David Howells <dhowells@redhat.com>
* tag 'asym-keys-fix-for-linus-v6.4-rc5' of https://github.com/robertosassu/linux:
KEYS: asymmetric: Copy sig and digest in public_key_verify_signature()
|
|
Mat Martineau says:
====================
mptcp: Fixes for address advertisement
Patches 1 and 2 allow address advertisements to be removed without
affecting current connected subflows, and updates associated self tests.
Patches 3 and 4 correctly track (and allow removal of) addresses that
were implicitly announced as part of subflow creation. Also updates
associated self tests.
Patch 5 makes subflow and address announcement counters work consistently
between the userspace and in-kernel path managers.
====================
Signed-off-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Increase pm subflows counter on both server side and client side when
userspace pm creates a new subflow, and decrease the counter when it
closes a subflow.
Increase add_addr_signaled counter in mptcp_nl_cmd_announce() when the
address is announced by userspace PM.
This modification is similar to how the in-kernel PM is updating the
counter: when additional subflows are created/removed.
Fixes: 9ab4807c84a4 ("mptcp: netlink: Add MPTCP_PM_CMD_ANNOUNCE")
Fixes: 702c2f646d42 ("mptcp: netlink: allow userspace-driven subflow establishment")
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/329
Cc: stable@vger.kernel.org
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
To align with what is done by the in-kernel PM, update userspace pm
subflow selftests, by sending the a remove_addrs command together
before the remove_subflows command. This will get a RM_ADDR in
chk_rm_nr().
Fixes: d9a4594edabf ("mptcp: netlink: Add MPTCP_PM_CMD_REMOVE")
Fixes: 5e986ec46874 ("selftests: mptcp: userspace pm subflow tests")
Link: https://github.com/multipath-tcp/mptcp_net-next/issues/379
Cc: stable@vger.kernel.org
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add the address into userspace_pm_local_addr_list when the subflow is
created. Make sure it can be found in mptcp_nl_cmd_remove(). And delete
it in the new helper mptcp_userspace_pm_delete_local_addr().
By doing this, the "REMOVE" command also works with subflows that have
been created via the "SUB_CREATE" command instead of restricting to
the addresses that have been announced via the "ANNOUNCE" command.
Fixes: d9a4594edabf ("mptcp: netlink: Add MPTCP_PM_CMD_REMOVE")
Link: https://github.com/multipath-tcp/mptcp_net-next/issues/379
Cc: stable@vger.kernel.org
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This patch is linked to the previous commit ("mptcp: only send RM_ADDR in
nl_cmd_remove").
To align with what is done by the in-kernel PM, update userspace pm addr
selftests, by sending a remove_subflows command together after the
remove_addrs command.
Fixes: d9a4594edabf ("mptcp: netlink: Add MPTCP_PM_CMD_REMOVE")
Fixes: 97040cf9806e ("selftests: mptcp: userspace pm address tests")
Cc: stable@vger.kernel.org
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The specifications from [1] about the "REMOVE" command say:
Announce that an address has been lost to the peer
It was then only supposed to send a RM_ADDR and not trying to delete
associated subflows.
A new helper mptcp_pm_remove_addrs() is then introduced to do just
that, compared to mptcp_pm_remove_addrs_and_subflows() also removing
subflows.
To delete a subflow, the userspace daemon can use the "SUB_DESTROY"
command, see mptcp_nl_cmd_sf_destroy().
Fixes: d9a4594edabf ("mptcp: netlink: Add MPTCP_PM_CMD_REMOVE")
Link: https://github.com/multipath-tcp/mptcp/blob/mptcp_v0.96/include/uapi/linux/mptcp.h [1]
Cc: stable@vger.kernel.org
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
We must not assign plat_dat->dwmac4_addrs unconditionally as for
structures which don't set them, this will result in the core driver
using zeroes everywhere and breaking the driver for older HW. On EMAC < 2
the address should remain NULL.
Fixes: b68376191c69 ("net: stmmac: dwmac-qcom-ethqos: Add EMAC3 support")
Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
Reviewed-by: Andrew Halaney <ahalaney@redhat.com>
Reviewed-by: Siddharth Vadapalli <s-vadapalli@ti.com>
Reviewed-by: Vinod Koul <vkoul@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Petr Machata says:
====================
mlxsw, selftests: Cleanups
This patchset consolidates a number of disparate items that can all be
considered cleanups. They are all related to mlxsw in that they are
directly in mlxsw code, or in selftests that mlxsw heavily uses.
- patch #1 fixes a comment, patch #2 propagates an extack
- patches #3 and #4 tweak several loops to query a resource once and cache
in a local variable instead of querying on each iteration
- patches #5 and #6 fix selftest diagrams, and #7 adds a missing diagram
into an existing test
- patch #8 disables a PVID on a bridge in a selftest that should not need
said PVID
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
When everything is configured, VLAN membership on the bridge in this
selftest are as follows:
# bridge vlan show
port vlan-id
swp2 1 PVID Egress Untagged
555
br1 1 Egress Untagged
555 PVID Egress Untagged
Note that it is possible for untagged traffic to just flow through as VLAN
1, instead of using VLAN 555 as intended by the test. This configuration
seems too close to "works by accident", and it would be better to just shut
out VLAN 1 altogether.
To that end, configure vlan_default_pvid of 0:
# bridge vlan show
port vlan-id
swp2 555
br1 555 PVID Egress Untagged
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Amit Cohen <amcohen@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add a topology diagram to this selftest to make the configuration easier to
understand.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Amit Cohen <amcohen@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The topology diagram implies that $swp1 and $swp2 are members of the bridge
br0, when in fact only their uppers, $swp1.10 and $swp2.10 are. Adjust the
diagram.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Amit Cohen <amcohen@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The topology diagram implies that $swp1 and $swp2 are members of the bridge
br0, when in fact only their uppers, $swp1.10 and $swp2.10 are. Adjust the
diagram.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Amit Cohen <amcohen@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
MLXSW_CORE_RES_GET involves a call to spectrum_core, a separate module.
Instead of making the call on every iteration, cache it up front, and use
the value.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Amit Cohen <amcohen@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
MLXSW_CORE_RES_GET involves a call to spectrum_core, a separate module.
Instead of making the call on every iteration, cache it up front, and use
the value.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Amit Cohen <amcohen@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
In commit 26029225d992 ("mlxsw: spectrum_router: Propagate extack
further"), the mlxsw_sp_rif_ops.configure callback got a new argument,
extack. However the callbacks that deal with tunnel configuration,
mlxsw_sp1_rif_ipip_lb_configure() and mlxsw_sp2_rif_ipip_lb_configure(),
were never updated to pass the parameter further. Do that now.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Amit Cohen <amcohen@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
"Reserved for X" usually means that only X is supposed to use a given
object. Here, it is used in the sense that X should consider the object
"reserved", as in "restricted".
Replace the comment simply by "X", with the implication that that's where
the field is used.
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Amit Cohen <amcohen@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Russell King says:
====================
convert sja1105 xpcs creation and remove xpcs_create
This series of three patches converts sja1105 to use the newly
provided xpcs_create_mdiodev(), and as there become no users of
xpcs_create(), removes this function from the global namespace to
discourage future direct use.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
There are now no callers of xpcs_create(), so let's remove it from
public view to discourage future direct usage.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Use the new xpcs_create_mdiodev() creator, which simplifies the
creation and destruction of the mdio device associated with xpcs.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Put the mdiodev after xpcs_create() so that the XPCS driver can manage
the lifetime of the mdiodev its using.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can
Marc Kleine-Budde says:
====================
this is a pull request of 3 patches for net/master.
All 3 patches target the j1939 stack.
The 1st patch is by Oleksij Rempel and fixes the error queue handling
for (E)TP sessions that run into timeouts.
The last 2 patches are by Fedor Pchelkin and fix a potential
use-after-free in j1939_netdev_start() if j1939_can_rx_register()
fails.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
We got multiple syzbot reports, all duplicates of the following [1]
syzbot managed to install fq_pie with a zero TCA_FQ_PIE_QUANTUM,
thus triggering infinite loops.
Use limits similar to sch_fq, with commits
3725a269815b ("pkt_sched: fq: avoid hang when quantum 0") and
d9e15a273306 ("pkt_sched: fq: do not accept silly TCA_FQ_QUANTUM")
[1]
watchdog: BUG: soft lockup - CPU#0 stuck for 26s! [swapper/0:0]
Modules linked in:
irq event stamp: 172817
hardirqs last enabled at (172816): [<ffff80001242fde4>] __el1_irq arch/arm64/kernel/entry-common.c:476 [inline]
hardirqs last enabled at (172816): [<ffff80001242fde4>] el1_interrupt+0x58/0x68 arch/arm64/kernel/entry-common.c:486
hardirqs last disabled at (172817): [<ffff80001242fdb0>] __el1_irq arch/arm64/kernel/entry-common.c:468 [inline]
hardirqs last disabled at (172817): [<ffff80001242fdb0>] el1_interrupt+0x24/0x68 arch/arm64/kernel/entry-common.c:486
softirqs last enabled at (167634): [<ffff800008020c1c>] softirq_handle_end kernel/softirq.c:414 [inline]
softirqs last enabled at (167634): [<ffff800008020c1c>] __do_softirq+0xac0/0xd54 kernel/softirq.c:600
softirqs last disabled at (167701): [<ffff80000802a660>] ____do_softirq+0x14/0x20 arch/arm64/kernel/irq.c:80
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.4.0-rc3-syzkaller-geb0f1697d729 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 04/28/2023
pstate: 80400005 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : fq_pie_qdisc_dequeue+0x10c/0x8ac net/sched/sch_fq_pie.c:246
lr : fq_pie_qdisc_dequeue+0xe4/0x8ac net/sched/sch_fq_pie.c:240
sp : ffff800008007210
x29: ffff800008007280 x28: ffff0000c86f7890 x27: ffff0000cb20c2e8
x26: ffff0000cb20c2f0 x25: dfff800000000000 x24: ffff0000cb20c2e0
x23: ffff0000c86f7880 x22: 0000000000000040 x21: 1fffe000190def10
x20: ffff0000cb20c2e0 x19: ffff0000cb20c2e0 x18: ffff800008006e60
x17: 0000000000000000 x16: ffff80000850af6c x15: 0000000000000302
x14: 0000000000000100 x13: 0000000000000000 x12: 0000000000000001
x11: 0000000000000302 x10: 0000000000000100 x9 : 0000000000000000
x8 : 0000000000000000 x7 : ffff80000841c468 x6 : 0000000000000000
x5 : 0000000000000001 x4 : 0000000000000001 x3 : 0000000000000000
x2 : ffff0000cb20c2e0 x1 : ffff0000cb20c2e0 x0 : 0000000000000001
Call trace:
fq_pie_qdisc_dequeue+0x10c/0x8ac net/sched/sch_fq_pie.c:246
dequeue_skb net/sched/sch_generic.c:292 [inline]
qdisc_restart net/sched/sch_generic.c:397 [inline]
__qdisc_run+0x1fc/0x231c net/sched/sch_generic.c:415
__dev_xmit_skb net/core/dev.c:3868 [inline]
__dev_queue_xmit+0xc80/0x3318 net/core/dev.c:4210
dev_queue_xmit include/linux/netdevice.h:3085 [inline]
neigh_connected_output+0x2f8/0x38c net/core/neighbour.c:1581
neigh_output include/net/neighbour.h:544 [inline]
ip6_finish_output2+0xd60/0x1a1c net/ipv6/ip6_output.c:134
__ip6_finish_output net/ipv6/ip6_output.c:195 [inline]
ip6_finish_output+0x538/0x8c8 net/ipv6/ip6_output.c:206
NF_HOOK_COND include/linux/netfilter.h:292 [inline]
ip6_output+0x270/0x594 net/ipv6/ip6_output.c:227
dst_output include/net/dst.h:458 [inline]
NF_HOOK include/linux/netfilter.h:303 [inline]
ndisc_send_skb+0xc30/0x1790 net/ipv6/ndisc.c:508
ndisc_send_rs+0x47c/0x5d4 net/ipv6/ndisc.c:718
addrconf_rs_timer+0x300/0x58c net/ipv6/addrconf.c:3936
call_timer_fn+0x19c/0x8cc kernel/time/timer.c:1700
expire_timers kernel/time/timer.c:1751 [inline]
__run_timers+0x55c/0x734 kernel/time/timer.c:2022
run_timer_softirq+0x7c/0x114 kernel/time/timer.c:2035
__do_softirq+0x2d0/0xd54 kernel/softirq.c:571
____do_softirq+0x14/0x20 arch/arm64/kernel/irq.c:80
call_on_irq_stack+0x24/0x4c arch/arm64/kernel/entry.S:882
do_softirq_own_stack+0x20/0x2c arch/arm64/kernel/irq.c:85
invoke_softirq kernel/softirq.c:452 [inline]
__irq_exit_rcu+0x28c/0x534 kernel/softirq.c:650
irq_exit_rcu+0x14/0x84 kernel/softirq.c:662
__el1_irq arch/arm64/kernel/entry-common.c:472 [inline]
el1_interrupt+0x38/0x68 arch/arm64/kernel/entry-common.c:486
el1h_64_irq_handler+0x18/0x24 arch/arm64/kernel/entry-common.c:491
el1h_64_irq+0x64/0x68 arch/arm64/kernel/entry.S:587
__daif_local_irq_enable arch/arm64/include/asm/irqflags.h:33 [inline]
arch_local_irq_enable+0x8/0xc arch/arm64/include/asm/irqflags.h:55
cpuidle_idle_call kernel/sched/idle.c:170 [inline]
do_idle+0x1f0/0x4e8 kernel/sched/idle.c:282
cpu_startup_entry+0x24/0x28 kernel/sched/idle.c:379
rest_init+0x2dc/0x2f4 init/main.c:735
start_kernel+0x0/0x55c init/main.c:834
start_kernel+0x3f0/0x55c init/main.c:1088
__primary_switched+0xb8/0xc0 arch/arm64/kernel/head.S:523
Fixes: ec97ecf1ebe4 ("net: sched: add Flow Queue PIE packet scheduler")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Maxime Chevallier says:
====================
net: add a regmap-based mdio driver and drop TSE PCS
This is the V4 of a series that follows-up on the work [1] aiming to drop the
altera TSE PCS driver, as it turns out to be a version of the Lynx PCS exposed
as a memory-mapped block, instead of living on an MDIO bus.
One step of this removal involved creating a regmap-based mdio driver
that translates MDIO accesses into the actual underlying bus that
exposes the register. The register layout must of course match the
standard MDIO layout, but we can now account for differences in stride
with recent work on the regmap subsystem [2].
Sorry for repeating this, but I didn't hear anything on this matter in previous
iterations, Mark, Net maintainers, this series depends on the patch
e12ff2876493 that was recently merged into the regmap tree [3].
For this series to be usable in net-next, this patch must be applied
beforehand. Should Mark create a tag that would then be merged into
net-next ? Or should we just wait for the next release to merge this
into net-next ?
This series introduces a new MDIO driver, and uses it to convert Altera
TSE from the actual TSE PCS driver to Lynx PCS.
Since it turns out dwmac_socfpga also uses a TSE PCS block, port that
driver to Lynx as well.
Changes in V4 :
- Use new pcs_lynx_create/destroy helpers added by Russell
- Rework the cleanup sequence to avoid leaking data
- Rework a bit KConfig to properly select dependencies
- Fix a few hiccups with misplaced hunks in 2 commits
Changes in V3 :
- Use a dedicated struct for the mii bus's priv data, to avoid
duplicating the whole struct mdio_regmap_config, from which 2 fields
only are necessary after init, as suggested by Russell
- Use ~0 instead of ~0UL for the no-scan bitmask, following Simon's
review.
Changes in V2 :
- Use phy_mask to avoid unnecessarily scanning the whole mdio bus
- Go one step further and completely disable scanning if users
set the .autoscan flag to false, in case the mdiodevice isn't an
actual PHY (a PCS for example).
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
dwmac_socfpga re-implements support for the TSE PCS, which is identical
to the already existing TSE PCS, which in turn is the same as the Lynx
PCS. Drop the existing TSE re-implemenation and use the Lynx PCS
instead, relying on the regmap-mdio driver to translate MDIO accesses
into mmio accesses.
Add a lynx_pcs reference in the stmmac's internal structure, and use
.mac_select_pcs() to return the relevant PCS to be used.
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Now that we can easily create a mdio-device that represents a
memory-mapped device that exposes an MDIO-like register layout, we don't
need the Altera TSE PCS anymore, since we can use the Lynx PCS instead.
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The newly introduced regmap-based MDIO driver allows for an easy mapping
of an mdiodevice onto the memory-mapped TSE PCS, which is actually a
Lynx PCS.
Convert Altera TSE to use this PCS instead of the pcs-altera-tse, which
is nothing more than a memory-mapped Lynx PCS.
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
There exists several examples today of devices that embed an ethernet
PHY or PCS directly inside an SoC. In this situation, either the device
is controlled through a vendor-specific register set, or sometimes
exposes the standard 802.3 registers that are typically accessed over
MDIO.
As phylib and phylink are designed to use mdiodevices, this driver
allows creating a virtual MDIO bus, that translates mdiodev register
accesses to regmap accesses.
The reason we use regmap is because there are at least 3 such devices
known today, 2 of them are Altera TSE PCS's, memory-mapped, exposed
with a 4-byte stride in stmmac's dwmac-socfpga variant, and a 2-byte
stride in altera-tse. The other one (nxp,sja1110-base-tx-mdio) is
exposed over SPI.
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
j1939_can_rx_register fails"
Fedor Pchelkin <pchelkin@ispras.ru> says:
The patch series fixes a possible racy use-after-free scenario
described in 2/2: if j1939_can_rx_register() fails then the concurrent
thread may have already read the invalid priv structure.
The 1/2 makes j1939_netdev_lock a mutex so that access to
j1939_can_rx_register() can be serialized without changing GFP_KERNEL
to GFP_ATOMIC inside can_rx_register(). This seems to be safe.
Note that the patch series has been tested only via Syzkaller and not
with a real device.
Link: https://lore.kernel.org/r/20230526171910.227615-1-pchelkin@ispras.ru
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
|
|
Syzkaller reports the following failure:
BUG: KASAN: use-after-free in kref_put include/linux/kref.h:64 [inline]
BUG: KASAN: use-after-free in j1939_priv_put+0x25/0xa0 net/can/j1939/main.c:172
Write of size 4 at addr ffff888141c15058 by task swapper/3/0
CPU: 3 PID: 0 Comm: swapper/3 Not tainted 5.10.144-syzkaller #0
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
Call Trace:
<IRQ>
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x107/0x167 lib/dump_stack.c:118
print_address_description.constprop.0+0x1c/0x220 mm/kasan/report.c:385
__kasan_report mm/kasan/report.c:545 [inline]
kasan_report.cold+0x1f/0x37 mm/kasan/report.c:562
check_memory_region_inline mm/kasan/generic.c:186 [inline]
check_memory_region+0x145/0x190 mm/kasan/generic.c:192
instrument_atomic_read_write include/linux/instrumented.h:101 [inline]
atomic_fetch_sub_release include/asm-generic/atomic-instrumented.h:220 [inline]
__refcount_sub_and_test include/linux/refcount.h:272 [inline]
__refcount_dec_and_test include/linux/refcount.h:315 [inline]
refcount_dec_and_test include/linux/refcount.h:333 [inline]
kref_put include/linux/kref.h:64 [inline]
j1939_priv_put+0x25/0xa0 net/can/j1939/main.c:172
j1939_sk_sock_destruct+0x44/0x90 net/can/j1939/socket.c:374
__sk_destruct+0x4e/0x820 net/core/sock.c:1784
rcu_do_batch kernel/rcu/tree.c:2485 [inline]
rcu_core+0xb35/0x1a30 kernel/rcu/tree.c:2726
__do_softirq+0x289/0x9a3 kernel/softirq.c:298
asm_call_irq_on_stack+0x12/0x20
</IRQ>
__run_on_irqstack arch/x86/include/asm/irq_stack.h:26 [inline]
run_on_irqstack_cond arch/x86/include/asm/irq_stack.h:77 [inline]
do_softirq_own_stack+0xaa/0xe0 arch/x86/kernel/irq_64.c:77
invoke_softirq kernel/softirq.c:393 [inline]
__irq_exit_rcu kernel/softirq.c:423 [inline]
irq_exit_rcu+0x136/0x200 kernel/softirq.c:435
sysvec_apic_timer_interrupt+0x4d/0x100 arch/x86/kernel/apic/apic.c:1095
asm_sysvec_apic_timer_interrupt+0x12/0x20 arch/x86/include/asm/idtentry.h:635
Allocated by task 1141:
kasan_save_stack+0x1b/0x40 mm/kasan/common.c:48
kasan_set_track mm/kasan/common.c:56 [inline]
__kasan_kmalloc.constprop.0+0xc9/0xd0 mm/kasan/common.c:461
kmalloc include/linux/slab.h:552 [inline]
kzalloc include/linux/slab.h:664 [inline]
j1939_priv_create net/can/j1939/main.c:131 [inline]
j1939_netdev_start+0x111/0x860 net/can/j1939/main.c:268
j1939_sk_bind+0x8ea/0xd30 net/can/j1939/socket.c:485
__sys_bind+0x1f2/0x260 net/socket.c:1645
__do_sys_bind net/socket.c:1656 [inline]
__se_sys_bind net/socket.c:1654 [inline]
__x64_sys_bind+0x6f/0xb0 net/socket.c:1654
do_syscall_64+0x33/0x40 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x61/0xc6
Freed by task 1141:
kasan_save_stack+0x1b/0x40 mm/kasan/common.c:48
kasan_set_track+0x1c/0x30 mm/kasan/common.c:56
kasan_set_free_info+0x1b/0x30 mm/kasan/generic.c:355
__kasan_slab_free+0x112/0x170 mm/kasan/common.c:422
slab_free_hook mm/slub.c:1542 [inline]
slab_free_freelist_hook+0xad/0x190 mm/slub.c:1576
slab_free mm/slub.c:3149 [inline]
kfree+0xd9/0x3b0 mm/slub.c:4125
j1939_netdev_start+0x5ee/0x860 net/can/j1939/main.c:300
j1939_sk_bind+0x8ea/0xd30 net/can/j1939/socket.c:485
__sys_bind+0x1f2/0x260 net/socket.c:1645
__do_sys_bind net/socket.c:1656 [inline]
__se_sys_bind net/socket.c:1654 [inline]
__x64_sys_bind+0x6f/0xb0 net/socket.c:1654
do_syscall_64+0x33/0x40 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x61/0xc6
It can be caused by this scenario:
CPU0 CPU1
j1939_sk_bind(socket0, ndev0, ...)
j1939_netdev_start()
j1939_sk_bind(socket1, ndev0, ...)
j1939_netdev_start()
mutex_lock(&j1939_netdev_lock)
j1939_priv_set(ndev0, priv)
mutex_unlock(&j1939_netdev_lock)
if (priv_new)
kref_get(&priv_new->rx_kref)
return priv_new;
/* inside j1939_sk_bind() */
jsk->priv = priv
j1939_can_rx_register(priv) // fails
j1939_priv_set(ndev, NULL)
kfree(priv)
j1939_sk_sock_destruct()
j1939_priv_put() // <- uaf
To avoid this, call j1939_can_rx_register() under j1939_netdev_lock so
that a concurrent thread cannot process j1939_priv before
j1939_can_rx_register() returns.
Found by Linux Verification Center (linuxtesting.org) with Syzkaller.
Fixes: 9d71dd0c7009 ("can: add support of SAE J1939 protocol")
Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru>
Tested-by: Oleksij Rempel <o.rempel@pengutronix.de>
Acked-by: Oleksij Rempel <o.rempel@pengutronix.de>
Link: https://lore.kernel.org/r/20230526171910.227615-3-pchelkin@ispras.ru
Cc: stable@vger.kernel.org
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
|
|
It turns out access to j1939_can_rx_register() needs to be serialized,
otherwise j1939_priv can be corrupted when parallel threads call
j1939_netdev_start() and j1939_can_rx_register() fails. This issue is
thoroughly covered in other commit which serializes access to
j1939_can_rx_register().
Change j1939_netdev_lock type to mutex so that we do not need to remove
GFP_KERNEL from can_rx_register().
j1939_netdev_lock seems to be used in normal contexts where mutex usage
is not prohibited.
Found by Linux Verification Center (linuxtesting.org) with Syzkaller.
Fixes: 9d71dd0c7009 ("can: add support of SAE J1939 protocol")
Suggested-by: Alexey Khoroshilov <khoroshilov@ispras.ru>
Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru>
Tested-by: Oleksij Rempel <o.rempel@pengutronix.de>
Acked-by: Oleksij Rempel <o.rempel@pengutronix.de>
Link: https://lore.kernel.org/r/20230526171910.227615-2-pchelkin@ispras.ru
Cc: stable@vger.kernel.org
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
|
|
J1939 Socket
This patch addresses an issue within the j1939_sk_send_loop_abort()
function in the j1939/socket.c file, specifically in the context of
Transport Protocol (TP) sessions.
Without this patch, when a TP session is initiated and a Clear To Send
(CTS) frame is received from the remote side requesting one data packet,
the kernel dispatches the first Data Transport (DT) frame and then waits
for the next CTS. If the remote side doesn't respond with another CTS,
the kernel aborts due to a timeout. This leads to the user-space
receiving an EPOLLERR on the socket, and the socket becomes active.
However, when trying to read the error queue from the socket with
sock.recvmsg(, , socket.MSG_ERRQUEUE), it returns -EAGAIN,
given that the socket is non-blocking. This situation results in an
infinite loop: the user-space repeatedly calls epoll(), epoll() returns
the socket file descriptor with EPOLLERR, but the socket then blocks on
the recv() of ERRQUEUE.
This patch introduces an additional check for the J1939_SOCK_ERRQUEUE
flag within the j1939_sk_send_loop_abort() function. If the flag is set,
it indicates that the application has subscribed to receive error queue
messages. In such cases, the kernel can communicate the current transfer
state via the error queue. This allows for the function to return early,
preventing the unnecessary setting of the socket into an error state,
and breaking the infinite loop. It is crucial to note that a socket
error is only needed if the application isn't using the error queue, as,
without it, the application wouldn't be aware of transfer issues.
Fixes: 9d71dd0c7009 ("can: add support of SAE J1939 protocol")
Reported-by: David Jander <david@protonic.nl>
Tested-by: David Jander <david@protonic.nl>
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Link: https://lore.kernel.org/r/20230526081946.715190-1-o.rempel@pengutronix.de
Cc: stable@vger.kernel.org
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
|
|
Unlinked list recovery requires errors removing the inode the from
the unlinked list get fed back to the main recovery loop. Now that
we offload the unlinking to the inodegc work, we don't get errors
being fed back when we trip over a corruption that prevents the
inode from being removed from the unlinked list.
This means we never clear the corrupt unlinked list bucket,
resulting in runtime operations eventually tripping over it and
shutting down.
Fix this by collecting inodegc worker errors and feed them
back to the flush caller. This is largely best effort - the only
context that really cares is log recovery, and it only flushes a
single inode at a time so we don't need complex synchronised
handling. Essentially the inodegc workers will capture the first
error that occurs and the next flush will gather them and clear
them. The flush itself will only report the first gathered error.
In the cases where callers can return errors, propagate the
collected inodegc flush error up the error handling chain.
In the case of inode unlinked list recovery, there are several
superfluous calls to flush queued unlinked inodes -
xlog_recover_iunlink_bucket() guarantees that it has flushed the
inodegc and collected errors before it returns. Hence nothing in the
calling path needs to run a flush, even when an error is returned.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
Bad things happen in defered extent freeing operations if it is
passed a bad block number in the xefi. This can come from a bogus
agno/agbno pair from deferred agfl freeing, or just a bad fsbno
being passed to __xfs_free_extent_later(). Either way, it's very
difficult to diagnose where a null perag oops in EFI creation
is coming from when the operation that queued the xefi has already
been completed and there's no longer any trace of it around....
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
If the agfl or the indexing in the AGF has been corrupted, getting a
block form the AGFL could return an invalid block number. If this
happens, bad things happen. Check the agbno we pull off the AGFL
and return -EFSCORRUPTED if we find somethign bad.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
When a v4 filesystem has fl_last - fl_first != fl_count, we do not
not detect the corruption and allow the AGF to be used as it if was
fully valid. On V5 filesystems, we reset the AGFL to empty in these
cases and avoid the corruption at a small cost of leaked blocks.
If we don't catch the corruption on V4 filesystems, bad things
happen later when an allocation attempts to trim the free list
and either double-frees stale entries in the AGFl or tries to free
NULLAGBNO entries.
Either way, this is bad. Prevent this from happening by using the
AGFL_NEED_RESET logic for v4 filesysetms, too.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
xfs_bmap_longest_free_extent() can return an error when accessing
the AGF fails. In this case, the behaviour of
xfs_filestream_pick_ag() is conditional on the error. We may
continue the loop, or break out of it. The error handling after the
loop cleans up the perag reference held when the break occurs. If we
continue, the next loop iteration handles cleaning up the perag
reference.
EIther way, we don't need to release the active perag reference when
xfs_bmap_longest_free_extent() fails. Doing so means we do a double
decrement on the active reference count, and this causes tha active
reference count to fall to zero. At this point, new active
references will fail.
This leads to unmount hanging because it tries to grab active
references to that perag, only for it to fail. This happens inside a
loop that retries until a inode tree radix tree tag is cleared,
which cannot happen because we can't get an active reference to the
perag.
The unmount livelocks in this path:
xfs_reclaim_inodes+0x80/0xc0
xfs_unmount_flush_inodes+0x5b/0x70
xfs_unmountfs+0x5b/0x1a0
xfs_fs_put_super+0x49/0x110
generic_shutdown_super+0x7c/0x1a0
kill_block_super+0x27/0x50
deactivate_locked_super+0x30/0x90
deactivate_super+0x3c/0x50
cleanup_mnt+0xc2/0x160
__cleanup_mnt+0x12/0x20
task_work_run+0x5e/0xa0
exit_to_user_mode_prepare+0x1bc/0x1c0
syscall_exit_to_user_mode+0x16/0x40
do_syscall_64+0x40/0x80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
Reported-by: Pengfei Xu <pengfei.xu@intel.com>
Fixes: eb70aa2d8ed9 ("xfs: use for_each_perag_wrap in xfs_filestream_pick_ag")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
Commit 6bc6c99a944c was a well-intentioned effort to initiate
consolidation of adjacent bmbt mapping records by setting the PREEN
flag. Consolidation can only happen if the length of the combined
record doesn't overflow the 21-bit blockcount field of the bmbt
recordset. Unfortunately, the length test is inverted, leading to it
triggering on data forks like these:
EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL
0: [0..16777207]: 76110848..92888055 0 (76110848..92888055) 16777208
1: [16777208..20639743]: 92888056..96750591 0 (92888056..96750591) 3862536
Note that record 0 has a length of 16777208 512b blocks. This
corresponds to 2097151 4k fsblocks, which is the maximum. Hence the two
records cannot be merged.
However, the logic is still wrong even if we change the in-loop
comparison, because the scope of our examination isn't broad enough
inside the loop to detect mappings like this:
0: [0..9]: 76110838..76110847 0 (76110838..76110847) 10
1: [10..16777217]: 76110848..92888055 0 (76110848..92888055) 16777208
2: [16777218..20639753]: 92888056..96750591 0 (92888056..96750591) 3862536
These three records could be merged into two, but one cannot determine
this purely from looking at records 0-1 or 1-2 in isolation.
Hoist the mergability detection outside the loop, and base its decision
making on whether or not a merged mapping could be expressed in fewer
bmbt records. While we're at it, fix the incorrect return type of the
iter function.
Fixes: 336642f79283 ("xfs: alert the user about data/attr fork mappings that could be merged")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
With gcc-5:
In file included from ./include/trace/define_trace.h:102:0,
from ./fs/xfs/scrub/trace.h:988,
from fs/xfs/scrub/trace.c:40:
./fs/xfs/./scrub/trace.h: In function ‘trace_raw_output_xchk_fsgate_class’:
./fs/xfs/scrub/scrub.h:111:28: error: initializer element is not constant
#define XREP_ALREADY_FIXED (1 << 31) /* checking our repair work */
^
Shifting the (signed) value 1 into the sign bit is undefined behavior.
Fix this for all definitions in the file by shifting "1U" instead of
"1".
This was exposed by the first user added in commit 466c525d6d35e691
("xfs: minimize overhead of drain wakeups by using jump labels").
Fixes: 160b5a784525e8a4 ("xfs: hoist the already_fixed variable to the scrub context")
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
Lock order in XFS is AGI -> AGF, hence for operations involving
inode unlinked list operations we always lock the AGI first. Inode
unlinked list operations operate on the inode cluster buffer,
so the lock order there is AGI -> inode cluster buffer.
For O_TMPFILE operations, this now means the lock order set down in
xfs_rename and xfs_link is AGI -> inode cluster buffer -> AGF as the
unlinked ops are done before the directory modifications that may
allocate space and lock the AGF.
Unfortunately, we also now lock the inode cluster buffer when
logging an inode so that we can attach the inode to the cluster
buffer and pin it in memory. This creates a lock order of AGF ->
inode cluster buffer in directory operations as we have to log the
inode after we've allocated new space for it.
This creates a lock inversion between the AGF and the inode cluster
buffer. Because the inode cluster buffer is shared across multiple
inodes, the inversion is not specific to individual inodes but can
occur when inodes in the same cluster buffer are accessed in
different orders.
To fix this we need move all the inode log item cluster buffer
interactions to the end of the current transaction. Unfortunately,
xfs_trans_log_inode() calls are littered throughout the transactions
with no thought to ordering against other items or locking. This
makes it difficult to do anything that involves changing the call
sites of xfs_trans_log_inode() to change locking orders.
However, we do now have a mechanism that allows is to postpone dirty
item processing to just before we commit the transaction: the
->iop_precommit method. This will be called after all the
modifications are done and high level objects like AGI and AGF
buffers have been locked and modified, thereby providing a mechanism
that guarantees we don't lock the inode cluster buffer before those
high level objects are locked.
This change is largely moving the guts of xfs_trans_log_inode() to
xfs_inode_item_precommit() and providing an extra flag context in
the inode log item to track the dirty state of the inode in the
current transaction. This also means we do a lot less repeated work
in xfs_trans_log_inode() by only doing it once per transaction when
all the work is done.
Fixes: 298f7bec503f ("xfs: pin inode backing buffer to the inode log item")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
To fix a AGI-AGF-inode cluster buffer deadlock, we need to move
inode cluster buffer operations to the ->iop_precommit() method.
However, this means that deferred operations can require precommits
to be run on the final transaction that the deferred ops pass back
to xfs_trans_commit() context. This will be exposed by attribute
handling, in that the last changes to the inode in the attr set
state machine "disappear" because the precommit operation is not run.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
It was accidentally dropped when refactoring the allocation code,
resulting in the AG iteration always doing blocking AG iteration.
This results in a small performance regression for a specific fsmark
test that runs more user data writer threads than there are AGs.
Reported-by: kernel test robot <oliver.sang@intel.com>
Fixes: 2edf06a50f5b ("xfs: factor xfs_alloc_vextent_this_ag() for _iterate_ags()")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|