diff options
| author | Qu Wenruo <wqu@suse.com> | 2024-01-26 13:51:32 +1030 | 
|---|---|---|
| committer | David Sterba <dsterba@suse.com> | 2024-03-04 16:24:52 +0100 | 
| commit | b2324e08b8b3b38bb86ba779970b0caab32ef0ed (patch) | |
| tree | 712259bd4d8a81eff5b18a9910db6083988a5131 /include/linux/overflow.h | |
| parent | 74cd8cac0b12b3d6f181491aca6af23f5d5a65f1 (diff) | |
btrfs: raid56: extra debugging for raid6 syndrome generation
[BUG]
I have got at least two crash report for RAID6 syndrome generation, no
matter if it's AVX2 or SSE2, they all seems to have a similar
calltrace with corrupted RAX:
  BUG: kernel NULL pointer dereference, address: 0000000000000000
  #PF: supervisor read access in kernel mode
  #PF: error_code(0x0000) - not-present page
  PGD 0 P4D 0
  Oops: 0000 [#1] PREEMPT SMP PTI
  Workqueue: btrfs-rmw rmw_rbio_work [btrfs]
  RIP: 0010:raid6_sse21_gen_syndrome+0x9e/0x130 [raid6_pq]
  RAX: 0000000000000000 RBX: 0000000000001000 RCX: ffffa0ff4cfa3248
  RDX: 0000000000000000 RSI: ffffa0f74cfa3238 RDI: 0000000000000000
  Call Trace:
   <TASK>
   rmw_rbio+0x5c8/0xa80 [btrfs]
   process_one_work+0x1c7/0x3d0
   worker_thread+0x4d/0x380
   kthread+0xf3/0x120
   ret_from_fork+0x2c/0x50
   </TASK>
[CAUSE]
The cause is not known.  Recently I also hit this in AVX512 path, and
that's even in v5.15 backport, which doesn't have any of my RAID56
rework.
Furthermore according to the registers:
  RAX: 0000000000000000 RBX: 0000000000001000 RCX: ffffa0ff4cfa3248
The RAX register is showing the number of stripes (including PQ), which
is not correct (0).  But the remaining two registers are all sane.
- RBX is the sectorsize
  For x86_64 it should always be 4K and matches the output.
- RCX is the pointers array
  Which is from rbio->finish_pointers, and it looks like a sane
  kernel address.
[WORKAROUND]
For now, I can only add extra debug ASSERT()s before we call raid6
gen_syndrome() helper and hopes to catch the problem.
The debug requires both CONFIG_BTRFS_DEBUG and CONFIG_BTRFS_ASSERT
enabled.
My current guess is some use-after-free, but every report is only having
corrupted RAX but seemingly valid pointers doesn't make much sense.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Diffstat (limited to 'include/linux/overflow.h')
0 files changed, 0 insertions, 0 deletions
