summaryrefslogtreecommitdiff
path: root/doc/kernel_bugs.md
blob: f65be3adc5ae8b72d139aed19d8abf0465f436eb (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
Notes about old and new bugs in Vivante GPL kernel.

Race condition
---------------

In event submission (from #cubox).

    [20:19:57] <_rmk_> so, gckEVENT_Submit claims the event listMutex... allocates an event id,
    drops it, and then submits the command queue...  [20:20:02] <_rmk_> so two threads can do
    this...
    [20:20:15] <_rmk_> CPU0: claim listMutex
    [20:20:20] <_rmk_> CPU0: get event ID
    [20:20:25] <_rmk_> CPU0: drop listMutex
    [20:20:31] <_rmk_> CPU1: claim listMutex
    [20:20:36] <_rmk_> CPU1: get another event ID
    [20:20:41] <_rmk_> CPU1: drop listMutex
    [20:20:49] <_rmk_> CPU1: insert commands into the command queue
    [20:20:56] <_rmk_> CPU0: insert commands into the command queue
    [20:21:16] <_rmk_> and then we have the second event due to fire first, which upsets the event
    handling

Status: still present

Command buffer submission
--------------------------

Version: dove

Submitting a lot of distinct command buffers without queuing (or waiting for) a synchronization
signal causes the kernel to run out of signals. This causes hangs in rendering.

Status: workaround found (see `ETNA_MAX_UNSIGNALED_FLUSHES`)

Command buffer memory management on cubox
-------------------------------------------

    <_rmk_> err, this is not good.
    <_rmk_> gckCOMMAND_Start: command queue is at 0xe7846000
    <_rmk_> gckCOMMAND_Start: WL 380000c8 0cf6d19f 40000002 10281000
    <_rmk_> that's what the first 4 words should've been at the beginning (and where when the GPU was started)
    <_rmk_> they then conveniently become: 4000002C 19CE0000 AAAAAAAA AAAAAAAA
    <_rmk_> the first two change because of the wait being converted to a link
    <_rmk_> the second two...
    <_rmk_> well, 0xaaaaaaaa is the free page poisoning.
    <_rmk_> which suggests that the GPU command page was freed while still in use
    <_rmk_> ah ha, that'll be how.  get lucky with the cache behaviour and this is what happens...
    <_rmk_> page poisoning enabled.  When a lowmem page is allocated, it's memset() through its lowmem mapping, which is cacheable.
    <_rmk_> this data can sit in the CPU cache.
    <_rmk_> it is then ioremap()'d (which I've been dead against for years) which creates a *device* mapping.
    <_rmk_> we then write the wait/link through the device mapping.
    <_rmk_> at some point later, the cache lines get evicted from the _normal memory_ mapping.
    <_rmk_> thereby overwriting the original wait/link commands in the GPU stream with 0xAAAAAAAA
    <_rmk_> I wonder what that a command word with a bit pattern of 10101 in the top 5 bits tells the GPU to do...
    <_rmk_> the only thing I can say is... if people would damn well listen to me when I say "don't do this" then you wouldn't get these bugs.
    <_rmk_> this is one of the reasons why my check in ioremap.c exists to prevent system memory being ioremap'd and therefore this kind of issue cropping up

More races...

    <wumpus> IIRC the blob does a similar thing
    <_rmk_> if you ever see the kernel galcore complain about a lost event or an event not pending, that's because the kernel driver raced
    <wumpus> yes, I've had those races, especially on cubox
    <_rmk_> and its caused by three races
    <wumpus> even when using it only from one process :D
    <_rmk_> one in _GetEvent, where it finds an event slot under a lock but has no way to mark it as having been allocated
    <_rmk_> one in gckEVENT_Submit, where you can end up with two events being allocated and then inserted in the reverse order to allocation
    <wumpus> ugh
    <_rmk_> and one in gckEVENT_Notify, where the processing of pending events can screw it up (I need to write that one up properly in the commit)
    <wumpus> I guess some more correct and linuxy synchronization should be used, not some leaky abstractions
    <_rmk_> the final one - _IsEmpty - can work both ways.  This function can report that the event queues are empty when they aren't, and they aren't empty when they are.

    <_rmk_> oh yes, that's the gckEVENT_Notify race - it scans the event queues/lists without holding any locks - it was coded with the assumption that the only thing you need to protect against is writes to queue->head and apparantly reads are fine.
    <_rmk_> the problem that leads to is that you can end up with queue->head non-NULL, queue->stamp being old, and then gckEVENT_Notify() identifies it as a missed event when in actual fact the event queue is in the middle of being setup
    <_rmk_> ... and NULLs out queue->head and runs all the queued events
    <_rmk_> since I've fixed all these, I'm seriously considering getting rid of my fencing in my Xorg driver... it's just not needed anymore