1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
|
GCxxx hardware
===============
Major optional blocks: each of these can be present or not depending on the specific chip:
- 2D engine
- Composition engine
- 3D engine
- VG engine
Some SoCs have multiple GPU cores, and have distributed the blocks mentioned above over the cores (I suppose
for extra parallelism and/or granularity in power switching). For example the Marvell Armada 620 has a GC2000
with only the 3D engine as well as a GC300 with only the 2D engine. Similarly, the Freescale i.mx6 SoC has a
GC2000 with the 3D engine, a GC320 with 2D engine and a GC355 with VG engine.
- State space is a 256kB (65536 times uint32) register file divided up into
separate units for parts of the chip (such as PE, RS, ...)
- Most of the state is latched; that means if it's set to a certain value, it
will keep that value until the next change
- Instead of programming the registers directly (which is possible from kernel
space), the FE, a DMA engine, is used to queue state changes for later
- To perform an operation such as rendering, all the state for doing that
operation have been programmed to the desired values
Feature bits
=================
Which features are supported on a certain Vivante core is not only determined by the model number
(which AFAIK mainly determines the performance), but specified by a combination of factors:
1) Chip features and minor feature flags
2) Chip specs (number of instructions, pipelines, ...)
3) Chip model (GC800, GC2000, ...)
4) Chip revision of the form 0x1234
All of these are available in read-only registers on the hardware. On most cases it suffices to check the feature flags as
Unlike NV, which parametrizes everything on the model and revision, for GC this is left for bugfixes (even these sometimes
have their own feature bit).
For an overview of the feature bits see the enumerations in `state.xml`.
For the Vivante GPUs on some platforms the detailed features and specs are known, these can be found in `doc/gpus_comparison.html`
(`tools/data/gpus.json` for the raw source data).
Modules
==============
(from Vivante SoCIP 2011 presentation [1])
------------------
| Host Interface |
------------------
|
----------------------
| Memory controller |
----------------------
| | | |
| \/ \| |
| ---- ----- |
| |3D| |Tex| ----
\/ Shader--->| |
---- ---- |PE|
|FE|------->|DE|------>| |
---- ---- ----
Functional blocks, indicated by two-letter abbreviations:
- FE Graphics Pipeline Front End (also: DMA engine, Fetch Engine)
- PE Pixel Engine (can be version 1.0 / 2.0)
- SH SHader (vertex + pixel)
- PA Primitive Assembly (clipping, perspective division, viewport transformation)
- SE Setup Engine (depth offset, scissor, clipping)
- RA RAsterizer (multisampling, clipping, culling, varying interpolation, generate fragments)
- TX Texture
- VG Vector Graphics
- IM ? (unknown bit in idle state, may group a few other modules, or maybe the 2D DE)
- FP Fragment Processor?
- MC Memory Controller
- HI Host Interface
- DE 2D drawing and scaling engine
- RS Resolve (resolves rendered image to memory, this is a copy and fill engine)
- VR Video raster (YUV tiler)
- TS Tile Status
These abbreviations are used in `state.xml` for the stripes where appropriate.
[1] http://www.socip.org/socip/speech/pdf/2-Vivante-SoCIP%202011%20Presentation.pdf
Operations
-----------
Modules are programmed and kicked off using state updates, queued through the FE. An exception is 2D and 3D primitive rendering,
which is kicked off directly through a FE command.
The GC320 technical manual [1] describes quite a few operations, but only for the 2D part (DE).
Hands-on Workshop: Graphics Development on the i.MX 6 Series [2] has some tips specific to programming Vivante 3D hardware,
including OpenCL, but is very high level.
Thread walker = Rectangle walker? (seems to have to do with OpenCL)
[1] http://www.vivantecorp.com/Vivante_GC320_Technical_Reference_Manual_V1.0_A.pdf
[2] http://2012ftf.ccidnet.com/pdf/0049.pdf
Connections
-------------
Connections between the different modules follow the OpenGL pipeline design [3].
- FE2VS (FE-VS) fetch engine to vertex shader: attributes
- RA2SH (RA-PS) rasterizer to shader engine: varyings
- SH2PE (PS-PE) shader to pixel engine: color output
Overall:
FE -> VS -> PA -> SE -> RA -> PS -> PE -> RS
See also [1]
- PA assembles 3D primitives from vertices, culls based on trivial rejection and clips based on near Z-plane
- PA transforms from 3D view frustum into 2D screen space
- SE determines rasterization starting point for each primitive, and also culls based on trivial rejection
- RA performs per-tile, per-subtile, per-quad and per-pixel clipping
[1] METHOD FOR DISTRIBUTED CLIPPING OUTSIDE OF VIEW VOLUME
http://www.freepatentsonline.com/y2010/0271370.html
[2] Efficient tile-based rasterization
http://www.google.com/patents/US8009169
[3] OpenGL ES2 pipeline structure
http://www.khronos.org/opengles/2_X/
Command stream
-------------------
Commands and data are sent to the GPU through the FE (Front End interface). The
command stream of the front-end interface has a specific format described in this section.
Overall format
OOOOOxxx xxxxxxxx xxxxxxxx xxxxxxxx Command (O=Opcode, x=argument)
arg0
..
argN-1
Opcodes
00001 Update state
00010 End
00011 NOP
00100 Start DE ([15-8] rect count, 1 parameter 0xDEADDEED 2 parameter words describing target rect)
00101 Draw primitives
00110 Draw indexed primitives
00111 Wait ([15-0] count)
01000 Link ([15-0] number of bytes, arg address)
01001 Stall (argument seems same format as state 0380C)
01010 Call
01011 Return
01101 Chip select
Arguments are always padded to 2 32-bit words. Number of argument words depends on the opcode, and
sometimes on the first word of the command.
See `cmdstream.xml` for detailed overview of commands and arguments. The most commonly used command is
`LOAD_STATE` whose header word has the following format:
00001FCC CCCCCCCC AAAAAAAA AAAAAAAA Update state
F Fixed point flag: convert a 16.16 fixed point float in the command stream to a floating point value in the state.
C Count of state words that follow
A Base address / 4
Synchronization
----------------
There are various states related to synchronization, either between different modules in the GPU
and the GPU and the CPU (through the FE).
- State `GL.SEMAPHORE_TOKEN`
- State `GL.STALL_TOKEN`
- The `STALL` command in command stream
The following sequence of states is common:
GL.SEMAPHORE_TOKEN := FROM=RA,TO=PE
GL.STALL_TOKEN := FROM=RA,TO=PE
The first state load arms the semaphore, the second one stalls the FROM module until the TO module has raised its semaphore. In
this example it stalls the rasterizer until the pixel engine has completed the commands up until now.
The `STALL` command is used to stall the command queue until the semaphore has been received. The stall command has
one argument that has the same format as the `_TOKEN` states above, except that the FROM module is always the FE.
Within the 3D engine, not many explicit synchronization points appear to be needed. Some exceptions:
- The blob issues a semaphore and stall from RA to PE when
- Changing depth configuration in PE
- Sometimes when changing stencil config in PE
- The blob issues a just a semaphore from RA to PE, and a stall before drawing a primitive when
- Tile status address/configuration changes
- Clearing depth
- Clearing tile status
- The blob issues a semaphor and stall from FE to PE before changing the pipe from 2D to 3D or vice versa
XXX (cwabbott) usually, isa's have some sort of texture barrier or sync operation to be able to load textures asyncronously
(mali does it w/ pipeline registers) i'm wondering where that is in the vivante isa
Resolve
-----------
The resolve module is a copy and fill engine. It can copy blocks of pixels from one GPU address to another,
optionally tiling/detiling, converting between pixel formats, or scaling down by a factor of 2. The source and
destination address can be the same to fill in tiles that were not touched during the rendering process
(according to the Tile Status, see below) with the background color.
The RS and PE (drawing) share one set of pixel pipes. They will never be active concurrently (AFAIK).
They do however have separate caches, so before using RS to copy from a surface at least the COLOR cache needs to be flushed
(and possibly the RS cache). The blob also flushes the DEPTH cache, I do however not know why.
Tiled or supertiled resolve operation sizes need to be aligned to 16 horizontally and 4 vertically.
Non-tiled to non-tiled:
- need a width of at least 17 (I suppose the safe value is 32)
- height must be multiple of 4
Tiled to non-tiled:
- width must be at least 13 (I suppose the safe value is 16)
- height must be at least 1
Tile status (Fast clear)
-------------------------
A render target is divided in tiles, and every tile has a couple of status flags.
An auxilary buffer associated with each render surface keeps track of these tile status flags, allocated with `gcvSURF_TILE_STATUS`.
One of these flags is the `clear` flag, that signifies that the tile has been cleared.
`fast clear` happens by setting the clear bit for each tile instead of clearing the actual surface
data.
Tile size is dependent on the hardware, and so is the number of bits per tile (can be two or four).
The tile status bits are cleared using RS, by clearing a small surface with the value
0x55555555. When clearing, only the destination address and stride needs to be set,
the source is ignored.
An invalid pattern in the tile status memory can result in hangs when rendering. This was discovered
in tests that used a depth surface but did not clear it. The residual data in the TS are caused
the GPU to hang mysteriously on rendering.
Shader ISA
================
Vivante GPUs have a unified shader ISA, this means that vertex and pixel shaders share the same
instruction set. See `isa.xml` and `isa.md` for details of the instructions, this section only provides a high-level overview.
- Each instruction consists of 4 32-bit words. These have a fixed format, with bitfields
that have a meaning which differs only very little per opcode. Which of these fields is used (which operands) does differ per opcode.
- Four-component SIMD processor (for most of the instructions)
- Older GPUs have floating point operations only, the newer ones have support for integer operations in the context of OpenCL.
The split is around GC1000, though this being Vivante there is likely some feature bit for it.
- Instructions can have up to three source operands (`SRC0_*`, `SRC1_*`, `SRC2_*`), and one destination operand (`DST_`).
In addition to that, there is a specific operand for texture sampling (`TEX_*`).
- Operands can have these properties:
- `USE`: the operand is enabled (1) or not (0)
- `REG`: register number to read or write
- `SWIZ`: arbitrary swizzle from four to four components (source operands only)
- `COMPS`: which components to affect (destination operand only)
- `AMODE`: addressing mode; this can either be direct or indexed through the X,Y,Z,W component of the address register
- `RGROUP`: choses the register group to read from (source operands only). Register groups are the temporaries, uniforms, and
possibly others.
- Registers:
- `N` four-component float temporary registers `tX` (actual number depends on the hardware, maximum seems to be 64 for all
vivante GPUs I've encountered up until now), but like with other GPUs using more registers will likely restrict
the available paralellism)
- `1` four-component address register `a0`
Temporary registers are also used for shader inputs (attributes, varyings) and outputs (colors, positions). They are set to
the input values before the shader executes, and should have the output values when the shader ends. If the output
should be the same as the input (passthrough) an empty shader with only a NOP instruction can be used.
Rendering to framebuffer
=========================
Rendering to the framebuffer is pretty easy (see `etna_fb.c`). The general idea is to get the physical address
of the framebuffer using the `FBIOGET_VSCREENINFO` and `FBIOGET_FSCREENINFO` ioctls on the framebuffer device.
This physical address can then directly be used as target address for a resolve operation, just like when copying
to a normal bitmap.
Even though it would save a resolve operation it is not useful to use the physical address of the frame buffer
directly for rendering, as it only possible to render to tiled and supertiled surfaces, and (afaik) no display controller
supports scan out from tiled formats.
In many cases there is more framebuffer memory than that which is used for the current screen, which causes larger virtual resolution
to be returned than the physical resolution. Double-buffering is achieved by changing the y-offset within that virtual frame buffer.
Operations
========================
An attempt to figure out which operations can be triggered in the hardware, and what state is used to specify
their operation.
- RS: Kick off resolve by writing a value with bit 0 set to `RS_KICKER`. State used:
- `RS_*`
- `TS_*` (only when reading, if fast clear enabled through `TS_CONFIG`)
- FE: Kick off 3D rendering by sending command `DRAW_PRIMITIVES` / `DRAW_INDEXED_PRIMITIVES`
- `FE_*` (vertex element layout, vertex streams, index stream, ...)
- `GL_*` (varyings setup, multisampling)
- `TS_*` (to read and update fast clear status for tiles)
- `PA_*` primitive assembly
- `SE_*` setup engine
- `RA_*` rasterizer
- `PE_*` pixel engine
- `VS_*` vertex shader code + uniforms + linking information
- `PS_*` pixel shader code + uniforms + linking information
- `(N)TE_*` texture samplers
- `SH_*` extra shader code + uniforms
- DE: Kick off 2D rendering by sending command `DRAW_2D`
- `DE_*` 2D state
- `FE_*` `GL_*` possibly
That's all, folks.
Programming pecularities
=========================
- The FE can convert from 16.16 fixed point format to 32 bit float. This is enabled by the `fixp` bit
in the `LOAD_STATE` command. This is mostly useful for older ARM CPUs without native floating point
support. The blob driver uses it for some states (viewport scaling, offset, scissor, ...)
but not others (uniforms etc).
- It is quite easy to hang the GPU when making a minor programming mistake.
When the GPU is stuck it is possible to submit command buffers, however nothing gets drawn and nothing
ever finishes.
Ways I've already made it crash:
- Wrong number of VS inputs (must be equal to number of vertex elements)
- Wrong number of temporaries in PS
- Sending 3D commands in the 2D pipe instead of 3D pipe (then using a signal waiting for them to complete)
- Wrong length of shader
- Texture sampling without properly setup texture units
- `SE_SCISSOR`: setting SCISSOR bottom/right to `(x<<16)|5` instead of `(x<<16)-1` causes crashes for higher resolutions
such as 1920x1080 on GC600. I don't know why, maybe some buffer or cache overflow. The rockchip vivante driver always uses |5 AFAIK,
this offset appears to be different per specific chip/revision.
This may be a (kernel) driver problem. It is possible to reset the GPU from user space with an ioctl, but
this usually is not enough to make it un-stuck. It would probably be a better solution to introduce a kernel-based timeout
instead of relying on userspace to be 100% correct (may exist on v4?).
Masked state
-------------
Many groups of state bits, especially in the PE, have a mask bit. These have been named `*_MASK`.
When the mask bit belonging to a group of state bits is *set* on a state write, the accompanying
state bits will be unaffected. If the mask bit is *unset*, the state bits will be written.
This allows setting state per group of bits. For example, it allows setting only
the destination alpha function (`ALPHA_CONFIG.DST_FUNC_ALPHA`) without affecting the
other bits in that state word.
If masking functionality is not desired, simply keep all the `_MASK` bits at zero and write all
bits at once. This is what I used in `etna_pipe`, as I keep track of all state myself.
Texture tiling
----------------
RGBA/RGBx textures and render targets are stored in a 4x4 tiled format.
Tile 1 Tile 2 ... Tile w-1
0 1 2 3 16 17 18 19
4 5 6 7 20 21 22 23
8 9 10 11 24 25 26 27
12 13 14 15 28 29 30 31
The stride of these tiled surfaces is the number of bytes between one row of tiles and the next. So for a surface of width
512, it is `(512/4)*16*4=8192`.
Supertiling
-------------------

It appears that the blob always pads render buffers pixel sizes to a multiple of 64, ie, a width of 400 becomes 448 and 800 becomes 832.
This is because the render buffer is also tiled, albeit differently than the 4x4 tiling format of the textures.
On a fine level, every tile is the same as for normal tiled surfaces:
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
However, as the name 'supertiled' implies, the tiles themselves are also tiled, to be specific in this pattern:
000 001 008 009 016 017 024 025 032 033 040 041 048 049 056 057
002 003 010 011 018 019 026 027 034 035 042 043 050 051 058 059
004 005 012 013 020 021 028 029 036 037 044 045 052 053 060 061
006 007 014 015 022 023 030 031 038 039 046 047 054 055 062 063
064 065 072 073 080 081 088 089 096 097 104 105 112 113 120 121
066 067 074 075 082 083 090 091 098 099 106 107 114 115 122 123
068 069 076 077 084 085 092 093 100 101 108 109 116 117 124 125
070 071 078 079 086 087 094 095 102 103 110 111 118 119 126 127
128 129 136 137 144 145 152 153 160 161 168 169 176 177 184 185
130 131 138 139 146 147 154 155 162 163 170 171 178 179 186 187
132 133 140 141 148 149 156 157 164 165 172 173 180 181 188 189
134 135 142 143 150 151 158 159 166 167 174 175 182 183 190 191
192 193 200 201 208 209 216 217 224 225 232 233 240 241 248 249
194 195 202 203 210 211 218 219 226 227 234 235 242 243 250 251
196 197 204 205 212 213 220 221 228 229 236 237 244 245 252 253
198 199 206 207 214 215 222 223 230 231 238 239 246 247 254 255
This has some similarity to a http://en.wikipedia.org/wiki/Z-order_curve or other space-filling curve,
but is only nested one level, in total this results in 64x64 sized tiles.
The GPU can render to normal tiled surfaces (such as used by textures) as well as supertiled surfaces. However,
rendering to supertiled surfaces is likely faster due to better cache locality.
Stride, as used for resolve operations, is for a row of tiles not a row of pixels; 0x1c00 for width 448 (originally 400),
0x3400 for width 832 (originally 800).
Multisampling
--------------
GC600 supports 1, 2, or 4 MSAA samples. Vivante's patent [1] on anti-aliasing may reveal some of the inner workings.
- 256x256 target with 1 sample creates a 256x256 render target (duh)
GL.MULTI_SAMPLE_CONFIG := MSAA_SAMPLES=NONE,MSAA_ENABLES=0xf,UNK12=0x0,UNK16=0x0
PE.COLOR_STRIDE := 0x400
PE.DEPTH_STRIDE := 0x200
- 256x256 target with 2 samples creates a 512x256 render target and depth buffer
GL.MULTI_SAMPLE_CONFIG := MSAA_SAMPLES=2X,MSAA_ENABLES=0x3,UNK12=0x0,UNK16=0x0
RA.MULTISAMPLE_UNK00E04 := 0x0
RA.MULTISAMPLE_UNK00E10[0] := 0xaa22
RA.CENTROID_TABLE[0] := 0x66aa2288
RA.CENTROID_TABLE[1] := 0x88558800
RA.CENTROID_TABLE[2] := 0x88881100
RA.CENTROID_TABLE[3] := 0x33888800
PE.COLOR_STRIDE := 0x800 (doubled)
PE.DEPTH_STRIDE := 0x400 (doubled)
- 256x256 target with 4 samples creates a 512x512 render target and depth buffer
GL.MULTI_SAMPLE_CONFIG := MSAA_SAMPLES=4X,MSAA_ENABLES=0xf,UNK12=0x0,UNK16=0x0
RA.MULTISAMPLE_UNK00E04 := 0x0
RA.MULTISAMPLE_UNK00E10[2] := 0xaaa22a22
RA.CENTROID_TABLE[8] := 0x262a2288
RA.CENTROID_TABLE[9] := 0x886688a2
RA.CENTROID_TABLE[10] := 0x888866aa
RA.CENTROID_TABLE[11] := 0x668888a6
RA.MULTISAMPLE_UNK00E10[1] := 0xe6ae622a
RA.CENTROID_TABLE[4] := 0x46622a88
RA.CENTROID_TABLE[5] := 0x888888ae
RA.CENTROID_TABLE[6] := 0x888888e6
RA.CENTROID_TABLE[7] := 0x888888ca
RA.MULTISAMPLE_UNK00E10[0] := 0xeaa26e26
RA.CENTROID_TABLE[0] := 0x4a6e2688
RA.CENTROID_TABLE[1] := 0x888888a2
RA.CENTROID_TABLE[2] := 0x888888ea
RA.CENTROID_TABLE[3] := 0x888888c6
PE.COLOR_STRIDE := 0x800
PE.DEPTH_STRIDE := 0x400 (doubled)
Other differences when MSAA is enabled:
- `TS.MEM_CONFIG` is different when MSAA is used (see descriptions for fields `MSAA` and `MSAA_FORMAT`).
- The TS surface belonging to the enlarged in the same way; it is treated as if there simply is a bigger render target.
- It also looks like the PS gets an extra input/temporary when MSAA is enabled:
-0x00001f02, /* PS.INPUT_COUNT := COUNT=2,COUNT2=31 */
+0x00001f03, /* PS.INPUT_COUNT := COUNT=3,COUNT2=31 */
-0x00000002, /* PS.TEMP_REGISTER_CONTROL := NUM_TEMPS=2 */
+0x00000003, /* PS.TEMP_REGISTER_CONTROL := NUM_TEMPS=3 */
Haven't yet checked what the value is that is passed in (XXX todo). The shader code itself is unaffected the same so the extra
input is added to the end.
- When resolving the supersampled surface to another (normal pixmap) surface, flag `SOURCE_MSAA` must be configured appropriately to
un-subsample the surface. `WINDOW_SIZE` for this resolve is the *doubled* window size as above, so 512x512 for a 256x256 render
target with MSAA.
[1] http://www.faqs.org/patents/app/20110249901
Rendering points
------------------
When rendering points (PRIMITIVE_TYPE_POINTS) there are some differences:
- VS can have an extra output, the size of the point `gl_pointSize`
if `PA_CONFIG.POINT_SIZE_ENABLE` is set. This will be the last output in `VS_OUTPUT`.
- There is an extra varying for `gl_pointCoord` with two components. This varying has
its components in `GL_VARYING_COMPONENT_USE` set to `POINTCOORD_X` and `POINTCOORD_Y`.
Its `PA_SHADER_ATTRIBUTES` is set to `0x000002f1`.
The VS output associated to this varying in `VS_OUTPUT` is discarded, so can be set
to any output register.
- `rasterizer.point_size_per_vertex` affects number of vs outputs (just like MSAA!). If point
size per vertex is not set, the value in `PA.POINT_SIZE` is used.
- Distinction between sprite coordinate origin `UPPER_LEFT` / `LOWER_LEFT` is implemented by adding
a 1.0-y instruction when glPointCoord is used. XXX figure out what is the default.
Vertex texture fetch
--------------------
Vertex samplers live in the same space as fragment samplers. The blob uses a fixed mapping:
sampler 0..7 are used as fragment samplers and 8..11 are used as vertex samplers.
The shaders themselves refer to the absolute shader number; so tex8 is the first texture unit used in a
vertex shader. The normal TEX instruction can be used to sample textures from a vertex shader.
Vivante hw has two texture caches that need to be flushed separately, one for fragment shaders
one for vertex shaders (bits `GL.FLUSH_CACHE.TEXTURE` and `GL.FLUSH_CACHE.TEXTUREVS` respectively).
This solves a problem with running `cubemap_sphere` after `displacement` demo;
it seemed that some leftover cache from using a texture in displacement
caused the texture in `cubemap_sphere` (which is only 1x1x6) to be messed
up (due to containing old values).
Warning: setting the `TEXTUREVS` bit seems to result in crashes when rendering directly afterwards.
Even adding a PE to FE semaphore afterwards or dummy state loads does not fix this. It could be that
a RA to PE (or FE) semaphore *before* the flush solves this crash. A similar issue exists when flushing
the TS cache.
All texture filtering options are allowed for vertex texture fetch.
XXX maybe figure out if the sampler units are shared between fragment and vertex shaders and thus interchangeable. This is
not important for GL/Gallium because it already lives with the assumption that vertex and fragment shaders
are distinct.
Shader size on GC2000
----------------------
The "query chip identity" ioctl on GC2000 reports an instructionCount of 512. Looking at the low-level command
stream dumps the device appears to have 0x0E000 - 0x0C000 = 8192 bytes of instruction memory, with 128 bit
instructions this indeed maps to 512 instructions.
XXX does the VS/PS split at instruction 256 during rendering affect OpenCL? Hopefully not...
State changes and caches
--------------------------
It looks like some state changes invalidate the cache. Before these state changes it is important to flush
the appropriate cache so that the rendered cache tiles are properly written back. These are at least:
PE.COLOR_FORMAT.OVERWRITE -> flush COLOR
(others will be added when found)
- before flushing the TS cache (as before a clear) first make sure that DEPTH
and COLOR are flushed, and a stall from RA to PE is done, otherwise a crash will happen.
Memory alignment
-----------------
We should take this errata into account moving down the road with GPU drivers.
The GPU3D L1 cache assumes that all memory requests are 16 bytes. If a request is 16 bytes, there
are no issues since the data boundary lines up evenly. If a request is not aligned to 16 bytes, the
memory controller will split those unaligned requests into two requests, doubling the number of
requests processed internally in L1 cache.
(jnettlet)
|