diff options
Diffstat (limited to 'Documentation/bpf')
-rw-r--r-- | Documentation/bpf/bpf_devel_QA.rst | 8 | ||||
-rw-r--r-- | Documentation/bpf/bpf_iterators.rst | 119 | ||||
-rw-r--r-- | Documentation/bpf/btf.rst | 25 | ||||
-rw-r--r-- | Documentation/bpf/kfuncs.rst | 17 | ||||
-rw-r--r-- | Documentation/bpf/standardization/instruction-set.rst | 20 |
5 files changed, 173 insertions, 16 deletions
diff --git a/Documentation/bpf/bpf_devel_QA.rst b/Documentation/bpf/bpf_devel_QA.rst index de27e1620821..0acb4c9b8d90 100644 --- a/Documentation/bpf/bpf_devel_QA.rst +++ b/Documentation/bpf/bpf_devel_QA.rst @@ -382,6 +382,14 @@ In case of new BPF instructions, once the changes have been accepted into the Linux kernel, please implement support into LLVM's BPF back end. See LLVM_ section below for further information. +Q: What "BPF_INTERNAL" symbol namespace is for? +----------------------------------------------- +A: Symbols exported as BPF_INTERNAL can only be used by BPF infrastructure +like preload kernel modules with light skeleton. Most symbols outside +of BPF_INTERNAL are not expected to be used by code outside of BPF either. +Symbols may lack the designation because they predate the namespaces, +or due to an oversight. + Stable submission ================= diff --git a/Documentation/bpf/bpf_iterators.rst b/Documentation/bpf/bpf_iterators.rst index 07433915aa41..189e3ec1c6c8 100644 --- a/Documentation/bpf/bpf_iterators.rst +++ b/Documentation/bpf/bpf_iterators.rst @@ -2,10 +2,117 @@ BPF Iterators ============= +-------- +Overview +-------- + +BPF supports two separate entities collectively known as "BPF iterators": BPF +iterator *program type* and *open-coded* BPF iterators. The former is +a stand-alone BPF program type which, when attached and activated by user, +will be called once for each entity (task_struct, cgroup, etc) that is being +iterated. The latter is a set of BPF-side APIs implementing iterator +functionality and available across multiple BPF program types. Open-coded +iterators provide similar functionality to BPF iterator programs, but gives +more flexibility and control to all other BPF program types. BPF iterator +programs, on the other hand, can be used to implement anonymous or BPF +FS-mounted special files, whose contents are generated by attached BPF iterator +program, backed by seq_file functionality. Both are useful depending on +specific needs. + +When adding a new BPF iterator program, it is expected that similar +functionality will be added as open-coded iterator for maximum flexibility. +It's also expected that iteration logic and code will be maximally shared and +reused between two iterator API surfaces. ----------- -Motivation ----------- +------------------------ +Open-coded BPF Iterators +------------------------ + +Open-coded BPF iterators are implemented as tightly-coupled trios of kfuncs +(constructor, next element fetch, destructor) and iterator-specific type +describing on-the-stack iterator state, which is guaranteed by the BPF +verifier to not be tampered with outside of the corresponding +constructor/destructor/next APIs. + +Each kind of open-coded BPF iterator has its own associated +struct bpf_iter_<type>, where <type> denotes a specific type of iterator. +bpf_iter_<type> state needs to live on BPF program stack, so make sure it's +small enough to fit on BPF stack. For performance reasons its best to avoid +dynamic memory allocation for iterator state and size the state struct big +enough to fit everything necessary. But if necessary, dynamic memory +allocation is a way to bypass BPF stack limitations. Note, state struct size +is part of iterator's user-visible API, so changing it will break backwards +compatibility, so be deliberate about designing it. + +All kfuncs (constructor, next, destructor) have to be named consistently as +bpf_iter_<type>_{new,next,destroy}(), respectively. <type> represents iterator +type, and iterator state should be represented as a matching +`struct bpf_iter_<type>` state type. Also, all iter kfuncs should have +a pointer to this `struct bpf_iter_<type>` as the very first argument. + +Additionally: + - Constructor, i.e., `bpf_iter_<type>_new()`, can have arbitrary extra + number of arguments. Return type is not enforced either. + - Next method, i.e., `bpf_iter_<type>_next()`, has to return a pointer + type and should have exactly one argument: `struct bpf_iter_<type> *` + (const/volatile/restrict and typedefs are ignored). + - Destructor, i.e., `bpf_iter_<type>_destroy()`, should return void and + should have exactly one argument, similar to the next method. + - `struct bpf_iter_<type>` size is enforced to be positive and + a multiple of 8 bytes (to fit stack slots correctly). + +Such strictness and consistency allows to build generic helpers abstracting +important, but boilerplate, details to be able to use open-coded iterators +effectively and ergonomically (see libbpf's bpf_for_each() macro). This is +enforced at kfunc registration point by the kernel. + +Constructor/next/destructor implementation contract is as follows: + - constructor, `bpf_iter_<type>_new()`, always initializes iterator state on + the stack. If any of the input arguments are invalid, constructor should + make sure to still initialize it such that subsequent next() calls will + return NULL. I.e., on error, *return error and construct empty iterator*. + Constructor kfunc is marked with KF_ITER_NEW flag. + + - next method, `bpf_iter_<type>_next()`, accepts pointer to iterator state + and produces an element. Next method should always return a pointer. The + contract between BPF verifier is that next method *guarantees* that it + will eventually return NULL when elements are exhausted. Once NULL is + returned, subsequent next calls *should keep returning NULL*. Next method + is marked with KF_ITER_NEXT (and should also have KF_RET_NULL as + NULL-returning kfunc, of course). + + - destructor, `bpf_iter_<type>_destroy()`, is always called once. Even if + constructor failed or next returned nothing. Destructor frees up any + resources and marks stack space used by `struct bpf_iter_<type>` as usable + for something else. Destructor is marked with KF_ITER_DESTROY flag. + +Any open-coded BPF iterator implementation has to implement at least these +three methods. It is enforced that for any given type of iterator only +applicable constructor/destructor/next are callable. I.e., verifier ensures +you can't pass number iterator state into, say, cgroup iterator's next method. + +From a 10,000-feet BPF verification point of view, next methods are the points +of forking a verification state, which are conceptually similar to what +verifier is doing when validating conditional jumps. Verifier is branching out +`call bpf_iter_<type>_next` instruction and simulates two outcomes: NULL +(iteration is done) and non-NULL (new element is returned). NULL is simulated +first and is supposed to reach exit without looping. After that non-NULL case +is validated and it either reaches exit (for trivial examples with no real +loop), or reaches another `call bpf_iter_<type>_next` instruction with the +state equivalent to already (partially) validated one. State equivalency at +that point means we technically are going to be looping forever without +"breaking out" out of established "state envelope" (i.e., subsequent +iterations don't add any new knowledge or constraints to the verifier state, +so running 1, 2, 10, or a million of them doesn't matter). But taking into +account the contract stating that iterator next method *has to* return NULL +eventually, we can conclude that loop body is safe and will eventually +terminate. Given we validated logic outside of the loop (NULL case), and +concluded that loop body is safe (though potentially looping many times), +verifier can claim safety of the overall program logic. + +------------------------ +BPF Iterators Motivation +------------------------ There are a few existing ways to dump kernel data into user space. The most popular one is the ``/proc`` system. For example, ``cat /proc/net/tcp6`` dumps @@ -86,7 +193,7 @@ following steps: The following are a few examples of selftest BPF iterator programs: * `bpf_iter_tcp4.c <https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/progs/bpf_iter_tcp4.c>`_ -* `bpf_iter_task_vma.c <https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/progs/bpf_iter_task_vma.c>`_ +* `bpf_iter_task_vmas.c <https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/progs/bpf_iter_task_vmas.c>`_ * `bpf_iter_task_file.c <https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/progs/bpf_iter_task_file.c>`_ Let us look at ``bpf_iter_task_file.c``, which runs in kernel space: @@ -323,8 +430,8 @@ Now, in the userspace program, pass the pointer of struct to the :: - link = bpf_program__attach_iter(prog, &opts); iter_fd = - bpf_iter_create(bpf_link__fd(link)); + link = bpf_program__attach_iter(prog, &opts); + iter_fd = bpf_iter_create(bpf_link__fd(link)); If both *tid* and *pid* are zero, an iterator created from this struct ``bpf_iter_attach_opts`` will include every opened file of every task in the diff --git a/Documentation/bpf/btf.rst b/Documentation/bpf/btf.rst index 2478cef758f8..3b60583f5db2 100644 --- a/Documentation/bpf/btf.rst +++ b/Documentation/bpf/btf.rst @@ -102,7 +102,8 @@ Each type contains the following common data:: * bits 24-28: kind (e.g. int, ptr, array...etc) * bits 29-30: unused * bit 31: kind_flag, currently used by - * struct, union, fwd, enum and enum64. + * struct, union, enum, fwd, enum64, + * decl_tag and type_tag */ __u32 info; /* "size" is used by INT, ENUM, STRUCT, UNION and ENUM64. @@ -478,7 +479,7 @@ No additional type data follow ``btf_type``. ``struct btf_type`` encoding requirement: * ``name_off``: offset to a non-empty string - * ``info.kind_flag``: 0 + * ``info.kind_flag``: 0 or 1 * ``info.kind``: BTF_KIND_DECL_TAG * ``info.vlen``: 0 * ``type``: ``struct``, ``union``, ``func``, ``var`` or ``typedef`` @@ -489,7 +490,6 @@ No additional type data follow ``btf_type``. __u32 component_idx; }; -The ``name_off`` encodes btf_decl_tag attribute string. The ``type`` should be ``struct``, ``union``, ``func``, ``var`` or ``typedef``. For ``var`` or ``typedef`` type, ``btf_decl_tag.component_idx`` must be ``-1``. For the other three types, if the btf_decl_tag attribute is @@ -499,12 +499,21 @@ the attribute is applied to a ``struct``/``union`` member or a ``func`` argument, and ``btf_decl_tag.component_idx`` should be a valid index (starting from 0) pointing to a member or an argument. +If ``info.kind_flag`` is 0, then this is a normal decl tag, and the +``name_off`` encodes btf_decl_tag attribute string. + +If ``info.kind_flag`` is 1, then the decl tag represents an arbitrary +__attribute__. In this case, ``name_off`` encodes a string +representing the attribute-list of the attribute specifier. For +example, for an ``__attribute__((aligned(4)))`` the string's contents +is ``aligned(4)``. + 2.2.18 BTF_KIND_TYPE_TAG ~~~~~~~~~~~~~~~~~~~~~~~~ ``struct btf_type`` encoding requirement: * ``name_off``: offset to a non-empty string - * ``info.kind_flag``: 0 + * ``info.kind_flag``: 0 or 1 * ``info.kind``: BTF_KIND_TYPE_TAG * ``info.vlen``: 0 * ``type``: the type with ``btf_type_tag`` attribute @@ -522,6 +531,14 @@ type_tag, then zero or more const/volatile/restrict/typedef and finally the base type. The base type is one of int, ptr, array, struct, union, enum, func_proto and float types. +Similarly to decl tags, if the ``info.kind_flag`` is 0, then this is a +normal type tag, and the ``name_off`` encodes btf_type_tag attribute +string. + +If ``info.kind_flag`` is 1, then the type tag represents an arbitrary +__attribute__, and the ``name_off`` encodes a string representing the +attribute-list of the attribute specifier. + 2.2.19 BTF_KIND_ENUM64 ~~~~~~~~~~~~~~~~~~~~~~ diff --git a/Documentation/bpf/kfuncs.rst b/Documentation/bpf/kfuncs.rst index a8f5782bd833..ae468b781d31 100644 --- a/Documentation/bpf/kfuncs.rst +++ b/Documentation/bpf/kfuncs.rst @@ -160,6 +160,23 @@ Or:: ... } +2.2.6 __prog Annotation +--------------------------- +This annotation is used to indicate that the argument needs to be fixed up to +the bpf_prog_aux of the caller BPF program. Any value passed into this argument +is ignored, and rewritten by the verifier. + +An example is given below:: + + __bpf_kfunc int bpf_wq_set_callback_impl(struct bpf_wq *wq, + int (callback_fn)(void *map, int *key, void *value), + unsigned int flags, + void *aux__prog) + { + struct bpf_prog_aux *aux = aux__prog; + ... + } + .. _BPF_kfunc_nodef: 2.3 Using an existing kernel function diff --git a/Documentation/bpf/standardization/instruction-set.rst b/Documentation/bpf/standardization/instruction-set.rst index ab820d565052..fbe975585236 100644 --- a/Documentation/bpf/standardization/instruction-set.rst +++ b/Documentation/bpf/standardization/instruction-set.rst @@ -324,34 +324,42 @@ register. .. table:: Arithmetic instructions - ===== ===== ======= ========================================================== + ===== ===== ======= =================================================================================== name code offset description - ===== ===== ======= ========================================================== + ===== ===== ======= =================================================================================== ADD 0x0 0 dst += src SUB 0x1 0 dst -= src MUL 0x2 0 dst \*= src DIV 0x3 0 dst = (src != 0) ? (dst / src) : 0 - SDIV 0x3 1 dst = (src != 0) ? (dst s/ src) : 0 + SDIV 0x3 1 dst = (src == 0) ? 0 : ((src == -1 && dst == LLONG_MIN) ? LLONG_MIN : (dst s/ src)) OR 0x4 0 dst \|= src AND 0x5 0 dst &= src LSH 0x6 0 dst <<= (src & mask) RSH 0x7 0 dst >>= (src & mask) NEG 0x8 0 dst = -dst MOD 0x9 0 dst = (src != 0) ? (dst % src) : dst - SMOD 0x9 1 dst = (src != 0) ? (dst s% src) : dst + SMOD 0x9 1 dst = (src == 0) ? dst : ((src == -1 && dst == LLONG_MIN) ? 0: (dst s% src)) XOR 0xa 0 dst ^= src MOV 0xb 0 dst = src MOVSX 0xb 8/16/32 dst = (s8,s16,s32)src ARSH 0xc 0 :term:`sign extending<Sign Extend>` dst >>= (src & mask) END 0xd 0 byte swap operations (see `Byte swap instructions`_ below) - ===== ===== ======= ========================================================== + ===== ===== ======= =================================================================================== Underflow and overflow are allowed during arithmetic operations, meaning the 64-bit or 32-bit value will wrap. If BPF program execution would result in division by zero, the destination register is instead set to zero. +Otherwise, for ``ALU64``, if execution would result in ``LLONG_MIN`` +dividing -1, the desination register is instead set to ``LLONG_MIN``. For +``ALU``, if execution would result in ``INT_MIN`` dividing -1, the +desination register is instead set to ``INT_MIN``. + If execution would result in modulo by zero, for ``ALU64`` the value of the destination register is unchanged whereas for ``ALU`` the upper -32 bits of the destination register are zeroed. +32 bits of the destination register are zeroed. Otherwise, for ``ALU64``, +if execution would resuslt in ``LLONG_MIN`` modulo -1, the destination +register is instead set to 0. For ``ALU``, if execution would result in +``INT_MIN`` modulo -1, the destination register is instead set to 0. ``{ADD, X, ALU}``, where 'code' = ``ADD``, 'source' = ``X``, and 'class' = ``ALU``, means:: |