summaryrefslogtreecommitdiffstats
path: root/net/tls
Commit message (Collapse)AuthorAgeFilesLines
* net/tls: Remove the context from the list in tls_device_downMaxim Mikityanskiy2022-07-241-1/+6
| | | | | | | | | | | | | | | | | | tls_device_down takes a reference on all contexts it's going to move to the degraded state (software fallback). If sk_destruct runs afterwards, it can reduce the reference counter back to 1 and return early without destroying the context. Then tls_device_down will release the reference it took and call tls_device_free_ctx. However, the context will still stay in tls_device_down_list forever. The list will contain an item, memory for which is released, making a memory corruption possible. Fix the above bug by properly removing the context from all lists before any call to tls_device_free_ctx. Fixes: 3740651bf7e2 ("tls: Fix context leak on tls_device_down") Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net/tls: Fix race in TLS device down flowTariq Toukan2022-07-181-3/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | Socket destruction flow and tls_device_down function sync against each other using tls_device_lock and the context refcount, to guarantee the device resources are freed via tls_dev_del() by the end of tls_device_down. In the following unfortunate flow, this won't happen: - refcount is decreased to zero in tls_device_sk_destruct. - tls_device_down starts, skips the context as refcount is zero, going all the way until it flushes the gc work, and returns without freeing the device resources. - only then, tls_device_queue_ctx_destruction is called, queues the gc work and frees the context's device resources. Solve it by decreasing the refcount in the socket's destruction flow under the tls_device_lock, for perfect synchronization. This does not slow down the common likely destructor flow, in which both the refcount is decreased and the spinlock is acquired, anyway. Fixes: e8f69799810c ("net/tls: Add generic NIC offload infrastructure") Reviewed-by: Maxim Mikityanskiy <maximmi@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* net/tls: Check for errors in tls_device_initTariq Toukan2022-07-142-3/+8
| | | | | | | | | | | Add missing error checks in tls_device_init. Fixes: e8f69799810c ("net/tls: Add generic NIC offload infrastructure") Reported-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Maxim Mikityanskiy <maximmi@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/20220714070754.1428-1-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* Revert "tls: rx: move counting TlsDecryptErrors for sync"Gal Pressman2022-07-061-4/+4
| | | | | | | | | | | | | | | | | | | | | | | This reverts commit 284b4d93daee56dff3e10029ddf2e03227f50dbf. When using TLS device offload and coming from tls_device_reencrypt() flow, -EBADMSG error in tls_do_decryption() should not be counted towards the TLSTlsDecryptError counter. Move the counter increase back to the decrypt_internal() call site in decrypt_skb_update(). This also fixes an issue where: if (n_sgin < 1) return -EBADMSG; Errors in decrypt_internal() were not counted after the cited patch. Fixes: 284b4d93daee ("tls: rx: move counting TlsDecryptErrors for sync") Cc: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Maxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* sock: redo the psock vs ULP protection checkJakub Kicinski2022-06-231-0/+2
| | | | | | | | | | | | | | | | | | | | | | Commit 8a59f9d1e3d4 ("sock: Introduce sk->sk_prot->psock_update_sk_prot()") has moved the inet_csk_has_ulp(sk) check from sk_psock_init() to the new tcp_bpf_update_proto() function. I'm guessing that this was done to allow creating psocks for non-inet sockets. Unfortunately the destruction path for psock includes the ULP unwind, so we need to fail the sk_psock_init() itself. Otherwise if ULP is already present we'll notice that later, and call tcp_update_ulp() with the sk_proto of the ULP itself, which will most likely result in the ULP looping its callbacks. Fixes: 8a59f9d1e3d4 ("sock: Introduce sk->sk_prot->psock_update_sk_prot()") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: John Fastabend <john.fastabend@gmail.com> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Tested-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/r/20220620191353.1184629-2-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
* Revert "net/tls: fix tls_sk_proto_close executed repeatedly"Jakub Kicinski2022-06-231-3/+0
| | | | | | | | | | | | | | This reverts commit 69135c572d1f84261a6de2a1268513a7e71753e2. This commit was just papering over the issue, ULP should not get ->update() called with its own sk_prot. Each ULP would need to add this check. Fixes: 69135c572d1f ("net/tls: fix tls_sk_proto_close executed repeatedly") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/r/20220620191353.1184629-1-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
* net/tls: fix tls_sk_proto_close executed repeatedlyZiyang Xuan2022-06-201-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | After setting the sock ktls, update ctx->sk_proto to sock->sk_prot by tls_update(), so now ctx->sk_proto->close is tls_sk_proto_close(). When close the sock, tls_sk_proto_close() is called for sock->sk_prot->close is tls_sk_proto_close(). But ctx->sk_proto->close() will be executed later in tls_sk_proto_close(). Thus tls_sk_proto_close() executed repeatedly occurred. That will trigger the following bug. ================================================================= KASAN: null-ptr-deref in range [0x0000000000000010-0x0000000000000017] RIP: 0010:tls_sk_proto_close+0xd8/0xaf0 net/tls/tls_main.c:306 Call Trace: <TASK> tls_sk_proto_close+0x356/0xaf0 net/tls/tls_main.c:329 inet_release+0x12e/0x280 net/ipv4/af_inet.c:428 __sock_release+0xcd/0x280 net/socket.c:650 sock_close+0x18/0x20 net/socket.c:1365 Updating a proto which is same with sock->sk_prot is incorrect. Add proto and sock->sk_prot equality check at the head of tls_update() to fix it. Fixes: 95fa145479fb ("bpf: sockmap/tls, close can race with map free") Reported-by: syzbot+29c3c12f3214b85ad081@syzkaller.appspotmail.com Signed-off-by: Ziyang Xuan <william.xuanziyang@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tls: Rename TLS_INFO_ZC_SENDFILE to TLS_INFO_ZC_TXMaxim Mikityanskiy2022-06-091-4/+4
| | | | | | | | | | | | | | | | | To embrace possible future optimizations of TLS, rename zerocopy sendfile definitions to more generic ones: * setsockopt: TLS_TX_ZEROCOPY_SENDFILE- > TLS_TX_ZEROCOPY_RO * sock_diag: TLS_INFO_ZC_SENDFILE -> TLS_INFO_ZC_RO_TX RO stands for readonly and emphasizes that the application shouldn't modify the data being transmitted with zerocopy to avoid potential disconnection. Fixes: c1318b39c7d3 ("tls: Add opt-in zerocopy mode of sendfile()") Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Link: https://lore.kernel.org/r/20220608153425.3151146-1-maximmi@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* net: tls: fix messing up lists when bpf enabledJakub Kicinski2022-05-191-1/+3
| | | | | | | | | | | Artem points out that skb may try to take over the skb and queue it to its own list. Unlink the skb before calling out. Fixes: b1a2c1786330 ("tls: rx: clear ctx->recv_pkt earlier") Reported-by: Artem Savkov <asavkov@redhat.com> Tested-by: Artem Savkov <asavkov@redhat.com> Link: https://lore.kernel.org/r/20220518205644.2059468-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* tls: Add opt-in zerocopy mode of sendfile()Boris Pismenny2022-05-192-13/+95
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | TLS device offload copies sendfile data to a bounce buffer before transmitting. It allows to maintain the valid MAC on TLS records when the file contents change and a part of TLS record has to be retransmitted on TCP level. In many common use cases (like serving static files over HTTPS) the file contents are not changed on the fly. In many use cases breaking the connection is totally acceptable if the file is changed during transmission, because it would be received corrupted in any case. This commit allows to optimize performance for such use cases to providing a new optional mode of TLS sendfile(), in which the extra copy is skipped. Removing this copy improves performance significantly, as TLS and TCP sendfile perform the same operations, and the only overhead is TLS header/trailer insertion. The new mode can only be enabled with the new socket option named TLS_TX_ZEROCOPY_SENDFILE on per-socket basis. It preserves backwards compatibility with existing applications that rely on the copying behavior. The new mode is safe, meaning that unsolicited modifications of the file being sent can't break integrity of the kernel. The worst thing that can happen is sending a corrupted TLS record, which is in any case not forbidden when using regular TCP sockets. Sockets other than TLS device offload are not affected by the new socket option. The actual status of zerocopy sendfile can be queried with sock_diag. Performance numbers in a single-core test with 24 HTTPS streams on nginx, under 100% CPU load: * non-zerocopy: 33.6 Gbit/s * zerocopy: 79.92 Gbit/s CPU: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz Signed-off-by: Boris Pismenny <borisp@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20220518092731.1243494-1-maximmi@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski2022-05-121-0/+3
|\ | | | | | | | | | | | | | | | | | | | | No conflicts. Build issue in drivers/net/ethernet/sfc/ptp.c 54fccfdd7c66 ("sfc: efx_default_channel_type APIs can be static") 49e6123c65da ("net: sfc: fix memory leak due to ptp channel") https://lore.kernel.org/all/20220510130556.52598fe2@canb.auug.org.au/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * tls: Fix context leak on tls_device_downMaxim Mikityanskiy2022-05-121-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The commit cited below claims to fix a use-after-free condition after tls_device_down. Apparently, the description wasn't fully accurate. The context stayed alive, but ctx->netdev became NULL, and the offload was torn down without a proper fallback, so a bug was present, but a different kind of bug. Due to misunderstanding of the issue, the original patch dropped the refcount_dec_and_test line for the context to avoid the alleged premature deallocation. That line has to be restored, because it matches the refcount_inc_not_zero from the same function, otherwise the contexts that survived tls_device_down are leaked. This patch fixes the described issue by restoring refcount_dec_and_test. After this change, there is no leak anymore, and the fallback to software kTLS still works. Fixes: c55dcdd435aa ("net/tls: Fix use-after-free after the TLS device goes down and up") Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/20220512091830.678684-1-maximmi@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski2022-04-281-5/+7
|\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | include/linux/netdevice.h net/core/dev.c 6510ea973d8d ("net: Use this_cpu_inc() to increment net->core_stats") 794c24e9921f ("net-core: rx_otherhost_dropped to core_stats") https://lore.kernel.org/all/20220428111903.5f4304e0@canb.auug.org.au/ drivers/net/wan/cosa.c d48fea8401cf ("net: cosa: fix error check return value of register_chrdev()") 89fbca3307d4 ("net: wan: remove support for COSA and SRP synchronous serial boards") https://lore.kernel.org/all/20220428112130.1f689e5e@canb.auug.org.au/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * tls: Skip tls_append_frag on zero copy sizeMaxim Mikityanskiy2022-04-271-5/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Calling tls_append_frag when max_open_record_len == record->len might add an empty fragment to the TLS record if the call happens to be on the page boundary. Normally tls_append_frag coalesces the zero-sized fragment to the previous one, but not if it's on page boundary. If a resync happens then, the mlx5 driver posts dump WQEs in tx_post_resync_dump, and the empty fragment may become a data segment with byte_count == 0, which will confuse the NIC and lead to a CQE error. This commit fixes the described issue by skipping tls_append_frag on zero size to avoid adding empty fragments. The fix is not in the driver, because an empty fragment is hardly the desired behavior. Fixes: e8f69799810c ("net/tls: Add generic NIC offload infrastructure") Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/20220426154949.159055-1-maximmi@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* | net: tls: fix async vs NIC crypto offloadJakub Kicinski2022-04-261-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | When NIC takes care of crypto (or the record has already been decrypted) we forget to update darg->async. ->async is supposed to mean whether record is async capable on input and whether record has been queued for async crypto on output. Reported-by: Gal Pressman <gal@nvidia.com> Fixes: 3547a1f9d988 ("tls: rx: use async as an in-out argument") Tested-by: Gal Pressman <gal@nvidia.com> Link: https://lore.kernel.org/r/20220425233309.344858-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* | net: generalize skb freeing deferral to per-cpu listsEric Dumazet2022-04-261-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Logic added in commit f35f821935d8 ("tcp: defer skb freeing after socket lock is released") helped bulk TCP flows to move the cost of skbs frees outside of critical section where socket lock was held. But for RPC traffic, or hosts with RFS enabled, the solution is far from being ideal. For RPC traffic, recvmsg() has to return to user space right after skb payload has been consumed, meaning that BH handler has no chance to pick the skb before recvmsg() thread. This issue is more visible with BIG TCP, as more RPC fit one skb. For RFS, even if BH handler picks the skbs, they are still picked from the cpu on which user thread is running. Ideally, it is better to free the skbs (and associated page frags) on the cpu that originally allocated them. This patch removes the per socket anchor (sk->defer_list) and instead uses a per-cpu list, which will hold more skbs per round. This new per-cpu list is drained at the end of net_action_rx(), after incoming packets have been processed, to lower latencies. In normal conditions, skbs are added to the per-cpu list with no further action. In the (unlikely) cases where the cpu does not run net_action_rx() handler fast enough, we use an IPI to raise NET_RX_SOFTIRQ on the remote cpu. Also, we do not bother draining the per-cpu list from dev_cpu_dead() This is because skbs in this list have no requirement on how fast they should be freed. Note that we can add in the future a small per-cpu cache if we see any contention on sd->defer_lock. Tested on a pair of hosts with 100Gbit NIC, RFS enabled, and /proc/sys/net/ipv4/tcp_rmem[2] tuned to 16MB to work around page recycling strategy used by NIC driver (its page pool capacity being too small compared to number of skbs/pages held in sockets receive queues) Note that this tuning was only done to demonstrate worse conditions for skb freeing for this particular test. These conditions can happen in more general production workload. 10 runs of one TCP_STREAM flow Before: Average throughput: 49685 Mbit. Kernel profiles on cpu running user thread recvmsg() show high cost for skb freeing related functions (*) 57.81% [kernel] [k] copy_user_enhanced_fast_string (*) 12.87% [kernel] [k] skb_release_data (*) 4.25% [kernel] [k] __free_one_page (*) 3.57% [kernel] [k] __list_del_entry_valid 1.85% [kernel] [k] __netif_receive_skb_core 1.60% [kernel] [k] __skb_datagram_iter (*) 1.59% [kernel] [k] free_unref_page_commit (*) 1.16% [kernel] [k] __slab_free 1.16% [kernel] [k] _copy_to_iter (*) 1.01% [kernel] [k] kfree (*) 0.88% [kernel] [k] free_unref_page 0.57% [kernel] [k] ip6_rcv_core 0.55% [kernel] [k] ip6t_do_table 0.54% [kernel] [k] flush_smp_call_function_queue (*) 0.54% [kernel] [k] free_pcppages_bulk 0.51% [kernel] [k] llist_reverse_order 0.38% [kernel] [k] process_backlog (*) 0.38% [kernel] [k] free_pcp_prepare 0.37% [kernel] [k] tcp_recvmsg_locked (*) 0.37% [kernel] [k] __list_add_valid 0.34% [kernel] [k] sock_rfree 0.34% [kernel] [k] _raw_spin_lock_irq (*) 0.33% [kernel] [k] __page_cache_release 0.33% [kernel] [k] tcp_v6_rcv (*) 0.33% [kernel] [k] __put_page (*) 0.29% [kernel] [k] __mod_zone_page_state 0.27% [kernel] [k] _raw_spin_lock After patch: Average throughput: 73076 Mbit. Kernel profiles on cpu running user thread recvmsg() looks better: 81.35% [kernel] [k] copy_user_enhanced_fast_string 1.95% [kernel] [k] _copy_to_iter 1.95% [kernel] [k] __skb_datagram_iter 1.27% [kernel] [k] __netif_receive_skb_core 1.03% [kernel] [k] ip6t_do_table 0.60% [kernel] [k] sock_rfree 0.50% [kernel] [k] tcp_v6_rcv 0.47% [kernel] [k] ip6_rcv_core 0.45% [kernel] [k] read_tsc 0.44% [kernel] [k] _raw_spin_lock_irqsave 0.37% [kernel] [k] _raw_spin_lock 0.37% [kernel] [k] native_irq_return_iret 0.33% [kernel] [k] __inet6_lookup_established 0.31% [kernel] [k] ip6_protocol_deliver_rcu 0.29% [kernel] [k] tcp_rcv_established 0.29% [kernel] [k] llist_reverse_order v2: kdoc issue (kernel bots) do not defer if (alloc_cpu == smp_processor_id()) (Paolo) replace the sk_buff_head with a single-linked list (Jakub) add a READ_ONCE()/WRITE_ONCE() for the lockless read of sd->defer_list Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Link: https://lore.kernel.org/r/20220422201237.416238-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* | tls: rx: only copy IV from the packet for TLS 1.2Jakub Kicinski2022-04-131-10/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | TLS 1.3 and ChaChaPoly don't carry IV in the packet. The code before this change would copy out iv_size worth of whatever followed the TLS header in the packet and then for TLS 1.3 | ChaCha overwrite that with the sequence number. Waste of cycles especially with TLS 1.2 being close to dead and TLS 1.3 being the common case. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tls: rx: use MAX_IV_SIZE for allocationsJakub Kicinski2022-04-131-1/+1
| | | | | | | | | | | | | | | | IVs are 8 or 16 bytes, no point reading out the exact value for quantities this small. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tls: rx: use async as an in-out argumentJakub Kicinski2022-04-131-15/+16
| | | | | | | | | | | | | | | | | | | | Propagating EINPROGRESS thru multiple layers of functions is error prone. Use darg->async as an in/out argument, like we use darg->zc today. On input it tells the code if async is allowed, on output if it took place. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tls: rx: return the already-copied data on crypto errorJakub Kicinski2022-04-131-6/+10
| | | | | | | | | | | | | | | | | | | | async crypto handler will report the socket error no need to report it again. We can, however, let the data we already copied be reported to user space but we need to make sure the error will be reported next time around. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tls: rx: treat process_rx_list() errors as transientJakub Kicinski2022-04-131-12/+8
| | | | | | | | | | | | | | | | | | | | process_rx_list() only fails if it can't copy data to user space. There is no point recording the error onto sk->sk_err or giving up on the data which was read partially. Treat the return value like a normal socket partial read. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tls: rx: assume crypto always calls our callbackJakub Kicinski2022-04-131-3/+0
| | | | | | | | | | | | | | | | | | If crypto didn't always invoke our callback for async we'd not be clearing skb->sk and would crash in the skb core when freeing it. This if must be dead code. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tls: rx: don't handle TLS 1.3 in the async crypto callbackJakub Kicinski2022-04-131-10/+5
| | | | | | | | | | | | | | | | | | Async crypto never worked with TLS 1.3 and was explicitly disabled in commit 8497ded2d16c ("net/tls: Disable async decrytion for tls1.3"). There's no need for us to handle TLS 1.3 padding in the async cb. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tls: rx: move counting TlsDecryptErrors for syncJakub Kicinski2022-04-131-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | Move counting TlsDecryptErrors to tls_do_decryption() where differences between sync and async crypto are reconciled. No functional changes, this code just always gave me a pause. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tls: rx: reuse leave_on_list label for psockJakub Kicinski2022-04-131-8/+4
| | | | | | | | | | | | | | The code is identical, we can save a few LoC. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tls: rx: consistently use unlocked accessors for rx_listJakub Kicinski2022-04-131-5/+5
| | | | | | | | | | | | | | | | rx_list is protected by the socket lock, no need to take the built-in spin lock on accesses. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: remove noblock parameter from recvmsg() entitiesOliver Hartkopp2022-04-121-3/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The internal recvmsg() functions have two parameters 'flags' and 'noblock' that were merged inside skb_recv_datagram(). As a follow up patch to commit f4b41f062c42 ("net: remove noblock parameter from skb_recv_datagram()") this patch removes the separate 'noblock' parameter for recvmsg(). Analogue to the referenced patch for skb_recv_datagram() the 'flags' and 'noblock' parameters are unnecessarily split up with e.g. err = sk->sk_prot->recvmsg(sk, msg, size, flags & MSG_DONTWAIT, flags & ~MSG_DONTWAIT, &addr_len); or in err = INDIRECT_CALL_2(sk->sk_prot->recvmsg, tcp_recvmsg, udp_recvmsg, sk, msg, size, flags & MSG_DONTWAIT, flags & ~MSG_DONTWAIT, &addr_len); instead of simply using only flags all the time and check for MSG_DONTWAIT where needed (to preserve for the formerly separated no(n)block condition). Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Link: https://lore.kernel.org/r/20220411124955.154876-1-socketcan@hartkopp.net Signed-off-by: Paolo Abeni <pabeni@redhat.com>
* | tls: rx: jump out for cases which need to leave skb on listJakub Kicinski2022-04-101-21/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | The current invese logic is harder to follow (and adds extra tests to the fast path). We have to enumerate all cases which need to keep the skb before consuming it. It's simpler to jump out of the full record flow as we detect those cases. This makes it clear that partial consumption and peek can only reach end of the function thru the !zc case so move the code up there. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tls: rx: clear ctx->recv_pkt earlierJakub Kicinski2022-04-101-9/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | Whatever we do in the loop the skb should not remain on as ctx->recv_pkt afterwards. We can clear that pointer and restart strparser earlier. This adds overhead of extra linking and unlinking to rx_list but that's not large (upcoming change will switch to unlocked skb list operations). Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tls: rx: inline consuming the skb at the end of the loopJakub Kicinski2022-04-101-24/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | tls_sw_advance_skb() always consumes the skb at the end of the loop. To fall here the following must be true: !async && !is_peek && !retain_skb retain_skb => !zc && rxm->full_len > len # but non-full record implies !zc, so above can be simplified as retain_skb => rxm->full_len > len !async && !is_peek && !(rxm->full_len > len) !async && !is_peek && rxm->full_len <= len tls_sw_advance_skb() returns false if len < rxm->full_len which can't be true given conditions above. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tls: rx: pull most of zc check out of the loopJakub Kicinski2022-04-101-4/+5
| | | | | | | | | | | | | | | | | | Most of the conditions deciding if zero-copy can be used do not change throughout the iterations, so pre-calculate them. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tls: rx: don't track the async countJakub Kicinski2022-04-101-7/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | We track both if the last record was handled by async crypto and how many records were async. This is not necessary. We implicitly assume once crypto goes async it will stay that way, otherwise we'd reorder records. So just track if we're in async mode, the exact number of records is not necessary. This change also forces us into "async" mode more consistently in case crypto ever decided to interleave async and sync. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tls: rx: don't handle async in tls_sw_advance_skb()Jakub Kicinski2022-04-101-13/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | tls_sw_advance_skb() caters to the async case when skb argument is NULL. In that case it simply unpauses the strparser. These are surprising semantics to a person reading the code, and result in higher LoC, so inline the __strp_unpause and only call tls_sw_advance_skb() when we actually move past an skb. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tls: rx: factor out writing ContentType to cmsgJakub Kicinski2022-04-101-55/+36
| | | | | | | | | | | | | | | | | | | | | | cmsg can be filled in during rx_list processing or normal receive. Consolidate the code. We don't need to keep the boolean to track if the cmsg was created. 0 is an invalid content type. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tls: rx: simplify async waitJakub Kicinski2022-04-101-12/+2
| | | | | | | | | | | | | | | | | | Since we are protected from async completions by decrypt_compl_lock we can drop the async_notify and reinit the completion before we start waiting. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tls: rx: wrap decryption arguments in a structureJakub Kicinski2022-04-101-22/+27
| | | | | | | | | | | | | | | | | | | | We pass zc as a pointer to bool a few functions down as an in/out argument. This is error prone since C will happily evalue a pointer as a boolean (IOW forgetting *zc and writing zc leads to loss of developer time..). Wrap the arguments into a structure. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tls: rx: don't report text length from the bowels of decryptJakub Kicinski2022-04-101-19/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We plumb pointer to chunk all the way to the decryption method. It's set to the length of the text when decrypt_skb_update() returns. I think the code is written this way because original TLS implementation passed &chunk to zerocopy_from_iter() and this was carried forward as the code gotten more complex, without any refactoring. The fix for peek() introduced a new variable - to_decrypt which for all practical purposes is what chunk is going to get set to. Spare ourselves the pointer passing, use to_decrypt. Use this opportunity to clean things up a little further. Note that chunk / to_decrypt was mostly needed for the async path, since the sync path would access rxm->full_len (decryption transforms full_len from record size to text size). Use the right source of truth more explicitly. We have three cases: - async - it's TLS 1.2 only, so chunk == to_decrypt, but we need the min() because to_decrypt is a whole record and we don't want to underflow len. Note that we can't handle partial record by falling back to sync as it would introduce reordering against records in flight. - zc - again, TLS 1.2 only for now, so chunk == to_decrypt, we don't do zc if len < to_decrypt, no need to check again. - normal - it already handles chunk > len, we can factor out the assignment to rxm->full_len and share it with zc. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tls: rx: drop unnecessary arguments from tls_setup_from_iter()Jakub Kicinski2022-04-101-8/+6
| | | | | | | | | | | | | | | | | | | | sk is unused, remove it to make it clear the function doesn't poke at the socket. size_used is always 0 on input and @length on success. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tls: hw: rx: use return value of tls_device_decrypted() to carry statusJakub Kicinski2022-04-082-8/+4
| | | | | | | | | | | | | | | | | | Instead of tls_device poking into internals of the message return 1 from tls_device_decrypted() if the device handled the decryption. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tls: rx: refactor decrypt_skb_update()Jakub Kicinski2022-04-081-33/+33
| | | | | | | | | | | | | | | | Use early return and a jump label to remove two indentation levels. No functional changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tls: rx: don't issue wake ups when data is decryptedJakub Kicinski2022-04-081-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | We inform the applications that data is available when the record is received. Decryption happens inline inside recvmsg or splice call. Generating another wakeup inside the decryption handler seems pointless as someone must be actively reading the socket if we are executing this code. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tls: rx: replace 'back' with 'offset'Jakub Kicinski2022-04-081-5/+4
| | | | | | | | | | | | | | | | | | | | The padding length TLS 1.3 logic is searching for content_type from the end of text. IMHO the code is easier to parse if we calculate offset and decrement it rather than try to maintain positive offset from the end of the record called "back". Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tls: rx: use a define for tag lengthJakub Kicinski2022-04-081-2/+2
| | | | | | | | | | | | | | | | | | TLS 1.3 has to strip padding, and it starts out 16 bytes from the end of the record. Make it clear this is because of the auth tag. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tls: rx: init decrypted status in tls_read_size()Jakub Kicinski2022-04-081-4/+1
| | | | | | | | | | | | | | | | We set the record type in tls_read_size(), can as well init the tlm->decrypted field there. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tls: rx: don't store the decryption status in socket contextJakub Kicinski2022-04-082-5/+8
| | | | | | | | | | | | | | | | Similar justification to previous change, the information about decryption status belongs in the skb. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tls: rx: don't store the record type in socket contextJakub Kicinski2022-04-081-21/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Original TLS implementation was handling one record at a time. It stashed the type of the record inside tls context (per socket structure) for convenience. When async crypto support was added [1] the author had to use skb->cb to store the type per-message. The use of skb->cb overlaps with strparser, however, so a hybrid approach was taken where type is stored in context while parsing (since we parse a message at a time) but once parsed its copied to skb->cb. Recently a workaround for sockmaps [2] exposed the previously private struct _strp_msg and started a trend of adding user fields directly in strparser's header. This is cleaner than storing information about an skb in the context. This change is not strictly necessary, but IMHO the ownership of the context field is confusing. Information naturally belongs to the skb. [1] commit 94524d8fc965 ("net/tls: Add support for async decryption of tls records") [2] commit b2c4618162ec ("bpf, sockmap: sk_skb data_end access incorrect when src_reg = dst_reg") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tls: rx: drop pointless else after gotoJakub Kicinski2022-04-081-2/+1
| | | | | | | | | | | | | | | | Pointless else branch after goto makes the code harder to refactor down the line. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tls: rx: jump to a more appropriate labelJakub Kicinski2022-04-081-3/+4
|/ | | | | | | | | | | | 'recv_end:' checks num_async and decrypted, and is then followed by the 'end' label. Since we know that decrypted and num_async are 0 at the start we can jump to 'end'. Move the init of decrypted and num_async to let the compiler catch if I'm wrong. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* net/tls: fix slab-out-of-bounds bug in decrypt_internalZiyang Xuan2022-04-011-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The memory size of tls_ctx->rx.iv for AES128-CCM is 12 setting in tls_set_sw_offload(). The return value of crypto_aead_ivsize() for "ccm(aes)" is 16. So memcpy() require 16 bytes from 12 bytes memory space will trigger slab-out-of-bounds bug as following: ================================================================== BUG: KASAN: slab-out-of-bounds in decrypt_internal+0x385/0xc40 [tls] Read of size 16 at addr ffff888114e84e60 by task tls/10911 Call Trace: <TASK> dump_stack_lvl+0x34/0x44 print_report.cold+0x5e/0x5db ? decrypt_internal+0x385/0xc40 [tls] kasan_report+0xab/0x120 ? decrypt_internal+0x385/0xc40 [tls] kasan_check_range+0xf9/0x1e0 memcpy+0x20/0x60 decrypt_internal+0x385/0xc40 [tls] ? tls_get_rec+0x2e0/0x2e0 [tls] ? process_rx_list+0x1a5/0x420 [tls] ? tls_setup_from_iter.constprop.0+0x2e0/0x2e0 [tls] decrypt_skb_update+0x9d/0x400 [tls] tls_sw_recvmsg+0x3c8/0xb50 [tls] Allocated by task 10911: kasan_save_stack+0x1e/0x40 __kasan_kmalloc+0x81/0xa0 tls_set_sw_offload+0x2eb/0xa20 [tls] tls_setsockopt+0x68c/0x700 [tls] __sys_setsockopt+0xfe/0x1b0 Replace the crypto_aead_ivsize() with prot->iv_size + prot->salt_size when memcpy() iv value in TLS_1_3_VERSION scenario. Fixes: f295b3ae9f59 ("net/tls: Add support of AES128-CCM based ciphers") Signed-off-by: Ziyang Xuan <william.xuanziyang@huawei.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* net/tls: optimize judgement processes in tls_set_device_offload()Ziyang Xuan2022-03-211-31/+31
| | | | | | | | | | | | | | | | It is known that priority setting HW offload when set tls TX/RX offload by setsockopt(). Check netdevice whether support NETIF_F_HW_TLS_TX or not at the later stages in the whole tls_set_device_offload() process, some memory allocations have been done before that. We must release those memory and return error if we judge the netdevice not support NETIF_F_HW_TLS_TX. It is redundant. Move NETIF_F_HW_TLS_TX judgement forward, and move start_marker_record and offload_ctx memory allocation back slightly. Thus, we can get simpler exception handling process. Signed-off-by: Ziyang Xuan <william.xuanziyang@huawei.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>