summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2017-03-22drivers: net: xgene-v2: Add MDIO supportIyappan Subramanian
Added phy management support by using phy abstraction layer APIs. Signed-off-by: Iyappan Subramanian <isubramanian@apm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-22Merge branch 'bpf-map-in-map'David S. Miller
Martin KaFai Lau says: ==================== bpf: Add map-in-map support This patchset adds map-in-map support (map->map). One use case is the (vips -> webservers) in the L4 load balancer so that different vips can be backed by different set of webservers. Please refer to the individual commit log for details. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-22bpf: Add tests for map-in-mapMartin KaFai Lau
Test cases for array of maps and hash of maps. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-22bpf: Add hash of maps supportMartin KaFai Lau
This patch adds hash of maps support (hashmap->bpf_map). BPF_MAP_TYPE_HASH_OF_MAPS is added. A map-in-map contains a pointer to another map and lets call this pointer 'inner_map_ptr'. Notes on deleting inner_map_ptr from a hash map: 1. For BPF_F_NO_PREALLOC map-in-map, when deleting an inner_map_ptr, the htab_elem itself will go through a rcu grace period and the inner_map_ptr resides in the htab_elem. 2. For pre-allocated htab_elem (!BPF_F_NO_PREALLOC), when deleting an inner_map_ptr, the htab_elem may get reused immediately. This situation is similar to the existing prealloc-ated use cases. However, the bpf_map_fd_put_ptr() calls bpf_map_put() which calls inner_map->ops->map_free(inner_map) which will go through a rcu grace period (i.e. all bpf_map's map_free currently goes through a rcu grace period). Hence, the inner_map_ptr is still safe for the rcu reader side. This patch also includes BPF_MAP_TYPE_HASH_OF_MAPS to the check_map_prealloc() in the verifier. preallocation is a must for BPF_PROG_TYPE_PERF_EVENT. Hence, even we don't expect heavy updates to map-in-map, enforcing BPF_F_NO_PREALLOC for map-in-map is impossible without disallowing BPF_PROG_TYPE_PERF_EVENT from using map-in-map first. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-22bpf: Add array of maps supportMartin KaFai Lau
This patch adds a few helper funcs to enable map-in-map support (i.e. outer_map->inner_map). The first outer_map type BPF_MAP_TYPE_ARRAY_OF_MAPS is also added in this patch. The next patch will introduce a hash of maps type. Any bpf map type can be acted as an inner_map. The exception is BPF_MAP_TYPE_PROG_ARRAY because the extra level of indirection makes it harder to verify the owner_prog_type and owner_jited. Multi-level map-in-map is not supported (i.e. map->map is ok but not map->map->map). When adding an inner_map to an outer_map, it currently checks the map_type, key_size, value_size, map_flags, max_entries and ops. The verifier also uses those map's properties to do static analysis. map_flags is needed because we need to ensure BPF_PROG_TYPE_PERF_EVENT is using a preallocated hashtab for the inner_hash also. ops and max_entries are needed to generate inlined map-lookup instructions. For simplicity reason, a simple '==' test is used for both map_flags and max_entries. The equality of ops is implied by the equality of map_type. During outer_map creation time, an inner_map_fd is needed to create an outer_map. However, the inner_map_fd's life time does not depend on the outer_map. The inner_map_fd is merely used to initialize the inner_map_meta of the outer_map. Also, for the outer_map: * It allows element update and delete from syscall * It allows element lookup from bpf_prog The above is similar to the current fd_array pattern. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-22bpf: Fix and simplifications on inline map lookupMartin KaFai Lau
Fix in verifier: For the same bpf_map_lookup_elem() instruction (i.e. "call 1"), a broken case is "a different type of map could be used for the same lookup instruction". For example, an array in one case and a hashmap in another. We have to resort to the old dynamic call behavior in this case. The fix is to check for collision on insn_aux->map_ptr. If there is collision, don't inline the map lookup. Please see the "do_reg_lookup()" in test_map_in_map_kern.c in the later patch for how-to trigger the above case. Simplifications on array_map_gen_lookup(): 1. Calculate elem_size from map->value_size. It removes the need for 'struct bpf_array' which makes the later map-in-map implementation easier. 2. Remove the 'elem_size == 1' test Fixes: 81ed18ab3098 ("bpf: add helper inlining infra and optimize map_array lookup") Signed-off-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-22inet: frag: release spinlock before calling icmp_send()Eric Dumazet
Dmitry reported a lockdep splat [1] (false positive) that we can fix by releasing the spinlock before calling icmp_send() from ip_expire() This is a false positive because sending an ICMP message can not possibly re-enter the IP frag engine. [1] [ INFO: possible circular locking dependency detected ] 4.10.0+ #29 Not tainted ------------------------------------------------------- modprobe/12392 is trying to acquire lock: (_xmit_ETHER#2){+.-...}, at: [<ffffffff837a8182>] spin_lock include/linux/spinlock.h:299 [inline] (_xmit_ETHER#2){+.-...}, at: [<ffffffff837a8182>] __netif_tx_lock include/linux/netdevice.h:3486 [inline] (_xmit_ETHER#2){+.-...}, at: [<ffffffff837a8182>] sch_direct_xmit+0x282/0x6d0 net/sched/sch_generic.c:180 but task is already holding lock: (&(&q->lock)->rlock){+.-...}, at: [<ffffffff8389a4d1>] spin_lock include/linux/spinlock.h:299 [inline] (&(&q->lock)->rlock){+.-...}, at: [<ffffffff8389a4d1>] ip_expire+0x51/0x6c0 net/ipv4/ip_fragment.c:201 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (&(&q->lock)->rlock){+.-...}: validate_chain kernel/locking/lockdep.c:2267 [inline] __lock_acquire+0x2149/0x3430 kernel/locking/lockdep.c:3340 lock_acquire+0x2a1/0x630 kernel/locking/lockdep.c:3755 __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline] _raw_spin_lock+0x33/0x50 kernel/locking/spinlock.c:151 spin_lock include/linux/spinlock.h:299 [inline] ip_defrag+0x3a2/0x4130 net/ipv4/ip_fragment.c:669 ip_check_defrag+0x4e3/0x8b0 net/ipv4/ip_fragment.c:713 packet_rcv_fanout+0x282/0x800 net/packet/af_packet.c:1459 deliver_skb net/core/dev.c:1834 [inline] dev_queue_xmit_nit+0x294/0xa90 net/core/dev.c:1890 xmit_one net/core/dev.c:2903 [inline] dev_hard_start_xmit+0x16b/0xab0 net/core/dev.c:2923 sch_direct_xmit+0x31f/0x6d0 net/sched/sch_generic.c:182 __dev_xmit_skb net/core/dev.c:3092 [inline] __dev_queue_xmit+0x13e5/0x1e60 net/core/dev.c:3358 dev_queue_xmit+0x17/0x20 net/core/dev.c:3423 neigh_resolve_output+0x6b9/0xb10 net/core/neighbour.c:1308 neigh_output include/net/neighbour.h:478 [inline] ip_finish_output2+0x8b8/0x15a0 net/ipv4/ip_output.c:228 ip_do_fragment+0x1d93/0x2720 net/ipv4/ip_output.c:672 ip_fragment.constprop.54+0x145/0x200 net/ipv4/ip_output.c:545 ip_finish_output+0x82d/0xe10 net/ipv4/ip_output.c:314 NF_HOOK_COND include/linux/netfilter.h:246 [inline] ip_output+0x1f0/0x7a0 net/ipv4/ip_output.c:404 dst_output include/net/dst.h:486 [inline] ip_local_out+0x95/0x170 net/ipv4/ip_output.c:124 ip_send_skb+0x3c/0xc0 net/ipv4/ip_output.c:1492 ip_push_pending_frames+0x64/0x80 net/ipv4/ip_output.c:1512 raw_sendmsg+0x26de/0x3a00 net/ipv4/raw.c:655 inet_sendmsg+0x164/0x5b0 net/ipv4/af_inet.c:761 sock_sendmsg_nosec net/socket.c:633 [inline] sock_sendmsg+0xca/0x110 net/socket.c:643 ___sys_sendmsg+0x4a3/0x9f0 net/socket.c:1985 __sys_sendmmsg+0x25c/0x750 net/socket.c:2075 SYSC_sendmmsg net/socket.c:2106 [inline] SyS_sendmmsg+0x35/0x60 net/socket.c:2101 do_syscall_64+0x2e8/0x930 arch/x86/entry/common.c:281 return_from_SYSCALL_64+0x0/0x7a -> #0 (_xmit_ETHER#2){+.-...}: check_prev_add kernel/locking/lockdep.c:1830 [inline] check_prevs_add+0xa8f/0x19f0 kernel/locking/lockdep.c:1940 validate_chain kernel/locking/lockdep.c:2267 [inline] __lock_acquire+0x2149/0x3430 kernel/locking/lockdep.c:3340 lock_acquire+0x2a1/0x630 kernel/locking/lockdep.c:3755 __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline] _raw_spin_lock+0x33/0x50 kernel/locking/spinlock.c:151 spin_lock include/linux/spinlock.h:299 [inline] __netif_tx_lock include/linux/netdevice.h:3486 [inline] sch_direct_xmit+0x282/0x6d0 net/sched/sch_generic.c:180 __dev_xmit_skb net/core/dev.c:3092 [inline] __dev_queue_xmit+0x13e5/0x1e60 net/core/dev.c:3358 dev_queue_xmit+0x17/0x20 net/core/dev.c:3423 neigh_hh_output include/net/neighbour.h:468 [inline] neigh_output include/net/neighbour.h:476 [inline] ip_finish_output2+0xf6c/0x15a0 net/ipv4/ip_output.c:228 ip_finish_output+0xa29/0xe10 net/ipv4/ip_output.c:316 NF_HOOK_COND include/linux/netfilter.h:246 [inline] ip_output+0x1f0/0x7a0 net/ipv4/ip_output.c:404 dst_output include/net/dst.h:486 [inline] ip_local_out+0x95/0x170 net/ipv4/ip_output.c:124 ip_send_skb+0x3c/0xc0 net/ipv4/ip_output.c:1492 ip_push_pending_frames+0x64/0x80 net/ipv4/ip_output.c:1512 icmp_push_reply+0x372/0x4d0 net/ipv4/icmp.c:394 icmp_send+0x156c/0x1c80 net/ipv4/icmp.c:754 ip_expire+0x40e/0x6c0 net/ipv4/ip_fragment.c:239 call_timer_fn+0x241/0x820 kernel/time/timer.c:1268 expire_timers kernel/time/timer.c:1307 [inline] __run_timers+0x960/0xcf0 kernel/time/timer.c:1601 run_timer_softirq+0x21/0x80 kernel/time/timer.c:1614 __do_softirq+0x31f/0xbe7 kernel/softirq.c:284 invoke_softirq kernel/softirq.c:364 [inline] irq_exit+0x1cc/0x200 kernel/softirq.c:405 exiting_irq arch/x86/include/asm/apic.h:657 [inline] smp_apic_timer_interrupt+0x76/0xa0 arch/x86/kernel/apic/apic.c:962 apic_timer_interrupt+0x93/0xa0 arch/x86/entry/entry_64.S:707 __read_once_size include/linux/compiler.h:254 [inline] atomic_read arch/x86/include/asm/atomic.h:26 [inline] rcu_dynticks_curr_cpu_in_eqs kernel/rcu/tree.c:350 [inline] __rcu_is_watching kernel/rcu/tree.c:1133 [inline] rcu_is_watching+0x83/0x110 kernel/rcu/tree.c:1147 rcu_read_lock_held+0x87/0xc0 kernel/rcu/update.c:293 radix_tree_deref_slot include/linux/radix-tree.h:238 [inline] filemap_map_pages+0x6d4/0x1570 mm/filemap.c:2335 do_fault_around mm/memory.c:3231 [inline] do_read_fault mm/memory.c:3265 [inline] do_fault+0xbd5/0x2080 mm/memory.c:3370 handle_pte_fault mm/memory.c:3600 [inline] __handle_mm_fault+0x1062/0x2cb0 mm/memory.c:3714 handle_mm_fault+0x1e2/0x480 mm/memory.c:3751 __do_page_fault+0x4f6/0xb60 arch/x86/mm/fault.c:1397 do_page_fault+0x54/0x70 arch/x86/mm/fault.c:1460 page_fault+0x28/0x30 arch/x86/entry/entry_64.S:1011 other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&(&q->lock)->rlock); lock(_xmit_ETHER#2); lock(&(&q->lock)->rlock); lock(_xmit_ETHER#2); *** DEADLOCK *** 10 locks held by modprobe/12392: #0: (&mm->mmap_sem){++++++}, at: [<ffffffff81329758>] __do_page_fault+0x2b8/0xb60 arch/x86/mm/fault.c:1336 #1: (rcu_read_lock){......}, at: [<ffffffff8188cab6>] filemap_map_pages+0x1e6/0x1570 mm/filemap.c:2324 #2: (&(ptlock_ptr(page))->rlock#2){+.+...}, at: [<ffffffff81984a78>] spin_lock include/linux/spinlock.h:299 [inline] #2: (&(ptlock_ptr(page))->rlock#2){+.+...}, at: [<ffffffff81984a78>] pte_alloc_one_map mm/memory.c:2944 [inline] #2: (&(ptlock_ptr(page))->rlock#2){+.+...}, at: [<ffffffff81984a78>] alloc_set_pte+0x13b8/0x1b90 mm/memory.c:3072 #3: (((&q->timer))){+.-...}, at: [<ffffffff81627e72>] lockdep_copy_map include/linux/lockdep.h:175 [inline] #3: (((&q->timer))){+.-...}, at: [<ffffffff81627e72>] call_timer_fn+0x1c2/0x820 kernel/time/timer.c:1258 #4: (&(&q->lock)->rlock){+.-...}, at: [<ffffffff8389a4d1>] spin_lock include/linux/spinlock.h:299 [inline] #4: (&(&q->lock)->rlock){+.-...}, at: [<ffffffff8389a4d1>] ip_expire+0x51/0x6c0 net/ipv4/ip_fragment.c:201 #5: (rcu_read_lock){......}, at: [<ffffffff8389a633>] ip_expire+0x1b3/0x6c0 net/ipv4/ip_fragment.c:216 #6: (slock-AF_INET){+.-...}, at: [<ffffffff839b3313>] spin_trylock include/linux/spinlock.h:309 [inline] #6: (slock-AF_INET){+.-...}, at: [<ffffffff839b3313>] icmp_xmit_lock net/ipv4/icmp.c:219 [inline] #6: (slock-AF_INET){+.-...}, at: [<ffffffff839b3313>] icmp_send+0x803/0x1c80 net/ipv4/icmp.c:681 #7: (rcu_read_lock_bh){......}, at: [<ffffffff838ab9a1>] ip_finish_output2+0x2c1/0x15a0 net/ipv4/ip_output.c:198 #8: (rcu_read_lock_bh){......}, at: [<ffffffff836d1dee>] __dev_queue_xmit+0x23e/0x1e60 net/core/dev.c:3324 #9: (dev->qdisc_running_key ?: &qdisc_running_key){+.....}, at: [<ffffffff836d3a27>] dev_queue_xmit+0x17/0x20 net/core/dev.c:3423 stack backtrace: CPU: 0 PID: 12392 Comm: modprobe Not tainted 4.10.0+ #29 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: <IRQ> __dump_stack lib/dump_stack.c:16 [inline] dump_stack+0x2ee/0x3ef lib/dump_stack.c:52 print_circular_bug+0x307/0x3b0 kernel/locking/lockdep.c:1204 check_prev_add kernel/locking/lockdep.c:1830 [inline] check_prevs_add+0xa8f/0x19f0 kernel/locking/lockdep.c:1940 validate_chain kernel/locking/lockdep.c:2267 [inline] __lock_acquire+0x2149/0x3430 kernel/locking/lockdep.c:3340 lock_acquire+0x2a1/0x630 kernel/locking/lockdep.c:3755 __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline] _raw_spin_lock+0x33/0x50 kernel/locking/spinlock.c:151 spin_lock include/linux/spinlock.h:299 [inline] __netif_tx_lock include/linux/netdevice.h:3486 [inline] sch_direct_xmit+0x282/0x6d0 net/sched/sch_generic.c:180 __dev_xmit_skb net/core/dev.c:3092 [inline] __dev_queue_xmit+0x13e5/0x1e60 net/core/dev.c:3358 dev_queue_xmit+0x17/0x20 net/core/dev.c:3423 neigh_hh_output include/net/neighbour.h:468 [inline] neigh_output include/net/neighbour.h:476 [inline] ip_finish_output2+0xf6c/0x15a0 net/ipv4/ip_output.c:228 ip_finish_output+0xa29/0xe10 net/ipv4/ip_output.c:316 NF_HOOK_COND include/linux/netfilter.h:246 [inline] ip_output+0x1f0/0x7a0 net/ipv4/ip_output.c:404 dst_output include/net/dst.h:486 [inline] ip_local_out+0x95/0x170 net/ipv4/ip_output.c:124 ip_send_skb+0x3c/0xc0 net/ipv4/ip_output.c:1492 ip_push_pending_frames+0x64/0x80 net/ipv4/ip_output.c:1512 icmp_push_reply+0x372/0x4d0 net/ipv4/icmp.c:394 icmp_send+0x156c/0x1c80 net/ipv4/icmp.c:754 ip_expire+0x40e/0x6c0 net/ipv4/ip_fragment.c:239 call_timer_fn+0x241/0x820 kernel/time/timer.c:1268 expire_timers kernel/time/timer.c:1307 [inline] __run_timers+0x960/0xcf0 kernel/time/timer.c:1601 run_timer_softirq+0x21/0x80 kernel/time/timer.c:1614 __do_softirq+0x31f/0xbe7 kernel/softirq.c:284 invoke_softirq kernel/softirq.c:364 [inline] irq_exit+0x1cc/0x200 kernel/softirq.c:405 exiting_irq arch/x86/include/asm/apic.h:657 [inline] smp_apic_timer_interrupt+0x76/0xa0 arch/x86/kernel/apic/apic.c:962 apic_timer_interrupt+0x93/0xa0 arch/x86/entry/entry_64.S:707 RIP: 0010:__read_once_size include/linux/compiler.h:254 [inline] RIP: 0010:atomic_read arch/x86/include/asm/atomic.h:26 [inline] RIP: 0010:rcu_dynticks_curr_cpu_in_eqs kernel/rcu/tree.c:350 [inline] RIP: 0010:__rcu_is_watching kernel/rcu/tree.c:1133 [inline] RIP: 0010:rcu_is_watching+0x83/0x110 kernel/rcu/tree.c:1147 RSP: 0000:ffff8801c391f120 EFLAGS: 00000a03 ORIG_RAX: ffffffffffffff10 RAX: dffffc0000000000 RBX: ffff8801c391f148 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 000055edd4374000 RDI: ffff8801dbe1ae0c RBP: ffff8801c391f1a0 R08: 0000000000000002 R09: 0000000000000000 R10: dffffc0000000000 R11: 0000000000000002 R12: 1ffff10038723e25 R13: ffff8801dbe1ae00 R14: ffff8801c391f680 R15: dffffc0000000000 </IRQ> rcu_read_lock_held+0x87/0xc0 kernel/rcu/update.c:293 radix_tree_deref_slot include/linux/radix-tree.h:238 [inline] filemap_map_pages+0x6d4/0x1570 mm/filemap.c:2335 do_fault_around mm/memory.c:3231 [inline] do_read_fault mm/memory.c:3265 [inline] do_fault+0xbd5/0x2080 mm/memory.c:3370 handle_pte_fault mm/memory.c:3600 [inline] __handle_mm_fault+0x1062/0x2cb0 mm/memory.c:3714 handle_mm_fault+0x1e2/0x480 mm/memory.c:3751 __do_page_fault+0x4f6/0xb60 arch/x86/mm/fault.c:1397 do_page_fault+0x54/0x70 arch/x86/mm/fault.c:1460 page_fault+0x28/0x30 arch/x86/entry/entry_64.S:1011 RIP: 0033:0x7f83172f2786 RSP: 002b:00007fffe859ae80 EFLAGS: 00010293 RAX: 000055edd4373040 RBX: 00007f83175111c8 RCX: 000055edd4373238 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00007f8317510970 RBP: 00007fffe859afd0 R08: 0000000000000009 R09: 0000000000000000 R10: 0000000000000064 R11: 0000000000000000 R12: 000055edd4373040 R13: 0000000000000000 R14: 00007fffe859afe8 R15: 0000000000000000 Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-22tcp: initialize icsk_ack.lrcvtime at session start timeEric Dumazet
icsk_ack.lrcvtime has a 0 value at socket creation time. tcpi_last_data_recv can have bogus value if no payload is ever received. This patch initializes icsk_ack.lrcvtime for active sessions in tcp_finish_connect(), and for passive sessions in tcp_create_openreq_child() Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-22genetlink: fix counting regression on ctrl_dumpfamily()Stanislaw Gruszka
Commit 2ae0f17df1cd ("genetlink: use idr to track families") replaced if (++n < fams_to_skip) continue; into: if (n++ < fams_to_skip) continue; This subtle change cause that on retry ctrl_dumpfamily() call we omit one family that failed to do ctrl_fill_info() on previous call, because cb->args[0] = n number counts also family that failed to do ctrl_fill_info(). Patch fixes the problem and avoid confusion in the future just decrease n counter when ctrl_fill_info() fail. User visible problem caused by this bug is failure to get access to some genetlink family i.e. nl80211. However problem is reproducible only if number of registered genetlink families is big enough to cause second call of ctrl_dumpfamily(). Cc: Xose Vazquez Perez <xose.vazquez@gmail.com> Cc: Larry Finger <Larry.Finger@lwfinger.net> Cc: Johannes Berg <johannes@sipsolutions.net> Fixes: 2ae0f17df1cd ("genetlink: use idr to track families") Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com> Acked-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-22socket, bpf: fix sk_filter use after free in sk_clone_lockDaniel Borkmann
In sk_clone_lock(), we create a new socket and inherit most of the parent's members via sock_copy() which memcpy()'s various sections. Now, in case the parent socket had a BPF socket filter attached, then newsk->sk_filter points to the same instance as the original sk->sk_filter. sk_filter_charge() is then called on the newsk->sk_filter to take a reference and should that fail due to hitting max optmem, we bail out and release the newsk instance. The issue is that commit 278571baca2a ("net: filter: simplify socket charging") wrongly combined the dismantle path with the failure path of xfrm_sk_clone_policy(). This means, even when charging failed, we call sk_free_unlock_clone() on the newsk, which then still points to the same sk_filter as the original sk. Thus, sk_free_unlock_clone() calls into __sk_destruct() eventually where it tests for present sk_filter and calls sk_filter_uncharge() on it, which potentially lets sk_omem_alloc wrap around and releases the eBPF prog and sk_filter structure from the (still intact) parent. Fix it by making sure that when sk_filter_charge() failed, we reset newsk->sk_filter back to NULL before passing to sk_free_unlock_clone(), so that we don't mess with the parents sk_filter. Only if xfrm_sk_clone_policy() fails, we did reach the point where either the parent's filter was NULL and as a result newsk's as well or where we previously had a successful sk_filter_charge(), thus for that case, we do need sk_filter_uncharge() to release the prior taken reference on sk_filter. Fixes: 278571baca2a ("net: filter: simplify socket charging") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-22net: stmmac: fix dma operation mode config for older versionsJoao Pinto
The dma operation mode configuration routine was wrongly moved to a function (stmmac_mtl_configuration) that is only executed if the core version is >= 4.00. Fixes: 6deee2221e11 ("net: stmmac: prepare dma op mode config for multiple queues") Reported-by: Corentin Labbe <clabbe.montjoie@gmail.com> Reviewed-by: Thierry Reding <thierry.reding@gmail.com> Signed-off-by: Joao Pinto <jpinto@synopsys.com> Tested-by: Corentin Labbe <clabbe.montjoie@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-22net: ipv6: Add sysctl for minimum prefix len acceptable in RIOs.Joel Scherpelz
This commit adds a new sysctl accept_ra_rt_info_min_plen that defines the minimum acceptable prefix length of Route Information Options. The new sysctl is intended to be used together with accept_ra_rt_info_max_plen to configure a range of acceptable prefix lengths. It is useful to prevent misconfigurations from unintentionally blackholing too much of the IPv6 address space (e.g., home routers announcing RIOs for fc00::/7, which is incorrect). Signed-off-by: Joel Scherpelz <jscherpelz@google.com> Acked-by: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-22ipv4: provide stronger user input validation in nl_fib_input()Eric Dumazet
Alexander reported a KMSAN splat caused by reads of uninitialized field (tb_id_in) from user provided struct fib_result_nl It turns out nl_fib_input() sanity tests on user input is a bit wrong : User can pretend nlh->nlmsg_len is big enough, but provide at sendmsg() time a too small buffer. Reported-by: Alexander Potapenko <glider@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-22bpf: fix hashmap extra_elems logicAlexei Starovoitov
In both kmalloc and prealloc mode the bpf_map_update_elem() is using per-cpu extra_elems to do atomic update when the map is full. There are two issues with it. The logic can be misused, since it allows max_entries+num_cpus elements to be present in the map. And alloc_extra_elems() at map creation time can fail percpu alloc for large map values with a warn: WARNING: CPU: 3 PID: 2752 at ../mm/percpu.c:892 pcpu_alloc+0x119/0xa60 illegal size (32824) or align (8) for percpu allocation The fixes for both of these issues are different for kmalloc and prealloc modes. For prealloc mode allocate extra num_possible_cpus elements and store their pointers into extra_elems array instead of actual elements. Hence we can use these hidden(spare) elements not only when the map is full but during bpf_map_update_elem() that replaces existing element too. That also improves performance, since pcpu_freelist_pop/push is avoided. Unfortunately this approach cannot be used for kmalloc mode which needs to kfree elements after rcu grace period. Therefore switch it back to normal kmalloc even when full and old element exists like it was prior to commit 6c9059817432 ("bpf: pre-allocate hash map elements"). Add tests to check for over max_entries and large map values. Reported-by: Dave Jones <davej@codemonkey.org.uk> Fixes: 6c9059817432 ("bpf: pre-allocate hash map elements") Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-22Merge branch 'nfp-concurrency'David S. Miller
Jakub Kicinski says: ==================== nfp: allow concurrency in core and minor fixes The first 10 patches of this series prepare nfpcore for concurrent accesses. This will be needed by upcoming hwmon and devlink patches. Most locking is already in place, the patches in this series iron out a few bugs. Last 5 patches are fixes and cleanups to the netdev code, including removal of doorbell pointers used only on old versions of the chip, removal of unnecessarily defensive code and flushing xmit_more more carefully on error paths. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-22nfp: disable FW on reconfiguration errorsJakub Kicinski
Since we no longer need to keep the FW enabled for .ndo_close() to work we can always stop FW after reconfiguration failure. This seems to make most FWs more resilient to faults (at least in error injection scenarios). Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-22nfp: remove defensive checks around ndo_open()/ndo_close()Jakub Kicinski
Device open and close handlers check if the device is already in the desired state. Thanks to our reconfig infrastructure this should not be necessary, there doesn't seem to be any code in the driver which depends on it. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-22nfp: flush xmit_more on error pathsJakub Kicinski
In case of ring full or DMA mapping error remember to flush xmit_more delayed kicks. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-22nfp: remove RX queue pointersJakub Kicinski
NFP6000 doesn't use queue pointers/doorbells for RX, it uses 'done' bit in descriptors. Remove the pointers from data structures. Since we are saving space in rx_ring structure make fields we previously compressed to 16bits word size again. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-22nfp: don't use netdev_warn() before netdev is registeredJakub Kicinski
Fix warning which was using netdev_warn() instead of dev_warn() to early. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-22nfp: fix nfp_cpp_read()/nfp_cpp_write() error pathsJakub Kicinski
When acquiring an area fails we can't call function doing both release and free. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-22nfp: fix invalid area detectionJakub Kicinski
Core should detect when someone is trying to request an access window which is too large for a given type of access. Otherwise the requester will be put on a wait queue for ever without any error message. Add const qualifiers to clarify that we are only looking at read- -only members in relevant functions. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-22nfp: don't ignore return value of wait_event_interruptibleJakub Kicinski
When signal interrupts waiting for an area to become available we assume success. Pay attention to the return code. Unpack the code a little bit to make it more readable. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-22nfp: correct return codes when msleep gets interruptedJakub Kicinski
msleep_interruptible() returns time left to wait, not error code. Return ERESTARTSYS when interrupted. While at it correct a comment and make the polling a bit more aggressive. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-22nfp: lock area cache earlierJakub Kicinski
We shouldn't access area_cache_list without its lock even to check if it's empty. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-22nfp: document expected locking in the coreJakub Kicinski
Document which fields of nfp_cpp are protected by which locks. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-22nfp: move mutex code out of nfp_cppcore.cJakub Kicinski
After mutex cache removal we can put the mutex code in a separate source file. This makes it clear it doesn't play with internals of struct nfp_cpp any more. No functional changes. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-22nfp: remove cpp mutex cacheJakub Kicinski
CPP mutex cache was introduced to work around the fact that the same host could successfully acquire a lock multiple times. It used to collapse multiple users to the same struct nfp_cpp_mutex and track use count. Unfortunately it's racy. Since we now force all nfp_mutex_lock() callers within the host to actually succeed at acquiring the lock we no longer need the cache, let's remove it. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-22nfp: fail graciously when someone tries to grab global lockJakub Kicinski
The global device lock is acquired to search the resource table. The lock is actually itself part of the table (entry 0). Therefore if someone asks for resource 0 we would deadlock since double locking is no longer allowed. Currently the driver doesn't try to lock that resource so let's simply make sure we fail graciously and not add special handling of this case until really need. Hide the relevant defines in the source file. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-22nfp: disallow sharing mutexes on the same machineJakub Kicinski
NFP can be connected to multiple machines via PCI or other buses. Access to hardware resources is arbitrated using locks residing in device memory. Currently nfpcore only respects the mutexes when it comes to inter-host locking, but if we try to acquire the same lock again, on one host - it will simply return success because owner of the lock is already set to that host. This makes the locks useless for arbitration within one host and unfair because whichever host grabbed the lock will have a chance to reacquire it without others getting a shot. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-22enic: update enic maintainersGovindarajulu Varadarajan
update enic maintainers Signed-off-by: Govindarajulu Varadarajan <gvaradar@cisco.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-22net: bcmgenet: remove bcmgenet_internal_phy_setup()Doug Berger
Commit 6ac3ce8295e6 ("net: bcmgenet: Remove excessive PHY reset") removed the bcmgenet_mii_reset() function from bcmgenet_power_up() and bcmgenet_internal_phy_setup() functions. In so doing it broke the reset of the internal PHY devices used by the GENETv1-GENETv3 which required this reset before the UniMAC was enabled. It also broke the internal GPHY devices used by the GENETv4 because the config_init that installed the AFE workaround was no longer occurring after the reset of the GPHY performed by bcmgenet_phy_power_set() in bcmgenet_internal_phy_setup(). In addition the code in bcmgenet_internal_phy_setup() related to the "enable APD" comment goes with the bcmgenet_mii_reset() so it should have also been removed. Commit bd4060a6108b ("net: bcmgenet: Power on integrated GPHY in bcmgenet_power_up()") moved the bcmgenet_phy_power_set() call to the bcmgenet_power_up() function, but failed to remove it from the bcmgenet_internal_phy_setup() function. Had it done so, the bcmgenet_internal_phy_setup() function would have been empty and could have been removed at that time. Commit 5dbebbb44a6a ("net: bcmgenet: Software reset EPHY after power on") was submitted to correct the functional problems introduced by commit 6ac3ce8295e6 ("net: bcmgenet: Remove excessive PHY reset"). It was included in v4.4 and made available on 4.3-stable. Unfortunately, it didn't fully revert the commit because this bcmgenet_mii_reset() doesn't apply the soft reset to the internal GPHY used by GENETv4 like the previous one did. This prevents the restoration of the AFE work- arounds for internal GPHY devices after the bcmgenet_phy_power_set() in bcmgenet_internal_phy_setup(). This commit takes the alternate approach of removing the unnecessary bcmgenet_internal_phy_setup() function which shouldn't have been in v4.3 so that when bcmgenet_mii_reset() was restored it should have only gone into bcmgenet_power_up(). This will avoid the problems while also removing the redundancy (and hopefully some of the confusion). Fixes: 6ac3ce8295e6 ("net: bcmgenet: Remove excessive PHY reset") Signed-off-by: Doug Berger <opendmb@gmail.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-22net: dwc-xlgmac: fix an error code in xlgmac_alloc_pages()Dan Carpenter
The dma_mapping_error() returns true if there is an error but we want to return -ENOMEM and not 1. Fixes: 65e0ace2c5cd ("net: dwc-xlgmac: Initial driver for DesignWare Enterprise Ethernet") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Jie Deng <jiedeng@synopsys.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-22rtnetlink: Add dump all for netconfDavid Ahern
Use the rtnl_dump_all to dump all netconf handlers that have been registered. Allows userspace to send a dump request for PF_UNSPEC and get all families. Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-22Merge branch 'phy-mmd-cleanup'David S. Miller
Russell King says: ==================== Clean up PHY MMD accessors This series cleans up phylib's MMD accessors, so that we have a common way of accessing the Clause 45 register set. The current situation is far from ideal - we have phy_(read|write)_mmd() which accesses Clause 45 registers over Clause 45 accesses, and we have phy_(read|write)_mmd_indirect(), which accesses Clause 45 registers via Clause 22 register 13/14. Generic code uses the indirect methods to access standard Clause 45 features, and when we come to add Clause 45 PHY support to phylib, we would need to make these conditional upon the PHY type, or duplicate these functions. An alternative solution is to merge these accessors together, and select the appropriate access method depending upon the 802.3 clause that the PHY conforms with. The result is that we have a single set of phy_(read|write)_mmd() accessors. For cases which require special handling, we still allow PHY drivers to override all MMD accesses - except rather than just overriding the indirect accesses. This keeps existing overrides working. Combining the two also has another beneficial side effect - we get rid of similar functions that take arguments in different orders. The old direct accessors took the phy structure, devad and register number, whereas the indirect accessors took the phy structure, register number and devad in that order. Care must be taken when updating future drivers that the argument order is correct, and the function name is not merely replaced. This patch set is against net-next. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-22net: phy: clean up mmd_phy_indirect()Russell King
Make mmd_phy_indirect() use the same terminology as the rest of the code, making clear what each address is - phy address, devad, and register number. While here, remove the "inline" from this static function, leaving it to the compiler to decide whether to inline this function, and get rid of unnecessary parens. Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-22net: phy: remove the indirect MMD read/write methodsRussell King
Remove the indirect MMD read/write methods which are now no longer necessary. Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-22net: phy: convert micrel to new read_mmd/write_mmd driver methodsRussell King
Convert micrel to the new read_mmd/write_mmd driver methods. This Clause 22 PHY does not support any MMD access method. Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-22net: phy: switch remaining users to phy_(read|write)_mmd()Russell King
Switch everyone over to using phy_read_mmd() and phy_write_mmd() now that they are able to handle both Clause 22 indirect addressing and Clause 45 direct addressing methods to the MMD registers. Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-22net: lan78xx: update for phy_(read|write)_mmd_indirect() removalRussell King
lan78xx appears to use phylib in a rather weird way, accessing the PHY partly through phylib, and partly by making direct accesses to it, including to the Clause 45 registers. As the indirect MMD accessors are going away, update this driver to use the plain phy_(read|write)_mmd() accessors instead. Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Acked-by: Woojung Huh <Woojung.Huh@microchip.com> Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-22net: phy: make phy_(read|write)_mmd() generic MMD accessorsRussell King
Make phy_(read|write)_mmd() generic 802.3 clause 45 register accessors for both Clause 22 and Clause 45 PHYs, using either the direct register reading for Clause 45, or the indirect method for Clause 22 PHYs. Allow this behaviour to be overriden by PHY drivers where necessary. Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-22net: phy: move phy MMD accessors to phy-core.cRussell King
Move the phy_(read|write)__mmd() helpers out of line, they will become our main MMD accessor functions, and so will be a little more complex. This complexity doesn't belong in an inline function. Also move the _indirect variants as well to keep like functionality together. Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-22ipv6: make sure to initialize sockc.tsflags before first useAlexander Potapenko
In the case udp_sk(sk)->pending is AF_INET6, udpv6_sendmsg() would jump to do_append_data, skipping the initialization of sockc.tsflags. Fix the problem by moving sockc.tsflags initialization earlier. The bug was detected with KMSAN. Fixes: c14ac9451c34 ("sock: enable timestamping using control messages") Signed-off-by: Alexander Potapenko <glider@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-22Merge branch 'fjes-fixes'David S. Miller
YASUAKI ISHIMATSU says: ==================== fjes: Do not load fjes driver The fjes driver is used only by FUJITSU servers and almost of all servers in the world never use it. But currently if ACPI PNP0C02 is defined in the ACPI table, the following message is always shown: "FUJITSU Extended Socket Network Device Driver - version 1.2 - Copyright (c) 2015 FUJITSU LIMITED" The message makes users confused because there is no reason that the message is shown in other vendor servers. To avoid the confusion, the patch adds several checks. v3: - Rebase on latest net tree. - Add _STA method check to avoid loading fjes driver. v2: - Order local variable declarations from longest to shortest line ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-22fjes: Do not load fjes driver if extended socket device is not power on.Yasuaki Ishimatsu
The extended device socket cannot turn on/off while system is running. So when system boots up and the device is not power on, the fjes driver does not need be loaded. To check the status of the device, the patch adds ACPI _STA method check. Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> CC: Taku Izumi <izumi.taku@jp.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-22fjes: Do not load fjes driver if system does not have extended socket device.Yasuaki Ishimatsu
The fjes driver is used only by FUJITSU servers and almost of all servers in the world never use it. But currently if ACPI PNP0C02 is defined in the ACPI table, the following message is always shown: "FUJITSU Extended Socket Network Device Driver - version 1.2 - Copyright (c) 2015 FUJITSU LIMITED" The message makes users confused because there is no reason that the message is shown in other vendor servers. To avoid the confusion, the patch adds a check that the server has a extended socket device or not. Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> CC: Taku Izumi <izumi.taku@jp.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-22net: stmmac: Use AVB mode by defaultThierry Reding
Prior to the recent multi-queue changes the driver would configure the queues to use the AVB mode, but the mode then got switched to DCB. The hardware still works fine in DCB mode, but my testing capabilities are limited, so it's safer to revert to the prior setting anyway. Signed-off-by: Thierry Reding <treding@nvidia.com> Acked-By: Joao Pinto <jpinto@synopsys.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-22net: stmmac: Restore DT backwards-compatibilityThierry Reding
Recent changes to support multiple queues in the device tree bindings resulted in the number of RX and TX queues to be initialized to zero for device trees not adhering to the new bindings. Restore backwards-compatibility with those device trees by falling back to a single RX and TX queues each. Signed-off-by: Thierry Reding <treding@nvidia.com> Acked-By: Joao Pinto <jpinto@synopsys.com> Tested-by: Corentin Labbe <clabbe.montjoie@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-22net: stmmac: Always enable MAC RX queuesThierry Reding
The MAC RX queues always need to be enabled in order to receive network packets. Remove the condition that this only needs to be done for multi- queue configurations. Signed-off-by: Thierry Reding <treding@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-22Merge branch 'mlx5-fixes'David S. Miller
Saeed Mahameed says: ==================== Mellanox mlx5 fixes 2017-03-21 This series contains some mlx5 core and ethernet driver fixes. For -stable: net/mlx5e: Count LRO packets correctly (for kernel >= 4.2) net/mlx5e: Count GSO packets correctly (for kernel >= 4.2) net/mlx5: Increase number of max QPs in default profile (for kernel >= 4.0) net/mlx5e: Avoid supporting udp tunnel port ndo for VF reps (for kernel >= 4.10) net/mlx5e: Use the proper UAPI values when offloading TC vlan actions (for kernel >= v4.9) net/mlx5: E-Switch, Don't allow changing inline mode when flows are configured (for kernel >= 4.10) net/mlx5e: Change the TC offload rule add/del code path to be per NIC or E-Switch (for kernel >= 4.10) net/mlx5: Add missing entries for set/query rate limit commands (for kernel >= 4.8) ==================== Signed-off-by: David S. Miller <davem@davemloft.net>