git.armlinux.org.uk/linux.git - Linus' kernel tree

Age	Commit message (Collapse)	Author
2013-04-19	qlge: Update version to 1.00.00.32.	Jitendra Kalsaria
	Signed-off-by: Jitendra Kalsaria <jitendra.kalsaria@qlogic.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-19	qlge: Fix ethtool autoneg advertising.	Jitendra Kalsaria
	Autoneg is supported on specific port types only. Fix the driver to advertise autoneg based on the port type. Signed-off-by: Jitendra Kalsaria <jitendra.kalsaria@qlogic.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-19	qlge: Fix receive path to drop error frames	Sritej Velaga
	o Fix the driver to drop error frames in the receive path o Update error counter which was not getting incremented Signed-off-by: Sritej Velaga <sritej.velaga@qlogic.com> Signed-off-by: Jitendra Kalsaria <jitendra.kalsaria@qlogic.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-19	Merge branch 'qmi_wwan'	David S. Miller
	Bjørn Mork says: ==================== This series adds workarounds for 3 different firmware bugs, each preventing the affected devices from working at all. I therefore humbly request that these fixes go to stable-3.8 (if still maintained) and 3.9 (either via net if still possible, or via stable if not). All 3 workarounds are applied to all devices supported by the driver. Adding quirks for specific devices was considered as an alternative, but was rejected because we have too little information about the exact distribution of the buggy firmwares. All we know is that the same bug shows up in devices from at least 3 different, and presumably independent, vendors. The workarounds have instead been designed to automatically apply when necessary, and to have as little impact as possible on unaffected devices. The series has been tested on a number of devices both with and without these bugs. The series should apply cleanly to net/master, net-next/master and stable/linux-3.8.y ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-19	net: qmi_wwan: prevent duplicate mac address on link (firmware bug workaround)	Bjørn Mork
	We normally trust and use the CDC functional descriptors provided by a number of devices. But some of these will erroneously list the address reserved for the device end of the link. Attempting to use this on both the device and host side will naturally not work. Work around this bug by ignoring the functional descriptor and assign a random address instead in this case. Signed-off-by: Bjørn Mork <bjorn@mork.no> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-19	net: qmi_wwan: fixup destination address (firmware bug workaround)	Bjørn Mork
	Received packets are sometimes addressed to 00:a0:c6:00:00:00 instead of the address the device firmware should have learned from the host: 321.224126 77.16.85.204 -> 148.122.171.134 ICMP 98 Echo (ping) request id=0x4025, seq=64/16384, ttl=64 0000 82 c0 82 c9 f1 67 82 c0 82 c9 f1 67 08 00 45 00 .....g.....g..E. 0010 00 54 00 00 40 00 40 01 57 cc 4d 10 55 cc 94 7a .T..@.@.W.M.U..z 0020 ab 86 08 00 62 fc 40 25 00 40 b2 bc 6e 51 00 00 ....b.@%.@..nQ.. 0030 00 00 6b bd 09 00 00 00 00 00 10 11 12 13 14 15 ..k............. 0040 16 17 18 19 1a 1b 1c 1d 1e 1f 20 21 22 23 24 25 .......... !"#$% 0050 26 27 28 29 2a 2b 2c 2d 2e 2f 30 31 32 33 34 35 &'()+,-./012345 0060 36 37 67 321.240607 148.122.171.134 -> 77.16.85.204 ICMP 98 Echo (ping) reply id=0x4025, seq=64/16384, ttl=55 0000 00 a0 c6 00 00 00 02 50 f3 00 00 00 08 00 45 00 .......P......E. 0010 00 54 00 56 00 00 37 01 a0 76 94 7a ab 86 4d 10 .T.V..7..v.z..M. 0020 55 cc 00 00 6a fc 40 25 00 40 b2 bc 6e 51 00 00 U...j.@%.@..nQ.. 0030 00 00 6b bd 09 00 00 00 00 00 10 11 12 13 14 15 ..k............. 0040 16 17 18 19 1a 1b 1c 1d 1e 1f 20 21 22 23 24 25 .......... !"#$% 0050 26 27 28 29 2a 2b 2c 2d 2e 2f 30 31 32 33 34 35 &'()+,-./012345 0060 36 37 67 The bogus address is always the same, and matches the address suggested by many devices as a default address. It is likely a hardcoded firmware default. The circumstances where this bug has been observed indicates that the trigger is related to timing or some other factor the host cannot control. Repeating the exact same configuration sequence that caused it to trigger once, will not necessarily cause it to trigger the next time. Reproducing the bug is therefore difficult. This opens up a possibility that the bug is more common than we can confirm, because affected devices often will work properly again after a reset. A procedure most users are likely to try out before reporting a bug. Unconditionally rewriting the destination address if the first digit of the received packet is 0, is considered an acceptable compromise since we already have to inspect this digit. The simplification will cause unnecessary rewrites if the real address starts with 0, but this is still better than adding additional tests for this particular case. Signed-off-by: Bjørn Mork <bjorn@mork.no> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-19	net: qmi_wwan: fixup missing ethernet header (firmware bug workaround)	Bjørn Mork
	A number of LTE devices from different vendors all suffer from the same firmware bug: Most of the packets received from the device while it is attached to a LTE network will not have an ethernet header. The devices work as expected when attached to 2G or 3G networks, sending an ethernet header with all packets. This driver is not aware of which network the modem attached to, and even if it were there are still some packet types which are always received with the header intact. All devices supported by this driver have severely limited networking capabilities: - can only transmit IPv4, IPv6 and possibly ARP - can only support a single host hardware address at any time - will only do point-to-point communcation with the host Because of this, we are able to reliably identify any bogus raw IP packets by simply looking at the 4 IP version bits. All we need to do is to avoid 4 or 6 in the first digit of the mac address. This workaround ensures this, and fix up the received packets as necessary. Given the distribution of the bug, it is believed that the source is the chipset vendor. The devices which are verified to be affected are: Huawei E392u-12 (Qualcomm MDM9200) Pantech UML290 (Qualcomm MDM9600) Novatel USB551L (Qualcomm MDM9600) Novatel E362 (Qualcomm MDM9600) It is believed that the bug depend on firmware revision, which means that possibly all devices based on the above mentioned chipset may be affected if we consider all available firmware revisions. The information about affected devices and versions is likely incomplete. As the additional overhead for packets not needing this fixup is very small, it is considered acceptable to apply the workaround to all devices handled by this driver. Reported-by: Dan Williams <dcbw@redhat.com> Signed-off-by: Bjørn Mork <bjorn@mork.no> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-19	Merge branch 'bonding'	David S. Miller
	Nikolay Aleksandrov says: ==================== This patch-set fixes mainly bugs on enslave failure and one occasion of a needed locking. The patches are: 1. On enslave failure mc addresses are not flushed from the slave 2. On enslave failure vlans are not cleaned up from the slave 3. On enslave failure the bond's primary and curr_active_slave are not cleaned up (which might result in use of freed memory) 4. On enslave failure netpoll is not disabled which might result in a memory leak 5. In bond_mc_swap() the bond's mc addr list is walked without netif_addr_lock, since it can be called without rtnl, add it v2: patch 01 - fix log message and remove unnecessary code move ==================== Signed-off-by: Jay Vosburgh <fubar@us.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-19	bonding: in bond_mc_swap() bond's mc addr list is walked without lock	nikolay@redhat.com
	Use netif_addr_lock_bh() to acquire the appropriate lock before walking. Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-19	bonding: disable netpoll on enslave failure	nikolay@redhat.com
	slave_disable_netpoll() is not called upon enslave failure which would lead to a memory leak. Call slave_disable_netpoll() after err_detach as that's the first error path after enabling netpoll on that slave. Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-19	bonding: primary_slave & curr_active_slave are not cleaned on enslave failure	nikolay@redhat.com
	On enslave failure primary_slave can point to new_slave which is to be freed, and the same applies to curr_active_slave. So check if this is the case and clean up properly after err_detach because that's the first error code path after they're set. Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-19	bonding: vlans don't get deleted on enslave failure	nikolay@redhat.com
	The main problem is with vid refcount which only gets bumped up. Delete the vlans after err_detach as that's the first error path after the vlans are added. Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-19	bonding: mc addresses don't get deleted on enslave failure	nikolay@redhat.com
	Add bond_mc_list_flush() after err_detach as that's the first error path after the addresses are added. The main issue is the mc addresses' refcount which only gets bumped up. v2: update log message and don't move code unnecessarily Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-19	pkt_sched: fix error return code in fw_change_attrs()	Wei Yongjun
	Fix to return -EINVAL when tb[TCA_FW_MASK] is set and head->mask != 0xFFFFFFFF instead of 0 (ifdef CONFIG_NET_CLS_IND and tb[TCA_FW_INDEV]), as done elsewhere in this function. Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-19	irda: small read past the end of array in debug code	Dan Carpenter
	The "reason" can come from skb->data[] and it hasn't been capped so it can be from 0-255 instead of just 0-6. For example in irlmp_state_dtr() the code does: reason = skb->data[3]; ... irlmp_disconnect_indication(self, reason, skb); Also LMREASON has a couple other values which don't have entries in the irlmp_reasons[] array. And 0xff is a valid reason as well which means "unknown". So far as I can see we don't actually care about "reason" except for in the debug code. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-19	sparc64: Fix race in TLB batch processing.	David S. Miller
	As reported by Dave Kleikamp, when we emit cross calls to do batched TLB flush processing we have a race because we do not synchronize on the sibling cpus completing the cross call. So meanwhile the TLB batch can be reset (tb->tlb_nr set to zero, etc.) and either flushes are missed or flushes will flush the wrong addresses. Fix this by using generic infrastructure to synchonize on the completion of the cross call. This first required getting the flush_tlb_pending() call out from switch_to() which operates with locks held and interrupts disabled. The problem is that smp_call_function_many() cannot be invoked with IRQs disabled and this is explicitly checked for with WARN_ON_ONCE(). We get the batch processing outside of locked IRQ disabled sections by using some ideas from the powerpc port. Namely, we only batch inside of arch_{enter,leave}_lazy_mmu_mode() calls. If we're not in such a region, we flush TLBs synchronously. 1) Get rid of xcall_flush_tlb_pending and per-cpu type implementations. 2) Do TLB batch cross calls instead via: smp_call_function_many() tlb_pending_func() __flush_tlb_pending() 3) Batch only in lazy mmu sequences: a) Add 'active' member to struct tlb_batch b) Define __HAVE_ARCH_ENTER_LAZY_MMU_MODE c) Set 'active' in arch_enter_lazy_mmu_mode() d) Run batch and clear 'active' in arch_leave_lazy_mmu_mode() e) Check 'active' in tlb_batch_add_one() and do a synchronous flush if it's clear. 4) Add infrastructure for synchronous TLB page flushes. a) Implement __flush_tlb_page and per-cpu variants, patch as needed. b) Likewise for xcall_flush_tlb_page. c) Implement smp_flush_tlb_page() to invoke the cross-call. d) Wire up global_flush_tlb_page() to the right routine based upon CONFIG_SMP 5) It turns out that singleton batches are very common, 2 out of every 3 batch flushes have only a single entry in them. The batch flush waiting is very expensive, both because of the poll on sibling cpu completeion, as well as because passing the tlb batch pointer to the sibling cpus invokes a shared memory dereference. Therefore, in flush_tlb_pending(), if there is only one entry in the batch perform a completely asynchronous global_flush_tlb_page() instead. Reported-by: Dave Kleikamp <dave.kleikamp@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net> Acked-by: Dave Kleikamp <dave.kleikamp@oracle.com>
2013-04-19	ARM: 7699/1: sched_clock: Add more notrace to prevent recursion	Stephen Boyd
	cyc_to_sched_clock() is called by sched_clock() and cyc_to_ns() is called by cyc_to_sched_clock(). I suspect that some compilers inline both of these functions into sched_clock() and so we've been getting away without having a notrace marking. It seems that my compiler isn't inlining cyc_to_sched_clock() though, so I'm hitting a recursion bug when I enable the function graph tracer, causing my system to crash. Marking these functions notrace fixes it. Technically cyc_to_ns() doesn't need the notrace because it's already marked inline, but let's just add it so that if we ever remove inline from that function it doesn't blow up. Signed-off-by: Stephen Boyd <sboyd@codeaurora.org> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2013-04-19	bond: add support to read speed and duplex via ethtool	Andy Gospodarek
	This patch adds support for the get_settings ethtool op to the bonding driver. This was motivated by users who wanted to get the speed of the bond and compare that against throughput to understand utilization. The behavior before this patch was added was problematic when computing line utilization after trying to get link-speed and throughput via SNMP. Output from ethtool looks like this for a round-robin bond: Settings for bond0: Supported ports: [ ] Supported link modes: Not reported Supported pause frame use: No Supports auto-negotiation: No Advertised link modes: Not reported Advertised pause frame use: No Advertised auto-negotiation: No Speed: 11000Mb/s Duplex: Full Port: Other PHYAD: 0 Transceiver: internal Auto-negotiation: off MDI-X: Unknown Link detected: yes I tested this and verified it works as expected. A test was also done on a version backported to an older kernel and it worked well there. v2: Switch to using ethtool_cmd_speed_set to set speed, added check to SLAVE_IS_OK for each slave in bond, dropped mode-specific calculations as they were not needed, and set port type to 'Other.' v3: Fix useless assignment and checkpatch warning. Signed-off-by: Andy Gospodarek <andy@greyhouse.net> Reviewed-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-19	packet: move hw/sw timestamp extraction into a small helper	Daniel Borkmann
	This patch introduces a small, internal helper function, that is used by PF_PACKET. Based on the flags that are passed, it extracts the packet timestamp in the receive path. This is merely a refactoring to remove some duplicate code in tpacket_rcv(), to make it more readable, and to enable others to use this function in PF_PACKET as well, e.g. for TX. Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-19	net: socket: move ktime2ts to ktime header api	Daniel Borkmann
	Currently, ktime2ts is a small helper function that is only used in net/socket.c. Move this helper into the ktime API as a small inline function, so that i) it's maintained together with ktime routines, and ii) also other files can make use of it. The function is named ktime_to_timespec_cond() and placed into the generic part of ktime, since we internally make use of ktime_to_timespec(). ktime_to_timespec() itself does not check the ktime variable for zero, hence, we name this function ktime_to_timespec_cond() for only a conditional conversion, and adapt its users to it. Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-19	net: Add .gitignore to networking selftests directory.	David S. Miller
	Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-19	net: Add missing netdev feature strings for NETIF_F_HW_VLAN_STAG_*	David S. Miller
	Noticed by Ben Hutchings. Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-19	Merge branch 'qlcnic'	David S. Miller
	Rajesh Borundia says: ==================== * "qlcnic: Change 82xx adapter VLAN id endian type". - Adapter requires VLAN id in little endian. VLAN id was being converted to __le16 and then passed as a parameter. Pass VLAN id as u16 and then use cpu_to_le16 at appropriate places. It is appropriate for net-next as SR-IOV patches have a dependency on it. * "qlcnic: Fix loopback test for SR-IOV PF". - It is appropriate for net-next as change is needed for SRIOV PF only. * Remaining patches add enhancements to SR-IOV functionality like - FLR handling - Adapter reset recovery handling - iproute2 tool support for configuring MAC address, Tx rate and VLAN id. - Mailbox polling support for SR-IOV PF in case mailbox interrupts are disabled. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-19	qlcnic: Update version to 5.2.41	Rajesh Borundia
	Signed-off-by: Rajesh Borundia <rajesh.borundia@qlogic.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-19	qlcnic: Support polling for mailbox events.	Rajesh Borundia
	o When mailbox interrupt is disabled PF should be able to process request from VF. Enable polling for such cases. Signed-off-by: Rajesh Borundia <rajesh.borundia@qlogic.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-19	qlcnic: Fix loopback test for SR-IOV PF.	Rajesh Borundia
	o Do not disable mailbox interrupts while running loopback test through SR-IOV PF. Signed-off-by: Manish Chopra <manish.chopra@qlogic.com> Signed-off-by: Rajesh Borundia <rajesh.borundia@qlogic.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-19	qlcnic: Support VLAN id config.	Rajesh Borundia
	o Add support for VLAN id configuration per VF using iproute2 tool. o VLAN id's 1-4094 are treated as PVID by the PF and Guest VLAN tagging is not allowed by default. o PVID is disabled when the VLAN id is set to 0 o Guest VLAN tagging is allowed when the VLAN id is set to 4095. o Only one Guest VLAN id is supported. o VLAN id can be changed only when the VF driver is not loaded. Signed-off-by: Manish Chopra <manish.chopra@qlogic.com> Signed-off-by: Sucheta Chakraborty <sucheta.chakraborty@qlogic.com> Signed-off-by: Rajesh Borundia <rajesh.borundia@qlogic.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-19	qlcnic: Support MAC address, Tx rate config.	Rajesh Borundia
	o Add support for MAC address and Tx rate configuration per VF via iproute2 tool. o Tx rate change is allowed while the guest is running and the VF driver is loaded. o MAC address change is allowed only when VF driver is not loaded. Signed-off-by: Manish Chopra <manish.chopra@qlogic.com> Signed-off-by: Sucheta Chakraborty <sucheta.chakraborty@qlogic.com> Signed-off-by: Rajesh Borundia <rajesh.borundia@qlogic.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-19	qlcnic: VF reset recovery implementation.	Rajesh Borundia
	o Implement recovery mechanism for VF to recover from adapter resets. Signed-off-by: Manish Chopra <manish.chopra@qlogic.com> Signed-off-by: Sucheta Chakraborty <sucheta.chakraborty@qlogic.com> Signed-off-by: Rajesh Borundia <rajesh.borundia@qlogic.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-19	qlcnic: VF FLR implementation.	Rajesh Borundia
	o FLR from Hypervisor - When hypervisor issues a VF FLR request, adapter notifies the parent PF driver of the FLR request for PF driver to perform any cleanup on behalf of that VF. o FLR from VF Driver - VF driver may initiate a VF FLR request, if VF state needs to be cleaned up before a re-initialization. VF re-initialization during kdump is an example. o PF driver cleans up all resources allocated on behalf of a VF, on VF FLR notifications from the adapter or from the VF driver. Signed-off-by: Manish Chopra <manish.chopra@qlogic.com> Signed-off-by: Sucheta Chakraborty <sucheta.chakraborty@qlogic.com> Signed-off-by: Rajesh Borundia <rajesh.borundia@qlogic.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-19	qlcnic: Change 82xx adapter VLAN id endian type.	Rajesh Borundia
	o 82xx adapter requires VLAN id in little endian format. Instead of passing vlan id parameter as __le16, pass the parameter as u16 and use cpu_to_le16 at appropriate places. Signed-off-by: Rajesh Borundia <rajesh.borundia@qlogic.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-19	Merge branch 'netlink-mmap'	David S. Miller
	Patrick McHardy says: ==================== The following patches contain an implementation of memory mapped I/O for netlink. The implementation is modelled after AF_PACKET memory mapped I/O with a few differences: - In order to perform memory mapped I/O to userspace, the kernel allocates skbs with the data area pointing to the data area of the mapped frames. All netlink subsystems assume a linear data area, so for the sake of simplicity, the mapped data area is not attached to the paged area but to skb->data. This requires introduction of a special skb alloction function that just allocates an skb head without the data area. Since this is a quite rare use case, I introduced a new function based on __alloc_skb instead of splitting it up into head and data alloction. The alternative would be to introduce an __alloc_skb_head and __alloc_skb_data function, which would actually be useful for a specific error case in memory mapped netlink, but would require a couple of extra instructions for the common skb allocation case, so it doesn't really seem worth it. In order to get the destination memory area for skb->data before message construction, memory mapped netlink I/O needs to look up the destination socket during allocation instead of during transmission because the ring is owned by the receiveing socket/process. A special skb allocation function (netlink_alloc_skb) taking the destination pid as an argument is used for this, all subsystems that want to support memory mapped I/O need to use this function, automatic fallback to the receive queue happens for unconverted subsystems. Dumps automatically use memory mapped I/O if the receiving socket has enabled it. The visible effect of looking up the destination socket during allocation instead of transmission is that message ordering in userspace might change in case allocation and transmission aren't performed atomically. This usually doesn't matter since most subsystems have a BKL-like lock like the rtnl mutex, to my knowledge the currently only existing case where it might matter is nfnetlink_queue combined with the recently introduced batched verdicts, but a) that subsystem already includes sequence numbers which allow userspace to reorder messages in case it cares to, also the reodering window is quite small and b) with memory mapped transmission batching can be performed in a subsystem indepandant manner. - AF_NETLINK contains flow control for database dumps, with regular I/O dump continuation are triggered based on the sockets receive queue space and by recvmsg() calls. Since with memory mapped I/O there are no recvmsg() calls under normal operation, this is done in netlink_poll(), under the assumption that userspace has processed all pending frames before invoking poll(), thus the ring is expected to have room for new messages. Dumps currently don't benefit as much as they could from memory mapped I/O because each single continuation requires a poll() call. A more agressive approach seems like a good idea to me, especially in case the socket is not subscribed to any multicast groups (IOW only receiving explicitly requested data). Besides that, the memory mapped netlink implementation extends the states defined by AF_PACKET between userspace and the kernel by a SKIP status, this is intended for the case that userspace wants to queue frames (specifically when using nfnetlink_queue, an IDS and stream reassembly, requested by Eric Leblond) for a longer period of time. The kernel skips over all frames marked with SKIP when looking or unused frames and only fails when not finding a free frame or when having skipped the entire ring. Also noteworthy is memory mapped sendmsg: the kernel performs validation of messages before accepting and processing them, in order to prevent userspace from changing the messages contents after validation, the kernel checks that the ring is only mapped once and the file descriptor is not shared (in order to avoid having userspace set up another mapping after the first mentioned check). If either of both is not true, the message copied to an allocated skb and processed as with regular I/O. I'd especially appreciate review of this part since I'm not really versed in memory, file and process management, The remaining interesting details are included in the changelogs of the individual patches and the documentation, so I won't repeat them here. As an example, nfnetlink_queue is convererted to support memory mapped I/O. Other subsystems that would probably benefit are nfnetlink_log, audit and maybe ISCSI, not sure. Following are some numbers collected by Florian Westphal based on a slightly older version, which included an experimental patch for the nfnetlink_queue ordering issue. === Test hardware is a 12-core machine Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz ixgbe interfaces are used (i.e., multiqueue nics). irqs are distributed across the cpus. I've made several tests. The simple one consists of 3GBit UDP traffic, packets are 1500 bytes in size (i.e., no fragmentation), with a single nfqueue and the test client programs in libmnl examples directory. Packets are sent from one /24 net to another /24 net, i.e. there are a few hundred flows active at any given time. I've also tested with snort, but I disabled all rules. 6Gbit UDP traffic is generated in the snort case, and 6 nfqueues are used (i.e., 6 snorts run in parallel). I've tested with 3 different kernels, all based on 3.7.1. - 3.7.1, without the mmap patches - 3.7.1, with Patricks mmap patches - 3.7.1, with mmap patches and extended spinlock to ensure packet ids are monotonically increasing and cannot be re-ordered. This is what we currently ship in our product. [ the spinlock that is extended is the per nfqueue spinlock, it will be held from the time the netlink skb is allocated until the netlink skb is sent to userspace: http://1984.lsi.us.es/git/nf-next/commit/?h=mmap-netlink3&id=b8eb19c46650fef4e9e4fe53f367f99bbf72afc9 ] snort is normally used in "batch mode", i.e., after processing 25 packets a single "batch verdict" is sent to accept the packets seen so far. "mmap snort" means RX_RING + sendmsg(), i.e. TX_RING is not used at this time (except where noted below). One reason is that snort has a reload thread, so kernel needs to copy; also in the snort case no payload rewrite takes place, so compared to the rx path the tx path is cheap. Results: 3.7.1, without mmap patches, i.e. recv()+sendmsg() for everyone nfq-queue: 1.7 gbit out snort-recv-batch-25 5.1 gbit out snort-recv-no-batch 3.1 gbit out 3.7.1 + mmap + without extended spinlocked section nfq-queue: 1.7 gbit out (recv/sendmsg) nfq-queue-mmap: 2.4 gbit out snort-mmap-batch-25 5.6 gbit out (warning: since ids can be re-ordered, this version is "broken"). snort-recv-batch-25 5.1 gbit out snort-mmap-no-batch 4.6 gbit out (i.e., one verdict per packet) Kernel 3.7.1 + mmap + extended spinlock section: nfq-queue: 1.4 gbit out nfq-queue-mmap: 2.3 gbit out snort: 5.6 gbit out Conclusions: - The "extended spinlocked section" hurts performance in the single queue case; with 6 snorts there is no measureable slowdown. - I tried to re-write the mmap-snort to work without batch verdicts, but results were not very encouraging: kernel 3.7.1 + mmap (without extended spinlocked section): snort-mmap-batch-25 5.6 gbit out (what we currenlty ship) snort-recv-batch-25 5.1 gbit out (without using mmap) snort-mmap-batch-1 4.6 gbit out (with mmap but without batch verdicts) snort-mmap-txring-25 5.2 gbit out (with mmap but without batch verdicts) snort-mmap-txring-1 4.6 gbit out (with mmap but without batch verdicts) The difference between the last two is that in the txring-25 case, we put a verdict into the tx ring after every packet, but will only invoke sendmsg(, NULL, 0) after processing 25 packets. So the only difference is the number of sendmsg calls/context switches. So, i.o.w, kernel 3.7.1 + mmap + the extra locking crap is faster than 3.7.1 + mmap-without-extra-locking and single-verdict-per packet. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-19	nfnetlink: add support for memory mapped netlink	Patrick McHardy
	Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-19	netfilter: rename netlink related "pid" variables to "portid"	Patrick McHardy
	Get rid of the confusing mix of pid and portid and use portid consistently for all netlink related socket identities. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-19	netlink: add documentation for memory mapped I/O	Patrick McHardy
	Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-19	netlink: add RX/TX-ring support to netlink diag	Patrick McHardy
	Based on AF_PACKET. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-19	netlink: add flow control for memory mapped I/O	Patrick McHardy
	Add flow control for memory mapped RX. Since user-space usually doesn't invoke recvmsg() when using memory mapped I/O, flow control is performed in netlink_poll(). Dumps are allowed to continue if at least half of the ring frames are unused. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-19	netlink: implement memory mapped recvmsg()	Patrick McHardy
	Add support for mmap'ed recvmsg(). To allow the kernel to construct messages into the mapped area, a dataless skb is allocated and the data pointer is set to point into the ring frame. This means frames will be delivered to userspace in order of allocation instead of order of transmission. This usually doesn't matter since the order is either not determinable by userspace or message creation/transmission is serialized. The only case where this can have a visible difference is nfnetlink_queue. Userspace can't assume mmap'ed messages have ordered IDs anymore and needs to check this if using batched verdicts. For non-mapped sockets, nothing changes. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-19	netlink: implement memory mapped sendmsg()	Patrick McHardy
	Add support for mmap'ed sendmsg() to netlink. Since the kernel validates received messages before processing them, the code makes sure userspace can't modify the message contents after invoking sendmsg(). To do that only a single mapping of the TX ring is allowed to exist and the socket must not be shared. If either of these two conditions does not hold, it falls back to copying. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-19	netlink: add mmap'ed netlink helper functions	Patrick McHardy
	Add helper functions for looking up mmap'ed frame headers, reading and writing their status, allocating skbs with mmap'ed data areas and a poll function. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-19	netlink: mmaped netlink: ring setup	Patrick McHardy
	Add support for mmap'ed RX and TX ring setup and teardown based on the af_packet.c code. The following patches will use this to add the real mmap'ed receive and transmit functionality. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-19	netlink: add netlink_skb_set_owner_r()	Patrick McHardy
	For mmap'ed I/O a netlink specific skb destructor needs to be invoked after the final kfree_skb() to clean up state. This doesn't work currently since the skb's ownership is transfered to the receiving socket using skb_set_owner_r(), which orphans the skb, thereby invoking the destructor prematurely. Since netlink doesn't account skbs to the originating socket, there's no need to orphan the skb. Add a netlink specific skb_set_owner_r() variant that does not orphan the skb and use a netlink specific destructor to call sock_rfree(). Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-19	netlink: don't orphan skb in netlink_trim()	Patrick McHardy
	Netlink doesn't account skbs to the sending socket, so the there's no need to orphan the skb before trimming it. Removing the skb_orphan() call is required for mmap'ed netlink, which uses a netlink specific skb destructor that must not be invoked before the final freeing of the skb. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-19	net: add function to allocate sk_buff head without data area	Patrick McHardy
	Add a function to allocate a sk_buff head without any data. This will be used by memory mapped netlink to attach data from the mmaped area to the skb. Additionally change skb_release_all() to check whether the skb has a data area to allow the skb destructor to clear the data pointer in case only a head has been allocated. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-19	netlink: rename ssk to sk in struct netlink_skb_params	Patrick McHardy
	Memory mapped netlink needs to store the receiving userspace socket when sending from the kernel to userspace. Rename 'ssk' to 'sk' to avoid confusion. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-19	netlink: add symbolic value for congested state	Patrick McHardy
	Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-19	Merge branch '8021ad'	David S. Miller
	Patrick McHardy says: ==================== The following patches add support for 802.1ad (provider tagging) to the VLAN driver. The patchset consists of the following parts: - renaming of the NET_F_HW_VLAN feature flags to indicate that they only operate on CTAGs - preparation for 802.1ad VLAN filtering offload by adding a proto argument to the rx_{add,kill}_vid net_device_ops callbacks - preparation of the VLAN code to support multiple protocols by making the protocol used for tagging a property of the VLAN device and converting the device lookup functions accordingly - second step of preparation of the VLAN code by making the packet tagging functions take a protocol argument - introducation of 802.1ad support in the VLAN code, consisting mainly of checking for ETH_P_8021AD in a couple of places and testing the netdevice offload feature checks to take the protocol into account - announcement of STAG offloading capabilities in a couple of drivers for virtual network devices The patchset is based on net-next.git and has been tested with single and double tagging with and without HW acceleration (for CTAGs). ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-19	net: vlan: announce STAG offload capability in some drivers	Patrick McHardy
	- macvlan: propagate STAG filtering capabilities from underlying device - ifb: announce STAG tagging support in addition to CTAG tagging support - veth: announce STAG tagging/stripping support in addition to CTAG support Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-19	net: vlan: add 802.1ad support	Patrick McHardy
	Add support for 802.1ad VLAN devices. This mainly consists of checking for ETH_P_8021AD in addition to ETH_P_8021Q in a couple of places and check offloading capabilities based on the used protocol. Configuration is done using "ip link": # ip link add link eth0 eth0.1000 \ type vlan proto 802.1ad id 1000 # ip link add link eth0.1000 eth0.1000.1000 \ type vlan proto 802.1q id 1000 52:54:00:12:34:56 > 92:b1:54:28:e4:8c, ethertype 802.1Q (0x8100), length 106: vlan 1000, p 0, ethertype 802.1Q, vlan 1000, p 0, ethertype IPv4, (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto ICMP (1), length 84) 20.1.0.2 > 20.1.0.1: ICMP echo request, id 3003, seq 8, length 64 92:b1:54:28:e4:8c > 52:54:00:12:34:56, ethertype 802.1Q-QinQ (0x88a8), length 106: vlan 1000, p 0, ethertype 802.1Q, vlan 1000, p 0, ethertype IPv4, (tos 0x0, ttl 64, id 47944, offset 0, flags [none], proto ICMP (1), length 84) 20.1.0.1 > 20.1.0.2: ICMP echo reply, id 3003, seq 8, length 64 Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-19	net: vlan: add protocol argument to packet tagging functions	Patrick McHardy
	Add a protocol argument to the VLAN packet tagging functions. In case of HW tagging, we need that protocol available in the ndo_start_xmit functions, so it is stored in a new field in the skb. The new field fits into a hole (on 64 bit) and doesn't increase the sks's size. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>