summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2015-09-10memcg: zap try_get_mem_cgroup_from_pageVladimir Davydov
It is only used in mem_cgroup_try_charge, so fold it in and zap it. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Reviewed-by: Andres Lagar-Cavilla <andreslc@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Greg Thelen <gthelen@google.com> Cc: Michel Lespinasse <walken@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-10hwpoison: use page_cgroup_ino for filtering by memcgVladimir Davydov
Hwpoison allows to filter pages by memory cgroup ino. Currently, it calls try_get_mem_cgroup_from_page to obtain the cgroup from a page and then its ino using cgroup_ino, but now we have a helper method for that, page_cgroup_ino, so use it instead. This patch also loosens the hwpoison memcg filter dependency rules - it makes it depend on CONFIG_MEMCG instead of CONFIG_MEMCG_SWAP, because hwpoison memcg filter does not require anything (nor it used to) from CONFIG_MEMCG_SWAP side. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Reviewed-by: Andres Lagar-Cavilla <andreslc@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Greg Thelen <gthelen@google.com> Cc: Michel Lespinasse <walken@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-10memcg: add page_cgroup_ino helperVladimir Davydov
This patchset introduces a new user API for tracking user memory pages that have not been used for a given period of time. The purpose of this is to provide the userspace with the means of tracking a workload's working set, i.e. the set of pages that are actively used by the workload. Knowing the working set size can be useful for partitioning the system more efficiently, e.g. by tuning memory cgroup limits appropriately, or for job placement within a compute cluster. ==== USE CASES ==== The unified cgroup hierarchy has memory.low and memory.high knobs, which are defined as the low and high boundaries for the workload working set size. However, the working set size of a workload may be unknown or change in time. With this patch set, one can periodically estimate the amount of memory unused by each cgroup and tune their memory.low and memory.high parameters accordingly, therefore optimizing the overall memory utilization. Another use case is balancing workloads within a compute cluster. Knowing how much memory is not really used by a workload unit may help take a more optimal decision when considering migrating the unit to another node within the cluster. Also, as noted by Minchan, this would be useful for per-process reclaim (https://lwn.net/Articles/545668/). With idle tracking, we could reclaim idle pages only by smart user memory manager. ==== USER API ==== The user API consists of two new files: * /sys/kernel/mm/page_idle/bitmap. This file implements a bitmap where each bit corresponds to a page, indexed by PFN. When the bit is set, the corresponding page is idle. A page is considered idle if it has not been accessed since it was marked idle. To mark a page idle one should set the bit corresponding to the page by writing to the file. A value written to the file is OR-ed with the current bitmap value. Only user memory pages can be marked idle, for other page types input is silently ignored. Writing to this file beyond max PFN results in the ENXIO error. Only available when CONFIG_IDLE_PAGE_TRACKING is set. This file can be used to estimate the amount of pages that are not used by a particular workload as follows: 1. mark all pages of interest idle by setting corresponding bits in the /sys/kernel/mm/page_idle/bitmap 2. wait until the workload accesses its working set 3. read /sys/kernel/mm/page_idle/bitmap and count the number of bits set * /proc/kpagecgroup. This file contains a 64-bit inode number of the memory cgroup each page is charged to, indexed by PFN. Only available when CONFIG_MEMCG is set. This file can be used to find all pages (including unmapped file pages) accounted to a particular cgroup. Using /sys/kernel/mm/page_idle/bitmap, one can then estimate the cgroup working set size. For an example of using these files for estimating the amount of unused memory pages per each memory cgroup, please see the script attached below. ==== REASONING ==== The reason to introduce the new user API instead of using /proc/PID/{clear_refs,smaps} is that the latter has two serious drawbacks: - it does not count unmapped file pages - it affects the reclaimer logic The new API attempts to overcome them both. For more details on how it is achieved, please see the comment to patch 6. ==== PATCHSET STRUCTURE ==== The patch set is organized as follows: - patch 1 adds page_cgroup_ino() helper for the sake of /proc/kpagecgroup and patches 2-3 do related cleanup - patch 4 adds /proc/kpagecgroup, which reports cgroup ino each page is charged to - patch 5 introduces a new mmu notifier callback, clear_young, which is a lightweight version of clear_flush_young; it is used in patch 6 - patch 6 implements the idle page tracking feature, including the userspace API, /sys/kernel/mm/page_idle/bitmap - patch 7 exports idle flag via /proc/kpageflags ==== SIMILAR WORKS ==== Originally, the patch for tracking idle memory was proposed back in 2011 by Michel Lespinasse (see http://lwn.net/Articles/459269/). The main difference between Michel's patch and this one is that Michel implemented a kernel space daemon for estimating idle memory size per cgroup while this patch only provides the userspace with the minimal API for doing the job, leaving the rest up to the userspace. However, they both share the same idea of Idle/Young page flags to avoid affecting the reclaimer logic. ==== PERFORMANCE EVALUATION ==== SPECjvm2008 (https://www.spec.org/jvm2008/) was used to evaluate the performance impact introduced by this patch set. Three runs were carried out: - base: kernel without the patch - patched: patched kernel, the feature is not used - patched-active: patched kernel, 1 minute-period daemon is used for tracking idle memory For tracking idle memory, idlememstat utility was used: https://github.com/locker/idlememstat testcase base patched patched-active compiler 537.40 ( 0.00)% 532.26 (-0.96)% 538.31 ( 0.17)% compress 305.47 ( 0.00)% 301.08 (-1.44)% 300.71 (-1.56)% crypto 284.32 ( 0.00)% 282.21 (-0.74)% 284.87 ( 0.19)% derby 411.05 ( 0.00)% 413.44 ( 0.58)% 412.07 ( 0.25)% mpegaudio 189.96 ( 0.00)% 190.87 ( 0.48)% 189.42 (-0.28)% scimark.large 46.85 ( 0.00)% 46.41 (-0.94)% 47.83 ( 2.09)% scimark.small 412.91 ( 0.00)% 415.41 ( 0.61)% 421.17 ( 2.00)% serial 204.23 ( 0.00)% 213.46 ( 4.52)% 203.17 (-0.52)% startup 36.76 ( 0.00)% 35.49 (-3.45)% 35.64 (-3.05)% sunflow 115.34 ( 0.00)% 115.08 (-0.23)% 117.37 ( 1.76)% xml 620.55 ( 0.00)% 619.95 (-0.10)% 620.39 (-0.03)% composite 211.50 ( 0.00)% 211.15 (-0.17)% 211.67 ( 0.08)% time idlememstat: 17.20user 65.16system 2:15:23elapsed 1%CPU (0avgtext+0avgdata 8476maxresident)k 448inputs+40outputs (1major+36052minor)pagefaults 0swaps ==== SCRIPT FOR COUNTING IDLE PAGES PER CGROUP ==== #! /usr/bin/python # import os import stat import errno import struct CGROUP_MOUNT = "/sys/fs/cgroup/memory" BUFSIZE = 8 * 1024 # must be multiple of 8 def get_hugepage_size(): with open("/proc/meminfo", "r") as f: for s in f: k, v = s.split(":") if k == "Hugepagesize": return int(v.split()[0]) * 1024 PAGE_SIZE = os.sysconf("SC_PAGE_SIZE") HUGEPAGE_SIZE = get_hugepage_size() def set_idle(): f = open("/sys/kernel/mm/page_idle/bitmap", "wb", BUFSIZE) while True: try: f.write(struct.pack("Q", pow(2, 64) - 1)) except IOError as err: if err.errno == errno.ENXIO: break raise f.close() def count_idle(): f_flags = open("/proc/kpageflags", "rb", BUFSIZE) f_cgroup = open("/proc/kpagecgroup", "rb", BUFSIZE) with open("/sys/kernel/mm/page_idle/bitmap", "rb", BUFSIZE) as f: while f.read(BUFSIZE): pass # update idle flag idlememsz = {} while True: s1, s2 = f_flags.read(8), f_cgroup.read(8) if not s1 or not s2: break flags, = struct.unpack('Q', s1) cgino, = struct.unpack('Q', s2) unevictable = (flags >> 18) & 1 huge = (flags >> 22) & 1 idle = (flags >> 25) & 1 if idle and not unevictable: idlememsz[cgino] = idlememsz.get(cgino, 0) + \ (HUGEPAGE_SIZE if huge else PAGE_SIZE) f_flags.close() f_cgroup.close() return idlememsz if __name__ == "__main__": print "Setting the idle flag for each page..." set_idle() raw_input("Wait until the workload accesses its working set, " "then press Enter") print "Counting idle pages..." idlememsz = count_idle() for dir, subdirs, files in os.walk(CGROUP_MOUNT): ino = os.stat(dir)[stat.ST_INO] print dir + ": " + str(idlememsz.get(ino, 0) / 1024) + " kB" ==== END SCRIPT ==== This patch (of 8): Add page_cgroup_ino() helper to memcg. This function returns the inode number of the closest online ancestor of the memory cgroup a page is charged to. It is required for exporting information about which page is charged to which cgroup to userspace, which will be introduced by a following patch. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Reviewed-by: Andres Lagar-Cavilla <andreslc@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Greg Thelen <gthelen@google.com> Cc: Michel Lespinasse <walken@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-10zswap: update docs for runtime-changeable attributesDan Streetman
Change the Documentation/vm/zswap.txt doc to indicate that the "zpool" and "compressor" params are now changeable at runtime. Signed-off-by: Dan Streetman <ddstreet@ieee.org> Cc: Seth Jennings <sjennings@variantweb.net> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-10zswap: change zpool/compressor at runtimeDan Streetman
Update the zpool and compressor parameters to be changeable at runtime. When changed, a new pool is created with the requested zpool/compressor, and added as the current pool at the front of the pool list. Previous pools remain in the list only to remove existing compressed pages from. The old pool(s) are removed once they become empty. Signed-off-by: Dan Streetman <ddstreet@ieee.org> Acked-by: Seth Jennings <sjennings@variantweb.net> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-10zswap: dynamic pool creationDan Streetman
Add dynamic creation of pools. Move the static crypto compression per-cpu transforms into each pool. Add a pointer to zswap_entry to the pool it's in. This is required by the following patch which enables changing the zswap zpool and compressor params at runtime. [akpm@linux-foundation.org: fix merge snafus] Signed-off-by: Dan Streetman <ddstreet@ieee.org> Acked-by: Seth Jennings <sjennings@variantweb.net> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-10zpool: add zpool_has_pool()Dan Streetman
This series makes creation of the zpool and compressor dynamic, so that they can be changed at runtime. This makes using/configuring zswap easier, as before this zswap had to be configured at boot time, using boot params. This uses a single list to track both the zpool and compressor together, although Seth had mentioned an alternative which is to track the zpools and compressors using separate lists. In the most common case, only a single zpool and single compressor, using one list is slightly simpler than using two lists, and for the uncommon case of multiple zpools and/or compressors, using one list is slightly less simple (and uses slightly more memory, probably) than using two lists. This patch (of 4): Add zpool_has_pool() function, indicating if the specified type of zpool is available (i.e. zsmalloc or zbud). This allows checking if a pool is available, without actually trying to allocate it, similar to crypto_has_alg(). This is used by a following patch to zswap that enables the dynamic runtime creation of zswap zpools. Signed-off-by: Dan Streetman <ddstreet@ieee.org> Acked-by: Seth Jennings <sjennings@variantweb.net> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-10tcp_cubic: better follow cubic curve after idle periodEric Dumazet
Jana Iyengar found an interesting issue on CUBIC : The epoch is only updated/reset initially and when experiencing losses. The delta "t" of now - epoch_start can be arbitrary large after app idle as well as the bic_target. Consequentially the slope (inverse of ca->cnt) would be really large, and eventually ca->cnt would be lower-bounded in the end to 2 to have delayed-ACK slow-start behavior. This particularly shows up when slow_start_after_idle is disabled as a dangerous cwnd inflation (1.5 x RTT) after few seconds of idle time. Jana initial fix was to reset epoch_start if app limited, but Neal pointed out it would ask the CUBIC algorithm to recalculate the curve so that we again start growing steeply upward from where cwnd is now (as CUBIC does just after a loss). Ideally we'd want the cwnd growth curve to be the same shape, just shifted later in time by the amount of the idle period. Reported-by: Jana Iyengar <jri@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Cc: Stephen Hemminger <stephen@networkplumber.org> Cc: Sangtae Ha <sangtae.ha@gmail.com> Cc: Lawrence Brakmo <lawrence@brakmo.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-10tcp: generate CA_EVENT_TX_START on data framesNeal Cardwell
Issuing a CC TX_START event on control frames like pure ACK is a waste of time, as a CC should not care. Following patch needs this change, as we want CUBIC to properly track idle time at a low cost, with a single TX_START being generated. Yuchung might slightly refine the condition triggering TX_START on a followup patch. Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Cc: Jana Iyengar <jri@google.com> Cc: Stephen Hemminger <stephen@networkplumber.org> Cc: Sangtae Ha <sangtae.ha@gmail.com> Cc: Lawrence Brakmo <lawrence@brakmo.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-10xen-netfront: respect user provided max_queuesWei Liu
Originally that parameter was always reset to num_online_cpus during module initialisation, which renders it useless. The fix is to only set max_queues to num_online_cpus when user has not provided a value. Signed-off-by: Wei Liu <wei.liu2@citrix.com> Cc: David Vrabel <david.vrabel@citrix.com> Reviewed-by: David Vrabel <david.vrabel@citrix.com> Tested-by: David Vrabel <david.vrabel@citrix.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-10xen-netback: respect user provided max_queuesWei Liu
Originally that parameter was always reset to num_online_cpus during module initialisation, which renders it useless. The fix is to only set max_queues to num_online_cpus when user has not provided a value. Reported-by: Johnny Strom <johnny.strom@linuxsolutions.fi> Signed-off-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: David Vrabel <david.vrabel@citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-10r8169: Fix sleeping function called during get_stats64, v2Corinna Vinschen
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=104031 Fixes: 6e85d5ad36a26debc23a9a865c029cbe242b2dc8 Based on the discussion starting at http://www.spinics.net/lists/netdev/msg342193.html Tested locally on RTL8168evl/8111evl with various concurrent processes accessing /proc/net/dev while changing the link state as well as removing/reloading the r8169 module. Signed-off-by: Corinna Vinschen <vinschen@redhat.com> Tested-by: poma <pomidorabelisima@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-10drm/i915: Allow DSI dual link to be configured on any pipeGaurav K Singh
Just like single link MIPI panels, similarly for dual link panels, pipe to be configured is based on the DVO port from VBT Block 2. In hardware, Port A is mapped with Pipe A and Port C is mapped with Pipe B. This issue got introduced in - commit 7e9804fdcffc650515c60f524b8b2076ee59e710 Author: Jani Nikula <jani.nikula@intel.com> Date: Fri Jan 16 14:27:23 2015 +0200 drm/i915/dsi: add drm mipi dsi host support Cc: stable@vger.kernel.org # v4.0 Signed-off-by: Gaurav K Singh <gaurav.k.singh@intel.com> Reviewed-by: Ville Syrjälä <ville.syrjala@linux.intel.com> Signed-off-by: Jani Nikula <jani.nikula@intel.com>
2015-09-10drm/i915: Don't try to use DDR DVFS on CHV when disabled in the BIOSVille Syrjälä
If one disables DDR DVFS in the BIOS, Punit will apparently ignores all DDR DVFS request. Currently we assume that DDR DVFS is always operational, which leads to errors in dmesg when the DDR DVFS requests time out. Fix the problem by gently prodding Punit during driver load to find out whether it will respond to DDR DVFS requests. If the request times out, we assume that DDR DVFS has been permanenly disabled in the BIOS and no longer perster the Punit about it. Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=91629 Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com> Reviewed-by: Clint Taylor <Clinton.A.Taylor@intel.com> Tested-by: Clint Taylor <Clinton.A.Taylor@intel.com> Signed-off-by: Jani Nikula <jani.nikula@intel.com>
2015-09-10drm/i915: Fix CSR MMIO address checkTakashi Iwai
Fix a wrong logical AND (&&) used for the range check of CSR MMIO. Spotted nicely by gcc -Wlogical-op flag: drivers/gpu/drm/i915/intel_csr.c: In function ‘finish_csr_load’: drivers/gpu/drm/i915/intel_csr.c:353:41: warning: logical ‘and’ of mutually exclusive tests is always false [-Wlogical-op] Fixes: eb805623d8b1 ('drm/i915/skl: Add support to load SKL CSR firmware.') Cc: <stable@vger.kernel.org> # v4.2 Signed-off-by: Takashi Iwai <tiwai@suse.de> Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch> Reviewed-by: Animesh Manna <animesh.manna@intel.com> Signed-off-by: Jani Nikula <jani.nikula@intel.com>
2015-09-09ether: add IEEE 1722 ethertype - TSNHenrik Austad
IEEE 1722 describes AVB (later renamed to TSN - Time Sensitive Networking), a protocol, encapsualtion and synchronization to utilize standard networks for audio/video (and later other time-sensitive) streams. This standard uses ethertype 0x22F0. http://standards.ieee.org/develop/regauth/ethertype/eth.txt This is a respin of a previous patch ("ether: add AVB frame type ETH_P_AVB") CC: "David S. Miller" <davem@davemloft.net> CC: netdev@vger.kernel.org CC: linux-api@vger.kernel.org CC: linux-kernel@vger.kernel.org Signed-off-by: Henrik Austad <henrik@austad.us> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-10elf-em.h: move EM_MICROBLAZE to the common headerMike Frysinger
The linux/audit.h header uses EM_MICROBLAZE in order to define AUDIT_ARCH_MICROBLAZE, but it's only available in the microblaze asm headers. Move it to the common elf-em.h header so that the define can be used on non-microblaze systems. Otherwise we get build errors that EM_MICROBLAZE isn't defined when we try to use the AUDIT_ARCH_MICROBLAZE symbol. Signed-off-by: Mike Frysinger <vapier@gentoo.org> Signed-off-by: Michal Simek <michal.simek@xilinx.com>
2015-09-09netlink, mmap: fix edge-case leakages in nf queue zero-copyDaniel Borkmann
When netlink mmap on receive side is the consumer of nf queue data, it can happen that in some edge cases, we write skb shared info into the user space mmap buffer: Assume a possible rx ring frame size of only 4096, and the network skb, which is being zero-copied into the netlink skb, contains page frags with an overall skb->len larger than the linear part of the netlink skb. skb_zerocopy(), which is generic and thus not aware of the fact that shared info cannot be accessed for such skbs then tries to write and fill frags, thus leaking kernel data/pointers and in some corner cases possibly writing out of bounds of the mmap area (when filling the last slot in the ring buffer this way). I.e. the ring buffer slot is then of status NL_MMAP_STATUS_VALID, has an advertised length larger than 4096, where the linear part is visible at the slot beginning, and the leaked sizeof(struct skb_shared_info) has been written to the beginning of the next slot (also corrupting the struct nl_mmap_hdr slot header incl. status etc), since skb->end points to skb->data + ring->frame_size - NL_MMAP_HDRLEN. The fix adds and lets __netlink_alloc_skb() take the actual needed linear room for the network skb + meta data into account. It's completely irrelevant for non-mmaped netlink sockets, but in case mmap sockets are used, it can be decided whether the available skb_tailroom() is really large enough for the buffer, or whether it needs to internally fallback to a normal alloc_skb(). >From nf queue side, the information whether the destination port is an mmap RX ring is not really available without extra port-to-socket lookup, thus it can only be determined in lower layers i.e. when __netlink_alloc_skb() is called that checks internally for this. I chose to add the extra ldiff parameter as mmap will then still work: We have data_len and hlen in nfqnl_build_packet_message(), data_len is the full length (capped at queue->copy_range) for skb_zerocopy() and hlen some possible part of data_len that needs to be copied; the rem_len variable indicates the needed remaining linear mmap space. The only other workaround in nf queue internally would be after allocation time by f.e. cap'ing the data_len to the skb_tailroom() iff we deal with an mmap skb, but that would 1) expose the fact that we use a mmap skb to upper layers, and 2) trim the skb where we otherwise could just have moved the full skb into the normal receive queue. After the patch, in my test case the ring slot doesn't fit and therefore shows NL_MMAP_STATUS_COPY, where a full skb carries all the data and thus needs to be picked up via recv(). Fixes: 3ab1f683bf8b ("nfnetlink: add support for memory mapped netlink") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-09netlink, mmap: don't walk rx ring on poll if receive queue non-emptyDaniel Borkmann
In case of netlink mmap, there can be situations where received frames have to be placed into the normal receive queue. The ring buffer indicates this through NL_MMAP_STATUS_COPY, so the user is asked to pick them up via recvmsg(2) syscall, and to put the slot back to NL_MMAP_STATUS_UNUSED. Commit 0ef707700f1c ("netlink: rx mmap: fix POLLIN condition") changed polling, so that we walk in the worst case the whole ring through the new netlink_has_valid_frame(), for example, when the ring would have no NL_MMAP_STATUS_VALID, but at least one NL_MMAP_STATUS_COPY frame. Since we do a datagram_poll() already earlier to pick up a mask that could possibly contain POLLIN | POLLRDNORM already (due to NL_MMAP_STATUS_COPY), we can skip checking the rx ring entirely. In case the kernel is compiled with !CONFIG_NETLINK_MMAP, then all this is irrelevant anyway as netlink_poll() is just defined as datagram_poll(). Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-09cxgb4: changes for new firmware 1.14.4.0Hariprasad Shenai
Incorporate fw_ldst_cmd structure change for new firmware and also update version string for the same Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-09net: fec: add netif status check before set mac addressNimrod Andy
There exist one issue by below case that case system hang: ifconfig eth0 down ifconfig eth0 hw ether 00:10:19:19:81:19 After eth0 down, all fec clocks are gated off. In the .fec_set_mac_address() function, it will set new MAC address to registers, which causes system hang. So it needs to add netif status check to avoid registers access when clocks are gated off. Until eth0 up the new MAC address are wrote into related registers. V2: As Lucas Stach's suggestion, add a comment in the code to explain why it needed. CC: Lucas Stach <l.stach@pengutronix.de> CC: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Fugang Duan <B38611@freescale.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-09Merge branch 'r8152-autoresume'David S. Miller
Hayes Wang says: ==================== r8152: fix the autoresume may fail Fix the autosuspend issues which occur about linking change. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-09r8152: fix the runtime suspend issueshayeswang
Fix the runtime suspend issues result from the linking change. Case 1: a) link down occurs. b) driver disable tx/rx. c) autosuspend occurs. d) hw linking up. e) device suspends without enabling tx/rx. f) couldn't wake up when receiving packets. Case 2: a) Nway results in linking down. b) autosuspend occurs. c) device suspends. d) device may not wake up when linking up. Signed-off-by: Hayes Wang <hayeswang@realtek.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-09r8152: split DRIVER_VERSIONhayeswang
Split DRIVER_VERSION into NETNEXT_VERSION and NET_VERSION. Then, according to the value of DRIVER_VERSION, we could know which patches are used generally without comparing the source code. Signed-off-by: Hayes Wang <hayeswang@realtek.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-09ipv6: fix ifnullfree.cocci warningsWu Fengguang
net/ipv6/route.c:2946:3-8: WARNING: NULL check before freeing functions like kfree, debugfs_remove, debugfs_remove_recursive or usb_free_urb is not needed. Maybe consider reorganizing relevant code to avoid passing NULL values. NULL check before some freeing functions is not needed. Based on checkpatch warning "kfree(NULL) is safe this check is probably not required" and kfreeaddr.cocci by Julia Lawall. Generated by: scripts/coccinelle/free/ifnullfree.cocci CC: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-09add microchip LAN88xx phy driverWoojung.Huh@microchip.com
Add Microchip LAN88XX phy driver for phylib. Signed-off-by: Woojung Huh <woojung.huh@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-09stmmac: fix check for phydev being openAlexey Brodkin
Current check of phydev with IS_ERR(phydev) may make not much sense because of_phy_connect() returns NULL on failure instead of error value. Still for checking result of phy_connect() IS_ERR() makes perfect sense. So let's use combined check IS_ERR_OR_NULL() that covers both cases. Cc: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com> Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com> Cc: linux-kernel@vger.kernel.org Cc: stable@vger.kernel.org Cc: David Miller <davem@davemloft.net> Signed-off-by: Alexey Brodkin <abrodkin@synopsys.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-09net: qlcnic: delete redundant memsetsRasmus Villemoes
In all cases, mbx->req.arg and mbx->rsp.arg have just been allocated using kcalloc(), so these six memsets are redundant. Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-09net: mv643xx_eth: use kzallocRasmus Villemoes
The double memset is a little ugly; using kzalloc avoids it altogether. Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-09net: jme: use kzalloc() instead of kmalloc+memsetRasmus Villemoes
Using kzalloc saves a tiny bit on .text. Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-09net: cavium: liquidio: use kzalloc in setup_glist()Rasmus Villemoes
We save a little .text and get rid of the sizeof(...) style inconsistency. Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-09Merge tag 'qcom-soc-for-4.3-rc2' of ↵Kevin Hilman
git://codeaurora.org/quic/kernel/agross-msm into next/late Qualcomm ARM Based SoC Updates for 4.3-rc2 * Fix errant private access in SMEM * Fix use of correct remote processor ID in SMD transactions * Correct SMD fBLOCKREADINTR handling * tag 'qcom-soc-for-4.3-rc2' of git://codeaurora.org/quic/kernel/agross-msm: soc: qcom: smd: Correct fBLOCKREADINTR handling soc: qcom: smd: Use correct remote processor ID soc: qcom: smem: Fix errant private access devicetree: soc: Add Qualcomm SMD based RPM DT binding soc: qcom: Driver for the Qualcomm RPM over SMD soc: qcom: Add Shared Memory Driver soc: qcom: Add device tree binding for Shared Memory Device drivers: qcom: Select QCOM_SCM unconditionally for QCOM_PM soc: qcom: Add Shared Memory Manager driver
2015-09-09Merge tag 'qcom-dt-for-4.3-rc2' of ↵Kevin Hilman
git://codeaurora.org/quic/kernel/agross-msm into next/late Qualcomm ARM Based Device Tree Updates for v4.3-rc2 * Add labels for serial nodes to be used for aliasing and stdout-path * Add stdout-path for APQ8064 Compulab QS600 * Add stdout-path for APQ8064 Inforce 6410 * Add stdout-path for APQ8074 Dragonboard * Add stdout-path for APQ8084 Inforce 6540 * Add stdout-path for APQ8084 MTP * Add stdout-path for IPQ8064 AP148 * Add stdout-path for MSM8660 Surf * Add stdout-path for MSM8960 CDP * Add stdout-path for MSM8974 Xperia Honami * tag 'qcom-dt-for-4.3-rc2' of git://codeaurora.org/quic/kernel/agross-msm: (24 commits) ARM: dts: qcom: msm8974-sony-xperia-honami: Use stdout-path ARM: dts: qcom: msm8960-cdp: Use stdout-path ARM: dts: qcom: msm8660-surf: Use stdout-path ARM: dts: qcom: ipq8064-ap148: Use stdout-path ARM: dts: qcom: apq8084-mtp: Use stdout-path ARM: dts: qcom: apq8084-ifc6540: Use stdout-path ARM: dts: qcom: apq8074-dragonboard: Use stdout-path ARM: dts: qcom: apq8064-ifc6410: Use stdout-path ARM: dts: qcom: apq8064-cm-qs600: Use stdout-path ARM: dts: qcom: Label serial nodes for aliasing and stdout-path ARM: dts: qs600: Add real regulators to sdcc ARM: dts: ifc6410: add real regulators for sdcc nodes. ARM: dts: apq8064: remove temporary fixed regulator for mmc ARM: dts: apq8064: fix missing gsbi cell-index ARM: dts: apq8064: Add DT support for GSBI6 and for UART pin mux ARM: dts: apq8064: add pm8921 mpp support ARM: dts: apq8064: Add pm8921 mfd and its gpio node ARM: dts: msm8974: Add smem reservation and node ARM: dts: msm8974: Add tcsr mutex node ARM: dts: qcom: Add ks8851 node for wired ethernet ...
2015-09-09Merge branch 'next/defconfig' into next/lateKevin Hilman
* next/defconfig: (45 commits) ARM: multi_v7_defconfig: Enable PBIAS regulator ARM: add TC2 PM support to multi_v7_defconfig ARM: tegra: Update multi_v7_defconfig ARM: tegra: Update default configuration ARM: at91/defconfig: at91_dt: remove ARM_AT91_ETHER ARM: at91/defconfig: at91_dt: enable DRM hlcdc support ARM: at91: at91_dt_defconfig: enable ISI and ov2640 support ARM: multi_v7_defconfig: Enable Allwinner P2WI, PWM, DMA_SUN6I, cryptodev ARM: sunxi_defconfig: Enable DMA_SUN6I, P2WI, PWM, cryptodev, EXTCON, FHANDLE ARM: shmobile: Enable fixed voltage regulator in shmobile_defconfig ARM: multi_v7_defconfig: Select MX6UL and MX7D ARM: prima2_defconfig: enable build for hwspinlock ARM: prima2_defconfig: enable build for RTC ARM: prima2_defconfig: enable build for misc input ARM: prima2_defconfig: enable build for SiRFSoC SDHC host ARM: prima2_defconfig: fix the outdated defconfig ARM: imx_v6_v7_defconfig: Select CONFIG_IKCONFIG_PROC ARM: defconfig: orion5x: add DT support ARM: qcom_defconfig: Enable options for KS8851 ethernet ARM: multi_v7_defconfig: Enable support for PWM Regulators ...
2015-09-09ARM: multi_v7_defconfig: Enable PBIAS regulatorKishon Vijay Abraham I
PBIAS regulator is required for MMC module in OMAP2, OMAP3, OMAP4, OMAP5 and DRA7 SoCs. Enable it here. Signed-off-by: Kishon Vijay Abraham I <kishon@ti.com> Signed-off-by: Kevin Hilman <khilman@linaro.org>
2015-09-09Merge branch 'drivers/reset' into next/lateKevin Hilman
* drivers/reset: reset: ath79: Fix missing spin_lock_init reset: Add (devm_)reset_control_get stub functions reset: reset-zynq: Adding support for Xilinx Zynq reset controller. docs: dts: Added documentation for Xilinx Zynq Reset Controller bindings. MIPS: ath79: Add the reset controller to the AR9132 dtsi reset: Add a driver for the reset controller on the AR71XX/AR9XXX devicetree: Add bindings for the ATH79 reset controller reset: socfpga: Update reset-socfpga to read the altr,modrst-offset property doc: dt: add documentation for lpc1850-rgu reset driver reset: add driver for lpc18xx rgu reset: sti: constify of_device_id array ARM: STi: DT: Move reset controller constants into common location MAINTAINERS: add include/dt-bindings/reset path to reset controller entry
2015-09-09Merge tag 'reset-for-4.3-fixes' of git://git.pengutronix.de/git/pza/linux ↵Kevin Hilman
into drivers/reset Merge "Reset controller fixes for v4.3" from Philipp Zabel: Reset controller fixes for v4.3 - added stubs to avoid build breakage in COMPILE_TEST configurations with RESET_CONTROLLER disabled - fixed missing spinlock initialization in ath79 driver * tag 'reset-for-4.3-fixes' of git://git.pengutronix.de/git/pza/linux: reset: ath79: Fix missing spin_lock_init reset: Add (devm_)reset_control_get stub functions
2015-09-09net: ipv6: use common fib_default_rule_prefPhil Sutter
This switches IPv6 policy routing to use the shared fib_default_rule_pref() function of IPv4 and DECnet. It is also used in multicast routing for IPv4 as well as IPv6. The motivation for this patch is a complaint about iproute2 behaving inconsistent between IPv4 and IPv6 when adding policy rules: Formerly, IPv6 rules were assigned a fixed priority of 0x3FFF whereas for IPv4 the assigned priority value was decreased with each rule added. Since then all users of the default_pref field have been converted to assign the generic function fib_default_rule_pref(), fib_nl_newrule() may just use it directly instead. Therefore get rid of the function pointer altogether and make fib_default_rule_pref() static, as it's not used outside fib_rules.c anymore. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-09net: ethoc: Remove unnecessary #ifdef CONFIG_OFTobias Klauser
For !CONFIG_OF of_get_property() is defined to always return NULL. Thus there's no need to protect the call to of_get_property() with #ifdef CONFIG_OF. Signed-off-by: Tobias Klauser <tklauser@distanz.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-09net: dsa: bcm_sf2: Fix 64-bits register writesFlorian Fainelli
The macro to write 64-bits quantities to the 32-bits register swapped the value and offsets arguments, we want to preserve the ordering of the arguments with respect to how writel() is implemented for instance: value first, offset/base second. Fixes: 246d7f773c13 ("net: dsa: add Broadcom SF2 switch driver") Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-09bpf: fix out of bounds access in verifier logAlexei Starovoitov
when the verifier log is enabled the print_bpf_insn() is doing bpf_alu_string[BPF_OP(insn->code) >> 4] and bpf_jmp_string[BPF_OP(insn->code) >> 4] where BPF_OP is a 4-bit instruction opcode. Malformed insns can cause out of bounds access. Fix it by sizing arrays appropriately. The bug was found by clang address sanitizer with libfuzzer. Reported-by: Yonghong Song <yhs@plumgrid.com> Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-09ipv6: fix multipath route replace error recoveryRoopa Prabhu
Problem: The ecmp route replace support for ipv6 in the kernel, deletes the existing ecmp route too early, ie when it installs the first nexthop. If there is an error in installing the subsequent nexthops, its too late to recover the already deleted existing route leaving the fib in an inconsistent state. This patch reduces the possibility of this by doing the following: a) Changes the existing multipath route add code to a two stage process: build rt6_infos + insert them ip6_route_add rt6_info creation code is moved into ip6_route_info_create. b) This ensures that most errors are caught during building rt6_infos and we fail early c) Separates multipath add and del code. Because add needs the special two stage mode in a) and delete essentially does not care. d) In any event if the code fails during inserting a route again, a warning is printed (This should be unlikely) Before the patch: $ip -6 route show 3000:1000:1000:1000::2 via fe80::202:ff:fe00:b dev swp49s0 metric 1024 3000:1000:1000:1000::2 via fe80::202:ff:fe00:d dev swp49s1 metric 1024 3000:1000:1000:1000::2 via fe80::202:ff:fe00:f dev swp49s2 metric 1024 /* Try replacing the route with a duplicate nexthop */ $ip -6 route change 3000:1000:1000:1000::2/128 nexthop via fe80::202:ff:fe00:b dev swp49s0 nexthop via fe80::202:ff:fe00:d dev swp49s1 nexthop via fe80::202:ff:fe00:d dev swp49s1 RTNETLINK answers: File exists $ip -6 route show /* previously added ecmp route 3000:1000:1000:1000::2 dissappears from * kernel */ After the patch: $ip -6 route show 3000:1000:1000:1000::2 via fe80::202:ff:fe00:b dev swp49s0 metric 1024 3000:1000:1000:1000::2 via fe80::202:ff:fe00:d dev swp49s1 metric 1024 3000:1000:1000:1000::2 via fe80::202:ff:fe00:f dev swp49s2 metric 1024 /* Try replacing the route with a duplicate nexthop */ $ip -6 route change 3000:1000:1000:1000::2/128 nexthop via fe80::202:ff:fe00:b dev swp49s0 nexthop via fe80::202:ff:fe00:d dev swp49s1 nexthop via fe80::202:ff:fe00:d dev swp49s1 RTNETLINK answers: File exists $ip -6 route show 3000:1000:1000:1000::2 via fe80::202:ff:fe00:b dev swp49s0 metric 1024 3000:1000:1000:1000::2 via fe80::202:ff:fe00:d dev swp49s1 metric 1024 3000:1000:1000:1000::2 via fe80::202:ff:fe00:f dev swp49s2 metric 1024 Fixes: 27596472473a ("ipv6: fix ECMP route replacement") Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-09soc: qcom: smd: Correct fBLOCKREADINTR handlingBjorn Andersson
fBLOCKREADINTR is masking the notification from the remote and should hence be cleared while we're waiting the tx fifo to drain. Also change the reset state to mask the notification, as send is the only use case where we're interested in it. Signed-off-by: Bjorn Andersson <bjorn.andersson@sonymobile.com> Signed-off-by: Andy Gross <agross@codeaurora.org>
2015-09-09soc: qcom: smd: Use correct remote processor IDAndy Gross
This patch fixes SMEM addressing issues when remote processors need to use secure SMEM partitions. Signed-off-by: Andy Gross <agross@codeaurora.org> Reviewed-by: Bjorn Andersson <bjorn.andersson@sonymobile.com>
2015-09-09soc: qcom: smem: Fix errant private accessAndy Gross
This patch corrects private partition item access. Instead of falling back to global for instances where we have an actual host and remote partition existing, return the results of the private lookup. Signed-off-by: Andy Gross <agross@codeaurora.org>
2015-09-09Merge tag 'qcom-soc-for-4.3' into v4.2-rc2Andy Gross
Qualcomm ARM Based SoC Updates for 4.3 * Add SMEM driver * Add SMD driver * Add RPM over SMD driver * Select QCOM_SCM by default
2015-09-09intel_pstate: fix PCT_TO_HWP macroKristen Carlson Accardi
PCT_TO_HWP does not take the actual range of pstates exported by HWP_CAPABILITIES in account, and is broken on most platforms. Remove the macro and set the min and max pstate for hwp by determining the range and adjusting by the min and max percent limits values. Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2015-09-09intel_pstate: Fix user input of min/max to legal policy regionChen Yu
In current code, max_perf_pct might be smaller than min_perf_pct by improper user input: $ grep . /sys/devices/system/cpu/intel_pstate/m*_perf_pct /sys/devices/system/cpu/intel_pstate/max_perf_pct:100 /sys/devices/system/cpu/intel_pstate/min_perf_pct:100 $ echo 80 > /sys/devices/system/cpu/intel_pstate/max_perf_pct $ grep . /sys/devices/system/cpu/intel_pstate/m*_perf_pct /sys/devices/system/cpu/intel_pstate/max_perf_pct:80 /sys/devices/system/cpu/intel_pstate/min_perf_pct:100 Fix this problem by 2 steps: 1. Normalize the user input to [min_policy, max_policy]. 2. Make sure max_perf_pct>=min_perf_pct, suggested by Seiichi Ikarashi. Signed-off-by: Chen Yu <yu.c.chen@intel.com> Acked-by: Kristen Carlson Accardi <kristen@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2015-09-09PM / OPP: Return suspend_opp only if it is enabledViresh Kumar
There is no point returning suspend_opp, if it is disabled by the core. As we can't use it at all. Fix it. Fixes: 4eafbd15b6c8 ("PM / OPP: add dev_pm_opp_get_suspend_opp() helper") Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2015-09-09ARM: dts: qcom: msm8974-sony-xperia-honami: Use stdout-pathStephen Boyd
Use stdout-path so that we don't have to put the console on the kernel command line. Cc: Tim Bird <tim.bird@sonymobile.com> Cc: Bjorn Andersson <bjorn.andersson@sonymobile.com> Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>