summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2012-07-31mm, oom: fix potential killing of thread that is disabled from oom killingDavid Rientjes
/proc/sys/vm/oom_kill_allocating_task will immediately kill current when the oom killer is called to avoid a potentially expensive tasklist scan for large systems. Currently, however, it is not checking current's oom_score_adj value which may be OOM_SCORE_ADJ_MIN, meaning that it has been disabled from oom killing. This patch avoids killing current in such a condition and simply falls back to the tasklist scan since memory still needs to be freed. Signed-off-by: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31mm: clear pages_scanned only if draining a pcp adds pages to the buddy ↵KOSAKI Motohiro
allocator again commit 2ff754fa8f ("mm: clear pages_scanned only if draining a pcp adds pages to the buddy allocator again") fixed one free_pcppages_bulk() misuse. But two another miuse still exist. This patch fixes it. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Minchan Kim <minchan@kernel.org> Cc: Wu Fengguang <fengguang.wu@intel.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31mm, fadvise: don't return -EINVAL when filesystem cannot implement fadvise()KOSAKI Motohiro
Eric Wong reported his test suite failex when /tmp is tmpfs. https://lkml.org/lkml/2012/2/24/479 Currentlt the input check of POSIX_FADV_WILLNEED has two problems. - requires a_ops->readpage. But in fact, force_page_cache_readahead() requires that the target filesystem has either ->readpage or ->readpages. - returns -EINVAL when the filesystem doesn't have ->readpage. But posix says that fadvise is merely a hint. Thus fadvise() should return 0 if filesystem has no means of implementing fadvise(). The userland application should not know nor care whcih type of filesystem backs the TMPDIR directory, as Eric pointed out. There is nothing which userspace can do to solve this error. So change the return value to 0 when filesytem doesn't support readahead. [akpm@linux-foundation.org: checkpatch fixes] Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Cc: Hillf Danton <dhillf@gmail.com> Signed-off-by: Eric Wong <normalperson@yhbt.net> Tested-by: Eric Wong <normalperson@yhbt.net> Reviewed-by: Wanlong Gao <gaowanlong@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31mm/compaction: cleanup on compaction_deferredGavin Shan
When CONFIG_COMPACTION is enabled, compaction_deferred() tries to recalculate the deferred limit again, which isn't necessary. When CONFIG_COMPACTION is disabled, compaction_deferred() should return "true" or "false" since it has "bool" for its return value. Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31memcg: make mem_cgroup_force_empty_list() return boolKAMEZAWA Hiroyuki
mem_cgroup_force_empty_list() just returns 0 or -EBUSY and -EBUSY indicates 'you need to retry'. Make mem_cgroup_force_empty_list() return a bool to simplify the logic. [akpm@linux-foundation.org: rework mem_cgroup_force_empty_list()'s comment] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31memcg: mem_cgroup_move_parent() doesn't need gfp_maskKAMEZAWA Hiroyuki
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31memcg: clean up force_empty_list() return value checkKamezawa Hiroyuki
After bf544fdc241da8 "memcg: move charges to root cgroup if use_hierarchy=0 in mem_cgroup_move_hugetlb_parent()" mem_cgroup_move_parent() returns only -EBUSY or -EINVAL. So we can remove the -ENOMEM and -EINTR checks. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31memcg: remove check for signal_pending() during rmdir()Kamezawa Hiroyuki
After bf544fdc241da8 "memcg: move charges to root cgroup if use_hierarchy=0 in mem_cgroup_move_hugetlb_parent()", no memory reclaim will occur when removing a memory cgroup. If -EINTR is returned here, cgroup will show a warning. We don't need to handle any user interruption signal. Remove this. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31mm/memblock.c:memblock_double_array(): cosmetic cleanupsAndrew Morton
This function is an 80-column eyesore, quite unnecessarily. Clean that up, and use standard comment layout style. Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Greg Pearson <greg.pearson@hp.com> Cc: Tejun Heo <tj@kernel.org> Cc: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31mm, oom: do not schedule if current has been killedDavid Rientjes
The oom killer currently schedules away from current in an uninterruptible sleep if it does not have access to memory reserves. It's possible that current was killed because it shares memory with the oom killed thread or because it was killed by the user in the interim, however. This patch only schedules away from current if it does not have a pending kill, i.e. if it does not share memory with the oom killed thread. It's possible that it will immediately retry its memory allocation and fail, but it will immediately be given access to memory reserves if it calls the oom killer again. This prevents the delay of memory freeing when threads that share memory with the oom killed thread get unnecessarily scheduled. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31hugetlb/cgroup: remove exclude and wakeup rmdir calls from migrateAneesh Kumar K.V
We already hold the hugetlb_lock. That should prevent a parallel cgroup rmdir from touching page's hugetlb cgroup. So remove the exclude and wakeup calls. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31hugetlb/cgroup: assign the page hugetlb cgroup when we move the page to ↵Aneesh Kumar K.V
active list. A page's hugetlb cgroup assignment and movement to the active list should occur with hugetlb_lock held. Otherwise when we remove the hugetlb cgroup we will iterate the active list and find pages with NULL hugetlb cgroup values. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31hugetlb: move all the in use pages to active listAneesh Kumar K.V
When we fail to allocate pages from the reserve pool, hugetlb tries to allocate huge pages using alloc_buddy_huge_page. Add these to the active list. We also need to add the huge page we allocate when we soft offline the oldpage to active list. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31hugetlb/cgroup: add HugeTLB controller documentationAneesh Kumar K.V
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Hillf Danton <dhillf@gmail.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31hugetlb/cgroup: migrate hugetlb cgroup info from oldpage to new page during ↵Aneesh Kumar K.V
migration With HugeTLB pages, hugetlb cgroup is uncharged in compound page destructor. Since we are holding a hugepage reference, we can be sure that old page won't get uncharged till the last put_page(). Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hillf Danton <dhillf@gmail.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31hugetlb/cgroup: add hugetlb cgroup control filesAneesh Kumar K.V
Add the control files for hugetlb controller [akpm@linux-foundation.org: s/CONFIG_CGROUP_HUGETLB_RES_CTLR/CONFIG_MEMCG_HUGETLB/g] [akpm@linux-foundation.org: s/CONFIG_MEMCG_HUGETLB/CONFIG_CGROUP_HUGETLB/] Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hillf Danton <dhillf@gmail.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31hugetlb/cgroup: add support for cgroup removalAneesh Kumar K.V
Add support for cgroup removal. If we don't have parent cgroup, the charges are moved to root cgroup. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hillf Danton <dhillf@gmail.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31hugetlb/cgroup: add charge/uncharge routines for hugetlb cgroupAneesh Kumar K.V
Add the charge and uncharge routines for hugetlb cgroup. We do cgroup charging in page alloc and uncharge in compound page destructor. Assigning page's hugetlb cgroup is protected by hugetlb_lock. [liwp@linux.vnet.ibm.com: add huge_page_order check to avoid incorrect uncharge] Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hillf Danton <dhillf@gmail.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Wanpeng Li <liwp.linux@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31hugetlb/cgroup: add the cgroup pointer to page lruAneesh Kumar K.V
Add the hugetlb cgroup pointer to 3rd page lru.next. This limit the usage to hugetlb cgroup to only hugepages with 3 or more normal pages. I guess that is an acceptable limitation. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hillf Danton <dhillf@gmail.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31mm/hugetlb: add new HugeTLB cgroupAneesh Kumar K.V
Implement a new controller that allows us to control HugeTLB allocations. The extension allows to limit the HugeTLB usage per control group and enforces the controller limit during page fault. Since HugeTLB doesn't support page reclaim, enforcing the limit at page fault time implies that, the application will get SIGBUS signal if it tries to access HugeTLB pages beyond its limit. This requires the application to know beforehand how much HugeTLB pages it would require for its use. The charge/uncharge calls will be added to HugeTLB code in later patch. Support for cgroup removal will be added in later patches. [akpm@linux-foundation.org: s/CONFIG_CGROUP_HUGETLB_RES_CTLR/CONFIG_MEMCG_HUGETLB/g] [akpm@linux-foundation.org: s/CONFIG_MEMCG_HUGETLB/CONFIG_CGROUP_HUGETLB/g] Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Cc: Hillf Danton <dhillf@gmail.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31hugetlb: make some static variables globalAneesh Kumar K.V
We will use them later in hugetlb_cgroup.c Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hillf Danton <dhillf@gmail.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31hugetlb: add a list for tracking in-use HugeTLB pagesAneesh Kumar K.V
hugepage_activelist will be used to track currently used HugeTLB pages. We need to find the in-use HugeTLB pages to support HugeTLB cgroup removal. On cgroup removal we update the page's HugeTLB cgroup to point to parent cgroup. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Hillf Danton <dhillf@gmail.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31hugetlb: simplify migrate_huge_page()Aneesh Kumar K.V
Since we migrate only one hugepage, don't use linked list for passing the page around. Directly pass the page that need to be migrated as argument. This also removes the usage of page->lru in the migrate path. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Hillf Danton <dhillf@gmail.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31hugetlb: use mmu_gather instead of a temporary linked list for accumulating ↵Aneesh Kumar K.V
pages Use a mmu_gather instead of a temporary linked list for accumulating pages when we unmap a hugepage range Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Hillf Danton <dhillf@gmail.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31hugetlb: add an inline helper for finding hstate indexAneesh Kumar K.V
Add an inline helper and use it in the code. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hillf Danton <dhillf@gmail.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31hugetlb: don't use ERR_PTR with VM_FAULT* valuesAneesh Kumar K.V
The current use of VM_FAULT_* codes with ERR_PTR requires us to ensure VM_FAULT_* values will not exceed MAX_ERRNO value. Decouple the VM_FAULT_* values from MAX_ERRNO. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: Hillf Danton <dhillf@gmail.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31hugetlb: rename max_hstate to hugetlb_max_hstateAneesh Kumar K.V
This patchset implements a cgroup resource controller for HugeTLB pages. The controller allows to limit the HugeTLB usage per control group and enforces the controller limit during page fault. Since HugeTLB doesn't support page reclaim, enforcing the limit at page fault time implies that, the application will get SIGBUS signal if it tries to access HugeTLB pages beyond its limit. This requires the application to know beforehand how much HugeTLB pages it would require for its use. The goal is to control how many HugeTLB pages a group of task can allocate. It can be looked at as an extension of the existing quota interface which limits the number of HugeTLB pages per hugetlbfs superblock. HPC job scheduler requires jobs to specify their resource requirements in the job file. Once their requirements can be met, job schedulers like (SLURM) will schedule the job. We need to make sure that the jobs won't consume more resources than requested. If they do we should either error out or kill the application. This patch: Rename max_hstate to hugetlb_max_hstate. We will be using this from other subsystems like hugetlb controller in later patches. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: David Rientjes <rientjes@google.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Hillf Danton <dhillf@gmail.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31mm: prepare for removal of obsolete /proc/sys/vm/nr_pdflush_threadsWanpeng Li
Since per-BDI flusher threads were introduced in 2.6, the pdflush mechanism is not used any more. But the old interface exported through /proc/sys/vm/nr_pdflush_threads still exists and is obviously useless. For back-compatibility, printk warning information and return 2 to notify the users that the interface is removed. Signed-off-by: Wanpeng Li <liwp@linux.vnet.ibm.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31mm/buddy: cleanup on should_fail_alloc_pageGavin Shan
Currently, function should_fail() has "bool" for its return value, so it's reasonable to change the return value of function should_fail_alloc_page() into "bool" as well. The patch does cleanup on function should_fail_alloc_page() to have "bool" for its return value. Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31mm: account the total_vm in the vm_stat_account()Huang Shijie
vm_stat_account() accounts the shared_vm, stack_vm and reserved_vm now. But we can also account for total_vm in the vm_stat_account() which makes the code tidy. Even for mprotect_fixup(), we can get the right result in the end. Signed-off-by: Huang Shijie <shijie8@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31documentation: update how page-cluster affects swap I/OChristian Ehrhardt
Fix of the documentation of /proc/sys/vm/page-cluster to match the behavior of the code and add some comments about what the tunable will change in that behavior. Signed-off-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> Acked-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Minchan Kim <minchan@kernel.org> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31swap: allow swap readahead to be mergedChristian Ehrhardt
Swap readahead works fine, but the I/O to disk is almost always done in page size requests, despite the fact that readahead submits 1<<page-cluster pages at a time. On older kernels the old per device plugging behavior might have captured this and merged the requests, but currently all comes down to much more I/Os than required. On a single device this might not be an issue, but as soon as a server runs on shared san resources savin I/Os not only improves swapin throughput but also provides a lower resource utilization. With a load running KVM in a lot of memory overcommitment (the hot memory is 1.5 times the host memory) swapping throughput improves significantly and the lead feels more responsive as well as achieves more throughput. In a test setup with 16 swap disks running blocktrace on one of those disks shows the improved merging: Prior: Reads Queued: 560,888, 2,243MiB Writes Queued: 226,242, 904,968KiB Read Dispatches: 544,701, 2,243MiB Write Dispatches: 159,318, 904,968KiB Reads Requeued: 0 Writes Requeued: 0 Reads Completed: 544,716, 2,243MiB Writes Completed: 159,321, 904,980KiB Read Merges: 16,187, 64,748KiB Write Merges: 61,744, 246,976KiB IO unplugs: 149,614 Timer unplugs: 2,940 With the patch: Reads Queued: 734,315, 2,937MiB Writes Queued: 300,188, 1,200MiB Read Dispatches: 214,972, 2,937MiB Write Dispatches: 215,176, 1,200MiB Reads Requeued: 0 Writes Requeued: 0 Reads Completed: 214,971, 2,937MiB Writes Completed: 215,177, 1,200MiB Read Merges: 519,343, 2,077MiB Write Merges: 73,325, 293,300KiB IO unplugs: 337,130 Timer unplugs: 11,184 I got ~10% to ~40% more throughput in my cases and at the same time much lower cpu consumption when broken down per transferred kilobyte (the majority of that due to saved interrupts and better cache handling). In a shared SAN others might get an additional benefit as well, because this now causes less protocol overhead. Signed-off-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Minchan Kim <minchan@kernel.org> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31memcg: remove MEM_CGROUP_CHARGE_TYPE_FORCEKamezawa Hiroyuki
There are no users since commit b24028572fb69 ("memcg: remove PCG_CACHE"). Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Hugh Dickins <hughd@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31memcg: rename MEM_CGROUP_CHARGE_TYPE_MAPPED as MEM_CGROUP_CHARGE_TYPE_ANONKamezawa Hiroyuki
Now, in memcg, 2 "MAPPED" enum/macro are found MEM_CGROUP_CHARGE_TYPE_MAPPED MEM_CGROUP_STAT_FILE_MAPPED Thier names looks similar to each other but the former is used for accounting anonymous memory. rename it as TYPE_ANON. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31memcg: rename MEM_CGROUP_STAT_SWAPOUT as MEM_CGROUP_STAT_SWAPKamezawa Hiroyuki
MEM_CGROUP_STAT_SWAPOUT represents the usage of swap rather than the number of swap-out events. Rename it to be MEM_CGROUP_STAT_SWAP. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Hugh Dickins <hughd@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31mm: make vb_alloc() more foolproofJan Kara
If someone calls vb_alloc() (or vm_map_ram() for that matter) to allocate 0 bytes (0 pages), get_order() returns BITS_PER_LONG - PAGE_CACHE_SHIFT and interesting stuff happens. So make debugging such problems easier and warn about 0-size allocation. [akpm@linux-foundation.org: use WARN_ON-return-value feature] Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31vmalloc: walk vmap_areas by sorted list instead of rb_next()Hong zhi guo
There's a walk by repeating rb_next to find a suitable hole. Could be simply replaced by walk on the sorted vmap_area_list. More simpler and efficient. Mutation of the list and tree only happens in pair within __insert_vmap_area and __free_vmap_area, under protection of vmap_area_lock. The patch code is also under vmap_area_lock, so the list walk is safe, and consistent with the tree walk. Tested on SMP by repeating batch of vmalloc anf vfree for random sizes and rounds for hours. Signed-off-by: Hong Zhiguo <honkiko@gmail.com> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31drivers/media/video/v4l2-ioctl.c: fix buildAndrew Morton
Fix zillions of these: drivers/media/video/v4l2-ioctl.c:1848: error: unknown field 'func' specified in initializer drivers/media/video/v4l2-ioctl.c:1848: warning: missing braces around initializer drivers/media/video/v4l2-ioctl.c:1848: warning: (near initialization for 'v4l2_ioctls[0].<anonymous>') drivers/media/video/v4l2-ioctl.c:1848: warning: initialization makes integer from pointer without a cast drivers/media/video/v4l2-ioctl.c:1848: error: initializer element is not computable at load time drivers/media/video/v4l2-ioctl.c:1848: error: (near initialization for 'v4l2_ioctls[0].<anonymous>.offset') Cc: Mauro Carvalho Chehab <mchehab@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31xtensa: select generic atomic64_t supportFengguang Wu
This will fix build errors: block/blk-cgroup.c:609:2: error: unknown type name 'atomic64_t' block/blk-cgroup.c:609:2: error: implicit declaration of function 'ATOMIC64_INIT' [-Werror=implicit-function-declaration] Signed-off-by: Fengguang Wu <fengguang.wu@intel.com> Cc: Chris Zankel <chris@zankel.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31fault-injection: fix failcmd.sh warningAkinobu Mita
"fault-injection: add tool to run command with failslab or fail_page_alloc" added tools/testing/fault-injection/failcmd.sh to make it easier to inject slab/page allocation failures by fault injection. failcmd.sh prints the following warning when running with arguments for command. # ./failcmd.sh echo aaa failcmd.sh: line 209: [: echo: binary operator expected aaa This warning is caused by an improper check whether at least one parameter is left after parsing command options. Fix it by testing the length of $1 instead of $@ Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31Merge tag 'for-v3.6' of git://git.infradead.org/battery-2.6Linus Torvalds
Pull battery updates from Anton Vorontsov: "The tag contains just a few battery-related changes for v3.6. It's is all pretty straightforward, except one thing. One of our patches added thermal support for power supply class, but thermal/ subsystem changed under our feet. We (well, Stephen, that is) caught the issue and it was decided[1] that I'd just delay the battery pull request, and then will fix it up by merging upstream back into battery tree at the specific commit. That's not all though: another[2] small fixup for thermal subsystem was needed to get rid of a warning in power supply subsystem (the warning was not drivers/power's "fault", the thermal registration function just needed a proper const annotation, which is also done by a small commit on top of the merge. So, to sum this up: - The 'master' branch of the battery tree was in the -next tree for weeks, was never rebased, altered etc. It should be all OK; - Although, for-v3.6 tag contains the 'master' branch + merge + the warning fix. [1] http://lkml.org/lkml/2012/6/19/23 [2] http://lkml.org/lkml/2012/6/18/28" * tag 'for-v3.6' of git://git.infradead.org/battery-2.6: (23 commits) thermal: Constify 'type' argument for the registration routine olpc-battery: update CHARGE_FULL_DESIGN property for BYD LiFe batteries olpc-battery: Add VOLTAGE_MAX_DESIGN property charger-manager: Fix build break related to EXTCON lp8727_charger: Move header file into platform_data directory power_supply: Add min/max alert properties for CAPACITY, TEMP, TEMP_AMBIENT bq27x00_battery: Add support for BQ27425 chip charger-manager: Set current limit of regulator for over current protection charger-manager: Use EXTCON Subsystem to detect charger cables for charging test_power: Add VOLTAGE_NOW and BATTERY_TEMP properties test_power: Add support for USB AC source gpio-charger: Use cansleep version of gpio_set_value bq27x00_battery: Add support for power average and health properties sbs-battery: Don't trigger false supply_changed event twl4030_charger: Allow charger to control the regulator that feeds it twl4030_charger: Add backup-battery charging twl4030_charger: Fix some typos max17042_battery: Support CHARGE_COUNTER power supply attribute smb347-charger: Add constant charge and current properties power_supply: Add constant charge_current and charge_voltage properties ...
2012-07-31Merge branch 'perf-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf updates from Ingo Molnar: "The biggest changes are Intel Nehalem-EX PMU uncore support, uprobes updates/cleanups/fixes from Oleg and diverse tooling updates (mostly fixes) now that Arnaldo is back from vacation." * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (40 commits) uprobes: __replace_page() needs munlock_vma_page() uprobes: Rename vma_address() and make it return "unsigned long" uprobes: Fix register_for_each_vma()->vma_address() check uprobes: Introduce vaddr_to_offset(vma, vaddr) uprobes: Teach build_probe_list() to consider the range uprobes: Remove insert_vm_struct()->uprobe_mmap() uprobes: Remove copy_vma()->uprobe_mmap() uprobes: Fix overflow in vma_address()/find_active_uprobe() uprobes: Suppress uprobe_munmap() from mmput() uprobes: Uprobe_mmap/munmap needs list_for_each_entry_safe() uprobes: Clean up and document write_opcode()->lock_page(old_page) uprobes: Kill write_opcode()->lock_page(new_page) uprobes: __replace_page() should not use page_address_in_vma() uprobes: Don't recheck vma/f_mapping in write_opcode() perf/x86: Fix missing struct before structure name perf/x86: Fix format definition of SNB-EP uncore QPI box perf/x86: Make bitfield unsigned perf/x86: Fix LLC-* and node-* events on Intel SandyBridge perf/x86: Add Intel Nehalem-EX uncore support perf/x86: Fix typo in format definition of uncore PCU filter ...
2012-07-31Merge branch 'merge' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc Pull powerpc updates from Benjamin Herrenschmidt: "Kumar sent me a handful of Freescale related fixes and I added another regression fix to the pile. PS. I -will- eventually learn about that signed tag business :-)" * 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: powerpc/kvm/book3s_32: Fix MTMSR_EERI macro powerpc/85xx: p1022ds: fix DIU/LBC switching with NAND enabled powerpc/85xx: p1022ds: disable the NAND flash node if video is enabled powerpc/85xx: Fix sram_offset parameter type powerpc/85xx: P3041DS - change espi input-clock from 40MHz to 35MHz powerpc/85xx: Fix pci base address error for p2020rdb-pc in dts
2012-07-31Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull s390 updates from Martin Schwidefsky: "This it the second batch of s390 patches for the 3.6 merge window. Included is enablement for two common code changes, killable page faults and sorted exception tables. And the regular set of cleanup and bug fix patches." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: s390: make use of user_mode() macro where possible s390/mm: rename user_mode variable to addressing_mode s390/mm: fix fault handling for page table walk case s390/mm: make page faults killable s390: update defconfig s390/mm: downgrade page table after fork of a 31 bit process s390/ipl: Use diagnose 8 command separation s390/linker script: use RO_DATA_SECTION s390/exceptions: sort exception table at build time s390/debug: remove module_exit function / move EXPORT_SYMBOLs
2012-07-31ipv4: Properly purge netdev references on uncached routes.David S. Miller
When a device is unregistered, we have to purge all of the references to it that may exist in the entire system. If a route is uncached, we currently have no way of accomplishing this. So create a global list that is scanned when a network device goes down. This mirrors the logic in net/core/dst.c's dst_ifdown(). Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-31ipv4: Cache routes in nexthop exception entries.David S. Miller
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-31Merge branch 'nfsd-next' of git://linux-nfs.org/~bfields/linuxLinus Torvalds
Pull nfsd changes from J. Bruce Fields: "This has been an unusually quiet cycle--mostly bugfixes and cleanup. The one large piece is Stanislav's work to containerize the server's grace period--but that in itself is just one more step in a not-yet-complete project to allow fully containerized nfs service. There are a number of outstanding delegation, container, v4 state, and gss patches that aren't quite ready yet; 3.7 may be wilder." * 'nfsd-next' of git://linux-nfs.org/~bfields/linux: (35 commits) NFSd: make boot_time variable per network namespace NFSd: make grace end flag per network namespace Lockd: move grace period management from lockd() to per-net functions LockD: pass actual network namespace to grace period management functions LockD: manage grace list per network namespace SUNRPC: service request network namespace helper introduced NFSd: make nfsd4_manager allocated per network namespace context. LockD: make lockd manager allocated per network namespace LockD: manage grace period per network namespace Lockd: add more debug to host shutdown functions Lockd: host complaining function introduced LockD: manage used host count per networks namespace LockD: manage garbage collection timeout per networks namespace LockD: make garbage collector network namespace aware. LockD: mark host per network namespace on garbage collect nfsd4: fix missing fault_inject.h include locks: move lease-specific code out of locks_delete_lock locks: prevent side-effects of locks_release_private before file_lock is initialized NFSd: set nfsd_serv to NULL after service destruction NFSd: introduce nfsd_destroy() helper ...
2012-07-31ipv4: percpu nh_rth_output cacheEric Dumazet
Input path is mostly run under RCU and doesnt touch dst refcnt But output path on forwarding or UDP workloads hits badly dst refcount, and we have lot of false sharing, for example in ipv4_mtu() when reading rt->rt_pmtu Using a percpu cache for nh_rth_output gives a nice performance increase at a small cost. 24 udpflood test on my 24 cpu machine (dummy0 output device) (each process sends 1.000.000 udp frames, 24 processes are started) before : 5.24 s after : 2.06 s For reference, time on linux-3.5 : 6.60 s Signed-off-by: Eric Dumazet <edumazet@google.com> Tested-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-31ipv4: Restore old dst_free() behavior.Eric Dumazet
commit 404e0a8b6a55 (net: ipv4: fix RCU races on dst refcounts) tried to solve a race but added a problem at device/fib dismantle time : We really want to call dst_free() as soon as possible, even if sockets still have dst in their cache. dst_release() calls in free_fib_info_rcu() are not welcomed. Root of the problem was that now we also cache output routes (in nh_rth_output), we must use call_rcu() instead of call_rcu_bh() in rt_free(), because output route lookups are done in process context. Based on feedback and initial patch from David Miller (adding another call_rcu_bh() call in fib, but it appears it was not the right fix) I left the inet_sk_rx_dst_set() helper and added __rcu attributes to nh_rth_output and nh_rth_input to better document what is going on in this code. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-31Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull Ceph changes from Sage Weil: "Lots of stuff this time around: - lots of cleanup and refactoring in the libceph messenger code, and many hard to hit races and bugs closed as a result. - lots of cleanup and refactoring in the rbd code from Alex Elder, mostly in preparation for the layering functionality that will be coming in 3.7. - some misc rbd cleanups from Josh Durgin that are finally going upstream - support for CRUSH tunables (used by newer clusters to improve the data placement) - some cleanup in our use of d_parent that Al brought up a while back - a random collection of fixes across the tree There is another patch coming that fixes up our ->atomic_open() behavior, but I'm going to hammer on it a bit more before sending it." Fix up conflicts due to commits that were already committed earlier in drivers/block/rbd.c, net/ceph/{messenger.c, osd_client.c} * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (132 commits) rbd: create rbd_refresh_helper() rbd: return obj version in __rbd_refresh_header() rbd: fixes in rbd_header_from_disk() rbd: always pass ops array to rbd_req_sync_op() rbd: pass null version pointer in add_snap() rbd: make rbd_create_rw_ops() return a pointer rbd: have __rbd_add_snap_dev() return a pointer libceph: recheck con state after allocating incoming message libceph: change ceph_con_in_msg_alloc convention to be less weird libceph: avoid dropping con mutex before fault libceph: verify state after retaking con lock after dispatch libceph: revoke mon_client messages on session restart libceph: fix handling of immediate socket connect failure ceph: update MAINTAINERS file libceph: be less chatty about stray replies libceph: clear all flags on con_close libceph: clean up con flags libceph: replace connection state bits with states libceph: drop unnecessary CLOSED check in socket state change callback libceph: close socket directly from ceph_con_close() ...