linux.git - Linus' kernel tree

Age	Commit message (Collapse)	Author
2014-05-13	cgroup: move cgroup->kn->priv clearing to cgroup_rmdir()	Tejun Heo
	The ->priv field of a cgroup directory kernfs_node points back to the cgroup. This field is RCU cleared in cgroup_destroy_locked() for non-kernfs accesses from css_tryget_from_dir() and cgroupstats_build(). As these are only applicable to cgroups which finished creation successfully and fully initialized cgroups are always removed by cgroup_rmdir(), this can be safely moved to the end of cgroup_rmdir(). This will help simplifying cgroup locking and shouldn't introduce any behavior difference. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-13	cgroup: grab cgroup_mutex earlier in cgroup_subtree_control_write()	Tejun Heo
	Move cgroup_lock_live_group() invocation upwards to right below cgroup_tree_mutex in cgroup_subtree_control_write(). This is to help the planned locking simplification. This doesn't make any userland-visible behavioral changes. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-13	cgroup: collapse cgroup_create() into croup_mkdir()	Tejun Heo
	cgroup_mkdir() is the sole user of cgroup_create(). Let's collapse the latter into the former. This will help simplifying locking. While at it, remove now stale comment about inode locking. This patch doesn't introduce any functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-13	cgroup: reorganize cgroup_create()	Tejun Heo
	Reorganize cgroup_create() so that all paths share unlock out path. * All err_* labels are renamed to out_* as they're now shared by both success and failure paths. * @err renamed to @ret for the similar reason as above and so that it's more consistent with other functions. * cgroup memory allocation moved after locking so that freeing failed cgroup happens before unlocking. While this moves more code inside critical section, memory allocations inside cgroup locking are already pretty common and this is unlikely to make any noticeable difference. * While at it, replace a stray @parent->root dereference with @root. This reorganization will help simplifying locking. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-13	cgroup: remove cgroup->control_kn	Tejun Heo
	Now that cgroup_subtree_control_write() has access to the associated kernfs_open_file and thus the kernfs_node, there's no need to cache it in cgroup->control_kn on creation. Remove cgroup->control_kn and use @of->kn directly. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-13	cgroup: convert "tasks" and "cgroup.procs" handle to use cftype->write()	Tejun Heo
	cgroup_tasks_write() and cgroup_procs_write() are currently using cftype->write_u64(). This patch converts them to use cftype->write() instead. This allows access to the associated kernfs_open_file which will be necessary to implement the planned kernfs active protection manipulation for these files. This shifts buffer parsing to attach_task_by_pid() and makes it return @nbytes on success. Let's rename it to __cgroup_procs_write() to clearly indicate that this is a write handler implementation. This patch doesn't introduce any visible behavior changes. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-13	cgroup: replace cftype->trigger() with cftype->write()	Tejun Heo
	cftype->trigger() is pointless. It's trivial to ignore the input buffer from a regular ->write() operation. Convert all ->trigger() users to ->write() and remove ->trigger(). This patch doesn't introduce any visible behavior changes. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz>
2014-05-13	cgroup: replace cftype->write_string() with cftype->write()	Tejun Heo
	Convert all cftype->write_string() users to the new cftype->write() which maps directly to kernfs write operation and has full access to kernfs and cgroup contexts. The conversions are mostly mechanical. * @css and @cft are accessed using of_css() and of_cft() accessors respectively instead of being specified as arguments. * Should return @nbytes on success instead of 0. * @buf is not trimmed automatically. Trim if necessary. Note that blkcg and netprio don't need this as the parsers already handle whitespaces. cftype->write_string() has no user left after the conversions and removed. While at it, remove unnecessary local variable @p in cgroup_subtree_control_write() and stale comment about CGROUP_LOCAL_BUFFER_SIZE in cgroup_freezer.c. This patch doesn't introduce any visible behavior changes. v2: netprio was missing from conversion. Converted. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Aristeu Rozanski <arozansk@redhat.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Neil Horman <nhorman@tuxdriver.com> Cc: "David S. Miller" <davem@davemloft.net>
2014-05-13	cgroup: implement cftype->write()	Tejun Heo
	During the recent conversion to kernfs, cftype's seq_file operations are updated so that they are directly mapped to kernfs operations and thus can fully access the associated kernfs and cgroup contexts; however, write path hasn't seen similar updates and none of the existing write operations has access to, for example, the associated kernfs_open_file. Let's introduce a new operation cftype->write() which maps directly to the kernfs write operation and has access to all the arguments and contexts. This will replace ->write_string() and ->trigger() and ease manipulation of kernfs active protection from cgroup file operations. Two accessors - of_cft() and of_css() - are introduced to enable accessing the associated cgroup context from cftype->write() which only takes kernfs_open_file for the context information. The accessors for seq_file operations - seq_cft() and seq_css() - are rewritten to wrap the of_ accessors. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-13	cgroup: rename css_tryget() to css_tryget_online()	Tejun Heo
	Unlike the more usual refcnting, what css_tryget() provides is the distinction between online and offline csses instead of protection against upping a refcnt which already reached zero. cgroup is planning to provide actual tryget which fails if the refcnt already reached zero. Let's rename the existing trygets so that they clearly indicate that they're onliness. I thought about keeping the existing names as-are and introducing new names for the planned actual tryget; however, given that each controller participates in the synchronization of the online state, it seems worthwhile to make it explicit that these functions are about on/offline state. Rename css_tryget() to css_tryget_online() and css_tryget_from_dir() to css_tryget_online_from_dir(). This is pure rename. v2: cgroup_freezer grew new usages of css_tryget(). Update accordingly. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
2014-05-13	cgroup: use release_agent_path_lock in cgroup_release_agent_show()	Tejun Heo
	release_path is now protected by release_agent_path_lock to allow accessing it without grabbing cgroup_mutex; however, cgroup_release_agent_show() was still grabbing cgroup_mutex. Let's convert it to release_agent_path_lock so that we don't have to worry about this one for the planned locking updates. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-13	cgroup: use restart_syscall() for retries after offline waits in ↵	Tejun Heo
	cgroup_subtree_control_write() After waiting for a child to finish offline, cgroup_subtree_control_write() jumps up to retry from after the input parsing and active protection breaking. This retry makes the scheduled locking update - removal of cgroup_tree_mutex - more difficult. Let's simplify it by returning with restart_syscall() for retries. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-13	cgroup: update and fix parsing of "cgroup.subtree_control"	Tejun Heo
	I was confused that strsep() was equivalent to strtok_r() in skipping over consecutive delimiters. strsep() just splits at the first occurrence of one of the delimiters which makes the parsing very inflexible, which makes allowing multiple whitespace chars as delimters kinda moot. Let's just be consistently strict and require list of tokens separated by spaces. This is what Documentation/cgroups/unified-hierarchy.txt describes too. Also, parsing may access beyond the end of the string if the string ends with spaces or is zero-length. Make sure it skips zero-length tokens. Note that this also ensures that the parser doesn't puke on multiple consecutive spaces. v2: Add zero-length token skipping. v3: Added missing space after "==". Spotted by Li. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-13	cgroup: css_release() shouldn't clear cgroup->subsys[]	Tejun Heo
	c1a71504e971 ("cgroup: don't recycle cgroup id until all csses' have been destroyed") made cgroup ID persist until a cgroup is released and add cgroup->subsys[] clearing to css_release() so that css_from_id() doesn't return a css which has already been released which happens before cgroup release; however, the right change here was updating offline_css() to clear cgroup->subsys[] which was done by e32978031016 ("cgroup: cgroup->subsys[] should be cleared after the css is offlined") instead of clearing it from css_release(). We're now clearing cgroup->subsys[] twice. This is okay for traditional hierarchies as a css's lifetime is the same as its cgroup's; however, this confuses unified hierarchy and turning on and off a controller repeatedly using "cgroup.subtree_control" can lead to an oops like the following which happens because cgroup->subsys[] is incorrectly cleared asynchronously by css_release(). BUG: unable to handle kernel NULL pointer dereference at 00000000000000 08 IP: [<ffffffff81130c11>] kill_css+0x21/0x1c0 PGD 1170d067 PUD f0ab067 PMD 0 Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC Modules linked in: CPU: 2 PID: 459 Comm: bash Not tainted 3.15.0-rc2-work+ #5 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 task: ffff880009296710 ti: ffff88000e198000 task.ti: ffff88000e198000 RIP: 0010:[<ffffffff81130c11>] [<ffffffff81130c11>] kill_css+0x21/0x1c0 RSP: 0018:ffff88000e199dc8 EFLAGS: 00010202 RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000001 RDX: 0000000000000001 RSI: ffffffff8238a968 RDI: ffff880009296f98 RBP: ffff88000e199de0 R08: 0000000000000001 R09: 02b0000000000000 R10: 0000000000000000 R11: ffff880009296fc0 R12: 0000000000000001 R13: ffff88000db6fc58 R14: 0000000000000001 R15: ffff8800139dcc00 FS: 00007ff9160c5740(0000) GS:ffff88001fb00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000008 CR3: 0000000013947000 CR4: 00000000000006e0 Stack: ffff88000e199de0 ffffffff82389160 0000000000000001 ffff88000e199e80 ffffffff8113537f 0000000000000007 ffff88000e74af00 ffff88000e199e48 ffff880009296710 ffff88000db6fc00 ffffffff8239c100 0000000000000002 Call Trace: [<ffffffff8113537f>] cgroup_subtree_control_write+0x85f/0xa00 [<ffffffff8112fd18>] cgroup_file_write+0x38/0x1d0 [<ffffffff8126fc97>] kernfs_fop_write+0xe7/0x170 [<ffffffff811f2ae6>] vfs_write+0xb6/0x1c0 [<ffffffff811f35ad>] SyS_write+0x4d/0xc0 [<ffffffff81d0acd2>] system_call_fastpath+0x16/0x1b Code: 5c 41 5d 41 5e 41 5f 5d c3 90 0f 1f 44 00 00 55 48 89 e5 41 54 53 48 89 fb 48 83 ec 08 8b 05 37 ad 29 01 85 c0 0f 85 df 00 00 00 <48> 8b 43 08 48 8b 3b be 01 00 00 00 8b 48 5c d3 e6 e8 49 ff ff RIP [<ffffffff81130c11>] kill_css+0x21/0x1c0 RSP <ffff88000e199dc8> CR2: 0000000000000008 ---[ end trace e7aae1f877c4e1b4 ]--- Remove the unnecessary cgroup->subsys[] clearing from css_release(). Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-13	cgroup: cgroup_idr_lock should be bh	Tejun Heo
	cgroup_idr_remove() can be invoked from bh leading to lockdep detecting possible AA deadlock (IN_BH/ON_BH). Make the lock bh-safe. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-13	cgroup: fix offlining child waiting in cgroup_subtree_control_write()	Tejun Heo
	cgroup_subtree_control_write() waits for offline to complete child-by-child before enabling a controller; however, it has a couple bugs. * It doesn't initialize the wait_queue_t. This can lead to infinite hang on the following schedule() among other things. * It forgets to pin the child before releasing cgroup_tree_mutex and performing schedule(). The child may already be gone by the time it wakes up and invokes finish_wait(). Pin the child being waited on. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-13	Merge branch 'for-3.15-fixes' of ↵	Tejun Heo
	git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup into for-3.16 Pull to receive e37a06f10994 ("cgroup: fix the retry path of cgroup_mount()") to avoid unnecessary conflicts with planned cgroup_tree_mutex removal and also to be able to remove the temp fix added by 36c38fb7144a ("blkcg: use trylock on blkcg_pol_mutex in blkcg_reset_stats()") afterwards. Signed-off-by: Tejun Heo <tj@kernel.org>
2014-05-13	cgroup: fix rcu_read_lock() leak in update_if_frozen()	Tejun Heo
	While updating cgroup_freezer locking, 68fafb77d827 ("cgroup_freezer: replace freezer->lock with freezer_mutex") introduced a bug in update_if_frozen() where it returns with rcu_read_lock() held. Fix it by adding rcu_read_unlock() before returning. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: kbuild test robot <fengguang.wu@intel.com>
2014-05-13	cgroup_freezer: replace freezer->lock with freezer_mutex	Tejun Heo
	After 96d365e0b86e ("cgroup: make css_set_lock a rwsem and rename it to css_set_rwsem"), css task iterators requires sleepable context as it may block on css_set_rwsem. I missed that cgroup_freezer was iterating tasks under IRQ-safe spinlock freezer->lock. This leads to errors like the following on freezer state reads and transitions. BUG: sleeping function called from invalid context at /work /os/work/kernel/locking/rwsem.c:20 in_atomic(): 0, irqs_disabled(): 0, pid: 462, name: bash 5 locks held by bash/462: #0: (sb_writers#7){.+.+.+}, at: [<ffffffff811f0843>] vfs_write+0x1a3/0x1c0 #1: (&of->mutex){+.+.+.}, at: [<ffffffff8126d78b>] kernfs_fop_write+0xbb/0x170 #2: (s_active#70){.+.+.+}, at: [<ffffffff8126d793>] kernfs_fop_write+0xc3/0x170 #3: (freezer_mutex){+.+...}, at: [<ffffffff81135981>] freezer_write+0x61/0x1e0 #4: (rcu_read_lock){......}, at: [<ffffffff81135973>] freezer_write+0x53/0x1e0 Preemption disabled at:[<ffffffff81104404>] console_unlock+0x1e4/0x460 CPU: 3 PID: 462 Comm: bash Not tainted 3.15.0-rc1-work+ #10 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 ffff88000916a6d0 ffff88000e0a3da0 ffffffff81cf8c96 0000000000000000 ffff88000e0a3dc8 ffffffff810cf4f2 ffffffff82388040 ffff880013aaf740 0000000000000002 ffff88000e0a3de8 ffffffff81d05974 0000000000000246 Call Trace: [<ffffffff81cf8c96>] dump_stack+0x4e/0x7a [<ffffffff810cf4f2>] __might_sleep+0x162/0x260 [<ffffffff81d05974>] down_read+0x24/0x60 [<ffffffff81133e87>] css_task_iter_start+0x27/0x70 [<ffffffff8113584d>] freezer_apply_state+0x5d/0x130 [<ffffffff81135a16>] freezer_write+0xf6/0x1e0 [<ffffffff8112eb88>] cgroup_file_write+0xd8/0x230 [<ffffffff8126d7b7>] kernfs_fop_write+0xe7/0x170 [<ffffffff811f0756>] vfs_write+0xb6/0x1c0 [<ffffffff811f121d>] SyS_write+0x4d/0xc0 [<ffffffff81d08292>] system_call_fastpath+0x16/0x1b freezer->lock used to be used in hot paths but that time is long gone and there's no reason for the lock to be IRQ-safe spinlock or even per-cgroup. In fact, given the fact that a cgroup may contain large number of tasks, it's not a good idea to iterate over them while holding IRQ-safe spinlock. Let's simplify locking by replacing per-cgroup freezer->lock with global freezer_mutex. This also makes the comments explaining the intricacies of policy inheritance and the locking around it as the states are protected by a common mutex. The conversion is mostly straight-forward. The followings are worth mentioning. * freezer_css_online() no longer needs double locking. * freezer_attach() now performs propagation simply while holding freezer_mutex. update_if_frozen() race no longer exists and the comment is removed. * freezer_fork() now tests whether the task is in root cgroup using the new task_css_is_root() without doing rcu_read_lock/unlock(). If not, it grabs freezer_mutex and performs the operation. * freezer_read() and freezer_change_state() grab freezer_mutex across the whole operation and pin the css while iterating so that each descendant processing happens in sleepable context. Fixes: 96d365e0b86e ("cgroup: make css_set_lock a rwsem and rename it to css_set_rwsem") Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-13	cgroup: introduce task_css_is_root()	Tejun Heo
	Determining the css of a task usually requires RCU read lock as that's the only thing which keeps the returned css accessible till its reference is acquired; however, testing whether a task belongs to the root can be performed without dereferencing the returned css by comparing the returned pointer against the root one in init_css_set[] which never changes. Implement task_css_is_root() which can be invoked in any context. This will be used by the scheduled cgroup_freezer change. v2: cgroup no longer supports modular controllers. No need to export init_css_set. Pointed out by Li. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-13	Merge branch 'for-3.15-fixes' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq Pull workqueue fixes from Tejun Heo: "Fixes for two bugs in workqueue. One is exiting with internal mutex held in a failure path of wq_update_unbound_numa(). The other is a subtle and unlikely use-after-possible-last-put in the rescuer logic. Both have been around for quite some time now and are unlikely to have triggered noticeably often. All patches are marked for -stable backport" * 'for-3.15-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: fix a possible race condition between rescuer and pwq-release workqueue: make rescuer_thread() empty wq->maydays list before exiting workqueue: fix bugs in wq_update_unbound_numa() failure path
2014-05-13	Merge branch 'for-3.15-fixes' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: "During recent restructuring, device_cgroup unified config input check and enforcement logic; unfortunately, it turned out to share too much. Aristeu's patches fix the breakage and marked for -stable backport. The other two patches are fallouts from kernfs conversion. The blkcg change is temporary and will go away once kernfs internal locking gets simplified (patches pending)" * 'for-3.15-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: blkcg: use trylock on blkcg_pol_mutex in blkcg_reset_stats() device_cgroup: check if exception removal is allowed device_cgroup: fix the comment format for recently added functions device_cgroup: rework device access check and exception checking cgroup: fix the retry path of cgroup_mount()
2014-05-12	ntp: Make is_error_status() use its argument	George Spelvin
	is_error_status() is an inline function always called with the global time_status as an argument, so there's zero functional difference with this change, but the non-CONFIG_NTP_PPS version uses the passed-in argument, while the CONFIG_NTP_PPS one ignores its argument and uses the global. Looks like is_error_status was refactored out, but someone forgot to change the logic to check the local argument value. Thus this patch makes it use the argument always; shorter variable names are good. Signed-off-by: George Spelvin <linux@horizon.com> [jstultz: Tweaked commit message] Signed-off-by: John Stultz <john.stultz@linaro.org>
2014-05-12	kernel/workqueue.c: pr_warning/pr_warn & printk/pr_info	Fabian Frederick
	tj: Refreshed on top of wq/for-3.16. Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Tejun Heo <tj@kernel.org>
2014-05-12	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net	David S. Miller
	Conflicts: drivers/net/ethernet/altera/altera_sgdma.c net/netlink/af_netlink.c net/sched/cls_api.c net/sched/sch_api.c The netlink conflict dealt with moving to netlink_capable() and netlink_ns_capable() in the 'net' tree vs. supporting 'tc' operations in non-init namespaces. These were simple transformations from netlink_capable to netlink_ns_capable. The Altera driver conflict was simply code removal overlapping some void pointer cast cleanups in net-next. Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-12	ntp: Convert simple_strtol to kstrtol	Fabian Frederick
	Replace obsolete function simple_strtol w/ kstrtol Inspired-By: Andrew Morton <akpm@linux-foundation.org> Cc: John Stultz <john.stultz@linaro.org> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Fabian Frederick <fabf@skynet.be> [jstultz: Tweak commit message] Signed-off-by: John Stultz <john.stultz@linaro.org>
2014-05-12	hrtimer: Set expiry time before switch_hrtimer_base()	Viresh Kumar
	switch_hrtimer_base() calls hrtimer_check_target() which ensures that we do not migrate a timer to a remote cpu if the timer expires before the current programmed expiry time on that remote cpu. But __hrtimer_start_range_ns() calls switch_hrtimer_base() before the new expiry time is set. So the sanity check in hrtimer_check_target() is operating on stale or even uninitialized data. Update expiry time before calling switch_hrtimer_base(). [ tglx: Rewrote changelog once again ] Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Cc: linaro-kernel@lists.linaro.org Cc: linaro-networking@linaro.org Cc: fweisbec@gmail.com Cc: arvind.chauhan@arm.com Link: http://lkml.kernel.org/r/81999e148745fc51bbcd0615823fbab9b2e87e23.1399882253.git.viresh.kumar@linaro.org Cc: stable@vger.kernel.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-05-09	PM / hibernate: convert simple_strtoul to kstrtoul	Fabian Frederick
	Replace obsolete function. Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2014-05-09	Merge branch 'x86-urgent-for-linus' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Peter Anvin: "A somewhat unpleasantly large collection of small fixes. The big ones are the __visible tree sweep and a fix for 'earlyprintk=efi,keep'. It was using __init functions with predictably suboptimal results. Another key fix is a build fix which would produce output that simply would not decompress correctly in some configuration, due to the existing Makefiles picking up an unfortunate local label and mistaking it for the global symbol _end. Additional fixes include the handling of 64-bit numbers when setting the vdso data page (a latent bug which became manifest when i386 started exporting a vdso with time functions), a fix to the new MSR manipulation accessors which would cause features to not get properly unblocked, a build fix for 32-bit userland, and a few new platform quirks" * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86, vdso, time: Cast tv_nsec to u64 for proper shifting in update_vsyscall() x86: Fix typo in MSR_IA32_MISC_ENABLE_LIMIT_CPUID macro x86: Fix typo preventing msr_set/clear_bit from having an effect x86/intel: Add quirk to disable HPET for the Baytrail platform x86/hpet: Make boot_hpet_disable extern x86-64, build: Fix stack protector Makefile breakage with 32-bit userland x86/reboot: Add reboot quirk for Certec BPC600 asmlinkage: Add explicit __visible to drivers/, lib/, kernel/* asmlinkage, x86: Add explicit __visible to arch/x86/* asmlinkage: Revert "lto: Make asmlinkage __visible" x86, build: Don't get confused by local symbols x86/efi: earlyprintk=efi,keep fix
2014-05-08	Merge tag 'trace-fixes-v3.15-rc4-v2' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fixes from Steven Rostedt: "This contains two fixes. The first is a long standing bug that causes bogus data to show up in the refcnt field of the module_refcnt tracepoint. It was introduced by a merge conflict resolution back in 2.6.35-rc days. The result should be 'refcnt = incs - decs', but instead it did 'refcnt = incs + decs'. The second fix is to a bug that was introduced in this merge window that allowed for a tracepoint funcs pointer to be used after it was freed. Moving the location of where the probes are released solved the problem" * tag 'trace-fixes-v3.15-rc4-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracepoint: Fix use of tracepoint funcs after rcu free trace: module: Maintain a valid user count
2014-05-08	tracepoint: Fix use of tracepoint funcs after rcu free	Mathieu Desnoyers
	Commit de7b2973903c "tracepoint: Use struct pointer instead of name hash for reg/unreg tracepoints" introduces a use after free by calling release_probes on the old struct tracepoint array before the newly allocated array is published with rcu_assign_pointer. There is a race window where tracepoints (RCU readers) can perform a "use-after-grace-period-after-free", which shows up as a GPF in stress-tests. Link: http://lkml.kernel.org/r/53698021.5020108@oracle.com Link: http://lkml.kernel.org/p/1399549669-25465-1-git-send-email-mathieu.desnoyers@efficios.com Reported-by: Sasha Levin <sasha.levin@oracle.com> CC: Oleg Nesterov <oleg@redhat.com> CC: Dave Jones <davej@redhat.com> Fixes: de7b2973903c "tracepoint: Use struct pointer instead of name hash for reg/unreg tracepoints" Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-05-08	sched/idle: Make cpuidle_idle_call() void	Rafael J. Wysocki
	The only value ever returned by cpuidle_idle_call() is 0 and its only caller ignores that value anyway, so make it void. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/4717784.WmVEpDoliM@vostro.rjw.lan Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-05-08	sched/idle: Reflow cpuidle_idle_call()	Peter Zijlstra
	Apply goto to reduce lines and nesting levels. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Nicolas Pitre <nicolas.pitre@linaro.org> Cc: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/n/tip-cc6vb0snt3sr7op6rlbfeqfh@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-05-08	sched/idle: Delay clearing the polling bit	Peter Zijlstra
	With the generic idle functions assuming !polling we should only clear the polling bit at the very last opportunity in order to avoid spurious IPIs. Ideally we'd flip the default to polling, but that means auditing all arch idle functions. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Nicolas Pitre <nicolas.pitre@linaro.org> Cc: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/n/tip-vq7719foqzf6z5h4j7eh7f9e@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-05-08	sched/idle: Avoid spurious wakeup IPIs	Peter Zijlstra
	Because mwait_idle_with_hints() gets called from !idle context it must call current_clr_polling(). This however means that resched_task() is very likely to send an IPI even when we were polling: CPU0 CPU1 if (current_set_polling_and_test()) goto out; __monitor(&ti->flags); if (!need_resched()) __mwait(eax, ecx); set_tsk_need_resched(p); smp_mb(); out: current_clr_polling(); if (!tsk_is_polling(p)) smp_send_reschedule(cpu); So while it is correct (extra IPIs aren't a problem, whereas a missed IPI would be) it is a performance problem (for some). Avoid this issue by using fetch_or() to atomically set NEED_RESCHED and test if POLLING_NRFLAG is set. Since a CPU stuck in mwait is unlikely to modify the flags word, contention on the cmpxchg is unlikely and thus we should mostly succeed in a single go. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Nicolas Pitre <nico@linaro.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/n/tip-kf5suce6njh5xf5d3od13rr0@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-05-07	perf: Simplify perf_event_exit_task_context()	Peter Zijlstra
	Instead of jumping through hoops to make sure to find (and exit) each event, do it the simple straight fwd way. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: Stephane Eranian <eranian@google.com> Link: http://lkml.kernel.org/n/tip-tij931199thfkys8vbnokdpf@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-05-07	perf: Rework free paths	Peter Zijlstra
	Primarily make perf_event_release_kernel() into put_event(), this will allow kernel space to create per-task inherited events, and is safer in general. Also, document the free_event() assumptions. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: Stephane Eranian <eranian@google.com> Link: http://lkml.kernel.org/n/tip-rk9pvr6e1d0559lxstltbztc@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-05-07	perf: Validate locking assumption	Peter Zijlstra
	Document and validate the locking assumption of event_sched_in(). Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: Stephane Eranian <eranian@google.com> Link: http://lkml.kernel.org/n/tip-sybq1publ9xt5no77cwvi0eo@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-05-07	perf: Always destroy groups on exit	Peter Zijlstra
	Commit 38b435b16c36 ("perf: Fix tear-down of inherited group events") states that we need to destroy groups for inherited events, but it doesn't make any sense to not also destroy groups for normal events. And while it usually makes no difference (the normal events won't leak, and its very likely all the group events will die in quick succession) it does make the code more consistent and closes a potential hole for trouble. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: Stephane Eranian <eranian@google.com> Link: http://lkml.kernel.org/n/tip-426egt8zmsm12d2q8k2xz4tt@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-05-07	perf: Ensure consistent inherit state in groups	Peter Zijlstra
	Make sure all events in a group have the same inherit state. It was possible for group leaders to have inherit set while sibling events would not have inherit set. In this case we'd still inherit the siblings, leading to some non-fatal weirdness. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: Stephane Eranian <eranian@google.com> Link: http://lkml.kernel.org/n/tip-r32tt8yldvic3jlcghd3g35u@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-05-07	Merge branch 'perf/urgent' into perf/core, to avoid conflicts	Ingo Molnar
	Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-05-07	sched/fair: Stop searching for tasks in newidle balance if there are ↵	Jason Low
	runnable tasks It was found that when running some workloads (such as AIM7) on large systems with many cores, CPUs do not remain idle for long. Thus, tasks can wake/get enqueued while doing idle balancing. In this patch, while traversing the domains in idle balance, in addition to checking for pulled_task, we add an extra check for this_rq->nr_running for determining if we should stop searching for tasks to pull. If there are runnable tasks on this rq, then we will stop traversing the domains. This reduces the chance that idle balance delays a task from running. This patch resulted in approximately a 6% performance improvement when running a Java Server workload on an 8 socket machine. Signed-off-by: Jason Low <jason.low2@hp.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: daniel.lezcano@linaro.org Cc: alex.shi@linaro.org Cc: preeti@linux.vnet.ibm.com Cc: efault@gmx.de Cc: vincent.guittot@linaro.org Cc: morten.rasmussen@arm.com Cc: aswin@hp.com Cc: chegu_vinod@hp.com Link: http://lkml.kernel.org/r/1398303035-18255-4-git-send-email-jason.low2@hp.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-05-07	sched: Add a new SD_SHARE_POWERDOMAIN for sched_domain	Vincent Guittot
	A new flag SD_SHARE_POWERDOMAIN is created to reflect whether groups of CPUs in a sched_domain level can or not reach different power state. As an example, the flag should be cleared at CPU level if groups of cores can be power gated independently. This information can be used in the load balance decision or to add load balancing level between group of CPUs that can power gate independantly. This flag is part of the topology flags that can be set by arch. Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: tony.luck@intel.com Cc: fenghua.yu@intel.com Cc: schwidefsky@de.ibm.com Cc: cmetcalf@tilera.com Cc: benh@kernel.crashing.org Cc: preeti@linux.vnet.ibm.com Link: http://lkml.kernel.org/r/1397209481-28542-5-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-05-07	sched, powerpc: Create a dedicated topology table	Vincent Guittot
	Create a dedicated topology table for handling asymetric feature of powerpc. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Andy Fleming <afleming@freescale.com> Cc: Anton Blanchard <anton@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Grant Likely <grant.likely@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Preeti U. Murthy <preeti@linux.vnet.ibm.com> Cc: Rob Herring <robh+dt@kernel.org> Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Cc: Toshi Kani <toshi.kani@hp.com> Cc: Vasant Hegde <hegdevasant@linux.vnet.ibm.com> Cc: tony.luck@intel.com Cc: fenghua.yu@intel.com Cc: schwidefsky@de.ibm.com Cc: cmetcalf@tilera.com Cc: dietmar.eggemann@arm.com Cc: devicetree@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Link: http://lkml.kernel.org/r/1397209481-28542-4-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-05-07	sched, s390: Create a dedicated topology table	Vincent Guittot
	BOOK level is only relevant for s390 so we create a dedicated topology table with BOOK level and remove it from default table. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Philipp Hachtmann <phacht@linux.vnet.ibm.com> Cc: cmetcalf@tilera.com Cc: benh@kernel.crashing.org Cc: dietmar.eggemann@arm.com Cc: preeti@linux.vnet.ibm.com Cc: tony.luck@intel.com Cc: fenghua.yu@intel.com Cc: linux390@de.ibm.com Cc: linux-s390@vger.kernel.org Link: http://lkml.kernel.org/r/1397209481-28542-3-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-05-07	sched: Rework sched_domain topology definition	Vincent Guittot
	We replace the old way to configure the scheduler topology with a new method which enables a platform to declare additionnal level (if needed). We still have a default topology table definition that can be used by platform that don't want more level than the SMT, MC, CPU and NUMA ones. This table can be overwritten by an arch which either wants to add new level where a load balance make sense like BOOK or powergating level or wants to change the flags configuration of some levels. For each level, we need a function pointer that returns cpumask for each cpu, a function pointer that returns the flags for the level and a name. Only flags that describe topology, can be set by an architecture. The current topology flags are: SD_SHARE_CPUPOWER SD_SHARE_PKG_RESOURCES SD_NUMA SD_ASYM_PACKING Then, each level must be a subset on the next one. The build sequence of the sched_domain will take care of removing useless levels like those with 1 CPU and those with the same CPU span and no more relevant information for load balancing than its children. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Christoph Lameter <cl@linux.com> Cc: David S. Miller <davem@davemloft.net> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hanjun Guo <hanjun.guo@linaro.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Jason Low <jason.low2@hp.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Tony Luck <tony.luck@intel.com> Cc: linux390@de.ibm.com Cc: linux-ia64@vger.kernel.org Cc: linux-s390@vger.kernel.org Link: http://lkml.kernel.org/r/1397209481-28542-2-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-05-07	sched/numa: Do not set preferred_node on migration to a second choice node	Rik van Riel
	Setting the numa_preferred_node for a task in task_numa_migrate does nothing on a 2-node system. Either we migrate to the node that already was our preferred node, or we stay where we were. On a 4-node system, it can slightly decrease overhead, by not calling the NUMA code as much. Since every node tends to be directly connected to every other node, running on the wrong node for a while does not do much damage. However, on an 8 node system, there are far more bad nodes than there are good ones, and pretending that a second choice is actually the preferred node can greatly delay, or even prevent, a workload from converging. The only time we can safely pretend that a second choice node is the preferred node is when the task is part of a workload that spans multiple NUMA nodes. Signed-off-by: Rik van Riel <riel@redhat.com> Tested-by: Vinod Chegu <chegu_vinod@hp.com> Acked-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1397235629-16328-4-git-send-email-riel@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-05-07	sched/numa: Retry placement more frequently when misplaced	Rik van Riel
	When tasks have not converged on their preferred nodes yet, we want to retry fairly often, to make sure we do not migrate a task's memory to an undesirable location, only to have to move it again later. This patch reduces the interval at which migration is retried, when the task's numa_scan_period is small. Signed-off-by: Rik van Riel <riel@redhat.com> Tested-by: Vinod Chegu <chegu_vinod@hp.com> Acked-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1397235629-16328-3-git-send-email-riel@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-05-07	sched/numa: Count pages on active node as local	Rik van Riel
	The NUMA code is smart enough to distribute the memory of workloads that span multiple NUMA nodes across those NUMA nodes. However, it still has a pretty high scan rate for such workloads, because any memory that is left on a node other than the node of the CPU that faulted on the memory is counted as non-local, which causes the scan rate to go up. Counting the memory on any node where the task's numa group is actively running as local, allows the scan rate to slow down once the application is settled in. This should reduce the overhead of the automatic NUMA placement code, when a workload spans multiple NUMA nodes. Signed-off-by: Rik van Riel <riel@redhat.com> Tested-by: Vinod Chegu <chegu_vinod@hp.com> Acked-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1397235629-16328-2-git-send-email-riel@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-05-07	Merge branch 'sched/urgent' into sched/core, to avoid conflicts	Ingo Molnar
	Signed-off-by: Ingo Molnar <mingo@kernel.org>