summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2014-05-22Merge branch 'pm-cpuidle' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm into sched/core Pull scheduling related CPU idle updates from Rafael J. Wysocki. Conflicts: kernel/sched/idle.c Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-05-22Merge tag 'v3.15-rc6' into sched/core, to pick up the latest fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-05-22sched: Fix hotplug vs. set_cpus_allowed_ptr()Lai Jiangshan
Lai found that: WARNING: CPU: 1 PID: 13 at arch/x86/kernel/smp.c:124 native_smp_send_reschedule+0x2d/0x4b() ... migration_cpu_stop+0x1d/0x22 was caused by set_cpus_allowed_ptr() assuming that cpu_active_mask is always a sub-set of cpu_online_mask. This isn't true since 5fbd036b552f ("sched: Cleanup cpu_active madness"). So set active and online at the same time to avoid this particular problem. Fixes: 5fbd036b552f ("sched: Cleanup cpu_active madness") Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michael wang <wangyun@linux.vnet.ibm.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Cc: Toshi Kani <toshi.kani@hp.com> Link: http://lkml.kernel.org/r/53758B12.8060609@cn.fujitsu.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-05-22sched/cpupri: Replace NR_CPUS arraysPeter Zijlstra
Tejun reported that his resume was failing due to order-3 allocations from sched_domain building. Replace the NR_CPUS arrays in there with a dynamically allocated array. Reported-by: Tejun Heo <tj@kernel.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/n/tip-7cysnkw1gik45r864t1nkudh@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-05-22sched/deadline: Replace NR_CPUS arraysPeter Zijlstra
Tejun reported that his resume was failing due to order-3 allocations from sched_domain building. Replace the NR_CPUS arrays in there with a dynamically allocated array. Reported-by: Tejun Heo <tj@kernel.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Juri Lelli <juri.lelli@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/n/tip-kat4gl1m5a6dwy6nzuqox45e@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-05-22sched/deadline: Restrict user params max value to 2^63 nsJuri Lelli
Michael Kerrisk noticed that creating SCHED_DEADLINE reservations with certain parameters (e.g, a runtime of something near 2^64 ns) can cause a system freeze for some amount of time. The problem is that in the interface we have u64 sched_runtime; while internally we need to have a signed runtime (to cope with budget overruns) s64 runtime; At the time we setup a new dl_entity we copy the first value in the second. The cast turns out with negative values when sched_runtime is too big, and this causes the scheduler to go crazy right from the start. Moreover, considering how we deal with deadlines wraparound (s64)(a - b) < 0 we also have to restrict acceptable values for sched_{deadline,period}. This patch fixes the thing checking that user parameters are always below 2^63 ns (still large enough for everyone). It also rewrites other conditions that we check, since in __checkparam_dl we don't have to deal with deadline wraparounds and what we have now erroneously fails when the difference between values is too big. Reported-by: Michael Kerrisk <mtk.manpages@gmail.com> Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Juri Lelli <juri.lelli@gmail.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: <stable@vger.kernel.org> Cc: Dario Faggioli<raistlin@linux.it> Cc: Dave Jones <davej@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20140513141131.20d944f81633ee937f256385@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-05-22sched/deadline: Change sched_getparam() behaviour vs SCHED_DEADLINEPeter Zijlstra
The way we read POSIX one should only call sched_getparam() when sched_getscheduler() returns either SCHED_FIFO or SCHED_RR. Given that we currently return sched_param::sched_priority=0 for all others, extend the same behaviour to SCHED_DEADLINE. Requested-by: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Dario Faggioli <raistlin@linux.it> Cc: linux-man <linux-man@vger.kernel.org> Cc: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com> Cc: Juri Lelli <juri.lelli@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/20140512205034.GH13467@laptop.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-05-22sched: Disallow sched_attr::sched_policy < 0Peter Zijlstra
The scheduler uses policy=-1 to preserve the current policy state to implement sys_sched_setparam(), this got exposed to userspace by accident through sys_sched_setattr(), cure this. Reported-by: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Michael Kerrisk <mtk.manpages@gmail.com> Cc: <stable@vger.kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20140509085311.GJ30445@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-05-22sched: Make sched_setattr() correctly return -EFBIGMichael Kerrisk
The documented[1] behavior of sched_attr() in the proposed man page text is: sched_attr::size must be set to the size of the structure, as in sizeof(struct sched_attr), if the provided structure is smaller than the kernel structure, any additional fields are assumed '0'. If the provided structure is larger than the kernel structure, the kernel verifies all additional fields are '0' if not the syscall will fail with -E2BIG. As currently implemented, sched_copy_attr() returns -EFBIG for for this case, but the logic in sys_sched_setattr() converts that error to -EFAULT. This patch fixes the behavior. [1] http://thread.gmane.org/gmane.linux.kernel/1615615/focus=1697760 Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: <stable@vger.kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/536CEC17.9070903@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-05-21Merge remote-tracking branch 'origin/x86/urgent' into x86/vdsoH. Peter Anvin
Resolved Conflicts: arch/x86/vdso/vdso32-setup.c Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2014-05-21Merge remote-tracking branch 'origin/x86/espfix' into x86/vdsoH. Peter Anvin
Merge x86/espfix into x86/vdso, due to changes in the vdso setup code that otherwise cause conflicts. Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2014-05-21net: filter: cleanup invocation of internal BPFAlexei Starovoitov
Kernel API for classic BPF socket filters is: sk_unattached_filter_create() - validate classic BPF, convert, JIT SK_RUN_FILTER() - run it sk_unattached_filter_destroy() - destroy socket filter Cleanup internal BPF kernel API as following: sk_filter_select_runtime() - final step of internal BPF creation. Try to JIT internal BPF program, if JIT is not available select interpreter SK_RUN_FILTER() - run it sk_filter_free() - free internal BPF program Disallow direct calls to BPF interpreter. Execution of the BPF program should be done with SK_RUN_FILTER() macro. Example of internal BPF create, run, destroy: struct sk_filter *fp; fp = kzalloc(sk_filter_size(prog_len), GFP_KERNEL); memcpy(fp->insni, prog, prog_len * sizeof(fp->insni[0])); fp->len = prog_len; sk_filter_select_runtime(fp); SK_RUN_FILTER(fp, ctx); sk_filter_free(fp); Sockets, seccomp, testsuite, tracing are using different ways to populate sk_filter, so first steps of program creation are not common. Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Acked-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-21Merge branch 'for-3.15-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull more cgroup fixes from Tejun Heo: "Three more patches to fix cgroup_freezer breakage due to the recent cgroup internal locking changes - an operation cgroup_freezer was using now requires sleepable context and cgroup_freezer was invoking that while holding a spin lock. cgroup_freezer was using an overly elaborate hierarchical locking scheme. While it's possible to convert the hierarchical spinlocks directly to mutexes, this patch simplifies the overall locking so that it uses a global mutex. This has the added benefit of avoiding iterating potentially huge number of tasks under a spinlock. While the patch is on the larger side in the devel cycle, the changes made are mostly straight-forward and the locking logic is a lot simpler afterwards" * 'for-3.15-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: fix rcu_read_lock() leak in update_if_frozen() cgroup_freezer: replace freezer->lock with freezer_mutex cgroup: introduce task_css_is_root()
2014-05-20tracing: Add funcgraph_tail option to print function name after closing bracesRobert Elliott
In the function-graph tracer, add a funcgraph_tail option to print the function name on all } lines, not just functions whose first line is no longer in the trace buffer. If a function calls other traced functions, its total time appears on its } line. This change allows grep to be used to determine the function for which the line corresponds. Update Documentation/trace/ftrace.txt to describe this new option. Link: http://lkml.kernel.org/p/20140520221041.8359.6782.stgit@beardog.cce.hp.com Signed-off-by: Robert Elliott <elliott@hp.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-05-20tracing: Eliminate duplicate TRACE_GRAPH_PRINT_xx definesRobert Elliott
Eliminate duplicate TRACE_GRAPH_PRINT_xx defines in trace_functions_graph.c that are already in trace.h. Add TRACE_GRAPH_PRINT_IRQS to trace.h, which is the only one that is missing. Link: http://lkml.kernel.org/p/20140520221031.8359.24733.stgit@beardog.cce.hp.com Signed-off-by: Robert Elliott <elliott@hp.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-05-20workqueue: use generic attach/detach routine for rescuersLai Jiangshan
There are several problems with the code that rescuers use to bind themselve to the target pool's cpumask. 1) It is very different from how the normal workers bind to cpumask, increasing code complexity and maintenance overhead. 2) The code of cpu-binding for rescuers is complicated. 3) If one or more cpu hotplugs happen while a rescuer is processing its scheduled work items, the rescuer may not stay bound to the cpumask of the pool. This is an allowed behavior, but is still hairy. It will be better if the cpumask of the rescuer is always kept synchronized with the pool across cpu hotplugs. Using generic attach/detach routine will solve the above problems and results in much simpler code. tj: Minor description updates. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2014-05-20workqueue: separate pool-attaching code out from create_worker()Lai Jiangshan
Currently, the code to attach a new worker to its pool is embedded in create_worker(). Separating this code out will make the codes clearer and will allow rescuers to share the code path later. tj: Description and comment updates. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2014-05-20workqueue: rename manager_mutex to attach_mutexLai Jiangshan
manager_mutex is only used to protect the attaching for the pool and the pool->workers list. It protects the pool->workers and operations based on this list, such as: cpu-binding for the workers in the pool->workers the operations to set/clear WORKER_UNBOUND So let's rename manager_mutex to attach_mutex to better reflect its role. This patch is a pure rename. tj: Minor command and description updates. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2014-05-20workqueue: narrow the protection range of manager_mutexLai Jiangshan
In create_worker(), as pool->worker_ida now uses ida_simple_get()/ida_simple_put() and doesn't require external synchronization, it doesn't need manager_mutex. struct worker allocation and kthread allocation are not visible by any one before attached, so they don't need manager_mutex either. The above operations are before the attaching operation which attaches the worker to the pool. Between attaching and starting the worker, the worker is already attached to the pool, so the cpu hotplug will handle cpu-binding for the worker correctly and we don't need the manager_mutex after attaching. The conclusion is that only the attaching operation needs manager_mutex, so we narrow the protection section of manager_mutex in create_worker(). Some comments about manager_mutex are removed, because we will rename it to attach_mutex and add worker_attach_to_pool() later which will be self-explanatory. tj: Minor description updates. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2014-05-20workqueue: convert worker_idr to worker_idaLai Jiangshan
We no longer iterate workers via worker_idr and worker_idr is used only for allocating/freeing ID, so we can convert it to worker_ida. By using ida_simple_get/remove(), worker_ida doesn't require external synchronization, so we don't need manager_mutex to protect it and the ID-removal code is allowed to be moved out from worker_detach_from_pool(). In a later patch, worker_detach_from_pool() will be used in rescuers which don't have IDs, so we move the ID-removal code out from worker_detach_from_pool() into worker_thread(). tj: Minor description updates. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2014-05-20workqueue: separate iteration role from worker_idrLai Jiangshan
worker_idr has the iteration (iterating for attached workers) and worker ID duties. These two duties don't have to be tied together. We can separate them and use a list for tracking attached workers and iteration. Before this separation, it wasn't possible to add rescuer workers to worker_idr due to rescuer workers couldn't allocate ID dynamically because ID-allocation depends on memory-allocation, which rescuer can't depend on. After separation, we can easily add the rescuer workers to the list for iteration without any memory-allocation. It is required when we attach the rescuer worker to the pool in later patch. tj: Minor description updates. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2014-05-20workqueue: destroy worker directly in the idle timeout handlerLai Jiangshan
Since destroy_worker() doesn't need to sleep nor require manager_mutex, destroy_worker() can be directly called in the idle timeout handler, it helps us remove POOL_MANAGE_WORKERS and maybe_destroy_worker() and simplify the manage_workers() After POOL_MANAGE_WORKERS is removed, worker_thread() doesn't need to test whether it needs to manage after processed works. So we can remove the test branch. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2014-05-20workqueue: async worker destructionLai Jiangshan
worker destruction includes these parts of code: adjust pool's stats remove the worker from idle list detach the worker from the pool kthread_stop() to wait for the worker's task exit free the worker struct We can find out that there is no essential work to do after kthread_stop(), which means destroy_worker() doesn't need to wait for the worker's task exit, so we can remove kthread_stop() and free the worker struct in the worker exiting path. However, put_unbound_pool() still needs to sync the all the workers' destruction before destroying the pool; otherwise, the workers may access to the invalid pool when they are exiting. So we also move the code of "detach the worker" to the exiting path and let put_unbound_pool() to sync with this code via detach_completion. The code of "detach the worker" is wrapped in a new function "worker_detach_from_pool()" although worker_detach_from_pool() is only called once (in worker_thread()) after this patch, but we need to wrap it for these reasons: 1) The code of "detach the worker" is not short enough to unfold them in worker_thread(). 2) the name of "worker_detach_from_pool()" is self-comment, and we add some comments above the function. 3) it will be shared by rescuer in later patch which allows rescuer and normal thread use the same attach/detach frameworks. The worker id is freed when detaching which happens before the worker is fully dead, but this id of the dying worker may be re-used for a new worker, so the dying worker's task name is changed to "worker/dying" to avoid two or several workers having the same name. Since "detach the worker" is moved out from destroy_worker(), destroy_worker() doesn't require manager_mutex, so the "lockdep_assert_held(&pool->manager_mutex)" in destroy_worker() is removed, and destroy_worker() is not protected by manager_mutex in put_unbound_pool(). tj: Minor description updates. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2014-05-20workqueue: destroy_worker() should destroy idle workers onlyLai Jiangshan
We used to have the CPU online failure path where a worker is created and then destroyed without being started. A worker was created for the CPU coming online and if the online operation failed the created worker was shut down without being started. But this behavior was changed. The first worker is created and started at the same time for the CPU coming online. It means that we had already ensured in the code that destroy_worker() destroys only idle workers and we don't want to allow it to destroy any non-idle worker in the future. Otherwise, it may be buggy and it may be extremely hard to check. We should force destroy_worker() to destroy only idle workers explicitly. Since destroy_worker() destroys only idle workers, this patch does not change any functionality. We just need to update the comments and the sanity check code. In the sanity check code, we will refuse to destroy the worker if !(worker->flags & WORKER_IDLE). If the worker entered idle which means it is already started, so we remove the check of "worker->flags & WORKER_STARTED", after this removal, WORKER_STARTED is totally unneeded, so we remove WORKER_STARTED too. In the comments for create_worker(), "Create a new worker which is bound..." is changed to "... which is attached..." due to we change the name of this behavior to attaching. tj: Minor description / comment updates. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2014-05-20workqueue: use manager lock only to protect worker_idrLai Jiangshan
worker_idr is highly bound to managers and is always/only accessed in manager lock context. So we don't need pool->lock for it. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2014-05-20Merge branch 'pm-sleep' into acpi-pmRafael J. Wysocki
2014-05-20Merge branch 'timers-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fix from Thomas Gleixner: "A single bug fix for a long standing issue: - Updating the expiry value of a relative timer _after_ letting the idle logic select a target cpu for the timer based on its stale expiry value is outright stupid. Thanks to Viresh for spotting the brainfart" * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: hrtimer: Set expiry time before switch_hrtimer_base()
2014-05-19PM / OPP: Make OPP invisible to users in KconfigMark Brown
The OPP code is an in kernel library selected by its users, there is no no architecture code required to implement it and enabling it without a user just increases the kernel size. Since the users select rather than depend on it just remove the ability to directly set the option from Kconfig. Signed-off-by: Mark Brown <broonie@linaro.org> Acked-by: Nishanth Menon <nm@ti.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2014-05-19cgroup: disallow debug controller on the default hierarchyTejun Heo
The debug controller, as its name suggests, exposes cgroup core internals to userland to aid debugging. Unfortunately, except for the name, there's no provision to prevent its usage in production configurations and the controller is widely enabled and mounted leaking internal details to userland. Like most other debug information, the information exposed by debug isn't interesting even for debugging itself once the related parts are working reliably. This controller has no reason for existing. This patch implements cgrp_dfl_root_inhibit_ss_mask which can suppress specific subsystems on the default hierarchy and adds the debug subsystem to it so that it can be gradually deprecated as usages move towards the unified hierarchy. Signed-off-by: Tejun Heo <tj@kernel.org>
2014-05-19rcu: Provide API to suppress stall warnings while sysrc runsRik van Riel
Some sysrq handlers can run for a long time, because they dump a lot of data onto a serial console. Having RCU stall warnings pop up in the middle of them only makes the problem worse. This commit provides rcu_sysrq_start() and rcu_sysrq_end() APIs to temporarily suppress RCU CPU stall warnings while a sysrq request is handled. Signed-off-by: Rik van Riel <riel@redhat.com> [ paulmck: Fix TINY_RCU build error. ] Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-05-19perf/events/core: Drop unused variable after cleanupBorislav Petkov
... in 3a497f48637 ("perf: Simplify perf_event_exit_task_context()") Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1399720259-28275-1-git-send-email-bp@alien8.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-05-19perf: Fix a race between ring_buffer_detach() and ring_buffer_attach()Peter Zijlstra
Alexander noticed that we use RCU iteration on rb->event_list but do not use list_{add,del}_rcu() to add,remove entries to that list, nor do we observe proper grace periods when re-using the entries. Merge ring_buffer_detach() into ring_buffer_attach() such that attaching to the NULL buffer is detaching. Furthermore, ensure that between any 'detach' and 'attach' of the same event we observe the required grace period, but only when strictly required. In effect this means that only ioctl(.request = PERF_EVENT_IOC_SET_OUTPUT) will wait for a grace period, while the normal initial attach and final detach will not be delayed. This patch should, I think, do the right thing under all circumstances, the 'normal' cases all should never see the extra grace period, but the two cases: 1) PERF_EVENT_IOC_SET_OUTPUT on an event which already has a ring_buffer set, will now observe the required grace period between removing itself from the old and attaching itself to the new buffer. This case is 'simple' in that both buffers are present in perf_event_set_output() one could think an unconditional synchronize_rcu() would be sufficient; however... 2) an event that has a buffer attached, the buffer is destroyed (munmap) and then the event is attached to a new/different buffer using PERF_EVENT_IOC_SET_OUTPUT. This case is more complex because the buffer destruction does: ring_buffer_attach(.rb = NULL) followed by the ioctl() doing: ring_buffer_attach(.rb = foo); and we still need to observe the grace period between these two calls due to us reusing the event->rb_entry list_head. In order to make 2 happen we use Paul's latest cond_synchronize_rcu() call. Cc: Paul Mackerras <paulus@samba.org> Cc: Stephane Eranian <eranian@google.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Mike Galbraith <efault@gmx.de> Reported-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20140507123526.GD13658@twins.programming.kicks-ass.net Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-05-19perf: Prevent false warning in perf_swevent_addJiri Olsa
The perf cpu offline callback takes down all cpu context events and releases swhash->swevent_hlist. This could race with task context software event being just scheduled on this cpu via perf_swevent_add while cpu hotplug code already cleaned up event's data. The race happens in the gap between the cpu notifier code and the cpu being actually taken down. Note that only cpu ctx events are terminated in the perf cpu hotplug code. It's easily reproduced with: $ perf record -e faults perf bench sched pipe while putting one of the cpus offline: # echo 0 > /sys/devices/system/cpu/cpu1/online Console emits following warning: WARNING: CPU: 1 PID: 2845 at kernel/events/core.c:5672 perf_swevent_add+0x18d/0x1a0() Modules linked in: CPU: 1 PID: 2845 Comm: sched-pipe Tainted: G W 3.14.0+ #256 Hardware name: Intel Corporation Montevina platform/To be filled by O.E.M., BIOS AMVACRB1.86C.0066.B00.0805070703 05/07/2008 0000000000000009 ffff880077233ab8 ffffffff81665a23 0000000000200005 0000000000000000 ffff880077233af8 ffffffff8104732c 0000000000000046 ffff88007467c800 0000000000000002 ffff88007a9cf2a0 0000000000000001 Call Trace: [<ffffffff81665a23>] dump_stack+0x4f/0x7c [<ffffffff8104732c>] warn_slowpath_common+0x8c/0xc0 [<ffffffff8104737a>] warn_slowpath_null+0x1a/0x20 [<ffffffff8110fb3d>] perf_swevent_add+0x18d/0x1a0 [<ffffffff811162ae>] event_sched_in.isra.75+0x9e/0x1f0 [<ffffffff8111646a>] group_sched_in+0x6a/0x1f0 [<ffffffff81083dd5>] ? sched_clock_local+0x25/0xa0 [<ffffffff811167e6>] ctx_sched_in+0x1f6/0x450 [<ffffffff8111757b>] perf_event_sched_in+0x6b/0xa0 [<ffffffff81117a4b>] perf_event_context_sched_in+0x7b/0xc0 [<ffffffff81117ece>] __perf_event_task_sched_in+0x43e/0x460 [<ffffffff81096f1e>] ? put_lock_stats.isra.18+0xe/0x30 [<ffffffff8107b3c8>] finish_task_switch+0xb8/0x100 [<ffffffff8166a7de>] __schedule+0x30e/0xad0 [<ffffffff81172dd2>] ? pipe_read+0x3e2/0x560 [<ffffffff8166b45e>] ? preempt_schedule_irq+0x3e/0x70 [<ffffffff8166b45e>] ? preempt_schedule_irq+0x3e/0x70 [<ffffffff8166b464>] preempt_schedule_irq+0x44/0x70 [<ffffffff816707f0>] retint_kernel+0x20/0x30 [<ffffffff8109e60a>] ? lockdep_sys_exit+0x1a/0x90 [<ffffffff812a4234>] lockdep_sys_exit_thunk+0x35/0x67 [<ffffffff81679321>] ? sysret_check+0x5/0x56 Fixing this by tracking the cpu hotplug state and displaying the WARN only if current cpu is initialized properly. Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: stable@vger.kernel.org Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Jiri Olsa <jolsa@redhat.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1396861448-10097-1-git-send-email-jolsa@redhat.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-05-19perf: Limit perf_event_attr::sample_period to 63 bitsPeter Zijlstra
Vince reported that using a large sample_period (one with bit 63 set) results in wreckage since while the sample_period is fundamentally unsigned (negative periods don't make sense) the way we implement things very much rely on signed logic. So limit sample_period to 63 bits to avoid tripping over this. Reported-by: Vince Weaver <vincent.weaver@maine.edu> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/n/tip-p25fhunibl4y3qi0zuqmyf4b@git.kernel.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-05-19futex: Prevent attaching to kernel threadsThomas Gleixner
We happily allow userspace to declare a random kernel thread to be the owner of a user space PI futex. Found while analysing the fallout of Dave Jones syscall fuzzer. We also should validate the thread group for private futexes and find some fast way to validate whether the "alleged" owner has RW access on the file which backs the SHM, but that's a separate issue. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Dave Jones <davej@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Darren Hart <darren@dvhart.com> Cc: Davidlohr Bueso <davidlohr@hp.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Clark Williams <williams@redhat.com> Cc: Paul McKenney <paulmck@linux.vnet.ibm.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Roland McGrath <roland@hack.frob.com> Cc: Carlos ODonell <carlos@redhat.com> Cc: Jakub Jelinek <jakub@redhat.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: http://lkml.kernel.org/r/20140512201701.194824402@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org
2014-05-19futex: Add another early deadlock detection checkThomas Gleixner
Dave Jones trinity syscall fuzzer exposed an issue in the deadlock detection code of rtmutex: http://lkml.kernel.org/r/20140429151655.GA14277@redhat.com That underlying issue has been fixed with a patch to the rtmutex code, but the futex code must not call into rtmutex in that case because - it can detect that issue early - it avoids a different and more complex fixup for backing out If the user space variable got manipulated to 0x80000000 which means no lock holder, but the waiters bit set and an active pi_state in the kernel is found we can figure out the recursive locking issue by looking at the pi_state owner. If that is the current task, then we can safely return -EDEADLK. The check should have been added in commit 59fa62451 (futex: Handle futex_pi OWNER_DIED take over correctly) already, but I did not see the above issue caused by user space manipulation back then. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Dave Jones <davej@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Darren Hart <darren@dvhart.com> Cc: Davidlohr Bueso <davidlohr@hp.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Clark Williams <williams@redhat.com> Cc: Paul McKenney <paulmck@linux.vnet.ibm.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Roland McGrath <roland@hack.frob.com> Cc: Carlos ODonell <carlos@redhat.com> Cc: Jakub Jelinek <jakub@redhat.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: http://lkml.kernel.org/r/20140512201701.097349971@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org
2014-05-16PM / hibernate: Fix memory corruption in resumedelay_setup()Dan Carpenter
In the original code "resume_delay" is an int so on 64 bits, the call to kstrtoul() will cause memory corruption. We may as well fix a style issue here as well and make "resume_delay" unsigned int, since that's what we pass to ssleep(). Fixes: 317cf7e5e85e (PM / hibernate: convert simple_strtoul to kstrtoul) Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2014-05-16cgroup: convert cgroup_has_live_children() into css_has_online_children()Tejun Heo
Now that cgroup liveliness and css onliness are the same state, convert cgroup_has_live_children() into css_has_online_children() so that it can be used for actual csses too. The function now uses css_for_each_child() for iteration and is published. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-16cgroup: use CSS_ONLINE instead of CGRP_DEADTejun Heo
Use CSS_ONLINE on the self css to indicate whether a cgroup has been killed instead of CGRP_DEAD. This will allow re-using css online test for cgroup liveliness test. This doesn't introduce any functional change. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-16cgroup: iterate cgroup_subsys_states directlyTejun Heo
Currently, css_next_child() is implemented as finding the next child cgroup which has the css enabled, which used to be the only way to do it as only cgroups participated in sibling lists and thus could be iteratd. This works as long as what's required during iteration is not missing online csses; however, it turns out that there are use cases where offlined but not yet released csses need to be iterated. This is difficult to implement through cgroup iteration the unified hierarchy as there may be multiple dying csses for the same subsystem associated with single cgroup. After the recent changes, the cgroup self and regular csses behave identically in how they're linked and unlinked from the sibling lists including assertion of CSS_RELEASED and css_next_child() can simply switch to iterating csses directly. This both simplifies the logic and ensures that all visible non-released csses are included in the iteration whether there are multiple dying csses for a subsystem or not. As all other iterators depend on css_next_child() for sibling iteration, this changes behaviors of all css iterators. Add and update explanations on the css states which are included in traversal to all iterators. As css iteration could always contain offlined csses, this shouldn't break any of the current users and new usages which need iteration of all on and offline csses can make use of the new semantics. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org>
2014-05-16cgroup: introduce CSS_RELEASED and reduce css iteration fallback windowTejun Heo
css iterations allow the caller to drop RCU read lock. As long as the caller keeps the current position accessible, it can simply re-grab RCU read lock later and continue iteration. This is achieved by using CGRP_DEAD to detect whether the current positions next pointer is safe to dereference and if not re-iterate from the beginning to the next position using ->serial_nr. CGRP_DEAD is used as the marker to invalidate the next pointer and the only requirement is that the marker is set before the next sibling starts its RCU grace period. Because CGRP_DEAD is set at the end of cgroup_destroy_locked() but the cgroup is unlinked when the reference count reaches zero, we currently have a rather large window where this fallback re-iteration logic can be triggered. This patch introduces CSS_RELEASED which is set when a css is unlinked from its sibling list. This still keeps the re-iteration logic working while drastically reducing the window of its activation. While at it, rewrite the comment in css_next_child() to reflect the new flag and better explain the synchronization. This will also enable iterating csses directly instead of through cgroups. v2: CSS_RELEASED now assigned to 1 << 2 as 1 << 0 is used by CSS_NO_REF. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-16cgroup: move cgroup->serial_nr into cgroup_subsys_stateTejun Heo
We're moving towards using cgroup_subsys_states as the fundamental structural blocks. All csses including the cgroup->self and actual ones now form trees through css->children and ->sibling which follow the same rules as what cgroup->children and ->sibling followed. This patch moves cgroup->serial_nr which is used to implement css iteration into css. Note that all csses, regardless of their types, allocate their serial numbers from the same monotonically increasing counter. This doesn't affect the ordering needed by css iteration or cause any other material behavior changes. This will be used to update css iteration. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-16cgroup: link all cgroup_subsys_states in their sibling listsTejun Heo
Currently, while all csses have ->children and ->sibling, only the self csses of cgroups make use of them. This patch makes all other csses to link themselves on the sibling lists too. This will be used to update css iteration. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-16cgroup: move cgroup->sibling and ->children into cgroup_subsys_stateTejun Heo
We're moving towards using cgroup_subsys_states as the fundamental structural blocks. Let's move cgroup->sibling and ->children into cgroup_subsys_state. This is pure move without functional change and only cgroup->self's fields are actually used. Other csses will make use of the fields later. While at it, update init_and_link_css() so that it zeroes the whole css before initializing it and remove explicit zeroing of ->flags. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-16cgroup: remove cgroup->parentTejun Heo
cgroup->parent is redundant as cgroup->self.parent can also be used to determine the parent cgroup and we're moving towards using cgroup_subsys_states as the fundamental structural blocks. This patch introduces cgroup_parent() which follows cgroup->self.parent and removes cgroup->parent. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
2014-05-16cgroup: remove css_parent()Tejun Heo
cgroup in general is moving towards using cgroup_subsys_state as the fundamental structural component and css_parent() was introduced to convert from using cgroup->parent to css->parent. It was quite some time ago and we're moving forward with making css more prominent. This patch drops the trivial wrapper css_parent() and let the users dereference css->parent. While at it, explicitly mark fields of css which are public and immutable. v2: New usage from device_cgroup.c converted. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Neil Horman <nhorman@tuxdriver.com> Acked-by: "David S. Miller" <davem@davemloft.net> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Johannes Weiner <hannes@cmpxchg.org>
2014-05-16cgroup: skip refcnting on normal root csses and cgrp_dfl_root self cssTejun Heo
9395a4500404 ("cgroup: enable refcnting for root csses") enabled reference counting for root csses (cgroup_subsys_states) so that cgroup's self csses can be used to manage the lifetime of the containing cgroups. Unfortunately, this change was incorrect. During early init, cgrp_dfl_root self css refcnt is used. percpu_ref can't initialized during early init and its initialization is deferred till cgroup_init() time. This means that cpu was using percpu_ref which wasn't properly initialized. Due to the way percpu variables are laid out on x86, this didn't blow up immediately on x86 but ended up incrementing and decrementing the percpu variable at offset zero, whatever it may be; however, on other archs, this caused fault and early boot failure. As cgroup self csses for root cgroups of non-dfl hierarchies need working refcounting, we can't revert 9395a4500404. This patch adds CSS_NO_REF which explicitly inhibits reference counting on the css and sets it on all normal (non-self) csses and cgroup_dfl_root self css. v2: cgrp_dfl_root.self is the offending one. Set the flag on it. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Stephen Warren <swarren@nvidia.com> Tested-by: Stephen Warren <swarren@nvidia.com> Fixes: 9395a4500404 ("cgroup: enable refcnting for root csses")
2014-05-16genirq: Remove dynamic_irq messThomas Gleixner
No more users. Get rid of the cruft. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Grant Likely <grant.likely@linaro.org> Tested-by: Tony Luck <tony.luck@intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20140507154341.012847637@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-05-16genirq: Replace dynamic_irq_init/cleanupThomas Gleixner
Create a new interface and confine it with a config switch which makes clear that this is just legacy support and not to be used for new code. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Grant Likely <grant.likely@linaro.org> Tested-by: Tony Luck <tony.luck@intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20140507154340.574437049@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-05-16genirq: Remove irq_reserve_irq[s]Thomas Gleixner
No more users. And it's not going to come back. If you need hotplugable irq chips, use irq domains. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-and-acked-by: Grant Likely <grant.likely@linaro.org> Tested-by: Tony Luck <tony.luck@intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20140507154340.302183048@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>