summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2012-10-24sched: Track the runnable average on a per-task entity basisPaul Turner
Instead of tracking averaging the load parented by a cfs_rq, we can track entity load directly. With the load for a given cfs_rq then being the sum of its children. To do this we represent the historical contribution to runnable average within each trailing 1024us of execution as the coefficients of a geometric series. We can express this for a given task t as: runnable_sum(t) = \Sum u_i * y^i, runnable_avg_period(t) = \Sum 1024 * y^i load(t) = weight_t * runnable_sum(t) / runnable_avg_period(t) Where: u_i is the usage in the last i`th 1024us period (approximately 1ms) ~ms and y is chosen such that y^k = 1/2. We currently choose k to be 32 which roughly translates to about a sched period. Signed-off-by: Paul Turner <pjt@google.com> Reviewed-by: Ben Segall <bsegall@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20120823141506.372695337@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-10-24timers, sched: Correct the comments for tick_sched_timer()Chuansheng Liu
In the comments of function tick_sched_timer(), the sentence "timer->base->cpu_base->lock held" is not right. In function __run_hrtimer(), before call timer->function(), the cpu_base->lock has been unlocked. Signed-off-by: liu chuansheng <chuansheng.liu@intel.com> Cc: fei.li@intel.com Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1351098455.15558.1421.camel@cliu38-desktop-build Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-10-23console: use might_sleep in console_lockDaniel Vetter
Instead of BUG_ON(in_interrupt()), since that doesn't check for all the newfangled stuff like preempt. Note that this is valid since the console_sem is essentially used like a real mutex with only two twists: - we allow trylock from hardirq context - across suspend/resume we lock the logical console_lock, but drop the semaphore protecting the locking state. Now that doesn't guarantee that no one is playing tricks in single-thread atomic contexts at suspend/resume/boot time, but - I couldn't find anything suspicious with some grepping, - might_sleep shouldn't die, - and I think the upside of catching more potential issues is worth the risk of getting a might_sleep backtrace that would have been save (and then dealing with that fallout). Cc: Dave Airlie <airlied@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-10-24Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Ingo Molnar: "Most of these are uprobes race fixes from Oleg, and their preparatory cleanups. (It's larger than what I'd normally send for an -rc kernel, but they looked significant enough to not delay them.) There's also an oprofile fix and an uncore PMU fix." * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (22 commits) perf/x86: Disable uncore on virtualized CPUs oprofile, x86: Fix wrapping bug in op_x86_get_ctrl() ring-buffer: Check for uninitialized cpu buffer before resizing uprobes: Fix the racy uprobe->flags manipulation uprobes: Fix prepare_uprobe() race with itself uprobes: Introduce prepare_uprobe() uprobes: Fix handle_swbp() vs unregister() + register() race uprobes: Do not delete uprobe if uprobe_unregister() fails uprobes: Don't return success if alloc_uprobe() fails uprobes/x86: Only rep+nop can be emulated correctly uprobes: Simplify is_swbp_at_addr(), remove stale comments uprobes: Kill set_orig_insn()->is_swbp_at_addr() uprobes: Introduce copy_opcode(), kill read_opcode() uprobes: Kill set_swbp()->is_swbp_at_addr() uprobes: Restrict valid_vma(false) to skip VM_SHARED vmas uprobes: Change valid_vma() to demand VM_MAYEXEC rather than VM_EXEC uprobes: Change write_opcode() to use FOLL_FORCE uprobes: Move clear_thread_flag(TIF_UPROBE) to uprobe_notify_resume() uprobes: Kill UTASK_BP_HIT state uprobes: Fix UPROBE_SKIP_SSTEP checks in handle_swbp() ...
2012-10-23rcu: Dump number of callbacks in stall warning messagesPaul E. McKenney
In theory, if a grace period manages to get started despite there being no callbacks on any of the CPUs, all CPUs could go into dyntick-idle mode, so that the grace period would never end. This commit updates the RCU CPU stall warning messages to detect this condition by summing up the number of callbacks on all CPUs. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-10-23rcu: Add grace-period information to RCU CPU stall warningsPaul E. McKenney
This commit causes the last grace period started and completed to be printed on RCU CPU stall warning messages in order to aid diagnosis. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-10-23rcu: Print remote CPU's stacks in stall warningsPaul E. McKenney
The RCU CPU stall warnings rely on trigger_all_cpu_backtrace() to do NMI-based dump of the stack traces of all CPUs. Unfortunately, a number of architectures do not implement trigger_all_cpu_backtrace(), in which case RCU falls back to just dumping the stack of the running CPU. This is unhelpful in the case where the running CPU has detected that some other CPU has stalled. This commit therefore makes the running CPU dump the stacks of the tasks running on the stalled CPUs. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-10-23srcu: Export process_srcu()Lai Jiangshan
Because process_srcu() will be used in DEFINE_SRCU(), which is a macro that could be expanded pretty much anywhere, it can no longer be static. Note that process_srcu() is still internal to srcu.h. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-10-23srcu: Credit Lai Jiangshan with SRCU rewriteLai Jiangshan
Lai Jiangshan rewrote SRCU, so this commit ensures that he gets his proper share of blame^Wcredit. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-10-23rcu: Fix precedence error in cpu_needs_another_gp()Paul E. McKenney
The fix introduced by a10d206e (rcu: Fix day-one dyntick-idle stall-warning bug) has a C-language precedence error. It turns out that this error is harmless in that the same result is computed for all inputs, but the code is nevertheless a potential source of confusion. This commit therefore introduces parentheses in order to force the execution of the code to reflect the intent. Reported-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-10-23rcu: Add a module parameter to force use of expedited RCU primitivesAntti P Miettinen
There have been some embedded applications that would benefit from use of expedited grace-period primitives. In some ways, this is similar to synchronize_net() doing either a normal or an expedited grace period depending on lock state, but with control outside of the kernel. This commit therefore adds rcu_expedited boot and sysfs parameters that cause the kernel to substitute expedited primitives for the normal grace-period primitives. [ paulmck: Add trace/event/rcu.h to kernel/srcu.c to avoid build error. Get rid of infinite loop through contention path.] Signed-off-by: Antti P Miettinen <amiettinen@nvidia.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-10-23rcu: Remove rcu_switch()Frederic Weisbecker
It's only there to call rcu_user_hooks_switch(). Let's just call rcu_user_hooks_switch() directly, we don't need this function in the middle. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Richard Weinberger <richard@nod.at> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-10-23rcu: Make rcutorture give diagnostics if CPU offline failsPaul E. McKenney
This commit causes rcutorture to print the errno if cpu_down() fails when the rcutorture "verbose" module parameter is specified. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-10-23rcu: Fix comment about _rcu_barrier()/orphanage exclusionPaul E. McKenney
In the old days, _rcu_barrier() acquired ->onofflock to exclude rcu_send_cbs_to_orphanage(), which allowed the latter to avoid memory barriers in callback handling. However, _rcu_barrier() recently started doing get_online_cpus() to lock out CPU-hotplug operations entirely, which means that the comment in rcu_send_cbs_to_orphanage() that talks about ->onofflock is now obsolete. This commit therefore fixes the comment. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-10-22console: implement lockdep support for console_lockDaniel Vetter
Dave Airlie recently discovered a locking bug in the fbcon layer, where a timer_del_sync (for the blinking cursor) deadlocks with the timer itself, since both (want to) hold the console_lock: https://lkml.org/lkml/2012/8/21/36 Unfortunately the console_lock isn't a plain mutex and hence has no lockdep support. Which resulted in a few days wasted of tracking down this bug (complicated by the fact that printk doesn't show anything when the console is locked) instead of noticing the bug much earlier with the lockdep splat. Hence I've figured I need to fix that for the next deadlock involving console_lock - and with kms/drm growing ever more complex locking that'll eventually happen. Now the console_lock has rather funky semantics, so after a quick irc discussion with Thomas Gleixner and Dave Airlie I've quickly ditched the original idead of switching to a real mutex (since it won't work) and instead opted to annotate the console_lock with lockdep information manually. There are a few special cases: - The console_lock state is protected by the console_sem, and usually grabbed/dropped at _lock/_unlock time. But the suspend/resume code drops the semaphore without dropping the console_lock (see suspend_console/resume_console). But since the same thread that did the suspend will do the resume, we don't need to fix up anything. - In the printk code there's a special trylock, only used to kick off the logbuffer printk'ing in console_unlock. But all that happens while lockdep is disable (since printk does a few other evil tricks). So no issue there, either. - The console_lock can also be acquired form irq context (but only with a trylock). lockdep already handles that. This all leaves us with annotating the normal console_lock, _unlock and _trylock functions. And yes, it works - simply unloading a drm kms driver resulted in lockdep complaining about the deadlock in fbcon_deinit: ====================================================== [ INFO: possible circular locking dependency detected ] 3.6.0-rc2+ #552 Not tainted ------------------------------------------------------- kms-reload/3577 is trying to acquire lock: ((&info->queue)){+.+...}, at: [<ffffffff81058c70>] wait_on_work+0x0/0xa7 but task is already holding lock: (console_lock){+.+.+.}, at: [<ffffffff81264686>] bind_con_driver+0x38/0x263 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (console_lock){+.+.+.}: [<ffffffff81087440>] lock_acquire+0x95/0x105 [<ffffffff81040190>] console_lock+0x59/0x5b [<ffffffff81209cb6>] fb_flashcursor+0x2e/0x12c [<ffffffff81057c3e>] process_one_work+0x1d9/0x3b4 [<ffffffff810584a2>] worker_thread+0x1a7/0x24b [<ffffffff8105ca29>] kthread+0x7f/0x87 [<ffffffff813b1204>] kernel_thread_helper+0x4/0x10 -> #0 ((&info->queue)){+.+...}: [<ffffffff81086cb3>] __lock_acquire+0x999/0xcf6 [<ffffffff81087440>] lock_acquire+0x95/0x105 [<ffffffff81058cab>] wait_on_work+0x3b/0xa7 [<ffffffff81058dd6>] __cancel_work_timer+0xbf/0x102 [<ffffffff81058e33>] cancel_work_sync+0xb/0xd [<ffffffff8120a3b3>] fbcon_deinit+0x11c/0x1dc [<ffffffff81264793>] bind_con_driver+0x145/0x263 [<ffffffff81264a45>] unbind_con_driver+0x14f/0x195 [<ffffffff8126540c>] store_bind+0x1ad/0x1c1 [<ffffffff8127cbb7>] dev_attr_store+0x13/0x1f [<ffffffff8116d884>] sysfs_write_file+0xe9/0x121 [<ffffffff811145b2>] vfs_write+0x9b/0xfd [<ffffffff811147b7>] sys_write+0x3e/0x6b [<ffffffff813b0039>] system_call_fastpath+0x16/0x1b other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(console_lock); lock((&info->queue)); lock(console_lock); lock((&info->queue)); *** DEADLOCK *** v2: Mark the lockdep_map static, noticed by Jani Nikula. Cc: Dave Airlie <airlied@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-10-23PM / QoS: Introduce request and constraint data types for PM QoS flagsRafael J. Wysocki
Introduce struct pm_qos_flags_request and struct pm_qos_flags representing PM QoS flags request type and PM QoS flags constraint type, respectively. With these definitions the data structures will be arranged so that the list member of a struct pm_qos_flags object will contain the head of a list of struct pm_qos_flags_request objects representing all of the "flags" requests present for the given device. Then, the effective_flags member of a struct pm_qos_flags object will contain the bitwise OR of the flags members of all the struct pm_qos_flags_request objects in the list. Additionally, introduce helper function pm_qos_update_flags() allowing the caller to manage the list of struct pm_qos_flags_request pointed to by the list member of struct pm_qos_flags. The flags are of type s32 so that the request's "value" field is always of the same type regardless of what kind of request it is (latency requests already have value fields of type s32). Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Jean Pihet <j-pihet@ti.com> Acked-by: mark gross <markgross@thegnar.org>
2012-10-22module_signing: fix printk format warningRandy Dunlap
Fix the warning: kernel/module_signing.c:195:2: warning: format '%lu' expects type 'long unsigned int', but argument 3 has type 'size_t' by using the proper 'z' modifier for printing a size_t. Signed-off-by: Randy Dunlap <rdunlap@xenotime.net> Cc: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-21Merge branch 'tip/perf/urgent' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace into perf/urgent Pull ftrace ring-buffer resizing fix from Steve Rostedt. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-10-21Merge branch 'uprobes/core' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/oleg/misc into perf/urgent Pull various uprobes bugfixes from Oleg Nesterov - mostly race and failure path fixes. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-10-21Merge branch 'nohz/core' of git://github.com/fweisbec/linux-dynticks into ↵Ingo Molnar
timers/core Pull uncontroversial cleanup/refactoring nohz patches from Frederic Weisbecker. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-10-20cgroup_freezer: don't use cgroup_lock_live_group()Tejun Heo
freezer_read/write() used cgroup_lock_live_group() to synchronize against task migration into and out of the target cgroup. cgroup_lock_live_group() grabs the internal cgroup lock and using it from outside cgroup core leads to complex and fragile locking dependency issues which are difficult to resolve. Now that freezer_can_attach() is replaced with freezer_attach() and update_if_frozen() updated, nothing requires excluding migration against freezer state reads and changes. This patch removes cgroup_lock_live_group() and the matching cgroup_unlock() usages. The prone-to-bitrot, already outdated and unnecessary global lock hierarchy documentation is replaced with documentation in local scope. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Rafael J. Wysocki <rjw@sisk.pl> Cc: Li Zefan <lizefan@huawei.com>
2012-10-20cgroup_freezer: prepare update_if_frozen() for locking changeTejun Heo
Locking will change such that migration can happen while freezer_read/write() is in progress. This means that update_if_frozen() can no longer assume that all tasks in the cgroup coform to the current freezer state - newly migrated tasks which haven't finished freezer_attach() yet might be in any state. This patch updates update_if_frozen() such that it no longer verifies task states against freezer state. It now simply decides whether FREEZING stage is complete. This removal of verification makes it meaningless to call from freezer_change_state(). Drop it and move the fast exit test from freezer_read() - the only left caller - to update_if_frozen(). Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Rafael J. Wysocki <rjw@sisk.pl> Cc: Li Zefan <lizefan@huawei.com>
2012-10-20cgroup_freezer: allow moving tasks in and out of a frozen cgroupTejun Heo
cgroup_freezer is one of the few users of cgroup_subsys->can_attach() and uses it to prevent tasks from being migrated into or out of a frozen cgroup. This makes cgroup_freezer cumbersome to use especially when co-mounted with other controllers. ->can_attach() is problematic in general as it can make co-mounting multiple cgroups difficult - migrating tasks may fail for reasons completely irrelevant for other controllers. freezer_can_attach() in particular is more problematic because it messes with cgroup internal locking to ensure that the state verification performed at freezer_can_attach() stays valid until migration is complete. This patch replaces freezer_can_attach() with freezer_attach() so that tasks are always allowed to migrate - they are nudged into the conforming state from freezer_attach(). This means that there can be tasks which are being migrated which don't conform to the current cgroup_freezer state until freezer_attach() is complete. Under the current locking scheme, the only such place is freezer_fork() which is updated to handle such window. While this patch doesn't remove the use of internal cgroup locking from freezer_read/write() paths, it removes the requirement to keep the freezer state constant while migrating and enables such change. Note that this creates a userland visible behavior change - FROZEN cgroup can no longer be used to lock migrations in and out of the cgroup. This behavior change is intended. I don't think the feature is necessary - userland should coordinate accesses to cgroup fs anyway - and even if the feature is needed cgroup_freezer is the completely wrong place to implement it. Signed-off-by: Tejun Heo <tj@kernel.org> LKML-Reference: <1350426526-14254-1-git-send-email-tj@kernel.org> Cc: Matt Helsley <matthltc@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Rafael J. Wysocki <rjw@sisk.pl> Cc: Li Zefan <lizefan@huawei.com>
2012-10-20rcu: Accelerate callbacks for CPU initiating a grace periodPaul E. McKenney
Because grace-period initialization is carried out by a separate kthread, it might happen on a different CPU than the one that had the callback needing a grace period -- which is where the callback acceleration needs to happen. Fortunately, rcu_start_gp() holds the root rcu_node structure's ->lock, which prevents a new grace period from starting. This allows this function to safely determine that a grace period has not yet started, which in turn allows it to fully accelerate any callbacks that it has pending. This commit adds this acceleration. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-10-19use clamp_t in UNAME26 fixKees Cook
The min/max call needed to have explicit types on some architectures (e.g. mn10300). Use clamp_t instead to avoid the warning: kernel/sys.c: In function 'override_release': kernel/sys.c:1287:10: warning: comparison of distinct pointer types lacks a cast [enabled by default] Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-19MODSIGN: Move the magic string to the end of a module and eliminate the searchDavid Howells
Emit the magic string that indicates a module has a signature after the signature data instead of before it. This allows module_sig_check() to be made simpler and faster by the elimination of the search for the magic string. Instead we just need to do a single memcmp(). This works because at the end of the signature data there is the fixed-length signature information block. This block then falls immediately prior to the magic number. From the contents of the information block, it is trivial to calculate the size of the signature data and thus the size of the actual module data. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-19Revert "cgroup: Remove task_lock() from cgroup_post_fork()"Tejun Heo
This reverts commit 7e3aa30ac8c904a706518b725c451bb486daaae9. The commit incorrectly assumed that fork path always performed threadgroup_change_begin/end() and depended on that for synchronization against task exit and cgroup migration paths instead of explicitly grabbing task_lock(). threadgroup_change is not locked when forking a new process (as opposed to a new thread in the same process) and even if it were it wouldn't be effective as different processes use different threadgroup locks. Revert the incorrect optimization. Signed-off-by: Tejun Heo <tj@kernel.org> LKML-Reference: <20121008020000.GB2575@localhost> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: stable@vger.kernel.org
2012-10-19Revert "cgroup: Drop task_lock(parent) on cgroup_fork()"Tejun Heo
This reverts commit 7e381b0eb1e1a9805c37335562e8dc02e7d7848c. The commit incorrectly assumed that fork path always performed threadgroup_change_begin/end() and depended on that for synchronization against task exit and cgroup migration paths instead of explicitly grabbing task_lock(). threadgroup_change is not locked when forking a new process (as opposed to a new thread in the same process) and even if it were it wouldn't be effective as different processes use different threadgroup locks. Revert the incorrect optimization. Signed-off-by: Tejun Heo <tj@kernel.org> LKML-Reference: <20121008020000.GB2575@localhost> Acked-by: Li Zefan <lizefan@huawei.com> Bitterly-Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: stable@vger.kernel.org
2012-10-19pidns: remove recursion from free_pid_ns()Cyrill Gorcunov
free_pid_ns() operates in a recursive fashion: free_pid_ns(parent) put_pid_ns(parent) kref_put(&ns->kref, free_pid_ns); free_pid_ns thus if there was a huge nesting of namespaces the userspace may trigger avalanche calling of free_pid_ns leading to kernel stack exhausting and a panic eventually. This patch turns the recursion into an iterative loop. Based on a patch by Andrew Vagin. [akpm@linux-foundation.org: export put_pid_ns() to modules] Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Andrew Vagin <avagin@openvz.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Greg KH <greg@kroah.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-19kernel/sys.c: fix stack memory content leak via UNAME26Kees Cook
Calling uname() with the UNAME26 personality set allows a leak of kernel stack contents. This fixes it by defensively calculating the length of copy_to_user() call, making the len argument unsigned, and initializing the stack buffer to zero (now technically unneeded, but hey, overkill). CVE-2012-0957 Reported-by: PaX Team <pageexec@freemail.hu> Signed-off-by: Kees Cook <keescook@chromium.org> Cc: Andi Kleen <ak@linux.intel.com> Cc: PaX Team <pageexec@freemail.hu> Cc: Brad Spengler <spender@grsecurity.net> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-16printk: Fix scheduling-while-atomic problem in console_cpu_notify()Paul E. McKenney
The console_cpu_notify() function runs with interrupts disabled in the CPU_DYING case. It therefore cannot block, for example, as will happen when it calls console_lock(). Therefore, remove the CPU_DYING leg of the switch statement to avoid this problem. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-16cgroup: notify_on_release may not be triggered in some casesDaisuke Nishimura
notify_on_release must be triggered when the last process in a cgroup is move to another. But if the first(and only) process in a cgroup is moved to another, notify_on_release is not triggered. # mkdir /cgroup/cpu/SRC # mkdir /cgroup/cpu/DST # # echo 1 >/cgroup/cpu/SRC/notify_on_release # echo 1 >/cgroup/cpu/DST/notify_on_release # # sleep 300 & [1] 8629 # # echo 8629 >/cgroup/cpu/SRC/tasks # echo 8629 >/cgroup/cpu/DST/tasks -> notify_on_release for /SRC must be triggered at this point, but it isn't. This is because put_css_set() is called before setting CGRP_RELEASABLE in cgroup_task_migrate(), and is a regression introduce by the commit:74a1166d(cgroups: make procs file writable), which was merged into v3.0. Cc: Ben Blum <bblum@andrew.cmu.edu> Cc: <stable@vger.kernel.org> # v3.0.x and later Acked-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Tejun Heo <tj@kernel.org>
2012-10-16cgroup_freezer: don't stall transition to FROZEN for PF_NOFREEZE or ↵Tejun Heo
PF_FREEZER_SKIP tasks cgroup_freezer doesn't transition from FREEZING to FROZEN if the cgroup contains PF_NOFREEZE tasks or tasks sleeping with PF_FREEZER_SKIP set. Only kernel tasks can be non-freezable (PF_NOFREEZE) and there's nothing cgroup_freezer or userland can do about or to it. It's pointless to stall the transition for PF_NOFREEZE tasks. PF_FREEZER_SKIP indicates that the task can be skipped when determining whether frozen state is reached. A task with PF_FREEZER_SKIP is guaranteed to perform try_to_freeze() after it wakes up and can be considered frozen much like stopped or traced tasks. Note that a vfork parent uses PF_FREEZER_SKIP while waiting for the child. This updates update_if_frozen() such that it only considers freezable tasks and treats %true freezer_should_skip() tasks as frozen. This allows cgroups w/ kthreads and vfork parents successfully reach FROZEN state. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Rafael J. Wysocki <rjw@sisk.pl>
2012-10-16cgroup_freezer: make it official that writes to freezer.state don't failTejun Heo
try_to_freeze_cgroup() has condition checks which are intended to fail the write operation to freezer.state if there are tasks which can't be frozen. The condition checks have been broken for quite some time now. freeze_task() returns %false if the target task can't be frozen, so num_cant_freeze_now is never incremented. In addition, strangely, cgroup freezing proceeds even after the write is failed, which is rather broken. This patch rips out the non-working code intended to fail the write to freezer.state when the cgroup contains non-freezable tasks and makes it official that writes to freezer.state succeed whether there are non-freezable tasks in the cgroup or not. This leaves is_task_frozen_enough() with only one user - upste_if_frozen(). Collapse it into the caller. Note that this removes an extra call to freezing(). This doesn't cause any userland behavior changes. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Rafael J. Wysocki <rjw@sisk.pl>
2012-10-16cgroup: cgroup_subsys->fork() should be called after the task is added to ↵Tejun Heo
css_set cgroup core has a bug which violates a basic rule about event notifications - when a new entity needs to be added, you add that to the notification list first and then make the new entity conform to the current state. If done in the reverse order, an event happening inbetween will be lost. cgroup_subsys->fork() is invoked way before the new task is added to the css_set. Currently, cgroup_freezer is the only user of ->fork() and uses it to make new tasks conform to the current state of the freezer. If FROZEN state is requested while fork is in progress between cgroup_fork_callbacks() and cgroup_post_fork(), the child could escape freezing - the cgroup isn't frozen when ->fork() is called and the freezer couldn't see the new task on the css_set. This patch moves cgroup_subsys->fork() invocation to cgroup_post_fork() after the new task is added to the css_set. cgroup_fork_callbacks() is removed. Because now a task may be migrated during cgroup_subsys->fork(), freezer_fork() is updated so that it adheres to the usual RCU locking and the rather pointless comment on why locking can be different there is removed (if it doesn't make anything simpler, why even bother?). Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Rafael J. Wysocki <rjw@sisk.pl> Cc: stable@vger.kernel.org
2012-10-16sched: Add WAKEUP_PREEMPTION feature flag, on by defaultIngo Molnar
As per the recent discussion with Mike and Linus, make it easier to test with/without this feature. No change in default behavior. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/n/tip-izoxq4haeg4mTognnDbwcevt@git.kernel.org
2012-10-15tick: Conditionally build nohz specific code in tick handlerFrederic Weisbecker
This optimize a bit the high res tick sched handler. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org>
2012-10-15tick: Consolidate tick handling for high and low res handlersFrederic Weisbecker
Besides unifying code, this also adds the idle check before processing idle accounting specifics on the low res handler. This way we also generalize this part of the nohz code for !CONFIG_HIGH_RES_TIMERS to prepare for the adaptive tickless features. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org>
2012-10-15tick: Consolidate timekeeping handling codeFrederic Weisbecker
Unify the duplicated timekeeping handling code of low and high res tick sched handlers. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org>
2012-10-14Merge branch 'modules-next' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux Pull module signing support from Rusty Russell: "module signing is the highlight, but it's an all-over David Howells frenzy..." Hmm "Magrathea: Glacier signing key". Somebody has been reading too much HHGTTG. * 'modules-next' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux: (37 commits) X.509: Fix indefinite length element skip error handling X.509: Convert some printk calls to pr_devel asymmetric keys: fix printk format warning MODSIGN: Fix 32-bit overflow in X.509 certificate validity date checking MODSIGN: Make mrproper should remove generated files. MODSIGN: Use utf8 strings in signer's name in autogenerated X.509 certs MODSIGN: Use the same digest for the autogen key sig as for the module sig MODSIGN: Sign modules during the build process MODSIGN: Provide a script for generating a key ID from an X.509 cert MODSIGN: Implement module signature checking MODSIGN: Provide module signing public keys to the kernel MODSIGN: Automatically generate module signing keys if missing MODSIGN: Provide Kconfig options MODSIGN: Provide gitignore and make clean rules for extra files MODSIGN: Add FIPS policy module: signature checking hook X.509: Add a crypto key parser for binary (DER) X.509 certificates MPILIB: Provide a function to read raw data into an MPI X.509: Add an ASN.1 decoder X.509: Add simple ASN.1 grammar compiler ...
2012-10-13Merge tag 'for_linus-3.7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/kgdb Pull KGDB/KDB fixes and cleanups from Jason Wessel: "Cleanups - Clean up compile warnings in kgdboc.c and x86/kernel/kgdb.c - Add module event hooks for simplified debugging with gdb Fixes - Fix kdb to stop paging with 'q' on bta and dmesg - Fix for data that scrolls off the vga console due to line wrapping when using the kdb pager New - The debug core registers for kernel module events which allows a kernel aware gdb to automatically load symbols and break on entry to a kernel module - Allow kgdboc=kdb to setup kdb on the vga console" * tag 'for_linus-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/kgdb: tty/console: fix warnings in drivers/tty/serial/kgdboc.c kdb,vt_console: Fix missed data due to pager overruns kdb: Fix dmesg/bta scroll to quit with 'q' kgdboc: Accept either kbd or kdb to activate the vga + keyboard kdb shell kgdb,x86: fix warning about unused variable mips,kgdb: fix recursive page fault with CONFIG_KPROBES kgdb: Add module event hooks
2012-10-13Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf updates from Ingo Molnar: "This tree includes some late late perf items that missed the first round: tools: - Bash auto completion improvements, now we can auto complete the tools long options, tracepoint event names, etc, from Namhyung Kim. - Look up thread using tid instead of pid in 'perf sched'. - Move global variables into a perf_kvm struct, from David Ahern. - Hists refactorings, preparatory for improved 'diff' command, from Jiri Olsa. - Hists refactorings, preparatory for event group viewieng work, from Namhyung Kim. - Remove double negation on optional feature macro definitions, from Namhyung Kim. - Remove several cases of needless global variables, on most builtins. - misc fixes kernel: - sysfs support for IBS on AMD CPUs, from Robert Richter. - Support for an upcoming Intel CPU, the Xeon-Phi / Knights Corner HPC blade PMU, from Vince Weaver. - misc fixes" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (46 commits) perf: Fix perf_cgroup_switch for sw-events perf: Clarify perf_cpu_context::active_pmu usage by renaming it to ::unique_pmu perf/AMD/IBS: Add sysfs support perf hists: Add more helpers for hist entry stat perf hists: Move he->stat.nr_events initialization to a template perf hists: Introduce struct he_stat perf diff: Removing the total_period argument from output code perf tool: Add hpp interface to enable/disable hpp column perf tools: Removing hists pair argument from output path perf hists: Separate overhead and baseline columns perf diff: Refactor diff displacement possition info perf hists: Add struct hists pointer to struct hist_entry perf tools: Complete tracepoint event names perf/x86: Add support for Intel Xeon-Phi Knights Corner PMU perf evlist: Remove some unused methods perf evlist: Introduce add_newtp method perf kvm: Move global variables into a perf_kvm struct perf tools: Convert to BACKTRACE_SUPPORT perf tools: Long option completion support for each subcommands perf tools: Complete long option names of perf command ...
2012-10-13Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal Pull third pile of kernel_execve() patches from Al Viro: "The last bits of infrastructure for kernel_thread() et.al., with alpha/arm/x86 use of those. Plus sanitizing the asm glue and do_notify_resume() on alpha, fixing the "disabled irq while running task_work stuff" breakage there. At that point the rest of kernel_thread/kernel_execve/sys_execve work can be done independently for different architectures. The only pending bits that do depend on having all architectures converted are restrictred to fs/* and kernel/* - that'll obviously have to wait for the next cycle. I thought we'd have to wait for all of them done before we start eliminating the longjump-style insanity in kernel_execve(), but it turned out there's a very simple way to do that without flagday-style changes." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal: alpha: switch to saner kernel_execve() semantics arm: switch to saner kernel_execve() semantics x86, um: convert to saner kernel_execve() semantics infrastructure for saner ret_from_kernel_thread semantics make sure that kernel_thread() callbacks call do_exit() themselves make sure that we always have a return path from kernel_execve() ppc: eeh_event should just use kthread_run() don't bother with kernel_thread/kernel_execve for launching linuxrc alpha: get rid of switch_stack argument of do_work_pending() alpha: don't bother passing switch_stack separately from regs alpha: take SIGPENDING/NOTIFY_RESUME loop into signal.c alpha: simplify TIF_NEED_RESCHED handling
2012-10-13Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull third pile of VFS updates from Al Viro: "Stuff from Jeff Layton, mostly. Sanitizing interplay between audit and namei, removing a lot of insanity from audit_inode() mess and getting things ready for his ESTALE patchset." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: procfs: don't need a PATH_MAX allocation to hold a string representation of an int vfs: embed struct filename inside of names_cache allocation if possible audit: make audit_inode take struct filename vfs: make path_openat take a struct filename pointer vfs: turn do_path_lookup into wrapper around struct filename variant audit: allow audit code to satisfy getname requests from its names_list vfs: define struct filename and have getname() return it vfs: unexport getname and putname symbols acct: constify the name arg to acct_on vfs: allocate page instead of names_cache buffer in mount_block_root audit: overhaul __audit_inode_child to accomodate retrying audit: optimize audit_compare_dname_path audit: make audit_compare_dname_path use parent_len helper audit: remove dirlen argument to audit_compare_dname_path audit: set the name_len in audit_inode for parent lookups audit: add a new "type" field to audit_names struct audit: reverse arguments to audit_inode_child audit: no need to walk list in audit_inode if name is NULL audit: pass in dentry to audit_copy_inode wherever possible audit: remove unnecessary NULL ptr checks from do_path_lookup
2012-10-12audit: make audit_inode take struct filenameJeff Layton
Keep a pointer to the audit_names "slot" in struct filename. Have all of the audit_inode callers pass a struct filename ponter to audit_inode instead of a string pointer. If the aname field is already populated, then we can skip walking the list altogether and just use it directly. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-10-12vfs: make path_openat take a struct filename pointerJeff Layton
...and fix up the callers. For do_file_open_root, just declare a struct filename on the stack and fill out the .name field. For do_filp_open, make it also take a struct filename pointer, and fix up its callers to call it appropriately. For filp_open, add a variant that takes a struct filename pointer and turn filp_open into a wrapper around it. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-10-12audit: allow audit code to satisfy getname requests from its names_listJeff Layton
Currently, if we call getname() on a userland string more than once, we'll get multiple copies of the string and multiple audit_names records. Add a function that will allow the audit_names code to satisfy getname requests using info from the audit_names list, avoiding a new allocation and audit_names records. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-10-12vfs: define struct filename and have getname() return itJeff Layton
getname() is intended to copy pathname strings from userspace into a kernel buffer. The result is just a string in kernel space. It would however be quite helpful to be able to attach some ancillary info to the string. For instance, we could attach some audit-related info to reduce the amount of audit-related processing needed. When auditing is enabled, we could also call getname() on the string more than once and not need to recopy it from userspace. This patchset converts the getname()/putname() interfaces to return a struct instead of a string. For now, the struct just tracks the string in kernel space and the original userland pointer for it. Later, we'll add other information to the struct as it becomes convenient. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-10-12infrastructure for saner ret_from_kernel_thread semanticsAl Viro
* allow kernel_execve() leave the actual return to userland to caller (selected by CONFIG_GENERIC_KERNEL_EXECVE). Callers updated accordingly. * architecture that does select GENERIC_KERNEL_EXECVE in its Kconfig should have its ret_from_kernel_thread() do this: call schedule_tail call the callback left for it by copy_thread(); if it ever returns, that's because it has just done successful kernel_execve() jump to return from syscall IOW, its only difference from ret_from_fork() is that it does call the callback. * such an architecture should also get rid of ret_from_kernel_execve() and __ARCH_WANT_KERNEL_EXECVE This is the last part of infrastructure patches in that area - from that point on work on different architectures can live independently. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-10-12Merge branch 'timers-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer core update from Thomas Gleixner: - Bug fixes (one for a longstanding dead loop issue) - Rework of time related vsyscalls - Alarm timer updates - Jiffies updates to remove compile time dependencies * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: timekeeping: Cast raw_interval to u64 to avoid shift overflow timers: Fix endless looping between cascade() and internal_add_timer() time/jiffies: bring back unconditional LATCH definition time: Convert x86_64 to using new update_vsyscall time: Only do nanosecond rounding on GENERIC_TIME_VSYSCALL_OLD systems time: Introduce new GENERIC_TIME_VSYSCALL time: Convert CONFIG_GENERIC_TIME_VSYSCALL to CONFIG_GENERIC_TIME_VSYSCALL_OLD time: Move update_vsyscall definitions to timekeeper_internal.h time: Move timekeeper structure to timekeeper_internal.h for vsyscall changes jiffies: Remove compile time assumptions about CLOCK_TICK_RATE jiffies: Kill unused TICK_USEC_TO_NSEC alarmtimer: Rename alarmtimer_remove to alarmtimer_dequeue alarmtimer: Remove unused helpers & defines alarmtimer: Use hrtimer per-alarm instead of per-base alarmtimer: Implement minimum alarm interval for allowing suspend