summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2012-07-31time: Clean up stray newlinesJohn Stultz
Ingo noted inconsistent newline usage between functions. This patch cleans those up. Signed-off-by: John Stultz <john.stultz@linaro.org> Cc: Prarit Bhargava <prarit@redhat.com> Link: http://lkml.kernel.org/r/1343414893-45779-4-git-send-email-john.stultz@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-31time/jiffies: Rename ACTHZ to SHIFTED_HZJohn Stultz
Ingo noted that ACTHZ is a confusing name, and requested it be renamed, so this patch renames ACTHZ to SHIFTED_HZ to better describe it. Signed-off-by: John Stultz <john.stultz@linaro.org> Cc: Prarit Bhargava <prarit@redhat.com> Link: http://lkml.kernel.org/r/1343414893-45779-3-git-send-email-john.stultz@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-31Merge branch 'linus' into timers/urgentIngo Molnar
Merge in Linus's branch which already has timers/core merged. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-31perf/trace: Add ability to set a target task for eventsAndrew Vagin
A few events are interesting not only for a current task. For example, sched_stat_* events are interesting for a task which wakes up. For this reason, it will be good if such events will be delivered to a target task too. Now a target task can be set by using __perf_task(). The original idea and a draft patch belongs to Peter Zijlstra. I need these events for profiling sleep times. sched_switch is used for getting callchains and sched_stat_* is used for getting time periods. These events are combined in user space, then it can be analyzed by perf tools. Inspired-by: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Arun Sharma <asharma@fb.com> Signed-off-by: Andrew Vagin <avagin@openvz.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1342016098-213063-1-git-send-email-avagin@openvz.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-31sched/cleanups: Add load balance cpumask pointer to 'struct lb_env'Michael Wang
With this patch struct ld_env will have a pointer of the load balancing cpumask and we don't need to pass a cpumask around anymore. Signed-off-by: Michael Wang <wangyun@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/4FFE8665.3080705@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-31kprobes/x86: ftrace based optimization for x86Masami Hiramatsu
Add function tracer based kprobe optimization support handlers on x86. This allows kprobes to use function tracer for probing on mcount call. Link: http://lkml.kernel.org/r/20120605102838.27845.26317.stgit@localhost.localdomain Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: "Frank Ch. Eigler" <fche@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> [ Updated to new port of ftrace save regs functions ] Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-07-31kprobes: introduce ftrace based optimizationMasami Hiramatsu
Introduce function trace based kprobes optimization. With using ftrace optimization, kprobes on the mcount calling address, use ftrace's mcount call instead of breakpoint. Furthermore, this optimization works with preemptive kernel not like as current jump-based optimization. Of cource, this feature works only if the probe is on mcount call. Only if kprobe.break_handler is set, that probe is not optimized with ftrace (nor put on ftrace). The reason why this limitation comes is that this break_handler may be used only from jprobes which changes ip address (for fetching the function arguments), but function tracer ignores modified ip address. Changes in v2: - Fix ftrace_ops registering right after setting its filter. - Unregister ftrace_ops if there is no kprobe using. - Remove notrace dependency from __kprobes macro. Link: http://lkml.kernel.org/r/20120605102832.27845.63461.stgit@localhost.localdomain Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: "Frank Ch. Eigler" <fche@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-07-31kprobes: Move locks into appropriate functionsMasami Hiramatsu
Break a big critical region into fine-grained pieces at registering kprobe path. This helps us to solve circular locking dependency when introducing ftrace-based kprobes. Link: http://lkml.kernel.org/r/20120605102826.27845.81689.stgit@localhost.localdomain Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: "Frank Ch. Eigler" <fche@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-07-31kprobes: cleanup to separate probe-able checkMasami Hiramatsu
Separate probe-able address checking code from register_kprobe(). Link: http://lkml.kernel.org/r/20120605102820.27845.90133.stgit@localhost.localdomain Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: "Frank Ch. Eigler" <fche@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-07-31kprobes: Inverse taking of module_mutex with kprobe_mutexSteven Rostedt
Currently module_mutex is taken before kprobe_mutex, but this can cause issues when we have kprobes register ftrace, as the ftrace mutex is taken before enabling a tracepoint, which currently takes the module mutex. If module_mutex is taken before kprobe_mutex, then we can not have kprobes use the ftrace infrastructure. There seems to be no reason that the kprobe_mutex can't be taken before the module_mutex. Running lockdep shows that it is safe among the kernels I've run. Link: http://lkml.kernel.org/r/20120605102814.27845.21047.stgit@localhost.localdomain Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: "Frank Ch. Eigler" <fche@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-07-31ftrace: add ftrace_set_filter_ip() for address based filterMasami Hiramatsu
Add a new filter update interface ftrace_set_filter_ip() to set ftrace filter by ip address, not only glob pattern. Link: http://lkml.kernel.org/r/20120605102808.27845.67952.stgit@localhost.localdomain Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: "Frank Ch. Eigler" <fche@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-07-31ftrace: Add selftest to test function save-regs supportSteven Rostedt
Add selftests to test the save-regs functionality of ftrace. If the arch supports saving regs, then it will make sure that regs is at least not NULL in the callback. If the arch does not support saving regs, it makes sure that the registering of the ftrace_ops that requests saving regs fails. It then tests the registering of the ftrace_ops succeeds if the 'IF_SUPPORTED' flag is set. Then it makes sure that the regs passed to the function is NULL. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-07-31ftrace: Add selftest to test function trace recursion protectionSteven Rostedt
Add selftests to test the function tracing recursion protection actually does work. It also tests if a ftrace_ops states it will perform its own protection. Although, even if the ftrace_ops states it will protect itself, the ftrace infrastructure may still provide protection if the arch does not support all features or another ftrace_ops is registered. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-07-31ftrace: Only compile ftrace selftest if selftests are enabledSteven Rostedt
No need to compile in the ftrace selftest helper file if selftests are not being executed. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-07-31ftrace: Add default recursion protection for function tracingSteven Rostedt
As more users of the function tracer utility are being added, they do not always add the necessary recursion protection. To protect from function recursion due to tracing, if the callback ftrace_ops does not specifically specify that it protects against recursion (by setting the FTRACE_OPS_FL_RECURSION_SAFE flag), the list operation will be called by the mcount trampoline which adds recursion protection. If the flag is set, then the function will be called directly with no extra protection. Note, the list operation is called if more than one function callback is registered, or if the arch does not support all of the function tracer features. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-07-31kernel/debug: Make use of KGDB_REASON_NMIAnton Vorontsov
Currently kernel never set KGDB_REASON_NMI. We do now, when we enter KGDB/KDB from an NMI. This is not to be confused with kgdb_nmicallback(), NMI callback is an entry for the slave CPUs during CPUs roundup, but REASON_NMI is the entry for the master CPU. Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org> Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
2012-07-31kdb: Remove cpu from the more promptJason Wessel
Having the CPU in the more prompt is completely redundent vs the standard kdb prompt, and it also wastes 32 bytes on the stack. Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
2012-07-31kdb: Remove unused KDB_FLAG_ONLY_DO_DUMPJason Wessel
This code cleanup was missed in the original kdb merge, and this code is simply not used at all. The code that was previously used to set the KDB_FLAG_ONLY_DO_DUMP was removed prior to the initial kdb merge. Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
2012-07-30resource: make sure requested range is included in the root rangeOctavian Purdila
When the requested range is outside of the root range the logic in __reserve_region_with_split will cause an infinite recursion which will overflow the stack as seen in the warning bellow. This particular stack overflow was caused by requesting the (100000000-107ffffff) range while the root range was (0-ffffffff). In this case __request_resource would return the whole root range as conflict range (i.e. 0-ffffffff). Then, the logic in __reserve_region_with_split would continue the recursion requesting the new range as (conflict->end+1, end) which incidentally in this case equals the originally requested range. This patch aborts looking for an usable range when the request does not intersect with the root range. When the request partially overlaps with the root range, it ajust the request to fall in the root range and then continues with the new request. When the request is modified or aborted errors and a stack trace are logged to allow catching the errors in the upper layers. [ 5.968374] WARNING: at kernel/sched.c:4129 sub_preempt_count+0x63/0x89() [ 5.975150] Modules linked in: [ 5.978184] Pid: 1, comm: swapper Not tainted 3.0.22-mid27-00004-gb72c817 #46 [ 5.985324] Call Trace: [ 5.987759] [<c1039dfc>] ? console_unlock+0x17b/0x18d [ 5.992891] [<c1039620>] warn_slowpath_common+0x48/0x5d [ 5.998194] [<c1031758>] ? sub_preempt_count+0x63/0x89 [ 6.003412] [<c1039644>] warn_slowpath_null+0xf/0x13 [ 6.008453] [<c1031758>] sub_preempt_count+0x63/0x89 [ 6.013499] [<c14d60c4>] _raw_spin_unlock+0x27/0x3f [ 6.018453] [<c10c6349>] add_partial+0x36/0x3b [ 6.022973] [<c10c7c0a>] deactivate_slab+0x96/0xb4 [ 6.027842] [<c14cf9d9>] __slab_alloc.isra.54.constprop.63+0x204/0x241 [ 6.034456] [<c103f78f>] ? kzalloc.constprop.5+0x29/0x38 [ 6.039842] [<c103f78f>] ? kzalloc.constprop.5+0x29/0x38 [ 6.045232] [<c10c7dc9>] kmem_cache_alloc_trace+0x51/0xb0 [ 6.050710] [<c103f78f>] ? kzalloc.constprop.5+0x29/0x38 [ 6.056100] [<c103f78f>] kzalloc.constprop.5+0x29/0x38 [ 6.061320] [<c17b45e9>] __reserve_region_with_split+0x1c/0xd1 [ 6.067230] [<c17b4693>] __reserve_region_with_split+0xc6/0xd1 ... [ 7.179057] [<c17b4693>] __reserve_region_with_split+0xc6/0xd1 [ 7.184970] [<c17b4779>] reserve_region_with_split+0x30/0x42 [ 7.190709] [<c17a8ebf>] e820_reserve_resources_late+0xd1/0xe9 [ 7.196623] [<c17c9526>] pcibios_resource_survey+0x23/0x2a [ 7.202184] [<c17cad8a>] pcibios_init+0x23/0x35 [ 7.206789] [<c17ca574>] pci_subsys_init+0x3f/0x44 [ 7.211659] [<c1002088>] do_one_initcall+0x72/0x122 [ 7.216615] [<c17ca535>] ? pci_legacy_init+0x3d/0x3d [ 7.221659] [<c17a27ff>] kernel_init+0xa6/0x118 [ 7.226265] [<c17a2759>] ? start_kernel+0x334/0x334 [ 7.231223] [<c14d7482>] kernel_thread_helper+0x6/0x10 Signed-off-by: Octavian Purdila <octavian.purdila@intel.com> Signed-off-by: Ram Pai <linuxram@us.ibm.com> Cc: Jesse Barnes <jbarnes@virtuousgeek.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-30taskstats: check nla_reserve() returnAlan Cox
Addresses https://bugzilla.kernel.org/show_bug.cgi?id=44621 Reported-by: <rucsoftsec@gmail.com> Signed-off-by: Alan Cox <alan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-30sysctl: suppress kmemleak messagesSteven Rostedt
register_sysctl_table() is a strange function, as it makes internal allocations (a header) to register a sysctl_table. This header is a handle to the table that is created, and can be used to unregister the table. But if the table is permanent and never unregistered, the header acts the same as a static variable. Unfortunately, this allocation of memory that is never expected to be freed fools kmemleak in thinking that we have leaked memory. For those sysctl tables that are never unregistered, and have no pointer referencing them, kmemleak will think that these are memory leaks: unreferenced object 0xffff880079fb9d40 (size 192): comm "swapper/0", pid 0, jiffies 4294667316 (age 12614.152s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<ffffffff8146b590>] kmemleak_alloc+0x73/0x98 [<ffffffff8110a935>] kmemleak_alloc_recursive.constprop.42+0x16/0x18 [<ffffffff8110b852>] __kmalloc+0x107/0x153 [<ffffffff8116fa72>] kzalloc.constprop.8+0xe/0x10 [<ffffffff811703c9>] __register_sysctl_paths+0xe1/0x160 [<ffffffff81170463>] register_sysctl_paths+0x1b/0x1d [<ffffffff8117047d>] register_sysctl_table+0x18/0x1a [<ffffffff81afb0a1>] sysctl_init+0x10/0x14 [<ffffffff81b05a6f>] proc_sys_init+0x2f/0x31 [<ffffffff81b0584c>] proc_root_init+0xa5/0xa7 [<ffffffff81ae5b7e>] start_kernel+0x3d0/0x40a [<ffffffff81ae52a7>] x86_64_start_reservations+0xae/0xb2 [<ffffffff81ae53ad>] x86_64_start_kernel+0x102/0x111 [<ffffffffffffffff>] 0xffffffffffffffff The sysctl_base_table used by sysctl itself is one such instance that registers the table to never be unregistered. Use kmemleak_not_leak() to suppress the kmemleak false positive. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-30kdump: append newline to the last lien of vmcoreinfo noteVivek Goyal
The last line of vmcoreinfo note does not end with \n. Parsing all the lines in note becomes easier if all lines end with \n instead of trying to special case the last line. I know at least one tool, vmcore-dmesg in kexec-tools tree which made the assumption that all lines end with \n. I think it is a good idea to fix it. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-30fork: fix error handling in dup_task()Akinobu Mita
The function dup_task() may fail at the following function calls in the following order. 0) alloc_task_struct_node() 1) alloc_thread_info_node() 2) arch_dup_task_struct() Error by 0) is not a matter, it can just return. But error by 1) requires releasing task_struct allocated by 0) before it returns. Likewise, error by 2) requires releasing task_struct and thread_info allocated by 0) and 1). The existing error handling calls free_task_struct() and free_thread_info() which do not only release task_struct and thread_info, but also call architecture specific arch_release_task_struct() and arch_release_thread_info(). The problem is that task_struct and thread_info are not fully initialized yet at this point, but arch_release_task_struct() and arch_release_thread_info() are called with them. For example, x86 defines its own arch_release_task_struct() that releases a task_xstate. If alloc_thread_info_node() fails in dup_task(), arch_release_task_struct() is called with task_struct which is just allocated and filled with garbage in this error handling. This actually happened with tools/testing/fault-injection/failcmd.sh # env FAILCMD_TYPE=fail_page_alloc \ ./tools/testing/fault-injection/failcmd.sh --times=100 \ --min-order=0 --ignore-gfp-wait=0 \ -- make -C tools/testing/selftests/ run_tests In order to fix this issue, make free_{task_struct,thread_info}() not to call arch_release_{task_struct,thread_info}() and call arch_release_{task_struct,thread_info}() implicitly where needed. Default arch_release_task_struct() and arch_release_thread_info() are defined as empty by default. So this change only affects the architectures which implement their own arch_release_task_struct() or arch_release_thread_info() as listed below. arch_release_task_struct(): x86, sh arch_release_thread_info(): mn10300, tile Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: David Howells <dhowells@redhat.com> Cc: Koichi Yasutake <yasutake.koichi@jp.panasonic.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Salman Qazi <sqazi@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-30revert "sched: Fix fork() error path to not crash"Andrew Morton
To make way for "fork: fix error handling in dup_task()", which fixes the errors more completely. Cc: Salman Qazi <sqazi@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Ingo Molnar <mingo@elte.hu> Cc: Akinobu Mita <akinobu.mita@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-30fork: use vma_pages() to simplify the codeHuang Shijie
The current code can be replaced by vma_pages(). So use it to simplify the code. [akpm@linux-foundation.org: initialise `len' at its definition site] Signed-off-by: Huang Shijie <shijie8@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-30kmod: avoid deadlock from recursive kmod callTetsuo Handa
The system deadlocks (at least since 2.6.10) when call_usermodehelper(UMH_WAIT_EXEC) request triggers call_usermodehelper(UMH_WAIT_PROC) request. This is because "khelper thread is waiting for the worker thread at wait_for_completion() in do_fork() since the worker thread was created with CLONE_VFORK flag" and "the worker thread cannot call complete() because do_execve() is blocked at UMH_WAIT_PROC request" and "the khelper thread cannot start processing UMH_WAIT_PROC request because the khelper thread is waiting for the worker thread at wait_for_completion() in do_fork()". The easiest example to observe this deadlock is to use a corrupted /sbin/hotplug binary (like shown below). # : > /tmp/dummy # chmod 755 /tmp/dummy # echo /tmp/dummy > /proc/sys/kernel/hotplug # modprobe whatever call_usermodehelper("/tmp/dummy", UMH_WAIT_EXEC) is called from kobject_uevent_env() in lib/kobject_uevent.c upon loading/unloading a module. do_execve("/tmp/dummy") triggers a call to request_module("binfmt-0000") from search_binary_handler() which in turn calls call_usermodehelper(UMH_WAIT_PROC). In order to avoid deadlock, as a for-now and easy-to-backport solution, do not try to call wait_for_completion() in call_usermodehelper_exec() if the worker thread was created by khelper thread with CLONE_VFORK flag. Future and fundamental solution might be replacing singleton khelper thread with some workqueue so that recursive calls up to max_active dependency loop can be handled without deadlock. [akpm@linux-foundation.org: add comment to kmod_thread_locker] Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Arjan van de Ven <arjan@linux.intel.com> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Cc: Tejun Heo <tj@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-30kernel/kmod.c: document call_usermodehelper_fns() a bitAndrew Morton
This function's interface is, uh, subtle. Attempt to apologise for it. Cc: WANG Cong <xiyou.wangcong@gmail.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Kees Cook <keescook@chromium.org> Cc: Serge Hallyn <serge.hallyn@canonical.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-30printk: only look for prefix levels in kernel messagesJoe Perches
vprintk_emit() prefix parsing should only be done for internal kernel messages. This allows existing behavior to be kept in all cases. Signed-off-by: Joe Perches <joe@perches.com> Cc: Kay Sievers <kay@vrfy.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-30printk: add generic functions to find KERN_<LEVEL> headersJoe Perches
The current form of a KERN_<LEVEL> is "<.>". Add printk_get_level and printk_skip_level functions to handle these formats. These functions centralize tests of KERN_<LEVEL> so a future modification can change the KERN_<LEVEL> style and shorten the number of bytes consumed by these headers. [akpm@linux-foundation.org: fix build error and warning] Signed-off-by: Joe Perches <joe@perches.com> Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: Wu Fengguang <wfg@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-30kmsg: /dev/kmsg - properly return possible copy_from_user() failureKay Sievers
Reported-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Kay Sievers <kay@vrfy.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-30kernel/sys.c: avoid argv_free(NULL)Andrew Morton
If argv_split() failed, the code will end up calling argv_free(NULL). Fix it up and clean things up a bit. Addresses Coverity report 703573. Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Kees Cook <keescook@chromium.org> Cc: Serge Hallyn <serge.hallyn@canonical.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: WANG Cong <xiyou.wangcong@gmail.com> Cc: Alan Cox <alan@linux.intel.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-30NMI watchdog: fix for lockup detector breakage on resumeSameer Nanda
On the suspend/resume path the boot CPU does not go though an offline->online transition. This breaks the NMI detector post-resume since it depends on PMU state that is lost when the system gets suspended. Fix this by forcing a CPU offline->online transition for the lockup detector on the boot CPU during resume. To provide more context, we enable NMI watchdog on Chrome OS. We have seen several reports of systems freezing up completely which indicated that the NMI watchdog was not firing for some reason. Debugging further, we found a simple way of repro'ing system freezes -- issuing the command 'tasket 1 sh -c "echo nmilockup > /proc/breakme"' after the system has been suspended/resumed one or more times. With this patch in place, the system freeze result in panics, as expected. These panics provide a nice stack trace for us to debug the actual issue causing the freeze. [akpm@linux-foundation.org: fiddle with code comment] [akpm@linux-foundation.org: make lockup_detector_bootcpu_resume() conditional on CONFIG_SUSPEND] [akpm@linux-foundation.org: fix section errors] Signed-off-by: Sameer Nanda <snanda@chromium.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Don Zickus <dzickus@redhat.com> Cc: Mandeep Singh Baines <msb@chromium.org> Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-30panic: fix a possible deadlock in panic()Vikram Mulukutla
panic_lock is meant to ensure that panic processing takes place only on one cpu; if any of the other cpus encounter a panic, they will spin waiting to be shut down. However, this causes a regression in this scenario: 1. Cpu 0 encounters a panic and acquires the panic_lock and proceeds with the panic processing. 2. There is an interrupt on cpu 0 that also encounters an error condition and invokes panic. 3. This second invocation fails to acquire the panic_lock and enters the infinite while loop in panic_smp_self_stop. Thus all panic processing is stopped, and the cpu is stuck for eternity in the while(1) inside panic_smp_self_stop. To address this, disable local interrupts with local_irq_disable before acquiring the panic_lock. This will prevent interrupt handlers from executing during the panic processing, thus avoiding this particular problem. Signed-off-by: Vikram Mulukutla <markivx@codeaurora.org> Reviewed-by: Stephen Boyd <sboyd@codeaurora.org> Cc: Michael Holzheu <holzheu@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-30coredump: warn about unsafe suid_dumpable / core_pattern comboKees Cook
When suid_dumpable=2, detect unsafe core_pattern settings and warn when they are seen. Signed-off-by: Kees Cook <keescook@chromium.org> Suggested-by: Andrew Morton <akpm@linux-foundation.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alan Cox <alan@linux.intel.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Doug Ledford <dledford@redhat.com> Cc: Serge Hallyn <serge.hallyn@canonical.com> Cc: James Morris <james.l.morris@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-30prctl: remove redunant assignment of "error" to zeroSasikantha babu
Just setting the "error" to error number is enough on failure and It doesn't require to set "error" variable to zero in each switch case, since it was already initialized with zero. And also removed return 0 in switch case with break statement Signed-off-by: Sasikantha babu <sasikanth.v19@gmail.com> Acked-by: Kees Cook <keescook@chromium.org> Acked-by: Serge E. Hallyn <serge@hallyn.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-30uprobes: __replace_page() needs munlock_vma_page()Oleg Nesterov
Like do_wp_page(), __replace_page() should do munlock_vma_page() for the case when the old page still has other !VM_LOCKED mappings. Unfortunately this needs mm/internal.h. Also, move put_page() outside of ptl lock. This doesn't really matter but looks a bit better. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Srikar Dronamraju <srikar.vnet.ibm.com> Cc: Anton Arapov <anton@redhat.com> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Link: http://lkml.kernel.org/r/20120729182249.GA20372@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-30uprobes: Rename vma_address() and make it return "unsigned long"Oleg Nesterov
1. vma_address() returns loff_t, this looks confusing and this is unnecessary after the previous change. Make it return "ulong", all callers truncate the result anyway. 2. Its name conflicts with mm/rmap.c:vma_address(), rename it to offset_to_vaddr(), this matches vaddr_to_offset(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Srikar Dronamraju <srikar.vnet.ibm.com> Cc: Anton Arapov <anton@redhat.com> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Link: http://lkml.kernel.org/r/20120729182247.GA20365@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-30uprobes: Fix register_for_each_vma()->vma_address() checkOleg Nesterov
1. register_for_each_vma() checks that vma_address() == vaddr, but this is not enough. We should also ensure that vaddr >= vm_start, find_vma() guarantees "vaddr < vm_end" only. 2. After the prevous changes, register_for_each_vma() is the only reason why vma_address() has to return loff_t, all other users know that we have the valid mapping at this offset and thus the overflow is not possible. Change the code to use vaddr_to_offset() instead, imho this looks more clean/understandable and now we can change vma_address(). 3. While at it, remove the unnecessary type-cast. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Srikar Dronamraju <srikar.vnet.ibm.com> Cc: Anton Arapov <anton@redhat.com> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Link: http://lkml.kernel.org/r/20120729182244.GA20362@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-30uprobes: Introduce vaddr_to_offset(vma, vaddr)Oleg Nesterov
Add the new helper, vaddr_to_offset(vma, vaddr) which returns the offset in vma->vm_file this vaddr is mapped at. Change build_probe_list() and find_active_uprobe() to use the new helper, the next patch adds another user. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Srikar Dronamraju <srikar.vnet.ibm.com> Cc: Anton Arapov <anton@redhat.com> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Link: http://lkml.kernel.org/r/20120729182242.GA20355@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-30uprobes: Teach build_probe_list() to consider the rangeOleg Nesterov
Currently build_probe_list() builds the list of all uprobes attached to the given inode, and the caller should filter out those who don't fall into the [start,end) range, this is sub-optimal. This patch turns find_least_offset_node() into find_node_in_range() which returns the first node inside the [min,max] range, and changes build_probe_list() to use this node as a starting point for rb_prev() and rb_next() to find all other nodes the caller needs. The resulting list is no longer sorted but we do not care. This can speed up both build_probe_list() and the callers, but there is another reason to introduce find_node_in_range(). It can be used to figure out whether the given vma has uprobes or not, this will be needed soon. While at it, shift INIT_LIST_HEAD(tmp_list) into build_probe_list(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Srikar Dronamraju <srikar.vnet.ibm.com> Cc: Anton Arapov <anton@redhat.com> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Link: http://lkml.kernel.org/r/20120729182240.GA20352@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-30uprobes: Fix overflow in vma_address()/find_active_uprobe()Oleg Nesterov
vma->vm_pgoff is "unsigned long", it should be promoted to loff_t before the multiplication to avoid the overflow. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Srikar Dronamraju <srikar.vnet.ibm.com> Cc: Anton Arapov <anton@redhat.com> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Link: http://lkml.kernel.org/r/20120729182233.GA20339@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-30uprobes: Suppress uprobe_munmap() from mmput()Oleg Nesterov
uprobe_munmap() does get_user_pages() and it is also called from the final mmput()->exit_mmap() path. This slows down exit/mmput() for no reason, and I think it is simply dangerous/wrong to try to fault-in a page into the dying mm. If nothing else, this happens after the last sync_mm_rss(), afaics handle_mm_fault() can change the task->rss_stat and make the subsequent check_mm() unhappy. Change uprobe_munmap() to check mm->mm_users != 0. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Srikar Dronamraju <srikar.vnet.ibm.com> Cc: Anton Arapov <anton@redhat.com> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Link: http://lkml.kernel.org/r/20120729182231.GA20336@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-30uprobes: Uprobe_mmap/munmap needs list_for_each_entry_safe()Oleg Nesterov
The bug was introduced by me in 449d0d7c ("uprobes: Simplify the usage of uprobe->pending_list"). Yes, we do not care about uprobe->pending_list after return and nobody can remove the current list entry, but put_uprobe(uprobe) can actually free it and thus we need list_for_each_safe(). Reported-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Srikar Dronamraju <srikar.vnet.ibm.com> Cc: Anton Arapov <anton@redhat.com> Link: http://lkml.kernel.org/r/20120729182229.GA20329@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-30uprobes: Clean up and document write_opcode()->lock_page(old_page)Oleg Nesterov
The comment above write_opcode()->lock_page(old_page) tells about the race with do_wp_page(). I don't really understand which exactly race it means, but afaics this lock_page() was not enough to close all races with do_wp_page(). Anyway, since: 77fc4af1b59d uprobes: Change register_for_each_vma() to take mm->mmap_sem for writing this code is always called with ->mmap_sem held for writing, so we can forget about do_wp_page(). However, we can't simply remove this lock_page(), and the only (afaics) reason is __replace_page()->try_to_free_swap(). Nothing in write_opcode() needs it, move it into __replace_page() and fix the comment. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Srikar Dronamraju <srikar.vnet.ibm.com> Cc: Anton Arapov <anton@redhat.com> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Link: http://lkml.kernel.org/r/20120729182220.GA20322@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-30uprobes: Kill write_opcode()->lock_page(new_page)Oleg Nesterov
write_opcode() does lock_page(new_page) for no reason. Nobody can see this page until __replace_page() exposes it under ptl lock, and we do nothing with this page after pte_unmap_unlock(). If nothing else, the similar code in do_wp_page() doesn't lock the new page for page_add_new_anon_rmap/set_pte_at_notify. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Srikar Dronamraju <srikar.vnet.ibm.com> Cc: Anton Arapov <anton@redhat.com> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Link: http://lkml.kernel.org/r/20120729182218.GA20315@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-30uprobes: __replace_page() should not use page_address_in_vma()Oleg Nesterov
page_address_in_vma(old_page) in __replace_page() is ugly and wrong. The caller already knows the correct virtual address, this page was found by get_user_pages(vaddr). However, page_address_in_vma() can actually fail if page->mapping was cleared by __delete_from_page_cache() after get_user_pages() returns. But this means the race with page reclaim, write_opcode() should not fail, it should retry and read this page again. Probably the race with remove_mapping() is not possible due to page_freeze_refs() logic, but afaics at least shmem_writepage()->shmem_delete_from_page_cache() can clear ->mapping. We could change __replace_page() to return -EAGAIN in this case, but it would be better to simply use the caller's vaddr and rely on page_check_address(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Srikar Dronamraju <srikar.vnet.ibm.com> Cc: Anton Arapov <anton@redhat.com> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Link: http://lkml.kernel.org/r/20120729182216.GA20311@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-30uprobes: Don't recheck vma/f_mapping in write_opcode()Oleg Nesterov
write_opcode() rechecks valid_vma() and ->f_mapping, this is pointless. The caller, register_for_each_vma() or uprobe_mmap(), has already done these checks under mmap_sem. To clarify, uprobe_mmap() checks valid_vma() only, but we can rely on build_probe_list(vm_file->f_mapping->host). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Srikar Dronamraju <srikar.vnet.ibm.com> Cc: Anton Arapov <anton@redhat.com> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Link: http://lkml.kernel.org/r/20120729182212.GA20304@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-29fs: add link restriction audit reportingKees Cook
Adds audit messages for unexpected link restriction violations so that system owners will have some sort of potentially actionable information about misbehaving processes. Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-29fs: add link restrictionsKees Cook
This adds symlink and hardlink restrictions to the Linux VFS. Symlinks: A long-standing class of security issues is the symlink-based time-of-check-time-of-use race, most commonly seen in world-writable directories like /tmp. The common method of exploitation of this flaw is to cross privilege boundaries when following a given symlink (i.e. a root process follows a symlink belonging to another user). For a likely incomplete list of hundreds of examples across the years, please see: http://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=/tmp The solution is to permit symlinks to only be followed when outside a sticky world-writable directory, or when the uid of the symlink and follower match, or when the directory owner matches the symlink's owner. Some pointers to the history of earlier discussion that I could find: 1996 Aug, Zygo Blaxell http://marc.info/?l=bugtraq&m=87602167419830&w=2 1996 Oct, Andrew Tridgell http://lkml.indiana.edu/hypermail/linux/kernel/9610.2/0086.html 1997 Dec, Albert D Cahalan http://lkml.org/lkml/1997/12/16/4 2005 Feb, Lorenzo Hernández García-Hierro http://lkml.indiana.edu/hypermail/linux/kernel/0502.0/1896.html 2010 May, Kees Cook https://lkml.org/lkml/2010/5/30/144 Past objections and rebuttals could be summarized as: - Violates POSIX. - POSIX didn't consider this situation and it's not useful to follow a broken specification at the cost of security. - Might break unknown applications that use this feature. - Applications that break because of the change are easy to spot and fix. Applications that are vulnerable to symlink ToCToU by not having the change aren't. Additionally, no applications have yet been found that rely on this behavior. - Applications should just use mkstemp() or O_CREATE|O_EXCL. - True, but applications are not perfect, and new software is written all the time that makes these mistakes; blocking this flaw at the kernel is a single solution to the entire class of vulnerability. - This should live in the core VFS. - This should live in an LSM. (https://lkml.org/lkml/2010/5/31/135) - This should live in an LSM. - This should live in the core VFS. (https://lkml.org/lkml/2010/8/2/188) Hardlinks: On systems that have user-writable directories on the same partition as system files, a long-standing class of security issues is the hardlink-based time-of-check-time-of-use race, most commonly seen in world-writable directories like /tmp. The common method of exploitation of this flaw is to cross privilege boundaries when following a given hardlink (i.e. a root process follows a hardlink created by another user). Additionally, an issue exists where users can "pin" a potentially vulnerable setuid/setgid file so that an administrator will not actually upgrade a system fully. The solution is to permit hardlinks to only be created when the user is already the existing file's owner, or if they already have read/write access to the existing file. Many Linux users are surprised when they learn they can link to files they have no access to, so this change appears to follow the doctrine of "least surprise". Additionally, this change does not violate POSIX, which states "the implementation may require that the calling process has permission to access the existing file"[1]. This change is known to break some implementations of the "at" daemon, though the version used by Fedora and Ubuntu has been fixed[2] for a while. Otherwise, the change has been undisruptive while in use in Ubuntu for the last 1.5 years. [1] http://pubs.opengroup.org/onlinepubs/9699919799/functions/linkat.html [2] http://anonscm.debian.org/gitweb/?p=collab-maint/at.git;a=commitdiff;h=f4114656c3a6c6f6070e315ffdf940a49eda3279 This patch is based on the patches in Openwall and grsecurity, along with suggestions from Al Viro. I have added a sysctl to enable the protected behavior, and documentation. Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-26posix_types.h: Cleanup stale __NFDBITS and related definitionsJosh Boyer
Recently, glibc made a change to suppress sign-conversion warnings in FD_SET (glibc commit ceb9e56b3d1). This uncovered an issue with the kernel's definition of __NFDBITS if applications #include <linux/types.h> after including <sys/select.h>. A build failure would be seen when passing the -Werror=sign-compare and -D_FORTIFY_SOURCE=2 flags to gcc. It was suggested that the kernel should either match the glibc definition of __NFDBITS or remove that entirely. The current in-kernel uses of __NFDBITS can be replaced with BITS_PER_LONG, and there are no uses of the related __FDELT and __FDMASK defines. Given that, we'll continue the cleanup that was started with commit 8b3d1cda4f5f ("posix_types: Remove fd_set macros") and drop the remaining unused macros. Additionally, linux/time.h has similar macros defined that expand to nothing so we'll remove those at the same time. Reported-by: Jeff Law <law@redhat.com> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> CC: <stable@vger.kernel.org> Signed-off-by: Josh Boyer <jwboyer@redhat.com> [ .. and fix up whitespace as per akpm ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>