Age | Commit message (Collapse) | Author |
|
Just like what devkmsg_read() does, return -EINVAL if the message len is
bigger than the buf size, or it will trigger a segfault error.
Acked-by: Kay Sievers <kay@vrfy.org>
Acked-by: Fengguang Wu <wfg@linux.intel.com>
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Although syslog_seq and log_next_seq stuff are protected by logbuf_lock
spin log, it's not enough. Say we have two processes A and B, and let
syslog_seq = N, while log_next_seq = N + 1, and the two processes both
come to syslog_print at almost the same time. And No matter which
process get the spin lock first, it will increase syslog_seq by one,
then release spin lock; thus later, another process increase syslog_seq
by one again. In this case, syslog_seq is bigger than syslog_next_seq.
And latter, it would make:
wait_event_interruptiable(log_wait, syslog != log_next_seq)
don't wait any more even there is no new write comes. Thus it introduce
a infinite loop reading.
I can easily see this kind of issue by the following steps:
# cat /proc/kmsg # at meantime, I don't kill rsyslog
# So they are the two processes.
# xinit # I added drm.debug=6 in the kernel parameter line,
# so that it will produce lots of message and let that
# issue happen
It's 100% reproducable on my side. And my disk will be filled up by
/var/log/messages in a quite short time.
So, introduce a mutex_lock to stop syslog_seq from going wild just like
what devkmsg_read() does. It does fix this issue as expected.
v2: use mutex_lock_interruptiable() instead (comments from Kay)
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Reviewed-by: Fengguang Wu <fengguang.wu@intel.com>
Acked-By: Kay Sievers <kay@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar.
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
watchdog: Quiet down the boot messages
perf/x86: Fix broken LBR fixup code
tracing: Have tracing_off() actually turn tracing off
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull core updates (RCU and locking) from Ingo Molnar:
"Most of the diffstat comes from the RCU slow boot regression fixes,
but there's also a debuggability improvements/fixes."
* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
memblock: Document memblock_is_region_{memory,reserved}()
rcu: Precompute RCU_FAST_NO_HZ timer offsets
rcu: Move RCU_FAST_NO_HZ per-CPU variables to rcu_dynticks structure
rcu: Update RCU_FAST_NO_HZ tracing for lazy callbacks
rcu: RCU_FAST_NO_HZ detection of callback adoption
spinlock: Indicate that a lockup is only suspected
kdump: Execute kmsg_dump(KMSG_DUMP_PANIC) after smp_send_stop()
panic: Make panic_on_oops configurable
|
|
Provide an iterator to receive the log buffer content, and convert all
kmsg_dump() users to it.
The structured data in the kmsg buffer now contains binary data, which
should no longer be copied verbatim to the kmsg_dump() users.
The iterator should provide reliable access to the buffer data, and also
supports proper log line-aware chunking of data while iterating.
Signed-off-by: Kay Sievers <kay@vrfy.org>
Tested-by: Tony Luck <tony.luck@intel.com>
Reported-by: Anton Vorontsov <anton.vorontsov@linaro.org>
Tested-by: Anton Vorontsov <anton.vorontsov@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
A bunch of bugzillas have complained how noisy the nmi_watchdog
is during boot-up especially with its expected failure cases
(like virt and bios resource contention).
This is my attempt to quiet them down and keep it less confusing
for the end user. What I did is print the message for cpu0 and
save it for future comparisons. If future cpus have an
identical message as cpu0, then don't print the redundant info.
However, if a future cpu has a different message, happily print
that loudly.
Before the change, you would see something like:
..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
CPU0: Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz stepping 0a
Performance Events: PEBS fmt0+, Core2 events, Intel PMU driver.
... version: 2
... bit width: 40
... generic registers: 2
... value mask: 000000ffffffffff
... max period: 000000007fffffff
... fixed-purpose events: 3
... event mask: 0000000700000003
NMI watchdog enabled, takes one hw-pmu counter.
Booting Node 0, Processors #1
NMI watchdog enabled, takes one hw-pmu counter.
#2
NMI watchdog enabled, takes one hw-pmu counter.
#3 Ok.
NMI watchdog enabled, takes one hw-pmu counter.
Brought up 4 CPUs
Total of 4 processors activated (22607.24 BogoMIPS).
After the change, it is simplified to:
..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
CPU0: Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz stepping 0a
Performance Events: PEBS fmt0+, Core2 events, Intel PMU driver.
... version: 2
... bit width: 40
... generic registers: 2
... value mask: 000000ffffffffff
... max period: 000000007fffffff
... fixed-purpose events: 3
... event mask: 0000000700000003
NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.
Booting Node 0, Processors #1 #2 #3 Ok.
Brought up 4 CPUs
V2: little changes based on Joe Perches' feedback
V3: printk cleanup based on Ingo's feedback; checkpatch fix
V4: keep printk as one long line
V5: Ingo fix ups
Reported-and-tested-by: Nathan Zimmer <nzimmer@sgi.com>
Signed-off-by: Don Zickus <dzickus@redhat.com>
Cc: nzimmer@sgi.com
Cc: joe@perches.com
Link: http://lkml.kernel.org/r/1339594548-17227-1-git-send-email-dzickus@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Dave Jones reported a kernel BUG at mm/slub.c:3474! triggered
by splice_shrink_spd() called from vmsplice_to_pipe()
commit 35f3d14dbbc5 (pipe: add support for shrinking and growing pipes)
added capability to adjust pipe->buffers.
Problem is some paths don't hold pipe mutex and assume pipe->buffers
doesn't change for their duration.
Fix this by adding nr_pages_max field in struct splice_pipe_desc, and
use it in place of pipe->buffers where appropriate.
splice_shrink_spd() loses its struct pipe_inode_info argument.
Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Tom Herbert <therbert@google.com>
Cc: stable <stable@vger.kernel.org> # 2.6.35
Tested-by: Dave Jones <davej@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Commit 7ff9554bb578ba02166071d2d487b7fc7d860d62, printk: convert
byte-buffer to variable-length record buffer, causes systems using
EABI to crash very early in the boot cycle. The first entry in struct
log is a u64, which for EABI must be 8 byte aligned.
Make use of __alignof__() so the compiler to decide the alignment, but
allow it to be overridden using CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS,
for systems which can perform unaligned access and want to save
a few bytes of space.
Tested on Orion5x and Kirkwood.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Tested-by: Stephen Warren <swarren@wwwdotorg.org>
Acked-by: Stephen Warren <swarren@wwwdotorg.org>
Acked-by: Kay Sievers <kay@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
into timers/core
|
|
The next idle expiry time record and idle sleeps tracking are
statistics that only concern idle.
Since we want the nohz APIs to become usable further idle
context, let's pull up the handling of these statistics to the
callers in idle.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@ti.com>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
|
|
Since we want to prepare for making the nohz API to work further
the idle case, we need to pull ts->idle_calls incrementation up to
the callers in idle.
To perform this, we split tick_nohz_stop_sched_tick() in two parts:
a first one that checks if we can really stop the tick for idle,
and another that actually stops it. Then from the callers in idle,
we check if we can stop the tick and only then we increment idle_calls
and finally relay to the nohz API that won't care about these details
anymore.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@ti.com>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
|
|
Now that idle and nohz logics are going to be independant each others,
ts->idle_tick becomes too much a biased name to describe the field that
saves the last scheduled tick on top of which we re-calculate the next
tick to schedule when the timer is restarted.
We want to reuse this even to stop the tick outside idle cases. So let's
rename it to some more generic name: ts->last_tick.
This changes a bit the timer list stat export so we need to increase its
version.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@ti.com>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
|
|
When the timer tick fires, it accounts the new jiffy as either part
of system, user or idle time. This is how we record the cputime
statistics.
But when the tick is stopped from the idle task, we still need
to record the number of jiffies spent tickless until we restart
the tick and fall back to traditional tick-based cputime accounting.
To do this, we take a snapshot of jiffies when the tick is stopped
and compute the difference against the new value of jiffies when
the tick is restarted. Then we account this whole difference to
the idle cputime.
However we are preparing to be able to stop the tick from other places
than idle. So this idle time accounting needs to be performed from
the callers of nohz APIs, not from the nohz APIs themselves because
we now want them to be agnostic against places that stop/restart tick.
Therefore, we pull the tickless idle time accounting out of generic
nohz helpers up to idle entry/exit callers.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@ti.com>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
|
|
As we plan to be able to stop the tick outside the idle task, we
need to prepare for separating nohz logic from idle. As a start,
this pulls the idle sleeping time accounting out of the tick
stop/restart API to the callers on idle entry/exit.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kevin Hilman <khilman@ti.com>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/urgent
Merge RCU fixes from Paul E. McKenney:
" This series has four patches, the major point of which is to eliminate
some slowdowns (including boot-time slowdowns) resulting from some
RCU_FAST_NO_HZ changes. The issue with the changes is that posting timers
from the idle loop has no effect if the CPU has entered dyntick-idle
mode because the CPU has already computed its wakeup time, and posting
a timer does not cause it to be recomputed. The short-term fix is for
RCU to precompute the timeout value so that the CPU's calculation is
correct. "
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar.
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched: Fix the relax_domain_level boot parameter
sched: Validate assumptions in sched_init_numa()
sched: Always initialize cpu-power
sched: Fix domain iteration
sched/rt: Fix lockdep annotation within find_lock_lowest_rq()
sched/numa: Load balance between remote nodes
sched/x86: Calculate booted cores after construction of sibling_mask
|
|
Fix lots of new kernel-doc warnings in kernel/sched/fair.c:
Warning(kernel/sched/fair.c:3625): No description found for parameter 'env'
Warning(kernel/sched/fair.c:3625): Excess function parameter 'sd' description in 'update_sg_lb_stats'
Warning(kernel/sched/fair.c:3735): No description found for parameter 'env'
Warning(kernel/sched/fair.c:3735): Excess function parameter 'sd' description in 'update_sd_pick_busiest'
Warning(kernel/sched/fair.c:3735): Excess function parameter 'this_cpu' description in 'update_sd_pick_busiest'
.. more warnings
Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
"A bit larger than what I'd wish for - half of it is due to hw driver
updates to Intel Ivy-Bridge which info got recently released,
cycles:pp should work there now too, amongst other things. (but we
are generally making exceptions for hardware enablement of this type.)
There are also callchain fixes in it - responding to mostly
theoretical (but valid) concerns. The tooling side sports perf.data
endianness/portability fixes which did not make it for the merge
window - and various other fixes as well."
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (26 commits)
perf/x86: Check user address explicitly in copy_from_user_nmi()
perf/x86: Check if user fp is valid
perf: Limit callchains to 127
perf/x86: Allow multiple stacks
perf/x86: Update SNB PEBS constraints
perf/x86: Enable/Add IvyBridge hardware support
perf/x86: Implement cycles:p for SNB/IVB
perf/x86: Fix Intel shared extra MSR allocation
x86/decoder: Fix bsr/bsf/jmpe decoding with operand-size prefix
perf: Remove duplicate invocation on perf_event_for_each
perf uprobes: Remove unnecessary check before strlist__delete
perf symbols: Check for valid dso before creating map
perf evsel: Fix 32 bit values endianity swap for sample_id_all header
perf session: Handle endianity swap on sample_id_all header data
perf symbols: Handle different endians properly during symbol load
perf evlist: Pass third argument to ioctl explicitly
perf tools: Update ioctl documentation for PERF_IOC_FLAG_GROUP
perf tools: Make --version show kernel version instead of pull req tag
perf tools: Check if callchain is corrupted
perf callchain: Make callchain cursors TLS
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull leap second timer fix from Thomas Gleixner.
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
timekeeping: Fix CLOCK_MONOTONIC inconsistency during leapsecond
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace into perf/urgent
Pull brown paper bag fix from Steve Rostedt.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
This reverts commit 40af1bbdca47e5c8a2044039bb78ca8fd8b20f94.
It's horribly and utterly broken for at least the following reasons:
- calling sync_mm_rss() from mmput() is fundamentally wrong, because
there's absolutely no reason to believe that the task that does the
mmput() always does it on its own VM. Example: fork, ptrace, /proc -
you name it.
- calling it *after* having done mmdrop() on it is doubly insane, since
the mm struct may well be gone now.
- testing mm against NULL before you call it is insane too, since a
NULL mm there would have caused oopses long before.
.. and those are just the three bugs I found before I decided to give up
looking for me and revert it asap. I should have caught it before I
even took it, but I trusted Andrew too much.
Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Markus Trippelsdorf <markus@trippelsdorf.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
mm->rss_stat counters have per-task delta: task->rss_stat. Before
changing task->mm pointer the kernel must flush this delta with
sync_mm_rss().
do_exit() already calls sync_mm_rss() to flush the rss-counters before
committing the rss statistics into task->signal->maxrss, taskstats,
audit and other stuff. Unfortunately the kernel does this before
calling mm_release(), which can call put_user() for processing
task->clear_child_tid. So at this point we can trigger page-faults and
task->rss_stat becomes non-zero again. As a result mm->rss_stat becomes
inconsistent and check_mm() will print something like this:
| BUG: Bad rss-counter state mm:ffff88020813c380 idx:1 val:-1
| BUG: Bad rss-counter state mm:ffff88020813c380 idx:2 val:1
This patch moves sync_mm_rss() into mm_release(), and moves mm_release()
out of do_exit() and calls it earlier. After mm_release() there should
be no pagefaults.
[akpm@linux-foundation.org: tweak comment]
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: <stable@vger.kernel.org> [3.4.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
In commit b76437579d13 ("procfs: mark thread stack correctly in
proc/<pid>/maps") the stack allocated via clone() is marked in
/proc/<pid>/maps as [stack:%d] thus it might be out of the former
mm->start_stack/end_stack values (and even has some custom VMA flags
set).
So to be able to restore mm->start_stack/end_stack drop vma flags test,
but still require the underlying VMA to exist.
As always note this feature is under CONFIG_CHECKPOINT_RESTORE and
requires CAP_SYS_RESOURCE to be granted.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Serge Hallyn <serge.hallyn@canonical.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Zero is written at clear_tid_address when the process exits. This
functionality is used by pthread_join().
We already have sys_set_tid_address() to change this address for the
current task but there is no way to obtain it from user space.
Without the ability to find this address and dump it we can't restore
pthread'ed apps which call pthread_join() once they have been restored.
This patch introduces the PR_GET_TID_ADDRESS prctl option which allows
the current process to obtain own clear_tid_address.
This feature is available iif CONFIG_CHECKPOINT_RESTORE is set.
[akpm@linux-foundation.org: fix prctl numbering]
Signed-off-by: Andrew Vagin <avagin@openvz.org>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Pedro Alves <palves@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Tejun Heo <tj@kernel.org>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Make sure the address being set is greater than mmap_min_addr (as
suggested by Kees Cook).
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Serge Hallyn <serge.hallyn@canonical.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
A fix for commit b32dfe377102 ("c/r: prctl: add ability to set new
mm_struct::exe_file").
After removing mm->num_exe_file_vmas kernel keeps mm->exe_file until
final mmput(), it never becomes NULL while task is alive.
We can check for other mapped files in mm instead of checking
mm->num_exe_file_vmas, and mark mm with flag MMF_EXE_FILE_CHANGED in
order to forbid second changing of mm->exe_file.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
When a CPU is entering dyntick-idle mode, tick_nohz_stop_sched_tick()
calls rcu_needs_cpu() see if RCU needs that CPU, and, if not, computes the
next wakeup time based on the timer wheels. Only later, when actually
entering the idle loop, rcu_prepare_for_idle() will be invoked. In some
cases, rcu_prepare_for_idle() will post timers to wake the CPU back up.
But all for naught: The next wakeup time for the CPU has already been
computed, and posting a timer afterwards does not force that wakeup
time to be recomputed. This means that rcu_prepare_for_idle()'s have
no effect.
This is not a problem on a busy system because something else will wake
up the CPU soon enough. However, on lightly loaded systems, the CPU
might stay asleep for a considerable length of time. If that CPU has
a callback that the rest of the system is waiting on, the system might
run very slowly or (in theory) even hang.
This commit avoids this problem by having rcu_needs_cpu() give
tick_nohz_stop_sched_tick() an estimate of when RCU will need the CPU
to wake back up, which tick_nohz_stop_sched_tick() takes into account
when programming the CPU's wakeup time. An alternative approach is
for rcu_prepare_for_idle() to use hrtimers instead of normal timers,
but timers are much more efficient than are hrtimers for frequently
and repeatedly posting and cancelling a given timer, which is exactly
what RCU_FAST_NO_HZ does.
Reported-by: Pascal Chapperon <pascal.chapperon@wanadoo.fr>
Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Tested-by: Pascal Chapperon <pascal.chapperon@wanadoo.fr>
|
|
The RCU_FAST_NO_HZ code relies on a number of per-CPU variables.
This works, but is hidden from someone scanning the data structures
in rcutree.h. This commit therefore converts these per-CPU variables
to fields in the per-CPU rcu_dynticks structures.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Tested-by: Pascal Chapperon <pascal.chapperon@wanadoo.fr>
|
|
In the current code, a short dyntick-idle interval (where there is
at least one non-lazy callback on the CPU) and a long dyntick-idle
interval (where there are only lazy callbacks on the CPU) are traced
identically, which can be less than helpful. This commit therefore
emits different event traces in these two cases.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Tested-by: Pascal Chapperon <pascal.chapperon@wanadoo.fr>
|
|
In the present implementations of CPU hotplug, the outgoing CPU is
guaranteed to run its stop-machine process on the way out, which
will guarantee that RCU_FAST_NO_HZ forces the CPU out of dyntick-idle
mode.
However, new versions of CPU hotplug might not work this way. This
commit therefore removes this design constraint by explicitly notifying
CPUs when they adopt non-lazy RCU callbacks.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Tested-by: Pascal Chapperon <pascal.chapperon@wanadoo.fr>
|
|
A recent update to have tracing_on/off() only affect the ftrace ring
buffers instead of all ring buffers had a cut and paste error.
The tracing_off() did the exact same thing as tracing_on() and
would not actually turn off tracing. Unfortunately, tracing_off()
is more important to be working than tracing_on() as this is a key
development tool, as it lets the developer turn off tracing as soon
as a problem is discovered. It is also used by panic and oops code.
This bug also breaks the 'echo func:traceoff > set_ftrace_filter'
Cc: <stable@vger.kernel.org> # 3.4
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
|
|
__css_put is using atomic_dec on the ref count, and then
looking at the ref count to make decisions. This is prone
to races, as someone else may decrement ref count between
our decrement and our decision. Instead, we should base our
decisions on the value that we decremented the ref count to.
(This results in an actual race on Google's kernel which I
haven't been able to reproduce on the upstream kernel. Having
said that, it's still incorrect by inspection).
Signed-off-by: Salman Qazi <sqazi@google.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org
|
|
It does not get processed because sched_domain_level_max is 0 at the
time that setup_relax_domain_level() is run.
Simply accept the value as it is, as we don't know the value of
sched_domain_level_max until sched domain construction is completed.
Fix sched_relax_domain_level in cpuset. The build_sched_domain() routine calls
the set_domain_attribute() routine prior to setting the sd->level, however,
the set_domain_attribute() routine relies on the sd->level to decide whether
idle load balancing will be off/on.
Signed-off-by: Dimitri Sivanich <sivanich@sgi.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120605184436.GA15668@sgi.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Add some code to validate assumptions we're making and output
warnings if they are not.
If this trigger we want to know about it.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Alex Shi <lkml.alex@gmail.com>
Link: http://lkml.kernel.org/n/tip-6uc3wk5s9udxtdl9cnku0vtt@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Often when we run into mis-shapen topologies the balance iteration
fails to update the cpu power properly and we'll end up in /0 traps.
Always initialize the cpu-power to a semi-sane value so that we can
at least boot the machine, even if the load-balancer might not
function correctly.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-3lbhyj25sr169ha7z3qht5na@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Weird topologies can lead to asymmetric domain setups. This needs
further consideration since these setups are typically non-minimal
too.
For now, make it work by adding an extra mask selecting which CPUs
are allowed to iterate up.
The topology that triggered it is the one from David Rientjes:
10 20 20 30
20 10 20 20
20 20 10 20
30 20 20 10
resulting in boxes that wouldn't even boot.
Reported-by: David Rientjes <rientjes@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-3p86l9cuaqnxz7uxsojmz5rm@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Roland Dreier reported spurious, hard to trigger lockdep warnings
within the scheduler - without any real lockup.
This bit gives us the right clue:
> [89945.640512] [<ffffffff8103fa1a>] double_lock_balance+0x5a/0x90
> [89945.640568] [<ffffffff8104c546>] push_rt_task+0xc6/0x290
if you look at that code you'll find the double_lock_balance() in
question is the one in find_lock_lowest_rq() [yay for inlining].
Now find_lock_lowest_rq() has a bug.. it fails to use
double_unlock_balance() in one exit path, if this results in a retry in
push_rt_task() we'll call double_lock_balance() again, at which point
we'll run into said lockdep confusion.
Reported-by: Roland Dreier <roland@kernel.org>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1337282386.4281.77.camel@twins
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Commit cb83b629b ("sched/numa: Rewrite the CONFIG_NUMA sched
domain support") removed the NODE sched domain and started checking
if the node distance in SLIT table is farther than REMOTE_DISTANCE,
if so, it will lose the load balance chance at exec/fork/wake_affine
points.
But actually, even the node distance is farther than REMOTE_DISTANCE.
Modern CPUs also has QPI like connections, which ensures that memory
access is not too slow between nodes. So the above change in behavior
on NUMA machine causes a performance regression on various benchmarks:
hackbench, tbench, netperf, oltp, etc.
This patch will recover the scheduler behavior to old mode on all my
Intel platforms: NHM EP/EX, WSM EP, SNB EP/EP4S, and thus fixes the
perfromance regressions. (all of them just have 2 kinds distance, 10, 21)
Signed-off-by: Alex Shi <alex.shi@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1338965571-9812-1-git-send-email-alex.shi@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Gilad reported at
http://lkml.kernel.org/r/1336056962-10465-2-git-send-email-gilad@benyossef.com
"Current timer code fails to correctly return a value meaning that
there is no future timer event, with the result that the timer keeps
getting re-armed in HZ one shot mode even when we could turn it off,
generating unneeded interrupts.
What is happening is that when __next_timer_interrupt() wishes
to return a value that signifies "there is no future timer
event", it returns (base->timer_jiffies + NEXT_TIMER_MAX_DELTA).
However, the code in tick_nohz_stop_sched_tick(), which called
__next_timer_interrupt() via get_next_timer_interrupt(),
compares the return value to (last_jiffies + NEXT_TIMER_MAX_DELTA)
to see if the timer needs to be re-armed.
base->timer_jiffies != last_jiffies and so tick_nohz_stop_sched_tick()
interperts the return value as indication that there is a distant
future event 12 days from now and programs the timer to fire next
after KTIME_MAX nsecs instead of avoiding to arm it. This ends up
causing a needless interrupt once every KTIME_MAX nsecs."
Fix this by using the new active timer accounting. This avoids scans
when no active timer is enqueued completely, so we don't have to rely
on base->timer_next and base->timer_jiffies anymore.
Reported-by: Gilad Ben-Yossef <gilad@benyossef.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/20120525214819.317535385@linutronix.de
|
|
The code in get_next_timer_interrupt() is suboptimal as it has to run
through the cascade to find the next expiring timer. On a completely
idle core we should only do that when there is an active timer
enqueued and base->next_timer does not give us a fast answer.
Add accounting of the active timers to the now consolidated
attach/detach code. I deliberately avoided sanity checks because the
code is fully symetric and any fiddling with timers w/o using the API
functions will lead to cute explosions anyway. ulong is big enough
even on 32bit and if we really run into the situation to have more
than 1<<32 timers enqueued there, then we are definitely not in a
state to go idle and run through that code.
This allows us to fix another shortcoming of get_next_timer_interrupt().
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Gilad Ben-Yossef <gilad@benyossef.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/20120525214819.236377028@linutronix.de
|
|
Another bunch of mindlessly copied code. All callers of
internal_add_timer() except the recascading code updates
base->next_timer.
Move this into internal_add_timer() and let the cascading code call
__internal_add_timer().
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Gilad Ben-Yossef <gilad@benyossef.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/20120525214819.189946224@linutronix.de
|
|
Most callers of detach_timer() have the same pattern around
them. Check whether the timer is pending and eventually updating
base->next_timer.
Create detach_if_pending() and replace the duplicated code.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Gilad Ben-Yossef <gilad@benyossef.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/20120525214819.131246037@linutronix.de
|
|
Merge two debugging patchlets that were waiting for
preparatory commits to hit upstream.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/urgent
Pull perf fixes from Arnaldo Carvalho de Melo:
* Endianness fixes from Jiri Olsa
* Fixes for make perf tarball
* Fix for DSO name in perf script callchains, from David Ahern
* Segfault fixes for perf top --callchain, from Namhyung Kim
* Minor function result fixes from Srikar Dronamraju
* Add missing 3rd ioctl parameter, from Namhyung Kim
* Fix pager usage in minimal embedded systems, from Avik Sil
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fix from Tejun Heo:
"This fixes the possible premature superblock release on umount bug
mentioned during v3.5-rc1 pull request.
Originally, cgroup dentry destruction path assumed that cgroup dentry
didn't have any reference left after cgroup removal thus put super
during dentry removal. Now that there can be lingering dentry
references, this led to super being put with live dentries. This
patch fixes the problem by putting super ref on dentry release instead
of removal."
* 'for-3.5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: superblock can't be released with active dentries
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar.
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched: Remove NULL assignment of dattr_cur
sched: Remove the last NULL entry from sched_feat_names
sched: Make sched_feat_names const
sched/rt: Fix SCHED_RR across cgroups
sched: Move nr_cpus_allowed out of 'struct sched_rt_entity'
sched: Make sure to not re-read variables after validation
sched: Fix SD_OVERLAP
sched: Don't try allocating memory from offline nodes
sched/nohz: Fix rq->cpu_load calculations some more
sched/x86: Use cpu_llc_shared_mask(cpu) for coregroup_mask
|
|
Commit 6b43ae8a61 (ntp: Fix leap-second hrtimer livelock) broke the
leapsecond update of CLOCK_MONOTONIC. The missing leapsecond update to
wall_to_monotonic causes discontinuities in CLOCK_MONOTONIC.
Adjust wall_to_monotonic when NTP inserted a leapsecond.
Reported-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Tested-by: Richard Cochran <richardcochran@gmail.com>
Cc: stable@kernel.org
Link: http://lkml.kernel.org/r/1338400497-12420-1-git-send-email-john.stultz@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq and smpboot updates from Thomas Gleixner:
"Just cleanup patches with no functional change and a fix for suspend
issues."
* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
genirq: Introduce irq_do_set_affinity() to reduce duplicated code
genirq: Add IRQS_PENDING for nested and simple irq
* 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
smpboot, idle: Fix comment mismatch over idle_threads_init()
smpboot, idle: Optimize calls to smp_processor_id() in idle_threads_init()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Thomas Gleixner:
"The clocksource driver is pure hardware enablement and the skew option
is default off, well tested and non dangerous."
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
tick: Move skew_tick option into the HIGH_RES_TIMER section
clocksource: em_sti: Add DT support
clocksource: em_sti: Emma Mobile STI driver
clockevents: Make clockevents_config() a global symbol
tick: Add tick skew boot option
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal
Pull third pile of signal handling patches from Al Viro:
"This time it's mostly helpers and conversions to them; there's a lot
of stuff remaining in the tree, but that'll either go in -rc2
(isolated bug fixes, ideally via arch maintainers' trees) or will sit
there until the next cycle."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
x86: get rid of calling do_notify_resume() when returning to kernel mode
blackfin: check __get_user() return value
whack-a-mole with TIF_FREEZE
FRV: Optimise the system call exit path in entry.S [ver #2]
FRV: Shrink TIF_WORK_MASK [ver #2]
FRV: Prevent syscall exit tracing and notify_resume at end of kernel exceptions
new helper: signal_delivered()
powerpc: get rid of restore_sigmask()
most of set_current_blocked() callers want SIGKILL/SIGSTOP removed from set
set_restore_sigmask() is never called without SIGPENDING (and never should be)
TIF_RESTORE_SIGMASK can be set only when TIF_SIGPENDING is set
don't call try_to_freeze() from do_signal()
pull clearing RESTORE_SIGMASK into block_sigmask()
sh64: failure to build sigframe != signal without handler
openrisc: tracehook_signal_handler() is supposed to be called on success
new helper: sigmask_to_save()
new helper: restore_saved_sigmask()
new helpers: {clear,test,test_and_clear}_restore_sigmask()
HAVE_RESTORE_SIGMASK is defined on all architectures now
|