summaryrefslogtreecommitdiff
path: root/kernel/trace/ring_buffer.c
AgeCommit message (Collapse)Author
2024-03-12ring-buffer: Use wait_event_interruptible() in ring_buffer_wait()Steven Rostedt (Google)
Convert ring_buffer_wait() over to wait_event_interruptible(). The default condition is to execute the wait loop inside __wait_event() just once. This does not change the ring_buffer_wait() prototype yet, but restructures the code so that it can take a "cond" and "data" parameter and will call wait_event_interruptible() with a helper function as the condition. The helper function (rb_wait_cond) takes the cond function and data parameters. It will first check if the buffer hit the watermark defined by the "full" parameter and then call the passed in condition parameter. If either are true, it returns true. If rb_wait_cond() does not return true, it will set the appropriate "waiters_pending" flag and returns false. Link: https://lore.kernel.org/linux-trace-kernel/CAHk-=wgsNgewHFxZAJiAQznwPMqEtQmi1waeS2O1v6L4c_Um5A@mail.gmail.com/ Link: https://lore.kernel.org/linux-trace-kernel/20240312121703.399598519@goodmis.org Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linke li <lilinke99@qq.com> Cc: Rabin Vincent <rabin@rab.in> Fixes: f3ddb74ad0790 ("tracing: Wake up ring buffer waiters on closing of the file") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-03-12ring-buffer: Reuse rb_watermark_hit() for the poll logicSteven Rostedt (Google)
The check for knowing if the poll should wait or not is basically the exact same logic as rb_watermark_hit(). The only difference is that rb_watermark_hit() also handles the !full case. But for the full case, the logic is the same. Just call that instead of duplicating the code in ring_buffer_poll_wait(). Link: https://lore.kernel.org/linux-trace-kernel/20240312131952.802267543@goodmis.org Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-03-12ring-buffer: Fix full_waiters_pending in pollSteven Rostedt (Google)
If a reader of the ring buffer is doing a poll, and waiting for the ring buffer to hit a specific watermark, there could be a case where it gets into an infinite ping-pong loop. The poll code has: rbwork->full_waiters_pending = true; if (!cpu_buffer->shortest_full || cpu_buffer->shortest_full > full) cpu_buffer->shortest_full = full; The writer will see full_waiters_pending and check if the ring buffer is filled over the percentage of the shortest_full value. If it is, it calls an irq_work to wake up all the waiters. But the code could get into a circular loop: CPU 0 CPU 1 ----- ----- [ Poll ] [ shortest_full = 0 ] rbwork->full_waiters_pending = true; if (rbwork->full_waiters_pending && [ buffer percent ] > shortest_full) { rbwork->wakeup_full = true; [ queue_irqwork ] cpu_buffer->shortest_full = full; [ IRQ work ] if (rbwork->wakeup_full) { cpu_buffer->shortest_full = 0; wakeup poll waiters; [woken] if ([ buffer percent ] > full) break; rbwork->full_waiters_pending = true; if (rbwork->full_waiters_pending && [ buffer percent ] > shortest_full) { rbwork->wakeup_full = true; [ queue_irqwork ] cpu_buffer->shortest_full = full; [ IRQ work ] if (rbwork->wakeup_full) { cpu_buffer->shortest_full = 0; wakeup poll waiters; [woken] [ Wash, rinse, repeat! ] In the poll, the shortest_full needs to be set before the full_pending_waiters, as once that is set, the writer will compare the current shortest_full (which is incorrect) to decide to call the irq_work, which will reset the shortest_full (expecting the readers to update it). Also move the setting of full_waiters_pending after the check if the ring buffer has the required percentage filled. There's no reason to tell the writer to wake up waiters if there are no waiters. Link: https://lore.kernel.org/linux-trace-kernel/20240312131952.630922155@goodmis.org Cc: stable@vger.kernel.org Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Fixes: 42fb0a1e84ff5 ("tracing/ring-buffer: Have polling block on watermark") Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-03-12ring-buffer: Do not set shortest_full when full target is hitSteven Rostedt (Google)
The rb_watermark_hit() checks if the amount of data in the ring buffer is above the percentage level passed in by the "full" variable. If it is, it returns true. But it also sets the "shortest_full" field of the cpu_buffer that informs writers that it needs to call the irq_work if the amount of data on the ring buffer is above the requested amount. The rb_watermark_hit() always sets the shortest_full even if the amount in the ring buffer is what it wants. As it is not going to wait, because it has what it wants, there's no reason to set shortest_full. Link: https://lore.kernel.org/linux-trace-kernel/20240312115641.6aa8ba08@gandalf.local.home Cc: stable@vger.kernel.org Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Fixes: 42fb0a1e84ff5 ("tracing/ring-buffer: Have polling block on watermark") Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-03-11Merge tag 'x86-core-2024-03-11' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull core x86 updates from Ingo Molnar: - The biggest change is the rework of the percpu code, to support the 'Named Address Spaces' GCC feature, by Uros Bizjak: - This allows C code to access GS and FS segment relative memory via variables declared with such attributes, which allows the compiler to better optimize those accesses than the previous inline assembly code. - The series also includes a number of micro-optimizations for various percpu access methods, plus a number of cleanups of %gs accesses in assembly code. - These changes have been exposed to linux-next testing for the last ~5 months, with no known regressions in this area. - Fix/clean up __switch_to()'s broken but accidentally working handling of FPU switching - which also generates better code - Propagate more RIP-relative addressing in assembly code, to generate slightly better code - Rework the CPU mitigations Kconfig space to be less idiosyncratic, to make it easier for distros to follow & maintain these options - Rework the x86 idle code to cure RCU violations and to clean up the logic - Clean up the vDSO Makefile logic - Misc cleanups and fixes * tag 'x86-core-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (52 commits) x86/idle: Select idle routine only once x86/idle: Let prefer_mwait_c1_over_halt() return bool x86/idle: Cleanup idle_setup() x86/idle: Clean up idle selection x86/idle: Sanitize X86_BUG_AMD_E400 handling sched/idle: Conditionally handle tick broadcast in default_idle_call() x86: Increase brk randomness entropy for 64-bit systems x86/vdso: Move vDSO to mmap region x86/vdso/kbuild: Group non-standard build attributes and primary object file rules together x86/vdso: Fix rethunk patching for vdso-image-{32,64}.o x86/retpoline: Ensure default return thunk isn't used at runtime x86/vdso: Use CONFIG_COMPAT_32 to specify vdso32 x86/vdso: Use $(addprefix ) instead of $(foreach ) x86/vdso: Simplify obj-y addition x86/vdso: Consolidate targets and clean-files x86/bugs: Rename CONFIG_RETHUNK => CONFIG_MITIGATION_RETHUNK x86/bugs: Rename CONFIG_CPU_SRSO => CONFIG_MITIGATION_SRSO x86/bugs: Rename CONFIG_CPU_IBRS_ENTRY => CONFIG_MITIGATION_IBRS_ENTRY x86/bugs: Rename CONFIG_CPU_UNRET_ENTRY => CONFIG_MITIGATION_UNRET_ENTRY x86/bugs: Rename CONFIG_SLS => CONFIG_MITIGATION_SLS ...
2024-03-10ring-buffer: Fix resetting of shortest_fullSteven Rostedt (Google)
The "shortest_full" variable is used to keep track of the waiter that is waiting for the smallest amount on the ring buffer before being woken up. When a tasks waits on the ring buffer, it passes in a "full" value that is a percentage. 0 means wake up on any data. 1-100 means wake up from 1% to 100% full buffer. As all waiters are on the same wait queue, the wake up happens for the waiter with the smallest percentage. The problem is that the smallest_full on the cpu_buffer that stores the smallest amount doesn't get reset when all the waiters are woken up. It does get reset when the ring buffer is reset (echo > /sys/kernel/tracing/trace). This means that tasks may be woken up more often then when they want to be. Instead, have the shortest_full field get reset just before waking up all the tasks. If the tasks wait again, they will update the shortest_full before sleeping. Also add locking around setting of shortest_full in the poll logic, and change "work" to "rbwork" to match the variable name for rb_irq_work structures that are used in other places. Link: https://lore.kernel.org/linux-trace-kernel/20240308202431.948914369@goodmis.org Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linke li <lilinke99@qq.com> Cc: Rabin Vincent <rabin@rab.in> Fixes: 2c2b0a78b3739 ("ring-buffer: Add percentage of ring buffer full to wake up reader") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-03-10ring-buffer: Fix waking up ring buffer readersSteven Rostedt (Google)
A task can wait on a ring buffer for when it fills up to a specific watermark. The writer will check the minimum watermark that waiters are waiting for and if the ring buffer is past that, it will wake up all the waiters. The waiters are in a wait loop, and will first check if a signal is pending and then check if the ring buffer is at the desired level where it should break out of the loop. If a file that uses a ring buffer closes, and there's threads waiting on the ring buffer, it needs to wake up those threads. To do this, a "wait_index" was used. Before entering the wait loop, the waiter will read the wait_index. On wakeup, it will check if the wait_index is different than when it entered the loop, and will exit the loop if it is. The waker will only need to update the wait_index before waking up the waiters. This had a couple of bugs. One trivial one and one broken by design. The trivial bug was that the waiter checked the wait_index after the schedule() call. It had to be checked between the prepare_to_wait() and the schedule() which it was not. The main bug is that the first check to set the default wait_index will always be outside the prepare_to_wait() and the schedule(). That's because the ring_buffer_wait() doesn't have enough context to know if it should break out of the loop. The loop itself is not needed, because all the callers to the ring_buffer_wait() also has their own loop, as the callers have a better sense of what the context is to decide whether to break out of the loop or not. Just have the ring_buffer_wait() block once, and if it gets woken up, exit the function and let the callers decide what to do next. Link: https://lore.kernel.org/all/CAHk-=whs5MdtNjzFkTyaUy=vHi=qwWgPi0JgTe6OYUYMNSRZfg@mail.gmail.com/ Link: https://lore.kernel.org/linux-trace-kernel/20240308202431.792933613@goodmis.org Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linke li <lilinke99@qq.com> Cc: Rabin Vincent <rabin@rab.in> Fixes: e30f53aad2202 ("tracing: Do not busy wait in buffer splice") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-02-21ring-buffer: Do not let subbuf be bigger than write maskSteven Rostedt (Google)
The data on the subbuffer is measured by a write variable that also contains status flags. The counter is just 20 bits in length. If the subbuffer is bigger than then counter, it will fail. Make sure that the subbuffer can not be set to greater than the counter that keeps track of the data on the subbuffer. Link: https://lore.kernel.org/linux-trace-kernel/20240220095112.77e9cb81@gandalf.local.home Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Fixes: 2808e31ec12e5 ("ring-buffer: Add interface for configuring trace sub buffer size") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-02-14Merge branch 'x86/bugs' into x86/core, to pick up pending changes before ↵Ingo Molnar
dependent patches Merge in pending alternatives patching infrastructure changes, before applying more patches. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2024-01-31ring-buffer: Clean ring_buffer_poll_wait() error returnVincent Donnefort
The return type for ring_buffer_poll_wait() is __poll_t. This is behind the scenes an unsigned where we can set event bits. In case of a non-allocated CPU, we do return instead -EINVAL (0xffffffea). Lucky us, this ends up setting few error bits (EPOLLERR | EPOLLHUP | EPOLLNVAL), so user-space at least is aware something went wrong. Nonetheless, this is an incorrect code. Replace that -EINVAL with a proper EPOLLERR to clean that output. As this doesn't change the behaviour, there's no need to treat this change as a bug fix. Link: https://lore.kernel.org/linux-trace-kernel/20240131140955.3322792-1-vdonnefort@google.com Cc: stable@vger.kernel.org Fixes: 6721cb6002262 ("ring-buffer: Do not poll non allocated cpu buffers") Signed-off-by: Vincent Donnefort <vdonnefort@google.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-01-18Merge tag 'trace-v6.8' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing updates from Steven Rostedt: - Allow kernel trace instance creation to specify what events are created Inside the kernel, a subsystem may create a tracing instance that it can use to send events to user space. This sub-system may not care about the thousands of events that exist in eventfs. Allow the sub-system to specify what sub-systems of events it cares about, and only those events are exposed to this instance. - Allow the ring buffer to be broken up into bigger sub-buffers than just the architecture page size. A new tracefs file called "buffer_subbuf_size_kb" is created. The user can now specify a minimum size the sub-buffer may be in kilobytes. Note, that the implementation currently make the sub-buffer size a power of 2 pages (1, 2, 4, 8, 16, ...) but the user only writes in kilobyte size, and the sub-buffer will be updated to the next size that it will can accommodate it. If the user writes in 10, it will change the size to be 4 pages on x86 (16K), as that is the next available size that can hold 10K pages. - Update the debug output when a corrupt time is detected in the ring buffer. If the ring buffer detects inconsistent timestamps, there's a debug config options that will dump the contents of the meta data of the sub-buffer that is used for debugging. Add some more information to this dump that helps with debugging. - Add more timestamp debugging checks (only triggers when the config is enabled) - Increase the trace_seq iterator to 2 page sizes. - Allow strings written into tracefs_marker to be larger. Up to just under 2 page sizes (based on what trace_seq can hold). - Increase the trace_maker_raw write to be as big as a sub-buffer can hold. - Remove 32 bit time stamp logic, now that the rb_time_cmpxchg() has been removed. - More selftests were added. - Some code clean ups as well. * tag 'trace-v6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (29 commits) ring-buffer: Remove stale comment from ring_buffer_size() tracing histograms: Simplify parse_actions() function tracing/selftests: Remove exec permissions from trace_marker.tc test ring-buffer: Use subbuf_order for buffer page masking tracing: Update subbuffer with kilobytes not page order ringbuffer/selftest: Add basic selftest to test changing subbuf order ring-buffer: Add documentation on the buffer_subbuf_order file ring-buffer: Just update the subbuffers when changing their allocation order ring-buffer: Keep the same size when updating the order tracing: Stop the tracing while changing the ring buffer subbuf size tracing: Update snapshot order along with main buffer order ring-buffer: Make sure the spare sub buffer used for reads has same size ring-buffer: Do no swap cpu buffers if order is different ring-buffer: Clear pages on error in ring_buffer_subbuf_order_set() failure ring-buffer: Read and write to ring buffers with custom sub buffer size ring-buffer: Set new size of the ring buffer sub page ring-buffer: Add interface for configuring trace sub buffer size ring-buffer: Page size per ring buffer ring-buffer: Have ring_buffer_print_page_header() be able to access ring_buffer_iter ring-buffer: Check if absolute timestamp goes backwards ...
2024-01-10x86/bugs: Rename CONFIG_RETPOLINE => CONFIG_MITIGATION_RETPOLINEBreno Leitao
Step 5/10 of the namespace unification of CPU mitigations related Kconfig options. [ mingo: Converted a few more uses in comments/messages as well. ] Suggested-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Breno Leitao <leitao@debian.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Ariel Miculas <amiculas@cisco.com> Acked-by: Josh Poimboeuf <jpoimboe@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20231121160740.1249350-6-leitao@debian.org
2024-01-09ring-buffer: Remove stale comment from ring_buffer_size()Steven Rostedt (Google)
It's been 11 years since the ring_buffer_size() function was updated to use the nr_pages from the buffer->buffers[cpu] structure instead of using the buffer->nr_pages that no longer exists. The comment in the code is more of what a change log should have and is pretty much useless for development. It's saying how things worked back in 2012 that bares no purpose on today's code. Remove it. Link: https://lore.kernel.org/linux-trace-kernel/84d3b41a72bd43dbb9d44921ef535c92@AcuMS.aculab.com/ Link: https://lore.kernel.org/linux-trace-kernel/20231220081028.7cd7e8e2@gandalf.local.home Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Reported-by: David Laight <David.Laight@ACULAB.COM> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2023-12-29tracing: Fix blocked reader of snapshot bufferSteven Rostedt (Google)
If an application blocks on the snapshot or snapshot_raw files, expecting to be woken up when a snapshot occurs, it will not happen. Or it may happen with an unexpected result. That result is that the application will be reading the main buffer instead of the snapshot buffer. That is because when the snapshot occurs, the main and snapshot buffers are swapped. But the reader has a descriptor still pointing to the buffer that it originally connected to. This is fine for the main buffer readers, as they may be blocked waiting for a watermark to be hit, and when a snapshot occurs, the data that the main readers want is now on the snapshot buffer. But for waiters of the snapshot buffer, they are waiting for an event to occur that will trigger the snapshot and they can then consume it quickly to save the snapshot before the next snapshot occurs. But to do this, they need to read the new snapshot buffer, not the old one that is now receiving new data. Also, it does not make sense to have a watermark "buffer_percent" on the snapshot buffer, as the snapshot buffer is static and does not receive new data except all at once. Link: https://lore.kernel.org/linux-trace-kernel/20231228095149.77f5b45d@gandalf.local.home Cc: stable@vger.kernel.org Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Mark Rutland <mark.rutland@arm.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Fixes: debdd57f5145f ("tracing: Make a snapshot feature available from userspace") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-12-29ring-buffer: Fix wake ups when buffer_percent is set to 100Steven Rostedt (Google)
The tracefs file "buffer_percent" is to allow user space to set a water-mark on how much of the tracing ring buffer needs to be filled in order to wake up a blocked reader. 0 - is to wait until any data is in the buffer 1 - is to wait for 1% of the sub buffers to be filled 50 - would be half of the sub buffers are filled with data 100 - is not to wake the waiter until the ring buffer is completely full Unfortunately the test for being full was: dirty = ring_buffer_nr_dirty_pages(buffer, cpu); return (dirty * 100) > (full * nr_pages); Where "full" is the value for "buffer_percent". There is two issues with the above when full == 100. 1. dirty * 100 > 100 * nr_pages will never be true That is, the above is basically saying that if the user sets buffer_percent to 100, more pages need to be dirty than exist in the ring buffer! 2. The page that the writer is on is never considered dirty, as dirty pages are only those that are full. When the writer goes to a new sub-buffer, it clears the contents of that sub-buffer. That is, even if the check was ">=" it would still not be equal as the most pages that can be considered "dirty" is nr_pages - 1. To fix this, add one to dirty and use ">=" in the compare. Link: https://lore.kernel.org/linux-trace-kernel/20231226125902.4a057f1d@gandalf.local.home Cc: stable@vger.kernel.org Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Fixes: 03329f9939781 ("tracing: Add tracefs file buffer_percentage") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-12-21ring-buffer: Use subbuf_order for buffer page maskingSteven Rostedt (Google)
The comparisons to PAGE_SIZE were all converted to use the buffer->subbuf_order, but the use of PAGE_MASK was missed. Convert all the PAGE_MASK usages over to: (PAGE_SIZE << cpu_buffer->buffer->subbuf_order) - 1 Link: https://lore.kernel.org/linux-trace-kernel/20231219173800.66eefb7a@gandalf.local.home Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com> Cc: Vincent Donnefort <vdonnefort@google.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Fixes: 139f84002145 ("ring-buffer: Page size per ring buffer") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-12-21ring-buffer: Just update the subbuffers when changing their allocation orderSteven Rostedt (Google)
The ring_buffer_subbuf_order_set() was creating ring_buffer_per_cpu cpu_buffers with the new subbuffers with the updated order, and if they all successfully were created, then they the ring_buffer's per_cpu buffers would be freed and replaced by them. The problem is that the freed per_cpu buffers contains state that would be lost. Running the following commands: 1. # echo 3 > /sys/kernel/tracing/buffer_subbuf_order 2. # echo 0 > /sys/kernel/tracing/tracing_cpumask 3. # echo 1 > /sys/kernel/tracing/snapshot 4. # echo ff > /sys/kernel/tracing/tracing_cpumask 5. # echo test > /sys/kernel/tracing/trace_marker Would result in: -bash: echo: write error: Bad file descriptor That's because the state of the per_cpu buffers of the snapshot buffer is lost when the order is changed (the order of a freed snapshot buffer goes to 0 to save memory, and when the snapshot buffer is allocated again, it goes back to what the main buffer is). In operation 2, the snapshot buffers were set to "disable" (as all the ring buffers CPUs were disabled). In operation 3, the snapshot is allocated and a call to ring_buffer_subbuf_order_set() replaced the per_cpu buffers losing the "record_disable" count. When it was enabled again, the atomic_dec(&cpu_buffer->record_disable) was decrementing a zero, setting it to -1. Writing 1 into the snapshot would swap the snapshot buffer with the main buffer, so now the main buffer is "disabled", and nothing can write to the ring buffer anymore. Instead of creating new per_cpu buffers and losing the state of the old buffers, basically do what the resize does and just allocate new subbuf pages into the new_pages link list of the per_cpu buffer and if they all succeed, then replace the old sub buffers with the new ones. This keeps the per_cpu buffer descriptor in tact and by doing so, keeps its state. Link: https://lore.kernel.org/linux-trace-kernel/20231219185630.944104939@goodmis.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com> Cc: Vincent Donnefort <vdonnefort@google.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Fixes: f9b94daa542a ("ring-buffer: Set new size of the ring buffer sub page") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-12-21ring-buffer: Keep the same size when updating the orderSteven Rostedt (Google)
The function ring_buffer_subbuf_order_set() just updated the sub-buffers to the new size, but this also changes the size of the buffer in doing so. As the size is determined by nr_pages * subbuf_size. If the subbuf_size is increased without decreasing the nr_pages, this causes the total size of the buffer to increase. This broke the latency tracers as the snapshot needs to be the same size as the main buffer. The size of the snapshot buffer is only expanded when needed, and because the order is still the same, the size becomes out of sync with the main buffer, as the main buffer increased in size without the tracing system knowing. Calculate the nr_pages to allocate with the new subbuf_size to be buffer_size / new_subbuf_size. Link: https://lore.kernel.org/linux-trace-kernel/20231219185630.649397785@goodmis.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com> Cc: Vincent Donnefort <vdonnefort@google.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Fixes: f9b94daa542a ("ring-buffer: Set new size of the ring buffer sub page") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-12-21ring-buffer: Do no swap cpu buffers if order is differentSteven Rostedt (Google)
As all the subbuffer order (subbuffer sizes) must be the same throughout the ring buffer, check the order of the buffers that are doing a CPU buffer swap in ring_buffer_swap_cpu() to make sure they are the same. If the are not the same, then fail to do the swap, otherwise the ring buffer will think the CPU buffer has a specific subbuffer size when it does not. Link: https://lore.kernel.org/linux-trace-kernel/20231219185629.467894710@goodmis.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com> Cc: Vincent Donnefort <vdonnefort@google.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-12-21ring-buffer: Clear pages on error in ring_buffer_subbuf_order_set() failureSteven Rostedt (Google)
On failure to allocate ring buffer pages, the pointer to the CPU buffer pages is freed, but the pages that were allocated previously were not. Make sure they are freed too. Link: https://lore.kernel.org/linux-trace-kernel/20231219185629.179352802@goodmis.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com> Cc: Vincent Donnefort <vdonnefort@google.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Fixes: f9b94daa542a ("tracing: Set new size of the ring buffer sub page") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-12-20ring-buffer: Read and write to ring buffers with custom sub buffer sizeTzvetomir Stoyanov (VMware)
As the size of the ring sub buffer page can be changed dynamically, the logic that reads and writes to the buffer should be fixed to take that into account. Some internal ring buffer APIs are changed: ring_buffer_alloc_read_page() ring_buffer_free_read_page() ring_buffer_read_page() A new API is introduced: ring_buffer_read_page_data() Link: https://lore.kernel.org/linux-trace-devel/20211213094825.61876-6-tz.stoyanov@gmail.com Link: https://lore.kernel.org/linux-trace-kernel/20231219185628.875145995@goodmis.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Vincent Donnefort <vdonnefort@google.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Tzvetomir Stoyanov (VMware) <tz.stoyanov@gmail.com> [ Fixed kerneldoc on data_page parameter in ring_buffer_free_read_page() ] Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-12-20ring-buffer: Set new size of the ring buffer sub pageTzvetomir Stoyanov (VMware)
There are two approaches when changing the size of the ring buffer sub page: 1. Destroying all pages and allocating new pages with the new size. 2. Allocating new pages, copying the content of the old pages before destroying them. The first approach is easier, it is selected in the proposed implementation. Changing the ring buffer sub page size is supposed to not happen frequently. Usually, that size should be set only once, when the buffer is not in use yet and is supposed to be empty. Link: https://lore.kernel.org/linux-trace-devel/20211213094825.61876-5-tz.stoyanov@gmail.com Link: https://lore.kernel.org/linux-trace-kernel/20231219185628.588995543@goodmis.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Vincent Donnefort <vdonnefort@google.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Tzvetomir Stoyanov (VMware) <tz.stoyanov@gmail.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-12-20ring-buffer: Add interface for configuring trace sub buffer sizeTzvetomir Stoyanov (VMware)
The trace ring buffer sub page size can be configured, per trace instance. A new ftrace file "buffer_subbuf_order" is added to get and set the size of the ring buffer sub page for current trace instance. The size must be an order of system page size, that's why the new interface works with system page order, instead of absolute page size: 0 means the ring buffer sub page is equal to 1 system page and so forth: 0 - 1 system page 1 - 2 system pages 2 - 4 system pages ... The ring buffer sub page size is limited between 1 and 128 system pages. The default value is 1 system page. New ring buffer APIs are introduced: ring_buffer_subbuf_order_set() ring_buffer_subbuf_order_get() ring_buffer_subbuf_size_get() Link: https://lore.kernel.org/linux-trace-devel/20211213094825.61876-4-tz.stoyanov@gmail.com Link: https://lore.kernel.org/linux-trace-kernel/20231219185628.298324722@goodmis.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Vincent Donnefort <vdonnefort@google.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Tzvetomir Stoyanov (VMware) <tz.stoyanov@gmail.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-12-20ring-buffer: Page size per ring bufferTzvetomir Stoyanov (VMware)
Currently the size of one sub buffer page is global for all buffers and it is hard coded to one system page. In order to introduce configurable ring buffer sub page size, the internal logic should be refactored to work with sub page size per ring buffer. Link: https://lore.kernel.org/linux-trace-devel/20211213094825.61876-3-tz.stoyanov@gmail.com Link: https://lore.kernel.org/linux-trace-kernel/20231219185628.009147038@goodmis.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Vincent Donnefort <vdonnefort@google.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Tzvetomir Stoyanov (VMware) <tz.stoyanov@gmail.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-12-20ring-buffer: Have ring_buffer_print_page_header() be able to access ↵Tzvetomir Stoyanov (VMware)
ring_buffer_iter In order to introduce sub-buffer size per ring buffer, some internal refactoring is needed. As ring_buffer_print_page_header() will depend on the trace_buffer structure, it is moved after the structure definition. Link: https://lore.kernel.org/linux-trace-devel/20211213094825.61876-2-tz.stoyanov@gmail.com Link: https://lore.kernel.org/linux-trace-kernel/20231219185627.723857541@goodmis.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Vincent Donnefort <vdonnefort@google.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Tzvetomir Stoyanov (VMware) <tz.stoyanov@gmail.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-12-19ring-buffer: Check if absolute timestamp goes backwardsSteven Rostedt (Google)
The check_buffer() which checks the timestamps of the ring buffer sub-buffer page, when enabled, only checks if the adding of deltas of the events from the last absolute timestamp or the timestamp of the sub-buffer page adds up to the current event. What it does not check is if the absolute timestamp causes the time of the events to go backwards, as that can cause issues elsewhere. Test for the timestamp going backwards too. This also fixes a slight issue where if the warning triggers at boot up (because of the resetting of the tsc), it will disable all further checks, even those that are after boot Have it continue checking if the warning was ignored during boot up. Link: https://lore.kernel.org/linux-trace-kernel/20231219074732.18b092d4@gandalf.local.home Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-12-19ring-buffer: Add interrupt information to dump of data sub-bufferSteven Rostedt (Google)
When the ring buffer timestamp verifier triggers, it dumps the content of the sub-buffer. But currently it only dumps the timestamps and the offset of the data as well as the deltas. It would be even more informative if the event data also showed the interrupt context level it was in. That is, if each event showed that the event was written in normal, softirq, irq or NMI context. Then a better idea about how the events may have been interrupted from each other. As the payload of the ring buffer is really a black box of the ring buffer, just assume that if the payload is larger than a trace entry, that it is a trace entry. As trace entries have the interrupt context information saved in a flags field, look at that location and report the output of the flags. If the payload is not a trace entry, there's no way to really know, and the information will be garbage. But that's OK, because this is for debugging only (this output is not used in production as the buffer check that calls it causes a huge overhead to the tracing). This information, when available, is crucial for debugging timestamp issues. If it's garbage, it will also be pretty obvious that its garbage too. As this output usually happens in kselftests of the tracing code, the user will know what the payload is at the time. Link: https://lore.kernel.org/linux-trace-kernel/20231219074542.6f304601@gandalf.local.home Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Suggested-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-12-19ring-buffer: Remove 32bit timestamp logicSteven Rostedt (Google)
Each event has a 27 bit timestamp delta that is used to hold the delta from the last event. If the time between events is greater than 2^27, then a timestamp is added that holds a 59 bit absolute timestamp. Until a389d86f7fd09 ("ring-buffer: Have nested events still record running time stamp"), if an interrupt interrupted an event in progress, all the events delta would be zero to not deal with the races that need to be handled. The commit a389d86f7fd09 changed that to handle the races giving all events, even those that preempt other events, still have an accurate timestamp. To handle those races requires performing 64-bit cmpxchg on the timestamps. But doing 64-bit cmpxchg on 32-bit architectures is considered very slow. To try to deal with this the timestamp logic was broken into two and then three 32-bit cmpxchgs, with the thought that two (or three) 32-bit cmpxchgs are still faster than a single 64-bit cmpxchg on 32-bit architectures. Part of the problem with this is that I didn't have any 32-bit architectures to test on. After hitting several subtle bugs in this code, an effort was made to try and see if three 32-bit cmpxchgs are indeed faster than a single 64-bit. After a few people brushed off the dust of their old 32-bit machines, tests were done, and even though 32-bit cmpxchg was faster than a single 64-bit, it was in the order of 50% at best, not 300%. After some more refactoring of the code, all 4 64-bit cmpxchg were removed: https://lore.kernel.org/linux-trace-kernel/20231211114420.36dde01b@gandalf.local.home https://lore.kernel.org/linux-trace-kernel/20231214222921.193037a7@gandalf.local.home https://lore.kernel.org/linux-trace-kernel/20231215081810.1f4f38fe@rorschach.local.home https://lore.kernel.org/linux-trace-kernel/20231218230712.3a76b081@gandalf.local.home/ With all the 64-bit cmpxchg removed, the complex 32-bit workaround can also be removed. The 32-bit and 64-bit logic is now exactly the same. Link: https://lore.kernel.org/all/20231213214632.15047c40@gandalf.local.home/ Link: https://lore.kernel.org/linux-trace-kernel/20231219074303.28f9abda@gandalf.local.home Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-12-18tracing: Allow for max buffer data size trace_marker writesSteven Rostedt (Google)
Allow a trace write to be as big as the ring buffer tracing data will allow. Currently, it only allows writes of 1KB in size, but there's no reason that it cannot allow what the ring buffer can hold. Link: https://lore.kernel.org/linux-trace-kernel/20231212131901.5f501e72@gandalf.local.home Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-12-18ring-buffer: Add offset of events in dump on mismatchSteven Rostedt (Google)
On bugs that have the ring buffer timestamp get out of sync, the config CONFIG_RING_BUFFER_VALIDATE_TIME_DELTAS, that checks for it and if it is detected it causes a dump of the bad sub buffer. It shows each event and their timestamp as well as the delta in the event. But it's also good to see the offset into the subbuffer for that event to know if how close to the end it is. Also print where the last event actually ended compared to where it was expected to end. Link: https://lore.kernel.org/linux-trace-kernel/20231211131623.59eaebd2@gandalf.local.home Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-12-18ring-buffer: Fix slowpath of interrupted eventSteven Rostedt (Google)
To synchronize the timestamps with the ring buffer reservation, there are two timestamps that are saved in the buffer meta data. 1. before_stamp 2. write_stamp When the two are equal, the write_stamp is considered valid, as in, it may be used to calculate the delta of the next event as the write_stamp is the timestamp of the previous reserved event on the buffer. This is done by the following: /*A*/ w = current position on the ring buffer before = before_stamp after = write_stamp ts = read current timestamp if (before != after) { write_stamp is not valid, force adding an absolute timestamp. } /*B*/ before_stamp = ts /*C*/ write = local_add_return(event length, position on ring buffer) if (w == write - event length) { /* Nothing interrupted between A and C */ /*E*/ write_stamp = ts; delta = ts - after /* * If nothing interrupted again, * before_stamp == write_stamp and write_stamp * can be used to calculate the delta for * events that come in after this one. */ } else { /* * The slow path! * Was interrupted between A and C. */ This is the place that there's a bug. We currently have: after = write_stamp ts = read current timestamp /*F*/ if (write == current position on the ring buffer && after < ts && cmpxchg(write_stamp, after, ts)) { delta = ts - after; } else { delta = 0; } The assumption is that if the current position on the ring buffer hasn't moved between C and F, then it also was not interrupted, and that the last event written has a timestamp that matches the write_stamp. That is the write_stamp is valid. But this may not be the case: If a task context event was interrupted by softirq between B and C. And the softirq wrote an event that got interrupted by a hard irq between C and E. and the hard irq wrote an event (does not need to be interrupted) We have: /*B*/ before_stamp = ts of normal context ---> interrupted by softirq /*B*/ before_stamp = ts of softirq context ---> interrupted by hardirq /*B*/ before_stamp = ts of hard irq context /*E*/ write_stamp = ts of hard irq context /* matches and write_stamp valid */ <---- /*E*/ write_stamp = ts of softirq context /* No longer matches before_stamp, write_stamp is not valid! */ <--- w != write - length, go to slow path // Right now the order of events in the ring buffer is: // // |-- softirq event --|-- hard irq event --|-- normal context event --| // after = write_stamp (this is the ts of softirq) ts = read current timestamp if (write == current position on the ring buffer [true] && after < ts [true] && cmpxchg(write_stamp, after, ts) [true]) { delta = ts - after [Wrong!] The delta is to be between the hard irq event and the normal context event, but the above logic made the delta between the softirq event and the normal context event, where the hard irq event is between the two. This will shift all the remaining event timestamps on the sub-buffer incorrectly. The write_stamp is only valid if it matches the before_stamp. The cmpxchg does nothing to help this. Instead, the following logic can be done to fix this: before = before_stamp ts = read current timestamp before_stamp = ts after = write_stamp if (write == current position on the ring buffer && after == before && after < ts) { delta = ts - after } else { delta = 0; } The above will only use the write_stamp if it still matches before_stamp and was tested to not have changed since C. As a bonus, with this logic we do not need any 64-bit cmpxchg() at all! This means the 32-bit rb_time_t workaround can finally be removed. But that's for a later time. Link: https://lore.kernel.org/linux-trace-kernel/20231218175229.58ec3daf@gandalf.local.home/ Link: https://lore.kernel.org/linux-trace-kernel/20231218230712.3a76b081@gandalf.local.home Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Fixes: dd93942570789 ("ring-buffer: Do not try to put back write_stamp") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-12-15ring-buffer: Do not record in NMI if the arch does not support cmpxchg in NMISteven Rostedt (Google)
As the ring buffer recording requires cmpxchg() to work, if the architecture does not support cmpxchg in NMI, then do not do any recording within an NMI. Link: https://lore.kernel.org/linux-trace-kernel/20231213175403.6fc18540@gandalf.local.home Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-12-15ring-buffer: Have rb_time_cmpxchg() set the msb counter tooSteven Rostedt (Google)
The rb_time_cmpxchg() on 32-bit architectures requires setting three 32-bit words to represent the 64-bit timestamp, with some salt for synchronization. Those are: msb, top, and bottom The issue is, the rb_time_cmpxchg() did not properly salt the msb portion, and the msb that was written was stale. Link: https://lore.kernel.org/linux-trace-kernel/20231215084114.20899342@rorschach.local.home Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Fixes: f03f2abce4f39 ("ring-buffer: Have 32 bit time stamps use all 64 bits") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-12-15ring-buffer: Fix 32-bit rb_time_read() race with rb_time_cmpxchg()Mathieu Desnoyers
The following race can cause rb_time_read() to observe a corrupted time stamp: rb_time_cmpxchg() [...] if (!rb_time_read_cmpxchg(&t->msb, msb, msb2)) return false; if (!rb_time_read_cmpxchg(&t->top, top, top2)) return false; <interrupted before updating bottom> __rb_time_read() [...] do { c = local_read(&t->cnt); top = local_read(&t->top); bottom = local_read(&t->bottom); msb = local_read(&t->msb); } while (c != local_read(&t->cnt)); *cnt = rb_time_cnt(top); /* If top and msb counts don't match, this interrupted a write */ if (*cnt != rb_time_cnt(msb)) return false; ^ this check fails to catch that "bottom" is still not updated. So the old "bottom" value is returned, which is wrong. Fix this by checking that all three of msb, top, and bottom 2-bit cnt values match. The reason to favor checking all three fields over requiring a specific update order for both rb_time_set() and rb_time_cmpxchg() is because checking all three fields is more robust to handle partial failures of rb_time_cmpxchg() when interrupted by nested rb_time_set(). Link: https://lore.kernel.org/lkml/20231211201324.652870-1-mathieu.desnoyers@efficios.com/ Link: https://lore.kernel.org/linux-trace-kernel/20231212193049.680122-1-mathieu.desnoyers@efficios.com Fixes: f458a1453424e ("ring-buffer: Test last update in 32bit version of __rb_time_read()") Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-12-15ring-buffer: Fix a race in rb_time_cmpxchg() for 32 bit archsSteven Rostedt (Google)
Mathieu Desnoyers pointed out an issue in the rb_time_cmpxchg() for 32 bit architectures. That is: static bool rb_time_cmpxchg(rb_time_t *t, u64 expect, u64 set) { unsigned long cnt, top, bottom, msb; unsigned long cnt2, top2, bottom2, msb2; u64 val; /* The cmpxchg always fails if it interrupted an update */ if (!__rb_time_read(t, &val, &cnt2)) return false; if (val != expect) return false; <<<< interrupted here! cnt = local_read(&t->cnt); The problem is that the synchronization counter in the rb_time_t is read *after* the value of the timestamp is read. That means if an interrupt were to come in between the value being read and the counter being read, it can change the value and the counter and the interrupted process would be clueless about it! The counter needs to be read first and then the value. That way it is easy to tell if the value is stale or not. If the counter hasn't been updated, then the value is still good. Link: https://lore.kernel.org/linux-trace-kernel/20231211201324.652870-1-mathieu.desnoyers@efficios.com/ Link: https://lore.kernel.org/linux-trace-kernel/20231212115301.7a9c9a64@gandalf.local.home Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Fixes: 10464b4aa605e ("ring-buffer: Add rb_time_t 64 bit operations for speeding up 32 bit") Reported-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-12-15ring-buffer: Remove useless update to write_stamp in rb_try_to_discard()Steven Rostedt (Google)
When filtering is enabled, a temporary buffer is created to place the content of the trace event output so that the filter logic can decide from the trace event output if the trace event should be filtered out or not. If it is to be filtered out, the content in the temporary buffer is simply discarded, otherwise it is written into the trace buffer. But if an interrupt were to come in while a previous event was using that temporary buffer, the event written by the interrupt would actually go into the ring buffer itself to prevent corrupting the data on the temporary buffer. If the event is to be filtered out, the event in the ring buffer is discarded, or if it fails to discard because another event were to have already come in, it is turned into padding. The update to the write_stamp in the rb_try_to_discard() happens after a fix was made to force the next event after the discard to use an absolute timestamp by setting the before_stamp to zero so it does not match the write_stamp (which causes an event to use the absolute timestamp). But there's an effort in rb_try_to_discard() to put back the write_stamp to what it was before the event was added. But this is useless and wasteful because nothing is going to be using that write_stamp for calculations as it still will not match the before_stamp. Remove this useless update, and in doing so, we remove another cmpxchg64()! Also update the comments to reflect this change as well as remove some extra white space in another comment. Link: https://lore.kernel.org/linux-trace-kernel/20231215081810.1f4f38fe@rorschach.local.home Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Vincent Donnefort <vdonnefort@google.com> Fixes: b2dd797543cf ("ring-buffer: Force absolute timestamp on discard of event") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-12-15ring-buffer: Do not try to put back write_stampSteven Rostedt (Google)
If an update to an event is interrupted by another event between the time the initial event allocated its buffer and where it wrote to the write_stamp, the code try to reset the write stamp back to the what it had just overwritten. It knows that it was overwritten via checking the before_stamp, and if it didn't match what it wrote to the before_stamp before it allocated its space, it knows it was overwritten. To put back the write_stamp, it uses the before_stamp it read. The problem here is that by writing the before_stamp to the write_stamp it makes the two equal again, which means that the write_stamp can be considered valid as the last timestamp written to the ring buffer. But this is not necessarily true. The event that interrupted the event could have been interrupted in a way that it was interrupted as well, and can end up leaving with an invalid write_stamp. But if this happens and returns to this context that uses the before_stamp to update the write_stamp again, it can possibly incorrectly make it valid, causing later events to have in correct time stamps. As it is OK to leave this function with an invalid write_stamp (one that doesn't match the before_stamp), there's no reason to try to make it valid again in this case. If this race happens, then just leave with the invalid write_stamp and the next event to come along will just add a absolute timestamp and validate everything again. Bonus points: This gets rid of another cmpxchg64! Link: https://lore.kernel.org/linux-trace-kernel/20231214222921.193037a7@gandalf.local.home Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Vincent Donnefort <vdonnefort@google.com> Fixes: a389d86f7fd09 ("ring-buffer: Have nested events still record running time stamp") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-12-12ring-buffer: Have saved event hold the entire eventSteven Rostedt (Google)
For the ring buffer iterator (non-consuming read), the event needs to be copied into the iterator buffer to make sure that a writer does not overwrite it while the user is reading it. If a write happens during the copy, the buffer is simply discarded. But the temp buffer itself was not big enough. The allocation of the buffer was only BUF_MAX_DATA_SIZE, which is the maximum data size that can be passed into the ring buffer and saved. But the temp buffer needs to hold the meta data as well. That would be BUF_PAGE_SIZE and not BUF_MAX_DATA_SIZE. Link: https://lore.kernel.org/linux-trace-kernel/20231212072558.61f76493@gandalf.local.home Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Fixes: 785888c544e04 ("ring-buffer: Have rb_iter_head_event() handle concurrent writer") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-12-12ring-buffer: Do not update before stamp when switching sub-buffersSteven Rostedt (Google)
The ring buffer timestamps are synchronized by two timestamp placeholders. One is the "before_stamp" and the other is the "write_stamp" (sometimes referred to as the "after stamp" but only in the comments. These two stamps are key to knowing how to handle nested events coming in with a lockless system. When moving across sub-buffers, the before stamp is updated but the write stamp is not. There's an effort to put back the before stamp to something that seems logical in case there's nested events. But as the current event is about to cross sub-buffers, and so will any new nested event that happens, updating the before stamp is useless, and could even introduce new race conditions. The first event on a sub-buffer simply uses the sub-buffer's timestamp and keeps a "delta" of zero. The "before_stamp" and "write_stamp" are not used in the algorithm in this case. There's no reason to try to fix the before_stamp when this happens. As a bonus, it removes a cmpxchg() when crossing sub-buffers! Link: https://lore.kernel.org/linux-trace-kernel/20231211114420.36dde01b@gandalf.local.home Cc: stable@vger.kernel.org Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Fixes: a389d86f7fd09 ("ring-buffer: Have nested events still record running time stamp") Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-12-12ring-buffer: Fix memory leak of free pageSteven Rostedt (Google)
Reading the ring buffer does a swap of a sub-buffer within the ring buffer with a empty sub-buffer. This allows the reader to have full access to the content of the sub-buffer that was swapped out without having to worry about contention with the writer. The readers call ring_buffer_alloc_read_page() to allocate a page that will be used to swap with the ring buffer. When the code is finished with the reader page, it calls ring_buffer_free_read_page(). Instead of freeing the page, it stores it as a spare. Then next call to ring_buffer_alloc_read_page() will return this spare instead of calling into the memory management system to allocate a new page. Unfortunately, on freeing of the ring buffer, this spare page is not freed, and causes a memory leak. Link: https://lore.kernel.org/linux-trace-kernel/20231210221250.7b9cc83c@rorschach.local.home Cc: stable@vger.kernel.org Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Fixes: 73a757e63114d ("ring-buffer: Return reader page back into existing ring buffer") Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-12-12ring-buffer: Fix writing to the buffer with max_data_sizeSteven Rostedt (Google)
The maximum ring buffer data size is the maximum size of data that can be recorded on the ring buffer. Events must be smaller than the sub buffer data size minus any meta data. This size is checked before trying to allocate from the ring buffer because the allocation assumes that the size will fit on the sub buffer. The maximum size was calculated as the size of a sub buffer page (which is currently PAGE_SIZE minus the sub buffer header) minus the size of the meta data of an individual event. But it missed the possible adding of a time stamp for events that are added long enough apart that the event meta data can't hold the time delta. When an event is added that is greater than the current BUF_MAX_DATA_SIZE minus the size of a time stamp, but still less than or equal to BUF_MAX_DATA_SIZE, the ring buffer would go into an infinite loop, looking for a page that can hold the event. Luckily, there's a check for this loop and after 1000 iterations and a warning is emitted and the ring buffer is disabled. But this should never happen. This can happen when a large event is added first, or after a long period where an absolute timestamp is prefixed to the event, increasing its size by 8 bytes. This passes the check and then goes into the algorithm that causes the infinite loop. For events that are the first event on the sub-buffer, it does not need to add a timestamp, because the sub-buffer itself contains an absolute timestamp, and adding one is redundant. The fix is to check if the event is to be the first event on the sub-buffer, and if it is, then do not add a timestamp. This also fixes 32 bit adding a timestamp when a read of before_stamp or write_stamp is interrupted. There's still no need to add that timestamp if the event is going to be the first event on the sub buffer. Also, if the buffer has "time_stamp_abs" set, then also check if the length plus the timestamp is greater than the BUF_MAX_DATA_SIZE. Link: https://lore.kernel.org/all/20231212104549.58863438@gandalf.local.home/ Link: https://lore.kernel.org/linux-trace-kernel/20231212071837.5fdd6c13@gandalf.local.home Link: https://lore.kernel.org/linux-trace-kernel/20231212111617.39e02849@gandalf.local.home Cc: stable@vger.kernel.org Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Fixes: a4543a2fa9ef3 ("ring-buffer: Get timestamp after event is allocated") Fixes: 58fbc3c63275c ("ring-buffer: Consolidate add_timestamp to remove some branches") Reported-by: Kent Overstreet <kent.overstreet@linux.dev> # (on IRC) Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-12-06ring-buffer: Test last update in 32bit version of __rb_time_read()Steven Rostedt (Google)
Since 64 bit cmpxchg() is very expensive on 32bit architectures, the timestamp used by the ring buffer does some interesting tricks to be able to still have an atomic 64 bit number. It originally just used 60 bits and broke it up into two 32 bit words where the extra 2 bits were used for synchronization. But this was not enough for all use cases, and all 64 bits were required. The 32bit version of the ring buffer timestamp was then broken up into 3 32bit words using the same counter trick. But one update was not done. The check to see if the read operation was done without interruption only checked the first two words and not last one (like it had before this update). Fix it by making sure all three updates happen without interruption by comparing the initial counter with the last updated counter. Link: https://lore.kernel.org/linux-trace-kernel/20231206100050.3100b7bb@gandalf.local.home Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Fixes: f03f2abce4f39 ("ring-buffer: Have 32 bit time stamps use all 64 bits") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-12-06ring-buffer: Force absolute timestamp on discard of eventSteven Rostedt (Google)
There's a race where if an event is discarded from the ring buffer and an interrupt were to happen at that time and insert an event, the time stamp is still used from the discarded event as an offset. This can screw up the timings. If the event is going to be discarded, set the "before_stamp" to zero. When a new event comes in, it compares the "before_stamp" with the "write_stamp" and if they are not equal, it will insert an absolute timestamp. This will prevent the timings from getting out of sync due to the discarded event. Link: https://lore.kernel.org/linux-trace-kernel/20231206100244.5130f9b3@gandalf.local.home Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Fixes: 6f6be606e763f ("ring-buffer: Force before_stamp and write_stamp to be different on discard") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-10-03ring_buffer: Use try_cmpxchg instead of cmpxchg in rb_insert_pagesUros Bizjak
Use try_cmpxchg instead of cmpxchg (*ptr, old, new) == old in rb_insert_pages. x86 CMPXCHG instruction returns success in ZF flag, so this change saves a compare after cmpxchg (and related move instruction in front of cmpxchg). No functional change intended. Link: https://lore.kernel.org/linux-trace-kernel/20230914163420.12923-1-ubizjak@gmail.com Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-09-30ring-buffer: Update "shortest_full" in pollingSteven Rostedt (Google)
It was discovered that the ring buffer polling was incorrectly stating that read would not block, but that's because polling did not take into account that reads will block if the "buffer-percent" was set. Instead, the ring buffer polling would say reads would not block if there was any data in the ring buffer. This was incorrect behavior from a user space point of view. This was fixed by commit 42fb0a1e84ff by having the polling code check if the ring buffer had more data than what the user specified "buffer percent" had. The problem now is that the polling code did not register itself to the writer that it wanted to wait for a specific "full" value of the ring buffer. The result was that the writer would wake the polling waiter whenever there was a new event. The polling waiter would then wake up, see that there's not enough data in the ring buffer to notify user space and then go back to sleep. The next event would wake it up again. Before the polling fix was added, the code would wake up around 100 times for a hackbench 30 benchmark. After the "fix", due to the constant waking of the writer, it would wake up over 11,0000 times! It would never leave the kernel, so the user space behavior was still "correct", but this definitely is not the desired effect. To fix this, have the polling code add what it's waiting for to the "shortest_full" variable, to tell the writer not to wake it up if the buffer is not as full as it expects to be. Note, after this fix, it appears that the waiter is now woken up around 2x the times it was before (~200). This is a tremendous improvement from the 11,000 times, but I will need to spend some time to see why polling is more aggressive in its wakeups than the read blocking code. Link: https://lore.kernel.org/linux-trace-kernel/20230929180113.01c2cae3@rorschach.local.home Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Fixes: 42fb0a1e84ff ("tracing/ring-buffer: Have polling block on watermark") Reported-by: Julia Lawall <julia.lawall@inria.fr> Tested-by: Julia Lawall <julia.lawall@inria.fr> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-09-22ring-buffer: Fix bytes info in per_cpu buffer statsZheng Yejian
The 'bytes' info in file 'per_cpu/cpu<X>/stats' means the number of bytes in cpu buffer that have not been consumed. However, currently after consuming data by reading file 'trace_pipe', the 'bytes' info was not changed as expected. # cat per_cpu/cpu0/stats entries: 0 overrun: 0 commit overrun: 0 bytes: 568 <--- 'bytes' is problematical !!! oldest event ts: 8651.371479 now ts: 8653.912224 dropped events: 0 read events: 8 The root cause is incorrect stat on cpu_buffer->read_bytes. To fix it: 1. When stat 'read_bytes', account consumed event in rb_advance_reader(); 2. When stat 'entries_bytes', exclude the discarded padding event which is smaller than minimum size because it is invisible to reader. Then use rb_page_commit() instead of BUF_PAGE_SIZE at where accounting for page-based read/remove/overrun. Also correct the comments of ring_buffer_bytes_cpu() in this patch. Link: https://lore.kernel.org/linux-trace-kernel/20230921125425.1708423-1-zhengyejian1@huawei.com Cc: stable@vger.kernel.org Fixes: c64e148a3be3 ("trace: Add ring buffer stats to measure rate of events") Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-09-08ring-buffer: Do not attempt to read past "commit"Steven Rostedt (Google)
When iterating over the ring buffer while the ring buffer is active, the writer can corrupt the reader. There's barriers to help detect this and handle it, but that code missed the case where the last event was at the very end of the page and has only 4 bytes left. The checks to detect the corruption by the writer to reads needs to see the length of the event. If the length in the first 4 bytes is zero then the length is stored in the second 4 bytes. But if the writer is in the process of updating that code, there's a small window where the length in the first 4 bytes could be zero even though the length is only 4 bytes. That will cause rb_event_length() to read the next 4 bytes which could happen to be off the allocated page. To protect against this, fail immediately if the next event pointer is less than 8 bytes from the end of the commit (last byte of data), as all events must be a minimum of 8 bytes anyway. Link: https://lore.kernel.org/all/20230905141245.26470-1-Tze-nan.Wu@mediatek.com/ Link: https://lore.kernel.org/linux-trace-kernel/20230907122820.0899019c@gandalf.local.home Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Reported-by: Tze-nan Wu <Tze-nan.Wu@mediatek.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-09-07ring-buffer: Avoid softlockup in ring_buffer_resize()Zheng Yejian
When user resize all trace ring buffer through file 'buffer_size_kb', then in ring_buffer_resize(), kernel allocates buffer pages for each cpu in a loop. If the kernel preemption model is PREEMPT_NONE and there are many cpus and there are many buffer pages to be allocated, it may not give up cpu for a long time and finally cause a softlockup. To avoid it, call cond_resched() after each cpu buffer allocation. Link: https://lore.kernel.org/linux-trace-kernel/20230906081930.3939106-1-zhengyejian1@huawei.com Cc: <mhiramat@kernel.org> Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-07-30ring_buffer: Use try_cmpxchg instead of cmpxchgUros Bizjak
Use try_cmpxchg instead of cmpxchg (*ptr, old, new) == old in ring_buffer.c. x86 CMPXCHG instruction returns success in ZF flag, so this change saves a compare after cmpxchg (and related move instruction in front of cmpxchg). No functional change intended. Link: https://lore.kernel.org/linux-trace-kernel/20230714154418.8884-1-ubizjak@gmail.com Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-07-28ring-buffer: Fix kernel-doc warnings in ring_buffer.cGaosheng Cui
Fix kernel-doc warnings: kernel/trace/ring_buffer.c:954: warning: Function parameter or member 'cpu' not described in 'ring_buffer_wake_waiters' kernel/trace/ring_buffer.c:3383: warning: Excess function parameter 'event' description in 'ring_buffer_unlock_commit' kernel/trace/ring_buffer.c:5359: warning: Excess function parameter 'cpu' description in 'ring_buffer_reset_online_cpus' Link: https://lkml.kernel.org/r/20230724140827.1023266-2-cuigaosheng1@huawei.com Cc: <mhiramat@kernel.org> Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>