/arch/alpha/kernel/

diff --git a/Documentation/ABI/obsolete/sysfs-selinux-checkreqprot b/Documentation/ABI/removed/sysfs-selinux-checkreqprot
index ed6b52ca210f..f599a0a87e8b 100644
--- a/Documentation/ABI/obsolete/sysfs-selinux-checkreqprot
+++ b/Documentation/ABI/removed/sysfs-selinux-checkreqprot

+ REMOVAL UPDATE: The SELinux checkreqprot functionality was removed in

+ March 2023, the original deprecation notice is shown below.

diff --git a/Documentation/ABI/obsolete/sysfs-selinux-disable b/Documentation/ABI/removed/sysfs-selinux-disable
index c340278e3cf8..cb783c64cab3 100644
--- a/Documentation/ABI/obsolete/sysfs-selinux-disable
+++ b/Documentation/ABI/removed/sysfs-selinux-disable

+ REMOVAL UPDATE: The SELinux runtime disable functionality was removed

+ in March 2023, the original deprecation notice is shown below.

+What: /sys/fs/o2cb/

+Description: The ACPI pm_profile sysfs interface exposes the preferred

+ power management (and performance) profile of the platform

+ as provided in the ACPI FADT Preferred_PM_Profile field.

+ The integer value is directly passed as retrieved from the FADT.

+Values: For the possible values refer to the Preferred_PM_Profile field

+ definition in Table 5.9 "FADT Format", Section 5.2.9 "Fixed ACPI

+ Description Table (FADT)" of the ACPI specification.

+ As of ACPI 6.5, the following values are defined:

+What: /sys/block/<disk>/alignment_offset

+Contact: Martin K. Petersen <martin.petersen@oracle.com>

+ Storage devices may report a physical block size that is

+ bigger than the logical block size (for instance a drive

+ with 4KB physical sectors exposing 512-byte logical

+ blocks to the operating system). This parameter

+ indicates how many bytes the beginning of the device is

+ offset from the disk's natural alignment.

+What: /sys/block/<disk>/discard_alignment

+Contact: Martin K. Petersen <martin.petersen@oracle.com>

+ Devices that support discard functionality may

+ internally allocate space in units that are bigger than

+ the exported logical block size. The discard_alignment

+ parameter indicates how many bytes the beginning of the

+ device is offset from the internal allocation unit's

+ natural alignment.

+What: /sys/block/<disk>/diskseq

+Date: February 2021

+Contact: Matteo Croce <mcroce@microsoft.com>

+ The /sys/block/<disk>/diskseq files reports the disk

+ sequence number, which is a monotonically increasing

+ number assigned to every drive.

+ Some devices, like the loop device, refresh such number

+ every time the backing file is changed.

+ The value type is 64 bit unsigned.

+What: /sys/block/<disk>/inflight

+Contact: Jens Axboe <axboe@kernel.dk>, Nikanth Karthikesan <knikanth@suse.de>

+ Reports the number of I/O requests currently in progress

+ (pending / in flight) in a device driver. This can be less

+ than the number of requests queued in the block device queue.

+ The report contains 2 fields: one for read requests

+ and one for write requests.

+ The value type is unsigned int.

+ Cf. Documentation/block/stat.rst which contains a single value for

+ requests in flight.

+ This is related to /sys/block/<disk>/queue/nr_requests

+ and for SCSI device also its queue_depth.

+What: /sys/block/<disk>/integrity/device_is_integrity_capable

+Contact: Martin K. Petersen <martin.petersen@oracle.com>

+ Indicates whether a storage device is capable of storing

+ integrity metadata. Set if the device is T10 PI-capable.

+What: /sys/block/<disk>/integrity/format

+Contact: Martin K. Petersen <martin.petersen@oracle.com>

+ Metadata format for integrity capable block device.

+ E.g. T10-DIF-TYPE1-CRC.

+What: /sys/block/<disk>/integrity/protection_interval_bytes

+Contact: Martin K. Petersen <martin.petersen@oracle.com>

+ Describes the number of data bytes which are protected

+ by one integrity tuple. Typically the device's logical

+What: /sys/block/<disk>/integrity/read_verify

+Contact: Martin K. Petersen <martin.petersen@oracle.com>

+ Indicates whether the block layer should verify the

+ integrity of read requests serviced by devices that

+ support sending integrity metadata.

+What: /sys/block/<disk>/integrity/tag_size

+Contact: Martin K. Petersen <martin.petersen@oracle.com>

+ Number of bytes of integrity tag space available per

+ 512 bytes of data.

+What: /sys/block/<disk>/integrity/write_generate

+Contact: Martin K. Petersen <martin.petersen@oracle.com>

+ Indicates whether the block layer should automatically

+ generate checksums for write requests bound for

+ devices that support receiving integrity metadata.

+What: /sys/block/<disk>/<partition>/alignment_offset

+Contact: Martin K. Petersen <martin.petersen@oracle.com>

+ Storage devices may report a physical block size that is

+ bigger than the logical block size (for instance a drive

+ with 4KB physical sectors exposing 512-byte logical

+ blocks to the operating system). This parameter

+ indicates how many bytes the beginning of the partition

+ is offset from the disk's natural alignment.

+What: /sys/block/<disk>/<partition>/discard_alignment

+Contact: Martin K. Petersen <martin.petersen@oracle.com>

+ Devices that support discard functionality may

+ internally allocate space in units that are bigger than

+ the exported logical block size. The discard_alignment

+ parameter indicates how many bytes the beginning of the

+ partition is offset from the internal allocation unit's

+ natural alignment.

+What: /sys/block/<disk>/<partition>/stat

+Date: February 2008

+Contact: Jerome Marchand <jmarchan@redhat.com>

+ The /sys/block/<disk>/<partition>/stat files display the

+ I/O statistics of partition <partition>. The format is the

+ same as the format of /sys/block/<disk>/stat.

+What: /sys/block/<disk>/queue/add_random

+Contact: linux-block@vger.kernel.org

+ [RW] This file allows to turn off the disk entropy contribution.

+ Default value of this file is '1'(on).

+What: /sys/block/<disk>/queue/chunk_sectors

+Date: September 2016

+Contact: Hannes Reinecke <hare@suse.com>

+ [RO] chunk_sectors has different meaning depending on the type

+ of the disk. For a RAID device (dm-raid), chunk_sectors

+ indicates the size in 512B sectors of the RAID volume stripe

+ segment. For a zoned block device, either host-aware or

+ host-managed, chunk_sectors indicates the size in 512B sectors

+ of the zones of the device, with the eventual exception of the

+ last zone of the device which may be smaller.

+What: /sys/block/<disk>/queue/crypto/

+Date: February 2022

+Contact: linux-block@vger.kernel.org

+ The presence of this subdirectory of /sys/block/<disk>/queue/

+ indicates that the device supports inline encryption. This

+ subdirectory contains files which describe the inline encryption

+ capabilities of the device. For more information about inline

+ encryption, refer to Documentation/block/inline-encryption.rst.

+What: /sys/block/<disk>/queue/crypto/max_dun_bits

+Date: February 2022

+Contact: linux-block@vger.kernel.org

+ [RO] This file shows the maximum length, in bits, of data unit

+ numbers accepted by the device in inline encryption requests.

+What: /sys/block/<disk>/queue/crypto/modes/<mode>

+Date: February 2022

+Contact: linux-block@vger.kernel.org

+ [RO] For each crypto mode (i.e., encryption/decryption

+ algorithm) the device supports with inline encryption, a file

+ will exist at this location. It will contain a hexadecimal

+ number that is a bitmask of the supported data unit sizes, in

+ bytes, for that crypto mode.

+ Currently, the crypto modes that may be supported are:

+ * AES-128-CBC-ESSIV

+ For example, if a device supports AES-256-XTS inline encryption

+ with data unit sizes of 512 and 4096 bytes, the file

+ /sys/block/<disk>/queue/crypto/modes/AES-256-XTS will exist and

+ will contain "0x1200".

+What: /sys/block/<disk>/queue/crypto/num_keyslots

+Date: February 2022

+Contact: linux-block@vger.kernel.org

+ [RO] This file shows the number of keyslots the device has for

+ use with inline encryption.

+What: /sys/block/<disk>/queue/dax

+Contact: linux-block@vger.kernel.org

+ [RO] This file indicates whether the device supports Direct

+ Access (DAX), used by CPU-addressable storage to bypass the

+ pagecache. It shows '1' if true, '0' if not.

+What: /sys/block/<disk>/queue/discard_granularity

+Contact: Martin K. Petersen <martin.petersen@oracle.com>

+ [RO] Devices that support discard functionality may internally

+ allocate space using units that are bigger than the logical

+ block size. The discard_granularity parameter indicates the size

+ of the internal allocation unit in bytes if reported by the

+ device. Otherwise the discard_granularity will be set to match

+ the device's physical block size. A discard_granularity of 0

+ means that the device does not support discard functionality.

+What: /sys/block/<disk>/queue/discard_max_bytes

+Contact: Martin K. Petersen <martin.petersen@oracle.com>

+ [RW] While discard_max_hw_bytes is the hardware limit for the

+ device, this setting is the software limit. Some devices exhibit

+ large latencies when large discards are issued, setting this

+ value lower will make Linux issue smaller discards and

+ potentially help reduce latencies induced by large discard

+What: /sys/block/<disk>/queue/discard_max_hw_bytes

+Contact: linux-block@vger.kernel.org

+ [RO] Devices that support discard functionality may have

+ internal limits on the number of bytes that can be trimmed or

+ unmapped in a single operation. The `discard_max_hw_bytes`

+ parameter is set by the device driver to the maximum number of

+ bytes that can be discarded in a single operation. Discard

+ requests issued to the device must not exceed this limit. A

+ `discard_max_hw_bytes` value of 0 means that the device does not

+ support discard functionality.

+What: /sys/block/<disk>/queue/discard_zeroes_data

+Contact: Martin K. Petersen <martin.petersen@oracle.com>

+ [RO] Will always return 0. Don't rely on any specific behavior

+ for discards, and don't read this file.

+What: /sys/block/<disk>/queue/dma_alignment

+Contact: linux-block@vger.kernel.org

+ Reports the alignment that user space addresses must have to be

+ used for raw block device access with O_DIRECT and other driver

+ specific passthrough mechanisms.

+What: /sys/block/<disk>/queue/fua

+Contact: linux-block@vger.kernel.org

+ [RO] Whether or not the block driver supports the FUA flag for

+ write requests. FUA stands for Force Unit Access. If the FUA

+ flag is set that means that write requests must bypass the

+ volatile cache of the storage device.

+What: /sys/block/<disk>/queue/hw_sector_size

+Contact: linux-block@vger.kernel.org

+ [RO] This is the hardware sector size of the device, in bytes.

+What: /sys/block/<disk>/queue/independent_access_ranges/

+Contact: linux-block@vger.kernel.org

+ [RO] The presence of this sub-directory of the

+ /sys/block/xxx/queue/ directory indicates that the device is

+ capable of executing requests targeting different sector ranges

+ in parallel. For instance, single LUN multi-actuator hard-disks

+ will have an independent_access_ranges directory if the device

+ correctly advertizes the sector ranges of its actuators.

+ The independent_access_ranges directory contains one directory

+ per access range, with each range described using the sector

+ (RO) attribute file to indicate the first sector of the range

+ and the nr_sectors (RO) attribute file to indicate the total

+ number of sectors in the range starting from the first sector of

+ the range. For example, a dual-actuator hard-disk will have the

+ following independent_access_ranges entries.::

+ $ tree /sys/block/<disk>/queue/independent_access_ranges/

+ /sys/block/<disk>/queue/independent_access_ranges/

+ The sector and nr_sectors attributes use 512B sector unit,

+ regardless of the actual block size of the device. Independent

+ access ranges do not overlap and include all sectors within the

+ device capacity. The access ranges are numbered in increasing

+ order of the range start sector, that is, the sector attribute

+ of range 0 always has the value 0.

+What: /sys/block/<disk>/queue/io_poll

+Date: November 2015

+Contact: linux-block@vger.kernel.org

+ [RW] When read, this file shows whether polling is enabled (1)

+ or disabled (0). Writing '0' to this file will disable polling

+ for this device. Writing any non-zero value will enable this

+What: /sys/block/<disk>/queue/io_poll_delay

+Date: November 2016

+Contact: linux-block@vger.kernel.org

+ [RW] This was used to control what kind of polling will be

+ performed. It is now fixed to -1, which is classic polling.

+ In this mode, the CPU will repeatedly ask for completions

+ without giving up any time.

+What: /sys/block/<disk>/queue/io_timeout

+Date: November 2018

+Contact: Weiping Zhang <zhangweiping@didiglobal.com>

+ [RW] io_timeout is the request timeout in milliseconds. If a

+ request does not complete in this time then the block driver

+ timeout handler is invoked. That timeout handler can decide to

+ retry the request, to fail it or to start a device recovery

+What: /sys/block/<disk>/queue/iostats

+Contact: linux-block@vger.kernel.org

+ [RW] This file is used to control (on/off) the iostats

+ accounting of the disk.

+What: /sys/block/<disk>/queue/logical_block_size

+Contact: Martin K. Petersen <martin.petersen@oracle.com>

+ [RO] This is the smallest unit the storage device can address.

+ It is typically 512 bytes.

+What: /sys/block/<disk>/queue/max_active_zones

+Contact: Niklas Cassel <niklas.cassel@wdc.com>

+ [RO] For zoned block devices (zoned attribute indicating

+ "host-managed" or "host-aware"), the sum of zones belonging to

+ any of the zone states: EXPLICIT OPEN, IMPLICIT OPEN or CLOSED,

+ is limited by this value. If this value is 0, there is no limit.

+ If the host attempts to exceed this limit, the driver should

+ report this error with BLK_STS_ZONE_ACTIVE_RESOURCE, which user

+ space may see as the EOVERFLOW errno.

+What: /sys/block/<disk>/queue/max_discard_segments

+Date: February 2017

+Contact: linux-block@vger.kernel.org

+ [RO] The maximum number of DMA scatter/gather entries in a

+What: /sys/block/<disk>/queue/max_hw_sectors_kb

+Date: September 2004

+Contact: linux-block@vger.kernel.org

+ [RO] This is the maximum number of kilobytes supported in a

+ single data transfer.

+What: /sys/block/<disk>/queue/max_integrity_segments

+Date: September 2010

+Contact: linux-block@vger.kernel.org

+ [RO] Maximum number of elements in a DMA scatter/gather list

+ with integrity data that will be submitted by the block layer

+ core to the associated block driver.

+What: /sys/block/<disk>/queue/max_open_zones

+Contact: Niklas Cassel <niklas.cassel@wdc.com>

+ [RO] For zoned block devices (zoned attribute indicating

+ "host-managed" or "host-aware"), the sum of zones belonging to

+ any of the zone states: EXPLICIT OPEN or IMPLICIT OPEN, is

+ limited by this value. If this value is 0, there is no limit.

+What: /sys/block/<disk>/queue/max_sectors_kb

+Date: September 2004

+Contact: linux-block@vger.kernel.org

+ [RW] This is the maximum number of kilobytes that the block

+ layer will allow for a filesystem request. Must be smaller than

+ or equal to the maximum size allowed by the hardware. Write 0

+ to use default kernel settings.

+What: /sys/block/<disk>/queue/max_segment_size

+Contact: linux-block@vger.kernel.org

+ [RO] Maximum size in bytes of a single element in a DMA

+ scatter/gather list.

+What: /sys/block/<disk>/queue/max_segments

+Contact: linux-block@vger.kernel.org

+ [RO] Maximum number of elements in a DMA scatter/gather list

+ that is submitted to the associated block driver.

+What: /sys/block/<disk>/queue/minimum_io_size

+Contact: Martin K. Petersen <martin.petersen@oracle.com>

+ [RO] Storage devices may report a granularity or preferred

+ minimum I/O size which is the smallest request the device can

+ perform without incurring a performance penalty. For disk

+ drives this is often the physical block size. For RAID arrays

+ it is often the stripe chunk size. A properly aligned multiple

+ of minimum_io_size is the preferred request size for workloads

+ where a high number of I/O operations is desired.

+What: /sys/block/<disk>/queue/nomerges

+Contact: linux-block@vger.kernel.org

+ [RW] Standard I/O elevator operations include attempts to merge

+ contiguous I/Os. For known random I/O loads these attempts will

+ always fail and result in extra cycles being spent in the

+ kernel. This allows one to turn off this behavior on one of two

+ ways: When set to 1, complex merge checks are disabled, but the

+ simple one-shot merges with the previous I/O request are

+ enabled. When set to 2, all merge tries are disabled. The

+ default value is 0 - which enables all types of merge tries.

+What: /sys/block/<disk>/queue/nr_requests

+Contact: linux-block@vger.kernel.org

+ [RW] This controls how many requests may be allocated in the

+ block layer for read or write requests. Note that the total

+ allocated number may be twice this amount, since it applies only

+ to reads or writes (not the accumulated sum).

+ To avoid priority inversion through request starvation, a

+ request queue maintains a separate request pool per each cgroup

+ when CONFIG_BLK_CGROUP is enabled, and this parameter applies to

+ each such per-block-cgroup request pool. IOW, if there are N

+ block cgroups, each request queue may have up to N request

+ pools, each independently regulated by nr_requests.

+What: /sys/block/<disk>/queue/nr_zones

+Date: November 2018

+Contact: Damien Le Moal <damien.lemoal@wdc.com>

+ [RO] nr_zones indicates the total number of zones of a zoned

+ block device ("host-aware" or "host-managed" zone model). For

+ regular block devices, the value is always 0.

+What: /sys/block/<disk>/queue/optimal_io_size

+Contact: Martin K. Petersen <martin.petersen@oracle.com>

+ [RO] Storage devices may report an optimal I/O size, which is

+ the device's preferred unit for sustained I/O. This is rarely

+ reported for disk drives. For RAID arrays it is usually the

+ stripe width or the internal track size. A properly aligned

+ multiple of optimal_io_size is the preferred request size for

+ workloads where sustained throughput is desired. If no optimal

+ I/O size is reported this file contains 0.

+What: /sys/block/<disk>/queue/physical_block_size

+Contact: Martin K. Petersen <martin.petersen@oracle.com>

+ [RO] This is the smallest unit a physical storage device can

+ write atomically. It is usually the same as the logical block

+ size but may be bigger. One example is SATA drives with 4KB

+ sectors that expose a 512-byte logical block size to the

+ operating system. For stacked block devices the

+ physical_block_size variable contains the maximum

+ physical_block_size of the component devices.

+What: /sys/block/<disk>/queue/read_ahead_kb

+Contact: linux-block@vger.kernel.org

+ [RW] Maximum number of kilobytes to read-ahead for filesystems

+ on this block device.

+What: /sys/block/<disk>/queue/rotational

+Contact: linux-block@vger.kernel.org

+ [RW] This file is used to stat if the device is of rotational

+ type or non-rotational type.

+What: /sys/block/<disk>/queue/rq_affinity

+Date: September 2008

+Contact: linux-block@vger.kernel.org

+ [RW] If this option is '1', the block layer will migrate request

+ completions to the cpu "group" that originally submitted the

+ request. For some workloads this provides a significant

+ reduction in CPU cycles due to caching effects.

+ For storage configurations that need to maximize distribution of

+ completion processing setting this option to '2' forces the

+ completion to run on the requesting cpu (bypassing the "group"

+ aggregation logic).

+What: /sys/block/<disk>/queue/scheduler

+Contact: linux-block@vger.kernel.org

+ [RW] When read, this file will display the current and available

+ IO schedulers for this block device. The currently active IO

+ scheduler will be enclosed in [] brackets. Writing an IO

+ scheduler name to this file will switch control of this block

+ device to that new IO scheduler. Note that writing an IO

+ scheduler name to this file will attempt to load that IO

+ scheduler module, if it isn't already present in the system.

+What: /sys/block/<disk>/queue/stable_writes

+Date: September 2020

+Contact: linux-block@vger.kernel.org

+ [RW] This file will contain '1' if memory must not be modified

+ while it is being used in a write request to this device. When

+ this is the case and the kernel is performing writeback of a

+ page, the kernel will wait for writeback to complete before

+ allowing the page to be modified again, rather than allowing

+ immediate modification as is normally the case. This

+ restriction arises when the device accesses the memory multiple

+ times where the same data must be seen every time -- for

+ example, once to calculate a checksum and once to actually write

+ the data. If no such restriction exists, this file will contain

+ '0'. This file is writable for testing purposes.

+What: /sys/block/<disk>/queue/throttle_sample_time

+Contact: linux-block@vger.kernel.org

+ [RW] This is the time window that blk-throttle samples data, in

+ millisecond. blk-throttle makes decision based on the

+ samplings. Lower time means cgroups have more smooth throughput,

+ but higher CPU overhead. This exists only when

+ CONFIG_BLK_DEV_THROTTLING_LOW is enabled.

+What: /sys/block/<disk>/queue/virt_boundary_mask

+Contact: linux-block@vger.kernel.org

+ [RO] This file shows the I/O segment memory alignment mask for

+ the block device. I/O requests to this device will be split

+ between segments wherever either the memory address of the end

+ of the previous segment or the memory address of the beginning

+ of the current segment is not aligned to virt_boundary_mask + 1

+What: /sys/block/<disk>/queue/wbt_lat_usec

+Date: November 2016

+Contact: linux-block@vger.kernel.org

+ [RW] If the device is registered for writeback throttling, then

+ this file shows the target minimum read latency. If this latency

+ is exceeded in a given window of time (see wb_window_usec), then

+ the writeback throttling will start scaling back writes. Writing

+ a value of '0' to this file disables the feature. Writing a

+ value of '-1' to this file resets the value to the default

+What: /sys/block/<disk>/queue/write_cache

+Contact: linux-block@vger.kernel.org

+ [RW] When read, this file will display whether the device has

+ write back caching enabled or not. It will return "write back"

+ for the former case, and "write through" for the latter. Writing

+ to this file can change the kernels view of the device, but it

+ doesn't alter the device state. This means that it might not be

+ safe to toggle the setting from "write back" to "write through",

+ since that will also eliminate cache flushes issued by the

+What: /sys/block/<disk>/queue/write_same_max_bytes

+Contact: Martin K. Petersen <martin.petersen@oracle.com>

+ [RO] Some devices support a write same operation in which a

+ single data block can be written to a range of several

+ contiguous blocks on storage. This can be used to wipe areas on

+ disk or to initialize drives in a RAID configuration.

+ write_same_max_bytes indicates how many bytes can be written in

+ a single write same command. If write_same_max_bytes is 0, write

+ same is not supported by the device.

+What: /sys/block/<disk>/queue/write_zeroes_max_bytes

+Date: November 2016

+Contact: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>

+ [RO] Devices that support write zeroes operation in which a

+ single request can be issued to zero out the range of contiguous

+ blocks on storage without having any payload in the request.

+ This can be used to optimize writing zeroes to the devices.

+ write_zeroes_max_bytes indicates how many bytes can be written

+ in a single write zeroes command. If write_zeroes_max_bytes is

+ 0, write zeroes is not supported by the device.

+What: /sys/block/<disk>/queue/zone_append_max_bytes

+Contact: linux-block@vger.kernel.org

+ [RO] This is the maximum number of bytes that can be written to

+ a sequential zone of a zoned block device using a zone append

+ write operation (REQ_OP_ZONE_APPEND). This value is always 0 for

+ regular block devices.

+What: /sys/block/<disk>/queue/zone_write_granularity

+Contact: linux-block@vger.kernel.org

+ [RO] This indicates the alignment constraint, in bytes, for

+ write operations in sequential zones of zoned block devices

+ (devices with a zoned attributed that reports "host-managed" or

+ "host-aware"). This value is always 0 for regular block devices.

+What: /sys/block/<disk>/queue/zoned

+Date: September 2016

+Contact: Damien Le Moal <damien.lemoal@wdc.com>

+ [RO] zoned indicates if the device is a zoned block device and

+ the zone model of the device if it is indeed zoned. The

+ possible values indicated by zoned are "none" for regular block

+ devices and "host-aware" or "host-managed" for zoned block

+ devices. The characteristics of host-aware and host-managed

+ zoned block devices are described in the ZBC (Zoned Block

+ Commands) and ZAC (Zoned Device ATA Command Set) standards.

+ These standards also define the "drive-managed" zone model.

+ However, since drive-managed zoned block devices do not support

+ zone commands, they will be treated as regular block devices and

+ zoned will report "none".

+What: /sys/block/<disk>/hidden

+Contact: linux-block@vger.kernel.org

+ [RO] the block device is hidden. it doesn’t produce events, and

+ can’t be opened from userspace or using blkdev_get*.

+ Used for the underlying components of multipath devices.

+What: /sys/block/<disk>/stat

+Date: February 2008

+Contact: Jerome Marchand <jmarchan@redhat.com>

+ The /sys/block/<disk>/stat files displays the I/O

+ statistics of disk <disk>. They contain 11 fields:

+ == ==============================================

+ 1 reads completed successfully

+ 5 writes completed

+ 8 time spent writing (ms)

+ 9 I/Os currently in progress

+ 10 time spent doing I/Os (ms)

+ 11 weighted time spent doing I/Os (ms)

+ 12 discards completed

+ 13 discards merged

+ 14 sectors discarded

+ 15 time spent discarding (ms)

+ 16 flush requests completed

+ 17 time spent flushing (ms)

+ == ==============================================

+ For more details refer Documentation/admin-guide/iostats.rst

+What: /sys/bus/mhi/devices/.../soc_reset

+KernelVersion: 5.19

+Contact: mhi@lists.linux.dev

+Description: Initiates a SoC reset on the MHI controller. A SoC reset is

+ a reset of last resort, and will require a complete re-init.

+ This can be useful as a method of recovery if the device is

+ non-responsive, or as a means of loading new firmware as a

+ system administration task.

diff --git a/Documentation/ABI/stable/sysfs-class-infiniband b/Documentation/ABI/stable/sysfs-class-infiniband
index 9b1bdfa43354..ebf08c604336 100644
--- a/Documentation/ABI/stable/sysfs-class-infiniband
+++ b/Documentation/ABI/stable/sysfs-class-infiniband

+What: /sys/class/infiniband_mad/umad<N>/ibdev

+What: /sys/class/infiniband_mad/umad<N>/port

+What: /sys/class/infiniband_mad/issm<N>/ibdev

+What: /sys/class/infiniband_mad/issm<N>/port

+What: /sys/class/infiniband_verbs/uverbs<N>/ibdev

+What: /sys/class/infiniband_verbs/uverbs<N>/abi_version

+What: /sys/class/infiniband/qibX/ports/<N>/sl2vl/[0-15]

+What: /sys/class/infiniband/qibX/ports/<N>/CCMgtA/cc_settings_bin

+What: /sys/class/infiniband/qibX/ports/<N>/CCMgtA/cc_table_bin

+What: /sys/class/infiniband/qibX/ports/<N>/linkstate/loopback

+What: /sys/class/infiniband/qibX/ports/<N>/linkstate/led_override

+What: /sys/class/infiniband/qibX/ports/<N>/linkstate/hrtbt_enable

+What: /sys/class/infiniband/qibX/ports/<N>/linkstate/status

+What: /sys/class/infiniband/qibX/ports/<N>/linkstate/status_str

+What: /sys/class/infiniband/qibX/ports/<N>/diag_counters/rc_resends

+What: /sys/class/infiniband/qibX/ports/<N>/diag_counters/seq_naks

+What: /sys/class/infiniband/qibX/ports/<N>/diag_counters/rdma_seq

+What: /sys/class/infiniband/qibX/ports/<N>/diag_counters/rnr_naks

+What: /sys/class/infiniband/qibX/ports/<N>/diag_counters/other_naks

+What: /sys/class/infiniband/qibX/ports/<N>/diag_counters/rc_timeouts

+What: /sys/class/infiniband/qibX/ports/<N>/diag_counters/look_pkts

+What: /sys/class/infiniband/qibX/ports/<N>/diag_counters/pkt_drops

+What: /sys/class/infiniband/qibX/ports/<N>/diag_counters/dma_wait

+What: /sys/class/infiniband/qibX/ports/<N>/diag_counters/unaligned

+What: /sys/class/infiniband/hfi1_X/ports/<N>/CCMgtA/cc_settings_bin

+What: /sys/class/infiniband/hfi1_X/ports/<N>/CCMgtA/cc_table_bin

+What: /sys/class/infiniband/hfi1_X/ports/<N>/CCMgtA/cc_prescan

+What: /sys/class/infiniband/hfi1_X/ports/<N>/sc2vl/[0-31]

+What: /sys/class/infiniband/hfi1_X/ports/<N>/sl2sc/[0-31]

+What: /sys/class/infiniband/hfi1_X/ports/<N>/vl2mtu/[0-15]

+What: /sys/class/infiniband/hfi1_X/sdma_<N>/cpu_list

+What: /sys/class/infiniband/hfi1_X/sdma_<N>/vl

+What: /sys/class/tpm/tpmX/pcr-<H>/<N>

+What: /sys/devices/*/dev

+Contact: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

+ Major and minor numbers of the character device corresponding

+ to the device (in <major>:<minor> format).

+What: /sys/devices/system/node/nodeX/x86/sgx_total_bytes

+Date: November 2021

+Contact: Jarkko Sakkinen <jarkko@kernel.org>

+ The total amount of SGX physical memory in bytes.

+What: /sys/devices/system/node/nodeX/memory_failure/total

+Contact: Jiaqi Yan <jiaqiyan@google.com>

+ The total number of raw poisoned pages (pages containing

+ corrupted data due to memory errors) on a NUMA node.

+What: /sys/devices/system/node/nodeX/memory_failure/ignored

+Contact: Jiaqi Yan <jiaqiyan@google.com>

+ Of the raw poisoned pages on a NUMA node, how many pages are

+ ignored by memory error recovery attempt, usually because

+ support for this type of pages is unavailable, and kernel

+ gives up the recovery.

+What: /sys/devices/system/node/nodeX/memory_failure/failed

+Contact: Jiaqi Yan <jiaqiyan@google.com>

+ Of the raw poisoned pages on a NUMA node, how many pages are

+ failed by memory error recovery attempt. This usually means

+ a key recovery operation failed.

+What: /sys/devices/system/node/nodeX/memory_failure/delayed

+Contact: Jiaqi Yan <jiaqiyan@google.com>

+ Of the raw poisoned pages on a NUMA node, how many pages are

+ delayed by memory error recovery attempt. Delayed poisoned

+ pages usually will be retried by kernel.

+What: /sys/devices/system/node/nodeX/memory_failure/recovered

+Contact: Jiaqi Yan <jiaqiyan@google.com>

+ Of the raw poisoned pages on a NUMA node, how many pages are

+ recovered by memory error recovery attempt.

diff --git a/Documentation/ABI/stable/sysfs-devices-system-cpu b/Documentation/ABI/stable/sysfs-devices-system-cpu
index 516dafea03eb..902392d7eddf 100644
--- a/Documentation/ABI/stable/sysfs-devices-system-cpu
+++ b/Documentation/ABI/stable/sysfs-devices-system-cpu

+What: /sys/devices/system/cpu/cpuX/topology/cluster_id

+Description: the cluster ID of cpuX. Typically it is the hardware platform's

+ identifier (rather than the kernel's). The actual value is

+ architecture and platform dependent.

+What: /sys/devices/system/cpu/cpuX/topology/ppin

+Description: per-socket protected processor inventory number

+Values: hexadecimal.

+What: /sys/devices/system/cpu/cpuX/topology/cluster_cpus

+Description: internal kernel map of CPUs within the same cluster.

+Values: hexadecimal bitmask.

+What: /sys/devices/system/cpu/cpuX/topology/cluster_cpus_list

+Description: human-readable list of CPUs within the same cluster.

+ The format is like 0-3, 8-11, 14,17.

+Values: decimal list.

+ It's not visible when the device does not support batch.

+What: /sys/bus/dsa/devices/dsa<m>/max_read_buffers

+KernelVersion: 5.17.0

+Description: The total number of read buffers supported by this device.

+ The read buffers represent resources within the DSA

+ support operations. See DSA spec v1.2 9.2.4 Total Read Buffers.

+ It's not visible when the device does not support Read Buffer

+ allocation control.

+What: /sys/bus/dsa/devices/dsa<m>/read_buffer_limit

+KernelVersion: 5.17.0

+Description: The maximum number of read buffers that may be in use at

+ device. See DSA spec v1.2 9.2.8 GENCFG on Global Read Buffer Limit.

+ It's not visible when the device does not support Read Buffer

+ allocation control.

+What: /sys/bus/dsa/devices/dsa<m>/iaa_cap

+Date: Sept 14, 2022

+KernelVersion: 6.0.0

+Contact: dmaengine@vger.kernel.org

+Description: IAA (IAX) capability mask. Exported to user space for application

+ consumption. This attribute should only be visible on IAA devices

+ that are version 2 or later.

+What: /sys/bus/dsa/devices/dsa<m>/event_log_size

+Date: Sept 14, 2022

+KernelVersion: 6.4.0

+Contact: dmaengine@vger.kernel.org

+Description: The event log size to be configured. Default is 64 entries and

+ occupies 4k size if the evl entry is 64 bytes. It's visible

+ only on platforms that support the capability.

+ It's not visible when the device does not support batch.

+What: /sys/bus/dsa/devices/wq<m>.<n>/prs_disable

+Date: Sept 14, 2022

+KernelVersion: 6.4.0

+Contact: dmaengine@vger.kernel.org

+Description: Controls whether PRS disable is turned on for the workqueue.

+ 0 indicates PRS is on, and 1 indicates PRS is off for the

+ workqueue. This option overrides block_on_fault attribute

+ if set. It's visible only on platforms that support the

+What: /sys/bus/dsa/devices/wq<m>.<n>/enqcmds_retries

+KernelVersion: 5.17.0

+Contact: dmaengine@vger.kernel.org

+Description: Indicate the number of retires for an enqcmds submission on a sharedwq.

+ A max value to set attribute is capped at 64.

+What: /sys/bus/dsa/devices/wq<m>.<n>/op_config

+Date: Sept 14, 2022

+KernelVersion: 6.0.0

+Contact: dmaengine@vger.kernel.org

+Description: Shows the operation capability bits displayed in bitmap format

+ presented by %*pb printk() output format specifier.

+ The attribute can be configured when the WQ is disabled in

+ order to configure the WQ to accept specific bits that

+ correlates to the operations allowed. It's visible only

+ on platforms that support the capability.

+What: /sys/bus/dsa/devices/group<m>.<n>/use_read_buffer_limit

+KernelVersion: 5.17.0

+Contact: dmaengine@vger.kernel.org

+Description: Enable the use of global read buffer limit for the group. See DSA

+ spec v1.2 9.2.18 GRPCFG Use Global Read Buffer Limit.

+ It's not visible when the device does not support Read Buffer

+ allocation control.

+What: /sys/bus/dsa/devices/group<m>.<n>/read_buffers_allowed

+KernelVersion: 5.17.0

+Contact: dmaengine@vger.kernel.org

+Description: Indicates max number of read buffers that may be in use at one time

+ by all engines in the group. See DSA spec v1.2 9.2.18 GRPCFG Read

+ It's not visible when the device does not support Read Buffer

+ allocation control.

+What: /sys/bus/dsa/devices/group<m>.<n>/read_buffers_reserved

+KernelVersion: 5.17.0

+Contact: dmaengine@vger.kernel.org

+Description: Indicates the number of Read Buffers reserved for the use of

+ engines in the group. See DSA spec v1.2 9.2.18 GRPCFG Read Buffers

+ It's not visible when the device does not support Read Buffer

+ allocation control.

+What: /sys/bus/dsa/devices/group<m>.<n>/desc_progress_limit

+Date: Sept 14, 2022

+KernelVersion: 6.0.0

+Contact: dmaengine@vger.kernel.org

+Description: Allows control of the number of work descriptors that can be

+ concurrently processed by an engine in the group as a fraction

+ of the Maximum Work Descriptors in Progress value specified in

+ the ENGCAP register. The acceptable values are 0 (default),

+ 1 (1/2 of max value), 2 (1/4 of the max value), and 3 (1/8 of

+ the max value). It's visible only on platforms that support

+What: /sys/bus/dsa/devices/group<m>.<n>/batch_progress_limit

+Date: Sept 14, 2022

+KernelVersion: 6.0.0

+Contact: dmaengine@vger.kernel.org

+Description: Allows control of the number of batch descriptors that can be

+ concurrently processed by an engine in the group as a fraction

+ of the Maximum Batch Descriptors in Progress value specified in

+ the ENGCAP register. The acceptable values are 0 (default),

+ 1 (1/2 of max value), 2 (1/4 of the max value), and 3 (1/8 of

+ the max value). It's visible only on platforms that support

+What: /sys/bus/dsa/devices/wq<m>.<n>/dsa<x>\!wq<m>.<n>/file<y>/cr_faults

+Date: Sept 14, 2022

+KernelVersion: 6.4.0

+Contact: dmaengine@vger.kernel.org

+Description: Show the number of Completion Record (CR) faults this application

+What: /sys/bus/dsa/devices/wq<m>.<n>/dsa<x>\!wq<m>.<n>/file<y>/cr_fault_failures

+Date: Sept 14, 2022

+KernelVersion: 6.4.0

+Contact: dmaengine@vger.kernel.org

+Description: Show the number of Completion Record (CR) faults failures that this

+ application has caused. The failure counter is incremented when the

+ driver cannot fault in the address for the CR. Typically this is caused

+ by a bad address programmed in the submitted descriptor or a malicious

+ submitter is using bad CR address on purpose.

+What: /sys/bus/dsa/devices/wq<m>.<n>/dsa<x>\!wq<m>.<n>/file<y>/pid

+Date: Sept 14, 2022

+KernelVersion: 6.4.0

+Contact: dmaengine@vger.kernel.org

+Description: Show the process id of the application that opened the file. This is

+ helpful information for a monitor daemon that wants to kill the

+ application that opened the file.

diff --git a/Documentation/ABI/stable/sysfs-driver-firmware-zynqmp b/Documentation/ABI/stable/sysfs-driver-firmware-zynqmp
index f5724bb5b462..c3fec3c835af 100644
--- a/Documentation/ABI/stable/sysfs-driver-firmware-zynqmp
+++ b/Documentation/ABI/stable/sysfs-driver-firmware-zynqmp

+What: /sys/devices/platform/firmware\:zynqmp-firmware/feature_config_id

+KernelVersion: 5.18

+Contact: "Ronak Jain" <ronak.jain@xilinx.com>

+ This sysfs interface allows user to configure features at

+ runtime. The user can enable or disable features running at

+ firmware as well as the user can configure the parameters of

+ the features at runtime. The supported features are over

+ temperature and external watchdog. Here, the external watchdog

+ is completely different than the /dev/watchdog as the external

+ watchdog is running on the firmware and it is used to monitor

+ the health of firmware not APU(Linux). Also, the external

+ watchdog is interfaced outside of the zynqmp soc.

+ The supported config ids are for the feature configuration is,

+ 1. PM_FEATURE_OVERTEMP_STATUS = 1, the user can enable or

+ disable the over temperature feature.

+ 2. PM_FEATURE_OVERTEMP_VALUE = 2, the user can configure the

+ over temperature limit in Degree Celsius.

+ 3. PM_FEATURE_EXTWDT_STATUS = 3, the user can enable or disable

+ the external watchdog feature.

+ 4. PM_FEATURE_EXTWDT_VALUE = 4, the user can configure the

+ external watchdog feature.

+ Select over temperature config ID to enable/disable feature

+ # echo 1 > /sys/devices/platform/firmware\:zynqmp-firmware/feature_config_id

+ Check over temperature config ID is selected or not

+ # cat /sys/devices/platform/firmware\:zynqmp-firmware/feature_config_id

+ The expected result is 1.

+ Select over temperature config ID to configure OT limit

+ # echo 2 > /sys/devices/platform/firmware\:zynqmp-firmware/feature_config_id

+ Check over temperature config ID is selected or not

+ # cat /sys/devices/platform/firmware\:zynqmp-firmware/feature_config_id

+ The expected result is 2.

+ Select external watchdog config ID to enable/disable feature

+ # echo 3 > /sys/devices/platform/firmware\:zynqmp-firmware/feature_config_id

+ Check external watchdog config ID is selected or not

+ # cat /sys/devices/platform/firmware\:zynqmp-firmware/feature_config_id

+ The expected result is 3.

+ Select external watchdog config ID to configure time interval

+ # echo 4 > /sys/devices/platform/firmware\:zynqmp-firmware/feature_config_id

+ Check external watchdog config ID is selected or not

+ # cat /sys/devices/platform/firmware\:zynqmp-firmware/feature_config_id

+ The expected result is 4.

+What: /sys/devices/platform/firmware\:zynqmp-firmware/feature_config_value

+KernelVersion: 5.18

+Contact: "Ronak Jain" <ronak.jain@xilinx.com>

+ This sysfs interface allows to configure features at runtime.

+ The user can enable or disable features running at firmware.

+ Also, the user can configure the parameters of the features

+ at runtime. The supported features are over temperature and

+ external watchdog. Here, the external watchdog is completely

+ different than the /dev/watchdog as the external watchdog is

+ running on the firmware and it is used to monitor the health

+ of firmware not APU(Linux). Also, the external watchdog is

+ interfaced outside of the zynqmp soc.

+ By default the features are disabled in the firmware. The user

+ can enable features by querying appropriate config id of the

+ The default limit for the over temperature is 90 Degree Celsius.

+ The default timer interval for the external watchdog is 570ms.

+ The supported config ids are for the feature configuration is,

+ 1. PM_FEATURE_OVERTEMP_STATUS = 1, the user can enable or

+ disable the over temperature feature.

+ 2. PM_FEATURE_OVERTEMP_VALUE = 2, the user can configure the

+ over temperature limit in Degree Celsius.

+ 3. PM_FEATURE_EXTWDT_STATUS = 3, the user can enable or disable

+ the external watchdog feature.

+ 4. PM_FEATURE_EXTWDT_VALUE = 4, the user can configure the

+ external watchdog feature.

+ Enable over temperature feature

+ # echo 1 > /sys/devices/platform/firmware\:zynqmp-firmware/feature_config_id

+ # echo 1 > /sys/devices/platform/firmware\:zynqmp-firmware/feature_config_value

+ Check whether the over temperature feature is enabled or not

+ # cat /sys/devices/platform/firmware\:zynqmp-firmware/feature_config_value

+ The expected result is 1.

+ Disable over temperature feature

+ # echo 1 > /sys/devices/platform/firmware\:zynqmp-firmware/feature_config_id

+ # echo 0 > /sys/devices/platform/firmware\:zynqmp-firmware/feature_config_value

+ Check whether the over temperature feature is disabled or not

+ # cat /sys/devices/platform/firmware\:zynqmp-firmware/feature_config_value

+ The expected result is 0.

+ Configure over temperature limit to 50 Degree Celsius

+ # echo 2 > /sys/devices/platform/firmware\:zynqmp-firmware/feature_config_id

+ # echo 50 > /sys/devices/platform/firmware\:zynqmp-firmware/feature_config_value

+ Check whether the over temperature limit is configured or not

+ # cat /sys/devices/platform/firmware\:zynqmp-firmware/feature_config_value

+ The expected result is 50.

+ Enable external watchdog feature

+ # echo 3 > /sys/devices/platform/firmware\:zynqmp-firmware/feature_config_id

+ # echo 1 > /sys/devices/platform/firmware\:zynqmp-firmware/feature_config_value

+ Check whether the external watchdog feature is enabled or not

+ # cat /sys/devices/platform/firmware\:zynqmp-firmware/feature_config_value

+ The expected result is 1.

+ Disable external watchdog feature

+ # echo 3 > /sys/devices/platform/firmware\:zynqmp-firmware/feature_config_id

+ # echo 0 > /sys/devices/platform/firmware\:zynqmp-firmware/feature_config_value

+ Check whether the external watchdog feature is disabled or not

+ # cat /sys/devices/platform/firmware\:zynqmp-firmware/feature_config_value

+ The expected result is 0.

+ Configure external watchdog timer interval to 500ms

+ # echo 4 > /sys/devices/platform/firmware\:zynqmp-firmware/feature_config_id

+ # echo 500 > /sys/devices/platform/firmware\:zynqmp-firmware/feature_config_value

+ Check whether the external watchdog timer interval is configured or not

+ # cat /sys/devices/platform/firmware\:zynqmp-firmware/feature_config_value

+ The expected result is 500.

diff --git a/Documentation/ABI/stable/sysfs-driver-mlxreg-io b/Documentation/ABI/stable/sysfs-driver-mlxreg-io
index b2553df2e786..60953903d007 100644
--- a/Documentation/ABI/stable/sysfs-driver-mlxreg-io
+++ b/Documentation/ABI/stable/sysfs-driver-mlxreg-io

+Contact: Vadim Pasternak <vadimp@nvidia.com>

+What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/bios_active_image

+What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/bios_auth_fail

+What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/bios_upgrade_fail

+KernelVersion: 5.16

+Contact: Vadim Pasternak <vadimp@nvidia.com>

+Description: The files represent BIOS statuses:

+ bios_active_image: location of current active BIOS image:

+ 0: Top, 1: Bottom.

+ The reported value should correspond to value expected by OS

+ in case of BIOS safe mode is 0. This bit is related to Intel

+ top-swap feature of DualBios on the same flash.

+ bios_auth_fail: BIOS upgrade is failed because provided BIOS

+ image is not signed correctly.

+ bios_upgrade_fail: BIOS upgrade is failed by some other

+ reason not because authentication. For example due to

+ physical SPI flash problem.

+ The files are read only.

+What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/lc1_enable

+What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/lc2_enable

+What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/lc3_enable

+What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/lc4_enable

+What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/lc5_enable

+What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/lc6_enable

+What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/lc7_enable

+What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/lc8_enable

+KernelVersion: 5.16

+Contact: Vadim Pasternak <vadimp@nvidia.com>

+Description: These files allow line cards enable state control.

+ Expected behavior:

+ When lc{n}_enable is written 1, related line card is released

+ from the reset state, when 0 - is hold in reset state.

+ The files are read/write.

+What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/lc1_pwr

+What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/lc2_pwr

+What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/lc3_pwr

+What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/lc4_pwr

+What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/lc5_pwr

+What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/lc6_pwr

+What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/lc7_pwr

+What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/lc8_pwr

+KernelVersion: 5.16

+Contact: Vadim Pasternak <vadimp@nvidia.com>

+Description: These files switching line cards power on and off.

+ Expected behavior:

+ When lc{n}_pwr is written 1, related line card is powered

+ on, when written 0 - powered off.

+ The files are read/write.

+What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/lc1_rst_mask

+What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/lc2_rst_mask

+What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/lc3_rst_mask

+What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/lc4_rst_mask

+What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/lc5_rst_mask

+What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/lc6_rst_mask

+What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/lc7_rst_mask

+What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/lc8_rst_mask

+KernelVersion: 5.16

+Contact: Vadim Pasternak <vadimp@nvidia.com>

+Description: These files clear line card reset bit enforced by ASIC, when it

+ sets it due to some abnormal ASIC behavior.

+ Expected behavior:

+ When lc{n}_rst_mask is written 1, related line card reset bit

+ is cleared, when written 0 - no effect.

+ The files are write only.

+What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/os_started

+KernelVersion: 5.16

+Contact: Vadim Pasternak <vadimp@nvidia.com>

+Description: This file, when written 1, indicates to programmable devices

+ that OS is taking control over it.

+ The file is read/write.

+What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/pm_mgmt_en

+KernelVersion: 5.16

+Contact: Vadim Pasternak <vadimp@nvidia.com>

+Description: This file assigns power management control ownership.

+ When power management control is provided by hardware, hardware

+ will automatically power off one or more line previously

+ powered line cards in case system power budget is getting

+ insufficient. It could be in case when some of power units lost

+ When pm_mgmt_en is written 1, power management control by

+ software is enabled, 0 - power management control by hardware.

+ Note that for any setting of pm_mgmt_en attribute hardware will

+ not allow to power on any new line card in case system power

+ budget is insufficient.

+ Same in case software will try to power on several line cards

+ at once - hardware will power line cards while system has

+ enough power budget.

+ The file is read/write.

+What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/psu3_on

+What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/psu4_on

+KernelVersion: 5.16

+Contact: Vadim Pasternak <vadimp@nvidia.com>

+Description: These files switching power supply units on and off.

+ Expected behavior:

+ When psu3_on or psu4_on is written 1, related unit will be

+ disconnected from the power source, when written 0 - connected.

+ The files are write only.

+What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/shutdown_unlock

+KernelVersion: 5.16

+Contact: Vadim Pasternak <vadimp@nvidia.com>

+Description: This file allows to unlock ASIC after thermal shutdown event.

+ When system thermal shutdown is enforced by ASIC, ASIC is

+ getting locked and after system boot it will not be available.

+ Software can decide to unlock it by setting this attribute to

+ 1 and then perform system power cycle by setting pwr_cycle

+ attribute to 1 (power cycle of main power domain).

+ Before setting shutdown_unlock to 1 it is recommended to

+ validate that system reboot cause is reset_asic_thermal or

+ reset_thermal_spc_or_pciesw.

+ In case shutdown_unlock is not set 1, the only way to release

+ ASIC from locking - is full system power cycle through the

+ external power distribution unit.

+ The file is read/write.

+What: /sys/devices/platform/mlxplat/i2c_mlxcpld.*/i2c-*/i2c-*/i2c-*/*-0032/mlxreg-io.*/hwmon/hwmon*/cpld1_pn

+What: /sys/devices/platform/mlxplat/i2c_mlxcpld.*/i2c-*/i2c-*/i2c-*/*-0032/mlxreg-io.*/hwmon/hwmon*/cpld1_version

+What: /sys/devices/platform/mlxplat/i2c_mlxcpld.*/i2c-*/i2c-*/i2c-*/*-0032/mlxreg-io.*/hwmon/hwmon*/cpld1_version_min

+KernelVersion: 5.16

+Contact: Vadim Pasternak <vadimp@nvidia.com>

+Description: These files show with which CPLD major and minor versions

+ and part number has been burned CPLD device on line card.

+ The files are read only.

+What: /sys/devices/platform/mlxplat/i2c_mlxcpld.*/i2c-*/i2c-*/i2c-*/*-0032/mlxreg-io.*/hwmon/hwmon*/fpga1_pn

+What: /sys/devices/platform/mlxplat/i2c_mlxcpld.*/i2c-*/i2c-*/i2c-*/*-0032/mlxreg-io.*/hwmon/hwmon*/fpga1_version

+What: /sys/devices/platform/mlxplat/i2c_mlxcpld.*/i2c-*/i2c-*/i2c-*/*-0032/mlxreg-io.*/hwmon/hwmon*/fpga1_version_min

+KernelVersion: 5.16

+Contact: Vadim Pasternak <vadimp@nvidia.com>

+Description: These files show with which FPGA major and minor versions

+ and part number has been burned FPGA device on line card.

+ The files are read only.

+What: /sys/devices/platform/mlxplat/i2c_mlxcpld.*/i2c-*/i2c-*/i2c-*/*-0032/mlxreg-io.*/hwmon/hwmon*/vpd_wp

+KernelVersion: 5.16

+Contact: Vadim Pasternak <vadimp@nvidia.com>

+Description: This file allow to overwrite line card VPD hardware write

+ protection mode. When attribute is set 1 - write protection is

+ disabled, when 0 - enabled.

+ If the system is in locked-down mode writing this file will not

+ The purpose if this file is to allow line card VPD burning

+ during production flow.

+ The file is read/write.

+What: /sys/devices/platform/mlxplat/i2c_mlxcpld.*/i2c-*/i2c-*/i2c-*/*-0032/mlxreg-io.*/hwmon/hwmon*/reset_aux_pwr_or_ref

+What: /sys/devices/platform/mlxplat/i2c_mlxcpld.*/i2c-*/i2c-*/i2c-*/*-0032/mlxreg-io.*/hwmon/hwmon*/reset_dc_dc_pwr_fail

+What: /sys/devices/platform/mlxplat/i2c_mlxcpld.*/i2c-*/i2c-*/i2c-*/*-0032/mlxreg-io.*/hwmon/hwmon*/reset_fpga_not_done

+What: /sys/devices/platform/mlxplat/i2c_mlxcpld.*/i2c-*/i2c-*/i2c-*/*-0032/mlxreg-io.*/hwmon/hwmon*/reset_from_chassis

+What: /sys/devices/platform/mlxplat/i2c_mlxcpld.*/i2c-*/i2c-*/i2c-*/*-0032/mlxreg-io.*/hwmon/hwmon*/reset_line_card

+What: /sys/devices/platform/mlxplat/i2c_mlxcpld.*/i2c-*/i2c-*/i2c-*/*-0032/mlxreg-io.*/hwmon/hwmon*/reset_pwr_off_from_chassis

+KernelVersion: 5.16

+Contact: Vadim Pasternak <vadimp@nvidia.com>

+Description: These files show the line reset cause, as following: power

+ auxiliary outage or power refresh, DC-to-DC power failure, FPGA reset

+ failed, line card reset failed, power off from chassis.

+ Value 1 in file means this is reset cause, 0 - otherwise. Only one of

+ the above causes could be 1 at the same time, representing only last

+ The files are read only.

+What: /sys/devices/platform/mlxplat/i2c_mlxcpld.*/i2c-*/i2c-*/i2c-*/*-0032/mlxreg-io.*/hwmon/hwmon*/cpld_upgrade_en

+What: /sys/devices/platform/mlxplat/i2c_mlxcpld.*/i2c-*/i2c-*/i2c-*/*-0032/mlxreg-io.*/hwmon/hwmon*/fpga_upgrade_en

+KernelVersion: 5.16

+Contact: Vadim Pasternak <vadimp@nvidia.com>

+Description: These files allow CPLD and FPGA burning. Value 1 in file means burning

+ is enabled, 0 - otherwise.

+ If the system is in locked-down mode writing these files will

+ The purpose of these files to allow line card CPLD and FPGA

+ upgrade through the JTAG daisy-chain.

+ The files are read/write.

+What: /sys/devices/platform/mlxplat/i2c_mlxcpld.*/i2c-*/i2c-*/i2c-*/*-0032/mlxreg-io.*/hwmon/hwmon*/qsfp_pwr_en

+What: /sys/devices/platform/mlxplat/i2c_mlxcpld.*/i2c-*/i2c-*/i2c-*/*-0032/mlxreg-io.*/hwmon/hwmon*/pwr_en

+KernelVersion: 5.16

+Contact: Vadim Pasternak <vadimp@nvidia.com>

+Description: These files allow to power on/off all QSFP ports and whole line card.

+ The attributes are set 1 for power on, 0 - for power off.

+ The files are read/write.

+What: /sys/devices/platform/mlxplat/i2c_mlxcpld.*/i2c-*/i2c-*/i2c-*/*-0032/mlxreg-io.*/hwmon/hwmon*/agb_spi_burn_en

+What: /sys/devices/platform/mlxplat/i2c_mlxcpld.*/i2c-*/i2c-*/i2c-*/*-0032/mlxreg-io.*/hwmon/hwmon*/fpga_spi_burn_en

+KernelVersion: 5.16

+Contact: Vadim Pasternak <vadimp@nvidia.com>

+Description: These files allow gearboxes and FPGA SPI flash burning.

+ The attributes are set 1 to enable burning, 0 - to disable.

+ If the system is in locked-down mode writing these files will

+ The purpose of these files to allow line card Gearboxes and FPGA

+ burning during production flow.

+ The file is read/write.

+What: /sys/devices/platform/mlxplat/i2c_mlxcpld.*/i2c-*/i2c-*/i2c-*/*-0032/mlxreg-io.*/hwmon/hwmon*/max_power

+What: /sys/devices/platform/mlxplat/i2c_mlxcpld.*/i2c-*/i2c-*/i2c-*/*-0032/mlxreg-io.*/hwmon/hwmon*/config

+KernelVersion: 5.16

+Contact: Vadim Pasternak <vadimp@nvidia.com>

+Description: These files provide the maximum powered required for line card

+ feeding and line card configuration Id.

+ The files are read only.

+What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/phy_reset

+KernelVersion: 5.19

+Contact: Vadim Pasternak <vadimp@nvidia.com>

+Description: This file allows to reset PHY 88E1548 when attribute is set 0

+ due to some abnormal PHY behavior.

+ Expected behavior:

+ When phy_reset is written 1, all PHY 88E1548 are released

+ from the reset state, when 0 - are hold in reset state.

+ The files are read/write.

+What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/mac_reset

+KernelVersion: 5.19

+Contact: Vadim Pasternak <vadimp@nvidia.com>

+Description: This file allows to reset ASIC MT52132 when attribute is set 0

+ due to some abnormal ASIC behavior.

+ Expected behavior:

+ When mac_reset is written 1, the ASIC MT52132 is released

+ from the reset state, when 0 - is hold in reset state.

+ The files are read/write.

+What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/qsfp_pwr_good

+KernelVersion: 5.19

+Contact: Vadim Pasternak <vadimp@nvidia.com>

+Description: This file shows QSFP ports power status. The value is set to 0

+ when one of any QSFP ports is plugged. The value is set to 1 when

+ there are no any QSFP ports are plugged.

+ The possible values are:

+ 0 - Power good, 1 - Not power good.

+ The files are read only.

+What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/asic2_health

+KernelVersion: 5.20

+Contact: Vadim Pasternak <vadimp@nvidia.com>

+Description: This file shows 2-nd ASIC health status. The possible values are:

+ 0 - health failed, 2 - health OK, 3 - ASIC in booting state.

+ The file is read only.

+What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/asic_reset

+What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/asic2_reset

+KernelVersion: 5.20

+Contact: Vadim Pasternak <vadimp@nvidia.com>

+Description: These files allow to each of ASICs by writing 1.

+ The files are write only.

+What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/comm_chnl_ready

+KernelVersion: 5.20

+Contact: Vadim Pasternak <vadimp@nvidia.com>

+Description: This file is used to indicate remote end (for example BMC) that system

+ host CPU is ready for sending telemetry data to remote end.

+ For indication the file should be written 1.

+ The file is write only.

+What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/config3

+Contact: Vadim Pasternak <vadimp@nvidia.com>

+Description: The file indicates COME module hardware configuration.

+ The value is pushed by hardware through GPIO pins.

+ The purpose is to expose some minor BOM changes for the same system SKU.

+ The file is read only.

+What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/reset_pwr_converter_fail

+Date: February 2023

+Contact: Vadim Pasternak <vadimp@nvidia.com>

+Description: This file shows the system reset cause due to power converter

+ Value 1 in file means this is reset cause, 0 - otherwise.

+ The file is read only.

+What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/erot1_ap_reset

+What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/erot2_ap_reset

+Date: February 2023

+Contact: Vadim Pasternak <vadimp@nvidia.com>

+Description: These files aim to monitor the status of the External Root of Trust (EROT)

+ processor's RESET output to the Application Processor (AP).

+ By reading this file, could be determined if the EROT has invalidated or

+ revoked AP Firmware, at which point it will hold the AP in RESET until a

+ valid firmware is loaded. This protects the AP from running an

+ unauthorized firmware. In the normal flow, the AP reset should be released

+ after the EROT validates the integrity of the FW, and it should be done so

+ as quickly as possible so that the AP boots before the CPU starts to

+ communicate to each ASIC.

+ The files are read only.

+What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/erot1_recovery

+What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/erot2_recovery

+What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/erot1_reset

+What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/erot2_reset

+Date: February 2023

+Contact: Vadim Pasternak <vadimp@nvidia.com>

+Description: These files aim to perform External Root of Trust (EROT) recovery

+ sequence after EROT device failure.

+ These EROT devices protect ASICs from unauthorized access and in normal

+ flow their reset should be released with system power – earliest power

+ up stage, so that EROTs can begin boot and authentication process before

+ CPU starts to communicate to ASICs.

+ Issuing a reset to the EROT while asserting the recovery signal will cause

+ the EROT Application Processor to enter recovery mode so that the EROT FW

+ can be updated/recovered.

+ For reset/recovery the related file should be toggled by 1/0.

+ The files are read/write.

+What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/erot1_wp

+What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/erot2_wp

+Date: February 2023

+Contact: Vadim Pasternak <vadimp@nvidia.com>

+Description: These files allow access to External Root of Trust (EROT) for reset

+ and recovery sequence after EROT device failure.

+ Default is 0 (programming disabled).

+ If the system is in locked-down mode writing this file will not be allowed.

+ The files are read/write.

+What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/spi_chnl_select

+Date: February 2023

+Contact: Vadim Pasternak <vadimp@nvidia.com>

+Description: This file allows SPI chip selection for External Root of Trust (EROT)

+ device Out-of-Band recovery.

+ File can be written with 0 or with 1. It selects which EROT can be accessed

+ through SPI device.

+ The file is read/write.

+What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/asic_pg_fail

+Date: February 2023

+Contact: Vadim Pasternak vadimp@nvidia.com

+Description: This file shows ASIC Power Good status.

+ Value 1 in file means ASIC Power Good failed, 0 - otherwise.

+ The file is read only.

+What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/clk_brd1_boot_fail

+What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/clk_brd2_boot_fail

+What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/clk_brd_fail

+Date: February 2023

+Contact: Vadim Pasternak vadimp@nvidia.com

+Description: These files are related to clock boards status in system.

+ - clk_brd1_boot_fail: warning about 1-st clock board failed to boot from CI.

+ - clk_brd2_boot_fail: warning about 2-nd clock board failed to boot from CI.

+ - clk_brd_fail: error about common clock board boot failure.

+ The files are read only.

+What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/clk_brd_prog_en

+Date: February 2023

+Contact: Vadim Pasternak <vadimp@nvidia.com>

+Description: This file enables programming of clock boards.

+ Default is 0 (programming disabled).

+ If the system is in locked-down mode writing this file will not be allowed.

+ The file is read/write.

+What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/pwr_converter_prog_en

+Date: February 2023

+Contact: Vadim Pasternak <vadimp@nvidia.com>

+Description: This file enables programming of power converters.

+ Default is 0 (programming disabled).

+ If the system is in locked-down mode writing this file will not be allowed.

+ The file is read/write.

+What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/reset_ac_ok_fail

+Date: February 2023

+Contact: Vadim Pasternak <vadimp@nvidia.com>

+Description: This file shows the system reset cause due to AC power failure.

+ Value 1 in file means this is reset cause, 0 - otherwise.

+ The file is read only.

+What: /sys/accessibility/speakup/cur_phonetic

+Contact: speakup@linux-speakup.org

+Description: This allows speakup to speak letters phoneticaly when arrowing through

+ a word letter by letter. This doesn't affect the spelling when typing

+ the characters. When cur_phonetic=1, speakup will speak characters

+ phoneticaly when arrowing over a letter. When cur_phonetic=0, speakup

+ will speak letters as normally.

+What: /sys/hypervisor/start_flags/*

+KernelVersion: 6.3.0

+Contact: xen-devel@lists.xenproject.org

+Description: If running under Xen:

+ All bits in Xen's start-flags are represented as

+ boolean files, returning '1' if set, '0' otherwise.

+ This takes the place of the defunct /proc/xen/capabilities,

+ which would contain "control_d" on dom0, and be empty

+ otherwise. This flag is now exposed as "initdomain" in

+ addition to the "privileged" flag; all other possible flags

+ are accessible as "unknownXX".

+The /sys/module tree consists of the following structure:

+What: /sys/module/<MODULENAME>

+What: /sys/module/<MODULENAME>/parameters

+What: /sys/module/<MODULENAME>/refcnt

+What: /sys/module/<MODULENAME>/srcversion

+ If the module source has MODULE_VERSION, this file will contain

+ the checksum of the source code.

+What: /sys/module/<MODULENAME>/version

+ If the module source has MODULE_VERSION, this file will contain

+ the version of the source code.

+What: /config/usb-gadget/gadget/webusb

+ This group contains "WebUSB" extension handling attributes.

+ ============= ===============================================

+ use flag turning "WebUSB" support on/off

+ bcdVersion bcd WebUSB specification version number

+ bVendorCode one-byte value used for custom per-device

+ landingPage UTF-8 encoded URL of the device's landing page

+ ============= ===============================================

diff --git a/Documentation/ABI/testing/configfs-usb-gadget-mass-storage b/Documentation/ABI/testing/configfs-usb-gadget-mass-storage
index c86b63a7bb43..fc0328069267 100644
--- a/Documentation/ABI/testing/configfs-usb-gadget-mass-storage
+++ b/Documentation/ABI/testing/configfs-usb-gadget-mass-storage

+ ============ ==============================================

+ forced_eject This write-only file is useful only when

+ the function is active. It causes the backing

+ file to be forcibly detached from the LUN,

+ regardless of whether the host has allowed it.

+ Any non-zero number of bytes written will

+ result in ejection.

+ ============ ==============================================

diff --git a/Documentation/ABI/testing/configfs-usb-gadget-uac1 b/Documentation/ABI/testing/configfs-usb-gadget-uac1
index dd647d44d975..c4ba92f004c3 100644
--- a/Documentation/ABI/testing/configfs-usb-gadget-uac1
+++ b/Documentation/ABI/testing/configfs-usb-gadget-uac1

+ ===================== =======================================

+ c_chmask capture channel mask

+ c_srate list of capture sampling rates (comma-separated)

+ c_ssize capture sample size (bytes)

+ c_mute_present capture mute control enable

+ c_volume_min capture volume control min value

+ c_volume_max capture volume control max value

+ c_volume_res capture volume control resolution

+ p_chmask playback channel mask

+ p_srate list of playback sampling rates (comma-separated)

+ p_ssize playback sample size (bytes)

+ p_mute_present playback mute control enable

+ p_volume_min playback volume control min value

+ p_volume_max playback volume control max value

+ p_volume_res playback volume control resolution

+ req_number the number of pre-allocated requests

+ for both capture and playback

+ function_name name of the interface

+ ===================== =======================================

diff --git a/Documentation/ABI/testing/configfs-usb-gadget-uac2 b/Documentation/ABI/testing/configfs-usb-gadget-uac2
index cfd160ff8b56..3371c39f651d 100644
--- a/Documentation/ABI/testing/configfs-usb-gadget-uac2
+++ b/Documentation/ABI/testing/configfs-usb-gadget-uac2

+ ===================== =======================================

+ c_chmask capture channel mask

+ c_srate list of capture sampling rates (comma-separated)

+ c_ssize capture sample size (bytes)

+ c_hs_bint capture bInterval for HS/SS (1-4: fixed, 0: auto)

+ c_sync capture synchronization type

+ c_mute_present capture mute control enable

+ c_volume_min capture volume control min value

+ c_volume_max capture volume control max value

+ c_volume_res capture volume control resolution

+ fb_max maximum extra bandwidth in async mode

+ p_chmask playback channel mask

+ p_srate list of playback sampling rates (comma-separated)

+ p_ssize playback sample size (bytes)

+ p_hs_bint playback bInterval for HS/SS (1-4: fixed, 0: auto)

+ p_mute_present playback mute control enable

+ p_volume_min playback volume control min value

+ p_volume_max playback volume control max value

+ p_volume_res playback volume control resolution

+ req_number the number of pre-allocated requests

+ for both capture and playback

+ function_name name of the interface

+ ===================== =======================================

diff --git a/Documentation/ABI/testing/configfs-usb-gadget-uvc b/Documentation/ABI/testing/configfs-usb-gadget-uvc
index 889ed45be4ca..4feb692c4c1d 100644
--- a/Documentation/ABI/testing/configfs-usb-gadget-uvc
+++ b/Documentation/ABI/testing/configfs-usb-gadget-uvc

+ function_name string [32]

+ All attributes read only except enable_interrupt_ep:

+ =================== =============================

+ enable_interrupt_ep flag to enable the interrupt

+ endpoint for the VC interface

+ =================== =============================

+ All attributes read only except bSourceID:

+ All attributes read only except bmControls, which is read/write:

+What: /config/usb-gadget/gadget/functions/uvc.name/control/extensions

+Description: Extension unit descriptors

+What: /config/usb-gadget/gadget/functions/uvc.name/control/extensions/name

+Description: Extension Unit (XU) Descriptor

+ bLength, bUnitID and iExtension are read-only. All others are

+ ================= ========================================

+ bLength size of the descriptor in bytes

+ bUnitID non-zero ID of this unit

+ guidExtensionCode Vendor-specific code identifying the XU

+ bNumControls number of controls in this XU

+ bNrInPins number of input pins for this unit

+ baSourceID list of the IDs of the units or terminals

+ to which this XU is connected

+ bControlSize size of the bmControls field in bytes

+ bmControls list of bitmaps detailing which vendor

+ specific controls are supported

+ iExtension index of a string descriptor that describes

+ this extension unit

+ ================= ========================================

+ All attributes read/write:

+ ======================== ======================================

+ bMatrixCoefficients matrix used to compute luma and

+ chroma values from the color primaries

+ bTransferCharacteristics optoelectronic transfer

+ characteristic of the source picture,

+ also called the gamma function

+ bColorPrimaries color primaries and the reference

+ ======================== ======================================

+What: /config/usb-gadget/gadget/functions/uvc.name/streaming/color_matching/name

+Description: Additional color matching descriptors

+ All attributes read/write:

+ bmInterlaceFlags specifies interlace information,

+What: /sys/kernel/debug/<cros-ec-device>/suspend_timeout_ms

+ Some ECs have a feature where they will track transitions of

+ a hardware-controlled sleep line, such as Intel's SLP_S0 line,

+ in order to detect cases where a system failed to go into deep

+ sleep states. The suspend_timeout_ms file controls the amount of

+ time in milliseconds the EC will wait before declaring a sleep

+ timeout event and attempting to wake the system.

+ Supply 0 to use the default value coded into EC firmware. Supply

+ 65535 (EC_HOST_SLEEP_TIMEOUT_INFINITE) to disable the EC sleep

+ failure detection mechanism. Values in between 0 and 65535

+ indicate the number of milliseconds the EC should wait after a

+ sleep transition before declaring a timeout. This includes both

+ the duration after a sleep command was received but before the

+ hardware line changed, as well as the duration between when the

+ hardware line changed and the kernel sent an EC resume command.

+ Output will be in the format: "%u\n".

+What: /sys/kernel/debug/cxl/memX/inject_poison

+KernelVersion: v6.4

+Contact: linux-cxl@vger.kernel.org

+ (WO) When a Device Physical Address (DPA) is written to this

+ attribute, the memdev driver sends an inject poison command to

+ the device for the specified address. The DPA must be 64-byte

+ aligned and the length of the injected poison is 64-bytes. If

+ successful, the device returns poison when the address is

+ accessed through the CXL.mem bus. Injecting poison adds the

+ address to the device's Poison List and the error source is set

+ to Injected. In addition, the device adds a poison creation

+ event to its internal Informational Event log, updates the

+ Event Status register, and if configured, interrupts the host.

+ It is not an error to inject poison into an address that

+ already has poison present and no error is returned. The

+ inject_poison attribute is only visible for devices supporting

+What: /sys/kernel/debug/memX/clear_poison

+KernelVersion: v6.4

+Contact: linux-cxl@vger.kernel.org

+ (WO) When a Device Physical Address (DPA) is written to this

+ attribute, the memdev driver sends a clear poison command to

+ the device for the specified address. Clearing poison removes

+ the address from the device's Poison List and writes 0 (zero)

+ for 64 bytes starting at address. It is not an error to clear

+ poison from an address that does not have poison set. If the

+ device cannot clear poison from the address, -ENXIO is returned.

+ The clear_poison attribute is only visible for devices

+ supporting the capability.

+What: /sys/kernel/debug/dell-wmi-ddv-<wmi_device_name>/fan_sensor_information

+Date: September 2022

+Contact: Armin Wolf <W_Armin@gmx.de>

+ This file contains the contents of the fan sensor information buffer,

+ which contains fan sensor entries and a terminating character (0xFF).

+ Each fan sensor entry consists of three bytes with an unknown meaning,

+ interested people may use this file for reverse-engineering.

+What: /sys/kernel/debug/dell-wmi-ddv-<wmi_device_name>/thermal_sensor_information

+Date: September 2022

+Contact: Armin Wolf <W_Armin@gmx.de>

+ This file contains the contents of the thermal sensor information buffer,

+ which contains thermal sensor entries and a terminating character (0xFF).

+ Each thermal sensor entry consists of five bytes with an unknown meaning,

+ interested people may use this file for reverse-engineering.

+What: /sys/kernel/debug/dcc/.../ready

+Date: December 2022

+Contact: Souradeep Chowdhury <quic_schowdhu@quicinc.com>

+ This file is used to check the status of the dcc

+ hardware if it's ready to receive user configurations.

+ A 'Y' here indicates dcc is ready.

+What: /sys/kernel/debug/dcc/.../trigger

+Date: December 2022

+Contact: Souradeep Chowdhury <quic_schowdhu@quicinc.com>

+ This is the debugfs interface for manual software

+ triggers. The trigger can be invoked by writing '1'

+What: /sys/kernel/debug/dcc/.../config_reset

+Date: December 2022

+Contact: Souradeep Chowdhury <quic_schowdhu@quicinc.com>

+ This file is used to reset the configuration of

+ a dcc driver to the default configuration. When '1'

+ is written to the file, all the previous addresses

+ stored in the driver gets removed and users need to

+ reconfigure addresses again.

+What: /sys/kernel/debug/dcc/.../[list-number]/config

+Date: December 2022

+Contact: Souradeep Chowdhury <quic_schowdhu@quicinc.com>

+ This stores the addresses of the registers which

+ can be read in case of a hardware crash or manual

+ software triggers. The input addresses type

+ can be one of following dcc instructions: read,

+ write, read-write, and loop type. The lists need to

+ be configured sequentially and not in a overlapping

+ manner; e.g. users can jump to list x only after

+ list y is configured and enabled. The input format for

+ each type is as follows:

+ i) Read instruction

+ echo R <addr> <n> <bus> >/sys/kernel/debug/dcc/../[list-number]/config

+ The address to be read.

+ The addresses word count, starting from address <1>.

+ Each word is 32 bits (4 bytes). If omitted, defaulted

+ The bus type, which can be either 'apb' or 'ahb'.

+ The default is 'ahb' if leaved out.

+ ii) Write instruction

+ echo W <addr> <n> <bus type> > /sys/kernel/debug/dcc/../[list-number]/config

+ The address to be written.

+ The value to be written at <addr>.

+ The bus type, which can be either 'apb' or 'ahb'.

+ iii) Read-write instruction

+ echo RW <addr> <n> <mask> > /sys/kernel/debug/dcc/../[list-number]/config

+ The address to be read and written.

+ The value to be written at <addr>.

+ iv) Loop instruction

+ echo L <loop count> <address count> <address>... > /sys/kernel/debug/dcc/../[list-number]/config

+ Number of iterations

+ total number of addresses to be written

+ Space-separated list of addresses.

+What: /sys/kernel/debug/dcc/.../[list-number]/enable

+Date: December 2022

+Contact: Souradeep Chowdhury <quic_schowdhu@quicinc.com>

+ This debugfs interface is used for enabling the

+ the dcc hardware. A file named "enable" is in the

+ directory list number where users can enable/disable

+ the specific list by writing boolean (1 or 0) to the

+ On enabling the dcc, all the addresses specified

+ by the user for the corresponding list is written

+ into dcc sram which is read by the dcc hardware

+ on manual or crash induced triggers. Lists must

+ be configured and enabled sequentially, e.g. list

+ 2 can only be enabled when list 1 have so.

diff --git a/Documentation/ABI/testing/debugfs-driver-habanalabs b/Documentation/ABI/testing/debugfs-driver-habanalabs
index 284e2dfa61cd..85f6d04f528b 100644
--- a/Documentation/ABI/testing/debugfs-driver-habanalabs
+++ b/Documentation/ABI/testing/debugfs-driver-habanalabs

+Description: This setting is now deprecated as clock gating is handled solely by the f/w

+What: /sys/kernel/debug/habanalabs/hl<n>/device_release_watchdog_timeout

+Contact: ttayar@habana.ai

+Description: The watchdog timeout value in seconds for a device relese upon

+ certain error cases, after which the device is reset.

+What: /sys/kernel/debug/habanalabs/hl<n>/dump_razwi_events

+KernelVersion: 5.20

+Contact: fkassabri@habana.ai

+Description: Dumps all razwi events to dmesg if exist.

+ After reading the status register of an existing event

+ the routine will clear the status register.

+ Usage: cat dump_razwi_events

+ by the device's CPU, Not available when device is loaded with secured

+ the device's CPU, Not available when device is loaded with secured

+ reading from the file generates a read transaction, Not available

+ when device is loaded with secured firmware

+What: /sys/kernel/debug/habanalabs/hl<n>/i2c_len

+KernelVersion: 5.17

+Contact: obitton@habana.ai

+Description: Sets I2C length in bytes for I2C transaction that is generated by

+ the device's CPU, Not available when device is loaded with secured

+Description: Sets the state of the first S/W led on the device, Not available

+ when device is loaded with secured firmware

+Description: Sets the state of the second S/W led on the device, Not available

+ when device is loaded with secured firmware

+Description: Sets the state of the third S/W led on the device, Not available

+ when device is loaded with secured firmware

+What: /sys/kernel/debug/habanalabs/hl<n>/memory_scrub

+KernelVersion: 5.19

+Contact: dhirschfeld@habana.ai

+Description: Allows the root user to scrub the dram memory. The scrubbing

+ value can be set using the debugfs file memory_scrub_val.

+What: /sys/kernel/debug/habanalabs/hl<n>/memory_scrub_val

+KernelVersion: 5.19

+Contact: dhirschfeld@habana.ai

+Description: The value to which the dram will be set to when the user

+ scrubs the dram using 'memory_scrub' debugfs file and

+ the scrubbing value when using module param 'memory_scrub'

+What: /sys/kernel/debug/habanalabs/hl<n>/monitor_dump

+KernelVersion: 5.19

+Contact: osharabi@habana.ai

+Description: Allows the root user to dump monitors status from the device's

+ protected config space.

+ This property is a binary blob that contains the result of the

+ monitors registers dump.

+ This custom interface is needed (instead of using the generic

+ Linux user-space PCI mapping) because this space is protected

+ and cannot be accessed using PCI read.

+ This interface doesn't support concurrency in the same device.

+ Only supported on GAUDI.

+What: /sys/kernel/debug/habanalabs/hl<n>/monitor_dump_trig

+KernelVersion: 5.19

+Contact: osharabi@habana.ai

+Description: Triggers dump of monitor data. The value to trigger the operation

+ must be 1. Triggering the monitor dump operation initiates dump of

+ current registers values of all monitors.

+ When the write is finished, the user can read the "monitor_dump"

+ Relevant only for GOYA and GAUDI.

+What: /sys/kernel/debug/habanalabs/hl<n>/timeout_locked

+KernelVersion: 5.16

+Contact: obitton@habana.ai

+Description: Sets the command submission timeout value in seconds.

+What: /sys/kernel/debug/hisi_hpre/<bdf>/cluster[0-3]/regs

+Contact: linux-crypto@vger.kernel.org

+Description: Dump debug registers from the HPRE cluster.

+What: /sys/kernel/debug/hisi_hpre/<bdf>/cluster[0-3]/cluster_ctrl

+Contact: linux-crypto@vger.kernel.org

+Description: Write the HPRE core selection in the cluster into this file,

+What: /sys/kernel/debug/hisi_hpre/<bdf>/rdclr_en

+Contact: linux-crypto@vger.kernel.org

+Description: HPRE cores debug registers read clear control. 1 means enable

+What: /sys/kernel/debug/hisi_hpre/<bdf>/current_qm

+Contact: linux-crypto@vger.kernel.org

+Description: One HPRE controller has one PF and multiple VFs, each function

+What: /sys/kernel/debug/hisi_hpre/<bdf>/alg_qos

+Contact: linux-crypto@vger.kernel.org

+Description: The <bdf> is related the function for PF and VF.

+ HPRE driver supports to configure each function's QoS, the driver

+ supports to write <bdf> value to alg_qos in the host. Such as

+ "echo <bdf> value > alg_qos". The qos value is 1~1000, means

+ 1/1000~1000/1000 of total QoS. The driver reading alg_qos to

+ get related QoS in the host and VM, Such as "cat alg_qos".

+What: /sys/kernel/debug/hisi_hpre/<bdf>/regs

+Contact: linux-crypto@vger.kernel.org

+Description: Dump debug registers from the HPRE.

+What: /sys/kernel/debug/hisi_hpre/<bdf>/qm/regs

+Contact: linux-crypto@vger.kernel.org

+Description: Dump debug registers from the QM.

+What: /sys/kernel/debug/hisi_hpre/<bdf>/qm/current_q

+Contact: linux-crypto@vger.kernel.org

+Description: One QM may contain multiple queues. Select specific queue to

+What: /sys/kernel/debug/hisi_hpre/<bdf>/qm/clear_enable

+Contact: linux-crypto@vger.kernel.org

+Description: QM debug registers(regs) read clear control. 1 means enable

+What: /sys/kernel/debug/hisi_hpre/<bdf>/qm/err_irq

+Contact: linux-crypto@vger.kernel.org

+Description: Dump the number of invalid interrupts for

+What: /sys/kernel/debug/hisi_hpre/<bdf>/qm/aeq_irq

+Contact: linux-crypto@vger.kernel.org

+Description: Dump the number of QM async event queue interrupts.

+What: /sys/kernel/debug/hisi_hpre/<bdf>/qm/abnormal_irq

+Contact: linux-crypto@vger.kernel.org

+Description: Dump the number of interrupts for QM abnormal event.

+What: /sys/kernel/debug/hisi_hpre/<bdf>/qm/create_qp_err

+Contact: linux-crypto@vger.kernel.org

+Description: Dump the number of queue allocation errors.

+What: /sys/kernel/debug/hisi_hpre/<bdf>/qm/mb_err

+Contact: linux-crypto@vger.kernel.org

+Description: Dump the number of failed QM mailbox commands.

+What: /sys/kernel/debug/hisi_hpre/<bdf>/qm/status

+Contact: linux-crypto@vger.kernel.org

+Description: Dump the status of the QM.

+What: /sys/kernel/debug/hisi_hpre/<bdf>/qm/diff_regs

+Contact: linux-crypto@vger.kernel.org

+Description: QM debug registers(regs) read hardware register value. This

+ node is used to show the change of the qm register values. This

+ node can be help users to check the change of register values.

+What: /sys/kernel/debug/hisi_hpre/<bdf>/hpre_dfx/diff_regs

+Contact: linux-crypto@vger.kernel.org

+Description: HPRE debug registers(regs) read hardware register value. This

+ node is used to show the change of the register values. This

+ node can be help users to check the change of register values.

+What: /sys/kernel/debug/hisi_hpre/<bdf>/hpre_dfx/send_cnt

+Contact: linux-crypto@vger.kernel.org

+Description: Dump the total number of sent requests.

+What: /sys/kernel/debug/hisi_hpre/<bdf>/hpre_dfx/recv_cnt

+Contact: linux-crypto@vger.kernel.org

+Description: Dump the total number of received requests.

+What: /sys/kernel/debug/hisi_hpre/<bdf>/hpre_dfx/send_busy_cnt

+Contact: linux-crypto@vger.kernel.org

+Description: Dump the total number of requests sent

+What: /sys/kernel/debug/hisi_hpre/<bdf>/hpre_dfx/send_fail_cnt

+Contact: linux-crypto@vger.kernel.org

+Description: Dump the total number of completed but error requests.

+What: /sys/kernel/debug/hisi_hpre/<bdf>/hpre_dfx/invalid_req_cnt

+Contact: linux-crypto@vger.kernel.org

+Description: Dump the total number of invalid requests being received.

+What: /sys/kernel/debug/hisi_hpre/<bdf>/hpre_dfx/overtime_thrhld

+Contact: linux-crypto@vger.kernel.org

+Description: Set the threshold time for counting the request which is

+What: /sys/kernel/debug/hisi_hpre/<bdf>/hpre_dfx/over_thrhld_cnt

+Contact: linux-crypto@vger.kernel.org

+Description: Dump the total number of time out requests.

+What: /sys/kernel/debug/hisi_sec2/<bdf>/clear_enable

+Contact: linux-crypto@vger.kernel.org

+Description: Enabling/disabling of clear action after reading

+What: /sys/kernel/debug/hisi_sec2/<bdf>/current_qm

+Contact: linux-crypto@vger.kernel.org

+Description: One SEC controller has one PF and multiple VFs, each function

+What: /sys/kernel/debug/hisi_sec2/<bdf>/alg_qos

+Contact: linux-crypto@vger.kernel.org

+Description: The <bdf> is related the function for PF and VF.

+ SEC driver supports to configure each function's QoS, the driver

+ supports to write <bdf> value to alg_qos in the host. Such as

+ "echo <bdf> value > alg_qos". The qos value is 1~1000, means

+ 1/1000~1000/1000 of total QoS. The driver reading alg_qos to

+ get related QoS in the host and VM, Such as "cat alg_qos".

+What: /sys/kernel/debug/hisi_sec2/<bdf>/qm/qm_regs

+Contact: linux-crypto@vger.kernel.org

+Description: Dump of QM related debug registers.

+What: /sys/kernel/debug/hisi_sec2/<bdf>/qm/current_q

+Contact: linux-crypto@vger.kernel.org

+Description: One QM of SEC may contain multiple queues. Select specific

+What: /sys/kernel/debug/hisi_sec2/<bdf>/qm/clear_enable

+Contact: linux-crypto@vger.kernel.org

+Description: Enabling/disabling of clear action after reading

+What: /sys/kernel/debug/hisi_sec2/<bdf>/qm/err_irq

+Contact: linux-crypto@vger.kernel.org

+Description: Dump the number of invalid interrupts for

+What: /sys/kernel/debug/hisi_sec2/<bdf>/qm/aeq_irq

+Contact: linux-crypto@vger.kernel.org

+Description: Dump the number of QM async event queue interrupts.

+What: /sys/kernel/debug/hisi_sec2/<bdf>/qm/abnormal_irq

+Contact: linux-crypto@vger.kernel.org

+Description: Dump the number of interrupts for QM abnormal event.

+What: /sys/kernel/debug/hisi_sec2/<bdf>/qm/create_qp_err

+Contact: linux-crypto@vger.kernel.org

+Description: Dump the number of queue allocation errors.

+What: /sys/kernel/debug/hisi_sec2/<bdf>/qm/mb_err

+Contact: linux-crypto@vger.kernel.org

+Description: Dump the number of failed QM mailbox commands.

+What: /sys/kernel/debug/hisi_sec2/<bdf>/qm/status

+Contact: linux-crypto@vger.kernel.org

+Description: Dump the status of the QM.

+What: /sys/kernel/debug/hisi_sec2/<bdf>/qm/diff_regs

+Contact: linux-crypto@vger.kernel.org

+Description: QM debug registers(regs) read hardware register value. This

+ node is used to show the change of the qm register values. This

+ node can be help users to check the change of register values.

+What: /sys/kernel/debug/hisi_sec2/<bdf>/sec_dfx/diff_regs

+Contact: linux-crypto@vger.kernel.org

+Description: SEC debug registers(regs) read hardware register value. This

+ node is used to show the change of the register values. This

+ node can be help users to check the change of register values.

+What: /sys/kernel/debug/hisi_sec2/<bdf>/sec_dfx/send_cnt

+Contact: linux-crypto@vger.kernel.org

+Description: Dump the total number of sent requests.

+What: /sys/kernel/debug/hisi_sec2/<bdf>/sec_dfx/recv_cnt

+Contact: linux-crypto@vger.kernel.org

+Description: Dump the total number of received requests.

+What: /sys/kernel/debug/hisi_sec2/<bdf>/sec_dfx/send_busy_cnt

+Contact: linux-crypto@vger.kernel.org

+Description: Dump the total number of requests sent with returning busy.

+What: /sys/kernel/debug/hisi_sec2/<bdf>/sec_dfx/err_bd_cnt

+Contact: linux-crypto@vger.kernel.org

+Description: Dump the total number of BD type error requests

+What: /sys/kernel/debug/hisi_sec2/<bdf>/sec_dfx/invalid_req_cnt

+Contact: linux-crypto@vger.kernel.org

+Description: Dump the total number of invalid requests being received.

+What: /sys/kernel/debug/hisi_sec2/<bdf>/sec_dfx/done_flag_cnt

+Contact: linux-crypto@vger.kernel.org

+Description: Dump the total number of completed but marked error requests

+What: /sys/kernel/debug/hisi_zip/<bdf>/comp_core[01]/regs

+Contact: linux-crypto@vger.kernel.org

+Description: Dump of compression cores related debug registers.

+What: /sys/kernel/debug/hisi_zip/<bdf>/decomp_core[0-5]/regs

+Contact: linux-crypto@vger.kernel.org

+Description: Dump of decompression cores related debug registers.

+What: /sys/kernel/debug/hisi_zip/<bdf>/clear_enable

+Contact: linux-crypto@vger.kernel.org

+Description: Compression/decompression core debug registers read clear

+What: /sys/kernel/debug/hisi_zip/<bdf>/current_qm

+Contact: linux-crypto@vger.kernel.org

+Description: One ZIP controller has one PF and multiple VFs, each function

+What: /sys/kernel/debug/hisi_zip/<bdf>/alg_qos

+Contact: linux-crypto@vger.kernel.org

+Description: The <bdf> is related the function for PF and VF.

+ ZIP driver supports to configure each function's QoS, the driver

+ supports to write <bdf> value to alg_qos in the host. Such as

+ "echo <bdf> value > alg_qos". The qos value is 1~1000, means

+ 1/1000~1000/1000 of total QoS. The driver reading alg_qos to

+ get related QoS in the host and VM, Such as "cat alg_qos".

+What: /sys/kernel/debug/hisi_zip/<bdf>/qm/regs

+Contact: linux-crypto@vger.kernel.org

+Description: Dump of QM related debug registers.

+What: /sys/kernel/debug/hisi_zip/<bdf>/qm/current_q

+Contact: linux-crypto@vger.kernel.org

+Description: One QM may contain multiple queues. Select specific queue to

+What: /sys/kernel/debug/hisi_zip/<bdf>/qm/clear_enable

+Contact: linux-crypto@vger.kernel.org

+Description: QM debug registers(regs) read clear control. 1 means enable

+What: /sys/kernel/debug/hisi_zip/<bdf>/qm/err_irq

+Contact: linux-crypto@vger.kernel.org

+Description: Dump the number of invalid interrupts for

+What: /sys/kernel/debug/hisi_zip/<bdf>/qm/aeq_irq

+Contact: linux-crypto@vger.kernel.org

+Description: Dump the number of QM async event queue interrupts.

+What: /sys/kernel/debug/hisi_zip/<bdf>/qm/abnormal_irq

+Contact: linux-crypto@vger.kernel.org

+Description: Dump the number of interrupts for QM abnormal event.

+What: /sys/kernel/debug/hisi_zip/<bdf>/qm/create_qp_err

+Contact: linux-crypto@vger.kernel.org

+Description: Dump the number of queue allocation errors.

+What: /sys/kernel/debug/hisi_zip/<bdf>/qm/mb_err

+Contact: linux-crypto@vger.kernel.org

+Description: Dump the number of failed QM mailbox commands.

+What: /sys/kernel/debug/hisi_zip/<bdf>/qm/status

+Contact: linux-crypto@vger.kernel.org

+Description: Dump the status of the QM.

+What: /sys/kernel/debug/hisi_zip/<bdf>/qm/diff_regs

+Contact: linux-crypto@vger.kernel.org

+Description: QM debug registers(regs) read hardware register value. This

+ node is used to show the change of the qm registers value. This

+ node can be help users to check the change of register values.

+What: /sys/kernel/debug/hisi_zip/<bdf>/zip_dfx/diff_regs

+Contact: linux-crypto@vger.kernel.org

+Description: ZIP debug registers(regs) read hardware register value. This

+ node is used to show the change of the registers value. this

+ node can be help users to check the change of register values.

+What: /sys/kernel/debug/hisi_zip/<bdf>/zip_dfx/send_cnt

+Contact: linux-crypto@vger.kernel.org

+Description: Dump the total number of sent requests.

+What: /sys/kernel/debug/hisi_zip/<bdf>/zip_dfx/recv_cnt

+Contact: linux-crypto@vger.kernel.org

+Description: Dump the total number of received requests.

+What: /sys/kernel/debug/hisi_zip/<bdf>/zip_dfx/send_busy_cnt

+Contact: linux-crypto@vger.kernel.org

+Description: Dump the total number of requests received

+What: /sys/kernel/debug/hisi_zip/<bdf>/zip_dfx/err_bd_cnt

+Contact: linux-crypto@vger.kernel.org

+Description: Dump the total number of BD type error requests

+What: /sys/kernel/debug/scmi/<n>/instance_name

+Contact: cristian.marussi@arm.com

+Description: The name of the underlying SCMI instance <n> described by

+ all the debugfs accessors rooted at /sys/kernel/debug/scmi/<n>,

+ expressed as the full name of the top DT SCMI node under which

+ this SCMI instance is rooted.

+Users: Debugging, any userspace test suite

+What: /sys/kernel/debug/scmi/<n>/atomic_threshold_us

+Contact: cristian.marussi@arm.com

+Description: An optional time value, expressed in microseconds, representing,

+ on this SCMI instance <n>, the threshold above which any SCMI

+ command, advertised to have an higher-than-threshold execution

+ latency, should not be considered for atomic mode of operation,

+ even if requested.

+Users: Debugging, any userspace test suite

+What: /sys/kernel/debug/scmi/<n>/transport/type

+Contact: cristian.marussi@arm.com

+Description: A string representing the type of transport configured for this

+ SCMI instance <n>.

+Users: Debugging, any userspace test suite

+What: /sys/kernel/debug/scmi/<n>/transport/is_atomic

+Contact: cristian.marussi@arm.com

+Description: A boolean stating if the transport configured on the underlying

+ SCMI instance <n> is capable of atomic mode of operation.

+Users: Debugging, any userspace test suite

+What: /sys/kernel/debug/scmi/<n>/transport/max_rx_timeout_ms

+Contact: cristian.marussi@arm.com

+Description: Timeout in milliseconds allowed for SCMI synchronous replies

+ for the currently configured SCMI transport for instance <n>.

+Users: Debugging, any userspace test suite

+What: /sys/kernel/debug/scmi/<n>/transport/max_msg_size

+Contact: cristian.marussi@arm.com

+Description: Max message size of allowed SCMI messages for the currently

+ configured SCMI transport for instance <n>.

+Users: Debugging, any userspace test suite

+What: /sys/kernel/debug/scmi/<n>/transport/tx_max_msg

+Contact: cristian.marussi@arm.com

+Description: Max number of concurrently allowed in-flight SCMI messages for

+ the currently configured SCMI transport for instance <n> on the

+Users: Debugging, any userspace test suite

+What: /sys/kernel/debug/scmi/<n>/transport/rx_max_msg

+Contact: cristian.marussi@arm.com

+Description: Max number of concurrently allowed in-flight SCMI messages for

+ the currently configured SCMI transport for instance <n> on the

+Users: Debugging, any userspace test suite

+What: /sys/kernel/debug/scmi/<n>/raw/message

+Contact: cristian.marussi@arm.com

+Description: SCMI Raw synchronous message injection/snooping facility; write

+ a complete SCMI synchronous command message (header included)

+ in little-endian binary format to have it sent to the configured

+ backend SCMI server for instance <n>.

+ Any subsequently received response can be read from this same

+ entry if it arrived within the configured timeout.

+ Each write to the entry causes one command request to be built

+ and sent while the replies are read back one message at time

+ (receiving an EOF at each message boundary).

+Users: Debugging, any userspace test suite

+What: /sys/kernel/debug/scmi/<n>/raw/message_async

+Contact: cristian.marussi@arm.com

+Description: SCMI Raw asynchronous message injection/snooping facility; write

+ a complete SCMI asynchronous command message (header included)

+ in little-endian binary format to have it sent to the configured

+ backend SCMI server for instance <n>.

+ Any subsequently received response can be read from this same

+ entry if it arrived within the configured timeout.

+ Any additional delayed response received afterwards can be read

+ from this same entry too if it arrived within the configured

+ Each write to the entry causes one command request to be built

+ and sent while the replies are read back one message at time

+ (receiving an EOF at each message boundary).

+Users: Debugging, any userspace test suite

+What: /sys/kernel/debug/scmi/<n>/raw/errors

+Contact: cristian.marussi@arm.com

+Description: SCMI Raw message errors facility; any kind of timed-out or

+ generally unexpectedly received SCMI message, for instance <n>,

+ can be read from this entry.

+ Each read gives back one message at time (receiving an EOF at

+ each message boundary).

+Users: Debugging, any userspace test suite

+What: /sys/kernel/debug/scmi/<n>/raw/notification

+Contact: cristian.marussi@arm.com

+Description: SCMI Raw notification snooping facility; any notification

+ emitted by the backend SCMI server, for instance <n>, can be

+ read from this entry.

+ Each read gives back one message at time (receiving an EOF at

+ each message boundary).

+Users: Debugging, any userspace test suite

+What: /sys/kernel/debug/scmi/<n>/raw/reset

+Contact: cristian.marussi@arm.com

+Description: SCMI Raw stack reset facility; writing a value to this entry

+ causes the internal queues of any kind of received message,

+ still pending to be read out for instance <n>, to be immediately

+ Can be used to reset and clean the SCMI Raw stack between to

+ different test-run.

+Users: Debugging, any userspace test suite

+What: /sys/kernel/debug/scmi/<n>/raw/channels/<m>/message

+Contact: cristian.marussi@arm.com

+Description: SCMI Raw synchronous message injection/snooping facility; write

+ a complete SCMI synchronous command message (header included)

+ in little-endian binary format to have it sent to the configured

+ backend SCMI server for instance <n> through the <m> transport

+ Any subsequently received response can be read from this same

+ entry if it arrived on channel <m> within the configured

+ Each write to the entry causes one command request to be built

+ and sent while the replies are read back one message at time

+ (receiving an EOF at each message boundary).

+ Channel identifier <m> matches the SCMI protocol number which

+ has been associated with this transport channel in the DT

+ description, with base protocol number 0x10 being the default

+ channel for this instance.

+ Note that these per-channel entries rooted at <..>/channels

+ exist only if the transport is configured to have more than

+ one default channel.

+Users: Debugging, any userspace test suite

+What: /sys/kernel/debug/scmi/<n>/raw/channels/<m>/message_async

+Contact: cristian.marussi@arm.com

+Description: SCMI Raw asynchronous message injection/snooping facility; write

+ a complete SCMI asynchronous command message (header included)

+ in little-endian binary format to have it sent to the configured

+ backend SCMI server for instance <n> through the <m> transport

+ Any subsequently received response can be read from this same

+ entry if it arrived on channel <m> within the configured

+ Any additional delayed response received afterwards can be read

+ from this same entry too if it arrived within the configured

+ Each write to the entry causes one command request to be built

+ and sent while the replies are read back one message at time

+ (receiving an EOF at each message boundary).

+ Channel identifier <m> matches the SCMI protocol number which

+ has been associated with this transport channel in the DT

+ description, with base protocol number 0x10 being the default

+ channel for this instance.

+ Note that these per-channel entries rooted at <..>/channels

+ exist only if the transport is configured to have more than

+ one default channel.

+Users: Debugging, any userspace test suite

+What: /sys/kernel/security/evm

+What: /sys/kernel/security/*/evm

+What: /sys/kernel/security/*/evm/evm_xattrs

+What: /sys/kernel/security/*/ima/policy

+ base: [[func=] [mask=] [fsmagic=] [fsuuid=] [fsname=]

+ [uid=] [euid=] [gid=] [egid=]

+ [fowner=] [fgroup=]]

+ option: [digest_type=] [template=] [permit_directio]

+ [appraise_type=] [appraise_flag=]

+ [appraise_algos=] [keyrings=]

+ [SETXATTR_CHECK][MMAP_CHECK_REQPROT]

+ gid:= decimal value

+ egid:= decimal value

+ fgroup:= decimal value

+ appraise_type:= [imasig] | [imasig|modsig] | [sigv3]

+ where 'imasig' is the original or the signature

+ where 'modsig' is an appended signature,

+ where 'sigv3' is the signature format v3. (Currently

+ limited to fsverity digest based signatures

+ stored in security.ima xattr. Requires

+ specifying "digest_type=verity" first.)

+ digest_type:= verity

+ Require fs-verity's file digest instead of the

+ regular IMA file hash.

+ Example of a 'measure' rule requiring fs-verity's digests

+ with indication of type of digest in the measurement list.

+ measure func=FILE_CHECK digest_type=verity \

+ Example of 'measure' and 'appraise' rules requiring fs-verity

+ signatures (format version 3) stored in security.ima xattr.

+ The 'measure' rule specifies the 'ima-sigv3' template option,

+ which includes the indication of type of digest and the file

+ signature in the measurement list.

+ measure func=BPRM_CHECK digest_type=verity \

+ template=ima-sigv3

+ The 'appraise' rule specifies the type and signature format

+ version (sigv3) required.

+ appraise func=BPRM_CHECK digest_type=verity \

+ appraise_type=sigv3

+ All of these policy rules could, for example, be constrained

+ either based on a filesystem's UUID (fsuuid) or based on LSM

+What: /sys/fs/pstore/...

+What: /dev/pstore/...

+What: security/secrets/coco

+Date: February 2022

+Contact: Dov Murik <dovmurik@linux.ibm.com>

+ Exposes confidential computing (coco) EFI secrets to

+ userspace via securityfs.

+ EFI can declare memory area used by confidential computing

+ platforms (such as AMD SEV and SEV-ES) for secret injection by

+ the Guest Owner during VM's launch. The secrets are encrypted

+ by the Guest Owner and decrypted inside the trusted enclave,

+ and therefore are not readable by the untrusted host.

+ The efi_secret module exposes the secrets to userspace. Each

+ secret appears as a file under <securityfs>/secrets/coco,

+ where the filename is the GUID of the entry in the secrets

+ table. This module is loaded automatically by the EFI driver

+ if the EFI secret area is populated.

+ Two operations are supported for the files: read and unlink.

+ Reading the file returns the content of secret entry.

+ Unlinking the file overwrites the secret data with zeroes and

+ removes the entry from the filesystem. A secret cannot be read

+ after it has been unlinked.

+ For example, listing the available secrets::

+ # modprobe efi_secret

+ # ls -l /sys/kernel/security/secrets/coco

+ -r--r----- 1 root root 0 Jun 28 11:54 736870e5-84f0-4973-92ec-06879ce3da0b

+ -r--r----- 1 root root 0 Jun 28 11:54 83c83f7f-1356-4975-8b7e-d3a0b54312c6

+ -r--r----- 1 root root 0 Jun 28 11:54 9553f55d-3da2-43ee-ab5d-ff17f78864d2

+ -r--r----- 1 root root 0 Jun 28 11:54 e6f5a162-d67f-4750-a67c-5d065f2a9910

+ Reading the secret data by reading a file::

+ # cat /sys/kernel/security/secrets/coco/e6f5a162-d67f-4750-a67c-5d065f2a9910

+ the-content-of-the-secret-data

+ Wiping a secret by unlinking a file::

+ # rm /sys/kernel/security/secrets/coco/e6f5a162-d67f-4750-a67c-5d065f2a9910

+ # ls -l /sys/kernel/security/secrets/coco

+ -r--r----- 1 root root 0 Jun 28 11:54 736870e5-84f0-4973-92ec-06879ce3da0b

+ -r--r----- 1 root root 0 Jun 28 11:54 83c83f7f-1356-4975-8b7e-d3a0b54312c6

+ -r--r----- 1 root root 0 Jun 28 11:54 9553f55d-3da2-43ee-ab5d-ff17f78864d2

+ Note: The binary format of the secrets table injected by the

+ Guest Owner is described in

+ drivers/virt/coco/efi_secret/efi_secret.c under "Structure of

+ the EFI secret area".

+What: /sys/bus/platform/drivers/amd_pmc/*/smu_fw_version

+Contact: Mario Limonciello <mario.limonciello@amd.com>

+Description: Reading this file reports the version of the firmware loaded to

+ System Management Unit (SMU) contained in AMD CPUs and

+What: /sys/bus/platform/drivers/amd_pmc/*/smu_program

+Contact: Mario Limonciello <mario.limonciello@amd.com>

+Description: Reading this file reports the program corresponding to the SMU

+ firmware version. The program field is used to disambiguate two

+ APU/CPU models that can share the same firmware binary.

+What: /sys/devices/platform/*/cnqf_enable

+Date: September 2022

+Contact: Shyam Sundar S K <Shyam-sundar.S-k@amd.com>

+Description: Reading this file tells if the AMD Platform Management(PMF)

+ Cool n Quiet Framework(CnQF) feature is enabled or not.

+ This feature is not enabled by default and gets only turned on

+ if OEM BIOS passes a "flag" to PMF ACPI function (index 11 or 12)

+ or in case the user writes "on".

+ To turn off CnQF user can write "off" to the sysfs node.

+ Note: Systems that support auto mode will not have this sysfs file

+What: /sys/class/ata_*

+ pio_mode: (RO) PIO transfer mode used by the device.

+ Mostly used by PATA devices.

+ xfer_mode: (RO) Current transfer mode. Mostly used by

+ dma_mode: (RO) DMA transfer mode used by the device.

+ Mostly used by PATA devices.

+What: /sys/block/zram<id>/recomp_algorithm

+Date: November 2022

+Contact: Sergey Senozhatsky <senozhatsky@chromium.org>

+ The recomp_algorithm file is read-write and allows to set

+ or show secondary compression algorithms.

+What: /sys/block/zram<id>/recompress

+Date: November 2022

+Contact: Sergey Senozhatsky <senozhatsky@chromium.org>

+ The recompress file is write-only and triggers re-compression

+ with secondary compression algorithms.

+ Each BCMA core has its manufacturer id. See

+What: /sys/bus/cdx/rescan

+Contact: nipun.gupta@amd.com

+ Writing y/1/on to this file will cause rescan of the bus

+ and devices on the CDX bus. Any new devices are scanned and

+ added to the list of Linux devices and any devices removed are

+ also deleted from Linux.

+ # echo 1 > /sys/bus/cdx/rescan

+What: /sys/bus/cdx/devices/.../vendor

+Contact: nipun.gupta@amd.com

+ Vendor ID for this CDX device, in hexadecimal. Vendor ID is

+ 16 bit identifier which is specific to the device manufacturer.

+ Combination of Vendor ID and Device ID identifies a device.

+What: /sys/bus/cdx/devices/.../device

+Contact: nipun.gupta@amd.com

+ Device ID for this CDX device, in hexadecimal. Device ID is

+ 16 bit identifier to identify a device type within the range

+ of a device manufacturer.

+ Combination of Vendor ID and Device ID identifies a device.

+What: /sys/bus/cdx/devices/.../reset

+Contact: nipun.gupta@amd.com

+ Writing y/1/on to this file resets the CDX device.

+ On resetting the device, the corresponding driver is notified

+ twice, once before the device is being reset, and again after

+ the reset has been complete.

+ # echo 1 > /sys/bus/cdx/.../reset

+What: /sys/bus/cdx/devices/.../remove

+Contact: tarak.reddy@amd.com

+ Writing y/1/on to this file removes the corresponding

+ device from the CDX bus. If the device is to be reconfigured

+ reconfigured in the Hardware, the device can be removed, so

+ that the device driver does not access the device while it is

+ being reconfigured.

+ # echo 1 > /sys/bus/cdx/devices/.../remove

+What: /sys/bus/coreboot

+Contact: Jack Rosenthal <jrosenth@chromium.org>

+ The coreboot bus provides a variety of virtual devices used to

+ access data structures created by the Coreboot BIOS.

+What: /sys/bus/coreboot/devices/cbmem-<id>

+Contact: Jack Rosenthal <jrosenth@chromium.org>

+ CBMEM is a downwards-growing memory region created by Coreboot,

+ and contains tagged data structures to be shared with payloads

+ in the boot process and the OS. Each CBMEM entry is given a

+ directory in /sys/bus/coreboot/devices based on its id.

+ A list of ids known to Coreboot can be found in the coreboot

+ ``src/commonlib/bsd/include/commonlib/bsd/cbmem_id.h``.

+What: /sys/bus/coreboot/devices/cbmem-<id>/address

+Contact: Jack Rosenthal <jrosenth@chromium.org>

+ This is the pyhsical memory address that the CBMEM entry's data

+ begins at, in hexadecimal (e.g., ``0x76ffe000``).

+What: /sys/bus/coreboot/devices/cbmem-<id>/size

+Contact: Jack Rosenthal <jrosenth@chromium.org>

+ This is the size of the CBMEM entry's data, in hexadecimal

+ (e.g., ``0x1234``).

+What: /sys/bus/coreboot/devices/cbmem-<id>/mem

+Contact: Jack Rosenthal <jrosenth@chromium.org>

+ A file exposing read/write access to the entry's data. Note

+ that this file does not support mmap(), as coreboot

+ does not guarantee that the data will be page-aligned.

+ The mode of this file is 0600. While there shouldn't be

+ anything security-sensitive contained in CBMEM, read access

+ requires root privileges given this is exposing a small subset

+ of physical memory.

diff --git a/Documentation/ABI/testing/sysfs-bus-coresight-devices-etm3x b/Documentation/ABI/testing/sysfs-bus-coresight-devices-etm3x
index 651602a61eac..234c33fbdb55 100644
--- a/Documentation/ABI/testing/sysfs-bus-coresight-devices-etm3x
+++ b/Documentation/ABI/testing/sysfs-bus-coresight-devices-etm3x

+Description: (RO) Holds the trace ID that will appear in the trace stream

diff --git a/Documentation/ABI/testing/sysfs-bus-coresight-devices-etm4x b/Documentation/ABI/testing/sysfs-bus-coresight-devices-etm4x
index 8e53a32f8150..08b1964f27d3 100644
--- a/Documentation/ABI/testing/sysfs-bus-coresight-devices-etm4x
+++ b/Documentation/ABI/testing/sysfs-bus-coresight-devices-etm4x

+What: /sys/bus/coresight/devices/etm<N>/ts_source

+Contact: Mathieu Poirier <mathieu.poirier@linaro.org> or Suzuki K Poulose <suzuki.poulose@arm.com>

+Description: (Read) When FEAT_TRF is implemented, value of TRFCR_ELx.TS used for

+ trace session. Otherwise -1 indicates an unknown time source. Check

+ trcidr0.tssize to see if a global timestamp is available.

diff --git a/Documentation/ABI/testing/sysfs-bus-coresight-devices-tpdm b/Documentation/ABI/testing/sysfs-bus-coresight-devices-tpdm
new file mode 100644
index 000000000000..4a58e649550d
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-bus-coresight-devices-tpdm

+What: /sys/bus/coresight/devices/<tpdm-name>/integration_test

+Contact: Jinlong Mao (QUIC) <quic_jinlmao@quicinc.com>, Tao Zhang (QUIC) <quic_taozha@quicinc.com>

+ (Write) Run integration test for tpdm. Integration test

+ will generate test data for tpdm. It can help to make

+ sure that the trace path is enabled and the link configurations

+ Accepts only one of the 2 values - 1 or 2.

+ 1 : Generate 64 bits data

+ 2 : Generate 32 bits data

diff --git a/Documentation/ABI/testing/sysfs-bus-coresight-devices-ultra_smb b/Documentation/ABI/testing/sysfs-bus-coresight-devices-ultra_smb
new file mode 100644
index 000000000000..f560918ae738
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-bus-coresight-devices-ultra_smb

+What: /sys/bus/coresight/devices/ultra_smb<N>/enable_sink

+Contact: Junhao He <hejunhao3@huawei.com>

+Description: (RW) Add/remove a SMB device from a trace path. There can be

+ multiple sources for a single SMB device.

+What: /sys/bus/coresight/devices/ultra_smb<N>/mgmt/buf_size

+Contact: Junhao He <hejunhao3@huawei.com>

+Description: (RO) Shows the buffer size of each UltraSoc SMB device.

+What: /sys/bus/coresight/devices/ultra_smb<N>/mgmt/buf_status

+Contact: Junhao He <hejunhao3@huawei.com>

+Description: (RO) Shows the value of UltraSoc SMB status register.

+ BIT(0) is zero means buffer is empty.

+What: /sys/bus/coresight/devices/ultra_smb<N>/mgmt/read_pos

+Contact: Junhao He <hejunhao3@huawei.com>

+Description: (RO) Shows the value of UltraSoc SMB Read Pointer register.

+What: /sys/bus/coresight/devices/ultra_smb<N>/mgmt/write_pos

+Contact: Junhao He <hejunhao3@huawei.com>

+Description: (RO) Shows the value of UltraSoc SMB Write Pointer register.

+What: /sys/bus/counter/devices/counterX/cascade_counts_enable

+Contact: linux-iio@vger.kernel.org

+ Indicates the cascading of Counts on Counter X.

+ Valid attribute values are boolean.

+What: /sys/bus/counter/devices/counterX/external_input_phase_clock_select

+Contact: linux-iio@vger.kernel.org

+ Selects the external clock pin for phase counting mode of

+ MTCLKA and MTCLKB pins are selected for the external

+ MTCLKC and MTCLKD pins are selected for the external

+What: /sys/bus/counter/devices/counterX/external_input_phase_clock_select_available

+Contact: linux-iio@vger.kernel.org

+ Discrete set of available values for the respective device

+ configuration are listed in this file.

+What: /sys/bus/counter/devices/counterX/countY/capture

+Contact: linux-iio@vger.kernel.org

+ Historical capture of the Count Y count data.

+What: /sys/bus/counter/devices/counterX/countY/num_overflows

+Contact: linux-iio@vger.kernel.org

+ This attribute indicates the number of overflows of count Y.

+What: /sys/bus/counter/devices/counterX/cascade_counts_enable_component_id

+What: /sys/bus/counter/devices/counterX/external_input_phase_clock_select_component_id

+What: /sys/bus/counter/devices/counterX/countY/capture_component_id

+What: /sys/bus/counter/devices/counterX/countY/ceiling_component_id

+What: /sys/bus/counter/devices/counterX/countY/floor_component_id

+What: /sys/bus/counter/devices/counterX/countY/count_mode_component_id

+What: /sys/bus/counter/devices/counterX/countY/direction_component_id

+What: /sys/bus/counter/devices/counterX/countY/enable_component_id

+What: /sys/bus/counter/devices/counterX/countY/error_noise_component_id

+What: /sys/bus/counter/devices/counterX/countY/prescaler_component_id

+What: /sys/bus/counter/devices/counterX/countY/preset_component_id

+What: /sys/bus/counter/devices/counterX/countY/preset_enable_component_id

+What: /sys/bus/counter/devices/counterX/countY/signalZ_action_component_id

+What: /sys/bus/counter/devices/counterX/countY/num_overflows_component_id

+What: /sys/bus/counter/devices/counterX/signalY/cable_fault_component_id

+What: /sys/bus/counter/devices/counterX/signalY/cable_fault_enable_component_id

+What: /sys/bus/counter/devices/counterX/signalY/filter_clock_prescaler_component_id

+What: /sys/bus/counter/devices/counterX/signalY/index_polarity_component_id

+What: /sys/bus/counter/devices/counterX/signalY/polarity_component_id

+What: /sys/bus/counter/devices/counterX/signalY/synchronous_mode_component_id

+What: /sys/bus/counter/devices/counterX/signalY/frequency_component_id

+KernelVersion: 5.16

+Contact: linux-iio@vger.kernel.org

+ Read-only attribute that indicates the component ID of the

+ respective extension or Synapse.

+What: /sys/bus/counter/devices/counterX/events_queue_size

+KernelVersion: 5.16

+Contact: linux-iio@vger.kernel.org

+ Size of the Counter events queue in number of struct

+ counter_event data structures. The number of elements will be

+ rounded-up to a power of 2.

+What: /sys/bus/counter/devices/counterX/signalY/polarity

+Contact: linux-iio@vger.kernel.org

+ Active level of Signal Y. The following polarity values are

+ Signal high state considered active level (rising edge).

+ Signal low state considered active level (falling edge).

+ Signal level state of Signal Y. The following signal level

+ states are available:

+What: /sys/bus/counter/devices/counterX/signalY/frequency

+Contact: linux-iio@vger.kernel.org

+ Read-only attribute that indicates the signal Y frequency, in Hz.

+Contact: linux-s390@vger.kernel.org

+What: /sys/bus/cxl/flush

+Date: Januarry, 2022

+KernelVersion: v5.18

+Contact: linux-cxl@vger.kernel.org

+ (WO) If userspace manually unbinds a port the kernel schedules

+ all descendant memdevs for unbind. Writing '1' to this attribute

+ flushes that work.

+What: /sys/bus/cxl/devices/memX/serial

+Date: January, 2022

+KernelVersion: v5.18

+Contact: linux-cxl@vger.kernel.org

+ (RO) 64-bit serial number per the PCIe Device Serial Number

+ capability. Mandatory for CXL devices, see CXL 2.0 8.1.12.2

+ Memory Device PCIe Capabilities and Extended Capabilities.

+What: /sys/bus/cxl/devices/memX/numa_node

+Date: January, 2022

+KernelVersion: v5.18

+Contact: linux-cxl@vger.kernel.org

+ (RO) If NUMA is enabled and the platform has affinitized the

+ host PCI device for this memory device, emit the CPU node

+ affinity for this device.

+ (RO) CXL device objects export the devtype attribute which

+ mirrors the same value communicated in the DEVTYPE environment

+ variable for uevents for devices on the "cxl" bus.

+What: /sys/bus/cxl/devices/*/modalias

+Date: December, 2021

+KernelVersion: v5.18

+Contact: linux-cxl@vger.kernel.org

+ (RO) CXL device objects export the modalias attribute which

+ mirrors the same value communicated in the MODALIAS environment

+ variable for uevents for devices on the "cxl" bus.

+ (RO) CXL port objects are enumerated from either a platform

+ firmware device (ACPI0017 and ACPI0016) or PCIe switch upstream

+ port with CXL component registers. The 'uport' symlink connects

+ the CXL portX object to the device that published the CXL port

+What: /sys/bus/cxl/devices/{port,endpoint}X/parent_dport

+Date: January, 2023

+KernelVersion: v6.3

+Contact: linux-cxl@vger.kernel.org

+ (RO) CXL port objects are instantiated for each upstream port in

+ a CXL/PCIe switch, and for each endpoint to map the

+ corresponding memory device into the CXL port hierarchy. When a

+ descendant CXL port (switch or endpoint) is enumerated it is

+ useful to know which 'dport' object in the parent CXL port

+ routes to this descendant. The 'parent_dport' symlink points to

+ the device representing the downstream port of a CXL switch that

+ routes to {port,endpoint}X.

+ (RO) CXL port objects are enumerated from either a platform

+ firmware device (ACPI0017 and ACPI0016) or PCIe switch upstream

+ port with CXL component registers. The 'dportY' symlink

+ identifies one or more downstream ports that the upstream port

+ may target in its decode of CXL memory resources. The 'Y'

+ integer reflects the hardware port unique-id used in the

+ hardware decoder target list.

+ (RO) CXL decoder objects are enumerated from either a platform

+ (RO) The 'start' and 'size' attributes together convey the

+ physical address base and number of bytes mapped in the

+ decoder's decode window. For decoders of devtype

+ "cxl_decoder_root" the address range is fixed. For decoders of

+ devtype "cxl_decoder_switch" the address is bounded by the

+ decode range of the cxl_port ancestor of the decoder's cxl_port,

+ and dynamically updates based on the active memory regions in

+ that address space.

+ (RO) CXL HDM decoders have the capability to lock the

+ configuration until the next device reset. For decoders of

+ devtype "cxl_decoder_root" there is no standard facility to

+ unlock them. For decoders of devtype "cxl_decoder_switch" a

+ secondary bus reset, of the PCIe bridge that provides the bus

+ for this decoders uport, unlocks / resets the decoder.

+ (RO) Display a comma separated list of the current decoder

+ target configuration. The list is ordered by the current

+ configured interleave order of the decoder's dport instances.

+ Each entry in the list is a dport id.

+ (RO) When a CXL decoder is of devtype "cxl_decoder_root", it

+ (RO) When a CXL decoder is of devtype "cxl_decoder_switch", it

+ can optionally decode either accelerator memory (type-2) or

+ expander memory (type-3). The 'target_type' attribute indicates

+ the current setting which may dynamically change based on what

+What: /sys/bus/cxl/devices/endpointX/CDAT

+KernelVersion: v6.0

+Contact: linux-cxl@vger.kernel.org

+ (RO) If this sysfs entry is not present no DOE mailbox was

+ found to support CDAT data. If it is present and the length of

+ the data is 0 reading the CDAT data failed. Otherwise the CDAT

+What: /sys/bus/cxl/devices/decoderX.Y/mode

+KernelVersion: v6.0

+Contact: linux-cxl@vger.kernel.org

+ (RW) When a CXL decoder is of devtype "cxl_decoder_endpoint" it

+ translates from a host physical address range, to a device local

+ address range. Device-local address ranges are further split

+ into a 'ram' (volatile memory) range and 'pmem' (persistent

+ memory) range. The 'mode' attribute emits one of 'ram', 'pmem',

+ 'mixed', or 'none'. The 'mixed' indication is for error cases

+ when a decoder straddles the volatile/persistent partition

+ boundary, and 'none' indicates the decoder is not actively

+ decoding, or no DPA allocation policy has been set.

+ 'mode' can be written, when the decoder is in the 'disabled'

+ state, with either 'ram' or 'pmem' to set the boundaries for the

+What: /sys/bus/cxl/devices/decoderX.Y/dpa_resource

+KernelVersion: v6.0

+Contact: linux-cxl@vger.kernel.org

+ (RO) When a CXL decoder is of devtype "cxl_decoder_endpoint",

+ and its 'dpa_size' attribute is non-zero, this attribute

+ indicates the device physical address (DPA) base address of the

+What: /sys/bus/cxl/devices/decoderX.Y/dpa_size

+KernelVersion: v6.0

+Contact: linux-cxl@vger.kernel.org

+ (RW) When a CXL decoder is of devtype "cxl_decoder_endpoint" it

+ translates from a host physical address range, to a device local

+ address range. The range, base address plus length in bytes, of

+ DPA allocated to this decoder is conveyed in these 2 attributes.

+ Allocations can be mutated as long as the decoder is in the

+ disabled state. A write to 'dpa_size' releases the previous DPA

+ allocation and then attempts to allocate from the free capacity

+ in the device partition referred to by 'decoderX.Y/mode'.

+ Allocate and free requests can only be performed on the highest

+ instance number disabled decoder with non-zero size. I.e.

+ allocations are enforced to occur in increasing 'decoderX.Y/id'

+ order and frees are enforced to occur in decreasing

+ 'decoderX.Y/id' order.

+What: /sys/bus/cxl/devices/decoderX.Y/interleave_ways

+KernelVersion: v6.0

+Contact: linux-cxl@vger.kernel.org

+ (RO) The number of targets across which this decoder's host

+ physical address (HPA) memory range is interleaved. The device

+ maps every Nth block of HPA (of size ==

+ 'interleave_granularity') to consecutive DPA addresses. The

+ decoder's position in the interleave is determined by the

+ device's (endpoint or switch) switch ancestry. For root

+ decoders their interleave is specified by platform firmware and

+ they only specify a downstream target order for host bridges.

+What: /sys/bus/cxl/devices/decoderX.Y/interleave_granularity

+KernelVersion: v6.0

+Contact: linux-cxl@vger.kernel.org

+ (RO) The number of consecutive bytes of host physical address

+ space this decoder claims at address N before the decode rotates

+ to the next target in the interleave at address N +

+ interleave_granularity (assuming N is aligned to

+ interleave_granularity).

+What: /sys/bus/cxl/devices/decoderX.Y/create_{pmem,ram}_region

+Date: May, 2022, January, 2023

+KernelVersion: v6.0 (pmem), v6.3 (ram)

+Contact: linux-cxl@vger.kernel.org

+ (RW) Write a string in the form 'regionZ' to start the process

+ of defining a new persistent, or volatile memory region

+ (interleave-set) within the decode range bounded by root decoder

+ 'decoderX.Y'. The value written must match the current value

+ returned from reading this attribute. An atomic compare exchange

+ operation is done on write to assign the requested id to a

+ region and allocate the region-id for the next creation attempt.

+ EBUSY is returned if the region name written does not match the

+ current cached value.

+What: /sys/bus/cxl/devices/decoderX.Y/delete_region

+KernelVersion: v6.0

+Contact: linux-cxl@vger.kernel.org

+ (WO) Write a string in the form 'regionZ' to delete that region,

+ provided it is currently idle / not bound to a driver.

+What: /sys/bus/cxl/devices/regionZ/uuid

+KernelVersion: v6.0

+Contact: linux-cxl@vger.kernel.org

+ (RW) Write a unique identifier for the region. This field must

+ be set for persistent regions and it must not conflict with the

+ UUID of another region. For volatile ram regions this

+ attribute is a read-only empty string.

+What: /sys/bus/cxl/devices/regionZ/interleave_granularity

+KernelVersion: v6.0

+Contact: linux-cxl@vger.kernel.org

+ (RW) Set the number of consecutive bytes each device in the

+ interleave set will claim. The possible interleave granularity

+ values are determined by the CXL spec and the participating

+What: /sys/bus/cxl/devices/regionZ/interleave_ways

+KernelVersion: v6.0

+Contact: linux-cxl@vger.kernel.org

+ (RW) Configures the number of devices participating in the

+ region is set by writing this value. Each device will provide

+ 1/interleave_ways of storage for the region.

+What: /sys/bus/cxl/devices/regionZ/size

+KernelVersion: v6.0

+Contact: linux-cxl@vger.kernel.org

+ (RW) System physical address space to be consumed by the region.

+ When written trigger the driver to allocate space out of the

+ parent root decoder's address space. When read the size of the

+ address space is reported and should match the span of the

+ region's resource attribute. Size shall be set after the

+ interleave configuration parameters. Once set it cannot be

+ changed, only freed by writing 0. The kernel makes no guarantees

+ that data is maintained over an address space freeing event, and

+ there is no guarantee that a free followed by an allocate

+ results in the same address being allocated.

+What: /sys/bus/cxl/devices/regionZ/mode

+Date: January, 2023

+KernelVersion: v6.3

+Contact: linux-cxl@vger.kernel.org

+ (RO) The mode of a region is established at region creation time

+ and dictates the mode of the endpoint decoder that comprise the

+ region. For more details on the possible modes see

+ /sys/bus/cxl/devices/decoderX.Y/mode

+What: /sys/bus/cxl/devices/regionZ/resource

+KernelVersion: v6.0

+Contact: linux-cxl@vger.kernel.org

+ (RO) A region is a contiguous partition of a CXL root decoder

+ address space. Region capacity is allocated by writing to the

+ size attribute, the resulting physical address space determined

+ by the driver is reflected here. It is therefore not useful to

+ read this before writing a value to the size attribute.

+What: /sys/bus/cxl/devices/regionZ/target[0..N]

+KernelVersion: v6.0

+Contact: linux-cxl@vger.kernel.org

+ (RW) Write an endpoint decoder object name to 'targetX' where X

+ is the intended position of the endpoint device in the region

+ interleave and N is the 'interleave_ways' setting for the

+ region. ENXIO is returned if the write results in an impossible

+ to map decode scenario, like the endpoint is unreachable at that

+ position relative to the root decoder interleave. EBUSY is

+ returned if the position in the region is already occupied, or

+ if the region is not in a state to accept interleave

+ configuration changes. EINVAL is returned if the object name is

+ not an endpoint decoder. Once all positions have been

+ successfully written a final validation for decode conflicts is

+ performed before activating the region.

+What: /sys/bus/cxl/devices/regionZ/commit

+KernelVersion: v6.0

+Contact: linux-cxl@vger.kernel.org

+ (RW) Write a boolean 'true' string value to this attribute to

+ trigger the region to transition from the software programmed

+ state to the actively decoding in hardware state. The commit

+ operation in addition to validating that the region is in proper

+ configured state, validates that the decoders are being

+ committed in spec mandated order (last committed decoder id +

+ 1), and checks that the hardware accepts the commit request.

+ Reading this value indicates whether the region is committed or

+What: /sys/bus/cxl/devices/memX/trigger_poison_list

+KernelVersion: v6.4

+Contact: linux-cxl@vger.kernel.org

+ (WO) When a boolean 'true' is written to this attribute the

+ memdev driver retrieves the poison list from the device. The

+ list consists of addresses that are poisoned, or would result

+ in poison if accessed, and the source of the poison. This

+ attribute is only visible for devices supporting the

+ capability. The retrieved errors are logged as kernel

+ events when cxl_poison event tracing is enabled.

diff --git a/Documentation/ABI/testing/sysfs-bus-event_source-devices-caps b/Documentation/ABI/testing/sysfs-bus-event_source-devices-caps
new file mode 100644
index 000000000000..8757dcf41c08
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-bus-event_source-devices-caps

+What: /sys/bus/event_source/devices/<dev>/caps

+KernelVersion: 5.19

+Contact: Linux kernel mailing list <linux-kernel@vger.kernel.org>

+ Attribute group to describe the capabilities exposed

+ for a particular pmu. Each attribute of this group can

+ expose information specific to a PMU, say pmu_name, so that

+ userspace can understand some of the feature which the

+ platform specific PMU supports.

+ One of the example available capability in supported platform

+ like Intel is pmu_name, which exposes underlying CPU name known

+ to the PMU driver.

+ Example output in powerpc:

+ grep . /sys/bus/event_source/devices/cpu/caps/*

+ /sys/bus/event_source/devices/cpu/caps/pmu_name:POWER9

diff --git a/Documentation/ABI/testing/sysfs-bus-event_source-devices-iommu b/Documentation/ABI/testing/sysfs-bus-event_source-devices-iommu
new file mode 100644
index 000000000000..d7af4919302e
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-bus-event_source-devices-iommu

+What: /sys/bus/event_source/devices/dmar*/format

+Contact: Kan Liang <kan.liang@linux.intel.com>

+Description: Read-only. Attribute group to describe the magic bits

+ that go into perf_event_attr.config,

+ perf_event_attr.config1 or perf_event_attr.config2 for

+ the IOMMU pmu. (See also

+ ABI/testing/sysfs-bus-event_source-devices-format).

+ Each attribute in this group defines a bit range in

+ perf_event_attr.config, perf_event_attr.config1,

+ or perf_event_attr.config2. All supported attributes

+ are listed below (See the VT-d Spec 4.0 for possible

+ attribute values)::

+ event = "config:0-27" - event ID

+ event_group = "config:28-31" - event group ID

+ filter_requester_en = "config1:0" - Enable Requester ID filter

+ filter_domain_en = "config1:1" - Enable Domain ID filter

+ filter_pasid_en = "config1:2" - Enable PASID filter

+ filter_ats_en = "config1:3" - Enable Address Type filter

+ filter_page_table_en= "config1:4" - Enable Page Table Level filter

+ filter_requester_id = "config1:16-31" - Requester ID filter

+ filter_domain = "config1:32-47" - Domain ID filter

+ filter_pasid = "config2:0-21" - PASID filter

+ filter_ats = "config2:24-28" - Address Type filter

+ filter_page_table = "config2:32-36" - Page Table Level filter

+What: /sys/bus/event_source/devices/dmar*/cpumask

+Contact: Kan Liang <kan.liang@linux.intel.com>

+Description: Read-only. This file always returns the CPU to which the

+ IOMMU pmu is bound for access to all IOMMU pmu performance

+ monitoring events.

+ writing its name, i.e. ctlr_X to the ctlr_delete file.

diff --git a/Documentation/ABI/testing/sysfs-bus-fsi-devices-sbefifo b/Documentation/ABI/testing/sysfs-bus-fsi-devices-sbefifo
new file mode 100644
index 000000000000..531fe9d6b40a
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-bus-fsi-devices-sbefifo

+What: /sys/bus/fsi/devices/XX.XX.00:06/sbefifoX/timeout

+KernelVersion: 5.15

+Contact: eajames@linux.ibm.com

+ Indicates whether or not this SBE device has experienced a

+ timeout; i.e. the SBE did not respond within the time allotted

+ by the driver. A value of 1 indicates that a timeout has

+ ocurred and no transfers have completed since the timeout. A

+ value of 0 indicates that no timeout has ocurred, or if one

+ has, more recent transfers have completed successful.

+ For devices where an accelerometer is housed in the swivel camera subassembly

+ (for AR application), the following standardized label is used:

+ The stm32-timer-trigger has the additional characteristic that

+ a sampling_frequency of 0 is defined to stop sampling.

+What: /sys/.../iio:deviceX/in_capacitanceY-capacitanceZ_raw

+What: /sys/.../iio:deviceX/in_capacitanceY-capacitanceZ_zeropoint

+Contact: linux-iio@vger.kernel.org

+ For differential channels, this an offset that is applied

+ equally to both inputs. As the reading is of the difference

+ between the two inputs, this should not be applied to the _raw

+ reading by userspace (unlike _offset) and unlike calibbias

+ it does not affect the differential value measured because

+ the effect of _zeropoint cancels out across the two inputs

+ that make up the differential pair. It's purpose is to bring

+ the individual signals, before the differential is measured,

+ within the measurement range of the device. The naming is

+ chosen because if the separate inputs that make the

+ differential pair are drawn on a graph in their

+ _raw units, this is the value that the zero point on the

+ measurement axis represents. It is expressed with the

+ same scaling as _raw.

+What: /sys/bus/iio/devices/iio:deviceX/in_accel_linear_x_raw

+What: /sys/bus/iio/devices/iio:deviceX/in_accel_linear_y_raw

+What: /sys/bus/iio/devices/iio:deviceX/in_accel_linear_z_raw

+Contact: linux-iio@vger.kernel.org

+ As per in_accel_X_raw attributes, but minus the

+ acceleration due to gravity.

+What: /sys/bus/iio/devices/iio:deviceX/in_concentration_co2_scale

+What: /sys/bus/iio/devices/iio:deviceX/in_altvoltage_calibscale

+What: /sys/.../iio:deviceX/events/in_accel_mag_referenced_en

+What: /sys/.../iio:deviceX/events/in_accel_mag_referenced_rising_en

+What: /sys/.../iio:deviceX/events/in_accel_mag_referenced_falling_en

+What: /sys/.../iio:deviceX/events/in_accel_y_mag_referenced_en

+What: /sys/.../iio:deviceX/events/in_accel_y_mag_referenced_rising_en

+What: /sys/.../iio:deviceX/events/in_accel_y_mag_referenced_falling_en

+KernelVersion: 5.18

+Contact: linux-iio@vger.kernel.org

+ Similar to in_accel_mag[_y][_rising|_falling]_en, but the event

+ value is relative to a reference magnitude. The reference magnitude

+ includes the graviational acceleration.

+What: /sys/.../iio:deviceX/events/in_accel_mag_referenced_value

+What: /sys/.../iio:deviceX/events/in_accel_mag_referenced_rising_value

+What: /sys/.../iio:deviceX/events/in_accel_mag_referenced_falling_value

+What: /sys/.../iio:deviceX/events/in_accel_y_mag_referenced_value

+What: /sys/.../iio:deviceX/events/in_accel_y_mag_referenced_rising_value

+What: /sys/.../iio:deviceX/events/in_accel_y_mag_referenced_falling_value

+KernelVersion: 5.18

+Contact: linux-iio@vger.kernel.org

+ The value to which the reference magnitude of the channel is

+ compared. If the axis is not specified, it applies to all channels

+ Note that it might be impossible to configure other attributes,

+ (e.g.: events, scale, sampling rate) if they impact the currently

+ active buffer capture session.

+ Raw (unscaled no offset etc.) resistance reading.

+ Units after application of scale and offset are ohms.

+ Raw (unscaled no offset etc.) electric conductivity reading.

+ Units after application of scale and offset are siemens per

+ Raw (unscaled) phase difference reading from channel Y.

+ Units after application of scale and offset are radians.

+What: /sys/bus/iio/devices/iio:deviceX/calibration_auto_enable

+Contact: linux-iio@vger.kernel.org

+ Some sensors have the ability to apply auto calibration at

+ runtime. For example, it may be necessary to compensate for

+ contaminant build-up in a measurement chamber or optical

+ element deterioration that would otherwise lead to sensor drift.

+ Writing 1 or 0 to this attribute will respectively activate or

+ deactivate this auto calibration function.

+ Upon reading, the current status is returned.

+What: /sys/bus/iio/devices/iio:deviceX/calibration_forced_value

+Contact: linux-iio@vger.kernel.org

+ Some sensors have the ability to apply a manual calibration using

+ a known measurement value, perhaps obtained from an external

+ Writing a value to this function will force such a calibration

+ change. For the scd30 the value should be from the range

+ Note for the scd30 that a valid value may only be obtained once

+ it is has been written. Until then any read back of this value

+ should be ignored. As for the scd4x an error will be returned

+ immediately if the manual calibration has failed.

+What: /sys/bus/iio/devices/iio:deviceX/calibration_forced_value_available

+KernelVersion: 5.15

+Contact: linux-iio@vger.kernel.org

+ Available range for the forced calibration value, expressed as:

+ - a range specified as "[min step max]"

+What: /sys/bus/iio/devices/iio:deviceX/in_voltageX_sampling_frequency

+What: /sys/bus/iio/devices/iio:deviceX/in_powerY_sampling_frequency

+What: /sys/bus/iio/devices/iio:deviceX/in_currentZ_sampling_frequency

+KernelVersion: 5.20

+Contact: linux-iio@vger.kernel.org

+ Some devices have separate controls of sampling frequency for

+ individual channels. If multiple channels are enabled in a scan,

+ then the sampling_frequency of the scan may be computed from the

+ per channel sampling frequencies.

+What: /sys/.../events/in_accel_gesture_singletap_en

+What: /sys/.../events/in_accel_gesture_doubletap_en

+Contact: linux-iio@vger.kernel.org

+ Device generates an event on a single or double tap.

+What: /sys/.../events/in_accel_gesture_singletap_value

+What: /sys/.../events/in_accel_gesture_doubletap_value

+Contact: linux-iio@vger.kernel.org

+ Specifies the threshold value that the device is comparing

+ against to generate the tap gesture event. The lower

+ threshold value increases the sensitivity of tap detection.

+ Units and the exact meaning of value are device-specific.

+What: /sys/.../events/in_accel_gesture_tap_value_available

+Contact: linux-iio@vger.kernel.org

+ Lists all available threshold values which can be used to

+ modify the sensitivity of the tap detection.

+What: /sys/.../events/in_accel_gesture_singletap_reset_timeout

+What: /sys/.../events/in_accel_gesture_doubletap_reset_timeout

+Contact: linux-iio@vger.kernel.org

+ Specifies the timeout value in seconds for the tap detector

+ to not to look for another tap event after the event as

+ occurred. Basically the minimum quiet time between the two

+ single-tap's or two double-tap's.

+What: /sys/.../events/in_accel_gesture_tap_reset_timeout_available

+Contact: linux-iio@vger.kernel.org

+ Lists all available tap reset timeout values. Units in seconds.

+What: /sys/.../events/in_accel_gesture_doubletap_tap2_min_delay

+Contact: linux-iio@vger.kernel.org

+ Specifies the minimum quiet time in seconds between the two

+ taps of a double tap.

+What: /sys/.../events/in_accel_gesture_doubletap_tap2_min_delay_available

+Contact: linux-iio@vger.kernel.org

+ Lists all available delay values between two taps in the double

+ tap. Units in seconds.

+What: /sys/.../events/in_accel_gesture_tap_maxtomin_time

+Contact: linux-iio@vger.kernel.org

+ Specifies the maximum time difference allowed between upper

+ and lower peak of tap to consider it as the valid tap event.

+What: /sys/.../events/in_accel_gesture_tap_maxtomin_time_available

+Contact: linux-iio@vger.kernel.org

+ Lists all available time values between upper peak to lower

+ peak. Units in seconds.

+What: /sys/bus/iio/devices/iio:deviceX/in_rot_yaw_raw

+What: /sys/bus/iio/devices/iio:deviceX/in_rot_pitch_raw

+What: /sys/bus/iio/devices/iio:deviceX/in_rot_roll_raw

+Contact: linux-iio@vger.kernel.org

+ Raw (unscaled) euler angles readings. Units after

+ application of scale are deg.

+What: /sys/bus/iio/devices/iio:deviceX/serialnumber

+Contact: linux-iio@vger.kernel.org

+ An example format is 16-bytes, 2-digits-per-byte, HEX-string

+ representing the sensor unique ID number.

+What: /sys/bus/iio/devices/iio:deviceX/in_voltage-voltage_filter_mode_available

+Contact: linux-iio@vger.kernel.org

+ Reading returns a list with the possible filter modes.

+ * "sinc4" - Sinc 4. Excellent noise performance. Long

+ 1st conversion time. No natural 50/60Hz rejection.

+ * "sinc4+sinc1" - Sinc4 + averaging by 8. Low 1st conversion

+ * "sinc3" - Sinc3. Moderate 1st conversion time.

+ Good noise performance.

+ * "sinc3+rej60" - Sinc3 + 60Hz rejection. At a sampling

+ frequency of 50Hz, achieves simultaneous 50Hz and 60Hz

+ * "sinc3+sinc1" - Sinc3 + averaging by 8. Low 1st conversion

+ time. Best used with a sampling frequency of at least

+ * "sinc3+pf1" - Sinc3 + Post Filter 1. 53dB rejection @

+ 50Hz, 58dB rejection @ 60Hz.

+ * "sinc3+pf2" - Sinc3 + Post Filter 2. 70dB rejection @

+ 50Hz, 70dB rejection @ 60Hz.

+ * "sinc3+pf3" - Sinc3 + Post Filter 3. 99dB rejection @

+ 50Hz, 103dB rejection @ 60Hz.

+ * "sinc3+pf4" - Sinc3 + Post Filter 4. 103dB rejection @

+ 50Hz, 109dB rejection @ 60Hz.

+What: /sys/bus/iio/devices/iio:deviceX/in_voltageY-voltageZ_filter_mode

+Contact: linux-iio@vger.kernel.org

+ Set the filter mode of the differential channel. When the filter

+ mode changes, the in_voltageY-voltageZ_sampling_frequency and

+ in_voltageY-voltageZ_sampling_frequency_available attributes

+ might also change to accommodate the new filter mode.

+ If the current sampling frequency is out of range for the new

+ filter mode, the sampling frequency will be changed to the

+ closest valid one.

+What: /sys/bus/iio/devices/iio:deviceX/in_voltageY-voltageZ_balance_switch_en

+KernelVersion: 5.14

+Contact: linux-iio@vger.kernel.org

+ Used to enable an output for balancing cells for time

+ controlled via in_voltage_Y-voltageZ_balance_switch_timer.

+What: /sys/bus/iio/devices/iio:deviceX/in_voltageY-voltageZ_balance_switch_timer

+KernelVersion: 5.14

+Contact: linux-iio@vger.kernel.org

+ Time in seconds for which balance switch will be turned on.

+ Multiple of 71.5 seconds.

+What: /sys/bus/iio/devices/iio:deviceX/in_voltage_filterY_notch_en

+Date: September 2022

+Contact: linux-iio@vger.kernel.org

+ Enable or disable a notch filter.

+What: /sys/bus/iio/devices/iio:deviceX/in_voltage_filterY_notch_center

+Date: September 2022

+Contact: linux-iio@vger.kernel.org

+ Center frequency of the notch filter in Hz.

+What: /sys/bus/iio/devices/iio:deviceX/in_accel_raw_range

+Contact: linux-iio@vger.kernel.org

+ Raw (unscaled) range for acceleration readings. Unit after

+ application of scale is m/s^2. Note that this doesn't affects

+ the scale (which should be used when changing the maximum and

+ minimum readable value affects also the reading scaling factor).

+What: /sys/bus/iio/devices/iio:deviceX/in_anglvel_raw_range

+Contact: linux-iio@vger.kernel.org

+ Range for angular velocity readings in radians per second. Note

+ that this does not affects the scale (which should be used when

+ changing the maximum and minimum readable value affects also the

+ reading scaling factor).

+What: /sys/bus/iio/devices/iio:deviceX/in_accel_raw_range_available

+Contact: linux-iio@vger.kernel.org

+ List of allowed values for in_accel_raw_range attribute

+What: /sys/bus/iio/devices/iio:deviceX/in_anglvel_raw_range_available

+Contact: linux-iio@vger.kernel.org

+ List of allowed values for in_anglvel_raw_range attribute

+What: /sys/bus/iio/devices/iio:deviceX/in_magn_calibration_fast_enable

+Contact: linux-iio@vger.kernel.org

+ Can be 1 or 0. Enables/disables the "Fast Magnetometer

+ Calibration" HW function.

+What: /sys/bus/iio/devices/iio:deviceX/fusion_enable

+Contact: linux-iio@vger.kernel.org

+ Can be 1 or 0. Enables/disables the "sensor fusion" (a.k.a.

+ NDOF) HW function.

+What: /sys/bus/iio/devices/iio:deviceX/calibration_data

+Contact: linux-iio@vger.kernel.org

+ Reports the binary calibration data blob for the IMU sensors.

+What: /sys/bus/iio/devices/iio:deviceX/in_accel_calibration_auto_status

+Contact: linux-iio@vger.kernel.org

+ Reports the autocalibration status for the accelerometer sensor.

+ Can be 0 (calibration non even enabled) or 1 to 5 where the greater

+ the number, the better the calibration status.

+What: /sys/bus/iio/devices/iio:deviceX/in_gyro_calibration_auto_status

+Contact: linux-iio@vger.kernel.org

+ Reports the autocalibration status for the gyroscope sensor.

+ Can be 0 (calibration non even enabled) or 1 to 5 where the greater

+ the number, the better the calibration status.

+What: /sys/bus/iio/devices/iio:deviceX/in_magn_calibration_auto_status

+Contact: linux-iio@vger.kernel.org

+ Reports the autocalibration status for the magnetometer sensor.

+ Can be 0 (calibration non even enabled) or 1 to 5 where the greater

+ the number, the better the calibration status.

+What: /sys/bus/iio/devices/iio:deviceX/sys_calibration_auto_status

+Contact: linux-iio@vger.kernel.org

+ Reports the status for the IMU overall autocalibration.

+ Can be 0 (calibration non even enabled) or 1 to 5 where the greater

+ the number, the better the calibration status.

+What: /sys/.../iio:deviceX/in_capacitableY_calibbias_calibration

+What: /sys/.../iio:deviceX/in_capacitableY_calibscale_calibration

+Contact: linux-iio@vger.kernel.org

+ Write 1 to trigger a calibration of the calibbias or

+ calibscale. For calibscale, a full scale capacitance should

+ be connected to the capacitance input and a

+ calibscale_calibration then started. For calibbias see

+ the device datasheet section on "capacitive system offset

diff --git a/Documentation/ABI/testing/sysfs-bus-iio-chemical-sunrise-co2 b/Documentation/ABI/testing/sysfs-bus-iio-chemical-sunrise-co2
new file mode 100644
index 000000000000..ee7aeb11709b
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-bus-iio-chemical-sunrise-co2

+What: /sys/bus/iio/devices/iio:deviceX/in_concentration_co2_calibration_factory

+KernelVersion: 5.16

+Contact: Jacopo Mondi <jacopo@jmondi.org>

+ Writing '1' triggers a 'Factory' calibration cycle.

+What: /sys/bus/iio/devices/iio:deviceX/in_concentration_co2_calibration_background

+KernelVersion: 5.16

+Contact: Jacopo Mondi <jacopo@jmondi.org>

+ Writing '1' triggers a 'Background' calibration cycle.

+What: /sys/bus/iio/devices/iio:deviceX/error_status_available

+KernelVersion: 5.16

+Contact: Jacopo Mondi <jacopo@jmondi.org>

+ Reading returns the list of possible chip error status.

+ Available options are:

+ - 'error_fatal': Analog front-end initialization error

+ - 'error_i2c': Read/write to non-existing register

+ - 'error_algorithm': Corrupted parameters

+ - 'error_calibration': Calibration has failed

+ - 'error_self_diagnostic': Internal interface failure

+ - 'error_out_of_range': Measured concentration out of scale

+ - 'error_memory': Error during memory operations

+ - 'error_no_measurement': Cleared at first measurement

+ - 'error_low_voltage': Sensor regulated voltage too low

+ - 'error_measurement_timeout': Unable to complete measurement

+What: /sys/bus/iio/devices/iio:deviceX/error_status

+KernelVersion: 5.16

+Contact: Jacopo Mondi <jacopo@jmondi.org>

+ Reading returns the current chip error status.

+What: /sys/bus/iio/devices/iio:deviceX/out_voltageY_dither_en

+KernelVersion: 5.18

+Contact: linux-iio@vger.kernel.org

+ Dither enable. Write 1 to enable dither or 0 to disable it. This is useful

+ for changing the dither parameters. They way it should be done is:

+ - disable dither operation;

+ - change dither parameters (eg: frequency, phase...);

+ - enabled dither operation

+What: /sys/bus/iio/devices/iio:deviceX/out_voltageY_dither_raw

+KernelVersion: 5.18

+Contact: linux-iio@vger.kernel.org

+ This raw, unscaled value refers to the dither signal amplitude.

+ The same scale as in out_voltageY_raw applies. However, the

+ offset might be different as it's always 0 for this attribute.

+What: /sys/bus/iio/devices/iio:deviceX/out_voltageY_dither_raw_available

+KernelVersion: 5.18

+Contact: linux-iio@vger.kernel.org

+ Available range for dither raw amplitude values.

+What: /sys/bus/iio/devices/iio:deviceX/out_voltageY_dither_offset

+KernelVersion: 5.18

+Contact: linux-iio@vger.kernel.org

+ Offset applied to out_voltageY_dither_raw. Read only attribute

+What: /sys/bus/iio/devices/iio:deviceX/out_voltageY_dither_frequency

+KernelVersion: 5.18

+Contact: linux-iio@vger.kernel.org

+ Sets the dither signal frequency. Units are in Hz.

+What: /sys/bus/iio/devices/iio:deviceX/out_voltageY_dither_frequency_available

+KernelVersion: 5.18

+Contact: linux-iio@vger.kernel.org

+ Returns the available values for the dither frequency.

+What: /sys/bus/iio/devices/iio:deviceX/out_voltageY_dither_phase

+KernelVersion: 5.18

+Contact: linux-iio@vger.kernel.org

+ Sets the dither signal phase. Units are in Radians.

+What: /sys/bus/iio/devices/iio:deviceX/out_voltageY_dither_phase_available

+KernelVersion: 5.18

+Contact: linux-iio@vger.kernel.org

+ Returns the available values for the dither phase.

+What: /sys/bus/iio/devices/iio:deviceX/out_voltageY_toggle_en

+KernelVersion: 5.18

+Contact: linux-iio@vger.kernel.org

+ Toggle enable. Write 1 to enable toggle or 0 to disable it. This is

+ useful when one wants to change the DAC output codes. The way it should

+ - disable toggle operation;

+ - change out_voltageY_raw0 and out_voltageY_raw1;

+ - enable toggle operation.

+What: /sys/bus/iio/devices/iio:deviceX/out_voltageY_raw0

+What: /sys/bus/iio/devices/iio:deviceX/out_voltageY_raw1

+KernelVersion: 5.18

+Contact: linux-iio@vger.kernel.org

+ It has the same meaning as out_voltageY_raw. This attribute is

+ specific to toggle enabled channels and refers to the DAC output

+ code in INPUT_A (_raw0) and INPUT_B (_raw1). The same scale and offset

+ as in out_voltageY_raw applies.

+What: /sys/bus/iio/devices/iio:deviceX/out_voltageY_symbol

+KernelVersion: 5.18

+Contact: linux-iio@vger.kernel.org

+ Performs a SW toggle. This attribute is specific to toggle

+ enabled channels and allows to toggle between out_voltageY_raw0

+ and out_voltageY_raw1 through software. Writing 0 will select

+ out_voltageY_raw0 while 1 selects out_voltageY_raw1.

diff --git a/Documentation/ABI/testing/sysfs-bus-iio-filter-admv8818 b/Documentation/ABI/testing/sysfs-bus-iio-filter-admv8818
new file mode 100644
index 000000000000..f6c035752639
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-bus-iio-filter-admv8818

+What: /sys/bus/iio/devices/iio:deviceX/filter_mode_available

+Contact: linux-iio@vger.kernel.org

+ Reading this returns the valid values that can be written to the

+ on_altvoltage0_mode attribute:

+ - auto -> Adjust bandpass filter to track changes in input clock rate.

+ - manual -> disable/unregister the clock rate notifier / input clock tracking.

+What: /sys/bus/iio/devices/iio:deviceX/filter_mode

+Contact: linux-iio@vger.kernel.org

+ This attribute configures the filter mode.

+ Reading returns the actual mode.

diff --git a/Documentation/ABI/testing/sysfs-bus-iio-frequency-admv1013 b/Documentation/ABI/testing/sysfs-bus-iio-frequency-admv1013
new file mode 100644
index 000000000000..de1e323e5d47
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-bus-iio-frequency-admv1013

+What: /sys/bus/iio/devices/iio:deviceX/in_altvoltage0-1_i_calibphase

+Contact: linux-iio@vger.kernel.org

+ Read/write unscaled value for the Local Oscillatior path quadrature I phase shift.

+What: /sys/bus/iio/devices/iio:deviceX/in_altvoltage0-1_q_calibphase

+Contact: linux-iio@vger.kernel.org

+ Read/write unscaled value for the Local Oscillatior path quadrature Q phase shift.

+What: /sys/bus/iio/devices/iio:deviceX/in_altvoltage0_i_calibbias

+Contact: linux-iio@vger.kernel.org

+ Read/write value for the Local Oscillatior Feedthrough Offset Calibration I Positive

+What: /sys/bus/iio/devices/iio:deviceX/in_altvoltage0_q_calibbias

+Contact: linux-iio@vger.kernel.org

+ Read/write value for the Local Oscillatior Feedthrough Offset Calibration Q Positive side.

+What: /sys/bus/iio/devices/iio:deviceX/in_altvoltage1_i_calibbias

+Contact: linux-iio@vger.kernel.org

+ Read/write raw value for the Local Oscillatior Feedthrough Offset Calibration I Negative

+What: /sys/bus/iio/devices/iio:deviceX/in_altvoltage1_q_calibbias

+Contact: linux-iio@vger.kernel.org

+ Read/write raw value for the Local Oscillatior Feedthrough Offset Calibration Q Negative

diff --git a/Documentation/ABI/testing/sysfs-bus-iio-frequency-admv1014 b/Documentation/ABI/testing/sysfs-bus-iio-frequency-admv1014
new file mode 100644
index 000000000000..395010a0ef8b
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-bus-iio-frequency-admv1014

+What: /sys/bus/iio/devices/iio:deviceX/in_altvoltage0_i_calibscale_coarse

+KernelVersion: 5.18

+Contact: linux-iio@vger.kernel.org

+ Read/write value for the digital attenuator gain (IF_I) with coarse steps.

+What: /sys/bus/iio/devices/iio:deviceX/in_altvoltage0_q_calibscale_coarse

+KernelVersion: 5.18

+Contact: linux-iio@vger.kernel.org

+ Read/write value for the digital attenuator gain (IF_Q) with coarse steps.

+What: /sys/bus/iio/devices/iio:deviceX/in_altvoltage0_i_calibscale_fine

+KernelVersion: 5.18

+Contact: linux-iio@vger.kernel.org

+ Read/write value for the digital attenuator gain (IF_I) with fine steps.

+What: /sys/bus/iio/devices/iio:deviceX/in_altvoltage0_q_calibscale_fine

+KernelVersion: 5.18

+Contact: linux-iio@vger.kernel.org

+ Read/write value for the digital attenuator gain (IF_Q) with fine steps.

diff --git a/Documentation/ABI/testing/sysfs-bus-iio-proximity b/Documentation/ABI/testing/sysfs-bus-iio-proximity
index 3aac6dab8775..9b9d1cc9b703 100644
--- a/Documentation/ABI/testing/sysfs-bus-iio-proximity
+++ b/Documentation/ABI/testing/sysfs-bus-iio-proximity

+ sensor and its operating environment:

+What: /sys/bus/iio/devices/iio:deviceX/in_proximity<id>_setup

+Date: November 2021

+KernelVersion: 5.17

+Contact: Gwendal Grignou <gwendal@chromium.org>

+ SX9324 has 3 inputs, CS0, CS1 and CS2. Hardware layout

+ defines if the input is

+ + not connected (HZ),

+ + connected to an antenna where it can act as a base

+ (DS - data shield), or measured input (MI).

+ The sensor rotates measurement across 4 phases

+ (PH0, PH1, PH2, PH3), where the inputs are configured

+ and then measured.

+ By default, during the first phase, [PH0], CS0 is measured,

+ while CS1 and CS2 are used as shields.

+ `cat in_proximity0_setup` returns "MI,DS,DS".

+ [PH1], CS1 is measured, CS0 and CS2 are shield:

+ `cat in_proximity1_setup` returns "DS,MI,DS".

+ [PH2], CS2 is measured, CS0 and CS1 are shield:

+ `cat in_proximity1_setup` returns "DS,DS,MI".

+ [PH3], CS1 and CS2 are measured (combo mode):

+ `cat in_proximity1_setup` returns "DS,MI,MI".

+ Note, these are the chip default. Hardware layout will most

+ likely dictate different output. The entry is read-only.

diff --git a/Documentation/ABI/testing/sysfs-bus-iio-temperature-max31856 b/Documentation/ABI/testing/sysfs-bus-iio-temperature-max31856
deleted file mode 100644
index e5ef6d8e5da1..000000000000
--- a/Documentation/ABI/testing/sysfs-bus-iio-temperature-max31856
+++ /dev/null

+What: /sys/bus/iio/devices/iio:deviceX/fault_ovuv

+Contact: linux-iio@vger.kernel.org

+ Overvoltage or Undervoltage Input Fault. The internal circuitry

+ is protected from excessive voltages applied to the thermocouple

+ cables. The device can also detect if such a condition occurs.

+ Reading returns '1' if input voltage is negative or greater

+ than VDD, otherwise '0'.

+What: /sys/bus/iio/devices/iio:deviceX/fault_oc

+Contact: linux-iio@vger.kernel.org

+ Open-circuit fault. The detection of open-circuit faults,

+ such as those caused by broken thermocouple wires.

+ Reading returns '1' if fault, '0' otherwise.

diff --git a/Documentation/ABI/testing/sysfs-bus-iio-timer-stm32 b/Documentation/ABI/testing/sysfs-bus-iio-timer-stm32
index c4a4497c249a..05074c4a65e2 100644
--- a/Documentation/ABI/testing/sysfs-bus-iio-timer-stm32
+++ b/Documentation/ABI/testing/sysfs-bus-iio-timer-stm32

+What: /sys/bus/iio/devices/iio:deviceX/in_conversion_mode

+What: /sys/class/mdio_bus/.../statistics/

+What: /sys/class/mdio_bus/.../transfers

+What: /sys/class/mdio_bus/.../statistics/errors

+What: /sys/class/mdio_bus/.../statistics/writes

+What: /sys/class/mdio_bus/.../statistics/reads

+What: /sys/class/mdio_bus/.../statistics/transfers_<addr>

+What: /sys/class/mdio_bus/.../statistics/errors_<addr>

+What: /sys/class/mdio_bus/.../statistics/writes_<addr>

+What: /sys/class/mdio_bus/.../statistics/reads_<addr>

+What: /sys/bus/event_source/devices/nmemX/format

+Date: February 2022

+KernelVersion: 5.18

+Contact: Kajol Jain <kjain@linux.ibm.com>

+Description: (RO) Attribute group to describe the magic bits

+ that go into perf_event_attr.config for a particular pmu.

+ (See ABI/testing/sysfs-bus-event_source-devices-format).

+ Each attribute under this group defines a bit range of the

+ perf_event_attr.config. Supported attribute is listed

+ event = "config:0-4" - event ID

+ ctl_res_cnt = "event=0x1"

+What: /sys/bus/event_source/devices/nmemX/events

+Date: February 2022

+KernelVersion: 5.18

+Contact: Kajol Jain <kjain@linux.ibm.com>

+Description: (RO) Attribute group to describe performance monitoring events

+ for the nvdimm memory device. Each attribute in this group

+ describes a single performance monitoring event supported by

+ this nvdimm pmu. The name of the file is the name of the event.

+ (See ABI/testing/sysfs-bus-event_source-devices-events). A

+ listing of the events supported by a given nvdimm provider type

+ can be found in Documentation/driver-api/nvdimm/$provider.

+What: /sys/bus/event_source/devices/nmemX/cpumask

+Date: February 2022

+KernelVersion: 5.18

+Contact: Kajol Jain <kjain@linux.ibm.com>

+Description: (RO) This sysfs file exposes the cpumask which is designated to

+ to retrieve nvdimm pmu event counter data.

+What: /sys/bus/nd/devices/nmemX/cxl/id

+Date: November 2022

+Contact: Dave Jiang <dave.jiang@intel.com>

+Description: (RO) Show the id (serial) of the device. This is CXL specific.

+What: /sys/bus/nd/devices/nmemX/cxl/provider

+Date: November 2022

+Contact: Dave Jiang <dave.jiang@intel.com>

+Description: (RO) Shows the CXL bridge device that ties to a CXL memory device

+ to this NVDIMM device. I.e. the parent of the device returned is

+ a /sys/bus/cxl/devices/memX instance.

+What: /sys/bus/nd/devices/nmemX/papr/health_bitmap_inject

+KernelVersion: v5.17

+Contact: linuxppc-dev <linuxppc-dev@lists.ozlabs.org>, nvdimm@lists.linux.dev,

+ (RO) Reports the health bitmap inject bitmap that is applied to

+ bitmap received from PowerVM via the H_SCM_HEALTH. This is used

+ to forcibly set specific bits returned from Hcall. These is then

+ used to simulate various health or shutdown states for an nvdimm

+ and are set by user-space tools like ndctl by issuing a PAPR DSM.

+What: /sys/devices/pciX/.../bind

+What: /sys/devices/pciX/.../unbind

+What: /sys/devices/pciX/.../new_id

+What: /sys/devices/pciX/.../remove_id

+What: /sys/bus/pci/devices/.../irq

+Contact: Linux PCI developers <linux-pci@vger.kernel.org>

+ If a driver has enabled MSI (not MSI-X), "irq" contains the

+ IRQ of the first MSI vector. Otherwise "irq" contains the

+ IRQ of the legacy INTx interrupt.

+ "irq" being set to 0 indicates that the device isn't

+ capable of generating legacy INTx interrupts.

+What: /sys/bus/pci/devices/.../virtfn<N>

+What: /sys/bus/pci/devices/.../modalias

+Contact: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

+ This attribute indicates the PCI ID of the device object.

+ That is in the format:

+ pci:vXXXXXXXXdXXXXXXXXsvXXXXXXXXsdXXXXXXXXbcXXscXXiXX,

+ - vXXXXXXXX contains the vendor ID;

+ - dXXXXXXXX contains the device ID;

+ - svXXXXXXXX contains the sub-vendor ID;

+ - sdXXXXXXXX contains the subsystem device ID;

+ - bcXX contains the device class;

+ - scXX contains the device subclass;

+ - iXX contains the device class programming interface.

+What: /sys/bus/pci/devices/.../p2pmem/allocate

+Contact: Logan Gunthorpe <logang@deltatee.com>

+ This file allows mapping p2pmem into userspace. For each

+ mmap() call on this file, the kernel will allocate a chunk

+ of Peer-to-Peer memory for use in Peer-to-Peer transactions.

+ This memory can be used in O_DIRECT calls to NVMe backed

+ files for Peer-to-Peer copies.

+What: /sys/bus/pci/devices/.../resourceN_resize

+Date: September 2022

+Contact: Alex Williamson <alex.williamson@redhat.com>

+ These files provide an interface to PCIe Resizable BAR support.

+ A file is created for each BAR resource (N) supported by the

+ PCIe Resizable BAR extended capability of the device. Reading

+ each file exposes the bitmap of available resource sizes:

+ # cat resource1_resize

+ The bitmap represents supported resource sizes for the BAR,

+ where bit0 = 1MB, bit1 = 2MB, bit2 = 4MB, etc. In the above

+ example the device supports 64MB, 128MB, and 256MB BAR sizes.

+ When writing the file, the user provides the bit position of

+ the desired resource size, for example:

+ # echo 7 > resource1_resize

+ This indicates to set the size value corresponding to bit 7,

+ 128MB. The resulting size is 2 ^ (bit# + 20). This definition

+ matches the PCIe specification of this capability.

+ In order to make use of resource resizing, all PCI drivers must

+ be unbound from the device and peer devices under the same

+ parent bridge may need to be soft removed. In the case of

+ VGA devices, writing a resize value will remove low level

+ console drivers from the device. Raw users of pci-sysfs

+ resourceN attributes must be terminated prior to resizing.

+ Success of the resizing operation is not guaranteed.

diff --git a/Documentation/ABI/testing/sysfs-bus-pci-drivers-xhci_hcd b/Documentation/ABI/testing/sysfs-bus-pci-drivers-xhci_hcd
index 0088aba4caa8..5a775b8f6543 100644
--- a/Documentation/ABI/testing/sysfs-bus-pci-drivers-xhci_hcd
+++ b/Documentation/ABI/testing/sysfs-bus-pci-drivers-xhci_hcd

+What: /sys/bus/pci/drivers/xhci_hcd/.../dbc_idVendor

+Contact: Mathias Nyman <mathias.nyman@linux.intel.com>

+ This dbc_idVendor attribute lets us change the idVendor field

+ presented in the USB device descriptor by this xhci debug

+ Value can only be changed while debug capability (DbC) is in

+ disabled state to prevent USB device descriptor change while

+ connected to a USB host.

+ The default value is 0x1d6b (Linux Foundation).

+ It can be any 16-bit integer.

+What: /sys/bus/pci/drivers/xhci_hcd/.../dbc_idProduct

+Contact: Mathias Nyman <mathias.nyman@linux.intel.com>

+ This dbc_idProduct attribute lets us change the idProduct field

+ presented in the USB device descriptor by this xhci debug

+ Value can only be changed while debug capability (DbC) is in

+ disabled state to prevent USB device descriptor change while

+ connected to a USB host.

+ The default value is 0x0010. It can be any 16-bit integer.

+What: /sys/bus/pci/drivers/xhci_hcd/.../dbc_bcdDevice

+Contact: Mathias Nyman <mathias.nyman@linux.intel.com>

+ This dbc_bcdDevice attribute lets us change the bcdDevice field

+ presented in the USB device descriptor by this xhci debug

+ Value can only be changed while debug capability (DbC) is in

+ disabled state to prevent USB device descriptor change while

+ connected to a USB host.

+ The default value is 0x0010. (device rev 0.10)

+ It can be any 16-bit integer.

+What: /sys/bus/pci/drivers/xhci_hcd/.../dbc_bInterfaceProtocol

+Contact: Mathias Nyman <mathias.nyman@linux.intel.com>

+ This attribute lets us change the bInterfaceProtocol field

+ presented in the USB Interface descriptor by the xhci debug

+ Value can only be changed while debug capability (DbC) is in

+ disabled state to prevent USB descriptor change while

+ connected to a USB host.

+ The default value is 1 (GNU Remote Debug command).

+ Other permissible value is 0 which is for vendor defined debug

+What: /sys/bus/peci/rescan

+KernelVersion: 5.18

+Contact: Iwona Winiarska <iwona.winiarska@intel.com>

+ Writing a non-zero value to this attribute will

+ initiate scan for PECI devices on all PECI controllers

+What: /sys/bus/peci/devices/<controller_id>-<device_addr>/remove

+KernelVersion: 5.18

+Contact: Iwona Winiarska <iwona.winiarska@intel.com>

+ Writing a non-zero value to this attribute will

+ remove the PECI device and any of its children.

+What: /sys/bus/platform/devices/.../modalias

+ Same as MODALIAS in the uevent at device creation.

+ A platform device that it is exposed via devicetree uses:

+ - of:N`of node name`T`type`

+ Other platform devices use, instead:

+ - platform:`driver name`

diff --git a/Documentation/ABI/testing/sysfs-bus-platform-devices-ampere-smpro b/Documentation/ABI/testing/sysfs-bus-platform-devices-ampere-smpro
new file mode 100644
index 000000000000..fead760dcf77
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-bus-platform-devices-ampere-smpro

+What: /sys/bus/platform/devices/smpro-errmon.*/error_[core|mem|pcie|other]_[ce|ue]

+Contact: Quan Nguyen <quan@os.amperecomputing.com>

+ (RO) Contains the 48-byte Ampere (Vendor-Specific) Error Record printed

+ in hex format according to the table below:

+ +--------+---------------+-------------+------------------------------------------------------------+

+ +--------+---------------+-------------+------------------------------------------------------------+

+ | 00 | Error Type | 1 | See :ref:`the table below <smpro-error-types>` for details |

+ +--------+---------------+-------------+------------------------------------------------------------+

+ | 01 | Subtype | 1 | See :ref:`the table below <smpro-error-types>` for details |

+ +--------+---------------+-------------+------------------------------------------------------------+

+ | 02 | Instance | 2 | See :ref:`the table below <smpro-error-types>` for details |

+ +--------+---------------+-------------+------------------------------------------------------------+

+ | 04 | Error status | 4 | See ARM RAS specification for details |

+ +--------+---------------+-------------+------------------------------------------------------------+

+ | 08 | Error Address | 8 | See ARM RAS specification for details |

+ +--------+---------------+-------------+------------------------------------------------------------+

+ | 16 | Error Misc 0 | 8 | See ARM RAS specification for details |

+ +--------+---------------+-------------+------------------------------------------------------------+

+ | 24 | Error Misc 1 | 8 | See ARM RAS specification for details |

+ +--------+---------------+-------------+------------------------------------------------------------+

+ | 32 | Error Misc 2 | 8 | See ARM RAS specification for details |

+ +--------+---------------+-------------+------------------------------------------------------------+

+ | 40 | Error Misc 3 | 8 | See ARM RAS specification for details |

+ +--------+---------------+-------------+------------------------------------------------------------+

+ The table below defines the value of error types, their subtype, subcomponent and instance:

+ .. _smpro-error-types:

+ +-----------------+------------+----------+----------------+----------------------------------------+

+ +-----------------+------------+----------+----------------+----------------------------------------+

+ | CPM (core) | 0 | 0 | Snoop-Logic | CPM # |

+ +-----------------+------------+----------+----------------+----------------------------------------+

+ | CPM (core) | 0 | 2 | Armv8 Core 1 | CPM # |

+ +-----------------+------------+----------+----------------+----------------------------------------+

+ | MCU (mem) | 1 | 1 | ERR1 | MCU # \| SLOT << 11 |

+ +-----------------+------------+----------+----------------+----------------------------------------+

+ | MCU (mem) | 1 | 2 | ERR2 | MCU # \| SLOT << 11 |

+ +-----------------+------------+----------+----------------+----------------------------------------+

+ | MCU (mem) | 1 | 3 | ERR3 | MCU # |

+ +-----------------+------------+----------+----------------+----------------------------------------+

+ | MCU (mem) | 1 | 4 | ERR4 | MCU # |

+ +-----------------+------------+----------+----------------+----------------------------------------+

+ | MCU (mem) | 1 | 5 | ERR5 | MCU # |

+ +-----------------+------------+----------+----------------+----------------------------------------+

+ | MCU (mem) | 1 | 6 | ERR6 | MCU # |

+ +-----------------+------------+----------+----------------+----------------------------------------+

+ | MCU (mem) | 1 | 7 | Link Error | MCU # |

+ +-----------------+------------+----------+----------------+----------------------------------------+

+ | Mesh (other) | 2 | 0 | Cross Point | X \| (Y << 5) \| NS <<11 |

+ +-----------------+------------+----------+----------------+----------------------------------------+

+ | Mesh (other) | 2 | 1 | Home Node(IO) | X \| (Y << 5) \| NS <<11 |

+ +-----------------+------------+----------+----------------+----------------------------------------+

+ | Mesh (other) | 2 | 2 | Home Node(Mem) | X \| (Y << 5) \| NS <<11 \| device<<12 |

+ +-----------------+------------+----------+----------------+----------------------------------------+

+ | Mesh (other) | 2 | 4 | CCIX Node | X \| (Y << 5) \| NS <<11 |

+ +-----------------+------------+----------+----------------+----------------------------------------+

+ | 2P Link (other) | 3 | 0 | N/A | Altra 2P Link # |

+ +-----------------+------------+----------+----------------+----------------------------------------+

+ | GIC (other) | 5 | 0 | ERR0 | 0 |

+ +-----------------+------------+----------+----------------+----------------------------------------+

+ | GIC (other) | 5 | 1 | ERR1 | 0 |

+ +-----------------+------------+----------+----------------+----------------------------------------+

+ | GIC (other) | 5 | 2 | ERR2 | 0 |

+ +-----------------+------------+----------+----------------+----------------------------------------+

+ | GIC (other) | 5 | 3 | ERR3 | 0 |

+ +-----------------+------------+----------+----------------+----------------------------------------+

+ | GIC (other) | 5 | 4 | ERR4 | 0 |

+ +-----------------+------------+----------+----------------+----------------------------------------+

+ | GIC (other) | 5 | 5 | ERR5 | 0 |

+ +-----------------+------------+----------+----------------+----------------------------------------+

+ | GIC (other) | 5 | 6 | ERR6 | 0 |

+ +-----------------+------------+----------+----------------+----------------------------------------+

+ | GIC (other) | 5 | 7 | ERR7 | 0 |

+ +-----------------+------------+----------+----------------+----------------------------------------+

+ | GIC (other) | 5 | 8 | ERR8 | 0 |

+ +-----------------+------------+----------+----------------+----------------------------------------+

+ | GIC (other) | 5 | 9 | ERR9 | 0 |

+ +-----------------+------------+----------+----------------+----------------------------------------+

+ | GIC (other) | 5 | 10 | ERR10 | 0 |

+ +-----------------+------------+----------+----------------+----------------------------------------+

+ | GIC (other) | 5 | 11 | ERR11 | 0 |

+ +-----------------+------------+----------+----------------+----------------------------------------+

+ | GIC (other) | 5 | 12 | ERR12 | 0 |

+ +-----------------+------------+----------+----------------+----------------------------------------+

+ | GIC (other) | 5 | 13-21 | ERR13 | RC # + 1 |

+ +-----------------+------------+----------+----------------+----------------------------------------+

+ | SMMU (other) | 6 | TCU | 100 | RC # |

+ +-----------------+------------+----------+----------------+----------------------------------------+

+ | SMMU (other) | 6 | TBU0 | 0 | RC # |

+ +-----------------+------------+----------+----------------+----------------------------------------+

+ | SMMU (other) | 6 | TBU1 | 1 | RC # |

+ +-----------------+------------+----------+----------------+----------------------------------------+

+ | SMMU (other) | 6 | TBU2 | 2 | RC # |

+ +-----------------+------------+----------+----------------+----------------------------------------+

+ | SMMU (other) | 6 | TBU3 | 3 | RC # |

+ +-----------------+------------+----------+----------------+----------------------------------------+

+ | SMMU (other) | 6 | TBU4 | 4 | RC # |

+ +-----------------+------------+----------+----------------+----------------------------------------+

+ | SMMU (other) | 6 | TBU5 | 5 | RC # |

+ +-----------------+------------+----------+----------------+----------------------------------------+

+ | SMMU (other) | 6 | TBU6 | 6 | RC # |

+ +-----------------+------------+----------+----------------+----------------------------------------+

+ | SMMU (other) | 6 | TBU7 | 7 | RC # |

+ +-----------------+------------+----------+----------------+----------------------------------------+

+ | SMMU (other) | 6 | TBU8 | 8 | RC # |

+ +-----------------+------------+----------+----------------+----------------------------------------+

+ | SMMU (other) | 6 | TBU9 | 9 | RC # |

+ +-----------------+------------+----------+----------------+----------------------------------------+

+ | PCIe AER (pcie) | 7 | Root | 0 | RC # |

+ +-----------------+------------+----------+----------------+----------------------------------------+

+ | PCIe AER (pcie) | 7 | Device | 1 | RC # |

+ +-----------------+------------+----------+----------------+----------------------------------------+

+ | PCIe RC (pcie) | 8 | RCA HB | 0 | RC # |

+ +-----------------+------------+----------+----------------+----------------------------------------+

+ | PCIe RC (pcie) | 8 | RCB HB | 1 | RC # |

+ +-----------------+------------+----------+----------------+----------------------------------------+

+ | PCIe RC (pcie) | 8 | RASDP | 8 | RC # |

+ +-----------------+------------+----------+----------------+----------------------------------------+

+ | OCM (other) | 9 | ERR0 | 0 | 0 |

+ +-----------------+------------+----------+----------------+----------------------------------------+

+ | OCM (other) | 9 | ERR1 | 1 | 0 |

+ +-----------------+------------+----------+----------------+----------------------------------------+

+ | OCM (other) | 9 | ERR2 | 2 | 0 |

+ +-----------------+------------+----------+----------------+----------------------------------------+

+ | SMpro (other) | 10 | ERR0 | 0 | 0 |

+ +-----------------+------------+----------+----------------+----------------------------------------+

+ | SMpro (other) | 10 | ERR1 | 1 | 0 |

+ +-----------------+------------+----------+----------------+----------------------------------------+

+ | SMpro (other) | 10 | MPA_ERR | 2 | 0 |

+ +-----------------+------------+----------+----------------+----------------------------------------+

+ | PMpro (other) | 11 | ERR0 | 0 | 0 |

+ +-----------------+------------+----------+----------------+----------------------------------------+

+ | PMpro (other) | 11 | ERR1 | 1 | 0 |

+ +-----------------+------------+----------+----------------+----------------------------------------+

+ | PMpro (other) | 11 | MPA_ERR | 2 | 0 |

+ +-----------------+------------+----------+----------------+----------------------------------------+

+ # cat error_other_ue

+ 880807001e004010401040101500000001004010401040100c0000000000000000000000000000000000000000000000

+ The detail of each sysfs entries is as below:

+ +-------------+---------------------------------------------------------+----------------------------------+

+ | Error | Sysfs entry | Description (when triggered) |

+ +-------------+---------------------------------------------------------+----------------------------------+

+ | Core's CE | /sys/bus/platform/devices/smpro-errmon.*/error_core_ce | Core has CE error |

+ +-------------+---------------------------------------------------------+----------------------------------+

+ | Core's UE | /sys/bus/platform/devices/smpro-errmon.*/error_core_ue | Core has UE error |

+ +-------------+---------------------------------------------------------+----------------------------------+

+ | Memory's CE | /sys/bus/platform/devices/smpro-errmon.*/error_mem_ce | Memory has CE error |

+ +-------------+---------------------------------------------------------+----------------------------------+

+ | Memory's UE | /sys/bus/platform/devices/smpro-errmon.*/error_mem_ue | Memory has UE error |

+ +-------------+---------------------------------------------------------+----------------------------------+

+ | PCIe's CE | /sys/bus/platform/devices/smpro-errmon.*/error_pcie_ce | any PCIe controller has CE error |

+ +-------------+---------------------------------------------------------+----------------------------------+

+ | PCIe's UE | /sys/bus/platform/devices/smpro-errmon.*/error_pcie_ue | any PCIe controller has UE error |

+ +-------------+---------------------------------------------------------+----------------------------------+

+ | Other's CE | /sys/bus/platform/devices/smpro-errmon.*/error_other_ce | any other CE error |

+ +-------------+---------------------------------------------------------+----------------------------------+

+ | Other's UE | /sys/bus/platform/devices/smpro-errmon.*/error_other_ue | any other UE error |

+ +-------------+---------------------------------------------------------+----------------------------------+

+ UE: Uncorrect-able Error

+ CE: Correct-able Error

+ For details, see section `3.3 Ampere (Vendor-Specific) Error Record Formats,

+ Altra Family RAS Supplement`.

+What: /sys/bus/platform/devices/smpro-errmon.*/overflow_[core|mem|pcie|other]_[ce|ue]

+Contact: Quan Nguyen <quan@os.amperecomputing.com>

+ (RO) Return the overflow status of each type HW error reported:

+ - 1 : There is an overflow and the oldest HW errors are dropped

+ The detail of each sysfs entries is as below:

+ +-------------+-----------------------------------------------------------+---------------------------------------+

+ | Overflow | Sysfs entry | Description |

+ +-------------+-----------------------------------------------------------+---------------------------------------+

+ | Core's CE | /sys/bus/platform/devices/smpro-errmon.*/overflow_core_ce | Core CE error overflow |

+ +-------------+-----------------------------------------------------------+---------------------------------------+

+ | Core's UE | /sys/bus/platform/devices/smpro-errmon.*/overflow_core_ue | Core UE error overflow |

+ +-------------+-----------------------------------------------------------+---------------------------------------+

+ | Memory's CE | /sys/bus/platform/devices/smpro-errmon.*/overflow_mem_ce | Memory CE error overflow |

+ +-------------+-----------------------------------------------------------+---------------------------------------+

+ | Memory's UE | /sys/bus/platform/devices/smpro-errmon.*/overflow_mem_ue | Memory UE error overflow |

+ +-------------+-----------------------------------------------------------+---------------------------------------+

+ | PCIe's CE | /sys/bus/platform/devices/smpro-errmon.*/overflow_pcie_ce | any PCIe controller CE error overflow |

+ +-------------+-----------------------------------------------------------+---------------------------------------+

+ | PCIe's UE | /sys/bus/platform/devices/smpro-errmon.*/overflow_pcie_ue | any PCIe controller UE error overflow |

+ +-------------+-----------------------------------------------------------+---------------------------------------+

+ | Other's CE | /sys/bus/platform/devices/smpro-errmon.*/overflow_other_ce| any other CE error overflow |

+ +-------------+-----------------------------------------------------------+---------------------------------------+

+ | Other's UE | /sys/bus/platform/devices/smpro-errmon.*/overflow_other_ue| other UE error overflow |

+ +-------------+-----------------------------------------------------------+---------------------------------------+

+ - UE: Uncorrect-able Error

+ - CE: Correct-able Error

+What: /sys/bus/platform/devices/smpro-errmon.*/[error|warn]_[smpro|pmpro]

+Contact: Quan Nguyen <quan@os.amperecomputing.com>

+ (RO) Contains the internal firmware error/warning printed as hex format.

+ The detail of each sysfs entries is as below:

+ +---------------+------------------------------------------------------+--------------------------+

+ | Error | Sysfs entry | Description |

+ +---------------+------------------------------------------------------+--------------------------+

+ | SMpro error | /sys/bus/platform/devices/smpro-errmon.*/error_smpro | system has SMpro error |

+ +---------------+------------------------------------------------------+--------------------------+

+ | SMpro warning | /sys/bus/platform/devices/smpro-errmon.*/warn_smpro | system has SMpro warning |

+ +---------------+------------------------------------------------------+--------------------------+

+ | PMpro error | /sys/bus/platform/devices/smpro-errmon.*/error_pmpro | system has PMpro error |

+ +---------------+------------------------------------------------------+--------------------------+

+ | PMpro warning | /sys/bus/platform/devices/smpro-errmon.*/warn_pmpro | system has PMpro warning |

+ +---------------+------------------------------------------------------+--------------------------+

+ For details, see section `5.10 RAS Internal Error Register Definitions,

+ Altra Family Soc BMC Interface Specification`.

+What: /sys/bus/platform/devices/smpro-errmon.*/event_[vrd_warn_fault|vrd_hot|dimm_hot|dimm_2x_refresh]

+KernelVersion: 6.1 (event_[vrd_warn_fault|vrd_hot|dimm_hot]), 6.4 (event_dimm_2x_refresh)

+Contact: Quan Nguyen <quan@os.amperecomputing.com>

+ (RO) Contains the detail information in case of VRD/DIMM warning/hot events

+ in hex format as below::

+ - ``AAAA``: The event detail information data

+ The detail of each sysfs entries is as below:

+ +---------------+---------------------------------------------------------------+---------------------+

+ | Event | Sysfs entry | Description |

+ +---------------+---------------------------------------------------------------+---------------------+

+ | VRD HOT | /sys/bus/platform/devices/smpro-errmon.*/event_vrd_hot | VRD Hot |

+ +---------------+---------------------------------------------------------------+---------------------+

+ | VR Warn/Fault | /sys/bus/platform/devices/smpro-errmon.*/event_vrd_warn_fault | VR Warning or Fault |

+ +---------------+---------------------------------------------------------------+---------------------+

+ | DIMM HOT | /sys/bus/platform/devices/smpro-errmon.*/event_dimm_hot | DIMM Hot |

+ +---------------+---------------------------------------------------------------+---------------------+

+ | DIMM 2X | /sys/bus/platform/devices/smpro-errmon.*/event_dimm_2x_refresh| DIMM 2x refresh rate|

+ | REFRESH RATE | | event in high temp |

+ +---------------+---------------------------------------------------------------+---------------------+

+ For more details, see section `5.7 GPI Status Registers and 5.9 Memory Error Register Definitions,

+ Altra Family Soc BMC Interface Specification`.

+What: /sys/bus/platform/devices/smpro-errmon.*/event_dimm[0-15]_syndrome

+Contact: Quan Nguyen <quan@os.amperecomputing.com>

+ (RO) The sysfs returns the 2-byte DIMM failure syndrome data for slot

+ 0-15 if it failed to initialize.

+ For more details, see section `5.11 Boot Stage Register Definitions,

+ Altra Family Soc BMC Interface Specification`.

+What: /sys/bus/platform/devices/smpro-misc.*/boot_progress

+Contact: Quan Nguyen <quan@os.amperecomputing.com>

+ (RO) Contains the boot stages information in hex as format below::

+ - ``AA`` : The boot stages

+ - 00: SMpro firmware booting

+ - 01: PMpro firmware booting

+ - 02: ATF BL1 firmware booting

+ - 03: DDR initialization

+ - 04: DDR training report status

+ - 05: ATF BL2 firmware booting

+ - 06: ATF BL31 firmware booting

+ - 07: ATF BL32 firmware booting

+ - 08: UEFI firmware booting

+ - ``BB`` : Boot status

+ - 02: Completed without error

+ - ``CCCCCCCC``: Boot status information defined for each boot stages

+ For details, see section `5.11 Boot Stage Register Definitions`

+ and section `6. Processor Boot Progress Codes, Altra Family Soc BMC

+ Interface Specification`.

+What: /sys/bus/platform/devices/smpro-misc*/soc_power_limit

+Contact: Quan Nguyen <quan@os.amperecomputing.com>

+ (RW) Contains the desired SoC power limit in Watt.

+ Writes to this sysfs set the desired SoC power limit (W).

+ Reads from this register return the current SoC power limit (W).

+ - Maximum: Socket TDP power

diff --git a/Documentation/ABI/testing/sysfs-bus-platform-devices-occ-hwmon b/Documentation/ABI/testing/sysfs-bus-platform-devices-occ-hwmon
new file mode 100644
index 000000000000..b24d7ab0278f
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-bus-platform-devices-occ-hwmon

+What: /sys/bus/platform/devices/occ-hwmon.X/ffdc

+KernelVersion: 5.15

+Contact: eajames@linux.ibm.com

+ Contains the First Failure Data Capture from the SBEFIFO

+ hardware, if there is any from a previous transfer. Otherwise,

+ the file is empty. The data is cleared when it's been

+ completely read by a user. As the name suggests, only the data

+ from the first error is saved, until it's cleared upon read. The OCC hwmon driver, running on

+ a Baseboard Management Controller (BMC), communicates with

+ POWER9 and up processors over the Self-Boot Engine (SBE) FIFO.

+ In many error conditions, the SBEFIFO will return error data

+ indicating the type of error and system state, etc.

diff --git a/Documentation/ABI/testing/sysfs-bus-platform-onboard-usb-hub b/Documentation/ABI/testing/sysfs-bus-platform-onboard-usb-hub
new file mode 100644
index 000000000000..42deb0552065
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-bus-platform-onboard-usb-hub

+What: /sys/bus/platform/devices/<dev>/always_powered_in_suspend

+KernelVersion: 5.20

+Contact: Matthias Kaehlcke <matthias@kaehlcke.net>

+ linux-usb@vger.kernel.org

+ (RW) Controls whether the USB hub remains always powered

+ during system suspend or not. \ No newline at end of file

+What: /sys/bus/rapidio/devices/<nn>:<d>:<iiii>

+What: /sys/bus/rapidio/devices/<nn>:<d>:<iiii>/did

+What: /sys/bus/rapidio/devices/<nn>:<d>:<iiii>/vid

+What: /sys/bus/rapidio/devices/<nn>:<d>:<iiii>/device_rev

+What: /sys/bus/rapidio/devices/<nn>:<d>:<iiii>/asm_did

+What: /sys/bus/rapidio/devices/<nn>:<d>:<iiii>/asm_rev

+What: /sys/bus/rapidio/devices/<nn>:<d>:<iiii>/asm_vid

+What: /sys/bus/rapidio/devices/<nn>:<d>:<iiii>/destid

+What: /sys/bus/rapidio/devices/<nn>:<d>:<iiii>/lprev

+What: /sys/bus/rapidio/devices/<nn>:<d>:<iiii>/modalias

+What: /sys/bus/rapidio/devices/<nn>:<d>:<iiii>/config

+What: /sys/bus/rapidio/devices/<nn>:<s>:<iiii>/routes

+What: /sys/bus/rapidio/devices/<nn>:<s>:<iiii>/destid

+What: /sys/bus/rapidio/devices/<nn>:<s>:<iiii>/hopcount

+What: /sys/bus/rapidio/devices/<nn>:<s>:<iiii>/lnext

+What: /sys/bus/rapidio/devices/<nn>:<s>:<iiii>/errlog

diff --git a/Documentation/ABI/testing/sysfs-bus-soundwire-master b/Documentation/ABI/testing/sysfs-bus-soundwire-master
index 46ef038d8722..d2342911ffbb 100644
--- a/Documentation/ABI/testing/sysfs-bus-soundwire-master
+++ b/Documentation/ABI/testing/sysfs-bus-soundwire-master

+What: /sys/bus/soundwire/devices/sdw-master-<N>/revision

+ /sys/bus/soundwire/devices/sdw-master-<N>/clk_stop_modes

+ /sys/bus/soundwire/devices/sdw-master-<N>/clk_freq

+ /sys/bus/soundwire/devices/sdw-master-<N>/clk_gears

+ /sys/bus/soundwire/devices/sdw-master-<N>/default_col

+ /sys/bus/soundwire/devices/sdw-master-<N>/default_frame_rate

+ /sys/bus/soundwire/devices/sdw-master-<N>/default_row

+ /sys/bus/soundwire/devices/sdw-master-<N>/dynamic_shape

+ /sys/bus/soundwire/devices/sdw-master-<N>/err_threshold

+ /sys/bus/soundwire/devices/sdw-master-<N>/max_clk_freq

diff --git a/Documentation/ABI/testing/sysfs-bus-soundwire-slave b/Documentation/ABI/testing/sysfs-bus-soundwire-slave
index d324aa0b678f..fbf55834dfee 100644
--- a/Documentation/ABI/testing/sysfs-bus-soundwire-slave
+++ b/Documentation/ABI/testing/sysfs-bus-soundwire-slave

+What: /sys/bus/soundwire/devices/sdw:.../dp<N>_src/max_word

+ /sys/bus/soundwire/devices/sdw:.../dp<N>_src/min_word

+ /sys/bus/soundwire/devices/sdw:.../dp<N>_src/words

+ /sys/bus/soundwire/devices/sdw:.../dp<N>_src/type

+ /sys/bus/soundwire/devices/sdw:.../dp<N>_src/max_grouping

+ /sys/bus/soundwire/devices/sdw:.../dp<N>_src/simple_ch_prep_sm

+ /sys/bus/soundwire/devices/sdw:.../dp<N>_src/ch_prep_timeout

+ /sys/bus/soundwire/devices/sdw:.../dp<N>_src/imp_def_interrupts

+ /sys/bus/soundwire/devices/sdw:.../dp<N>_src/min_ch

+ /sys/bus/soundwire/devices/sdw:.../dp<N>_src/max_ch

+ /sys/bus/soundwire/devices/sdw:.../dp<N>_src/channels

+ /sys/bus/soundwire/devices/sdw:.../dp<N>_src/ch_combinations

+ /sys/bus/soundwire/devices/sdw:.../dp<N>_src/max_async_buffer

+ /sys/bus/soundwire/devices/sdw:.../dp<N>_src/block_pack_mode

+ /sys/bus/soundwire/devices/sdw:.../dp<N>_src/port_encoding

+ /sys/bus/soundwire/devices/sdw:.../dp<N>_sink/max_word

+ /sys/bus/soundwire/devices/sdw:.../dp<N>_sink/min_word

+ /sys/bus/soundwire/devices/sdw:.../dp<N>_sink/words

+ /sys/bus/soundwire/devices/sdw:.../dp<N>_sink/type

+ /sys/bus/soundwire/devices/sdw:.../dp<N>_sink/max_grouping

+ /sys/bus/soundwire/devices/sdw:.../dp<N>_sink/simple_ch_prep_sm

+ /sys/bus/soundwire/devices/sdw:.../dp<N>_sink/ch_prep_timeout

+ /sys/bus/soundwire/devices/sdw:.../dp<N>_sink/imp_def_interrupts

+ /sys/bus/soundwire/devices/sdw:.../dp<N>_sink/min_ch

+ /sys/bus/soundwire/devices/sdw:.../dp<N>_sink/max_ch

+ /sys/bus/soundwire/devices/sdw:.../dp<N>_sink/channels

+ /sys/bus/soundwire/devices/sdw:.../dp<N>_sink/ch_combinations

+ /sys/bus/soundwire/devices/sdw:.../dp<N>_sink/max_async_buffer

+ /sys/bus/soundwire/devices/sdw:.../dp<N>_sink/block_pack_mode

+ /sys/bus/soundwire/devices/sdw:.../dp<N>_sink/port_encoding

diff --git a/Documentation/ABI/testing/sysfs-bus-spi-devices-spi-nor b/Documentation/ABI/testing/sysfs-bus-spi-devices-spi-nor
index d76cd3946434..c800621eff95 100644
--- a/Documentation/ABI/testing/sysfs-bus-spi-devices-spi-nor
+++ b/Documentation/ABI/testing/sysfs-bus-spi-devices-spi-nor

+ The attribute is not present if the flash doesn't support

+ the "Read JEDEC ID" command (9Fh). This is the case for

+ non-JEDEC compliant flashes.

+ The attribute is not present if the flash device isn't

+ known to the kernel and is only probed by its SFDP

diff --git a/Documentation/ABI/testing/sysfs-bus-surface_aggregator-tabletsw b/Documentation/ABI/testing/sysfs-bus-surface_aggregator-tabletsw
new file mode 100644
index 000000000000..74cd9d754e60
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-bus-surface_aggregator-tabletsw

+What: /sys/bus/surface_aggregator/devices/01:0e:01:00:01/state

+KernelVersion: 5.20

+Contact: Maximilian Luz <luzmaximilian@gmail.com>

+ This attribute returns a string with the current type-cover

+ or device posture, as indicated by the embedded controller.

+ Currently returned posture states are:

+ - "disconnected": The type-cover has been disconnected.

+ - "closed": The type-cover has been folded closed and lies on

+ top of the display.

+ - "laptop": The type-cover is open and in laptop-mode, i.e.,

+ ready for normal use.

+ - "folded-canvas": The type-cover has been folded back

+ part-ways, but does not lie flush with the back side of the

+ device. In general, this means that the kick-stand is used

+ and extended atop of the cover.

+ - "folded-back": The type cover has been fully folded back and

+ lies flush with the back side of the device.

+ - "<unknown>": The current state is unknown to the driver, for

+ example due to newer as-of-yet unsupported hardware.

+ New states may be introduced with new hardware. Users therefore

+ must not rely on this list of states being exhaustive and

+ gracefully handle unknown states.

+What: /sys/bus/surface_aggregator/devices/01:26:01:00:01/state

+KernelVersion: 5.20

+Contact: Maximilian Luz <luzmaximilian@gmail.com>

+ This attribute returns a string with the current device posture, as indicated by the embedded controller. Currently

+ returned posture states are:

+ - "closed": The lid of the device is closed.

+ - "laptop": The lid of the device is opened and the device

+ operates as a normal laptop.

+ - "slate": The screen covers the keyboard or has been flipped

+ back and the device operates mainly based on touch input.

+ - "tablet": The device operates as tablet and exclusively

+ relies on touch input (or external peripherals).

+ - "<unknown>": The current state is unknown to the driver, for

+ example due to newer as-of-yet unsupported hardware.

+ New states may be introduced with new hardware. Users therefore

+ must not rely on this list of states being exhaustive and

+ gracefully handle unknown states.

diff --git a/Documentation/ABI/testing/sysfs-bus-thunderbolt b/Documentation/ABI/testing/sysfs-bus-thunderbolt
index b7e87f6c7d47..76ab3e1fe374 100644
--- a/Documentation/ABI/testing/sysfs-bus-thunderbolt
+++ b/Documentation/ABI/testing/sysfs-bus-thunderbolt

+ using simultaneously through its upstream port.

+What: /sys/bus/thunderbolt/devices/usb4_portX/connector

+Contact: Heikki Krogerus <heikki.krogerus@linux.intel.com>

+ Symlink to the USB Type-C connector. This link is only

+ created when USB Type-C Connector Class is enabled,

+ and only if the system firmware is capable of

+ describing the connection between a port and its

+What: /sys/bus/usb/devices/<INTERFACE>/authorized

+What: /sys/bus/usb/devices/<INTERFACE>/wireless_status

+Date: February 2023

+Contact: Bastien Nocera <hadess@hadess.net>

+ Some USB devices use a USB receiver dongle to communicate

+ wirelessly with their device using proprietary protocols. This

+ attribute allows user-space to know whether the device is

+ connected to its receiver dongle, and, for example, consider

+ the device to be absent when choosing whether to show the

+ device's battery, show a headset in a list of outputs, or show

+ an on-screen keyboard if the only wireless keyboard is

+ This attribute is not to be used to replace protocol specific

+ statuses available in WWAN, WLAN/Wi-Fi, Bluetooth, etc.

+ If the device does not use a receiver dongle with a wireless

+ device, then this attribute will not exist.

+What: /sys/bus/usb/devices/.../<hub_interface>/port<X>

+ The /sys/bus/usb/devices/.../<hub_interface>/port<X>

+What: /sys/bus/usb/devices/.../<hub_interface>/port<X>/connect_type

+What: /sys/bus/usb/devices/.../<hub_interface>/port<X>/location

+What: /sys/bus/usb/devices/.../<hub_interface>/port<X>/quirks

+What: /sys/bus/usb/devices/.../<hub_interface>/port<X>/over_current_count

+ OVER_CURRENT_PORT=/sys/bus/usb/devices/.../<hub_interface>/port<X>

+What: /sys/bus/usb/devices/.../<hub_interface>/port<X>/usb3_lpm_permit

+What: /sys/bus/usb/devices/.../<hub_interface>/port<X>/connector

+Date: December 2021

+Contact: Heikki Krogerus <heikki.krogerus@linux.intel.com>

+ Link to the USB Type-C connector when available. This link is

+ only created when USB Type-C Connector Class is enabled, and

+ only if the system firmware is capable of describing the

+ connection between a port and its connector.

+What: /sys/bus/usb/devices/.../<hub_interface>/port<X>/disable

+Contact: Michael Grzeschik <m.grzeschik@pengutronix.de>

+ This file controls the state of a USB port, including

+ Vbus power output (but only on hubs that support

+ power switching -- most hubs don't support it). If

+ a port is disabled, the port is unusable: Devices

+ attached to the port will not be detected, initialized,

+What: /sys/bus/usb/devices/.../<hub_interface>/port<X>/early_stop

+Contact: Ray Chi <raychi@google.com>

+ Some USB hosts have some watchdog mechanisms so that the device

+ may enter ramdump if it takes a long time during port initialization.

+ This attribute allows each port just has two attempts so that the

+ port initialization will be failed quickly. In addition, if a port

+ which is marked with early_stop has failed to initialize, it will ignore

+ all future connections until this attribute is clear.

+What: /sys/bus/usb/devices/usbX/bAlternateSetting

+ The current interface alternate setting number, in decimal.

+ See USB specs for its meaning.

+What: /sys/bus/usb/devices/usbX/bcdDevice

+ The device's release number, in hexadecimal.

+ See USB specs for its meaning.

+What: /sys/bus/usb/devices/usbX/bConfigurationValue

+ While a USB device typically have just one configuration

+ setting, some devices support multiple configurations.

+ This value shows the current configuration, in decimal.

+ Changing its value will change the device's configuration

+ to another setting.

+ The number of configurations supported by a device is at:

+ /sys/bus/usb/devices/usbX/bNumConfigurations

+ See USB specs for its meaning.

+What: /sys/bus/usb/devices/usbX/bDeviceClass

+ Class code of the device, in hexadecimal.

+ See USB specs for its meaning.

+What: /sys/bus/usb/devices/usbX/bDeviceProtocol

+ Protocol code of the device, in hexadecimal.

+ See USB specs for its meaning.

+What: /sys/bus/usb/devices/usbX/bDeviceSubClass

+ Subclass code of the device, in hexadecimal.

+ See USB specs for its meaning.

+What: /sys/bus/usb/devices/usbX/bInterfaceClass

+ Class code of the interface, in hexadecimal.

+ See USB specs for its meaning.

+What: /sys/bus/usb/devices/usbX/bInterfaceNumber

+ Interface number, in hexadecimal.

+ See USB specs for its meaning.

+What: /sys/bus/usb/devices/usbX/bInterfaceProtocol

+ Protocol code of the interface, in hexadecimal.

+ See USB specs for its meaning.

+What: /sys/bus/usb/devices/usbX/bInterfaceSubClass

+ Subclass code of the interface, in hexadecimal.

+ See USB specs for its meaning.

+What: /sys/bus/usb/devices/usbX/bmAttributes

+ Attributes of the current configuration, in hexadecimal.

+ See USB specs for its meaning.

+What: /sys/bus/usb/devices/usbX/bMaxPacketSize0

+ Maximum endpoint 0 packet size, in decimal.

+ See USB specs for its meaning.

+What: /sys/bus/usb/devices/usbX/bMaxPower

+ Maximum power consumption of the active configuration of

+ the device, in miliamperes.

+What: /sys/bus/usb/devices/usbX/bNumConfigurations

+ Number of the possible configurations of the device, in

+ decimal. The current configuration is controlled via:

+ /sys/bus/usb/devices/usbX/bConfigurationValue

+ See USB specs for its meaning.

+What: /sys/bus/usb/devices/usbX/bNumEndpoints

+ Number of endpoints used on this interface, in hexadecimal.

+ See USB specs for its meaning.

+What: /sys/bus/usb/devices/usbX/bNumInterfaces

+ Number of interfaces on this device, in decimal.

+What: /sys/bus/usb/devices/usbX/busnum

+ Number of the bus.

+What: /sys/bus/usb/devices/usbX/configuration

+ Contents of the string descriptor associated with the

+ current configuration. It may include the firmware version

+ of a device and/or its serial number.

+What: /sys/bus/usb/devices/usbX/descriptors

+ Contains the interface descriptors, in binary.

+What: /sys/bus/usb/devices/usbX/idProduct

+ Product ID, in hexadecimal.

+What: /sys/bus/usb/devices/usbX/idVendor

+ Vendor ID, in hexadecimal.

+What: /sys/bus/usb/devices/usbX/devspec

+ Displays the Device Tree Open Firmware node of the interface.

+What: /sys/bus/usb/devices/usbX/avoid_reset_quirk

+ Most devices have this set to zero.

+ If the value is 1, enable a USB quirk that prevents this

+ device to use reset.

+What: /sys/bus/usb/devices/usbX/devnum

+ USB interface device number, in decimal.

+What: /sys/bus/usb/devices/usbX/devpath

+ String containing the USB interface device path.

+What: /sys/bus/usb/devices/usbX/manufacturer

+ Vendor specific string containing the name of the

+ manufacturer of the device.

+What: /sys/bus/usb/devices/usbX/maxchild

+ Number of ports of an USB hub

+What: /sys/bus/usb/devices/usbX/persist

+ Keeps the device even if it gets disconnected.

+What: /sys/bus/usb/devices/usbX/product

+ Vendor specific string containing the name of the

+What: /sys/bus/usb/devices/usbX/speed

+ Shows the device's max speed, according to the USB version,

+ ======= ====================

+ Unknown speed unknown

+ 10000 Super Speed+

+ 20000 Super Speed+ Gen 2x2

+ ======= ====================

+What: /sys/bus/usb/devices/usbX/supports_autosuspend

+ Returns 1 if the device doesn't support autosuspend.

+ Otherwise, returns 0.

+What: /sys/bus/usb/devices/usbX/urbnum

+ Number of URBs submitted for the whole device.

+What: /sys/bus/usb/devices/usbX/version

+ String containing the USB device version, as encoded

+ at the BCD descriptor.

+What: /sys/bus/usb/devices/usbX/power/autosuspend

+ Time in milliseconds for the device to autosuspend. If the

+ value is negative, then autosuspend is prevented.

+What: /sys/bus/usb/devices/usbX/power/active_duration

+ The total time the device has not been suspended.

+What: /sys/bus/usb/devices/usbX/power/connected_duration

+ The total time (in msec) that the device has been connected.

+What: /sys/bus/usb/devices/usbX/power/level

+What: /sys/bus/usb/devices/usbX/ep_<N>/bEndpointAddress

+ The address of the endpoint described by this descriptor,

+ in hexadecimal. The endpoint direction on this bitmapped field

+ /sys/bus/usb/devices/usbX/ep_<N>/direction

+ See USB specs for its meaning.

+What: /sys/bus/usb/devices/usbX/ep_<N>/bInterval

+ The interval of the endpoint as described on its descriptor,

+ in hexadecimal. The actual interval depends on the version

+ of the USB. Also shown in time units at

+ /sys/bus/usb/devices/usbX/ep_<N>/interval.

+What: /sys/bus/usb/devices/usbX/ep_<N>/bLength

+ Number of bytes of the endpoint descriptor, in hexadecimal.

+What: /sys/bus/usb/devices/usbX/ep_<N>/bmAttributes

+ Attributes which apply to the endpoint as described on its

+ descriptor, in hexadecimal. The endpoint type on this

+ bitmapped field is also shown at:

+ /sys/bus/usb/devices/usbX/ep_<N>/type

+ See USB specs for its meaning.

+What: /sys/bus/usb/devices/usbX/ep_<N>/direction

+ Direction of the endpoint. Can be:

+ - both (on control endpoints)

+What: /sys/bus/usb/devices/usbX/ep_<N>/interval

+ Interval for polling endpoint for data transfers, in

+ milisseconds or microseconds.

+What: /sys/bus/usb/devices/usbX/ep_<N>/type

+ Descriptor type. Can be:

+What: /sys/bus/usb/devices/usbX/ep_<N>/wMaxPacketSize

+ Maximum packet size this endpoint is capable of

+ sending or receiving, in hexadecimal.

+What: /sys/bus/vdpa/driver_autoprobe

+Contact: virtualization@lists.linux-foundation.org

+ This file determines whether new devices are immediately bound

+ to a driver after the creation. It initially contains 1, which

+ means the kernel automatically binds devices to a compatible

+ driver immediately after they are created.

+ Writing "0" to this file disable this feature, any other string

+What: /sys/bus/vdpa/driver_probe

+Contact: virtualization@lists.linux-foundation.org

+ Writing a device name to this file will cause the kernel binds

+ devices to a compatible driver.

+ This can be useful when /sys/bus/vdpa/driver_autoprobe is

+What: /sys/bus/vdpa/drivers/.../bind

+Contact: virtualization@lists.linux-foundation.org

+ Writing a device name to this file will cause the driver to

+ attempt to bind to the device. This is useful for overriding

+What: /sys/bus/vdpa/drivers/.../unbind

+Contact: virtualization@lists.linux-foundation.org

+ Writing a device name to this file will cause the driver to

+ attempt to unbind from the device. This may be useful when

+ overriding default bindings.

+What: /sys/bus/vdpa/devices/.../driver_override

+Date: November 2021

+Contact: virtualization@lists.linux-foundation.org

+ This file allows the driver for a device to be specified.

+ When specified, only a driver with a name matching the value

+ written to driver_override will have an opportunity to bind to

+ the device. The override is specified by writing a string to the

+ driver_override file (echo vhost-vdpa > driver_override) and may

+ be cleared with an empty string (echo > driver_override).

+ This returns the device to standard matching rules binding.

+ Writing to driver_override does not automatically unbind the

+ device from its current driver or make any attempt to

+ automatically load the specified driver. If no driver with a

+ matching name is currently loaded in the kernel, the device will

+ not bind to any driver. This also allows devices to opt-out of

+ driver binding using a driver_override name such as "none".

+ Only a single driver may be specified in the override, there is

+ no support for parsing delimiters.

+What: /sys/class/bdi/<bdi>/read_ahead_kb

+Contact: Peter Zijlstra <a.p.zijlstra@chello.nl>

+What: /sys/class/bdi/<bdi>/min_ratio

+Contact: Peter Zijlstra <a.p.zijlstra@chello.nl>

+What: /sys/class/bdi/<bdi>/min_ratio_fine

+Date: November 2022

+Contact: Stefan Roesch <shr@devkernel.io>

+ Under normal circumstances each device is given a part of the

+ total write-back cache that relates to its current average

+ writeout speed in relation to the other devices.

+ The 'min_ratio_fine' parameter allows assigning a minimum reserve

+ of the write-back cache to a particular device. The value is

+ expressed as part of 1 million. For example, this is useful for

+ providing a minimum QoS.

+What: /sys/class/bdi/<bdi>/max_ratio

+Contact: Peter Zijlstra <a.p.zijlstra@chello.nl>

+What: /sys/class/bdi/<bdi>/max_ratio_fine

+Date: November 2022

+Contact: Stefan Roesch <shr@devkernel.io>

+ Allows limiting a particular device to use not more than the

+ given value of the write-back cache. The value is given as part

+ of 1 million. This is useful in situations where we want to avoid

+ one device taking all or most of the write-back cache. For example

+ in case of an NFS mount that is prone to get stuck, or a FUSE mount

+ which cannot be trusted to play fair.

+What: /sys/class/bdi/<bdi>/min_bytes

+Contact: Stefan Roesch <shr@devkernel.io>

+ Under normal circumstances each device is given a part of the

+ total write-back cache that relates to its current average

+ writeout speed in relation to the other devices.

+ The 'min_bytes' parameter allows assigning a minimum

+ percentage of the write-back cache to a particular device

+ expressed in bytes.

+ For example, this is useful for providing a minimum QoS.

+What: /sys/class/bdi/<bdi>/max_bytes

+Contact: Stefan Roesch <shr@devkernel.io>

+ Allows limiting a particular device to use not more than the

+ given 'max_bytes' of the write-back cache. This is useful in

+ situations where we want to avoid one device taking all or

+ most of the write-back cache. For example in case of an NFS

+ mount that is prone to get stuck, a FUSE mount which cannot be

+ trusted to play fair, or a nbd device.

+What: /sys/class/bdi/<bdi>/strict_limit

+Contact: Stefan Roesch <shr@devkernel.io>

+ Forces per-BDI checks for the share of given device in the write-back

+ cache even before the global background dirty limit is reached. This

+ is useful in situations where the global limit is much higher than

+ affordable for given relatively slow (or untrusted) device. Turning

+ strictlimit on has no visible effect if max_ratio is equal to 100%.

+What: /sys/class/bdi/<bdi>/stable_pages_required

+Contact: Peter Zijlstra <a.p.zijlstra@chello.nl>

+ Decimal value of the lowest version of the userspace API

+ this kernel supports.

+What: /sys/class/cxl/<afu>m/pp_mmio_off

+What: /sys/class/cxl/<card>/base_image

+What: /sys/class/cxl/<card>/image_loaded

+What: /sys/class/cxl/<card>/load_image_on_perst

+What: /sys/class/cxl/<card>/perst_reloads_same_image

diff --git a/Documentation/ABI/testing/sysfs-class-devfreq-event b/Documentation/ABI/testing/sysfs-class-devfreq-event
index ceaf0f686d4a..dbe48495e55a 100644
--- a/Documentation/ABI/testing/sysfs-class-devfreq-event
+++ b/Documentation/ABI/testing/sysfs-class-devfreq-event

+What: /sys/class/devfreq-event/event<x>/

+ The name of devfreq-event object denoted as 'event<x>' which

+What: /sys/class/devfreq-event/event<x>/name

+ The /sys/class/devfreq-event/event<x>/name attribute contains

+What: /sys/class/devfreq-event/event<x>/enable_count

+ The /sys/class/devfreq-event/event<x>/enable_count attribute

+What: /sys/class/extcon/.../cable.X/name

+ The /sys/class/extcon/.../cable.X/name shows the name of cable

+ "X" (integer between 0 and 31) of an extcon device.

+What: /sys/class/extcon/.../cable.X/state

+ The /sys/class/extcon/.../cable.X/state shows and stores the

+ state of cable "X" (integer between 0 and 31) of an extcon

+What: /sys/class/fc/fc_udev_device/appid_store

+Contact: Muneendra Kumar <muneendra.kumar@broadconm.com>

+ This interface allows an admin to set an FC application

+ identifier in the blkcg associated with a cgroup id. The

+ identifier is typically a UUID that is associated with

+ an application or logical entity such as a virtual

+ machine or container group. The application or logical

+ entity utilizes a block device via the cgroup id.

+ FC adapter drivers may query the identifier and tag FC

+ traffic based on the identifier. FC host and FC fabric

+ entities can utilize the application id and FC traffic

+ tag to identify traffic sources.

+ The interface expects a string "<cgroupid>:<appid>" where:

+ <cgroupid> is inode of the cgroup in hexadecimal

+ <appid> is user provided string upto 128 characters

+ If an appid_store is done for a cgroup id that already

+ has an appid set, the new value will override the

+ If an admin wants to remove an FC application identifier

+ from a cgroup, an appid_store should be done with the

+ following string: "<cgroupid>:"

+What: /sys/class/firmware/.../data

+KernelVersion: 5.19

+Contact: Russ Weight <russell.h.weight@intel.com>

+Description: The data sysfs file is used for firmware-fallback and for

+ firmware uploads. Cat a firmware image to this sysfs file

+ after you echo 1 to the loading sysfs file. When the firmware

+ image write is complete, echo 0 to the loading sysfs file. This

+ sequence will signal the completion of the firmware write and

+ signal the lower-level driver that the firmware data is

+What: /sys/class/firmware/.../cancel

+KernelVersion: 5.19

+Contact: Russ Weight <russell.h.weight@intel.com>

+Description: Write-only. For firmware uploads, write a "1" to this file to

+ request that the transfer of firmware data to the lower-level

+ device be canceled. This request will be rejected (EBUSY) if

+ the update cannot be canceled (e.g. a FLASH write is in

+ progress) or (ENODEV) if there is no firmware update in progress.

+What: /sys/class/firmware/.../error

+KernelVersion: 5.19

+Contact: Russ Weight <russell.h.weight@intel.com>

+Description: Read-only. Returns a string describing a failed firmware

+ upload. This string will be in the form of <STATUS>:<ERROR>,

+ where <STATUS> will be one of the status strings described

+ for the status sysfs file and <ERROR> will be one of the

+ following: "hw-error", "timeout", "user-abort", "device-busy",

+ "invalid-file-size", "read-write-error", "flash-wearout". The

+ error sysfs file is only meaningful when the current firmware

+ upload status is "idle". If this file is read while a firmware

+ transfer is in progress, then the read will fail with EBUSY.

+What: /sys/class/firmware/.../loading

+KernelVersion: 5.19

+Contact: Russ Weight <russell.h.weight@intel.com>

+Description: The loading sysfs file is used for both firmware-fallback and

+ for firmware uploads. Echo 1 onto the loading file to indicate

+ you are writing a firmware file to the data sysfs node. Echo

+ -1 onto this file to abort the data write or echo 0 onto this

+ file to indicate that the write is complete. For firmware

+ uploads, the zero value also triggers the transfer of the

+ firmware data to the lower-level device driver.

+What: /sys/class/firmware/.../remaining_size

+KernelVersion: 5.19

+Contact: Russ Weight <russell.h.weight@intel.com>

+Description: Read-only. For firmware upload, this file contains the size

+ of the firmware data that remains to be transferred to the

+ lower-level device driver. The size value is initialized to

+ the full size of the firmware image that was previously

+ written to the data sysfs file. This value is periodically

+ updated during the "transferring" phase of the firmware

+What: /sys/class/firmware/.../status

+KernelVersion: 5.19

+Contact: Russ Weight <russell.h.weight@intel.com>

+Description: Read-only. Returns a string describing the current status of

+ a firmware upload. The string will be one of the following:

+ idle, "receiving", "preparing", "transferring", "programming".

+What: /sys/class/firmware/.../timeout

+KernelVersion: 5.19

+Contact: Russ Weight <russell.h.weight@intel.com>

+Description: This file supports the timeout mechanism for firmware

+ fallback. This file has no affect on firmware uploads. For

+ more information on timeouts please see the documentation

+ for firmware fallback.

diff --git a/Documentation/ABI/testing/sysfs-class-firmware-attributes b/Documentation/ABI/testing/sysfs-class-firmware-attributes
index 90fdf935aa5e..4cdba3477176 100644
--- a/Documentation/ABI/testing/sysfs-class-firmware-attributes
+++ b/Documentation/ABI/testing/sysfs-class-firmware-attributes

+ Representing System Management password.

+ See Lenovo extensions section for details

+ Representing HDD password

+ See Lenovo extensions section for details

+ Representing NVMe password

+ See Lenovo extensions section for details

+ --------------------------------

+ role: system-mgmt This gives the same authority as the bios-admin password to control

+ security related features. The authorities allocated can be set via

+ the BIOS menu SMP Access Control Policy

+ role: HDD & NVMe This password is used to unlock access to the drive at boot. Note see

+ 'level' and 'index' extensions below.

+ Available for HDD and NVMe authentication to set 'user' or 'master'

+ If only the user password is configured then this should be used to

+ unlock the drive at boot. If both master and user passwords are set

+ then either can be used. If a master password is set a user password

+ This attribute defaults to 'user' level

+ Used with HDD and NVME authentication to set the drive index

+ that is being referenced (e.g hdd0, hdd1 etc)

+ This attribute defaults to device 0.

+ certificate, signature, save_signature:

+ These attributes are used for certificate based authentication. This is

+ used in conjunction with a signing server as an alternative to password

+ based authentication.

+ The user writes to the attribute(s) with a BASE64 encoded string obtained

+ from the signing server.

+ The attributes can be displayed to check the stored value.

+ Some usage examples:

+ Installing a certificate to enable feature::

+ echo "supervisor password" > authentication/Admin/current_password

+ echo "signed certificate" > authentication/Admin/certificate

+ Updating the installed certificate::

+ echo "signature" > authentication/Admin/signature

+ echo "signed certificate" > authentication/Admin/certificate

+ Removing the installed certificate::

+ echo "signature" > authentication/Admin/signature

+ echo "" > authentication/Admin/certificate

+ Changing a BIOS setting::

+ echo "signature" > authentication/Admin/signature

+ echo "save signature" > authentication/Admin/save_signature

+ echo Enable > attribute/PasswordBeep/current_value

+ You cannot enable certificate authentication if a supervisor password

+ Clearing the certificate results in no bios-admin authentication method

+ being configured allowing anyone to make changes.

+ After any of these operations the system must reboot for the changes to

+ certificate_thumbprint:

+ Read only attribute used to display the MD5, SHA1 and SHA256 thumbprints

+ for the certificate installed in the BIOS.

+ certificate_to_password:

+ Write only attribute used to switch from certificate based authentication

+ back to password based.

+ echo "signature" > authentication/Admin/signature

+ echo "password" > authentication/Admin/certificate_to_password

+ builtinsafe lastknowngood [factory] custom

+What: /sys/class/gnss/gnss<N>/type

+What: /sys/class/hwmon/hwmonX/name

+ This should be a short, lowercase string, not containing

+ whitespace, dashes, or the wildcard character '*'.

+ This attribute represents the chip name. It is the only

+ mandatory attribute.

+ I2C devices get this attribute created automatically.

+What: /sys/class/hwmon/hwmonX/label

+ A descriptive label that allows to uniquely identify a

+ device within the system.

+ The contents of the label are free-form.

+What: /sys/class/hwmon/hwmonX/update_interval

+ The interval at which the chip will update readings.

+ Some devices have a variable update rate or interval.

+ This attribute can be used to change it to the desired value.

+What: /sys/class/hwmon/hwmonX/inY_min

+ Voltage min value.

+What: /sys/class/hwmon/hwmonX/inY_lcrit

+ Voltage critical min value.

+ If voltage drops to or below this limit, the system may

+ take drastic action such as power down or reset. At the very

+ least, it should report a fault.

+What: /sys/class/hwmon/hwmonX/inY_max

+ Voltage max value.

+What: /sys/class/hwmon/hwmonX/inY_crit

+ Voltage critical max value.

+ If voltage reaches or exceeds this limit, the system may

+ take drastic action such as power down or reset. At the very

+ least, it should report a fault.

+What: /sys/class/hwmon/hwmonX/inY_input

+ Voltage input value.

+ Voltage measured on the chip pin.

+ Actual voltage depends on the scaling resistors on the

+ motherboard, as recommended in the chip datasheet.

+ This varies by chip and by motherboard.

+ Because of this variation, values are generally NOT scaled

+ by the chip driver, and must be done by the application.

+ However, some drivers (notably lm87 and via686a)

+ do scale, because of internal resistors built into a chip.

+ These drivers will output the actual voltage. Rule of

+ thumb: drivers should report the voltage values at the

+ "pins" of the chip.

+What: /sys/class/hwmon/hwmonX/inY_average

+What: /sys/class/hwmon/hwmonX/inY_lowest

+ Historical minimum voltage

+What: /sys/class/hwmon/hwmonX/inY_highest

+ Historical maximum voltage

+What: /sys/class/hwmon/hwmonX/inY_reset_history

+ Reset inX_lowest and inX_highest

+What: /sys/class/hwmon/hwmonX/in_reset_history

+ Reset inX_lowest and inX_highest for all sensors

+What: /sys/class/hwmon/hwmonX/inY_label

+ Suggested voltage channel label.

+ Should only be created if the driver has hints about what

+ this voltage channel is being used for, and user-space

+ doesn't. In all other cases, the label is provided by

+What: /sys/class/hwmon/hwmonX/inY_enable

+ Enable or disable the sensors.

+ When disabled the sensor read will return -ENODATA.

+What: /sys/class/hwmon/hwmonX/cpuY_vid

+ CPU core reference voltage.

+ Not always correct.

+What: /sys/class/hwmon/hwmonX/vrm

+ Voltage Regulator Module version number.

+ RW (but changing it should no more be necessary)

+ Originally the VRM standard version multiplied by 10, but now

+ an arbitrary number, as not all standards have a version

+ Affects the way the driver calculates the CPU core reference

+ voltage from the vid pins.

+What: /sys/class/hwmon/hwmonX/inY_rated_min

+ Minimum rated voltage.

+What: /sys/class/hwmon/hwmonX/inY_rated_max

+ Maximum rated voltage.

+What: /sys/class/hwmon/hwmonX/fanY_min

+ Unit: revolution/min (RPM)

+What: /sys/class/hwmon/hwmonX/fanY_max

+ Unit: revolution/min (RPM)

+ Only rarely supported by the hardware.

+What: /sys/class/hwmon/hwmonX/fanY_input

+ Unit: revolution/min (RPM)

+What: /sys/class/hwmon/hwmonX/fanY_div

+ Integer value in powers of two (1, 2, 4, 8, 16, 32, 64, 128).

+ Some chips only support values 1, 2, 4 and 8.

+ Note that this is actually an internal clock divisor, which

+ affects the measurable speed range, not the read value.

+What: /sys/class/hwmon/hwmonX/fanY_pulses

+ Number of tachometer pulses per fan revolution.

+ Integer value, typically between 1 and 4.

+ This value is a characteristic of the fan connected to the

+ device's input, so it has to be set in accordance with the fan

+ Should only be created if the chip has a register to configure

+ the number of pulses. In the absence of such a register (and

+ thus attribute) the value assumed by all devices is 2 pulses

+ per fan revolution.

+What: /sys/class/hwmon/hwmonX/fanY_target

+ Unit: revolution/min (RPM)

+ Only makes sense if the chip supports closed-loop fan speed

+ control based on the measured fan speed.

+What: /sys/class/hwmon/hwmonX/fanY_label

+ Suggested fan channel label.

+ Should only be created if the driver has hints about what

+ this fan channel is being used for, and user-space doesn't.

+ In all other cases, the label is provided by user-space.

+What: /sys/class/hwmon/hwmonX/fanY_enable

+ Enable or disable the sensors.

+ When disabled the sensor read will return -ENODATA.

+What: /sys/class/hwmon/hwmonX/fanY_fault

+ Reports if a fan has reported failure.

+What: /sys/class/hwmon/hwmonX/pwmY

+ Pulse width modulation fan control.

+ Integer value in the range 0 to 255

+ 255 is max or 100%.

+What: /sys/class/hwmon/hwmonX/pwmY_enable

+ Fan speed control method:

+ - 0: no fan speed control (i.e. fan at full speed)

+ - 1: manual fan speed control enabled (using `pwmY`)

+ - 2+: automatic fan speed control enabled

+ Check individual chip documentation files for automatic mode

+What: /sys/class/hwmon/hwmonX/pwmY_mode

+ - 0: DC mode (direct current)

+ - 1: PWM mode (pulse-width modulation)

+What: /sys/class/hwmon/hwmonX/pwmY_freq

+ Base PWM frequency in Hz.

+ Only possibly available when pwmN_mode is PWM, but not always

+ present even then.

+What: /sys/class/hwmon/hwmonX/pwmY_auto_channels_temp

+ Select which temperature channels affect this PWM output in

+ Bitfield, 1 is temp1, 2 is temp2, 4 is temp3 etc...

+ Which values are possible depend on the chip used.

+What: /sys/class/hwmon/hwmonX/pwmY_auto_pointZ_pwm

+What: /sys/class/hwmon/hwmonX/pwmY_auto_pointZ_temp

+What: /sys/class/hwmon/hwmonX/pwmY_auto_pointZ_temp_hyst

+ Define the PWM vs temperature curve.

+ Number of trip points is chip-dependent. Use this for chips

+ which associate trip points to PWM output channels.

+What: /sys/class/hwmon/hwmonX/tempY_auto_pointZ_pwm

+What: /sys/class/hwmon/hwmonX/tempY_auto_pointZ_temp

+What: /sys/class/hwmon/hwmonX/tempY_auto_pointZ_temp_hyst

+ Define the PWM vs temperature curve.

+ Number of trip points is chip-dependent. Use this for chips

+ which associate trip points to temperature channels.

+What: /sys/class/hwmon/hwmonX/tempY_type

+ Sensor type selection.

+ - 1: CPU embedded diode

+ - 2: 3904 transistor

+ - 3: thermal diode

+ Not all types are supported by all chips

+What: /sys/class/hwmon/hwmonX/tempY_max

+ Temperature max value.

+ Unit: millidegree Celsius (or millivolt, see below)

+What: /sys/class/hwmon/hwmonX/tempY_min

+ Temperature min value.

+ Unit: millidegree Celsius

+What: /sys/class/hwmon/hwmonX/tempY_max_hyst

+ Temperature hysteresis value for max limit.

+ Unit: millidegree Celsius

+ Must be reported as an absolute temperature, NOT a delta

+ from the max value.

+What: /sys/class/hwmon/hwmonX/tempY_min_hyst

+ Temperature hysteresis value for min limit.

+ Unit: millidegree Celsius

+ Must be reported as an absolute temperature, NOT a delta

+ from the min value.

+What: /sys/class/hwmon/hwmonX/tempY_input

+ Temperature input value.

+ Unit: millidegree Celsius

+What: /sys/class/hwmon/hwmonX/tempY_crit

+ Temperature critical max value, typically greater than

+ corresponding temp_max values.

+ Unit: millidegree Celsius

+What: /sys/class/hwmon/hwmonX/tempY_crit_alarm

+ Critical high temperature alarm flag.

+ - 1: temperature has reached tempY_crit

+ Contrary to regular alarm flags which clear themselves

+ automatically when read, this one sticks until cleared by

+ the user. This is done by writing 0 to the file. Writing

+ other values is unsupported.

+What: /sys/class/hwmon/hwmonX/tempY_crit_hyst

+ Temperature hysteresis value for critical limit.

+ Unit: millidegree Celsius

+ Must be reported as an absolute temperature, NOT a delta

+ from the critical value.

+What: /sys/class/hwmon/hwmonX/tempY_emergency

+ Temperature emergency max value, for chips supporting more than

+ two upper temperature limits. Must be equal or greater than

+ corresponding temp_crit values.

+ Unit: millidegree Celsius

+What: /sys/class/hwmon/hwmonX/tempY_emergency_hyst

+ Temperature hysteresis value for emergency limit.

+ Unit: millidegree Celsius

+ Must be reported as an absolute temperature, NOT a delta

+ from the emergency value.

+What: /sys/class/hwmon/hwmonX/tempY_lcrit

+ Temperature critical min value, typically lower than

+ corresponding temp_min values.

+ Unit: millidegree Celsius

+What: /sys/class/hwmon/hwmonX/tempY_lcrit_hyst

+ Temperature hysteresis value for critical min limit.

+ Unit: millidegree Celsius

+ Must be reported as an absolute temperature, NOT a delta

+ from the critical min value.

+What: /sys/class/hwmon/hwmonX/tempY_offset

+ Temperature offset which is added to the temperature reading

+ Unit: millidegree Celsius

+What: /sys/class/hwmon/hwmonX/tempY_label

+ Suggested temperature channel label.

+ Should only be created if the driver has hints about what

+ this temperature channel is being used for, and user-space

+ doesn't. In all other cases, the label is provided by

+What: /sys/class/hwmon/hwmonX/tempY_lowest

+ Historical minimum temperature

+ Unit: millidegree Celsius

+What: /sys/class/hwmon/hwmonX/tempY_highest

+ Historical maximum temperature

+ Unit: millidegree Celsius

+What: /sys/class/hwmon/hwmonX/tempY_reset_history

+ Reset temp_lowest and temp_highest

+What: /sys/class/hwmon/hwmonX/temp_reset_history

+ Reset temp_lowest and temp_highest for all sensors

+What: /sys/class/hwmon/hwmonX/tempY_enable

+ Enable or disable the sensors.

+ When disabled the sensor read will return -ENODATA.

+What: /sys/class/hwmon/hwmonX/tempY_rated_min

+ Minimum rated temperature.

+ Unit: millidegree Celsius

+What: /sys/class/hwmon/hwmonX/tempY_rated_max

+ Maximum rated temperature.

+ Unit: millidegree Celsius

+What: /sys/class/hwmon/hwmonX/currY_max

+What: /sys/class/hwmon/hwmonX/currY_min

+ Current min value.

+What: /sys/class/hwmon/hwmonX/currY_lcrit

+ Current critical low value

+What: /sys/class/hwmon/hwmonX/currY_crit

+ Current critical high value.

+What: /sys/class/hwmon/hwmonX/currY_input

+ Current input value

+What: /sys/class/hwmon/hwmonX/currY_average

+ Average current use

+What: /sys/class/hwmon/hwmonX/currY_lowest

+ Historical minimum current

+What: /sys/class/hwmon/hwmonX/currY_highest

+ Historical maximum current

+What: /sys/class/hwmon/hwmonX/currY_reset_history

+ Reset currX_lowest and currX_highest

+What: /sys/class/hwmon/hwmonX/curr_reset_history

+ Reset currX_lowest and currX_highest for all sensors

+What: /sys/class/hwmon/hwmonX/currY_enable

+ Enable or disable the sensors.

+ When disabled the sensor read will return -ENODATA.

+What: /sys/class/hwmon/hwmonX/currY_rated_min

+ Minimum rated current.

+What: /sys/class/hwmon/hwmonX/currY_rated_max

+ Maximum rated current.

+What: /sys/class/hwmon/hwmonX/powerY_average

+What: /sys/class/hwmon/hwmonX/powerY_average_interval

+ Power use averaging interval. A poll

+ notification is sent to this file if the

+ hardware changes the averaging interval.

+ Unit: milliseconds

+What: /sys/class/hwmon/hwmonX/powerY_average_interval_max

+ Maximum power use averaging interval

+ Unit: milliseconds

+What: /sys/class/hwmon/hwmonX/powerY_average_interval_min

+ Minimum power use averaging interval

+ Unit: milliseconds

+What: /sys/class/hwmon/hwmonX/powerY_average_highest

+ Historical average maximum power use

+What: /sys/class/hwmon/hwmonX/powerY_average_lowest

+ Historical average minimum power use

+What: /sys/class/hwmon/hwmonX/powerY_average_max

+ A poll notification is sent to

+ `powerY_average` when power use

+ rises above this value.

+What: /sys/class/hwmon/hwmonX/powerY_average_min

+ A poll notification is sent to

+ `powerY_average` when power use

+ sinks below this value.

+What: /sys/class/hwmon/hwmonX/powerY_input

+ Instantaneous power use

+What: /sys/class/hwmon/hwmonX/powerY_input_highest

+ Historical maximum power use

+What: /sys/class/hwmon/hwmonX/powerY_input_lowest

+ Historical minimum power use

+What: /sys/class/hwmon/hwmonX/powerY_reset_history

+ Reset input_highest, input_lowest,

+ average_highest and average_lowest.

+What: /sys/class/hwmon/hwmonX/powerY_accuracy

+ Accuracy of the power meter.

+What: /sys/class/hwmon/hwmonX/powerY_cap

+ If power use rises above this limit, the

+ system should take action to reduce power use.

+ A poll notification is sent to this file if the

+ cap is changed by the hardware. The `*_cap`

+ files only appear if the cap is known to be

+ enforced by hardware.

+What: /sys/class/hwmon/hwmonX/powerY_cap_hyst

+ Margin of hysteresis built around capping and

+What: /sys/class/hwmon/hwmonX/powerY_cap_max

+ Maximum cap that can be set.

+What: /sys/class/hwmon/hwmonX/powerY_cap_min

+ Minimum cap that can be set.

+What: /sys/class/hwmon/hwmonX/powerY_max

+What: /sys/class/hwmon/hwmonX/powerY_crit

+ Critical maximum power.

+ If power rises to or above this limit, the

+ system is expected take drastic action to reduce

+ power consumption, such as a system shutdown or

+ a forced powerdown of some devices.

+What: /sys/class/hwmon/hwmonX/powerY_enable

+ Enable or disable the sensors.

+ When disabled the sensor read will return

+What: /sys/class/hwmon/hwmonX/powerY_rated_min

+ Minimum rated power.

+What: /sys/class/hwmon/hwmonX/powerY_rated_max

+ Maximum rated power.

+What: /sys/class/hwmon/hwmonX/energyY_input

+ Cumulative energy use

+What: /sys/class/hwmon/hwmonX/energyY_enable

+ Enable or disable the sensors.

+ When disabled the sensor read will return

+What: /sys/class/hwmon/hwmonX/humidityY_input

+ Unit: milli-percent (per cent mille, pcm)

+What: /sys/class/hwmon/hwmonX/humidityY_enable

+ Enable or disable the sensors

+ When disabled the sensor read will return

+What: /sys/class/hwmon/hwmonX/humidityY_rated_min

+ Minimum rated humidity.

+ Unit: milli-percent (per cent mille, pcm)

+What: /sys/class/hwmon/hwmonX/humidityY_rated_max

+ Maximum rated humidity.

+ Unit: milli-percent (per cent mille, pcm)

+What: /sys/class/hwmon/hwmonX/intrusionY_alarm

+ Chassis intrusion detection

+ - 1: intrusion detected

+ Contrary to regular alarm flags which clear themselves

+ automatically when read, this one sticks until cleared by

+ the user. This is done by writing 0 to the file. Writing

+ other values is unsupported.

+What: /sys/class/hwmon/hwmonX/intrusionY_beep

+ Chassis intrusion beep

+What: /sys/class/hwmon/hwmonX/device/pec

+ PEC support on I2C devices

+ - 0, off, n: disable

+ - 1, on, y: enable

+What: /sys/class/mei/mei<N>/

+What: /sys/class/mei/mei<N>/fw_status

+What: /sys/class/mei/mei<N>/hbm_ver

+What: /sys/class/mei/mei<N>/hbm_ver_drv

+What: /sys/class/mei/mei<N>/tx_queue_limit

+What: /sys/class/mei/mei<N>/fw_ver

+What: /sys/class/mei/mei<N>/dev_state

+What: /sys/class/mei/mei<N>/trc

+What: /sys/class/mei/mei<N>/kind

+What: /sys/class/mic/mic<X>

+What: /sys/class/mic/mic<X>/family

+What: /sys/class/mic/mic<X>/stepping

+What: /sys/class/mic/mic<X>/state

+What: /sys/class/mic/mic<X>/shutdown_status

+What: /sys/class/mic/mic<X>/cmdline

+What: /sys/class/mic/mic<X>/firmware

+What: /sys/class/mic/mic<X>/ramdisk

+What: /sys/class/mic/mic<X>/bootmode

+What: /sys/class/mic/mic<X>/log_buf_addr

+What: /sys/class/mic/mic<X>/log_buf_len

+What: /sys/class/mic/mic<X>/heartbeat_enable

+What: /sys/class/mux/muxchip<N>/

+What: /sys/class/net/<iface>/peak_usb/can_channel_id

+Date: November 2022

+Contact: Stephane Grosjean <s.grosjean@peak-system.com>

+ PEAK PCAN-USB devices support user-configurable CAN channel

+ identifiers. Contrary to a USB serial number, these identifiers

+ are writable and can be set per CAN interface. This means that

+ if a USB device exports multiple CAN interfaces, each of them

+ can be assigned a unique channel ID.

+ This attribute provides read-only access to the currently

+ configured value of the channel identifier. Depending on the

+ device type, the identifier has a length of 8 or 32 bit. The

+ value read from this attribute is always an 8 digit 32 bit

+ hexadecimal value in big endian format. If the device only

+ supports an 8 bit identifier, the upper 24 bit of the value are

+ stop. Not all hardware is capable of setting this to an arbitrary

+ percentage. Drivers will round written values to the nearest

+ supported value. Reading back the value will show the actual

+ threshold set by the driver.

+ different algorithm. "Long Life" means the charger reduces its

+ charging rate in order to prolong the battery health. "Bypass"

+ means the charger bypasses the charging path around the

+ integrated converter allowing for a "smart" wall adaptor to

+ perform the power conversion externally.

+ "Adaptive", "Custom", "Long Life", "Bypass"

+ "Cool", "Hot", "No battery"

+ Reports whether a battery is present or not in the system. If the

+ property does not exist, the battery is considered to be present.

+What: /sys/class/power_supply/<supply_name>/charge_behaviour

+Date: November 2021

+Contact: linux-pm@vger.kernel.org

+ Represents the charging behaviour.

+ Access: Read, Write

+ ================ ====================================

+ auto: Charge normally, respect thresholds

+ inhibit-charge: Do not charge while AC is attached

+ force-discharge: Force discharge while AC is attached

+ ================ ====================================

+What: /sys/class/power_supply/<supply_name>/cycle_count

+Contact: linux-pm@vger.kernel.org

+ Reports the number of full charge + discharge cycles the

+ battery has undergone.

+ Integer > 0: representing full cycles

+ Integer = 0: cycle_count info is not available

+What: /sys/class/power_supply/rt9467-*/sysoff_enable

+Contact: ChiaEn Wu <chiaen_wu@richtek.com>

+ This entry allows enabling the sysoff mode of rt9467 charger

+ If enabled and the input is removed, the internal battery FET

+ is turned off to reduce the leakage from the BAT pin. See

+ device datasheet for details. It's commonly used when the

+ product enter shipping stage. After entering shipping mode,

+ only 'VBUS' or 'Power key" pressed can make it leave this mode.

+ 'Disable' also can help to leave it, but it's more like to

+ abort the action before the device really enter shipping mode.

+ Access: Read, Write

+What: /sys/class/power_supply/rt9471-*/sysoff_enable

+Contact: ChiYuan Huang <cy_huang@richtek.com>

+ This entry allows enabling the sysoff mode of rt9471 charger devices.

+ If enabled and the input is removed, the internal battery FET is turned

+ off to reduce the leakage from the BAT pin. See device datasheet for details.

+ It's commonly used when the product enter shipping stage. After entering

+ shipping mode, only 'VBUS' or 'Power key" pressed can make it leave this

+ mode. 'Disable' also can help to leave it, but it's more like to abort

+ the action before the device really enter shipping mode.

+ Access: Read, Write

+What: /sys/class/power_supply/rt9471-*/port_detect_enable

+Contact: ChiYuan Huang <cy_huang@richtek.com>

+ This entry allows enabling the USB BC12 port detect function of rt9471 charger

+ devices. If enabled and VBUS is inserted, device will start to do the BC12

+ port detect and report the usb port type when port detect is done. See

+ datasheet for details. Normally controlled when TypeC/USBPD port integrated.

+ Access: Read, Write

+What: /sys/class/pwm/pwmchip<N>/

+What: /sys/class/pwm/pwmchip<N>/npwm

+What: /sys/class/pwm/pwmchip<N>/export

+What: /sys/class/pwm/pwmchip<N>/unexport

+What: /sys/class/pwm/pwmchip<N>/pwmX

+What: /sys/class/pwm/pwmchip<N>/pwmX/period

+What: /sys/class/pwm/pwmchip<N>/pwmX/duty_cycle

+What: /sys/class/pwm/pwmchip<N>/pwmX/polarity

+What: /sys/class/pwm/pwmchip<N>/pwmX/enable

+What: /sys/class/pwm/pwmchip<N>/pwmX/capture

+Contact: Lee Jones <lee@kernel.org>

+What: /sys/class/rapidio_port/rapidio<N>/sys_size

+What: /sys/class/rapidio_port/rapidio<N>/port_destid

+What: /sys/class/rc/rc<N>/

+What: /sys/class/rc/rc<N>/protocols

+What: /sys/class/rc/rc<N>/filter

+What: /sys/class/rc/rc<N>/filter_mask

+What: /sys/class/rc/rc<N>/wakeup_protocols

+What: /sys/class/rc/rc<N>/wakeup_filter

+What: /sys/class/rc/rc<N>/wakeup_filter_mask

diff --git a/Documentation/ABI/testing/sysfs-class-rc-nuvoton b/Documentation/ABI/testing/sysfs-class-rc-nuvoton
index d3abe45f8690..f7bad8ecd08f 100644
--- a/Documentation/ABI/testing/sysfs-class-rc-nuvoton
+++ b/Documentation/ABI/testing/sysfs-class-rc-nuvoton

+What: /sys/class/rc/rc<N>/wakeup_data

diff --git a/Documentation/ABI/testing/sysfs-class-regulator b/Documentation/ABI/testing/sysfs-class-regulator
index 8516f08806dd..475b9a372657 100644
--- a/Documentation/ABI/testing/sysfs-class-regulator
+++ b/Documentation/ABI/testing/sysfs-class-regulator

+What: /sys/class/regulator/.../under_voltage

+KernelVersion: 5.18

+Contact: Zev Weiss <zev@bewilderbeest.net>

+ Some regulator directories will contain a field called

+ under_voltage. This indicates if the device reports an

+ under-voltage fault (1) or not (0).

+What: /sys/class/regulator/.../over_current

+KernelVersion: 5.18

+Contact: Zev Weiss <zev@bewilderbeest.net>

+ Some regulator directories will contain a field called

+ over_current. This indicates if the device reports an

+ over-current fault (1) or not (0).

+What: /sys/class/regulator/.../regulation_out

+KernelVersion: 5.18

+Contact: Zev Weiss <zev@bewilderbeest.net>

+ Some regulator directories will contain a field called

+ regulation_out. This indicates if the device reports an

+ out-of-regulation fault (1) or not (0).

+What: /sys/class/regulator/.../fail

+KernelVersion: 5.18

+Contact: Zev Weiss <zev@bewilderbeest.net>

+ Some regulator directories will contain a field called

+ fail. This indicates if the device reports an output failure

+What: /sys/class/regulator/.../over_temp

+KernelVersion: 5.18

+Contact: Zev Weiss <zev@bewilderbeest.net>

+ Some regulator directories will contain a field called

+ over_temp. This indicates if the device reports an

+ over-temperature fault (1) or not (0).

+What: /sys/class/regulator/.../under_voltage_warn

+KernelVersion: 5.18

+Contact: Zev Weiss <zev@bewilderbeest.net>

+ Some regulator directories will contain a field called

+ under_voltage_warn. This indicates if the device reports an

+ under-voltage warning (1) or not (0).

+What: /sys/class/regulator/.../over_current_warn

+KernelVersion: 5.18

+Contact: Zev Weiss <zev@bewilderbeest.net>

+ Some regulator directories will contain a field called

+ over_current_warn. This indicates if the device reports an

+ over-current warning (1) or not (0).

+What: /sys/class/regulator/.../over_voltage_warn

+KernelVersion: 5.18

+Contact: Zev Weiss <zev@bewilderbeest.net>

+ Some regulator directories will contain a field called

+ over_voltage_warn. This indicates if the device reports an

+ over-voltage warning (1) or not (0).

+What: /sys/class/regulator/.../over_temp_warn

+KernelVersion: 5.18

+Contact: Zev Weiss <zev@bewilderbeest.net>

+ Some regulator directories will contain a field called

+ over_temp_warn. This indicates if the device reports an

+ over-temperature warning (1) or not (0).

diff --git a/Documentation/ABI/testing/sysfs-class-rtrs-client b/Documentation/ABI/testing/sysfs-class-rtrs-client
index 49a4157c7bf1..fecc59d1b96f 100644
--- a/Documentation/ABI/testing/sysfs-class-rtrs-client
+++ b/Documentation/ABI/testing/sysfs-class-rtrs-client

+Description: RO, Contains the name of HCA the connection established on.

diff --git a/Documentation/ABI/testing/sysfs-class-rtrs-server b/Documentation/ABI/testing/sysfs-class-rtrs-server
index 3b6d5b067df0..b08601d80409 100644
--- a/Documentation/ABI/testing/sysfs-class-rtrs-server
+++ b/Documentation/ABI/testing/sysfs-class-rtrs-server

+Description: RO, Contains the name of HCA the connection established on.

+What: /sys/class/thermal/thermal_zoneX/type

+ Strings which represent the thermal zone type.

+ This is given by thermal zone driver as part of registration.

+ E.g: "acpitz" indicates it's an ACPI thermal device.

+ In order to keep it consistent with hwmon sys attribute; this

+ shouldbe a short, lowercase string, not containing spaces nor

+What: /sys/class/thermal/thermal_zoneX/temp

+ Current temperature as reported by thermal zone (sensor).

+ Unit: millidegree Celsius

+What: /sys/class/thermal/thermal_zoneX/mode

+ One of the predefined values in [enabled, disabled].

+ This file gives information about the algorithm that is

+ currently managing the thermal zone. It can be either default

+ kernel based algorithm or user space application.

+ enable Kernel Thermal management.

+ Preventing kernel thermal zone driver actions upon

+ trip points so that user application can take full

+ charge of the thermal management.

+What: /sys/class/thermal/thermal_zoneX/policy

+ One of the various thermal governors used for a particular zone.

+What: /sys/class/thermal/thermal_zoneX/available_policies

+ Available thermal governors which can be used for a

+What: /sys/class/thermal/thermal_zoneX/trip_point_Y_temp

+ The temperature above which trip point will be fired.

+ Unit: millidegree Celsius

+What: /sys/class/thermal/thermal_zoneX/trip_point_Y_type

+ Strings which indicate the type of the trip point.

+ E.g. it can be one of critical, hot, passive, `active[0-*]`

+ for ACPI thermal zone.

+What: /sys/class/thermal/thermal_zoneX/trip_point_Y_hyst

+ The hysteresis value for a trip point, represented as an

+What: /sys/class/thermal/thermal_zoneX/cdevY

+ Sysfs link to the thermal cooling device node where the sys I/F

+ for cooling device throttling control represents.

+What: /sys/class/thermal/thermal_zoneX/cdevY_trip_point

+ The trip point in this thermal zone which `cdev[0-*]` is

+ associated with; -1 means the cooling device is not

+ associated with any trip point.

+What: /sys/class/thermal/thermal_zoneX/cdevY_weight

+ The influence of `cdev[0-*]` in this thermal zone. This value

+ is relative to the rest of cooling devices in the thermal

+ zone. For example, if a cooling device has a weight double

+ than that of other, it's twice as effective in cooling the

+What: /sys/class/thermal/thermal_zoneX/emul_temp

+ Interface to set the emulated temperature method in thermal zone

+ (sensor). After setting this temperature, the thermal zone may

+ pass this temperature to platform emulation function if

+ registered or cache it locally. This is useful in debugging

+ different temperature threshold and its associated cooling

+ action. This is write only node and writing 0 on this node

+ should disable emulation.

+ Unit: millidegree Celsius

+ Be careful while enabling this option on production systems,

+ because userland can easily disable the thermal policy by simply

+ flooding this sysfs node with low temperature values.

+What: /sys/class/thermal/thermal_zoneX/k_d

+ The derivative term of the power allocator governor's PID

+ controller. For more information see

+ Documentation/driver-api/thermal/power_allocator.rst

+What: /sys/class/thermal/thermal_zoneX/k_i

+ The integral term of the power allocator governor's PID

+ controller. This term allows the PID controller to compensate

+ for long term drift. For more information see

+ Documentation/driver-api/thermal/power_allocator.rst

+What: /sys/class/thermal/thermal_zoneX/k_po

+ The proportional term of the power allocator governor's PID

+ controller during temperature overshoot. Temperature overshoot

+ is when the current temperature is above the "desired

+ temperature" trip point. For more information see

+ Documentation/driver-api/thermal/power_allocator.rst

+What: /sys/class/thermal/thermal_zoneX/k_pu

+ The proportional term of the power allocator governor's PID

+ controller during temperature undershoot. Temperature undershoot

+ is when the current temperature is below the "desired

+ temperature" trip point. For more information see

+ Documentation/driver-api/thermal/power_allocator.rst

+What: /sys/class/thermal/thermal_zoneX/integral_cutoff

+ Temperature offset from the desired temperature trip point

+ above which the integral term of the power allocator

+ governor's PID controller starts accumulating errors. For

+ example, if integral_cutoff is 0, then the integral term only

+ accumulates error when temperature is above the desired

+ temperature trip point. For more information see

+ Documentation/driver-api/thermal/power_allocator.rst

+ Unit: millidegree Celsius

+What: /sys/class/thermal/thermal_zoneX/slope

+ The slope constant used in a linear extrapolation model

+ to determine a hotspot temperature based off the sensor's

+ raw readings. It is up to the device driver to determine

+ the usage of these values.

+What: /sys/class/thermal/thermal_zoneX/offset

+ The offset constant used in a linear extrapolation model

+ to determine a hotspot temperature based off the sensor's

+ raw readings. It is up to the device driver to determine

+ the usage of these values.

+What: /sys/class/thermal/thermal_zoneX/sustainable_power

+ An estimate of the sustained power that can be dissipated by

+ the thermal zone. Used by the power allocator governor. For

+ more information see

+ Documentation/driver-api/thermal/power_allocator.rst

+What: /sys/class/thermal/cooling_deviceX/type

+ String which represents the type of device, e.g:

+ - for generic ACPI: should be "Fan", "Processor" or "LCD"

+ - for memory controller device on intel_menlow platform:

+ should be "Memory controller".

+What: /sys/class/thermal/cooling_deviceX/max_state

+ The maximum permissible cooling state of this cooling device.

+What: /sys/class/thermal/cooling_deviceX/cur_state

+ The current cooling state of this cooling device.

+ The value can any integer numbers between 0 and max_state:

+ - cur_state == 0 means no cooling

+ - cur_state == max_state means the maximum cooling.

+What: /sys/class/thermal/cooling_deviceX/stats/reset

+ Writing any value resets the cooling device's statistics.

+What: /sys/class/thermal/cooling_deviceX/stats/time_in_state_ms:

+ The amount of time spent by the cooling device in various

+ cooling states. The output will have "<state> <time>" pair

+ in each line, which will mean this cooling device spent <time>

+ msec of time at <state>.

+ Output will have one line for each of the supported states.

+What: /sys/class/thermal/cooling_deviceX/stats/total_trans

+ A single positive value showing the total number of times

+ the state of a cooling device is changed.

+What: /sys/class/thermal/cooling_deviceX/stats/trans_table

+ This gives fine grained information about all the cooling state

+ transitions. The cat output here is a two dimensional matrix,

+ where an entry <i,j> (row i, column j) represents the number

+ of transitions from State_i to State_j. If the transition

+ table is bigger than PAGE_SIZE, reading this will return

+What: /sys/class/typec/<port>/select_usb_power_delivery

+Contact: Heikki Krogerus <heikki.krogerus@linux.intel.com>

+ Lists the USB Power Delivery Capabilities that the port can

+ advertise to the partner. The currently used capabilities are in

+ brackets. Selection happens by writing to the file.

+What: /sys/class/typec/<port>-partner/identity/

diff --git a/Documentation/ABI/testing/sysfs-class-usb_power_delivery b/Documentation/ABI/testing/sysfs-class-usb_power_delivery
new file mode 100644
index 000000000000..1bf9d1d7902c
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-class-usb_power_delivery

+What: /sys/class/usb_power_delivery

+Contact: Heikki Krogerus <heikki.krogerus@linux.intel.com>

+ Directory for USB Power Delivery devices.

+What: /sys/class/usb_power_delivery/.../revision

+Contact: Heikki Krogerus <heikki.krogerus@linux.intel.com>

+ File showing the USB Power Delivery Specification Revision used

+What: /sys/class/usb_power_delivery/.../version

+Contact: Heikki Krogerus <heikki.krogerus@linux.intel.com>

+ This is an optional attribute file showing the version of the

+ specific revision of the USB Power Delivery Specification. In

+ most cases the specification version is not known and the file

+What: /sys/class/usb_power_delivery/.../source-capabilities

+Contact: Heikki Krogerus <heikki.krogerus@linux.intel.com>

+ The source capabilities message "Source_Capabilities" contains a

+ set of Power Data Objects (PDO), each representing a type of

+ power supply. The order of the PDO objects is defined in the USB

+ Power Delivery Specification. Each PDO - power supply - will

+ have its own device, and the PDO device name will start with the

+ object position number as the first character followed by the

+ power supply type name (":" as delimiter).

+ /sys/class/usb_power_delivery/.../source_capabilities/<position>:<type>

+What: /sys/class/usb_power_delivery/.../sink-capabilities

+Contact: Heikki Krogerus <heikki.krogerus@linux.intel.com>

+ The sink capability message "Sink_Capabilities" contains a set

+ of Power Data Objects (PDO) just like with source capabilities,

+ but instead of describing the power capabilities, these objects

+ describe the power requirements.

+ The order of the objects in the sink capability message is the

+ same as with the source capabilities message.

+What: /sys/class/usb_power_delivery/.../<capability>/<position>:fixed_supply

+Contact: Heikki Krogerus <heikki.krogerus@linux.intel.com>

+ Devices containing the attributes (the bit fields) defined for

+ The device "1:fixed_supply" is special. USB Power Delivery

+ Specification dictates that the first PDO (at object position

+ 1), and the only mandatory PDO, is always the vSafe5V Fixed

+ Supply Object. vSafe5V Object has additional fields defined for

+ it that the other Fixed Supply Objects do not have and that are

+ related to the USB capabilities rather than power capabilities.

+What: /sys/class/usb_power_delivery/.../<capability>/1:fixed_supply/dual_role_power

+Contact: Heikki Krogerus <heikki.krogerus@linux.intel.com>

+ This file contains boolean value that tells does the device

+ support both source and sink power roles.

+What: /sys/class/usb_power_delivery/.../source-capabilities/1:fixed_supply/usb_suspend_supported

+Contact: Heikki Krogerus <heikki.krogerus@linux.intel.com>

+ This file shows the value of the USB Suspend Supported bit in

+ vSafe5V Fixed Supply Object. If the bit is set then the device

+ will follow the USB 2.0 and USB 3.2 rules for suspend and

+What: /sys/class/usb_power_delivery/.../sink-capabilities/1:fixed_supply/higher_capability

+Date: February 2023

+Contact: Saranya Gopal <saranya.gopal@linux.intel.com>

+ This file shows the value of the Higher capability bit in

+ vsafe5V Fixed Supply Object. If the bit is set, then the sink

+ needs more than vsafe5V(eg. 12 V) to provide full functionality.

+ Valid values: 0, 1

+What: /sys/class/usb_power_delivery/.../<capability>/1:fixed_supply/unconstrained_power

+Contact: Heikki Krogerus <heikki.krogerus@linux.intel.com>

+ This file shows the value of the Unconstrained Power bit in

+ vSafe5V Fixed Supply Object. The bit is set when an external

+ source of power, powerful enough to power the entire system on

+ its own, is available for the device.

+What: /sys/class/usb_power_delivery/.../<capability>/1:fixed_supply/usb_communication_capable

+Contact: Heikki Krogerus <heikki.krogerus@linux.intel.com>

+ This file shows the value of the USB Communication Capable bit in

+ vSafe5V Fixed Supply Object.

+What: /sys/class/usb_power_delivery/.../<capability>/1:fixed_supply/dual_role_data

+Contact: Heikki Krogerus <heikki.krogerus@linux.intel.com>

+ This file shows the value of the Dual-Role Data bit in vSafe5V

+ Fixed Supply Object. Dual role data means ability act as both

+ USB host and USB device.

+What: /sys/class/usb_power_delivery/.../<capability>/1:fixed_supply/unchunked_extended_messages_supported

+Contact: Heikki Krogerus <heikki.krogerus@linux.intel.com>

+ This file shows the value of the Unchunked Extended Messages

+ Supported bit in vSafe5V Fixed Supply Object.

+What: /sys/class/usb_power_delivery/.../<capability>/<position>:fixed_supply/voltage

+Contact: Heikki Krogerus <heikki.krogerus@linux.intel.com>

+ The voltage the supply supports in millivolts.

+What: /sys/class/usb_power_delivery/.../source-capabilities/<position>:fixed_supply/maximum_current

+Contact: Heikki Krogerus <heikki.krogerus@linux.intel.com>

+ Maximum current of the fixed source supply in milliamperes.

+What: /sys/class/usb_power_delivery/.../sink-capabilities/<position>:fixed_supply/operational_current

+Contact: Heikki Krogerus <heikki.krogerus@linux.intel.com>

+ Operational current of the sink in milliamperes.

+What: /sys/class/usb_power_delivery/.../sink-capabilities/<position>:fixed_supply/fast_role_swap_current

+Contact: Heikki Krogerus <heikki.krogerus@linux.intel.com>

+ This file contains the value of the "Fast Role Swap USB Type-C

+ Current" field that tells the current level the sink requires

+ after a Fast Role Swap.

+ 0 - Fast Swap not supported"

+ 1 - Default USB Power"

+What: /sys/class/usb_power_delivery/.../<capability>/<position>:variable_supply

+Contact: Heikki Krogerus <heikki.krogerus@linux.intel.com>

+ Variable Power Supply PDO.

+What: /sys/class/usb_power_delivery/.../<capability>/<position>:variable_supply/maximum_voltage

+Contact: Heikki Krogerus <heikki.krogerus@linux.intel.com>

+ Maximum Voltage in millivolts.

+What: /sys/class/usb_power_delivery/.../<capability>/<position>:variable_supply/minimum_voltage

+Contact: Heikki Krogerus <heikki.krogerus@linux.intel.com>

+ Minimum Voltage in millivolts.

+What: /sys/class/usb_power_delivery/.../source-capabilities/<position>:variable_supply/maximum_current

+Contact: Heikki Krogerus <heikki.krogerus@linux.intel.com>

+ The maximum current in milliamperes that the source can supply

+ at the given Voltage range.

+What: /sys/class/usb_power_delivery/.../sink-capabilities/<position>:variable_supply/operational_current

+Contact: Heikki Krogerus <heikki.krogerus@linux.intel.com>

+ The operational current in milliamperes that the sink requires

+ at the given Voltage range.

+What: /sys/class/usb_power_delivery/.../<capability>/<position>:battery

+Contact: Heikki Krogerus <heikki.krogerus@linux.intel.com>

+What: /sys/class/usb_power_delivery/.../<capability>/<position>:battery/maximum_voltage

+Contact: Heikki Krogerus <heikki.krogerus@linux.intel.com>

+ Maximum Voltage in millivolts.

+What: /sys/class/usb_power_delivery/.../<capability>/<position>:battery/minimum_voltage

+Contact: Heikki Krogerus <heikki.krogerus@linux.intel.com>

+ Minimum Voltage in millivolts.

+What: /sys/class/usb_power_delivery/.../source-capabilities/<position>:battery/maximum_power

+Contact: Heikki Krogerus <heikki.krogerus@linux.intel.com>

+ Maximum allowable Power in milliwatts.

+What: /sys/class/usb_power_delivery/.../sink-capabilities/<position>:battery/operational_power

+Contact: Heikki Krogerus <heikki.krogerus@linux.intel.com>

+ The operational power that the sink requires at the given

+Standard Power Range (SPR) Programmable Power Supplies

+What: /sys/class/usb_power_delivery/.../<capability>/<position>:programmable_supply

+Contact: Heikki Krogerus <heikki.krogerus@linux.intel.com>

+ Programmable Power Supply (PPS) Augmented PDO (APDO).

+What: /sys/class/usb_power_delivery/.../<capability>/<position>:programmable_supply/maximum_voltage

+Contact: Heikki Krogerus <heikki.krogerus@linux.intel.com>

+ Maximum Voltage in millivolts.

+What: /sys/class/usb_power_delivery/.../<capability>/<position>:programmable_supply/minimum_voltage

+Contact: Heikki Krogerus <heikki.krogerus@linux.intel.com>

+ Minimum Voltage in millivolts.

+What: /sys/class/usb_power_delivery/.../<capability>/<position>:programmable_supply/maximum_current

+Contact: Heikki Krogerus <heikki.krogerus@linux.intel.com>

+ Maximum Current in milliamperes.

+What: /sys/class/usb_power_delivery/.../source-capabilities/<position>:programmable_supply/pps_power_limited

+Contact: Heikki Krogerus <heikki.krogerus@linux.intel.com>

+ The PPS Power Limited bit indicates whether or not the source

+ supply will exceed the rated output power if requested.

+What: /sys/class/uwb_rc/uwb<N>/

+What: /sys/class/uwb_rc/uwb<N>/beacon

+What: /sys/class/uwb_rc/uwb<N>/ASIE

+What: /sys/class/uwb_rc/uwb<N>/scan

+What: /sys/class/uwb_rc/uwb<N>/mac_address

+What: /sys/class/uwb_rc/uwb<N>/wusbhc

+What: /sys/class/uwb_rc/uwb<N>/<EUI-48>/

+What: /sys/class/uwb_rc/uwb<N>/<EUI-48>/BPST

+What: /sys/class/uwb_rc/uwb<N>/<EUI-48>/DevAddr

+What: /sys/class/uwb_rc/uwb<N>/<EUI-48>/EUI_48

+What: /sys/class/uwb_rc/uwb<N>/<EUI-48>/IEs

+What: /sys/class/uwb_rc/uwb<N>/<EUI-48>/LQE

+What: /sys/class/uwb_rc/uwb<N>/<EUI-48>/RSSI

diff --git a/Documentation/ABI/testing/sysfs-class-uwb_rc-wusbhc b/Documentation/ABI/testing/sysfs-class-uwb_rc-wusbhc
index 5977e2875325..55eb55cac92e 100644
--- a/Documentation/ABI/testing/sysfs-class-uwb_rc-wusbhc
+++ b/Documentation/ABI/testing/sysfs-class-uwb_rc-wusbhc

+What: /sys/class/uwb_rc/uwb<N>/wusbhc/wusb_chid

+What: /sys/class/uwb_rc/uwb<N>/wusbhc/wusb_trust_timeout

+What: /sys/class/uwb_rc/uwb<N>/wusbhc/wusb_phy_rate

+What: /sys/class/uwb_rc/uwb<N>/wusbhc/wusb_dnts

+What: /sys/class/uwb_rc/uwb<N>/wusbhc/wusb_retry_count

+What: /sys/class/vduse/

+KernelVersion: 5.15

+Contact: Yongji Xie <xieyongji@bytedance.com>

+ The vduse/ class sub-directory belongs to the VDUSE

+ framework and provides a sysfs interface for configuring

+What: /sys/class/vduse/control/

+KernelVersion: 5.15

+Contact: Yongji Xie <xieyongji@bytedance.com>

+ This directory entry is created for the control device

+ of VDUSE framework.

+What: /sys/class/vduse/<device-name>/

+KernelVersion: 5.15

+Contact: Yongji Xie <xieyongji@bytedance.com>

+ This directory entry is created when a VDUSE device is

+ created via the control device.

+What: /sys/class/vduse/<device-name>/msg_timeout

+KernelVersion: 5.15

+Contact: Yongji Xie <xieyongji@bytedance.com>

+ (RW) The timeout (in seconds) for waiting for the control

+ message's response from userspace. Default value is 30s.

+ Writing a '0' to the file means to disable the timeout.

+What: /sys/class/watchdog/watchdogn/options

+Contact: Thomas Weißschuh

+ It is a read only file. It contains options of watchdog device.

+What: /sys/class/watchdog/watchdogn/fw_version

+Contact: Thomas Weißschuh

+ It is a read only file. It contains firmware version of

+What: /sys/devices/hisi_ptt<sicl_id>_<core_id>/tune

+Contact: Yicong Yang <yangyicong@hisilicon.com>

+Description: This directory contains files for tuning the PCIe link

+ parameters(events). Each file is named after the event

+ See Documentation/trace/hisi-ptt.rst for more information.

+What: /sys/devices/hisi_ptt<sicl_id>_<core_id>/tune/qos_tx_cpl

+Contact: Yicong Yang <yangyicong@hisilicon.com>

+Description: (RW) Controls the weight of Tx completion TLPs, which influence

+ the proportion of outbound completion TLPs on the PCIe link.

+ The available tune data is [0, 1, 2]. Writing a negative value

+ will return an error, and out of range values will be converted

+ to 2. The value indicates a probable level of the event.

+What: /sys/devices/hisi_ptt<sicl_id>_<core_id>/tune/qos_tx_np

+Contact: Yicong Yang <yangyicong@hisilicon.com>

+Description: (RW) Controls the weight of Tx non-posted TLPs, which influence

+ the proportion of outbound non-posted TLPs on the PCIe link.

+ The available tune data is [0, 1, 2]. Writing a negative value

+ will return an error, and out of range values will be converted

+ to 2. The value indicates a probable level of the event.

+What: /sys/devices/hisi_ptt<sicl_id>_<core_id>/tune/qos_tx_p

+Contact: Yicong Yang <yangyicong@hisilicon.com>

+Description: (RW) Controls the weight of Tx posted TLPs, which influence the

+ proportion of outbound posted TLPs on the PCIe link.

+ The available tune data is [0, 1, 2]. Writing a negative value

+ will return an error, and out of range values will be converted

+ to 2. The value indicates a probable level of the event.

+What: /sys/devices/hisi_ptt<sicl_id>_<core_id>/tune/rx_alloc_buf_level

+Contact: Yicong Yang <yangyicong@hisilicon.com>

+Description: (RW) Control the allocated buffer watermark for inbound packets.

+ The packets will be stored in the buffer first and then transmitted

+ either when the watermark reached or when timed out.

+ The available tune data is [0, 1, 2]. Writing a negative value

+ will return an error, and out of range values will be converted

+ to 2. The value indicates a probable level of the event.

+What: /sys/devices/hisi_ptt<sicl_id>_<core_id>/tune/tx_alloc_buf_level

+Contact: Yicong Yang <yangyicong@hisilicon.com>

+Description: (RW) Control the allocated buffer watermark of outbound packets.

+ The packets will be stored in the buffer first and then transmitted

+ either when the watermark reached or when timed out.

+ The available tune data is [0, 1, 2]. Writing a negative value

+ will return an error, and out of range values will be converted

+ to 2. The value indicates a probable level of the event.

diff --git a/Documentation/ABI/testing/sysfs-devices-mapping b/Documentation/ABI/testing/sysfs-devices-mapping
index 8d202bac9394..2eee1446ad4c 100644
--- a/Documentation/ABI/testing/sysfs-devices-mapping
+++ b/Documentation/ABI/testing/sysfs-devices-mapping

+Contact: Alexander Antonov <alexander.antonov@linux.intel.com>

+What: /sys/devices/uncore_upi_x/dieX

+Contact: Alexander Antonov <alexander.antonov@linux.intel.com>

+ Each /sys/devices/uncore_upi_X/dieY file holds "upi_Z,die_W"

+ value that means UPI link number X on die Y is connected to UPI

+ link Z on die W and this link between sockets can be monitored

+ by UPI PMON block.

+ For example, 4-die Sapphire Rapids platform has the following

+ # tail /sys/devices/uncore_upi_0/die*

+ ==> /sys/devices/uncore_upi_0/die0 <==

+ ==> /sys/devices/uncore_upi_0/die1 <==

+ ==> /sys/devices/uncore_upi_0/die2 <==

+ ==> /sys/devices/uncore_upi_0/die3 <==

+ UPI link 0 on die 0 is connected to UPI link 1 on die 1

+ UPI link 0 on die 1 is connected to UPI link 0 on die 3

+ UPI link 0 on die 2 is connected to UPI link 1 on die 3

+ UPI link 0 on die 3 is connected to UPI link 0 on die 1 \ No newline at end of file

diff --git a/Documentation/ABI/testing/sysfs-devices-physical_location b/Documentation/ABI/testing/sysfs-devices-physical_location
new file mode 100644
index 000000000000..202324b87083
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-devices-physical_location

+What: /sys/devices/.../physical_location

+Contact: Won Chung <wonchung@google.com>

+ This directory contains information on physical location of

+ the device connection point with respect to the system's

+What: /sys/devices/.../physical_location/panel

+Contact: Won Chung <wonchung@google.com>

+ Describes which panel surface of the system’s housing the

+ device connection point resides on.

+What: /sys/devices/.../physical_location/vertical_position

+Contact: Won Chung <wonchung@google.com>

+ Describes vertical position of the device connection point on

+ the panel surface.

+What: /sys/devices/.../physical_location/horizontal_position

+Contact: Won Chung <wonchung@google.com>

+ Describes horizontal position of the device connection point on

+ the panel surface.

+What: /sys/devices/.../physical_location/dock

+Contact: Won Chung <wonchung@google.com>

+ "Yes" if the device connection point resides in a docking

+ station or a port replicator. "No" otherwise.

+What: /sys/devices/.../physical_location/lid

+Contact: Won Chung <wonchung@google.com>

+ "Yes" if the device connection point resides on the lid of

+ laptop system. "No" otherwise.

diff --git a/Documentation/ABI/testing/sysfs-devices-platform-ACPI-TAD b/Documentation/ABI/testing/sysfs-devices-platform-ACPI-TAD
index f7b360a61b21..bc44bc903bc8 100644
--- a/Documentation/ABI/testing/sysfs-devices-platform-ACPI-TAD
+++ b/Documentation/ABI/testing/sysfs-devices-platform-ACPI-TAD

+ Another way to reset the status of the AC alarm timer is to

diff --git a/Documentation/ABI/testing/sysfs-devices-platform-dock b/Documentation/ABI/testing/sysfs-devices-platform-dock
index 1d8c18f905c7..411c174de830 100644
--- a/Documentation/ABI/testing/sysfs-devices-platform-dock
+++ b/Documentation/ABI/testing/sysfs-devices-platform-dock

+What: /sys/devices/platform/dock.<N>/docked

+What: /sys/devices/platform/dock.<N>/undock

+What: /sys/devices/platform/dock.<N>/uid

+What: /sys/devices/platform/dock.<N>/flags

+What: /sys/devices/platform/dock.<N>/type

diff --git a/Documentation/ABI/testing/sysfs-devices-platform-soc-ipa b/Documentation/ABI/testing/sysfs-devices-platform-soc-ipa
index c56dcf15bf29..364b1ba41242 100644
--- a/Documentation/ABI/testing/sysfs-devices-platform-soc-ipa
+++ b/Documentation/ABI/testing/sysfs-devices-platform-soc-ipa

+What: .../XXXXXXX.ipa/endpoint_id/

+KernelVersion: v5.19

+Contact: Alex Elder <elder@kernel.org>

+ The .../XXXXXXX.ipa/endpoint_id/ directory contains

+ attributes that define IDs associated with IPA

+ endpoints. The "rx" or "tx" in an endpoint name is

+ from the perspective of the AP. An endpoint ID is a

+ small unsigned integer.

+What: .../XXXXXXX.ipa/endpoint_id/modem_rx

+KernelVersion: v5.19

+Contact: Alex Elder <elder@kernel.org>

+ The .../XXXXXXX.ipa/endpoint_id/modem_rx file contains

+ the ID of the AP endpoint on which packets originating

+ from the embedded modem are received.

+What: .../XXXXXXX.ipa/endpoint_id/modem_tx

+KernelVersion: v5.19

+Contact: Alex Elder <elder@kernel.org>

+ The .../XXXXXXX.ipa/endpoint_id/modem_tx file contains

+ the ID of the AP endpoint on which packets destined

+ for the embedded modem are sent.

+What: .../XXXXXXX.ipa/endpoint_id/monitor_rx

+KernelVersion: v5.19

+Contact: Alex Elder <elder@kernel.org>

+ The .../XXXXXXX.ipa/endpoint_id/monitor_rx file contains

+ the ID of the AP endpoint on which IPA "monitor" data is

+ received. The monitor endpoint supplies replicas of

+ packets that enter the IPA hardware for processing.

+ Each replicated packet is preceded by a fixed-size "ODL"

+ header (see .../XXXXXXX.ipa/feature/monitor, above).

+ Large packets are truncated, to reduce the bandwidth

+ required to provide the monitor function.

+ The .../XXXXXXX.ipa/modem/ directory contains attributes

+ describing properties of the modem embedded in the SoC.

+ The .../XXXXXXX.ipa/modem/rx_endpoint_id file duplicates

+ the value found in .../XXXXXXX.ipa/endpoint_id/modem_rx.

+ The .../XXXXXXX.ipa/modem/tx_endpoint_id file duplicates

+ the value found in .../XXXXXXX.ipa/endpoint_id/modem_tx.

+What: /sys/devices/.../power/runtime_active_time

+Contact: Arjan van de Ven <arjan@linux.intel.com>

+ Reports the total time that the device has been active.

+ Used for runtime PM statistics.

+What: /sys/devices/.../power/runtime_suspended_time

+Contact: Arjan van de Ven <arjan@linux.intel.com>

+ Reports total time that the device has been suspended.

+ Used for runtime PM statistics.

+What: /sys/devices/.../power/runtime_usage

+Contact: Dominik Brodowski <linux@dominikbrodowski.net>

+ Reports the runtime PM usage count of a device.

+What: /sys/devices/.../power/runtime_enabled

+Contact: Dominik Brodowski <linux@dominikbrodowski.net>

+ Is runtime PM enabled for this device?

+ States are "enabled", "disabled", "forbidden" or a

+ combination of the latter two.

+What: /sys/devices/.../power/runtime_active_kids

+Contact: Dominik Brodowski <linux@dominikbrodowski.net>

+ Reports the runtime PM children usage count of a device, or

+ 0 if the children will be ignored.

diff --git a/Documentation/ABI/testing/sysfs-devices-removable b/Documentation/ABI/testing/sysfs-devices-removable
index bda6c320c8d3..754ecb4587ca 100644
--- a/Documentation/ABI/testing/sysfs-devices-removable
+++ b/Documentation/ABI/testing/sysfs-devices-removable

+ =========== ===================================================

+ "removable" device can be removed from the platform by the user

+ "fixed" device is fixed to the platform / cannot be removed

+ "unknown" The information is unavailable / cannot be deduced.

+ =========== ===================================================

+contact: Lee Jones <lee@kernel.org>

diff --git a/Documentation/ABI/testing/sysfs-devices-state_synced b/Documentation/ABI/testing/sysfs-devices-state_synced
index 0c922d7d02fc..c64636ddac41 100644
--- a/Documentation/ABI/testing/sysfs-devices-state_synced
+++ b/Documentation/ABI/testing/sysfs-devices-state_synced

+ Writing "1" to this file will force a call to the device's

+ sync_state() function if it hasn't been called already. The

+ sync_state() call happens independent of the state of the

diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu b/Documentation/ABI/testing/sysfs-devices-system-cpu
index b46ef147616a..f54867cadb0f 100644
--- a/Documentation/ABI/testing/sysfs-devices-system-cpu
+++ b/Documentation/ABI/testing/sysfs-devices-system-cpu

+ /sys/devices/system/cpu/cpuX/

+What: /sys/devices/system/cpu/cpuX/node

+What: /sys/devices/system/cpu/cpuX/topology/core_siblings

+ /sys/devices/system/cpu/cpuX/topology/core_siblings_list

+ /sys/devices/system/cpu/cpuX/topology/physical_package_id

+ /sys/devices/system/cpu/cpuX/topology/thread_siblings

+ /sys/devices/system/cpu/cpuX/topology/thread_siblings_list

+ /sys/devices/system/cpu/cpuX/topology/ppin

+ One cpuX directory is created per logical CPU in the system,

+ core_siblings: internal kernel map of cpuX's hardware threads

+ numbers within the same physical_package_id as cpuX.

+ physical_package_id: physical package id of cpuX. Typically

+ thread_siblings: internal kernel map of cpuX's hardware

+ threads within the same core as cpuX

+ thread_siblings_list: human-readable list of cpuX's hardware

+ threads within the same core as cpuX

+ ppin: human-readable Protected Processor Identification

+ Number of the socket the cpu# belongs to. There should be

+ one per physical_package_id. File is readable only to

+What: /sys/devices/system/cpu/cpuX/cpuidle/state<N>/name

+What: /sys/devices/system/cpu/cpuX/cpuidle/state<N>/desc

+What: /sys/devices/system/cpu/cpuX/cpuidle/state<N>/disable

+What: /sys/devices/system/cpu/cpuX/cpuidle/state<N>/default_status

+What: /sys/devices/system/cpu/cpuX/cpuidle/state<N>/residency

+What: /sys/devices/system/cpu/cpuX/cpuidle/state<N>/s2idle/

+What: /sys/devices/system/cpu/cpuX/cpuidle/state<N>/s2idle/time

+What: /sys/devices/system/cpu/cpuX/cpuidle/state<N>/s2idle/usage

+What: /sys/devices/system/cpu/cpuX/cpufreq/*

+What: /sys/devices/system/cpu/cpuX/cpufreq/freqdomain_cpus

+ beyond its nominal limit.

+What: /sys/devices/system/cpu/cpuX/crash_notes

+ /sys/devices/system/cpu/cpuX/crash_notes_size

+ crash_notes_size: size of the note of cpuX.

+ /sys/devices/system/cpu/cpuX/regs/identification/smidr_el1

+ identifying model and revision of the CPU and SMCU.

+What: /sys/devices/system/cpu/cpuX/cpu_capacity

+ cpu_capacity: capacity of cpuX.

+ /sys/devices/system/cpu/vulnerabilities/mmio_stale_data

+ /sys/devices/system/cpu/vulnerabilities/retbleed

+What: /sys/devices/system/cpu/cpuX/power/energy_perf_bias

+ "asymm" Prefer asymmetric mode

+What: /sys/devices/system/cpu/nohz_full

+Contact: Linux kernel mailing list <linux-kernel@vger.kernel.org>

+ (RO) the list of CPUs that are in nohz_full mode.

+ These CPUs are set by boot parameter "nohz_full=".

+What: /sys/devices/system/cpu/isolated

+Contact: Linux kernel mailing list <linux-kernel@vger.kernel.org>

+ (RO) the list of CPUs that are isolated and don't

+ participate in load balancing. These CPUs are set by

+ boot parameter "isolcpus=".

+What: /sys/.../<device>/vfio-dev/vfioX/

+Date: September 2022

+Contact: Yi Liu <yi.l.liu@intel.com>

+ This directory is created when the device is bound to a

+ vfio driver. The layout under this directory matches what

+ exists for a standard 'struct device'. 'X' is a unique

+ index marking this device in vfio.

diff --git a/Documentation/ABI/testing/sysfs-driver-aspeed-uart-routing b/Documentation/ABI/testing/sysfs-driver-aspeed-uart-routing
new file mode 100644
index 000000000000..910df0e5815a
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-driver-aspeed-uart-routing

+What: /sys/bus/platform/drivers/aspeed-uart-routing/\*/uart\*

+Date: September 2021

+Contact: Oskar Senft <osk@google.com>

+ Chia-Wei Wang <chiawei_wang@aspeedtech.com>

+Description: Selects the RX source of the UARTx device.

+ When read, each file shows the list of available options with currently

+ selected option marked by brackets "[]". The list of available options

+ depends on the selected file.

+ cat /sys/bus/platform/drivers/aspeed-uart-routing/\*.uart_routing/uart1

+ [io1] io2 io3 io4 uart2 uart3 uart4 io6

+ In this case, UART1 gets its input from IO1 (physical serial port 1).

+Users: OpenBMC. Proposed changes should be mailed to

+ openbmc@lists.ozlabs.org

+What: /sys/bus/platform/drivers/aspeed-uart-routing/\*/io\*

+Date: September 2021

+Contact: Oskar Senft <osk@google.com>

+ Chia-Wei Wang <chiawei_wang@aspeedtech.com>

+Description: Selects the RX source of IOx serial port. The current selection

+ will be marked by brackets "[]".

+Users: OpenBMC. Proposed changes should be mailed to

+ openbmc@lists.ozlabs.org

diff --git a/Documentation/ABI/testing/sysfs-driver-bd9571mwv-regulator b/Documentation/ABI/testing/sysfs-driver-bd9571mwv-regulator
index 42214b4ff14a..90596d8bb51c 100644
--- a/Documentation/ABI/testing/sysfs-driver-bd9571mwv-regulator
+++ b/Documentation/ABI/testing/sysfs-driver-bd9571mwv-regulator

+ See also Documentation/devicetree/bindings/mfd/rohm,bd9571mwv.yaml.

+What: /sys/bus/pci/devices/<BDF>/fused_part

+KernelVersion: 5.19

+Contact: mario.limonciello@amd.com

+ The /sys/bus/pci/devices/<BDF>/fused_part file reports

+ whether the CPU or APU has been fused to prevent tampering.

+What: /sys/bus/pci/devices/<BDF>/debug_lock_on

+KernelVersion: 5.19

+Contact: mario.limonciello@amd.com

+ The /sys/bus/pci/devices/<BDF>/debug_lock_on reports

+ whether the AMD CPU or APU has been unlocked for debugging.

+What: /sys/bus/pci/devices/<BDF>/tsme_status

+KernelVersion: 5.19

+Contact: mario.limonciello@amd.com

+ The /sys/bus/pci/devices/<BDF>/tsme_status file reports

+ the status of transparent secure memory encryption on AMD systems.

+What: /sys/bus/pci/devices/<BDF>/anti_rollback_status

+KernelVersion: 5.19

+Contact: mario.limonciello@amd.com

+ The /sys/bus/pci/devices/<BDF>/anti_rollback_status file reports

+ whether the PSP is enforcing rollback protection.

+What: /sys/bus/pci/devices/<BDF>/rpmc_production_enabled

+KernelVersion: 5.19

+Contact: mario.limonciello@amd.com

+ The /sys/bus/pci/devices/<BDF>/rpmc_production_enabled file reports

+ whether Replay Protected Monotonic Counter support has been enabled.

+What: /sys/bus/pci/devices/<BDF>/rpmc_spirom_available

+KernelVersion: 5.19

+Contact: mario.limonciello@amd.com

+ The /sys/bus/pci/devices/<BDF>/rpmc_spirom_available file reports

+ whether an Replay Protected Monotonic Counter supported SPI is installed

+What: /sys/bus/pci/devices/<BDF>/hsp_tpm_available

+KernelVersion: 5.19

+Contact: mario.limonciello@amd.com

+ The /sys/bus/pci/devices/<BDF>/hsp_tpm_available file reports

+ whether the HSP TPM has been activated.

+ 0: Not activated or present

+What: /sys/bus/pci/devices/<BDF>/rom_armor_enforced

+KernelVersion: 5.19

+Contact: mario.limonciello@amd.com

+ The /sys/bus/pci/devices/<BDF>/rom_armor_enforced file reports

+ whether RomArmor SPI protection is enforced.

+What: /sys/bus/platform/devices/GGL0001:*/BINF.2

+KernelVersion: 5.19

+ Returns active EC firmware of current boot (boolean).

+ == ===============================

+ 0 Read only (recovery) firmware.

+ 1 Rewritable firmware.

+ == ===============================

+What: /sys/bus/platform/devices/GGL0001:*/BINF.3

+KernelVersion: 5.19

+ Returns main firmware type for current boot (integer).

+ == =====================================

+ 3 Netboot (factory installation only).

+ == =====================================

+What: /sys/bus/platform/devices/GGL0001:*/CHSW

+KernelVersion: 5.19

+ Returns switch position for Chrome OS specific hardware

+ switches when the firmware is booted (integer).

+ ==== ===========================================

+ 2 Recovery button was pressed.

+ 4 Recovery button was pressed (EC firmware).

+ 32 Developer switch was enabled.

+ 512 Firmware write protection was disabled.

+ ==== ===========================================

+What: /sys/bus/platform/devices/GGL0001:*/FMAP

+KernelVersion: 5.19

+ Returns physical memory address of the start of the main

+ processor firmware flashmap.

+What: /sys/bus/platform/devices/GGL0001:*/FRID

+KernelVersion: 5.19

+ Returns firmware version for the read-only portion of the

+ main processor firmware.

+What: /sys/bus/platform/devices/GGL0001:*/FWID

+KernelVersion: 5.19

+ Returns firmware version for the rewritable portion of the

+ main processor firmware.

+What: /sys/bus/platform/devices/GGL0001:*/GPIO.X/GPIO.0

+KernelVersion: 5.19

+ Returns type of the GPIO signal for the Chrome OS specific

+ GPIO assignments (integer).

+ =========== ==================================

+ 1 Recovery button.

+ 2 Developer mode switch.

+ 3 Firmware write protection switch.

+ 256 to 511 Debug header GPIO 0 to GPIO 255.

+ =========== ==================================

+What: /sys/bus/platform/devices/GGL0001:*/GPIO.X/GPIO.1

+KernelVersion: 5.19

+ Returns signal attributes of the GPIO signal (integer bitfield).

+ == =======================

+ 0 Signal is active low.

+ 1 Signal is active high.

+ == =======================

+What: /sys/bus/platform/devices/GGL0001:*/GPIO.X/GPIO.2

+KernelVersion: 5.19

+ Returns the GPIO number on the specified GPIO

+What: /sys/bus/platform/devices/GGL0001:*/GPIO.X/GPIO.3

+KernelVersion: 5.19

+ Returns name of the GPIO controller.

+What: /sys/bus/platform/devices/GGL0001:*/HWID

+KernelVersion: 5.19

+ Returns hardware ID for the Chromebook.

+What: /sys/bus/platform/devices/GGL0001:*/MECK

+KernelVersion: 5.19

+ Returns the SHA-1 or SHA-256 hash that is read out of the

+ Management Engine extended registers during boot. The hash

+ is exported via ACPI so the OS can verify that the Management

+ Engine firmware has not changed. If Management Engine is not

+ present, or if the firmware was unable to read the extended registers, this buffer size can be zero.

+What: /sys/bus/platform/devices/GGL0001:*/VBNV.0

+KernelVersion: 5.19

+ Returns offset in CMOS bank 0 of the verified boot non-volatile

+ storage block, counting from the first writable CMOS byte

+ (that is, 'offset = 0' is the byte following the 14 bytes of

+What: /sys/bus/platform/devices/GGL0001:*/VBNV.1

+KernelVersion: 5.19

+ Return the size in bytes of the verified boot non-volatile

+What: /sys/bus/platform/devices/GGL0001:*/VDAT

+KernelVersion: 5.19

+ Returns the verified boot data block shared between the

+ firmware verification step and the kernel verification step

+What: /sys/bus/platform/drivers/eud/.../enable

+Date: February 2022

+Contact: Souradeep Chowdhury <quic_schowdhu@quicinc.com>

+ The Enable/Disable sysfs interface for Embedded

+ USB Debugger(EUD). This enables and disables the

+ EUD based on a 1 or a 0 value. By enabling EUD,

+ the user is able to activate the mini-usb hub of

+ EUD for debug and trace capabilities.

diff --git a/Documentation/ABI/testing/sysfs-driver-habanalabs b/Documentation/ABI/testing/sysfs-driver-habanalabs
index 1f127f71d2b4..1b98b6503b23 100644
--- a/Documentation/ABI/testing/sysfs-driver-habanalabs
+++ b/Documentation/ABI/testing/sysfs-driver-habanalabs

+What: /sys/class/habanalabs/hl<n>/fw_os_ver

+KernelVersion: 5.18

+Contact: ogabbay@kernel.org

+Description: Version of the firmware OS running on the device's CPU

+Description: Version of the Device's power supply F/W code. Relevant only to GOYA and GAUDI

+What: /sys/class/habanalabs/hl<n>/security_enabled

+Contact: obitton@habana.ai

+Description: Displays the device's security status

+Description: Status of the card:

+ * "operational" - Device is available for work.

+ * "in reset" - Device is going through reset, will be

+ available shortly.

+ * "disabled" - Device is not usable.

+ * "needs reset" - Device is not usable until a hard reset

+ * "in device creation" - Device is not available yet, as it

+ is still initializing.

+ * "in reset after device release" - Device is going through

+ a compute-reset which is executed after a device release

+ (relevant for Gaudi2 only).

+Description: Version of the u-boot running on the device's CPU

+What: /sys/class/habanalabs/hl<n>/vrm_ver

+KernelVersion: 5.17

+Contact: ogabbay@kernel.org

+Description: Version of the Device's Voltage Regulator Monitor F/W code. N/A to GOYA and GAUDI

diff --git a/Documentation/ABI/testing/sysfs-driver-intel-i915-hwmon b/Documentation/ABI/testing/sysfs-driver-intel-i915-hwmon
new file mode 100644
index 000000000000..8d7d8f05f6cd
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-driver-intel-i915-hwmon

+What: /sys/devices/.../hwmon/hwmon/in0_input

+Date: February 2023

+Contact: intel-gfx@lists.freedesktop.org

+Description: RO. Current Voltage in millivolt.

+ Only supported for particular Intel i915 graphics platforms.

+What: /sys/devices/.../hwmon/hwmon/power1_max

+Date: February 2023

+Contact: intel-gfx@lists.freedesktop.org

+Description: RW. Card reactive sustained (PL1/Tau) power limit in microwatts.

+ The power controller will throttle the operating frequency

+ if the power averaged over a window (typically seconds)

+ exceeds this limit. A read value of 0 means that the PL1

+ power limit is disabled, writing 0 disables the

+ limit. Writing values > 0 will enable the power limit.

+ Only supported for particular Intel i915 graphics platforms.

+What: /sys/devices/.../hwmon/hwmon/power1_rated_max

+Date: February 2023

+Contact: intel-gfx@lists.freedesktop.org

+Description: RO. Card default power limit (default TDP setting).

+ Only supported for particular Intel i915 graphics platforms.

+What: /sys/devices/.../hwmon/hwmon/power1_max_interval

+Date: February 2023

+Contact: intel-gfx@lists.freedesktop.org

+Description: RW. Sustained power limit interval (Tau in PL1/Tau) in

+ milliseconds over which sustained power is averaged.

+ Only supported for particular Intel i915 graphics platforms.

+What: /sys/devices/.../hwmon/hwmon/power1_crit

+Date: February 2023

+Contact: intel-gfx@lists.freedesktop.org

+Description: RW. Card reactive critical (I1) power limit in microwatts.

+ Card reactive critical (I1) power limit in microwatts is exposed

+ for client products. The power controller will throttle the

+ operating frequency if the power averaged over a window exceeds

+ Only supported for particular Intel i915 graphics platforms.

+What: /sys/devices/.../hwmon/hwmon/curr1_crit

+Date: February 2023

+Contact: intel-gfx@lists.freedesktop.org

+Description: RW. Card reactive critical (I1) power limit in milliamperes.

+ Card reactive critical (I1) power limit in milliamperes is

+ exposed for server products. The power controller will throttle

+ the operating frequency if the power averaged over a window

+ exceeds this limit.

+ Only supported for particular Intel i915 graphics platforms.

+What: /sys/devices/.../hwmon/hwmon/energy1_input

+Date: February 2023

+Contact: intel-gfx@lists.freedesktop.org

+Description: RO. Energy input of device or gt in microjoules.

+ For i915 device level hwmon devices (name "i915") this

+ reflects energy input for the entire device. For gt level

+ hwmon devices (name "i915_gtN") this reflects energy input

+ Only supported for particular Intel i915 graphics platforms.

diff --git a/Documentation/ABI/testing/sysfs-driver-intel-m10-bmc b/Documentation/ABI/testing/sysfs-driver-intel-m10-bmc
index 9773925138af..a8ab58035c95 100644
--- a/Documentation/ABI/testing/sysfs-driver-intel-m10-bmc
+++ b/Documentation/ABI/testing/sysfs-driver-intel-m10-bmc

+What: /sys/bus/.../drivers/intel-m10-bmc/.../bmc_version

+What: /sys/bus/.../drivers/intel-m10-bmc/.../bmcfw_version

+What: /sys/bus/.../drivers/intel-m10-bmc/.../mac_address

+What: /sys/bus/.../drivers/intel-m10-bmc/.../mac_count

diff --git a/Documentation/ABI/testing/sysfs-driver-intel-m10-bmc-sec-update b/Documentation/ABI/testing/sysfs-driver-intel-m10-bmc-sec-update
new file mode 100644
index 000000000000..0a41afe0ab4c
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-driver-intel-m10-bmc-sec-update

+What: /sys/bus/platform/drivers/intel-m10bmc-sec-update/.../security/sr_root_entry_hash

+KernelVersion: 5.20

+Contact: Russ Weight <russell.h.weight@intel.com>

+Description: Read only. Returns the root entry hash for the static

+ region if one is programmed, else it returns the

+ string: "hash not programmed". This file is only

+ visible if the underlying device supports it.

+What: /sys/bus/platform/drivers/intel-m10bmc-sec-update/.../security/pr_root_entry_hash

+KernelVersion: 5.20

+Contact: Russ Weight <russell.h.weight@intel.com>

+Description: Read only. Returns the root entry hash for the partial

+ reconfiguration region if one is programmed, else it

+ returns the string: "hash not programmed". This file

+ is only visible if the underlying device supports it.

+What: /sys/bus/platform/drivers/intel-m10bmc-sec-update/.../security/bmc_root_entry_hash

+KernelVersion: 5.20

+Contact: Russ Weight <russell.h.weight@intel.com>

+Description: Read only. Returns the root entry hash for the BMC image

+ if one is programmed, else it returns the string:

+ "hash not programmed". This file is only visible if the

+ underlying device supports it.

+What: /sys/bus/platform/drivers/intel-m10bmc-sec-update/.../security/sr_canceled_csks

+KernelVersion: 5.20

+Contact: Russ Weight <russell.h.weight@intel.com>

+Description: Read only. Returns a list of indices for canceled code

+ signing keys for the static region. The standard bitmap

+ list format is used (e.g. "1,2-6,9").

+What: /sys/bus/platform/drivers/intel-m10bmc-sec-update/.../security/pr_canceled_csks

+KernelVersion: 5.20

+Contact: Russ Weight <russell.h.weight@intel.com>

+Description: Read only. Returns a list of indices for canceled code

+ signing keys for the partial reconfiguration region. The

+ standard bitmap list format is used (e.g. "1,2-6,9").

+What: /sys/bus/platform/drivers/intel-m10bmc-sec-update/.../security/bmc_canceled_csks

+KernelVersion: 5.20

+Contact: Russ Weight <russell.h.weight@intel.com>

+Description: Read only. Returns a list of indices for canceled code

+ signing keys for the BMC. The standard bitmap list format

+ is used (e.g. "1,2-6,9").

+What: /sys/bus/platform/drivers/intel-m10bmc-sec-update/.../security/flash_count

+KernelVersion: 5.20

+Contact: Russ Weight <russell.h.weight@intel.com>

+Description: Read only. Returns number of times the secure update

+ staging area has been flashed.

+What: /sys/bus/auxiliary/devices/intel_vsec.sdsi.X

+KernelVersion: 5.18

+Contact: "David E. Box" <david.e.box@linux.intel.com>

+ This directory contains interface files for accessing Intel

+ On Demand (formerly Software Defined Silicon or SDSi) features

+ on a CPU. X represents the socket instance (though not the

+ socket ID). The socket ID is determined by reading the

+ registers file and decoding it per the specification.

+ Some files communicate with On Demand hardware through a

+ mailbox. Should the operation fail, one of the following error

+ codes may be returned:

+ EIO General mailbox failure. Log may indicate cause.

+ EBUSY Mailbox is owned by another agent.

+ EPERM On Demand capability is not enabled in hardware.

+ EPROTO Failure in mailbox protocol detected by driver.

+ See log for details.

+ EOVERFLOW For provision commands, the size of the data

+ exceeds what may be written.

+ ESPIPE Seeking is not allowed.

+ ETIMEDOUT Failure to complete mailbox transaction in time.

+What: /sys/bus/auxiliary/devices/intel_vsec.sdsi.X/guid

+KernelVersion: 5.18

+Contact: "David E. Box" <david.e.box@linux.intel.com>

+ (RO) The GUID for the registers file. The GUID identifies

+ the layout of the registers file in this directory.

+ Information about the register layouts for a particular GUID

+ is available at http://github.com/intel/intel-sdsi

+What: /sys/bus/auxiliary/devices/intel_vsec.sdsi.X/registers

+KernelVersion: 5.18

+Contact: "David E. Box" <david.e.box@linux.intel.com>

+ (RO) Contains information needed by applications to provision

+ a CPU and monitor status information. The layout of this file

+ is determined by the GUID in this directory. Information about

+ the layout for a particular GUID is available at

+ http://github.com/intel/intel-sdsi

+What: /sys/bus/auxiliary/devices/intel_vsec.sdsi.X/provision_akc

+KernelVersion: 5.18

+Contact: "David E. Box" <david.e.box@linux.intel.com>

+ (WO) Used to write an Authentication Key Certificate (AKC) to

+ the On Demand NVRAM for the CPU. The AKC is used to authenticate

+ a Capability Activation Payload. Mailbox command.

+What: /sys/bus/auxiliary/devices/intel_vsec.sdsi.X/provision_cap

+KernelVersion: 5.18

+Contact: "David E. Box" <david.e.box@linux.intel.com>

+ (WO) Used to write a Capability Activation Payload (CAP) to the

+ On Demand NVRAM for the CPU. CAPs are used to activate a given

+ CPU feature. A CAP is validated by On Demand hardware using a

+ previously provisioned AKC file. Upon successful authentication,

+ the CPU configuration is updated. A cold reboot is required to

+ fully activate the feature. Mailbox command.

+What: /sys/bus/auxiliary/devices/intel_vsec.sdsi.X/meter_certificate

+Contact: "David E. Box" <david.e.box@linux.intel.com>

+ (RO) Used to read back the current meter certificate for the CPU

+ from Intel On Demand hardware. The meter certificate contains

+ utilization metrics of On Demand enabled features. Mailbox

+What: /sys/bus/auxiliary/devices/intel_vsec.sdsi.X/state_certificate

+KernelVersion: 5.18

+Contact: "David E. Box" <david.e.box@linux.intel.com>

+ (RO) Used to read back the current state certificate for the CPU

+ from On Demand hardware. The state certificate contains

+ information about the current licenses on the CPU. Mailbox

+What: /sys/bus/pci/devices/<BDF>/qat/state

+Contact: qat-linux@intel.com

+Description: (RW) Reports the current state of the QAT device. Write to

+ the file to start or stop the device.

+ * up: the device is up and running

+ * down: the device is down

+ It is possible to transition the device from up to down only

+ if the device is up and vice versa.

+ This attribute is only available for qat_4xxx devices.

+What: /sys/bus/pci/devices/<BDF>/qat/cfg_services

+Contact: qat-linux@intel.com

+Description: (RW) Reports the current configuration of the QAT device.

+ Write to the file to change the configured services.

+ * sym;asym: the device is configured for running crypto

+ * dc: the device is configured for running compression services

+ It is possible to set the configuration only if the device

+ is in the `down` state (see /sys/bus/pci/devices/<BDF>/qat/state)

+ The following example shows how to change the configuration of

+ a device configured for running crypto services in order to

+ run data compression::

+ # cat /sys/bus/pci/devices/<BDF>/qat/state

+ # cat /sys/bus/pci/devices/<BDF>/qat/cfg_services

+ # echo down > /sys/bus/pci/devices/<BDF>/qat/state

+ # echo dc > /sys/bus/pci/devices/<BDF>/qat/cfg_services

+ # echo up > /sys/bus/pci/devices/<BDF>/qat/state

+ # cat /sys/bus/pci/devices/<BDF>/qat/cfg_services

+ This attribute is only available for qat_4xxx devices.

diff --git a/Documentation/ABI/testing/sysfs-driver-typec-displayport b/Documentation/ABI/testing/sysfs-driver-typec-displayport
index 231471ad0d4b..256c87c5219a 100644
--- a/Documentation/ABI/testing/sysfs-driver-typec-displayport
+++ b/Documentation/ABI/testing/sysfs-driver-typec-displayport

+What: /sys/bus/typec/devices/.../displayport/hpd

+Contact: Badhri Jagan Sridharan <badhri@google.com>

+ VESA DisplayPort Alt Mode on USB Type-C Standard defines how

+ HotPlugDetect(HPD) shall be supported on the USB-C connector when

+ operating in DisplayPort Alt Mode. This is a read only node which

+ reflects the current state of HPD.

+ - 1: when HPD’s logical state is high (HPD_High) as defined

+ by VESA DisplayPort Alt Mode on USB Type-C Standard.

+ - 0 when HPD’s logical state is low (HPD_Low) as defined by

+ VESA DisplayPort Alt Mode on USB Type-C Standard.

+What: /sys/class/uacce/<dev_name>/isolate_strategy

+Contact: linux-accelerators@lists.ozlabs.org

+Description: (RW) A sysfs node that configure the error threshold for the hardware

+ isolation strategy. This size is a configured integer value, which is the

+ number of threshold for hardware errors occurred in one hour. The default is 0.

+ 0 means never isolate the device. The maximum value is 65535. You can write

+ a number of threshold based on your hardware.

+What: /sys/class/uacce/<dev_name>/isolate

+Contact: linux-accelerators@lists.ozlabs.org

+Description: (R) A sysfs node that read the device isolated state. The value 1

+ means the device is unavailable. The 0 means the device is

+What: /sys/bus/platform/devices/*.ufs/device_descriptor/device_type

+What: /sys/bus/platform/devices/*.ufs/device_descriptor/device_class

+What: /sys/bus/platform/devices/*.ufs/device_descriptor/device_sub_class

+What: /sys/bus/platform/devices/*.ufs/device_descriptor/protocol

+What: /sys/bus/platform/devices/*.ufs/device_descriptor/number_of_luns

+What: /sys/bus/platform/devices/*.ufs/device_descriptor/number_of_wluns

+What: /sys/bus/platform/devices/*.ufs/device_descriptor/boot_enable

+What: /sys/bus/platform/devices/*.ufs/device_descriptor/descriptor_access_enable

+What: /sys/bus/platform/devices/*.ufs/device_descriptor/initial_power_mode

+What: /sys/bus/platform/devices/*.ufs/device_descriptor/high_priority_lun

+What: /sys/bus/platform/devices/*.ufs/device_descriptor/secure_removal_type

+What: /sys/bus/platform/devices/*.ufs/device_descriptor/support_security_lun

+What: /sys/bus/platform/devices/*.ufs/device_descriptor/bkops_termination_latency

+What: /sys/bus/platform/devices/*.ufs/device_descriptor/initial_active_icc_level

+What: /sys/bus/platform/devices/*.ufs/device_descriptor/specification_version

+What: /sys/bus/platform/devices/*.ufs/device_descriptor/manufacturing_date

+What: /sys/bus/platform/devices/*.ufs/device_descriptor/manufacturer_id

+What: /sys/bus/platform/devices/*.ufs/device_descriptor/rtt_capability

+What: /sys/bus/platform/devices/*.ufs/device_descriptor/rtc_update

+What: /sys/bus/platform/devices/*.ufs/device_descriptor/ufs_features

+What: /sys/bus/platform/devices/*.ufs/device_descriptor/ffu_timeout

+What: /sys/bus/platform/devices/*.ufs/device_descriptor/queue_depth

+What: /sys/bus/platform/devices/*.ufs/device_descriptor/device_version

+What: /sys/bus/platform/devices/*.ufs/device_descriptor/number_of_secure_wpa

+What: /sys/bus/platform/devices/*.ufs/device_descriptor/psa_max_data_size

+What: /sys/bus/platform/devices/*.ufs/device_descriptor/psa_state_timeout

+What: /sys/bus/platform/devices/*.ufs/interconnect_descriptor/unipro_version

+What: /sys/bus/platform/devices/*.ufs/interconnect_descriptor/mphy_version

+What: /sys/bus/platform/devices/*.ufs/geometry_descriptor/raw_device_capacity

+What: /sys/bus/platform/devices/*.ufs/geometry_descriptor/max_number_of_luns

+What: /sys/bus/platform/devices/*.ufs/geometry_descriptor/segment_size

+What: /sys/bus/platform/devices/*.ufs/geometry_descriptor/allocation_unit_size

+What: /sys/bus/platform/devices/*.ufs/geometry_descriptor/min_addressable_block_size

+What: /sys/bus/platform/devices/*.ufs/geometry_descriptor/optimal_read_block_size

+What: /sys/bus/platform/devices/*.ufs/geometry_descriptor/optimal_write_block_size

+What: /sys/bus/platform/devices/*.ufs/geometry_descriptor/max_in_buffer_size

+What: /sys/bus/platform/devices/*.ufs/geometry_descriptor/max_out_buffer_size

+What: /sys/bus/platform/devices/*.ufs/geometry_descriptor/rpmb_rw_size

+What: /sys/bus/platform/devices/*.ufs/geometry_descriptor/dyn_capacity_resource_policy

+What: /sys/bus/platform/devices/*.ufs/geometry_descriptor/data_ordering

+What: /sys/bus/platform/devices/*.ufs/geometry_descriptor/max_number_of_contexts

+What: /sys/bus/platform/devices/*.ufs/geometry_descriptor/sys_data_tag_unit_size

+What: /sys/bus/platform/devices/*.ufs/geometry_descriptor/sys_data_tag_resource_size

+What: /sys/bus/platform/devices/*.ufs/geometry_descriptor/secure_removal_types

+What: /sys/bus/platform/devices/*.ufs/geometry_descriptor/memory_types

+What: /sys/bus/platform/devices/*.ufs/geometry_descriptor/*_memory_max_alloc_units

+What: /sys/bus/platform/devices/*.ufs/geometry_descriptor/*_memory_capacity_adjustment_factor

+What: /sys/bus/platform/devices/*.ufs/health_descriptor/eol_info

+What: /sys/bus/platform/devices/*.ufs/health_descriptor/life_time_estimation_a

+What: /sys/bus/platform/devices/*.ufs/health_descriptor/life_time_estimation_b

+What: /sys/bus/platform/devices/*.ufs/power_descriptor/active_icc_levels_vcc*

+What: /sys/bus/platform/devices/*.ufs/string_descriptors/manufacturer_name

+What: /sys/bus/platform/devices/*.ufs/string_descriptors/product_name

+What: /sys/bus/platform/devices/*.ufs/string_descriptors/oem_id

+What: /sys/bus/platform/devices/*.ufs/string_descriptors/serial_number

+What: /sys/bus/platform/devices/*.ufs/string_descriptors/product_revision

+What: /sys/bus/platform/devices/*.ufs/flags/device_init

+What: /sys/bus/platform/devices/*.ufs/flags/permanent_wpe

+What: /sys/bus/platform/devices/*.ufs/flags/power_on_wpe

+What: /sys/bus/platform/devices/*.ufs/flags/bkops_enable

+What: /sys/bus/platform/devices/*.ufs/flags/life_span_mode_enable

+What: /sys/bus/platform/devices/*.ufs/flags/phy_resource_removal

+What: /sys/bus/platform/devices/*.ufs/flags/busy_rtc

+What: /sys/bus/platform/devices/*.ufs/flags/disable_fw_update

+What: /sys/bus/platform/devices/*.ufs/attributes/boot_lun_enabled

+What: /sys/bus/platform/devices/*.ufs/attributes/current_power_mode

+What: /sys/bus/platform/devices/*.ufs/attributes/active_icc_level

+What: /sys/bus/platform/devices/*.ufs/attributes/ooo_data_enabled

+What: /sys/bus/platform/devices/*.ufs/attributes/bkops_status

+What: /sys/bus/platform/devices/*.ufs/attributes/purge_status

+What: /sys/bus/platform/devices/*.ufs/attributes/max_data_in_size

+What: /sys/bus/platform/devices/*.ufs/attributes/max_data_out_size

+What: /sys/bus/platform/devices/*.ufs/attributes/reference_clock_frequency

+What: /sys/bus/platform/devices/*.ufs/attributes/configuration_descriptor_lock

+What: /sys/bus/platform/devices/*.ufs/attributes/max_number_of_rtt

+What: /sys/bus/platform/devices/*.ufs/attributes/exception_event_control

+What: /sys/bus/platform/devices/*.ufs/attributes/exception_event_status

+What: /sys/bus/platform/devices/*.ufs/attributes/ffu_status

+What: /sys/bus/platform/devices/*.ufs/attributes/psa_state

+What: /sys/bus/platform/devices/*.ufs/attributes/psa_data_size

+Description: This file shows the amount of physical memory needed

+What: /sys/bus/platform/devices/*.ufs/rpm_lvl

+What: /sys/bus/platform/devices/*.ufs/rpm_target_dev_state

+What: /sys/bus/platform/devices/*.ufs/rpm_target_link_state

+What: /sys/bus/platform/devices/*.ufs/spm_lvl

+What: /sys/bus/platform/devices/*.ufs/spm_target_dev_state

+What: /sys/bus/platform/devices/*.ufs/spm_target_link_state

+What: /sys/bus/platform/devices/*.ufs/monitor/monitor_enable

+What: /sys/bus/platform/devices/*.ufs/monitor/monitor_chunk_size

+What: /sys/bus/platform/devices/*.ufs/monitor/read_total_sectors

+What: /sys/bus/platform/devices/*.ufs/monitor/read_total_busy

+What: /sys/bus/platform/devices/*.ufs/monitor/read_nr_requests

+What: /sys/bus/platform/devices/*.ufs/monitor/read_req_latency_max

+What: /sys/bus/platform/devices/*.ufs/monitor/read_req_latency_min

+What: /sys/bus/platform/devices/*.ufs/monitor/read_req_latency_avg

+What: /sys/bus/platform/devices/*.ufs/monitor/read_req_latency_sum

+What: /sys/bus/platform/devices/*.ufs/monitor/write_total_sectors

+What: /sys/bus/platform/devices/*.ufs/monitor/write_total_busy

+What: /sys/bus/platform/devices/*.ufs/monitor/write_nr_requests

+What: /sys/bus/platform/devices/*.ufs/monitor/write_req_latency_max

+What: /sys/bus/platform/devices/*.ufs/monitor/write_req_latency_min

+What: /sys/bus/platform/devices/*.ufs/monitor/write_req_latency_avg

+What: /sys/bus/platform/devices/*.ufs/monitor/write_req_latency_sum

+What: /sys/bus/platform/devices/*.ufs/device_descriptor/wb_presv_us_en

+What: /sys/bus/platform/devices/*.ufs/device_descriptor/wb_shared_alloc_units

+What: /sys/bus/platform/devices/*.ufs/device_descriptor/wb_type

+What: /sys/bus/platform/devices/*.ufs/geometry_descriptor/wb_buff_cap_adj

+What: /sys/bus/platform/devices/*.ufs/geometry_descriptor/wb_max_alloc_units

+What: /sys/bus/platform/devices/*.ufs/geometry_descriptor/wb_max_wb_luns

+What: /sys/bus/platform/devices/*.ufs/geometry_descriptor/wb_sup_red_type

+What: /sys/bus/platform/devices/*.ufs/geometry_descriptor/wb_sup_wb_type

+What: /sys/bus/platform/devices/*.ufs/flags/wb_enable

+What: /sys/bus/platform/devices/*.ufs/flags/wb_flush_en

+What: /sys/bus/platform/devices/*.ufs/flags/wb_flush_during_h8

+What: /sys/bus/platform/devices/*.ufs/attributes/wb_avail_buf

+What: /sys/bus/platform/devices/*.ufs/attributes/wb_cur_buf

+What: /sys/bus/platform/devices/*.ufs/attributes/wb_flush_status

+What: /sys/bus/platform/devices/*.ufs/attributes/wb_life_time_est

+What: /sys/bus/platform/devices/*.ufs/wb_on

+What: /sys/bus/platform/drivers/ufshcd/*/enable_wb_buf_flush

+What: /sys/bus/platform/devices/*.ufs/enable_wb_buf_flush

+Contact: Jinyoung Choi <j-young.choi@samsung.com>

+Description: This entry shows the status of WriteBooster buffer flushing

+ and it can be used to enable or disable the flushing.

+ If flushing is enabled, the device executes the flush

+ operation when the command queue is empty.

+What: /sys/bus/platform/devices/*.ufs/device_descriptor/hpb_version

+What: /sys/bus/platform/devices/*.ufs/device_descriptor/hpb_control

+What: /sys/bus/platform/devices/*.ufs/geometry_descriptor/hpb_region_size

+What: /sys/bus/platform/devices/*.ufs/geometry_descriptor/hpb_number_lu

+What: /sys/bus/platform/devices/*.ufs/geometry_descriptor/hpb_subregion_size

+What: /sys/bus/platform/devices/*.ufs/geometry_descriptor/hpb_max_active_regions

+What: /sys/class/scsi_device/*/device/hpb_stats/rcmd_noti_cnt

+What: /sys/class/scsi_device/*/device/hpb_stats/rcmd_active_cnt

+Description: For the HPB device control mode, this entry shows the number of

+ active sub-regions recommended by response UPIUs. For the HPB host control

+ mode, this entry shows the number of active sub-regions recommended by the

+ HPB host control mode heuristic algorithm.

+What: /sys/class/scsi_device/*/device/hpb_stats/rcmd_inactive_cnt

+Description: For the HPB device control mode, this entry shows the number of

+ inactive regions recommended by response UPIUs. For the HPB host control

+ mode, this entry shows the number of inactive regions recommended by the

+ HPB host control mode heuristic algorithm.

+What: /sys/bus/platform/devices/*.ufs/attributes/max_data_size_hpb_single_cmd

+What: /sys/bus/platform/devices/*.ufs/flags/hpb_enable

+Contact: Daniil Lunev <dlunev@chromium.org>

+What: /sys/bus/platform/drivers/ufshcd/*/capabilities/

+What: /sys/bus/platform/devices/*.ufs/capabilities/

+Description: The group represents the effective capabilities of the

+ host-device pair. i.e. the capabilities which are enabled in the

+ driver for the specific host controller, supported by the host

+ controller and are supported and/or have compatible

+ configuration on the device side.

+Contact: Daniil Lunev <dlunev@chromium.org>

+What: /sys/bus/platform/drivers/ufshcd/*/capabilities/clock_scaling

+What: /sys/bus/platform/devices/*.ufs/capabilities/clock_scaling

+Contact: Daniil Lunev <dlunev@chromium.org>

+Description: Indicates status of clock scaling.

+ == ============================

+ 0 Clock scaling is not supported.

+ 1 Clock scaling is supported.

+ == ============================

+ The file is read only.

+What: /sys/bus/platform/drivers/ufshcd/*/capabilities/write_booster

+What: /sys/bus/platform/devices/*.ufs/capabilities/write_booster

+Contact: Daniil Lunev <dlunev@chromium.org>

+Description: Indicates status of Write Booster.

+ == ============================

+ 0 Write Booster can not be enabled.

+ 1 Write Booster can be enabled.

+ == ============================

+ The file is read only.

diff --git a/Documentation/ABI/testing/sysfs-driver-xen-blkback b/Documentation/ABI/testing/sysfs-driver-xen-blkback
index ac2947b98950..fac0f429a869 100644
--- a/Documentation/ABI/testing/sysfs-driver-xen-blkback
+++ b/Documentation/ABI/testing/sysfs-driver-xen-blkback

+Contact: Maximilian Heyne <mheyne@amazon.de>

+ that this option only takes effect on newly connected backends.

diff --git a/Documentation/ABI/testing/sysfs-driver-xen-blkfront b/Documentation/ABI/testing/sysfs-driver-xen-blkfront
index 28008905615f..4d36c5a10546 100644
--- a/Documentation/ABI/testing/sysfs-driver-xen-blkfront
+++ b/Documentation/ABI/testing/sysfs-driver-xen-blkfront

+Contact: Maximilian Heyne <mheyne@amazon.de>

+ that this option only takes effect on newly connected frontends.

diff --git a/Documentation/ABI/testing/sysfs-driver-xilinx-tmr-manager b/Documentation/ABI/testing/sysfs-driver-xilinx-tmr-manager
new file mode 100644
index 000000000000..57b9b68a73ee
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-driver-xilinx-tmr-manager

+What: /sys/devices/platform/amba_pl/<dev>/errcnt

+Contact: appana.durga.kedareswara.rao@amd.com

+Description: This control file provides the fault detection count.

+ This file cannot be written.

+ # cat /sys/devices/platform/amba_pl/44a10000.tmr_manager/errcnt

+What: /sys/devices/platform/amba_pl/<dev>/dis_block_break

+Contact: appana.durga.kedareswara.rao@amd.com

+Description: Write any value to it, This control file enables the break signal.

+ This file is write only.

+ # echo <any value> > /sys/devices/platform/amba_pl/44a10000.tmr_manager/dis_block_break

+What: /sys/bus/platform/drivers/zynqmp_fpga_manager/firmware:zynqmp-firmware:pcap/status

+Date: February 2023

+Contact: Nava kishore Manne <nava.kishore.manne@amd.com>

+Description: (RO) Read fpga status.

+ Read returns a hexadecimal value that tells the current status

+ of the FPGA device. Each bit position in the status value is

+ described Below(see ug570 chapter 9).

+ https://docs.xilinx.com/v/u/en-US/ug570-ultrascale-configuration

+ ====================== ==============================================

+ BIT(0) 0: No CRC error

+ BIT(1) 0: Decryptor security not set

+ 1: Decryptor security set

+ BIT(2) 0: MMCMs/PLLs are not locked

+ 1: MMCMs/PLLs are locked

+ BIT(3) 0: DCI not matched

+ BIT(4) 0: Start-up sequence has not finished

+ 1: Start-up sequence has finished

+ BIT(5) 0: All I/Os are placed in High-Z state

+ 1: All I/Os behave as configured

+ BIT(6) 0: Flip-flops and block RAM are write disabled

+ 1: Flip-flops and block RAM are write enabled

+ BIT(7) 0: GHIGH_B_STATUS asserted

+ 1: GHIGH_B_STATUS deasserted

+ BIT(8) to BIT(10) Status of the mode pins

+ BIT(11) 0: Initialization has not finished

+ 1: Initialization finished

+ BIT(12) Value on INIT_B_PIN pin

+ BIT(13) 0: Signal not released

+ 1: Signal released

+ BIT(14) Value on DONE_PIN pin.

+ BIT(15) 0: No IDCODE_ERROR

+ BIT(16) 0: No SECURITY_ERROR

+ BIT(17) System Monitor over-temperature if set

+ BIT(18) to BIT(20) Start-up state machine (0 to 7)

+ BIT(25) to BIT(26) Indicates the detected bus width

+ ====================== ==============================================

+ The other bits are reserved.

diff --git a/Documentation/ABI/testing/sysfs-firmware-efi-esrt b/Documentation/ABI/testing/sysfs-firmware-efi-esrt
index 31b57676d4ad..4c2d440487dd 100644
--- a/Documentation/ABI/testing/sysfs-firmware-efi-esrt
+++ b/Documentation/ABI/testing/sysfs-firmware-efi-esrt

+What: /sys/firmware/efi/esrt/entries/entry<N>/

+What: /sys/firmware/efi/esrt/entries/entry<N>/fw_type

+What: /sys/firmware/efi/esrt/entries/entry<N>/fw_class

+What: /sys/firmware/efi/esrt/entries/entry<N>/fw_version

+What: /sys/firmware/efi/esrt/entries/entry<N>/lowest_supported_fw_version

+What: /sys/firmware/efi/esrt/entries/entry<N>/capsule_flags

+What: /sys/firmware/efi/esrt/entries/entry<N>/last_attempt_version

+What: /sys/firmware/efi/esrt/entries/entry<N>/last_attempt_status

diff --git a/Documentation/ABI/testing/sysfs-firmware-papr-energy-scale-info b/Documentation/ABI/testing/sysfs-firmware-papr-energy-scale-info
new file mode 100644
index 000000000000..141a6b371469
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-firmware-papr-energy-scale-info

+What: /sys/firmware/papr/energy_scale_info

+Date: February 2022

+Contact: Linux for PowerPC mailing list <linuxppc-dev@ozlabs.org>

+Description: Directory hosting a set of platform attributes like

+ energy/frequency on Linux running as a PAPR guest.

+ Each file in a directory contains a platform

+ attribute hierarchy pertaining to performance/

+ energy-savings mode and processor frequency.

+What: /sys/firmware/papr/energy_scale_info/<id>

+Date: February 2022

+Contact: Linux for PowerPC mailing list <linuxppc-dev@ozlabs.org>

+Description: Energy, frequency attributes directory for POWERVM servers

+What: /sys/firmware/papr/energy_scale_info/<id>/desc

+Date: February 2022

+Contact: Linux for PowerPC mailing list <linuxppc-dev@ozlabs.org>

+Description: String description of the energy attribute of <id>

+What: /sys/firmware/papr/energy_scale_info/<id>/value

+Date: February 2022

+Contact: Linux for PowerPC mailing list <linuxppc-dev@ozlabs.org>

+Description: Numeric value of the energy attribute of <id>

+What: /sys/firmware/papr/energy_scale_info/<id>/value_desc

+Date: February 2022

+Contact: Linux for PowerPC mailing list <linuxppc-dev@ozlabs.org>

+Description: String value of the energy attribute of <id>

diff --git a/Documentation/ABI/testing/sysfs-firmware-qemu_fw_cfg b/Documentation/ABI/testing/sysfs-firmware-qemu_fw_cfg
index ee0d6dbc810e..54d1bfd0db12 100644
--- a/Documentation/ABI/testing/sysfs-firmware-qemu_fw_cfg
+++ b/Documentation/ABI/testing/sysfs-firmware-qemu_fw_cfg

+ to the fw_cfg device can be found in "docs/specs/fw_cfg.rst"

+ in the QEMU source tree, or online at:

+ https://qemu-project.gitlab.io/qemu/specs/fw_cfg.html

+What: /sys/fs/erofs/features/

+Date: November 2021

+Contact: "Huang Jianan" <huangjianan@oppo.com>

+Description: Shows all enabled kernel features.

+ Supported features:

+ zero_padding, compr_cfgs, big_pcluster, chunked_file,

+ device_table, compr_head2, sb_chksum, ztailpacking,

+ dedupe, fragments.

+What: /sys/fs/erofs/<disk>/sync_decompress

+Date: November 2021

+Contact: "Huang Jianan" <huangjianan@oppo.com>

+Description: Control strategy of sync decompression:

+ - 0 (default, auto): enable for readpage, and enable for

+ readahead on atomic contexts only.

+ - 1 (force on): enable for readpage and readahead.

+ - 2 (force off): disable for all situations.

+ ===== =============== ===================================================

+ value policy description

+ 0x00 DISABLE disable IPU(=default option in LFS mode)

+ 0x01 FORCE all the time

+ 0x02 SSR if SSR mode is activated

+ 0x04 UTIL if FS utilization is over threashold

+ 0x08 SSR_UTIL if SSR mode is activated and FS utilization is over

+ 0x10 FSYNC activated in fsync path only for high performance

+ flash storages. IPU will be triggered only if the

+ # of dirty pages over min_fsync_blocks.

+ 0x20 ASYNC do IPU given by asynchronous write requests

+ 0x40 NOCACHE disable IPU bio cache

+ 0x80 HONOR_OPU_WRITE use OPU write prior to IPU write if inode has

+ ===== =============== ===================================================

+What: /sys/fs/f2fs/<disk>/max_ordered_discard

+Contact: "Yangtao Li" <frank.li@vivo.com>

+Description: Controls the maximum ordered discard, the unit size is one block(4KB).

+ Set it to 16 by default.

+What: /sys/fs/f2fs/<disk>/max_discard_request

+Date: December 2021

+Contact: "Konstantin Vyshetsky" <vkon@google.com>

+Description: Controls the number of discards a thread will issue at a time.

+ Higher number will allow the discard thread to finish its work

+ faster, at the cost of higher latency for incomming I/O.

+What: /sys/fs/f2fs/<disk>/min_discard_issue_time

+Date: December 2021

+Contact: "Konstantin Vyshetsky" <vkon@google.com>

+Description: Controls the interval the discard thread will wait between

+ issuing discard requests when there are discards to be issued and

+ no I/O aware interruptions occur.

+What: /sys/fs/f2fs/<disk>/mid_discard_issue_time

+Date: December 2021

+Contact: "Konstantin Vyshetsky" <vkon@google.com>

+Description: Controls the interval the discard thread will wait between

+ issuing discard requests when there are discards to be issued and

+ an I/O aware interruption occurs.

+What: /sys/fs/f2fs/<disk>/max_discard_issue_time

+Date: December 2021

+Contact: "Konstantin Vyshetsky" <vkon@google.com>

+Description: Controls the interval the discard thread will wait when there are

+ no discard operations to be issued.

+ in range of [1, 512]. Default value is 16.

+ For small devices, default value is 1.

+What: /sys/fs/f2fs/<disk>/pending_discard

+Date: November 2021

+Contact: "Jaegeuk Kim" <jaegeuk@kernel.org>

+Description: Shows the number of pending discard commands in the queue.

+Description: <deprecated: should use /sys/fs/f2fs/<disk>/feature_list/>

+Description: Do background GC aggressively when set. Set to 0 by default.

+ gc urgent high(1): does GC forcibly in a period of given

+ gc_urgent_sleep_time and ignores I/O idling check. uses greedy

+ GC approach and turns SSR mode on.

+ gc urgent low(2): lowers the bar of checking I/O idling in

+ order to process outstanding discard commands and GC a

+ little bit aggressively. uses cost benefit GC approach.

+ gc urgent mid(3): does GC forcibly in a period of given

+ gc_urgent_sleep_time and executes a mid level of I/O idling check.

+ uses cost benefit GC approach.

+ 0x4000 SBI_IS_FREEZING freefs is in process

+What: /sys/fs/f2fs/<disk>/stat/cp_status

+Date: September 2022

+Contact: "Chao Yu" <chao.yu@oppo.com>

+Description: Show status of f2fs checkpoint in real time.

+ =============================== ==============================

+ CP_UMOUNT_FLAG 0x00000001

+ CP_ORPHAN_PRESENT_FLAG 0x00000002

+ CP_COMPACT_SUM_FLAG 0x00000004

+ CP_ERROR_FLAG 0x00000008

+ CP_FSCK_FLAG 0x00000010

+ CP_FASTBOOT_FLAG 0x00000020

+ CP_CRC_RECOVERY_FLAG 0x00000040

+ CP_NAT_BITS_FLAG 0x00000080

+ CP_TRIMMED_FLAG 0x00000100

+ CP_NOCRC_RECOVERY_FLAG 0x00000200

+ CP_LARGE_NAT_BITMAP_FLAG 0x00000400

+ CP_QUOTA_NEED_FSCK_FLAG 0x00000800

+ CP_DISABLED_FLAG 0x00001000

+ CP_DISABLED_QUICK_FLAG 0x00002000

+ CP_RESIZEFS_FLAG 0x00004000

+ =============================== ==============================

+ 3: GC idle AT, 4: GC urgent high, 5: GC urgent low 6: GC urgent mid)

+What: /sys/fs/f2fs/<disk>/max_fragment_chunk

+Contact: "Daeho Jeong" <daehojeong@google.com>

+Description: With "mode=fragment:block" mount options, we can scatter block allocation.

+ f2fs will allocate 1..<max_fragment_chunk> blocks in a chunk and make a hole

+ in the length of 1..<max_fragment_hole> by turns. This value can be set

+ between 1..512 and the default value is 4.

+What: /sys/fs/f2fs/<disk>/max_fragment_hole

+Contact: "Daeho Jeong" <daehojeong@google.com>

+Description: With "mode=fragment:block" mount options, we can scatter block allocation.

+ f2fs will allocate 1..<max_fragment_chunk> blocks in a chunk and make a hole

+ in the length of 1..<max_fragment_hole> by turns. This value can be set

+ between 1..512 and the default value is 4.

+What: /sys/fs/f2fs/<disk>/gc_remaining_trials

+Contact: "Yangtao Li" <frank.li@vivo.com>

+Description: You can set the trial count limit for GC urgent and idle mode with this value.

+ If GC thread gets to the limit, the mode will turn back to GC normal mode.

+ By default, the value is zero, which means there is no limit like before.

+What: /sys/fs/f2fs/<disk>/max_roll_forward_node_blocks

+Contact: "Jaegeuk Kim" <jaegeuk@kernel.org>

+Description: Controls max # of node block writes to be used for roll forward

+ recovery. This can limit the roll forward recovery time.

+What: /sys/fs/f2fs/<disk>/unusable_blocks_per_sec

+Contact: "Jaegeuk Kim" <jaegeuk@kernel.org>

+Description: Shows the number of unusable blocks in a section which was defined by

+ the zone capacity reported by underlying zoned device.

+What: /sys/fs/f2fs/<disk>/current_atomic_write

+Contact: "Daeho Jeong" <daehojeong@google.com>

+Description: Show the total current atomic write block count, which is not committed yet.

+ This is a read-only entry.

+What: /sys/fs/f2fs/<disk>/peak_atomic_write

+Contact: "Daeho Jeong" <daehojeong@google.com>

+Description: Show the peak value of total current atomic write block count after boot.

+ If you write "0" here, you can initialize to "0".

+What: /sys/fs/f2fs/<disk>/committed_atomic_block

+Contact: "Daeho Jeong" <daehojeong@google.com>

+Description: Show the accumulated total committed atomic write block count after boot.

+ If you write "0" here, you can initialize to "0".

+What: /sys/fs/f2fs/<disk>/revoked_atomic_block

+Contact: "Daeho Jeong" <daehojeong@google.com>

+Description: Show the accumulated total revoked atomic write block count after boot.

+ If you write "0" here, you can initialize to "0".

+What: /sys/fs/f2fs/<disk>/gc_mode

+Contact: "Yangtao Li" <frank.li@vivo.com>

+Description: Show the current gc_mode as a string.

+ This is a read-only entry.

+What: /sys/fs/f2fs/<disk>/discard_urgent_util

+Date: November 2022

+Contact: "Yangtao Li" <frank.li@vivo.com>

+Description: When space utilization exceeds this, do background DISCARD aggressively.

+ Does DISCARD forcibly in a period of given min_discard_issue_time when the number

+ of discards is not 0 and set discard granularity to 1.

+What: /sys/fs/f2fs/<disk>/hot_data_age_threshold

+Date: November 2022

+Contact: "Ping Xiong" <xiongping1@xiaomi.com>

+Description: When DATA SEPARATION is on, it controls the age threshold to indicate

+ the data blocks as hot. By default it was initialized as 262144 blocks

+What: /sys/fs/f2fs/<disk>/warm_data_age_threshold

+Date: November 2022

+Contact: "Ping Xiong" <xiongping1@xiaomi.com>

+Description: When DATA SEPARATION is on, it controls the age threshold to indicate

+ the data blocks as warm. By default it was initialized as 2621440 blocks

+What: /sys/fs/f2fs/<disk>/fault_rate

+Contact: "Sheng Yong" <shengyong@oppo.com>

+Contact: "Chao Yu" <chao@kernel.org>

+Description: Enable fault injection in all supported types with

+ specified injection rate.

+What: /sys/fs/f2fs/<disk>/fault_type

+Contact: "Sheng Yong" <shengyong@oppo.com>

+Contact: "Chao Yu" <chao@kernel.org>

+Description: Support configuring fault injection type, should be

+ enabled with fault_injection option, fault type value

+ is shown below, it supports single or combined type.

+ =================== ===========

+ Type_Name Type_Value

+ =================== ===========

+ FAULT_KMALLOC 0x000000001

+ FAULT_KVMALLOC 0x000000002

+ FAULT_PAGE_ALLOC 0x000000004

+ FAULT_PAGE_GET 0x000000008

+ FAULT_ALLOC_BIO 0x000000010 (obsolete)

+ FAULT_ALLOC_NID 0x000000020

+ FAULT_ORPHAN 0x000000040

+ FAULT_BLOCK 0x000000080

+ FAULT_DIR_DEPTH 0x000000100

+ FAULT_EVICT_INODE 0x000000200

+ FAULT_TRUNCATE 0x000000400

+ FAULT_READ_IO 0x000000800

+ FAULT_CHECKPOINT 0x000001000

+ FAULT_DISCARD 0x000002000

+ FAULT_WRITE_IO 0x000004000

+ FAULT_SLAB_ALLOC 0x000008000

+ FAULT_DQUOT_INIT 0x000010000

+ FAULT_LOCK_OP 0x000020000

+ FAULT_BLKADDR 0x000040000

+ =================== ===========

+What: /sys/fs/f2fs/<disk>/discard_io_aware_gran

+Contact: "Yangtao Li" <frank.li@vivo.com>

+Description: Controls background discard granularity of inner discard thread

+ when is not in idle. Inner thread will not issue discards with size that

+ is smaller than granularity. The unit size is one block(4KB), now only

+ support configuring in range of [0, 512].

+What: /sys/fs/f2fs/<disk>/last_age_weight

+Contact: "Ping Xiong" <xiongping1@xiaomi.com>

+Description: When DATA SEPARATION is on, it controls the weight of last data block age.

+What: /sys/fs/f2fs/<disk>/compress_watermark

+Date: February 2023

+Contact: "Yangtao Li" <frank.li@vivo.com>

+Description: When compress cache is on, it controls free memory watermark

+ in order to limit caching compress page. If free memory is lower

+ than watermark, then deny caching compress page. The value should be in

+ range of (0, 100], by default it was initialized as 20(%).

+What: /sys/fs/f2fs/<disk>/compress_percent

+Date: February 2023

+Contact: "Yangtao Li" <frank.li@vivo.com>

+Description: When compress cache is on, it controls cached page

+ percent(compress pages / free_ram) in order to limit caching compress page.

+ If cached page percent exceed threshold, then deny caching compress page.

+ The value should be in range of (0, 100], by default it was initialized

+What: /sys/fs/ubifsX_Y/error_magic

+KernelVersion: 5.16

+Contact: linux-mtd@lists.infradead.org

+ Exposes magic errors: every node starts with a magic number.

+ This counter keeps track of the number of accesses of nodes

+ with a corrupted magic number.

+ The counter is reset to 0 with a remount.

+What: /sys/fs/ubifsX_Y/error_node

+KernelVersion: 5.16

+Contact: linux-mtd@lists.infradead.org

+ Exposes node errors. Every node embeds its type.

+ This counter keeps track of the number of accesses of nodes

+ with a corrupted node type.

+ The counter is reset to 0 with a remount.

+What: /sys/fs/ubifsX_Y/error_crc

+KernelVersion: 5.16

+Contact: linux-mtd@lists.infradead.org

+ Exposes crc errors: every node embeds a crc checksum.

+ This counter keeps track of the number of accesses of nodes

+ with a bad crc checksum.

+ The counter is reset to 0 with a remount.

+What: /sys/kernel/address_bit

+Contact: Thomas Weißschuh <linux@weissschuh.net>

+ The address size of the running kernel in bits.

+What: /sys/kernel/cpu_byteorder

+Date: February 2023

+Contact: Thomas Weißschuh <linux@weissschuh.net>

+ The endianness of the running kernel.

diff --git a/Documentation/ABI/testing/sysfs-kernel-iommu_groups b/Documentation/ABI/testing/sysfs-kernel-iommu_groups
index b15af6a5bc08..a42d4383d999 100644
--- a/Documentation/ABI/testing/sysfs-kernel-iommu_groups
+++ b/Documentation/ABI/testing/sysfs-kernel-iommu_groups

diff --git a/Documentation/ABI/testing/sysfs-kernel-livepatch b/Documentation/ABI/testing/sysfs-kernel-livepatch
index bea7bd5a1d5f..a5df9b4910dc 100644
--- a/Documentation/ABI/testing/sysfs-kernel-livepatch
+++ b/Documentation/ABI/testing/sysfs-kernel-livepatch

+What: /sys/kernel/livepatch/<patch>/<object>/patched

+KernelVersion: 6.1.0

+Contact: live-patching@vger.kernel.org

+ An attribute which indicates whether the object is currently

+what: /sys/kernel/mm/damon/

+Contact: SeongJae Park <sj@kernel.org>

+Description: Interface for Data Access MONitoring (DAMON). Contains files

+ for controlling DAMON. For more details on DAMON itself,

+ please refer to Documentation/admin-guide/mm/damon/index.rst.

+What: /sys/kernel/mm/damon/admin/

+Contact: SeongJae Park <sj@kernel.org>

+Description: Interface for privileged users of DAMON. Contains files for

+ controlling DAMON that aimed to be used by privileged users.

+What: /sys/kernel/mm/damon/admin/kdamonds/nr_kdamonds

+Contact: SeongJae Park <sj@kernel.org>

+Description: Writing a number 'N' to this file creates the number of

+ directories for controlling each DAMON worker thread (kdamond)

+ named '0' to 'N-1' under the kdamonds/ directory.

+What: /sys/kernel/mm/damon/admin/kdamonds/<K>/state

+Contact: SeongJae Park <sj@kernel.org>

+Description: Writing 'on' or 'off' to this file makes the kdamond starts or

+ stops, respectively. Reading the file returns the keywords

+ based on the current status. Writing 'commit' to this file

+ makes the kdamond reads the user inputs in the sysfs files

+ except 'state' again. Writing 'update_schemes_stats' to the

+ file updates contents of schemes stats files of the kdamond.

+ Writing 'update_schemes_tried_regions' to the file updates

+ contents of 'tried_regions' directory of every scheme directory

+ of this kdamond. Writing 'clear_schemes_tried_regions' to the

+ file removes contents of the 'tried_regions' directory.

+What: /sys/kernel/mm/damon/admin/kdamonds/<K>/pid

+Contact: SeongJae Park <sj@kernel.org>

+Description: Reading this file returns the pid of the kdamond if it is

+What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/nr_contexts

+Contact: SeongJae Park <sj@kernel.org>

+Description: Writing a number 'N' to this file creates the number of

+ directories for controlling each DAMON context named '0' to

+ 'N-1' under the contexts/ directory.

+What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/avail_operations

+Contact: SeongJae Park <sj@kernel.org>

+Description: Reading this file returns the available monitoring operations

+ sets on the currently running kernel.

+What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/operations

+Contact: SeongJae Park <sj@kernel.org>

+Description: Writing a keyword for a monitoring operations set ('vaddr' for

+ virtual address spaces monitoring, 'fvaddr' for fixed virtual

+ address ranges monitoring, and 'paddr' for the physical address

+ space monitoring) to this file makes the context to use the

+ operations set. Reading the file returns the keyword for the

+ operations set the context is set to use.

+ Note that only the operations sets that listed in

+ 'avail_operations' file are valid inputs.

+What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/monitoring_attrs/intervals/sample_us

+Contact: SeongJae Park <sj@kernel.org>

+Description: Writing a value to this file sets the sampling interval of the

+ DAMON context in microseconds as the value. Reading this file

+ returns the value.

+What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/monitoring_attrs/intervals/aggr_us

+Contact: SeongJae Park <sj@kernel.org>

+Description: Writing a value to this file sets the aggregation interval of

+ the DAMON context in microseconds as the value. Reading this

+ file returns the value.

+What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/monitoring_attrs/intervals/update_us

+Contact: SeongJae Park <sj@kernel.org>

+Description: Writing a value to this file sets the update interval of the

+ DAMON context in microseconds as the value. Reading this file

+ returns the value.

+What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/monitoring_attrs/nr_regions/min

+Contact: SeongJae Park <sj@kernel.org>

+Description: Writing a value to this file sets the minimum number of

+ monitoring regions of the DAMON context as the value. Reading

+ this file returns the value.

+What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/monitoring_attrs/nr_regions/max

+Contact: SeongJae Park <sj@kernel.org>

+Description: Writing a value to this file sets the maximum number of

+ monitoring regions of the DAMON context as the value. Reading

+ this file returns the value.

+What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/targets/nr_targets

+Contact: SeongJae Park <sj@kernel.org>

+Description: Writing a number 'N' to this file creates the number of

+ directories for controlling each DAMON target of the context

+ named '0' to 'N-1' under the contexts/ directory.

+What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/targets/<T>/pid_target

+Contact: SeongJae Park <sj@kernel.org>

+Description: Writing to and reading from this file sets and gets the pid of

+ the target process if the context is for virtual address spaces

+ monitoring, respectively.

+What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/targets/<T>/regions/nr_regions

+Contact: SeongJae Park <sj@kernel.org>

+Description: Writing a number 'N' to this file creates the number of

+ directories for setting each DAMON target memory region of the

+ context named '0' to 'N-1' under the regions/ directory. In

+ case of the virtual address space monitoring, DAMON

+ automatically sets the target memory region based on the target

+ processes' mappings.

+What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/targets/<T>/regions/<R>/start

+Contact: SeongJae Park <sj@kernel.org>

+Description: Writing to and reading from this file sets and gets the start

+ address of the monitoring region.

+What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/targets/<T>/regions/<R>/end

+Contact: SeongJae Park <sj@kernel.org>

+Description: Writing to and reading from this file sets and gets the end

+ address of the monitoring region.

+What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/nr_schemes

+Contact: SeongJae Park <sj@kernel.org>

+Description: Writing a number 'N' to this file creates the number of

+ directories for controlling each DAMON-based operation scheme

+ of the context named '0' to 'N-1' under the schemes/ directory.

+What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/action

+Contact: SeongJae Park <sj@kernel.org>

+Description: Writing to and reading from this file sets and gets the action

+What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/access_pattern/sz/min

+Contact: SeongJae Park <sj@kernel.org>

+Description: Writing to and reading from this file sets and gets the mimimum

+ size of the scheme's target regions in bytes.

+What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/access_pattern/sz/max

+Contact: SeongJae Park <sj@kernel.org>

+Description: Writing to and reading from this file sets and gets the maximum

+ size of the scheme's target regions in bytes.

+What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/access_pattern/nr_accesses/min

+Contact: SeongJae Park <sj@kernel.org>

+Description: Writing to and reading from this file sets and gets the manimum

+ 'nr_accesses' of the scheme's target regions.

+What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/access_pattern/nr_accesses/max

+Contact: SeongJae Park <sj@kernel.org>

+Description: Writing to and reading from this file sets and gets the maximum

+ 'nr_accesses' of the scheme's target regions.

+What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/access_pattern/age/min

+Contact: SeongJae Park <sj@kernel.org>

+Description: Writing to and reading from this file sets and gets the minimum

+ 'age' of the scheme's target regions.

+What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/access_pattern/age/max

+Contact: SeongJae Park <sj@kernel.org>

+Description: Writing to and reading from this file sets and gets the maximum

+ 'age' of the scheme's target regions.

+What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/quotas/ms

+Contact: SeongJae Park <sj@kernel.org>

+Description: Writing to and reading from this file sets and gets the time

+ quota of the scheme in milliseconds.

+What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/quotas/bytes

+Contact: SeongJae Park <sj@kernel.org>

+Description: Writing to and reading from this file sets and gets the size

+ quota of the scheme in bytes.

+What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/quotas/reset_interval_ms

+Contact: SeongJae Park <sj@kernel.org>

+Description: Writing to and reading from this file sets and gets the quotas

+ charge reset interval of the scheme in milliseconds.

+What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/quotas/weights/sz_permil

+Contact: SeongJae Park <sj@kernel.org>

+Description: Writing to and reading from this file sets and gets the

+ under-quota limit regions prioritization weight for 'size' in

+What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/quotas/weights/nr_accesses_permil

+Contact: SeongJae Park <sj@kernel.org>

+Description: Writing to and reading from this file sets and gets the

+ under-quota limit regions prioritization weight for

+ 'nr_accesses' in permil.

+What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/quotas/weights/age_permil

+Contact: SeongJae Park <sj@kernel.org>

+Description: Writing to and reading from this file sets and gets the

+ under-quota limit regions prioritization weight for 'age' in

+What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/watermarks/metric

+Contact: SeongJae Park <sj@kernel.org>

+Description: Writing to and reading from this file sets and gets the metric

+ of the watermarks for the scheme. The writable/readable

+ keywords for this file are 'none' for disabling the watermarks

+ feature, or 'free_mem_rate' for the system's global free memory

+What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/watermarks/interval_us

+Contact: SeongJae Park <sj@kernel.org>

+Description: Writing to and reading from this file sets and gets the metric

+ check interval of the watermarks for the scheme in

+What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/watermarks/high

+Contact: SeongJae Park <sj@kernel.org>

+Description: Writing to and reading from this file sets and gets the high

+ watermark of the scheme in permil.

+What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/watermarks/mid

+Contact: SeongJae Park <sj@kernel.org>

+Description: Writing to and reading from this file sets and gets the mid

+ watermark of the scheme in permil.

+What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/watermarks/low

+Contact: SeongJae Park <sj@kernel.org>

+Description: Writing to and reading from this file sets and gets the low

+ watermark of the scheme in permil.

+What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/filters/nr_filters

+Contact: SeongJae Park <sj@kernel.org>

+Description: Writing a number 'N' to this file creates the number of

+ directories for setting filters of the scheme named '0' to

+ 'N-1' under the filters/ directory.

+What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/filters/<F>/type

+Contact: SeongJae Park <sj@kernel.org>

+Description: Writing to and reading from this file sets and gets the type of

+ the memory of the interest. 'anon' for anonymous pages, or

+ 'memcg' for specific memory cgroup can be written and read.

+What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/filters/<F>/memcg_path

+Contact: SeongJae Park <sj@kernel.org>

+Description: If 'memcg' is written to the 'type' file, writing to and

+ reading from this file sets and gets the path to the memory

+ cgroup of the interest.

+What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/filters/<F>/matching

+Contact: SeongJae Park <sj@kernel.org>

+Description: Writing 'Y' or 'N' to this file sets whether to filter out

+ pages that do or do not match to the 'type' and 'memcg_path',

+ respectively. Filter out means the action of the scheme will

+ not be applied to.

+What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/stats/nr_tried

+Contact: SeongJae Park <sj@kernel.org>

+Description: Reading this file returns the number of regions that the action

+ of the scheme has tried to be applied.

+What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/stats/sz_tried

+Contact: SeongJae Park <sj@kernel.org>

+Description: Reading this file returns the total size of regions that the

+ action of the scheme has tried to be applied in bytes.

+What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/stats/nr_applied

+Contact: SeongJae Park <sj@kernel.org>

+Description: Reading this file returns the number of regions that the action

+ of the scheme has successfully applied.

+What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/stats/sz_applied

+Contact: SeongJae Park <sj@kernel.org>

+Description: Reading this file returns the total size of regions that the

+ action of the scheme has successfully applied in bytes.

+What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/stats/qt_exceeds

+Contact: SeongJae Park <sj@kernel.org>

+Description: Reading this file returns the number of the exceed events of

+ the scheme's quotas.

+What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/tried_regions/<R>/start

+Contact: SeongJae Park <sj@kernel.org>

+Description: Reading this file returns the start address of a memory region

+ that corresponding DAMON-based Operation Scheme's action has

+ tried to be applied.

+What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/tried_regions/<R>/end

+Contact: SeongJae Park <sj@kernel.org>

+Description: Reading this file returns the end address of a memory region

+ that corresponding DAMON-based Operation Scheme's action has

+ tried to be applied.

+What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/tried_regions/<R>/nr_accesses

+Contact: SeongJae Park <sj@kernel.org>

+Description: Reading this file returns the 'nr_accesses' of a memory region

+ that corresponding DAMON-based Operation Scheme's action has

+ tried to be applied.

+What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/tried_regions/<R>/age

+Contact: SeongJae Park <sj@kernel.org>

+Description: Reading this file returns the 'age' of a memory region that

+ corresponding DAMON-based Operation Scheme's action has tried

+ See Documentation/mm/ksm.rst for more information.

+What: /sys/kernel/mm/ksm/general_profit

+Contact: Linux memory management mailing list <linux-mm@kvack.org>

+Description: Measure how effective KSM is.

+ general_profit: how effective is KSM. The formula for the

+ calculation is in Documentation/admin-guide/mm/ksm.rst.

+What: /sys/devices/virtual/memory_tiering/

+Contact: Linux memory management mailing list <linux-mm@kvack.org>

+Description: A collection of all the memory tiers allocated.

+ Individual memory tier details are contained in subdirectories

+ named by the abstract distance of the memory tier.

+ /sys/devices/virtual/memory_tiering/memory_tierN/

+What: /sys/devices/virtual/memory_tiering/memory_tierN/

+ /sys/devices/virtual/memory_tiering/memory_tierN/nodelist

+Contact: Linux memory management mailing list <linux-mm@kvack.org>

+Description: Directory with details of a specific memory tier

+ This is the directory containing information about a particular

+ memory tier, memtierN, where N is derived based on abstract distance.

+ A smaller value of N implies a higher (faster) memory tier in the

+ nodelist: NUMA nodes that are part of this memory tier.

+What: /sys/kernel/oops_count

+Date: November 2022

+KernelVersion: 6.2.0

+Contact: Linux Kernel Hardening List <linux-hardening@vger.kernel.org>

+ Shows how many times the system has Oopsed since last boot.

+What: /sys/kernel/slab/<cache>/aliases

+What: /sys/kernel/slab/<cache>/align

+What: /sys/kernel/slab/<cache>/alloc_calls

+ enabled for that cache (see Documentation/mm/slub.rst).

+What: /sys/kernel/slab/<cache>/alloc_fastpath

+What: /sys/kernel/slab/<cache>/alloc_from_partial

+What: /sys/kernel/slab/<cache>/alloc_refill

+What: /sys/kernel/slab/<cache>/alloc_slab

+What: /sys/kernel/slab/<cache>/alloc_slowpath

+What: /sys/kernel/slab/<cache>/cache_dma

+What: /sys/kernel/slab/<cache>/cpu_slabs

+What: /sys/kernel/slab/<cache>/cpuslab_flush

+What: /sys/kernel/slab/<cache>/ctor

+What: /sys/kernel/slab/<cache>/deactivate_empty

+What: /sys/kernel/slab/<cache>/deactivate_full

+What: /sys/kernel/slab/<cache>/deactivate_remote_frees

+What: /sys/kernel/slab/<cache>/deactivate_to_head

+What: /sys/kernel/slab/<cache>/deactivate_to_tail

+What: /sys/kernel/slab/<cache>/destroy_by_rcu

+What: /sys/kernel/slab/<cache>/free_add_partial

+What: /sys/kernel/slab/<cache>/free_calls

+ Documentation/mm/slub.rst).

+What: /sys/kernel/slab/<cache>/free_fastpath

+What: /sys/kernel/slab/<cache>/free_frozen

+What: /sys/kernel/slab/<cache>/free_remove_partial

+What: /sys/kernel/slab/<cache>/free_slab

+What: /sys/kernel/slab/<cache>/free_slowpath

+What: /sys/kernel/slab/<cache>/hwcache_align

+What: /sys/kernel/slab/<cache>/min_partial

+What: /sys/kernel/slab/<cache>/object_size

+What: /sys/kernel/slab/<cache>/objects

+What: /sys/kernel/slab/<cache>/objects_partial

+What: /sys/kernel/slab/<cache>/objs_per_slab

+ specified in /sys/kernel/slab/<cache>/order.

+What: /sys/kernel/slab/<cache>/order

+What: /sys/kernel/slab/<cache>/order_fallback

+What: /sys/kernel/slab/<cache>/partial

+What: /sys/kernel/slab/<cache>/poison

+What: /sys/kernel/slab/<cache>/reclaim_account

+What: /sys/kernel/slab/<cache>/red_zone

+What: /sys/kernel/slab/<cache>/remote_node_defrag_ratio

+What: /sys/kernel/slab/<cache>/sanity_checks

+What: /sys/kernel/slab/<cache>/shrink

+What: /sys/kernel/slab/<cache>/slab_size

+What: /sys/kernel/slab/<cache>/slabs

+What: /sys/kernel/slab/<cache>/store_user

+What: /sys/kernel/slab/<cache>/total_objects

+What: /sys/kernel/slab/<cache>/trace

+What: /sys/kernel/slab/<cache>/validate

+What: /sys/kernel/slab/<cache>/usersize

+Contact: David Windsor <dave@nullcore.net>

+ The usersize file is read-only and contains the usercopy

+What: /sys/kernel/slab/<cache>/slabs_cpu_partial

+Contact: Christoph Lameter <cl@linux.com>

+ This read-only file shows the number of partialli allocated

+What: /sys/kernel/slab/<cache>/cpu_partial

+Contact: Christoph Lameter <cl@linux.com>

+ This read-only file shows the number of per cpu partial

+ pages to keep around.

+What: /sys/kernel/warn_count

+Date: November 2022

+KernelVersion: 6.2.0

+Contact: Linux Kernel Hardening List <linux-hardening@vger.kernel.org>

+ Shows how many times the system has Warned since last boot.

+What: /sys/devices/system/machinecheck/machinecheckX/

+Contact: Andi Kleen <ak@linux.intel.com>

+ Machine checks report internal hardware error conditions

+ detected by the CPU. Uncorrected errors typically cause a

+ machine check (often with panic), corrected ones cause a

+ machine check log entry.

+ For more details about the x86 machine check architecture

+ see the Intel and AMD architecture manuals from their

+ developer websites.

+ For more details about the architecture

+ see http://one.firstfloor.org/~andi/mce.pdf

+ Each CPU has its own directory.

+What: /sys/devices/system/machinecheck/machinecheckX/bank<Y>

+Contact: Andi Kleen <ak@linux.intel.com>

+ 64bit Hex bitmask enabling/disabling specific subevents for

+ When a bit in the bitmask is zero then the respective

+ subevent will not be reported.

+ By default all events are enabled.

+ Note that BIOS maintain another mask to disable specific events

+ per bank. This is not visible here

+What: /sys/devices/system/machinecheck/machinecheckX/check_interval

+Contact: Andi Kleen <ak@linux.intel.com>

+ The entries appear for each CPU, but they are truly shared

+ How often to poll for corrected machine check errors, in

+ seconds (Note output is hexadecimal). Default 5 minutes.

+ When the poller finds MCEs it triggers an exponential speedup

+ (poll more often) on the polling interval. When the poller

+ stops finding MCEs, it triggers an exponential backoff

+ (poll less often) on the polling interval. The check_interval

+ variable is both the initial and maximum polling interval.

+ 0 means no polling for corrected machine check errors

+ (but some corrected errors might be still reported

+What: /sys/devices/system/machinecheck/machinecheckX/trigger

+Contact: Andi Kleen <ak@linux.intel.com>

+ The entries appear for each CPU, but they are truly shared

+ Program to run when a machine check event is detected.

+ This is an alternative to running mcelog regularly from cron

+ and allows to detect events faster.

+What: /sys/devices/system/machinecheck/machinecheckX/monarch_timeout

+Contact: Andi Kleen <ak@linux.intel.com>

+ How long to wait for the other CPUs to machine check too on a

+ exception. 0 to disable waiting for other CPUs.

+What: /sys/devices/system/machinecheck/machinecheckX/ignore_ce

+Contact: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>

+ Disables polling and CMCI for corrected errors.

+ All corrected events are not cleared and kept in bank MSRs.

+What: /sys/devices/system/machinecheck/machinecheckX/dont_log_ce

+Contact: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>

+ Disables logging for corrected errors.

+ All reported corrected errors will be cleared silently.

+ This option will be useful if you never care about corrected

+What: /sys/devices/system/machinecheck/machinecheckX/cmci_disabled

+Contact: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>

+ Disables the CMCI feature.

+What: /sys/module/*/initstate

+KernelVersion: 2.6.19

+Contact: Kay Sievers <kay.sievers@vrfy.org>

+Description: Show the initialization state(live, coming, going) of

diff --git a/Documentation/ABI/testing/sysfs-platform-asus-wmi b/Documentation/ABI/testing/sysfs-platform-asus-wmi
index 04885738cf15..a77a004a1baa 100644
--- a/Documentation/ABI/testing/sysfs-platform-asus-wmi
+++ b/Documentation/ABI/testing/sysfs-platform-asus-wmi

+What: /sys/devices/platform/<platform>/gpu_mux_mode

+Contact: "Luke Jones" <luke@ljones.dev>

+ Switch the GPU hardware MUX mode. Laptops with this feature can

+ can be toggled to boot with only the dGPU (discrete mode) or in

+ standard Optimus/Hybrid mode. On switch a reboot is required:

+ * 0 - Discrete GPU,

+ * 1 - Optimus/Hybrid,

+What: /sys/devices/platform/<platform>/dgpu_disable

+KernelVersion: 5.17

+Contact: "Luke Jones" <luke@ljones.dev>

+ Disable discrete GPU:

+ * 0 - Enable dGPU,

+ * 1 - Disable dGPU

+What: /sys/devices/platform/<platform>/egpu_enable

+KernelVersion: 5.17

+Contact: "Luke Jones" <luke@ljones.dev>

+ Enable the external GPU paired with ROG X-Flow laptops.

+ Toggling this setting will also trigger ACPI to disable the dGPU:

+What: /sys/devices/platform/<platform>/panel_od

+KernelVersion: 5.17

+Contact: "Luke Jones" <luke@ljones.dev>

+ Enable an LCD response-time boost to reduce or remove ghosting:

+What: /sys/bus/platform/devices/*/srpd

+KernelVersion: 5.21

+Contact: Florian Fainelli <f.fainelli@gmail.com>

+ Self Refresh Power Down (SRPD) inactivity timeout counted in

+ internal DDR controller clock cycles. Possible values range

+ from 0 (disable inactivity timeout) to 65535 (0xffff).

+What: /sys/bus/platform/devices/*/frequency

+KernelVersion: 5.21

+Contact: Florian Fainelli <f.fainelli@gmail.com>

+ DDR PHY frequency in Hz.

diff --git a/Documentation/ABI/testing/sysfs-platform-chipidea-usb2 b/Documentation/ABI/testing/sysfs-platform-chipidea-usb2
index b0f4684a83fe..b9f7d924f28a 100644
--- a/Documentation/ABI/testing/sysfs-platform-chipidea-usb2
+++ b/Documentation/ABI/testing/sysfs-platform-chipidea-usb2

+ When read, it returns string "gadget" or "host", indicating

+ the current controller role.

+ It will do role switch when "gadget" or "host" is written to it.

diff --git a/Documentation/ABI/testing/sysfs-platform-dell-privacy-wmi b/Documentation/ABI/testing/sysfs-platform-dell-privacy-wmi
index 7f9e18705861..1f1f274a6979 100644
--- a/Documentation/ABI/testing/sysfs-platform-dell-privacy-wmi
+++ b/Documentation/ABI/testing/sysfs-platform-dell-privacy-wmi

+Contact: "<perry.yuan@dell.com>"

+ For example to check which privacy devices are supported::

+ # cat /sys/bus/wmi/drivers/dell-privacy/6932965F-1671-4CEB-B988-D3AB0A901919/dell_privacy_supported_type

+ [Microphone Mute] [supported]

+ [Camera Shutter] [supported]

+ [ePrivacy Screen] [unsupported]

+Contact: "<perry.yuan@dell.com>"

+ Identifies the local microphone can be muted by hardware, no applications

+ is available to capture system mic sound

+ Identifies camera shutter controlled by hardware, which is a micromechanical

+ shutter assembly that is built onto the camera module to block capturing images

+ from outside the laptop

+ Identifies the privacy device is turned off

+ and cannot send stream to OS applications

+ Identifies the privacy device is turned on,

+ audio or camera driver can get stream from mic

+ and camera module to OS applications

+ For example to check all supported current privacy device states::

+ # cat /sys/bus/wmi/drivers/dell-privacy/6932965F-1671-4CEB-B988-D3AB0A901919/dell_privacy_current_state

+ [Microphone] [unmuted]

+ [Camera Shutter] [unmuted]

+What: /sys/class/power_supply/<battery_name>/eppid

+Date: September 2022

+Contact: Armin Wolf <W_Armin@gmx.de>

+ Reports the Dell ePPID (electronic Dell Piece Part Identification)

+ of the ACPI battery.

+ ======= ==========================================

+Device instance to test mapping

+intel_ifs_0 -> Scan Test

+intel_ifs_1 -> Array BIST test

+What: /sys/devices/virtual/misc/intel_ifs_<N>/run_test

+Contact: "Jithu Joseph" <jithu.joseph@intel.com>

+Description: Write <cpu#> to trigger IFS test for one online core.

+ Note that the test is per core. The cpu# can be

+ for any thread on the core. Running on one thread

+ completes the test for the core containing that thread.

+ Example: to test the core containing cpu5: echo 5 >

+ /sys/devices/virtual/misc/intel_ifs_<N>/run_test

+What: /sys/devices/virtual/misc/intel_ifs_<N>/status

+Contact: "Jithu Joseph" <jithu.joseph@intel.com>

+Description: The status of the last test. It can be one of "pass", "fail"

+What: /sys/devices/virtual/misc/intel_ifs_<N>/details

+Contact: "Jithu Joseph" <jithu.joseph@intel.com>

+Description: Additional information regarding the last test. The details file reports

+ the hex value of the STATUS MSR for this test. Note that the error_code field

+ may contain driver defined software code not defined in the Intel SDM.

+What: /sys/devices/virtual/misc/intel_ifs_<N>/image_version

+Contact: "Jithu Joseph" <jithu.joseph@intel.com>

+Description: Version (hexadecimal) of loaded IFS test image. If no test image

+ is loaded reports "none". Only present for device instances where a test image

+Devices: intel_ifs_0

+What: /sys/devices/virtual/misc/intel_ifs_<N>/current_batch

+Contact: "Jithu Joseph" <jithu.joseph@intel.com>

+Description: Write a number less than or equal to 0xff to load an IFS test image.

+ The number written treated as the 2 digit suffix in the following file name:

+ /lib/firmware/intel/ifs_<N>/ff-mm-ss-02x.scan

+ Reading the file will provide the suffix of the currently loaded IFS test image.

+ This file is present only for device instances where a test image is applicable.

+Devices: intel_ifs_0

diff --git a/Documentation/ABI/testing/sysfs-platform-intel-pmc b/Documentation/ABI/testing/sysfs-platform-intel-pmc
index ef199af75ab0..f31d59b21f9b 100644
--- a/Documentation/ABI/testing/sysfs-platform-intel-pmc
+++ b/Documentation/ABI/testing/sysfs-platform-intel-pmc

diff --git a/Documentation/ABI/testing/sysfs-platform-lg-laptop b/Documentation/ABI/testing/sysfs-platform-lg-laptop
index cf47749b19df..0570cd524d0e 100644
--- a/Documentation/ABI/testing/sysfs-platform-lg-laptop
+++ b/Documentation/ABI/testing/sysfs-platform-lg-laptop

+ Deprecated use /sys/class/power_supply/CMB0/charge_control_end_threshold

diff --git a/Documentation/ABI/testing/sysfs-platform-mellanox-bootctl b/Documentation/ABI/testing/sysfs-platform-mellanox-bootctl
index e79ca22e2f45..9b99a81babb1 100644
--- a/Documentation/ABI/testing/sysfs-platform-mellanox-bootctl
+++ b/Documentation/ABI/testing/sysfs-platform-mellanox-bootctl

+What: /sys/bus/platform/devices/MLNXBF04:00/bootfifo

+Contact: "Liming Sun <limings@nvidia.com>"

+ The file used to access the BlueField boot fifo.

diff --git a/Documentation/ABI/testing/sysfs-platform-sst-atom b/Documentation/ABI/testing/sysfs-platform-sst-atom
index d5f6e21f0e42..0154b0fba759 100644
--- a/Documentation/ABI/testing/sysfs-platform-sst-atom
+++ b/Documentation/ABI/testing/sysfs-platform-sst-atom

+What: /sys/devices/platform/8086<x>:00/firmware_version

+ match a device and output its name here.

+What: /sys/power/suspend_stats/last_hw_sleep

+Contact: Mario Limonciello <mario.limonciello@amd.com>

+ The /sys/power/suspend_stats/last_hw_sleep file

+ contains the duration of time spent in a hardware sleep

+ state in the most recent system suspend-resume cycle.

+ This number is measured in microseconds.

+What: /sys/power/suspend_stats/total_hw_sleep

+Contact: Mario Limonciello <mario.limonciello@amd.com>

+ The /sys/power/suspend_stats/total_hw_sleep file

+ contains the aggregate of time spent in a hardware sleep

+ state since the kernel was booted. This number

+ is measured in microseconds.

+What: /sys/power/suspend_stats/max_hw_sleep

+Contact: Mario Limonciello <mario.limonciello@amd.com>

+ The /sys/power/suspend_stats/max_hw_sleep file

+ contains the maximum amount of time that the hardware can

+ report for time spent in a hardware sleep state. When sleep

+ cycles are longer than this time, the values for

+ 'total_hw_sleep' and 'last_hw_sleep' may not be accurate.

+ This number is measured in microseconds.

+What: /sys/class/ptp/ptp<N>/

+What: /sys/class/ptp/ptp<N>/clock_name

+What: /sys/class/ptp/ptp<N>/max_adjustment

+What: /sys/class/ptp/ptp<N>/max_vclocks

+What: /sys/class/ptp/ptp<N>/n_alarms

+What: /sys/class/ptp/ptp<N>/n_external_timestamps

+What: /sys/class/ptp/ptp<N>/n_periodic_outputs

+What: /sys/class/ptp/ptp<N>/n_pins

+What: /sys/class/ptp/ptp<N>/n_vclocks

+What: /sys/class/ptp/ptp<N>/pins

+What: /sys/class/ptp/ptp<N>/pps_available

+What: /sys/class/ptp/ptp<N>/extts_enable

+What: /sys/class/ptp/ptp<N>/fifo

+What: /sys/class/ptp/ptp<N>/period

+What: /sys/class/ptp/ptp<N>/pps_enable

+ On powernv/OPAL, this value is provided by the OPAL firmware

+ and is expected to be "ibm,edk2-compat-v1".

+ On pseries/PLPKS, this is generated by the kernel based on the

+ version number in the SB_VERSION variable in the keystore, and

+ has the form "ibm,plpks-sb-v<version>", or

+ "ibm,plpks-sb-unknown" if there is no SB_VERSION variable.

+Contact: Nayna Jain <nayna@linux.ibm.com>

+What: /sys/firmware/secvar/config

+Date: February 2023

+Contact: Nayna Jain <nayna@linux.ibm.com>

+Description: This optional directory contains read-only config attributes as

+ defined by the secure variable implementation. All data is in

+ ASCII format. The directory is only created if the backing

+ implementation provides variables to populate it, which at

+ present is only PLPKS on the pseries platform.

+What: /sys/firmware/secvar/config/version

+Date: February 2023

+Contact: Nayna Jain <nayna@linux.ibm.com>

+Description: Config version as reported by the hypervisor in ASCII decimal

+ Currently only provided by PLPKS on the pseries platform.

+What: /sys/firmware/secvar/config/max_object_size

+Date: February 2023

+Contact: Nayna Jain <nayna@linux.ibm.com>

+Description: Maximum allowed size of objects in the keystore in bytes,

+ represented in ASCII decimal format.

+ This is not necessarily the same as the max size that can be

+ written to an update file as writes can contain more than

+ object data, you should use the size of the update file for

+ Currently only provided by PLPKS on the pseries platform.

+What: /sys/firmware/secvar/config/total_size

+Date: February 2023

+Contact: Nayna Jain <nayna@linux.ibm.com>

+Description: Total size of the PLPKS in bytes, represented in ASCII decimal

+ Currently only provided by PLPKS on the pseries platform.

+What: /sys/firmware/secvar/config/used_space

+Date: February 2023

+Contact: Nayna Jain <nayna@linux.ibm.com>

+Description: Current space consumed by the key store, in bytes, represented

+ in ASCII decimal format.

+ Currently only provided by PLPKS on the pseries platform.

+What: /sys/firmware/secvar/config/supported_policies

+Date: February 2023

+Contact: Nayna Jain <nayna@linux.ibm.com>

+Description: Bitmask of supported policy flags by the hypervisor,

+ represented as an 8 byte hexadecimal ASCII string. Consult the

+ hypervisor documentation for what these flags are.

+ Currently only provided by PLPKS on the pseries platform.

+What: /sys/firmware/secvar/config/signed_update_algorithms

+Date: February 2023

+Contact: Nayna Jain <nayna@linux.ibm.com>

+Description: Bitmask of flags indicating which algorithms the hypervisor

+ supports for signed update of objects, represented as a 16 byte

+ hexadecimal ASCII string. Consult the hypervisor documentation

+ for what these flags mean.

+ Currently only provided by PLPKS on the pseries platform.

+What: /sys/class/timecard/

+Date: September 2021

+Contact: Jonathan Lemon <jonathan.lemon@gmail.com>

+Description: This directory contains files and directories

+ providing a standardized interface to the ancillary

+ features of the OpenCompute timecard.

+What: /sys/class/timecard/ocpN/

+Date: September 2021

+Contact: Jonathan Lemon <jonathan.lemon@gmail.com>

+Description: This directory contains the attributes of the Nth timecard

+What: /sys/class/timecard/ocpN/available_clock_sources

+Date: September 2021

+Contact: Jonathan Lemon <jonathan.lemon@gmail.com>

+Description: (RO) The list of available time sources that the PHC

+ uses for clock adjustments.

+ ==== =================================================

+ NONE no adjustments

+ PPS adjustments come from the PPS1 selector (default)

+ TOD adjustments from the GNSS/TOD module

+ IRIG adjustments from external IRIG-B signal

+ DCF adjustments from external DCF signal

+ ==== =================================================

+What: /sys/class/timecard/ocpN/available_sma_inputs

+Date: September 2021

+Contact: Jonathan Lemon <jonathan.lemon@gmail.com>

+Description: (RO) Set of available destinations (sinks) for a SMA

+ ===== ================================================

+ 10Mhz signal is used as the 10Mhz reference clock

+ PPS1 signal is sent to the PPS1 selector

+ PPS2 signal is sent to the PPS2 selector

+ TS1 signal is sent to timestamper 1

+ TS2 signal is sent to timestamper 2

+ TS3 signal is sent to timestamper 3

+ TS4 signal is sent to timestamper 4

+ IRIG signal is sent to the IRIG-B module

+ DCF signal is sent to the DCF module

+ FREQ1 signal is sent to frequency counter 1

+ FREQ2 signal is sent to frequency counter 2

+ FREQ3 signal is sent to frequency counter 3

+ FREQ4 signal is sent to frequency counter 4

+ None signal input is disabled

+ ===== ================================================

+What: /sys/class/timecard/ocpN/available_sma_outputs

+Contact: Jonathan Lemon <jonathan.lemon@gmail.com>

+Description: (RO) Set of available sources for a SMA output signal.

+ ===== ================================================

+ 10Mhz output is from the 10Mhz reference clock

+ PHC output PPS is from the PHC clock

+ MAC output PPS is from the Miniature Atomic Clock

+ GNSS1 output PPS is from the first GNSS module

+ GNSS2 output PPS is from the second GNSS module

+ IRIG output is from the PHC, in IRIG-B format

+ DCF output is from the PHC, in DCF format

+ GEN1 output is from frequency generator 1

+ GEN2 output is from frequency generator 2

+ GEN3 output is from frequency generator 3

+ GEN4 output is from frequency generator 4

+ ===== ================================================

+What: /sys/class/timecard/ocpN/clock_source

+Date: September 2021

+Contact: Jonathan Lemon <jonathan.lemon@gmail.com>

+Description: (RW) Contains the current synchronization source used by

+ the PHC. May be changed by writing one of the listed

+ values from the available_clock_sources attribute set.

+What: /sys/class/timecard/ocpN/clock_status_drift

+Contact: Jonathan Lemon <jonathan.lemon@gmail.com>

+Description: (RO) Contains the current drift value used by the firmware

+ for internal disciplining of the atomic clock.

+What: /sys/class/timecard/ocpN/clock_status_offset

+Contact: Jonathan Lemon <jonathan.lemon@gmail.com>

+Description: (RO) Contains the current offset value used by the firmware

+ for internal disciplining of the atomic clock.

+What: /sys/class/timecard/ocpN/freqX

+Contact: Jonathan Lemon <jonathan.lemon@gmail.com>

+Description: (RO) Optional directory containing the sysfs nodes for

+ frequency counter <X>.

+What: /sys/class/timecard/ocpN/freqX/frequency

+Contact: Jonathan Lemon <jonathan.lemon@gmail.com>

+Description: (RO) Contains the measured frequency over the specified

+ measurement period.

+What: /sys/class/timecard/ocpN/freqX/seconds

+Contact: Jonathan Lemon <jonathan.lemon@gmail.com>

+Description: (RW) Specifies the number of seconds from 0-255 that the

+ frequency should be measured over. Write 0 to disable.

+What: /sys/class/timecard/ocpN/genX

+Contact: Jonathan Lemon <jonathan.lemon@gmail.com>

+Description: (RO) Optional directory containing the sysfs nodes for

+ frequency generator <X>.

+What: /sys/class/timecard/ocpN/genX/duty

+Contact: Jonathan Lemon <jonathan.lemon@gmail.com>

+Description: (RO) Specifies the signal duty cycle as a percentage from 1-99.

+What: /sys/class/timecard/ocpN/genX/period

+Contact: Jonathan Lemon <jonathan.lemon@gmail.com>

+Description: (RO) Specifies the signal period in nanoseconds.

+What: /sys/class/timecard/ocpN/genX/phase

+Contact: Jonathan Lemon <jonathan.lemon@gmail.com>

+Description: (RO) Specifies the signal phase offset in nanoseconds.

+What: /sys/class/timecard/ocpN/genX/polarity

+Contact: Jonathan Lemon <jonathan.lemon@gmail.com>

+Description: (RO) Specifies the signal polarity, either 1 or 0.

+What: /sys/class/timecard/ocpN/genX/running

+Contact: Jonathan Lemon <jonathan.lemon@gmail.com>

+Description: (RO) Either 0 or 1, showing if the signal generator is running.

+What: /sys/class/timecard/ocpN/genX/start

+Contact: Jonathan Lemon <jonathan.lemon@gmail.com>

+Description: (RO) Shows the time in <sec>.<nsec> that the signal generator

+What: /sys/class/timecard/ocpN/genX/signal

+Contact: Jonathan Lemon <jonathan.lemon@gmail.com>

+Description: (RW) Used to start the signal generator, and summarize

+ the current status.

+ The signal generator may be started by writing the signal

+ period, followed by the optional signal values. If the

+ optional values are not provided, they default to the current

+ settings, which may be obtained from the other sysfs nodes.

+ period [duty [phase [polarity]]]

+ echo 500000000 > signal # 1/2 second period

+ echo 1000000 40 100 > signal

+ echo 0 > signal # turn off generator

+ Period and phase are specified in nanoseconds. Duty cycle is

+ a percentage from 1-99. Polarity is 1 or 0.

+ Reading this node will return:

+ period duty phase polarity start_time

+What: /sys/class/timecard/ocpN/gnss_sync

+Date: September 2021

+Contact: Jonathan Lemon <jonathan.lemon@gmail.com>

+Description: (RO) Indicates whether a valid GNSS signal is received,

+ or when the signal was lost.

+What: /sys/class/timecard/ocpN/i2c

+Date: September 2021

+Contact: Jonathan Lemon <jonathan.lemon@gmail.com>

+Description: This optional attribute links to the associated i2c device.

+What: /sys/class/timecard/ocpN/irig_b_mode

+Date: September 2021

+Contact: Jonathan Lemon <jonathan.lemon@gmail.com>

+Description: (RW) An integer from 0-7 indicating the timecode format

+ of the IRIG-B output signal: B00<n>

+What: /sys/class/timecard/ocpN/pps

+Date: September 2021

+Contact: Jonathan Lemon <jonathan.lemon@gmail.com>

+Description: This optional attribute links to the associated PPS device.

+What: /sys/class/timecard/ocpN/ptp

+Date: September 2021

+Contact: Jonathan Lemon <jonathan.lemon@gmail.com>

+Description: This attribute links to the associated PTP device.

+What: /sys/class/timecard/ocpN/serialnum

+Date: September 2021

+Contact: Jonathan Lemon <jonathan.lemon@gmail.com>

+Description: (RO) Provides the serial number of the timecard.

+What: /sys/class/timecard/ocpN/sma1

+What: /sys/class/timecard/ocpN/sma2

+What: /sys/class/timecard/ocpN/sma3

+What: /sys/class/timecard/ocpN/sma4

+Date: September 2021

+Contact: Jonathan Lemon <jonathan.lemon@gmail.com>

+Description: (RW) These attributes specify the direction of the signal

+ on the associated SMA connectors, and also the signal sink

+ The display format of the attribute is a space separated

+ list of signals, prefixed by the input/output direction.

+ The signal direction may be changed (if supported) by

+ prefixing the signal list with either "in:" or "out:".

+ If neither prefix is present, then the direction is unchanged.

+ The output signal may be changed by writing one of the listed

+ values from the available_sma_outputs attribute set.

+ The input destinations may be changed by writing multiple

+ values from the available_sma_inputs attribute set,

+ separated by spaces. If there are duplicated input

+ destinations between connectors, the lowest numbered SMA

+ connector is given priority.

+ Note that not all input combinations may make sense.

+ The 10Mhz reference clock input is currently only valid

+ on SMA1 and may not be combined with other destination sinks.

+What: /sys/class/timecard/ocpN/tod_correction

+Contact: Jonathan Lemon <jonathan.lemon@gmail.com>

+Description: (RW) The incoming GNSS signal is in UTC time, and the NMEA

+ format messages do not provide a TAI offset. This sets the

+ correction value for the incoming time.

+ If UBX_LS is enabled, this should be 0, and the offset is

+ taken from the UBX-NAV-TIMELS message.

+What: /sys/class/timecard/ocpN/ts_window_adjust

+Date: September 2021

+Contact: Jonathan Lemon <jonathan.lemon@gmail.com>

+Description: (RW) When retrieving the PHC with the PTP SYS_OFFSET_EXTENDED

+ ioctl, a system timestamp is made before and after the PHC

+ time is retrieved. The midpoint between the two system

+ timestamps is usually taken to be the SYS time associated

+ with the PHC time. This estimate may be wrong, as it depends

+ on PCI latencies, and when the PHC time was latched

+ The attribute value reduces the end timestamp by the given

+ number of nanoseconds, so the computed midpoint matches the

+ retrieved PHC time.

+ The initial value is set based on measured PCI latency and

+ the estimated point where the FPGA latches the PHC time. This

+ value may be changed by writing an unsigned integer.

+What: /sys/class/timecard/ocpN/ttyGNSS

+What: /sys/class/timecard/ocpN/ttyGNSS2

+Date: September 2021

+Contact: Jonathan Lemon <jonathan.lemon@gmail.com>

+Description: These optional attributes link to the TTY serial ports

+ associated with the GNSS devices.

+What: /sys/class/timecard/ocpN/ttyMAC

+Date: September 2021

+Contact: Jonathan Lemon <jonathan.lemon@gmail.com>

+Description: This optional attribute links to the TTY serial port

+ associated with the Miniature Atomic Clock.

+What: /sys/class/timecard/ocpN/ttyNMEA

+Date: September 2021

+Contact: Jonathan Lemon <jonathan.lemon@gmail.com>

+Description: This optional attribute links to the TTY serial port

+ which outputs the PHC time in NMEA ZDA format.

+What: /sys/class/timecard/ocpN/utc_tai_offset

+Date: September 2021

+Contact: Jonathan Lemon <jonathan.lemon@gmail.com>

+Description: (RW) The DCF and IRIG output signals are in UTC, while the

+ TimeCard operates on TAI. This attribute allows setting the

+ offset in seconds, which is added to the TAI timebase for

+ The offset may be changed by writing an unsigned integer.

+What: /sys/class/tty/tty<x>/active

+What: /sys/class/tty/ttyS<x>/uartclk

+What: /sys/class/tty/ttyS<x>/type

+What: /sys/class/tty/ttyS<x>/line

+What: /sys/class/tty/ttyS<x>/port

+What: /sys/class/tty/ttyS<x>/irq

+What: /sys/class/tty/ttyS<x>/flags

+What: /sys/class/tty/ttyS<x>/xmit_fifo_size

+What: /sys/class/tty/ttyS<x>/close_delay

+What: /sys/class/tty/ttyS<x>/closing_wait

+What: /sys/class/tty/ttyS<x>/custom_divisor

+What: /sys/class/tty/ttyS<x>/io_type

+What: /sys/class/tty/ttyS<x>/iomem_base

+What: /sys/class/tty/ttyS<x>/iomem_reg_shift

+What: /sys/class/tty/ttyS<x>/rx_trig_bytes

+What: /sys/class/tty/ttyS<x>/console

+menu "Documentation"

+config WARN_MISSING_DOCUMENTS

+ It is not uncommon that a document gets renamed.

+ This option makes the Kernel to check for missing dependencies,

+ warning when something is missing. Works only if the Kernel

+ is built from a git tree.

+ If unsure, select 'N'.

+ The files under Documentation/ABI should follow what's

+ described at Documentation/ABI/README. Yet, as they're manually

+ written, it would be possible that some of those files would

+ have errors that would break them for being parsed by

+ scripts/get_abi.pl. Add a check to verify them.

+ If unsure, select 'N'.

+LATEXOPTS = -interaction=batchmode -no-shell-escape

+ifeq ($(findstring 1, $(KBUILD_VERBOSE)),)

+ $(abspath $(BUILDDIR)/$3/$4) && \

+ if [ "x$(DOCS_CSS)" != "x" ]; then \

+ cp $(if $(patsubst /%,,$(DOCS_CSS)),$(abspath $(srctree)/$(DOCS_CSS)),$(DOCS_CSS)) $(BUILDDIR)/$3/_static/; \

+ @$(srctree)/scripts/sphinx-pre-install --version-check

+ @+$(foreach var,$(SPHINXDIRS),$(call loop_cmd,sphinx,texinfo,$(var),texinfo,$(var)))

+# Note: the 'info' Make target is generated by sphinx itself when

+# running the texinfodocs target define above.

+infodocs: texinfodocs

+ $(MAKE) -C $(BUILDDIR)/texinfo info

+ @echo ' texinfodocs - Texinfo'

+ @echo ' infodocs - Info'

+ @echo ' make DOCS_THEME={sphinx-theme} selects a different Sphinx theme.'

+ @echo ' make DOCS_CSS={a .css file} adds a DOCS_CSS override file for html/epub output.'

+.. SPDX-License-Identifier: GPL-2.0

+:Author: Frank Li <Frank.Li@nxp.com>

+The difference between PCI NTB function and PCI vNTB function is

+PCI NTB function need at two endpoint instances and connect HOST1

+PCI vNTB function only use one host and one endpoint(EP), use NTB

+connect EP and PCI host

+.. code-block:: text

+ +------------+ +---------------------------------------+

+ +------------+ | +--------------+

+ +------------+ | +--------------+

+ +------------+ | +--------------+

+ | | +---------------+ | NTB Driver |

+ | | | PCI EP NTB |<------>| |

+ | | | FN Driver | | |

+ +------------+ +---------------+ +--------------+

+ | | PCI | | | BUS |

+ +------------+ +---------------+--------+--------------+

+Constructs used for Implementing vNTB

+=====================================

+ 2) Self Scratchpad Registers

+ 3) Peer Scratchpad Registers

+ 4) Doorbell (DB) Registers

+ 5) Memory Window (MW)

+It is same as PCI NTB Function driver

+Scratchpad Registers:

+---------------------

+It is appended after Config region.

+.. code-block:: text

+ +--------------------------------------------------+ Base

+ | Common Config Register |

+ +-----------------------+--------------------------+ Base + span_offset

+ | Peer Span Space | Span Space |

+ +-----------------------+--------------------------+ Base + span_offset

+ | | | + span_count * 4

+ | Span Space | Peer Span Space |

+ +-----------------------+--------------------------+

+ Virtual PCI Pcie Endpoint

+ NTB Driver NTB Driver

+Doorbell Registers:

+-------------------

+ Doorbell Registers are used by the hosts to interrupt each other.

+ Actual transfer of data between the two hosts will happen using the

+Modeling Constructs:

+====================

+====== ===============

+BAR NO CONSTRUCTS USED

+====== ===============

+BAR2 Memory Window 1

+BAR3 Memory Window 2

+BAR4 Memory Window 3

+BAR5 Memory Window 4

+====== ===============

+====== ===============================

+BAR NO CONSTRUCTS USED

+====== ===============================

+BAR0 Config Region + Scratchpad

+BAR4 Memory Window 1

+====== ===============================

+.. SPDX-License-Identifier: GPL-2.0

+===================================================================

+PCI Non-Transparent Bridge (NTB) Endpoint Function (EPF) User Guide

+===================================================================

+:Author: Frank Li <Frank.Li@nxp.com>

+This document is a guide to help users use pci-epf-vntb function driver

+and ntb_hw_epf host driver for NTB functionality. The list of steps to

+be followed in the host side and EP side is given below. For the hardware

+configuration and internals of NTB using configurable endpoints see

+Documentation/PCI/endpoint/pci-vntb-function.rst

+Endpoint Controller Devices

+---------------------------

+To find the list of endpoint controller devices in the system::

+ # ls /sys/class/pci_epc/

+If PCI_ENDPOINT_CONFIGFS is enabled::

+ # ls /sys/kernel/config/pci_ep/controllers

+Endpoint Function Drivers

+-------------------------

+To find the list of endpoint function drivers in the system::

+ # ls /sys/bus/pci-epf/drivers

+ pci_epf_ntb pci_epf_test pci_epf_vntb

+If PCI_ENDPOINT_CONFIGFS is enabled::

+ # ls /sys/kernel/config/pci_ep/functions

+ pci_epf_ntb pci_epf_test pci_epf_vntb

+Creating pci-epf-vntb Device

+----------------------------

+PCI endpoint function device can be created using the configfs. To create

+pci-epf-vntb device, the following commands can be used::

+ # mount -t configfs none /sys/kernel/config

+ # cd /sys/kernel/config/pci_ep/

+ # mkdir functions/pci_epf_vntb/func1

+The "mkdir func1" above creates the pci-epf-ntb function device that will

+be probed by pci_epf_vntb driver.

+The PCI endpoint framework populates the directory with the following

+configurable fields::

+ # ls functions/pci_epf_ntb/func1

+ baseclass_code deviceid msi_interrupts pci-epf-ntb.0

+ progif_code secondary subsys_id vendorid

+ cache_line_size interrupt_pin msix_interrupts primary

+ revid subclass_code subsys_vendor_id

+The PCI endpoint function driver populates these entries with default values

+when the device is bound to the driver. The pci-epf-vntb driver populates

+vendorid with 0xffff and interrupt_pin with 0x0001::

+ # cat functions/pci_epf_vntb/func1/vendorid

+ # cat functions/pci_epf_vntb/func1/interrupt_pin

+Configuring pci-epf-vntb Device

+-------------------------------

+The user can configure the pci-epf-vntb device using its configfs entry. In order

+to change the vendorid and the deviceid, the following

+commands can be used::

+ # echo 0x1957 > functions/pci_epf_vntb/func1/vendorid

+ # echo 0x0809 > functions/pci_epf_vntb/func1/deviceid

+In order to configure NTB specific attributes, a new sub-directory to func1

+should be created::

+ # mkdir functions/pci_epf_vntb/func1/pci_epf_vntb.0/

+The NTB function driver will populate this directory with various attributes

+that can be configured by the user::

+ # ls functions/pci_epf_vntb/func1/pci_epf_vntb.0/

+ db_count mw1 mw2 mw3 mw4 num_mws

+A sample configuration for NTB function is given below::

+ # echo 4 > functions/pci_epf_vntb/func1/pci_epf_vntb.0/db_count

+ # echo 128 > functions/pci_epf_vntb/func1/pci_epf_vntb.0/spad_count

+ # echo 1 > functions/pci_epf_vntb/func1/pci_epf_vntb.0/num_mws

+ # echo 0x100000 > functions/pci_epf_vntb/func1/pci_epf_vntb.0/mw1

+A sample configuration for virtual NTB driver for virutal PCI bus::

+ # echo 0x1957 > functions/pci_epf_vntb/func1/pci_epf_vntb.0/vntb_vid

+ # echo 0x080A > functions/pci_epf_vntb/func1/pci_epf_vntb.0/vntb_pid

+ # echo 0x10 > functions/pci_epf_vntb/func1/pci_epf_vntb.0/vbus_number

+Binding pci-epf-ntb Device to EP Controller

+--------------------------------------------

+NTB function device should be attached to PCI endpoint controllers

+connected to the host.

+ # ln -s controllers/5f010000.pcie_ep functions/pci-epf-ntb/func1/primary

+Once the above step is completed, the PCI endpoint controllers are ready to

+establish a link with the host.

+In order for the endpoint device to establish a link with the host, the _start_

+field should be populated with '1'. For NTB, both the PCI endpoint controllers

+should establish link with the host (imx8 don't need this steps)::

+ # echo 1 > controllers/5f010000.pcie_ep/start

+lspci Output at Host side

+-------------------------

+Note that the devices listed here correspond to the values populated in

+"Creating pci-epf-ntb Device" section above::

+ 00:00.0 PCI bridge: Freescale Semiconductor Inc Device 0000 (rev 01)

+ 01:00.0 RAM memory: Freescale Semiconductor Inc Device 0809

+Endpoint Device / Virtual PCI bus

+=================================

+lspci Output at EP Side / Virtual PCI bus

+-----------------------------------------

+Note that the devices listed here correspond to the values populated in

+"Creating pci-epf-ntb Device" section above::

+ 10:00.0 Unassigned class [ffff]: Dawicontrol Computersysteme GmbH Device 1234 (rev ff)

+Using ntb_hw_epf Device

+-----------------------

+The host side software follows the standard NTB software architecture in Linux.

+All the existing client side NTB utilities like NTB Transport Client and NTB

+Netdev, NTB Ping Pong Test Client and NTB Tool Test Client can be used with NTB

+For more information on NTB see

+:doc:`Non-Transparent Bridge <../../driver-api/ntb>`

+List of device drivers MSI(-X) APIs

+===================================

+The PCI/MSI subystem has a dedicated C file for its exported device driver

+APIs — `drivers/pci/msi/api.c`. The following functions are exported:

+.. kernel-doc:: drivers/pci/msi/api.c

+ void (*cor_error_detected)(struct pci_dev *dev);

+ The cor_error_detected() callback is invoked in handle_error_source() when

+ the error severity is "correctable". The callback is optional and allows

+ additional logging to be done if desired. See example:

+ - drivers/cxl/pci.c

+ static int dev_suspend(struct device *dev)

+ static int dev_resume(struct device *dev)

+ .driver.pm = &dev_pm_ops,

+to "register" this capability by calling dma_set_mask() with

+dma_set_mask() as they are 64-bit DMA devices.

+can directly address "coherent memory" in System RAM above 4G physical

+address by calling dma_set_coherent_mask().

+Once the DMA masks are set, the driver can allocate "coherent" (a.k.a. shared)

+ - Release DMA buffers (both streaming and coherent)

+Then clean up "coherent" buffers which contain the control data.

+implementation of pci_mmap_resource_range() instead of defining

diff --git a/Documentation/RCU/Design/Data-Structures/Data-Structures.rst b/Documentation/RCU/Design/Data-Structures/Data-Structures.rst
index f4efd6897b09..b34990c7c377 100644
--- a/Documentation/RCU/Design/Data-Structures/Data-Structures.rst
+++ b/Documentation/RCU/Design/Data-Structures/Data-Structures.rst

+for user mode adaptive-ticks support (see Documentation/timers/no_hz.rst).

diff --git a/Documentation/RCU/Design/Expedited-Grace-Periods/Expedited-Grace-Periods.rst b/Documentation/RCU/Design/Expedited-Grace-Periods/Expedited-Grace-Periods.rst
index 6f89cf1e567d..93d899d53258 100644
--- a/Documentation/RCU/Design/Expedited-Grace-Periods/Expedited-Grace-Periods.rst
+++ b/Documentation/RCU/Design/Expedited-Grace-Periods/Expedited-Grace-Periods.rst

+identify which of many concurrent requests will initiate the grace

+tasks, so more recent implementations use the Linux kernel's

+workqueues (see Documentation/core-api/workqueue.rst).

+zone”, expedited grace periods must do something else during this time.

diff --git a/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel0.svg b/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel0.svg
index 98af66557908..16b1ff0ad38c 100644
--- a/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel0.svg
+++ b/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel0.svg

- style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:Symbol;-inkscape-font-specification:Symbol"><flowRegion

+ style="font-size:10px;font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;text-align:center;line-height:125%;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;font-family:monospace;-inkscape-font-specification:monospace"><flowRegion

diff --git a/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel1.svg b/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel1.svg
index e0184a37aec7..684a4b969725 100644
--- a/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel1.svg
+++ b/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel1.svg

diff --git a/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel2.svg b/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel2.svg
index 1bc3fed54d58..8fb2454d9544 100644
--- a/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel2.svg
+++ b/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel2.svg

diff --git a/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel3.svg b/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel3.svg
index 6d8a1bffb3e4..5d4f22d5662c 100644
--- a/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel3.svg
+++ b/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel3.svg

diff --git a/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel4.svg b/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel4.svg
index 44018fd6342b..b89b02869914 100644
--- a/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel4.svg
+++ b/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel4.svg

diff --git a/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel5.svg b/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel5.svg
index e5eef50454fb..90f1c77bea2f 100644
--- a/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel5.svg
+++ b/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel5.svg

diff --git a/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel6.svg b/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel6.svg
index fbd2c1892886..3e5651da031a 100644
--- a/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel6.svg
+++ b/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel6.svg

diff --git a/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel7.svg b/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel7.svg
index 502e159ed278..9483f08d345e 100644
--- a/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel7.svg
+++ b/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel7.svg

diff --git a/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel8.svg b/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel8.svg
index 677401551c7d..1101ec30e604 100644
--- a/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel8.svg
+++ b/Documentation/RCU/Design/Expedited-Grace-Periods/Funnel8.svg

diff --git a/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst b/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst
index eeb351296df1..5750f125361b 100644
--- a/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst
+++ b/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst

+task blocked in ``synchronize_rcu()``. This task might be affined to

+ 4 struct rcu_data *rdp = this_cpu_ptr(&rcu_data);

+ 5 struct rcu_node *rnp;

+ 8 lockdep_assert_irqs_disabled();

+ 9 if (rcu_rdp_is_offloaded(rdp))

+ 12 /* Handle nohz enablement switches conservatively. */

+ 14 if (tne != rdp->tick_nohz_enabled_snap) {

+ 15 if (!rcu_segcblist_empty(&rdp->cblist))

+ 16 invoke_rcu_core(); /* force nohz to see update. */

+ 17 rdp->tick_nohz_enabled_snap = tne;

+ 24 * If we have not yet accelerated this jiffy, accelerate all

+ 25 * callbacks on this CPU.

+ 27 if (rdp->last_accelerate == jiffies)

+ 29 rdp->last_accelerate = jiffies;

+ 30 if (rcu_segcblist_pend_cbs(&rdp->cblist)) {

+ 31 rnp = rdp->mynode;

+ 32 raw_spin_lock_rcu_node(rnp); /* irqs already disabled. */

+ 33 needwake = rcu_accelerate_cbs(rnp, rdp);

+ 34 raw_spin_unlock_rcu_node(rnp); /* irqs remain disabled. */

+ 36 rcu_gp_kthread_wake();

+this discussion are lines 32–34. We will therefore abbreviate this

diff --git a/Documentation/RCU/Design/Requirements/GPpartitionReaders1.svg b/Documentation/RCU/Design/Requirements/GPpartitionReaders1.svg
index 4b4014fda770..87851a8fac1e 100644
--- a/Documentation/RCU/Design/Requirements/GPpartitionReaders1.svg
+++ b/Documentation/RCU/Design/Requirements/GPpartitionReaders1.svg

diff --git a/Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg b/Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg
index 48cd1623d4d4..e2a8af592bab 100644
--- a/Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg
+++ b/Documentation/RCU/Design/Requirements/ReadersPartitionGP1.svg

diff --git a/Documentation/RCU/Design/Requirements/Requirements.rst b/Documentation/RCU/Design/Requirements/Requirements.rst
index 45278e2974c0..49387d823619 100644
--- a/Documentation/RCU/Design/Requirements/Requirements.rst
+++ b/Documentation/RCU/Design/Requirements/Requirements.rst

+mechanism, most commonly locking or reference counting

+(see ../../rcuref.rst).

+directly invokes ct_irq_enter() and ct_irq_exit() to be called

+code structure, which has ct_irq_enter() invoking

+ct_nmi_enter() and ct_irq_exit() invoking ct_nmi_exit().

+with via timer_shutdown_sync() or similar.

+ ct_irq_enter() and ct_irq_exit() at exception entry and

+| One approach is to do ``ct_irq_exit();ct_irq_enter();`` every so |

+Some forms of tracing need to wait for all preemption-disabled regions

+of code running on any online CPU, including those executed when RCU is

+not watching. This means that synchronize_rcu() is insufficient, and

+Tasks Rude RCU must be used instead. This flavor of RCU does its work by

+forcing a workqueue to be scheduled on each online CPU, hence the "Rude"

+moniker. And this operation is considered to be quite rude by real-time

+workloads that don't want their ``nohz_full`` CPUs receiving IPIs and

+by battery-powered systems that don't want their idle CPUs to be awakened.

+The tasks-rude-RCU API is also reader-marking-free and thus quite compact,

+consisting of call_rcu_tasks_rude(), synchronize_rcu_tasks_rude(),

+and rcu_barrier_tasks_rude().

+Some forms of tracing need to sleep in readers, but cannot tolerate

+SRCU's read-side overhead, which includes a full memory barrier in both

+srcu_read_lock() and srcu_read_unlock(). This need is handled by a

+Tasks Trace RCU that uses scheduler locking and IPIs to synchronize with

+readers. Real-time systems that cannot tolerate IPIs may build their

+kernels with ``CONFIG_TASKS_TRACE_RCU_READ_MB=y``, which avoids the IPIs at

+the expense of adding full memory barriers to the read-side primitives.

+The tasks-trace-RCU API is also reasonably compact,

+consisting of rcu_read_lock_trace(), rcu_read_unlock_trace(),

+rcu_read_lock_trace_held(), call_rcu_tasks_trace(),

+synchronize_rcu_tasks_trace(), and rcu_barrier_tasks_trace().

+work in an old version of "arch/x86/kernel/traps.c".

+ optimizations. (But please don't!)

+is making changes in the opposite direction from the read-side traversal

+ preemptible RCU read side critical sections.

+ Final patch for preemptible RCU to -rt. (Later patches were

+ First posting of simple and fast preemptible RCU.

+ the read-side traversal order, the updater need only execute a

+elements A, B, and C in process context, but that it invokes a function

+be self-deadlock *even if* this invocation occurred from a later

+call_rcu() invocation a full grace period later.

+It is important to note that userspace RCU implementations *do*

+permit call_rcu() to directly invoke callbacks, but only if a full

+grace period has elapsed since those callbacks were queued. This is

+the case because some userspace environments are extremely constrained.

+Nevertheless, people writing userspace RCU implementations are strongly

+encouraged to avoid invoking callbacks from call_rcu(), thus obtaining

+the deadlock-avoidance benefits called out above.

+ running preemptible RCU?

+ approach can provide the same simplifications to certain types

+ of lockless algorithms that garbage collectors do.

+ explain how this single task does not become a major bottleneck

+ on large systems (for example, if the task is updating information

+ relating to itself that other tasks can read, there by definition

+ can be no bottleneck). Note that the definition of "large" has

+ changed significantly: Eight CPUs was "large" in the year 2000,

+ but a hundred CPUs was unremarkable in 2017.

+ Explicit disabling of preemption (preempt_disable(), for example)

+ can serve as rcu_read_lock_sched(), but is less readable and

+ prevents lockdep from detecting locking issues.

+ Please note that you *cannot* rely on code known to be built

+ only in non-preemptible kernels. Such code can and will break,

+ especially in kernels built with CONFIG_PREEMPT_COUNT=y.

+ that guard per-element state. Fields that the readers

+ refrain from accessing can be guarded by some other lock

+ acquired only by updaters, if desired.

+ This also works quite well.

+ of multiple atomic primitives. One alternative is to

+ move multiple individual fields to a separate structure,

+ thus solving the multiple-field problem by imposing an

+ additional level of indirection.

+ d. Carefully order the updates and the reads so that readers

+ see valid data at all phases of the update. This is often

+ more difficult than it sounds, especially given modern

+ CPUs' tendency to reorder memory references. One must

+ usually liberally sprinkle memory-ordering operations

+ through the code, making it difficult to understand and

+ to test. Where it works, it is better to use things

+ like smp_store_release() and smp_load_acquire(), but in

+ some cases the smp_mb() full memory barrier is required.

+ As noted earlier, it is usually better to group the

+ changing data into a separate structure, so that the

+ change may be made to appear atomic by updating a pointer

+ to reference a new structure containing updated values.

+ Please see rcu_dereference.rst for more information.

+ of an RCU read-side critical section. See lockdep.rst

+5. If any of call_rcu(), call_srcu(), call_rcu_tasks(),

+ call_rcu_tasks_rude(), or call_rcu_tasks_trace() is used,

+ the callback function may be invoked from softirq context,

+ and in any case with bottom halves disabled. In particular,

+ this callback function cannot block. If you need the callback

+ to block, run that code in a workqueue handler scheduled from

+ the callback. The queue_rcu_work() function does this for you

+ in the case of call_rcu().

+ for synchronize_srcu(), synchronize_rcu_expedited(),

+ synchronize_srcu_expedited(), synchronize_rcu_tasks(),

+ synchronize_rcu_tasks_rude(), and synchronize_rcu_tasks_trace().

+ as the non-expedited forms, but expediting is more CPU intensive.

+ Use of the expedited primitives should be restricted to rare

+ configuration-change operations that would not normally be

+ undertaken while a real-time workload is running. Note that

+ IPI-sensitive real-time workloads can use the rcupdate.rcu_normal

+ kernel boot parameter to completely disable expedited grace

+ periods, though this might have performance implications.

+ of the system, especially to real-time workloads running on the

+ rest of the system. Alternatively, instead use asynchronous

+ primitives such as call_rcu().

+ must use anything that disables preemption, for example,

+ preempt_disable() and preempt_enable().

+8. Although synchronize_rcu() is slower than is call_rcu(),

+ it usually results in simpler code. So, unless update

+ performance is critically important, the updaters cannot block,

+ or the latency of synchronize_rcu() is visible from userspace,

+ synchronize_rcu() should be used in preference to call_rcu().

+ Furthermore, kfree_rcu() and kvfree_rcu() usually result

+ in even simpler code than does synchronize_rcu() without

+ synchronize_rcu()'s multi-millisecond latency. So please take

+ advantage of kfree_rcu()'s and kvfree_rcu()'s "fire and forget"

+ memory-freeing capabilities where it applies.

+ Ways of gaining this self-limiting property when using call_rcu(),

+ kfree_rcu(), or kvfree_rcu() include:

+ d. Periodically invoke rcu_barrier(), permitting a limited

+ The same cautions apply to call_srcu(), call_rcu_tasks(),

+ call_rcu_tasks_rude(), and call_rcu_tasks_trace(). This is

+ why there is an srcu_barrier(), rcu_barrier_tasks(),

+ rcu_barrier_tasks_rude(), and rcu_barrier_tasks_rude(),

+ Note that although these primitives do take action to avoid

+ memory exhaustion when any given CPU has too many callbacks,

+ a determined user or administrator can still exhaust memory.

+ This is especially the case if a system with a large number of

+ CPUs has been configured to offload all of its RCU callbacks onto

+ a single CPU, or if the system has relatively little free memory.

+ are provided for this case, as discussed in lockdep.rst.

+ and confuse people trying to understand your code.

+ with softirq disabled, e.g., via spin_lock_bh(). Failing to

+ disable softirq on a given acquisition of that lock will result

+ in deadlock as soon as the RCU softirq handler happens to run

+ your RCU callback while interrupting that acquisition's critical

+ In addition, do not assume that callbacks queued in a given order

+ will be invoked in that order, even if they all are queued on the

+ same CPU. Furthermore, do not assume that same-CPU callbacks will

+ be invoked serially. For example, in recent kernels, CPUs can be

+ switched between offloaded and de-offloaded callback invocation,

+ and while a given CPU is undergoing such a switch, its callbacks

+ might be concurrently invoked by that CPU's softirq handler and

+ that CPU's rcuo kthread. At such times, that CPU's callbacks

+ might be executed both concurrently and out of order.

+13. Unlike most flavors of RCU, it *is* permissible to block in an

+ It is also permissible to sleep in RCU Tasks Trace read-side

+ critical, which are delimited by rcu_read_lock_trace() and

+ rcu_read_unlock_trace(). However, this is a specialized flavor

+ of RCU, and you should not use it without first checking with

+ its current users. In most cases, you should instead use SRCU.

+ check that accesses to RCU-protected data structures

+ are carried out under the proper RCU read-side critical

+ section, while holding the right combination of locks,

+ or whatever other conditions are appropriate.

+ check that you don't pass the same object to call_rcu()

+ (or friends) before an RCU grace period has elapsed

+ since the last time that you passed that same object to

+ call_rcu() (or friends).

+ tag the pointer to the RCU-protected data structure

+ with __rcu, and sparse will warn you if you access that

+ pointer without the services of one of the variants

+ of rcu_dereference().

+17. If you pass a callback function defined within a module to one of

+ call_rcu(), call_srcu(), call_rcu_tasks(), call_rcu_tasks_rude(),

+ or call_rcu_tasks_trace(), then it is necessary to wait for all

+ pending callbacks to be invoked before unloading that module.

+ Note that it is absolutely *not* sufficient to wait for a grace

+ period! For example, synchronize_rcu() implementation is *not*

+ guaranteed to wait for callbacks registered on other CPUs via

+ call_rcu(). Or even on the current CPU if that CPU recently

+ went offline and came back online.

+ - call_rcu_tasks() -> rcu_barrier_tasks()

+ - call_rcu_tasks_rude() -> rcu_barrier_tasks_rude()

+ - call_rcu_tasks_trace() -> rcu_barrier_tasks_trace()

+ to wait for a grace period. For example, if there are no

+ call_rcu() callbacks queued anywhere in the system, rcu_barrier()

+ can and will return immediately.

+ So if you need to wait for both a grace period and for all

+ pre-existing callbacks, you will need to invoke both functions,

+ with the pair depending on the flavor of RCU:

+ - Either synchronize_rcu() or synchronize_rcu_expedited(),

+ together with rcu_barrier()

+ - Either synchronize_srcu() or synchronize_srcu_expedited(),

+ together with and srcu_barrier()

+ - synchronize_rcu_tasks() and rcu_barrier_tasks()

+ - synchronize_tasks_rude() and rcu_barrier_tasks_rude()

+ - synchronize_tasks_trace() and rcu_barrier_tasks_trace()

+ If necessary, you can use something like workqueues to execute

+ the requisite pair of functions concurrently.

+ See rcubarrier.rst for more information.

+One of the most common uses of RCU is protecting read-mostly linked lists

+(``struct list_head`` in list.h). One big advantage of this approach is

+that all of the required memory ordering is provided by the list macros.

+This document describes several list-based RCU use cases.

+The simplified and heavily inlined code for removing a process from a

+When a process exits, ``release_task()`` calls ``list_del_rcu(&p->tasks)``

+via __exit_signal() and __unhash_process() under ``tasklist_lock``

+writer lock protection. The list_del_rcu() invocation removes

+the task from the list of all tasks. The ``tasklist_lock``

+prevents concurrent list additions/removals from corrupting the

+list. Readers using ``for_each_process()`` are not protected with the

+``tasklist_lock``. To prevent readers from noticing changes in the list

+pointers, the ``task_struct`` object is freed only after one or more

+grace periods elapse, with the help of call_rcu(), which is invoked via

+put_task_struct_rcu_user(). This deferring of destruction ensures that

+any readers traversing the list will see valid ``p->tasks.next`` pointers

+and deletion/freeing can happen in parallel with traversal of the list.

+This pattern is also called an **existence lock**, since RCU refrains

+from invoking the delayed_put_task_struct() callback function until

+all existing readers finish, which guarantees that the ``task_struct``

+object in question will remain in existence until after the completion

+of all RCU readers that might possibly have a reference to that object.

+Some reader-writer locking use cases compute a value while holding

+the read-side lock, but continue to use that value after that lock is

+released. These use cases are often good candidates for conversion

+to RCU. One prominent example involves network packet routing.

+Because the packet-routing data tracks the state of equipment outside

+of the computer, it will at times contain stale data. Therefore, once

+the route has been computed, there is no need to hold the routing table

+static during transmission of the packet. After all, you can hold the

+routing table static all you want, but that won't keep the external

+Internet from changing, and it is the state of the external Internet

+that really matters. In addition, routing entries are typically added

+or deleted, rather than being modified in place. This is a rare example

+of the finite speed of light and the non-zero size of atoms actually

+helping make synchronization be lighter weight.

+A straightforward example of this type of RCU use case may be found in

+the system-call auditing support. For example, a reader-writer locked

+ static enum audit_state audit_filter_task(struct task_struct *tsk, char **key)

+ if (state == AUDIT_STATE_RECORD)

+ *key = kstrdup(e->rule.filterkey, GFP_ATOMIC);

+ static enum audit_state audit_filter_task(struct task_struct *tsk, char **key)

+ if (state == AUDIT_STATE_RECORD)

+ *key = kstrdup(e->rule.filterkey, GFP_ATOMIC);

+The read_lock() and read_unlock() calls have become rcu_read_lock()

+and rcu_read_unlock(), respectively, and the list_for_each_entry()

+has become list_for_each_entry_rcu(). The **_rcu()** list-traversal

+primitives add READ_ONCE() and diagnostic checks for incorrect use

+outside of an RCU read-side critical section.

+might be used as follows for deletion and insertion in these simplified

+versions of audit_del_rule() and audit_add_rule()::

+Normally, the write_lock() and write_unlock() would be replaced by a

+auditsc_lock can therefore be eliminated, since use of RCU eliminates the

+The **_rcu()** list-manipulation primitives add memory barriers that are

+needed on weakly ordered CPUs. The list_del_rcu() primitive omits the

+RCU (*read-copy update*) its name.

+The RCU version of audit_upd_rule() is as follows::

+The update_lsm_rule() does something very similar, for those who would

+prefer to look at real Linux-kernel code.

+that are tracking external state. After all, given there is a delay

+from the time the external state changes before Linux becomes aware

+of the change, and so as noted earlier, a small quantity of additional

+RCU-induced staleness is generally not a problem.

+``audit_entry`` structure, and modify audit_filter_task() as follows::

+ if (state == AUDIT_STATE_RECORD)

+ *key = kstrdup(e->rule.filterkey, GFP_ATOMIC);

+Note that this example assumes that entries are only added and deleted.

+Additional mechanism is required to deal correctly with the update-in-place

+performed by audit_upd_rule(). For one thing, audit_upd_rule() would

+need to hold the locks of both the old ``audit_entry`` and its replacement

+while executing the list_replace_rcu().

+For some use cases, reader performance can be improved by skipping

+stale objects during read-side list traversal, where stale objects

+are those that will be removed and destroyed after one or more grace

+periods. One such example can be found in the timerfd subsystem. When a

+``CLOCK_REALTIME`` clock is reprogrammed (for example due to setting

+of the system time) then all programmed ``timerfds`` that depend on

+this clock get triggered and processes waiting on them are awakened in

+advance of their scheduled expiry. To facilitate this, all such timers

+are added to an RCU-managed ``cancel_list`` when they are setup in

+ if ((ctx->clockid == CLOCK_REALTIME ||

+ ctx->clockid == CLOCK_REALTIME_ALARM) &&

+ __timerfd_remove_cancel(ctx);

+When a timerfd is freed (fd is closed), then the ``might_cancel``

+flag of the timerfd object is cleared, the object removed from the

+``cancel_list`` and destroyed, as shown in this simplified and inlined

+version of timerfd_release()::

+ alarm_cancel(&ctx->t.alarm);

+ hrtimer_cancel(&ctx->t.tmr);

+ ktime_t moffs = ktime_mono_to_real(0);

+ if (ctx->moffs != moffs) {

+The key point is that because RCU-protected traversal of the

+``cancel_list`` happens concurrently with object addition and removal,

+sometimes the traversal can access an object that has been removed from

+the list. In this example, a flag is used to skip such objects.

+ rcu_read_lock_any_held() for any of normal RCU, RCU-bh, and RCU-sched.

+ rcu_read_lock_trace_held() for RCU Tasks Trace.

+ rcu_dereference_raw_check(p):

+ Don't do lockdep at all. (Use sparingly, if at all.)

+ or coalescing. This is useful when testing the

+expression, but would normally include a lockdep expression. For a

+moderately ornate example, consider the following::

+complain even if this was used in an RCU read-side critical section unless

+one of these two cases held. Because rcu_dereference_protected() omits

+all barriers and compiler constraints, it generates better code than do

+the other flavors of rcu_dereference(). On the other hand, it is illegal

+a grace period to elapse, then free the element. See listRCU.rst for more

+information on using RCU with linked lists.

+ See UP.rst for more information.

+ See checklist.rst.

+ listRCU.rst has more information on where this name came from, search

+ for "read-copy update" to find it.

+ Many (but not all) have long since expired.

+ Realtime-friendly RCU are enabled via the CONFIG_PREEMPTION

+ Or point your browser at (https://docs.google.com/document/d/1X0lThx8OK0ZgLMqVoXiR4ZrGURHrXK6NyLRbeXe3Xac/edit)

+ or (https://docs.google.com/document/d/1GCdQC8SDbb54W1shjEXqGZ0Rq8a6kIeYutdSIajfpLA/edit?usp=sharing).

+ return data preceding initialization that preceded the store

+ of the pointer. (As noted later, in recent kernels READ_ONCE()

+ also prevents DEC Alpha from playing these tricks.)

+- You are only permitted to use rcu_dereference() on pointer values.

+ Note that if the pointer comparison is done outside

+ of an RCU read-side critical section, and the pointer

+ is never dereferenced, rcu_access_pointer() should be

+ used in place of rcu_dereference(). In most cases,

+ it is best to avoid accidental dereferences by testing

+ the rcu_access_pointer() return value directly, without

+ assigning it to a variable.

+ Within an RCU read-side critical section, there is little

+ reason to use rcu_access_pointer().

+ r2 = p->c - r1; /* Unconditional access to p->c. */

+ rcu_read_unlock();

+ spin_lock(&q->lock);

+ spin_unlock(&q->lock);

+ rcu_read_unlock();

+garbage values on weakly ordered systems.

+The sparse static-analysis tool checks for non-RCU access to RCU-protected

+RCU updaters sometimes use call_rcu() to initiate an asynchronous wait for

+a grace period to elapse. This primitive takes a pointer to an rcu_head

+struct placed within the RCU-protected data structure and another pointer

+to a function that may be invoked later to free that structure. Code to

+delete an element p from the linked list from IRQ context might then be

+But what if the p_callback() function is defined in an unloadable module?

+heavy RCU-callback load, then some of the callbacks might be deferred in

+order to allow other processing to proceed. For but one example, such

+deferral is required in realtime kernels in order to avoid excessive

+scheduling latencies.

+This situation can be handled by the rcu_barrier() primitive. Rather

+than waiting for a grace period to elapse, rcu_barrier() waits for all

+outstanding RCU callbacks to complete. Please note that rcu_barrier()

+does **not** imply synchronize_rcu(), in particular, if there are no RCU

+callbacks queued anywhere, rcu_barrier() is within its rights to return

+immediately, without waiting for anything, let alone a grace period.

+must match the flavor of srcu_barrier() with that of call_srcu().

+If your module uses multiple srcu_struct structures, then it must also

+use multiple invocations of srcu_barrier() when unloading that module.

+For example, if it uses call_rcu(), call_srcu() on srcu_struct_1, and

+call_srcu() on srcu_struct_2, then the following three lines of code

+will be required when unloading::

+ 2 srcu_barrier(&srcu_struct_1);

+ 3 srcu_barrier(&srcu_struct_2);

+If latency is of the essence, workqueues could be used to run these

+three functions concurrently.

+An ancient version of the rcutorture module makes use of rcu_barrier()

+in its exit function as follows::

+ 2 rcu_torture_cleanup(void)

+ 7 if (shuffler_task != NULL) {

+ 8 VERBOSE_PRINTK_STRING("Stopping rcu_torture_shuffle task");

+ 9 kthread_stop(shuffler_task);

+ 11 shuffler_task = NULL;

+ 13 if (writer_task != NULL) {

+ 14 VERBOSE_PRINTK_STRING("Stopping rcu_torture_writer task");

+ 15 kthread_stop(writer_task);

+ 17 writer_task = NULL;

+ 19 if (reader_tasks != NULL) {

+ 20 for (i = 0; i < nrealreaders; i++) {

+ 21 if (reader_tasks[i] != NULL) {

+ 22 VERBOSE_PRINTK_STRING(

+ 23 "Stopping rcu_torture_reader task");

+ 24 kthread_stop(reader_tasks[i]);

+ 26 reader_tasks[i] = NULL;

+ 28 kfree(reader_tasks);

+ 29 reader_tasks = NULL;

+ 31 rcu_torture_current = NULL;

+ 33 if (fakewriter_tasks != NULL) {

+ 34 for (i = 0; i < nfakewriters; i++) {

+ 35 if (fakewriter_tasks[i] != NULL) {

+ 36 VERBOSE_PRINTK_STRING(

+ 37 "Stopping rcu_torture_fakewriter task");

+ 38 kthread_stop(fakewriter_tasks[i]);

+ 40 fakewriter_tasks[i] = NULL;

+ 42 kfree(fakewriter_tasks);

+ 43 fakewriter_tasks = NULL;

+ 46 if (stats_task != NULL) {

+ 47 VERBOSE_PRINTK_STRING("Stopping rcu_torture_stats task");

+ 48 kthread_stop(stats_task);

+ 50 stats_task = NULL;

+ 52 /* Wait for all RCU callbacks to fire. */

+ 55 rcu_torture_stats_print(); /* -After- the stats thread is stopped! */

+ 57 if (cur_ops->cleanup != NULL)

+ 58 cur_ops->cleanup();

+ 59 if (atomic_read(&n_rcu_torture_error))

+ 60 rcu_torture_print_module_parms("End of test: FAILURE");

+ 62 rcu_torture_print_module_parms("End of test: SUCCESS");

+module invokes call_rcu() from timers, you will need to first refrain

+from posting new timers, cancel (or wait for) all the already-posted

+timers, and only then invoke rcu_barrier() to wait for any remaining

+Of course, if your module uses call_rcu(), you will need to invoke

+**and** call_srcu(), then (as noted above) you will need to invoke

+rcu_barrier() **and** srcu_barrier().

+The original code for rcu_barrier() was roughly as follows::

+ 1 void rcu_barrier(void)

+ 3 BUG_ON(in_interrupt());

+ 4 /* Take cpucontrol mutex to protect against CPU hotplug */

+ 5 mutex_lock(&rcu_barrier_mutex);

+ 6 init_completion(&rcu_barrier_completion);

+ 7 atomic_set(&rcu_barrier_cpu_count, 1);

+ 8 on_each_cpu(rcu_barrier_func, NULL, 0, 1);

+ 9 if (atomic_dec_and_test(&rcu_barrier_cpu_count))

+ 10 complete(&rcu_barrier_completion);

+ 11 wait_for_completion(&rcu_barrier_completion);

+ 12 mutex_unlock(&rcu_barrier_mutex);

+Line 3 verifies that the caller is in process context, and lines 5 and 12

+before on_each_cpu() returns. Line 9 removes the initial count from

+rcu_barrier_cpu_count, and if this count is now zero, line 10 finalizes

+the completion, which prevents line 11 from blocking. Either way,

+line 11 then waits (if needed) for the completion.

+.. _rcubarrier_quiz_2:

+ Why doesn't line 8 initialize rcu_barrier_cpu_count to zero,

+ thereby avoiding the need for lines 9 and 10?

+:ref:`Answer to Quick Quiz #2 <answer_rcubarrier_quiz_2>`

+ 1 static void rcu_barrier_func(void *notused)

+ 3 int cpu = smp_processor_id();

+ 4 struct rcu_data *rdp = &per_cpu(rcu_data, cpu);

+ 5 struct rcu_head *head;

+ 7 head = &rdp->barrier;

+ 8 atomic_inc(&rcu_barrier_cpu_count);

+ 9 call_rcu(head, rcu_barrier_callback);

+8 increments the global counter. This counter will later be decremented

+ 1 static void rcu_barrier_callback(struct rcu_head *notused)

+ 3 if (atomic_dec_and_test(&rcu_barrier_cpu_count))

+ 4 complete(&rcu_barrier_completion);

+.. _rcubarrier_quiz_3:

+:ref:`Answer to Quick Quiz #3 <answer_rcubarrier_quiz_3>`

+In addition, a great many optimizations have been applied. However,

+the code above illustrates the concepts.

+The rcu_barrier() primitive is used relatively infrequently, since most

+ Interestingly enough, rcu_barrier() was not originally

+ Why doesn't line 8 initialize rcu_barrier_cpu_count to zero,

+ thereby avoiding the need for lines 9 and 10?

+ Suppose that the on_each_cpu() function shown on line 8 was

+ delayed, so that CPU 0's rcu_barrier_func() executed and

+ the corresponding grace period elapsed, all before CPU 1's

+ rcu_barrier_func() started executing. This would result in

+ rcu_barrier_cpu_count being decremented to zero, so that line

+ 11's wait_for_completion() would return immediately, failing to

+ wait for CPU 1's callbacks to be invoked.

+ Note that this was not a problem when the rcu_barrier() code

+ was first added back in 2005. This is because on_each_cpu()

+ disables preemption, which acted as an RCU read-side critical

+ section, thus preventing CPU 0's grace period from completing

+ until on_each_cpu() had dealt with all of the CPUs. However,

+ with the advent of preemptible RCU, rcu_barrier() no longer

+ waited on nonpreemptible regions of code in preemptible kernels,

+ that being the job of the new rcu_barrier_sched() function.

+ However, with the RCU flavor consolidation around v4.20, this

+ possibility was once again ruled out, because the consolidated

+ RCU once again waits on nonpreemptible regions of code.

+ Nevertheless, that extra count might still be a good idea.

+ Relying on these sort of accidents of implementation can result

+ in later surprise bugs when the implementation changes.

+:ref:`Back to Quick Quiz #2 <rcubarrier_quiz_2>`

+.. _answer_rcubarrier_quiz_3:

+ This cannot happen. The reason is that on_each_cpu() has its last

+ rcu_barrier_func(). Because recent RCU implementations treat

+ preemption-disabled regions of code as RCU read-side critical

+ sections, this prevents grace periods from completing. This

+ But if on_each_cpu() ever decides to forgo disabling preemption,

+ as might well happen due to real-time latency considerations,

+ initializing rcu_barrier_cpu_count to one will save the day.

+:ref:`Back to Quick Quiz #3 <rcubarrier_quiz_3>`

+Please read the basics in listRCU.rst.

+to solve following problem.

+Without 'nulls', a typical RCU linked list managing objects which are

+allocated with SLAB_TYPESAFE_BY_RCU kmem_cache can use the following

+1) Lookup algorithm

+-------------------

+ rcu_read_unlock();

+ pos && ({ next = pos->next; smp_rmb(); prefetch(next); 1; }) &&

+ ({ tpos = hlist_entry(pos, typeof(*tpos), member); 1; });

+ pos = rcu_dereference(next))

+ pos && ({ prefetch(pos->next); 1; }) &&

+ ({ tpos = hlist_entry(pos, typeof(*tpos), member); 1; });

+ pos = rcu_dereference(pos->next))

+2) Insertion algorithm

+----------------------

+and previous value of 'obj->key'. Otherwise, an item could be deleted

+before the move, 'next' pointer is NULL, and lockless reader can not

+detect the fact that it missed following items in original chain.

+ * Please note that new inserts are done at the head of list,

+ * not in the middle or end.

+ atomic_set_release(&obj->refcnt, 1); // key before refcnt

+3) Removal algorithm

+--------------------

+and extra _release() in insert function.

+then the reader doesn't care: It might occasionally

+1) lookup algorithm

+-------------------

+ if (!try_get_ref(obj)) { // might fail for free objects

+ rcu_read_unlock();

+ // If the nulls value we got at the end of this lookup is

+ // not the expected one, we must restart lookup.

+ // We probably met an item that was moved to another chain.

+ if (get_nulls_value(node) != slot) {

+ rcu_read_unlock();

+2) Insert algorithm

+-------------------

+ * Please note that new inserts are done at the head of list,

+ * not in the middle or end.

+ atomic_set_release(&obj->refcnt, 1); // key before refcnt

+ * insert obj in RCU way (readers might be traversing chain)

+- For !CONFIG_PREEMPTION kernels, a CPU looping anywhere in the

+ kernel without potentially invoking schedule(). If the looping

+ in the kernel is really expected and desirable behavior, you

+ might need to add some calls to cond_resched().

+- A low-level kernel issue that either fails to invoke one of the

+ variants of rcu_eqs_enter(true), rcu_eqs_exit(true), ct_idle_enter(),

+ ct_idle_exit(), ct_irq_enter(), or ct_irq_exit() on the one

+ hand, or that invokes one of them too many times on the other.

+ Historically, the most frequent issue has been an omission

+ of either irq_enter() or irq_exit(), which in turn invoke

+ ct_irq_enter() or ct_irq_exit(), respectively. Building your

+ kernel with CONFIG_RCU_EQS_DEBUG=y can help track down these types

+ of issues, which sometimes arise in architecture-specific code.

+- A hardware failure. This is quite unlikely, but is not at all

+ uncommon in large datacenter. In one memorable case some decades

+ back, a CPU failed in a running system, becoming unresponsive,

+ but not causing an immediate crash. This resulted in a series

+ of RCU CPU stall warnings, eventually leading the realization

+ that the CPU had failed.

+The RCU, RCU-sched, RCU-tasks, and RCU-tasks-trace implementations have

+CPU stall warning. Note that SRCU does *not* have CPU stall warnings.

+Please note that RCU only detects CPU stalls when there is a grace period

+in progress. No grace period, no CPU stall warnings.

+CONFIG_RCU_EXP_CPU_STALL_TIMEOUT

+--------------------------------

+ Same as the CONFIG_RCU_CPU_STALL_TIMEOUT parameter but only for

+ the expedited grace period. This parameter defines the period

+ of time that RCU will wait from the beginning of an expedited

+ grace period until it issues an RCU CPU stall warning. This time

+ period is normally 20 milliseconds on Android devices. A zero

+ value causes the CONFIG_RCU_CPU_STALL_TIMEOUT value to be used,

+ after conversion to milliseconds.

+ This configuration parameter may be changed at runtime via the

+ /sys/module/rcupdate/parameters/rcu_exp_cpu_stall_timeout, however

+ this parameter is checked only at the beginning of a cycle. If you

+ are in a current stall cycle, setting it to a new value will change

+ the timeout for the -next- stall.

+ Stall-warning messages may be enabled and disabled completely via

+ /sys/module/rcupdate/parameters/rcu_cpu_stall_suppress.

+ This boot/sysfs parameter controls the RCU-tasks and

+ RCU-tasks-trace stall warning intervals. A value of zero or less

+ suppresses RCU-tasks stall warnings. A positive value sets the

+ stall-warning interval in seconds. An RCU-tasks stall warning

+ starts with the line:

+ An RCU-tasks-trace stall warning starts (and continues) similarly:

+ INFO: rcu_tasks_trace detected stalls on tasks

+very large positive number otherwise. The number following the final

+"/" is the NMI nesting, which will be a small non-negative number.

+RCU_CPU_STALL_CPUTIME

+=====================

+In kernels built with CONFIG_RCU_CPU_STALL_CPUTIME=y or booted with

+rcupdate.rcu_cpu_stall_cputime=1, the following additional information

+is supplied with each RCU CPU stall warning::

+ rcu: hardirqs softirqs csw/system

+ rcu: number: 624 45 0

+ rcu: cputime: 69 1 2425 ==> 2500(ms)

+These statistics are collected during the sampling period. The values

+in row "number:" are the number of hard interrupts, number of soft

+interrupts, and number of context switches on the stalled CPU. The

+first three values in row "cputime:" indicate the CPU time in

+milliseconds consumed by hard interrupts, soft interrupts, and tasks

+on the stalled CPU. The last number is the measurement interval, again

+in milliseconds. Because user-mode tasks normally do not cause RCU CPU

+stalls, these tasks are typically kernel tasks, which is why only the

+system CPU time are considered.

+The sampling period is shown as follows::

+ |<------------first timeout---------->|<-----second timeout----->|

+ |<--half timeout-->|<--half timeout-->| |

+ | |<--first period-->| |

+ | |<-----------second sampling period---------->|

+ snapshot time point 1st-stall 2nd-stall

+The following describes four typical scenarios:

+1. A CPU looping with interrupts disabled.

+ rcu: hardirqs softirqs csw/system

+ rcu: number: 0 0 0

+ rcu: cputime: 0 0 0 ==> 2500(ms)

+ Because interrupts have been disabled throughout the measurement

+ interval, there are no interrupts and no context switches.

+ Furthermore, because CPU time consumption was measured using interrupt

+ handlers, the system CPU consumption is misleadingly measured as zero.

+ This scenario will normally also have "(0 ticks this GP)" printed on

+ this CPU's summary line.

+2. A CPU looping with bottom halves disabled.

+ This is similar to the previous example, but with non-zero number of

+ and CPU time consumed by hard interrupts, along with non-zero CPU

+ time consumed by in-kernel execution::

+ rcu: hardirqs softirqs csw/system

+ rcu: number: 624 0 0

+ rcu: cputime: 49 0 2446 ==> 2500(ms)

+ The fact that there are zero softirqs gives a hint that these were

+ disabled, perhaps via local_bh_disable(). It is of course possible

+ that there were no softirqs, perhaps because all events that would

+ result in softirq execution are confined to other CPUs. In this case,

+ the diagnosis should continue as shown in the next example.

+3. A CPU looping with preemption disabled.

+ Here, only the number of context switches is zero::

+ rcu: hardirqs softirqs csw/system

+ rcu: number: 624 45 0

+ rcu: cputime: 69 1 2425 ==> 2500(ms)

+ This situation hints that the stalled CPU was looping with preemption

+4. No looping, but massive hard and soft interrupts.

+ rcu: hardirqs softirqs csw/system

+ rcu: number: xx xx 0

+ rcu: cputime: xx xx 0 ==> 2500(ms)

+ Here, the number and CPU time of hard interrupts are all non-zero,

+ but the number of context switches and the in-kernel CPU time consumed

+ are zero. The number and cputime of soft interrupts will usually be

+ non-zero, but could be zero, for example, if the CPU was spinning

+ within a single hard interrupt handler.

+ If this type of RCU CPU stall warning can be reproduced, you can

+ narrow it down by looking at /proc/interrupts or by writing code to

+ trace each interrupt, for example, by referring to show_interrupts().

+parameter to kvm.sh may be used, for example, ``--kconfig 'CONFIG_RCU_EQS_DEBUG=y'``.

+In addition, there are the --gdb, --kasan, and --kcsan parameters.

+Note that --gdb limits you to one scenario per kvm.sh run and requires

+that you have another window open from which to run ``gdb`` as instructed

+the resulting RCU CPU stall warning. As noted above, reducing memory may

+what the --buildonly parameter does.

+The --duration parameter can override the default run time of 30 minutes.

+For example, ``--duration 2d`` would run for two days, ``--duration 3h``

+would run for three hours, ``--duration 5m`` would run for five minutes,

+and ``--duration 45s`` would run for 45 seconds. This last can be useful

+for tracking down rare boot-time failures.

+Finally, the --trust-make parameter allows each kernel build to reuse what

+it can from the previous kernel build. Please note that without the

+--trust-make parameter, your tags files may be demolished.

+Suppose that you are chasing down a rare boot-time failure. Although you

+could use kvm.sh, doing so will rebuild the kernel on each run. If you

+need (say) 1,000 runs to have confidence that you have fixed the bug,

+these pointless rebuilds can become extremely annoying.

+This is why kvm-again.sh exists.

+Suppose that a previous kvm.sh run left its output in this directory::

+ tools/testing/selftests/rcutorture/res/2022.11.03-11.26.28

+Then this run can be re-run without rebuilding as follow:

+ kvm-again.sh tools/testing/selftests/rcutorture/res/2022.11.03-11.26.28

+A few of the original run's kvm.sh parameters may be overridden, perhaps

+most notably --duration and --bootargs. For example::

+ kvm-again.sh tools/testing/selftests/rcutorture/res/2022.11.03-11.26.28 \

+would re-run the previous test, but for only 45 seconds, thus facilitating

+tracking down the aforementioned rare boot-time failure.

+Although kvm.sh is quite useful, its testing is confined to a single

+system. It is not all that hard to use your favorite framework to cause

+(say) 5 instances of kvm.sh to run on your 5 systems, but this will very

+likely unnecessarily rebuild kernels. In addition, manually distributing

+the desired rcutorture scenarios across the available systems can be

+painstaking and error-prone.

+And this is why the kvm-remote.sh script exists.

+If you the following command works::

+and if it also works for system1, system2, system3, system4, and system5,

+and all of these systems have 64 CPUs, you can type::

+ kvm-remote.sh "system0 system1 system2 system3 system4 system5" \

+ --cpus 64 --duration 8h --configs "5*CFLIST"

+This will build each default scenario's kernel on the local system, then

+spread each of five instances of each scenario over the systems listed,

+running each scenario for eight hours. At the end of the runs, the

+results will be gathered, recorded, and printed. Most of the parameters

+that kvm.sh will accept can be passed to kvm-remote.sh, but the list of

+systems must come first.

+The kvm.sh ``--dryrun scenarios`` argument is useful for working out

+how many scenarios may be run in one batch across a group of systems.

+You can also re-run a previous remote run in a manner similar to kvm.sh:

+ kvm-remote.sh "system0 system1 system2 system3 system4 system5" \

+ tools/testing/selftests/rcutorture/res/2022.11.03-11.26.28-remote \

+In this case, most of the kvm-again.sh parameters may be supplied following

+the pathname of the old run-results directory.

+| 1. What is RCU, Fundamentally? https://lwn.net/Articles/262464/

+| 2. What is RCU? Part 2: Usage https://lwn.net/Articles/263130/

+| 3. RCU part 3: the RCU API https://lwn.net/Articles/264090/

+| 4. The RCU API, 2010 Edition https://lwn.net/Articles/418853/

+| 2010 Big API Table https://lwn.net/Articles/419086/

+| 5. The RCU API, 2014 Edition https://lwn.net/Articles/609904/

+| 2014 Big API Table https://lwn.net/Articles/609973/

+| 6. The RCU API, 2019 Edition https://lwn.net/Articles/777036/

+| 2019 Big API Table https://lwn.net/Articles/777165/

+For those preferring video:

+| 1. Unraveling RCU Mysteries: Fundamentals https://www.linuxfoundation.org/webinars/unraveling-rcu-usage-mysteries

+| 2. Unraveling RCU Mysteries: Additional Use Cases https://www.linuxfoundation.org/webinars/unraveling-rcu-usage-mysteries-additional-use-cases

+situations. Although RCU is actually quite simple, making effective use

+of it requires you to think differently about your code. Another part

+of the problem is the mistaken assumption that there is "one true way" to

+describe and to use RCU. Instead, the experience has been that different

+people must take different paths to arrive at an understanding of RCU,

+depending on their experiences and use cases. This document provides

+several different paths, as follows:

+:ref:`7. ANALOGY WITH REFERENCE COUNTING <7_whatisRCU>`

+:ref:`8. FULL LIST OF RCU APIs <8_whatisRCU>`

+:ref:`9. ANSWERS TO QUICK QUIZZES <9_whatisRCU>`

+ This temporal primitive is used by a reader to inform the

+ reclaimer that the reader is entering an RCU read-side critical

+ section. It is illegal to block while in an RCU read-side

+ critical section, though kernels built with CONFIG_PREEMPT_RCU

+ can preempt RCU read-side critical sections. Any RCU-protected

+ data structure accessed during an RCU read-side critical section

+ is guaranteed to remain unreclaimed for the full duration of that

+ critical section. Reference counts may be used in conjunction

+ with RCU to maintain longer-term references to data structures.

+ This temporal primitives is used by a reader to inform the

+ reclaimer that the reader is exiting an RCU read-side critical

+ section. Note that RCU read-side critical sections may be nested

+ and/or overlapping.

+ This temporal primitive marks the end of updater code and the

+ beginning of reclaimer code. It does this by blocking until

+ all pre-existing RCU read-side critical sections on all CPUs

+ have completed. Note that synchronize_rcu() will **not**

+ necessarily wait for any subsequent RCU read-side critical

+ sections to complete. For example, consider the following

+ sequence of events::

+ The call_rcu() API is an asynchronous callback form of

+ synchronize_rcu(), and is described in more detail in a later

+ section. Instead of blocking, it registers a function and

+ argument which are invoked after all ongoing RCU read-side

+ critical sections have completed. This callback variant is

+ particularly useful in situations where it is illegal to block

+ or where update-side performance is critically important.

+ checklist.rst for some approaches to limiting the update rate.

+ The updater uses this spatial macro to assign a new value to an

+ in value from the updater to the reader. This is a spatial (as

+ opposed to temporal) macro. It does not evaluate to an rvalue,

+ but it does execute any memory-barrier instructions required

+ for a given CPU architecture. Its ordering properties are that

+ of a store-release operation.

+ The reader uses the spatial rcu_dereference() macro to fetch

+ an RCU-protected pointer, which returns a value that may

+ then be safely dereferenced. Note that rcu_dereference()

+ does not actually dereference the pointer, instead, it

+ protects the pointer for later dereferencing. It also

+ executes any needed memory-barrier instructions for a given

+ CPU architecture. Currently, only Alpha needs memory barriers

+ within rcu_dereference() -- on other CPUs, it compiles to a

+ a lockdep splat is emitted. See Design/Requirements/Requirements.rst

+The RCU infrastructure observes the temporal sequence of rcu_read_lock(),

+The rcu_assign_pointer() and rcu_dereference() invocations communicate

+spatial changes via stores to and loads from the RCU-protected pointer in

+for specialized uses, but are relatively uncommon. The SRCU, RCU-Tasks,

+RCU-Tasks-Rude, and RCU-Tasks-Trace have similar relationships among

+their assorted primitives.

+uses of RCU may be found in listRCU.rst, arrayRCU.rst, and NMI-RCU.rst.

+- Use some solid design (such as locks or semaphores) to

+See checklist.rst for additional rules to follow when using RCU.

+And again, more-typical uses of RCU may be found in listRCU.rst,

+arrayRCU.rst, and NMI-RCU.rst.

+If the occasional sleep is permitted, the single-argument form may

+be used, omitting the rcu_head structure from struct foo.

+ kfree_rcu_mightsleep(old_fp);

+This variant almost never blocks, but might do so by invoking

+synchronize_rcu() in response to memory-allocation failure.

+Again, see checklist.rst for additional rules governing the use of RCU.

+ https://docs.google.com/document/d/1X0lThx8OK0ZgLMqVoXiR4ZrGURHrXK6NyLRbeXe3Xac/edit

+ Design/Requirements/Requirements.rst

+:ref:`Answers to Quick Quiz <9_whatisRCU>`

+7. ANALOGY WITH REFERENCE COUNTING

+-----------------------------------

+The reader-writer analogy (illustrated by the previous section) is not

+always the best way to think about using RCU. Another helpful analogy

+considers RCU an effective reference count on everything which is

+A reference count typically does not prevent the referenced object's

+values from changing, but does prevent changes to type -- particularly the

+gross change of type that happens when that object's memory is freed and

+re-allocated for some other purpose. Once a type-safe reference to the

+object is obtained, some other mechanism is needed to ensure consistent

+access to the data in the object. This could involve taking a spinlock,

+but with RCU the typical approach is to perform reads with SMP-aware

+operations such as smp_load_acquire(), to perform updates with atomic

+read-modify-write operations, and to provide the necessary ordering.

+RCU provides a number of support functions that embed the required

+operations and ordering, such as the list_for_each_entry_rcu() macro

+used in the previous section.

+A more focused view of the reference counting behavior is that,

+between rcu_read_lock() and rcu_read_unlock(), any reference taken with

+rcu_dereference() on a pointer marked as ``__rcu`` can be treated as

+though a reference-count on that object has been temporarily increased.

+This prevents the object from changing type. Exactly what this means

+will depend on normal expectations of objects of that type, but it

+typically includes that spinlocks can still be safely locked, normal

+reference counters can be safely manipulated, and ``__rcu`` pointers

+can be safely dereferenced.

+Some operations that one might expect to see on an object for

+which an RCU reference is held include:

+ - Copying out data that is guaranteed to be stable by the object's type.

+ - Using kref_get_unless_zero() or similar to get a longer-term

+ reference. This may fail of course.

+ - Acquiring a spinlock in the object, and checking if the object still

+ is the expected object and if so, manipulating it freely.

+The understanding that RCU provides a reference that only prevents a

+change of type is particularly visible with objects allocated from a

+slab cache marked ``SLAB_TYPESAFE_BY_RCU``. RCU operations may yield a

+reference to an object from such a cache that has been concurrently freed

+and the memory reallocated to a completely different object, though of

+the same type. In this case RCU doesn't even protect the identity of the

+object from changing, only its type. So the object found may not be the

+one expected, but it will be one where it is safe to take a reference

+(and then potentially acquiring a spinlock), allowing subsequent code

+to check whether the identity matches expectations. It is tempting

+to simply acquire the spinlock without first taking the reference, but

+unfortunately any spinlock in a ``SLAB_TYPESAFE_BY_RCU`` object must be

+initialized after each and every call to kmem_cache_alloc(), which renders

+reference-free spinlock acquisition completely unsafe. Therefore, when

+using ``SLAB_TYPESAFE_BY_RCU``, make proper use of a reference counter.

+(Those willing to use a kmem_cache constructor may also use locking,

+including cache-friendly sequence locking.)

+With traditional reference counting -- such as that implemented by the

+kref library in Linux -- there is typically code that runs when the last

+reference to an object is dropped. With kref, this is the function

+passed to kref_put(). When RCU is being used, such finalization code

+must not be run until all ``__rcu`` pointers referencing the object have

+been updated, and then a grace period has passed. Every remaining

+globally visible pointer to the object must be considered to be a

+potential counted reference, and the finalization code is typically run

+using call_rcu() only after all those pointers have been changed.

+To see how to choose between these two analogies -- of RCU as a

+reader-writer lock and RCU as a reference counting system -- it is useful

+to reflect on the scale of the thing being protected. The reader-writer

+lock analogy looks at larger multi-part objects such as a linked list

+and shows how RCU can facilitate concurrency while elements are added

+to, and removed from, the list. The reference-count analogy looks at

+the individual objects and looks at how they can be accessed safely

+within whatever whole they are a part of.

+8. FULL LIST OF RCU APIs

+ Critical sections Grace period Barrier

+ N/A call_rcu_tasks rcu_barrier_tasks

+ synchronize_rcu_tasks

+ Critical sections Grace period Barrier

+ N/A call_rcu_tasks_rude rcu_barrier_tasks_rude

+ synchronize_rcu_tasks_rude

+ Critical sections Grace period Barrier

+ rcu_read_lock_trace call_rcu_tasks_trace rcu_barrier_tasks_trace

+ rcu_read_unlock_trace synchronize_rcu_tasks_trace

+All: lockdep-checked RCU utility APIs::

+All: Unchecked RCU-protected pointer access::

+ rcu_dereference_raw

+All: Unchecked RCU-protected pointer access with dereferencing prohibited::

+ rcu_access_pointer

+b. Will readers need to block and are you doing tracing, for

+ example, ftrace or BPF? If so, you need RCU-tasks,

+ RCU-tasks-rude, and/or RCU-tasks-trace.

+c. What about the -rt patchset? If readers would need to block in

+ an non-rt kernel, you need SRCU. If readers would block when

+ acquiring spinlocks in a -rt kernel, but not in a non-rt kernel,

+ SRCU is not necessary. (The -rt patchset turns spinlocks into

+ sleeplocks, hence this distinction.)

+d. Do you need to treat NMI handlers, hardirq handlers,

+ If so, RCU-sched readers are the only choice that will work

+ for you, but since about v4.20 you use can use the vanilla RCU

+ update primitives.

+e. Do you need RCU grace periods to complete even in the face of

+ softirq monopolization of one or more of the CPUs? For example,

+ is your code subject to network-based denial-of-service attacks?

+ If so, you should disable softirq across your readers, for

+ example, by using rcu_read_lock_bh(). Since about v4.20 you

+ use can use the vanilla RCU update primitives.

+f. Is your workload too update-intensive for normal use of

+g. Do you need read-side critical sections that are respected even

+ on CPUs that are deep in the idle loop, during entry to or exit

+ from user-mode execution, or on an offlined CPU? If so, SRCU

+ and RCU Tasks Trace are the only choices that will work for you,

+ with SRCU being strongly preferred in almost all cases.

+h. Otherwise, use RCU.

+9. ANSWERS TO QUICK QUIZZES

+.. SPDX-License-Identifier: GPL-2.0

+====================

+Compute Accelerators

+====================

+.. only:: subproject and html

+.. SPDX-License-Identifier: GPL-2.0

+The Linux compute accelerators subsystem is designed to expose compute

+accelerators in a common way to user-space and provide a common set of

+These devices can be either stand-alone ASICs or IP blocks inside an SoC/GPU.

+Although these devices are typically designed to accelerate

+Machine-Learning (ML) and/or Deep-Learning (DL) computations, the accel layer

+is not limited to handling these types of accelerators.

+Typically, a compute accelerator will belong to one of the following

+- Edge AI - doing inference at an edge device. It can be an embedded ASIC/FPGA,

+ or an IP inside a SoC (e.g. laptop web camera). These devices

+ are typically configured using registers and can work with or without DMA.

+- Inference data-center - single/multi user devices in a large server. This

+ type of device can be stand-alone or an IP inside a SoC or a GPU. It will

+ have on-board DRAM (to hold the DL topology), DMA engines and

+ command submission queues (either kernel or user-space queues).

+ It might also have an MMU to manage multiple users and might also enable

+ virtualization (SR-IOV) to support multiple VMs on the same device. In

+ addition, these devices will usually have some tools, such as profiler and

+- Training data-center - Similar to Inference data-center cards, but typically

+ have more computational power and memory b/w (e.g. HBM) and will likely have

+ a method of scaling-up/out, i.e. connecting to other training cards inside

+ the server or in other servers, respectively.

+All these devices typically have different runtime user-space software stacks,

+that are tailored-made to their h/w. In addition, they will also probably

+include a compiler to generate programs to their custom-made computational

+engines. Typically, the common layer in user-space will be the DL frameworks,

+such as PyTorch and TensorFlow.

+Sharing code with DRM

+=====================

+Because this type of devices can be an IP inside GPUs or have similar

+characteristics as those of GPUs, the accel subsystem will use the

+DRM subsystem's code and functionality. i.e. the accel core code will

+be part of the DRM subsystem and an accel device will be a new type of DRM

+This will allow us to leverage the extensive DRM code-base and

+collaborate with DRM developers that have experience with this type of

+devices. In addition, new features that will be added for the accelerator

+drivers can be of use to GPU drivers as well.

+Differentiation from GPUs

+=========================

+Because we want to prevent the extensive user-space graphic software stack

+from trying to use an accelerator as a GPU, the compute accelerators will be

+differentiated from GPUs by using a new major number and new device char files.

+Furthermore, the drivers will be located in a separate place in the kernel

+tree - drivers/accel/.

+The accelerator devices will be exposed to the user space with the dedicated

+261 major number and will have the following convention:

+- device char files - /dev/accel/accel\*

+- sysfs - /sys/class/accel/accel\*/

+- debugfs - /sys/kernel/debug/accel/\*/

+First, read the DRM documentation at Documentation/gpu/index.rst.

+Not only it will explain how to write a new DRM driver but it will also

+contain all the information on how to contribute, the Code Of Conduct and

+what is the coding style/documentation. All of that is the same for the

+Second, make sure the kernel is configured with CONFIG_DRM_ACCEL.

+To expose your device as an accelerator, two changes are needed to

+be done in your driver (as opposed to a standard DRM driver):

+- Add the DRIVER_COMPUTE_ACCEL feature flag in your drm_driver's

+ driver_features field. It is important to note that this driver feature is

+ mutually exclusive with DRIVER_RENDER and DRIVER_MODESET. Devices that want

+ to expose both graphics and compute device char files should be handled by

+ two drivers that are connected using the auxiliary bus framework.

+- Change the open callback in your driver fops structure to accel_open().

+ Alternatively, your driver can use DEFINE_DRM_ACCEL_FOPS macro to easily

+ set the correct function operations pointers structure.

+External References

+===================

+* `Initial discussion on the New subsystem for acceleration devices <https://lkml.org/lkml/2022/7/31/83>`_ - Oded Gabbay (2022)

+* `patch-set to add the new subsystem <https://lkml.org/lkml/2022/10/22/544>`_ - Oded Gabbay (2022)

+* `LPC 2022 Accelerators BOF outcomes summary <https://airlied.blogspot.com/2022/09/accelerators-bof-outcomes-summary.html>`_ - Dave Airlie (2022)

+.. SPDX-License-Identifier: GPL-2.0-only

+===============================

+ Qualcomm Cloud AI 100 (AIC100)

+===============================

+The Qualcomm Cloud AI 100/AIC100 family of products (including SA9000P - part of

+Snapdragon Ride) are PCIe adapter cards which contain a dedicated SoC ASIC for

+the purpose of efficiently running Artificial Intelligence (AI) Deep Learning

+inference workloads. They are AI accelerators.

+The PCIe interface of AIC100 is capable of PCIe Gen4 speeds over eight lanes

+(x8). An individual SoC on a card can have up to 16 NSPs for running workloads.

+Each SoC has an A53 management CPU. On card, there can be up to 32 GB of DDR.

+Multiple AIC100 cards can be hosted in a single system to scale overall

+performance. AIC100 cards are multi-user capable and able to execute workloads

+from multiple users in a concurrent manner.

+Hardware Description

+====================

+An AIC100 card consists of an AIC100 SoC, on-card DDR, and a set of misc

+peripherals (PMICs, etc).

+An AIC100 card can either be a PCIe HHHL form factor (a traditional PCIe card),

+or a Dual M.2 card. Both use PCIe to connect to the host system.

+As a PCIe endpoint/adapter, AIC100 uses the standard VendorID(VID)/

+DeviceID(DID) combination to uniquely identify itself to the host. AIC100

+uses the standard Qualcomm VID (0x17cb). All AIC100 SKUs use the same

+AIC100 DID (0xa100).

+AIC100 does not implement FLR (function level reset).

+AIC100 implements MSI but does not implement MSI-X. AIC100 requires 17 MSIs to

+operate (1 for MHI, 16 for the DMA Bridge).

+As a PCIe device, AIC100 utilizes BARs to provide host interfaces to the device

+hardware. AIC100 provides 3, 64-bit BARs.

+* The first BAR is 4K in size, and exposes the MHI interface to the host.

+* The second BAR is 2M in size, and exposes the DMA Bridge interface to the

+* The third BAR is variable in size based on an individual AIC100's

+ configuration, but defaults to 64K. This BAR currently has no purpose.

+From the host perspective, AIC100 has several key hardware components -

+* MHI (Modem Host Interface)

+* QSM (QAIC Service Manager)

+* NSPs (Neural Signal Processor)

+AIC100 has one MHI interface over PCIe. MHI itself is documented at

+Documentation/mhi/index.rst MHI is the mechanism the host uses to communicate

+with the QSM. Except for workload data via the DMA Bridge, all interaction with

+the device occurs via MHI.

+QAIC Service Manager. This is an ARM A53 CPU that runs the primary

+firmware of the card and performs on-card management tasks. It also

+communicates with the host via MHI. Each AIC100 has one of

+Neural Signal Processor. Each AIC100 has up to 16 of these. These are

+the processors that run the workloads on AIC100. Each NSP is a Qualcomm Hexagon

+(Q6) DSP with HVX and HMX. Each NSP can only run one workload at a time, but

+multiple NSPs may be assigned to a single workload. Since each NSP can only run

+one workload, AIC100 is limited to 16 concurrent workloads. Workload

+"scheduling" is under the purview of the host. AIC100 does not automatically

+The DMA Bridge is custom DMA engine that manages the flow of data

+in and out of workloads. AIC100 has one of these. The DMA Bridge has 16

+channels, each consisting of a set of request/response FIFOs. Each active

+workload is assigned a single DMA Bridge channel. The DMA Bridge exposes

+hardware registers to manage the FIFOs (head/tail pointers), but requires host

+memory to store the FIFOs.

+AIC100 has on-card DDR. In total, an AIC100 can have up to 32 GB of DDR.

+This DDR is used to store workloads, data for the workloads, and is used by the

+QSM for managing the device. NSPs are granted access to sections of the DDR by

+the QSM. The host does not have direct access to the DDR, and must make

+requests to the QSM to transfer data to the DDR.

+High-level Use Flow

+===================

+AIC100 is a multi-user, programmable accelerator typically used for running

+neural networks in inferencing mode to efficiently perform AI operations.

+AIC100 is not intended for training neural networks. AIC100 can be utilized

+for generic compute workloads.

+Assuming a user wants to utilize AIC100, they would follow these steps:

+1. Compile the workload into an ELF targeting the NSP(s)

+2. Make requests to the QSM to load the workload and related artifacts into the

+3. Make a request to the QSM to activate the workload onto a set of idle NSPs

+4. Make requests to the DMA Bridge to send input data to the workload to be

+ processed, and other requests to receive processed output data from the

+5. Once the workload is no longer required, make a request to the QSM to

+ deactivate the workload, thus putting the NSPs back into an idle state.

+6. Once the workload and related artifacts are no longer needed for future

+ sessions, make requests to the QSM to unload the data from DDR. This frees

+ the DDR to be used by other users.

+AIC100 uses a flashless boot flow, derived from Qualcomm MSMs.

+When AIC100 is first powered on, it begins executing PBL (Primary Bootloader)

+from ROM. PBL enumerates the PCIe link, and initializes the BHI (Boot Host

+Interface) component of MHI.

+Using BHI, the host points PBL to the location of the SBL (Secondary Bootloader)

+image. The PBL pulls the image from the host, validates it, and begins

+SBL initializes MHI, and uses MHI to notify the host that the device has entered

+the SBL stage. SBL performs a number of operations:

+* SBL initializes the majority of hardware (anything PBL left uninitialized),

+* SBL offloads the bootlog to the host.

+* SBL synchronizes timestamps with the host for future logging.

+* SBL uses the Sahara protocol to obtain the runtime firmware images from the

+Once SBL has obtained and validated the runtime firmware, it brings the NSPs out

+of reset, and jumps into the QSM.

+The QSM uses MHI to notify the host that the device has entered the QSM stage

+(AMSS in MHI terms). At this point, the AIC100 device is fully functional, and

+ready to process workloads.

+Userspace components

+====================

+An open compiler for AIC100 based on upstream LLVM can be found at:

+https://github.com/quic/software-kit-for-qualcomm-cloud-ai-100-cc

+Usermode Driver (UMD)

+---------------------

+An open UMD that interfaces with the qaic kernel driver can be found at:

+https://github.com/quic/software-kit-for-qualcomm-cloud-ai-100

+An open implementation of the Sahara protocol called kickstart can be found at:

+https://github.com/andersson/qdl

+AIC100 defines a number of MHI channels for different purposes. This is a list

+of the defined channels, and their uses.

++----------------+---------+----------+----------------------------------------+

+| Channel name | IDs | EEs | Purpose |

++================+=========+==========+========================================+

+| | | | channel is sent back to the host. |

++----------------+---------+----------+----------------------------------------+

+| QAIC_SAHARA | 2 & 3 | SBL | Used by SBL to obtain the runtime |

+| | | | firmware from the host. |

++----------------+---------+----------+----------------------------------------+

+| | | | DIAG protocol. |

++----------------+---------+----------+----------------------------------------+

+| | | | restart events, and to offload SSR |

+| | | | crashdumps. |

++----------------+---------+----------+----------------------------------------+

++----------------+---------+----------+----------------------------------------+

+| | | | (NNC) protocol. This is the primary |

+| | | | channel between host and QSM for |

+| | | | managing workloads. |

++----------------+---------+----------+----------------------------------------+

+| QAIC_LOGGING | 12 & 13 | SBL | Used by the SBL to send the bootlog to |

+| | | | the host. |

++----------------+---------+----------+----------------------------------------+

+| | | | Accessibility, Serviceability (RAS) |

++----------------+---------+----------+----------------------------------------+

+| | | | attributes. |

++----------------+---------+----------+----------------------------------------+

++----------------+---------+----------+----------------------------------------+

+| | | | device side logs with the host time |

++----------------+---------+----------+----------------------------------------+

+The DMA Bridge is one of the main interfaces to the host from the device

+(the other being MHI). As part of activating a workload to run on NSPs, the QSM

+assigns that network a DMA Bridge channel. A workload's DMA Bridge channel

+(DBC for short) is solely for the use of that workload and is not shared with

+Each DBC is a pair of FIFOs that manage data in and out of the workload. One

+FIFO is the request FIFO. The other FIFO is the response FIFO.

+Each DBC contains 4 registers in hardware:

+* Request FIFO head pointer (offset 0x0). Read only by the host. Indicates the

+ latest item in the FIFO the device has consumed.

+* Request FIFO tail pointer (offset 0x4). Read/write by the host. Host

+ increments this register to add new items to the FIFO.

+* Response FIFO head pointer (offset 0x8). Read/write by the host. Indicates

+ the latest item in the FIFO the host has consumed.

+* Response FIFO tail pointer (offset 0xc). Read only by the host. Device

+ increments this register to add new items to the FIFO.

+The values in each register are indexes in the FIFO. To get the location of the

+FIFO element pointed to by the register: FIFO base address + register * element

+DBC registers are exposed to the host via the second BAR. Each DBC consumes

+4KB of space in the BAR.

+The actual FIFOs are backed by host memory. When sending a request to the QSM

+to activate a network, the host must donate memory to be used for the FIFOs.

+Due to internal mapping limitations of the device, a single contiguous chunk of

+memory must be provided per DBC, which hosts both FIFOs. The request FIFO will

+consume the beginning of the memory chunk, and the response FIFO will consume

+the end of the memory chunk.

+A request FIFO element has the following structure:

+ struct request_elem {

+ u64 pcie_dma_source_addr;

+ u64 pcie_dma_dest_addr;

+ u64 doorbell_addr;

+ u32 doorbell_data;

+Request field descriptions:

+ request ID. A request FIFO element and a response FIFO element with

+ the same request ID refer to the same command.

+ sequence ID within a request. Ignored by the DMA Bridge.

+ describes the DMA element of this request.

+ * Bit(7) is the force msi flag, which overrides the DMA Bridge MSI logic

+ and generates a MSI when this request is complete, and QSM

+ configures the DMA Bridge to look at this bit.

+ * Bits(6:5) are reserved.

+ * Bit(4) is the completion code flag, and indicates that the DMA Bridge

+ shall generate a response FIFO element when this request is

+ * Bit(3) indicates if this request is a linked list transfer(0) or a bulk

+ * Bit(2) is reserved.

+ * Bits(1:0) indicate the type of transfer. No transfer(0), to device(1),

+ from device(2). Value 3 is illegal.

+pcie_dma_source_addr

+ source address for a bulk transfer, or the address of the linked list.

+ destination address for a bulk transfer.

+ length of the bulk transfer. Note that the size of this field

+ limits transfers to 4G in size.

+ address of the doorbell to ring when this request is complete.

+ doorbell attributes.

+ * Bit(7) indicates if a write to a doorbell is to occur.

+ * Bits(6:2) are reserved.

+ * Bits(1:0) contain the encoding of the doorbell length. 0 is 32-bit,

+ 1 is 16-bit, 2 is 8-bit, 3 is reserved. The doorbell address

+ must be naturally aligned to the specified length.

+ data to write to the doorbell. Only the bits corresponding to

+ the doorbell length are valid.

+ semaphore command.

+ * Bit(31) indicates this semaphore command is enabled.

+ * Bit(30) is the to-device DMA fence. Block this request until all

+ to-device DMA transfers are complete.

+ * Bit(29) is the from-device DMA fence. Block this request until all

+ from-device DMA transfers are complete.

+ * Bits(28:27) are reserved.

+ * Bits(26:24) are the semaphore command. 0 is NOP. 1 is init with the

+ specified value. 2 is increment. 3 is decrement. 4 is wait

+ until the semaphore is equal to the specified value. 5 is wait

+ until the semaphore is greater or equal to the specified value.

+ 6 is "P", wait until semaphore is greater than 0, then

+ decrement by 1. 7 is reserved.

+ * Bit(23) is reserved.

+ * Bit(22) is the semaphore sync. 0 is post sync, which means that the

+ semaphore operation is done after the DMA transfer. 1 is

+ presync, which gates the DMA transfer. Only one presync is

+ allowed per request.

+ * Bit(21) is reserved.

+ * Bits(20:16) is the index of the semaphore to operate on.

+ * Bits(15:12) are reserved.

+ * Bits(11:0) are the semaphore value to use in operations.

+Overall, a request is processed in 4 steps:

+1. If specified, the presync semaphore condition must be true

+2. If enabled, the DMA transfer occurs

+3. If specified, the postsync semaphore conditions must be true

+4. If enabled, the doorbell is written

+By using the semaphores in conjunction with the workload running on the NSPs,

+the data pipeline can be synchronized such that the host can queue multiple

+requests of data for the workload to process, but the DMA Bridge will only copy

+the data into the memory of the workload when the workload is ready to process

+Once a request is fully processed, a response FIFO element is generated if

+specified in pcie_dma_cmd. The structure of a response FIFO element:

+ struct response_elem {

+ u16 completion_code;

+ matches the req_id of the request that generated this element.

+ status of this request. 0 is success. Non-zero is an error.

+The DMA Bridge will generate a MSI to the host as a reaction to activity in the

+response FIFO of a DBC. The DMA Bridge hardware has an IRQ storm mitigation

+algorithm, where it will only generate a MSI when the response FIFO transitions

+from empty to non-empty (unless force MSI is enabled and triggered). In

+response to this MSI, the host is expected to drain the response FIFO, and must

+take care to handle any race conditions between draining the FIFO, and the

+device inserting elements into the FIFO.

+Neural Network Control (NNC) Protocol

+=====================================

+The NNC protocol is how the host makes requests to the QSM to manage workloads.

+It uses the QAIC_CONTROL MHI channel.

+Each NNC request is packaged into a message. Each message is a series of

+transactions. A passthrough type transaction can contain elements known as

+QSM requires NNC messages be little endian encoded and the fields be naturally

+aligned. Since there are 64-bit elements in some NNC messages, 64-bit alignment

+must be maintained.

+A message contains a header and then a series of transactions. A message may be

+at most 4K in size from QSM to the host. From the host to the QSM, a message

+can be at most 64K (maximum size of a single MHI packet), but there is a

+continuation feature where message N+1 can be marked as a continuation of

+message N. This is used for exceedingly large DMA xfer transactions.

+Transaction descriptions

+------------------------

+ Allows userspace to send an opaque payload directly to the QSM.

+ This is used for NNC commands. Userspace is responsible for managing

+ the QSM message requirements in the payload.

+ DMA transfer. Describes an object that the QSM should DMA into the

+ device via address and size tuples.

+ Activate a workload onto NSPs. The host must provide memory to be

+ Deactivate an active workload and return the NSPs to idle.

+ Query the QSM about it's NNC implementation. Returns the NNC version,

+ and if CRC is used.

+ Release a user's resources.

+ Continuation of a previous DMA transfer. If a DMA transfer

+ cannot be specified in a single message (highly fragmented), this

+ transaction can be used to specify more ranges.

+ Query to QSM to determine if a partition identifier is valid.

+Each message is tagged with a user id, and a partition id. The user id allows

+QSM to track resources, and release them when the user goes away (eg the process

+crashes). A partition id identifies the resource partition that QSM manages,

+which this message applies to.

+Messages may have CRCs. Messages should have CRCs applied until the QSM

+reports via the status transaction that CRCs are not needed. The QSM on the

+SA9000P requires CRCs for black channel safing.

+Subsystem Restart (SSR)

+=======================

+SSR is the concept of limiting the impact of an error. An AIC100 device may

+have multiple users, each with their own workload running. If the workload of

+one user crashes, the fallout of that should be limited to that workload and not

+impact other workloads. SSR accomplishes this.

+If a particular workload crashes, QSM notifies the host via the QAIC_SSR MHI

+channel. This notification identifies the workload by it's assigned DBC. A

+multi-stage recovery process is then used to cleanup both sides, and get the

+DBC/NSPs into a working state.

+When SSR occurs, any state in the workload is lost. Any inputs that were in

+process, or queued by not yet serviced, are lost. The loaded artifacts will

+remain in on-card DDR, but the host will need to re-activate the workload if

+it desires to recover the workload.

+Reliability, Accessibility, Serviceability (RAS)

+================================================

+AIC100 is expected to be deployed in server systems where RAS ideology is

+applied. Simply put, RAS is the concept of detecting, classifying, and

+reporting errors. While PCIe has AER (Advanced Error Reporting) which factors

+into RAS, AER does not allow for a device to report details about internal

+errors. Therefore, AIC100 implements a custom RAS mechanism. When a RAS event

+occurs, QSM will report the event with appropriate details via the QAIC_STATUS

+MHI channel. A sysadmin may determine that a particular device needs

+additional service based on RAS reports.

+QSM has the ability to report various physical attributes of the device, and in

+some cases, to allow the host to control them. Examples include thermal limits,

+thermal readings, and power readings. These items are communicated via the

+QAIC_TELEMETRY MHI channel.

+.. SPDX-License-Identifier: GPL-2.0-only

+=====================================

+ accel/qaic Qualcomm Cloud AI driver

+=====================================

+The accel/qaic driver supports the Qualcomm Cloud AI machine learning

+.. SPDX-License-Identifier: GPL-2.0-only

+The QAIC driver is the Kernel Mode Driver (KMD) for the AIC100 family of AI

+accelerator products.

+While the AIC100 DMA Bridge hardware implements an IRQ storm mitigation

+mechanism, it is still possible for an IRQ storm to occur. A storm can happen

+if the workload is particularly quick, and the host is responsive. If the host

+can drain the response FIFO as quickly as the device can insert elements into

+it, then the device will frequently transition the response FIFO from empty to

+non-empty and generate MSIs at a rate equivalent to the speed of the

+workload's ability to process inputs. The lprnet (license plate reader network)

+workload is known to trigger this condition, and can generate in excess of 100k

+MSIs per second. It has been observed that most systems cannot tolerate this

+for long, and will crash due to some form of watchdog due to the overhead of

+the interrupt controller interrupting the host CPU.

+To mitigate this issue, the QAIC driver implements specific IRQ handling. When

+QAIC receives an IRQ, it disables that line. This prevents the interrupt

+controller from interrupting the CPU. Then AIC drains the FIFO. Once the FIFO

+is drained, QAIC implements a "last chance" polling algorithm where QAIC will

+sleep for a time to see if the workload will generate more activity. The IRQ

+line remains disabled during this time. If no activity is detected, QAIC exits

+polling mode and reenables the IRQ line.

+This mitigation in QAIC is very effective. The same lprnet usecase that

+generates 100k IRQs per second (per /proc/interrupts) is reduced to roughly 64

+IRQs over 5 minutes while keeping the host system stable, and having the same

+workload throughput performance (within run to run noise variation).

+Neural Network Control (NNC) Protocol

+=====================================

+The implementation of NNC is split between the KMD (QAIC) and UMD. In general

+QAIC understands how to encode/decode NNC wire protocol, and elements of the

+protocol which require kernel space knowledge to process (for example, mapping

+host memory to device IOVAs). QAIC understands the structure of a message, and

+all of the transactions. QAIC does not understand commands (the payload of a

+passthrough transaction).

+QAIC handles and enforces the required little endianness and 64-bit alignment,

+to the degree that it can. Since QAIC does not know the contents of a

+passthrough transaction, it relies on the UMD to satisfy the requirements.

+The terminate transaction is of particular use to QAIC. QAIC is not aware of

+the resources that are loaded onto a device since the majority of that activity

+occurs within NNC commands. As a result, QAIC does not have the means to

+roll back userspace activity. To ensure that a userspace client's resources

+are fully released in the case of a process crash, or a bug, QAIC uses the

+terminate command to let QSM know when a user has gone away, and the resources

+QSM can report a version number of the NNC protocol it supports. This is in the

+form of a Major number and a Minor number.

+Major number updates indicate changes to the NNC protocol which impact the

+message format, or transactions (impacts QAIC).

+Minor number updates indicate changes to the NNC protocol which impact the

+commands (does not impact QAIC).

+QAIC defines a number of driver specific IOCTLs as part of the userspace API.

+This section describes those APIs.

+DRM_IOCTL_QAIC_MANAGE

+ This IOCTL allows userspace to send a NNC request to the QSM. The call will

+ block until a response is received, or the request has timed out.

+DRM_IOCTL_QAIC_CREATE_BO

+ This IOCTL allows userspace to allocate a buffer object (BO) which can send

+ or receive data from a workload. The call will return a GEM handle that

+ represents the allocated buffer. The BO is not usable until it has been

+ sliced (see DRM_IOCTL_QAIC_ATTACH_SLICE_BO).

+DRM_IOCTL_QAIC_MMAP_BO

+ This IOCTL allows userspace to prepare an allocated BO to be mmap'd into the

+ userspace process.

+DRM_IOCTL_QAIC_ATTACH_SLICE_BO

+ This IOCTL allows userspace to slice a BO in preparation for sending the BO

+ to the device. Slicing is the operation of describing what portions of a BO

+ get sent where to a workload. This requires a set of DMA transfers for the

+ DMA Bridge, and as such, locks the BO to a specific DBC.

+DRM_IOCTL_QAIC_EXECUTE_BO

+ This IOCTL allows userspace to submit a set of sliced BOs to the device. The

+ call is non-blocking. Success only indicates that the BOs have been queued

+ to the device, but does not guarantee they have been executed.

+DRM_IOCTL_QAIC_PARTIAL_EXECUTE_BO

+ This IOCTL operates like DRM_IOCTL_QAIC_EXECUTE_BO, but it allows userspace

+ to shrink the BOs sent to the device for this specific call. If a BO

+ typically has N inputs, but only a subset of those is available, this IOCTL

+ allows userspace to indicate that only the first M bytes of the BO should be

+ sent to the device to minimize data transfer overhead. This IOCTL dynamically

+ recomputes the slicing, and therefore has some processing overhead before the

+ BOs can be queued to the device.

+DRM_IOCTL_QAIC_WAIT_BO

+ This IOCTL allows userspace to determine when a particular BO has been

+ processed by the device. The call will block until either the BO has been

+ processed and can be re-queued to the device, or a timeout occurs.

+DRM_IOCTL_QAIC_PERF_STATS_BO

+ This IOCTL allows userspace to collect performance statistics on the most

+ recent execution of a BO. This allows userspace to construct an end to end

+ timeline of the BO processing for a performance analysis.

+DRM_IOCTL_QAIC_PART_DEV

+ This IOCTL allows userspace to request a duplicate "shadow device". This extra

+ accelN device is associated with a specific partition of resources on the

+ AIC100 device and can be used for limiting a process to some subset of

+Userspace Client Isolation

+==========================

+AIC100 supports multiple clients. Multiple DBCs can be consumed by a single

+client, and multiple clients can each consume one or more DBCs. Workloads

+may contain sensitive information therefore only the client that owns the

+workload should be allowed to interface with the DBC.

+Clients are identified by the instance associated with their open(). A client

+may only use memory they allocate, and DBCs that are assigned to their

+workloads. Attempts to access resources assigned to other clients will be

+QAIC supports the following module parameters:

+**datapath_polling (bool)**

+Configures QAIC to use a polling thread for datapath events instead of relying

+on the device interrupts. Useful for platforms with broken multiMSI. Must be

+set at QAIC driver initialization. Default is 0 (off).

+**mhi_timeout_ms (unsigned int)**

+Sets the timeout value for MHI operations in milliseconds (ms). Must be set

+at the time the driver detects a device. Default is 2000 (2 seconds).

+**control_resp_timeout_s (unsigned int)**

+Sets the timeout value for QSM responses to NNC messages in seconds (s). Must

+be set at the time the driver is sending a request to QSM. Default is 60 (one

+**wait_exec_default_timeout_ms (unsigned int)**

+Sets the default timeout for the wait_exec ioctl in milliseconds (ms). Must be

+set prior to the waic_exec ioctl call. A value specified in the ioctl call

+overrides this for that call. Default is 5000 (5 seconds).

+**datapath_poll_interval_us (unsigned int)**

+Sets the polling interval in microseconds (us) when datapath polling is active.

+Takes effect at the next polling interval. Default is 100 (100 us).

+g) write-protect copy

+ include/uapi/linux/taskstats.h

+delay seen for cpu, sync block I/O, swapin, memory reclaim, thrash page

+cache, direct compact, write-protect copy, IRQ/SOFTIRQ etc.

+ getdelays [-dilv] [-t tgid] [-p pid]

+ # ./getdelays -d -p 10

+ # ./getdelays -d -t 5

+ print delayacct stats ON

+ CPU count real total virtual total delay total delay average

+ 8 7000000 6872122 3382277 0.423ms

+ IO count delay total delay average

+ SWAP count delay total delay average

+ RECLAIM count delay total delay average

+ THRASHING count delay total delay average

+ COMPACT count delay total delay average

+ WPCOPY count delay total delay average

+ IRQ count delay total delay average

+Get IO accounting for pid 1, it works only with -p::

+ # ./getdelays -i -p 1

+ printing IO accounting

+ linuxrc: read=65536, write=0, cancelled_write=0

+The above command can be used with -v to get more debug information.

+The format is as such::

+CPU full is undefined at the system level, but has been reported

+since 5.13, so it is set to zero for backward compatibility.

+when opening the same psi interface file. Write operations to a file descriptor

+with an already existing psi trigger will fail with EBUSY.

+Unprivileged users can also create monitors, with the only limitation that the

+window size must be a multiple of 2s, in order to prevent excessive resource

+Linux kernel release 6.x <http://kernel.org/>

+These are the release notes for Linux version 6. Read them carefully,

+ xz -cd linux-6.x.tar.xz | tar xvf -

+ - You can also upgrade between 6.x releases by patching. Patches are

+ (linux-6.x) and execute::

+ xz -cd ../patch-6.x.xz | patch -p1

+ Unlike patches for the 6.x kernels, patches for the 6.x.y kernels

+ directly to the base 6.x kernel. For example, if your base kernel is 6.0

+ and you want to apply the 6.0.3 patch, you must not first apply the 6.0.1

+ and 6.0.2 patches. Similarly, if you are running kernel version 6.0.2 and

+ want to jump to 6.0.3, you must first reverse the 6.0.2 patch (that is,

+ patch -R) **before** applying the 6.0.3 patch. You can read more on this in

+ Compiling and running the 6.x kernels requires up-to-date

+ kernel source code: /usr/src/linux-6.x

+ cd /usr/src/linux-6.x

+If you have problems that seem to be due to kernel bugs, please follow the

+instructions at 'Documentation/admin-guide/reporting-issues.rst'.

+Hints on understanding kernel bug reports are in

+'Documentation/admin-guide/bug-hunting.rst'. More on debugging the kernel

+with gdb is in 'Documentation/dev-tools/gdb-kernel-debugging.rst' and

+'Documentation/dev-tools/kgdb.rst'.

diff --git a/Documentation/admin-guide/acpi/fan_performance_states.rst b/Documentation/admin-guide/acpi/fan_performance_states.rst
index 98fe5c333121..b9e4b4d146c1 100644
--- a/Documentation/admin-guide/acpi/fan_performance_states.rst
+++ b/Documentation/admin-guide/acpi/fan_performance_states.rst

+ACPI Fan Fine Grain Control

+=============================

+When _FIF object specifies support for fine grain control, then fan speed

+can be set from 0 to 100% with the recommended minimum "step size" via

+_FSL object. User can adjust fan speed using thermal sysfs cooling device.

+Here use can look at fan performance states for a reference speed (speed_rpm)

+and set it by changing cooling device cur_state. If the fine grain control

+is supported then user can also adjust to some other speeds which are

+not defined in the performance states.

+The support of fine grain control is presented via sysfs attribute

+"fine_grain_control". If fine grain control is present, this attribute

+will show "1" otherwise "0".

+This sysfs attribute is presented in the same directory as performance states.

+ACPI Fan Performance Feedback

+=============================

+The optional _FST object provides status information for the fan device.

+This includes field to provide current fan speed in revolutions per minute

+at which the fan is rotating.

+This speed is presented in the sysfs using the attribute "fan_speed_rpm",

+in the same directory as performance states.

+cache device without losing data.

diff --git a/Documentation/admin-guide/blockdev/drbd/figures.rst b/Documentation/admin-guide/blockdev/drbd/figures.rst
index bd9a4901fe46..9f73253ea353 100644
--- a/Documentation/admin-guide/blockdev/drbd/figures.rst
+++ b/Documentation/admin-guide/blockdev/drbd/figures.rst

+.. kernel-figure:: peer-states-8.dot

+ :alt: peer-states-8.dot

diff --git a/Documentation/admin-guide/blockdev/drbd/node-states-8.dot b/Documentation/admin-guide/blockdev/drbd/node-states-8.dot
deleted file mode 100644
index bfa54e1f8016..000000000000
--- a/Documentation/admin-guide/blockdev/drbd/node-states-8.dot
+++ /dev/null

diff --git a/Documentation/admin-guide/blockdev/drbd/peer-states-8.dot b/Documentation/admin-guide/blockdev/drbd/peer-states-8.dot
new file mode 100644
index 000000000000..6dc3954954d6
--- /dev/null
+++ b/Documentation/admin-guide/blockdev/drbd/peer-states-8.dot

+digraph peer_states {

+ Secondary -> Primary [ label = "recv state packet" ]

+ Primary -> Secondary [ label = "recv state packet" ]

+ Primary -> Unknown [ label = "connection lost" ]

+ Secondary -> Unknown [ label = "connection lost" ]

+ Unknown -> Primary [ label = "connected" ]

+ Unknown -> Secondary [ label = "connected" ]

+tools, go to https://github.com/NetworkBlockDevice/nbd.

+To support such a wide range of devices, pata_parport is actually structured

+in two parts. There is a base pata_parport module which provides an interface

+to kernel libata subsystem, registry and some common methods for accessing

+the parallel ports.

+The second component is a set of low-level protocol drivers for each of the

+parallel port IDE adapter chips. Thanks to the interest and encouragement of

+Linux users from many parts of the world, support is available for almost all

+known adapter protocols:

+2. Using pata_parport subsystem

+===============================

+the pata_parport drivers into your kernel, or to build them as modules.

+and at least one of the parallel port communication protocols.

+If you do not know what kind of parallel port adapter is used in your drive,

+you could begin by checking the file names and any text files on your DOS

+You can actually select all the protocol modules, and allow the pata_parport

+ ================ ============ ========

+ Manufacturer Model Protocol

+ ================ ============ ========

+ MicroSolutions CD-ROM bpck

+ MicroSolutions PD drive bpck

+ MicroSolutions hard-drive bpck

+ MicroSolutions 8000t tape bpck

+ SyQuest EZ, SparQ epat

+ Imation Superdisk epat

+ Maxell Superdisk friq

+ FreeCom CD-ROM frpw

+ Hewlett-Packard 5GB Tape epat

+ Hewlett-Packard 7200e (CD) epat

+ Hewlett-Packard 7200e (CD-R) epat

+ ================ ============ ========

+All parports and all protocol drivers are probed automatically unless probe=0

+parameter is used. So just "modprobe epat" is enough for a Imation SuperDisk

+Manual device creation::

+ # echo "port protocol mode unit delay" >/sys/bus/pata_parport/new_device

+ ======== ================================================

+ port parport name (or "auto" for all parports)

+ protocol protocol name (or "auto" for all protocols)

+ mode mode number (protocol-specific) or -1 for probe

+ unit unit number (for backpack only, see below)

+ delay I/O delay (see troubleshooting section below)

+ ======== ================================================

+If you omit the parameters from the end, defaults will be used, e.g.:

+Probe all parports with all protocols::

+ # echo auto >/sys/bus/pata_parport/new_device

+Probe parport0 using protocol epat and mode 4 (EPP-16)::

+ # echo "parport0 epat 4" >/sys/bus/pata_parport/new_device

+Probe parport0 using all protocols::

+ # echo "parport0 auto" >/sys/bus/pata_parport/new_device

+Probe all parports using protoocol epat::

+ # echo "auto epat" >/sys/bus/pata_parport/new_device

+ # echo pata_parport.0 >/sys/bus/pata_parport/delete_device

+The most common problems that people report with the pata_parport drivers

+protocol modules support ECP mode, or any ECP combination modes.

+offset the errors, the protocol modules introduce a "port

+multiple device environments, the pata_parport drivers will not do it

+your pata_parport drivers as modules, and arrange to do the printer reset

+before loading the pata_parport drivers.

+before disksize setting. It supports only partitions at this moment.

+If admin wants to use incompressible page writeback, they could do it via::

+Additionally, when CONFIG_ZRAM_MEMORY_TRACKING is enabled pages can be

+marked as idle based on how long (in seconds) it's been since they were

+ echo 86400 > /sys/block/zramX/idle

+In this example all pages which haven't been accessed in more than 86400

+seconds (one day) will be marked idle.

+With the command, zram will writeback idle pages from memory to the storage.

+Additionally, if a user choose to writeback only huge and idle pages

+this can be accomplished with::

+ echo huge_idle > /sys/block/zramX/writeback

+If a user chooses to writeback only incompressible pages (pages that none of

+algorithms can compress) this can be accomplished with::

+ echo incompressible > /sys/block/zramX/writeback

+If an admin wants to write a specific page in zram device to the backing device,

+they could write a page index into the interface::

+any writeback. IOW, if admin wants to apply writeback budget, they should

+If admin wants to limit writeback as per-day 400M, they could do it

+they could do it like below::

+If an admin wants to see the remaining writeback budget since last set::

+If an admin wants to disable writeback limit, they could do::

+If admin wants to measure writeback count in a certain period, they could

+With CONFIG_ZRAM_MULTI_COMP, zram can recompress pages using alternative

+(secondary) compression algorithms. The basic idea is that alternative

+compression algorithm can provide better compression ratio at a price of

+(potentially) slower compression/decompression speeds. Alternative compression

+algorithm can, for example, be more successful compressing huge pages (those

+that default algorithm failed to compress). Another application is idle pages

+recompression - pages that are cold and sit in the memory can be recompressed

+using more effective algorithm and, hence, reduce zsmalloc memory usage.

+With CONFIG_ZRAM_MULTI_COMP, zram supports up to 4 compression algorithms:

+one primary and up to 3 secondary ones. Primary zram compressor is explained

+in "3) Select compression algorithm", secondary algorithms are configured

+using recomp_algorithm device attribute.

+ #show supported recompression algorithms

+ cat /sys/block/zramX/recomp_algorithm

+ #1: lzo lzo-rle lz4 lz4hc [zstd]

+ #2: lzo lzo-rle lz4 [lz4hc] zstd

+Alternative compression algorithms are sorted by priority. In the example

+above, zstd is used as the first alternative algorithm, which has priority

+of 1, while lz4hc is configured as a compression algorithm with priority 2.

+Alternative compression algorithm's priority is provided during algorithms

+ #select zstd recompression algorithm, priority 1

+ echo "algo=zstd priority=1" > /sys/block/zramX/recomp_algorithm

+ #select deflate recompression algorithm, priority 2

+ echo "algo=deflate priority=2" > /sys/block/zramX/recomp_algorithm

+Another device attribute that CONFIG_ZRAM_MULTI_COMP enables is recompress,

+which controls recompression.

+ #IDLE pages recompression is activated by `idle` mode

+ echo "type=idle" > /sys/block/zramX/recompress

+ #HUGE pages recompression is activated by `huge` mode

+ echo "type=huge" > /sys/block/zram0/recompress

+ #HUGE_IDLE pages recompression is activated by `huge_idle` mode

+ echo "type=huge_idle" > /sys/block/zramX/recompress

+The number of idle pages can be significant, so user-space can pass a size

+threshold (in bytes) to the recompress knob: zram will recompress only pages

+of equal or greater size:::

+ #recompress all pages larger than 3000 bytes

+ echo "threshold=3000" > /sys/block/zramX/recompress

+ #recompress idle pages larger than 2000 bytes

+ echo "type=idle threshold=2000" > /sys/block/zramX/recompress

+Recompression of idle pages requires memory tracking.

+During re-compression for every page, that matches re-compression criteria,

+ZRAM iterates the list of registered alternative compression algorithms in

+order of their priorities. ZRAM stops either when re-compression was

+successful (re-compressed object is smaller in size than the original one)

+and matches re-compression criteria (e.g. size threshold) or when there are

+no secondary algorithms left to try. If none of the secondary algorithms can

+successfully re-compressed the page such a page is marked as incompressible,

+so ZRAM will not attempt to re-compress it in the future.

+This re-compression behaviour, when it iterates through the list of

+registered compression algorithms, increases our chances of finding the

+algorithm that successfully compresses a particular page. Sometimes, however,

+it is convenient (and sometimes even necessary) to limit recompression to

+only one particular algorithm so that it will not try any other algorithms.

+This can be achieved by providing a algo=NAME parameter:::

+ #use zstd algorithm only (if registered)

+ echo "type=huge algo=zstd" > /sys/block/zramX/recompress

+ 300 75.033841 .wh...

+ 301 63.806904 s.....

+ 302 63.806919 ..hi..

+ 303 62.801919 ....r.

+ 304 146.781902 ..hi.n

+ recompressed page (secondary compression algorithm)

+ none (including secondary) of algorithms could compress it

+There are two options to boot the kernel with bootconfig: attaching the

+bootconfig to the initrd image or embedding it in the kernel itself.

+Attaching a Boot Config to Initrd

+---------------------------------

+Since the boot configuration file is loaded with initrd by default,

+it will be added to the end of the initrd (initramfs) image file with

+padding, size, checksum and 12-byte magic word as below.

+Alternatively, build your kernel with the ``CONFIG_BOOT_CONFIG_FORCE``

+Kconfig option selected.

+Embedding a Boot Config into Kernel

+-----------------------------------

+If you can not use initrd, you can also embed the bootconfig file in the

+kernel by Kconfig options. In this case, you need to recompile the kernel

+with the following configs::

+ CONFIG_BOOT_CONFIG_EMBED=y

+ CONFIG_BOOT_CONFIG_EMBED_FILE="/PATH/TO/BOOTCONFIG/FILE"

+``CONFIG_BOOT_CONFIG_EMBED_FILE`` requires an absolute path or a relative

+path to the bootconfig file from source tree or object tree.

+The kernel will embed it as the default bootconfig.

+Just as when attaching the bootconfig to the initrd, you need ``bootconfig``

+option on the kernel command line to enable the embedded bootconfig, or,

+alternatively, build your kernel with the ``CONFIG_BOOT_CONFIG_FORCE``

+Kconfig option selected.

+Note that even if you set this option, you can override the embedded

+bootconfig by another bootconfig which attached to the initrd.

+The parameters are concatenated with user-given kernel cmdline string

diff --git a/Documentation/admin-guide/cgroup-v1/blkio-controller.rst b/Documentation/admin-guide/cgroup-v1/blkio-controller.rst
index 16253eda192e..dabb80cdd25a 100644
--- a/Documentation/admin-guide/cgroup-v1/blkio-controller.rst
+++ b/Documentation/admin-guide/cgroup-v1/blkio-controller.rst

+ Specifies per cgroup per device weights, overriding the default group

diff --git a/Documentation/admin-guide/cgroup-v1/cgroups.rst b/Documentation/admin-guide/cgroup-v1/cgroups.rst
index b0688011ed06..9343148ee993 100644
--- a/Documentation/admin-guide/cgroup-v1/cgroups.rst
+++ b/Documentation/admin-guide/cgroup-v1/cgroups.rst

+.. _cgroups-why-needed:

diff --git a/Documentation/admin-guide/cgroup-v1/cpusets.rst b/Documentation/admin-guide/cgroup-v1/cpusets.rst
index 5d844ed4df69..ae646d621a8a 100644
--- a/Documentation/admin-guide/cgroup-v1/cpusets.rst
+++ b/Documentation/admin-guide/cgroup-v1/cpusets.rst

+ (https://github.com/libcgroup/libcgroup/)

diff --git a/Documentation/admin-guide/cgroup-v1/hugetlb.rst b/Documentation/admin-guide/cgroup-v1/hugetlb.rst
index 338f2c7d7a1c..0fa724d82abb 100644
--- a/Documentation/admin-guide/cgroup-v1/hugetlb.rst
+++ b/Documentation/admin-guide/cgroup-v1/hugetlb.rst

+ hugetlb.<hugepagesize>.numa_stat # show the numa information of the hugetlb memory charged to this cgroup

+ hugetlb.1GB.numa_stat

+ hugetlb.64KB.numa_stat

+ hugetlb.32MB.numa_stat

diff --git a/Documentation/admin-guide/cgroup-v1/memcg_test.rst b/Documentation/admin-guide/cgroup-v1/memcg_test.rst
index 45b94f7b3beb..a402359abb99 100644
--- a/Documentation/admin-guide/cgroup-v1/memcg_test.rst
+++ b/Documentation/admin-guide/cgroup-v1/memcg_test.rst

+ - filemap_add_folio().

+from the rest of the system. The article on LWN [12]_ mentions some probable

+ basically functionality. (See :ref:`section 2.7

+ <cgroup-v1-memory-kernel-extension>`)

+ This knob is not available on CONFIG_PREEMPT_RT systems.

+ This knob is deprecated and shouldn't be

+ memory.kmem.limit_in_bytes This knob is deprecated and writing to

+ it will return -ENOTSUPP.

+controller was posted by Balbir Singh [1]_. At the time the RFC was posted

+for memory control. The first RSS controller was posted by Balbir Singh [2]_

+in Feb 2007. Pavel Emelianov [3]_ [4]_ [5]_ has since posted three versions

+of the RSS controller. At OLS, at the resource management BoF, everyone

+suggested that we handle both page cache and RSS together. Another request was

+raised to allow user space handling of OOM. The current memory controller is

+Cache Control [11]_.

+ :caption: Figure 1: Hierarchy of Accounting

+But see :ref:`section 8.2 <cgroup-v1-memory-movable-charges>` when moving a

+task to another cgroup, its pages may be recharged to the new cgroup, if

+move_charge_at_immigrate has been chosen.

+2.4.1 why 'memory+swap' rather than swap

+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

+2.4.2. What happens when a cgroup hits memory.memsw.limit_in_bytes

+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

+cgroup. (See :ref:`10. OOM Control <cgroup-v1-memory-oom-control>` below.)

+ Reclaim does not work for the root cgroup, since we cannot set any

+ limits on the root cgroup.

+ When panic_on_oom is set to "2", the whole system will panic.

+(See :ref:`oom_control <cgroup-v1-memory-oom-control>` section)

+Lock order is as follows::

+.. _cgroup-v1-memory-kernel-extension:

+2.7 Kernel Memory Extension

+ In the current implementation, memory reclaim will NOT be triggered for

+ a cgroup when it hits K while staying below U, which makes this setup

+To use the user interface:

+1. Enable CONFIG_CGROUPS and CONFIG_MEMCG options

+2. Prepare the cgroups (see :ref:`Why are cgroups needed?

+ <cgroups-why-needed>` for the background information)::

+3. Make the new group and move bash into it::

+4. Since now we're in the 0 cgroup, we can alter the memory limit::

+ The limit can now be queried::

+ # cat /sys/fs/cgroup/memory/0/memory.limit_in_bytes

+ We can use a suffix (k, K, m, M, g or G) to indicate values in kilo,

+ mega or gigabytes. (Here, Kilo, Mega, Giga are Kibibytes, Mebibytes,

+ We can write "-1" to reset the ``*.limit_in_bytes(unlimited)``.

+ We cannot set limits on the root cgroup any more.

+.. _cgroup-v1-memory-test-troubleshoot:

+To know what happens, disabling OOM_Kill as per :ref:`"10. OOM Control"

+<cgroup-v1-memory-oom-control>` (below) and seeing what happens will be

+.. _cgroup-v1-memory-test-task-migration:

+See :ref:`8. "Move charges at task migration" <cgroup-v1-memory-move-charges>`

+A cgroup can be removed by rmdir, but as discussed in :ref:`sections 4.1

+<cgroup-v1-memory-test-troubleshoot>` and :ref:`4.2

+<cgroup-v1-memory-test-task-migration>`, a cgroup might have some charge

+associated with it, even though all tasks have migrated away from it. (because

+we charge against pages, not against tasks.)

+memory.stat file includes following statistics:

+ * per-memory cgroup local status

+ =============== ===============================================================

+ cache # of bytes of page cache memory.

+ rss # of bytes of anonymous and swap cache memory (includes

+ transparent hugepages).

+ rss_huge # of bytes of anonymous transparent hugepages.

+ mapped_file # of bytes of mapped file (includes tmpfs/shmem)

+ pgpgin # of charging events to the memory cgroup. The charging

+ event happens each time a page is accounted as either mapped

+ anon page(RSS) or cache page(Page Cache) to the cgroup.

+ pgpgout # of uncharging events to the memory cgroup. The uncharging

+ event happens each time a page is unaccounted from the

+ swap # of bytes of swap usage

+ dirty # of bytes that are waiting to get written back to the disk.

+ writeback # of bytes of file/anon cache that are queued for syncing to

+ inactive_anon # of bytes of anonymous and swap cache memory on inactive

+ active_anon # of bytes of anonymous and swap cache memory on active

+ inactive_file # of bytes of file-backed memory and MADV_FREE anonymous

+ memory (LazyFree pages) on inactive LRU list.

+ active_file # of bytes of file-backed memory on active LRU list.

+ unevictable # of bytes of memory that cannot be reclaimed (mlocked etc).

+ =============== ===============================================================

+ * status considering hierarchy (see memory.use_hierarchy settings):

+ ========================= ===================================================

+ hierarchical_memory_limit # of bytes of memory limit with regard to

+ under which the memory cgroup is

+ hierarchical_memsw_limit # of bytes of memory+swap limit with regard to

+ hierarchy under which memory cgroup is.

+ total_<counter> # hierarchical version of <counter>, which in

+ addition to the cgroup's own value includes the

+ sum of all hierarchical children's values of

+ <counter>, i.e. total_cache

+ ========================= ===================================================

+ * additional vm parameters (depends on CONFIG_DEBUG_VM):

+ ========================= ========================================

+ recent_rotated_anon VM internal parameter. (see mm/vmscan.c)

+ recent_rotated_file VM internal parameter. (see mm/vmscan.c)

+ recent_scanned_anon VM internal parameter. (see mm/vmscan.c)

+ recent_scanned_file VM internal parameter. (see mm/vmscan.c)

+ ========================= ========================================

+.. _cgroup-v1-memory-move-charges:

+8. Move charges at task migration (DEPRECATED!)

+===============================================

+THIS IS DEPRECATED!

+It's expensive and unreliable! It's better practice to launch workload

+tasks directly from inside their target cgroup. Use dedicated workload

+cgroups to allow fine-grained policy adjustments without having to

+move physical pages between control domains.

+ of charges should be moved. See :ref:`section 8.2

+ <cgroup-v1-memory-movable-charges>` for details.

+.. _cgroup-v1-memory-movable-charges:

+.. _cgroup-v1-memory-oom-control:

+.. [1] Singh, Balbir. RFC: Memory Controller, http://lwn.net/Articles/206697/

+.. [2] Singh, Balbir. Memory Controller (RSS Control),

+.. [3] Emelianov, Pavel. Resource controllers based on process cgroups

+.. [4] Emelianov, Pavel. RSS controller based on process cgroups (v2)

+.. [5] Emelianov, Pavel. RSS controller based on process cgroups (v3)

+.. [11] Singh, Balbir. Memory controller introduction (v6),

+ https://lore.kernel.org/r/20070817084228.26003.12568.sendpatchset@balbir-laptop

+.. [12] Corbet, Jonathan, Controlling memory use in cgroups,

+ http://lwn.net/Articles/243795/

+ Reduce the latencies of dynamic cgroup modifications such as

+ task migrations and controller on/offs at the cost of making

+ hot path operations such as forks and exits more expensive.

+ The static usage pattern of creating a cgroup, enabling

+ controllers, and then seeding it with CLONE_INTO_CGROUP is

+ not affected by this option.

+.. _cgroupv2-limits-distributor:

+A child can only consume up to the configured amount of the resource.

+.. _cgroupv2-protections-distributor:

+A cgroup is protected up to the configured amount of the resource

+only up to the amount available to the parent is protected among

+ A read-write single value file that allowed values are "0" and "1".

+ The default is "1".

+ Writing "0" to the file will disable the cgroup PSI accounting.

+ Writing "1" to the file will re-enable the cgroup PSI accounting.

+ This control attribute is not hierarchical, so disable or enable PSI

+ accounting in a cgroup does not affect PSI accounting in descendants

+ and doesn't need pass enablement via ancestors from root.

+ The reason this control attribute exists is that PSI accounts stalls for

+ each cgroup separately and aggregates it at each level of the hierarchy.

+ This may cause non-negligible overhead for some workloads when under

+ deep level of the hierarchy, in which case this control attribute can

+ be used to disable PSI accounting in the non-leaf cgroups.

+ A read-write nested-keyed file.

+ Shows pressure stall information for IRQ/SOFTIRQ. See

+ :ref:`Documentation/accounting/psi.rst <psi>` for details.

+ which indicates that the group may consume up to $MAX in each

+ A read-write single value file which exists on non-root

+ cgroups. The default is "0".

+ The burst in the range [0, $MAX].

+ A write-only nested-keyed file which exists for all cgroups.

+ This is a simple interface to trigger memory reclaim in the

+ This file accepts a single key, the number of bytes to reclaim.

+ No nested keys are currently supported.

+ echo "1G" > memory.reclaim

+ The interface can be later extended with nested keys to

+ configure the reclaim behavior. For example, specify the

+ type of memory to reclaim from (anon, file, ..).

+ Please note that the kernel can over or under reclaim from

+ the target cgroup. If less bytes are reclaimed than the

+ specified amount, -EAGAIN is returned.

+ Please note that the proactive reclaim (triggered by this

+ interface) is not meant to indicate memory pressure on the

+ memory cgroup. Therefore socket memory balancing triggered by

+ the memory reclaim normally is not exercised in this case.

+ This means that the networking layer will not adapt based on

+ reclaim induced by memory.reclaim.

+ A read-only single value file which exists on non-root

+ The max memory usage recorded for the cgroup and its

+ descendants since the creation of the cgroup.

+ The number of times a group OOM has occurred.

+ Amount of total kernel memory, including

+ (kernel_stack, pagetables, percpu, vmalloc, slab) in

+ addition to other kernel memory use cases.

+ Amount of memory allocated for secondary page tables,

+ this currently includes KVM mmu allocations on x86

+ Amount of memory used for vmap backed memory.

+ Amount of memory consumed by the zswap compression backend.

+ Amount of application memory swapped out to zswap.

+ Amount of scanned pages (in an inactive LRU list)

+ Amount of reclaimed pages

+ pgscan_kswapd (npn)

+ Amount of scanned pages by kswapd (in an inactive LRU list)

+ pgscan_direct (npn)

+ Amount of scanned pages directly (in an inactive LRU list)

+ pgscan_khugepaged (npn)

+ Amount of scanned pages by khugepaged (in an inactive LRU list)

+ pgsteal_kswapd (npn)

+ Amount of reclaimed pages by kswapd

+ pgsteal_direct (npn)

+ Amount of reclaimed pages directly

+ pgsteal_khugepaged (npn)

+ Amount of reclaimed pages by khugepaged

+ memory.zswap.current

+ A read-only single value file which exists on non-root

+ The total amount of memory consumed by the zswap compression

+ A read-write single value file which exists on non-root

+ cgroups. The default is "max".

+ Zswap usage hard limit. If a cgroup's zswap pool reaches this

+ limit, it will refuse to take any more stores before existing

+ entries fault back in or are written out to disk.

+ "MAJOR:MINOR target=<target time in microseconds>"

+ ========== =====================================

+ "member" Non-root member of a partition

+ "root" Partition root

+ "isolated" Partition root without load balancing

+ ========== =====================================

+ The root cgroup is always a partition root and its state

+ cannot be changed. All other non-root cgroups start out as

+ When set to "root", the current cgroup is the root of a new

+ partition or scheduling domain that comprises itself and all

+ its descendants except those that are separate partition roots

+ themselves and their descendants.

+ When set to "isolated", the CPUs in that partition root will

+ be in an isolated state without any load balancing from the

+ scheduler. Tasks placed in such a partition with multiple

+ CPUs should be carefully distributed and bound to each of the

+ individual CPUs for optimal performance.

+ The value shown in "cpuset.cpus.effective" of a partition root

+ is the CPUs that the partition root can dedicate to a potential

+ new child partition root. The new child subtracts available

+ CPUs from its parent "cpuset.cpus.effective".

+ A partition root ("root" or "isolated") can be in one of the

+ two possible states - valid or invalid. An invalid partition

+ root is in a degraded state where some state information may

+ be retained, but behaves more like a "member".

+ All possible state transitions among "member", "root" and

+ "isolated" are allowed.

+ On read, the "cpuset.cpus.partition" file can show the following

+ ============================= =====================================

+ "member" Non-root member of a partition

+ "root" Partition root

+ "isolated" Partition root without load balancing

+ "root invalid (<reason>)" Invalid partition root

+ "isolated invalid (<reason>)" Invalid isolated partition root

+ ============================= =====================================

+ In the case of an invalid partition root, a descriptive string on

+ why the partition is invalid is included within parentheses.

+ For a partition root to become valid, the following conditions

+ 1) The "cpuset.cpus" is exclusive with its siblings , i.e. they

+ are not shared by any of its siblings (exclusivity rule).

+ 2) The parent cgroup is a valid partition root.

+ 3) The "cpuset.cpus" is not empty and must contain at least

+ one of the CPUs from parent's "cpuset.cpus", i.e. they overlap.

+ 4) The "cpuset.cpus.effective" cannot be empty unless there is

+ no task associated with this partition.

+ External events like hotplug or changes to "cpuset.cpus" can

+ cause a valid partition root to become invalid and vice versa.

+ Note that a task cannot be moved to a cgroup with empty

+ "cpuset.cpus.effective".

+ For a valid partition root with the sibling cpu exclusivity

+ rule enabled, changes made to "cpuset.cpus" that violate the

+ exclusivity rule will invalidate the partition as well as its

+ sibling partitions with conflicting cpuset.cpus values. So

+ care must be taking in changing "cpuset.cpus".

+ A valid non-root parent partition may distribute out all its CPUs

+ to its child partitions when there is no task associated with it.

+ Care must be taken to change a valid partition root to

+ "member" as all its child partitions, if present, will become

+ invalid causing disruption to tasks running in those child

+ partitions. These inactivated partitions could be recovered if

+ their parent is switched back to a partition root with a proper

+ set of "cpuset.cpus".

+ Poll and inotify events are triggered whenever the state of

+ "cpuset.cpus.partition" changes. That includes changes caused

+ by write to "cpuset.cpus.partition", cpu hotplug or other

+ changes that modify the validity status of the partition.

+ This will allow user space agents to monitor unexpected changes

+ to "cpuset.cpus.partition" without the need to do continuous

+ hugetlb.<hugepagesize>.numa_stat

+ Similar to memory.numa_stat, it shows the numa information of the

+ hugetlb pages of <hugepagesize> in this cgroup. Only active in

+ use hugetlb pages are included. The per-node values are in bytes.

+ A read-only flat-keyed file which exists on non-root cgroups. The

+ following entries are defined. Unless specified otherwise, a value

+ change in this file generates a file modified event. All fields in

+ this file are hierarchical.

+ The number of times the cgroup's resource usage was

+ about to go over the max boundary.

+ parameters. e.g.::

+ options to be enabled. Enabling plaintext

+ authentication currently requires also enabling

+ lanman authentication in the security flags

+ more of the following flags (7 sets them all)::

+ /sys/module/cifs/parameters/<param>

+However the die, cluster, book, and drawer hierarchy related sysfs files will

+only be created if an architecture provides the related macros as described

+ #define topology_cluster_id(cpu)

+ #define topology_cluster_cpumask(cpu)

+3) topology_cluster_id: -1

+4) topology_core_id: 0

+5) topology_book_id: -1

+6) topology_drawer_id: -1

+7) topology_sibling_cpumask: just the given CPU

+8) topology_core_cpumask: just the given CPU

+9) topology_cluster_cpumask: just the given CPU

+10) topology_die_cpumask: just the given CPU

+11) topology_book_cpumask: just the given CPU

+12) topology_drawer_cpumask: just the given CPU

diff --git a/Documentation/admin-guide/device-mapper/cache-policies.rst b/Documentation/admin-guide/device-mapper/cache-policies.rst
index b17fe352fc41..13da4d831d46 100644
--- a/Documentation/admin-guide/device-mapper/cache-policies.rst
+++ b/Documentation/admin-guide/device-mapper/cache-policies.rst

+memory, but a substantial improvement nonetheless.

diff --git a/Documentation/admin-guide/device-mapper/dm-ebs.rst b/Documentation/admin-guide/device-mapper/dm-ebs.rst
index 534fa38e8862..c09f66db5621 100644
--- a/Documentation/admin-guide/device-mapper/dm-ebs.rst
+++ b/Documentation/admin-guide/device-mapper/dm-ebs.rst

+ <underlying sectors>:

diff --git a/Documentation/admin-guide/device-mapper/dm-flakey.rst b/Documentation/admin-guide/device-mapper/dm-flakey.rst
index 86138735879d..f7104c01b0f7 100644
--- a/Documentation/admin-guide/device-mapper/dm-flakey.rst
+++ b/Documentation/admin-guide/device-mapper/dm-flakey.rst

+ All read I/O is failed with an error signalled.

+ Write I/O is handled correctly.

diff --git a/Documentation/admin-guide/device-mapper/dm-init.rst b/Documentation/admin-guide/device-mapper/dm-init.rst
index e5242ff17e9b..981d6a907699 100644
--- a/Documentation/admin-guide/device-mapper/dm-init.rst
+++ b/Documentation/admin-guide/device-mapper/dm-init.rst

+For setups using device-mapper on top of asynchronously probed block

+devices (MMC, USB, ..), it may be necessary to tell dm-init to

+explicitly wait for them to become available before setting up the

+device-mapper tables. This can be done with the "dm-mod.waitfor="

+module parameter, which takes a list of devices to wait for::

+ dm-mod.waitfor=<device1>[,..,<deviceN>]

diff --git a/Documentation/admin-guide/device-mapper/dm-zoned.rst b/Documentation/admin-guide/device-mapper/dm-zoned.rst
index 0fac051caeac..932383fe6e88 100644
--- a/Documentation/admin-guide/device-mapper/dm-zoned.rst
+++ b/Documentation/admin-guide/device-mapper/dm-zoned.rst

+Metadata zones are not reported as usable capacity to the user.

diff --git a/Documentation/admin-guide/device-mapper/unstriped.rst b/Documentation/admin-guide/device-mapper/unstriped.rst
index 0a8d3eb3f072..5772ccdd1f5f 100644
--- a/Documentation/admin-guide/device-mapper/unstriped.rst
+++ b/Documentation/admin-guide/device-mapper/unstriped.rst

+the unstriped target on top of the striped device to access the

+unstriped on top of Intel NVMe device that has 2 cores

+------------------------------------------------------

+unstriped on top of striped with 4 drives using 128K chunk size

+---------------------------------------------------------------

diff --git a/Documentation/admin-guide/device-mapper/verity.rst b/Documentation/admin-guide/device-mapper/verity.rst
index 1a6b91368e59..a65c1602cb23 100644
--- a/Documentation/admin-guide/device-mapper/verity.rst
+++ b/Documentation/admin-guide/device-mapper/verity.rst

+try_verify_in_tasklet

+ If verity hashes are in cache, verify data blocks in kernel tasklet instead

+ of workqueue. This option can reduce IO latency.

diff --git a/Documentation/admin-guide/device-mapper/writecache.rst b/Documentation/admin-guide/device-mapper/writecache.rst
index 10429779a91a..60c16b7fd5ac 100644
--- a/Documentation/admin-guide/device-mapper/writecache.rst
+++ b/Documentation/admin-guide/device-mapper/writecache.rst

+5. the number of read blocks

+6. the number of read blocks that hit the cache

+7. the number of write blocks

+8. the number of write blocks that hit uncommitted block

+9. the number of write blocks that hit committed block

+10. the number of write blocks that bypass the cache

+11. the number of write blocks that are allocated in the cache

+14. the number of discarded blocks

+The version of this document at lanana.org is no longer maintained. This

+version in the mainline Linux kernel is the master document. Updates

+shall be sent as patches to the kernel maintainers (see the

+ 117 char [REMOVED] COSA/SRP synchronous serial card

+ 162 char Used for (now removed) raw block device interface

+ 261 char Compute Acceleration Devices

+ 0 = /dev/accel/accel0 First acceleration device

+ 1 = /dev/accel/accel1 Second acceleration device

diff --git a/Documentation/admin-guide/dynamic-debug-howto.rst b/Documentation/admin-guide/dynamic-debug-howto.rst
index b119b8277b3e..8dc668cc1216 100644
--- a/Documentation/admin-guide/dynamic-debug-howto.rst
+++ b/Documentation/admin-guide/dynamic-debug-howto.rst

+Dynamic debug allows you to dynamically enable/disable kernel

+debug-print code to obtain additional kernel information.

+If ``/proc/dynamic_debug/control`` exists, your kernel has dynamic

+debug. You'll need root access (sudo su) to use this.

+Dynamic debug provides:

+ * a Catalog of all *prdbgs* in your kernel.

+ ``cat /proc/dynamic_debug/control`` to see them.

+ * a Simple query/command language to alter *prdbgs* by selecting on

+ any combination of 0 or 1 of:

+ - class name (as known/declared by each module)

+You can view the currently configured behaviour in the *prdbg* catalog::

+ :#> head -n7 /proc/dynamic_debug/control

+ init/main.c:1179 [main]initcall_blacklist =_ "blacklisting initcall %s\012

+ init/main.c:1218 [main]initcall_blacklisted =_ "initcall %s blacklisted\012"

+ init/main.c:1424 [main]run_init_process =_ " with arguments:\012"

+ init/main.c:1426 [main]run_init_process =_ " %s\012"

+ init/main.c:1427 [main]run_init_process =_ " with environment:\012"

+ init/main.c:1429 [main]run_init_process =_ " %s\012"

+The 3rd space-delimited column shows the current flags, preceded by

+a ``=`` for easy use with grep/cut. ``=p`` shows enabled callsites.

+Controlling dynamic debug Behaviour

+===================================

+The behaviour of *prdbg* sites are controlled by writing

+query/commands to the control file. Example::

+ # grease the interface

+ :#> alias ddcmd='echo $* > /proc/dynamic_debug/control'

+ :#> ddcmd '-p; module main func run* +p'

+ :#> grep =p /proc/dynamic_debug/control

+ init/main.c:1424 [main]run_init_process =p " with arguments:\012"

+ init/main.c:1426 [main]run_init_process =p " %s\012"

+ init/main.c:1427 [main]run_init_process =p " with environment:\012"

+ init/main.c:1429 [main]run_init_process =p " %s\012"

+Error messages go to console/syslog::

+ :#> ddcmd mode foo +p

+ dyndbg: unknown keyword "mode"

+ dyndbg: query parse failed

+ bash: echo: write error: Invalid argument

+If debugfs is also enabled and mounted, ``dynamic_debug/control`` is

+also under the mount-dir, typically ``/sys/kernel/debug/``.

+At the basic lexical level, a command is a sequence of words separated

+ :#> ddcmd file svcsock.c line 1603 +p

+ :#> ddcmd "file svcsock.c line 1603 +p"

+ :#> ddcmd ' file svcsock.c line 1603 +p '

+ :#> ddcmd "func pnpacpi_get_resources +p; func pnp_assign_mem +p"

+ func pnpacpi_get_resources +p

+ func pnp_assign_mem +p

+ :#> cat query-batch-file > /proc/dynamic_debug/control

+You can also use wildcards in each query term. The match rule supports

+``*`` (matches zero or more characters) and ``?`` (matches exactly one

+character). For example, you can match all usb drivers::

+ :#> ddcmd file "drivers/usb/*" +p # "" to suppress shell expansion

+Syntactically, a command is pairs of keyword values, followed by a

+flags change or setting::

+The match-spec's select *prdbgs* from the catalog, upon which to apply

+the flags-spec, all constraints are ANDed together. An absent keyword

+is the same as keyword "*".

+A match specification is a keyword, which selects the attribute of

+the callsite to be compared, and a value to compare against. Possible

+ The given class_name is validated against each module, which may

+ have declared a list of known class_names. If the class_name is

+ found for a module, callsite & class matching and adjustment

+ proceeds. Examples::

+ class DRM_UT_KMS # a DRM.debug category

+ class JUNK # silent non-match

+ // class TLD_* # NOTICE: no wildcard in class names

+ _ enables no flags.

+ Decorator flags add to the message-prefix, in order:

+ t Include thread ID, or <intr>

+ m Include module name

+ f Include the function name

+ l Include line number

+For ``print_hex_dump_debug()`` and ``print_hex_dump_bytes()``, only

+the ``p`` flag has meaning, other flags are ignored.

+``dyndbg="QUERY"`` or ``module.dyndbg="QUERY"``. QUERY follows

+loaded later. Bare ``dyndbg=`` is only processed at boot.

+ echo "module module_name -p" > /proc/dynamic_debug/control

+ :#> ddcmd 'file svcsock.c line 1603 +p'

+ :#> ddcmd 'file svcsock.c +p'

+ :#> ddcmd 'module nfsd +p'

+ :#> ddcmd 'func svc_process +p'

+ :#> ddcmd 'func svc_process -p'

+ :#> ddcmd 'format "nfsd: READ" +p'

+ :#> ddcmd 'file *usb* +p' > /proc/dynamic_debug/control

+ :#> ddcmd '+p' > /proc/dynamic_debug/control

+ :#> ddcmd '+mf' > /proc/dynamic_debug/control

+ // see what's going on in dyndbg=value processing

+ dynamic_debug.verbose=3

+ // enable pr_debugs in the btrfs module (can be builtin or loadable)

+ // enable pr_debugs in all files under init/

+ // and the function parse_one, #cmt is stripped

+ dyndbg="file init/* +p #cmt ; func parse_one +p"

+Kernel Configuration

+====================

+Dynamic Debug is enabled via kernel config items::

+ CONFIG_DYNAMIC_DEBUG=y # build catalog, enables CORE

+ CONFIG_DYNAMIC_DEBUG_CORE=y # enable mechanics only, skip catalog

+If you do not want to enable dynamic debug globally (i.e. in some embedded

+system), you may set ``CONFIG_DYNAMIC_DEBUG_CORE`` as basic support of dynamic

+debug and add ``ccflags := -DDYNAMIC_DEBUG_MODULE`` into the Makefile of any

+modules which you'd like to dynamically debug later.

+The following functions are cataloged and controllable when dynamic

+ print_hex_dump_debug()

+ print_hex_dump_bytes()

+Otherwise, they are off by default; ``ccflags += -DDEBUG`` or

+``#define DEBUG`` in a source file will enable them appropriately.

+If ``CONFIG_DYNAMIC_DEBUG`` is not set, ``print_hex_dump_debug()`` is

+just a shortcut for ``print_hex_dump(KERN_DEBUG)``.

+For ``print_hex_dump_debug()``/``print_hex_dump_bytes()``, format string is

+its ``prefix_str`` argument, if it is constant string; or ``hexdump``

+in case ``prefix_str`` is built dynamically.

+arch/x86/boot/header.S and drivers/firmware/efi/libstub/x86-stub.c,

+drivers/firmware/efi/libstub/arm32-stub.c. EFI stub code that is shared

+.. SPDX-License-Identifier: GPL-2.0

+====================================

+File system Monitoring with fanotify

+====================================

+File system Error Reporting

+===========================

+Fanotify supports the FAN_FS_ERROR event type for file system-wide error

+reporting. It is meant to be used by file system health monitoring

+daemons, which listen for these events and take actions (notify

+sysadmin, start recovery) when a file system problem is detected.

+By design, a FAN_FS_ERROR notification exposes sufficient information

+for a monitoring tool to know a problem in the file system has happened.

+It doesn't necessarily provide a user space application with semantics

+to verify an IO operation was successfully executed. That is out of

+scope for this feature. Instead, it is only meant as a framework for

+early file system problem detection and reporting recovery tools.

+When a file system operation fails, it is common for dozens of kernel

+errors to cascade after the initial failure, hiding the original failure

+log, which is usually the most useful debug data to troubleshoot the

+problem. For this reason, FAN_FS_ERROR tries to report only the first

+error that occurred for a file system since the last notification, and

+it simply counts additional errors. This ensures that the most

+important pieces of information are never lost.

+FAN_FS_ERROR requires the fanotify group to be setup with the

+FAN_REPORT_FID flag.

+At the time of this writing, the only file system that emits FAN_FS_ERROR

+notifications is Ext4.

+A FAN_FS_ERROR Notification has the following format::

+ [ Notification Metadata (Mandatory) ]

+ [ Generic Error Record (Mandatory) ]

+ [ FID record (Mandatory) ]

+The order of records is not guaranteed, and new records might be added

+in the future. Therefore, applications must not rely on the order and

+must be prepared to skip over unknown records. Please refer to

+``samples/fanotify/fs-monitor.c`` for an example parser.

+Generic error record

+--------------------

+The generic error record provides enough information for a file system

+agnostic tool to learn about a problem in the file system, without

+providing any additional details about the problem. This record is

+identified by ``struct fanotify_event_info_header.info_type`` being set

+to FAN_EVENT_INFO_TYPE_ERROR.

+ struct fanotify_event_info_error {

+ struct fanotify_event_info_header hdr;

+ __u32 error_count;

+The `error` field identifies the type of error using errno values.

+`error_count` tracks the number of errors that occurred and were

+suppressed to preserve the original error information, since the last

+The FID record can be used to uniquely identify the inode that triggered

+the error through the combination of fsid and file handle. A file system

+specific application can use that information to attempt a recovery

+procedure. Errors that are not related to an inode are reported with an

+empty file handle of type FILEID_INVALID.

+.. SPDX-License-Identifier: GPL-2.0-or-later

+Configfs GPIO Simulator

+=======================

+The configfs GPIO Simulator (gpio-sim) provides a way to create simulated GPIO

+chips for testing purposes. The lines exposed by these chips can be accessed

+using the standard GPIO character device interface as well as manipulated

+using sysfs attributes.

+Creating simulated chips

+------------------------

+The gpio-sim module registers a configfs subsystem called ``'gpio-sim'``. For

+details of the configfs filesystem, please refer to the configfs documentation.

+The user can create a hierarchy of configfs groups and items as well as modify

+values of exposed attributes. Once the chip is instantiated, this hierarchy

+will be translated to appropriate device properties. The general structure is:

+**Group:** ``/config/gpio-sim``

+This is the top directory of the gpio-sim configfs tree.

+**Group:** ``/config/gpio-sim/gpio-device``

+**Attribute:** ``/config/gpio-sim/gpio-device/dev_name``

+**Attribute:** ``/config/gpio-sim/gpio-device/live``

+This is a directory representing a GPIO platform device. The ``'dev_name'``

+attribute is read-only and allows the user-space to read the platform device

+name (e.g. ``'gpio-sim.0'``). The ``'live'`` attribute allows to trigger the

+actual creation of the device once it's fully configured. The accepted values

+are: ``'1'`` to enable the simulated device and ``'0'`` to disable and tear

+**Group:** ``/config/gpio-sim/gpio-device/gpio-bankX``

+**Attribute:** ``/config/gpio-sim/gpio-device/gpio-bankX/chip_name``

+**Attribute:** ``/config/gpio-sim/gpio-device/gpio-bankX/num_lines``

+This group represents a bank of GPIOs under the top platform device. The

+``'chip_name'`` attribute is read-only and allows the user-space to read the

+device name of the bank device. The ``'num_lines'`` attribute allows to specify

+the number of lines exposed by this bank.

+**Group:** ``/config/gpio-sim/gpio-device/gpio-bankX/lineY``

+**Attribute:** ``/config/gpio-sim/gpio-device/gpio-bankX/lineY/name``

+This group represents a single line at the offset Y. The 'name' attribute

+allows to set the line name as represented by the 'gpio-line-names' property.

+**Item:** ``/config/gpio-sim/gpio-device/gpio-bankX/lineY/hog``

+**Attribute:** ``/config/gpio-sim/gpio-device/gpio-bankX/lineY/hog/name``

+**Attribute:** ``/config/gpio-sim/gpio-device/gpio-bankX/lineY/hog/direction``

+This item makes the gpio-sim module hog the associated line. The ``'name'``

+attribute specifies the in-kernel consumer name to use. The ``'direction'``

+attribute specifies the hog direction and must be one of: ``'input'``,

+``'output-high'`` and ``'output-low'``.

+Inside each bank directory, there's a set of attributes that can be used to

+configure the new chip. Additionally the user can ``mkdir()`` subdirectories

+inside the chip's directory that allow to pass additional configuration for

+specific lines. The name of those subdirectories must take the form of:

+``'line<offset>'`` (e.g. ``'line0'``, ``'line20'``, etc.) as the name will be

+used by the module to assign the config to the specific line at given offset.

+Once the confiuration is complete, the ``'live'`` attribute must be set to 1 in

+order to instantiate the chip. It can be set back to 0 to destroy the simulated

+chip. The module will synchronously wait for the new simulated device to be

+successfully probed and if this doesn't happen, writing to ``'live'`` will

+result in an error.

+Simulated GPIO chips can also be defined in device-tree. The compatible string

+must be: ``"gpio-simulator"``. Supported properties are:

+ ``"gpio-sim,label"`` - chip label

+Other standard GPIO properties (like ``"gpio-line-names"``, ``"ngpios"`` or

+``"gpio-hog"``) are also supported. Please refer to the GPIO documentation for

+An example device-tree code defining a GPIO simulator:

+.. code-block :: none

+ compatible = "gpio-simulator";

+ #gpio-cells = <2>;

+ gpio-sim,label = "dt-bank0";

+ gpio-line-names = "", "sim-foo", "", "sim-bar";

+ #gpio-cells = <2>;

+ gpio-sim,label = "dt-bank1";

+ line-name = "sim-hog-from-dt";

+Manipulating simulated lines

+----------------------------

+Each simulated GPIO chip creates a separate sysfs group under its device

+directory for each exposed line

+(e.g. ``/sys/devices/platform/gpio-sim.X/gpiochipY/``). The name of each group

+is of the form: ``'sim_gpioX'`` where X is the offset of the line. Inside each

+group there are two attributes:

+ ``pull`` - allows to read and set the current simulated pull setting for

+ every line, when writing the value must be one of: ``'pull-up'``,

+ ``value`` - allows to read the current value of the line which may be

+ different from the pull if the line is being driven from

+ /* reverse gpiod_export() */

diff --git a/Documentation/admin-guide/hw-vuln/core-scheduling.rst b/Documentation/admin-guide/hw-vuln/core-scheduling.rst
index 0febe458597c..cf1eeefdfc32 100644
--- a/Documentation/admin-guide/hw-vuln/core-scheduling.rst
+++ b/Documentation/admin-guide/hw-vuln/core-scheduling.rst

+ ``pid_type`` for which the operation applies. It is one of

+ ``PR_SCHED_CORE_SCOPE_``-prefixed macro constants. For example, if arg4

+ is ``PR_SCHED_CORE_SCOPE_THREAD_GROUP``, then the operation of this command

+.. SPDX-License-Identifier: GPL-2.0

+Cross-Thread Return Address Predictions

+=======================================

+Certain AMD and Hygon processors are subject to a cross-thread return address

+predictions vulnerability. When running in SMT mode and one sibling thread

+transitions out of C0 state, the other sibling thread could use return target

+predictions from the sibling thread that transitioned out of C0.

+The Spectre v2 mitigations protect the Linux kernel, as it fills the return

+address prediction entries with safe targets when context switching to the idle

+thread. However, KVM does allow a VMM to prevent exiting guest mode when

+transitioning out of C0. This could result in a guest-controlled return target

+being consumed by the sibling thread.

+Affected processors

+-------------------

+The following CPUs are vulnerable:

+ - AMD Family 17h processors

+ - Hygon Family 18h processors

+The following CVE entry is related to this issue:

+ ============== =======================================

+ CVE-2022-27672 Cross-Thread Return Address Predictions

+ ============== =======================================

+Affected SMT-capable processors support 1T and 2T modes of execution when SMT

+is enabled. In 2T mode, both threads in a core are executing code. For the

+processor core to enter 1T mode, it is required that one of the threads

+requests to transition out of the C0 state. This can be communicated with the

+HLT instruction or with an MWAIT instruction that requests non-C0.

+When the thread re-enters the C0 state, the processor transitions back

+to 2T mode, assuming the other thread is also still in C0 state.

+In affected processors, the return address predictor (RAP) is partitioned

+depending on the SMT mode. For instance, in 2T mode each thread uses a private

+16-entry RAP, but in 1T mode, the active thread uses a 32-entry RAP. Upon

+transition between 1T/2T mode, the RAP contents are not modified but the RAP

+pointers (which control the next return target to use for predictions) may

+change. This behavior may result in return targets from one SMT thread being

+used by RET predictions in the sibling thread following a 1T/2T switch. In

+particular, a RET instruction executed immediately after a transition to 1T may

+use a return target from the thread that just became idle. In theory, this

+could lead to information disclosure if the return targets used do not come

+from trustworthy code.

+An attack can be mounted on affected processors by performing a series of CALL

+instructions with targeted return locations and then transitioning out of C0

+Mitigation mechanism

+--------------------

+Before entering idle state, the kernel context switches to the idle thread. The

+context switch fills the RAP entries (referred to as the RSB in Linux) with safe

+targets by performing a sequence of CALL instructions.

+Prevent a guest VM from directly putting the processor into an idle state by

+intercepting HLT and MWAIT instructions.

+Both mitigations are required to fully address this issue.

+Mitigation control on the kernel command line

+---------------------------------------------

+Use existing Spectre v2 mitigations that will fill the RSB on context switch.

+Mitigation control for KVM - module parameter

+---------------------------------------------

+By default, the KVM hypervisor mitigates this issue by intercepting guest

+attempts to transition out of C0. A VMM can use the KVM_CAP_X86_DISABLE_EXITS

+capability to override those interceptions, but since this is not common, the

+mitigation that covers this path is not enabled by default.

+The mitigation for the KVM_CAP_X86_DISABLE_EXITS capability can be turned on

+using the boolean module parameter mitigate_smt_rsb, e.g. ``kvm.mitigate_smt_rsb=1``.

+ processor_mmio_stale_data.rst

+ cross-thread-rsb.rst

+architecture section: :ref:`Documentation/arch/x86/mds.rst <mds>`.

+Attacks against the MDS vulnerabilities can be mounted from malicious non-

+privileged user space applications running on hosts or guest. Malicious

diff --git a/Documentation/admin-guide/hw-vuln/processor_mmio_stale_data.rst b/Documentation/admin-guide/hw-vuln/processor_mmio_stale_data.rst
new file mode 100644
index 000000000000..c98fd11907cc
--- /dev/null
+++ b/Documentation/admin-guide/hw-vuln/processor_mmio_stale_data.rst

+=========================================

+Processor MMIO Stale Data Vulnerabilities

+=========================================

+Processor MMIO Stale Data Vulnerabilities are a class of memory-mapped I/O

+(MMIO) vulnerabilities that can expose data. The sequences of operations for

+exposing data range from simple to very complex. Because most of the

+vulnerabilities require the attacker to have access to MMIO, many environments

+are not affected. System environments using virtualization where MMIO access is

+provided to untrusted guests may need mitigation. These vulnerabilities are

+not transient execution attacks. However, these vulnerabilities may propagate

+stale data into core fill buffers where the data can subsequently be inferred

+by an unmitigated transient execution attack. Mitigation for these

+vulnerabilities includes a combination of microcode update and software

+changes, depending on the platform and usage model. Some of these mitigations

+are similar to those used to mitigate Microarchitectural Data Sampling (MDS) or

+those used to mitigate Special Register Buffer Data Sampling (SRBDS).

+Propagators are operations that result in stale data being copied or moved from

+one microarchitectural buffer or register to another. Processor MMIO Stale Data

+Vulnerabilities are operations that may result in stale data being directly

+read into an architectural, software-visible state or sampled from a buffer or

+Fill Buffer Stale Data Propagator (FBSDP)

+-----------------------------------------

+Stale data may propagate from fill buffers (FB) into the non-coherent portion

+of the uncore on some non-coherent writes. Fill buffer propagation by itself

+does not make stale data architecturally visible. Stale data must be propagated

+to a location where it is subject to reading or sampling.

+Sideband Stale Data Propagator (SSDP)

+-------------------------------------

+The sideband stale data propagator (SSDP) is limited to the client (including

+Intel Xeon server E3) uncore implementation. The sideband response buffer is

+shared by all client cores. For non-coherent reads that go to sideband

+destinations, the uncore logic returns 64 bytes of data to the core, including

+both requested data and unrequested stale data, from a transaction buffer and

+the sideband response buffer. As a result, stale data from the sideband

+response and transaction buffers may now reside in a core fill buffer.

+Primary Stale Data Propagator (PSDP)

+------------------------------------

+The primary stale data propagator (PSDP) is limited to the client (including

+Intel Xeon server E3) uncore implementation. Similar to the sideband response

+buffer, the primary response buffer is shared by all client cores. For some

+processors, MMIO primary reads will return 64 bytes of data to the core fill

+buffer including both requested data and unrequested stale data. This is

+similar to the sideband stale data propagator.

+Device Register Partial Write (DRPW) (CVE-2022-21166)

+-----------------------------------------------------

+Some endpoint MMIO registers incorrectly handle writes that are smaller than

+the register size. Instead of aborting the write or only copying the correct

+subset of bytes (for example, 2 bytes for a 2-byte write), more bytes than

+specified by the write transaction may be written to the register. On

+processors affected by FBSDP, this may expose stale data from the fill buffers

+of the core that created the write transaction.

+Shared Buffers Data Sampling (SBDS) (CVE-2022-21125)

+----------------------------------------------------

+After propagators may have moved data around the uncore and copied stale data

+into client core fill buffers, processors affected by MFBDS can leak data from

+the fill buffer. It is limited to the client (including Intel Xeon server E3)

+uncore implementation.

+Shared Buffers Data Read (SBDR) (CVE-2022-21123)

+------------------------------------------------

+It is similar to Shared Buffer Data Sampling (SBDS) except that the data is

+directly read into the architectural software-visible state. It is limited to

+the client (including Intel Xeon server E3) uncore implementation.

+Affected Processors

+===================

+Not all the CPUs are affected by all the variants. For instance, most

+processors for the server market (excluding Intel Xeon E3 processors) are

+impacted by only Device Register Partial Write (DRPW).

+Below is the list of affected Intel processors [#f1]_:

+ =================== ============ =========

+ Common name Family_Model Steppings

+ =================== ============ =========

+ HASWELL_X 06_3FH 2,4

+ SKYLAKE_L 06_4EH 3

+ BROADWELL_X 06_4FH All

+ SKYLAKE_X 06_55H 3,4,6,7,11

+ BROADWELL_D 06_56H 3,4,5

+ ICELAKE_X 06_6AH 4,5,6

+ ICELAKE_D 06_6CH 1

+ ICELAKE_L 06_7EH 5

+ ATOM_TREMONT_D 06_86H All

+ LAKEFIELD 06_8AH 1

+ KABYLAKE_L 06_8EH 9 to 12

+ ATOM_TREMONT 06_96H 1

+ ATOM_TREMONT_L 06_9CH 0

+ KABYLAKE 06_9EH 9 to 13

+ COMETLAKE 06_A5H 2,3,5

+ COMETLAKE_L 06_A6H 0,1

+ ROCKETLAKE 06_A7H 1

+ =================== ============ =========

+If a CPU is in the affected processor list, but not affected by a variant, it

+is indicated by new bits in MSR IA32_ARCH_CAPABILITIES. As described in a later

+section, mitigation largely remains the same for all the variants, i.e. to

+clear the CPU fill buffers via VERW instruction.

+Newer processors and microcode update on existing affected processors added new

+bits to IA32_ARCH_CAPABILITIES MSR. These bits can be used to enumerate

+specific variants of Processor MMIO Stale Data vulnerabilities and mitigation

+MSR IA32_ARCH_CAPABILITIES

+--------------------------

+Bit 13 - SBDR_SSDP_NO - When set, processor is not affected by either the

+ Shared Buffers Data Read (SBDR) vulnerability or the sideband stale

+ data propagator (SSDP).

+Bit 14 - FBSDP_NO - When set, processor is not affected by the Fill Buffer

+ Stale Data Propagator (FBSDP).

+Bit 15 - PSDP_NO - When set, processor is not affected by Primary Stale Data

+ Propagator (PSDP).

+Bit 17 - FB_CLEAR - When set, VERW instruction will overwrite CPU fill buffer

+ values as part of MD_CLEAR operations. Processors that do not

+ enumerate MDS_NO (meaning they are affected by MDS) but that do

+ enumerate support for both L1D_FLUSH and MD_CLEAR implicitly enumerate

+ FB_CLEAR as part of their MD_CLEAR support.

+Bit 18 - FB_CLEAR_CTRL - Processor supports read and write to MSR

+ IA32_MCU_OPT_CTRL[FB_CLEAR_DIS]. On such processors, the FB_CLEAR_DIS

+ bit can be set to cause the VERW instruction to not perform the

+ FB_CLEAR action. Not all processors that support FB_CLEAR will support

+MSR IA32_MCU_OPT_CTRL

+---------------------

+Bit 3 - FB_CLEAR_DIS - When set, VERW instruction does not perform the FB_CLEAR

+action. This may be useful to reduce the performance impact of FB_CLEAR in

+cases where system software deems it warranted (for example, when performance

+is more critical, or the untrusted software has no MMIO access). Note that

+FB_CLEAR_DIS has no impact on enumeration (for example, it does not change

+FB_CLEAR or MD_CLEAR enumeration) and it may not be supported on all processors

+that enumerate FB_CLEAR.

+Like MDS, all variants of Processor MMIO Stale Data vulnerabilities have the

+same mitigation strategy to force the CPU to clear the affected buffers before

+an attacker can extract the secrets.

+This is achieved by using the otherwise unused and obsolete VERW instruction in

+combination with a microcode update. The microcode clears the affected CPU

+buffers when the VERW instruction is executed.

+Kernel reuses the MDS function to invoke the buffer clearing:

+ mds_clear_cpu_buffers()

+On MDS affected CPUs, the kernel already invokes CPU buffer clear on

+kernel/userspace, hypervisor/guest and C-state (idle) transitions. No

+additional mitigation is needed on such CPUs.

+For CPUs not affected by MDS or TAA, mitigation is needed only for the attacker

+with MMIO capability. Therefore, VERW is not required for kernel/userspace. For

+virtualization case, VERW is only needed at VMENTER for a guest with MMIO

+Return to user space

+^^^^^^^^^^^^^^^^^^^^

+Same mitigation as MDS when affected by MDS/TAA, otherwise no mitigation

+Control register writes by CPU during C-state transition can propagate data

+from fill buffer to uncore buffers. Execute VERW before C-state transition to

+clear CPU fill buffers.

+Same mitigation as MDS when processor is also affected by MDS/TAA, otherwise

+execute VERW at VMENTER only for MMIO capable guests. On CPUs not affected by

+MDS/TAA, guest without MMIO access cannot extract secrets using Processor MMIO

+Stale Data vulnerabilities, so there is no need to execute VERW for such guests.

+Mitigation control on the kernel command line

+---------------------------------------------

+The kernel command line allows to control the Processor MMIO Stale Data

+mitigations at boot time with the option "mmio_stale_data=". The valid

+arguments for this option are:

+ ========== =================================================================

+ full If the CPU is vulnerable, enable mitigation; CPU buffer clearing

+ on exit to userspace and when entering a VM. Idle transitions are

+ protected as well. It does not automatically disable SMT.

+ full,nosmt Same as full, with SMT disabled on vulnerable CPUs. This is the

+ complete mitigation.

+ off Disables mitigation completely.

+ ========== =================================================================

+If the CPU is affected and mmio_stale_data=off is not supplied on the kernel

+command line, then the kernel selects the appropriate mitigation.

+Mitigation status information

+-----------------------------

+The Linux kernel provides a sysfs interface to enumerate the current

+vulnerability status of the system: whether the system is vulnerable, and

+which mitigations are active. The relevant sysfs file is:

+ /sys/devices/system/cpu/vulnerabilities/mmio_stale_data

+The possible values in this file are:

+ * - 'Not affected'

+ - The processor is not vulnerable

+ - The processor is vulnerable, but no mitigation enabled

+ * - 'Vulnerable: Clear CPU buffers attempted, no microcode'

+ - The processor is vulnerable, but microcode is not updated. The

+ mitigation is enabled on a best effort basis.

+ * - 'Mitigation: Clear CPU buffers'

+ - The processor is vulnerable and the CPU buffer clearing mitigation is

+ * - 'Unknown: No mitigations'

+ - The processor vulnerability status is unknown because it is

+ out of Servicing period. Mitigation is not attempted.

+Servicing period: The process of providing functional and security updates to

+Intel processors or platforms, utilizing the Intel Platform Update (IPU)

+process or other similar mechanisms.

+End of Servicing Updates (ESU): ESU is the date at which Intel will no

+longer provide Servicing, such as through IPU or other similar update

+processes. ESU dates will typically be aligned to end of quarter.

+If the processor is vulnerable then the following information is appended to

+the above information:

+ ======================== ===========================================

+ 'SMT vulnerable' SMT is enabled

+ 'SMT disabled' SMT is disabled

+ 'SMT Host state unknown' Kernel runs in a VM, Host SMT state unknown

+ ======================== ===========================================

+.. [#f1] Affected Processors

+ https://www.intel.com/content/www/us/en/developer/topic-technology/software-security-guidance/processors-affected-consolidated-product-cpu-model.html

+See :ref:`[1] <spec_ref1>` :ref:`[5] <spec_ref5>` :ref:`[6] <spec_ref6>`

+:ref:`[7] <spec_ref7>` :ref:`[10] <spec_ref10>` :ref:`[11] <spec_ref11>`.

+Yet another variant 2 attack vector is for the attacker to poison the

+Branch History Buffer (BHB) to speculatively steer an indirect branch

+to a specific Branch Target Buffer (BTB) entry, even if the entry isn't

+associated with the source address of the indirect branch. Specifically,

+the BHB might be shared across privilege levels even in the presence of

+Currently the only known real-world BHB attack vector is via

+unprivileged eBPF. Therefore, it's highly recommended to not enable

+unprivileged eBPF, especially when eIBRS is used (without retpolines).

+For a full mitigation against BHB attacks, it's recommended to use

+retpolines (or eIBRS combined with retpolines).

+ ======================================== =================================

+ 'Not affected' The processor is not vulnerable

+ 'Mitigation: None' Vulnerable, no mitigation

+ 'Mitigation: Retpolines' Use Retpoline thunks

+ 'Mitigation: LFENCE' Use LFENCE instructions

+ 'Mitigation: Enhanced IBRS' Hardware-focused mitigation

+ 'Mitigation: Enhanced IBRS + Retpolines' Hardware-focused + Retpolines

+ 'Mitigation: Enhanced IBRS + LFENCE' Hardware-focused + LFENCE

+ ======================================== =================================

+ - EIBRS Post-barrier Return Stack Buffer (PBRSB) protection status:

+ =========================== =======================================================

+ 'PBRSB-eIBRS: SW sequence' CPU is affected and protection of RSB on VMEXIT enabled

+ 'PBRSB-eIBRS: Vulnerable' CPU is vulnerable

+ 'PBRSB-eIBRS: Not affected' CPU is not affected by PBRSB

+ =========================== =======================================================

+ On CPUs with hardware mitigation for Spectre variant 2 (e.g. IBRS

+ or enhanced IBRS on x86), retpoline is automatically disabled at run time.

+ Systems which support enhanced IBRS (eIBRS) enable IBRS protection once at

+ boot, by setting the IBRS bit, and they're automatically protected against

+ Spectre v2 variant attacks, including cross-thread branch target injections

+ on SMT systems (STIBP). In other words, eIBRS enables STIBP too.

+ Legacy IBRS systems clear the IBRS bit on exit to userspace and

+ therefore explicitly enable STIBP for that

+ Using kernel address space randomization (CONFIG_RANDOMIZE_BASE=y

+ target buffer left by malicious software.

+ On legacy IBRS systems, at return to userspace, implicit STIBP is disabled

+ because the kernel clears the IBRS bit. In this case, the userspace programs

+ can disable indirect branch speculation via prctl() (See

+ :ref:`Documentation/userspace-api/spec_ctrl.rst <set_spec_ctrl>`).

+ on x86. Administrators can change that behavior via the kernel

+ command line and sysfs control files.

+ retpoline auto pick between generic,lfence

+ retpoline,generic Retpolines

+ retpoline,lfence LFENCE; indirect branch

+ retpoline,amd alias for retpoline,lfence

+ eibrs Enhanced/Auto IBRS

+ eibrs,retpoline Enhanced/Auto IBRS + Retpolines

+ eibrs,lfence Enhanced/Auto IBRS + LFENCE

+ ibrs use IBRS to protect kernel

+For spectre_v2_user see Documentation/admin-guide/kernel-parameters.txt

+ buffer. This behavior can be changed via the kernel command line

+ and sysfs control files. See

+[6] `Software techniques for managing speculation on AMD processors <https://developer.amd.com/wp-content/resources/Managing-Speculation-on-AMD-Processors.pdf>`_.

diff --git a/Documentation/admin-guide/hw-vuln/tsx_async_abort.rst b/Documentation/admin-guide/hw-vuln/tsx_async_abort.rst
index 76673affd917..014167ef8dd1 100644
--- a/Documentation/admin-guide/hw-vuln/tsx_async_abort.rst
+++ b/Documentation/admin-guide/hw-vuln/tsx_async_abort.rst

+architecture section: :ref:`Documentation/arch/x86/tsx_async_abort.rst <tsx_async_abort>`.

+=================================

+Hardware random number generators

+=================================

+ reporting-regressions

+ quickly-build-trimmed-linux

+This is the beginning of a section with information of interest to

+application developers and system integrators doing analysis of the

+Linux kernel for safety critical applications. Documents supporting

+analysis of kernel interactions with applications, and key kernel

+subsystems expectations will be found here.

+ filesystem-monitoring

+ measured from blk_mq_alloc_request() to __blk_mq_end_request()).

+ set var $id = ($id + 1) & $id_mask

+ filesystems" menu if "Configure standard kernel features (expert users)"

+ is not enabled in "General Setup." In this case, check the .config file

+ itself to ensure that sysfs is turned on, as follows::

+or use scp to write out the dump file between hosts on a network, e.g::

+ scp /proc/vmcore remote_username@remote_ip:<dump-file>

+Each zone has a free_area structure array called free_area[MAX_ORDER + 1].

+(zone.free_area, MAX_ORDER + 1)

+-------------------------------

+allocated ringbuffer, depending on when the core dump occurred.

+-----------------------------------------------------------------------------

+Used to get the correct ranges:

+ MODULES_VADDR ~ MODULES_END-1 : Kernel module space.

+ VMALLOC_START ~ VMALLOC_END-1 : vmalloc() / ioremap() space.

+ VMEMMAP_START ~ VMEMMAP_END-1 : vmemmap region, used for struct page array.

+The maximum number of bits for virtual addresses. Used to compute the

+virtual memory ranges.

+Indicates the virtual kernel start address of the direct-mapped RAM region.

+Indicates the start physical RAM address.

+----------------------------------------------------------------------------------------------

+Used to get the correct ranges:

+ * MODULES_VADDR ~ MODULES_END : Kernel module space.

+ * VMALLOC_START ~ VMALLOC_END : vmalloc() / ioremap() space.

+ * VMEMMAP_START ~ VMEMMAP_END : vmemmap space, used for struct page array.

+ * KERNEL_LINK_ADDR : start address of Kernel link and BPF

diff --git a/Documentation/admin-guide/kernel-parameters.rst b/Documentation/admin-guide/kernel-parameters.rst
index 01ba293a2d70..1ba8f2a44aac 100644
--- a/Documentation/admin-guide/kernel-parameters.rst
+++ b/Documentation/admin-guide/kernel-parameters.rst

+ APPARMOR AppArmor support is enabled.

+ HIBERNATION HIBERNATION is enabled.

+ HYPER_V HYPERV support is enabled.

+ LOONGARCH LoongArch architecture is enabled.

+ Documentation/arch/m68k/kernel-options.rst.

+ Documentation/arch/x86/x86_64/boot-options.rst.

+need or coordination with <Documentation/arch/x86/boot.rst>.

+See for example <Documentation/arch/x86/x86_64/boot-options.rst>.

+./include/uapi/asm-generic/setup.h as COMMAND_LINE_SIZE.

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 43dc35fe5bc0..9e5bab29685f 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt

+ Format: { s3_bios, s3_mode, s3_beep, s4_hwsig,

+ s4_nohwsig, old_ordering, nonvs,

+ sci_force_enable, nobl }

+ s4_hwsig causes the kernel to check the ACPI hardware

+ signature during resume from hibernation, and gracefully

+ refuse to resume if it has changed. This complies with

+ the ACPI specification but not with reality, since

+ Windows does not do this and many laptops do change it

+ on docking. So the default behaviour is to allow resume

+ and simply warn when the signature changes, unless the

+ s4_hwsig option is enabled.

+ used (or even warned about) during resume.

+ pgtbl_v1 - Use v1 page table for DMA-API (Default).

+ pgtbl_v2 - Use v2 page table for DMA-API.

+ Do not enable amd_pstate as the default

+ scaling driver for the supported processors

+ Use amd_pstate with passive mode as a scaling driver.

+ In this mode autonomous selection is disabled.

+ Driver requests a desired performance level and platform

+ tries to match the same performance level if it is

+ satisfied by guaranteed performance level.

+ Use amd_pstate_epp driver instance as the scaling driver,

+ driver provides a hint to the hardware if software wants

+ to bias toward performance (0x0) or energy efficiency (0xff)

+ to the CPPC firmware. then CPPC power algorithm will

+ calculate the runtime workload and adjust the realtime cores

+ Activate guided autonomous mode. Driver requests minimum and

+ maximum performance level and the platform autonomously

+ selects a performance level in this range and appropriate

+ to the current workload.

+ apparmor= [APPARMOR] Disable or enable AppArmor at boot time

+ Format: { "0" | "1" }

+ See security/apparmor/Kconfig help text

+ Default value is set via kernel config option.

+ arm64.nosve [ARM64] Unconditionally disable Scalable Vector

+ arm64.nosme [ARM64] Unconditionally disable Scalable Matrix

+ bert_disable [ACPI]

+ Disable BERT OS support on buggy BIOSes.

+ bgrt_disable [ACPI][X86]

+ Disable BGRT to avoid flickering OEM logo.

+ Only works if CONFIG_BOOT_PRINTK_DELAY is enabled,

+ and you may also have to specify "lpj=". Boot_delay

+ values larger than 10 seconds (10000) are assumed

+ erroneous and ignored.

+ nobpf -- Disable BPF memory accounting.

+ checkreqprot= [SELINUX] Set initial checkreqprot flag value.

+ clearcpuid=X[,X...] [X86]

+ Disable CPUID feature X for the kernel. See

+ arch/x86/include/asm/cpufeatures.h for the valid bit

+ numbers X. Note the Linux-specific bits are not necessarily

+ stable over kernel options, but the vendor-specific

+ X can also be a string as appearing in the flags: line

+ in /proc/cpuinfo which does not have the above

+ instability issue. However, not all features have names

+ Note that using this option will taint your kernel.

+ Also note that user programs calling CPUID directly

+ or using the feature without checking anything

+ will still see it. This just prevents it from

+ being used by the kernel or shown in /proc/cpuinfo.

+ Also note the kernel might malfunction if you disable

+ some critical bits.

+ unstable. Defaults to two retries, that is,

+ three attempts to read the clock under test.

+ specified, the default value is 0.

+ con3215_drop= [S390] 3215 console drop mode.

+ Format: y|n|Y|N|1|0

+ When set to true, drop data on the 3215 console when

+ the console buffer is full. In this case the

+ operator using a 3270 terminal emulator (for example

+ x3270) does not have to enter the clear key for the

+ console output to advance and the kernel to continue.

+ This leads to a much faster boot time when a 3270

+ terminal emulator is active. If no 3270 terminal

+ emulator is used, this parameter has no effect.

+ Use to disable console output, i.e., to have kernel

+ console messages discarded.

+ This must be the only console= parameter used on the

+ kernel command line.

+ cpcihp_generic= [HW,PCI] Generic port I/O CompactPCI driver

+ <first_slot>,<last_slot>,<port>,<enum_bit>[,<debug>]

+ cpu0_hotplug [X86] Turn on CPU0 hotplug feature when

+ CONFIG_BOOTPARAM_HOTPLUG_CPU0 is off.

+ Some features depend on CPU0. Known dependencies are:

+ 1. Resume from suspend/hibernate depends on CPU0.

+ Suspend/hibernate will fail if CPU0 is offline and you

+ need to online CPU0 before suspend/hibernate.

+ 2. PIC interrupts also depend on CPU0. CPU0 can't be

+ removed if a PIC interrupt is detected.

+ It's said poweroff/reboot may depend on CPU0 on some

+ machines although I haven't seen such issues so far

+ after CPU0 is offline on a few tested machines.

+ If the dependencies are under your control, you can

+ turn on cpu0_hotplug.

+ crash_kexec_post_notifiers

+ Run kdump after running panic-notifiers and dumping

+ kmsg. This only for the users who doubt kdump always

+ succeeds in any situation.

+ Note that this also increases risks of kdump failure,

+ because some panic notifiers can make the crashed

+ kernel more unstable.

+ [KNL, X86-64, ARM64] Select a region under 4G first, and

+ [KNL, X86-64, ARM64] range could be above 4G. Allow kernel

+ [KNL, X86-64, ARM64] range under 4G. When crashkernel=X,high

+ devices won't run out. Kernel would try to allocate

+ default size of memory below 4G automatically. The default

+ size is platform dependent.

+ --> x86: max(swiotlb_size_or_default() + 8MiB, 256MiB)

+ This one lets the user specify own low range under 4G

+ csdlock_debug= [KNL] Enable or disable debug add-ons of cross-CPU

+ function call handling. When switched on,

+ additional debug data is printed to the console

+ in case a hanging CPU is detected, and that

+ CPU is pinged again in order to try to resolve

+ the hang situation. The default value of this

+ option depends on the CSD_LOCK_WAIT_DEBUG_DEFAULT

+ F/W or by drivers badly programming DMA (basically when

+ drivers) that have opted in will be ignored. A timeout

+ of 0 will timeout at the end of initcalls. If the time

+ out hasn't expired, it'll be restarted by each

+ successful driver registration. This option will also

+ delayacct [KNL] Enable per-task delay accounting

+ dell_smm_hwmon.ignore_dmi=

+ [HW] Continue probing hardware even if DMI data

+ indicates that the driver is running on unsupported

+ dell_smm_hwmon.force=

+ [HW] Activate driver even if SMM BIOS signature does

+ not match list of supported models and enable otherwise

+ blacklisted features.

+ dell_smm_hwmon.power_status=

+ [HW] Report power status in /proc/i8k

+ (disabled by default).

+ dell_smm_hwmon.restricted=

+ [HW] Allow controlling fans only if SYS_ADMIN

+ capability is set.

+ dell_smm_hwmon.fan_mult=

+ [HW] Factor to multiply fan speed with.

+ dell_smm_hwmon.fan_max=

+ [HW] Maximum configurable fan speed.

+ List of driver names to be probed asynchronously. *

+ matches with all driver names. If * is specified, the

+ rest of the listed driver names are those that will NOT

+ uart[8250],io,<addr>[,options[,uartclk]]

+ uart[8250],mmio,<addr>[,options[,uartclk]]

+ uart[8250],mmio32,<addr>[,options[,uartclk]]

+ uart[8250],mmio32be,<addr>[,options[,uartclk]]

+ unspecified, the h/w is not initialized. 'uartclk' is

+ the uart clock frequency; if unspecified, it is set

+ to 'BASE_BAUD' * 16.

+ Only one of vga, serial, or usb debug port can

+ The VGA output is eventually overwritten by

+ ekgdboc= [X86,KGDB] Allow early kernel console debugging

+ Format: ekgdboc=kbd

+ This is designed to be used in conjunction with

+ the boot argument: earlyprintk=vga

+ This parameter works in place of the kgdboc parameter

+ but can only be used if the backing tty is available

+ very early in the boot process. For early debugging

+ via a serial port see kgdboc_earlycon instead.

+ enforcing= [SELINUX] Set initial enforcing status.

+ early_page_ext [KNL] Enforces page_ext initialization to earlier

+ stages so cover more early boot allocations.

+ Please note that as side effect some optimizations

+ might be disabled to achieve that (e.g. parallelized

+ memory initialization is disabled) so the boot process

+ might take longer, especially on systems with a lot of

+ memory. Available with CONFIG_PAGE_EXTENSION=y.

+ ftrace_boot_snapshot

+ [FTRACE] On boot up, a snapshot will be taken of the

+ ftrace ring buffer that can be read at:

+ /sys/kernel/tracing/snapshot.

+ This is useful if you need tracing information from kernel

+ boot up that is likely to be overridden by user space

+ start up functionality.

+ Optionally, the snapshot can also be defined for a tracing

+ instance that was created by the trace_instance= command

+ trace_instance=foo,sched_switch ftrace_boot_snapshot=foo

+ The above will cause the "foo" tracing instance to trigger

+ a snapshot at the end of boot up.

+ fw_devlink.sync_state =

+ [KNL] When all devices that could probe have finished

+ probing, this parameter controls what to do with

+ devices that haven't yet received their sync_state()

+ Format: { strict | timeout }

+ strict -- Default. Continue waiting on consumers to

+ probe successfully.

+ timeout -- Give up waiting on consumers and call

+ sync_state() on any devices that haven't yet

+ received their sync_state() calls after

+ deferred_probe_timeout has expired or by

+ late_initcall() if !CONFIG_MODULES.

+ hardened_usercopy=

+ [KNL] Under CONFIG_HARDENED_USERCOPY, whether

+ hardening is enabled for this boot. Hardened

+ usercopy checking is used to protect the kernel

+ from reading or writing beyond known memory

+ allocation boundaries as a proactive defense

+ against bounds-checking flaws in the kernel's

+ copy_to_user()/copy_from_user() interface.

+ on Perform hardened usercopy checks (default).

+ off Disable hardened usercopy checks.

+ hibernate= [HIBERNATION]

+ noresume Don't check if there's a hibernation image

+ present during boot.

+ nocompress Don't compress/decompress hibernation images.

+ no Disable hibernation and resume.

+ protect_image Turn on image protection during restoration

+ (that will set all pages holding image data

+ during restoration read-only).

+ hostname= [KNL] Set the hostname (aka UTS nodename).

+ This allows setting the system's hostname during early

+ startup. This sets the name returned by gethostname.

+ Using this parameter to set the hostname makes it

+ possible to ensure the hostname is correctly set before

+ any userspace processes run, avoiding the possibility

+ that a process may call gethostname before the hostname

+ has been explicitly set, resulting in the calling

+ process getting an incorrect result. The string must

+ not exceed the maximum allowed hostname length (usually

+ 64 characters) and will be truncated otherwise.

+ the default huge page size. If using node format, the

+ number of pages to allocate per-node can be specified.

+ See also Documentation/admin-guide/mm/hugetlbpage.rst.

+ Format: <integer> or (node format)

+ <node>:<integer>[,<node>:<integer>]

+ hugetlb_cma= [HW,CMA] The size of a CMA area used for allocation

+ of gigantic hugepages. Or using node format, the size

+ of a CMA area per node can be specified.

+ Format: nn[KMGTPE] or (node format)

+ <node>:nn[KMGTPE][,<node>:nn[KMGTPE]]

+ Reserve a CMA area of given size and allocate gigantic

+ hugepages using the CMA allocator. If enabled, the

+ boot-time allocation of gigantic hugepages is skipped.

+ [KNL] Requires CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP

+ Control if HugeTLB Vmemmap Optimization (HVO) is enabled.

+ memory (7 * PAGE_SIZE for each 2MB hugetlb page).

+ Built with CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON=y,

+ Note that the vmemmap pages may be allocated from the added

+ memory block itself when memory_hotplug.memmap_on_memory is

+ enabled, those vmemmap pages cannot be optimized even if this

+ feature is enabled. Other vmemmap pages not allocated from

+ the added memory block itself do not be affected.

+ [HW] Allow deferred probing upon i8042 probe errors

+ Formats: { "ima" | "ima-ng" | "ima-ngv2" | "ima-sig" |

+ mapping provided in the IVRS ACPI table.

+ By default, PCI segment is 0, and can be omitted.

+ For example, to map IOAPIC-ID decimal 10 to

+ PCI segment 0x1 and PCI device 00:14.0,

+ write the parameter as:

+ ivrs_ioapic=10@0001:00:14.0

+ Deprecated formats:

+ * To map IOAPIC-ID decimal 10 to PCI device 00:14.0

+ write the parameter as:

+ * To map IOAPIC-ID decimal 10 to PCI segment 0x1 and

+ PCI device 00:14.0 write the parameter as:

+ ivrs_ioapic[10]=0001:00:14.0

+ mapping provided in the IVRS ACPI table.

+ By default, PCI segment is 0, and can be omitted.

+ For example, to map HPET-ID decimal 10 to

+ PCI segment 0x1 and PCI device 00:14.0,

+ write the parameter as:

+ ivrs_hpet=10@0001:00:14.0

+ Deprecated formats:

+ * To map HPET-ID decimal 0 to PCI device 00:14.0

+ write the parameter as:

+ * To map HPET-ID decimal 10 to PCI segment 0x1 and

+ PCI device 00:14.0 write the parameter as:

+ ivrs_ioapic[10]=0001:00:14.0

+ mapping provided in the IVRS ACPI table.

+ By default, PCI segment is 0, and can be omitted.

+ For example, to map UART-HID:UID AMD0020:0 to

+ PCI segment 0x1 and PCI device ID 00:14.5,

+ write the parameter as:

+ ivrs_acpihid=AMD0020:0@0001:00:14.5

+ Deprecated formats:

+ * To map UART-HID:UID AMD0020:0 to PCI segment is 0,

+ PCI device ID 00:14.5, write the parameter as:

+ * To map UART-HID:UID AMD0020:0 to PCI segment 0x1 and

+ PCI device ID 00:14.5, write the parameter as:

+ ivrs_acpihid[0001:00:14.5]=AMD0020:0

+ keep_bootcon [KNL]

+ Do not unregister boot console at start. This is only

+ useful for debugging when something happens in the window

+ between unregistering the boot console and initializing

+ kunit.enable= [KUNIT] Enable executing KUnit tests. Requires

+ CONFIG_KUNIT to be set to be fully enabled. The

+ default value can be overridden via

+ KUNIT_DEFAULT_ENABLED.

+ Default is 1 (enabled)

+ kvm.eager_page_split=

+ [KVM,X86] Controls whether or not KVM will try to

+ proactively split all huge pages during dirty logging.

+ Eager page splitting reduces interruptions to vCPU

+ execution by eliminating the write-protection faults

+ and MMU lock contention that would otherwise be

+ required to split huge pages lazily.

+ VM workloads that rarely perform writes or that write

+ only to a small region of VM memory may benefit from

+ disabling eager page splitting to allow huge pages to

+ still be used for reads.

+ The behavior of eager page splitting depends on whether

+ KVM_DIRTY_LOG_INITIALLY_SET is enabled or disabled. If

+ disabled, all huge pages in a memslot will be eagerly

+ split when dirty logging is enabled on that memslot. If

+ enabled, eager page splitting will be performed during

+ the KVM_CLEAR_DIRTY ioctl, and only for the pages being

+ Eager page splitting is only supported when kvm.tdp_mmu=Y.

+ Default is Y (on).

+ period (see below). The default is 60.

+ kvm.nx_huge_pages_recovery_period_ms=

+ [KVM] Controls the time period at which KVM zaps 4KiB pages

+ back to huge pages. If the value is a non-zero N, KVM will

+ zap a portion (see ratio above) of the pages every N msecs.

+ If the value is 0 (the default), KVM will pick a period based

+ on the ratio, such that a page is zapped after 1 hour on average.

+ none: Forcefully disable KVM.

+ nested: VHE-based mode with support for nested

+ virtualization. Requires at least ARMv8.3

+ Defaults to VHE/nVHE based on hardware support. Setting

+ mode to "protected" will disable kexec and hibernation

+ for the host. "nested" is experimental and should be

+ used with extreme caution.

+ [KVM,Intel] Disable emulation of invalid guest state.

+ Ignored if kvm-intel.enable_unrestricted_guest=1, as

+ guest state is never invalid for unrestricted guests.

+ This param doesn't apply to nested guests (L2), as KVM

+ never emulates invalid L2 guest state.

+ Default is 1 (enabled)

+ libata.force= [LIBATA] Force configurations. The format is a comma-

+ separated list of "[ID:]VAL" where ID is PORT[.DEVICE].

+ PORT and DEVICE are decimal numbers matching port, link

+ or device. Basically, it matches the ATA ID string

+ printed on console by libata. If the whole ID part is

+ omitted, the last PORT and DEVICE values are used. If

+ ID hasn't been specified yet, the configuration applies

+ to all ports, links and devices.

+ as there is no ambiguity, shortcut notation is allowed.

+ * nohrst, nosrst, norst: suppress hard, soft and both

+ * rstonce: only attempt one reset during hot-unplug

+ * [no]dbdelay: Enable or disable the extra 200ms delay

+ before debouncing a link PHY and device presence

+ * [no]ncqtrim: Enable or disable queued DSM TRIM.

+ * [no]ncqati: Enable or disable NCQ trim on ATI chipset.

+ * [no]trim: Enable or disable (unqueued) TRIM.

+ * trim_zero: Indicate that TRIM command zeroes data.

+ * max_trim_128m: Set 128M maximum trim size limit.

+ * [no]dma: Turn on or off DMA transfers.

+ * atapi_dmadir: Enable ATAPI DMADIR bridge support.

+ * atapi_mod16_dma: Enable the use of ATAPI DMA for

+ commands that are not a multiple of 16 bytes.

+ * [no]dmalog: Enable or disable the use of the

+ READ LOG DMA EXT command to access logs.

+ * [no]iddevlog: Enable or disable access to the

+ identify device data log.

+ * [no]logdir: Enable or disable access to the general

+ purpose log directory.

+ * max_sec_128: Set transfer size limit to 128 sectors.

+ * max_sec_1024: Set or clear transfer size limit to

+ * max_sec_lba48: Set or clear transfer size limit to

+ * [no]lpm: Enable or disable link power management.

+ * [no]setxfer: Indicate if transfer speed mode setting

+ should be skipped.

+ * [no]fua: Disable or enable FUA (Force Unit Access)

+ support for devices supporting this feature.

+ * dump_id: Dump IDENTIFY data.

+ max_addr=nn[KMG] [KNL,BOOT,IA-64] All physical memory greater

+ mce=option [X86-64] See Documentation/arch/x86/x86_64/boot-options.rst

+ mem=nn[KMG] [HEXAGON] Set the memory size.

+ Must be specified, otherwise memory size will be 0.

+ 4 to limit the memory available for kdump kernel.

+ [ARC,MICROBLAZE] - the limit applies only to low memory,

+ high memory is not affected.

+ [ARM64] - only limits memory covered by the linear

+ mapping. The NOMAP regions are not affected.

+ mem=nn[KMG]@ss[KMG]

+ [ARM,MIPS] - override the memory layout reported by

+ Define a memory region of size nn[KMG] starting at

+ Multiple different regions can be specified with

+ multiple mem= parameters on the command line.

+ memblock=debug [KNL] Enable memblock debug messages.

+ allocate its internal metadata (struct pages,

+ those vmemmap pages cannot be optimized even

+ if hugetlb_free_vmemmap is enabled) from the

+ hotadded memory which will allow to hotadd a

+ lot of memory without requiring additional

+ memtest= [KNL,X86,ARM,M68K,PPC,RISCV] Enable memtest

+ Refer to Documentation/virt/kvm/x86/amd-memory-encryption.rst

+ min_addr=nn[KMG] [KNL,BOOT,IA-64] All physical memory below this

+ if nokaslr then kpti=0 [ARM64]

+ nospectre_bhb [ARM64]

+ srbds=off [X86,INTEL]

+ mmio_stale_data=off [X86]

+ retbleed=off [X86]

+ mmio_stale_data=full,nosmt [X86]

+ retbleed=auto,nosmt [X86]

+ [X86,INTEL] Control mitigation for the Processor

+ MMIO Stale Data vulnerabilities.

+ Processor MMIO Stale Data is a class of

+ vulnerabilities that may expose data after an MMIO

+ operation. Exposed data could originate or end in

+ the same CPU buffers as affected by MDS and TAA.

+ Therefore, similar to MDS and TAA, the mitigation

+ is to clear the affected CPU buffers.

+ This parameter controls the mitigation. The

+ full - Enable mitigation on vulnerable CPUs

+ full,nosmt - Enable mitigation and disable SMT on

+ off - Unconditionally disable mitigation

+ On MDS or TAA affected machines,

+ mmio_stale_data=off can be prevented by an active

+ MDS or TAA mitigation as these vulnerabilities are

+ mitigated with the same mechanism so in order to

+ disable this mitigation, you need to specify

+ mds=off and tsx_async_abort=off too.

+ Not specifying this option is equivalent to

+ mmio_stale_data=full.

+ Documentation/admin-guide/hw-vuln/processor_mmio_stale_data.rst

+ <module>.async_probe[=<bool>] [KNL]

+ If no <bool> value is specified or if the value

+ specified is not a valid <bool>, enable asynchronous

+ probe on this module. Otherwise, enable/disable

+ asynchronous probe on this module as indicated by the

+ <bool> value. See also: module.async_probe

+ module.async_probe=<bool>

+ [KNL] When set to true, modules will use async probing

+ by default. To enable/disable async probing for a

+ specific module, use the module specific control that

+ is documented under <module>.async_probe. When both

+ module.async_probe and <module>.async_probe are

+ specified, <module>.async_probe takes precedence for

+ the specific module.

+ module.enable_dups_trace

+ [KNL] When CONFIG_MODULE_DEBUG_AUTOLOAD_DUPS is set,

+ this means that duplicate request_module() calls will

+ trigger a WARN_ON() instead of a pr_warn(). Note that

+ if MODULE_DEBUG_AUTOLOAD_DUPS_TRACE is set, WARN_ON()s

+ will always be issued and this option does nothing.

+ multitce=off [PPC] This parameter disables the use of the pSeries

+ firmware feature for updating multiple TCE entries

+ netpoll.carrier_timeout=

+ [NET] Specifies amount of time (in seconds) that

+ netpoll should wait for a carrier. By default netpoll

+ [NFS] sets the pathname to the program which is used

+ to update the NFS client cache entries.

+ nfs.cache_getent_timeout=

+ [NFS] sets the timeout after which an attempt to

+ update a cache entry is deemed to have failed.

+ nfs.idmap_cache_timeout=

+ [NFS] set the maximum lifetime for idmapper cache

+ nfs.recover_lost_locks=

+ nfs.send_implementation_id=

+ [NFSv4.1] Send client implementation identification

+ information in exchange_id requests.

+ If zero, no implementation identification information

+ The default is to send the implementation identification

+ nfs4.layoutstats_timer=

+ nfsd.inter_copy_offload_enable=

+ [NFSv4.2] When set to 1, the server will support

+ server-to-server copies for which this server is

+ the destination of the copy.

+ nfsd.nfsd4_ssc_umount_timeout=

+ [NFSv4.2] When used as the destination of a

+ server-to-server copy, knfsd temporarily mounts

+ the source server. It caches the mount in case

+ it will be needed again, and discards it if not

+ used for the number of milliseconds specified by

+ nfsaddrs= [NFS] Deprecated. Use ip= instead.

+ See Documentation/admin-guide/nfs/nfsroot.rst.

+ nfsroot= [NFS] nfs root filesystem for disk-less boxes.

+ See Documentation/admin-guide/nfs/nfsroot.rst.

+ nfsrootdebug [NFS] enable nfsroot debugging messages.

+ See Documentation/admin-guide/nfs/nfsroot.rst.

+ no4lvl [RISCV] Disable 4-level and 5-level paging modes. Forces

+ kernel to use 3-level paging instead.

+ no5lvl [X86-64,RISCV] Disable 5-level paging mode. Forces

+ kernel to use 4-level paging instead.

+ no_console_suspend

+ [HW] Never suspend the console

+ Disable suspending of consoles during suspend and

+ hibernate operations. Once disabled, debugging

+ messages can reach various consoles while the rest

+ of the system is being put to sleep (ie, while

+ debugging driver suspend/resume hooks). This may

+ not work reliably with all consoles, but is known

+ to work with serial and VGA consoles.

+ To facilitate more flexible debugging, we also add

+ console_suspend, a printk module parameter to control

+ it. Users could use console_suspend (usually

+ /sys/module/printk/parameters/console_suspend) to

+ turn on/off it dynamically.

+ [KNL] Disable object debugging

+ no_file_caps Tells the kernel not to honor file capabilities. The

+ only way then for a file to be executed with privilege

+ is to be setuid root or executed by root.

+ nofsgsbase [X86] Disables FSGSBASE instructions.

+ value printed. This option should only be specified when

+ nohlt [ARM,ARM64,MICROBLAZE,SH] Forces the kernel to busy wait

+ in do_idle() and not use the arch_cpu_idle()

+ implementation; requires CONFIG_GENERIC_IDLE_POLL_SETUP

+ to be effective. This is useful on platforms where the

+ sleep(SH) or wfi(ARM,ARM64) instructions do not work

+ correctly or when doing power measurements to evaluate

+ the impact of the sleep instructions. This is also

+ useful when using JTAG debugger.

+ nohugeiomap [KNL,X86,PPC,ARM64] Disable kernel huge I/O mappings.

+ nohugevmalloc [KNL,X86,PPC,ARM64] Disable kernel huge vmalloc mappings.

+ Note that this argument takes precedence over

+ the CONFIG_RCU_NOCB_CPU_DEFAULT_ALL option.

+ noiotrap [SH] Disables trapped I/O port accesses.

+ noirqdebug [X86-32] Disables the code which attempts to detect and

+ disable unhandled interrupt sources.

+ noisapnp [ISAPNP] Disables ISA PnP code.

+ When CONFIG_RANDOMIZE_BASE is set, this disables

+ kernel and module base offset ASLR (Address Space

+ Layout Randomization).

+ no-kvmclock [X86,KVM] Disable paravirtualized KVM clock driver

+ nomodeset Disable kernel modesetting. Most systems' firmware

+ sets up a display mode and provides framebuffer memory

+ for output. With nomodeset, DRM and fbdev drivers will

+ not load if they could possibly displace the pre-

+ initialized output. Only the system framebuffer will

+ be available for use. The respective drivers will not

+ perform display-mode changes or accelerated rendering.

+ Useful as error fallback, or for testing and debugging.

+ nomodule Disable module load

+ nopku [X86] Disable Memory Protection Keys CPU feature found

+ in some Intel CPUs.

+ Equivalent to pti=off

+ nopv= [X86,XEN,KVM,HYPER_V,VMWARE]

+ Disables the PV optimizations forcing the guest to run

+ as generic guest with no PV drivers. Currently support

+ XEN HVM, KVM, HYPER_V and VMWARE guest.

+ nopvspin [X86,XEN,KVM]

+ Disables the qspinlock slow path using PV optimizations

+ which allow the hypervisor to 'idle' the guest on lock

+ nosgx [X86-64,SGX] Disables Intel SGX kernel support.

+ Disable SMAP (Supervisor Mode Access Prevention)

+ even if it is supported by processor.

+ Disable SMEP (Supervisor Mode Execution Prevention)

+ even if it is supported by processor.

+ nosmt [KNL,S390] Disable symmetric multithreading (SMT).

+ Equivalent to smt=1.

+ [KNL,X86] Disable symmetric multithreading (SMT).

+ nosmt=force: Force disable SMT, cannot be undone

+ via the sysfs control file.

+ nospec_store_bypass_disable

+ [HW] Disable all mitigations for the Speculative Store Bypass vulnerability

+ nospectre_bhb [ARM64] Disable all mitigations for Spectre-BHB (branch

+ history injection) vulnerability. System may allow data leaks

+ nospectre_v1 [X86,PPC] Disable mitigations for Spectre Variant 1

+ (bounds check bypass). With this option data leaks are

+ possible in the system.

+ nospectre_v2 [X86,PPC_E500,ARM64] Disable all mitigations for

+ the Spectre variant 2 (indirect branch prediction)

+ vulnerability. System may allow data leaks with this

+ no-steal-acc [X86,PV_OPS,ARM64,PPC/PSERIES] Disable paravirtualized

+ steal time accounting. steal time is computed, but

+ won't influence scheduler behaviour

+ no_timer_check [X86,APIC] Disables the code which tests for

+ broken timer IRQ sources.

+ [PPC] Don't flush the L1-D cache after accessing user data.

+ novmcoredd [KNL,KDUMP]

+ Disable device dump. Device dump allows drivers to

+ append dump data to vmcore so you can collect driver

+ specified debug info. Drivers can append the data

+ without any limit and this data is stored in memory,

+ so this may cause significant memory stress. Disabling

+ device dump can help save memory but the driver debug

+ data will be no longer available. This parameter

+ is only available when CONFIG_PROC_VMCORE_DEVICE_DUMP

+ no-vmw-sched-clock

+ [X86,PV_OPS] Disable paravirtualized VMware scheduler

+ clock and use the default one.

+ NOTE: this parameter will be ignored on systems with the

+ LEGACY_XAPIC_DISABLED bit set in the

+ IA32_XAPIC_DISABLE_STATUS MSR.

+ noxsave [BUGS=X86] Disables x86 extended register state save

+ and restore using xsave. The kernel will fallback to

+ enabling legacy floating-point and sse state.

+ noxsaveopt [X86] Disables xsaveopt used in saving x86 extended

+ register states. The kernel will fall back to use

+ xsave to save the states. By using this parameter,

+ performance of saving the states is degraded because

+ xsave doesn't support modified optimization while

+ xsaveopt supports it on xsaveopt enabled systems.

+ noxsaves [X86] Disables xsaves and xrstors used in saving and

+ restoring x86 extended register state in compacted

+ form of xsave area. The kernel will fall back to use

+ xsaveopt and xrstor to save and restore the states

+ in standard form of xsave area. By using this

+ parameter, xsave area per process might occupy more

+ memory on xsaves enabled systems.

+ onenand.bdry= [HW,MTD] Flex-OneNAND Boundary Configuration

+ Format: [die0_boundary][,die0_lock][,die1_boundary][,die1_lock]

+ boundary - index of last SLC block on Flex-OneNAND.

+ The remaining blocks are configured as MLC blocks.

+ lock - Configure if Flex-OneNAND boundary should be locked.

+ Once locked, the boundary cannot be changed.

+ 1 indicates lock status, 0 indicates unlock status.

+ reporting is disabled when it exceeds MAX_ORDER.

+ bit 6: print all CPUs backtrace (if available in the arch)

+ *Be aware* that this option may print a _lot_ of lines,

+ so there are risks of losing older messages in the log.

+ Use this option carefully, maybe worth to setup a

+ bigger log buffer with "log_buf_len" along with this.

+ use_e820 [X86] Use E820 reservations to exclude parts of

+ PCI host bridge windows. This is a workaround

+ for BIOS defects in host bridge _CRS methods.

+ If you need to use this, please report a bug to

+ <linux-pci@vger.kernel.org>.

+ no_e820 [X86] Ignore E820 reservations for PCI host

+ bridge windows. This is the default on modern

+ hardware. If you need to use this, please report

+ a bug to <linux-pci@vger.kernel.org>.

+ end-to-end CRC checking). Only effective if

+ OS has native AER control (either granted by

+ ACPI _OSC or forced via "pcie_ports=native")

+ See Documentation/arch/x86/i386/IO-APIC.rst.

+ pmu_override= [PPC] Override the PMU.

+ This option takes over the PMU facility, so it is no

+ longer usable by perf. Setting this option starts the

+ PMU counters by setting MMCR0 to 0 (the FC bit is

+ cleared). If a number is given, then MMCR1 is set to

+ that number, otherwise (e.g., 'pmu_override=on'), MMCR1

+ radix_hcall_invalidate=on [PPC/PSERIES]

+ Disable RADIX GTSE feature and use hcall for TLB

+ random.trust_cpu=off

+ [KNL] Disable trusting the use of the CPU's

+ random number generator (if available) to

+ initialize the kernel's RNG.

+ random.trust_bootloader=off

+ [KNL] Disable trusting the use of the a seed

+ passed by the bootloader (if available) to

+ initialize the kernel's RNG.

+ rcu_nocbs[=cpu-list]

+ [KNL] The optional argument is a cpu list,

+ as described above.

+ In kernels built with CONFIG_RCU_NOCB_CPU=y,

+ enable the no-callback CPU mode, which prevents

+ such CPUs' callbacks from being invoked in

+ softirq context. Invocation of such CPUs' RCU

+ callbacks will instead be offloaded to "rcuox/N"

+ kthreads created for that purpose, where "x" is

+ "p" for RCU-preempt, "s" for RCU-sched, and "g"

+ for the kthreads that mediate grace periods; and

+ "N" is the CPU number. This reduces OS jitter on

+ the offloaded CPUs, which can be useful for HPC

+ and real-time workloads. It can also improve

+ energy efficiency for asymmetric multiprocessors.

+ If a cpulist is passed as an argument, the specified

+ list of CPUs is set to no-callback mode from boot.

+ Otherwise, if the '=' sign and the cpulist

+ arguments are omitted, no CPU will be set to

+ no-callback mode from boot but the mode may be

+ toggled at runtime via cpusets.

+ Note that this argument takes precedence over

+ the CONFIG_RCU_NOCB_CPU_DEFAULT_ALL option.

+ When RCU_NOCB_CPU is set, also adjust the

+ priority of NOCB callback kthreads.

+ rcutree.rcu_divisor= [KNL]

+ Set the shift-right count to use to compute

+ the callback-invocation batch limit bl from

+ the number of callbacks queued on this CPU.

+ The result will be bounded below by the value of

+ the rcutree.blimit kernel parameter. Every bl

+ callbacks, the softirq handler will exit in

+ order to allow the CPU to do other work.

+ Please note that this callback-invocation batch

+ limit applies only to non-offloaded callback

+ invocation. Offloaded callbacks are instead

+ invoked in the context of an rcuoc kthread, which

+ scheduler will preempt as it does any other task.

+ rcutree.nocb_nobypass_lim_per_jiffy= [KNL]

+ On callback-offloaded (rcu_nocbs) CPUs,

+ RCU reduces the lock contention that would

+ otherwise be caused by callback floods through

+ use of the ->nocb_bypass list. However, in the

+ common non-flooded case, RCU queues directly to

+ the main ->cblist in order to avoid the extra

+ overhead of the ->nocb_bypass list and its lock.

+ But if there are too many callbacks queued during

+ a single jiffy, RCU pre-queues the callbacks into

+ the ->nocb_bypass queue. The definition of "too

+ many" is supplied by this kernel boot parameter.

+ Specifies the number of kthreads to be used

+ for RCU grace-period forward-progress testing

+ Defaults to 1 kthread, values less than zero or

+ greater than the number of CPUs cause the number

+ of CPUs to be used.

+ The value is in seconds and the maximum allowed

+ value is 300 seconds.

+ rcupdate.rcu_exp_cpu_stall_timeout= [KNL]

+ Set timeout for expedited RCU CPU stall warning

+ messages. The value is in milliseconds

+ and the maximum allowed value is 21000

+ milliseconds. Please note that this value is

+ adjusted to an arch timer tick resolution.

+ Setting this to zero causes the value from

+ rcupdate.rcu_cpu_stall_timeout to be used (after

+ conversion from seconds to milliseconds).

+ rcupdate.rcu_cpu_stall_cputime= [KNL]

+ Provide statistics on the cputime and count of

+ interrupts and tasks during the sampling period. For

+ multiple continuous RCU stalls, all sampling periods

+ begin at half of the first RCU stall timeout.

+ rcupdate.rcu_exp_stall_task_details= [KNL]

+ Print stack dumps of any tasks blocking the

+ current expedited RCU grace period during an

+ expedited RCU CPU stall warning.

+ rcupdate.rcu_task_collapse_lim= [KNL]

+ Set the maximum number of callbacks present

+ at the beginning of a grace period that allows

+ the RCU Tasks flavors to collapse back to using

+ a single callback queue. This switching only

+ occurs when rcupdate.rcu_task_enqueue_lim is

+ set to the default value of -1.

+ rcupdate.rcu_task_contend_lim= [KNL]

+ Set the minimum number of callback-queuing-time

+ lock-contention events per jiffy required to

+ cause the RCU Tasks flavors to switch to per-CPU

+ callback queuing. This switching only occurs

+ when rcupdate.rcu_task_enqueue_lim is set to

+ the default value of -1.

+ rcupdate.rcu_task_enqueue_lim= [KNL]

+ Set the number of callback queues to use for the

+ RCU Tasks family of RCU flavors. The default

+ of -1 allows this to be automatically (and

+ dynamically) adjusted. This parameter is intended

+ for use in testing.

+ rcupdate.rcu_task_stall_info= [KNL]

+ Set initial timeout in jiffies for RCU task stall

+ informational messages, which give some indication

+ of the problem for those not patient enough to

+ wait for ten minutes. Informational messages are

+ only printed prior to the stall-warning message

+ for a given grace period. Disable with a value

+ less than or equal to zero. Defaults to ten

+ seconds. A change in value does not take effect

+ until the beginning of the next grace period.

+ rcupdate.rcu_task_stall_info_mult= [KNL]

+ Multiplier for time interval between successive

+ RCU task stall informational messages for a given

+ RCU tasks grace period. This value is clamped

+ to one through ten, inclusive. It defaults to

+ the value three, so that the first informational

+ message is printed 10 seconds into the grace

+ period, the second at 40 seconds, the third at

+ 160 seconds, and then the stall warning at 600

+ seconds would prevent a fourth at 640 seconds.

+ Set timeout in jiffies for RCU task stall

+ warning messages. Disable with a value less

+ than or equal to zero. Defaults to ten minutes.

+ A change in value does not take effect until

+ the beginning of the next grace period.

+ retbleed= [X86] Control mitigation of RETBleed (Arbitrary

+ Speculative Code Execution with Return Instructions)

+ AMD-based UNRET and IBPB mitigations alone do not stop

+ sibling threads from influencing the predictions of other

+ sibling threads. For that reason, STIBP is used on pro-

+ cessors that support it, and mitigate SMT on processors

+ off - no mitigation

+ auto - automatically select a migitation

+ auto,nosmt - automatically select a mitigation,

+ disabling SMT if necessary for

+ the full mitigation (only on Zen1

+ and older without STIBP).

+ ibpb - On AMD, mitigate short speculation

+ windows on basic block boundaries too.

+ Safe, highest perf impact. It also

+ enables STIBP if present. Not suitable

+ ibpb,nosmt - Like "ibpb" above but will disable SMT

+ when STIBP is not available. This is

+ the alternative for systems which do not

+ unret - Force enable untrained return thunks,

+ only effective on AMD f15h-f17h based

+ unret,nosmt - Like unret, but will disable SMT when STIBP

+ is not available. This is the alternative for

+ systems which do not have STIBP.

+ Selecting 'auto' will choose a mitigation method at run

+ time according to the CPU.

+ Not specifying this option is equivalent to retbleed=auto.

+ full Mark read-only kernel memory and aliases as read-only

+ s390_iommu_aperture= [KNL,S390]

+ Specifies the size of the per device DMA address space

+ accessible through the DMA and IOMMU APIs as a decimal

+ factor of the size of main memory.

+ The default is 1 meaning that one can concurrently use

+ as many DMA addresses as physical memory is installed,

+ if supported by hardware, and thus map all of memory

+ once. With a value of 2 one can map all of memory twice

+ and so on. As a special case a factor of 0 imposes no

+ restrictions other than those given by hardware at the

+ cost of significant additional memory use for tables.

+ sev=option[,option...] [X86-64] See Documentation/arch/x86/x86_64/boot-options.rst

+ show_lapic= [APIC,X86] Advanced Programmable Interrupt Controller

+ Limit apic dumping. The parameter defines the maximal

+ number of local apics being dumped. Also it is possible

+ to set it to "all" by meaning -- no limit here.

+ Format: { 1 (default) | 2 | ... | all }.

+ The parameter valid if only apic=debug or

+ apic=verbose is specified.

+ Example: apic=debug show_lapic=all

+ For more information see Documentation/mm/slub.rst.

+ Documentation/mm/slub.rst.

+ For more information see Documentation/mm/slub.rst.

+ smp.csd_lock_timeout= [KNL]

+ Specify the period of time in milliseconds

+ that smp_call_function() and friends will wait

+ for a CPU to release the CSD lock. This is

+ useful when diagnosing bugs involving CPUs

+ disabling interrupts for extended periods

+ of time. Defaults to 5,000 milliseconds, and

+ setting a value of zero disables this feature.

+ This feature may be more efficiently disabled

+ using the csdlock_debug- kernel parameter.

+ smt= [KNL,S390] Set the maximum number of threads (logical

+ retpoline,generic - Retpolines

+ retpoline,lfence - LFENCE; indirect branch

+ retpoline,amd - alias for retpoline,lfence

+ eibrs - Enhanced/Auto IBRS

+ eibrs,retpoline - Enhanced/Auto IBRS + Retpolines

+ eibrs,lfence - Enhanced/Auto IBRS + LFENCE

+ ibrs - use IBRS to protect kernel

+ Default mitigation: "prctl"

+ srcutree.big_cpu_lim [KNL]

+ Specifies the number of CPUs constituting a

+ large system, such that srcu_struct structures

+ should immediately allocate an srcu_node array.

+ This kernel-boot parameter defaults to 128,

+ but takes effect only when the low-order four

+ bits of srcutree.convert_to_big is equal to 3

+ srcutree.convert_to_big [KNL]

+ Specifies under what conditions an SRCU tree

+ srcu_struct structure will be converted to big

+ form, that is, with an rcu_node tree:

+ 1: At init_srcu_struct() time.

+ 2: When rcutorture decides to.

+ 3: Decide at boot time (default).

+ 0x1X: Above plus if high contention.

+ Either way, the srcu_node tree will be sized based

+ on the actual runtime number of CPUs (nr_cpu_ids)

+ instead of the compile-time CONFIG_NR_CPUS.

+ srcutree.srcu_max_nodelay [KNL]

+ Specifies the number of no-delay instances

+ per jiffy for which the SRCU grace period

+ worker thread will be rescheduled with zero

+ delay. Beyond this limit, worker thread will

+ be rescheduled with a sleep delay of one jiffy.

+ srcutree.srcu_max_nodelay_phase [KNL]

+ Specifies the per-grace-period phase, number of

+ non-sleeping polls of readers. Beyond this limit,

+ grace period worker thread will be rescheduled

+ with a sleep delay of one jiffy, between each

+ rescan of the readers, for a grace period phase.

+ srcutree.srcu_retry_check_delay [KNL]

+ Specifies number of microseconds of non-sleeping

+ delay between each non-sleeping poll of readers.

+ srcutree.small_contention_lim [KNL]

+ Specifies the number of update-side contention

+ events per jiffy will be tolerated before

+ initiating a conversion of an srcu_struct

+ structure to big form. Note that the value of

+ srcutree.convert_to_big must have the 0x10 bit

+ set for contention-based conversions to occur.

+ Enable or disable strict sigaltstack size checks

+ against the required signal frame size which

+ depends on the supported FPU features. This can

+ be used to filter out binaries which have

+ not yet been made aware of AT_MINSIGSTKSZ.

+ Limits the number of kernel HPT entries in the hash

+ page table to increase the rate of hash page table

+ faults on kernel addresses.

+ Limits the number of kernel SLB entries, and flushes

+ them frequently to increase the rate of SLB faults

+ on kernel addresses.

+ Format: { <int> [,<int>] | force | noforce }

+ <int> -- Second integer after comma. Number of swiotlb

+ areas with their own lock. Will be rounded up

+ test_suspend= [SUSPEND]

+ Format: { "mem" | "standby" | "freeze" }[,N]

+ tp_printk [FTRACE]

+ Have the tracepoints sent to printk as well as the

+ tracing ring buffer. This is useful for early boot up

+ where the system hangs or reboots and does not give the

+ option for reading the tracing buffer or performing a

+ ftrace_dump_on_oops.

+ To turn off having tracepoints sent to printk,

+ echo 0 > /proc/sys/kernel/tracepoint_printk

+ Note, echoing 1 into this file without the

+ tracepoint_printk kernel cmdline option has no effect.

+ The tp_printk_stop_on_boot (see below) can also be used

+ to stop the printing of events to console at

+ late_initcall_sync.

+ Having tracepoints sent to printk() and activating high

+ frequency tracepoints such as irq or sched, can cause

+ the system to live lock.

+ tp_printk_stop_on_boot [FTRACE]

+ When tp_printk (above) is set, it can cause a lot of noise

+ on the console. It may be useful to only include the

+ printing of events during boot up, as user space may

+ make the system inoperable.

+ This command line option will stop the printing of events

+ to console at the late_initcall_sync() time frame.

+ trace_clock= [FTRACE] Set the clock used for tracing events

+ local - Use the per CPU time stamp counter

+ (converted into nanoseconds). Fast, but

+ depending on the architecture, may not be

+ in sync between CPUs.

+ global - Event time stamps are synchronize across

+ CPUs. May be slower than the local clock,

+ but better for some race conditions.

+ counter - Simple counting of events (1, 2, ..)

+ note, some counts may be skipped due to the

+ infrastructure grabbing the clock more than

+ uptime - Use jiffies as the time stamp.

+ perf - Use the same clock that perf uses.

+ mono - Use ktime_get_mono_fast_ns() for time stamps.

+ mono_raw - Use ktime_get_raw_fast_ns() for time

+ boot - Use ktime_get_boot_fast_ns() for time stamps.

+ Architectures may add more clocks. See

+ Documentation/trace/ftrace.rst for more details.

+ trace_instance=[instance-info]

+ [FTRACE] Create a ring buffer instance early in boot up.

+ This will be listed in:

+ /sys/kernel/tracing/instances

+ Events can be enabled at the time the instance is created

+ trace_instance=<name>,<system1>:<event1>,<system2>:<event2>

+ Note, the "<system*>:" portion is optional if the event is

+ trace_instance=foo,sched:sched_switch,irq_handler_entry,initcall

+ will enable the "sched_switch" event (note, the "sched:" is optional, and

+ the same thing would happen if it was left off). The irq_handler_entry

+ event, and all events under the "initcall" system.

+ /sys/kernel/tracing/trace_options

+ trace_trigger=[trigger-list]

+ [FTRACE] Add a event trigger on specific events.

+ Set a trigger on top of a specific event, with an optional

+ The format is is "trace_trigger=<event>.<trigger>[ if <filter>],..."

+ Where more than one trigger may be specified that are comma deliminated.

+ trace_trigger="sched_switch.stacktrace if prev_state == 2"

+ The above will enable the "stacktrace" trigger on the "sched_switch"

+ event but only trigger it if the "prev_state" of the "sched_switch"

+ event is "2" (TASK_UNINTERUPTIBLE).

+ See also "Event triggers" in Documentation/trace/events.rst

+ file located in /sys/kernel/tracing/

+ trusted.rng= [KEYS]

+ The RNG used to generate key material for trusted keys.

+ - the same value as trusted.source: "tpm" or "tee"

+ If not specified, "default" is used. In this case,

+ the RNG's choice is left to each individual trust source.

+ [x86] recalibrate: force recalibration against a HW timer

+ (HPET or PM timer) on systems whose TSC frequency was

+ obtained from HW or FW using either an MSR or CPUID(0x15).

+ Warn if the difference is more than 500 ppm.

+ [x86] watchdog: Use TSC as the watchdog clocksource with

+ which to check other HW timers (HPET or PM timer), but

+ only on systems where TSC has been deemed trustworthy.

+ This will be suppressed by an earlier tsc=nowatchdog and

+ can be overridden by a later tsc=nowatchdog. A console

+ message will flag any such suppression or overriding.

+ vdso= [X86,SH,SPARC]

+ video.brightness_switch_enabled= [ACPI]

+ the allocated input device. If set to 0, video driver

+ See Documentation/arch/x86/boot.rst and

+ emulate Vsyscalls turn into traps and are emulated

+ reasonably safely. The vsyscall page is

+ xonly [default] Vsyscalls turn into traps and are

+ writecombine= [LOONGARCH] Control the MAT (Memory Access Type) of

+ on - Enable writecombine, use WUC for ioremap_wc()

+ off - Disable writecombine, use SUC for ioremap_wc()

+ xen_msr_safe= [X86,XEN]

+ Select whether to always use non-faulting (safe) MSR

+ access functions when running as Xen PV guest. The

+ default value is controlled by CONFIG_XEN_PV_MSR_SAFE.

+ xen.balloon_boot_timeout= [XEN]

+ The time (in seconds) to wait before giving up to boot

+ in case initial ballooning fails to free enough memory.

+ Applies only when running as HVM or PVH guest and

+ started with less memory configured than allowed at

+ max. Default is 180.

+ xive.store-eoi=off [PPC]

+ By default on POWER10 and above, the kernel will use

+ stores for EOI handling when the XIVE interrupt mode

+ is active. This option allows the XIVE driver to use

+ loads instead, as on POWER9.

diff --git a/Documentation/admin-guide/kernel-per-CPU-kthreads.rst b/Documentation/admin-guide/kernel-per-CPU-kthreads.rst
index 5e51ee5b0358..993c2a05f5ee 100644
--- a/Documentation/admin-guide/kernel-per-CPU-kthreads.rst
+++ b/Documentation/admin-guide/kernel-per-CPU-kthreads.rst

+ cd /sys/kernel/tracing

+ a. Build with CONFIG_NO_HZ=y.

diff --git a/Documentation/admin-guide/laptops/lg-laptop.rst b/Documentation/admin-guide/laptops/lg-laptop.rst
index 6fbe165dcd27..67fd6932cef4 100644
--- a/Documentation/admin-guide/laptops/lg-laptop.rst
+++ b/Documentation/admin-guide/laptops/lg-laptop.rst

+Writing 80/100 to /sys/class/power_supply/CMB0/charge_control_end_threshold

diff --git a/Documentation/admin-guide/laptops/thinkpad-acpi.rst b/Documentation/admin-guide/laptops/thinkpad-acpi.rst
index 6721a80a2d4f..e27a1c3f634e 100644
--- a/Documentation/admin-guide/laptops/thinkpad-acpi.rst
+++ b/Documentation/admin-guide/laptops/thinkpad-acpi.rst

+nl(Dutch), nn(Norway), pl(Polish), pt(portuguese), sl(Slovenian), sv(Sweden),

+- 1 = Web-browser mode

+- 2 = Web-conference mode

+- 3 = Function mode

+https://download.lenovo.com/ibmdl/pub/pc/pccbbs/mobiles_pdf/x1carbon_2_ug_en.pdf

+ Writing this, if accepted, will block until array is quiescent

+ - AVerTV GO Series (Kein SVideo Input)

+the usage of special programs (called eBPF) that would allow applications

+.. SPDX-License-Identifier: GPL-2.0

+Supported hardware in mainline

+==============================

+- V4L2 adv7511 (same HW, but a different driver from the drm adv7511)

+- Allwinner A10 (sun4i)

+- dw-hdmi (Synopsis IP)

+- amlogic (meson ao-cec and ao-cec-g12a)

+- drm adv7511/adv7533

+- DisplayPort CEC-Tunneling-over-AUX on i915, nouveau and amdgpu

+- CEC for SECO boards (UDOO x86).

+USB Dongles (see below for additional information on how to use these

+- Pulse-Eight: the pulse8-cec driver implements the following module option:

+ ``persistent_config``: by default this is off, but when set to 1 the driver

+ will store the current settings to the device's internal eeprom and restore

+ it the next time the device is connected to the USB port.

+- RainShadow Tech. Note: this driver does not support the persistent_config

+ module option of the Pulse-Eight driver. The hardware supports it, but I

+ have no plans to add this feature. But I accept patches :-)

+- vivid: emulates a CEC receiver and CEC transmitter.

+ Can be used to test CEC applications without actual CEC hardware.

+- cec-gpio. If the CEC pin is hooked up to a GPIO pin then

+ you can control the CEC line through this driver. This supports error

+ injection as well.

+- cec-gpio and Allwinner A10 (or any other driver that uses the CEC pin

+ framework to drive the CEC pin directly): the CEC pin framework uses

+ high-resolution timers. These timers are affected by NTP daemons that

+ speed up or slow down the clock to sync with the official time. The

+ chronyd server will by default increase or decrease the clock by

+ 1/12th. This will cause the CEC timings to go out of spec. To fix this,

+ add a 'maxslewrate 40000' line to chronyd.conf. This limits the clock

+ frequency change to 1/25th, which keeps the CEC timings within spec.

+Utilities are available here: https://git.linuxtv.org/v4l-utils.git

+``utils/cec-ctl``: control a CEC device

+``utils/cec-compliance``: test compliance of a remote CEC device

+``utils/cec-follower``: emulate a CEC follower device

+Note that ``cec-ctl`` has support for the CEC Hospitality Profile as is

+used in some hotel displays. See http://www.htng.org.

+Note that the libcec library (https://github.com/Pulse-Eight/libcec) supports

+the linux CEC framework.

+If you want to get the CEC specification, then look at the References of

+the HDMI wikipedia page: https://en.wikipedia.org/wiki/HDMI. CEC is part

+of the HDMI specification. HDMI 1.3 is freely available (very similar to

+HDMI 1.4 w.r.t. CEC) and should be good enough for most things.

+DisplayPort to HDMI Adapters with working CEC

+=============================================

+Background: most adapters do not support the CEC Tunneling feature,

+and of those that do many did not actually connect the CEC pin.

+Unfortunately, this means that while a CEC device is created, it

+is actually all alone in the world and will never be able to see other

+This is a list of known working adapters that have CEC Tunneling AND

+that properly connected the CEC pin. If you find adapters that work

+but are not in this list, then drop me a note.

+To test: hook up your DP-to-HDMI adapter to a CEC capable device

+(typically a TV), then run::

+ cec-ctl --playback # Configure the PC as a CEC Playback device

+ cec-ctl -S # Show the CEC topology

+The ``cec-ctl -S`` command should show at least two CEC devices,

+ourselves and the CEC device you are connected to (i.e. typically the TV).

+General note: I have only seen this work with the Parade PS175, PS176 and

+PS186 chipsets and the MegaChips 2900. While MegaChips 28x0 claims CEC support,

+I have never seen it work.

+Samsung Multiport Adapter EE-PW700: https://www.samsung.com/ie/support/model/EE-PW700BBEGWW/

+Kramer ADC-U31C/HF: https://www.kramerav.com/product/ADC-U31C/HF

+Club3D CAC-2504: https://www.club-3d.com/en/detail/2449/usb_3.1_type_c_to_hdmi_2.0_uhd_4k_60hz_active_adapter/

+DisplayPort to HDMI

+-------------------

+Club3D CAC-1080: https://www.club-3d.com/en/detail/2442/displayport_1.4_to_hdmi_2.0b_hdr/

+CableCreation (SKU: CD0712): https://www.cablecreation.com/products/active-displayport-to-hdmi-adapter-4k-hdr

+HP DisplayPort to HDMI True 4k Adapter (P/N 2JA63AA): https://www.hp.com/us-en/shop/pdp/hp-displayport-to-hdmi-true-4k-adapter

+Mini-DisplayPort to HDMI

+------------------------

+Club3D CAC-1180: https://www.club-3d.com/en/detail/2443/mini_displayport_1.4_to_hdmi_2.0b_hdr/

+Note that passive adapters will never work, you need an active adapter.

+The Club3D adapters in this list are all MegaChips 2900 based. Other Club3D adapters

+are PS176 based and do NOT have the CEC pin hooked up, so only the three Club3D

+adapters above are known to work.

+I suspect that MegaChips 2900 based designs in general are likely to work

+whereas with the PS176 it is more hit-and-miss (mostly miss). The PS186 is

+likely to have the CEC pin hooked up, it looks like they changed the reference

+design for that chipset.

+These dongles appear as ``/dev/ttyACMX`` devices and need the ``inputattach``

+utility to create the ``/dev/cecX`` devices. Support for the Pulse-Eight

+has been added to ``inputattach`` 1.6.0. Support for the Rainshadow Tech has

+been added to ``inputattach`` 1.6.1.

+You also need udev rules to automatically start systemd services::

+ SUBSYSTEM=="tty", KERNEL=="ttyACM[0-9]*", ATTRS{idVendor}=="2548", ATTRS{idProduct}=="1002", ACTION=="add", TAG+="systemd", ENV{SYSTEMD_WANTS}+="pulse8-cec-inputattach@%k.service"

+ SUBSYSTEM=="tty", KERNEL=="ttyACM[0-9]*", ATTRS{idVendor}=="2548", ATTRS{idProduct}=="1001", ACTION=="add", TAG+="systemd", ENV{SYSTEMD_WANTS}+="pulse8-cec-inputattach@%k.service"

+ SUBSYSTEM=="tty", KERNEL=="ttyACM[0-9]*", ATTRS{idVendor}=="04d8", ATTRS{idProduct}=="ff59", ACTION=="add", TAG+="systemd", ENV{SYSTEMD_WANTS}+="rainshadow-cec-inputattach@%k.service"

+and these systemd services:

+For Pulse-Eight make /lib/systemd/system/pulse8-cec-inputattach@.service::

+ Description=inputattach for pulse8-cec device on %I

+ ExecStart=/usr/bin/inputattach --pulse8-cec /dev/%I

+For the RainShadow Tech make /lib/systemd/system/rainshadow-cec-inputattach@.service::

+ Description=inputattach for rainshadow-cec device on %I

+ ExecStart=/usr/bin/inputattach --rainshadow-cec /dev/%I

+For proper suspend/resume support create: /lib/systemd/system/restart-cec-inputattach.service::

+ Description=restart inputattach for cec devices

+ After=suspend.target

+ ExecStart=/bin/bash -c 'for d in /dev/serial/by-id/usb-Pulse-Eight*; do /usr/bin/inputattach --daemon --pulse8-cec $d; done; for d in /dev/serial/by-id/usb-RainShadow_Tech*; do /usr/bin/inputattach --daemon --rainshadow-cec $d; done'

+ WantedBy=suspend.target

+And run ``systemctl enable restart-cec-inputattach``.

+To automatically set the physical address of the CEC device whenever the

+EDID changes, you can use ``cec-ctl`` with the ``-E`` option::

+ cec-ctl -E /sys/class/drm/card0-DP-1/edid

+This assumes the dongle is connected to the card0-DP-1 output (``xrandr`` will tell

+you which output is used) and it will poll for changes to the EDID and update

+the Physical Address whenever they occur.

+To automatically run this command you can use cron. Edit crontab with

+``crontab -e`` and add this line::

+ @reboot /usr/local/bin/cec-ctl -E /sys/class/drm/card0-DP-1/edid

+This only works for display drivers that expose the EDID in ``/sys/class/drm``,

+such as the i915 driver.

+Some displays when in standby mode have no HDMI Hotplug Detect signal, but

+CEC is still enabled so connected devices can send an <Image View On> CEC

+message in order to wake up such displays. Unfortunately, not all CEC

+adapters can support this. An example is the Odroid-U3 SBC that has a

+level-shifter that is powered off when the HPD signal is low, thus

+blocking the CEC pin. Even though the SoC can use CEC without a HPD,

+the level-shifter will prevent this from functioning.

+There is a CEC capability flag to signal this: ``CEC_CAP_NEEDS_HPD``.

+If set, then the hardware cannot wake up displays with this behavior.

+Note for CEC application implementers: the <Image View On> message must

+be the first message you send, don't send any other messages before.

+Certain very bad but unfortunately not uncommon CEC implementations

+get very confused if they receive anything else but this message and

+they won't wake up.

+When writing a driver it can be tricky to test this. There are two

+1) Get a Pulse-Eight USB CEC dongle, connect an HDMI cable from your

+ device to the Pulse-Eight, but do not connect the Pulse-Eight to

+ Now configure the Pulse-Eight dongle::

+ cec-ctl -p0.0.0.0 --tv

+ and start monitoring::

+ On the device you are testing run::

+ cec-ctl --playback

+ It should report a physical address of f.f.f.f. Now run this

+ cec-ctl -t0 --image-view-on

+ The Pulse-Eight should see the <Image View On> message. If not,

+ then something (hardware and/or software) is preventing the CEC

+ message from going out.

+ To make sure you have the wiring correct just connect the

+ Pulse-Eight to a CEC-enabled display and run the same command

+ on your device: now there is a HPD, so you should see the command

+ arriving at the Pulse-Eight.

+2) If you have another linux device supporting CEC without HPD, then

+ you can just connect your device to that device. Yes, you can connect

+ two HDMI outputs together. You won't have a HPD (which is what we

+ want for this test), but the second device can monitor the CEC pin.

+ Otherwise use the same commands as in 1.

+If CEC messages do not come through when there is no HPD, then you

+need to figure out why. Typically it is either a hardware restriction

+or the software powers off the CEC core when the HPD goes low. The

+first cannot be corrected of course, the second will likely required

+Microcontrollers & CEC

+======================

+We have seen some CEC implementations in displays that use a microcontroller

+to sample the bus. This does not have to be a problem, but some implementations

+have timing issues. This is hard to discover unless you can hook up a low-level

+CEC debugger (see the next section).

+You will see cases where the CEC transmitter holds the CEC line high or low for

+a longer time than is allowed. For directed messages this is not a problem since

+if that happens the message will not be Acked and it will be retransmitted.

+For broadcast messages no such mechanism exists.

+It's not clear what to do about this. It is probably wise to transmit some

+broadcast messages twice to reduce the chance of them being lost. Specifically

+<Standby> and <Active Source> are candidates for that.

+Making a CEC debugger

+=====================

+By using a Raspberry Pi 4B and some cheap components you can make

+your own low-level CEC debugger.

+The critical component is one of these HDMI female-female passthrough connectors

+(full soldering type 1):

+https://elabbay.myshopify.com/collections/camera/products/hdmi-af-af-v1a-hdmi-type-a-female-to-hdmi-type-a-female-pass-through-adapter-breakout-board?variant=45533926147

+The video quality is variable and certainly not enough to pass-through 4kp60

+(594 MHz) video. You might be able to support 4kp30, but more likely you will

+be limited to 1080p60 (148.5 MHz). But for CEC testing that is fine.

+You need a breadboard and some breadboard wires:

+http://www.dx.com/p/diy-40p-male-to-female-male-to-male-female-to-female-dupont-line-wire-3pcs-356089#.WYLOOXWGN7I

+If you want to monitor the HPD and/or 5V lines as well, then you need one of

+these 5V to 3.3V level shifters:

+https://www.adafruit.com/product/757

+(This is just where I got these components, there are many other places you

+can get similar things).

+The ground pin of the HDMI connector needs to be connected to a ground

+pin of the Raspberry Pi, of course.

+The CEC pin of the HDMI connector needs to be connected to these pins:

+GPIO 6 and GPIO 7. The optional HPD pin of the HDMI connector should

+be connected via the level shifter to these pins: GPIO 23 and GPIO 12.

+The optional 5V pin of the HDMI connector should be connected via the

+level shifter to these pins: GPIO 25 and GPIO 22. Monitoring the HPD and

+5V lines is not necessary, but it is helpful.

+This device tree addition in ``arch/arm/boot/dts/bcm2711-rpi-4-b.dts``

+will hook up the cec-gpio driver correctly::

+ compatible = "cec-gpio";

+ cec-gpios = <&gpio 6 (GPIO_ACTIVE_HIGH|GPIO_OPEN_DRAIN)>;

+ hpd-gpios = <&gpio 23 GPIO_ACTIVE_HIGH>;

+ v5-gpios = <&gpio 25 GPIO_ACTIVE_HIGH>;

+ compatible = "cec-gpio";

+ cec-gpios = <&gpio 7 (GPIO_ACTIVE_HIGH|GPIO_OPEN_DRAIN)>;

+ hpd-gpios = <&gpio 12 GPIO_ACTIVE_HIGH>;

+ v5-gpios = <&gpio 22 GPIO_ACTIVE_HIGH>;

+If you haven't hooked up the HPD and/or 5V lines, then just delete those

+This dts change will enable two cec GPIO devices: I typically use one to

+send/receive CEC commands and the other to monitor. If you monitor using

+an unconfigured CEC adapter then it will use GPIO interrupts which makes

+monitoring very accurate.

+If you just want to monitor traffic, then a single instance is sufficient.

+The minimum configuration is one HDMI female-female passthrough connector

+and two female-female breadboard wires: one for connecting the HDMI ground

+pin to a ground pin on the Raspberry Pi, and the other to connect the HDMI

+CEC pin to GPIO 6 on the Raspberry Pi.

+The documentation on how to use the error injection is here: :ref:`cec_pin_error_inj`.

+``cec-ctl --monitor-pin`` will do low-level CEC bus sniffing and analysis.

+You can also store the CEC traffic to file using ``--store-pin`` and analyze

+it later using ``--analyze-pin``.

+You can also use this as a full-fledged CEC device by configuring it

+using ``cec-ctl --tv -p0.0.0.0`` or ``cec-ctl --playback -p1.0.0.0``.

diff --git a/Documentation/admin-guide/media/dvb-drivers.rst b/Documentation/admin-guide/media/dvb-drivers.rst
index 8df637c375f9..66fa4edd0606 100644
--- a/Documentation/admin-guide/media/dvb-drivers.rst
+++ b/Documentation/admin-guide/media/dvb-drivers.rst

+drivers/media/platform/samsung/exynos4-is directory.

diff --git a/Documentation/admin-guide/media/i2c-cardlist.rst b/Documentation/admin-guide/media/i2c-cardlist.rst
index e60d459d18a9..1825a0bb47bd 100644
--- a/Documentation/admin-guide/media/i2c-cardlist.rst
+++ b/Documentation/admin-guide/media/i2c-cardlist.rst

+ccs MIPI CCS compliant camera sensors (also SMIA++ and SMIA)

+hi846 Hynix Hi-846 sensor

+imx208 Sony IMX208 sensor

+imx334 Sony IMX334 sensor

+imx412 Sony IMX412 sensor

+ov13b10 OmniVision OV13B10 sensor

+dw9768 DW9768 lens voice coil

+xc2028 XCeive xc2028/xc3028 tuners

+i.MX6ULL-EVK with OV5640

+------------------------

+On this platform a parallel OV5640 sensor is connected to the CSI port.

+The following example configures a video capture pipeline with an output

+of 640x480 and UYVY8_2X8 format:

+.. code-block:: none

+ media-ctl -l "'ov5640 1-003c':0 -> 'csi':0[1]"

+ media-ctl -l "'csi':1 -> 'csi capture':0[1]"

+ # Configure pads for pipeline

+ media-ctl -v -V "'ov5640 1-003c':0 [fmt:UYVY8_2X8/640x480 field:none]"

+After this streaming can start:

+.. code-block:: none

+ gst-launch-1.0 -v v4l2src device=/dev/video1 ! video/x-raw,format=UYVY,width=640,height=480 ! v4l2convert ! fbdevsink

+.. code-block:: none

+ Media controller API version 5.14.0

+ Media device information

+ ------------------------

+ driver version 5.14.0

+ - entity 1: csi (2 pads, 2 links)

+ type V4L2 subdev subtype Unknown flags 0

+ device node name /dev/v4l-subdev0

+ [fmt:UYVY8_2X8/640x480 field:none colorspace:srgb xfer:srgb ycbcr:601 quantization:full-range]

+ <- "ov5640 1-003c":0 [ENABLED,IMMUTABLE]

+ [fmt:UYVY8_2X8/640x480 field:none colorspace:srgb xfer:srgb ycbcr:601 quantization:full-range]

+ -> "csi capture":0 [ENABLED,IMMUTABLE]

+ - entity 4: csi capture (1 pad, 1 link)

+ type Node subtype V4L flags 0

+ device node name /dev/video1

+ <- "csi":1 [ENABLED,IMMUTABLE]

+ - entity 10: ov5640 1-003c (1 pad, 1 link)

+ type V4L2 subdev subtype Sensor flags 0

+ device node name /dev/v4l-subdev1

+ [fmt:UYVY8_2X8/640x480@1/30 field:none colorspace:srgb xfer:srgb ycbcr:601 quantization:full-range]

+ -> "csi":0 [ENABLED,IMMUTABLE]

+ is a 16x16 linear tiled NV12 format (V4L2_PIX_FMT_NV12_16L16)

+driver located under drivers/media/platform/ti/omap3isp. The original driver was

diff --git a/Documentation/admin-guide/media/omap4_camera.rst b/Documentation/admin-guide/media/omap4_camera.rst
index 24db4222d36d..2ada9b1e6897 100644
--- a/Documentation/admin-guide/media/omap4_camera.rst
+++ b/Documentation/admin-guide/media/omap4_camera.rst

+code from OMAP3 ISP driver (found under drivers/media/platform/ti/omap3isp/\*),

diff --git a/Documentation/admin-guide/media/other-usb-cardlist.rst b/Documentation/admin-guide/media/other-usb-cardlist.rst
index bbfdb1389c18..fb88db50e861 100644
--- a/Documentation/admin-guide/media/other-usb-cardlist.rst
+++ b/Documentation/admin-guide/media/other-usb-cardlist.rst

diff --git a/Documentation/admin-guide/media/pci-cardlist.rst b/Documentation/admin-guide/media/pci-cardlist.rst
index f4d670e632f8..42528795d4da 100644
--- a/Documentation/admin-guide/media/pci-cardlist.rst
+++ b/Documentation/admin-guide/media/pci-cardlist.rst

diff --git a/Documentation/admin-guide/media/platform-cardlist.rst b/Documentation/admin-guide/media/platform-cardlist.rst
index 261e7772eb3e..1230ae4037ad 100644
--- a/Documentation/admin-guide/media/platform-cardlist.rst
+++ b/Documentation/admin-guide/media/platform-cardlist.rst

+stm32-dma2d STM32 Chrom-Art Accelerator Unit

+ 0x19 hassi ditto for High side

diff --git a/Documentation/admin-guide/media/usb-cardlist.rst b/Documentation/admin-guide/media/usb-cardlist.rst
index 1e96f928e0af..5f5ab0723e48 100644
--- a/Documentation/admin-guide/media/usb-cardlist.rst
+++ b/Documentation/admin-guide/media/usb-cardlist.rst

diff --git a/Documentation/admin-guide/media/v4l-drivers.rst b/Documentation/admin-guide/media/v4l-drivers.rst
index 9c7ebe2ca3bd..1c41f87c3917 100644
--- a/Documentation/admin-guide/media/v4l-drivers.rst
+++ b/Documentation/admin-guide/media/v4l-drivers.rst

+ n00000001 -> n00000002

+ n00000002 [label="{{} | Lens A\n/dev/v4l-subdev5 | {<port0>}}", shape=Mrecord, style=filled, fillcolor=green]

+ n00000003 -> n00000004

+ n00000004 [label="{{} | Lens B\n/dev/v4l-subdev6 | {<port0>}}", shape=Mrecord, style=filled, fillcolor=green]

+ n00000005:port1 -> n00000015:port0

+ n00000008:port1 -> n00000015:port0 [style=dashed]

+ n00000013 [label="{{} | RGB/YUV Input\n/dev/v4l-subdev4 | {<port0> 0}}", shape=Mrecord, style=filled, fillcolor=green]

+ n00000013:port0 -> n00000015:port0 [style=dashed]

+ n00000015 [label="{{<port0> 0} | Scaler\n/dev/v4l-subdev5 | {<port1> 1}}", shape=Mrecord, style=filled, fillcolor=green]

+ n00000015:port1 -> n00000018 [style=bold]

+ n00000018 [label="RGB/YUV Capture\n/dev/video2", shape=box, style=filled, fillcolor=yellow]

+ media-ctl -d platform:vimc -V '"Scaler":0[fmt:RGB888_1X24/640x480]'

+ media-ctl -d platform:vimc -V '"Scaler":0[crop:(100,50)/400x150]'

+ media-ctl -d platform:vimc -V '"Scaler":1[fmt:RGB888_1X24/300x700]'

+ v4l2-ctl -z platform:vimc -d "RGB/YUV Capture" -v width=300,height=700

+ Ancillary lens for a sensor. Supports auto focus control. Linked to

+ a vimc-sensor using an ancillary link. The lens supports FOCUS_ABSOLUTE

+.. code-block:: bash

+ - entity 28: Lens A (0 pad, 0 link)

+ type V4L2 subdev subtype Lens flags 0

+ device node name /dev/v4l-subdev6

+ - entity 29: Lens B (0 pad, 0 link)

+ type V4L2 subdev subtype Lens flags 0

+ device node name /dev/v4l-subdev7

+ v4l2-ctl -d /dev/v4l-subdev7 -C focus_absolute

+ Re-size the image to meet the source pad resolution. E.g.: if the sync

+ pad is configured to 360x480 and the source to 1280x720, the image will

+ be stretched to fit the source resolution. Works for any resolution

+ within the vimc limitations (even shrinking the image if necessary).

+* ``allocator=<unsigned int>``

+ memory allocator selection, default is 0. It specifies the way buffers

+ will be allocated.

+.. SPDX-License-Identifier: GPL-2.0

+The Virtual Stateless Decoder Driver (visl)

+===========================================

+A virtual stateless decoder device for stateless uAPI development

+This tool's objective is to help the development and testing of

+userspace applications that use the V4L2 stateless API to decode media.

+A userspace implementation can use visl to run a decoding loop even when

+no hardware is available or when the kernel uAPI for the codec has not

+been upstreamed yet. This can reveal bugs at an early stage.

+This driver can also trace the contents of the V4L2 controls submitted

+to it. It can also dump the contents of the vb2 buffers through a

+debugfs interface. This is in many ways similar to the tracing

+infrastructure available for other popular encode/decode APIs out there

+and can help develop a userspace application by using another (working)

+one as a reference.

+ No actual decoding of video frames is performed by visl. The

+ V4L2 test pattern generator is used to write various debug information

+ to the capture buffers instead.

+- visl_debug: Activates debug info, printing various debug messages through

+ dprintk. Also controls whether per-frame debug info is shown. Defaults to off.

+ Note that enabling this feature can result in slow performance through serial.

+- visl_transtime_ms: Simulated process time in milliseconds. Slowing down the

+ decoding speed can be useful for debugging.

+- visl_dprintk_frame_start, visl_dprintk_frame_nframes: Dictates a range of

+ frames where dprintk is activated. This only controls the dprintk tracing on a

+ per-frame basis. Note that printing a lot of data can be slow through serial.

+- keep_bitstream_buffers: Controls whether bitstream (i.e. OUTPUT) buffers are

+ kept after a decoding session. Defaults to false so as to reduce the amount of

+ clutter. keep_bitstream_buffers == false works well when live debugging the

+ client program with GDB.

+- bitstream_trace_frame_start, bitstream_trace_nframes: Similar to

+ visl_dprintk_frame_start, visl_dprintk_nframes, but controls the dumping of

+ buffer data through debugfs instead.

+What is the default use case for this driver?

+---------------------------------------------

+This driver can be used as a way to compare different userspace implementations.

+This assumes that a working client is run against visl and that the ftrace and

+OUTPUT buffer data is subsequently used to debug a work-in-progress

+Information on reference frames, their timestamps, the status of the OUTPUT and

+CAPTURE queues and more can be read directly from the CAPTURE buffers.

+The following codecs are supported:

+The trace events are defined on a per-codec basis, e.g.:

+.. code-block:: bash

+ $ ls /sys/kernel/debug/tracing/events/ | grep visl

+ visl_fwht_controls

+ visl_h264_controls

+ visl_hevc_controls

+ visl_mpeg2_controls

+For example, in order to dump HEVC SPS data:

+.. code-block:: bash

+ $ echo 1 > /sys/kernel/debug/tracing/events/visl_hevc_controls/v4l2_ctrl_hevc_sps/enable

+The SPS data will be dumped to the trace buffer, i.e.:

+.. code-block:: bash

+ $ cat /sys/kernel/debug/tracing/trace

+ video_parameter_set_id 0

+ seq_parameter_set_id 0

+ pic_width_in_luma_samples 1920

+ pic_height_in_luma_samples 1080

+ bit_depth_luma_minus8 0

+ bit_depth_chroma_minus8 0

+ log2_max_pic_order_cnt_lsb_minus4 4

+ sps_max_dec_pic_buffering_minus1 6

+ sps_max_num_reorder_pics 2

+ sps_max_latency_increase_plus1 0

+ log2_min_luma_coding_block_size_minus3 0

+ log2_diff_max_min_luma_coding_block_size 3

+ log2_min_luma_transform_block_size_minus2 0

+ log2_diff_max_min_luma_transform_block_size 3

+ max_transform_hierarchy_depth_inter 2

+ max_transform_hierarchy_depth_intra 2

+ pcm_sample_bit_depth_luma_minus1 0

+ pcm_sample_bit_depth_chroma_minus1 0

+ log2_min_pcm_luma_coding_block_size_minus3 0

+ log2_diff_max_min_pcm_luma_coding_block_size 0

+ num_short_term_ref_pic_sets 0

+ num_long_term_ref_pics_sps 0

+ chroma_format_idc 1

+ sps_max_sub_layers_minus1 0

+ flags AMP_ENABLED|SAMPLE_ADAPTIVE_OFFSET|TEMPORAL_MVP_ENABLED|STRONG_INTRA_SMOOTHING_ENABLED

+Dumping OUTPUT buffer data through debugfs

+------------------------------------------

+If the **VISL_DEBUGFS** Kconfig is enabled, visl will populate

+**/sys/kernel/debug/visl/bitstream** with OUTPUT buffer data according to the

+values of bitstream_trace_frame_start and bitstream_trace_nframes. This can

+highlight errors as broken clients may fail to fill the buffers properly.

+A single file is created for each processed OUTPUT buffer. Its name contains an

+integer that denotes the buffer sequence, i.e.:

+ snprintf(name, 32, "bitstream%d", run->src->sequence);

+Dumping the values is simply a matter of reading from the file, i.e.:

+For the buffer with sequence == 0:

+.. code-block:: bash

+ $ xxd /sys/kernel/debug/visl/bitstream/bitstream0

+ 00000000: 2601 af04 d088 bc25 a173 0e41 a4f2 3274 &......%.s.A..2t

+ 00000010: c668 cb28 e775 b4ac f53a ba60 f8fd 3aa1 .h.(.u...:.`..:.

+ 00000020: 46b4 bcfc 506c e227 2372 e5f5 d7ea 579f F...Pl.'#r....W.

+ 00000030: 6371 5eb5 0eb8 23b5 ca6a 5de5 983a 19e4 cq^...#..j]..:..

+ 00000040: e8c3 4320 b4ba a226 cbc1 4138 3a12 32d6 ..C ...&..A8:.2.

+ 00000050: fef3 247b 3523 4e90 9682 ac8e eb0c a389 ..${5#N.........

+ 00000060: ddd0 6cfc 0187 0e20 7aae b15b 1812 3d33 ..l.... z..[..=3

+ 00000070: e1c5 f425 a83a 00b7 4f18 8127 3c4c aefb ...%.:..O..'<L..

+For the buffer with sequence == 1:

+.. code-block:: bash

+ $ xxd /sys/kernel/debug/visl/bitstream/bitstream1

+ 00000000: 0201 d021 49e1 0c40 aa11 1449 14a6 01dc ...!I..@...I....

+ 00000010: 7023 889a c8cd 2cd0 13b4 dab0 e8ca 21fe p#....,.......!.

+ 00000020: c4c8 ab4c 486e 4e2f b0df 96cc c74e 8dde ...LHnN/.....N..

+ 00000030: 8ce7 ee36 d880 4095 4d64 30a0 ff4f 0c5e ...6..@.Md0..O.^

+ 00000040: f16b a6a1 d806 ca2a 0ece a673 7bea 1f37 .k.....*...s{..7

+ 00000050: 370f 5bb9 1dc4 ba21 6434 bc53 0173 cba0 7.[....!d4.S.s..

+ 00000060: dfe6 bc99 01ea b6e0 346b 92b5 c8de 9f5d ........4k.....]

+ 00000070: e7cc 3484 1769 fef2 a693 a945 2c8b 31da ..4..i.....E,.1.

+By default, the files are removed during STREAMOFF. This is to reduce the amount

+Finally, for these inputs the v4l2_timecode struct is filled in the

+transmitted based on the values set in vivid controls.

+- Insert Video Guard Band

+ adds 4 columns of pixels with the HDMI Video Guard Band code at the

+ left hand side of the image. This only works with 3 or 4 byte RGB pixel

+ formats. The RGB pixel value 0xab/0x55/0xab turns out to be equivalent

+ to the HDMI Video Guard Band code that precedes each active video line

+ (see section 5.2.2.1 in the HDMI 1.3 Specification). To test if a video

+ receiver has correct HDMI Video Guard Band processing, enable this

+ control and then move the image to the left hand side of the screen.

+ That will result in video lines that start with multiple pixels that

+ have the same value as the Video Guard Band that precedes them.

+ Receivers that will just keep skipping Video Guard Band values will

+ now fail and either loose sync or these video lines will shift.

+ $ v4l2-ctl -d0 -c loop_video=1

+Each CMA area represents a directory under <debugfs>/cma/, represented by

+its CMA name like below:

+ <debugfs>/cma/<cma_name>

+ echo 5 > <debugfs>/cma/<cma_name>/alloc

+would try to allocate 5 pages from the 'cma_name' area.

+Documentation/admin-guide/mm/hugetlbpage.rst.

+name. See Documentation/admin-guide/mm/transhuge.rst for more details

+Documentation/mm/numa.rst` and in

+Documentation/admin-guide/mm/numa_memory_policy.rst.

+==========================

+DAMON: Data Access MONitor

+==========================

+:doc:`DAMON </mm/damon/index>` allows light-weight data access monitoring.

+.. SPDX-License-Identifier: GPL-2.0

+=============================

+DAMON-based LRU-lists Sorting

+=============================

+DAMON-based LRU-lists Sorting (DAMON_LRU_SORT) is a static kernel module that

+aimed to be used for proactive and lightweight data access pattern based

+(de)prioritization of pages on their LRU-lists for making LRU-lists a more

+trusworthy data access pattern source.

+Where Proactive LRU-lists Sorting is Required?

+==============================================

+As page-granularity access checking overhead could be significant on huge

+systems, LRU lists are normally not proactively sorted but partially and

+reactively sorted for special events including specific user requests, system

+calls and memory pressure. As a result, LRU lists are sometimes not so

+perfectly prepared to be used as a trustworthy access pattern source for some

+situations including reclamation target pages selection under sudden memory

+Because DAMON can identify access patterns of best-effort accuracy while

+inducing only user-specified range of overhead, proactively running

+DAMON_LRU_SORT could be helpful for making LRU lists more trustworthy access

+pattern source with low and controlled overhead.

+DAMON_LRU_SORT finds hot pages (pages of memory regions that showing access

+rates that higher than a user-specified threshold) and cold pages (pages of

+memory regions that showing no access for a time that longer than a

+user-specified threshold) using DAMON, and prioritizes hot pages while

+deprioritizing cold pages on their LRU-lists. To avoid it consuming too much

+CPU for the prioritizations, a CPU time usage limit can be configured. Under

+the limit, it prioritizes and deprioritizes more hot and cold pages first,

+respectively. System administrators can also configure under what situation

+this scheme should automatically activated and deactivated with three memory

+pressure watermarks.

+Its default parameters for hotness/coldness thresholds and CPU quota limit are

+conservatively chosen. That is, the module under its default parameters could

+be widely used without harm for common situations while providing a level of

+benefits for systems having clear hot/cold access patterns under memory

+pressure while consuming only a limited small portion of CPU time.

+Interface: Module Parameters

+============================

+To use this feature, you should first ensure your system is running on a kernel

+that is built with ``CONFIG_DAMON_LRU_SORT=y``.

+To let sysadmins enable or disable it and tune for the given system,

+DAMON_LRU_SORT utilizes module parameters. That is, you can put

+``damon_lru_sort.<parameter>=<value>`` on the kernel boot command line or write

+proper values to ``/sys/module/damon_lru_sort/parameters/<parameter>`` files.

+Below are the description of each parameter.

+Enable or disable DAMON_LRU_SORT.

+You can enable DAMON_LRU_SORT by setting the value of this parameter as ``Y``.

+Setting it as ``N`` disables DAMON_LRU_SORT. Note that DAMON_LRU_SORT could do

+no real monitoring and LRU-lists sorting due to the watermarks-based activation

+condition. Refer to below descriptions for the watermarks parameter for this.

+Make DAMON_LRU_SORT reads the input parameters again, except ``enabled``.

+Input parameters that updated while DAMON_LRU_SORT is running are not applied

+by default. Once this parameter is set as ``Y``, DAMON_LRU_SORT reads values

+of parametrs except ``enabled`` again. Once the re-reading is done, this

+parameter is set as ``N``. If invalid parameters are found while the

+re-reading, DAMON_LRU_SORT will be disabled.

+hot_thres_access_freq

+---------------------

+Access frequency threshold for hot memory regions identification in permil.

+If a memory region is accessed in frequency of this or higher, DAMON_LRU_SORT

+identifies the region as hot, and mark it as accessed on the LRU list, so that

+it could not be reclaimed under memory pressure. 50% by default.

+Time threshold for cold memory regions identification in microseconds.

+If a memory region is not accessed for this or longer time, DAMON_LRU_SORT

+identifies the region as cold, and mark it as unaccessed on the LRU list, so

+that it could be reclaimed first under memory pressure. 120 seconds by

+Limit of time for trying the LRU lists sorting in milliseconds.

+DAMON_LRU_SORT tries to use only up to this time within a time window

+(quota_reset_interval_ms) for trying LRU lists sorting. This can be used

+for limiting CPU consumption of DAMON_LRU_SORT. If the value is zero, the

+quota_reset_interval_ms

+-----------------------

+The time quota charge reset interval in milliseconds.

+The charge reset interval for the quota of time (quota_ms). That is,

+DAMON_LRU_SORT does not try LRU-lists sorting for more than quota_ms

+milliseconds or quota_sz bytes within quota_reset_interval_ms milliseconds.

+1 second by default.

+The watermarks check time interval in microseconds.

+Minimal time to wait before checking the watermarks, when DAMON_LRU_SORT is

+enabled but inactive due to its watermarks rule. 5 seconds by default.

+Free memory rate (per thousand) for the high watermark.

+If free memory of the system in bytes per thousand bytes is higher than this,

+DAMON_LRU_SORT becomes inactive, so it does nothing but periodically checks the

+watermarks. 200 (20%) by default.

+Free memory rate (per thousand) for the middle watermark.

+If free memory of the system in bytes per thousand bytes is between this and

+the low watermark, DAMON_LRU_SORT becomes active, so starts the monitoring and

+the LRU-lists sorting. 150 (15%) by default.

+Free memory rate (per thousand) for the low watermark.

+If free memory of the system in bytes per thousand bytes is lower than this,

+DAMON_LRU_SORT becomes inactive, so it does nothing but periodically checks the

+watermarks. 50 (5%) by default.

+Sampling interval for the monitoring in microseconds.

+The sampling interval of DAMON for the cold memory monitoring. Please refer to

+the DAMON documentation (:doc:`usage`) for more detail. 5ms by default.

+Aggregation interval for the monitoring in microseconds.

+The aggregation interval of DAMON for the cold memory monitoring. Please

+refer to the DAMON documentation (:doc:`usage`) for more detail. 100ms by

+Minimum number of monitoring regions.

+The minimal number of monitoring regions of DAMON for the cold memory

+monitoring. This can be used to set lower-bound of the monitoring quality.

+But, setting this too high could result in increased monitoring overhead.

+Please refer to the DAMON documentation (:doc:`usage`) for more detail. 10 by

+Maximum number of monitoring regions.

+The maximum number of monitoring regions of DAMON for the cold memory

+monitoring. This can be used to set upper-bound of the monitoring overhead.

+However, setting this too low could result in bad monitoring quality. Please

+refer to the DAMON documentation (:doc:`usage`) for more detail. 1000 by

+monitor_region_start

+--------------------

+Start of target memory region in physical address.

+The start physical address of memory region that DAMON_LRU_SORT will do work

+against. By default, biggest System RAM is used as the region.

+End of target memory region in physical address.

+The end physical address of memory region that DAMON_LRU_SORT will do work

+against. By default, biggest System RAM is used as the region.

+PID of the DAMON thread.

+If DAMON_LRU_SORT is enabled, this becomes the PID of the worker thread. Else,

+nr_lru_sort_tried_hot_regions

+-----------------------------

+Number of hot memory regions that tried to be LRU-sorted.

+bytes_lru_sort_tried_hot_regions

+--------------------------------

+Total bytes of hot memory regions that tried to be LRU-sorted.

+nr_lru_sorted_hot_regions

+-------------------------

+Number of hot memory regions that successfully be LRU-sorted.

+bytes_lru_sorted_hot_regions

+----------------------------

+Total bytes of hot memory regions that successfully be LRU-sorted.

+nr_hot_quota_exceeds

+--------------------

+Number of times that the time quota limit for hot regions have exceeded.

+nr_lru_sort_tried_cold_regions

+------------------------------

+Number of cold memory regions that tried to be LRU-sorted.

+bytes_lru_sort_tried_cold_regions

+---------------------------------

+Total bytes of cold memory regions that tried to be LRU-sorted.

+nr_lru_sorted_cold_regions

+--------------------------

+Number of cold memory regions that successfully be LRU-sorted.

+bytes_lru_sorted_cold_regions

+-----------------------------

+Total bytes of cold memory regions that successfully be LRU-sorted.

+nr_cold_quota_exceeds

+---------------------

+Number of times that the time quota limit for cold regions have exceeded.

+Below runtime example commands make DAMON_LRU_SORT to find memory regions

+having >=50% access frequency and LRU-prioritize while LRU-deprioritizing

+memory regions that not accessed for 120 seconds. The prioritization and

+deprioritization is limited to be done using only up to 1% CPU time to avoid

+DAMON_LRU_SORT consuming too much CPU time for the (de)prioritization. It also

+asks DAMON_LRU_SORT to do nothing if the system's free memory rate is more than

+50%, but start the real works if it becomes lower than 40%. If DAMON_RECLAIM

+doesn't make progress and therefore the free memory rate becomes lower than

+20%, it asks DAMON_LRU_SORT to do nothing again, so that we can fall back to

+the LRU-list based page granularity reclamation. ::

+ # cd /sys/module/damon_lru_sort/parameters

+ # echo 500 > hot_thres_access_freq

+ # echo 120000000 > cold_min_age

+ # echo 10 > quota_ms

+ # echo 1000 > quota_reset_interval_ms

+ # echo 500 > wmarks_high

+ # echo 400 > wmarks_mid

+ # echo 200 > wmarks_low

+ # echo Y > enabled

+.. SPDX-License-Identifier: GPL-2.0

+=======================

+DAMON-based Reclamation

+=======================

+DAMON-based Reclamation (DAMON_RECLAIM) is a static kernel module that aimed to

+be used for proactive and lightweight reclamation under light memory pressure.

+It doesn't aim to replace the LRU-list based page_granularity reclamation, but

+to be selectively used for different level of memory pressure and requirements.

+Where Proactive Reclamation is Required?

+========================================

+On general memory over-committed systems, proactively reclaiming cold pages

+helps saving memory and reducing latency spikes that incurred by the direct

+reclaim of the process or CPU consumption of kswapd, while incurring only

+minimal performance degradation [1]_ [2]_ .

+Free Pages Reporting [3]_ based memory over-commit virtualization systems are

+good example of the cases. In such systems, the guest VMs reports their free

+memory to host, and the host reallocates the reported memory to other guests.

+As a result, the memory of the systems are fully utilized. However, the

+guests could be not so memory-frugal, mainly because some kernel subsystems and

+user-space applications are designed to use as much memory as available. Then,

+guests could report only small amount of memory as free to host, results in

+memory utilization drop of the systems. Running the proactive reclamation in

+guests could mitigate this problem.

+DAMON_RECLAIM finds memory regions that didn't accessed for specific time

+duration and page out. To avoid it consuming too much CPU for the paging out

+operation, a speed limit can be configured. Under the speed limit, it pages

+out memory regions that didn't accessed longer time first. System

+administrators can also configure under what situation this scheme should

+automatically activated and deactivated with three memory pressure watermarks.

+Interface: Module Parameters

+============================

+To use this feature, you should first ensure your system is running on a kernel

+that is built with ``CONFIG_DAMON_RECLAIM=y``.

+To let sysadmins enable or disable it and tune for the given system,

+DAMON_RECLAIM utilizes module parameters. That is, you can put

+``damon_reclaim.<parameter>=<value>`` on the kernel boot command line or write

+proper values to ``/sys/module/damon_reclaim/parameters/<parameter>`` files.

+Below are the description of each parameter.

+Enable or disable DAMON_RECLAIM.

+You can enable DAMON_RCLAIM by setting the value of this parameter as ``Y``.

+Setting it as ``N`` disables DAMON_RECLAIM. Note that DAMON_RECLAIM could do

+no real monitoring and reclamation due to the watermarks-based activation

+condition. Refer to below descriptions for the watermarks parameter for this.

+Make DAMON_RECLAIM reads the input parameters again, except ``enabled``.

+Input parameters that updated while DAMON_RECLAIM is running are not applied

+by default. Once this parameter is set as ``Y``, DAMON_RECLAIM reads values

+of parametrs except ``enabled`` again. Once the re-reading is done, this

+parameter is set as ``N``. If invalid parameters are found while the

+re-reading, DAMON_RECLAIM will be disabled.

+Time threshold for cold memory regions identification in microseconds.

+If a memory region is not accessed for this or longer time, DAMON_RECLAIM

+identifies the region as cold, and reclaims it.

+120 seconds by default.

+Limit of time for the reclamation in milliseconds.

+DAMON_RECLAIM tries to use only up to this time within a time window

+(quota_reset_interval_ms) for trying reclamation of cold pages. This can be

+used for limiting CPU consumption of DAMON_RECLAIM. If the value is zero, the

+Limit of size of memory for the reclamation in bytes.

+DAMON_RECLAIM charges amount of memory which it tried to reclaim within a time

+window (quota_reset_interval_ms) and makes no more than this limit is tried.

+This can be used for limiting consumption of CPU and IO. If this value is

+zero, the limit is disabled.

+128 MiB by default.

+quota_reset_interval_ms

+-----------------------

+The time/size quota charge reset interval in milliseconds.

+The charget reset interval for the quota of time (quota_ms) and size

+(quota_sz). That is, DAMON_RECLAIM does not try reclamation for more than

+quota_ms milliseconds or quota_sz bytes within quota_reset_interval_ms

+1 second by default.

+Minimal time to wait before checking the watermarks, when DAMON_RECLAIM is

+enabled but inactive due to its watermarks rule.

+Free memory rate (per thousand) for the high watermark.

+If free memory of the system in bytes per thousand bytes is higher than this,

+DAMON_RECLAIM becomes inactive, so it does nothing but only periodically checks

+Free memory rate (per thousand) for the middle watermark.

+If free memory of the system in bytes per thousand bytes is between this and

+the low watermark, DAMON_RECLAIM becomes active, so starts the monitoring and

+Free memory rate (per thousand) for the low watermark.

+If free memory of the system in bytes per thousand bytes is lower than this,

+DAMON_RECLAIM becomes inactive, so it does nothing but periodically checks the

+watermarks. In the case, the system falls back to the LRU-list based page

+granularity reclamation logic.

+Sampling interval for the monitoring in microseconds.

+The sampling interval of DAMON for the cold memory monitoring. Please refer to

+the DAMON documentation (:doc:`usage`) for more detail.

+Aggregation interval for the monitoring in microseconds.

+The aggregation interval of DAMON for the cold memory monitoring. Please

+refer to the DAMON documentation (:doc:`usage`) for more detail.

+Minimum number of monitoring regions.

+The minimal number of monitoring regions of DAMON for the cold memory

+monitoring. This can be used to set lower-bound of the monitoring quality.

+But, setting this too high could result in increased monitoring overhead.

+Please refer to the DAMON documentation (:doc:`usage`) for more detail.

+Maximum number of monitoring regions.

+The maximum number of monitoring regions of DAMON for the cold memory

+monitoring. This can be used to set upper-bound of the monitoring overhead.

+However, setting this too low could result in bad monitoring quality. Please

+refer to the DAMON documentation (:doc:`usage`) for more detail.

+monitor_region_start

+--------------------

+Start of target memory region in physical address.

+The start physical address of memory region that DAMON_RECLAIM will do work

+against. That is, DAMON_RECLAIM will find cold memory regions in this region

+and reclaims. By default, biggest System RAM is used as the region.

+End of target memory region in physical address.

+The end physical address of memory region that DAMON_RECLAIM will do work

+against. That is, DAMON_RECLAIM will find cold memory regions in this region

+and reclaims. By default, biggest System RAM is used as the region.

+Skip anonymous pages reclamation.

+If this parameter is set as ``Y``, DAMON_RECLAIM does not reclaim anonymous

+pages. By default, ``N``.

+PID of the DAMON thread.

+If DAMON_RECLAIM is enabled, this becomes the PID of the worker thread. Else,

+nr_reclaim_tried_regions

+------------------------

+Number of memory regions that tried to be reclaimed by DAMON_RECLAIM.

+bytes_reclaim_tried_regions

+---------------------------

+Total bytes of memory regions that tried to be reclaimed by DAMON_RECLAIM.

+nr_reclaimed_regions

+--------------------

+Number of memory regions that successfully be reclaimed by DAMON_RECLAIM.

+bytes_reclaimed_regions

+-----------------------

+Total bytes of memory regions that successfully be reclaimed by DAMON_RECLAIM.

+Number of times that the time/space quota limits have exceeded.

+Below runtime example commands make DAMON_RECLAIM to find memory regions that

+not accessed for 30 seconds or more and pages out. The reclamation is limited

+to be done only up to 1 GiB per second to avoid DAMON_RECLAIM consuming too

+much CPU time for the paging out operation. It also asks DAMON_RECLAIM to do

+nothing if the system's free memory rate is more than 50%, but start the real

+works if it becomes lower than 40%. If DAMON_RECLAIM doesn't make progress and

+therefore the free memory rate becomes lower than 20%, it asks DAMON_RECLAIM to

+do nothing again, so that we can fall back to the LRU-list based page

+granularity reclamation. ::

+ # cd /sys/module/damon_reclaim/parameters

+ # echo 30000000 > min_age

+ # echo $((1 * 1024 * 1024 * 1024)) > quota_sz

+ # echo 1000 > quota_reset_interval_ms

+ # echo 500 > wmarks_high

+ # echo 400 > wmarks_mid

+ # echo 200 > wmarks_low

+ # echo Y > enabled

+.. [1] https://research.google/pubs/pub48551/

+.. [2] https://lwn.net/Articles/787611/

+.. [3] https://www.kernel.org/doc/html/latest/mm/free_page_reporting.html

+of its features for brevity. Please refer to the usage `doc

+<https://github.com/awslabs/damo/blob/next/USAGE.md>`_ of the tool for more

+Because DAMO is using the sysfs interface (refer to :doc:`usage` for the

+detail) of DAMON, you should ensure :doc:`sysfs </filesystems/sysfs>` is

+You can visualize the pattern in a heatmap, showing which memory region

+(x-axis) got accessed when (y-axis) and how frequently (number).::

+ $ sudo damo report heats --heatmap stdout

+ 22222222222222222222222222222222222222211111111111111111111111111111111111111100

+ 44444444444444444444444444444444444444434444444444444444444444444444444444443200

+ 44444444444444444444444444444444444444433444444444444444444444444444444444444200

+ 33333333333333333333333333333333333333344555555555555555555555555555555555555200

+ 33333333333333333333333333333333333344444444444444444444444444444444444444444200

+ 22222222222222222222222222222222222223355555555555555555555555555555555555555200

+ 00000000000000000000000000000000000000288888888888888888888888888888888888888400

+ 33333333333333333333333333333333333333355555555555555555555555555555555555555200

+ 88888888888888888888888888888888888888600000000000000000000000000000000000000000

+ 33333333333333333333333333333333333333444444444444444444444444444444444444443200

+ 00000000000000000000000000000000000000288888888888888888888888888888888888888400

+ # access_frequency: 0 1 2 3 4 5 6 7 8 9

+ # x-axis: space (139728247021568-139728453431248: 196.848 MiB)

+ # y-axis: time (15256597248362-15326899978162: 1 m 10.303 s)

+ # resolution: 80x40 (2.461 MiB and 1.758 s for each character)

+You can also visualize the distribution of the working set size, sorted by the

+ $ sudo damo report wss --range 0 101 10

+ # <percentile> <wss>

+ # target_id 18446632103789443072

+ # avr: 107.708 MiB

+ 10 95.328 MiB |**************************** |

+ 20 95.332 MiB |**************************** |

+ 30 95.340 MiB |**************************** |

+ 40 95.387 MiB |**************************** |

+ 50 95.387 MiB |**************************** |

+ 60 95.398 MiB |**************************** |

+ 70 95.398 MiB |**************************** |

+ 80 95.504 MiB |**************************** |

+ 90 190.703 MiB |********************************************************* |

+ 100 196.875 MiB |***********************************************************|

+Using ``--sortby`` option with the above command, you can show how the working

+set size has chronologically changed.::

+ $ sudo damo report wss --range 0 101 10 --sortby time

+ # <percentile> <wss>

+ # target_id 18446632103789443072

+ # avr: 107.708 MiB

+ 10 190.703 MiB |***********************************************************|

+ 20 95.336 MiB |***************************** |

+ 30 95.328 MiB |***************************** |

+ 40 95.387 MiB |***************************** |

+ 50 95.332 MiB |***************************** |

+ 60 95.320 MiB |***************************** |

+ 70 95.398 MiB |***************************** |

+ 80 95.398 MiB |***************************** |

+ 90 95.340 MiB |***************************** |

+ 100 95.398 MiB |***************************** |

+Data Access Pattern Aware Memory Management

+===========================================

+Below three commands make every memory region of size >=4K that doesn't

+accessed for >=60 seconds in your workload to be swapped out. ::

+ $ echo "#min-size max-size min-acc max-acc min-age max-age action" > test_scheme

+ $ echo "4K max 0 0 60s max pageout" >> test_scheme

+ $ damo schemes -c test_scheme <pid of your workload>

+DAMON provides below interfaces for different users.

+ `This <https://github.com/awslabs/damo>`_ is for privileged people such as

+ system administrators who want a just-working human-friendly interface.

+ Using this, users can use the DAMON’s major features in a human-friendly way.

+ It may not be highly tuned for special cases, though. It supports both

+ virtual and physical address spaces monitoring. For more detail, please

+ refer to its `usage document

+ <https://github.com/awslabs/damo/blob/next/USAGE.md>`_.

+- *sysfs interface.*

+ :ref:`This <sysfs_interface>` is for privileged user space programmers who

+ want more optimized use of DAMON. Using this, users can use DAMON’s major

+ features by reading from and writing to special sysfs files. Therefore,

+ you can write and use your personalized DAMON sysfs wrapper programs that

+ reads/writes the sysfs files instead of you. The `DAMON user space tool

+ <https://github.com/awslabs/damo>`_ is one example of such programs. It

+ supports both virtual and physical address spaces monitoring. Note that this

+ interface provides only simple :ref:`statistics <damos_stats>` for the

+ monitoring results. For detailed monitoring results, DAMON provides a

+ :ref:`tracepoint <tracepoint>`.

+- *debugfs interface. (DEPRECATED!)*

+ :ref:`This <debugfs_interface>` is almost identical to :ref:`sysfs interface

+ <sysfs_interface>`. This is deprecated, so users should move to the

+ :ref:`sysfs interface <sysfs_interface>`. If you depend on this and cannot

+ move, please report your usecase to damon@lists.linux.dev and

+ linux-mm@kvack.org.

+ :doc:`This </mm/damon/api>` is for kernel space programmers. Using this,

+ users can utilize every feature of DAMON most flexibly and efficiently by

+ writing kernel space DAMON application programs for you. You can even extend

+ DAMON for various address spaces. For detail, please refer to the interface

+ :doc:`document </mm/damon/api>`.

+.. _sysfs_interface:

+DAMON sysfs interface is built when ``CONFIG_DAMON_SYSFS`` is defined. It

+creates multiple directories and files under its sysfs directory,

+``<sysfs>/kernel/mm/damon/``. You can control DAMON by writing to and reading

+from the files under the directory.

+For a short example, users can monitor the virtual address space of a given

+workload as below. ::

+ # cd /sys/kernel/mm/damon/admin/

+ # echo 1 > kdamonds/nr_kdamonds && echo 1 > kdamonds/0/contexts/nr_contexts

+ # echo vaddr > kdamonds/0/contexts/0/operations

+ # echo 1 > kdamonds/0/contexts/0/targets/nr_targets

+ # echo $(pidof <workload>) > kdamonds/0/contexts/0/targets/0/pid_target

+ # echo on > kdamonds/0/state

+The files hierarchy of DAMON sysfs interface is shown below. In the below

+figure, parents-children relations are represented with indentations, each

+directory is having ``/`` suffix, and files in each directory are separated by

+ /sys/kernel/mm/damon/admin

+ │ kdamonds/nr_kdamonds

+ │ │ 0/state,pid

+ │ │ │ contexts/nr_contexts

+ │ │ │ │ 0/avail_operations,operations

+ │ │ │ │ │ monitoring_attrs/

+ │ │ │ │ │ │ intervals/sample_us,aggr_us,update_us

+ │ │ │ │ │ │ nr_regions/min,max

+ │ │ │ │ │ targets/nr_targets

+ │ │ │ │ │ │ 0/pid_target

+ │ │ │ │ │ │ │ regions/nr_regions

+ │ │ │ │ │ │ │ │ 0/start,end

+ │ │ │ │ │ │ │ │ ...

+ │ │ │ │ │ │ ...

+ │ │ │ │ │ schemes/nr_schemes

+ │ │ │ │ │ │ 0/action

+ │ │ │ │ │ │ │ access_pattern/

+ │ │ │ │ │ │ │ │ sz/min,max

+ │ │ │ │ │ │ │ │ nr_accesses/min,max

+ │ │ │ │ │ │ │ │ age/min,max

+ │ │ │ │ │ │ │ quotas/ms,bytes,reset_interval_ms

+ │ │ │ │ │ │ │ │ weights/sz_permil,nr_accesses_permil,age_permil

+ │ │ │ │ │ │ │ watermarks/metric,interval_us,high,mid,low

+ │ │ │ │ │ │ │ filters/nr_filters

+ │ │ │ │ │ │ │ │ 0/type,matching,memcg_id

+ │ │ │ │ │ │ │ stats/nr_tried,sz_tried,nr_applied,sz_applied,qt_exceeds

+ │ │ │ │ │ │ │ tried_regions/

+ │ │ │ │ │ │ │ │ 0/start,end,nr_accesses,age

+ │ │ │ │ │ │ │ │ ...

+ │ │ │ │ │ │ ...

+ │ │ │ │ ...

+The root of the DAMON sysfs interface is ``<sysfs>/kernel/mm/damon/``, and it

+has one directory named ``admin``. The directory contains the files for

+privileged user space programs' control of DAMON. User space tools or deamons

+having the root permission could use this directory.

+The monitoring-related information including request specifications and results

+are called DAMON context. DAMON executes each context with a kernel thread

+called kdamond, and multiple kdamonds could run in parallel.

+Under the ``admin`` directory, one directory, ``kdamonds``, which has files for

+controlling the kdamonds exist. In the beginning, this directory has only one

+file, ``nr_kdamonds``. Writing a number (``N``) to the file creates the number

+of child directories named ``0`` to ``N-1``. Each directory represents each

+In each kdamond directory, two files (``state`` and ``pid``) and one directory

+(``contexts``) exist.

+Reading ``state`` returns ``on`` if the kdamond is currently running, or

+``off`` if it is not running. Writing ``on`` or ``off`` makes the kdamond be

+in the state. Writing ``commit`` to the ``state`` file makes kdamond reads the

+user inputs in the sysfs files except ``state`` file again. Writing

+``update_schemes_stats`` to ``state`` file updates the contents of stats files

+for each DAMON-based operation scheme of the kdamond. For details of the

+stats, please refer to :ref:`stats section <sysfs_schemes_stats>`. Writing

+``update_schemes_tried_regions`` to ``state`` file updates the DAMON-based

+operation scheme action tried regions directory for each DAMON-based operation

+scheme of the kdamond. Writing ``clear_schemes_tried_regions`` to ``state``

+file clears the DAMON-based operating scheme action tried regions directory for

+each DAMON-based operation scheme of the kdamond. For details of the

+DAMON-based operation scheme action tried regions directory, please refer to

+:ref:tried_regions section <sysfs_schemes_tried_regions>`.

+If the state is ``on``, reading ``pid`` shows the pid of the kdamond thread.

+``contexts`` directory contains files for controlling the monitoring contexts

+that this kdamond will execute.

+kdamonds/<N>/contexts/

+----------------------

+In the beginning, this directory has only one file, ``nr_contexts``. Writing a

+number (``N``) to the file creates the number of child directories named as

+``0`` to ``N-1``. Each directory represents each monitoring context. At the

+moment, only one context per kdamond is supported, so only ``0`` or ``1`` can

+be written to the file.

+.. _sysfs_contexts:

+In each context directory, two files (``avail_operations`` and ``operations``)

+and three directories (``monitoring_attrs``, ``targets``, and ``schemes``)

+DAMON supports multiple types of monitoring operations, including those for

+virtual address space and the physical address space. You can get the list of

+available monitoring operations set on the currently running kernel by reading

+``avail_operations`` file. Based on the kernel configuration, the file will

+list some or all of below keywords.

+ - vaddr: Monitor virtual address spaces of specific processes

+ - fvaddr: Monitor fixed virtual address ranges

+ - paddr: Monitor the physical address space of the system

+Please refer to :ref:`regions sysfs directory <sysfs_regions>` for detailed

+differences between the operations sets in terms of the monitoring target

+You can set and get what type of monitoring operations DAMON will use for the

+context by writing one of the keywords listed in ``avail_operations`` file and

+reading from the ``operations`` file.

+.. _sysfs_monitoring_attrs:

+contexts/<N>/monitoring_attrs/

+------------------------------

+Files for specifying attributes of the monitoring including required quality

+and efficiency of the monitoring are in ``monitoring_attrs`` directory.

+Specifically, two directories, ``intervals`` and ``nr_regions`` exist in this

+Under ``intervals`` directory, three files for DAMON's sampling interval

+(``sample_us``), aggregation interval (``aggr_us``), and update interval

+(``update_us``) exist. You can set and get the values in micro-seconds by

+writing to and reading from the files.

+Under ``nr_regions`` directory, two files for the lower-bound and upper-bound

+of DAMON's monitoring regions (``min`` and ``max``, respectively), which

+controls the monitoring overhead, exist. You can set and get the values by

+writing to and rading from the files.

+For more details about the intervals and monitoring regions range, please refer

+to the Design document (:doc:`/mm/damon/design`).

+contexts/<N>/targets/

+---------------------

+In the beginning, this directory has only one file, ``nr_targets``. Writing a

+number (``N``) to the file creates the number of child directories named ``0``

+to ``N-1``. Each directory represents each monitoring target.

+In each target directory, one file (``pid_target``) and one directory

+(``regions``) exist.

+If you wrote ``vaddr`` to the ``contexts/<N>/operations``, each target should

+be a process. You can specify the process to DAMON by writing the pid of the

+process to the ``pid_target`` file.

+targets/<N>/regions

+-------------------

+When ``vaddr`` monitoring operations set is being used (``vaddr`` is written to

+the ``contexts/<N>/operations`` file), DAMON automatically sets and updates the

+monitoring target regions so that entire memory mappings of target processes

+can be covered. However, users could want to set the initial monitoring region

+to specific address ranges.

+In contrast, DAMON do not automatically sets and updates the monitoring target

+regions when ``fvaddr`` or ``paddr`` monitoring operations sets are being used

+(``fvaddr`` or ``paddr`` have written to the ``contexts/<N>/operations``).

+Therefore, users should set the monitoring target regions by themselves in the

+For such cases, users can explicitly set the initial monitoring target regions

+as they want, by writing proper values to the files under this directory.

+In the beginning, this directory has only one file, ``nr_regions``. Writing a

+number (``N``) to the file creates the number of child directories named ``0``

+to ``N-1``. Each directory represents each initial monitoring target region.

+In each region directory, you will find two files (``start`` and ``end``). You

+can set and get the start and end addresses of the initial monitoring target

+region by writing to and reading from the files, respectively.

+Each region should not overlap with others. ``end`` of directory ``N`` should

+be equal or smaller than ``start`` of directory ``N+1``.

+contexts/<N>/schemes/

+---------------------

+For usual DAMON-based data access aware memory management optimizations, users

+would normally want the system to apply a memory management action to a memory

+region of a specific access pattern. DAMON receives such formalized operation

+schemes from the user and applies those to the target memory regions. Users

+can get and set the schemes by reading from and writing to files under this

+In the beginning, this directory has only one file, ``nr_schemes``. Writing a

+number (``N``) to the file creates the number of child directories named ``0``

+to ``N-1``. Each directory represents each DAMON-based operation scheme.

+In each scheme directory, five directories (``access_pattern``, ``quotas``,

+``watermarks``, ``filters``, ``stats``, and ``tried_regions``) and one file

+(``action``) exist.

+The ``action`` file is for setting and getting what action you want to apply to

+memory regions having specific access pattern of the interest. The keywords

+that can be written to and read from the file and their meaning are as below.

+Note that support of each action depends on the running DAMON operations set

+`implementation <sysfs_contexts>`.

+ - ``willneed``: Call ``madvise()`` for the region with ``MADV_WILLNEED``.

+ Supported by ``vaddr`` and ``fvaddr`` operations set.

+ - ``cold``: Call ``madvise()`` for the region with ``MADV_COLD``.

+ Supported by ``vaddr`` and ``fvaddr`` operations set.

+ - ``pageout``: Call ``madvise()`` for the region with ``MADV_PAGEOUT``.

+ Supported by ``vaddr``, ``fvaddr`` and ``paddr`` operations set.

+ - ``hugepage``: Call ``madvise()`` for the region with ``MADV_HUGEPAGE``.

+ Supported by ``vaddr`` and ``fvaddr`` operations set.

+ - ``nohugepage``: Call ``madvise()`` for the region with ``MADV_NOHUGEPAGE``.

+ Supported by ``vaddr`` and ``fvaddr`` operations set.

+ - ``lru_prio``: Prioritize the region on its LRU lists.

+ Supported by ``paddr`` operations set.

+ - ``lru_deprio``: Deprioritize the region on its LRU lists.

+ Supported by ``paddr`` operations set.

+ - ``stat``: Do nothing but count the statistics.

+ Supported by all operations sets.

+schemes/<N>/access_pattern/

+---------------------------

+The target access pattern of each DAMON-based operation scheme is constructed

+with three ranges including the size of the region in bytes, number of

+monitored accesses per aggregate interval, and number of aggregated intervals

+for the age of the region.

+Under the ``access_pattern`` directory, three directories (``sz``,

+``nr_accesses``, and ``age``) each having two files (``min`` and ``max``)

+exist. You can set and get the access pattern for the given scheme by writing

+to and reading from the ``min`` and ``max`` files under ``sz``,

+``nr_accesses``, and ``age`` directories, respectively.

+schemes/<N>/quotas/

+-------------------

+Optimal ``target access pattern`` for each ``action`` is workload dependent, so

+not easy to find. Worse yet, setting a scheme of some action too aggressive

+can cause severe overhead. To avoid such overhead, users can limit time and

+size quota for each scheme. In detail, users can ask DAMON to try to use only

+up to specific time (``time quota``) for applying the action, and to apply the

+action to only up to specific amount (``size quota``) of memory regions having

+the target access pattern within a given time interval (``reset interval``).

+When the quota limit is expected to be exceeded, DAMON prioritizes found memory

+regions of the ``target access pattern`` based on their size, access frequency,

+and age. For personalized prioritization, users can set the weights for the

+Under ``quotas`` directory, three files (``ms``, ``bytes``,

+``reset_interval_ms``) and one directory (``weights``) having three files

+(``sz_permil``, ``nr_accesses_permil``, and ``age_permil``) in it exist.

+You can set the ``time quota`` in milliseconds, ``size quota`` in bytes, and

+``reset interval`` in milliseconds by writing the values to the three files,

+respectively. You can also set the prioritization weights for size, access

+frequency, and age in per-thousand unit by writing the values to the three

+files under the ``weights`` directory.

+schemes/<N>/watermarks/

+-----------------------

+To allow easy activation and deactivation of each scheme based on system

+status, DAMON provides a feature called watermarks. The feature receives five

+values called ``metric``, ``interval``, ``high``, ``mid``, and ``low``. The

+``metric`` is the system metric such as free memory ratio that can be measured.

+If the metric value of the system is higher than the value in ``high`` or lower

+than ``low`` at the memoent, the scheme is deactivated. If the value is lower

+than ``mid``, the scheme is activated.

+Under the watermarks directory, five files (``metric``, ``interval_us``,

+``high``, ``mid``, and ``low``) for setting each value exist. You can set and

+get the five values by writing to the files, respectively.

+Keywords and meanings of those that can be written to the ``metric`` file are

+ - none: Ignore the watermarks

+ - free_mem_rate: System's free memory rate (per thousand)

+The ``interval`` should written in microseconds unit.

+schemes/<N>/filters/

+--------------------

+Users could know something more than the kernel for specific types of memory.

+In the case, users could do their own management for the memory and hence

+doesn't want DAMOS bothers that. Users could limit DAMOS by setting the access

+pattern of the scheme and/or the monitoring regions for the purpose, but that

+can be inefficient in some cases. In such cases, users could set non-access

+pattern driven filters using files in this directory.

+In the beginning, this directory has only one file, ``nr_filters``. Writing a

+number (``N``) to the file creates the number of child directories named ``0``

+to ``N-1``. Each directory represents each filter. The filters are evaluated

+in the numeric order.

+Each filter directory contains three files, namely ``type``, ``matcing``, and

+``memcg_path``. You can write one of two special keywords, ``anon`` for

+anonymous pages, or ``memcg`` for specific memory cgroup filtering. In case of

+the memory cgroup filtering, you can specify the memory cgroup of the interest

+by writing the path of the memory cgroup from the cgroups mount point to

+``memcg_path`` file. You can write ``Y`` or ``N`` to ``matching`` file to

+filter out pages that does or does not match to the type, respectively. Then,

+the scheme's action will not be applied to the pages that specified to be

+For example, below restricts a DAMOS action to be applied to only non-anonymous

+pages of all memory cgroups except ``/having_care_already``.::

+ # echo 2 > nr_filters

+ # # filter out anonymous pages

+ echo anon > 0/type

+ echo Y > 0/matching

+ # # further filter out all cgroups except one at '/having_care_already'

+ echo memcg > 1/type

+ echo /having_care_already > 1/memcg_path

+ echo N > 1/matching

+Note that filters are currently supported only when ``paddr``

+`implementation <sysfs_contexts>` is being used.

+.. _sysfs_schemes_stats:

+DAMON counts the total number and bytes of regions that each scheme is tried to

+be applied, the two numbers for the regions that each scheme is successfully

+applied, and the total number of the quota limit exceeds. This statistics can

+be used for online analysis or tuning of the schemes.

+The statistics can be retrieved by reading the files under ``stats`` directory

+(``nr_tried``, ``sz_tried``, ``nr_applied``, ``sz_applied``, and

+``qt_exceeds``), respectively. The files are not updated in real time, so you

+should ask DAMON sysfs interface to updte the content of the files for the

+stats by writing a special keyword, ``update_schemes_stats`` to the relevant

+``kdamonds/<N>/state`` file.

+.. _sysfs_schemes_tried_regions:

+schemes/<N>/tried_regions/

+--------------------------

+When a special keyword, ``update_schemes_tried_regions``, is written to the

+relevant ``kdamonds/<N>/state`` file, DAMON creates directories named integer

+starting from ``0`` under this directory. Each directory contains files

+exposing detailed information about each of the memory region that the

+corresponding scheme's ``action`` has tried to be applied under this directory,

+during next :ref:`aggregation interval <sysfs_monitoring_attrs>`. The

+information includes address range, ``nr_accesses``, , and ``age`` of the

+The directories will be removed when another special keyword,

+``clear_schemes_tried_regions``, is written to the relevant

+``kdamonds/<N>/state`` file.

+In each region directory, you will find four files (``start``, ``end``,

+``nr_accesses``, and ``age``). Reading the files will show the start and end

+addresses, ``nr_accesses``, and ``age`` of the region that corresponding

+DAMON-based operation scheme ``action`` has tried to be applied.

+Below commands applies a scheme saying "If a memory region of size in [4KiB,

+8KiB] is showing accesses per aggregate interval in [0, 5] for aggregate

+interval in [10, 20], page out the region. For the paging out, use only up to

+10ms per second, and also don't page out more than 1GiB per second. Under the

+limitation, page out memory regions having longer age first. Also, check the

+free memory rate of the system every 5 seconds, start the monitoring and paging

+out when the free memory rate becomes lower than 50%, but stop it if the free

+memory rate becomes larger than 60%, or lower than 30%". ::

+ # cd <sysfs>/kernel/mm/damon/admin

+ # # populate directories

+ # echo 1 > kdamonds/nr_kdamonds; echo 1 > kdamonds/0/contexts/nr_contexts;

+ # echo 1 > kdamonds/0/contexts/0/schemes/nr_schemes

+ # cd kdamonds/0/contexts/0/schemes/0

+ # # set the basic access pattern and the action

+ # echo 4096 > access_pattern/sz/min

+ # echo 8192 > access_pattern/sz/max

+ # echo 0 > access_pattern/nr_accesses/min

+ # echo 5 > access_pattern/nr_accesses/max

+ # echo 10 > access_pattern/age/min

+ # echo 20 > access_pattern/age/max

+ # echo pageout > action

+ # echo 10 > quotas/ms

+ # echo $((1024*1024*1024)) > quotas/bytes

+ # echo 1000 > quotas/reset_interval_ms

+ # echo free_mem_rate > watermarks/metric

+ # echo 5000000 > watermarks/interval_us

+ # echo 600 > watermarks/high

+ # echo 500 > watermarks/mid

+ # echo 300 > watermarks/low

+Please note that it's highly recommended to use user space tools like `damo

+<https://github.com/awslabs/damo>`_ rather than manually reading and writing

+the files as above. Above is only for an example.

+.. _debugfs_interface:

+debugfs Interface (DEPRECATED!)

+===============================

+ THIS IS DEPRECATED!

+ DAMON debugfs interface is deprecated, so users should move to the

+ :ref:`sysfs interface <sysfs_interface>`. If you depend on this and cannot

+ move, please report your usecase to damon@lists.linux.dev and

+ linux-mm@kvack.org.

+DAMON exports eight files, ``attrs``, ``target_ids``, ``init_regions``,

+``schemes``, ``monitor_on``, ``kdamond_pid``, ``mk_contexts`` and

+``rm_contexts`` under its debugfs directory, ``<debugfs>/damon/``.

+``update interval``, and min/max number of monitoring target regions by

+attributes in detail, please refer to the :doc:`/mm/damon/design`. For

+Users can also monitor the physical memory address space of the system by

+writing a special keyword, "``paddr\n``" to the file. Because physical address

+space monitoring doesn't support multiple targets, reading the file will show a

+fake value, ``42``, as below::

+ # cd <debugfs>/damon

+ # echo paddr > target_ids

+Initial Monitoring Target Regions

+---------------------------------

+In case of the virtual address space monitoring, DAMON automatically sets and

+updates the monitoring target regions so that entire memory mappings of target

+processes can be covered. However, users can want to limit the monitoring

+region to specific address ranges, such as the heap, the stack, or specific

+file-mapped area. Or, some users can know the initial access pattern of their

+workloads and therefore want to set optimal initial regions for the 'adaptive

+regions adjustment'.

+In contrast, DAMON do not automatically sets and updates the monitoring target

+regions in case of physical memory monitoring. Therefore, users should set the

+monitoring target regions by themselves.

+In such cases, users can explicitly set the initial monitoring target regions

+as they want, by writing proper values to the ``init_regions`` file. The input

+should be a sequence of three integers separated by white spaces that represent

+one region in below form.::

+ <target idx> <start address> <end address>

+The ``target idx`` should be the index of the target in ``target_ids`` file,

+starting from ``0``, and the regions should be passed in address order. For

+example, below commands will set a couple of address ranges, ``1-100`` and

+``100-200`` as the initial monitoring target region of pid 42, which is the

+first one (index ``0``) in ``target_ids``, and another couple of address

+ranges, ``20-40`` and ``50-100`` as that of pid 4242, which is the second one

+(index ``1``) in ``target_ids``.::

+ # cd <debugfs>/damon

+ 1 50 100" > init_regions

+Note that this sets the initial monitoring target regions only. In case of

+virtual memory monitoring, DAMON will automatically updates the boundary of the

+regions after one ``update interval``. Therefore, users should set the

+``update interval`` large enough in this case, if they don't want the

+For usual DAMON-based data access aware memory management optimizations, users

+would simply want the system to apply a memory management action to a memory

+region of a specific access pattern. DAMON receives such formalized operation

+schemes from the user and applies those to the target processes.

+Users can get and set the schemes by reading from and writing to ``schemes``

+debugfs file. Reading the file also shows the statistics of each scheme. To

+the file, each of the schemes should be represented in each line in below

+ <target access pattern> <action> <quota> <watermarks>

+You can disable schemes by simply writing an empty string to the file.

+Target Access Pattern

+~~~~~~~~~~~~~~~~~~~~~

+The ``<target access pattern>`` is constructed with three ranges in below

+ min-size max-size min-acc max-acc min-age max-age

+Specifically, bytes for the size of regions (``min-size`` and ``max-size``),

+number of monitored accesses per aggregate interval for access frequency

+(``min-acc`` and ``max-acc``), number of aggregate intervals for the age of

+regions (``min-age`` and ``max-age``) are specified. Note that the ranges are

+The ``<action>`` is a predefined integer for memory management actions, which

+DAMON will apply to the regions having the target access pattern. The

+supported numbers and their meanings are as below.

+ - 0: Call ``madvise()`` for the region with ``MADV_WILLNEED``. Ignored if

+ ``target`` is ``paddr``.

+ - 1: Call ``madvise()`` for the region with ``MADV_COLD``. Ignored if

+ ``target`` is ``paddr``.

+ - 2: Call ``madvise()`` for the region with ``MADV_PAGEOUT``.

+ - 3: Call ``madvise()`` for the region with ``MADV_HUGEPAGE``. Ignored if

+ ``target`` is ``paddr``.

+ - 4: Call ``madvise()`` for the region with ``MADV_NOHUGEPAGE``. Ignored if

+ ``target`` is ``paddr``.

+ - 5: Do nothing but count the statistics

+Optimal ``target access pattern`` for each ``action`` is workload dependent, so

+not easy to find. Worse yet, setting a scheme of some action too aggressive

+can cause severe overhead. To avoid such overhead, users can limit time and

+size quota for the scheme via the ``<quota>`` in below form::

+ <ms> <sz> <reset interval> <priority weights>

+This makes DAMON to try to use only up to ``<ms>`` milliseconds for applying

+the action to memory regions of the ``target access pattern`` within the

+``<reset interval>`` milliseconds, and to apply the action to only up to

+``<sz>`` bytes of memory regions within the ``<reset interval>``. Setting both

+``<ms>`` and ``<sz>`` zero disables the quota limits.

+When the quota limit is expected to be exceeded, DAMON prioritizes found memory

+regions of the ``target access pattern`` based on their size, access frequency,

+and age. For personalized prioritization, users can set the weights for the

+three properties in ``<priority weights>`` in below form::

+ <size weight> <access frequency weight> <age weight>

+Some schemes would need to run based on current value of the system's specific

+metrics like free memory ratio. For such cases, users can specify watermarks

+for the condition.::

+ <metric> <check interval> <high mark> <middle mark> <low mark>

+``<metric>`` is a predefined integer for the metric to be checked. The

+supported numbers and their meanings are as below.

+ - 0: Ignore the watermarks

+ - 1: System's free memory rate (per thousand)

+The value of the metric is checked every ``<check interval>`` microseconds.

+If the value is higher than ``<high mark>`` or lower than ``<low mark>``, the

+scheme is deactivated. If the value is lower than ``<mid mark>``, the scheme

+It also counts the total number and bytes of regions that each scheme is tried

+to be applied, the two numbers for the regions that each scheme is successfully

+applied, and the total number of the quota limit exceeds. This statistics can

+be used for online analysis or tuning of the schemes.

+The statistics can be shown by reading the ``schemes`` file. Reading the file

+will show each scheme you entered in each line, and the five numbers for the

+statistics will be added at the end of each line.

+Below commands applies a scheme saying "If a memory region of size in [4KiB,

+8KiB] is showing accesses per aggregate interval in [0, 5] for aggregate

+interval in [10, 20], page out the region. For the paging out, use only up to

+10ms per second, and also don't page out more than 1GiB per second. Under the

+limitation, page out memory regions having longer age first. Also, check the

+free memory rate of the system every 5 seconds, start the monitoring and paging

+out when the free memory rate becomes lower than 50%, but stop it if the free

+memory rate becomes larger than 60%, or lower than 30%".::

+ # cd <debugfs>/damon

+ # scheme="4096 8192 0 5 10 20 2" # target access pattern and action

+ # scheme+=" 10 $((1024*1024*1024)) 1000" # quotas

+ # scheme+=" 0 0 100" # prioritization weights

+ # scheme+=" 1 5000000 600 500 300" # watermarks

+ # echo "$scheme" > schemes

+Monitoring Thread PID

+---------------------

+DAMON does requested monitoring with a kernel thread called ``kdamond``. You

+can get the pid of the thread by reading the ``kdamond_pid`` file. When the

+monitoring is turned off, reading the file returns ``none``. ::

+ # cd <debugfs>/damon

+ # echo on > monitor_on

+Using Multiple Monitoring Threads

+---------------------------------

+One ``kdamond`` thread is created for each monitoring context. You can create

+and remove monitoring contexts for multiple ``kdamond`` required use case using

+the ``mk_contexts`` and ``rm_contexts`` files.

+Writing the name of the new context to the ``mk_contexts`` file creates a

+directory of the name on the DAMON debugfs directory. The directory will have

+DAMON debugfs files for the context. ::

+ # cd <debugfs>/damon

+ # ls: cannot access 'foo': No such file or directory

+ # echo foo > mk_contexts

+ # attrs init_regions kdamond_pid schemes target_ids

+If the context is not needed anymore, you can remove it and the corresponding

+directory by putting the name of the context to the ``rm_contexts`` file. ::

+ # echo foo > rm_contexts

+ # ls: cannot access 'foo': No such file or directory

+Note that ``mk_contexts``, ``rm_contexts``, and ``monitor_on`` files are in the

+root directory only.

+ is the default hugepage size (in kB).

+the user when the system is under memory pressure. Please try again later.

+ parameter pair for the default size. This parameter also has a

+ node format. The node format specifies the number of huge pages

+ to allocate on specific nodes.

+ Node format example::

+ hugepagesz=2M hugepages=0:1,1:2

+ It will allocate 1 2M hugepage on node0 and 2 2M hugepages on node1.

+ If the node number is invalid, the parameter will be ignored.

+ When CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP is set, this enables HugeTLB

+ Vmemmap Optimization (HVO).

+Inside each of these directories, the set of files contained in ``/proc``

+will exist. In addition, two additional interfaces for demoting huge

+The demote interfaces provide the ability to split a huge page into

+smaller huge pages. For example, the x86 architecture supports both

+1GB and 2MB huge pages sizes. A 1GB huge page can be split into 512

+2MB huge pages. Demote interfaces are not available for the smallest

+huge page size. The demote interfaces are:

+ is the size of demoted pages. When a page is demoted a corresponding

+ number of huge pages of demote_size will be created. By default,

+ demote_size is set to the next smaller huge page size. If there are

+ multiple smaller huge page sizes, demote_size can be set to any of

+ these smaller sizes. Only huge page sizes less than the current huge

+ pages size are allowed.

+ is used to demote a number of huge pages. A user with root privileges

+ can write to this file. It may not be possible to demote the

+ requested number of huge pages. To determine how many pages were

+ actually demoted, compare the value of nr_hugepages before and after

+ writing to the demote interface. demote is a write only interface.

+The interfaces which are the same as in ``/proc`` (all except demote and

+demote_size) function as described above for the default huge page-sized case.

+ Documentation/admin-guide/mm/numa_memory_policy.rst],

+ see tools/testing/selftests/mm/map_hugetlb.c

+ see tools/testing/selftests/mm/hugepage-shm.c

+ see tools/testing/selftests/mm/hugepage-mmap.c

diff --git a/Documentation/admin-guide/mm/idle_page_tracking.rst b/Documentation/admin-guide/mm/idle_page_tracking.rst
index df9394fb39c2..16fcf38dac56 100644
--- a/Documentation/admin-guide/mm/idle_page_tracking.rst
+++ b/Documentation/admin-guide/mm/idle_page_tracking.rst

+The page-types tool in the tools/mm directory can be used to assist in this.

+See Documentation/admin-guide/mm/pagemap.rst for more information about

+``/proc/pid/pagemap``, ``/proc/kpageflags``, and ``/proc/kpagecgroup``.

+familiar with it, consider reading Documentation/admin-guide/mm/concepts.rst.

+interface <ksm_sysfs>`

+ how effective is KSM. The calculation is explained below.

+Monitoring KSM profit

+=====================

+KSM can save memory by merging identical pages, but also can consume

+additional memory, because it needs to generate a number of rmap_items to

+save each scanned page's brief rmap information. Some of these pages may

+be merged, but some may not be abled to be merged after being checked

+several times, which are unprofitable memory consumed.

+1) How to determine whether KSM save memory or consume memory in system-wide

+ range? Here is a simple approximate calculation for reference::

+ general_profit =~ pages_sharing * sizeof(page) - (all_rmap_items) *

+ sizeof(rmap_item);

+ where all_rmap_items can be easily obtained by summing ``pages_sharing``,

+ ``pages_shared``, ``pages_unshared`` and ``pages_volatile``.

+2) The KSM profit inner a single process can be similarly obtained by the

+ following approximate calculation::

+ process_profit =~ ksm_merging_pages * sizeof(page) -

+ ksm_rmap_items * sizeof(rmap_item).

+ where ksm_merging_pages is shown under the directory ``/proc/<pid>/``,

+ and ksm_rmap_items is shown in ``/proc/<pid>/ksm_stat``. The process profit

+ is also shown in ``/proc/<pid>/ksm_stat`` as ksm_process_profit.

+From the perspective of application, a high ratio of ``ksm_rmap_items`` to

+``ksm_merging_pages`` means a bad madvise-applied policy, so developers or

+administrators have to rethink how to change madvise policy. Giving an example

+for reference, a page's size is usually 4K, and the rmap_item's size is

+separately 32B on 32-bit CPU architecture and 64B on 64-bit CPU architecture.

+so if the ``ksm_rmap_items/ksm_merging_pages`` ratio exceeds 64 on 64-bit CPU

+or exceeds 128 on 32-bit CPU, then the app's madvise policy should be dropped,

+because the ksm profit is approximately zero or negative.

+Monitoring KSM events

+=====================

+There are some counters in /proc/vmstat that may be used to monitor KSM events.

+KSM might help save memory, it's a tradeoff by may suffering delay on KSM COW

+or on swapping in copy. Those events could help users evaluate whether or how

+to use KSM. For example, if cow_ksm increases too fast, user may decrease the

+range of madvise(, , MADV_MERGEABLE).

+ is incremented every time a KSM page triggers copy on write (COW)

+ when users try to write to a KSM page, we have to make a copy.

+ is incremented every time a KSM page is copied when swapping in

+ note that KSM page might be copied when swapping in because do_swap_page()

+ cannot do all the locking needed to reconstitute a cross-anon_vma KSM page.

diff --git a/Documentation/admin-guide/mm/memory-hotplug.rst b/Documentation/admin-guide/mm/memory-hotplug.rst
index 03dfbc925252..1b02fe5807cc 100644
--- a/Documentation/admin-guide/mm/memory-hotplug.rst
+++ b/Documentation/admin-guide/mm/memory-hotplug.rst

+The kernel will select the target zone automatically, depending on the

+configured ``online_policy``.

+Similarly to manual onlining, with ``online`` the kernel will select the

+target zone automatically, depending on the configured ``online_policy``.

+``movable_node`` configure automatic zone selection in the kernel when

+ using the ``contig-zones`` online policy. When

+ set, the kernel will default to ZONE_MOVABLE when

+ onlining a memory block, unless other zones can be kept

+See Documentation/admin-guide/kernel-parameters.txt for a more generic

+description of these command line parameters.

+ /sys/module/memory_hotplug/parameters/

+================================ ===============================================

+``memmap_on_memory`` read-write: Allocate memory for the memmap from

+ the added memory block itself. Even if enabled,

+ actual support depends on various other system

+ properties and should only be regarded as a

+ hint whether the behavior would be desired.

+ While allocating the memmap from the memory

+ block itself makes memory hotplug less likely

+ to fail and keeps the memmap on the same NUMA

+ node in any case, it can fragment physical

+ memory in a way that huge pages in bigger

+ granularity cannot be formed on hotplugged

+``online_policy`` read-write: Set the basic policy used for

+ automatic zone selection when onlining memory

+ blocks without specifying a target zone.

+ ``contig-zones`` has been the kernel default

+ before this parameter was added. After an

+ online policy was configured and memory was

+ online, the policy should not be changed

+ When set to ``contig-zones``, the kernel will

+ try keeping zones contiguous. If a memory block

+ intersects multiple zones or no zone, the

+ behavior depends on the ``movable_node`` kernel

+ command line parameter: default to ZONE_MOVABLE

+ if set, default to the applicable kernel zone

+ (usually ZONE_NORMAL) if not set.

+ When set to ``auto-movable``, the kernel will

+ try onlining memory blocks to ZONE_MOVABLE if

+ possible according to the configuration and

+ memory device details. With this policy, one

+ can avoid zone imbalances when eventually

+ hotplugging a lot of memory later and still

+ wanting to be able to hotunplug as much as

+ possible reliably, very desirable in

+ virtualized environments. This policy ignores

+ the ``movable_node`` kernel command line

+ parameter and isn't really applicable in

+ environments that require it (e.g., bare metal

+ with hotunpluggable nodes) where hotplugged

+ memory might be exposed via the

+ firmware-provided memory map early during boot

+ to the system instead of getting detected,

+ added and onlined later during boot (such as

+ done by virtio-mem or by some hypervisors

+ implementing emulated DIMMs). As one example, a

+ hotplugged DIMM will be onlined either

+ completely to ZONE_MOVABLE or completely to

+ ZONE_NORMAL, not a mixture.

+ As another example, as many memory blocks

+ belonging to a virtio-mem device will be

+ onlined to ZONE_MOVABLE as possible,

+ special-casing units of memory blocks that can

+ only get hotunplugged together. *This policy

+ does not protect from setups that are

+ problematic with ZONE_MOVABLE and does not

+ change the zone of memory blocks dynamically

+ after they were onlined.*

+``auto_movable_ratio`` read-write: Set the maximum MOVABLE:KERNEL

+ memory ratio in % for the ``auto-movable``

+ online policy. Whether the ratio applies only

+ for the system across all NUMA nodes or also

+ per NUMA nodes depends on the

+ ``auto_movable_numa_aware`` configuration.

+ All accounting is based on present memory pages

+ in the zones combined with accounting per

+ memory device. Memory dedicated to the CMA

+ allocator is accounted as MOVABLE, although

+ residing on one of the kernel zones. The

+ possible ratio depends on the actual workload.

+ The kernel default is "301" %, for example,

+ allowing for hotplugging 24 GiB to a 8 GiB VM

+ and automatically onlining all hotplugged

+ memory to ZONE_MOVABLE in many setups. The

+ additional 1% deals with some pages being not

+ present, for example, because of some firmware

+ Note that ZONE_NORMAL memory provided by one

+ memory device does not allow for more

+ ZONE_MOVABLE memory for a different memory

+ device. As one example, onlining memory of a

+ hotplugged DIMM to ZONE_NORMAL will not allow

+ for another hotplugged DIMM to get onlined to

+ ZONE_MOVABLE automatically. In contrast, memory

+ hotplugged by a virtio-mem device that got

+ onlined to ZONE_NORMAL will allow for more

+ ZONE_MOVABLE memory within *the same*

+ virtio-mem device.

+``auto_movable_numa_aware`` read-write: Configure whether the

+ ``auto_movable_ratio`` in the ``auto-movable``

+ online policy also applies per NUMA

+ node in addition to the whole system across all

+ NUMA nodes. The kernel default is "Y".

+ Disabling NUMA awareness can be helpful when

+ dealing with NUMA nodes that should be

+ completely hotunpluggable, onlining the memory

+ completely to ZONE_MOVABLE automatically if

+ Parameter availability depends on CONFIG_NUMA.

+================================ ===============================================

+- Out of memory when dissolving huge pages, especially when HugeTLB Vmemmap

+ Optimization (HVO) is enabled.

+.. SPDX-License-Identifier: GPL-2.0

+The multi-gen LRU is an alternative LRU implementation that optimizes

+page reclaim and improves performance under memory pressure. Page

+reclaim decides the kernel's caching policy and ability to overcommit

+memory. It directly impacts the kswapd CPU usage and RAM efficiency.

+Build the kernel with the following configurations.

+* ``CONFIG_LRU_GEN=y``

+* ``CONFIG_LRU_GEN_ENABLED=y``

+``/sys/kernel/mm/lru_gen/`` contains stable ABIs described in the

+following subsections.

+``enabled`` accepts different values to enable or disable the

+following components. Its default value depends on

+``CONFIG_LRU_GEN_ENABLED``. All the components should be enabled

+unless some of them have unforeseen side effects. Writing to

+``enabled`` has no effect when a component is not supported by the

+hardware, and valid values will be accepted even when the main switch

+====== ===============================================================

+0x0001 The main switch for the multi-gen LRU.

+0x0002 Clearing the accessed bit in leaf page table entries in large

+ batches, when MMU sets it (e.g., on x86). This behavior can

+ theoretically worsen lock contention (mmap_lock). If it is

+ disabled, the multi-gen LRU will suffer a minor performance

+ degradation for workloads that contiguously map hot pages,

+ whose accessed bits can be otherwise cleared by fewer larger

+0x0004 Clearing the accessed bit in non-leaf page table entries as

+ well, when MMU sets it (e.g., on x86). This behavior was not

+ verified on x86 varieties other than Intel and AMD. If it is

+ disabled, the multi-gen LRU will suffer a negligible

+ performance degradation.

+[yYnN] Apply to all the components above.

+====== ===============================================================

+ echo y >/sys/kernel/mm/lru_gen/enabled

+ cat /sys/kernel/mm/lru_gen/enabled

+ echo 5 >/sys/kernel/mm/lru_gen/enabled

+ cat /sys/kernel/mm/lru_gen/enabled

+Thrashing prevention

+--------------------

+Personal computers are more sensitive to thrashing because it can

+cause janks (lags when rendering UI) and negatively impact user

+experience. The multi-gen LRU offers thrashing prevention to the

+majority of laptop and desktop users who do not have ``oomd``.

+Users can write ``N`` to ``min_ttl_ms`` to prevent the working set of

+``N`` milliseconds from getting evicted. The OOM killer is triggered

+if this working set cannot be kept in memory. In other words, this

+option works as an adjustable pressure relief valve, and when open, it

+terminates applications that are hopefully not being used.

+Based on the average human detectable lag (~100ms), ``N=1000`` usually

+eliminates intolerable janks due to thrashing. Larger values like

+``N=3000`` make janks less noticeable at the risk of premature OOM

+The default value ``0`` means disabled.

+Experimental features

+=====================

+``/sys/kernel/debug/lru_gen`` accepts commands described in the

+following subsections. Multiple command lines are supported, so does

+concatenation with delimiters ``,`` and ``;``.

+``/sys/kernel/debug/lru_gen_full`` provides additional stats for

+debugging. ``CONFIG_LRU_GEN_STATS=y`` keeps historical stats from

+evicted generations in this file.

+Working set estimation

+----------------------

+Working set estimation measures how much memory an application needs

+in a given time interval, and it is usually done with little impact on

+the performance of the application. E.g., data centers want to

+optimize job scheduling (bin packing) to improve memory utilizations.

+When a new job comes in, the job scheduler needs to find out whether

+each server it manages can allocate a certain amount of memory for

+this new job before it can pick a candidate. To do so, the job

+scheduler needs to estimate the working sets of the existing jobs.

+When it is read, ``lru_gen`` returns a histogram of numbers of pages

+accessed over different time intervals for each memcg and node.

+``MAX_NR_GENS`` decides the number of bins for each histogram. The

+histograms are noncumulative.

+ memcg memcg_id memcg_path

+ min_gen_nr age_in_ms nr_anon_pages nr_file_pages

+ max_gen_nr age_in_ms nr_anon_pages nr_file_pages

+Each bin contains an estimated number of pages that have been accessed

+within ``age_in_ms``. E.g., ``min_gen_nr`` contains the coldest pages

+and ``max_gen_nr`` contains the hottest pages, since ``age_in_ms`` of

+the former is the largest and that of the latter is the smallest.

+Users can write the following command to ``lru_gen`` to create a new

+generation ``max_gen_nr+1``:

+ ``+ memcg_id node_id max_gen_nr [can_swap [force_scan]]``

+``can_swap`` defaults to the swap setting and, if it is set to ``1``,

+it forces the scan of anon pages when swap is off, and vice versa.

+``force_scan`` defaults to ``1`` and, if it is set to ``0``, it

+employs heuristics to reduce the overhead, which is likely to reduce

+the coverage as well.

+A typical use case is that a job scheduler runs this command at a

+certain time interval to create new generations, and it ranks the

+servers it manages based on the sizes of their cold pages defined by

+this time interval.

+Proactive reclaim induces page reclaim when there is no memory

+pressure. It usually targets cold pages only. E.g., when a new job

+comes in, the job scheduler wants to proactively reclaim cold pages on

+the server it selected, to improve the chance of successfully landing

+Users can write the following command to ``lru_gen`` to evict

+generations less than or equal to ``min_gen_nr``.

+ ``- memcg_id node_id min_gen_nr [swappiness [nr_to_reclaim]]``

+``min_gen_nr`` should be less than ``max_gen_nr-1``, since

+``max_gen_nr`` and ``max_gen_nr-1`` are not fully aged (equivalent to

+the active list) and therefore cannot be evicted. ``swappiness``

+overrides the default value in ``/proc/sys/vm/swappiness``.

+``nr_to_reclaim`` limits the number of pages to evict.

+A typical use case is that a job scheduler runs this command before it

+tries to land a new job on a server. If it fails to materialize enough

+cold pages because of the overestimation, it retries on the next

+server according to the ranking result obtained from the working set

+estimation step. This less forceful approach limits the impacts on the

diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst
index 64fd0ba0d057..46515ad2337f 100644
--- a/Documentation/admin-guide/mm/numa_memory_policy.rst
+++ b/Documentation/admin-guide/mm/numa_memory_policy.rst

+ This mode specifies that the allocation should be preferably

+ allocation process, which may sleep during page reclamation, because the

+Linux supports 4 system calls for controlling memory policy. These APIS

+Set home node for a Range of Task's Address Spacec::

+ long sys_set_mempolicy_home_node(unsigned long start, unsigned long len,

+ unsigned long home_node,

+ unsigned long flags);

+sys_set_mempolicy_home_node set the home node for a VMA policy present in the

+task's address range. The system call updates the home node only for the existing

+mempolicy range. Other address ranges are ignored. A home node is the NUMA node

+closest to which page allocation will come from. Specifying the home node override

+the default allocation policy to allocate memory close to the local node for an

+=======================

+NUMA Memory Performance

+=======================

+ Documentation/admin-guide/mm/soft-dirty.rst)

+ Documentation/admin-guide/mm/userfaultfd.rst)

+The page-types tool in the tools/mm directory can be used to query the

+ The page is being locked for exclusive access, e.g. by undergoing read/write

+ The page is managed by the SLAB/SLUB kernel memory allocator.

+ When compound page is used, either will only set this flag on the head

+ A free memory block managed by the buddy system allocator.

+ pages are hugeTLB pages (Documentation/admin-guide/mm/hugetlbpage.rst),

+ This is an integral part of a HugeTLB page.

+ Hardware detected memory corruption on this page: don't touch the data!

+ No page frame exists at the requested address.

+ Identical memory pages dynamically shared between one or more processes.

+ Contiguous pages which construct transparent hugepages.

+ The page is logically offline.

+ Zero page for pfn_zero or huge_zero page.

+ The page has not been accessed since it was marked idle (see

+ Documentation/admin-guide/mm/idle_page_tracking.rst).

+ The page is in use as a page table.

+ IO error occurred.

+ The page has up-to-date data.

+ The page has been written to, hence contains new data.

+ The page is being synced to disk.

+ The page is in one of the LRU lists.

+ The page is in the active LRU list.

+ The page is in the unevictable (non-)LRU list It is somehow pinned and

+ shmctl(SHM_LOCK) and mlock() memory segments.

+ The page has been referenced since last LRU list enqueue/requeue.

+ The page will be reclaimed soon after its pageout IO completed.

+ A memory mapped page.

+ A memory mapped page that is not part of a file.

+ The page is mapped to swap space, i.e. has an associated swap entry.

+ The page is backed by swap/RAM.

+The page-types tool in the tools/mm directory can be used to query the

+Exceptions for Shared Memory

+============================

+Page table entries for shared pages are cleared when the pages are zapped or

+swapped out. This makes swapped out pages indistinguishable from never-allocated

+In kernel space, the swap location can still be retrieved from the page cache.

+However, values stored only on the normal PTE get lost irretrievably when the

+page is swapped out (i.e. SOFT_DIRTY).

+In user space, whether the page is present, swapped or none can be deduced with

+the help of lseek and/or mincore system calls.

+lseek() can differentiate between accessed pages (present or swapped out) and

+holes (none/non-allocated) by specifying the SEEK_DATA flag on the file where

+the pages are backed. For anonymous shared pages, the file can be found in

+``/proc/pid/map_files/``.

+mincore() can differentiate between pages in memory (present, including swap

+cache) and out of memory (swapped out or none/non-allocated).

+==========================

+Shrinker Debugfs Interface

+==========================

+Shrinker debugfs interface provides a visibility into the kernel memory

+shrinkers subsystem and allows to get information about individual shrinkers

+and interact with them.

+For each shrinker registered in the system a directory in **<debugfs>/shrinker/**

+is created. The directory's name is composed from the shrinker's name and an

+unique id: e.g. *kfree_rcu-0* or *sb-xfs:vda1-36*.

+Each shrinker directory contains **count** and **scan** files, which allow to

+trigger *count_objects()* and *scan_objects()* callbacks for each memcg and

+numa node (if applicable).

+1. *List registered shrinkers*

+ $ cd /sys/kernel/debug/shrinker/

+ dquota-cache-16 sb-devpts-28 sb-proc-47 sb-tmpfs-42

+ mm-shadow-18 sb-devtmpfs-5 sb-proc-48 sb-tmpfs-43

+ mm-zspool:zram0-34 sb-hugetlbfs-17 sb-pstore-31 sb-tmpfs-44

+ rcu-kfree-0 sb-hugetlbfs-33 sb-rootfs-2 sb-tmpfs-49

+ sb-aio-20 sb-iomem-12 sb-securityfs-6 sb-tracefs-13

+ sb-anon_inodefs-15 sb-mqueue-21 sb-selinuxfs-22 sb-xfs:vda1-36

+ sb-bdev-3 sb-nsfs-4 sb-sockfs-8 sb-zsmalloc-19

+ sb-bpf-32 sb-pipefs-14 sb-sysfs-26 thp-deferred_split-10

+ sb-btrfs:vda2-24 sb-proc-25 sb-tmpfs-1 thp-zero-9

+ sb-cgroup2-30 sb-proc-39 sb-tmpfs-27 xfs-buf:vda1-37

+ sb-configfs-23 sb-proc-41 sb-tmpfs-29 xfs-inodegc:vda1-38

+ sb-dax-11 sb-proc-45 sb-tmpfs-35

+ sb-debugfs-7 sb-proc-46 sb-tmpfs-40

+2. *Get information about a specific shrinker*

+ $ cd sb-btrfs\:vda2-24/

+ Each line in the output has the following format::

+ <cgroup inode id> <nr of objects on node 0> <nr of objects on node 1> ...

+ If there are no objects on all numa nodes, a line is omitted. If there

+ are no objects at all, the output might be empty.

+ If the shrinker is not memcg-aware or CONFIG_MEMCG is off, 0 is printed

+ as cgroup inode id. If the shrinker is not numa-aware, 0's are printed

+ for all nodes except the first one.

+ The expected input format::

+ <cgroup inode id> <numa id> <number of objects to scan>

+ For a non-memcg-aware shrinker or on a system with no memory

+ cgrups **0** should be passed as cgroup id.

+ $ cd /sys/kernel/debug/shrinker/

+ $ cd sb-btrfs\:vda2-24/

+ $ cat count | head -n 5

+ $ echo "55 0 200" > scan

+ $ cat count | head -n 5

+The khugepaged progress can be seen in the number of pages collapsed (note

+that this counter may not be an exact count of the number of pages

+collapsed, since "collapsed" could mean multiple things: (1) A PTE mapping

+being replaced by a PMD mapping, or (2) All 4K physical pages replaced by

+one 2M hugepage. Each may happen independently, or together, depending on

+the type of memory and the failures that occur. As such, this value should

+be interpreted roughly as a sign of progress, and counters in /proc/vmstat

+consulted for more accurate accounting)::

+ is incremented every time a huge zero page used for thp is

+ successfully allocated. Note, it doesn't count every map of

+ the huge zero page, only its allocation.

+Userspace creates a new userfaultfd, initializes it, and registers one or more

+regions of virtual memory with it. Then, any page faults which occur within the

+region(s) result in a message being delivered to the userfaultfd, notifying

+userspace of the fault.

+The ``userfaultfd``, once created, can also be

+Creating a userfaultfd

+----------------------

+There are two ways to create a new userfaultfd, each of which provide ways to

+restrict access to this functionality (since historically userfaultfds which

+handle kernel page faults have been a useful tool for exploiting the kernel).

+The first way, supported since userfaultfd was introduced, is the

+userfaultfd(2) syscall. Access to this is controlled in several ways:

+- Any user can always create a userfaultfd which traps userspace page faults

+ only. Such a userfaultfd can be created using the userfaultfd(2) syscall

+ with the flag UFFD_USER_MODE_ONLY.

+- In order to also trap kernel page faults for the address space, either the

+ process needs the CAP_SYS_PTRACE capability, or the system must have

+ vm.unprivileged_userfaultfd set to 1. By default, vm.unprivileged_userfaultfd

+The second way, added to the kernel more recently, is by opening

+/dev/userfaultfd and issuing a USERFAULTFD_IOC_NEW ioctl to it. This method

+yields equivalent userfaultfds to the userfaultfd(2) syscall.

+Unlike userfaultfd(2), access to /dev/userfaultfd is controlled via normal

+filesystem permissions (user/group/mode), which gives fine grained access to

+userfaultfd specifically, without also granting other unrelated privileges at

+the same time (as e.g. granting CAP_SYS_PTRACE would do). Users who have access

+to /dev/userfaultfd can always create userfaultfds that trap kernel page faults;

+vm.unprivileged_userfaultfd is not considered.

+Initializing a userfaultfd

+--------------------------

+Userfaultfd write-protect mode currently behave differently on none ptes

+(when e.g. page is missing) over different types of memories.

+For anonymous memory, ``ioctl(UFFDIO_WRITEPROTECT)`` will ignore none ptes

+(e.g. when pages are missing and not populated). For file-backed memories

+like shmem and hugetlbfs, none ptes will be write protected just like a

+present pte. In other words, there will be a userfaultfd write fault

+message generated when writing to a missing page on file typed memories,

+as long as the page range was write-protected before. Such a message will

+not be generated on anonymous memories by default.

+If the application wants to be able to write protect none ptes on anonymous

+memory, one can pre-populate the memory with e.g. MADV_POPULATE_READ. On

+newer kernels, one can also detect the feature UFFD_FEATURE_WP_UNPOPULATED

+and set the feature bit in advance to make sure none ptes will also be

+write protected even upon anonymous memory.

+When using ``UFFDIO_REGISTER_MODE_WP`` in combination with either

+``UFFDIO_REGISTER_MODE_MISSING`` or ``UFFDIO_REGISTER_MODE_MINOR``, when

+resolving missing / minor faults with ``UFFDIO_COPY`` or ``UFFDIO_CONTINUE``

+respectively, it may be desirable for the new page / mapping to be

+write-protected (so future writes will also result in a WP fault). These ioctls

+support a mode flag (``UFFDIO_COPY_MODE_WP`` or ``UFFDIO_CONTINUE_MODE_WP``

+respectively) to configure the mapping this way.

+Some potential benefits:

+storage method, and it can achieve greater storage densities.

+checking for the same-value filled pages during store operation.

+In other words, every page will be then considered non-same-value filled.

+However, the existing pages which are marked as same-value filled pages remain

+stored unchanged in zswap until they are either loaded or invalidated.

+In some circumstances it might be advantageous to make use of just the zswap

+ability to efficiently store same-filled pages without enabling the whole

+compressed page storage.

+In this case the handling of non-same-value pages by zswap (enabled by default)

+can be disabled by setting the ``non_same_filled_pages_enabled`` attribute

+to 0, e.g. ``zswap.non_same_filled_pages_enabled=0``.

+It can also be enabled and disabled at runtime using the sysfs

+``non_same_filled_pages_enabled`` attribute, e.g.::

+ echo 1 > /sys/module/zswap/parameters/non_same_filled_pages_enabled

+Disabling both ``zswap.same_filled_pages_enabled`` and

+``zswap.non_same_filled_pages_enabled`` effectively disables accepting any new

+used together with a system's node name when an NFS client identifies itself to

+a server. Thus, if the system's node name is not unique, its

+nfs.nfs4_unique_id can help prevent collisions with other clients.

+This uniquifier string will be the same for all NFS clients running in

+containers unless it is overridden by a value written to

+/sys/fs/nfs/net/nfs_client/identifier which will be local to the network

+namespace of the process which writes.

+=============================================================

+Alibaba's T-Head SoC Uncore Performance Monitoring Unit (PMU)

+=============================================================

+The Yitian 710, custom-built by Alibaba Group's chip development business,

+T-Head, implements uncore PMU for performance and functional debugging to

+facilitate system maintenance.

+DDR Sub-System Driveway (DRW) PMU Driver

+=========================================

+Yitian 710 employs eight DDR5/4 channels, four on each die. Each DDR5 channel

+is independent of others to service system memory requests. And one DDR5

+channel is split into two independent sub-channels. The DDR Sub-System Driveway

+implements separate PMUs for each sub-channel to monitor various performance

+The Driveway PMU devices are named as ali_drw_<sys_base_addr> with perf.

+For example, ali_drw_21000 and ali_drw_21080 are two PMU devices for two

+sub-channels of the same channel in die 0. And the PMU device of die 1 is

+prefixed with ali_drw_400XXXXX, e.g. ali_drw_40021000.

+Each sub-channel has 36 PMU counters in total, which is classified into

+- Group 0: PMU Cycle Counter. This group has one pair of counters

+ pmu_cycle_cnt_low and pmu_cycle_cnt_high, that is used as the cycle count

+ based on DDRC core clock.

+- Group 1: PMU Bandwidth Counters. This group has 8 counters that are used

+ to count the total access number of either the eight bank groups in a

+ selected rank, or four ranks separately in the first 4 counters. The base

+ transfer unit is 64B.

+- Group 2: PMU Retry Counters. This group has 10 counters, that intend to

+ count the total retry number of each type of uncorrectable error.

+- Group 3: PMU Common Counters. This group has 16 counters, that are used

+ to count the common events.

+For now, the Driveway PMU driver only uses counters in group 0 and group 3.

+The DDR Controller (DDRCTL) and DDR PHY combine to create a complete solution

+for connecting an SoC application bus to DDR memory devices. The DDRCTL

+receives transactions Host Interface (HIF) which is custom-defined by Synopsys.

+These transactions are queued internally and scheduled for access while

+satisfying the SDRAM protocol timing requirements, transaction priorities, and

+dependencies between the transactions. The DDRCTL in turn issues commands on

+the DDR PHY Interface (DFI) to the PHY module, which launches and captures data

+to and from the SDRAM. The driveway PMUs have hardware logic to gather

+statistics and performance logging signals on HIF, DFI, etc.

+By counting the READ, WRITE and RMW commands sent to the DDRC through the HIF

+interface, we could calculate the bandwidth. Example usage of counting memory

+ -e ali_drw_21000/hif_wr/ \

+ -e ali_drw_21000/hif_rd/ \

+ -e ali_drw_21000/hif_rmw/ \

+ -e ali_drw_21000/cycle/ \

+ -e ali_drw_21080/hif_wr/ \

+ -e ali_drw_21080/hif_rd/ \

+ -e ali_drw_21080/hif_rmw/ \

+ -e ali_drw_21080/cycle/ \

+ -e ali_drw_23000/hif_wr/ \

+ -e ali_drw_23000/hif_rd/ \

+ -e ali_drw_23000/hif_rmw/ \

+ -e ali_drw_23000/cycle/ \

+ -e ali_drw_23080/hif_wr/ \

+ -e ali_drw_23080/hif_rd/ \

+ -e ali_drw_23080/hif_rmw/ \

+ -e ali_drw_23080/cycle/ \

+ -e ali_drw_25000/hif_wr/ \

+ -e ali_drw_25000/hif_rd/ \

+ -e ali_drw_25000/hif_rmw/ \

+ -e ali_drw_25000/cycle/ \

+ -e ali_drw_25080/hif_wr/ \

+ -e ali_drw_25080/hif_rd/ \

+ -e ali_drw_25080/hif_rmw/ \

+ -e ali_drw_25080/cycle/ \

+ -e ali_drw_27000/hif_wr/ \

+ -e ali_drw_27000/hif_rd/ \

+ -e ali_drw_27000/hif_rmw/ \

+ -e ali_drw_27000/cycle/ \

+ -e ali_drw_27080/hif_wr/ \

+ -e ali_drw_27080/hif_rd/ \

+ -e ali_drw_27080/hif_rmw/ \

+ -e ali_drw_27080/cycle/ -- sleep 10

+The average DRAM bandwidth can be calculated as follows:

+- Read Bandwidth = perf_hif_rd * DDRC_WIDTH * DDRC_Freq / DDRC_Cycle

+- Write Bandwidth = (perf_hif_wr + perf_hif_rmw) * DDRC_WIDTH * DDRC_Freq / DDRC_Cycle

+Here, DDRC_WIDTH = 64 bytes.

+The current driver does not support sampling. So "perf record" is

+unsupported. Also attach to a task is unsupported as the events are all

+================================================

+HiSilicon PCIe Performance Monitoring Unit (PMU)

+================================================

+On Hip09, HiSilicon PCIe Performance Monitoring Unit (PMU) could monitor

+bandwidth, latency, bus utilization and buffer occupancy data of PCIe.

+Each PCIe Core has a PMU to monitor multi Root Ports of this PCIe Core and

+all Endpoints downstream these Root Ports.

+HiSilicon PCIe PMU driver

+=========================

+The PCIe PMU driver registers a perf PMU with the name of its sicl-id and PCIe

+ /sys/bus/event_source/hisi_pcie<sicl>_core<core>

+PMU driver provides description of available events and filter options in sysfs,

+see /sys/bus/event_source/devices/hisi_pcie<sicl>_core<core>.

+The "format" directory describes all formats of the config (events) and config1

+(filter options) fields of the perf_event_attr structure. The "events" directory

+describes all documented events shown in perf list.

+The "identifier" sysfs file allows users to identify the version of the

+PMU hardware device.

+The "bus" sysfs file allows users to get the bus number of Root Ports

+Example usage of perf::

+ hisi_pcie0_core0/rx_mwr_latency/ [kernel PMU event]

+ hisi_pcie0_core0/rx_mwr_cnt/ [kernel PMU event]

+ ------------------------------------------

+ $# perf stat -e hisi_pcie0_core0/rx_mwr_latency/

+ $# perf stat -e hisi_pcie0_core0/rx_mwr_cnt/

+ $# perf stat -g -e hisi_pcie0_core0/rx_mwr_latency/ -e hisi_pcie0_core0/rx_mwr_cnt/

+The current driver does not support sampling. So "perf record" is unsupported.

+Also attach to a task is unsupported for PCIe PMU.

+ PMU could only monitor the performance of traffic downstream target Root

+ Ports or downstream target Endpoint. PCIe PMU driver support "port" and

+ "bdf" interfaces for users, and these two interfaces aren't supported at the

+ "port" filter can be used in all PCIe PMU events, target Root Port can be

+ selected by configuring the 16-bits-bitmap "port". Multi ports can be

+ selected for AP-layer-events, and only one port can be selected for

+ TL/DL-layer-events.

+ For example, if target Root Port is 0000:00:00.0 (x8 lanes), bit0 of

+ bitmap should be set, port=0x1; if target Root Port is 0000:00:04.0 (x4

+ lanes), bit8 is set, port=0x100; if these two Root Ports are both

+ monitored, port=0x101.

+ Example usage of perf::

+ $# perf stat -e hisi_pcie0_core0/rx_mwr_latency,port=0x1/ sleep 5

+ "bdf" filter can only be used in bandwidth events, target Endpoint is

+ selected by configuring BDF to "bdf". Counter only counts the bandwidth of

+ message requested by target Endpoint.

+ For example, "bdf=0x3900" means BDF of target Endpoint is 0000:39:00.0.

+ Example usage of perf::

+ $# perf stat -e hisi_pcie0_core0/rx_mrd_flux,bdf=0x3900/ sleep 5

+ Event statistics start when the first time TLP length is greater/smaller

+ than trigger condition. You can set the trigger condition by writing

+ "trig_len", and set the trigger mode by writing "trig_mode". This filter can

+ only be used in bandwidth events.

+ For example, "trig_len=4" means trigger condition is 2^4 DW, "trig_mode=0"

+ means statistics start when TLP length > trigger condition, "trig_mode=1"

+ means start when TLP length < condition.

+ Example usage of perf::

+ $# perf stat -e hisi_pcie0_core0/rx_mrd_flux,trig_len=0x4,trig_mode=1/ sleep 5

+3. Threshold filter

+ Counter counts when TLP length within the specified range. You can set the

+ threshold by writing "thr_len", and set the threshold mode by writing

+ "thr_mode". This filter can only be used in bandwidth events.

+ For example, "thr_len=4" means threshold is 2^4 DW, "thr_mode=0" means

+ counter counts when TLP length >= threshold, and "thr_mode=1" means counts

+ when TLP length < threshold.

+ Example usage of perf::

+ $# perf stat -e hisi_pcie0_core0/rx_mrd_flux,thr_len=0x4,thr_mode=1/ sleep 5

+4. TLP Length filter

+ When counting bandwidth, the data can be composed of certain parts of TLP

+ packets. You can specify it through "len_mode":

+ - 2'b00: Reserved (Do not use this since the behaviour is undefined)

+ - 2'b01: Bandwidth of TLP payloads

+ - 2'b10: Bandwidth of TLP headers

+ - 2'b11: Bandwidth of both TLP payloads and headers

+ For example, "len_mode=2" means only counting the bandwidth of TLP headers

+ and "len_mode=3" means the final bandwidth data is composed of both TLP

+ headers and payloads. Default value if not specified is 2'b11.

+ Example usage of perf::

+ $# perf stat -e hisi_pcie0_core0/rx_mrd_flux,len_mode=0x1/ sleep 5

+======================================

+HNS3 Performance Monitoring Unit (PMU)

+======================================

+HNS3(HiSilicon network system 3) Performance Monitoring Unit (PMU) is an

+End Point device to collect performance statistics of HiSilicon SoC NIC.

+On Hip09, each SICL(Super I/O cluster) has one PMU device.

+HNS3 PMU supports collection of performance statistics such as bandwidth,

+latency, packet rate and interrupt rate.

+Each HNS3 PMU supports 8 hardware events.

+The HNS3 PMU driver registers a perf PMU with the name of its sicl id.::

+ /sys/devices/hns3_pmu_sicl_<sicl_id>

+PMU driver provides description of available events, filter modes, format,

+identifier and cpumask in sysfs.

+The "events" directory describes the event code of all supported events

+shown in perf list.

+The "filtermode" directory describes the supported filter modes of each

+The "format" directory describes all formats of the config (events) and

+config1 (filter options) fields of the perf_event_attr structure.

+The "identifier" file shows version of PMU hardware device.

+The "bdf_min" and "bdf_max" files show the supported bdf range of each

+The "hw_clk_freq" file shows the hardware clock frequency of each pmu

+Example usage of checking event code and subevent code::

+ $# cat /sys/devices/hns3_pmu_sicl_0/events/dly_tx_normal_to_mac_time

+ $# cat /sys/devices/hns3_pmu_sicl_0/events/dly_tx_normal_to_mac_packet_num

+Each performance statistic has a pair of events to get two values to

+calculate real performance data in userspace.

+The bits 0~15 of config (here 0x0204) are the true hardware event code. If

+two events have same value of bits 0~15 of config, that means they are

+event pair. And the bit 16 of config indicates getting counter 0 or

+counter 1 of hardware event.

+After getting two values of event pair in userspace, the formula of

+computation to calculate real performance data is:::

+ counter 0 / counter 1

+Example usage of checking supported filter mode::

+ $# cat /sys/devices/hns3_pmu_sicl_0/filtermode/bw_ssu_rpu_byte_num

+ filter mode supported: global/port/port-tc/func/func-queue/

+Example usage of perf::

+ hns3_pmu_sicl_0/bw_ssu_rpu_byte_num/ [kernel PMU event]

+ hns3_pmu_sicl_0/bw_ssu_rpu_time/ [kernel PMU event]

+ ------------------------------------------

+ $# perf stat -g -e hns3_pmu_sicl_0/bw_ssu_rpu_byte_num,global=1/ -e hns3_pmu_sicl_0/bw_ssu_rpu_time,global=1/ -I 1000

+ $# perf stat -g -e hns3_pmu_sicl_0/config=0x00002,global=1/ -e hns3_pmu_sicl_0/config=0x10002,global=1/ -I 1000

+PMU collect performance statistics for all HNS3 PCIe functions of IO DIE.

+Set the "global" filter option to 1 will enable this mode.

+Example usage of perf::

+ $# perf stat -a -e hns3_pmu_sicl_0/config=0x1020F,global=1/ -I 1000

+PMU collect performance statistic of one whole physical port. The port id

+is same as mac id. The "tc" filter option must be set to 0xF in this mode,

+here tc stands for traffic class.

+Example usage of perf::

+ $# perf stat -a -e hns3_pmu_sicl_0/config=0x1020F,port=0,tc=0xF/ -I 1000

+PMU collect performance statistic of one tc of physical port. The port id

+is same as mac id. The "tc" filter option must be set to 0 ~ 7 in this

+Example usage of perf::

+ $# perf stat -a -e hns3_pmu_sicl_0/config=0x1020F,port=0,tc=0/ -I 1000

+PMU collect performance statistic of one PF/VF. The function id is BDF of

+PF/VF, its conversion formula::

+ func = (bus << 8) + (device << 3) + (function)

+In this mode, the "queue" filter option must be set to 0xFFFF.

+Example usage of perf::

+ $# perf stat -a -e hns3_pmu_sicl_0/config=0x1020F,bdf=0x3500,queue=0xFFFF/ -I 1000

+PMU collect performance statistic of one queue of PF/VF. The function id

+is BDF of PF/VF, the "queue" filter option must be set to the exact queue

+Example usage of perf::

+ $# perf stat -a -e hns3_pmu_sicl_0/config=0x1020F,bdf=0x3500,queue=0/ -I 1000

+PMU collect performance statistic of one interrupt of PF/VF. The function

+id is BDF of PF/VF, the "intr" filter option must be set to the exact

+interrupt id of function.

+Example usage of perf::

+ $# perf stat -a -e hns3_pmu_sicl_0/config=0x00301,bdf=0x3500,intr=0/ -I 1000

+.. SPDX-License-Identifier: GPL-2.0

+===========================================================

+Amlogic SoC DDR Bandwidth Performance Monitoring Unit (PMU)

+===========================================================

+The Amlogic Meson G12 SoC contains a bandwidth monitor inside DRAM controller.

+The monitor includes 4 channels. Each channel can count the request accessing

+DRAM. The channel can count up to 3 AXI port simultaneously. It can be helpful

+to show if the performance bottleneck is on DDR bandwidth.

+Currently, this driver supports the following 5 perf events:

++ meson_ddr_bw/total_rw_bytes/

++ meson_ddr_bw/chan_1_rw_bytes/

++ meson_ddr_bw/chan_2_rw_bytes/

++ meson_ddr_bw/chan_3_rw_bytes/

++ meson_ddr_bw/chan_4_rw_bytes/

+meson_ddr_bw/chan_{1,2,3,4}_rw_bytes/ events are channel-specific events.

+Each channel support filtering, which can let the channel to monitor

+individual IP module in SoC.

+Below are DDR access request event filter keywords:

++ vpu_read1 - from OSD + VPP read

++ gpu - from 3D GPU

++ pcie - from PCIe controller

++ hdcp - from HDCP controller

++ hevc_front - from HEVC codec front end

++ usb3_0 - from USB3.0 controller

++ hevc_back - from HEVC codec back end

++ h265enc - from HEVC encoder

++ vpu_read2 - from DI read

++ vpu_write1 - from VDIN write

++ vpu_write2 - from di write

++ vdec - from legacy codec video decoder

++ hcodec - from H264 encoder

++ spicc1 - from SPI controller 1

++ usb0 - from USB2.0 controller 0

++ dma - from system DMA controller 1

++ sd_emmc_b - from SD eMMC b controller

++ usb1 - from USB2.0 controller 1

++ audio - from Audio module

++ sd_emmc_c - from SD eMMC c controller

++ spicc2 - from SPI controller 2

++ ethernet - from Ethernet controller

+ + Show the total DDR bandwidth per seconds:

+ .. code-block:: bash

+ perf stat -a -e meson_ddr_bw/total_rw_bytes/ -I 1000 sleep 10

+ + Show individual DDR bandwidth from CPU and GPU respectively, as well as

+ .. code-block:: bash

+ perf stat -a -e meson_ddr_bw/chan_1_rw_bytes,arm=1/ -I 1000 sleep 10

+ perf stat -a -e meson_ddr_bw/chan_2_rw_bytes,gpu=1/ -I 1000 sleep 10

+ perf stat -a -e meson_ddr_bw/chan_3_rw_bytes,arm=1,gpu=1/ -I 1000 sleep 10

+=========================================================

+NVIDIA Tegra SoC Uncore Performance Monitoring Unit (PMU)

+=========================================================

+The NVIDIA Tegra SoC includes various system PMUs to measure key performance

+metrics like memory bandwidth, latency, and utilization:

+* Scalable Coherency Fabric (SCF)

+The PMUs in this document are based on ARM CoreSight PMU Architecture as

+described in document: ARM IHI 0091. Since this is a standard architecture, the

+PMUs are managed by a common driver "arm-cs-arch-pmu". This driver describes

+the available events and configuration of each PMU in sysfs. Please see the

+sections below to get the sysfs path of each PMU. Like other uncore PMU drivers,

+the driver provides "cpumask" sysfs attribute to show the CPU id used to handle

+the PMU event. There is also "associated_cpus" sysfs attribute, which contains a

+list of CPUs associated with the PMU instance.

+.. _SCF_PMU_Section:

+The SCF PMU monitors system level cache events, CPU traffic, and

+strongly-ordered (SO) PCIE write traffic to local/remote memory. Please see

+:ref:`NVIDIA_Uncore_PMU_Traffic_Coverage_Section` for more info about the PMU

+The events and configuration options of this PMU device are described in sysfs,

+see /sys/bus/event_sources/devices/nvidia_scf_pmu_<socket-id>.

+* Count event id 0x0 in socket 0::

+ perf stat -a -e nvidia_scf_pmu_0/event=0x0/

+* Count event id 0x0 in socket 1::

+ perf stat -a -e nvidia_scf_pmu_1/event=0x0/

+--------------------

+The NVLink-C2C0 PMU monitors incoming traffic from a GPU/CPU connected with

+NVLink-C2C (Chip-2-Chip) interconnect. The type of traffic captured by this PMU

+varies dependent on the chip configuration:

+* NVIDIA Grace Hopper Superchip: Hopper GPU is connected with Grace SoC.

+ In this config, the PMU captures GPU ATS translated or EGM traffic from the GPU.

+* NVIDIA Grace CPU Superchip: two Grace CPU SoCs are connected.

+ In this config, the PMU captures read and relaxed ordered (RO) writes from

+ PCIE device of the remote SoC.

+Please see :ref:`NVIDIA_Uncore_PMU_Traffic_Coverage_Section` for more info about

+the PMU traffic coverage.

+The events and configuration options of this PMU device are described in sysfs,

+see /sys/bus/event_sources/devices/nvidia_nvlink_c2c0_pmu_<socket-id>.

+* Count event id 0x0 from the GPU/CPU connected with socket 0::

+ perf stat -a -e nvidia_nvlink_c2c0_pmu_0/event=0x0/

+* Count event id 0x0 from the GPU/CPU connected with socket 1::

+ perf stat -a -e nvidia_nvlink_c2c0_pmu_1/event=0x0/

+* Count event id 0x0 from the GPU/CPU connected with socket 2::

+ perf stat -a -e nvidia_nvlink_c2c0_pmu_2/event=0x0/

+* Count event id 0x0 from the GPU/CPU connected with socket 3::

+ perf stat -a -e nvidia_nvlink_c2c0_pmu_3/event=0x0/

+-------------------

+The NVLink-C2C1 PMU monitors incoming traffic from a GPU connected with

+NVLink-C2C (Chip-2-Chip) interconnect. This PMU captures untranslated GPU

+traffic, in contrast with NvLink-C2C0 PMU that captures ATS translated traffic.

+Please see :ref:`NVIDIA_Uncore_PMU_Traffic_Coverage_Section` for more info about

+the PMU traffic coverage.

+The events and configuration options of this PMU device are described in sysfs,

+see /sys/bus/event_sources/devices/nvidia_nvlink_c2c1_pmu_<socket-id>.

+* Count event id 0x0 from the GPU connected with socket 0::

+ perf stat -a -e nvidia_nvlink_c2c1_pmu_0/event=0x0/

+* Count event id 0x0 from the GPU connected with socket 1::

+ perf stat -a -e nvidia_nvlink_c2c1_pmu_1/event=0x0/

+* Count event id 0x0 from the GPU connected with socket 2::

+ perf stat -a -e nvidia_nvlink_c2c1_pmu_2/event=0x0/

+* Count event id 0x0 from the GPU connected with socket 3::

+ perf stat -a -e nvidia_nvlink_c2c1_pmu_3/event=0x0/

+The CNVLink PMU monitors traffic from GPU and PCIE device on remote sockets

+to local memory. For PCIE traffic, this PMU captures read and relaxed ordered

+(RO) write traffic. Please see :ref:`NVIDIA_Uncore_PMU_Traffic_Coverage_Section`

+for more info about the PMU traffic coverage.

+The events and configuration options of this PMU device are described in sysfs,

+see /sys/bus/event_sources/devices/nvidia_cnvlink_pmu_<socket-id>.

+Each SoC socket can be connected to one or more sockets via CNVLink. The user can

+use "rem_socket" bitmap parameter to select the remote socket(s) to monitor.

+Each bit represents the socket number, e.g. "rem_socket=0xE" corresponds to

+/sys/bus/event_sources/devices/nvidia_cnvlink_pmu_<socket-id>/format/rem_socket

+shows the valid bits that can be set in the "rem_socket" parameter.

+The PMU can not distinguish the remote traffic initiator, therefore it does not

+provide filter to select the traffic source to monitor. It reports combined

+traffic from remote GPU and PCIE devices.

+* Count event id 0x0 for the traffic from remote socket 1, 2, and 3 to socket 0::

+ perf stat -a -e nvidia_cnvlink_pmu_0/event=0x0,rem_socket=0xE/

+* Count event id 0x0 for the traffic from remote socket 0, 2, and 3 to socket 1::

+ perf stat -a -e nvidia_cnvlink_pmu_1/event=0x0,rem_socket=0xD/

+* Count event id 0x0 for the traffic from remote socket 0, 1, and 3 to socket 2::

+ perf stat -a -e nvidia_cnvlink_pmu_2/event=0x0,rem_socket=0xB/

+* Count event id 0x0 for the traffic from remote socket 0, 1, and 2 to socket 3::

+ perf stat -a -e nvidia_cnvlink_pmu_3/event=0x0,rem_socket=0x7/

+The PCIE PMU monitors all read/write traffic from PCIE root ports to

+local/remote memory. Please see :ref:`NVIDIA_Uncore_PMU_Traffic_Coverage_Section`

+for more info about the PMU traffic coverage.

+The events and configuration options of this PMU device are described in sysfs,

+see /sys/bus/event_sources/devices/nvidia_pcie_pmu_<socket-id>.

+Each SoC socket can support multiple root ports. The user can use

+"root_port" bitmap parameter to select the port(s) to monitor, i.e.

+"root_port=0xF" corresponds to root port 0 to 3.

+/sys/bus/event_sources/devices/nvidia_pcie_pmu_<socket-id>/format/root_port

+shows the valid bits that can be set in the "root_port" parameter.

+* Count event id 0x0 from root port 0 and 1 of socket 0::

+ perf stat -a -e nvidia_pcie_pmu_0/event=0x0,root_port=0x3/

+* Count event id 0x0 from root port 0 and 1 of socket 1::

+ perf stat -a -e nvidia_pcie_pmu_1/event=0x0,root_port=0x3/

+.. _NVIDIA_Uncore_PMU_Traffic_Coverage_Section:

+The PMU traffic coverage may vary dependent on the chip configuration:

+* **NVIDIA Grace Hopper Superchip**: Hopper GPU is connected with Grace SoC.

+ Example configuration with two Grace SoCs::

+ ********************************* *********************************

+ * SOCKET-A * * SOCKET-B *

+ * :::::::: * * :::::::: *

+ * : PCIE : * * : PCIE : *

+ * :::::::: * * :::::::: *

+ * ::::::: ::::::::: * * ::::::::: ::::::: *

+ * : : : : * * : : : : *

+ * : GPU :<--NVLink-->: Grace :<---CNVLink--->: Grace :<--NVLink-->: GPU : *

+ * : : C2C : SoC : * * : SoC : C2C : : *

+ * ::::::: ::::::::: * * ::::::::: ::::::: *

+ * &&&&&&&& &&&&&&&& * * &&&&&&&& &&&&&&&& *

+ * & GMEM & & CMEM & * * & CMEM & & GMEM & *

+ * &&&&&&&& &&&&&&&& * * &&&&&&&& &&&&&&&& *

+ ********************************* *********************************

+ GMEM = GPU Memory (e.g. HBM)

+ CMEM = CPU Memory (e.g. LPDDR5X)

+ | Following table contains traffic coverage of Grace SoC PMU in socket-A:

+ +--------------+-------+-----------+-----------+-----+----------+----------+

+ + +-------+-----------+-----------+-----+----------+----------+

+ | | |EGM | | | | |

+ +==============+=======+===========+===========+=====+==========+==========+

+ | SYSRAM/CMEM | PMU |PMU |PMU | PMU | | PMU |

+ +--------------+-------+-----------+-----------+-----+----------+----------+

+ | | PMU | |PMU | PMU | | PMU |

+ +--------------+-------+-----------+-----------+-----+----------+----------+

+ | SYSRAM/CMEM | PMU |PMU |PMU | PMU | N/A | N/A |

+ | over CNVLink | | | | | | |

+ +--------------+-------+-----------+-----------+-----+----------+----------+

+ | over CNVLink | PMU |PMU |PMU | PMU | N/A | N/A |

+ +--------------+-------+-----------+-----------+-----+----------+----------+

+ PCIE1 traffic represents strongly ordered (SO) writes.

+ PCIE2 traffic represents reads and relaxed ordered (RO) writes.

+* **NVIDIA Grace CPU Superchip**: two Grace CPU SoCs are connected.

+ Example configuration with two Grace SoCs::

+ ******************* *******************

+ * SOCKET-A * * SOCKET-B *

+ * :::::::: * * :::::::: *

+ * : PCIE : * * : PCIE : *

+ * :::::::: * * :::::::: *

+ * ::::::::: * * ::::::::: *

+ * : Grace :<--------NVLink------->: Grace : *

+ * : SoC : * C2C * : SoC : *

+ * ::::::::: * * ::::::::: *

+ * &&&&&&&& * * &&&&&&&& *

+ * & CMEM & * * & CMEM & *

+ * &&&&&&&& * * &&&&&&&& *

+ ******************* *******************

+ GMEM = GPU Memory (e.g. HBM)

+ CMEM = CPU Memory (e.g. LPDDR5X)

+ | Following table contains traffic coverage of Grace SoC PMU in socket-A:

+ +-----------------+-----------+---------+----------+-------------+

+ + +-----------+---------+----------+-------------+

+ +=================+===========+=========+==========+=============+

+ | SYSRAM/CMEM | | | | PMU |

+ +-----------------+-----------+---------+----------+-------------+

+ | Remote | | | | |

+ | SYSRAM/CMEM | PCIE PMU | SCF PMU | N/A | N/A |

+ | over NVLink-C2C | | | | |

+ +-----------------+-----------+---------+----------+-------------+

+ PCIE1 traffic represents strongly ordered (SO) writes.

+ PCIE2 traffic represents reads and relaxed ordered (RO) writes.

+.. SPDX-License-Identifier: GPL-2.0

+.. include:: <isonum.txt>

+===============================================

+``amd-pstate`` CPU Performance Scaling Driver

+===============================================

+:Copyright: |copy| 2021 Advanced Micro Devices, Inc.

+:Author: Huang Rui <ray.huang@amd.com>

+===================

+``amd-pstate`` is the AMD CPU performance scaling driver that introduces a

+new CPU frequency control mechanism on modern AMD APU and CPU series in

+Linux kernel. The new mechanism is based on Collaborative Processor

+Performance Control (CPPC) which provides finer grain frequency management

+than legacy ACPI hardware P-States. Current AMD CPU/APU platforms are using

+the ACPI P-states driver to manage CPU frequency and clocks with switching

+only in 3 P-states. CPPC replaces the ACPI P-states controls and allows a

+flexible, low-latency interface for the Linux kernel to directly

+communicate the performance hints to hardware.

+``amd-pstate`` leverages the Linux kernel governors such as ``schedutil``,

+``ondemand``, etc. to manage the performance hints which are provided by

+CPPC hardware functionality that internally follows the hardware

+specification (for details refer to AMD64 Architecture Programmer's Manual

+Volume 2: System Programming [1]_). Currently, ``amd-pstate`` supports basic

+frequency control function according to kernel governors on some of the

+Zen2 and Zen3 processors, and we will implement more AMD specific functions

+in future after we verify them on the hardware and SBIOS.

+=======================

+Collaborative Processor Performance Control (CPPC) interface enumerates a

+continuous, abstract, and unit-less performance value in a scale that is

+not tied to a specific performance state / frequency. This is an ACPI

+standard [2]_ which software can specify application performance goals and

+hints as a relative target to the infrastructure limits. AMD processors

+provide the low latency register model (MSR) instead of an AML code

+interpreter for performance adjustments. ``amd-pstate`` will initialize a

+``struct cpufreq_driver`` instance, ``amd_pstate_driver``, with the callbacks

+to manage each performance update behavior. ::

+ Highest Perf ------>+-----------------------+ +-----------------------+

+ | | Max Perf ---->| |

+ Nominal Perf ------>+-----------------------+ +-----------------------+

+ | | Desired Perf ---->| |

+ Lowest non- | | | |

+ linear perf ------>+-----------------------+ +-----------------------+

+ | | Lowest perf ---->| |

+ Lowest perf ------>+-----------------------+ +-----------------------+

+ 0 ------>+-----------------------+ +-----------------------+

+ AMD P-States Performance Scale

+AMD CPPC Performance Capability

+--------------------------------

+Highest Performance (RO)

+.........................

+This is the absolute maximum performance an individual processor may reach,

+assuming ideal conditions. This performance level may not be sustainable

+for long durations and may only be achievable if other platform components

+are in a specific state; for example, it may require other processors to be in

+an idle state. This would be equivalent to the highest frequencies

+supported by the processor.

+Nominal (Guaranteed) Performance (RO)

+......................................

+This is the maximum sustained performance level of the processor, assuming

+ideal operating conditions. In the absence of an external constraint (power,

+thermal, etc.), this is the performance level the processor is expected to

+be able to maintain continuously. All cores/processors are expected to be

+able to sustain their nominal performance state simultaneously.

+Lowest non-linear Performance (RO)

+...................................

+This is the lowest performance level at which nonlinear power savings are

+achieved, for example, due to the combined effects of voltage and frequency

+scaling. Above this threshold, lower performance levels should be generally

+more energy efficient than higher performance levels. This register

+effectively conveys the most efficient performance level to ``amd-pstate``.

+Lowest Performance (RO)

+........................

+This is the absolute lowest performance level of the processor. Selecting a

+performance level lower than the lowest nonlinear performance level may

+cause an efficiency penalty but should reduce the instantaneous power

+consumption of the processor.

+AMD CPPC Performance Control

+------------------------------

+``amd-pstate`` passes performance goals through these registers. The

+register drives the behavior of the desired performance target.

+Minimum requested performance (RW)

+...................................

+``amd-pstate`` specifies the minimum allowed performance level.

+Maximum requested performance (RW)

+...................................

+``amd-pstate`` specifies a limit the maximum performance that is expected

+to be supplied by the hardware.

+Desired performance target (RW)

+...................................

+``amd-pstate`` specifies a desired target in the CPPC performance scale as

+a relative number. This can be expressed as percentage of nominal

+performance (infrastructure max). Below the nominal sustained performance

+level, desired performance expresses the average performance level of the

+processor subject to hardware. Above the nominal performance level,

+the processor must provide at least nominal performance requested and go higher

+if current operating conditions allow.

+Energy Performance Preference (EPP) (RW)

+.........................................

+This attribute provides a hint to the hardware if software wants to bias

+toward performance (0x0) or energy efficiency (0xff).

+Key Governors Support

+=======================

+``amd-pstate`` can be used with all the (generic) scaling governors listed

+by the ``scaling_available_governors`` policy attribute in ``sysfs``. Then,

+it is responsible for the configuration of policy objects corresponding to

+CPUs and provides the ``CPUFreq`` core (and the scaling governors attached

+to the policy objects) with accurate information on the maximum and minimum

+operating frequencies supported by the hardware. Users can check the

+``scaling_cur_freq`` information comes from the ``CPUFreq`` core.

+``amd-pstate`` mainly supports ``schedutil`` and ``ondemand`` for dynamic

+frequency control. It is to fine tune the processor configuration on

+``amd-pstate`` to the ``schedutil`` with CPU CFS scheduler. ``amd-pstate``

+registers the adjust_perf callback to implement performance update behavior

+similar to CPPC. It is initialized by ``sugov_start`` and then populates the

+CPU's update_util_data pointer to assign ``sugov_update_single_perf`` as the

+utilization update callback function in the CPU scheduler. The CPU scheduler

+will call ``cpufreq_update_util`` and assigns the target performance according

+to the ``struct sugov_cpu`` that the utilization update belongs to.

+Then, ``amd-pstate`` updates the desired performance according to the CPU

+scheduler assigned.

+.. _processor_support:

+=======================

+The ``amd-pstate`` initialization will fail if the ``_CPC`` entry in the ACPI

+SBIOS does not exist in the detected processor. It uses ``acpi_cpc_valid``

+to check the existence of ``_CPC``. All Zen based processors support the legacy

+ACPI hardware P-States function, so when ``amd-pstate`` fails initialization,

+the kernel will fall back to initialize the ``acpi-cpufreq`` driver.

+There are two types of hardware implementations for ``amd-pstate``: one is

+`Full MSR Support <perf_cap_>`_ and another is `Shared Memory Support

+<perf_cap_>`_. It can use the :c:macro:`X86_FEATURE_CPPC` feature flag to

+indicate the different types. (For details, refer to the Processor Programming

+Reference (PPR) for AMD Family 19h Model 51h, Revision A1 Processors [3]_.)

+``amd-pstate`` is to register different ``static_call`` instances for different

+hardware implementations.

+Currently, some of the Zen2 and Zen3 processors support ``amd-pstate``. In the

+future, it will be supported on more and more AMD processors.

+Some new Zen3 processors such as Cezanne provide the MSR registers directly

+while the :c:macro:`X86_FEATURE_CPPC` CPU feature flag is set.

+``amd-pstate`` can handle the MSR register to implement the fast switch

+function in ``CPUFreq`` that can reduce the latency of frequency control in

+interrupt context. The functions with a ``pstate_xxx`` prefix represent the

+operations on MSR registers.

+Shared Memory Support

+----------------------

+If the :c:macro:`X86_FEATURE_CPPC` CPU feature flag is not set, the

+processor supports the shared memory solution. In this case, ``amd-pstate``

+uses the ``cppc_acpi`` helper methods to implement the callback functions

+that are defined on ``static_call``. The functions with the ``cppc_xxx`` prefix

+represent the operations of ACPI CPPC helpers for the shared memory solution.

+AMD P-States and ACPI hardware P-States always can be supported in one

+processor. But AMD P-States has the higher priority and if it is enabled

+with :c:macro:`MSR_AMD_CPPC_ENABLE` or ``cppc_set_enable``, it will respond

+to the request from AMD P-States.

+User Space Interface in ``sysfs`` - Per-policy control

+======================================================

+``amd-pstate`` exposes several global attributes (files) in ``sysfs`` to

+control its functionality at the system level. They are located in the

+``/sys/devices/system/cpu/cpufreq/policyX/`` directory and affect all CPUs. ::

+ root@hr-test1:/home/ray# ls /sys/devices/system/cpu/cpufreq/policy0/*amd*

+ /sys/devices/system/cpu/cpufreq/policy0/amd_pstate_highest_perf

+ /sys/devices/system/cpu/cpufreq/policy0/amd_pstate_lowest_nonlinear_freq

+ /sys/devices/system/cpu/cpufreq/policy0/amd_pstate_max_freq

+``amd_pstate_highest_perf / amd_pstate_max_freq``

+Maximum CPPC performance and CPU frequency that the driver is allowed to

+set, in percent of the maximum supported CPPC performance level (the highest

+performance supported in `AMD CPPC Performance Capability <perf_cap_>`_).

+In some ASICs, the highest CPPC performance is not the one in the ``_CPC``

+table, so we need to expose it to sysfs. If boost is not active, but

+still supported, this maximum frequency will be larger than the one in

+This attribute is read-only.

+``amd_pstate_lowest_nonlinear_freq``

+The lowest non-linear CPPC CPU frequency that the driver is allowed to set,

+in percent of the maximum supported CPPC performance level. (Please see the

+lowest non-linear performance in `AMD CPPC Performance Capability

+This attribute is read-only.

+``energy_performance_available_preferences``

+A list of all the supported EPP preferences that could be used for

+``energy_performance_preference`` on this system.

+These profiles represent different hints that are provided

+to the low-level firmware about the user's desired energy vs efficiency

+tradeoff. ``default`` represents the epp value is set by platform

+firmware. This attribute is read-only.

+``energy_performance_preference``

+The current energy performance preference can be read from this attribute.

+and user can change current preference according to energy or performance needs

+Please get all support profiles list from

+``energy_performance_available_preferences`` attribute, all the profiles are

+integer values defined between 0 to 255 when EPP feature is enabled by platform

+firmware, if EPP feature is disabled, driver will ignore the written value

+This attribute is read-write.

+Other performance and frequency values can be read back from

+``/sys/devices/system/cpu/cpuX/acpi_cppc/``, see :ref:`cppc_sysfs`.

+``amd-pstate`` vs ``acpi-cpufreq``

+======================================

+On the majority of AMD platforms supported by ``acpi-cpufreq``, the ACPI tables

+provided by the platform firmware are used for CPU performance scaling, but

+only provide 3 P-states on AMD processors.

+However, on modern AMD APU and CPU series, hardware provides the Collaborative

+Processor Performance Control according to the ACPI protocol and customizes this

+for AMD platforms. That is, fine-grained and continuous frequency ranges

+instead of the legacy hardware P-states. ``amd-pstate`` is the kernel

+module which supports the new AMD P-States mechanism on most of the future AMD

+platforms. The AMD P-States mechanism is the more performance and energy

+efficiency frequency management method on AMD processors.

+AMD Pstate Driver Operation Modes

+=================================

+``amd_pstate`` CPPC has 3 operation modes: autonomous (active) mode,

+non-autonomous (passive) mode and guided autonomous (guided) mode.

+Active/passive/guided mode can be chosen by different kernel parameters.

+- In autonomous mode, platform ignores the desired performance level request

+ and takes into account only the values set to the minimum, maximum and energy

+ performance preference registers.

+- In non-autonomous mode, platform gets desired performance level

+ from OS directly through Desired Performance Register.

+- In guided-autonomous mode, platform sets operating performance level

+ autonomously according to the current workload and within the limits set by

+ OS through min and max performance registers.

+``amd_pstate=active``

+This is the low-level firmware control mode which is implemented by ``amd_pstate_epp``

+driver with ``amd_pstate=active`` passed to the kernel in the command line.

+In this mode, ``amd_pstate_epp`` driver provides a hint to the hardware if software

+wants to bias toward performance (0x0) or energy efficiency (0xff) to the CPPC firmware.

+then CPPC power algorithm will calculate the runtime workload and adjust the realtime

+cores frequency according to the power supply and thermal, core voltage and some other

+hardware conditions.

+``amd_pstate=passive``

+It will be enabled if the ``amd_pstate=passive`` is passed to the kernel in the command line.

+In this mode, ``amd_pstate`` driver software specifies a desired QoS target in the CPPC

+performance scale as a relative number. This can be expressed as percentage of nominal

+performance (infrastructure max). Below the nominal sustained performance level,

+desired performance expresses the average performance level of the processor subject

+to the Performance Reduction Tolerance register. Above the nominal performance level,

+processor must provide at least nominal performance requested and go higher if current

+operating conditions allow.

+``amd_pstate=guided``

+If ``amd_pstate=guided`` is passed to kernel command line option then this mode

+is activated. In this mode, driver requests minimum and maximum performance

+level and the platform autonomously selects a performance level in this range

+and appropriate to the current workload.

+User Space Interface in ``sysfs`` - General

+===========================================

+``amd-pstate`` exposes several global attributes (files) in ``sysfs`` to

+control its functionality at the system level. They are located in the

+``/sys/devices/system/cpu/amd-pstate/`` directory and affect all CPUs.

+ Operation mode of the driver: "active", "passive" or "disable".

+ The driver is functional and in the ``active mode``

+ The driver is functional and in the ``passive mode``

+ The driver is functional and in the ``guided mode``

+ The driver is unregistered and not functional now.

+ This attribute can be written to in order to change the driver's

+ operation mode or to unregister it. The string written to it must be

+ one of the possible values of it and, if successful, writing one of

+ these values to the sysfs file will cause the driver to switch over

+ to the operation mode represented by that string - or to be

+ unregistered in the "disable" case.

+``cpupower`` tool support for ``amd-pstate``

+===============================================

+``amd-pstate`` is supported by the ``cpupower`` tool, which can be used to dump

+frequency information. Development is in progress to support more and more

+operations for the new ``amd-pstate`` module with this tool. ::

+ root@hr-test1:/home/ray# cpupower frequency-info

+ driver: amd-pstate

+ CPUs which run at the same hardware frequency: 0

+ CPUs which need to have their frequency coordinated by software: 0

+ maximum transition latency: 131 us

+ hardware limits: 400 MHz - 4.68 GHz

+ available cpufreq governors: ondemand conservative powersave userspace performance schedutil

+ current policy: frequency should be within 400 MHz and 4.68 GHz.

+ The governor "schedutil" may decide which speed to use

+ within this range.

+ current CPU frequency: Unable to call hardware

+ current CPU frequency: 4.02 GHz (asserted by call to kernel)

+ boost state support:

+ AMD PSTATE Highest Performance: 166. Maximum Frequency: 4.68 GHz.

+ AMD PSTATE Nominal Performance: 117. Nominal Frequency: 3.30 GHz.

+ AMD PSTATE Lowest Non-linear Performance: 39. Lowest Non-linear Frequency: 1.10 GHz.

+ AMD PSTATE Lowest Performance: 15. Lowest Frequency: 400 MHz.

+Diagnostics and Tuning

+=======================

+There are two static trace events that can be used for ``amd-pstate``

+diagnostics. One of them is the ``cpu_frequency`` trace event generally used

+by ``CPUFreq``, and the other one is the ``amd_pstate_perf`` trace event

+specific to ``amd-pstate``. The following sequence of shell commands can

+be used to enable them and see their output (if the kernel is

+configured to support event tracing). ::

+ root@hr-test1:/home/ray# cd /sys/kernel/tracing/

+ root@hr-test1:/sys/kernel/tracing# echo 1 > events/amd_cpu/enable

+ root@hr-test1:/sys/kernel/tracing# cat trace

+ # entries-in-buffer/entries-written: 47827/42233061 #P:2

+ # _-----=> irqs-off

+ # / _----=> need-resched

+ # | / _---=> hardirq/softirq

+ # || / _--=> preempt-depth

+ # TASK-PID CPU# |||| TIMESTAMP FUNCTION

+ <idle>-0 [015] dN... 4995.979886: amd_pstate_perf: amd_min_perf=85 amd_des_perf=85 amd_max_perf=166 cpu_id=15 changed=false fast_switch=true

+ <idle>-0 [007] d.h.. 4995.979893: amd_pstate_perf: amd_min_perf=85 amd_des_perf=85 amd_max_perf=166 cpu_id=7 changed=false fast_switch=true

+ cat-2161 [000] d.... 4995.980841: amd_pstate_perf: amd_min_perf=85 amd_des_perf=85 amd_max_perf=166 cpu_id=0 changed=false fast_switch=true

+ sshd-2125 [004] d.s.. 4995.980968: amd_pstate_perf: amd_min_perf=85 amd_des_perf=85 amd_max_perf=166 cpu_id=4 changed=false fast_switch=true

+ <idle>-0 [007] d.s.. 4995.980968: amd_pstate_perf: amd_min_perf=85 amd_des_perf=85 amd_max_perf=166 cpu_id=7 changed=false fast_switch=true

+ <idle>-0 [003] d.s.. 4995.980971: amd_pstate_perf: amd_min_perf=85 amd_des_perf=85 amd_max_perf=166 cpu_id=3 changed=false fast_switch=true

+ <idle>-0 [011] d.s.. 4995.980996: amd_pstate_perf: amd_min_perf=85 amd_des_perf=85 amd_max_perf=166 cpu_id=11 changed=false fast_switch=true

+The ``cpu_frequency`` trace event will be triggered either by the ``schedutil`` scaling

+governor (for the policies it is attached to), or by the ``CPUFreq`` core (for the

+policies with other scaling governors).

+``amd_pstate_tracer.py`` can record and parse ``amd-pstate`` trace log, then

+generate performance plots. This utility can be used to debug and tune the

+performance of ``amd-pstate`` driver. The tracer tool needs to import intel

+Tracer tool located in ``linux/tools/power/x86/amd_pstate_tracer``. It can be

+used in two ways. If trace file is available, then directly parse the file

+ ./amd_pstate_trace.py [-c cpus] -t <trace_file> -n <test_name>

+Or generate trace file with root privilege, then parse and plot with command ::

+ sudo ./amd_pstate_trace.py [-c cpus] -n <test_name> -i <interval> [-m kbytes]

+The test result can be found in ``results/test_name``. Following is the example

+about part of the output. ::

+ common_cpu common_secs common_usecs min_perf des_perf max_perf freq mperf apef tsc load duration_ms sample_num elapsed_time common_comm

+ CPU_005 712 116384 39 49 166 0.7565 9645075 2214891 38431470 25.1 11.646 469 2.496 kworker/5:0-40

+ CPU_006 712 116408 39 49 166 0.6769 8950227 1839034 37192089 24.06 11.272 470 2.496 kworker/6:0-1264

+Unit Tests for amd-pstate

+-------------------------

+``amd-pstate-ut`` is a test module for testing the ``amd-pstate`` driver.

+ * It can help all users to verify their processor support (SBIOS/Firmware or Hardware).

+ * Kernel can have a basic function test to avoid the kernel regression during the update.

+ * We can introduce more functional or performance tests to align the result together, it will benefit power and performance scale optimization.

+1. Test case descriptions

+ Test prerequisite and basic functions for the ``amd-pstate`` driver.

+ +---------+--------------------------------+------------------------------------------------------------------------------------+

+ | Index | Functions | Description |

+ +=========+================================+====================================================================================+

+ | 1 | amd_pstate_ut_acpi_cpc_valid || Check whether the _CPC object is present in SBIOS. |

+ | | || The detail refer to `Processor Support <processor_support_>`_. |

+ +---------+--------------------------------+------------------------------------------------------------------------------------+

+ | 2 | amd_pstate_ut_check_enabled || Check whether AMD P-State is enabled. |

+ | | || AMD P-States and ACPI hardware P-States always can be supported in one processor. |

+ | | | But AMD P-States has the higher priority and if it is enabled with |

+ | | | :c:macro:`MSR_AMD_CPPC_ENABLE` or ``cppc_set_enable``, it will respond to the |

+ | | | request from AMD P-States. |

+ +---------+--------------------------------+------------------------------------------------------------------------------------+

+ | 3 | amd_pstate_ut_check_perf || Check if the each performance values are reasonable. |

+ | | || highest_perf >= nominal_perf > lowest_nonlinear_perf > lowest_perf > 0. |

+ +---------+--------------------------------+------------------------------------------------------------------------------------+

+ | 4 | amd_pstate_ut_check_freq || Check if the each frequency values and max freq when set support boost mode |

+ | | | are reasonable. |

+ | | || max_freq >= nominal_freq > lowest_nonlinear_freq > min_freq > 0 |

+ | | || If boost is not active but supported, this maximum frequency will be larger than |

+ | | | the one in ``cpuinfo``. |

+ +---------+--------------------------------+------------------------------------------------------------------------------------+

+ Test and monitor the cpu changes when running tbench benchmark under the specified governor.

+ These changes include desire performance, frequency, load, performance, energy etc.

+ The specified governor is ondemand or schedutil.

+ Tbench can also be tested on the ``acpi-cpufreq`` kernel driver for comparison.

+ 3). Gitsource test

+ Test and monitor the cpu changes when running gitsource benchmark under the specified governor.

+ These changes include desire performance, frequency, load, time, energy etc.

+ The specified governor is ondemand or schedutil.

+ Gitsource can also be tested on the ``acpi-cpufreq`` kernel driver for comparison.

+#. How to execute the tests

+ We use test module in the kselftest frameworks to implement it.

+ We create ``amd-pstate-ut`` module and tie it into kselftest.(for

+ details refer to Linux Kernel Selftests [4]_).

+ + open the :c:macro:`CONFIG_X86_AMD_PSTATE` configuration option.

+ + set the :c:macro:`CONFIG_X86_AMD_PSTATE_UT` configuration option to M.

+ + make selftest ::

+ $ make -C tools/testing/selftests

+ 2). Installation & Steps ::

+ $ make -C tools/testing/selftests install INSTALL_PATH=~/kselftest

+ $ cp tools/perf/perf /usr/bin/perf

+ $ sudo ./kselftest/run_kselftest.sh -c amd-pstate

+ 3). Specified test case ::

+ $ cd ~/kselftest/amd-pstate

+ $ sudo ./run.sh -t basic

+ $ sudo ./run.sh -t tbench

+ $ sudo ./run.sh -t tbench -m acpi-cpufreq

+ $ sudo ./run.sh -t gitsource

+ $ sudo ./run.sh -t gitsource -m acpi-cpufreq

+ ./run.sh: illegal option -- -

+ Usage: ./run.sh [OPTION...]

+ [-o <output-file-for-dump>]

+ [-c <all: All testing,

+ basic: Basic testing,

+ tbench: Tbench testing,

+ gitsource: Gitsource testing.>]

+ [-t <tbench time limit>]

+ [-p <tbench process number>]

+ [-l <loop times for tbench>]

+ [-i <amd tracer interval>]

+ [-m <comparative test: acpi-cpufreq>]

+ When you finish test, you will get the following log info ::

+ $ dmesg | grep "amd_pstate_ut" | tee log.txt

+ [12977.570663] amd_pstate_ut: 1 amd_pstate_ut_acpi_cpc_valid success!

+ [12977.570673] amd_pstate_ut: 2 amd_pstate_ut_check_enabled success!

+ [12977.571207] amd_pstate_ut: 3 amd_pstate_ut_check_perf success!

+ [12977.571212] amd_pstate_ut: 4 amd_pstate_ut_check_freq success!

+ When you finish test, you will get selftest.tbench.csv and png images.

+ The selftest.tbench.csv file contains the raw data and the drop of the comparative test.

+ The png images shows the performance, energy and performan per watt of each test.

+ Open selftest.tbench.csv :

+ +-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+

+ +-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+

+ + Unit | | | GHz | | MB/s | J | MB/J |

+ +=================================================+==============+==========+=========+==========+=============+=========+======================+

+ + amd-pstate-ondemand | 1 | | | | 2504.05 | 1563.67 | 158.5378 |

+ +-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+

+ + amd-pstate-ondemand | 2 | | | | 2243.64 | 1430.32 | 155.2941 |

+ +-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+

+ + amd-pstate-ondemand | 3 | | | | 2183.88 | 1401.32 | 154.2860 |

+ +-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+

+ + amd-pstate-ondemand | Average | | | | 2310.52 | 1465.1 | 156.1268 |

+ +-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+

+ + amd-pstate-schedutil | 1 | 165.329 | 1.62257 | 99.798 | 2136.54 | 1395.26 | 151.5971 |

+ +-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+

+ + amd-pstate-schedutil | 2 | 166 | 1.49761 | 99.9993 | 2100.56 | 1380.5 | 150.6377 |

+ +-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+

+ + amd-pstate-schedutil | 3 | 166 | 1.47806 | 99.9993 | 2084.12 | 1375.76 | 149.9737 |

+ +-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+

+ + amd-pstate-schedutil | Average | 165.776 | 1.53275 | 99.9322 | 2107.07 | 1383.84 | 150.7399 |

+ +-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+

+ + acpi-cpufreq-ondemand | 1 | | | | 2529.9 | 1564.4 | 160.0997 |

+ +-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+

+ + acpi-cpufreq-ondemand | 2 | | | | 2249.76 | 1432.97 | 155.4297 |

+ +-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+

+ + acpi-cpufreq-ondemand | 3 | | | | 2181.46 | 1406.88 | 153.5060 |

+ +-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+

+ + acpi-cpufreq-ondemand | Average | | | | 2320.37 | 1468.08 | 156.4741 |

+ +-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+

+ + acpi-cpufreq-schedutil | 1 | | | | 2137.64 | 1385.24 | 152.7723 |

+ +-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+

+ + acpi-cpufreq-schedutil | 2 | | | | 2107.05 | 1372.23 | 152.0138 |

+ +-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+

+ + acpi-cpufreq-schedutil | 3 | | | | 2085.86 | 1365.35 | 151.2433 |

+ +-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+

+ + acpi-cpufreq-schedutil | Average | | | | 2110.18 | 1374.27 | 152.0136 |

+ +-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+

+ + acpi-cpufreq-ondemand VS acpi-cpufreq-schedutil | Comprison(%) | | | | -9.0584 | -6.3899 | -2.8506 |

+ +-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+

+ + amd-pstate-ondemand VS amd-pstate-schedutil | Comprison(%) | | | | 8.8053 | -5.5463 | -3.4503 |

+ +-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+

+ + acpi-cpufreq-ondemand VS amd-pstate-ondemand | Comprison(%) | | | | -0.4245 | -0.2029 | -0.2219 |

+ +-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+

+ + acpi-cpufreq-schedutil VS amd-pstate-schedutil | Comprison(%) | | | | -0.1473 | 0.6963 | -0.8378 |

+ +-------------------------------------------------+--------------+----------+---------+----------+-------------+---------+----------------------+

+ When you finish test, you will get selftest.gitsource.csv and png images.

+ The selftest.gitsource.csv file contains the raw data and the drop of the comparative test.

+ The png images shows the performance, energy and performan per watt of each test.

+ Open selftest.gitsource.csv :

+ +-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+

+ +-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+

+ + Unit | | | GHz | | s | J | 1/J |

+ +=================================================+==============+==========+==========+==========+=============+=========+======================+

+ + amd-pstate-ondemand | 1 | 50.119 | 2.10509 | 23.3076 | 475.69 | 865.78 | 0.001155027 |

+ +-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+

+ + amd-pstate-ondemand | 2 | 94.8006 | 1.98771 | 56.6533 | 467.1 | 839.67 | 0.001190944 |

+ +-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+

+ + amd-pstate-ondemand | 3 | 76.6091 | 2.53251 | 43.7791 | 467.69 | 855.85 | 0.001168429 |

+ +-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+

+ + amd-pstate-ondemand | Average | 73.8429 | 2.20844 | 41.2467 | 470.16 | 853.767 | 0.001171279 |

+ +-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+

+ + amd-pstate-schedutil | 1 | 165.919 | 1.62319 | 98.3868 | 464.17 | 866.8 | 0.001153668 |

+ +-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+

+ + amd-pstate-schedutil | 2 | 165.97 | 1.31309 | 99.5712 | 480.15 | 880.4 | 0.001135847 |

+ +-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+

+ + amd-pstate-schedutil | 3 | 165.973 | 1.28448 | 99.9252 | 481.79 | 867.02 | 0.001153375 |

+ +-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+

+ + amd-pstate-schedutil | Average | 165.954 | 1.40692 | 99.2944 | 475.37 | 871.407 | 0.001147569 |

+ +-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+

+ + acpi-cpufreq-ondemand | 1 | | | | 2379.62 | 742.96 | 0.001345967 |

+ +-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+

+ + acpi-cpufreq-ondemand | 2 | | | | 441.74 | 817.49 | 0.001223256 |

+ +-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+

+ + acpi-cpufreq-ondemand | 3 | | | | 455.48 | 820.01 | 0.001219497 |

+ +-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+

+ + acpi-cpufreq-ondemand | Average | | | | 425.613 | 793.487 | 0.001260260 |

+ +-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+

+ + acpi-cpufreq-schedutil | 1 | | | | 459.69 | 838.54 | 0.001192548 |

+ +-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+

+ + acpi-cpufreq-schedutil | 2 | | | | 466.55 | 830.89 | 0.001203528 |

+ +-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+

+ + acpi-cpufreq-schedutil | 3 | | | | 470.38 | 837.32 | 0.001194286 |

+ +-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+

+ + acpi-cpufreq-schedutil | Average | | | | 465.54 | 835.583 | 0.001196769 |

+ +-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+

+ + acpi-cpufreq-ondemand VS acpi-cpufreq-schedutil | Comprison(%) | | | | 9.3810 | 5.3051 | -5.0379 |

+ +-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+

+ + amd-pstate-ondemand VS amd-pstate-schedutil | Comprison(%) | 124.7392 | -36.2934 | 140.7329 | 1.1081 | 2.0661 | -2.0242 |

+ +-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+

+ + acpi-cpufreq-ondemand VS amd-pstate-ondemand | Comprison(%) | | | | 10.4665 | 7.5968 | -7.0605 |

+ +-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+

+ + acpi-cpufreq-schedutil VS amd-pstate-schedutil | Comprison(%) | | | | 2.1115 | 4.2873 | -4.1110 |

+ +-------------------------------------------------+--------------+----------+----------+----------+-------------+---------+----------------------+

+.. [1] AMD64 Architecture Programmer's Manual Volume 2: System Programming,

+ https://www.amd.com/system/files/TechDocs/24593.pdf

+.. [2] Advanced Configuration and Power Interface Specification,

+ https://uefi.org/sites/default/files/resources/ACPI_Spec_6_4_Jan22.pdf

+.. [3] Processor Programming Reference (PPR) for AMD Family 19h Model 51h, Revision A1 Processors

+ https://www.amd.com/system/files/TechDocs/56569-A1-PUB.zip

+.. [4] Linux Kernel Selftests,

+ https://www.kernel.org/doc/html/latest/dev-tools/kselftest.html

+described below are only relevant for the *x86* architecture and references

+to ``intel_idle`` affect Intel processors only.

+The ``idle=nomwait`` option prevents the use of ``MWAIT`` instruction of

+the CPU to enter idle states. When this option is used, the ``acpi_idle``

+driver will use the ``HLT`` instruction instead of ``MWAIT``. On systems

+running Intel processors, this option disables the ``intel_idle`` driver

+and forces the use of the ``acpi_idle`` driver instead. Note that in either

+case, ``acpi_idle`` driver will function only if all the information needed

+by it is in the system's ACPI tables.

diff --git a/Documentation/admin-guide/pm/intel-speed-select.rst b/Documentation/admin-guide/pm/intel-speed-select.rst
index 0a1fbdb54bfe..a2bfb971654f 100644
--- a/Documentation/admin-guide/pm/intel-speed-select.rst
+++ b/Documentation/admin-guide/pm/intel-speed-select.rst

+Changing performance level via BMC Interface

+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

+It is possible to change SST-PP level using out of band (OOB) agent (Via some

+remote management console, through BMC "Baseboard Management Controller"

+interface). This mode is supported from the Sapphire Rapids processor

+generation. The kernel and tool change to support this mode is added to Linux

+kernel version 5.18. To enable this feature, kernel config

+"CONFIG_INTEL_HFI_THERMAL" is required. The minimum version of the tool

+is "v1.12" to support this feature, which is part of Linux kernel version 5.18.

+To support such configuration, this tool can be used as a daemon. Add

+a command line option --oob::

+ # intel-speed-select --oob

+ Intel(R) Speed Select Technology

+ Executing on CPU model:143[0x8f]

+ OOB mode is enabled and will run as daemon

+In this mode the tool will online/offline CPUs based on the new performance

+ # cd /sys/kernel/tracing/

diff --git a/Documentation/admin-guide/pm/intel_uncore_frequency_scaling.rst b/Documentation/admin-guide/pm/intel_uncore_frequency_scaling.rst
new file mode 100644
index 000000000000..09169d935835
--- /dev/null
+++ b/Documentation/admin-guide/pm/intel_uncore_frequency_scaling.rst

+.. SPDX-License-Identifier: GPL-2.0

+.. include:: <isonum.txt>

+==============================

+Intel Uncore Frequency Scaling

+==============================

+:Copyright: |copy| 2022 Intel Corporation

+:Author: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

+The uncore can consume significant amount of power in Intel's Xeon servers based

+on the workload characteristics. To optimize the total power and improve overall

+performance, SoCs have internal algorithms for scaling uncore frequency. These

+algorithms monitor workload usage of uncore and set a desirable frequency.

+It is possible that users have different expectations of uncore performance and

+want to have control over it. The objective is similar to allowing users to set

+the scaling min/max frequencies via cpufreq sysfs to improve CPU performance.

+Users may have some latency sensitive workloads where they do not want any

+change to uncore frequency. Also, users may have workloads which require

+different core and uncore performance at distinct phases and they may want to

+use both cpufreq and the uncore scaling interface to distribute power and

+improve overall performance.

+To control uncore frequency, a sysfs interface is provided in the directory:

+`/sys/devices/system/cpu/intel_uncore_frequency/`.

+There is one directory for each package and die combination as the scope of

+uncore scaling control is per die in multiple die/package SoCs or per

+package for single die per package SoCs. The name represents the

+scope of control. For example: 'package_00_die_00' is for package id 0 and

+Each package_*_die_* contains the following attributes:

+``initial_max_freq_khz``

+ Out of reset, this attribute represent the maximum possible frequency.

+ This is a read-only attribute. If users adjust max_freq_khz,

+ they can always go back to maximum using the value from this attribute.

+``initial_min_freq_khz``

+ Out of reset, this attribute represent the minimum possible frequency.

+ This is a read-only attribute. If users adjust min_freq_khz,

+ they can always go back to minimum using the value from this attribute.

+ This attribute is used to set the maximum uncore frequency.

+ This attribute is used to set the minimum uncore frequency.

+``current_freq_khz``

+ This attribute is used to get the current uncore frequency.

+ intel_uncore_frequency_scaling

diff --git a/Documentation/admin-guide/quickly-build-trimmed-linux.rst b/Documentation/admin-guide/quickly-build-trimmed-linux.rst
new file mode 100644
index 000000000000..ff4f4cc8522b
--- /dev/null
+++ b/Documentation/admin-guide/quickly-build-trimmed-linux.rst

+.. SPDX-License-Identifier: (GPL-2.0+ OR CC-BY-4.0)

+.. [see the bottom of this file for redistribution information]

+===========================================

+How to quickly build a trimmed Linux kernel

+===========================================

+This guide explains how to swiftly build Linux kernels that are ideal for

+testing purposes, but perfectly fine for day-to-day use, too.

+The essence of the process (aka 'TL;DR')

+========================================

+*[If you are new to compiling Linux, ignore this TLDR and head over to the next

+section below: it contains a step-by-step guide, which is more detailed, but

+still brief and easy to follow; that guide and its accompanying reference

+section also mention alternatives, pitfalls, and additional aspects, all of

+which might be relevant for you.]*

+If your system uses techniques like Secure Boot, prepare it to permit starting

+self-compiled Linux kernels; install compilers and everything else needed for

+building Linux; make sure to have 12 Gigabyte free space in your home directory.

+Now run the following commands to download fresh Linux mainline sources, which

+you then use to configure, build and install your own kernel::

+ git clone --depth 1 -b master \

+ https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git ~/linux/

+ # Hint: if you want to apply patches, do it at this point. See below for details.

+ # Hint: it's recommended to tag your build at this point. See below for details.

+ yes "" | make localmodconfig

+ # Hint: at this point you might want to adjust the build configuration; you'll

+ # have to, if you are running Debian. See below for details.

+ make -j $(nproc --all)

+ # Note: on many commodity distributions the next command suffices, but on Arch

+ # Linux, its derivatives, and some others it does not. See below for details.

+ command -v installkernel && sudo make modules_install install

+If you later want to build a newer mainline snapshot, use these commands::

+ git fetch --depth 1 origin

+ # Note: the next command will discard any changes you did to the code:

+ git checkout --force --detach origin/master

+ # Reminder: if you want to (re)apply patches, do it at this point.

+ # Reminder: you might want to add or modify a build tag at this point.

+ make -j $(nproc --all)

+ # Reminder: the next command on some distributions does not suffice.

+ command -v installkernel && sudo make modules_install install

+Compiling your own Linux kernel is easy in principle. There are various ways to

+do it. Which of them actually work and is the best depends on the circumstances.

+This guide describes a way perfectly suited for those who want to quickly

+install Linux from sources without being bothered by complicated details; the

+goal is to cover everything typically needed on mainstream Linux distributions

+running on commodity PC or server hardware.

+The described approach is great for testing purposes, for example to try a

+proposed fix or to check if a problem was already fixed in the latest codebase.

+Nonetheless, kernels built this way are also totally fine for day-to-day use

+while at the same time being easy to keep up to date.

+The following steps describe the important aspects of the process; a

+comprehensive reference section later explains each of them in more detail. It

+sometimes also describes alternative approaches, pitfalls, as well as errors

+that might occur at a particular point -- and how to then get things rolling

+ Note: if you see this note, you are reading the text's source file. You

+ might want to switch to a rendered version, as it makes it a lot easier to

+ quickly look something up in the reference section and afterwards jump back

+ to where you left off. Find a the latest rendered version here:

+ https://docs.kernel.org/admin-guide/quickly-build-trimmed-linux.html

+ * Create a fresh backup and put system repair and restore tools at hand, just

+ to be prepared for the unlikely case of something going sideways.

+ [:ref:`details<backup>`]

+.. _secureboot_sbs:

+ * On platforms with 'Secure Boot' or similar techniques, prepare everything to

+ ensure the system will permit your self-compiled kernel to boot later. The

+ quickest and easiest way to achieve this on commodity x86 systems is to

+ disable such techniques in the BIOS setup utility; alternatively, remove

+ their restrictions through a process initiated by

+ ``mokutil --disable-validation``.

+ [:ref:`details<secureboot>`]

+.. _buildrequires_sbs:

+ * Install all software required to build a Linux kernel. Often you will need:

+ 'bc', 'binutils' ('ld' et al.), 'bison', 'flex', 'gcc', 'git', 'openssl',

+ 'pahole', 'perl', and the development headers for 'libelf' and 'openssl'. The

+ reference section shows how to quickly install those on various popular Linux

+ [:ref:`details<buildrequires>`]

+ * Ensure to have enough free space for building and installing Linux. For the

+ latter 150 Megabyte in /lib/ and 100 in /boot/ are a safe bet. For storing

+ sources and build artifacts 12 Gigabyte in your home directory should

+ typically suffice. If you have less available, be sure to check the reference

+ section for the step that explains adjusting your kernels build

+ configuration: it mentions a trick that reduce the amount of required space

+ in /home/ to around 4 Gigabyte.

+ [:ref:`details<diskspace>`]

+ * Retrieve the sources of the Linux version you intend to build; then change

+ into the directory holding them, as all further commands in this guide are

+ meant to be executed from there.

+ *[Note: the following paragraphs describe how to retrieve the sources by

+ partially cloning the Linux stable git repository. This is called a shallow

+ clone. The reference section explains two alternatives:* :ref:`packaged

+ archives<sources_archive>` *and* :ref:`a full git clone<sources_full>` *;

+ prefer the latter, if downloading a lot of data does not bother you, as that

+ will avoid some* :ref:`peculiar characteristics of shallow clones the

+ reference section explains<sources_shallow>` *.]*

+ First, execute the following command to retrieve a fresh mainline codebase::

+ git clone --no-checkout --depth 1 -b master \

+ https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git ~/linux/

+ If you want to access recent mainline releases and pre-releases, deepen you

+ clone's history to the oldest mainline version you are interested in::

+ git fetch --shallow-exclude=v6.0 origin

+ In case you want to access a stable/longterm release (say v6.1.5), simply add

+ the branch holding that series; afterwards fetch the history at least up to

+ the mainline version that started the series (v6.1)::

+ git remote set-branches --add origin linux-6.1.y

+ git fetch --shallow-exclude=v6.0 origin

+ Now checkout the code you are interested in. If you just performed the

+ initial clone, you will be able to check out a fresh mainline codebase, which

+ is ideal for checking whether developers already fixed an issue::

+ git checkout --detach origin/master

+ If you deepened your clone, you instead of ``origin/master`` can specify the

+ version you deepened to (``v6.0`` above); later releases like ``v6.1`` and

+ pre-release like ``v6.2-rc1`` will work, too. Stable or longterm versions

+ like ``v6.1.5`` work just the same, if you added the appropriate

+ stable/longterm branch as described.

+ [:ref:`details<sources>`]

+ * In case you want to apply a kernel patch, do so now. Often a command like

+ this will do the trick::

+ patch -p1 < ../proposed-fix.patch

+ If the ``-p1`` is actually needed, depends on how the patch was created; in

+ case it does not apply thus try without it.

+ If you cloned the sources with git and anything goes sideways, run ``git

+ reset --hard`` to undo any changes to the sources.

+ [:ref:`details<patching>`]

+ * If you patched your kernel or have one of the same version installed already,

+ better add a unique tag to the one you are about to build::

+ echo "-proposed_fix" > localversion

+ Running ``uname -r`` under your kernel later will then print something like

+ '6.1-rc4-proposed_fix'.

+ [:ref:`details<tagging>`]

+ .. _configuration_sbs:

+ * Create the build configuration for your kernel based on an existing

+ If you already prepared such a '.config' file yourself, copy it to

+ ~/linux/ and run ``make olddefconfig``.

+ Use the same command, if your distribution or somebody else already tailored

+ your running kernel to your or your hardware's needs: the make target

+ 'olddefconfig' will then try to use that kernel's .config as base.

+ Using this make target is fine for everybody else, too -- but you often can

+ save a lot of time by using this command instead::

+ yes "" | make localmodconfig

+ This will try to pick your distribution's kernel as base, but then disable

+ modules for any features apparently superfluous for your setup. This will

+ reduce the compile time enormously, especially if you are running an

+ universal kernel from a commodity Linux distribution.

+ There is a catch: the make target 'localmodconfig' will disable kernel

+ features you have not directly or indirectly through some program utilized

+ since you booted the system. You can reduce or nearly eliminate that risk by

+ using tricks outlined in the reference section; for quick testing purposes

+ that risk is often negligible, but it is an aspect you want to keep in mind

+ in case your kernel behaves oddly.

+ [:ref:`details<configuration>`]

+.. _configmods_sbs:

+ * Check if you might want to or have to adjust some kernel configuration

+ * Evaluate how you want to handle debug symbols. Enable them, if you later

+ might need to decode a stack trace found for example in a 'panic', 'Oops',

+ 'warning', or 'BUG'; on the other hand disable them, if you are short on

+ storage space or prefer a smaller kernel binary. See the reference section

+ for details on how to do either. If neither applies, it will likely be fine

+ to simply not bother with this. [:ref:`details<configmods_debugsymbols>`]

+ * Are you running Debian? Then to avoid known problems by performing

+ additional adjustments explained in the reference section.

+ [:ref:`details<configmods_distros>`].

+ * If you want to influence the other aspects of the configuration, do so now

+ by using make targets like 'menuconfig' or 'xconfig'.

+ [:ref:`details<configmods_individual>`].

+ * Build the image and the modules of your kernel::

+ make -j $(nproc --all)

+ If you want your kernel packaged up as deb, rpm, or tar file, see the

+ reference section for alternatives.

+ [:ref:`details<build>`]

+ * Now install your kernel::

+ command -v installkernel && sudo make modules_install install

+ Often all left for you to do afterwards is a ``reboot``, as many commodity

+ Linux distributions will then create an initramfs (also known as initrd) and

+ an entry for your kernel in your bootloader's configuration; but on some

+ distributions you have to take care of these two steps manually for reasons

+ the reference section explains.

+ On a few distributions like Arch Linux and its derivatives the above command

+ does nothing at all; in that case you have to manually install your kernel,

+ as outlined in the reference section.

+ [:ref:`details<install>`]

+ * To later build another kernel you need similar steps, but sometimes slightly

+ different commands.

+ First, switch back into the sources tree::

+ In case you want to build a version from a stable/longterm series you have

+ not used yet (say 6.2.y), tell git to track it::

+ git remote set-branches --add origin linux-6.2.y

+ Now fetch the latest upstream changes; you again need to specify the earliest

+ version you care about, as git otherwise might retrieve the entire commit

+ git fetch --shallow-exclude=v6.1 origin

+ If you modified the sources (for example by applying a patch), you now need

+ to discard those modifications; that's because git otherwise will not be able

+ to switch to the sources of another version due to potential conflicting

+ Now checkout the version you are interested in, as explained above::

+ git checkout --detach origin/master

+ At this point you might want to patch the sources again or set/modify a build

+ tag, as explained earlier; afterwards adjust the build configuration to the

+ new codebase and build your next kernel::

+ # reminder: if you want to apply patches, do it at this point

+ # reminder: you might want to update your build tag at this point

+ make -j $(nproc --all)

+ Install the kernel as outlined above::

+ command -v installkernel && sudo make modules_install install

+ [:ref:`details<another>`]

+ * Your kernel is easy to remove later, as its parts are only stored in two

+ places and clearly identifiable by the kernel's release name. Just ensure to

+ not delete the kernel you are running, as that might render your system

+ Start by deleting the directory holding your kernel's modules, which is named

+ after its release name -- '6.0.1-foobar' in the following example::

+ sudo rm -rf /lib/modules/6.0.1-foobar

+ Now try the following command, which on some distributions will delete all

+ other kernel files installed while also removing the kernel's entry from the

+ bootloader configuration::

+ command -v kernel-install && sudo kernel-install -v remove 6.0.1-foobar

+ If that command does not output anything or fails, see the reference section;

+ do the same if any files named '*6.0.1-foobar*' remain in /boot/.

+ [:ref:`details<uninstall>`]

+.. _submit_improvements:

+Did you run into trouble following any of the above steps that is not cleared up

+by the reference section below? Or do you have ideas how to improve the text?

+Then please take a moment of your time and let the maintainer of this document

+know by email (Thorsten Leemhuis <linux@leemhuis.info>), ideally while CCing the

+Linux docs mailing list (linux-doc@vger.kernel.org). Such feedback is vital to

+improve this document further, which is in everybody's interest, as it will

+enable more people to master the task described here.

+Reference section for the step-by-step guide

+============================================

+This section holds additional information for each of the steps in the above

+Prepare for emergencies

+-----------------------

+ *Create a fresh backup and put system repair and restore tools at hand*

+ [:ref:`... <backup_sbs>`]

+Remember, you are dealing with computers, which sometimes do unexpected things

+-- especially if you fiddle with crucial parts like the kernel of an operating

+system. That's what you are about to do in this process. Hence, better prepare

+for something going sideways, even if that should not happen.

+[:ref:`back to step-by-step guide <backup_sbs>`]

+Dealing with techniques like Secure Boot

+----------------------------------------

+ *On platforms with 'Secure Boot' or similar techniques, prepare everything to

+ ensure the system will permit your self-compiled kernel to boot later.*

+ [:ref:`... <secureboot_sbs>`]

+Many modern systems allow only certain operating systems to start; they thus by

+default will reject booting self-compiled kernels.

+You ideally deal with this by making your platform trust your self-built kernels

+with the help of a certificate and signing. How to do that is not described

+here, as it requires various steps that would take the text too far away from

+its purpose; 'Documentation/admin-guide/module-signing.rst' and various web

+sides already explain this in more detail.

+Temporarily disabling solutions like Secure Boot is another way to make your own

+Linux boot. On commodity x86 systems it is possible to do this in the BIOS Setup

+utility; the steps to do so are not described here, as they greatly vary between

+On mainstream x86 Linux distributions there is a third and universal option:

+disable all Secure Boot restrictions for your Linux environment. You can

+initiate this process by running ``mokutil --disable-validation``; this will

+tell you to create a one-time password, which is safe to write down. Now

+restart; right after your BIOS performed all self-tests the bootloader Shim will

+show a blue box with a message 'Press any key to perform MOK management'. Hit

+some key before the countdown exposes. This will open a menu and choose 'Change

+Secure Boot state' there. Shim's 'MokManager' will now ask you to enter three

+randomly chosen characters from the one-time password specified earlier. Once

+you provided them, confirm that you really want to disable the validation.

+Afterwards, permit MokManager to reboot the machine.

+[:ref:`back to step-by-step guide <secureboot_sbs>`]

+Install build requirements

+--------------------------

+ *Install all software required to build a Linux kernel.*

+ [:ref:`...<buildrequires_sbs>`]

+The kernel is pretty stand-alone, but besides tools like the compiler you will

+sometimes need a few libraries to build one. How to install everything needed

+depends on your Linux distribution and the configuration of the kernel you are

+Here are a few examples what you typically need on some mainstream

+ * Debian, Ubuntu, and derivatives::

+ sudo apt install bc binutils bison dwarves flex gcc git make openssl \

+ pahole perl-base libssl-dev libelf-dev

+ * Fedora and derivatives::

+ sudo dnf install binutils /usr/include/{libelf.h,openssl/pkcs7.h} \

+ /usr/bin/{bc,bison,flex,gcc,git,openssl,make,perl,pahole}

+ * openSUSE and derivatives::

+ sudo zypper install bc binutils bison dwarves flex gcc git make perl-base \

+ openssl openssl-devel libelf-dev

+In case you wonder why these lists include openssl and its development headers:

+they are needed for the Secure Boot support, which many distributions enable in

+their kernel configuration for x86 machines.

+Sometimes you will need tools for compression formats like bzip2, gzip, lz4,

+lzma, lzo, xz, or zstd as well.

+You might need additional libraries and their development headers in case you

+perform tasks not covered in this guide. For example, zlib will be needed when

+building kernel tools from the tools/ directory; adjusting the build

+configuration with make targets like 'menuconfig' or 'xconfig' will require

+development headers for ncurses or Qt5.

+[:ref:`back to step-by-step guide <buildrequires_sbs>`]

+ *Ensure to have enough free space for building and installing Linux.*

+ [:ref:`... <diskspace_sbs>`]

+The numbers mentioned are rough estimates with a big extra charge to be on the

+safe side, so often you will need less.

+If you have space constraints, remember to read the reference section when you

+reach the :ref:`section about configuration adjustments' <configmods>`, as

+ensuring debug symbols are disabled will reduce the consumed disk space by quite

+[:ref:`back to step-by-step guide <diskspace_sbs>`]

+Download the sources

+--------------------

+ *Retrieve the sources of the Linux version you intend to build.*

+ [:ref:`...<sources_sbs>`]

+The step-by-step guide outlines how to retrieve Linux' sources using a shallow

+git clone. There is :ref:`more to tell about this method<sources_shallow>` and

+two alternate ways worth describing: :ref:`packaged archives<sources_archive>`

+and :ref:`a full git clone<sources_full>`. And the aspects ':ref:`wouldn't it

+be wiser to use a proper pre-release than the latest mainline code

+<sources_snapshot>`' and ':ref:`how to get an even fresher mainline codebase

+<sources_fresher>`' need elaboration, too.

+Note, to keep things simple the commands used in this guide store the build

+artifacts in the source tree. If you prefer to separate them, simply add

+something like ``O=~/linux-builddir/`` to all make calls; also adjust the path

+in all commands that add files or modify any generated (like your '.config').

+[:ref:`back to step-by-step guide <sources_sbs>`]

+.. _sources_shallow:

+Noteworthy characteristics of shallow clones

+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

+The step-by-step guide uses a shallow clone, as it is the best solution for most

+of this document's target audience. There are a few aspects of this approach

+ * This document in most places uses ``git fetch`` with ``--shallow-exclude=``

+ to specify the earliest version you care about (or to be precise: its git

+ tag). You alternatively can use the parameter ``--shallow-since=`` to specify

+ an absolute (say ``'2023-07-15'``) or relative (``'12 months'``) date to

+ define the depth of the history you want to download. As a second

+ alternative, you can also specify a certain depth explicitly with a parameter

+ like ``--depth=1``, unless you add branches for stable/longterm kernels.

+ * When running ``git fetch``, remember to always specify the oldest version,

+ the time you care about, or an explicit depth as shown in the step-by-step

+ guide. Otherwise you will risk downloading nearly the entire git history,

+ which will consume quite a bit of time and bandwidth while also stressing the

+ Note, you do not have to use the same version or date all the time. But when

+ you change it over time, git will deepen or flatten the history to the

+ specified point. That allows you to retrieve versions you initially thought

+ you did not need -- or it will discard the sources of older versions, for

+ example in case you want to free up some disk space. The latter will happen

+ automatically when using ``--shallow-since=`` or

+ * Be warned, when deepening your clone you might encounter an error like

+ 'fatal: error in object: unshallow cafecaca0c0dacafecaca0c0dacafecaca0c0da'.

+ In that case run ``git repack -d`` and try again``

+ * In case you want to revert changes from a certain version (say Linux 6.3) or

+ perform a bisection (v6.2..v6.3), better tell ``git fetch`` to retrieve

+ objects up to three versions earlier (e.g. 6.0): ``git describe`` will then

+ be able to describe most commits just like it would in a full git clone.

+[:ref:`back to step-by-step guide <sources_sbs>`] [:ref:`back to section intro <sources>`]

+.. _sources_archive:

+Downloading the sources using a packages archive

+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

+People new to compiling Linux often assume downloading an archive via the

+front-page of https://kernel.org is the best approach to retrieve Linux'

+sources. It actually can be, if you are certain to build just one particular

+kernel version without changing any code. Thing is: you might be sure this will

+be the case, but in practice it often will turn out to be a wrong assumption.

+That's because when reporting or debugging an issue developers will often ask to

+give another version a try. They also might suggest temporarily undoing a commit

+with ``git revert`` or might provide various patches to try. Sometimes reporters

+will also be asked to use ``git bisect`` to find the change causing a problem.

+These things rely on git or are a lot easier and quicker to handle with it.

+A shallow clone also does not add any significant overhead. For example, when

+you use ``git clone --depth=1`` to create a shallow clone of the latest mainline

+codebase git will only retrieve a little more data than downloading the latest

+mainline pre-release (aka 'rc') via the front-page of kernel.org would.

+A shallow clone therefore is often the better choice. If you nevertheless want

+to use a packaged source archive, download one via kernel.org; afterwards

+extract its content to some directory and change to the subdirectory created

+during extraction. The rest of the step-by-step guide will work just fine, apart

+from things that rely on git -- but this mainly concerns the section on

+successive builds of other versions.

+[:ref:`back to step-by-step guide <sources_sbs>`] [:ref:`back to section intro <sources>`]

+Downloading the sources using a full git clone

+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

+If downloading and storing a lot of data (~4,4 Gigabyte as of early 2023) is

+nothing that bothers you, instead of a shallow clone perform a full git clone

+instead. You then will avoid the specialties mentioned above and will have all

+versions and individual commits at hand at any time::

+ https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/clone.bundle \

+ -o linux-stable.git.bundle

+ git clone clone.bundle ~/linux/

+ rm linux-stable.git.bundle

+ git remote set-url origin

+ https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git

+ git checkout --detach origin/master

+[:ref:`back to step-by-step guide <sources_sbs>`] [:ref:`back to section intro <sources>`]

+.. _sources_snapshot:

+Proper pre-releases (RCs) vs. latest mainline

+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

+When cloning the sources using git and checking out origin/master, you often

+will retrieve a codebase that is somewhere between the latest and the next

+release or pre-release. This almost always is the code you want when giving

+mainline a shot: pre-releases like v6.1-rc5 are in no way special, as they do

+not get any significant extra testing before being published.

+There is one exception: you might want to stick to the latest mainline release

+(say v6.1) before its successor's first pre-release (v6.2-rc1) is out. That is

+because compiler errors and other problems are more likely to occur during this

+time, as mainline then is in its 'merge window': a usually two week long phase,

+in which the bulk of the changes for the next release is merged.

+[:ref:`back to step-by-step guide <sources_sbs>`] [:ref:`back to section intro <sources>`]

+.. _sources_fresher:

+Avoiding the mainline lag

+~~~~~~~~~~~~~~~~~~~~~~~~~

+The explanations for both the shallow clone and the full clone both retrieve the

+code from the Linux stable git repository. That makes things simpler for this

+document's audience, as it allows easy access to both mainline and

+stable/longterm releases. This approach has just one downside:

+Changes merged into the mainline repository are only synced to the master branch

+of the Linux stable repository every few hours. This lag most of the time is

+not something to worry about; but in case you really need the latest code, just

+add the mainline repo as additional remote and checkout the code from there::

+ git remote add mainline \

+ https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

+ git fetch mainline

+ git checkout --detach mainline/master

+When doing this with a shallow clone, remember to call ``git fetch`` with one

+of the parameters described earlier to limit the depth.

+[:ref:`back to step-by-step guide <sources_sbs>`] [:ref:`back to section intro <sources>`]

+Patch the sources (optional)

+----------------------------

+ *In case you want to apply a kernel patch, do so now.*

+ [:ref:`...<patching_sbs>`]

+This is the point where you might want to patch your kernel -- for example when

+a developer proposed a fix and asked you to check if it helps. The step-by-step

+guide already explains everything crucial here.

+[:ref:`back to step-by-step guide <patching_sbs>`]

+Tagging this kernel build (optional, often wise)

+------------------------------------------------

+ *If you patched your kernel or already have that kernel version installed,

+ better tag your kernel by extending its release name:*

+ [:ref:`...<tagging_sbs>`]

+Tagging your kernel will help avoid confusion later, especially when you patched

+your kernel. Adding an individual tag will also ensure the kernel's image and

+its modules are installed in parallel to any existing kernels.

+There are various ways to add such a tag. The step-by-step guide realizes one by

+creating a 'localversion' file in your build directory from which the kernel

+build scripts will automatically pick up the tag. You can later change that file

+to use a different tag in subsequent builds or simply remove that file to dump

+[:ref:`back to step-by-step guide <tagging_sbs>`]

+Define the build configuration for your kernel

+----------------------------------------------

+ *Create the build configuration for your kernel based on an existing

+ configuration.* [:ref:`... <configuration_sbs>`]

+There are various aspects for this steps that require a more careful

+Pitfalls when using another configuration file as base

+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

+Make targets like localmodconfig and olddefconfig share a few common snares you

+want to be aware of:

+ * These targets will reuse a kernel build configuration in your build directory

+ (e.g. '~/linux/.config'), if one exists. In case you want to start from

+ scratch you thus need to delete it.

+ * The make targets try to find the configuration for your running kernel

+ automatically, but might choose poorly. A line like '# using defaults found

+ in /boot/config-6.0.7-250.fc36.x86_64' or 'using config:

+ '/boot/config-6.0.7-250.fc36.x86_64' tells you which file they picked. If

+ that is not the intended one, simply store it as '~/linux/.config'

+ before using these make targets.

+ * Unexpected things might happen if you try to use a config file prepared for

+ one kernel (say v6.0) on an older generation (say v5.15). In that case you

+ might want to use a configuration as base which your distribution utilized

+ when they used that or an slightly older kernel version.

+Influencing the configuration

+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

+The make target olddefconfig and the ``yes "" |`` used when utilizing

+localmodconfig will set any undefined build options to their default value. This

+among others will disable many kernel features that were introduced after your

+base kernel was released.

+If you want to set these configurations options manually, use ``oldconfig``

+instead of ``olddefconfig`` or omit the ``yes "" |`` when utilizing

+localmodconfig. Then for each undefined configuration option you will be asked

+how to proceed. In case you are unsure what to answer, simply hit 'enter' to

+apply the default value.

+Big pitfall when using localmodconfig

+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

+As explained briefly in the step-by-step guide already: with localmodconfig it

+can easily happen that your self-built kernel will lack modules for tasks you

+did not perform before utilizing this make target. That's because those tasks

+require kernel modules that are normally autoloaded when you perform that task

+for the first time; if you didn't perform that task at least once before using

+localmodonfig, the latter will thus assume these modules are superfluous and

+You can try to avoid this by performing typical tasks that often will autoload

+additional kernel modules: start a VM, establish VPN connections, loop-mount a

+CD/DVD ISO, mount network shares (CIFS, NFS, ...), and connect all external

+devices (2FA keys, headsets, webcams, ...) as well as storage devices with file

+systems you otherwise do not utilize (btrfs, ext4, FAT, NTFS, XFS, ...). But it

+is hard to think of everything that might be needed -- even kernel developers

+often forget one thing or another at this point.

+Do not let that risk bother you, especially when compiling a kernel only for

+testing purposes: everything typically crucial will be there. And if you forget

+something important you can turn on a missing feature later and quickly run the

+commands to compile and install a better kernel.

+But if you plan to build and use self-built kernels regularly, you might want to

+reduce the risk by recording which modules your system loads over the course of

+a few weeks. You can automate this with `modprobed-db

+<https://github.com/graysky2/modprobed-db>`_. Afterwards use ``LSMOD=<path>`` to

+point localmodconfig to the list of modules modprobed-db noticed being used::

+ yes "" | make LSMOD="${HOME}"/.config/modprobed.db localmodconfig

+Remote building with localmodconfig

+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

+If you want to use localmodconfig to build a kernel for another machine, run

+``lsmod > lsmod_foo-machine`` on it and transfer that file to your build host.

+Now point the build scripts to the file like this: ``yes "" | make

+LSMOD=~/lsmod_foo-machine localmodconfig``. Note, in this case

+you likely want to copy a base kernel configuration from the other machine over

+as well and place it as .config in your build directory.

+[:ref:`back to step-by-step guide <configuration_sbs>`]

+Adjust build configuration

+--------------------------

+ *Check if you might want to or have to adjust some kernel configuration

+Depending on your needs you at this point might want or have to adjust some

+kernel configuration options.

+.. _configmods_debugsymbols:

+ *Evaluate how you want to handle debug symbols.*

+ [:ref:`...<configmods_sbs>`]

+Most users do not need to care about this, it's often fine to leave everything

+as it is; but you should take a closer look at this, if you might need to decode

+a stack trace or want to reduce space consumption.

+Having debug symbols available can be important when your kernel throws a

+'panic', 'Oops', 'warning', or 'BUG' later when running, as then you will be

+able to find the exact place where the problem occurred in the code. But

+collecting and embedding the needed debug information takes time and consumes

+quite a bit of space: in late 2022 the build artifacts for a typical x86 kernel

+configured with localmodconfig consumed around 5 Gigabyte of space with debug

+symbols, but less than 1 when they were disabled. The resulting kernel image and

+the modules are bigger as well, which increases load times.

+Hence, if you want a small kernel and are unlikely to decode a stack trace

+later, you might want to disable debug symbols to avoid above downsides::

+ ./scripts/config --file .config -d DEBUG_INFO \

+ -d DEBUG_INFO_DWARF_TOOLCHAIN_DEFAULT -d DEBUG_INFO_DWARF4 \

+ -d DEBUG_INFO_DWARF5 -e CONFIG_DEBUG_INFO_NONE

+You on the other hand definitely want to enable them, if there is a decent

+chance that you need to decode a stack trace later (as explained by 'Decode

+failure messages' in Documentation/admin-guide/tainted-kernels.rst in more

+ ./scripts/config --file .config -d DEBUG_INFO_NONE -e DEBUG_KERNEL

+ -e DEBUG_INFO -e DEBUG_INFO_DWARF_TOOLCHAIN_DEFAULT -e KALLSYMS -e KALLSYMS_ALL

+Note, many mainstream distributions enable debug symbols in their kernel

+configurations -- make targets like localmodconfig and olddefconfig thus will

+often pick that setting up.

+[:ref:`back to step-by-step guide <configmods_sbs>`]

+.. _configmods_distros:

+Distro specific adjustments

+~~~~~~~~~~~~~~~~~~~~~~~~~~~

+ *Are you running* [:ref:`... <configmods_sbs>`]

+The following sections help you to avoid build problems that are known to occur

+when following this guide on a few commodity distributions.

+ * Remove a stale reference to a certificate file that would cause your build to

+ ./scripts/config --file .config --set-str SYSTEM_TRUSTED_KEYS ''

+ Alternatively, download the needed certificate and make that configuration

+ option point to it, as `the Debian handbook explains in more detail

+ <https://debian-handbook.info/browse/stable/sect.kernel-compilation.html>`_

+ -- or generate your own, as explained in

+ Documentation/admin-guide/module-signing.rst.

+[:ref:`back to step-by-step guide <configmods_sbs>`]

+.. _configmods_individual:

+Individual adjustments

+~~~~~~~~~~~~~~~~~~~~~~

+ *If you want to influence the other aspects of the configuration, do so

+ now* [:ref:`... <configmods_sbs>`]

+You at this point can use a command like ``make menuconfig`` to enable or

+disable certain features using a text-based user interface; to use a graphical

+configuration utilize, use the make target ``xconfig`` or ``gconfig`` instead.

+All of them require development libraries from toolkits they are based on

+(ncurses, Qt5, Gtk2); an error message will tell you if something required is

+[:ref:`back to step-by-step guide <configmods_sbs>`]

+ *Build the image and the modules of your kernel* [:ref:`... <build_sbs>`]

+A lot can go wrong at this stage, but the instructions below will help you help

+yourself. Another subsection explains how to directly package your kernel up as

+deb, rpm or tar file.

+Dealing with build errors

+~~~~~~~~~~~~~~~~~~~~~~~~~

+When a build error occurs, it might be caused by some aspect of your machine's

+setup that often can be fixed quickly; other times though the problem lies in

+the code and can only be fixed by a developer. A close examination of the

+failure messages coupled with some research on the internet will often tell you

+which of the two it is. To perform such a investigation, restart the build

+process like this::

+The ``V=1`` activates verbose output, which might be needed to see the actual

+error. To make it easier to spot, this command also omits the ``-j $(nproc

+--all)`` used earlier to utilize every CPU core in the system for the job -- but

+this parallelism also results in some clutter when failures occur.

+After a few seconds the build process should run into the error again. Now try

+to find the most crucial line describing the problem. Then search the internet

+for the most important and non-generic section of that line (say 4 to 8 words);

+avoid or remove anything that looks remotely system-specific, like your username

+or local path names like ``/home/username/linux/``. First try your regular

+internet search engine with that string, afterwards search Linux kernel mailing

+lists via `lore.kernel.org/all/ <https://lore.kernel.org/all/>`_.

+This most of the time will find something that will explain what is wrong; quite

+often one of the hits will provide a solution for your problem, too. If you

+do not find anything that matches your problem, try again from a different angle

+by modifying your search terms or using another line from the error messages.

+In the end, most trouble you are to run into has likely been encountered and

+reported by others already. That includes issues where the cause is not your

+system, but lies the code. If you run into one of those, you might thus find a

+solution (e.g. a patch) or workaround for your problem, too.

+Package your kernel up

+~~~~~~~~~~~~~~~~~~~~~~

+The step-by-step guide uses the default make targets (e.g. 'bzImage' and

+'modules' on x86) to build the image and the modules of your kernel, which later

+steps of the guide then install. You instead can also directly build everything

+and directly package it up by using one of the following targets:

+ * ``make -j $(nproc --all) bindeb-pkg`` to generate a deb package

+ * ``make -j $(nproc --all) binrpm-pkg`` to generate a rpm package

+ * ``make -j $(nproc --all) tarbz2-pkg`` to generate a bz2 compressed tarball

+This is just a selection of available make targets for this purpose, see

+``make help`` for others. You can also use these targets after running

+``make -j $(nproc --all)``, as they will pick up everything already built.

+If you employ the targets to generate deb or rpm packages, ignore the

+step-by-step guide's instructions on installing and removing your kernel;

+instead install and remove the packages using the package utility for the format

+(e.g. dpkg and rpm) or a package management utility build on top of them (apt,

+aptitude, dnf/yum, zypper, ...). Be aware that the packages generated using

+these two make targets are designed to work on various distributions utilizing

+those formats, they thus will sometimes behave differently than your

+distribution's kernel packages.

+[:ref:`back to step-by-step guide <build_sbs>`]

+Install your kernel

+-------------------

+ *Now install your kernel* [:ref:`... <install_sbs>`]

+What you need to do after executing the command in the step-by-step guide

+depends on the existence and the implementation of an ``installkernel``

+executable. Many commodity Linux distributions ship such a kernel installer in

+``/sbin/`` that does everything needed, hence there is nothing left for you

+except rebooting. But some distributions contain an installkernel that does

+only part of the job -- and a few lack it completely and leave all the work to

+If ``installkernel`` is found, the kernel's build system will delegate the

+actual installation of your kernel's image and related files to this executable.

+On almost all Linux distributions it will store the image as '/boot/vmlinuz-

+<your kernel's release name>' and put a 'System.map-<your kernel's release

+name>' alongside it. Your kernel will thus be installed in parallel to any

+existing ones, unless you already have one with exactly the same release name.

+Installkernel on many distributions will afterwards generate an 'initramfs'

+(often also called 'initrd'), which commodity distributions rely on for booting;

+hence be sure to keep the order of the two make targets used in the step-by-step

+guide, as things will go sideways if you install your kernel's image before its

+modules. Often installkernel will then add your kernel to the bootloader

+configuration, too. You have to take care of one or both of these tasks

+yourself, if your distributions installkernel doesn't handle them.

+A few distributions like Arch Linux and its derivatives totally lack an

+installkernel executable. On those just install the modules using the kernel's

+build system and then install the image and the System.map file manually::

+ sudo make modules_install

+ sudo install -m 0600 $(make -s image_name) /boot/vmlinuz-$(make -s kernelrelease)

+ sudo install -m 0600 System.map /boot/System.map-$(make -s kernelrelease)

+If your distribution boots with the help of an initramfs, now generate one for

+your kernel using the tools your distribution provides for this process.

+Afterwards add your kernel to your bootloader configuration and reboot.

+[:ref:`back to step-by-step guide <install_sbs>`]

+Another round later

+-------------------

+ *To later build another kernel you need similar, but sometimes slightly

+ different commands* [:ref:`... <another_sbs>`]

+The process to build later kernels is similar, but at some points slightly

+different. You for example do not want to use 'localmodconfig' for succeeding

+kernel builds, as you already created a trimmed down configuration you want to

+use from now on. Hence instead just use ``oldconfig`` or ``olddefconfig`` to

+adjust your build configurations to the needs of the kernel version you are

+If you created a shallow-clone with git, remember what the :ref:`section that

+explained the setup described in more detail <sources>`: you need to use a

+slightly different ``git fetch`` command and when switching to another series

+need to add an additional remote branch.

+[:ref:`back to step-by-step guide <another_sbs>`]

+Uninstall the kernel later

+--------------------------

+ *All parts of your installed kernel are identifiable by its release name and

+ thus easy to remove later.* [:ref:`... <uninstall_sbs>`]

+Do not worry installing your kernel manually and thus bypassing your

+distribution's packaging system will totally mess up your machine: all parts of

+your kernel are easy to remove later, as files are stored in two places only and

+normally identifiable by the kernel's release name.

+One of the two places is a directory in /lib/modules/, which holds the modules

+for each installed kernel. This directory is named after the kernel's release

+name; hence, to remove all modules for one of your kernels, simply remove its

+modules directory in /lib/modules/.

+The other place is /boot/, where typically one to five files will be placed

+during installation of a kernel. All of them usually contain the release name in

+their file name, but how many files and their name depends somewhat on your

+distribution's installkernel executable (:ref:`see above <install>`) and its

+initramfs generator. On some distributions the ``kernel-install`` command

+mentioned in the step-by-step guide will remove all of these files for you --

+and the entry for your kernel in the bootloader configuration at the same time,

+too. On others you have to take care of these steps yourself. The following

+command should interactively remove the two main files of a kernel with the

+release name '6.0.1-foobar'::

+ rm -i /boot/{System.map,vmlinuz}-6.0.1-foobar

+Now remove the belonging initramfs, which often will be called something like

+``/boot/initramfs-6.0.1-foobar.img`` or ``/boot/initrd.img-6.0.1-foobar``.

+Afterwards check for other files in /boot/ that have '6.0.1-foobar' in their

+name and delete them as well. Now remove the kernel from your bootloader's

+Note, be very careful with wildcards like '*' when deleting files or directories

+for kernels manually: you might accidentally remove files of a 6.0.11 kernel

+when all you want is to remove 6.0 or 6.0.1.

+[:ref:`back to step-by-step guide <uninstall_sbs>`]

+Why does this 'how-to' not work on my system?

+---------------------------------------------

+As initially stated, this guide is 'designed to cover everything typically

+needed [to build a kernel] on mainstream Linux distributions running on

+commodity PC or server hardware'. The outlined approach despite this should work

+on many other setups as well. But trying to cover every possible use-case in one

+guide would defeat its purpose, as without such a focus you would need dozens or

+hundreds of constructs along the lines of 'in case you are having <insert

+machine or distro>, you at this point have to do <this and that>

+<instead|additionally>'. Each of which would make the text longer, more

+complicated, and harder to follow.

+That being said: this of course is a balancing act. Hence, if you think an

+additional use-case is worth describing, suggest it to the maintainers of this

+document, as :ref:`described above <submit_improvements>`.

+ This document is maintained by Thorsten Leemhuis <linux@leemhuis.info>. If

+ you spot a typo or small mistake, feel free to let him know directly and

+ he'll fix it. You are free to do the same in a mostly informal way if you

+ want to contribute changes to the text -- but for copyright reasons please CC

+ linux-doc@vger.kernel.org and 'sign-off' your contribution as

+ Documentation/process/submitting-patches.rst explains in the section 'Sign

+ your work - the Developer's Certificate of Origin'.

+ This text is available under GPL-2.0+ or CC-BY-4.0, as stated at the top

+ of the file. If you want to distribute this text under CC-BY-4.0 only,

+ please use 'The Linux kernel development community' for author attribution

+ and link this as source:

+ https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/plain/Documentation/admin-guide/quickly-build-trimmed-linux.rst

+ Note: Only the content of this RST file as found in the Linux kernel sources

+ is available under CC-BY-4.0, as versions of this text that were processed

+ (for example by the kernel's build system) might contain content taken from

+ files which use a more restrictive license.

+ ``Documentation/devicetree/bindings/reserved-memory/ramoops.yaml``.

+ please read Documentation/arch/x86/x86_64/machinecheck.rst at the Kernel tree.

+.. See the bottom of this file for additional redistribution information.

+You deal with a regression if some application or practical use case running

+fine with one Linux kernel works worse or not at all with a newer version

+compiled using a similar configuration. The document

+Documentation/admin-guide/reporting-regressions.rst explains this in more

+detail. It also provides a good deal of other information about regressions you

+might want to be aware of; it for example explains how to add your issue to the

+list of tracked regressions, to ensure it won't fall through the cracks.

+Documentation/process/security-bugs.rst before proceeding, as it

+If your kernel is tainted, study Documentation/admin-guide/tainted-kernels.rst

+Documentation/admin-guide/bug-bisect.rst describes in detail. That process

+older and the newer kernel got built with a similar configuration. This can be

+achieved by using ``make olddefconfig``, as explained in more detail by

+Documentation/admin-guide/reporting-regressions.rst; that document also

+provides a good deal of other information about regressions you might want to be

+See Documentation/process/security-bugs.rst for more information.

+the document Documentation/admin-guide/bug-bisect.rst for details how to

+within rules outlined in Documentation/process/stable-kernel-rules.rst.

+ This document is maintained by Thorsten Leemhuis <linux@leemhuis.info>. If

+ you spot a typo or small mistake, feel free to let him know directly and

+ he'll fix it. You are free to do the same in a mostly informal way if you

+ want to contribute changes to the text, but for copyright reasons please CC

+ This text is available under GPL-2.0+ or CC-BY-4.0, as stated at the top

+ of the file. If you want to distribute this text under CC-BY-4.0 only,

+ please use "The Linux kernel developers" for author attribution and link

+ https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/plain/Documentation/admin-guide/reporting-issues.rst

+ Note: Only the content of this RST file as found in the Linux kernel sources

+ is available under CC-BY-4.0, as versions of this text that were processed

+ (for example by the kernel's build system) might contain content taken from

+ files which use a more restrictive license.

+.. SPDX-License-Identifier: (GPL-2.0+ OR CC-BY-4.0)

+.. [see the bottom of this file for redistribution information]

+Reporting regressions

++++++++++++++++++++++

+"*We don't cause regressions*" is the first rule of Linux kernel development;

+Linux founder and lead developer Linus Torvalds established it himself and

+ensures it's obeyed.

+This document describes what the rule means for users and how the Linux kernel's

+development model ensures to address all reported regressions; aspects relevant

+for kernel developers are left to Documentation/process/handling-regressions.rst.

+The important bits (aka "TL;DR")

+================================

+#. It's a regression if something running fine with one Linux kernel works worse

+ or not at all with a newer version. Note, the newer kernel has to be compiled

+ using a similar configuration; the detailed explanations below describes this

+ and other fine print in more detail.

+#. Report your issue as outlined in Documentation/admin-guide/reporting-issues.rst,

+ it already covers all aspects important for regressions and repeated

+ below for convenience. Two of them are important: start your report's subject

+ with "[REGRESSION]" and CC or forward it to `the regression mailing list

+ <https://lore.kernel.org/regressions/>`_ (regressions@lists.linux.dev).

+#. Optional, but recommended: when sending or forwarding your report, make the

+ Linux kernel regression tracking bot "regzbot" track the issue by specifying

+ when the regression started like this::

+ #regzbot introduced v5.13..v5.14-rc1

+All the details on Linux kernel regressions relevant for users

+==============================================================

+The important basics

+--------------------

+What is a "regression" and what is the "no regressions rule"?

+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

+It's a regression if some application or practical use case running fine with

+one Linux kernel works worse or not at all with a newer version compiled using a

+similar configuration. The "no regressions rule" forbids this to take place; if

+it happens by accident, developers that caused it are expected to quickly fix

+It thus is a regression when a WiFi driver from Linux 5.13 works fine, but with

+5.14 doesn't work at all, works significantly slower, or misbehaves somehow.

+It's also a regression if a perfectly working application suddenly shows erratic

+behavior with a newer kernel version; such issues can be caused by changes in

+procfs, sysfs, or one of the many other interfaces Linux provides to userland

+software. But keep in mind, as mentioned earlier: 5.14 in this example needs to

+be built from a configuration similar to the one from 5.13. This can be achieved

+using ``make olddefconfig``, as explained in more detail below.

+Note the "practical use case" in the first sentence of this section: developers

+despite the "no regressions" rule are free to change any aspect of the kernel

+and even APIs or ABIs to userland, as long as no existing application or use

+Also be aware the "no regressions" rule covers only interfaces the kernel

+provides to the userland. It thus does not apply to kernel-internal interfaces

+like the module API, which some externally developed drivers use to hook into

+How do I report a regression?

+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

+Just report the issue as outlined in

+Documentation/admin-guide/reporting-issues.rst, it already describes the

+important points. The following aspects outlined there are especially relevant

+ * When checking for existing reports to join, also search the `archives of the

+ Linux regressions mailing list <https://lore.kernel.org/regressions/>`_ and

+ `regzbot's web-interface <https://linux-regtracking.leemhuis.info/regzbot/>`_.

+ * Start your report's subject with "[REGRESSION]".

+ * In your report, clearly mention the last kernel version that worked fine and

+ the first broken one. Ideally try to find the exact change causing the

+ regression using a bisection, as explained below in more detail.

+ * Remember to let the Linux regressions mailing list

+ (regressions@lists.linux.dev) know about your report:

+ * If you report the regression by mail, CC the regressions list.

+ * If you report your regression to some bug tracker, forward the submitted

+ report by mail to the regressions list while CCing the maintainer and the

+ mailing list for the subsystem in question.

+ If it's a regression within a stable or longterm series (e.g.

+ v5.15.3..v5.15.5), remember to CC the `Linux stable mailing list

+ <https://lore.kernel.org/stable/>`_ (stable@vger.kernel.org).

+ In case you performed a successful bisection, add everyone to the CC the

+ culprit's commit message mentions in lines starting with "Signed-off-by:".

+When CCing for forwarding your report to the list, consider directly telling the

+aforementioned Linux kernel regression tracking bot about your report. To do

+that, include a paragraph like this in your mail::

+ #regzbot introduced: v5.13..v5.14-rc1

+Regzbot will then consider your mail a report for a regression introduced in the

+specified version range. In above case Linux v5.13 still worked fine and Linux

+v5.14-rc1 was the first version where you encountered the issue. If you

+performed a bisection to find the commit that caused the regression, specify the

+culprit's commit-id instead::

+ #regzbot introduced: 1f2e3d4c5d

+Placing such a "regzbot command" is in your interest, as it will ensure the

+report won't fall through the cracks unnoticed. If you omit this, the Linux

+kernel's regressions tracker will take care of telling regzbot about your

+regression, as long as you send a copy to the regressions mailing lists. But the

+regression tracker is just one human which sometimes has to rest or occasionally

+might even enjoy some time away from computers (as crazy as that might sound).

+Relying on this person thus will result in an unnecessary delay before the

+regressions becomes mentioned `on the list of tracked and unresolved Linux

+kernel regressions <https://linux-regtracking.leemhuis.info/regzbot/>`_ and the

+weekly regression reports sent by regzbot. Such delays can result in Linus

+Torvalds being unaware of important regressions when deciding between "continue

+development or call this finished and release the final?".

+Are really all regressions fixed?

+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

+Nearly all of them are, as long as the change causing the regression (the

+"culprit commit") is reliably identified. Some regressions can be fixed without

+this, but often it's required.

+Who needs to find the root cause of a regression?

+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

+Developers of the affected code area should try to locate the culprit on their

+own. But for them that's often impossible to do with reasonable effort, as quite

+a lot of issues only occur in a particular environment outside the developer's

+reach -- for example, a specific hardware platform, firmware, Linux distro,

+system's configuration, or application. That's why in the end it's often up to

+the reporter to locate the culprit commit; sometimes users might even need to

+run additional tests afterwards to pinpoint the exact root cause. Developers

+should offer advice and reasonably help where they can, to make this process

+relatively easy and achievable for typical users.

+How can I find the culprit?

+~~~~~~~~~~~~~~~~~~~~~~~~~~~

+Perform a bisection, as roughly outlined in

+Documentation/admin-guide/reporting-issues.rst and described in more detail by

+Documentation/admin-guide/bug-bisect.rst. It might sound like a lot of work, but

+in many cases finds the culprit relatively quickly. If it's hard or

+time-consuming to reliably reproduce the issue, consider teaming up with other

+affected users to narrow down the search range together.

+Who can I ask for advice when it comes to regressions?

+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

+Send a mail to the regressions mailing list (regressions@lists.linux.dev) while

+CCing the Linux kernel's regression tracker (regressions@leemhuis.info); if the

+issue might better be dealt with in private, feel free to omit the list.

+Additional details about regressions

+------------------------------------

+What is the goal of the "no regressions rule"?

+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

+Users should feel safe when updating kernel versions and not have to worry

+something might break. This is in the interest of the kernel developers to make

+updating attractive: they don't want users to stay on stable or longterm Linux

+series that are either abandoned or more than one and a half years old. That's

+in everybody's interest, as `those series might have known bugs, security

+issues, or other problematic aspects already fixed in later versions

+<http://www.kroah.com/log/blog/2018/08/24/what-stable-kernel-should-i-use/>`_.

+Additionally, the kernel developers want to make it simple and appealing for

+users to test the latest pre-release or regular release. That's also in

+everybody's interest, as it's a lot easier to track down and fix problems, if

+they are reported shortly after being introduced.

+Is the "no regressions" rule really adhered in practice?

+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

+It's taken really seriously, as can be seen by many mailing list posts from

+Linux creator and lead developer Linus Torvalds, some of which are quoted in

+Documentation/process/handling-regressions.rst.

+Exceptions to this rule are extremely rare; in the past developers almost always

+turned out to be wrong when they assumed a particular situation was warranting

+Who ensures the "no regressions" is actually followed?

+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

+The subsystem maintainers should take care of that, which are watched and

+supported by the tree maintainers -- e.g. Linus Torvalds for mainline and

+Greg Kroah-Hartman et al. for various stable/longterm series.

+All of them are helped by people trying to ensure no regression report falls

+through the cracks. One of them is Thorsten Leemhuis, who's currently acting as

+the Linux kernel's "regressions tracker"; to facilitate this work he relies on

+regzbot, the Linux kernel regression tracking bot. That's why you want to bring

+your report on the radar of these people by CCing or forwarding each report to

+the regressions mailing list, ideally with a "regzbot command" in your mail to

+get it tracked immediately.

+How quickly are regressions normally fixed?

+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

+Developers should fix any reported regression as quickly as possible, to provide

+affected users with a solution in a timely manner and prevent more users from

+running into the issue; nevertheless developers need to take enough time and

+care to ensure regression fixes do not cause additional damage.

+The answer thus depends on various factors like the impact of a regression, its

+age, or the Linux series in which it occurs. In the end though, most regressions

+should be fixed within two weeks.

+Is it a regression, if the issue can be avoided by updating some software?

+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

+Almost always: yes. If a developer tells you otherwise, ask the regression

+tracker for advice as outlined above.

+Is it a regression, if a newer kernel works slower or consumes more energy?

+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

+Yes, but the difference has to be significant. A five percent slow-down in a

+micro-benchmark thus is unlikely to qualify as regression, unless it also

+influences the results of a broad benchmark by more than one percent. If in

+doubt, ask for advice.

+Is it a regression, if an external kernel module breaks when updating Linux?

+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

+No, as the "no regression" rule is about interfaces and services the Linux

+kernel provides to the userland. It thus does not cover building or running

+externally developed kernel modules, as they run in kernel-space and hook into

+the kernel using internal interfaces occasionally changed.

+How are regressions handled that are caused by security fixes?

+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

+In extremely rare situations security issues can't be fixed without causing

+regressions; those fixes are given way, as they are the lesser evil in the end.

+Luckily this middling almost always can be avoided, as key developers for the

+affected area and often Linus Torvalds himself try very hard to fix security

+issues without causing regressions.

+If you nevertheless face such a case, check the mailing list archives if people

+tried their best to avoid the regression. If not, report it; if in doubt, ask

+for advice as outlined above.

+What happens if fixing a regression is impossible without causing another?

+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

+Sadly these things happen, but luckily not very often; if they occur, expert

+developers of the affected code area should look into the issue to find a fix

+that avoids regressions or at least their impact. If you run into such a

+situation, do what was outlined already for regressions caused by security

+fixes: check earlier discussions if people already tried their best and ask for

+advice if in doubt.

+A quick note while at it: these situations could be avoided, if people would

+regularly give mainline pre-releases (say v5.15-rc1 or -rc3) from each

+development cycle a test run. This is best explained by imagining a change

+integrated between Linux v5.14 and v5.15-rc1 which causes a regression, but at

+the same time is a hard requirement for some other improvement applied for

+5.15-rc1. All these changes often can simply be reverted and the regression thus

+solved, if someone finds and reports it before 5.15 is released. A few days or

+weeks later this solution can become impossible, as some software might have

+started to rely on aspects introduced by one of the follow-up changes: reverting

+all changes would then cause a regression for users of said software and thus is

+out of the question.

+Is it a regression, if some feature I relied on was removed months ago?

+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

+It is, but often it's hard to fix such regressions due to the aspects outlined

+in the previous section. It hence needs to be dealt with on a case-by-case

+basis. This is another reason why it's in everybody's interest to regularly test

+mainline pre-releases.

+Does the "no regression" rule apply if I seem to be the only affected person?

+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

+It does, but only for practical usage: the Linux developers want to be free to

+remove support for hardware only to be found in attics and museums anymore.

+Note, sometimes regressions can't be avoided to make progress -- and the latter

+is needed to prevent Linux from stagnation. Hence, if only very few users seem

+to be affected by a regression, it for the greater good might be in their and

+everyone else's interest to lettings things pass. Especially if there is an

+easy way to circumvent the regression somehow, for example by updating some

+software or using a kernel parameter created just for this purpose.

+Does the regression rule apply for code in the staging tree as well?

+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

+Not according to the `help text for the configuration option covering all

+staging code <https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/staging/Kconfig>`_,

+which since its early days states::

+ Please note that these drivers are under heavy development, may or

+ may not work, and may contain userspace interfaces that most likely

+ will be changed in the near future.

+The staging developers nevertheless often adhere to the "no regressions" rule,

+but sometimes bend it to make progress. That's for example why some users had to

+deal with (often negligible) regressions when a WiFi driver from the staging

+tree was replaced by a totally different one written from scratch.

+Why do later versions have to be "compiled with a similar configuration"?

+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

+Because the Linux kernel developers sometimes integrate changes known to cause

+regressions, but make them optional and disable them in the kernel's default

+configuration. This trick allows progress, as the "no regressions" rule

+otherwise would lead to stagnation.

+Consider for example a new security feature blocking access to some kernel

+interfaces often abused by malware, which at the same time are required to run a

+few rarely used applications. The outlined approach makes both camps happy:

+people using these applications can leave the new security feature off, while

+everyone else can enable it without running into trouble.

+How to create a configuration similar to the one of an older kernel?

+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

+Start your machine with a known-good kernel and configure the newer Linux

+version with ``make olddefconfig``. This makes the kernel's build scripts pick

+up the configuration file (the ".config" file) from the running kernel as base

+for the new one you are about to compile; afterwards they set all new

+configuration options to their default value, which should disable new features

+that might cause regressions.

+Can I report a regression I found with pre-compiled vanilla kernels?

+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

+You need to ensure the newer kernel was compiled with a similar configuration

+file as the older one (see above), as those that built them might have enabled

+some known-to-be incompatible feature for the newer kernel. If in doubt, report

+the matter to the kernel's provider and ask for advice.

+More about regression tracking with "regzbot"

+---------------------------------------------

+What is regression tracking and why should I care about it?

+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

+Rules like "no regressions" need someone to ensure they are followed, otherwise

+they are broken either accidentally or on purpose. History has shown this to be

+true for Linux kernel development as well. That's why Thorsten Leemhuis, the

+Linux Kernel's regression tracker, and some people try to ensure all regression

+are fixed by keeping an eye on them until they are resolved. Neither of them are

+paid for this, that's why the work is done on a best effort basis.

+Why and how are Linux kernel regressions tracked using a bot?

+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

+Tracking regressions completely manually has proven to be quite hard due to the

+distributed and loosely structured nature of Linux kernel development process.

+That's why the Linux kernel's regression tracker developed regzbot to facilitate

+the work, with the long term goal to automate regression tracking as much as

+possible for everyone involved.

+Regzbot works by watching for replies to reports of tracked regressions.

+Additionally, it's looking out for posted or committed patches referencing such

+reports with "Link:" tags; replies to such patch postings are tracked as well.

+Combined this data provides good insights into the current state of the fixing

+How to see which regressions regzbot tracks currently?

+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

+Check out `regzbot's web-interface <https://linux-regtracking.leemhuis.info/regzbot/>`_.

+What kind of issues are supposed to be tracked by regzbot?

+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

+The bot is meant to track regressions, hence please don't involve regzbot for

+regular issues. But it's okay for the Linux kernel's regression tracker if you

+involve regzbot to track severe issues, like reports about hangs, corrupted

+data, or internal errors (Panic, Oops, BUG(), warning, ...).

+How to change aspects of a tracked regression?

+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

+By using a 'regzbot command' in a direct or indirect reply to the mail with the

+report. The easiest way to do that: find the report in your "Sent" folder or the

+mailing list archive and reply to it using your mailer's "Reply-all" function.

+In that mail, use one of the following commands in a stand-alone paragraph (IOW:

+use blank lines to separate one or multiple of these commands from the rest of

+ * Update when the regression started to happen, for example after performing a

+ #regzbot introduced: 1f2e3d4c5d

+ * Set or update the title::

+ #regzbot title: foo

+ * Monitor a discussion or bugzilla.kernel.org ticket where additions aspects of

+ the issue or a fix are discussed:::

+ #regzbot monitor: https://lore.kernel.org/r/30th.anniversary.repost@klaava.Helsinki.FI/

+ #regzbot monitor: https://bugzilla.kernel.org/show_bug.cgi?id=123456789

+ * Point to a place with further details of interest, like a mailing list post

+ or a ticket in a bug tracker that are slightly related, but about a different

+ #regzbot link: https://bugzilla.kernel.org/show_bug.cgi?id=123456789

+ * Mark a regression as invalid::

+ #regzbot invalid: wasn't a regression, problem has always existed

+Regzbot supports a few other commands primarily used by developers or people

+tracking regressions. They and more details about the aforementioned regzbot

+commands can be found in the `getting started guide

+<https://gitlab.com/knurd42/regzbot/-/blob/main/docs/getting_started.md>`_ and

+the `reference documentation <https://gitlab.com/knurd42/regzbot/-/blob/main/docs/reference.md>`_

+ This text is available under GPL-2.0+ or CC-BY-4.0, as stated at the top

+ of the file. If you want to distribute this text under CC-BY-4.0 only,

+ please use "The Linux kernel developers" for author attribution and link

+ https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/plain/Documentation/admin-guide/reporting-regressions.rst

+ Note: Only the content of this RST file as found in the Linux kernel sources

+ is available under CC-BY-4.0, as versions of this text that were processed

+ (for example by the kernel's build system) might contain content taken from

+ files which use a more restrictive license.

+The behavior is well defined when each device type is mentioned only once.

+In this case, the output will appear on all requested consoles. And

+the last device will be used when you open ``/dev/console``.

+The behavior is more complicated when the same device type is defined more

+times. In this case, there are the following two rules:

+1. The output will appear only on the first device of each defined type.

+2. ``/dev/console`` will be associated with the first registered device.

+ Where the registration order depends on how kernel initializes various

+ This rule is used also when the last console= parameter is not used

+ for other reasons. For example, because of a typo or because

+ the hardware is not available.

+The result might be surprising. For example, the following two command

+lines have the same result:

+ console=ttyS1,9600 console=tty0 console=tty1

+ console=tty0 console=ttyS1,9600 console=tty1

+The kernel messages are printed only on ``tty0`` and ``ttyS1``. And

+``/dev/console`` gets associated with ``tty0``. It is because kernel

+tries to register graphical consoles before serial ones. It does it

+because of the default behavior when no console device is specified,

+Note that the last ``console=tty1`` parameter still makes a difference.

+The kernel command line is used also by systemd. It would use the last

+defined ``tty1`` as the login console.

+has configured the system to load the modules at boot time. The modules

+If your system administrator himself ran the script, all the users will be able

+to change from English to the language chosen by root and do directly

diff --git a/Documentation/admin-guide/syscall-user-dispatch.rst b/Documentation/admin-guide/syscall-user-dispatch.rst
index 60314953c728..e3cfffef5a63 100644
--- a/Documentation/admin-guide/syscall-user-dispatch.rst
+++ b/Documentation/admin-guide/syscall-user-dispatch.rst

+Additionally, a tasks syscall user dispatch configuration can be peeked

+and poked via the PTRACE_(GET|SET)_SYSCALL_USER_DISPATCH_CONFIG ptrace

+requests. This is useful for checkpoint/restart software.

+This file contains documentation for the sysctl files and directories

+in ``/proc/sys/fs/``.

+kernel. Since some of the files *can* be used to screw up your

+Currently, these files might (depending on your configuration)

+show up in ``/proc/sys/fs``:

+.. contents:: :local:

+``aio-nr`` shows the current system-wide number of asynchronous io

+requests. ``aio-max-nr`` allows you to change the maximum value

+``aio-nr`` can grow to. If ``aio-nr`` reaches ``aio-nr-max`` then

+``io_setup`` will fail with ``EAGAIN``. Note that raising

+``aio-max-nr`` does not result in the

+pre-allocation or re-sizing of any kernel data structures.

+This file shows the values in ``struct dentry_stat``, as defined in

+``linux/include/linux/dcache.h``::

+``nr_dentry`` shows the total number of dentries allocated (active

++ unused). ``nr_unused shows`` the number of dentries that are not

+``age_limit`` is the age in seconds after which dcache entries

+can be reclaimed when memory is short and ``want_pages`` is

+nonzero when ``shrink_dcache_pages()`` has been called and the

+``nr_negative`` shows the number of unused dentries that are also

+The value in ``file-max`` denotes the maximum number of file-

+``file-nr`` denote the number of allocated file handles, the number

+file handles. Linux 2.6 and later always reports 0 as the number of free

+Attempts to allocate more file descriptors than ``file-max`` are

+reported with ``printk``, look for::

+ VFS: file-max limit <number> reached

+in the kernel logs.

+inode-nr & inode-state

+----------------------

+The file ``inode-nr`` contains the first two items from

+``inode-state``, so we'll skip to that file...

+``inode-state`` contains three actual numbers and four dummies.

+The actual numbers are, in order of appearance, ``nr_inodes``,

+``nr_free_inodes`` and ``preshrink``.

+``nr_inodes`` stands for the number of inodes the system has

+``nr_free_inodes`` represents the number of free inodes (?) and

+preshrink is nonzero when the

+This denotes the maximum number of mounts that may exist

+in a mount namespace.

+This denotes the maximum number of file-handles a process can

+allocate. Default value is 1024*1024 (1048576) which should be

+enough for most machines. Actual limit depends on ``RLIMIT_NOFILE``

+limit total memory usage, and trying to increase them using ``fcntl()`` will be

+When set to "1" don't allow ``O_CREAT`` open on FIFOs that we don't own

+directories like ``/tmp``. The common method of exploitation of this flaw

+This protection is similar to `protected_fifos`_, but it

+When set to "1" don't allow ``O_CREAT`` open on regular files that we

+directories like ``/tmp``. The common method of exploitation of this flaw

+0 (default) Traditional behaviour. Any process which has changed

+1 (debug) All processes dump core when possible. The core dump is

+2 (suidsafe) Any binary which normally would not be dumped is dumped

+ anyway, but only if the ``core_pattern`` kernel sysctl (see

+ :ref:`Documentation/admin-guide/sysctl/kernel.rst <core_pattern>`)

+Documentation for the files in ``/proc/sys/fs/binfmt_misc`` is

+The "mqueue" filesystem contains values for determining/setting the

+amount of resources used by the file system.

+``/proc/sys/fs/mqueue/queues_max`` is a read/write file for

+setting/getting the maximum number of message queues allowed on the

+``/proc/sys/fs/mqueue/msg_max`` is a read/write file for

+setting/getting the maximum number of messages in a queue value. In

+fact it is the limiting value for another (user) limit which is set in

+``mq_open`` invocation. This attribute of a queue must be less than

+or equal to ``msg_max``.

+``/proc/sys/fs/mqueue/msgsize_max`` is a read/write file for

+setting/getting the maximum message size value (it is an attribute of

+every message queue, set during its creation).

+``/proc/sys/fs/mqueue/msg_default`` is a read/write file for

+setting/getting the default number of messages in a queue value if the

+``attr`` parameter of ``mq_open(2)`` is ``NULL``. If it exceeds

+``msg_max``, the default value is initialized to ``msg_max``.

+``/proc/sys/fs/mqueue/msgsize_default`` is a read/write file for

+setting/getting the default message size value if the ``attr``

+parameter of ``mq_open(2)`` is ``NULL``. If it exceeds

+``msgsize_max``, the default value is initialized to ``msgsize_max``.

+Each "watch" costs roughly 90 bytes on a 32-bit kernel, and roughly 160 bytes

+The current default value for ``max_user_watches`` is 4% of the

+available low memory, divided by the "watch" cost in bytes.

+goes below ``lowwater``\ % accounting suspends. If free space gets

+above ``highwater``\ % accounting resumes. ``frequency`` determines

+The machine hardware name, the same output as ``uname -m``

+(e.g. ``x86_64`` or ``aarch64``).

+Documentation/arch/x86/boot.rst for additional information.

+ %C CPU the task ran on

+currently, ``arc``, ``ia64`` and ``loongarch``), controls whether all

+unaligned traps are logged.

+A toggle indicating if the syscalls ``kexec_load`` and

+``kexec_file_load`` have been disabled.

+This value defaults to 0 (false: ``kexec_*load`` enabled), but can be

+set to 1 (true: ``kexec_*load`` disabled).

+kexec_load_limit_panic

+======================

+This parameter specifies a limit to the number of times the syscalls

+``kexec_load`` and ``kexec_file_load`` can be called with a crash

+image. It can only be set with a more restrictive value than the

+== ======================================================

+-1 Unlimited calls to kexec. This is the default setting.

+N Number of calls left.

+== ======================================================

+kexec_load_limit_reboot

+=======================

+Similar functionality as ``kexec_load_limit_panic``, but for a normal

+nmi_wd_lpm_factor (PPC only)

+============================

+Factor to apply to the NMI watchdog timeout (only when ``nmi_watchdog`` is

+set to 1). This factor represents the percentage added to

+``watchdog_thresh`` when calculating the NMI watchdog timeout during an

+LPM. The soft lockup timeout is not impacted.

+A value of 0 means no change. The default value is 200 meaning the NMI

+watchdog is set to 30s (based on ``watchdog_thresh`` equal to 10).

+Enables/disables and configures automatic page fault based NUMA memory

+balancing. Memory is moved automatically to nodes that access it often.

+The value to set can be the result of ORing the following:

+= =================================

+0 NUMA_BALANCING_DISABLED

+1 NUMA_BALANCING_NORMAL

+2 NUMA_BALANCING_MEMORY_TIERING

+= =================================

+Or NUMA_BALANCING_NORMAL to optimize page placement among different

+NUMA nodes to reduce remote accessing. On NUMA machines, there is a

+performance penalty if remote memory is accessed by a CPU. When this

+feature is enabled the kernel samples what task thread is accessing

+memory by periodically unmapping pages and later trapping a page

+fault. At the time of the page fault, it is determined if the data

+being accessed should be migrated to a local memory node.

+The unmapping of pages and trapping faults incur additional overhead that

+ideally is offset by improved memory locality but there is no universal

+guarantee. If the target workload is already bound to NUMA nodes then this

+feature should be disabled.

+Or NUMA_BALANCING_MEMORY_TIERING to optimize page placement among

+different types of memory (represented as different NUMA nodes) to

+place the hot pages in the fast memory. This is implemented based on

+unmapping and page fault too.

+numa_balancing_promote_rate_limit_MBps

+======================================

+Too high promotion/demotion throughput between different memory types

+may hurt application latency. This can be used to rate limit the

+promotion throughput. The per-node max promotion throughput in MB/s

+will be limited to be no more than the set value.

+A rule of thumb is to set this to less than 1/10 of the PMEM node

+Number of kernel oopses after which the kernel should panic when

+``panic_on_oops`` is not set. Setting this to 0 disables checking

+the count. Setting this to 1 has the same effect as setting

+``panic_on_oops=1``. The default value is 10000.

+bit 5 print all printk messages in buffer

+bit 6 print all CPUs backtrace (if available in the arch)

+max_rcu_stall_to_panic

+======================

+When ``panic_on_rcu_stall`` is set to 1, this value determines the

+number of times that RCU can stall before panic() is called.

+When ``panic_on_rcu_stall`` is set to 0, this value is has no effect.

+perf_user_access (arm64 only)

+=================================

+Controls user space access for reading perf event counters. When set to 1,

+user space can read performance monitor counter registers directly.

+The default value is 0 (access disabled).

+See Documentation/arm64/perf.rst for more information.

+* ``uuid``: a UUID generated every time this is retrieved (this can

+ thus be used to generate UUIDs at will);

+ number of seconds between urandom pool reseeding). This file is

+ writable for compatibility purposes, but writing to it has no effect

+ on any RNG behavior;

+ are woken up. This file is writable for compatibility purposes, but

+ writing to it has no effect on any RNG behavior.

+Documentation/accounting/delay-accounting.rst. Enabling this feature incurs

+split_lock_mitigate (x86 only)

+==============================

+On x86, each "split lock" imposes a system-wide performance penalty. On larger

+systems, large numbers of split locks from unprivileged users can result in

+denials of service to well-behaved and potentially more important users.

+The kernel mitigates these bad users by detecting split locks and imposing

+penalties: forcing them to wait and only allowing one core to execute split

+These mitigations can make those bad applications unbearably slow. Setting

+split_lock_mitigate=0 may restore some application performance, but will also

+increase system exposure to denial of service attacks from split lock users.

+= ===================================================================

+0 Disable the mitigation mode - just warns the split lock on kernel log

+ and exposes the system to denials of service from the split lockers.

+1 Enable the mitigation mode (this is the default) - penalizes the split

+ lockers with intentional performance degradation.

+= ===================================================================

+``arc``, ``parisc`` and ``loongarch``), controls whether unaligned traps

+are caught and emulated (instead of failing).

+Number of kernel warnings after which the kernel should panic when

+``panic_on_warn`` is not set. Setting this to 0 disables checking

+the warning count. Setting this to 1 has the same effect as setting

+``panic_on_warn=1``. The default value is 0.

+ ========= =================== = ========== ===================

+ 802 E802 protocol mptcp Multipath TCP

+ appletalk Appletalk protocol netfilter Network Filter

+ ax25 AX25 netrom NET/ROM

+ bridge Bridging rose X.25 PLP layer

+ core General parameter tipc TIPC

+ ethernet Ethernet protocol unix Unix domain sockets

+ ========= =================== = ========== ===================

+where "privileged user" in this context means a process having

+CAP_BPF or CAP_SYS_ADMIN in the root user name space.

+The default RPS CPU mask used on newly created network devices. An empty

+mask means RPS disabled by default.

+Maximum number of packets, queued on the INPUT side, when the interface

+Max size (in skbs) of the per-cpu list of skbs being freed

+by the cpu which allocated them. Used by TCP stack so far.

+Controls default hash rethink behaviour on listening socket when SO_TXREHASH

+option is set to SOCK_TXREHASH_DEFAULT (i. e. not overridden by setsockopt).

+If set to 1 (default), hash rethink is performed on listening socket.

+If set to 0, hash rethink is not performed.

+Maximum number of the segments to batch up on output of GRO. When a packet

+exits GRO, either as a coalesced superframe or as an original packet which

+GRO has decided not to coalesce, it is placed on a per-NAPI list. This

+list is then passed to the stack when the number of segments reaches the

+gro_normal_batch limit.

+high_order_alloc_disable

+------------------------

+By default the allocator for page frags tries to use high order pages (order-3

+on x86). While the default behavior gives good results in most cases, some users

+might have hit a contention in page allocations/freeing. This was especially

+true on older kernels (< 5.14) when high-order pages were not stored on per-cpu

+lists. This allows to opt-in for order-0 allocation instead but is now mostly of

+historical importance.

+- page_lock_unfairness

+in /proc/zoneinfo like the following. (This is an example of x86-64 box).

+no other up-to-date copy of the data it will kill to prevent any data

+hugetlb_optimize_vmemmap

+========================

+This knob is not available when the size of 'struct page' (a structure defined

+in include/linux/mm_types.h) is not power of two (an unusual system config could

+Enable (set to 1) or disable (set to 0) HugeTLB Vmemmap Optimization (HVO).

+Once enabled, the vmemmap pages of subsequent allocation of HugeTLB pages from

+buddy allocator will be optimized (7 pages per 2MB HugeTLB page and 4095 pages

+per 1GB HugeTLB page), whereas already allocated HugeTLB pages will not be

+optimized. When those optimized HugeTLB pages are freed from the HugeTLB pool

+to the buddy allocator, the vmemmap pages representing that range needs to be

+remapped again and the vmemmap pages discarded earlier need to be rellocated

+again. If your use case is that HugeTLB pages are allocated 'on the fly' (e.g.

+never explicitly allocating HugeTLB pages with 'nr_hugepages' but only set

+'nr_overcommit_hugepages', those overcommitted HugeTLB pages are allocated 'on

+the fly') instead of being pulled from the HugeTLB pool, you should weigh the

+benefits of memory savings against the more overhead (~2x slower than before)

+of allocation or freeing HugeTLB pages between the HugeTLB pool and the buddy

+allocator. Another behavior to note is that if the system is under heavy memory

+pressure, it could prevent the user from freeing HugeTLB pages from the HugeTLB

+pool to the buddy allocator since the allocation of vmemmap pages could be

+failed, you have to retry later if your system encounter this situation.

+Once disabled, the vmemmap pages of subsequent allocation of HugeTLB pages from

+buddy allocator will not be optimized meaning the extra overhead at allocation

+time from buddy allocator disappears, whereas already optimized HugeTLB pages

+will not be affected. If you want to make sure there are no optimized HugeTLB

+pages, you can set "nr_hugepages" to 0 first and then disable this. Note that

+writing 0 to nr_hugepages will make any "in use" HugeTLB pages become surplus

+pages. So, those surplus pages are still optimized until they are no longer

+in use. You would need to wait for those surplus pages to be released before

+there are no optimized pages in the system.

+See Documentation/mm/overcommit-accounting.rst and

+page_lock_unfairness

+====================

+This value determines the number of times that the page lock can be

+stolen from under a waiter. After the lock is stolen the number of times

+specified in this file (default is 5), the "fair lock handoff" semantics

+will apply, and the waiter will only be awakened if the lock can be taken.

+Another way to control permissions for userfaultfd is to use

+/dev/userfaultfd instead of userfaultfd(2). See

+Documentation/admin-guide/mm/userfaultfd.rst.

+node/system. The maximum value is 3000, or 30% of memory.

+``w`` Dumps tasks that are in uninterruptible (blocked) state.

+ 18 _/N 262144 an in-kernel test has been run

+ - x86/x86_64: Microcode late loading is dangerous and will result in

+ tainting the kernel. It requires that all CPUs rendezvous to make sure

+ the update happens when the system is as quiescent as possible. However,

+ a higher priority MCE/SMI/NMI can move control flow away from that

+ rendezvous and interrupt the update, which can be detrimental to the

diff --git a/Documentation/driver-api/thermal/intel_powerclamp.rst b/Documentation/admin-guide/thermal/intel_powerclamp.rst
index 3f6dfb0b3ea6..08509b978af4 100644
--- a/Documentation/driver-api/thermal/intel_powerclamp.rst
+++ b/Documentation/admin-guide/thermal/intel_powerclamp.rst

+ (*) Module Parameters

+scheme to work for both preemptible and non-preemptible kernels.

+ This is to offset the error occurring when the system can

+ enter idle without extra wakeups (such as external interrupts).

+ When an excessive amount of wakeups occurs during idle, an

+ additional idle ratio can be added to quiet interrupts, by

+ slowing down CPU activities.

+ A bit mask of CPUs to inject idle. The format of the bitmask is same as

+ used in other subsystems like in /proc/irq/\*/smp_affinity. The mask is

+ comma separated 32 bit groups. Each CPU is one bit. For example for a 256

+ CPU system the full mask is:

+ ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff

+ The rightmost mask is for CPU 0-32.

+ Maximum injected idle time to the total CPU time ratio in percent range

+ from 1 to 100. Even if the cooling device max_state is always 100 (100%),

+ this parameter allows to add a max idle percent limit. The default is 50,

+ to match the current implementation of powerclamp driver. Also doesn't

+ allow value more than 75, if the cpumask includes every CPU present in

+Note: The original version of this document, which was maintained at

+lanana.org as part of the Linux Assigned Names And Numbers Authority

+(LANANA) project, is no longer existent. So, this version in the

+mainline Linux kernel is now the maintained main document.

+.. SPDX-License-Identifier: (GPL-2.0+ OR CC-BY-4.0)

+======================================================

+Discovering Linux kernel subsystems used by a workload

+======================================================

+:Authors: - Shuah Khan <skhan@linuxfoundation.org>

+ - Shefali Sharma <sshefali021@gmail.com>

+:maintained-by: Shuah Khan <skhan@linuxfoundation.org>

+ * Understanding system resources necessary to build and run a workload

+ * Linux tracing and strace can be used to discover the system resources

+ in use by a workload. The completeness of the system usage information

+ depends on the completeness of coverage of a workload.

+ * Performance and security of the operating system can be analyzed with

+ the help of tools such as:

+ `perf <https://man7.org/linux/man-pages/man1/perf.1.html>`_,

+ `stress-ng <https://www.mankier.com/1/stress-ng>`_,

+ `paxtest <https://github.com/opntr/paxtest-freebsd>`_.

+ * Once we discover and understand the workload needs, we can focus on them

+ to avoid regressions and use it to evaluate safety considerations.

+`strace <https://man7.org/linux/man-pages/man1/strace.1.html>`_ is a

+diagnostic, instructional, and debugging tool and can be used to discover

+the system resources in use by a workload. Once we discover and understand

+the workload needs, we can focus on them to avoid regressions and use it

+to evaluate safety considerations. We use strace tool to trace workloads.

+This method of tracing using strace tells us the system calls invoked by

+the workload and doesn't include all the system calls that can be invoked

+by it. In addition, this tracing method tells us just the code paths within

+these system calls that are invoked. As an example, if a workload opens a

+file and reads from it successfully, then the success path is the one that

+is traced. Any error paths in that system call will not be traced. If there

+is a workload that provides full coverage of a workload then the method

+outlined here will trace and find all possible code paths. The completeness

+of the system usage information depends on the completeness of coverage of a

+The goal is tracing a workload on a system running a default kernel without

+requiring custom kernel installs.

+How do we gather fine-grained system information?

+=================================================

+strace tool can be used to trace system calls made by a process and signals

+it receives. System calls are the fundamental interface between an

+application and the operating system kernel. They enable a program to

+request services from the kernel. For instance, the open() system call in

+Linux is used to provide access to a file in the file system. strace enables

+us to track all the system calls made by an application. It lists all the

+system calls made by a process and their resulting output.

+You can generate profiling data combining strace and perf record tools to

+record the events and information associated with a process. This provides

+insight into the process. "perf annotate" tool generates the statistics of

+each instruction of the program. This document goes over the details of how

+to gather fine-grained information on a workload's usage of system resources.

+We used strace to trace the perf, stress-ng, paxtest workloads to illustrate

+our methodology to discover resources used by a workload. This process can

+be applied to trace other workloads.

+Getting the system ready for tracing

+====================================

+Before we can get started we will show you how to get your system ready.

+We assume that you have a Linux distribution running on a physical system

+or a virtual machine. Most distributions will include strace command. Let’s

+install other tools that aren’t usually included to build Linux kernel.

+Please note that the following works on Debian based distributions. You

+might have to find equivalent packages on other Linux distributions.

+Install tools to build Linux kernel and tools in kernel repository.

+scripts/ver_linux is a good way to check if your system already has

+the necessary tools::

+ sudo apt-get build-essentials flex bison yacc

+ sudo apt install libelf-dev systemtap-sdt-dev libaudit-dev libslang2-dev libperl-dev libdw-dev

+cscope is a good tool to browse kernel sources. Let's install it now::

+ sudo apt-get install cscope

+Install stress-ng and paxtest::

+ apt-get install stress-ng

+ apt-get install paxtest

+As mentioned earlier, we used strace to trace perf bench, stress-ng and

+paxtest workloads to show how to analyze a workload and identify Linux

+subsystems used by these workloads. Let's start with an overview of these

+three workloads to get a better understanding of what they do and how to

+perf bench (all) workload

+-------------------------

+The perf bench command contains multiple multi-threaded microkernel

+benchmarks for executing different subsystems in the Linux kernel and

+system calls. This allows us to easily measure the impact of changes,

+which can help mitigate performance regressions. It also acts as a common

+benchmarking framework, enabling developers to easily create test cases,

+integrate transparently, and use performance-rich tooling subsystems.

+Stress-ng netdev stressor workload

+----------------------------------

+stress-ng is used for performing stress testing on the kernel. It allows

+you to exercise various physical subsystems of the computer, as well as

+interfaces of the OS kernel, using "stressor-s". They are available for

+CPU, CPU cache, devices, I/O, interrupts, file system, memory, network,

+operating system, pipelines, schedulers, and virtual machines. Please refer

+to the `stress-ng man-page <https://www.mankier.com/1/stress-ng>`_ to

+find the description of all the available stressor-s. The netdev stressor

+starts specified number (N) of workers that exercise various netdevice

+ioctl commands across all the available network devices.

+paxtest kiddie workload

+-----------------------

+paxtest is a program that tests buffer overflows in the kernel. It tests

+kernel enforcements over memory usage. Generally, execution in some memory

+segments makes buffer overflows possible. It runs a set of programs that

+attempt to subvert memory usage. It is used as a regression test suite for

+PaX, but might be useful to test other memory protection patches for the

+kernel. We used paxtest kiddie mode which looks for simple vulnerabilities.

+What is strace and how do we use it?

+====================================

+As mentioned earlier, strace which is a useful diagnostic, instructional,

+and debugging tool and can be used to discover the system resources in use

+by a workload. It can be used:

+ * To see how a process interacts with the kernel.

+ * To see why a process is failing or hanging.

+ * For reverse engineering a process.

+ * To find the files on which a program depends.

+ * For analyzing the performance of an application.

+ * For troubleshooting various problems related to the operating system.

+In addition, strace can generate run-time statistics on times, calls, and

+errors for each system call and report a summary when program exits,

+suppressing the regular output. This attempts to show system time (CPU time

+spent running in the kernel) independent of wall clock time. We plan to use

+these features to get information on workload system usage.

+strace command supports basic, verbose, and stats modes. strace command when

+run in verbose mode gives more detailed information about the system calls

+invoked by a process.

+Running strace -c generates a report of the percentage of time spent in each

+system call, the total time in seconds, the microseconds per call, the total

+number of calls, the count of each system call that has failed with an error

+and the type of system call made.

+ * Usage: strace <command we want to trace>

+ * Verbose mode usage: strace -v <command>

+ * Gather statistics: strace -c <command>

+We used the “-c” option to gather fine-grained run-time statistics in use

+by three workloads we have chose for this analysis.

+What is cscope and how do we use it?

+====================================

+Now let’s look at `cscope <https://cscope.sourceforge.net/>`_, a command

+line tool for browsing C, C++ or Java code-bases. We can use it to find

+all the references to a symbol, global definitions, functions called by a

+function, functions calling a function, text strings, regular expression

+patterns, files including a file.

+We can use cscope to find which system call belongs to which subsystem.

+This way we can find the kernel subsystems used by a process when it is

+Let’s checkout the latest Linux repository and build cscope database::

+ git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git linux

+ cscope -R -p10 # builds cscope.out database before starting browse session

+ cscope -d -p10 # starts browse session on cscope.out database

+Note: Run "cscope -R -p10" to build the database and c"scope -d -p10" to

+enter into the browsing session. cscope by default cscope.out database.

+To get out of this mode press ctrl+d. -p option is used to specify the

+number of file path components to display. -p10 is optimal for browsing

+What is perf and how do we use it?

+==================================

+Perf is an analysis tool based on Linux 2.6+ systems, which abstracts the

+CPU hardware difference in performance measurement in Linux, and provides

+a simple command line interface. Perf is based on the perf_events interface

+exported by the kernel. It is very useful for profiling the system and

+finding performance bottlenecks in an application.

+If you haven't already checked out the Linux mainline repository, you can do

+so and then build kernel and perf tool::

+ git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git linux

+Note: The perf command can be built without building the kernel in the

+repository and can be run on older kernels. However matching the kernel

+and perf revisions gives more accurate information on the subsystem usage.

+We used "perf stat" and "perf bench" options. For a detailed information on

+the perf tool, run "perf -h".

+The perf stat command generates a report of various hardware and software

+events. It does so with the help of hardware counter registers found in

+modern CPUs that keep the count of these activities. "perf stat cal" shows

+stats for cal command.

+The perf bench command contains multiple multi-threaded microkernel

+benchmarks for executing different subsystems in the Linux kernel and

+system calls. This allows us to easily measure the impact of changes,

+which can help mitigate performance regressions. It also acts as a common

+benchmarking framework, enabling developers to easily create test cases,

+integrate transparently, and use performance-rich tooling.

+"perf bench all" command runs the following benchmarks:

+What is stress-ng and how do we use it?

+=======================================

+As mentioned earlier, stress-ng is used for performing stress testing on

+the kernel. It allows you to exercise various physical subsystems of the

+computer, as well as interfaces of the OS kernel, using stressor-s. They

+are available for CPU, CPU cache, devices, I/O, interrupts, file system,

+memory, network, operating system, pipelines, schedulers, and virtual

+The netdev stressor starts N workers that exercise various netdevice ioctl

+commands across all the available network devices. The following ioctls are

+ * SIOCGIFCONF, SIOCGIFINDEX, SIOCGIFNAME, SIOCGIFFLAGS

+ * SIOCGIFADDR, SIOCGIFNETMASK, SIOCGIFMETRIC, SIOCGIFMTU

+ * SIOCGIFHWADDR, SIOCGIFMAP, SIOCGIFTXQLEN

+The following command runs the stressor::

+ stress-ng --netdev 1 -t 60 --metrics command.

+We can use the perf record command to record the events and information

+associated with a process. This command records the profiling data in the

+perf.data file in the same directory.

+Using the following commands you can record the events associated with the

+netdev stressor, view the generated report perf.data and annotate the to

+view the statistics of each instruction of the program::

+ perf record stress-ng --netdev 1 -t 60 --metrics command.

+What is paxtest and how do we use it?

+=====================================

+paxtest is a program that tests buffer overflows in the kernel. It tests

+kernel enforcements over memory usage. Generally, execution in some memory

+segments makes buffer overflows possible. It runs a set of programs that

+attempt to subvert memory usage. It is used as a regression test suite for

+PaX, and will be useful to test other memory protection patches for the

+paxtest provides kiddie and blackhat modes. The paxtest kiddie mode runs

+in normal mode, whereas the blackhat mode tries to get around the protection

+of the kernel testing for vulnerabilities. We focus on the kiddie mode here

+and combine "paxtest kiddie" run with "perf record" to collect CPU stack

+traces for the paxtest kiddie run to see which function is calling other

+functions in the performance profile. Then the "dwarf" (DWARF's Call Frame

+Information) mode can be used to unwind the stack.

+The following command can be used to view resulting report in call-graph

+ perf record --call-graph dwarf paxtest kiddie

+ perf report --stdio

+Now that we understand the workloads, let's start tracing them.

+Tracing perf bench all workload

+-------------------------------

+Run the following command to trace perf bench all workload::

+ strace -c perf bench all

+**System Calls made by the workload**

+The below table shows the system calls invoked by the workload, number of

+times each system call is invoked, and the corresponding Linux subsystem.

++-------------------+-----------+-----------------+-------------------------+

++===================+===========+=================+=========================+