diff options
Diffstat (limited to 'Documentation/driver-api/cxl/linux')
10 files changed, 2421 insertions, 0 deletions
diff --git a/Documentation/driver-api/cxl/linux/access-coordinates.rst b/Documentation/driver-api/cxl/linux/access-coordinates.rst new file mode 100644 index 000000000000..341a7c682043 --- /dev/null +++ b/Documentation/driver-api/cxl/linux/access-coordinates.rst @@ -0,0 +1,178 @@ +.. SPDX-License-Identifier: GPL-2.0 +.. include:: <isonum.txt> + +================================== +CXL Access Coordinates Computation +================================== + +Latency and Bandwidth Calculation +================================= +A memory region performance coordinates (latency and bandwidth) are typically +provided via ACPI tables :doc:`SRAT <../platform/acpi/srat>` and +:doc:`HMAT <../platform/acpi/hmat>`. However, the platform firmware (BIOS) is +not able to annotate those for CXL devices that are hot-plugged since they do +not exist during platform firmware initialization. The CXL driver can compute +the performance coordinates by retrieving data from several components. + +The :doc:`SRAT <../platform/acpi/srat>` provides a Generic Port Affinity +subtable that ties a proximity domain to a device handle, which in this case +would be the CXL hostbridge. Using this association, the performance +coordinates for the Generic Port can be retrieved from the +:doc:`HMAT <../platform/acpi/hmat>` subtable. This piece represents the +performance coordinates between a CPU and a Generic Port (CXL hostbridge). + +The :doc:`CDAT <../platform/cdat>` provides the performance coordinates for +the CXL device itself. That is the bandwidth and latency to access that device's +memory region. The DSMAS subtable provides a DSMADHandle that is tied to a +Device Physical Address (DPA) range. The DSLBIS subtable provides the +performance coordinates that's tied to a DSMADhandle and this ties the two +table entries together to provide the performance coordinates for each DPA +region. For example, if a device exports a DRAM region and a PMEM region, +then there would be different performance characteristsics for each of those +regions. + +If there's a CXL switch in the topology, then the performance coordinates for the +switch is provided by SSLBIS subtable. This provides the bandwidth and latency +for traversing the switch between the switch upstream port and the switch +downstream port that points to the endpoint device. + +Simple topology example:: + + GP0/HB0/ACPI0016-0 + RP0 + | + | L0 + | + SW 0 / USP0 + SW 0 / DSP0 + | + | L1 + | + EP0 + +In this example, there is a CXL switch between an endpoint and a root port. +Latency in this example is calculated as such: +L(EP0) - Latency from EP0 CDAT DSMAS+DSLBIS +L(L1) - Link latency between EP0 and SW0DSP0 +L(SW0) - Latency for the switch from SW0 CDAT SSLBIS. +L(L0) - Link latency between SW0 and RP0 +L(RP0) - Latency from root port to CPU via SRAT and HMAT (Generic Port). +Total read and write latencies are the sum of all these parts. + +Bandwidth in this example is calculated as such: +B(EP0) - Bandwidth from EP0 CDAT DSMAS+DSLBIS +B(L1) - Link bandwidth between EP0 and SW0DSP0 +B(SW0) - Bandwidth for the switch from SW0 CDAT SSLBIS. +B(L0) - Link bandwidth between SW0 and RP0 +B(RP0) - Bandwidth from root port to CPU via SRAT and HMAT (Generic Port). +The total read and write bandwidth is the min() of all these parts. + +To calculate the link bandwidth: +LinkOperatingFrequency (GT/s) is the current negotiated link speed. +DataRatePerLink (MB/s) = LinkOperatingFrequency / 8 +Bandwidth (MB/s) = PCIeCurrentLinkWidth * DataRatePerLink +Where PCIeCurrentLinkWidth is the number of lanes in the link. + +To calculate the link latency: +LinkLatency (picoseconds) = FlitSize / LinkBandwidth (MB/s) + +See `CXL Memory Device SW Guide r1.0 <https://www.intel.com/content/www/us/en/content-details/643805/cxl-memory-device-software-guide.html>`_, +section 2.11.3 and 2.11.4 for details. + +In the end, the access coordinates for a constructed memory region is calculated from one +or more memory partitions from each of the CXL device(s). + +Shared Upstream Link Calculation +================================ +For certain CXL region construction with endpoints behind CXL switches (SW) or +Root Ports (RP), there is the possibility of the total bandwidth for all +the endpoints behind a switch being more than the switch upstream link. +A similar situation can occur within the host, upstream of the root ports. +The CXL driver performs an additional pass after all the targets have +arrived for a region in order to recalculate the bandwidths with possible +upstream link being a limiting factor in mind. + +The algorithm assumes the configuration is a symmetric topology as that +maximizes performance. When asymmetric topology is detected, the calculation +is aborted. An asymmetric topology is detected during topology walk where the +number of RPs detected as a grandparent is not equal to the number of devices +iterated in the same iteration loop. The assumption is made that subtle +asymmetry in properties does not happen and all paths to EPs are equal. + +There can be multiple switches under an RP. There can be multiple RPs under +a CXL Host Bridge (HB). There can be multiple HBs under a CXL Fixed Memory +Window Structure (CFMWS) in the :doc:`CEDT <../platform/acpi/cedt>`. + +An example hierarchy:: + + CFMWS 0 + | + _________|_________ + | | + ACPI0017-0 ACPI0017-1 + GP0/HB0/ACPI0016-0 GP1/HB1/ACPI0016-1 + | | | | + RP0 RP1 RP2 RP3 + | | | | + SW 0 SW 1 SW 2 SW 3 + | | | | | | | | + EP0 EP1 EP2 EP3 EP4 EP5 EP6 EP7 + +Computation for the example hierarchy: + +Min (GP0 to CPU BW, + Min(SW 0 Upstream Link to RP0 BW, + Min(SW0SSLBIS for SW0DSP0 (EP0), EP0 DSLBIS, EP0 Upstream Link) + + Min(SW0SSLBIS for SW0DSP1 (EP1), EP1 DSLBIS, EP1 Upstream link)) + + Min(SW 1 Upstream Link to RP1 BW, + Min(SW1SSLBIS for SW1DSP0 (EP2), EP2 DSLBIS, EP2 Upstream Link) + + Min(SW1SSLBIS for SW1DSP1 (EP3), EP3 DSLBIS, EP3 Upstream link))) + +Min (GP1 to CPU BW, + Min(SW 2 Upstream Link to RP2 BW, + Min(SW2SSLBIS for SW2DSP0 (EP4), EP4 DSLBIS, EP4 Upstream Link) + + Min(SW2SSLBIS for SW2DSP1 (EP5), EP5 DSLBIS, EP5 Upstream link)) + + Min(SW 3 Upstream Link to RP3 BW, + Min(SW3SSLBIS for SW3DSP0 (EP6), EP6 DSLBIS, EP6 Upstream Link) + + Min(SW3SSLBIS for SW3DSP1 (EP7), EP7 DSLBIS, EP7 Upstream link)))) + +The calculation starts at cxl_region_shared_upstream_perf_update(). A xarray +is created to collect all the endpoint bandwidths via the +cxl_endpoint_gather_bandwidth() function. The min() of bandwidth from the +endpoint CDAT and the upstream link bandwidth is calculated. If the endpoint +has a CXL switch as a parent, then min() of calculated bandwidth and the +bandwidth from the SSLBIS for the switch downstream port that is associated +with the endpoint is calculated. The final bandwidth is stored in a +'struct cxl_perf_ctx' in the xarray indexed by a device pointer. If the +endpoint is direct attached to a root port (RP), the device pointer would be an +RP device. If the endpoint is behind a switch, the device pointer would be the +upstream device of the parent switch. + +At the next stage, the code walks through one or more switches if they exist +in the topology. For endpoints directly attached to RPs, this step is skipped. +If there is another switch upstream, the code takes the min() of the current +gathered bandwidth and the upstream link bandwidth. If there's a switch +upstream, then the SSLBIS of the upstream switch. + +Once the topology walk reaches the RP, whether it's direct attached endpoints +or walking through the switch(es), cxl_rp_gather_bandwidth() is called. At +this point all the bandwidths are aggregated per each host bridge, which is +also the index for the resulting xarray. + +The next step is to take the min() of the per host bridge bandwidth and the +bandwidth from the Generic Port (GP). The bandwidths for the GP are retrieved +via ACPI tables (:doc:`SRAT <../platform/acpi/srat>` and +:doc:`HMAT <../platform/acpi/hmat>`). The minimum bandwidth are aggregated +under the same ACPI0017 device to form a new xarray. + +Finally, the cxl_region_update_bandwidth() is called and the aggregated +bandwidth from all the members of the last xarray is updated for the +access coordinates residing in the cxl region (cxlr) context. + +QTG ID +====== +Each :doc:`CEDT <../platform/acpi/cedt>` has a QTG ID field. This field provides +the ID that associates with a QoS Throttling Group (QTG) for the CFMWS window. +Once the access coordinates are calculated, an ACPI Device Specific Method can +be issued to the ACPI0016 device to retrieve the QTG ID depends on the access +coordinates provided. The QTG ID for the device can be used as guidance to match +to the CFMWS to setup the best Linux root decoder for the device performance. diff --git a/Documentation/driver-api/cxl/linux/cxl-driver.rst b/Documentation/driver-api/cxl/linux/cxl-driver.rst new file mode 100644 index 000000000000..9759e90c3cf1 --- /dev/null +++ b/Documentation/driver-api/cxl/linux/cxl-driver.rst @@ -0,0 +1,630 @@ +.. SPDX-License-Identifier: GPL-2.0 + +==================== +CXL Driver Operation +==================== + +The devices described in this section are present in :: + + /sys/bus/cxl/devices/ + /dev/cxl/ + +The :code:`cxl-cli` library, maintained as part of the NDTCL project, may +be used to script interactions with these devices. + +Drivers +======= +The CXL driver is split into a number of drivers. + +* cxl_core - fundamental init interface and core object creation +* cxl_port - initializes root and provides port enumeration interface. +* cxl_acpi - initializes root decoders and interacts with ACPI data. +* cxl_p/mem - initializes memory devices +* cxl_pci - uses cxl_port to enumates the actual fabric hierarchy. + +Driver Devices +============== +Here is an example from a single-socket system with 4 host bridges. Two host +bridges have a single memory device attached, and the devices are interleaved +into a single memory region. The memory region has been converted to dax. :: + + # ls /sys/bus/cxl/devices/ + dax_region0 decoder3.0 decoder6.0 mem0 port3 + decoder0.0 decoder4.0 decoder6.1 mem1 port4 + decoder1.0 decoder5.0 endpoint5 port1 region0 + decoder2.0 decoder5.1 endpoint6 port2 root0 + + +.. kernel-render:: DOT + :alt: Digraph of CXL fabric describing host-bridge interleaving + :caption: Diagraph of CXL fabric with a host-bridge interleave memory region + + digraph foo { + "root0" -> "port1"; + "root0" -> "port3"; + "root0" -> "decoder0.0"; + "port1" -> "endpoint5"; + "port3" -> "endpoint6"; + "port1" -> "decoder1.0"; + "port3" -> "decoder3.0"; + "endpoint5" -> "decoder5.0"; + "endpoint6" -> "decoder6.0"; + "decoder0.0" -> "region0"; + "decoder0.0" -> "decoder1.0"; + "decoder0.0" -> "decoder3.0"; + "decoder1.0" -> "decoder5.0"; + "decoder3.0" -> "decoder6.0"; + "decoder5.0" -> "region0"; + "decoder6.0" -> "region0"; + "region0" -> "dax_region0"; + "dax_region0" -> "dax0.0"; + } + +For this section we'll explore the devices present in this configuration, but +we'll explore more configurations in-depth in example configurations below. + +Base Devices +------------ +Most devices in a CXL fabric are a `port` of some kind (because each +device mostly routes request from one device to the next, rather than +provide a direct service). + +Root +~~~~ +The `CXL Root` is logical object created by the `cxl_acpi` driver during +:code:`cxl_acpi_probe` - if the :code:`ACPI0017` `Compute Express Link +Root Object` Device Class is found. + +The Root contains links to: + +* `Host Bridge Ports` defined by CHBS in the :doc:`CEDT<../platform/acpi/cedt>` + +* `Downstream Ports` typically connected to `Host Bridge Ports`. + +* `Root Decoders` defined by CFMWS the :doc:`CEDT<../platform/acpi/cedt>` + +:: + + # ls /sys/bus/cxl/devices/root0 + decoder0.0 dport0 dport5 port2 subsystem + decoders_committed dport1 modalias port3 uevent + devtype dport4 port1 port4 uport + + # cat /sys/bus/cxl/devices/root0/devtype + cxl_port + + # cat port1/devtype + cxl_port + + # cat decoder0.0/devtype + cxl_decoder_root + +The root is first `logical port` in the CXL fabric, as presented by the Linux +CXL driver. The `CXL root` is a special type of `switch port`, in that it +only has downstream port connections. + +Port +~~~~ +A `port` object is better described as a `switch port`. It may represent a +host bridge to the root or an actual switch port on a switch. A `switch port` +contains one or more decoders used to route memory requests downstream ports, +which may be connected to another `switch port` or an `endpoint port`. + +:: + + # ls /sys/bus/cxl/devices/port1 + decoder1.0 dport0 driver parent_dport uport + decoders_committed dport113 endpoint5 subsystem + devtype dport2 modalias uevent + + # cat devtype + cxl_port + + # cat decoder1.0/devtype + cxl_decoder_switch + + # cat endpoint5/devtype + cxl_port + +CXL `Host Bridges` in the fabric are probed during :code:`cxl_acpi_probe` at +the time the `CXL Root` is probed. The allows for the immediate logical +connection to between the root and host bridge. + +* The root has a downstream port connection to a host bridge + +* The host bridge has an upstream port connection to the root. + +* The host bridge has one or more downstream port connections to switch + or endpoint ports. + +A `Host Bridge` is a special type of CXL `switch port`. It is explicitly +defined in the ACPI specification via `ACPI0016` ID. `Host Bridge` ports +will be probed at `acpi_probe` time, while similar ports on an actual switch +will be probed later. Otherwise, switch and host bridge ports look very +similar - the both contain switch decoders which route accesses between +upstream and downstream ports. + +Endpoint +~~~~~~~~ +An `endpoint` is a terminal port in the fabric. This is a `logical device`, +and may be one of many `logical devices` presented by a memory device. It +is still considered a type of `port` in the fabric. + +An `endpoint` contains `endpoint decoders` and the device's Coherent Device +Attribute Table (which describes the device's capabilities). :: + + # ls /sys/bus/cxl/devices/endpoint5 + CDAT decoders_committed modalias uevent + decoder5.0 devtype parent_dport uport + decoder5.1 driver subsystem + + # cat /sys/bus/cxl/devices/endpoint5/devtype + cxl_port + + # cat /sys/bus/cxl/devices/endpoint5/decoder5.0/devtype + cxl_decoder_endpoint + + +Memory Device (memdev) +~~~~~~~~~~~~~~~~~~~~~~ +A `memdev` is probed and added by the `cxl_pci` driver in :code:`cxl_pci_probe` +and is managed by the `cxl_mem` driver. It primarily provides the `IOCTL` +interface to a memory device, via :code:`/dev/cxl/memN`, and exposes various +device configuration data. :: + + # ls /sys/bus/cxl/devices/mem0 + dev firmware_version payload_max security uevent + driver label_storage_size pmem serial + firmware numa_node ram subsystem + +A Memory Device is a discrete base object that is not a port. While the +physical device it belongs to may also host an `endpoint`, the relationship +between an `endpoint` and a `memdev` is not captured in sysfs. + +Port Relationships +~~~~~~~~~~~~~~~~~~ +In our example described above, there are four host bridges attached to the +root, and two of the host bridges have one endpoint attached. + +.. kernel-render:: DOT + :alt: Digraph of CXL fabric describing host-bridge interleaving + :caption: Diagraph of CXL fabric with a host-bridge interleave memory region + + digraph foo { + "root0" -> "port1"; + "root0" -> "port2"; + "root0" -> "port3"; + "root0" -> "port4"; + "port1" -> "endpoint5"; + "port3" -> "endpoint6"; + } + +Decoders +-------- +A `Decoder` is short for a CXL Host-Managed Device Memory (HDM) Decoder. It is +a device that routes accesses through the CXL fabric to an endpoint, and at +the endpoint translates a `Host Physical` to `Device Physical` Addressing. + +The CXL 3.1 specification heavily implies that only endpoint decoders should +engage in translation of `Host Physical Address` to `Device Physical Address`. +:: + + 8.2.4.20 CXL HDM Decoder Capability Structure + + IMPLEMENTATION NOTE + CXL Host Bridge and Upstream Switch Port Decode Flow + + IMPLEMENTATION NOTE + Device Decode Logic + +These notes imply that there are two logical groups of decoders. + +* Routing Decoder - a decoder which routes accesses but does not translate + addresses from HPA to DPA. + +* Translating Decoder - a decoder which translates accesses from HPA to DPA + for an endpoint to service. + +The CXL drivers distinguish 3 decoder types: root, switch, and endpoint. Only +endpoint decoders are Translating Decoders, all others are Routing Decoders. + +.. note:: PLATFORM VENDORS BE AWARE + + Linux makes a strong assumption that endpoint decoders are the only decoder + in the fabric that actively translates HPA to DPA. Linux assumes routing + decoders pass the HPA unchanged to the next decoder in the fabric. + + It is therefore assumed that any given decoder in the fabric will have an + address range that is a subset of its upstream port decoder. Any deviation + from this scheme undefined per the specification. Linux prioritizes + spec-defined / architectural behavior. + +Decoders may have one or more `Downstream Targets` if configured to interleave +memory accesses. This will be presented in sysfs via the :code:`target_list` +parameter. + +Root Decoder +~~~~~~~~~~~~ +A `Root Decoder` is logical construct of the physical address and interleave +configurations present in the CFMWS field of the :doc:`CEDT +<../platform/acpi/cedt>`. +Linux presents this information as a decoder present in the `CXL Root`. We +consider this a `Root Decoder`, though technically it exists on the boundary +of the CXL specification and platform-specific CXL root implementations. + +Linux considers these logical decoders a type of `Routing Decoder`, and is the +first decoder in the CXL fabric to receive a memory access from the platform's +memory controllers. + +`Root Decoders` are created during :code:`cxl_acpi_probe`. One root decoder +is created per CFMWS entry in the :doc:`CEDT <../platform/acpi/cedt>`. + +The :code:`target_list` parameter is filled by the CFMWS target fields. Targets +of a root decoder are `Host Bridges`, which means interleave done at the root +decoder level is an `Inter-Host-Bridge Interleave`. + +Only root decoders are capable of `Inter-Host-Bridge Interleave`. + +Such interleaves must be configured by the platform and described in the ACPI +CEDT CFMWS, as the target CXL host bridge UIDs in the CFMWS must match the CXL +host bridge UIDs in the CHBS field of the :doc:`CEDT +<../platform/acpi/cedt>` and the UID field of CXL Host Bridges defined in +the :doc:`DSDT <../platform/acpi/dsdt>`. + +Interleave settings in a root decoder describe how to interleave accesses among +the *immediate downstream targets*, not the entire interleave set. + +The memory range described in the root decoder is used to + +1) Create a memory region (:code:`region0` in this example), and + +2) Associate the region with an IO Memory Resource (:code:`kernel/resource.c`) + +:: + + # ls /sys/bus/cxl/devices/decoder0.0/ + cap_pmem devtype region0 + cap_ram interleave_granularity size + cap_type2 interleave_ways start + cap_type3 locked subsystem + create_ram_region modalias target_list + delete_region qos_class uevent + + # cat /sys/bus/cxl/devices/decoder0.0/region0/resource + 0xc050000000 + +The IO Memory Resource is created during early boot when the CFMWS region is +identified in the EFI Memory Map or E820 table (on x86). + +Root decoders are defined as a separate devtype, but are also a type +of `Switch Decoder` due to having downstream targets. :: + + # cat /sys/bus/cxl/devices/decoder0.0/devtype + cxl_decoder_root + +Switch Decoder +~~~~~~~~~~~~~~ +Any non-root, translating decoder is considered a `Switch Decoder`, and will +present with the type :code:`cxl_decoder_switch`. Both `Host Bridge` and `CXL +Switch` (device) decoders are of type :code:`cxl_decoder_switch`. :: + + # ls /sys/bus/cxl/devices/decoder1.0/ + devtype locked size target_list + interleave_granularity modalias start target_type + interleave_ways region subsystem uevent + + # cat /sys/bus/cxl/devices/decoder1.0/devtype + cxl_decoder_switch + + # cat /sys/bus/cxl/devices/decoder1.0/region + region0 + +A `Switch Decoder` has associations between a region defined by a root +decoder and downstream target ports. Interleaving done within a switch decoder +is a multi-downstream-port interleave (or `Intra-Host-Bridge Interleave` for +host bridges). + +Interleave settings in a switch decoder describe how to interleave accesses +among the *immediate downstream targets*, not the entire interleave set. + +Switch decoders are created during :code:`cxl_switch_port_probe` in the +:code:`cxl_port` driver, and is created based on a PCI device's DVSEC +registers. + +Switch decoder programming is validated during probe if the platform programs +them during boot (See `Auto Decoders` below), or on commit if programmed at +runtime (See `Runtime Programming` below). + + +Endpoint Decoder +~~~~~~~~~~~~~~~~ +Any decoder attached to a *terminal* point in the CXL fabric (`An Endpoint`) is +considered an `Endpoint Decoder`. Endpoint decoders are of type +:code:`cxl_decoder_endpoint`. :: + + # ls /sys/bus/cxl/devices/decoder5.0 + devtype locked start + dpa_resource modalias subsystem + dpa_size mode target_type + interleave_granularity region uevent + interleave_ways size + + # cat /sys/bus/cxl/devices/decoder5.0/devtype + cxl_decoder_endpoint + + # cat /sys/bus/cxl/devices/decoder5.0/region + region0 + +An `Endpoint Decoder` has an association with a region defined by a root +decoder and describes the device-local resource associated with this region. + +Unlike root and switch decoders, endpoint decoders translate `Host Physical` to +`Device Physical` address ranges. The interleave settings on an endpoint +therefore describe the entire *interleave set*. + +`Device Physical Address` regions must be committed in-order. For example, the +DPA region starting at 0x80000000 cannot be committed before the DPA region +starting at 0x0. + +As of Linux v6.15, Linux does not support *imbalanced* interleave setups, all +endpoints in an interleave set are expected to have the same interleave +settings (granularity and ways must be the same). + +Endpoint decoders are created during :code:`cxl_endpoint_port_probe` in the +:code:`cxl_port` driver, and is created based on a PCI device's DVSEC registers. + +Decoder Relationships +~~~~~~~~~~~~~~~~~~~~~ +In our example described above, there is one root decoder which routes memory +accesses over two host bridges. Each host bridge has a decoder which routes +access to their singular endpoint targets. Each endpoint has a decoder which +translates HPA to DPA and services the memory request. + +The driver validates relationships between ports by decoder programming, so +we can think of decoders being related in a similarly hierarchical fashion to +ports. + +.. kernel-render:: DOT + :alt: Digraph of hierarchical relationship between root, switch, and endpoint decoders. + :caption: Diagraph of CXL root, switch, and endpoint decoders. + + digraph foo { + "root0" -> "decoder0.0"; + "decoder0.0" -> "decoder1.0"; + "decoder0.0" -> "decoder3.0"; + "decoder1.0" -> "decoder5.0"; + "decoder3.0" -> "decoder6.0"; + } + +Regions +------- + +Memory Region +~~~~~~~~~~~~~ +A `Memory Region` is a logical construct that connects a set of CXL ports in +the fabric to an IO Memory Resource. It is ultimately used to expose the memory +on these devices to the DAX subsystem via a `DAX Region`. + +An example RAM region: :: + + # ls /sys/bus/cxl/devices/region0/ + access0 devtype modalias subsystem uuid + access1 driver mode target0 + commit interleave_granularity resource target1 + dax_region0 interleave_ways size uevent + +A memory region can be constructed during endpoint probe, if decoders were +programmed by BIOS/EFI (see `Auto Decoders`), or by creating a region manually +via a `Root Decoder`'s :code:`create_ram_region` or :code:`create_pmem_region` +interfaces. + +The interleave settings in a `Memory Region` describe the configuration of the +`Interleave Set` - and are what can be expected to be seen in the endpoint +interleave settings. + +.. kernel-render:: DOT + :alt: Digraph of CXL memory region relationships between root and endpoint decoders. + :caption: Regions are created based on root decoder configurations. Endpoint decoders + must be programmed with the same interleave settings as the region. + + digraph foo { + "root0" -> "decoder0.0"; + "decoder0.0" -> "region0"; + "region0" -> "decoder5.0"; + "region0" -> "decoder6.0"; + } + +DAX Region +~~~~~~~~~~ +A `DAX Region` is used to convert a CXL `Memory Region` to a DAX device. A +DAX device may then be accessed directly via a file descriptor interface, or +converted to System RAM via the DAX kmem driver. See the DAX driver section +for more details. :: + + # ls /sys/bus/cxl/devices/dax_region0/ + dax0.0 devtype modalias uevent + dax_region driver subsystem + +Mailbox Interfaces +------------------ +A mailbox command interface for each device is exposed in :: + + /dev/cxl/mem0 + /dev/cxl/mem1 + +These mailboxes may receive any specification-defined command. Raw commands +(custom commands) can only be sent to these interfaces if the build config +:code:`CXL_MEM_RAW_COMMANDS` is set. This is considered a debug and/or +development interface, not an officially supported mechanism for creation +of vendor-specific commands (see the `fwctl` subsystem for that). + +Decoder Programming +=================== + +Runtime Programming +------------------- +During probe, the only decoders *required* to be programmed are `Root Decoders`. +In reality, `Root Decoders` are a logical construct to describe the memory +region and interleave configuration at the host bridge level - as described +in the ACPI CEDT CFMWS. + +All other `Switch` and `Endpoint` decoders may be programmed by the user +at runtime - if the platform supports such configurations. + +This interaction is what creates a `Software Defined Memory` environment. + +See the :code:`cxl-cli` documentation for more information about how to +configure CXL decoders at runtime. + +Auto Decoders +------------- +Auto Decoders are decoders programmed by BIOS/EFI at boot time, and are +almost always locked (cannot be changed). This is done by a platform +which may have a static configuration - or certain quirks which may prevent +dynamic runtime changes to the decoders (such as requiring additional +controller programming within the CPU complex outside the scope of CXL). + +Auto Decoders are probed automatically as long as the devices and memory +regions they are associated with probe without issue. When probing Auto +Decoders, the driver's primary responsibility is to ensure the fabric is +sane - as-if validating runtime programmed regions and decoders. + +If Linux cannot validate auto-decoder configuration, the memory will not +be surfaced as a DAX device - and therefore not be exposed to the page +allocator - effectively stranding it. + +Interleave +---------- + +The Linux CXL driver supports `Cross-Link First` interleave. This dictates +how interleave is programmed at each decoder step, as the driver validates +the relationships between a decoder and it's parent. + +For example, in a `Cross-Link First` interleave setup with 16 endpoints +attached to 4 host bridges, linux expects the following ways/granularity +across the root, host bridge, and endpoints respectively. + +.. flat-table:: 4x4 cross-link first interleave settings + + * - decoder + - ways + - granularity + + * - root + - 4 + - 256 + + * - host bridge + - 4 + - 1024 + + * - endpoint + - 16 + - 256 + +At the root, every a given access will be routed to the +:code:`((HPA / 256) % 4)th` target host bridge. Within a host bridge, every +:code:`((HPA / 1024) % 4)th` target endpoint. Each endpoint translates based +on the entire 16 device interleave set. + +Unbalanced interleave sets are not supported - decoders at a similar point +in the hierarchy (e.g. all host bridge decoders) must have the same ways and +granularity configuration. + +At Root +~~~~~~~ +Root decoder interleave is defined by CFMWS field of the :doc:`CEDT +<../platform/acpi/cedt>`. The CEDT may actually define multiple CFMWS +configurations to describe the same physical capacity, with the intent to allow +users to decide at runtime whether to online memory as interleaved or +non-interleaved. :: + + Subtable Type : 01 [CXL Fixed Memory Window Structure] + Window base address : 0000000100000000 + Window size : 0000000100000000 + Interleave Members (2^n) : 00 + Interleave Arithmetic : 00 + First Target : 00000007 + + Subtable Type : 01 [CXL Fixed Memory Window Structure] + Window base address : 0000000200000000 + Window size : 0000000100000000 + Interleave Members (2^n) : 00 + Interleave Arithmetic : 00 + First Target : 00000006 + + Subtable Type : 01 [CXL Fixed Memory Window Structure] + Window base address : 0000000300000000 + Window size : 0000000200000000 + Interleave Members (2^n) : 01 + Interleave Arithmetic : 00 + First Target : 00000007 + Next Target : 00000006 + +In this example, the CFMWS defines two discrete non-interleaved 4GB regions +for each host bridge, and one interleaved 8GB region that targets both. This +would result in 3 root decoders presenting in the root. :: + + # ls /sys/bus/cxl/devices/root0/decoder* + decoder0.0 decoder0.1 decoder0.2 + + # cat /sys/bus/cxl/devices/decoder0.0/target_list start size + 7 + 0x100000000 + 0x100000000 + + # cat /sys/bus/cxl/devices/decoder0.1/target_list start size + 6 + 0x200000000 + 0x100000000 + + # cat /sys/bus/cxl/devices/decoder0.2/target_list start size + 7,6 + 0x300000000 + 0x200000000 + +These decoders are not runtime programmable. They are used to generate a +`Memory Region` to bring this memory online with runtime programmed settings +at the `Switch` and `Endpoint` decoders. + +At Host Bridge or Switch +~~~~~~~~~~~~~~~~~~~~~~~~ +`Host Bridge` and `Switch` decoders are programmable via the following fields: + +- :code:`start` - the HPA region associated with the memory region +- :code:`size` - the size of the region +- :code:`target_list` - the list of downstream ports +- :code:`interleave_ways` - the number downstream ports to interleave across +- :code:`interleave_granularity` - the granularity to interleave at. + +Linux expects the :code:`interleave_granularity` of switch decoders to be +derived from their upstream port connections. In `Cross-Link First` interleave +configurations, the :code:`interleave_granularity` of a decoder is equal to +:code:`parent_interleave_granularity * parent_interleave_ways`. + +At Endpoint +~~~~~~~~~~~ +`Endpoint Decoders` are programmed similar to Host Bridge and Switch decoders, +with the exception that the ways and granularity are defined by the interleave +set (e.g. the interleave settings defined by the associated `Memory Region`). + +- :code:`start` - the HPA region associated with the memory region +- :code:`size` - the size of the region +- :code:`interleave_ways` - the number endpoints in the interleave set +- :code:`interleave_granularity` - the granularity to interleave at. + +These settings are used by endpoint decoders to *Translate* memory requests +from HPA to DPA. This is why they must be aware of the entire interleave set. + +Linux does not support unbalanced interleave configurations. As a result, all +endpoints in an interleave set must have the same ways and granularity. + +Example Configurations +====================== +.. toctree:: + :maxdepth: 1 + + example-configurations/single-device.rst + example-configurations/hb-interleave.rst + example-configurations/intra-hb-interleave.rst + example-configurations/multi-interleave.rst diff --git a/Documentation/driver-api/cxl/linux/dax-driver.rst b/Documentation/driver-api/cxl/linux/dax-driver.rst new file mode 100644 index 000000000000..10d953a2167b --- /dev/null +++ b/Documentation/driver-api/cxl/linux/dax-driver.rst @@ -0,0 +1,43 @@ +.. SPDX-License-Identifier: GPL-2.0 + +==================== +DAX Driver Operation +==================== +The `Direct Access Device` driver was originally designed to provide a +memory-like access mechanism to memory-like block-devices. It was +extended to support CXL Memory Devices, which provide user-configured +memory devices. + +The CXL subsystem depends on the DAX subsystem to either: + +- Generate a file-like interface to userland via :code:`/dev/daxN.Y`, or +- Engage the memory-hotplug interface to add CXL memory to page allocator. + +The DAX subsystem exposes this ability through the `cxl_dax_region` driver. +A `dax_region` provides the translation between a CXL `memory_region` and +a `DAX Device`. + +DAX Device +========== +A `DAX Device` is a file-like interface exposed in :code:`/dev/daxN.Y`. A +memory region exposed via dax device can be accessed via userland software +via the :code:`mmap()` system-call. The result is direct mappings to the +CXL capacity in the task's page tables. + +Users wishing to manually handle allocation of CXL memory should use this +interface. + +kmem conversion +=============== +The :code:`dax_kmem` driver converts a `DAX Device` into a series of `hotplug +memory blocks` managed by :code:`kernel/memory-hotplug.c`. This capacity +will be exposed to the kernel page allocator in the user-selected memory +zone. + +The :code:`memmap_on_memory` setting (both global and DAX device local) +dictates where the kernell will allocate the :code:`struct folio` descriptors +for this memory will come from. If :code:`memmap_on_memory` is set, memory +hotplug will set aside a portion of the memory block capacity to allocate +folios. If unset, the memory is allocated via a normal :code:`GFP_KERNEL` +allocation - and as a result will most likely land on the local NUM node of the +CPU executing the hotplug operation. diff --git a/Documentation/driver-api/cxl/linux/early-boot.rst b/Documentation/driver-api/cxl/linux/early-boot.rst new file mode 100644 index 000000000000..a7fc6fc85fbe --- /dev/null +++ b/Documentation/driver-api/cxl/linux/early-boot.rst @@ -0,0 +1,137 @@ +.. SPDX-License-Identifier: GPL-2.0 + +======================= +Linux Init (Early Boot) +======================= + +Linux configuration is split into two major steps: Early-Boot and everything else. + +During early boot, Linux sets up immutable resources (such as numa nodes), while +later operations include things like driver probe and memory hotplug. Linux may +read EFI and ACPI information throughout this process to configure logical +representations of the devices. + +During Linux Early Boot stage (functions in the kernel that have the __init +decorator), the system takes the resources created by EFI/BIOS +(:doc:`ACPI tables <../platform/acpi>`) and turns them into resources that the +kernel can consume. + + +BIOS, Build and Boot Options +============================ + +There are 4 pre-boot options that need to be considered during kernel build +which dictate how memory will be managed by Linux during early boot. + +* EFI_MEMORY_SP + + * BIOS/EFI Option that dictates whether memory is SystemRAM or + Specific Purpose. Specific Purpose memory will be deferred to + drivers to manage - and not immediately exposed as system RAM. + +* CONFIG_EFI_SOFT_RESERVE + + * Linux Build config option that dictates whether the kernel supports + Specific Purpose memory. + +* CONFIG_MHP_DEFAULT_ONLINE_TYPE + + * Linux Build config that dictates whether and how Specific Purpose memory + converted to a dax device should be managed (left as DAX or onlined as + SystemRAM in ZONE_NORMAL or ZONE_MOVABLE). + +* nosoftreserve + + * Linux kernel boot option that dictates whether Soft Reserve should be + supported. Similar to CONFIG_EFI_SOFT_RESERVE. + +Memory Map Creation +=================== + +While the kernel parses the EFI memory map, if :code:`Specific Purpose` memory +is supported and detected, it will set this region aside as +:code:`SOFT_RESERVED`. + +If :code:`EFI_MEMORY_SP=0`, :code:`CONFIG_EFI_SOFT_RESERVE=n`, or +:code:`nosoftreserve=y` - Linux will default a CXL device memory region to +SystemRAM. This will expose the memory to the kernel page allocator in +:code:`ZONE_NORMAL`, making it available for use for most allocations (including +:code:`struct page` and page tables). + +If `Specific Purpose` is set and supported, :code:`CONFIG_MHP_DEFAULT_ONLINE_TYPE_*` +dictates whether the memory is onlined by default (:code:`_OFFLINE` or +:code:`_ONLINE_*`), and if online which zone to online this memory to by default +(:code:`_NORMAL` or :code:`_MOVABLE`). + +If placed in :code:`ZONE_MOVABLE`, the memory will not be available for most +kernel allocations (such as :code:`struct page` or page tables). This may +significant impact performance depending on the memory capacity of the system. + + +NUMA Node Reservation +===================== + +Linux refers to the proximity domains (:code:`PXM`) defined in the :doc:`SRAT +<../platform/acpi/srat>` to create NUMA nodes in :code:`acpi_numa_init`. +Typically, there is a 1:1 relation between :code:`PXM` and NUMA node IDs. + +The SRAT is the only ACPI defined way of defining Proximity Domains. Linux +chooses to, at most, map those 1:1 with NUMA nodes. +:doc:`CEDT <../platform/acpi/cedt>` adds a description of SPA ranges which +Linux may map to one or more NUMA nodes. + +If there are CXL ranges in the CFMWS but not in SRAT, then a fake :code:`PXM` +is created (as of v6.15). In the future, Linux may reject CFMWS not described +by SRAT due to the ambiguity of proximity domain association. + +It is important to note that NUMA node creation cannot be done at runtime. All +possible NUMA nodes are identified at :code:`__init` time, more specifically +during :code:`mm_init`. The CEDT and SRAT must contain sufficient :code:`PXM` +data for Linux to identify NUMA nodes their associated memory regions. + +The relevant code exists in: :code:`linux/drivers/acpi/numa/srat.c`. + +See :doc:`Example Platform Configurations <../platform/example-configs>` +for more info. + +Memory Tiers Creation +===================== +Memory tiers are a collection of NUMA nodes grouped by performance characteristics. +During :code:`__init`, Linux initializes the system with a default memory tier that +contains all nodes marked :code:`N_MEMORY`. + +:code:`memory_tier_init` is called at boot for all nodes with memory online by +default. :code:`memory_tier_late_init` is called during late-init for nodes setup +during driver configuration. + +Nodes are only marked :code:`N_MEMORY` if they have *online* memory. + +Tier membership can be inspected in :: + + /sys/devices/virtual/memory_tiering/memory_tierN/nodelist + 0-1 + +If nodes are grouped which have clear difference in performance, check the +:doc:`HMAT <../platform/acpi/hmat>` and CDAT information for the CXL nodes. All +nodes default to the DRAM tier, unless HMAT/CDAT information is reported to the +memory_tier component via `access_coordinates`. + +For more, see :doc:`CXL access coordinates documentation +<../linux/access-coordinates>`. + +Contiguous Memory Allocation +============================ +The contiguous memory allocator (CMA) enables reservation of contiguous memory +regions on NUMA nodes during early boot. However, CMA cannot reserve memory +on NUMA nodes that are not online during early boot. :: + + void __init hugetlb_cma_reserve(int order) { + if (!node_online(nid)) + /* do not allow reservations */ + } + +This means if users intend to defer management of CXL memory to the driver, CMA +cannot be used to guarantee huge page allocations. If enabling CXL memory as +SystemRAM in `ZONE_NORMAL` during early boot, CMA reservations per-node can be +made with the :code:`cma_pernuma` or :code:`numa_cma` kernel command line +parameters. diff --git a/Documentation/driver-api/cxl/linux/example-configurations/hb-interleave.rst b/Documentation/driver-api/cxl/linux/example-configurations/hb-interleave.rst new file mode 100644 index 000000000000..f071490763a2 --- /dev/null +++ b/Documentation/driver-api/cxl/linux/example-configurations/hb-interleave.rst @@ -0,0 +1,314 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============================ +Inter-Host-Bridge Interleave +============================ +This cxl-cli configuration dump shows the following host configuration: + +* A single socket system with one CXL root +* CXL Root has Four (4) CXL Host Bridges +* Two CXL Host Bridges have a single CXL Memory Expander Attached +* The CXL root is configured to interleave across the two host bridges. + +This output is generated by :code:`cxl list -v` and describes the relationships +between objects exposed in :code:`/sys/bus/cxl/devices/`. + +:: + + [ + { + "bus":"root0", + "provider":"ACPI.CXL", + "nr_dports":4, + "dports":[ + { + "dport":"pci0000:00", + "alias":"ACPI0016:01", + "id":0 + }, + { + "dport":"pci0000:a8", + "alias":"ACPI0016:02", + "id":4 + }, + { + "dport":"pci0000:2a", + "alias":"ACPI0016:03", + "id":1 + }, + { + "dport":"pci0000:d2", + "alias":"ACPI0016:00", + "id":5 + } + ], + +This chunk shows the CXL "bus" (root0) has 4 downstream ports attached to CXL +Host Bridges. The `Root` can be considered the singular upstream port attached +to the platform's memory controller - which routes memory requests to it. + +The `ports:root0` section lays out how each of these downstream ports are +configured. If a port is not configured (id's 0 and 1), they are omitted. + +:: + + "ports:root0":[ + { + "port":"port1", + "host":"pci0000:d2", + "depth":1, + "nr_dports":3, + "dports":[ + { + "dport":"0000:d2:01.1", + "alias":"device:02", + "id":0 + }, + { + "dport":"0000:d2:01.3", + "alias":"device:05", + "id":2 + }, + { + "dport":"0000:d2:07.1", + "alias":"device:0d", + "id":113 + } + ], + +This chunk shows the available downstream ports associated with the CXL Host +Bridge :code:`port1`. In this case, :code:`port1` has 3 available downstream +ports: :code:`dport1`, :code:`dport2`, and :code:`dport113`.. + +:: + + "endpoints:port1":[ + { + "endpoint":"endpoint5", + "host":"mem0", + "parent_dport":"0000:d2:01.1", + "depth":2, + "memdev":{ + "memdev":"mem0", + "ram_size":137438953472, + "serial":0, + "numa_node":0, + "host":"0000:d3:00.0" + }, + "decoders:endpoint5":[ + { + "decoder":"decoder5.0", + "resource":825975898112, + "size":274877906944, + "interleave_ways":2, + "interleave_granularity":256, + "region":"region0", + "dpa_resource":0, + "dpa_size":137438953472, + "mode":"ram" + } + ] + } + ], + +This chunk shows the endpoints attached to the host bridge :code:`port1`. + +:code:`endpoint5` contains a single configured decoder :code:`decoder5.0` +which has the same interleave configuration as :code:`region0` (shown later). + +Next we have the decodesr belonging to the host bridge: + +:: + + "decoders:port1":[ + { + "decoder":"decoder1.0", + "resource":825975898112, + "size":274877906944, + "interleave_ways":1, + "region":"region0", + "nr_targets":1, + "targets":[ + { + "target":"0000:d2:01.1", + "alias":"device:02", + "position":0, + "id":0 + } + ] + } + ] + }, + +Host Bridge :code:`port1` has a single decoder (:code:`decoder1.0`), whose only +target is :code:`dport1` - which is attached to :code:`endpoint5`. + +The following chunk shows a similar configuration for Host Bridge :code:`port3`, +the second host bridge with a memory device attached. + +:: + + { + "port":"port3", + "host":"pci0000:a8", + "depth":1, + "nr_dports":1, + "dports":[ + { + "dport":"0000:a8:01.1", + "alias":"device:c3", + "id":0 + } + ], + "endpoints:port3":[ + { + "endpoint":"endpoint6", + "host":"mem1", + "parent_dport":"0000:a8:01.1", + "depth":2, + "memdev":{ + "memdev":"mem1", + "ram_size":137438953472, + "serial":0, + "numa_node":0, + "host":"0000:a9:00.0" + }, + "decoders:endpoint6":[ + { + "decoder":"decoder6.0", + "resource":825975898112, + "size":274877906944, + "interleave_ways":2, + "interleave_granularity":256, + "region":"region0", + "dpa_resource":0, + "dpa_size":137438953472, + "mode":"ram" + } + ] + } + ], + "decoders:port3":[ + { + "decoder":"decoder3.0", + "resource":825975898112, + "size":274877906944, + "interleave_ways":1, + "region":"region0", + "nr_targets":1, + "targets":[ + { + "target":"0000:a8:01.1", + "alias":"device:c3", + "position":0, + "id":0 + } + ] + } + ] + }, + + +The next chunk shows the two CXL host bridges without attached endpoints. + +:: + + { + "port":"port2", + "host":"pci0000:00", + "depth":1, + "nr_dports":2, + "dports":[ + { + "dport":"0000:00:01.3", + "alias":"device:55", + "id":2 + }, + { + "dport":"0000:00:07.1", + "alias":"device:5d", + "id":113 + } + ] + }, + { + "port":"port4", + "host":"pci0000:2a", + "depth":1, + "nr_dports":1, + "dports":[ + { + "dport":"0000:2a:01.1", + "alias":"device:d0", + "id":0 + } + ] + } + ], + +Next we have the `Root Decoders` belonging to :code:`root0`. This root decoder +applies the interleave across the downstream ports :code:`port1` and +:code:`port3` - with a granularity of 256 bytes. + +This information is generated by the CXL driver reading the ACPI CEDT CMFWS. + +:: + + "decoders:root0":[ + { + "decoder":"decoder0.0", + "resource":825975898112, + "size":274877906944, + "interleave_ways":2, + "interleave_granularity":256, + "max_available_extent":0, + "volatile_capable":true, + "nr_targets":2, + "targets":[ + { + "target":"pci0000:a8", + "alias":"ACPI0016:02", + "position":1, + "id":4 + }, + { + "target":"pci0000:d2", + "alias":"ACPI0016:00", + "position":0, + "id":5 + } + ], + +Finally we have the `Memory Region` associated with the `Root Decoder` +:code:`decoder0.0`. This region describes the overall interleave configuration +of the interleave set. + +:: + + "regions:decoder0.0":[ + { + "region":"region0", + "resource":825975898112, + "size":274877906944, + "type":"ram", + "interleave_ways":2, + "interleave_granularity":256, + "decode_state":"commit", + "mappings":[ + { + "position":1, + "memdev":"mem1", + "decoder":"decoder6.0" + }, + { + "position":0, + "memdev":"mem0", + "decoder":"decoder5.0" + } + ] + } + ] + } + ] + } + ] diff --git a/Documentation/driver-api/cxl/linux/example-configurations/intra-hb-interleave.rst b/Documentation/driver-api/cxl/linux/example-configurations/intra-hb-interleave.rst new file mode 100644 index 000000000000..077dfaf8458d --- /dev/null +++ b/Documentation/driver-api/cxl/linux/example-configurations/intra-hb-interleave.rst @@ -0,0 +1,291 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============================ +Intra-Host-Bridge Interleave +============================ +This cxl-cli configuration dump shows the following host configuration: + +* A single socket system with one CXL root +* CXL Root has Four (4) CXL Host Bridges +* One (1) CXL Host Bridges has two CXL Memory Expanders Attached +* The Host bridge decoder is programmed to interleave across the expanders. + +This output is generated by :code:`cxl list -v` and describes the relationships +between objects exposed in :code:`/sys/bus/cxl/devices/`. + +:: + + [ + { + "bus":"root0", + "provider":"ACPI.CXL", + "nr_dports":4, + "dports":[ + { + "dport":"pci0000:00", + "alias":"ACPI0016:01", + "id":0 + }, + { + "dport":"pci0000:a8", + "alias":"ACPI0016:02", + "id":4 + }, + { + "dport":"pci0000:2a", + "alias":"ACPI0016:03", + "id":1 + }, + { + "dport":"pci0000:d2", + "alias":"ACPI0016:00", + "id":5 + } + ], + +This chunk shows the CXL "bus" (root0) has 4 downstream ports attached to CXL +Host Bridges. The `Root` can be considered the singular upstream port attached +to the platform's memory controller - which routes memory requests to it. + +The `ports:root0` section lays out how each of these downstream ports are +configured. If a port is not configured (id's 0 and 1), they are omitted. + +:: + + "ports:root0":[ + { + "port":"port1", + "host":"pci0000:d2", + "depth":1, + "nr_dports":3, + "dports":[ + { + "dport":"0000:d2:01.1", + "alias":"device:02", + "id":0 + }, + { + "dport":"0000:d2:01.3", + "alias":"device:05", + "id":2 + }, + { + "dport":"0000:d2:07.1", + "alias":"device:0d", + "id":113 + } + ], + +This chunk shows the available downstream ports associated with the CXL Host +Bridge :code:`port1`. In this case, :code:`port1` has 3 available downstream +ports: :code:`dport1`, :code:`dport2`, and :code:`dport113`.. + +:: + + "endpoints:port1":[ + { + "endpoint":"endpoint5", + "host":"mem0", + "parent_dport":"0000:d2:01.1", + "depth":2, + "memdev":{ + "memdev":"mem0", + "ram_size":137438953472, + "serial":0, + "numa_node":0, + "host":"0000:d3:00.0" + }, + "decoders:endpoint5":[ + { + "decoder":"decoder5.0", + "resource":825975898112, + "size":274877906944, + "interleave_ways":2, + "interleave_granularity":256, + "region":"region0", + "dpa_resource":0, + "dpa_size":137438953472, + "mode":"ram" + } + ] + }, + { + "endpoint":"endpoint6", + "host":"mem1", + "parent_dport":"0000:d2:01.3, + "depth":2, + "memdev":{ + "memdev":"mem1", + "ram_size":137438953472, + "serial":0, + "numa_node":0, + "host":"0000:a9:00.0" + }, + "decoders:endpoint6":[ + { + "decoder":"decoder6.0", + "resource":825975898112, + "size":274877906944, + "interleave_ways":2, + "interleave_granularity":256, + "region":"region0", + "dpa_resource":0, + "dpa_size":137438953472, + "mode":"ram" + } + ] + } + ], + +This chunk shows the endpoints attached to the host bridge :code:`port1`. + +:code:`endpoint5` contains a single configured decoder :code:`decoder5.0` +which has the same interleave configuration memory region they belong to +(show later). + +Next we have the decoders belonging to the host bridge: + +:: + + "decoders:port1":[ + { + "decoder":"decoder1.0", + "resource":825975898112, + "size":274877906944, + "interleave_ways":2, + "interleave_granularity":256, + "region":"region0", + "nr_targets":2, + "targets":[ + { + "target":"0000:d2:01.1", + "alias":"device:02", + "position":0, + "id":0 + }, + { + "target":"0000:d2:01.3", + "alias":"device:05", + "position":1, + "id":0 + } + ] + } + ] + }, + +Host Bridge :code:`port1` has a single decoder (:code:`decoder1.0`) with two +targets: :code:`dport1` and :code:`dport3` - which are attached to +:code:`endpoint5` and :code:`endpoint6` respectively. + +The host bridge decoder interleaves these devices at a 256 byte granularity. + +The next chunk shows the three CXL host bridges without attached endpoints. + +:: + + { + "port":"port2", + "host":"pci0000:00", + "depth":1, + "nr_dports":2, + "dports":[ + { + "dport":"0000:00:01.3", + "alias":"device:55", + "id":2 + }, + { + "dport":"0000:00:07.1", + "alias":"device:5d", + "id":113 + } + ] + }, + { + "port":"port3", + "host":"pci0000:a8", + "depth":1, + "nr_dports":1, + "dports":[ + { + "dport":"0000:a8:01.1", + "alias":"device:c3", + "id":0 + } + ], + }, + { + "port":"port4", + "host":"pci0000:2a", + "depth":1, + "nr_dports":1, + "dports":[ + { + "dport":"0000:2a:01.1", + "alias":"device:d0", + "id":0 + } + ] + } + ], + +Next we have the `Root Decoders` belonging to :code:`root0`. This root decoder +applies the interleave across the downstream ports :code:`port1` and +:code:`port3` - with a granularity of 256 bytes. + +This information is generated by the CXL driver reading the ACPI CEDT CMFWS. + +:: + + "decoders:root0":[ + { + "decoder":"decoder0.0", + "resource":825975898112, + "size":274877906944, + "interleave_ways":1, + "max_available_extent":0, + "volatile_capable":true, + "nr_targets":2, + "targets":[ + { + "target":"pci0000:a8", + "alias":"ACPI0016:02", + "position":1, + "id":4 + }, + ], + +Finally we have the `Memory Region` associated with the `Root Decoder` +:code:`decoder0.0`. This region describes the overall interleave configuration +of the interleave set. + +:: + + "regions:decoder0.0":[ + { + "region":"region0", + "resource":825975898112, + "size":274877906944, + "type":"ram", + "interleave_ways":2, + "interleave_granularity":256, + "decode_state":"commit", + "mappings":[ + { + "position":1, + "memdev":"mem1", + "decoder":"decoder6.0" + }, + { + "position":0, + "memdev":"mem0", + "decoder":"decoder5.0" + } + ] + } + ] + } + ] + } + ] diff --git a/Documentation/driver-api/cxl/linux/example-configurations/multi-interleave.rst b/Documentation/driver-api/cxl/linux/example-configurations/multi-interleave.rst new file mode 100644 index 000000000000..008f9053c630 --- /dev/null +++ b/Documentation/driver-api/cxl/linux/example-configurations/multi-interleave.rst @@ -0,0 +1,401 @@ +.. SPDX-License-Identifier: GPL-2.0 + +====================== +Multi-Level Interleave +====================== +This cxl-cli configuration dump shows the following host configuration: + +* A single socket system with one CXL root +* CXL Root has Four (4) CXL Host Bridges +* Two CXL Host Bridges have a two CXL Memory Expanders Attached each. +* The CXL root is configured to interleave across the two host bridges. +* Each host bridge with expanders interleaves across two endpoints. + +This output is generated by :code:`cxl list -v` and describes the relationships +between objects exposed in :code:`/sys/bus/cxl/devices/`. + +:: + + [ + { + "bus":"root0", + "provider":"ACPI.CXL", + "nr_dports":4, + "dports":[ + { + "dport":"pci0000:00", + "alias":"ACPI0016:01", + "id":0 + }, + { + "dport":"pci0000:a8", + "alias":"ACPI0016:02", + "id":4 + }, + { + "dport":"pci0000:2a", + "alias":"ACPI0016:03", + "id":1 + }, + { + "dport":"pci0000:d2", + "alias":"ACPI0016:00", + "id":5 + } + ], + +This chunk shows the CXL "bus" (root0) has 4 downstream ports attached to CXL +Host Bridges. The `Root` can be considered the singular upstream port attached +to the platform's memory controller - which routes memory requests to it. + +The `ports:root0` section lays out how each of these downstream ports are +configured. If a port is not configured (id's 0 and 1), they are omitted. + +:: + + "ports:root0":[ + { + "port":"port1", + "host":"pci0000:d2", + "depth":1, + "nr_dports":3, + "dports":[ + { + "dport":"0000:d2:01.1", + "alias":"device:02", + "id":0 + }, + { + "dport":"0000:d2:01.3", + "alias":"device:05", + "id":2 + }, + { + "dport":"0000:d2:07.1", + "alias":"device:0d", + "id":113 + } + ], + +This chunk shows the available downstream ports associated with the CXL Host +Bridge :code:`port1`. In this case, :code:`port1` has 3 available downstream +ports: :code:`dport0`, :code:`dport2`, and :code:`dport113`. + +:: + + "endpoints:port1":[ + { + "endpoint":"endpoint5", + "host":"mem0", + "parent_dport":"0000:d2:01.1", + "depth":2, + "memdev":{ + "memdev":"mem0", + "ram_size":137438953472, + "serial":0, + "numa_node":0, + "host":"0000:d3:00.0" + }, + "decoders:endpoint5":[ + { + "decoder":"decoder5.0", + "resource":825975898112, + "size":549755813888, + "interleave_ways":4, + "interleave_granularity":256, + "region":"region0", + "dpa_resource":0, + "dpa_size":137438953472, + "mode":"ram" + } + ] + }, + { + "endpoint":"endpoint6", + "host":"mem1", + "parent_dport":"0000:d2:01.3", + "depth":2, + "memdev":{ + "memdev":"mem1", + "ram_size":137438953472, + "serial":0, + "numa_node":0, + "host":"0000:d3:00.0" + }, + "decoders:endpoint6":[ + { + "decoder":"decoder6.0", + "resource":825975898112, + "size":549755813888, + "interleave_ways":4, + "interleave_granularity":256, + "region":"region0", + "dpa_resource":0, + "dpa_size":137438953472, + "mode":"ram" + } + ] + } + ], + +This chunk shows the endpoints attached to the host bridge :code:`port1`. + +:code:`endpoint5` contains a single configured decoder :code:`decoder5.0` +which has the same interleave configuration as :code:`region0` (shown later). + +:code:`endpoint6` contains a single configured decoder :code:`decoder5.0` +which has the same interleave configuration as :code:`region0` (shown later). + +Next we have the decoders belonging to the host bridge: + +:: + + "decoders:port1":[ + { + "decoder":"decoder1.0", + "resource":825975898112, + "size":549755813888, + "interleave_ways":2, + "interleave_granularity":512, + "region":"region0", + "nr_targets":2, + "targets":[ + { + "target":"0000:d2:01.1", + "alias":"device:02", + "position":0, + "id":0 + }, + { + "target":"0000:d2:01.3", + "alias":"device:05", + "position":2, + "id":0 + } + ] + } + ] + }, + +Host Bridge :code:`port1` has a single decoder (:code:`decoder1.0`), whose +targets are :code:`dport0` and :code:`dport2` - which are attached to +:code:`endpoint5` and :code:`endpoint6` respectively. + +The following chunk shows a similar configuration for Host Bridge :code:`port3`, +the second host bridge with a memory device attached. + +:: + + { + "port":"port3", + "host":"pci0000:a8", + "depth":1, + "nr_dports":1, + "dports":[ + { + "dport":"0000:a8:01.1", + "alias":"device:c3", + "id":0 + }, + { + "dport":"0000:a8:01.3", + "alias":"device:c5", + "id":0 + } + ], + "endpoints:port3":[ + { + "endpoint":"endpoint7", + "host":"mem2", + "parent_dport":"0000:a8:01.1", + "depth":2, + "memdev":{ + "memdev":"mem2", + "ram_size":137438953472, + "serial":0, + "numa_node":0, + "host":"0000:a9:00.0" + }, + "decoders:endpoint7":[ + { + "decoder":"decoder7.0", + "resource":825975898112, + "size":549755813888, + "interleave_ways":4, + "interleave_granularity":256, + "region":"region0", + "dpa_resource":0, + "dpa_size":137438953472, + "mode":"ram" + } + ] + }, + { + "endpoint":"endpoint8", + "host":"mem3", + "parent_dport":"0000:a8:01.3", + "depth":2, + "memdev":{ + "memdev":"mem3", + "ram_size":137438953472, + "serial":0, + "numa_node":0, + "host":"0000:a9:00.0" + }, + "decoders:endpoint8":[ + { + "decoder":"decoder8.0", + "resource":825975898112, + "size":549755813888, + "interleave_ways":4, + "interleave_granularity":256, + "region":"region0", + "dpa_resource":0, + "dpa_size":137438953472, + "mode":"ram" + } + ] + } + ], + "decoders:port3":[ + { + "decoder":"decoder3.0", + "resource":825975898112, + "size":549755813888, + "interleave_ways":2, + "interleave_granularity":512, + "region":"region0", + "nr_targets":1, + "targets":[ + { + "target":"0000:a8:01.1", + "alias":"device:c3", + "position":1, + "id":0 + }, + { + "target":"0000:a8:01.3", + "alias":"device:c5", + "position":3, + "id":0 + } + ] + } + ] + }, + + +The next chunk shows the two CXL host bridges without attached endpoints. + +:: + + { + "port":"port2", + "host":"pci0000:00", + "depth":1, + "nr_dports":2, + "dports":[ + { + "dport":"0000:00:01.3", + "alias":"device:55", + "id":2 + }, + { + "dport":"0000:00:07.1", + "alias":"device:5d", + "id":113 + } + ] + }, + { + "port":"port4", + "host":"pci0000:2a", + "depth":1, + "nr_dports":1, + "dports":[ + { + "dport":"0000:2a:01.1", + "alias":"device:d0", + "id":0 + } + ] + } + ], + +Next we have the `Root Decoders` belonging to :code:`root0`. This root decoder +applies the interleave across the downstream ports :code:`port1` and +:code:`port3` - with a granularity of 256 bytes. + +This information is generated by the CXL driver reading the ACPI CEDT CMFWS. + +:: + + "decoders:root0":[ + { + "decoder":"decoder0.0", + "resource":825975898112, + "size":549755813888, + "interleave_ways":2, + "interleave_granularity":256, + "max_available_extent":0, + "volatile_capable":true, + "nr_targets":2, + "targets":[ + { + "target":"pci0000:a8", + "alias":"ACPI0016:02", + "position":1, + "id":4 + }, + { + "target":"pci0000:d2", + "alias":"ACPI0016:00", + "position":0, + "id":5 + } + ], + +Finally we have the `Memory Region` associated with the `Root Decoder` +:code:`decoder0.0`. This region describes the overall interleave configuration +of the interleave set. So we see there are a total of :code:`4` interleave +targets across 4 endpoint decoders. + +:: + + "regions:decoder0.0":[ + { + "region":"region0", + "resource":825975898112, + "size":549755813888, + "type":"ram", + "interleave_ways":4, + "interleave_granularity":256, + "decode_state":"commit", + "mappings":[ + { + "position":3, + "memdev":"mem3", + "decoder":"decoder8.0" + }, + { + "position":2, + "memdev":"mem1", + "decoder":"decoder6.0" + } + { + "position":1, + "memdev":"mem2", + "decoder":"decoder7.0" + }, + { + "position":0, + "memdev":"mem0", + "decoder":"decoder5.0" + } + ] + } + ] + } + ] + } + ] diff --git a/Documentation/driver-api/cxl/linux/example-configurations/single-device.rst b/Documentation/driver-api/cxl/linux/example-configurations/single-device.rst new file mode 100644 index 000000000000..5fd38eb0aaf4 --- /dev/null +++ b/Documentation/driver-api/cxl/linux/example-configurations/single-device.rst @@ -0,0 +1,246 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============= +Single Device +============= +This cxl-cli configuration dump shows the following host configuration: + +* A single socket system with one CXL root +* CXL Root has Four (4) CXL Host Bridges +* One CXL Host Bridges has a single CXL Memory Expander Attached +* No interleave is present. + +This output is generated by :code:`cxl list -v` and describes the relationships +between objects exposed in :code:`/sys/bus/cxl/devices/`. + +:: + + [ + { + "bus":"root0", + "provider":"ACPI.CXL", + "nr_dports":4, + "dports":[ + { + "dport":"pci0000:00", + "alias":"ACPI0016:01", + "id":0 + }, + { + "dport":"pci0000:a8", + "alias":"ACPI0016:02", + "id":4 + }, + { + "dport":"pci0000:2a", + "alias":"ACPI0016:03", + "id":1 + }, + { + "dport":"pci0000:d2", + "alias":"ACPI0016:00", + "id":5 + } + ], + +This chunk shows the CXL "bus" (root0) has 4 downstream ports attached to CXL +Host Bridges. The `Root` can be considered the singular upstream port attached +to the platform's memory controller - which routes memory requests to it. + +The `ports:root0` section lays out how each of these downstream ports are +configured. If a port is not configured (id's 0, 1, and 4), they are omitted. + +:: + + "ports:root0":[ + { + "port":"port1", + "host":"pci0000:d2", + "depth":1, + "nr_dports":3, + "dports":[ + { + "dport":"0000:d2:01.1", + "alias":"device:02", + "id":0 + }, + { + "dport":"0000:d2:01.3", + "alias":"device:05", + "id":2 + }, + { + "dport":"0000:d2:07.1", + "alias":"device:0d", + "id":113 + } + ], + +This chunk shows the available downstream ports associated with the CXL Host +Bridge :code:`port1`. In this case, :code:`port1` has 3 available downstream +ports: :code:`dport1`, :code:`dport2`, and :code:`dport113`.. + +:: + + "endpoints:port1":[ + { + "endpoint":"endpoint5", + "host":"mem0", + "parent_dport":"0000:d2:01.1", + "depth":2, + "memdev":{ + "memdev":"mem0", + "ram_size":137438953472, + "serial":0, + "numa_node":0, + "host":"0000:d3:00.0" + }, + "decoders:endpoint5":[ + { + "decoder":"decoder5.0", + "resource":825975898112, + "size":137438953472, + "interleave_ways":1, + "region":"region0", + "dpa_resource":0, + "dpa_size":137438953472, + "mode":"ram" + } + ] + } + ], + +This chunk shows the endpoints attached to the host bridge :code:`port1`. + +:code:`endpoint5` contains a single configured decoder :code:`decoder5.0` +which has the same interleave configuration as :code:`region0` (shown later). + +Next we have the decoders belonging to the host bridge: + +:: + + "decoders:port1":[ + { + "decoder":"decoder1.0", + "resource":825975898112, + "size":137438953472, + "interleave_ways":1, + "region":"region0", + "nr_targets":1, + "targets":[ + { + "target":"0000:d2:01.1", + "alias":"device:02", + "position":0, + "id":0 + } + ] + } + ] + }, + +Host Bridge :code:`port1` has a single decoder (:code:`decoder1.0`), whose only +target is :code:`dport1` - which is attached to :code:`endpoint5`. + +The next chunk shows the three CXL host bridges without attached endpoints. + +:: + + { + "port":"port2", + "host":"pci0000:00", + "depth":1, + "nr_dports":2, + "dports":[ + { + "dport":"0000:00:01.3", + "alias":"device:55", + "id":2 + }, + { + "dport":"0000:00:07.1", + "alias":"device:5d", + "id":113 + } + ] + }, + { + "port":"port3", + "host":"pci0000:a8", + "depth":1, + "nr_dports":1, + "dports":[ + { + "dport":"0000:a8:01.1", + "alias":"device:c3", + "id":0 + } + ] + }, + { + "port":"port4", + "host":"pci0000:2a", + "depth":1, + "nr_dports":1, + "dports":[ + { + "dport":"0000:2a:01.1", + "alias":"device:d0", + "id":0 + } + ] + } + ], + +Next we have the `Root Decoders` belonging to :code:`root0`. This root decoder +is a pass-through decoder because :code:`interleave_ways` is set to :code:`1`. + +This information is generated by the CXL driver reading the ACPI CEDT CMFWS. + +:: + + "decoders:root0":[ + { + "decoder":"decoder0.0", + "resource":825975898112, + "size":137438953472, + "interleave_ways":1, + "max_available_extent":0, + "volatile_capable":true, + "nr_targets":1, + "targets":[ + { + "target":"pci0000:d2", + "alias":"ACPI0016:00", + "position":0, + "id":5 + } + ], + +Finally we have the `Memory Region` associated with the `Root Decoder` +:code:`decoder0.0`. This region describes the discrete region associated +with the lone device. + +:: + + "regions:decoder0.0":[ + { + "region":"region0", + "resource":825975898112, + "size":137438953472, + "type":"ram", + "interleave_ways":1, + "decode_state":"commit", + "mappings":[ + { + "position":0, + "memdev":"mem0", + "decoder":"decoder5.0" + } + ] + } + ] + } + ] + } + ] diff --git a/Documentation/driver-api/cxl/linux/memory-hotplug.rst b/Documentation/driver-api/cxl/linux/memory-hotplug.rst new file mode 100644 index 000000000000..af368c2bc9cf --- /dev/null +++ b/Documentation/driver-api/cxl/linux/memory-hotplug.rst @@ -0,0 +1,78 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============== +Memory Hotplug +============== +The final phase of surfacing CXL memory to the kernel page allocator is for +the `DAX` driver to surface a `Driver Managed` memory region via the +memory-hotplug component. + +There are four major configurations to consider: + +1) Default Online Behavior (on/off and zone) +2) Hotplug Memory Block size +3) Memory Map Resource location +4) Driver-Managed Memory Designation + +Default Online Behavior +======================= +The default-online behavior of hotplug memory is dictated by the following, +in order of precedence: + +- :code:`CONFIG_MHP_DEFAULT_ONLINE_TYPE` Build Configuration +- :code:`memhp_default_state` Boot parameter +- :code:`/sys/devices/system/memory/auto_online_blocks` value + +These dictate whether hotplugged memory blocks arrive in one of three states: + +1) Offline +2) Online in :code:`ZONE_NORMAL` +3) Online in :code:`ZONE_MOVABLE` + +:code:`ZONE_NORMAL` implies this capacity may be used for almost any allocation, +while :code:`ZONE_MOVABLE` implies this capacity should only be used for +migratable allocations. + +:code:`ZONE_MOVABLE` attempts to retain the hotplug-ability of a memory block +so that it the entire region may be hot-unplugged at a later time. Any capacity +onlined into :code:`ZONE_NORMAL` should be considered permanently attached to +the page allocator. + +Hotplug Memory Block Size +========================= +By default, on most architectures, the Hotplug Memory Block Size is either +128MB or 256MB. On x86, the block size increases up to 2GB as total memory +capacity exceeds 64GB. As of v6.15, Linux does not take into account the +size and alignment of the ACPI CEDT CFMWS regions (see Early Boot docs) when +deciding the Hotplug Memory Block Size. + +Memory Map +========== +The location of :code:`struct folio` allocations to represent the hotplugged +memory capacity are dictated by the following system settings: + +- :code:`/sys_module/memory_hotplug/parameters/memmap_on_memory` +- :code:`/sys/bus/dax/devices/daxN.Y/memmap_on_memory` + +If both of these parameters are set to true, :code:`struct folio` for this +capacity will be carved out of the memory block being onlined. This has +performance implications if the memory is particularly high-latency and +its :code:`struct folio` becomes hotly contended. + +If either parameter is set to false, :code:`struct folio` for this capacity +will be allocated from the local node of the processor running the hotplug +procedure. This capacity will be allocated from :code:`ZONE_NORMAL` on +that node, as it is a :code:`GFP_KERNEL` allocation. + +Systems with extremely large amounts of :code:`ZONE_MOVABLE` memory (e.g. +CXL memory pools) must ensure that there is sufficient local +:code:`ZONE_NORMAL` capacity to host the memory map for the hotplugged capacity. + +Driver Managed Memory +===================== +The DAX driver surfaces this memory to memory-hotplug as "Driver Managed". This +is not a configurable setting, but it's important to note that driver managed +memory is explicitly excluded from use during kexec. This is required to ensure +any reset or out-of-band operations that the CXL device may be subject to during +a functional system-reboot (such as a reset-on-probe) will not cause portions of +the kexec kernel to be overwritten. diff --git a/Documentation/driver-api/cxl/linux/overview.rst b/Documentation/driver-api/cxl/linux/overview.rst new file mode 100644 index 000000000000..648beb2c8c83 --- /dev/null +++ b/Documentation/driver-api/cxl/linux/overview.rst @@ -0,0 +1,103 @@ +.. SPDX-License-Identifier: GPL-2.0 + +======== +Overview +======== + +This section presents the configuration process of a CXL Type-3 memory device, +and how it is ultimately exposed to users as either a :code:`DAX` device or +normal memory pages via the kernel's page allocator. + +Portions marked with a bullet are points at which certain kernel objects +are generated. + +1) Early Boot + + a) BIOS, Build, and Boot Parameters + + i) EFI_MEMORY_SP + ii) CONFIG_EFI_SOFT_RESERVE + iii) CONFIG_MHP_DEFAULT_ONLINE_TYPE + iv) nosoftreserve + + b) Memory Map Creation + + i) EFI Memory Map / E820 Consulted for Soft-Reserved + + * CXL Memory is set aside to be handled by the CXL driver + + * Soft-Reserved IO Resource created for CFMWS entry + + c) NUMA Node Creation + + * Nodes created from ACPI CEDT CFMWS and SRAT Proximity domains (PXM) + + d) Memory Tier Creation + + * A default memory_tier is created with all nodes. + + e) Contiguous Memory Allocation + + * Any requested CMA is allocated from Online nodes + + f) Init Finishes, Drivers start probing + +2) ACPI and PCI Drivers + + a) Detects PCI device is CXL, marking it for probe by CXL driver + +3) CXL Driver Operation + + a) Base device creation + + * root, port, and memdev devices created + * CEDT CFMWS IO Resource creation + + b) Decoder creation + + * root, switch, and endpoint decoders created + + c) Logical device creation + + * memory_region and endpoint devices created + + d) Devices are associated with each other + + * If auto-decoder (BIOS-programmed decoders), driver validates + configurations, builds associations, and locks configs at probe time. + + * If user-configured, validation and associations are built at + decoder-commit time. + + e) Regions surfaced as DAX region + + * dax_region created + + * DAX device created via DAX driver + +4) DAX Driver Operation + + a) DAX driver surfaces DAX region as one of two dax device modes + + * kmem - dax device is converted to hotplug memory blocks + + * DAX kmem IO Resource creation + + * hmem - dax device is left as daxdev to be accessed as a file. + + * If hmem, journey ends here. + + b) DAX kmem surfaces memory region to Memory Hotplug to add to page + allocator as "driver managed memory" + +5) Memory Hotplug + + a) mhp component surfaces a dax device memory region as multiple memory + blocks to the page allocator + + * blocks appear in :code:`/sys/bus/memory/devices` and linked to a NUMA node + + b) blocks are onlined into the requested zone (NORMAL or MOVABLE) + + * Memory is marked "Driver Managed" to avoid kexec from using it as region + for kernel updates |