summaryrefslogtreecommitdiff
path: root/Documentation/edac/memory_repair.rst
blob: 52162a422864d970c55ca9921527de523977a60c (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
.. SPDX-License-Identifier: GPL-2.0 OR GFDL-1.2-no-invariants-or-later

==========================
EDAC Memory Repair Control
==========================

Copyright (c) 2024-2025 HiSilicon Limited.

:Author:   Shiju Jose <shiju.jose@huawei.com>
:License:  The GNU Free Documentation License, Version 1.2 without
           Invariant Sections, Front-Cover Texts nor Back-Cover Texts.
           (dual licensed under the GPL v2)
:Original Reviewers:

- Written for: 6.15

Introduction
------------

Some memory devices support repair operations to address issues in their
memory media. Post Package Repair (PPR) and memory sparing are examples of
such features.

Post Package Repair (PPR)
~~~~~~~~~~~~~~~~~~~~~~~~~

Post Package Repair is a maintenance operation which requests the memory
device to perform repair operation on its media. It is a memory self-healing
feature that fixes a failing memory location by replacing it with a spare row
in a DRAM device.

For example, a CXL memory device with DRAM components that support PPR
features implements maintenance operations. DRAM components support those
types of PPR functions:

 - hard PPR, for a permanent row repair, and
 - soft PPR, for a temporary row repair.

Soft PPR is much faster than hard PPR, but the repair is lost after a power
cycle.

The data may not be retained and memory requests may not be correctly
processed during a repair operation. In such case, the repair operation should
not be executed at runtime.

For example, for CXL memory devices, see CXL spec rev 3.1 [1]_ sections
8.2.9.7.1.1 PPR Maintenance Operations, 8.2.9.7.1.2 sPPR Maintenance Operation
and 8.2.9.7.1.3 hPPR Maintenance Operation for more details.

Memory Sparing
~~~~~~~~~~~~~~

Memory sparing is a repair function that replaces a portion of memory with
a portion of functional memory at a particular granularity. Memory
sparing has cacheline/row/bank/rank sparing granularities. For example, in
rank memory-sparing mode, one memory rank serves as a spare for other ranks on
the same channel in case they fail.

The spare rank is held in reserve and not used as active memory until
a failure is indicated, with reserved capacity subtracted from the total
available memory in the system.

After an error threshold is surpassed in a system protected by memory sparing,
the content of a failing rank of DIMMs is copied to the spare rank. The
failing rank is then taken offline and the spare rank placed online for use as
active memory in place of the failed rank.

For example, CXL memory devices can support various subclasses for sparing
operation vary in terms of the scope of the sparing being performed.

Cacheline sparing subclass refers to a sparing action that can replace a full
cacheline. Row sparing is provided as an alternative to PPR sparing functions
and its scope is that of a single DDR row. Bank sparing allows an entire bank
to be replaced. Rank sparing is defined as an operation in which an entire DDR
rank is replaced.

See CXL spec 3.1 [1]_ section 8.2.9.7.1.4 Memory Sparing Maintenance
Operations for more details.

.. [1] https://computeexpresslink.org/cxl-specification/

Use cases of generic memory repair features control
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

1. The soft PPR, hard PPR and memory-sparing features share similar control
   attributes. Therefore, there is a need for a standardized, generic sysfs
   repair control that is exposed to userspace and used by administrators,
   scripts and tools.

2. When a CXL device detects an error in a memory component, it informs the
   host of the need for a repair maintenance operation by using an event
   record where the "maintenance needed" flag is set. The event record
   specifies the device physical address (DPA) and attributes of the memory
   that requires repair. The kernel reports the corresponding CXL general
   media or DRAM trace event to userspace, and userspace tools (e.g.
   rasdaemon) initiate a repair maintenance operation in response to the
   device request using the sysfs repair control.

3. Userspace tools, such as rasdaemon, request a repair operation on a memory
   region when maintenance need flag set or an uncorrected memory error or
   excess of corrected memory errors above a threshold value is reported or an
   exceed corrected errors threshold flag set for that memory.

4. Multiple PPR/sparing instances may be present per memory device.

5. Drivers should enforce that live repair is safe. In systems where memory
   mapping functions can change between boots, one approach to this is to log
   memory errors seen on this boot against which to check live memory repair
   requests.

The File System
---------------

The control attributes of a registered memory repair instance could be
accessed in the /sys/bus/edac/devices/<dev-name>/mem_repairX/

sysfs
-----

Sysfs files are documented in
`Documentation/ABI/testing/sysfs-edac-memory-repair`.