summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2023-03-08selftests/clone3: test clone3 with CLONE_NEWTIMETobias Klauser
Verify that clone3 can be called successfully with CLONE_NEWTIME in flags. Cc: Andrey Vagin <avagin@openvz.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Tobias Klauser <tklauser@distanz.ch> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-03-08fork: allow CLONE_NEWTIME in clone3 flagsTobias Klauser
Currently, calling clone3() with CLONE_NEWTIME in clone_args->flags fails with -EINVAL. This is because CLONE_NEWTIME intersects with CSIGNAL. However, CSIGNAL was deprecated when clone3 was introduced in commit 7f192e3cd316 ("fork: add clone3"), allowing re-use of that part of clone flags. Fix this by explicitly allowing CLONE_NEWTIME in clone3_args_valid. This is also in line with the respective check in check_unshare_flags which allow CLONE_NEWTIME for unshare(). Fixes: 769071ac9f20 ("ns: Introduce Time Namespace") Cc: Andrey Vagin <avagin@openvz.org> Cc: Christian Brauner <brauner@kernel.org> Cc: stable@vger.kernel.org Signed-off-by: Tobias Klauser <tklauser@distanz.ch> Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-03-08watch_queue: fix IOC_WATCH_QUEUE_SET_SIZE alloc error pathsDavid Disseldorp
The watch_queue_set_size() allocation error paths return the ret value set via the prior pipe_resize_ring() call, which will always be zero. As a result, IOC_WATCH_QUEUE_SET_SIZE callers such as "keyctl watch" fail to detect kernel wqueue->notes allocation failures and proceed to KEYCTL_WATCH_KEY, with any notifications subsequently lost. Fixes: c73be61cede58 ("pipe: Add general notification queue support") Signed-off-by: David Disseldorp <ddiss@suse.de> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-03-07ext4: Fix deadlock during directory renameJan Kara
As lockdep properly warns, we should not be locking i_rwsem while having transactions started as the proper lock ordering used by all directory handling operations is i_rwsem -> transaction start. Fix the lock ordering by moving the locking of the directory earlier in ext4_rename(). Reported-by: syzbot+9d16c39efb5fade84574@syzkaller.appspotmail.com Fixes: 0813299c586b ("ext4: Fix possible corruption when moving a directory") Link: https://syzkaller.appspot.com/bug?extid=9d16c39efb5fade84574 Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230301141004.15087-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-03-07ext4: Fix comment about the 64BIT featureTudor Ambarus
64BIT is part of the incompatible feature set, update the comment accordingly. Signed-off-by: Tudor Ambarus <tudor.ambarus@linaro.org> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Link: https://lore.kernel.org/r/20230301133842.671821-1-tudor.ambarus@linaro.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-03-07docs: ext4: modify the group desc size to 64Wu Bo
Since the default ext4 group desc size is 64 now (assuming that the 64-bit feature is enbled). And the size mentioned in this doc is 64 too. Change it to 64. Signed-off-by: Wu Bo <bo.wu@vivo.com> Acked-by: Darrick J. Wong <djwong@kernel.org> Link: https://lore.kernel.org/r/20230222013525.14748-1-bo.wu@vivo.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-03-07ext4: fix another off-by-one fsmap error on 1k block filesystemsDarrick J. Wong
Apparently syzbot figured out that issuing this FSMAP call: struct fsmap_head cmd = { .fmh_count = ...; .fmh_keys = { { .fmr_device = /* ext4 dev */, .fmr_physical = 0, }, { .fmr_device = /* ext4 dev */, .fmr_physical = 0, }, }, ... }; ret = ioctl(fd, FS_IOC_GETFSMAP, &cmd); Produces this crash if the underlying filesystem is a 1k-block ext4 filesystem: kernel BUG at fs/ext4/ext4.h:3331! invalid opcode: 0000 [#1] PREEMPT SMP CPU: 3 PID: 3227965 Comm: xfs_io Tainted: G W O 6.2.0-rc8-achx Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1 04/01/2014 RIP: 0010:ext4_mb_load_buddy_gfp+0x47c/0x570 [ext4] RSP: 0018:ffffc90007c03998 EFLAGS: 00010246 RAX: ffff888004978000 RBX: ffffc90007c03a20 RCX: ffff888041618000 RDX: 0000000000000000 RSI: 00000000000005a4 RDI: ffffffffa0c99b11 RBP: ffff888012330000 R08: ffffffffa0c2b7d0 R09: 0000000000000400 R10: ffffc90007c03950 R11: 0000000000000000 R12: 0000000000000001 R13: 00000000ffffffff R14: 0000000000000c40 R15: ffff88802678c398 FS: 00007fdf2020c880(0000) GS:ffff88807e100000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007ffd318a5fe8 CR3: 000000007f80f001 CR4: 00000000001706e0 Call Trace: <TASK> ext4_mballoc_query_range+0x4b/0x210 [ext4 dfa189daddffe8fecd3cdfd00564e0f265a8ab80] ext4_getfsmap_datadev+0x713/0x890 [ext4 dfa189daddffe8fecd3cdfd00564e0f265a8ab80] ext4_getfsmap+0x2b7/0x330 [ext4 dfa189daddffe8fecd3cdfd00564e0f265a8ab80] ext4_ioc_getfsmap+0x153/0x2b0 [ext4 dfa189daddffe8fecd3cdfd00564e0f265a8ab80] __ext4_ioctl+0x2a7/0x17e0 [ext4 dfa189daddffe8fecd3cdfd00564e0f265a8ab80] __x64_sys_ioctl+0x82/0xa0 do_syscall_64+0x2b/0x80 entry_SYSCALL_64_after_hwframe+0x46/0xb0 RIP: 0033:0x7fdf20558aff RSP: 002b:00007ffd318a9e30 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 00000000000200c0 RCX: 00007fdf20558aff RDX: 00007fdf1feb2010 RSI: 00000000c0c0583b RDI: 0000000000000003 RBP: 00005625c0634be0 R08: 00005625c0634c40 R09: 0000000000000001 R10: 0000000000000000 R11: 0000000000000246 R12: 00007fdf1feb2010 R13: 00005625be70d994 R14: 0000000000000800 R15: 0000000000000000 For GETFSMAP calls, the caller selects a physical block device by writing its block number into fsmap_head.fmh_keys[01].fmr_device. To query mappings for a subrange of the device, the starting byte of the range is written to fsmap_head.fmh_keys[0].fmr_physical and the last byte of the range goes in fsmap_head.fmh_keys[1].fmr_physical. IOWs, to query what mappings overlap with bytes 3-14 of /dev/sda, you'd set the inputs as follows: fmh_keys[0] = { .fmr_device = major(8, 0), .fmr_physical = 3}, fmh_keys[1] = { .fmr_device = major(8, 0), .fmr_physical = 14}, Which would return you whatever is mapped in the 12 bytes starting at physical offset 3. The crash is due to insufficient range validation of keys[1] in ext4_getfsmap_datadev. On 1k-block filesystems, block 0 is not part of the filesystem, which means that s_first_data_block is nonzero. ext4_get_group_no_and_offset subtracts this quantity from the blocknr argument before cracking it into a group number and a block number within a group. IOWs, block group 0 spans blocks 1-8192 (1-based) instead of 0-8191 (0-based) like what happens with larger blocksizes. The net result of this encoding is that blocknr < s_first_data_block is not a valid input to this function. The end_fsb variable is set from the keys that are copied from userspace, which means that in the above example, its value is zero. That leads to an underflow here: blocknr = blocknr - le32_to_cpu(es->s_first_data_block); The division then operates on -1: offset = do_div(blocknr, EXT4_BLOCKS_PER_GROUP(sb)) >> EXT4_SB(sb)->s_cluster_bits; Leaving an impossibly large group number (2^32-1) in blocknr. ext4_getfsmap_check_keys checked that keys[0].fmr_physical and keys[1].fmr_physical are in increasing order, but ext4_getfsmap_datadev adjusts keys[0].fmr_physical to be at least s_first_data_block. This implies that we have to check it again after the adjustment, which is the piece that I forgot. Reported-by: syzbot+6be2b977c89f79b6b153@syzkaller.appspotmail.com Fixes: 4a4956249dac ("ext4: fix off-by-one fsmap error on 1k block filesystems") Link: https://syzkaller.appspot.com/bug?id=79d5768e9bfe362911ac1a5057a36fc6b5c30002 Cc: stable@vger.kernel.org Signed-off-by: Darrick J. Wong <djwong@kernel.org> Link: https://lore.kernel.org/r/Y+58NPTH7VNGgzdd@magnolia Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-03-07ext4: fix RENAME_WHITEOUT handling for inline directoriesEric Whitney
A significant number of xfstests can cause ext4 to log one or more warning messages when they are run on a test file system where the inline_data feature has been enabled. An example: "EXT4-fs warning (device vdc): ext4_dirblock_csum_set:425: inode #16385: comm fsstress: No space for directory leaf checksum. Please run e2fsck -D." The xfstests include: ext4/057, 058, and 307; generic/013, 051, 068, 070, 076, 078, 083, 232, 269, 270, 390, 461, 475, 476, 482, 579, 585, 589, 626, 631, and 650. In this situation, the warning message indicates a bug in the code that performs the RENAME_WHITEOUT operation on a directory entry that has been stored inline. It doesn't detect that the directory is stored inline, and incorrectly attempts to compute a dirent block checksum on the whiteout inode when creating it. This attempt fails as a result of the integrity checking in get_dirent_tail (usually due to a failure to match the EXT4_FT_DIR_CSUM magic cookie), and the warning message is then emitted. Fix this by simply collecting the inlined data state at the time the search for the source directory entry is performed. Existing code handles the rest, and this is sufficient to eliminate all spurious warning messages produced by the tests above. Go one step further and do the same in the code that resets the source directory entry in the event of failure. The inlined state should be present in the "old" struct, but given the possibility of a race there's no harm in taking a conservative approach and getting that information again since the directory entry is being reread anyway. Fixes: b7ff91fd030d ("ext4: find old entry again if failed to rename whiteout") Cc: stable@kernel.org Signed-off-by: Eric Whitney <enwlinux@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230210173244.679890-1-enwlinux@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-03-07ext4: make kobj_type structures constantThomas Weißschuh
Since commit ee6d3dd4ed48 ("driver core: make kobj_type constant.") the driver core allows the usage of const struct kobj_type. Take advantage of this to constify the structure definitions to prevent modification at runtime. Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230209-kobj_type-ext4-v1-1-6865fb05c1f8@weissschuh.net Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-03-07ext4: fix cgroup writeback accounting with fs-layer encryptionEric Biggers
When writing a page from an encrypted file that is using filesystem-layer encryption (not inline encryption), ext4 encrypts the pagecache page into a bounce page, then writes the bounce page. It also passes the bounce page to wbc_account_cgroup_owner(). That's incorrect, because the bounce page is a newly allocated temporary page that doesn't have the memory cgroup of the original pagecache page. This makes wbc_account_cgroup_owner() not account the I/O to the owner of the pagecache page as it should. Fix this by always passing the pagecache page to wbc_account_cgroup_owner(). Fixes: 001e4a8775f6 ("ext4: implement cgroup writeback support") Cc: stable@vger.kernel.org Reported-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Eric Biggers <ebiggers@google.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230203005503.141557-1-ebiggers@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-03-08btrfs: fix block group item corruption after inserting new block groupFilipe Manana
We can often end up inserting a block group item, for a new block group, with a wrong value for the used bytes field. This happens if for the new allocated block group, in the same transaction that created the block group, we have tasks allocating extents from it as well as tasks removing extents from it. For example: 1) Task A creates a metadata block group X; 2) Two extents are allocated from block group X, so its "used" field is updated to 32K, and its "commit_used" field remains as 0; 3) Transaction commit starts, by some task B, and it enters btrfs_start_dirty_block_groups(). There it tries to update the block group item for block group X, which currently has its "used" field with a value of 32K. But that fails since the block group item was not yet inserted, and so on failure update_block_group_item() sets the "commit_used" field of the block group back to 0; 4) The block group item is inserted by task A, when for example btrfs_create_pending_block_groups() is called when releasing its transaction handle. This results in insert_block_group_item() inserting the block group item in the extent tree (or block group tree), with a "used" field having a value of 32K, but without updating the "commit_used" field in the block group, which remains with value of 0; 5) The two extents are freed from block X, so its "used" field changes from 32K to 0; 6) The transaction commit by task B continues, it enters btrfs_write_dirty_block_groups() which calls update_block_group_item() for block group X, and there it decides to skip the block group item update, because "used" has a value of 0 and "commit_used" has a value of 0 too. As a result, we end up with a block item having a 32K "used" field but no extents allocated from it. When this issue happens, a btrfs check reports an error like this: [1/7] checking root items [2/7] checking extents block group [1104150528 1073741824] used 39796736 but extent items used 0 ERROR: errors found in extent allocation tree or chunk allocation (...) Fix this by making insert_block_group_item() update the block group's "commit_used" field. Fixes: 7248e0cebbef ("btrfs: skip update of block group item if used bytes are the same") CC: stable@vger.kernel.org # 6.2+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-03-07docs: sysfs-block: document hidden sysfs entrySagi Grimberg
/sys/block/<disk>/hidden is undocumented. Document it. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20230303084323.228098-1-sagi@grimberg.me Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-03-07ynl: re-license uniformly under GPL-2.0 OR BSD-3-ClauseJakub Kicinski
I was intending to make all the Netlink Spec code BSD-3-Clause to ease the adoption but it appears that: - I fumbled the uAPI and used "GPL WITH uAPI note" there - it gives people pause as they expect GPL in the kernel As suggested by Chuck re-license under dual. This gives us benefit of full BSD freedom while fulfilling the broad "kernel is under GPL" expectations. Link: https://lore.kernel.org/all/20230304120108.05dd44c5@kernel.org/ Link: https://lore.kernel.org/r/20230306200457.3903854-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-07mailmap: update entries for Stephen HemmingerStephen Hemminger
Map all my old email addresses to current address. Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Link: https://lore.kernel.org/r/20230306194405.108236-1-stephen@networkplumber.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-07mailmap: add entry for Maxim MikityanskiyJakub Kicinski
Map Maxim's old corporate addresses to his personal one. Link: https://lore.kernel.org/r/20230306192018.3894988-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-07nfc: change order inside nfc_se_io error pathFedor Pchelkin
cb_context should be freed on the error path in nfc_se_io as stated by commit 25ff6f8a5a3b ("nfc: fix memory leak of se_io context in nfc_genl_se_io"). Make the error path in nfc_se_io unwind everything in reverse order, i.e. free the cb_context after unlocking the device. Suggested-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru> Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Link: https://lore.kernel.org/r/20230306212650.230322-1-pchelkin@ispras.ru Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-07ethernet: ice: avoid gcc-9 integer overflow warningArnd Bergmann
With older compilers like gcc-9, the calculation of the vlan priority field causes a false-positive warning from the byteswap: In file included from drivers/net/ethernet/intel/ice/ice_tc_lib.c:4: drivers/net/ethernet/intel/ice/ice_tc_lib.c: In function 'ice_parse_cls_flower': include/uapi/linux/swab.h:15:15: error: integer overflow in expression '(int)(short unsigned int)((int)match.key-><U67c8>.<U6698>.vlan_priority << 13) & 57344 & 255' of type 'int' results in '0' [-Werror=overflow] 15 | (((__u16)(x) & (__u16)0x00ffU) << 8) | \ | ~~~~~~~~~~~~^~~~~~~~~~~~~~~~~ include/uapi/linux/swab.h:106:2: note: in expansion of macro '___constant_swab16' 106 | ___constant_swab16(x) : \ | ^~~~~~~~~~~~~~~~~~ include/uapi/linux/byteorder/little_endian.h:42:43: note: in expansion of macro '__swab16' 42 | #define __cpu_to_be16(x) ((__force __be16)__swab16((x))) | ^~~~~~~~ include/linux/byteorder/generic.h:96:21: note: in expansion of macro '__cpu_to_be16' 96 | #define cpu_to_be16 __cpu_to_be16 | ^~~~~~~~~~~~~ drivers/net/ethernet/intel/ice/ice_tc_lib.c:1458:5: note: in expansion of macro 'cpu_to_be16' 1458 | cpu_to_be16((match.key->vlan_priority << | ^~~~~~~~~~~ After a change to be16_encode_bits(), the code becomes more readable to both people and compilers, which avoids the warning. Fixes: 34800178b302 ("ice: Add support for VLAN priority filters in switchdev") Suggested-by: Alexander Lobakin <alexandr.lobakin@intel.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com> Tested-by: Sujai Buvaneswaran <sujai.buvaneswaran@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2023-03-07ice: don't ignore return codes in VSI related codeMichal Swiatkowski
There were few smatch warnings reported by Dan: - ice_vsi_cfg_xdp_txqs can return 0 instead of ret, which is cleaner - return values in ice_vsi_cfg_def were ignored - in ice_vsi_rebuild return value was ignored in case rebuild failed, it was a never reached code, however, rewrite it for clarity. - ice_vsi_cfg_tc can return 0 instead of ret Fixes: 6624e780a577 ("ice: split ice_vsi_setup into smaller functions") Reported-by: Dan Carpenter <error27@gmail.com> Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Tested-by: Gurucharan G <gurucharanx.g@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2023-03-07ice: Fix DSCP PFC TLV creationDave Ertman
When creating the TLV to send to the FW for configuring DSCP mode PFC,the PFCENABLE field was being masked with a 4 bit mask (0xF), but this is an 8 bit bitmask for enabled classes for PFC. This means that traffic classes 4-7 could not be enabled for PFC. Remove the mask completely, as it is not necessary, as we are assigning 8 bits to an 8 bit field. Fixes: 2a87bd73e50d ("ice: Add DSCP support") Signed-off-by: Dave Ertman <david.m.ertman@intel.com> Signed-off-by: Karen Ostrowska <karen.ostrowska@intel.com> Tested-by: Gurucharan G <gurucharanx.g@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2023-03-07drm/amd/display: Update clock table to include highest clock settingSwapnil Patel
[Why] Currently, the clk manager matches SocVoltage with voltage from fused settings (dfPstate clock table). And then corresponding clocks are selected. However in certain situations, this leads to clk manager not including at least one entry with highest supported clock setting. [How] Update the clk manager to include at least one entry with highest supported clock setting. Reviewed-by: Pavle Kotarac <pavle.kotarac@amd.com> Acked-by: Qingqing Zhuo <qingqing.zhuo@amd.com> Signed-off-by: Swapnil Patel <Swapnil.Patel@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-03-07drm/amd/pm: Enable ecc_info table support for smu v13_0_10Candice Li
Support EccInfoTable which includes umc ras error count and error address. Signed-off-by: Candice Li <candice.li@amd.com> Reviewed-by: Evan Quan <evan.quan@amd.com> Reviewed-by: Stanley.Yang <Stanley.Yang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-03-07drm/amdgpu: Support umc node harvest config on umc v8_10Candice Li
Don't need to query error count and error address on harvest umc nodes. v2: Fix code bug, use active_mask instead of harvsest_config and remove unnecessary argument in LOOP macro. v3: Leave adev->gmc.num_umc unchanged. Signed-off-by: Candice Li <candice.li@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-03-07drm/connector: print max_requested_bpc in state debugfsHarry Wentland
This is useful to understand the bpc defaults and support of a driver. Signed-off-by: Harry Wentland <harry.wentland@amd.com> Cc: Pekka Paalanen <ppaalanen@gmail.com> Cc: Sebastian Wick <sebastian.wick@redhat.com> Cc: Vitaly.Prosyak@amd.com Cc: Uma Shankar <uma.shankar@intel.com> Cc: Ville Syrjälä <ville.syrjala@linux.intel.com> Cc: Joshua Ashton <joshua@froggi.es> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: dri-devel@lists.freedesktop.org Cc: amd-gfx@lists.freedesktop.org Reviewed-By: Joshua Ashton <joshua@froggi.es> Link: https://patchwork.freedesktop.org/patch/msgid/20230113162428.33874-3-harry.wentland@amd.com Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org
2023-03-07drm/display: Don't block HDR_OUTPUT_METADATA on unknown EOTFHarry Wentland
The EDID of an HDR display defines EOTFs that are supported by the display and can be set in the HDR metadata infoframe. Userspace is expected to read the EDID and set an appropriate HDR_OUTPUT_METADATA. In drm_parse_hdr_metadata_block the kernel reads the supported EOTFs from the EDID and stores them in the drm_connector->hdr_sink_metadata. While doing so it also filters the EOTFs to the EOTFs the kernel knows about. When an HDR_OUTPUT_METADATA is set it then checks to make sure the EOTF is a supported EOTF. In cases where the kernel doesn't know about a new EOTF this check will fail, even if the EDID advertises support. Since it is expected that userspace reads the EDID to understand what the display supports it doesn't make sense for DRM to block an HDR_OUTPUT_METADATA if it contains an EOTF the kernel doesn't understand. This comes with the added benefit of future-proofing metadata support. If the spec defines a new EOTF there is no need to update DRM and an compositor can immediately make use of it. Bug: https://gitlab.freedesktop.org/wayland/weston/-/issues/609 v2: Distinguish EOTFs defind in kernel and ones defined in EDID in the commit description (Pekka) v3: Rebase; drm_hdmi_infoframe_set_hdr_metadata moved to drm_hdmi_helper.c Signed-off-by: Harry Wentland <harry.wentland@amd.com> Cc: Pekka Paalanen <ppaalanen@gmail.com> Cc: Sebastian Wick <sebastian.wick@redhat.com> Cc: Vitaly.Prosyak@amd.com Cc: Uma Shankar <uma.shankar@intel.com> Cc: Ville Syrjälä <ville.syrjala@linux.intel.com> Cc: Joshua Ashton <joshua@froggi.es> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: dri-devel@lists.freedesktop.org Cc: amd-gfx@lists.freedesktop.org Acked-by: Pekka Paalanen <pekka.paalanen@collabora.com> Reviewed-By: Joshua Ashton <joshua@froggi.es> Link: https://patchwork.freedesktop.org/patch/msgid/20230113162428.33874-2-harry.wentland@amd.com Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org
2023-03-07RISC-V: fix taking the text_mutex twice during sifive errata patchingConor Dooley
Chris pointed out that some bonehead, *cough* me *cough*, added two mutex_locks() to the SiFive errata patching. The second was meant to have been a mutex_unlock(). This results in errors such as Unable to handle kernel NULL pointer dereference at virtual address 0000000000000030 Oops [#1] Modules linked in: CPU: 0 PID: 0 Comm: swapper Not tainted 6.2.0-rc1-starlight-00079-g9493e6f3ce02 #229 Hardware name: BeagleV Starlight Beta (DT) epc : __schedule+0x42/0x500 ra : schedule+0x46/0xce epc : ffffffff8065957c ra : ffffffff80659a80 sp : ffffffff81203c80 gp : ffffffff812d50a0 tp : ffffffff8120db40 t0 : ffffffff81203d68 t1 : 0000000000000001 t2 : 4c45203a76637369 s0 : ffffffff81203cf0 s1 : ffffffff8120db40 a0 : 0000000000000000 a1 : ffffffff81213958 a2 : ffffffff81213958 a3 : 0000000000000000 a4 : 0000000000000000 a5 : ffffffff80a1bd00 a6 : 0000000000000000 a7 : 0000000052464e43 s2 : ffffffff8120db41 s3 : ffffffff80a1ad00 s4 : 0000000000000000 s5 : 0000000000000002 s6 : ffffffff81213938 s7 : 0000000000000000 s8 : 0000000000000000 s9 : 0000000000000001 s10: ffffffff812d7204 s11: ffffffff80d3c920 t3 : 0000000000000001 t4 : ffffffff812e6dd7 t5 : ffffffff812e6dd8 t6 : ffffffff81203bb8 status: 0000000200000100 badaddr: 0000000000000030 cause: 000000000000000d [<ffffffff80659a80>] schedule+0x46/0xce [<ffffffff80659dce>] schedule_preempt_disabled+0x16/0x28 [<ffffffff8065ae0c>] __mutex_lock.constprop.0+0x3fe/0x652 [<ffffffff8065b138>] __mutex_lock_slowpath+0xe/0x16 [<ffffffff8065b182>] mutex_lock+0x42/0x4c [<ffffffff8000ad94>] sifive_errata_patch_func+0xf6/0x18c [<ffffffff80002b92>] _apply_alternatives+0x74/0x76 [<ffffffff80802ee8>] apply_boot_alternatives+0x3c/0xfa [<ffffffff80803cb0>] setup_arch+0x60c/0x640 [<ffffffff80800926>] start_kernel+0x8e/0x99c ---[ end trace 0000000000000000 ]--- Reported-by: Chris Hofstaedtler <zeha@debian.org> Fixes: 9493e6f3ce02 ("RISC-V: take text_mutex during alternative patching") Signed-off-by: Conor Dooley <conor.dooley@microchip.com> Tested-by: Geert Uytterhoeven <geert+renesas@glider.be> Link: https://lore.kernel.org/r/20230302174154.970746-1-conor@kernel.org [Palmer: pick up Geert's bug report from the thread] Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2023-03-07cpumask: be more careful with 'cpumask_setall()'Linus Torvalds
Commit 596ff4a09b89 ("cpumask: re-introduce constant-sized cpumask optimizations") changed cpumask_setall() to use "bitmap_set()" instead of "bitmap_fill()", because bitmap_fill() would explicitly set all the bits of a constant sized small bitmap, and that's exactly what we don't want: we want to only set bits up to 'nr_cpu_ids', which is what "bitmap_set()" does. However, Yury correctly points out that while "bitmap_set()" does indeed only set bits up to the required bitmap size, it doesn't _clear_ bits above that size, so the upper bits would still not have well-defined values. Now, none of this should really matter, since any bits set past 'nr_cpu_ids' should always be ignored in the first place. Yes, the bit scanning functions might return them as a result, but since users should always consider the ">= nr_cpu_ids" condition to mean "no more bits", that shouldn't have any actual effect (see previous commit 8ca09d5fa354 "cpumask: fix incorrect cpumask scanning result checks"). But let's just do it right, the way the code was _intended_ to work. We have had enough lazy code that works but bites us in the *rse later (again, see previous commit) that there's no reason to not just do this properly. It turns out that "bitmap_fill()" gets this all right for the complex case, and really only fails for the inlined optimized case that just fills the whole word. And while we could just fix bitmap_fill() to use the proper last word mask, there's two issues with that: - the cpumask case wants to do the _optimization_ based on "NR_CPUS is a small constant", but then wants to do the actual bit _fill_ based on "nr_cpu_ids" that isn't necessarily that same constant - we have lots of non-cpumask users of bitmap_fill(), and while they hopefully don't care, and probably would want the proper semantics anyway ("only set bits up to the limit"), I do not want the cpumask changes to impact other parts So this ends up just doing the single-word optimization by hand in the cpumask code. If our cpumask is fundamentally limited to a single word, just do the proper "fill in that word" exactly. And if it's the more complex multi-word case, then the generic bitmap_fill() will DTRT. This is all an example of how our bitmap function optimizations really are somewhat broken. They conflate the "this is size of the bitmap" optimizations with the actual bit(s) we want to set. In many cases we really want to have the two be separate things: sometimes we base our optimizations on the size of the whole bitmap ("I know this whole bitmap fits in a single word, so I'll just use single-word accesses"), and sometimes we base them on the bit we are looking at ("this is just acting on bits that are in the first word, so I'll use single-word accesses"). Notice how the end result of the two optimizations are the same, but the way we get to them are quite different. And all our cpumask optimization games are really about that fundamental distinction, and we'd often really want to pass in both the "this is the bit I'm working on" (which _can_ be a small constant but might be variable), and "I know it's in this range even if it's variable" (based on CONFIG_NR_CPUS). So this cpumask_setall() implementation just makes that explicit. It checks the "I statically know the size is small" using the known static size of the cpumask (which is what that 'small_cpumask_bits' is all about), but then sets the actual bits using the exact number of cpus we have (ie 'nr_cpumask_bits') Of course, in a perfect world, the compiler would have done all the range analysis (possibly with help from us just telling it that "this value is always in this range"), and would do all of this for us. But that is not the world we live in. While we dream of that perfect world, this does that manual logic to make it all work out. And this was a very long explanation for a small code change that shouldn't even matter. Reported-by: Yury Norov <yury.norov@gmail.com> Link: https://lore.kernel.org/lkml/ZAV9nGG9e1%2FrV+L%2F@yury-laptop/ Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-03-07NFSD: Protect against filesystem freezingChuck Lever
Flole observes this WARNING on occasion: [1210423.486503] WARNING: CPU: 8 PID: 1524732 at fs/ext4/ext4_jbd2.c:75 ext4_journal_check_start+0x68/0xb0 Reported-by: <flole@flole.de> Suggested-by: Jan Kara <jack@suse.cz> Link: https://bugzilla.kernel.org/show_bug.cgi?id=217123 Fixes: 73da852e3831 ("nfsd: use vfs_iter_read/write") Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-03-07net: usb: qmi_wwan: add Telit 0x1080 compositionEnrico Sau
Add the following Telit FE990 composition: 0x1080: tty, adb, rmnet, tty, tty, tty, tty Signed-off-by: Enrico Sau <enrico.sau@gmail.com> Link: https://lore.kernel.org/r/20230306120528.198842-1-enrico.sau@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-03-07net: usb: cdc_mbim: avoid altsetting toggling for Telit FE990Enrico Sau
Add quirk CDC_MBIM_FLAG_AVOID_ALTSETTING_TOGGLE for Telit FE990 0x1081 composition in order to avoid bind error. Signed-off-by: Enrico Sau <enrico.sau@gmail.com> Link: https://lore.kernel.org/r/20230306115933.198259-1-enrico.sau@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-03-07block: fix wrong mode for blkdev_put() from disk_scan_partitions()Yu Kuai
If disk_scan_partitions() is called with 'FMODE_EXCL', blkdev_get_by_dev() will be called without 'FMODE_EXCL', however, follow blkdev_put() is still called with 'FMODE_EXCL', which will cause 'bd_holders' counter to leak. Fix the problem by using the right mode for blkdev_put(). Reported-by: syzbot+2bcc0d79e548c4f62a59@syzkaller.appspotmail.com Link: https://lore.kernel.org/lkml/f9649d501bc8c3444769418f6c26263555d9d3be.camel@linux.ibm.com/T/ Tested-by: Julian Ruess <julianr@linux.ibm.com> Fixes: e5cfefa97bcc ("block: fix scan partition for exclusively open device again") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-03-07Merge branch 'main' of ↵Paolo Abeni
git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for net: 1) Restore ctnetlink zero mark in events and dump, from Ivan Delalande. 2) Fix deadlock due to missing disabled bh in tproxy, from Florian Westphal. 3) Safer maximum chain load in conntrack, from Eric Dumazet. * 'main' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf: netfilter: conntrack: adopt safer max chain length netfilter: tproxy: fix deadlock due to missing BH disable netfilter: ctnetlink: revert to dumping mark regardless of event type ==================== Link: https://lore.kernel.org/r/20230307100424.2037-1-pablo@netfilter.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-03-07platform: mellanox: mlx-platform: Initialize shift variable to 0Hans de Goede
Initialize shift variable in mlxplat_mlxcpld_verify_bus_topology() to 0 to avoid the following compile error: drivers/platform/x86/mlx-platform.c:6013 mlxplat_mlxcpld_verify_bus_topology() error: uninitialized symbol 'shift'. Fixes: 50b823fdd357 ("platform: mellanox: mlx-platform: Move bus shift assignment out of the loop") Cc: Vadim Pasternak <vadimp@nvidia.com> Cc: Michael Shych <michaelsh@nvidia.com> Signed-off-by: Hans de Goede <hdegoede@redhat.com> Link: https://lore.kernel.org/r/20230307105842.286118-1-hdegoede@redhat.com
2023-03-07platform/x86: int3472: Add GPIOs to Surface Go 3 Board dataDaniel Scally
Add the INT347E GPIO lookup table to the board data for the Surface Go 3. This is necessary to allow the ov7251 IR camera to probe properly on that platform. Signed-off-by: Daniel Scally <dan.scally@ideasonboard.com> Link: https://lore.kernel.org/r/20230302102611.314341-1-dan.scally@ideasonboard.com Reviewed-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Hans de Goede <hdegoede@redhat.com>
2023-03-07platform/x86: ISST: Fix kernel documentation warningsSrinivas Pandruvada
Fix warning displayed for "make W=1" for kernel documentation. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Link: https://lore.kernel.org/r/20230211063257.311746-2-srinivas.pandruvada@linux.intel.com Signed-off-by: Hans de Goede <hdegoede@redhat.com> Reviewed-by: Hans de Goede <hdegoede@redhat.com>
2023-03-07platform: x86: MLX_PLATFORM: select REGMAP instead of depending on itRandy Dunlap
REGMAP is a hidden (not user visible) symbol. Users cannot set it directly thru "make *config", so drivers should select it instead of depending on it if they need it. Consistently using "select" or "depends on" can also help reduce Kconfig circular dependency issues. Therefore, change the use of "depends on REGMAP" to "select REGMAP". Fixes: ef0f62264b2a ("platform/x86: mlx-platform: Add physical bus number auto detection") Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Vadim Pasternak <vadimp@mellanox.com> Cc: Darren Hart <dvhart@infradead.org> Cc: Hans de Goede <hdegoede@redhat.com> Cc: Mark Gross <markgross@kernel.org> Cc: platform-driver-x86@vger.kernel.org Link: https://lore.kernel.org/r/20230226053953.4681-7-rdunlap@infradead.org Signed-off-by: Hans de Goede <hdegoede@redhat.com> Reviewed-by: Hans de Goede <hdegoede@redhat.com>
2023-03-07platform: mellanox: select REGMAP instead of depending on itRandy Dunlap
REGMAP is a hidden (not user visible) symbol. Users cannot set it directly thru "make *config", so drivers should select it instead of depending on it if they need it. Consistently using "select" or "depends on" can also help reduce Kconfig circular dependency issues. Therefore, change the use of "depends on REGMAP" to "select REGMAP". For NVSW_SN2201, select REGMAP_I2C instead of depending on it. Fixes: c6acad68eb2d ("platform/mellanox: mlxreg-hotplug: Modify to use a regmap interface") Fixes: 5ec4a8ace06c ("platform/mellanox: Introduce support for Mellanox register access driver") Fixes: 62f9529b8d5c ("platform/mellanox: mlxreg-lc: Add initial support for Nvidia line card devices") Fixes: 662f24826f95 ("platform/mellanox: Add support for new SN2201 system") Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Darren Hart <dvhart@infradead.org> Cc: Hans de Goede <hdegoede@redhat.com> Cc: Michael Shych <michaelsh@nvidia.com> Cc: Mark Gross <markgross@kernel.org> Cc: Vadim Pasternak <vadimp@nvidia.com> Cc: platform-driver-x86@vger.kernel.org Link: https://lore.kernel.org/r/20230226053953.4681-6-rdunlap@infradead.org Signed-off-by: Hans de Goede <hdegoede@redhat.com> Reviewed-by: Hans de Goede <hdegoede@redhat.com>
2023-03-07platform/x86/intel/tpmi: Fix double free reported by SmatchSrinivas Pandruvada
Fix warning: drivers/platform/x86/intel/tpmi.c:253 tpmi_create_device() warn: 'feature_vsec_dev' was already freed. If there is some error, feature_vsec_dev memory is freed as part of resource managed call intel_vsec_add_aux(). So, additional kfree() call is not required. Reordered res allocation and feature_vsec_dev, so that on error only res is freed. Reported-by: Dan Carpenter <error27@gmail.com> Link: https://lore.kernel.org/platform-driver-x86/Y%2FxYR7WGiPayZu%2FR@kili/T/#u Fixes: 47731fd2865f ("platform/x86/intel: Intel TPMI enumeration driver") Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Link: https://lore.kernel.org/r/20230227140614.2913474-1-srinivas.pandruvada@linux.intel.com Signed-off-by: Hans de Goede <hdegoede@redhat.com> Reviewed-by: Hans de Goede <hdegoede@redhat.com>
2023-03-07platform/x86: ISST: Increase range of valid mail box commandsSrinivas Pandruvada
A new command CONFIG_TDP_GET_RATIO_INFO is added, with sub command type of 0x0C. The previous range of valid sub commands was from 0x00 to 0x0B. Change the valid range from 0x00 to 0x0C. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Link: https://lore.kernel.org/r/20230227053504.2734214-1-srinivas.pandruvada@linux.intel.com Signed-off-by: Hans de Goede <hdegoede@redhat.com> Reviewed-by: Hans de Goede <hdegoede@redhat.com>
2023-03-07platform/x86: dell-ddv: Fix temperature scalingArmin Wolf
After using the built-in UEFI hardware diagnostics to compare the measured battery temperature, i noticed that the temperature is actually expressed in tenth degree kelvin, similar to the SBS-Data standard. For example, a value of 2992 is displayed as 26 degrees celsius. Fix the scaling so that the correct values are being displayed. Tested on a Dell Inspiron 3505. Fixes: a77272c16041 ("platform/x86: dell: Add new dell-wmi-ddv driver") Signed-off-by: Armin Wolf <W_Armin@gmx.de> Link: https://lore.kernel.org/r/20230218115318.20662-2-W_Armin@gmx.de Signed-off-by: Hans de Goede <hdegoede@redhat.com> Reviewed-by: Hans de Goede <hdegoede@redhat.com>
2023-03-07platform/x86: dell-ddv: Fix cache invalidation on resumeArmin Wolf
If one or both sensor buffers could not be initialized, either due to missing hardware support or due to some error during probing, the resume handler will encounter undefined behaviour when attempting to lock buffers then protected by an uninitialized or destroyed mutex. Fix this by introducing a "active" flag which is set during probe, and only invalidate buffers which where flaged as "active". Tested on a Dell Inspiron 3505. Fixes: 3b7eeff93d29 ("platform/x86: dell-ddv: Add hwmon support") Signed-off-by: Armin Wolf <W_Armin@gmx.de> Link: https://lore.kernel.org/r/20230218115318.20662-1-W_Armin@gmx.de Signed-off-by: Hans de Goede <hdegoede@redhat.com> Reviewed-by: Hans de Goede <hdegoede@redhat.com>
2023-03-07platform/x86/amd: pmc: remove CONFIG_SUSPEND checksArnd Bergmann
The amd_pmc_write_stb() function was previously hidden in an ifdef to avoid a warning when CONFIG_SUSPEND is disabled, but now there is an additional caller: drivers/platform/x86/amd/pmc.c: In function 'amd_pmc_stb_debugfs_open_v2': drivers/platform/x86/amd/pmc.c:256:8: error: implicit declaration of function 'amd_pmc_write_stb'; did you mean 'amd_pmc_read_stb'? [-Werror=implicit-function-declaration] 256 | ret = amd_pmc_write_stb(dev, AMD_PMC_STB_DUMMY_PC); | ^~~~~~~~~~~~~~~~~ | amd_pmc_read_stb There is now an easier way to handle this using DEFINE_SIMPLE_DEV_PM_OPS() to replace all the #ifdefs, letting gcc drop any of the unused functions silently. Fixes: b0d4bb973539 ("platform/x86/amd: pmc: Write dummy postcode into the STB DRAM") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Link: https://lore.kernel.org/r/20230214152512.806188-1-arnd@kernel.org Signed-off-by: Hans de Goede <hdegoede@redhat.com> Reviewed-by: Hans de Goede <hdegoede@redhat.com>
2023-03-07netfilter: conntrack: adopt safer max chain lengthEric Dumazet
Customers using GKE 1.25 and 1.26 are facing conntrack issues root caused to commit c9c3b6811f74 ("netfilter: conntrack: make max chain length random"). Even if we assume Uniform Hashing, a bucket often reachs 8 chained items while the load factor of the hash table is smaller than 0.5 With a limit of 16, we reach load factors of 3. With a limit of 32, we reach load factors of 11. With a limit of 40, we reach load factors of 15. With a limit of 50, we reach load factors of 24. This patch changes MIN_CHAINLEN to 50, to minimize risks. Ideally, we could in the future add a cushion based on expected load factor (2 * nf_conntrack_max / nf_conntrack_buckets), because some setups might expect unusual values. Fixes: c9c3b6811f74 ("netfilter: conntrack: make max chain length random") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-03-07new helper: put_and_unmap_page()Al Viro
kunmap_local() + put_page(), as done by e.g. ext2 directory handling. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2023-03-06Merge tag 'for-netdev' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Daniel Borkmann says: ==================== pull-request: bpf 2023-03-06 We've added 8 non-merge commits during the last 7 day(s) which contain a total of 9 files changed, 64 insertions(+), 18 deletions(-). The main changes are: 1) Fix BTF resolver for DATASEC sections when a VAR points at a modifier, that is, keep resolving such instances instead of bailing out, from Lorenz Bauer. 2) Fix BPF test framework with regards to xdp_frame info misplacement in the "live packet" code, from Alexander Lobakin. 3) Fix an infinite loop in BPF sockmap code for TCP/UDP/AF_UNIX, from Liu Jian. 4) Fix a build error for riscv BPF JIT under PERF_EVENTS=n, from Randy Dunlap. 5) Several BPF doc fixes with either broken links or external instead of internal doc links, from Bagas Sanjaya. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: selftests/bpf: check that modifier resolves after pointer btf: fix resolving BTF_KIND_VAR after ARRAY, STRUCT, UNION, PTR bpf, test_run: fix &xdp_frame misplacement for LIVE_FRAMES bpf, doc: Link to submitting-patches.rst for general patch submission info bpf, doc: Do not link to docs.kernel.org for kselftest link bpf, sockmap: Fix an infinite loop error when len is 0 in tcp_bpf_recvmsg_parser() riscv, bpf: Fix patch_text implicit declaration bpf, docs: Fix link to BTF doc ==================== Link: https://lore.kernel.org/r/20230306215944.11981-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-06alpha: fix lazy-FPU mis(merged/applied/whatnot)Al Viro
Looks like a braino that used to be fixed in e.g. #next.alpha had gotten into alpha.git cherry-picked version of that patch. Sure, alpha has no preempt, but preempt_enable() in place of preempt_disable() is actively confusing the readers... Other than that, the cherry-picked variant matches what I have. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2023-03-06RISC-V: Stop emitting attributesPalmer Dabbelt
The RISC-V ELF attributes don't contain any useful information. New toolchains ignore them, but they frequently trip up various older/mixed toolchains. So just turn them off. Tested-by: Conor Dooley <conor.dooley@microchip.com> Link: https://lore.kernel.org/r/20230223224605.6995-1-palmer@rivosinc.com Cc: stable@vger.kernel.org Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2023-03-06scsi: sd: Fix wrong zone_write_granularity value during revalidateShin'ichiro Kawasaki
When the sd driver revalidates host-managed SMR disks, it calls disk_set_zoned() which changes the zone_write_granularity attribute value to the logical block size regardless of the device type. After that, the sd driver overwrites the value in sd_zbc_read_zone() with the physical block size, since ZBC/ZAC requires this for host-managed disks. Between the calls to disk_set_zoned() and sd_zbc_read_zone(), there exists a window where the attribute shows the logical block size as the zone_write_granularity value, which is wrong for host-managed disks. The duration of the window is from 20ms to 200ms, depending on report zone command execution time. To avoid the wrong zone_write_granularity value between disk_set_zoned() and sd_zbc_read_zone(), modify the value not in sd_zbc_read_zone() but just after disk_set_zoned() call. Fixes: a805a4fa4fa3 ("block: introduce zone_write_granularity limit") Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Link: https://lore.kernel.org/r/20230306063024.3376959-1-shinichiro.kawasaki@wdc.com Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2023-03-06scsi: storvsc: Handle BlockSize change in Hyper-V VHD/VHDX fileMichael Kelley
Hyper-V uses a VHD or VHDX file on the host as the underlying storage for a virtual disk. The VHD/VHDX file format is a sparse format where real disk space on the host is assigned in chunks that the VHD/VHDX file format calls the BlockSize. This BlockSize is not to be confused with the 512-byte (or 4096-byte) sector size of the underlying storage device. The default block size for a new VHD/VHDX file is 32 Mbytes. When a guest VM touches any disk space within a 32 Mbyte chunk of the VHD/VHDX file, Hyper-V allocates 32 Mbytes of real disk space for that section of the VHD/VHDX. Similarly, if a discard operation is done that covers an entire 32 Mbyte chunk, Hyper-V will free the real disk space for that portion of the VHD/VHDX. This BlockSize is surfaced in Linux as the "discard_granularity" in /sys/block/sd<x>/queue, which makes sense. Hyper-V also has differencing disks that can overlay a VHD/VHDX file to capture changes to the VHD/VHDX while preserving the original VHD/VHDX. One example of this differencing functionality is for VM snapshots. When a snapshot is created, a differencing disk is created. If the snapshot is rolled back, Hyper-V can just delete the differencing disk, and the VM will see the original disk contents at the time the snapshot was taken. Differencing disks are used in other scenarios as well. The BlockSize for a differencing disk defaults to 2 Mbytes, not 32 Mbytes. The smaller default is used because changes to differencing disks are typically scattered all over, and Hyper-V doesn't want to allocate 32 Mbytes of real disk space for a stray write here or there. The smaller BlockSize provides more efficient use of real disk space. When a differencing disk is added to a VHD/VHDX, Hyper-V reports UNIT_ATTENTION with a sense code indicating "Operating parameters have changed", because the value of discard_granularity should be changed to 2 Mbytes. When the differencing disk is removed, discard_granularity should be changed back to 32 Mbytes. However, current code simply reports a message from scsi_report_sense() and the value of /sys/block/sd<x>/queue/discard_granularity is not updated. The message isn't very actionable by a sysadmin. Fix this by having the storvsc driver check for the sense code indicating that the underly VHD/VHDX block size has changed, and do a rescan of the device to pick up the new discard_granularity. With this change the entire transition to/from differencing disks is handled automatically and transparently, with no confusing messages being output. Link: https://lore.kernel.org/r/1677516514-86060-1-git-send-email-mikelley@microsoft.com Signed-off-by: Michael Kelley <mikelley@microsoft.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2023-03-06scsi: megaraid_sas: Driver version update to 07.725.01.00-rc1Chandrakanth Patil
Update driver version. Signed-off-by: Chandrakanth Patil <chandrakanth.patil@broadcom.com> Link: https://lore.kernel.org/r/20230302105342.34933-4-chandrakanth.patil@broadcom.com Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2023-03-06scsi: megaraid_sas: Add crash dump mode capability bit in MFI capabilitiesChandrakanth Patil
In kdump kernel mode, the driver works in reduced functionality mode with some features disabled such as reduced MSI-X count and RDPQ disabled, etc. However, the firmware is not aware of this mode in some cases, which results in undefined behavior. To address this, the driver informs the firmware about the kdump mode through MPI capabilities bit during driver initialization. This allows firmware to adjust its behavior accordingly. Signed-off-by: Chandrakanth Patil <chandrakanth.patil@broadcom.com> Signed-off-by: Sumit Saxena <sumit.saxena@broadcom.com> Link: https://lore.kernel.org/r/20230302105342.34933-3-chandrakanth.patil@broadcom.com Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>