summaryrefslogtreecommitdiff
path: root/fs/ext4/page-io.c
AgeCommit message (Collapse)Author
2025-03-27Merge tag 'ext4-for_linus-6.15-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "Ext4 bug fixes and cleanups, including: - hardening against maliciously fuzzed file systems - backwards compatibility for the brief period when we attempted to ignore zero-width characters - avoid potentially BUG'ing if there is a file system corruption found during the file system unmount - fix free space reporting by statfs when project quotas are enabled and the free space is less than the remaining project quota Also improve performance when replaying a journal with a very large number of revoke records (applicable for Lustre volumes)" * tag 'ext4-for_linus-6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (71 commits) ext4: fix OOB read when checking dotdot dir ext4: on a remount, only log the ro or r/w state when it has changed ext4: correct the error handle in ext4_fallocate() ext4: Make sb update interval tunable ext4: avoid journaling sb update on error if journal is destroying ext4: define ext4_journal_destroy wrapper ext4: hash: simplify kzalloc(n * 1, ...) to kzalloc(n, ...) jbd2: add a missing data flush during file and fs synchronization ext4: don't over-report free space or inodes in statvfs ext4: clear DISCARD flag if device does not support discard jbd2: remove jbd2_journal_unfile_buffer() ext4: reorder capability check last ext4: update the comment about mb_optimize_scan jbd2: fix off-by-one while erasing journal ext4: remove references to bh->b_page ext4: goto right label 'out_mmap_sem' in ext4_setattr() ext4: fix out-of-bound read in ext4_xattr_inode_dec_ref_all() ext4: introduce ITAIL helper jbd2: remove redundant function jbd2_journal_has_csum_v2or3_feature ext4: remove redundant function ext4_has_metadata_csum ...
2025-03-13ext4: add ext4_emergency_state() helper functionBaokun Li
Since both SHUTDOWN and EMERGENCY_RO are emergency states of the ext4 file system, and they are checked in similar locations, we have added a helper function, ext4_emergency_state(), to determine whether the current file system is in one of these two emergency states. Then, replace calls to ext4_forced_shutdown() with ext4_emergency_state() in those functions that could potentially trigger write operations. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250122114130.229709-4-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-03-13ext4: abort journal on data writeback failure if in data_err=abort modeBaokun Li
The data_err=abort was initially introduced to address users' worries about data corruption spreading unnoticed. With direct writes, we can rely on return values to confirm successful writes to disk. But with buffered writes, a successful return only means the data has been written to memory. Users have no way of knowing if the data has actually written it to disk unless they use fsync (which impacts performance and can sometimes miss errors). The current data_err=abort implementation relies on the ordered data list, but past changes have inadvertently altered its behavior. For example, if an extent is unwritten, we do not add the inode to the ordered data list. Therefore, jbd2 will not wait for the data write-back of that inode to complete and check for errors in the inode mapping. Moreover, the checks performed by jbd2 can also miss errors. Now, all buffered writes eventually call ext4_end_bio(), where I/O errors are checked. Therefore, we can check for the data_err=abort mode at this point and abort the journal in a kworker (due to the interrupt context). Therefore, when data_err=abort is enabled, the journal is aborted in ext4_end_io_end() when an I/O error is detected in ext4_end_bio() to make users who are concerned about the contents of the file happy. Suggested-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/c7ab26f3-85ad-4b31-b132-0afb0e07bf79@huawei.com Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20250122110533.4116662-6-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-03-13ext4: do not convert the unwritten extents if data writeback failsBaokun Li
When dioread_nolock is turned on (the default), it will convert unwritten extents to written at ext4_end_io_end(), even if the data writeback fails. It leads to the possibility that stale data may be exposed when the physical block corresponding to the file data is read-only (i.e., writes return -EIO, but reads are normal). Therefore a new ext4_io_end->flags EXT4_IO_END_FAILED is added, which indicates that some bio write-back failed in the current ext4_io_end. When this flag is set, the unwritten to written conversion is no longer performed. Users can read the data normally until the caches are dropped, after that, the failed extents can only be read to all 0. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250122110533.4116662-3-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-03-13ext4: replace opencoded ext4_end_io_end() in ext4_put_io_end()Baokun Li
This reduces duplicate code and ensures that a “potential data loss” warning is available if the unwritten conversion fails. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250122110533.4116662-2-libaokun@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-03-05fscrypt: Change fscrypt_encrypt_pagecache_blocks() to take a folioMatthew Wilcox (Oracle)
ext4 and ceph already have a folio to pass; f2fs needs to be properly converted but this will do for now. This removes a reference to page->index and page->mapping as well as removing a call to compound_head(). Signed-off-by: "Matthew Wilcox (Oracle)" <willy@infradead.org> Link: https://lore.kernel.org/r/20250304170224.523141-1-willy@infradead.org Acked-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-11-18Merge tag 'ext4_for_linus-6.13-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "A lot of miscellaneous ext4 bug fixes and cleanups this cycle, most notably in the journaling code, bufered I/O, and compiler warning cleanups" * tag 'ext4_for_linus-6.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (33 commits) jbd2: Fix comment describing journal_init_common() ext4: prevent an infinite loop in the lazyinit thread ext4: use struct_size() to improve ext4_htree_store_dirent() ext4: annotate struct fname with __counted_by() jbd2: avoid dozens of -Wflex-array-member-not-at-end warnings ext4: use str_yes_no() helper function ext4: prevent delalloc to nodelalloc on remount jbd2: make b_frozen_data allocation always succeed ext4: cleanup variable name in ext4_fc_del() ext4: use string choices helpers jbd2: remove the 'success' parameter from the jbd2_do_replay() function jbd2: remove useless 'block_error' variable jbd2: factor out jbd2_do_replay() jbd2: refactor JBD2_COMMIT_BLOCK process in do_one_pass() jbd2: unified release of buffer_head in do_one_pass() jbd2: remove redundant judgments for check v1 checksum ext4: use ERR_CAST to return an error-valued pointer mm: zero range of eof folio exposed by inode size extension ext4: partial zero eof block on unaligned inode size extension ext4: disambiguate the return value of ext4_dio_write_end_io() ...
2024-11-12ext4: pass write-hint for buffered IOj.xia
Commit 449813515d3e ("block, fs: Restore the per-bio/request data lifetime fields") restored write-hint support in ext4. But that is applicable only for direct IO. This patch supports passing write-hint for buffered IO from ext4 file system to block layer by filling bi_write_hint of struct bio in io_submit_add_bh(). Signed-off-by: j.xia <j.xia@samsung.com> Link: https://patch.msgid.link/20240919020341.2657646-1-j.xia@samsung.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-10-28fs/writeback: convert wbc_account_cgroup_owner to take a folioPankaj Raghav
Most of the callers of wbc_account_cgroup_owner() are converting a folio to page before calling the function. wbc_account_cgroup_owner() is converting the page back to a folio to call mem_cgroup_css_from_folio(). Convert wbc_account_cgroup_owner() to take a folio instead of a page, and convert all callers to pass a folio directly except f2fs. Convert the page to folio for all the callers from f2fs as they were the only callers calling wbc_account_cgroup_owner() with a page. As f2fs is already in the process of converting to folios, these call sites might also soon be calling wbc_account_cgroup_owner() with a folio directly in the future. No functional changes. Only compile tested. Signed-off-by: Pankaj Raghav <p.raghav@samsung.com> Link: https://lore.kernel.org/r/20240926140121.203821-1-kernel@pankajraghav.com Acked-by: David Sterba <dsterba@suse.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-05-09ext4: remove calls to to set/clear the folio error flagMatthew Wilcox (Oracle)
Nobody checks this flag on ext4 folios, stop setting and clearing it. Cc: Theodore Ts'o <tytso@mit.edu> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: linux-ext4@vger.kernel.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://lore.kernel.org/r/20240420025029.2166544-11-willy@infradead.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-12-29fs: convert block_write_full_page to block_write_full_folioMatthew Wilcox (Oracle)
Convert the function to be compatible with writepage_t so that it can be passed to write_cache_pages() by blkdev. This removes a call to compound_head(). We can also remove the function export as both callers are built-in. Link: https://lkml.kernel.org/r/20231215200245.748418-14-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-07-29ext4: make ext4_forced_shutdown() take struct super_blockJan Kara
Currently ext4_forced_shutdown() takes struct ext4_sb_info but most callers need to get it from struct super_block anyway. So just pass in struct super_block to save all callers from some boilerplate code. No functional changes. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230616165109.21695-3-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-04-19ext4: remove unneeded check of nr_to_submitTom Rix
cppcheck reports fs/ext4/page-io.c:516:51: style: Condition 'nr_to_submit' is always true [knownConditionTrueFalse] if (fscrypt_inode_uses_fs_layer_crypto(inode) && nr_to_submit) { ^ This earlier check to bail, makes this check unncessary /* Nothing to submit? Just unlock the page... */ if (!nr_to_submit) return 0; Signed-off-by: Tom Rix <trix@redhat.com> Fixes: dff4ac75eeee ("ext4: move keep_towrite handling to ext4_bio_write_page()") Reviewed-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20230316204831.2472537-1-trix@redhat.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-04-14ext4: Keep pages with journalled data dirtyJan Kara
Currently we clear page dirty bit when we checkpoint some buffers from a page with journalled data or when we perform delayed dirtying of a page in ext4_writepages(). In a quest to simplify handling of journalled data we want to keep page dirty as long as it has either buffers to checkpoint or journalled dirty data. So make sure to keep page dirty in ext4_writepages() if it still has journalled data attached to it. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230329154950.19720-3-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-04-06ext4: Convert ext4_bio_write_page() to ext4_bio_write_folio()Matthew Wilcox
The only caller now has a folio so pass it in directly and avoid the call to page_folio() at the beginning. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Theodore Ts'o <tytso@mit.edu> Link: https://lore.kernel.org/r/20230324180129.1220691-9-willy@infradead.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-04-06ext4: Convert ext4_finish_bio() to use foliosMatthew Wilcox
Prepare ext4 to support large folios in the page writeback path. Also set the actual error in the mapping, not just -EIO. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Theodore Ts'o <tytso@mit.edu> Link: https://lore.kernel.org/r/20230324180129.1220691-5-willy@infradead.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-04-06ext4: Convert ext4_bio_write_page() to use a folioMatthew Wilcox
Remove several calls to compound_head() and the last caller of set_page_writeback_keepwrite(), so remove the wrapper too. Also export bio_add_folio() as this is the first caller from a module. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Theodore Ts'o <tytso@mit.edu> Link: https://lore.kernel.org/r/20230324180129.1220691-4-willy@infradead.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2023-03-23ext4: Don't unlock page in ext4_bio_write_page()Jan Kara
Do not unlock the written page in ext4_bio_write_page(). Instead leave the page locked and unlock it in the callers. We'll need to keep the page locked for data=journal writeback for a bit longer. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Link: https://lore.kernel.org/r/20230228051319.4085470-5-tytso@mit.edu
2023-03-07ext4: fix cgroup writeback accounting with fs-layer encryptionEric Biggers
When writing a page from an encrypted file that is using filesystem-layer encryption (not inline encryption), ext4 encrypts the pagecache page into a bounce page, then writes the bounce page. It also passes the bounce page to wbc_account_cgroup_owner(). That's incorrect, because the bounce page is a newly allocated temporary page that doesn't have the memory cgroup of the original pagecache page. This makes wbc_account_cgroup_owner() not account the I/O to the owner of the pagecache page as it should. Fix this by always passing the pagecache page to wbc_account_cgroup_owner(). Fixes: 001e4a8775f6 ("ext4: implement cgroup writeback support") Cc: stable@vger.kernel.org Reported-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Eric Biggers <ebiggers@google.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230203005503.141557-1-ebiggers@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-12-08ext4: drop pointless IO submission from ext4_bio_write_page()Jan Kara
We submit outstanding IO in ext4_bio_write_page() if we find a buffer we are not going to write. This is however pointless because we already handle submission of previous IO in case we detect newly added buffer head is discontiguous. So just delete the pointless IO submission call. Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20221207112722.22220-4-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-12-08ext4: remove nr_submitted from ext4_bio_write_page()Jan Kara
nr_submitted is the same as nr_to_submit. Drop one of them. Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20221207112722.22220-3-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-12-08ext4: move keep_towrite handling to ext4_bio_write_page()Jan Kara
When we are writing back page but we cannot for some reason write all its buffers (e.g. because we cannot allocate blocks in current context) we have to keep TOWRITE tag set in the mapping as otherwise racing WB_SYNC_ALL writeback that could write these buffers can skip the page and result in data loss. We will need this logic for writeback during transaction commit so move the logic from ext4_writepage() to ext4_bio_write_page(). Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20221207112722.22220-2-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-12-08ext4: handle redirtying in ext4_bio_write_page()Jan Kara
Since we want to transition transaction commits to use ext4_writepages() for writing back ordered, add handling of page redirtying into ext4_bio_write_page(). Also move buffer dirty bit clearing into the same place other buffer state handling. Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20221207112722.22220-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-06-16ext4: fix incorrect comment in ext4_bio_write_page()Wang Jianjian
Signed-off-by: Wang Jianjian <wangjianjian3@huawei.com> Link: https://lore.kernel.org/r/20220520022255.2120576-1-wangjianjian3@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-04-22Merge tag 'ext4_for_linus_stable' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 fixes from Ted Ts'o: "Fix some syzbot-detected bugs, as well as other bugs found by I/O injection testing. Change ext4's fallocate to consistently drop set[ug]id bits when an fallocate operation might possibly change the user-visible contents of a file. Also, improve handling of potentially invalid values in the the s_overhead_cluster superblock field to avoid ext4 returning a negative number of free blocks" * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: jbd2: fix a potential race while discarding reserved buffers after an abort ext4: update the cached overhead value in the superblock ext4: force overhead calculation if the s_overhead_cluster makes no sense ext4: fix overhead calculation to account for the reserved gdt blocks ext4, doc: fix incorrect h_reserved size ext4: limit length to bitmap_maxbytes - blocksize in punch_hole ext4: fix use-after-free in ext4_search_dir ext4: fix bug_on in start_this_handle during umount filesystem ext4: fix symlink file size not match to file content ext4: fix fallocate to use file_modified to update permissions consistently
2022-04-12ext4: fix symlink file size not match to file contentYe Bin
We got issue as follows: [home]# fsck.ext4 -fn ram0yb e2fsck 1.45.6 (20-Mar-2020) Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Symlink /p3/d14/d1a/l3d (inode #3494) is invalid. Clear? no Entry 'l3d' in /p3/d14/d1a (3383) has an incorrect filetype (was 7, should be 0). Fix? no As the symlink file size does not match the file content. If the writeback of the symlink data block failed, ext4_finish_bio() handles the end of IO. However this function fails to mark the buffer with BH_write_io_error and so when unmount does journal checkpoint it cannot detect the writeback error and will cleanup the journal. Thus we've lost the correct data in the journal area. To solve this issue, mark the buffer as BH_write_io_error in ext4_finish_bio(). Cc: stable@kernel.org Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220321144438.201685-1-yebin10@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-03-07block: remove the per-bio/request write hintChristoph Hellwig
With the NVMe support for this gone, there are no consumers of these hints left, so remove them. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220304175556.407719-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-07Merge branch 'for-5.18/alloc-cleanups' into for-5.18/write-streamsJens Axboe
* for-5.18/alloc-cleanups: nilfs2: pass the operation to bio_alloc ext4: pass the operation to bio_alloc mpage: pass the operation to bio_alloc
2022-03-07ext4: stop using bio_devnameChristoph Hellwig
Use the %pg format specifier to save on stack consuption and code size. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20220304180105.409765-10-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-02-27ext4: pass the operation to bio_allocChristoph Hellwig
Refactor the readpage code to pass the op to bio_alloc instead of setting it just before the submission. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Theodore Ts'o <tytso@mit.edu> Link: https://lore.kernel.org/r/20220222154634.597067-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-02-02block: pass a block_device and opf to bio_allocChristoph Hellwig
Pass the block_device and operation that we plan to use this bio for to bio_alloc to optimize the assignment. NULL/0 can be passed, both for the passthrough case on a raw request_queue and to temporarily avoid refactoring some nasty code. Also move the gfp_mask argument after the nr_vecs argument for a much more logical calling convention matching what most of the kernel does. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220124091107.642561-18-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-01-15mm: introduce memalloc_retry_wait()NeilBrown
Various places in the kernel - largely in filesystems - respond to a memory allocation failure by looping around and re-trying. Some of these cannot conveniently use __GFP_NOFAIL, for reasons such as: - a GFP_ATOMIC allocation, which __GFP_NOFAIL doesn't work on - a need to check for the process being signalled between failures - the possibility that other recovery actions could be performed - the allocation is quite deep in support code, and passing down an extra flag to say if __GFP_NOFAIL is wanted would be clumsy. Many of these currently use congestion_wait() which (in almost all cases) simply waits the given timeout - congestion isn't tracked for most devices. It isn't clear what the best delay is for loops, but it is clear that the various filesystems shouldn't be responsible for choosing a timeout. This patch introduces memalloc_retry_wait() with takes on that responsibility. Code that wants to retry a memory allocation can call this function passing the GFP flags that were used. It will wait however is appropriate. For now, it only considers __GFP_NORETRY and whatever gfpflags_allow_blocking() tests. If blocking is allowed without __GFP_NORETRY, then alloc_page either made some reclaim progress, or waited for a while, before failing. So there is no need for much further waiting. memalloc_retry_wait() will wait until the current jiffie ends. If this condition is not met, then alloc_page() won't have waited much if at all. In that case memalloc_retry_wait() waits about 200ms. This is the delay that most current loops uses. linux/sched/mm.h needs to be included in some files now, but linux/backing-dev.h does not. Link: https://lkml.kernel.org/r/163754371968.13692.1277530886009912421@noble.neil.brown.name Signed-off-by: NeilBrown <neilb@suse.de> Cc: Dave Chinner <david@fromorbit.com> Cc: Michal Hocko <mhocko@suse.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Chao Yu <chao@kernel.org> Cc: Darrick J. Wong <djwong@kernel.org> Cc: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-04ext4: convert from atomic_t to refcount_t on ext4_io_end->countXiyu Yang
refcount_t type and corresponding API can protect refcounters from accidental underflow and overflow and further use-after-free situations. Signed-off-by: Xiyu Yang <xiyuyang19@fudan.edu.cn> Signed-off-by: Xin Tan <tanxin.ctf@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/1626674355-55795-1-git-send-email-xiyuyang19@fudan.edu.cn Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-03-11block: rename BIO_MAX_PAGES to BIO_MAX_VECSChristoph Hellwig
Ever since the addition of multipage bio_vecs BIO_MAX_PAGES has been horribly confusingly misnamed. Rename it to BIO_MAX_VECS to stop confusing users of the bio API. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20210311110137.1132391-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-22ext4: remove unnecessary wbc parameter from ext4_bio_write_pageLei Chen
ext4_bio_write_page does not need wbc parameter, since its parameter io contains the io_wbc field. The io::io_wbc is initialized by ext4_io_submit_init which is called in ext4_writepages and ext4_writepage functions prior to ext4_bio_write_page. Therefor, when ext4_bio_write_page is called, wbc info has already been included in io parameter. Signed-off-by: Lei Chen <lennychen@tencent.com> Link: https://lore.kernel.org/r/1607669664-25656-1-git-send-email-lennychen@tencent.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-12-03ext4: remove the null check of bio_vec pageXianting Tian
bv_page can't be NULL in a valid bio_vec, so we can remove the NULL check, as we did in other places when calling bio_for_each_segment_all() to go through all bio_vec of a bio. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Xianting Tian <tian.xianting@h3c.com> Link: https://lore.kernel.org/r/20201020082201.34257-1-tian.xianting@h3c.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-07-08ext4: add inline encryption supportEric Biggers
Wire up ext4 to support inline encryption via the helper functions which fs/crypto/ now provides. This includes: - Adding a mount option 'inlinecrypt' which enables inline encryption on encrypted files where it can be used. - Setting the bio_crypt_ctx on bios that will be submitted to an inline-encrypted file. Note: submit_bh_wbc() in fs/buffer.c also needed to be patched for this part, since ext4 sometimes uses ll_rw_block() on file data. - Not adding logically discontiguous data to bios that will be submitted to an inline-encrypted file. - Not doing filesystem-layer crypto on inline-encrypted files. Co-developed-by: Satya Tangirala <satyat@google.com> Signed-off-by: Satya Tangirala <satyat@google.com> Reviewed-by: Theodore Ts'o <tytso@mit.edu> Link: https://lore.kernel.org/r/20200702015607.1215430-5-satyat@google.com Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-03-28fs/buffer: Make BH_Uptodate_Lock bit_spin_lock a regular spinlock_tThomas Gleixner
Bit spinlocks are problematic if PREEMPT_RT is enabled, because they disable preemption, which is undesired for latency reasons and breaks when regular spinlocks are taken within the bit_spinlock locked region because regular spinlocks are converted to 'sleeping spinlocks' on RT. PREEMPT_RT replaced the bit spinlocks with regular spinlocks to avoid this problem. The replacement was done conditionaly at compile time, but Christoph requested to do an unconditional conversion. Jan suggested to move the spinlock into a existing padding hole which avoids a size increase of struct buffer_head on production kernels. As a benefit the lock gains lockdep coverage. [ bigeasy: Remove the wrapper and use always spinlock_t and move it into the padding hole ] Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Christoph Hellwig <hch@infradead.org> Link: https://lkml.kernel.org/r/20191118132824.rclhrbujqh4b4g4d@linutronix.de
2020-01-17ext4: fix deadlock allocating crypto bounce page from mempoolEric Biggers
ext4_writepages() on an encrypted file has to encrypt the data, but it can't modify the pagecache pages in-place, so it encrypts the data into bounce pages and writes those instead. All bounce pages are allocated from a mempool using GFP_NOFS. This is not correct use of a mempool, and it can deadlock. This is because GFP_NOFS includes __GFP_DIRECT_RECLAIM, which enables the "never fail" mode for mempool_alloc() where a failed allocation will fall back to waiting for one of the preallocated elements in the pool. But since this mode is used for all a bio's pages and not just the first, it can deadlock waiting for pages already in the bio to be freed. This deadlock can be reproduced by patching mempool_alloc() to pretend that pool->alloc() always fails (so that it always falls back to the preallocations), and then creating an encrypted file of size > 128 KiB. Fix it by only using GFP_NOFS for the first page in the bio. For subsequent pages just use GFP_NOWAIT, and if any of those fail, just submit the bio and start a new one. This will need to be fixed in f2fs too, but that's less straightforward. Fixes: c9af28fdd449 ("ext4 crypto: don't let data integrity writebacks fail with ENOMEM") Cc: stable@vger.kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20191231181149.47619-1-ebiggers@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-11-14ext4: bio_alloc with __GFP_DIRECT_RECLAIM never failsGao Xiang
Similar to [1] [2], bio_alloc with __GFP_DIRECT_RECLAIM flags guarantees bio allocation under some given restrictions, as stated in block/bio.c and fs/direct-io.c So here it's ok to not check for NULL value from bio_alloc(). [1] https://lore.kernel.org/r/20191030035518.65477-1-gaoxiang25@huawei.com [2] https://lore.kernel.org/r/20190830162812.GA10694@infradead.org Cc: Theodore Ts'o <tytso@mit.edu> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Ritesh Harjani <riteshh@linux.ibm.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Gao Xiang <gaoxiang25@huawei.com> Link: https://lore.kernel.org/r/20191031092315.139267-1-gaoxiang25@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-10-22ext4: Add support for blocksize < pagesize in dioread_nolockRitesh Harjani
This patch adds the support for blocksize < pagesize for dioread_nolock feature. Since in case of blocksize < pagesize, we can have multiple small buffers of page as unwritten extents, we need to maintain a vector of these unwritten extents which needs the conversion after the IO is complete. Thus, we maintain a list of tuple <offset, size> pair (io_end_vec) for this & traverse this list to do the unwritten to written conversion. Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/20191016073711.4141-5-riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-10-22ext4: Add API to bring in support for unwritten io_end_vec conversionRitesh Harjani
This patch just brings in the API for conversion of unwritten io_end_vec extents which will be required for blocksize < pagesize support for dioread_nolock feature. No functional changes in this patch. Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/20191016073711.4141-3-riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-10-22ext4: keep uniform naming convention for io & io_end variablesRitesh Harjani
Let's keep uniform naming convention for ext4_submit_io (io) & ext4_end_io_t (io_end) structures, to avoid any confusion. No functionality change in this patch. Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/20191016073711.4141-2-riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-07-15Merge tag 'for-linus-20190715' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull more block updates from Jens Axboe: "A later pull request with some followup items. I had some vacation coming up to the merge window, so certain things items were delayed a bit. This pull request also contains fixes that came in within the last few days of the merge window, which I didn't want to push right before sending you a pull request. This contains: - NVMe pull request, mostly fixes, but also a few minor items on the feature side that were timing constrained (Christoph et al) - Report zones fixes (Damien) - Removal of dead code (Damien) - Turn on cgroup psi memstall (Josef) - block cgroup MAINTAINERS entry (Konstantin) - Flush init fix (Josef) - blk-throttle low iops timing fix (Konstantin) - nbd resize fixes (Mike) - nbd 0 blocksize crash fix (Xiubo) - block integrity error leak fix (Wenwen) - blk-cgroup writeback and priority inheritance fixes (Tejun)" * tag 'for-linus-20190715' of git://git.kernel.dk/linux-block: (42 commits) MAINTAINERS: add entry for block io cgroup null_blk: fixup ->report_zones() for !CONFIG_BLK_DEV_ZONED block: Limit zone array allocation size sd_zbc: Fix report zones buffer allocation block: Kill gfp_t argument of blkdev_report_zones() block: Allow mapping of vmalloc-ed buffers block/bio-integrity: fix a memory leak bug nvme: fix NULL deref for fabrics options nbd: add netlink reconfigure resize support nbd: fix crash when the blksize is zero block: Disable write plugging for zoned block devices block: Fix elevator name declaration block: Remove unused definitions nvme: fix regression upon hot device removal and insertion blk-throttle: fix zero wait time for iops throttled group block: Fix potential overflow in blk_report_zones() blkcg: implement REQ_CGROUP_PUNT blkcg, writeback: Implement wbc_blkcg_css() blkcg, writeback: Add wbc->no_cgroup_owner blkcg, writeback: Rename wbc_account_io() to wbc_account_cgroup_owner() ...
2019-07-10blkcg, writeback: Rename wbc_account_io() to wbc_account_cgroup_owner()Tejun Heo
wbc_account_io() does a very specific job - try to see which cgroup is actually dirtying an inode and transfer its ownership to the majority dirtier if needed. The name is too generic and confusing. Let's rename it to something more specific. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-05-28ext4: encrypt only up to last block in ext4_bio_write_page()Eric Biggers
As an optimization, don't encrypt blocks fully beyond i_size, since those definitely won't need to be written out. Also add a comment. This is in preparation for allowing encryption on ext4 filesystems with blocksize != PAGE_SIZE. This is based on work by Chandan Rajendra. Reviewed-by: Chandan Rajendra <chandan@linux.ibm.com> Signed-off-by: Eric Biggers <ebiggers@google.com>
2019-05-28fscrypt: support encrypting multiple filesystem blocks per pageEric Biggers
Rename fscrypt_encrypt_page() to fscrypt_encrypt_pagecache_blocks() and redefine its behavior to encrypt all filesystem blocks from the given region of the given page, rather than assuming that the region consists of just one filesystem block. Also remove the 'inode' and 'lblk_num' parameters, since they can be retrieved from the page as it's already assumed to be a pagecache page. This is in preparation for allowing encryption on ext4 filesystems with blocksize != PAGE_SIZE. This is based on work by Chandan Rajendra. Reviewed-by: Chandan Rajendra <chandan@linux.ibm.com> Signed-off-by: Eric Biggers <ebiggers@google.com>
2019-05-28fscrypt: simplify bounce page handlingEric Biggers
Currently, bounce page handling for writes to encrypted files is unnecessarily complicated. A fscrypt_ctx is allocated along with each bounce page, page_private(bounce_page) points to this fscrypt_ctx, and fscrypt_ctx::w::control_page points to the original pagecache page. However, because writes don't use the fscrypt_ctx for anything else, there's no reason why page_private(bounce_page) can't just point to the original pagecache page directly. Therefore, this patch makes this change. In the process, it also cleans up the API exposed to filesystems that allows testing whether a page is a bounce page, getting the pagecache page from a bounce page, and freeing a bounce page. Reviewed-by: Chandan Rajendra <chandan@linux.ibm.com> Signed-off-by: Eric Biggers <ebiggers@google.com>
2019-04-30block: remove the i argument to bio_for_each_segment_allChristoph Hellwig
We only have two callers that need the integer loop iterator, and they can easily maintain it themselves. Suggested-by: Matthew Wilcox <willy@infradead.org> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Acked-by: David Sterba <dsterba@suse.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Acked-by: Coly Li <colyli@suse.de> Reviewed-by: Matthew Wilcox <willy@infradead.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-03-12Merge tag 'ext4_for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "A large number of bug fixes and cleanups. One new feature to allow users to more easily find the jbd2 journal thread for a particular ext4 file system" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (25 commits) jbd2: jbd2_get_transaction does not need to return a value jbd2: fix invalid descriptor block checksum ext4: fix bigalloc cluster freeing when hole punching under load ext4: add sysfs attr /sys/fs/ext4/<disk>/journal_task ext4: Change debugging support help prefix from EXT4 to Ext4 ext4: fix compile error when using BUFFER_TRACE jbd2: fix compile warning when using JBUFFER_TRACE ext4: fix some error pointer dereferences ext4: annotate more implicit fall throughs ext4: annotate implicit fall throughs ext4: don't update s_rev_level if not required jbd2: fold jbd2_superblock_csum_{verify,set} into their callers jbd2: fix race when writing superblock ext4: fix crash during online resizing ext4: disallow files with EXT4_JOURNAL_DATA_FL from EXT4_IOC_SWAP_BOOT ext4: add mask of ext4 flags to swap ext4: update quota information while swapping boot loader inode ext4: cleanup pagecache before swap i_data ext4: fix check of inode in swap_inode_boot_loader ext4: unlock unused_pages timely when doing writeback ...