summaryrefslogtreecommitdiff
path: root/fs/xfs
AgeCommit message (Collapse)Author
2024-12-03Merge tag 'xfs-fixes-6.13-rc2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds
Pull xfs fixes from Carlos Maiolino: - Use xchg() in xlog_cil_insert_pcp_aggregate() - Fix ABBA deadlock on a race between mount and log shutdown - Fix quota softlimit incoherency on delalloc - Fix sparse inode limits on runt AG - remove unknown compat feature checks in SB write valdation - Eliminate a lockdep false positive * tag 'xfs-fixes-6.13-rc2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: don't call xfs_bmap_same_rtgroup in xfs_bmap_add_extent_hole_delay xfs: Use xchg() in xlog_cil_insert_pcp_aggregate() xfs: prevent mount and log shutdown race xfs: delalloc and quota softlimit timers are incoherent xfs: fix sparse inode limits on runt AG xfs: remove unknown compat feature check in superblock write validation xfs: eliminate lockdep false positives in xfs_attr_shortform_list
2024-11-28xfs: don't call xfs_bmap_same_rtgroup in xfs_bmap_add_extent_hole_delayChristoph Hellwig
xfs_bmap_add_extent_hole_delay works entirely on delalloc extents, for which xfs_bmap_same_rtgroup doesn't make sense. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-11-28xfs: Use xchg() in xlog_cil_insert_pcp_aggregate()Uros Bizjak
try_cmpxchg() loop with constant "new" value can be substituted with just xchg() to atomically get and clear the location. The code on x86_64 improves from: 1e7f: 48 89 4c 24 10 mov %rcx,0x10(%rsp) 1e84: 48 03 14 c5 00 00 00 add 0x0(,%rax,8),%rdx 1e8b: 00 1e88: R_X86_64_32S __per_cpu_offset 1e8c: 8b 02 mov (%rdx),%eax 1e8e: 41 89 c5 mov %eax,%r13d 1e91: 31 c9 xor %ecx,%ecx 1e93: f0 0f b1 0a lock cmpxchg %ecx,(%rdx) 1e97: 75 f5 jne 1e8e <xlog_cil_commit+0x84e> 1e99: 48 8b 4c 24 10 mov 0x10(%rsp),%rcx 1e9e: 45 01 e9 add %r13d,%r9d to just: 1e7f: 48 03 14 cd 00 00 00 add 0x0(,%rcx,8),%rdx 1e86: 00 1e83: R_X86_64_32S __per_cpu_offset 1e87: 31 c9 xor %ecx,%ecx 1e89: 87 0a xchg %ecx,(%rdx) 1e8b: 41 01 cb add %ecx,%r11d No functional change intended. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Cc: Chandan Babu R <chandan.babu@oracle.com> Cc: Darrick J. Wong <djwong@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <dchinner@redhat.com> Reviewed-by: Alex Elder <elder@riscstar.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-11-23Merge tag 'mm-stable-2024-11-18-19-27' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - The series "zram: optimal post-processing target selection" from Sergey Senozhatsky improves zram's post-processing selection algorithm. This leads to improved memory savings. - Wei Yang has gone to town on the mapletree code, contributing several series which clean up the implementation: - "refine mas_mab_cp()" - "Reduce the space to be cleared for maple_big_node" - "maple_tree: simplify mas_push_node()" - "Following cleanup after introduce mas_wr_store_type()" - "refine storing null" - The series "selftests/mm: hugetlb_fault_after_madv improvements" from David Hildenbrand fixes this selftest for s390. - The series "introduce pte_offset_map_{ro|rw}_nolock()" from Qi Zheng implements some rationaizations and cleanups in the page mapping code. - The series "mm: optimize shadow entries removal" from Shakeel Butt optimizes the file truncation code by speeding up the handling of shadow entries. - The series "Remove PageKsm()" from Matthew Wilcox completes the migration of this flag over to being a folio-based flag. - The series "Unify hugetlb into arch_get_unmapped_area functions" from Oscar Salvador implements a bunch of consolidations and cleanups in the hugetlb code. - The series "Do not shatter hugezeropage on wp-fault" from Dev Jain takes away the wp-fault time practice of turning a huge zero page into small pages. Instead we replace the whole thing with a THP. More consistent cleaner and potentiall saves a large number of pagefaults. - The series "percpu: Add a test case and fix for clang" from Andy Shevchenko enhances and fixes the kernel's built in percpu test code. - The series "mm/mremap: Remove extra vma tree walk" from Liam Howlett optimizes mremap() by avoiding doing things which we didn't need to do. - The series "Improve the tmpfs large folio read performance" from Baolin Wang teaches tmpfs to copy data into userspace at the folio size rather than as individual pages. A 20% speedup was observed. - The series "mm/damon/vaddr: Fix issue in damon_va_evenly_split_region()" fro Zheng Yejian fixes DAMON splitting. - The series "memcg-v1: fully deprecate charge moving" from Shakeel Butt removes the long-deprecated memcgv2 charge moving feature. - The series "fix error handling in mmap_region() and refactor" from Lorenzo Stoakes cleanup up some of the mmap() error handling and addresses some potential performance issues. - The series "x86/module: use large ROX pages for text allocations" from Mike Rapoport teaches x86 to use large pages for read-only-execute module text. - The series "page allocation tag compression" from Suren Baghdasaryan is followon maintenance work for the new page allocation profiling feature. - The series "page->index removals in mm" from Matthew Wilcox remove most references to page->index in mm/. A slow march towards shrinking struct page. - The series "damon/{self,kunit}tests: minor fixups for DAMON debugfs interface tests" from Andrew Paniakin performs maintenance work for DAMON's self testing code. - The series "mm: zswap swap-out of large folios" from Kanchana Sridhar improves zswap's batching of compression and decompression. It is a step along the way towards using Intel IAA hardware acceleration for this zswap operation. - The series "kasan: migrate the last module test to kunit" from Sabyrzhan Tasbolatov completes the migration of the KASAN built-in tests over to the KUnit framework. - The series "implement lightweight guard pages" from Lorenzo Stoakes permits userapace to place fault-generating guard pages within a single VMA, rather than requiring that multiple VMAs be created for this. Improved efficiencies for userspace memory allocators are expected. - The series "memcg: tracepoint for flushing stats" from JP Kobryn uses tracepoints to provide increased visibility into memcg stats flushing activity. - The series "zram: IDLE flag handling fixes" from Sergey Senozhatsky fixes a zram buglet which potentially affected performance. - The series "mm: add more kernel parameters to control mTHP" from Maíra Canal enhances our ability to control/configuremultisize THP from the kernel boot command line. - The series "kasan: few improvements on kunit tests" from Sabyrzhan Tasbolatov has a couple of fixups for the KASAN KUnit tests. - The series "mm/list_lru: Split list_lru lock into per-cgroup scope" from Kairui Song optimizes list_lru memory utilization when lockdep is enabled. * tag 'mm-stable-2024-11-18-19-27' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (215 commits) cma: enforce non-zero pageblock_order during cma_init_reserved_mem() mm/kfence: add a new kunit test test_use_after_free_read_nofault() zram: fix NULL pointer in comp_algorithm_show() memcg/hugetlb: add hugeTLB counters to memcg vmstat: call fold_vm_zone_numa_events() before show per zone NUMA event mm: mmap_lock: check trace_mmap_lock_$type_enabled() instead of regcount zram: ZRAM_DEF_COMP should depend on ZRAM MAINTAINERS/MEMORY MANAGEMENT: add document files for mm Docs/mm/damon: recommend academic papers to read and/or cite mm: define general function pXd_init() kmemleak: iommu/iova: fix transient kmemleak false positive mm/list_lru: simplify the list_lru walk callback function mm/list_lru: split the lock to per-cgroup scope mm/list_lru: simplify reparenting and initial allocation mm/list_lru: code clean up for reparenting mm/list_lru: don't export list_lru_add mm/list_lru: don't pass unnecessary key parameters kasan: add kunit tests for kmalloc_track_caller, kmalloc_node_track_caller kasan: change kasan_atomics kunit test as KUNIT_CASE_SLOW kasan: use EXPORT_SYMBOL_IF_KUNIT to export symbols ...
2024-11-22xfs: prevent mount and log shutdown raceDave Chinner
I recently had an fstests hang where there were two internal tasks stuck like so: [ 6559.010870] task:kworker/24:45 state:D stack:12152 pid:631308 tgid:631308 ppid:2 flags:0x00004000 [ 6559.016984] Workqueue: xfs-buf/dm-2 xfs_buf_ioend_work [ 6559.020349] Call Trace: [ 6559.022002] <TASK> [ 6559.023426] __schedule+0x650/0xb10 [ 6559.025734] schedule+0x6d/0xf0 [ 6559.027835] schedule_timeout+0x31/0x180 [ 6559.030582] wait_for_common+0x10c/0x1e0 [ 6559.033495] wait_for_completion+0x1d/0x30 [ 6559.036463] __flush_workqueue+0xeb/0x490 [ 6559.039479] ? mempool_alloc_slab+0x15/0x20 [ 6559.042537] xlog_cil_force_seq+0xa1/0x2f0 [ 6559.045498] ? bio_alloc_bioset+0x1d8/0x510 [ 6559.048578] ? submit_bio_noacct+0x2f2/0x380 [ 6559.051665] ? xlog_force_shutdown+0x3b/0x170 [ 6559.054819] xfs_log_force+0x77/0x230 [ 6559.057455] xlog_force_shutdown+0x3b/0x170 [ 6559.060507] xfs_do_force_shutdown+0xd4/0x200 [ 6559.063798] ? xfs_buf_rele+0x1bd/0x580 [ 6559.066541] xfs_buf_ioend_handle_error+0x163/0x2e0 [ 6559.070099] xfs_buf_ioend+0x61/0x200 [ 6559.072728] xfs_buf_ioend_work+0x15/0x20 [ 6559.075706] process_scheduled_works+0x1d4/0x400 [ 6559.078814] worker_thread+0x234/0x2e0 [ 6559.081300] kthread+0x147/0x170 [ 6559.083462] ? __pfx_worker_thread+0x10/0x10 [ 6559.086295] ? __pfx_kthread+0x10/0x10 [ 6559.088771] ret_from_fork+0x3e/0x50 [ 6559.091153] ? __pfx_kthread+0x10/0x10 [ 6559.093624] ret_from_fork_asm+0x1a/0x30 [ 6559.096227] </TASK> [ 6559.109304] Workqueue: xfs-cil/dm-2 xlog_cil_push_work [ 6559.112673] Call Trace: [ 6559.114333] <TASK> [ 6559.115760] __schedule+0x650/0xb10 [ 6559.118084] schedule+0x6d/0xf0 [ 6559.120175] schedule_timeout+0x31/0x180 [ 6559.122776] ? call_rcu+0xee/0x2f0 [ 6559.125034] __down_common+0xbe/0x1f0 [ 6559.127470] __down+0x1d/0x30 [ 6559.129458] down+0x48/0x50 [ 6559.131343] ? xfs_buf_item_unpin+0x8d/0x380 [ 6559.134213] xfs_buf_lock+0x3d/0xe0 [ 6559.136544] xfs_buf_item_unpin+0x8d/0x380 [ 6559.139253] xlog_cil_committed+0x287/0x520 [ 6559.142019] ? sched_clock+0x10/0x30 [ 6559.144384] ? sched_clock_cpu+0x10/0x190 [ 6559.147039] ? psi_group_change+0x48/0x310 [ 6559.149735] ? _raw_spin_unlock+0xe/0x30 [ 6559.152340] ? finish_task_switch+0xbc/0x310 [ 6559.155163] xlog_cil_process_committed+0x6d/0x90 [ 6559.158265] xlog_state_shutdown_callbacks+0x53/0x110 [ 6559.161564] ? xlog_cil_push_work+0xa70/0xaf0 [ 6559.164441] xlog_state_release_iclog+0xba/0x1b0 [ 6559.167483] xlog_cil_push_work+0xa70/0xaf0 [ 6559.170260] process_scheduled_works+0x1d4/0x400 [ 6559.173286] worker_thread+0x234/0x2e0 [ 6559.175779] kthread+0x147/0x170 [ 6559.177933] ? __pfx_worker_thread+0x10/0x10 [ 6559.180748] ? __pfx_kthread+0x10/0x10 [ 6559.183231] ret_from_fork+0x3e/0x50 [ 6559.185601] ? __pfx_kthread+0x10/0x10 [ 6559.188092] ret_from_fork_asm+0x1a/0x30 [ 6559.190692] </TASK> This is an ABBA deadlock where buffer IO completion is triggering a forced shutdown with the buffer lock held. It is waiting for the CIL to flush as part of the log force. The CIL flush is blocked doing shutdown processing of all it's objects, trying to unpin a buffer item. That requires taking the buffer lock.... For the CIL to be doing shutdown processing, the log must be marked with XLOG_IO_ERROR, but that doesn't happen until after the log force is issued. Hence for xfs_do_force_shutdown() to be forcing the log on a shut down log, we must have had a racing xlog_force_shutdown and xfs_force_shutdown like so: p0 p1 CIL push <holds buffer lock> xlog_force_shutdown xfs_log_force test_and_set_bit(XLOG_IO_ERROR) xlog_state_release_iclog() sees XLOG_IO_ERROR xlog_state_shutdown_callbacks .... xfs_buf_item_unpin xfs_buf_lock <blocks on buffer p1 holds> xfs_force_shutdown xfs_set_shutdown(mp) wins xlog_force_shutdown xfs_log_force <blocks on CIL push> xfs_set_shutdown(mp) fails <shuts down rest of log> The deadlock can be mitigated by avoiding the log force on the second pass through xlog_force_shutdown. Do this by adding another atomic state bit (XLOG_OP_PENDING_SHUTDOWN) that is set on entry to xlog_force_shutdown() but doesn't mark the log as shutdown. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-11-22xfs: delalloc and quota softlimit timers are incoherentDave Chinner
I've been seeing this failure on during xfs/050 recently: XFS: Assertion failed: dst->d_spc_timer != 0, file: fs/xfs/xfs_qm_syscalls.c, line: 435 .... Call Trace: <TASK> xfs_qm_scall_getquota_fill_qc+0x2a2/0x2b0 xfs_qm_scall_getquota_next+0x69/0xa0 xfs_fs_get_nextdqblk+0x62/0xf0 quota_getnextxquota+0xbf/0x320 do_quotactl+0x1a1/0x410 __se_sys_quotactl+0x126/0x310 __x64_sys_quotactl+0x21/0x30 x64_sys_call+0x2819/0x2ee0 do_syscall_64+0x68/0x130 entry_SYSCALL_64_after_hwframe+0x76/0x7e It turns out that the _qmount call has silently been failing to unmount and mount the filesystem, so when the softlimit is pushed past with a buffered write, it is not getting synced to disk before the next quota report is being run. Hence when the quota report runs, we have 300 blocks of delalloc data on an inode, with a soft limit of 200 blocks. XFS dquots account delalloc reservations as used space, hence the dquot is over the soft limit. However, we don't update the soft limit timers until we do a transactional update of the dquot. That is, the dquot sits over the soft limit without a softlimit timer being started until writeback occurs and the allocation modifies the dquot and we call xfs_qm_adjust_dqtimers() from xfs_trans_apply_dquot_deltas() in xfs_trans_commit() context. This isn't really a problem, except for this debug code in xfs_qm_scall_getquota_fill_qc(): if (xfs_dquot_is_enforced(dqp) && dqp->q_id != 0) { if ((dst->d_space > dst->d_spc_softlimit) && (dst->d_spc_softlimit > 0)) { ASSERT(dst->d_spc_timer != 0); } .... It asserts taht if the used block count is over the soft limit, it *must* have a soft limit timer running. This is clearly not the case, because we haven't committed the delalloc space to disk yet. Hence the soft limit is only exceeded temporarily in memory (which isn't an issue) and we start the timer the moment we exceed the soft limit in journalled metadata. This debug was introduced in: commit 0d5ad8383061fbc0a9804fbb98218750000fe032 Author: Supriya Wickrematillake <sup@sgi.com> Date: Wed May 15 22:44:44 1996 +0000 initial checkin quotactl syscall functions. The very first quota support commit back in 1996. This is zero-day debug for Irix and, as it turns out, a zero-day bug in the debug code because the delalloc code on Irix didn't update the softlimit timers, either. IOWs, this issue has been in the code for 28 years. We obviously don't care if soft limit timers are a bit rubbery when we have delalloc reservations in memory. Production systems running quota reports have been exposed to this situation for 28 years and nobody has noticed it, so the debug code is essentially worthless at this point in time. We also have the on-disk dquot verifiers checking that the soft limit timer is running whenever the dquot is over the soft limit before we write it to disk and after we read it from disk. These aren't firing, so it is clear the issue is purely a temporary in-memory incoherency that I never would have noticed had the test not silently failed to unmount the filesystem. Hence I'm simply going to trash this runtime debug because it isn't useful in the slightest for catching quota bugs. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-11-22xfs: fix sparse inode limits on runt AGDave Chinner
The runt AG at the end of a filesystem is almost always smaller than the mp->m_sb.sb_agblocks. Unfortunately, when setting the max_agbno limit for the inode chunk allocation, we do not take this into account. This means we can allocate a sparse inode chunk that overlaps beyond the end of an AG. When we go to allocate an inode from that sparse chunk, the irec fails validation because the agbno of the start of the irec is beyond valid limits for the runt AG. Prevent this from happening by taking into account the size of the runt AG when allocating inode chunks. Also convert the various checks for valid inode chunk agbnos to use xfs_ag_block_count() so that they will also catch such issues in the future. Fixes: 56d1115c9bc7 ("xfs: allocate sparse inode chunks on full chunk allocation failure") Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-11-22xfs: remove unknown compat feature check in superblock write validationLong Li
Compat features are new features that older kernels can safely ignore, allowing read-write mounts without issues. The current sb write validation implementation returns -EFSCORRUPTED for unknown compat features, preventing filesystem write operations and contradicting the feature's definition. Additionally, if the mounted image is unclean, the log recovery may need to write to the superblock. Returning an error for unknown compat features during sb write validation can cause mount failures. Although XFS currently does not use compat feature flags, this issue affects current kernels' ability to mount images that may use compat feature flags in the future. Since superblock read validation already warns about unknown compat features, it's unnecessary to repeat this warning during write validation. Therefore, the relevant code in write validation is being removed. Fixes: 9e037cb7972f ("xfs: check for unknown v5 feature bits in superblock write verifier") Cc: stable@vger.kernel.org # v4.19+ Signed-off-by: Long Li <leo.lilong@huawei.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-11-22xfs: eliminate lockdep false positives in xfs_attr_shortform_listLong Li
xfs_attr_shortform_list() only called from a non-transactional context, it hold ilock before alloc memory and maybe trapped in memory reclaim. Since commit 204fae32d5f7("xfs: clean up remaining GFP_NOFS users") removed GFP_NOFS flag, lockdep warning will be report as [1]. Eliminate lockdep false positives by use __GFP_NOLOCKDEP to alloc memory in xfs_attr_shortform_list(). [1] https://lore.kernel.org/linux-xfs/000000000000e33add0616358204@google.com/ Reported-by: syzbot+4248e91deb3db78358a2@syzkaller.appspotmail.com Signed-off-by: Long Li <leo.lilong@huawei.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-11-21Merge tag 'xfs-6.13-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds
Pull xfs updates from Carlos Maiolino: "The bulk of this pull request is a major rework that Darrick and Christoph have been doing on XFS's real-time volume, coupled with a few features to support this rework. It does also includes some bug fixes. - convert perag to use xarrays - create a new generic allocation group structure - add metadata inode dir trees - create in-core rt allocation groups - shard the RT section into allocation groups - persist quota options with the enw metadata dir tree - enable quota for RT volumes - enable metadata directory trees - some bugfixes" * tag 'xfs-6.13-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (146 commits) xfs: port ondisk structure checks from xfs/122 to the kernel xfs: separate space btree structures in xfs_ondisk.h xfs: convert struct typedefs in xfs_ondisk.h xfs: enable metadata directory feature xfs: enable realtime quota again xfs: update sb field checks when metadir is turned on xfs: reserve quota for realtime files correctly xfs: create quota preallocation watermarks for realtime quota xfs: report realtime block quota limits on realtime directories xfs: persist quota flags with metadir xfs: advertise realtime quota support in the xqm stat files xfs: scrub quota file metapaths xfs: fix chown with rt quota xfs: use metadir for quota inodes xfs: refactor xfs_qm_destroy_quotainos xfs: use rtgroup busy extent list for FITRIM xfs: implement busy extent tracking for rtgroups xfs: port the perag discard code to handle generic groups xfs: move the min and max group block numbers to xfs_group xfs: adjust min_block usage in xfs_verify_agbno ...
2024-11-18Merge tag 'pull-fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds
Pull 'struct fd' class updates from Al Viro: "The bulk of struct fd memory safety stuff Making sure that struct fd instances are destroyed in the same scope where they'd been created, getting rid of reassignments and passing them by reference, converting to CLASS(fd{,_pos,_raw}). We are getting very close to having the memory safety of that stuff trivial to verify" * tag 'pull-fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (28 commits) deal with the last remaing boolean uses of fd_file() css_set_fork(): switch to CLASS(fd_raw, ...) memcg_write_event_control(): switch to CLASS(fd) assorted variants of irqfd setup: convert to CLASS(fd) do_pollfd(): convert to CLASS(fd) convert do_select() convert vfs_dedupe_file_range(). convert cifs_ioctl_copychunk() convert media_request_get_by_fd() convert spu_run(2) switch spufs_calls_{get,put}() to CLASS() use convert cachestat(2) convert do_preadv()/do_pwritev() fdget(), more trivial conversions fdget(), trivial conversions privcmd_ioeventfd_assign(): don't open-code eventfd_ctx_fdget() o2hb_region_dev_store(): avoid goto around fdget()/fdput() introduce "fd_pos" class, convert fdget_pos() users to it. fdget_raw() users: switch to CLASS(fd_raw) convert vmsplice() to CLASS(fd) ...
2024-11-18Merge tag 'vfs-6.13.untorn.writes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs untorn write support from Christian Brauner: "An atomic write is a write issed with torn-write protection. This means for a power failure or any hardware failure all or none of the data from the write will be stored, never a mix of old and new data. This work is already supported for block devices. If a block device is opened with O_DIRECT and the block device supports atomic write, then FMODE_CAN_ATOMIC_WRITE is added to the file of the opened block device. This contains the work to expand atomic write support to filesystems, specifically ext4 and XFS. Currently, only support for writing exactly one filesystem block atomically is added. Since it's now possible to have filesystem block size > page size for XFS, it's possible to write 4K+ blocks atomically on x86" * tag 'vfs-6.13.untorn.writes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: iomap: drop an obsolete comment in iomap_dio_bio_iter ext4: Do not fallback to buffered-io for DIO atomic write ext4: Support setting FMODE_CAN_ATOMIC_WRITE ext4: Check for atomic writes support in write iter ext4: Add statx support for atomic writes xfs: Support setting FMODE_CAN_ATOMIC_WRITE xfs: Validate atomic writes xfs: Support atomic write for statx fs: iomap: Atomic write support fs: Export generic_atomic_write_valid() block: Add bdev atomic write limits helpers fs/block: Check for IOCB_DIRECT in generic_atomic_write_valid() block/fs: Pass an iocb to generic_atomic_write_valid()
2024-11-18Merge tag 'vfs-6.13.mgtime' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs multigrain timestamps from Christian Brauner: "This is another try at implementing multigrain timestamps. This time with significant help from the timekeeping maintainers to reduce the performance impact. Thomas provided a base branch that contains the required timekeeping interfaces for the VFS. It serves as the base for the multi-grain timestamp work: - Multigrain timestamps allow the kernel to use fine-grained timestamps when an inode's attributes is being actively observed via ->getattr(). With this support, it's possible for a file to get a fine-grained timestamp, and another modified after it to get a coarse-grained stamp that is earlier than the fine-grained time. If this happens then the files can appear to have been modified in reverse order, which breaks VFS ordering guarantees. To prevent this, a floor value is maintained for multigrain timestamps. Whenever a fine-grained timestamp is handed out, record it, and when later coarse-grained stamps are handed out, ensure they are not earlier than that value. If the coarse-grained timestamp is earlier than the fine-grained floor, return the floor value instead. The timekeeper changes add a static singleton atomic64_t into timekeeper.c that is used to keep track of the latest fine-grained time ever handed out. This is tracked as a monotonic ktime_t value to ensure that it isn't affected by clock jumps. Because it is updated at different times than the rest of the timekeeper object, the floor value is managed independently of the timekeeper via a cmpxchg() operation, and sits on its own cacheline. Two new public timekeeper interfaces are added: (1) ktime_get_coarse_real_ts64_mg() fills a timespec64 with the later of the coarse-grained clock and the floor time (2) ktime_get_real_ts64_mg() gets the fine-grained clock value, and tries to swap it into the floor. A timespec64 is filled with the result. - The VFS has always used coarse-grained timestamps when updating the ctime and mtime after a change. This has the benefit of allowing filesystems to optimize away a lot metadata updates, down to around 1 per jiffy, even when a file is under heavy writes. Unfortunately, this has always been an issue when we're exporting via NFSv3, which relies on timestamps to validate caches. A lot of changes can happen in a jiffy, so timestamps aren't sufficient to help the client decide when to invalidate the cache. Even with NFSv4, a lot of exported filesystems don't properly support a change attribute and are subject to the same problems with timestamp granularity. Other applications have similar issues with timestamps (e.g backup applications). If we were to always use fine-grained timestamps, that would improve the situation, but that becomes rather expensive, as the underlying filesystem would have to log a lot more metadata updates. This adds a way to only use fine-grained timestamps when they are being actively queried. Use the (unused) top bit in inode->i_ctime_nsec as a flag that indicates whether the current timestamps have been queried via stat() or the like. When it's set, we allow the kernel to use a fine-grained timestamp iff it's necessary to make the ctime show a different value. This solves the problem of being able to distinguish the timestamp between updates, but introduces a new problem: it's now possible for a file being changed to get a fine-grained timestamp. A file that is altered just a bit later can then get a coarse-grained one that appears older than the earlier fine-grained time. This violates timestamp ordering guarantees. This is where the earlier mentioned timkeeping interfaces help. A global monotonic atomic64_t value is kept that acts as a timestamp floor. When we go to stamp a file, we first get the latter of the current floor value and the current coarse-grained time. If the inode ctime hasn't been queried then we just attempt to stamp it with that value. If it has been queried, then first see whether the current coarse time is later than the existing ctime. If it is, then we accept that value. If it isn't, then we get a fine-grained time and try to swap that into the global floor. Whether that succeeds or fails, we take the resulting floor time, convert it to realtime and try to swap that into the ctime. We take the result of the ctime swap whether it succeeds or fails, since either is just as valid. Filesystems can opt into this by setting the FS_MGTIME fstype flag. Others should be unaffected (other than being subject to the same floor value as multigrain filesystems)" * tag 'vfs-6.13.mgtime' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fs: reduce pointer chasing in is_mgtime() test tmpfs: add support for multigrain timestamps btrfs: convert to multigrain timestamps ext4: switch to multigrain timestamps xfs: switch to multigrain timestamps Documentation: add a new file documenting multigrain timestamps fs: add percpu counters for significant multigrain timestamp events fs: tracepoints around multigrain timestamp events fs: handle delegated timestamps in setattr_copy_mgtime timekeeping: Add percpu counter for tracking floor swap events timekeeping: Add interfaces for handling timestamps with a floor value fs: have setattr_copy handle multigrain timestamps appropriately fs: add infrastructure for multigrain timestamps
2024-11-12Merge tag 'better-ondisk-6.13_2024-11-05' of ↵Carlos Maiolino
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into staging-merge xfs: improve ondisk structure checks [v5.5 10/10] Reorganize xfs_ondisk.h to group the build checks by type, then add a bunch of missing checks that were in xfs/122 but not the build system. With this, we can get rid of xfs/122. With a bit of luck, this should all go splendidly. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-11-12Merge tag 'metadir-6.13_2024-11-05' of ↵Carlos Maiolino
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into staging-merge xfs: enable metadir [v5.5 09/10] Actually enable this very large feature, which adds metadata directory trees, allocation groups on the realtime volume, persistent quota options, and quota for realtime files. With a bit of luck, this should all go splendidly. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-11-12Merge tag 'realtime-quotas-6.13_2024-11-05' of ↵Carlos Maiolino
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into staging-merge xfs: enable quota for realtime volumes [v5.5 08/10] At some point, I realized that I've refactored enough of the quota code in XFS that I should evaluate whether or not quota actually works on realtime volumes. It turns out that it nearly works: the only broken pieces are chown and delayed allocation, and reporting of project quotas in the statvfs output for projinherit+rtinherit directories. Fix these things and we can have realtime quotas again after 20 years. With a bit of luck, this should all go splendidly. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-11-12Merge tag 'metadir-quotas-6.13_2024-11-05' of ↵Carlos Maiolino
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into staging-merge xfs: persist quota options with metadir [v5.5 07/10] Store the quota files in the metadata directory tree instead of the superblock. Since we're introducing a new incompat feature flag, let's also make the mount process bring up quotas in whatever state they were when the filesystem was last unmounted, instead of requiring sysadmins to remember that themselves. With a bit of luck, this should all go splendidly. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-11-12Merge tag 'realtime-groups-6.13_2024-11-05' of ↵Carlos Maiolino
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into staging-merge xfs: shard the realtime section [v5.5 06/10] Right now, the realtime section uses a single pair of metadata inodes to store the free space information. This presents a scalability problem since every thread trying to allocate or free rt extents have to lock these files. Solve this problem by sharding the realtime section into separate realtime allocation groups. While we're at it, define a superblock to be stamped into the start of the rt section. This enables utilities such as blkid to identify block devices containing realtime sections, and avoids the situation where anything written into block 0 of the realtime extent can be misinterpreted as file data. The best advantage for rtgroups will become evident later when we get to adding rmap and reflink to the realtime volume, since the geometry constraints are the same for rt groups and AGs. Hence we can reuse all that code directly. This is a very large patchset, but it catches us up with 20 years of technical debt that have accumulated. With a bit of luck, this should all go splendidly. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-11-12Merge tag 'rtgroups-prep-6.13_2024-11-05' of ↵Carlos Maiolino
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into staging-merge xfs: preparation for realtime allocation groups [v5.5 05/10] Prepare for realtime groups by adding a few bug fixes and generic code that will be necessary. With a bit of luck, this should all go splendidly. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-11-12Merge tag 'incore-rtgroups-6.13_2024-11-05' of ↵Carlos Maiolino
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into staging-merge xfs: create incore rt allocation groups [v5.5 04/10] Add in-memory data structures for sharding the realtime volume into independent allocation groups. For existing filesystems, the entire rt volume is modelled as having a single large group, with (potentially) a number of rt extents exceeding 2^32 blocks, though these are not likely to exist because the codebase has been a bit broken for decades. The next series fills in the ondisk format and other supporting structures. With a bit of luck, this should all go splendidly. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-11-12Merge tag 'metadata-directory-tree-6.13_2024-11-05' of ↵Carlos Maiolino
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into staging-merge xfs: metadata inode directory trees [v5.5 03/10] This series delivers a new feature -- metadata inode directories. This is a separate directory tree (rooted in the superblock) that contains only inodes that contain filesystem metadata. Different metadata objects can be looked up with regular paths. Start by creating xfs_imeta{dir,file}* functions to mediate access to the metadata directory tree. By the end of this mega series, all existing metadata inodes (rt+quota) will use this directory tree instead of the superblock. Next, define the metadir on-disk format, which consists of marking inodes with a new iflag that says they're metadata. This prevents bulkstat and friends from ever getting their hands on fs metadata files. With a bit of luck, this should all go splendidly. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-11-12Merge tag 'generic-groups-6.13_2024-11-05' of ↵Carlos Maiolino
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into staging-merge xfs: create a generic allocation group structure [v5.5 02/10] Soon we'll be sharding the realtime volume into separate allocation groups. These rt groups will /mostly/ behave the same as the ones on the data device, but since rt groups don't have quite the same set of struct fields as perags, let's hoist the parts that will be shared by both into a common xfs_group object. With a bit of luck, this should all go splendidly. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-11-12Merge tag 'perag-xarray-6.13_2024-11-05' of ↵Carlos Maiolino
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into staging-merge xfs: convert perag to use xarrays [v5.5 01/10] Convert the xfs_mount perag tree to use an xarray instead of a radix tree. There should be no functional changes here. With a bit of luck, this should all go splendidly. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-11-11mm/list_lru: simplify the list_lru walk callback functionKairui Song
Now isolation no longer takes the list_lru global node lock, only use the per-cgroup lock instead. And this lock is inside the list_lru_one being walked, no longer needed to pass the lock explicitly. Link: https://lkml.kernel.org/r/20241104175257.60853-7-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Cc: Chengming Zhou <zhouchengming@bytedance.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Waiman Long <longman@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-11mm/list_lru: split the lock to per-cgroup scopeKairui Song
Currently, every list_lru has a per-node lock that protects adding, deletion, isolation, and reparenting of all list_lru_one instances belonging to this list_lru on this node. This lock contention is heavy when multiple cgroups modify the same list_lru. This lock can be split into per-cgroup scope to reduce contention. To achieve this, we need a stable list_lru_one for every cgroup. This commit adds a lock to each list_lru_one and introduced a helper function lock_list_lru_of_memcg, making it possible to pin the list_lru of a memcg. Then reworked the reparenting process. Reparenting will switch the list_lru_one instances one by one. By locking each instance and marking it dead using the nr_items counter, reparenting ensures that all items in the corresponding cgroup (on-list or not, because items have a stable cgroup, see below) will see the list_lru_one switch synchronously. Objcg reparent is also moved after list_lru reparent so items will have a stable mem cgroup until all list_lru_one instances are drained. The only caller that doesn't work the *_obj interfaces are direct calls to list_lru_{add,del}. But it's only used by zswap and that's also based on objcg, so it's fine. This also changes the bahaviour of the isolation function when LRU_RETRY or LRU_REMOVED_RETRY is returned, because now releasing the lock could unblock reparenting and free the list_lru_one, isolation function will have to return withoug re-lock the lru. prepare() { mkdir /tmp/test-fs modprobe brd rd_nr=1 rd_size=33554432 mkfs.xfs -f /dev/ram0 mount -t xfs /dev/ram0 /tmp/test-fs for i in $(seq 1 512); do mkdir "/tmp/test-fs/$i" for j in $(seq 1 10240); do echo TEST-CONTENT > "/tmp/test-fs/$i/$j" done & done; wait } do_test() { read_worker() { sleep 1 tar -cv "$1" &>/dev/null } read_in_all() { cd "/tmp/test-fs" && ls for i in $(seq 1 512); do (exec sh -c 'echo "$PPID"') > "/sys/fs/cgroup/benchmark/$i/cgroup.procs" read_worker "$i" & done; wait } for i in $(seq 1 512); do mkdir -p "/sys/fs/cgroup/benchmark/$i" done echo +memory > /sys/fs/cgroup/benchmark/cgroup.subtree_control echo 512M > /sys/fs/cgroup/benchmark/memory.max echo 3 > /proc/sys/vm/drop_caches time read_in_all } Above script simulates compression of small files in multiple cgroups with memory pressure. Run prepare() then do_test for 6 times: Before: real 0m7.762s user 0m11.340s sys 3m11.224s real 0m8.123s user 0m11.548s sys 3m2.549s real 0m7.736s user 0m11.515s sys 3m11.171s real 0m8.539s user 0m11.508s sys 3m7.618s real 0m7.928s user 0m11.349s sys 3m13.063s real 0m8.105s user 0m11.128s sys 3m14.313s After this commit (about ~15% faster): real 0m6.953s user 0m11.327s sys 2m42.912s real 0m7.453s user 0m11.343s sys 2m51.942s real 0m6.916s user 0m11.269s sys 2m43.957s real 0m6.894s user 0m11.528s sys 2m45.346s real 0m6.911s user 0m11.095s sys 2m43.168s real 0m6.773s user 0m11.518s sys 2m40.774s Link: https://lkml.kernel.org/r/20241104175257.60853-6-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Cc: Chengming Zhou <zhouchengming@bytedance.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Waiman Long <longman@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-05xfs: port ondisk structure checks from xfs/122 to the kernelDarrick J. Wong
Check this with every kernel and userspace build, so we can drop the nonsense in xfs/122. Roughly drafted with: sed -e 's/^offsetof/\tXFS_CHECK_OFFSET/g' \ -e 's/^sizeof/\tXFS_CHECK_STRUCT_SIZE/g' \ -e 's/ = \([0-9]*\)/,\t\t\t\1);/g' \ -e 's/xfs_sb_t/struct xfs_dsb/g' \ -e 's/),/,/g' \ -e 's/xfs_\([a-z0-9_]*\)_t,/struct xfs_\1,/g' \ < tests/xfs/122.out | sort and then manual fixups. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05xfs: separate space btree structures in xfs_ondisk.hDarrick J. Wong
Create a separate section for space management btrees so that they're not mixed in with file structures. Ignore the dsb stuff sprinkled around for now, because we'll deal with that in a subsequent patch. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05xfs: convert struct typedefs in xfs_ondisk.hDarrick J. Wong
Replace xfs_foo_t with struct xfs_foo where appropriate. The next patch will import more checks from xfs/122, and it's easier to automate deduplication if we don't have to reason about typedefs. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05xfs: enable metadata directory featureDarrick J. Wong
Enable the metadata directory feature. With this feature, all metadata inodes are placed in the metadata directory, and the only inumbers in the superblock are the roots of the two directory trees. The RT device is now sharded into a number of rtgroups, where 0 rtgroups mean that no RT extents are supported, and the traditional XFS stub RT bitmap and summary inodes don't exist. A single rtgroup gives roughly identical behavior to the traditional RT setup, but now with checksummed and self identifying free space metadata. For quota, the quota options are read from the superblock unless explicitly overridden via mount options. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05xfs: enable realtime quota againDarrick J. Wong
Enable quotas for the realtime device. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05xfs: update sb field checks when metadir is turned onDarrick J. Wong
When metadir is enabled, we want to check the two new rtgroups fields, and we don't want to check the old inumbers that are now in the metadir. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05xfs: reserve quota for realtime files correctlyDarrick J. Wong
Fix xfs_quota_reserve_blkres to reserve rt block quota whenever we're dealing with a realtime file. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05xfs: create quota preallocation watermarks for realtime quotaDarrick J. Wong
Refactor the quota preallocation watermarking code so that it'll work for realtime quota too. Convert the do_div calls into div_u64 for compactness. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05xfs: report realtime block quota limits on realtime directoriesDarrick J. Wong
On the data device, calling statvfs on a projinherit directory results in the block and avail counts being curtailed to the project quota block limits, if any are set. Do the same for realtime files or directories, only use the project quota rt block limits. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05xfs: persist quota flags with metadirDarrick J. Wong
It's annoying that one has to keep reminding XFS about what quota options it should mount with, since the quota flags recording the previous state are sitting right there in the primary superblock. Even more strangely, there exists a noquota option to disable quotas completely, so it's odder still that providing no options is the same as noquota. Starting with metadir, let's change the behavior so that if the user does not specify any quota-related mount options at all, the ondisk quota flags will be used to bring up quota. In other words, the filesystem will mount in the same state and with the same functionality as it had during the last mount. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05xfs: advertise realtime quota support in the xqm stat filesDarrick J. Wong
Add a fifth column to this (really old) stat file to advertise that the kernel supports quota for realtime volumes. This will be used by fstests to detect kernel support. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05xfs: scrub quota file metapathsDarrick J. Wong
Enable online fsck for quota file metadata directory paths. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05xfs: fix chown with rt quotaDarrick J. Wong
Make chown's quota adjustments work with realtime files. This is mostly a matter of calling xfs_inode_count_blocks on a given file to figure out the number of blocks allocated to the data device and to the realtime device, and using those quantities to update the quota accounting when the id changes. Delayed allocation reservations are moved from the old dquot's incore reservation to the new dquot's incore reservation. Note that there was a missing ILOCK bug in xfs_qm_dqusage_adjust that we must fix before calling xfs_iread_extents. Prior to 2.6.37 the locking was correct, but then someone removed the ILOCK as part of a cleanup. Nobody noticed because nowhere in the git history have we ever supported rt+quota so nobody can use this. I'm leaving git breadcrumbs in case anyone is desperate enough to try to backport the rtquota code to old kernels. Not-Cc: <stable@vger.kernel.org> # v2.6.37 Fixes: 52fda114249578 ("xfs: simplify xfs_qm_dqusage_adjust") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05xfs: use metadir for quota inodesDarrick J. Wong
Store the quota inodes in the /quota metadata directory if metadir is enabled. This enables us to stop using the sb_[ugp]uotino fields in the superblock. From this point on, all metadata files will be children of the metadata directory tree root. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05xfs: refactor xfs_qm_destroy_quotainosDarrick J. Wong
Reuse this function instead of open-coding the logic. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05xfs: use rtgroup busy extent list for FITRIMDarrick J. Wong
For filesystems that have rtgroups and hence use the busy extent list for freed rt space, use that busy extent list so that FITRIM can issue discard commands asynchronously without worrying about other callers accidentally allocating and using space that is being discarded. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05xfs: implement busy extent tracking for rtgroupsDarrick J. Wong
For rtgroups filesystems, track newly freed (rt) space through the log until the rt EFIs have been committed to disk. This way we ensure that space cannot be reused until all traces of the old owner are gone. As a fringe benefit, we now support -o discard on the realtime device. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05xfs: port the perag discard code to handle generic groupsDarrick J. Wong
Port xfs_discard_extents and its tracepoints to handle generic groups instead of just perags. This is needed to enable busy extent tracking for rtgroups. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05xfs: move the min and max group block numbers to xfs_groupDarrick J. Wong
Move the min and max agblock numbers to the generic xfs_group structure so that we can start building validators for extents within an rtgroup. While we're at it, use check_add_overflow for the extent length computation because that has much better overflow checking. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05xfs: adjust min_block usage in xfs_verify_agbnoDarrick J. Wong
There's some weird logic in xfs_verify_agbno -- min_block ought to be the first agblock number in the AG that can be used by non-static metadata. However, we initialize it to the last agblock of the static metadata, which works due to the <= check, even though this isn't technically correct. Change the check to < and set min_block to the next agblock past the static metadata. This hasn't been an issue up to now, but we're going to move these things into the generic group struct, and this will cause problems with rtgroups, where min_block can be zero for an rtgroup that doesn't have a rt superblock. Note that there's no user-visible impact with the old logic, so this isn't a bug fix. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05xfs: make xfs_rtblock_t a segmented address like xfs_fsblock_tDarrick J. Wong
Now that we've finished adding allocation groups to the realtime volume, let's make the file block mapping address (xfs_rtblock_t) a segmented value just like we do on the data device. This means that group number and block number conversions can be done with shifting and masking instead of integer division. While in theory we could continue caching the rgno shift value in m_rgblklog, the fact that we now always use the shift value means that we have an opportunity to increase the redundancy of the rt geometry by storing it in the ondisk superblock and adding more sb verifier code. Extend the sueprblock to store the rgblklog value. Now that we have segmented addresses, set the correct values in m_groups[XG_TYPE_RTG] so that the xfs_group helpers work correctly. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05xfs: create helpers to deal with rounding xfs_filblks_t to rtx boundariesDarrick J. Wong
We're about to segment xfs_rtblock_t addresses, so we must create type-specific helpers to do rt extent rounding of file mapping block lengths because the rtb helpers soon will not do the right thing there. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05xfs: create helpers to deal with rounding xfs_fileoff_t to rtx boundariesDarrick J. Wong
We're about to segment xfs_rtblock_t addresses, so we must create type-specific helpers to do rt extent rounding of file block offsets because the rtb helpers soon will not do the right thing there. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05xfs: mask off the rtbitmap and summary inodes when metadir in useDarrick J. Wong
Set the rtbitmap and summary file inumbers to NULLFSINO in the superblock and make sure they're zeroed whenever we write the superblock to disk, to mimic mkfs behavior. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-11-05xfs: scrub metadir paths for rtgroup metadataDarrick J. Wong
Add the code we need to scan the metadata directory paths of rt group metadata files. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>