summaryrefslogtreecommitdiff
path: root/fs/xfs/libxfs/xfs_ialloc.c
AgeCommit message (Collapse)Author
2023-07-03xfs: AGI length should be bounds checkedDarrick J. Wong
Similar to the recent patch strengthening the AGF agf_length verification, the AGI verifier does not check that the AGI length field is within known good bounds. This isn't currently checked by runtime kernel code, yet we assume in many places that it is correct and verify other metadata against it. Add length verification to the AGI verifier. Just like the AGF length checking, the length of the AGI must be equal to the size of the AG specified in the superblock, unless it is the last AG in the filesystem. In that case, it must be less than or equal to sb->sb_agblocks and greater than XFS_MIN_AG_BLOCKS, which is the smallest AG a growfs operation will allow to exist. There's only one place in the filesystem that actually uses agi_length, but let's not leave it vulnerable to the same weird nonsense that generates syzbot bugs, eh? Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-06-29xfs: use deferred frees for btree block freeingDave Chinner
Btrees that aren't freespace management trees use the normal extent allocation and freeing routines for their blocks. Hence when a btree block is freed, a direct call to xfs_free_extent() is made and the extent is immediately freed. This puts the entire free space management btrees under this path, so we are stacking btrees on btrees in the call stack. The inobt, finobt and refcount btrees all do this. However, the bmap btree does not do this - it calls xfs_free_extent_later() to defer the extent free operation via an XEFI and hence it gets processed in deferred operation processing during the commit of the primary transaction (i.e. via intent chaining). We need to change xfs_free_extent() to behave in a non-blocking manner so that we can avoid deadlocks with busy extents near ENOSPC in transactions that free multiple extents. Inserting or removing a record from a btree can cause a multi-level tree merge operation and that will free multiple blocks from the btree in a single transaction. i.e. we can call xfs_free_extent() multiple times, and hence the btree manipulation transaction is vulnerable to this busy extent deadlock vector. To fix this, convert all the remaining callers of xfs_free_extent() to use xfs_free_extent_later() to queue XEFIs and hence defer processing of the extent frees to a context that can be safely restarted if a deadlock condition is detected. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
2023-06-05xfs: validate block number being freed before adding to xefiDave Chinner
Bad things happen in defered extent freeing operations if it is passed a bad block number in the xefi. This can come from a bogus agno/agbno pair from deferred agfl freeing, or just a bad fsbno being passed to __xfs_free_extent_later(). Either way, it's very difficult to diagnose where a null perag oops in EFI creation is coming from when the operation that queued the xefi has already been completed and there's no longer any trace of it around.... Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Dave Chinner <david@fromorbit.com>
2023-04-11xfs: convert xfs_ialloc_has_inodes_at_extent to return keyfill scan resultsDarrick J. Wong
Convert the xfs_ialloc_has_inodes_at_extent function to return keyfill scan results because for a given range of inode numbers, we might have no indexed inodes at all; the entire region might be allocated ondisk inodes; or there might be a mix of the two. Unfortunately, sparse inodes adds to the complexity, because each inode record can have holes, which means that we cannot use the generic btree _scan_keyfill function because we must look for holes in individual records to decide the result. On the plus side, online fsck can now detect sub-chunk discrepancies in the inobt. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11xfs: remove pointless shadow variable from xfs_difree_inobtDarrick J. Wong
In xfs_difree_inobt, the pag passed in was previously used to look up the AGI buffer. There's no need to extract it again, so remove the shadow variable and shut up -Wshadow. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11xfs: hoist inode record alignment checks from scrubDarrick J. Wong
Move the inobt record alignment checks from xchk_iallocbt_rec into xfs_inobt_check_irec so that they are applied everywhere. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11xfs: complain about bad records in query_range helpersDarrick J. Wong
For every btree type except for the bmbt, refactor the code that complains about bad records into a helper and make the ->query_range helpers call it so that corruptions found via that avenue are logged. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-04-11xfs: standardize ondisk to incore conversion for inode btreesDarrick J. Wong
Create a xfs_inobt_check_irec function to detect corruption in btree records. Fix all xfs_inobt_btrec_to_irec callsites to call the new helper and bubble up corruption reports. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-02-27xfs: restore old agirotor behaviorDarrick J. Wong
Prior to the removal of xfs_ialloc_next_ag, we would increment the agi rotor and return the *old* value. atomic_inc_return returns the new value, which causes mkfs to allocate the root directory in AG 1. Put back the old behavior (at least for mkfs) by subtracting 1 here. Fixes: 20a5eab49d35 ("xfs: convert xfs_ialloc_next_ag() to an atomic") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-02-13xfs: introduce xfs_alloc_vextent_exact_bno()Dave Chinner
Two of the callers to xfs_alloc_vextent_this_ag() actually want exact block number allocation, not anywhere-in-ag allocation. Split this out from _this_ag() as a first class citizen so no external extent allocation code needs to care about args->type anymore. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13xfs: introduce xfs_alloc_vextent_near_bno()Dave Chinner
The remaining callers of xfs_alloc_vextent() are all doing NEAR_BNO allocations. We can replace that function with a new xfs_alloc_vextent_near_bno() function that does this explicitly. We also multiplex NEAR_BNO allocations through xfs_alloc_vextent_this_ag via args->type. Replace all of these with direct calls to xfs_alloc_vextent_near_bno(), too. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13xfs: use xfs_alloc_vextent_this_ag() where appropriateDave Chinner
Change obvious callers of single AG allocation to use xfs_alloc_vextent_this_ag(). Drive the per-ag grabbing out to the callers, too, so that callers with active references don't need to do new lookups just for an allocation in a context that already has a perag reference. The only remaining caller that does single AG allocation through xfs_alloc_vextent() is xfs_bmap_btalloc() with XFS_ALLOCTYPE_NEAR_BNO. That is going to need more untangling before it can be converted cleanly. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13xfs: introduce xfs_for_each_perag_wrap()Dave Chinner
In several places we iterate every AG from a specific start agno and wrap back to the first AG when we reach the end of the filesystem to continue searching. We don't have a primitive for this iteration yet, so add one for conversion of these algorithms to per-ag based iteration. The filestream AG select code is a mess, and this initially makes it worse. The per-ag selection needs to be driven completely into the filestream code to clean this up and it will be done in a future patch that makes the filestream allocator use active per-ag references correctly. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13xfs: perags need atomic operational stateDave Chinner
We currently don't have any flags or operational state in the xfs_perag except for the pagf_init and pagi_init flags. And the agflreset flag. Oh, there's also the pagf_metadata and pagi_inodeok flags, too. For controlling per-ag operations, we are going to need some atomic state flags. Hence add an opstate field similar to what we already have in the mount and log, and convert all these state flags across to atomic bit operations. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13xfs: convert xfs_ialloc_next_ag() to an atomicDave Chinner
This is currently a spinlock lock protected rotor which can be implemented with a single atomic operation. Change it to be more efficient and get rid of the m_agirotor_lock. Noticed while converting the inode allocation AG selection loop to active perag references. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13xfs: inobt can use perags in many more places than it doesDave Chinner
Lots of code in the inobt infrastructure is passed both xfs_mount and perags. We only need perags for the per-ag inode allocation code, so reduce the duplication by passing only the perags as the primary object. This ends up reducing the code size by a bit: text data bss dec hex filename orig 1138878 323979 548 1463405 16546d (TOTALS) patched 1138709 323979 548 1463236 1653c4 (TOTALS) Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13xfs: use active perag references for inode allocationDave Chinner
Convert the inode allocation routines to use active perag references or references held by callers rather than grab their own. Also drive the perag further inwards to replace xfs_mounts when doing operations on a specific AG. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-13xfs: convert xfs_imap() to take a peragDave Chinner
Callers have referenced perags but they don't pass it into xfs_imap() so it takes it's own reference. Fix that so we can change inode allocation over to using active references. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2023-02-11xfs: prefer free inodes at ENOSPC over chunk allocationDave Chinner
When an XFS filesystem has free inodes in chunks already allocated on disk, it will still allocate new inode chunks if the target AG has no free inodes in it. Normally, this is a good idea as it preserves locality of all the inodes in a given directory. However, at ENOSPC this can lead to using the last few remaining free filesystem blocks to allocate a new chunk when there are many, many free inodes that could be allocated without consuming free space. This results in speeding up the consumption of the last few blocks and inode create operations then returning ENOSPC when there free inodes available because we don't have enough block left in the filesystem for directory creation reservations to proceed. Hence when we are near ENOSPC, we should be attempting to preserve the remaining blocks for directory block allocation rather than using them for unnecessary inode chunk creation. This particular behaviour is exposed by xfs/294, when it drives to ENOSPC on empty file creation whilst there are still thousands of free inodes available for allocation in other AGs in the filesystem. Hence, when we are within 1% of ENOSPC, change the inode allocation behaviour to prefer to use existing free inodes over allocating new inode chunks, even though it results is poorer locality of the data set. It is more important for the allocations to be space efficient near ENOSPC than to have optimal locality for performance, so lets modify the inode AG selection code to reflect that fact. This allows generic/294 to not only pass with this allocator rework patchset, but to increase the number of post-ENOSPC empty inode allocations to from ~600 to ~9080 before we hit ENOSPC on the directory create transaction reservation. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2022-11-18treewide: use get_random_u32_below() instead of deprecated functionJason A. Donenfeld
This is a simple mechanical transformation done by: @@ expression E; @@ - prandom_u32_max + get_random_u32_below (E) Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs Reviewed-by: SeongJae Park <sj@kernel.org> # for damon Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> # for infiniband Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> # for arm Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # for mmc Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-10-11treewide: use get_random_u32() when possibleJason A. Donenfeld
The prandom_u32() function has been a deprecated inline wrapper around get_random_u32() for several releases now, and compiles down to the exact same code. Replace the deprecated wrapper with a direct call to the real function. The same also applies to get_random_int(), which is just a wrapper around get_random_u32(). This was done as a basic find and replace. Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Yury Norov <yury.norov@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> # for ext4 Acked-by: Toke Høiland-Jørgensen <toke@toke.dk> # for sch_cake Acked-by: Chuck Lever <chuck.lever@oracle.com> # for nfsd Acked-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Mika Westerberg <mika.westerberg@linux.intel.com> # for thunderbolt Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs Acked-by: Helge Deller <deller@gmx.de> # for parisc Acked-by: Heiko Carstens <hca@linux.ibm.com> # for s390 Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-10-11treewide: use prandom_u32_max() when possible, part 1Jason A. Donenfeld
Rather than incurring a division or requesting too many random bytes for the given range, use the prandom_u32_max() function, which only takes the minimum required bytes from the RNG and avoids divisions. This was done mechanically with this coccinelle script: @basic@ expression E; type T; identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32"; typedef u64; @@ ( - ((T)get_random_u32() % (E)) + prandom_u32_max(E) | - ((T)get_random_u32() & ((E) - 1)) + prandom_u32_max(E * XXX_MAKE_SURE_E_IS_POW2) | - ((u64)(E) * get_random_u32() >> 32) + prandom_u32_max(E) | - ((T)get_random_u32() & ~PAGE_MASK) + prandom_u32_max(PAGE_SIZE) ) @multi_line@ identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32"; identifier RAND; expression E; @@ - RAND = get_random_u32(); ... when != RAND - RAND %= (E); + RAND = prandom_u32_max(E); // Find a potential literal @literal_mask@ expression LITERAL; type T; identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32"; position p; @@ ((T)get_random_u32()@p & (LITERAL)) // Add one to the literal. @script:python add_one@ literal << literal_mask.LITERAL; RESULT; @@ value = None if literal.startswith('0x'): value = int(literal, 16) elif literal[0] in '123456789': value = int(literal, 10) if value is None: print("I don't know how to handle %s" % (literal)) cocci.include_match(False) elif value == 2**32 - 1 or value == 2**31 - 1 or value == 2**24 - 1 or value == 2**16 - 1 or value == 2**8 - 1: print("Skipping 0x%x for cleanup elsewhere" % (value)) cocci.include_match(False) elif value & (value + 1) != 0: print("Skipping 0x%x because it's not a power of two minus one" % (value)) cocci.include_match(False) elif literal.startswith('0x'): coccinelle.RESULT = cocci.make_expr("0x%x" % (value + 1)) else: coccinelle.RESULT = cocci.make_expr("%d" % (value + 1)) // Replace the literal mask with the calculated result. @plus_one@ expression literal_mask.LITERAL; position literal_mask.p; expression add_one.RESULT; identifier FUNC; @@ - (FUNC()@p & (LITERAL)) + prandom_u32_max(RESULT) @collapse_ret@ type T; identifier VAR; expression E; @@ { - T VAR; - VAR = (E); - return VAR; + return E; } @drop_var@ type T; identifier VAR; @@ { - T VAR; ... when != VAR } Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Yury Norov <yury.norov@gmail.com> Reviewed-by: KP Singh <kpsingh@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> # for ext4 and sbitmap Reviewed-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> # for drbd Acked-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Heiko Carstens <hca@linux.ibm.com> # for s390 Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # for mmc Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-07-07xfs: make is_log_ag() a first class helperDave Chinner
We check if an ag contains the log in many places, so make this a first class XFS helper by lifting it to fs/xfs/libxfs/xfs_ag.h and renaming it xfs_ag_contains_log(). The convert all the places that check if the AG contains the log to use this helper. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2022-07-07xfs: Pre-calculate per-AG agino geometryDave Chinner
There is a lot of overhead in functions like xfs_verify_agino() that repeatedly calculate the geometry limits of an AG. These can be pre-calculated as they are static and the verification context has a per-ag context it can quickly reference. In the case of xfs_verify_agino(), we now always have a perag context handy, so we can store the minimum and maximum agino values in the AG in the perag. This means we don't have to calculate it on every call and it can be inlined in callers if we move it to xfs_ag.h. xfs_verify_agino_or_null() gets the same perag treatment. xfs_agino_range() is moved to xfs_ag.c as it's not really a type function, and it's use is largely restricted as the first and last aginos can be grabbed straight from the perag in most cases. Note that we leave the original xfs_verify_agino in place in xfs_types.c as a static function as other callers in that file do not have per-ag contexts so still need to go the long way. It's been renamed to xfs_verify_agno_agino() to indicate it takes both an agno and an agino to differentiate it from new function. $ size --totals fs/xfs/built-in.a text data bss dec hex filename before 1482185 329588 572 1812345 1ba779 (TOTALS) after 1481937 329588 572 1812097 1ba681 (TOTALS) Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2022-07-07xfs: pass perag to xfs_read_agiDave Chinner
We have the perag in most palces we call xfs_read_agi, so pass the perag instead of a mount/agno pair. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2022-07-07xfs: pass perag to xfs_alloc_read_agf()Dave Chinner
xfs_alloc_read_agf() initialises the perag if it hasn't been done yet, so it makes sense to pass it the perag rather than pull a reference from the buffer. This allows callers to be per-ag centric rather than passing mount/agno pairs everywhere. Whilst modifying the xfs_reflink_find_shared() function definition, declare it static and remove the extern declaration as it is an internal function only these days. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2022-07-07xfs: kill xfs_alloc_pagf_init()Dave Chinner
Trivial wrapper around xfs_alloc_read_agf(), can be easily replaced by passing a NULL agfbp to xfs_alloc_read_agf(). Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2022-07-07xfs: pass perag to xfs_ialloc_read_agi()Dave Chinner
xfs_ialloc_read_agi() initialises the perag if it hasn't been done yet, so it makes sense to pass it the perag rather than pull a reference from the buffer. This allows callers to be per-ag centric rather than passing mount/agno pairs everywhere. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2022-07-07xfs: kill xfs_ialloc_pagi_init()Dave Chinner
This is just a basic wrapper around xfs_ialloc_read_agi(), which can be entirely handled by xfs_ialloc_read_agi() by passing a NULL agibpp.... Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2022-04-21Merge tag 'large-extent-counters-v9' of https://github.com/chandanr/linux ↵Dave Chinner
into xfs-5.19-for-next xfs: Large extent counters The commit xfs: fix inode fork extent count overflow (3f8a4f1d876d3e3e49e50b0396eaffcc4ba71b08) mentions that 10 billion data fork extents should be possible to create. However the corresponding on-disk field has a signed 32-bit type. Hence this patchset extends the per-inode data fork extent counter to 64 bits (out of which 48 bits are used to store the extent count). Also, XFS has an attribute fork extent counter which is 16 bits wide. A workload that, 1. Creates 1 million 255-byte sized xattrs, 2. Deletes 50% of these xattrs in an alternating manner, 3. Tries to insert 400,000 new 255-byte sized xattrs causes the xattr extent counter to overflow. Dave tells me that there are instances where a single file has more than 100 million hardlinks. With parent pointers being stored in xattrs, we will overflow the signed 16-bits wide attribute extent counter when large number of hardlinks are created. Hence this patchset extends the on-disk field to 32-bits. The following changes are made to accomplish this, 1. A 64-bit inode field is carved out of existing di_pad and di_flushiter fields to hold the 64-bit data fork extent counter. 2. The existing 32-bit inode data fork extent counter will be used to hold the attribute fork extent counter. 3. A new incompat superblock flag to prevent older kernels from mounting the filesystem. Signed-off-by: Chandan Babu R <chandan.babu@oracle.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2022-04-21xfs: convert AGI log flags to unsigned.Dave Chinner
5.18 w/ std=gnu11 compiled with gcc-5 wants flags stored in unsigned fields to be unsigned. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2022-04-11xfs: Introduce XFS_DIFLAG2_NREXT64 and associated helpersChandan Babu R
This commit adds the new per-inode flag XFS_DIFLAG2_NREXT64 to indicate that an inode supports 64-bit extent counters. This flag is also enabled by default on newly created inodes when the corresponding filesystem has large extent counter feature bit (i.e. XFS_FEAT_NREXT64) set. Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
2021-10-22xfs: rename xfs_bmap_add_free to xfs_free_extent_laterDarrick J. Wong
xfs_bmap_add_free isn't a block mapping function; it schedules deferred freeing operations for a later point in a compound transaction chain. While it's primarily used by bunmapi, its use has expanded beyond that. Move it to xfs_alloc.c and rename the function since it's now general freeing functionality. Bring the slab cache bits in line with the way we handle the other intent items. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
2021-10-19xfs: compute absolute maximum nlevels for each btree typeDarrick J. Wong
Add code for all five btree types so that we can compute the absolute maximum possible btree height for each btree type. This is a setup for the next patch, which makes every btree type have its own cursor cache. The functions are exported so that we can have xfs_db report the absolute maximum btree heights for each btree type, rather than making everyone run their own ad-hoc computations. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-08-19xfs: kill xfs_sb_version_has_v3inode()Dave Chinner
All callers to xfs_dinode_good_version() and XFS_DINODE_SIZE() in both the kernel and userspace have a xfs_mount structure available which means they can use mount features checks instead looking directly are the superblock. Convert these functions to take a mount and use a xfs_has_v3inodes() check and move it out of the libxfs/xfs_format.h file as it really doesn't have anything to do with the definition of the on-disk format. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-08-19xfs: convert xfs_sb_version_has checks to use mount featuresDave Chinner
This is a conversion of the remaining xfs_sb_version_has..(sbp) checks to use xfs_has_..(mp) feature checks. This was largely done with a vim replacement macro that did: :0,$s/xfs_sb_version_has\(.*\)&\(.*\)->m_sb/xfs_has_\1\2/g<CR> A couple of other variants were also used, and the rest touched up by hand. $ size -t fs/xfs/built-in.a text data bss dec hex filename before 1127533 311352 484 1439369 15f689 (TOTALS) after 1125360 311352 484 1437196 15ee0c (TOTALS) Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-08-19xfs: replace XFS_FORCED_SHUTDOWN with xfs_is_shutdownDave Chinner
Remove the shouty macro and instead use the inline function that matches other state/feature check wrapper naming. This conversion was done with sed. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-08-19xfs: convert mount flags to featuresDave Chinner
Replace m_flags feature checks with xfs_has_<feature>() calls and rework the setup code to set flags in m_features. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-08-19xfs: replace xfs_sb_version checks with feature flag checksDave Chinner
Convert the xfs_sb_version_hasfoo() to checks against mp->m_features. Checks of the superblock itself during disk operations (e.g. in the read/write verifiers and the to/from disk formatters) are not converted - they operate purely on the superblock state. Everything else should use the mount features. Large parts of this conversion were done with sed with commands like this: for f in `git grep -l xfs_sb_version_has fs/xfs/*.c`; do sed -i -e 's/xfs_sb_version_has\(.*\)(&\(.*\)->m_sb)/xfs_has_\1(\2)/' $f done With manual cleanups for things like "xfs_has_extflgbit" and other little inconsistencies in naming. The result is ia lot less typing to check features and an XFS binary size reduced by a bit over 3kB: $ size -t fs/xfs/built-in.a text data bss dec hex filenam before 1130866 311352 484 1442702 16038e (TOTALS) after 1127727 311352 484 1439563 15f74b (TOTALS) Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-08-18xfs: make the record pointer passed to query_range functions constDarrick J. Wong
The query_range functions are supposed to call a caller-supplied function on each record found in the dataset. These functions don't own the memory storing the record, so don't let them change the record. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2021-08-09xfs: fix silly whitespace problems with kernel libxfsDarrick J. Wong
Fix a few whitespace errors such as spaces at the end of the line, etc. This gets us back to something more closely resembling parity. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2021-07-15xfs: check for sparse inode clusters that cross new EOAG when shrinkingDarrick J. Wong
While running xfs/168, I noticed occasional write verifier shutdowns involving inodes at the very end of the filesystem. Existing inode btree validation code checks that all inode clusters are fully contained within the filesystem. However, due to inadequate checking in the fs shrink code, it's possible that there could be a sparse inode cluster at the end of the filesystem where the upper inodes of the cluster are marked as holes and the corresponding blocks are free. In this case, the last blocks in the AG are listed in the bnobt. This enables the shrink to proceed but results in a filesystem that trips the inode verifiers. Fix this by disallowing the shrink. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2021-06-18xfs: perag may be null in xfs_imap()Dave Chinner
Dan Carpenter's static checker reported: The patch 7b13c5155182: "xfs: use perag for ialloc btree cursors" from Jun 2, 2021, leads to the following Smatch complaint: fs/xfs/libxfs/xfs_ialloc.c:2403 xfs_imap() error: we previously assumed 'pag' could be null (see line 2294) And it's right. Fix it. Fixes: 7b13c5155182 ("xfs: use perag for ialloc btree cursors") Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
2021-06-08xfs: drop the AGI being passed to xfs_check_agi_freecountDave Chinner
From: Dave Chinner <dchinner@redhat.com> Stephen Rothwell reported this compiler warning from linux-next: fs/xfs/libxfs/xfs_ialloc.c: In function 'xfs_difree_finobt': fs/xfs/libxfs/xfs_ialloc.c:2032:20: warning: unused variable 'agi' [-Wunused-variable] 2032 | struct xfs_agi *agi = agbp->b_addr; Which is fallout from agno -> perag conversions that were done in this function. xfs_check_agi_freecount() is the only user of "agi" in xfs_difree_finobt() now, and it only uses the agi to get the current free inode count. We hold that in the perag structure, so there's not need to directly reference the raw AGI to get this information. The btree cursor being passed to xfs_check_agi_freecount() has a reference to the perag being operated on, so use that directly in xfs_check_agi_freecount() rather than passing an AGI. Fixes: 7b13c5155182 ("xfs: use perag for ialloc btree cursors") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-06-02xfs: use perag through unlink processingDave Chinner
Unlinked lists are held in the perag, and freeing of inodes needs to be passed a perag, too, so look up the perag early in the unlink processing and use it throughout. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Brian Foster <bfoster@redhat.com>
2021-06-02xfs: clean up and simplify xfs_dialloc()Dave Chinner
Because it's a mess. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-06-02xfs: inode allocation can use a single perag instanceDave Chinner
Now that we've internalised the two-phase inode allocation, we can now easily make the AG selection and allocation atomic from the perspective of a single perag context. This will ensure AGs going offline/away cannot occur between the selection and allocation steps. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-06-02xfs: get rid of xfs_dir_ialloc()Dave Chinner
This is just a simple wrapper around the per-ag inode allocation that doesn't need to exist. The internal mechanism to select and allocate within an AG does not need to be exposed outside xfs_ialloc.c, and it being exposed simply makes it harder to follow the code and simplify it. This is simplified by internalising xf_dialloc_select_ag() and xfs_dialloc_ag() into a single xfs_dialloc() function and then xfs_dir_ialloc() can go away. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-06-02xfs: collapse AG selection for inode allocationDave Chinner
xfs_dialloc_select_ag() does a lot of repetitive work. It first calls xfs_ialloc_ag_select() to select the AG to start allocation attempts in, which can do up to two entire loops across the perags that inodes can be allocated in. This is simply checking if there is spce available to allocate inodes in an AG, and it returns when it finds the first candidate AG. xfs_dialloc_select_ag() then does it's own iterative walk across all the perags locking the AGIs and trying to allocate inodes from the locked AG. It also doesn't limit the search to mp->m_maxagi, so it will walk all AGs whether they can allocate inodes or not. Hence if we are really low on inodes, we could do almost 3 entire walks across the whole perag range before we find an allocation group we can allocate inodes in or report ENOSPC. Because xfs_ialloc_ag_select() returns on the first candidate AG it finds, we can simply do these checks directly in xfs_dialloc_select_ag() before we lock and try to allocate inodes. This reduces the inode allocation pass down to 2 perag sweeps at most - one for aligned inode cluster allocation and if we can't allocate full, aligned inode clusters anywhere we'll do another pass trying to do sparse inode cluster allocation. This also removes a big chunk of duplicate code. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-06-02xfs: simplify xfs_dialloc_select_ag() return valuesDave Chinner
The only caller of xfs_dialloc_select_ag() will always return -ENOSPC to it's caller if the agbp returned from xfs_dialloc_select_ag() is NULL. IOWs, failure to find a candidate AGI we can allocate inodes from is always an ENOSPC condition, so move this logic up into xfs_dialloc_select_ag() so we can simplify the return logic in this function. xfs_dialloc_select_ag() now only ever returns 0 with a locked agbp, or an error with no agbp. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>