summaryrefslogtreecommitdiff
path: root/include/linux/fs.h
AgeCommit message (Collapse)Author
2024-10-09bcachefs: do not use PF_MEMALLOC_NORECLAIMMichal Hocko
Patch series "remove PF_MEMALLOC_NORECLAIM" v3. This patch (of 2): bch2_new_inode relies on PF_MEMALLOC_NORECLAIM to try to allocate a new inode to achieve GFP_NOWAIT semantic while holding locks. If this allocation fails it will drop locks and use GFP_NOFS allocation context. We would like to drop PF_MEMALLOC_NORECLAIM because it is really dangerous to use if the caller doesn't control the full call chain with this flag set. E.g. if any of the function down the chain needed GFP_NOFAIL request the PF_MEMALLOC_NORECLAIM would override this and cause unexpected failure. While this is not the case in this particular case using the scoped gfp semantic is not really needed bacause we can easily pus the allocation context down the chain without too much clutter. [akpm@linux-foundation.org: fix kerneldoc warnings] Link: https://lkml.kernel.org/r/20240926172940.167084-1-mhocko@kernel.org Link: https://lkml.kernel.org/r/20240926172940.167084-2-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> # For vfs changes Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: James Morris <jmorris@namei.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Paul Moore <paul@paul-moore.com> Cc: Serge E. Hallyn <serge@hallyn.com> Cc: Yafang Shao <laoar.shao@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-27[tree-wide] finally take no_llseek outAl Viro
no_llseek had been defined to NULL two years ago, in commit 868941b14441 ("fs: remove no_llseek") To quote that commit, At -rc1 we'll need do a mechanical removal of no_llseek - git grep -l -w no_llseek | grep -v porting.rst | while read i; do sed -i '/\<no_llseek\>/d' $i done would do it. Unfortunately, that hadn't been done. Linus, could you do that now, so that we could finally put that thing to rest? All instances are of the form .llseek = no_llseek, so it's obviously safe. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-09-24Merge tag 'fuse-update-6.12' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse updates from Miklos Szeredi: - Add support for idmapped fuse mounts (Alexander Mikhalitsyn) - Add optimization when checking for writeback (yangyun) - Add tracepoints (Josef Bacik) - Clean up writeback code (Joanne Koong) - Clean up request queuing (me) - Misc fixes * tag 'fuse-update-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: (32 commits) fuse: use exclusive lock when FUSE_I_CACHE_IO_MODE is set fuse: clear FR_PENDING if abort is detected when sending request fs/fuse: convert to use invalid_mnt_idmap fs/mnt_idmapping: introduce an invalid_mnt_idmap fs/fuse: introduce and use fuse_simple_idmap_request() helper fs/fuse: fix null-ptr-deref when checking SB_I_NOIDMAP flag fuse: allow O_PATH fd for FUSE_DEV_IOC_BACKING_OPEN virtio_fs: allow idmapped mounts fuse: allow idmapped mounts fuse: warn if fuse_access is called when idmapped mounts are allowed fuse: handle idmappings properly in ->write_iter() fuse: support idmapped ->rename op fuse: support idmapped ->set_acl fuse: drop idmap argument from __fuse_get_acl fuse: support idmapped ->setattr op fuse: support idmapped ->permission inode op fuse: support idmapped getattr inode op fuse: support idmap for mkdir/mknod/symlink/create/tmpfile fuse: support idmapped FUSE_EXT_GROUPS fuse: add an idmap argument to fuse_simple_request ...
2024-09-23Merge tag 'bcachefs-2024-09-21' of git://evilpiepirate.org/bcachefsLinus Torvalds
Pull bcachefs updates from Kent Overstreet: - rcu_pending, btree key cache rework: this solves lock contenting in the key cache, eliminating the biggest source of the srcu lock hold time warnings, and drastically improving performance on some metadata heavy workloads - on multithreaded creates we're now 3-4x faster than xfs. - We're now using an rhashtable instead of the system inode hash table; this is another significant performance improvement on multithreaded metadata workloads, eliminating more lock contention. - for_each_btree_key_in_subvolume_upto(): new helper for iterating over keys within a specific subvolume, eliminating a lot of open coded "subvolume_get_snapshot()" and also fixing another source of srcu lock time warnings, by running each loop iteration in its own transaction (as the existing for_each_btree_key() does). - More work on btree_trans locking asserts; we now assert that we don't hold btree node locks when trans->locked is false, which is important because we don't use lockdep for tracking individual btree node locks. - Some cleanups and improvements in the bset.c btree node lookup code, from Alan. - Rework of btree node pinning, which we use in backpointers fsck. The old hacky implementation, where the shrinker just skipped over nodes in the pinned range, was causing OOMs; instead we now use another shrinker with a much higher seeks number for pinned nodes. - Rebalance now uses BCH_WRITE_ONLY_SPECIFIED_DEVS; this fixes an issue where rebalance would sometimes fall back to allocating from the full filesystem, which is not what we want when it's trying to move data to a specific target. - Use __GFP_ACCOUNT, GFP_RECLAIMABLE for btree node, key cache allocations. - Idmap mounts are now supported (Hongbo Li) - Rename whiteouts are now supported (Hongbo Li) - Erasure coding can now handle devices being marked as failed, or forcibly removed. We still need the evacuate path for erasure coding, but it's getting very close to ready for people to start using. * tag 'bcachefs-2024-09-21' of git://evilpiepirate.org/bcachefs: (99 commits) bcachefs: return err ptr instead of null in read sb clean bcachefs: Remove duplicated include in backpointers.c bcachefs: Don't drop devices with stripe pointers bcachefs: bch2_ec_stripe_head_get() now checks for change in rw devices bcachefs: bch_fs.rw_devs_change_count bcachefs: bch2_dev_remove_stripes() bcachefs: bch2_trigger_ptr() calculates sectors even when no device bcachefs: improve error messages in bch2_ec_read_extent() bcachefs: improve error message on too few devices for ec bcachefs: improve bch2_new_stripe_to_text() bcachefs: ec_stripe_head.nr_created bcachefs: bch_stripe.disk_label bcachefs: stripe_to_mem() bcachefs: EIO errcode cleanup bcachefs: Rework btree node pinning bcachefs: split up btree cache counters for live, freeable bcachefs: btree cache counters should be size_t bcachefs: Don't count "skipped access bit" as touched in btree cache scan bcachefs: Failed devices no longer require mounting in degraded mode bcachefs: bch2_dev_rcu_noerror() ...
2024-09-21Merge tag 'mm-stable-2024-09-20-02-31' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "Along with the usual shower of singleton patches, notable patch series in this pull request are: - "Align kvrealloc() with krealloc()" from Danilo Krummrich. Adds consistency to the APIs and behaviour of these two core allocation functions. This also simplifies/enables Rustification. - "Some cleanups for shmem" from Baolin Wang. No functional changes - mode code reuse, better function naming, logic simplifications. - "mm: some small page fault cleanups" from Josef Bacik. No functional changes - code cleanups only. - "Various memory tiering fixes" from Zi Yan. A small fix and a little cleanup. - "mm/swap: remove boilerplate" from Yu Zhao. Code cleanups and simplifications and .text shrinkage. - "Kernel stack usage histogram" from Pasha Tatashin and Shakeel Butt. This is a feature, it adds new feilds to /proc/vmstat such as $ grep kstack /proc/vmstat kstack_1k 3 kstack_2k 188 kstack_4k 11391 kstack_8k 243 kstack_16k 0 which tells us that 11391 processes used 4k of stack while none at all used 16k. Useful for some system tuning things, but partivularly useful for "the dynamic kernel stack project". - "kmemleak: support for percpu memory leak detect" from Pavel Tikhomirov. Teaches kmemleak to detect leaksage of percpu memory. - "mm: memcg: page counters optimizations" from Roman Gushchin. "3 independent small optimizations of page counters". - "mm: split PTE/PMD PT table Kconfig cleanups+clarifications" from David Hildenbrand. Improves PTE/PMD splitlock detection, makes powerpc/8xx work correctly by design rather than by accident. - "mm: remove arch_make_page_accessible()" from David Hildenbrand. Some folio conversions which make arch_make_page_accessible() unneeded. - "mm, memcg: cg2 memory{.swap,}.peak write handlers" fro David Finkel. Cleans up and fixes our handling of the resetting of the cgroup/process peak-memory-use detector. - "Make core VMA operations internal and testable" from Lorenzo Stoakes. Rationalizaion and encapsulation of the VMA manipulation APIs. With a view to better enable testing of the VMA functions, even from a userspace-only harness. - "mm: zswap: fixes for global shrinker" from Takero Funaki. Fix issues in the zswap global shrinker, resulting in improved performance. - "mm: print the promo watermark in zoneinfo" from Kaiyang Zhao. Fill in some missing info in /proc/zoneinfo. - "mm: replace follow_page() by folio_walk" from David Hildenbrand. Code cleanups and rationalizations (conversion to folio_walk()) resulting in the removal of follow_page(). - "improving dynamic zswap shrinker protection scheme" from Nhat Pham. Some tuning to improve zswap's dynamic shrinker. Significant reductions in swapin and improvements in performance are shown. - "mm: Fix several issues with unaccepted memory" from Kirill Shutemov. Improvements to the new unaccepted memory feature, - "mm/mprotect: Fix dax puds" from Peter Xu. Implements mprotect on DAX PUDs. This was missing, although nobody seems to have notied yet. - "Introduce a store type enum for the Maple tree" from Sidhartha Kumar. Cleanups and modest performance improvements for the maple tree library code. - "memcg: further decouple v1 code from v2" from Shakeel Butt. Move more cgroup v1 remnants away from the v2 memcg code. - "memcg: initiate deprecation of v1 features" from Shakeel Butt. Adds various warnings telling users that memcg v1 features are deprecated. - "mm: swap: mTHP swap allocator base on swap cluster order" from Chris Li. Greatly improves the success rate of the mTHP swap allocation. - "mm: introduce numa_memblks" from Mike Rapoport. Moves various disparate per-arch implementations of numa_memblk code into generic code. - "mm: batch free swaps for zap_pte_range()" from Barry Song. Greatly improves the performance of munmap() of swap-filled ptes. - "support large folio swap-out and swap-in for shmem" from Baolin Wang. With this series we no longer split shmem large folios into simgle-page folios when swapping out shmem. - "mm/hugetlb: alloc/free gigantic folios" from Yu Zhao. Nice performance improvements and code reductions for gigantic folios. - "support shmem mTHP collapse" from Baolin Wang. Adds support for khugepaged's collapsing of shmem mTHP folios. - "mm: Optimize mseal checks" from Pedro Falcato. Fixes an mprotect() performance regression due to the addition of mseal(). - "Increase the number of bits available in page_type" from Matthew Wilcox. Increases the number of bits available in page_type! - "Simplify the page flags a little" from Matthew Wilcox. Many legacy page flags are now folio flags, so the page-based flags and their accessors/mutators can be removed. - "mm: store zero pages to be swapped out in a bitmap" from Usama Arif. An optimization which permits us to avoid writing/reading zero-filled zswap pages to backing store. - "Avoid MAP_FIXED gap exposure" from Liam Howlett. Fixes a race window which occurs when a MAP_FIXED operqtion is occurring during an unrelated vma tree walk. - "mm: remove vma_merge()" from Lorenzo Stoakes. Major rotorooting of the vma_merge() functionality, making ot cleaner, more testable and better tested. - "misc fixups for DAMON {self,kunit} tests" from SeongJae Park. Minor fixups of DAMON selftests and kunit tests. - "mm: memory_hotplug: improve do_migrate_range()" from Kefeng Wang. Code cleanups and folio conversions. - "Shmem mTHP controls and stats improvements" from Ryan Roberts. Cleanups for shmem controls and stats. - "mm: count the number of anonymous THPs per size" from Barry Song. Expose additional anon THP stats to userspace for improved tuning. - "mm: finish isolate/putback_lru_page()" from Kefeng Wang: more folio conversions and removal of now-unused page-based APIs. - "replace per-quota region priorities histogram buffer with per-context one" from SeongJae Park. DAMON histogram rationalization. - "Docs/damon: update GitHub repo URLs and maintainer-profile" from SeongJae Park. DAMON documentation updates. - "mm/vdpa: correct misuse of non-direct-reclaim __GFP_NOFAIL and improve related doc and warn" from Jason Wang: fixes usage of page allocator __GFP_NOFAIL and GFP_ATOMIC flags. - "mm: split underused THPs" from Yu Zhao. Improve THP=always policy. This was overprovisioning THPs in sparsely accessed memory areas. - "zram: introduce custom comp backends API" frm Sergey Senozhatsky. Add support for zram run-time compression algorithm tuning. - "mm: Care about shadow stack guard gap when getting an unmapped area" from Mark Brown. Fix up the various arch_get_unmapped_area() implementations to better respect guard areas. - "Improve mem_cgroup_iter()" from Kinsey Ho. Improve the reliability of mem_cgroup_iter() and various code cleanups. - "mm: Support huge pfnmaps" from Peter Xu. Extends the usage of huge pfnmap support. - "resource: Fix region_intersects() vs add_memory_driver_managed()" from Huang Ying. Fix a bug in region_intersects() for systems with CXL memory. - "mm: hwpoison: two more poison recovery" from Kefeng Wang. Teaches a couple more code paths to correctly recover from the encountering of poisoned memry. - "mm: enable large folios swap-in support" from Barry Song. Support the swapin of mTHP memory into appropriately-sized folios, rather than into single-page folios" * tag 'mm-stable-2024-09-20-02-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (416 commits) zram: free secondary algorithms names uprobes: turn xol_area->pages[2] into xol_area->page uprobes: introduce the global struct vm_special_mapping xol_mapping Revert "uprobes: use vm_special_mapping close() functionality" mm: support large folios swap-in for sync io devices mm: add nr argument in mem_cgroup_swapin_uncharge_swap() helper to support large folios mm: fix swap_read_folio_zeromap() for large folios with partial zeromap mm/debug_vm_pgtable: Use pxdp_get() for accessing page table entries set_memory: add __must_check to generic stubs mm/vma: return the exact errno in vms_gather_munmap_vmas() memcg: cleanup with !CONFIG_MEMCG_V1 mm/show_mem.c: report alloc tags in human readable units mm: support poison recovery from copy_present_page() mm: support poison recovery from do_cow_fault() resource, kunit: add test case for region_intersects() resource: make alloc_free_mem_region() works for iomem_resource mm: z3fold: deprecate CONFIG_Z3FOLD vfio/pci: implement huge_fault support mm/arm64: support large pfn mappings mm/x86: support large pfn mappings ...
2024-09-16Merge tag 'vfs-6.12.file' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs file updates from Christian Brauner: "This is the work to cleanup and shrink struct file significantly. Right now, (focusing on x86) struct file is 232 bytes. After this series struct file will be 184 bytes aka 3 cacheline and a spare 8 bytes for future extensions at the end of the struct. With struct file being as ubiquitous as it is this should make a difference for file heavy workloads and allow further optimizations in the future. - struct fown_struct was embedded into struct file letting it take up 32 bytes in total when really it shouldn't even be embedded in struct file in the first place. Instead, actual users of struct fown_struct now allocate the struct on demand. This frees up 24 bytes. - Move struct file_ra_state into the union containg the cleanup hooks and move f_iocb_flags out of the union. This closes a 4 byte hole we created earlier and brings struct file to 192 bytes. Which means struct file is 3 cachelines and we managed to shrink it by 40 bytes. - Reorder struct file so that nothing crosses a cacheline. I suspect that in the future we will end up reordering some members to mitigate false sharing issues or just because someone does actually provide really good perf data. - Shrinking struct file to 192 bytes is only part of the work. Files use a slab that is SLAB_TYPESAFE_BY_RCU and when a kmem cache is created with SLAB_TYPESAFE_BY_RCU the free pointer must be located outside of the object because the cache doesn't know what part of the memory can safely be overwritten as it may be needed to prevent object recycling. That has the consequence that SLAB_TYPESAFE_BY_RCU may end up adding a new cacheline. So this also contains work to add a new kmem_cache_create_rcu() function that allows the caller to specify an offset where the freelist pointer is supposed to be placed. Thus avoiding the implicit addition of a fourth cacheline. - And finally this removes the f_version member in struct file. The f_version member isn't particularly well-defined. It is mainly used as a cookie to detect concurrent seeks when iterating directories. But it is also abused by some subsystems for completely unrelated things. It is mostly a directory and filesystem specific thing that doesn't really need to live in struct file and with its wonky semantics it really lacks a specific function. For pipes, f_version is (ab)used to defer poll notifications until a write has happened. And struct pipe_inode_info is used by multiple struct files in their ->private_data so there's no chance of pushing that down into file->private_data without introducing another pointer indirection. But pipes don't rely on f_pos_lock so this adds a union into struct file encompassing f_pos_lock and a pipe specific f_pipe member that pipes can use. This union of course can be extended to other file types and is similar to what we do in struct inode already" * tag 'vfs-6.12.file' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (26 commits) fs: remove f_version pipe: use f_pipe fs: add f_pipe ubifs: store cookie in private data ufs: store cookie in private data udf: store cookie in private data proc: store cookie in private data ocfs2: store cookie in private data input: remove f_version abuse ext4: store cookie in private data ext2: store cookie in private data affs: store cookie in private data fs: add generic_llseek_cookie() fs: use must_set_pos() fs: add must_set_pos() fs: add vfs_setpos_cookie() s390: remove unused f_version ceph: remove unused f_version adi: remove unused f_version mm: Removed @freeptr_offset to prevent doc warning ...
2024-09-16Merge tag 'vfs-6.12.folio' of ↵Linus Torvalds
gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs Pull vfs folio updates from Christian Brauner: "This contains work to port write_begin and write_end to rely on folios for various filesystems. This converts ocfs2, vboxfs, orangefs, jffs2, hostfs, fuse, f2fs, ecryptfs, ntfs3, nilfs2, reiserfs, minixfs, qnx6, sysv, ufs, and squashfs. After this series lands a bunch of the filesystems in this list do not mention struct page anymore" * tag 'vfs-6.12.folio' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (61 commits) Squashfs: Ensure all readahead pages have been used Squashfs: Rewrite and update squashfs_readahead_fragment() to not use page->index Squashfs: Update squashfs_readpage_block() to not use page->index Squashfs: Update squashfs_readahead() to not use page->index Squashfs: Update page_actor to not use page->index jffs2: Use a folio in jffs2_garbage_collect_dnode() jffs2: Convert jffs2_do_readpage_nolock to take a folio buffer: Convert __block_write_begin() to take a folio ocfs2: Convert ocfs2_write_zero_page to use a folio fs: Convert aops->write_begin to take a folio fs: Convert aops->write_end to take a folio vboxsf: Use a folio in vboxsf_write_end() orangefs: Convert orangefs_write_begin() to use a folio orangefs: Convert orangefs_write_end() to use a folio jffs2: Convert jffs2_write_begin() to use a folio jffs2: Convert jffs2_write_end() to use a folio hostfs: Convert hostfs_write_end() to use a folio fuse: Convert fuse_write_begin() to use a folio fuse: Convert fuse_write_end() to use a folio f2fs: Convert f2fs_write_begin() to use a folio ...
2024-09-16Merge tag 'vfs-6.12.misc' of ↵Linus Torvalds
gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "This contains the usual pile of misc updates: Features: - Add F_CREATED_QUERY fcntl() that allows userspace to query whether a file was actually created. Often userspace wants to know whether an O_CREATE request did actually create a file without using O_EXCL. The current logic is that to first attempts to open the file without O_CREAT | O_EXCL and if ENOENT is returned userspace tries again with both flags. If that succeeds all is well. If it now reports EEXIST it retries. That works fairly well but some corner cases make this more involved. If this operates on a dangling symlink the first openat() without O_CREAT | O_EXCL will return ENOENT but the second openat() with O_CREAT | O_EXCL will fail with EEXIST. The reason is that openat() without O_CREAT | O_EXCL follows the symlink while O_CREAT | O_EXCL doesn't for security reasons. So it's not something we can really change unless we add an explicit opt-in via O_FOLLOW which seems really ugly. All available workarounds are really nasty (fanotify, bpf lsm etc) so add a simple fcntl(). - Try an opportunistic lookup for O_CREAT. Today, when opening a file we'll typically do a fast lookup, but if O_CREAT is set, the kernel always takes the exclusive inode lock. This was likely done with the expectation that O_CREAT means that we always expect to do the create, but that's often not the case. Many programs set O_CREAT even in scenarios where the file already exists (see related F_CREATED_QUERY patch motivation above). The series contained in the pr rearranges the pathwalk-for-open code to also attempt a fast_lookup in certain O_CREAT cases. If a positive dentry is found, the inode_lock can be avoided altogether and it can stay in rcuwalk mode for the last step_into. - Expose the 64 bit mount id via name_to_handle_at() Now that we provide a unique 64-bit mount ID interface in statx(2), we can now provide a race-free way for name_to_handle_at(2) to provide a file handle and corresponding mount without needing to worry about racing with /proc/mountinfo parsing or having to open a file just to do statx(2). While this is not necessary if you are using AT_EMPTY_PATH and don't care about an extra statx(2) call, users that pass full paths into name_to_handle_at(2) need to know which mount the file handle comes from (to make sure they don't try to open_by_handle_at a file handle from a different filesystem) and switching to AT_EMPTY_PATH would require allocating a file for every name_to_handle_at(2) call - Add a per dentry expire timeout to autofs There are two fairly well known automounter map formats, the autofs format and the amd format (more or less System V and Berkley). Some time ago Linux autofs added an amd map format parser that implemented a fair amount of the amd functionality. This was done within the autofs infrastructure and some functionality wasn't implemented because it either didn't make sense or required extra kernel changes. The idea was to restrict changes to be within the existing autofs functionality as much as possible and leave changes with a wider scope to be considered later. One of these changes is implementing the amd options: 1) "unmount", expire this mount according to a timeout (same as the current autofs default). 2) "nounmount", don't expire this mount (same as setting the autofs timeout to 0 except only for this specific mount) . 3) "utimeout=<seconds>", expire this mount using the specified timeout (again same as setting the autofs timeout but only for this mount) To implement these options per-dentry expire timeouts need to be implemented for autofs indirect mounts. This is because all map keys (mounts) for autofs indirect mounts use an expire timeout stored in the autofs mount super block info. structure and all indirect mounts use the same expire timeout. Fixes: - Fix missing fput for FSCONFIG_SET_FD in autofs - Use param->file for FSCONFIG_SET_FD in coda - Delete the 'fs/netfs' proc subtreee when netfs module exits - Make sure that struct uid_gid_map fits into a single cacheline - Don't flush in-flight wb switches for superblocks without cgroup writeback - Correcting the idmapping mount example in the idmapping documentation - Fix a race between evice_inodes() and find_inode() and iput() - Refine the show_inode_state() macro definition in writeback code - Prevent dump_mapping() from accessing invalid dentry.d_name.name - Show actual source for debugfs in /proc/mounts - Annotate data-race of busy_poll_usecs in eventpoll - Don't WARN for racy path_noexec check in exec code - Handle OOM on mnt_warn_timestamp_expiry() - Fix some spelling in the iomap design documentation - Fix typo in procfs comment - Fix typo in fs/namespace.c comment Cleanups: - Add the VFS git tree to the MAINTAINERS file - Move FMODE_UNSIGNED_OFFSET to fop_flags freeing up another f_mode bit in struct file bringing us to 5 free f_mode bits - Remove the __I_DIO_WAKEUP bit from i_state flags as we can simplify the wait mechanism - Remove the unused path_put_init() helper - Replace a __u32 with u32 for s_fsnotify_mask as __u32 is uapi specific - Replace the unsigned long i_state member with a u32 i_state member in struct inode freeing up 4 bytes in struct inode. Instead of using the bit based wait apis we're now using the var event apis and using the individual bytes of the i_state member to wait on state changes - Explain how per-syscall AT_* flags should be allocated - Use in_group_or_capable() helper to simplify the posix acl mode update code - Switch to LIST_HEAD() in fsync_buffers_list() to simplify the code - Removed comment about d_rcu_to_refcount() as that function doesn't exist anymore - Add kernel documentation for lookup_fast() - Don't re-zero evenpoll fields - Remove outdated comment after close_fd() - Fix imprecise wording in comment about the pipe filesystem - Drop GFP_NOFAIL mode from alloc_page_buffers - Missing blank line warnings and struct declaration improved in file_table - Annotate struct poll_list with __counted_by() - Remove the unused read parameter in percpu-rwsem - Remove linux/prefetch.h include from direct-io code - Use kmemdup_array instead of kmemdup for multiple allocation in mnt_idmapping code - Remove unused mnt_cursor_del() declaration Performance tweaks: - Dodge smp_mb in break_lease and break_deleg in the common case - Only read fops once in fops_{get,put}() - Use RCU in ilookup() - Elide smp_mb in iversion handling in the common case - Drop one lock trip in evict()" * tag 'vfs-6.12.misc' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (58 commits) uidgid: make sure we fit into one cacheline proc: Fix typo in the comment fs/pipe: Correct imprecise wording in comment fhandle: expose u64 mount id to name_to_handle_at(2) uapi: explain how per-syscall AT_* flags should be allocated fs: drop GFP_NOFAIL mode from alloc_page_buffers writeback: Refine the show_inode_state() macro definition fs/inode: Prevent dump_mapping() accessing invalid dentry.d_name.name mnt_idmapping: Use kmemdup_array instead of kmemdup for multiple allocation netfs: Delete subtree of 'fs/netfs' when netfs module exits fs: use LIST_HEAD() to simplify code inode: make i_state a u32 inode: port __I_LRU_ISOLATING to var event vfs: fix race between evice_inodes() and find_inode()&iput() inode: port __I_NEW to var event inode: port __I_SYNC to var event fs: reorder i_state bits fs: add i_state helpers MAINTAINERS: add the VFS git tree fs: s/__u32/u32/ for s_fsnotify_mask ...
2024-09-12fs: remove f_versionChristian Brauner
Now that detecting concurrent seeks is done by the filesystems that require it we can remove f_version and free up 8 bytes for future extensions. Link: https://lore.kernel.org/r/20240830-vfs-file-f_version-v1-20-6d3e4816aa7b@kernel.org Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-09-12fs: add f_pipeChristian Brauner
Only regular files with FMODE_ATOMIC_POS and directories need f_pos_lock. Place a new f_pipe member in a union with f_pos_lock that they can use and make them stop abusing f_version in follow-up patches. Link: https://lore.kernel.org/r/20240830-vfs-file-f_version-v1-18-6d3e4816aa7b@kernel.org Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-09-09inode: make __iget() a static inlineKent Overstreet
bcachefs is switching to an rhashtable for vfs inodes instead of the standard inode.c hashtable, so we need this exported, or - a static inline makes more sense for a single atomic_inc(). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-09fs: add generic_llseek_cookie()Christian Brauner
This is similar to generic_file_llseek() but allows the caller to specify a cookie that will be updated to indicate that a seek happened. Caller's requiring that information in their readdir implementations can use that. Link: https://lore.kernel.org/r/20240830-vfs-file-f_version-v1-8-6d3e4816aa7b@kernel.org Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-09-04namespace: introduce SB_I_NOIDMAP flagAlexander Mikhalitsyn
Right now we determine if filesystem support vfs idmappings or not basing on the FS_ALLOW_IDMAP flag presence. This "static" way works perfecly well for local filesystems like ext4, xfs, btrfs, etc. But for network-like filesystems like fuse, cephfs this approach is not ideal, because sometimes proper support of vfs idmaps requires some extensions for the on-wire protocol, which implies that changes have to be made not only in the Linux kernel code but also in the 3rd party components like libfuse, cephfs MDS server and so on. We have seen that issue during our work on cephfs idmapped mounts [1] with Christian, but right now I'm working on the idmapped mounts support for fuse/virtiofs and I think that it is a right time for this extension. [1] 5ccd8530dd7 ("ceph: handle idmapped mounts in create_request_message()") Suggested-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-09-03mm: allow read-ahead with IOCB_NOWAIT setYafang Shao
Readahead support for IOCB_NOWAIT was introduced in commit 2e85abf053b9 ("mm: allow read-ahead with IOCB_NOWAIT set"). However, this implementation broke the semantics of IOCB_NOWAIT by potentially causing it to wait on I/O during memory reclamation. This behavior was later modified in commit efa8480a8316 ("fs: RWF_NOWAIT should imply IOCB_NOIO"). To resolve the blocking issue during memory reclamation, we can use memalloc_noio_{save,restore} to ensure non-blocking behavior. This change restores the original functionality, allowing preadv2(IOCB_NOWAIT) to trigger readahead if the file content is not present in the page cache. While this process may trigger direct memory reclamation, the __GFP_NORETRY flag is set in the readahead GFP flags, ensuring it won't block. A use case for this change is when we want to trigger readahead in the preadv2(2) syscall if the file cache is absent, but without waiting for certain filesystem locks, like xfs_ilock. A simple example is as follows: retry: if (preadv2(fd, iovec, cnt, offset, RWF_NOWAIT) < 0) { do_other_work(); goto retry; } Link: https://lore.gnuweeb.org/io-uring/20200624164127.GP21350@casper.infradead.org/ Link: https://lkml.kernel.org/r/20240820022639.89562-1-laoar.shao@gmail.com Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Matthew Wilcox <willy@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Jan Kara <jack@suse.cz> Cc: Christian Brauner <brauner@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-08-30inode: make i_state a u32Christian Brauner
Now that we use the wait var event mechanism make i_state a u32 and free up 4 bytes. This means we currently have two 4 byte holes in struct inode which we can pack. Link: https://lore.kernel.org/r/20240823-work-i_state-v3-6-5cd5fd207a57@kernel.org Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-30fs: reorder i_state bitsChristian Brauner
so that we can use the first bits to derive unique addresses from i_state. Link: https://lore.kernel.org/r/20240823-work-i_state-v3-2-5cd5fd207a57@kernel.org Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-30fs: add i_state helpersChristian Brauner
The i_state member is an unsigned long so that it can be used with the wait bit infrastructure which expects unsigned long. This wastes 4 bytes which we're unlikely to ever use. Switch to using the var event wait mechanism using the address of the bit. Thanks to Linus for the address idea. Link: https://lore.kernel.org/r/20240823-work-i_state-v3-1-5cd5fd207a57@kernel.org Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-30fs: s/__u32/u32/ for s_fsnotify_maskChristian Brauner
The underscore variants are for uapi whereas the non-underscore variants are for in-kernel consumers. Link: https://lore.kernel.org/r/20240822-anwerben-nutzung-1cd6c82a565f@brauner Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-30inode: remove __I_DIO_WAKEUPChristian Brauner
Afaict, we can just rely on inode->i_dio_count for waiting instead of this awkward indirection through __I_DIO_WAKEUP. This survives LTP dio and xfstests dio tests. Link: https://lore.kernel.org/r/20240816-vfs-misc-dio-v1-1-80fe21a2c710@kernel.org Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-30fs: move FMODE_UNSIGNED_OFFSET to fop_flagsChristian Brauner
This is another flag that is statically set and doesn't need to use up an FMODE_* bit. Move it to ->fop_flags and free up another FMODE_* bit. (1) mem_open() used from proc_mem_operations (2) adi_open() used from adi_fops (3) drm_open_helper(): (3.1) accel_open() used from DRM_ACCEL_FOPS (3.2) drm_open() used from (3.2.1) amdgpu_driver_kms_fops (3.2.2) psb_gem_fops (3.2.3) i915_driver_fops (3.2.4) nouveau_driver_fops (3.2.5) panthor_drm_driver_fops (3.2.6) radeon_driver_kms_fops (3.2.7) tegra_drm_fops (3.2.8) vmwgfx_driver_fops (3.2.9) xe_driver_fops (3.2.10) DRM_GEM_FOPS (3.2.11) DEFINE_DRM_GEM_DMA_FOPS (4) struct memdev sets fmode flags based on type of device opened. For devices using struct mem_fops unsigned offset is used. Mark all these file operations as FOP_UNSIGNED_OFFSET and add asserts into the open helper to ensure that the flag is always set. Link: https://lore.kernel.org/r/20240809-work-fop_unsigned-v1-1-658e054d893e@kernel.org Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-30vfs: only read fops once in fops_get/putMateusz Guzik
In do_dentry_open() the usage is: f->f_op = fops_get(inode->i_fop); In generated asm the compiler emits 2 reads from inode->i_fop instead of just one. This popped up due to false-sharing where loads from that offset end up bouncing a cacheline during parallel open. While this is going to be fixed, the spurious load does not need to be there. This makes do_dentry_open() go down from 1177 to 1154 bytes. fops_put() is patched to maintain some consistency. No functional changes. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Link: https://lore.kernel.org/r/20240810064753.1211441-1-mjguzik@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-29fs: use kmem_cache_create_rcu()Christian Brauner
Switch to the new kmem_cache_create_rcu() helper which allows us to use a custom free pointer offset avoiding the need to have an external free pointer which would grow struct file behind our backs. Link: https://lore.kernel.org/r/20240828-work-kmem_cache-rcu-v3-3-5460bc1f09f6@kernel.org Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-29fs: pack struct fileChristian Brauner
Now that we shrunk struct file to 192 bytes aka 3 cachelines reorder struct file to not leave any holes or have members cross cachelines. Add a short comment to each of the fields and mark the cachelines. It's possible that we may have to tweak this based on profiling in the future. So far I had Jens test this comparing io_uring with non-fixed and fixed files and it improved performance. The layout is a combination of Jens' and my changes. Link: https: //lore.kernel.org/r/20240824-peinigen-hocken-7384b977c643@brauner Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-29Merge tag 'nfsd-6.11-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux Pull nfsd fixes from Chuck Lever: - Fix a number of crashers - Update email address for an NFSD reviewer * tag 'nfsd-6.11-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: fs/nfsd: fix update of inode attrs in CB_GETATTR nfsd: fix potential UAF in nfsd4_cb_getattr_release nfsd: hold reference to delegation when updating it for cb_getattr MAINTAINERS: Update Olga Kornievskaia's email address nfsd: prevent panic for nfsv4.0 closed files in nfs4_show_open nfsd: ensure that nfsd4_fattr_args.context is zeroed out
2024-08-28fs: switch f_iocb_flags and f_raChristian Brauner
Now that we shrank struct file by 24 bytes we still have a 4 byte hole. If we move struct file_ra_state into the union and f_iocb_flags out of the union we close that whole and bring down struct file to 192 bytes. Which means struct file is 3 cachelines and we managed to shrink it by 40 bytes this cycle. I've tried to audit all codepaths that use f_ra and none of them seem to rely on it in file->f_op->release() and never have since commit 1da177e4c3f4 ("Linux-2.6.12-rc2"). Link: https://lore.kernel.org/r/20240823-luftdicht-berappen-d69a2166a0db@brauner Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-28file: reclaim 24 bytes from f_ownerChristian Brauner
We do embedd struct fown_struct into struct file letting it take up 32 bytes in total. We could tweak struct fown_struct to be more compact but really it shouldn't even be embedded in struct file in the first place. Instead, actual users of struct fown_struct should allocate the struct on demand. This frees up 24 bytes in struct file. That will have some potentially user-visible changes for the ownership fcntl()s. Some of them can now fail due to allocation failures. Practically, that probably will almost never happen as the allocations are small and they only happen once per file. The fown_struct is used during kill_fasync() which is used by e.g., pipes to generate a SIGIO signal. Sending of such signals is conditional on userspace having set an owner for the file using one of the F_OWNER fcntl()s. Such users will be unaffected if struct fown_struct is allocated during the fcntl() call. There are a few subsystems that call __f_setown() expecting file->f_owner to be allocated: (1) tun devices file->f_op->fasync::tun_chr_fasync() -> __f_setown() There are no callers of tun_chr_fasync(). (2) tty devices file->f_op->fasync::tty_fasync() -> __tty_fasync() -> __f_setown() tty_fasync() has no additional callers but __tty_fasync() has. Note that __tty_fasync() only calls __f_setown() if the @on argument is true. It's called from: file->f_op->release::tty_release() -> tty_release() -> __tty_fasync() -> __f_setown() tty_release() calls __tty_fasync() with @on false => __f_setown() is never called from tty_release(). => All callers of tty_release() are safe as well. file->f_op->release::tty_open() -> tty_release() -> __tty_fasync() -> __f_setown() __tty_hangup() calls __tty_fasync() with @on false => __f_setown() is never called from tty_release(). => All callers of __tty_hangup() are safe as well. From the callchains it's obvious that (1) and (2) end up getting called via file->f_op->fasync(). That can happen either through the F_SETFL fcntl() with the FASYNC flag raised or via the FIOASYNC ioctl(). If FASYNC is requested and the file isn't already FASYNC then file->f_op->fasync() is called with @on true which ends up causing both (1) and (2) to call __f_setown(). (1) and (2) are the only subsystems that call __f_setown() from the file->f_op->fasync() handler. So both (1) and (2) have been updated to allocate a struct fown_struct prior to calling fasync_helper() to register with the fasync infrastructure. That's safe as they both call fasync_helper() which also does allocations if @on is true. The other interesting case are file leases: (3) file leases lease_manager_ops->lm_setup::lease_setup() -> __f_setown() Which in turn is called from: generic_add_lease() -> lease_manager_ops->lm_setup::lease_setup() -> __f_setown() So here again we can simply make generic_add_lease() allocate struct fown_struct prior to the lease_manager_ops->lm_setup::lease_setup() which happens under a spinlock. With that the two remaining subsystems that call __f_setown() are: (4) dnotify (5) sockets Both have their own custom ioctls to set struct fown_struct and both have been converted to allocate a struct fown_struct on demand from their respective ioctls. Interactions with O_PATH are fine as well e.g., when opening a /dev/tty as O_PATH then no file->f_op->open() happens thus no file->f_owner is allocated. That's fine as no file operation will be set for those and the device has never been opened. fcntl()s called on such things will just allocate a ->f_owner on demand. Although I have zero idea why'd you care about f_owner on an O_PATH fd. Link: https://lore.kernel.org/r/20240813-work-f_owner-v2-1-4e9343a79f9f@kernel.org Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-26fs/nfsd: fix update of inode attrs in CB_GETATTRJeff Layton
Currently, we copy the mtime and ctime to the in-core inode and then mark the inode dirty. This is fine for certain types of filesystems, but not all. Some require a real setattr to properly change these values (e.g. ceph or reexported NFS). Fix this code to call notify_change() instead, which is the proper way to effect a setattr. There is one problem though: In this case, the client is holding a write delegation and has sent us attributes to update our cache. We don't want to break the delegation for this since that would defeat the purpose. Add a new ATTR_DELEG flag that makes notify_change bypass the try_break_deleg call. Fixes: c5967721e106 ("NFSD: handle GETATTR conflict with write delegation") Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-08-19percpu-rwsem: remove the unused parameter 'read'Wang Long
In the function percpu_rwsem_release, the parameter `read` is unused, so remove it. Signed-off-by: Wang Long <w@laoqinren.net> Link: https://lore.kernel.org/r/20240802091901.2546797-1-w@laoqinren.net Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-13vfs: Don't evict inode under the inode lru traversing contextZhihao Cheng
The inode reclaiming process(See function prune_icache_sb) collects all reclaimable inodes and mark them with I_FREEING flag at first, at that time, other processes will be stuck if they try getting these inodes (See function find_inode_fast), then the reclaiming process destroy the inodes by function dispose_list(). Some filesystems(eg. ext4 with ea_inode feature, ubifs with xattr) may do inode lookup in the inode evicting callback function, if the inode lookup is operated under the inode lru traversing context, deadlock problems may happen. Case 1: In function ext4_evict_inode(), the ea inode lookup could happen if ea_inode feature is enabled, the lookup process will be stuck under the evicting context like this: 1. File A has inode i_reg and an ea inode i_ea 2. getfattr(A, xattr_buf) // i_ea is added into lru // lru->i_ea 3. Then, following three processes running like this: PA PB echo 2 > /proc/sys/vm/drop_caches shrink_slab prune_dcache_sb // i_reg is added into lru, lru->i_ea->i_reg prune_icache_sb list_lru_walk_one inode_lru_isolate i_ea->i_state |= I_FREEING // set inode state inode_lru_isolate __iget(i_reg) spin_unlock(&i_reg->i_lock) spin_unlock(lru_lock) rm file A i_reg->nlink = 0 iput(i_reg) // i_reg->nlink is 0, do evict ext4_evict_inode ext4_xattr_delete_inode ext4_xattr_inode_dec_ref_all ext4_xattr_inode_iget ext4_iget(i_ea->i_ino) iget_locked find_inode_fast __wait_on_freeing_inode(i_ea) ----→ AA deadlock dispose_list // cannot be executed by prune_icache_sb wake_up_bit(&i_ea->i_state) Case 2: In deleted inode writing function ubifs_jnl_write_inode(), file deleting process holds BASEHD's wbuf->io_mutex while getting the xattr inode, which could race with inode reclaiming process(The reclaiming process could try locking BASEHD's wbuf->io_mutex in inode evicting function), then an ABBA deadlock problem would happen as following: 1. File A has inode ia and a xattr(with inode ixa), regular file B has inode ib and a xattr. 2. getfattr(A, xattr_buf) // ixa is added into lru // lru->ixa 3. Then, following three processes running like this: PA PB PC echo 2 > /proc/sys/vm/drop_caches shrink_slab prune_dcache_sb // ib and ia are added into lru, lru->ixa->ib->ia prune_icache_sb list_lru_walk_one inode_lru_isolate ixa->i_state |= I_FREEING // set inode state inode_lru_isolate __iget(ib) spin_unlock(&ib->i_lock) spin_unlock(lru_lock) rm file B ib->nlink = 0 rm file A iput(ia) ubifs_evict_inode(ia) ubifs_jnl_delete_inode(ia) ubifs_jnl_write_inode(ia) make_reservation(BASEHD) // Lock wbuf->io_mutex ubifs_iget(ixa->i_ino) iget_locked find_inode_fast __wait_on_freeing_inode(ixa) | iput(ib) // ib->nlink is 0, do evict | ubifs_evict_inode | ubifs_jnl_delete_inode(ib) ↓ ubifs_jnl_write_inode ABBA deadlock ←-----make_reservation(BASEHD) dispose_list // cannot be executed by prune_icache_sb wake_up_bit(&ixa->i_state) Fix the possible deadlock by using new inode state flag I_LRU_ISOLATING to pin the inode in memory while inode_lru_isolate() reclaims its pages instead of using ordinary inode reference. This way inode deletion cannot be triggered from inode_lru_isolate() thus avoiding the deadlock. evict() is made to wait for I_LRU_ISOLATING to be cleared before proceeding with inode cleanup. Link: https://lore.kernel.org/all/37c29c42-7685-d1f0-067d-63582ffac405@huaweicloud.com/ Link: https://bugzilla.kernel.org/show_bug.cgi?id=219022 Fixes: e50e5129f384 ("ext4: xattr-in-inode support") Fixes: 7959cf3a7506 ("ubifs: journal: Handle xattrs like files") Cc: stable@vger.kernel.org Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Link: https://lore.kernel.org/r/20240809031628.1069873-1-chengzhihao@huaweicloud.com Reviewed-by: Jan Kara <jack@suse.cz> Suggested-by: Jan Kara <jack@suse.cz> Suggested-by: Mateusz Guzik <mjguzik@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-07fs: Convert aops->write_begin to take a folioMatthew Wilcox (Oracle)
Convert all callers from working on a page to working on one page of a folio (support for working on an entire folio can come later). Removes a lot of folio->page->folio conversions. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-07fs: Convert aops->write_end to take a folioMatthew Wilcox (Oracle)
Most callers have a folio, and most implementations operate on a folio, so remove the conversion from folio->page->folio to fit through this interface. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-07-15Merge tag 'for-6.11/block-20240710' of git://git.kernel.dk/linuxLinus Torvalds
Pull block updates from Jens Axboe: - NVMe updates via Keith: - Device initialization memory leak fixes (Keith) - More constants defined (Weiwen) - Target debugfs support (Hannes) - PCIe subsystem reset enhancements (Keith) - Queue-depth multipath policy (Redhat and PureStorage) - Implement get_unique_id (Christoph) - Authentication error fixes (Gaosheng) - MD updates via Song - sync_action fix and refactoring (Yu Kuai) - Various small fixes (Christoph Hellwig, Li Nan, and Ofir Gal, Yu Kuai, Benjamin Marzinski, Christophe JAILLET, Yang Li) - Fix loop detach/open race (Gulam) - Fix lower control limit for blk-throttle (Yu) - Add module descriptions to various drivers (Jeff) - Add support for atomic writes for block devices, and statx reporting for same. Includes SCSI and NVMe (John, Prasad, Alan) - Add IO priority information to block trace points (Dongliang) - Various zone improvements and tweaks (Damien) - mq-deadline tag reservation improvements (Bart) - Ignore direct reclaim swap writes in writeback throttling (Baokun) - Block integrity improvements and fixes (Anuj) - Add basic support for rust based block drivers. Has a dummy null_blk variant for now (Andreas) - Series converting driver settings to queue limits, and cleanups and fixes related to that (Christoph) - Cleanup for poking too deeply into the bvec internals, in preparation for DMA mapping API changes (Christoph) - Various minor tweaks and fixes (Jiapeng, John, Kanchan, Mikulas, Ming, Zhu, Damien, Christophe, Chaitanya) * tag 'for-6.11/block-20240710' of git://git.kernel.dk/linux: (206 commits) floppy: add missing MODULE_DESCRIPTION() macro loop: add missing MODULE_DESCRIPTION() macro ublk_drv: add missing MODULE_DESCRIPTION() macro xen/blkback: add missing MODULE_DESCRIPTION() macro block/rnbd: Constify struct kobj_type block: take offset into account in blk_bvec_map_sg again block: fix get_max_segment_size() warning loop: Don't bother validating blocksize virtio_blk: Don't bother validating blocksize null_blk: Don't bother validating blocksize block: Validate logical block size in blk_validate_limits() virtio_blk: Fix default logical block size fallback nvmet-auth: fix nvmet_auth hash error handling nvme: implement ->get_unique_id block: pass a phys_addr_t to get_max_segment_size block: add a bvec_phys helper blk-lib: check for kill signal in ioctl BLKZEROOUT block: limit the Write Zeroes to manually writing zeroes fallback block: refacto blkdev_issue_zeroout block: move read-only and supported checks into (__)blkdev_issue_zeroout ...
2024-07-15Merge tag 'vfs-6.11.inode' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs inode / dentry updates from Christian Brauner: "This contains smaller performance improvements to inodes and dentries: inode: - Add rcu based inode lookup variants. They avoid one inode hash lock acquire in the common case thereby significantly reducing contention. We already support RCU-based operations but didn't take advantage of them during inode insertion. Callers of iget_locked() get the improvement without any code changes. Callers that need a custom callback can switch to iget5_locked_rcu() as e.g., did btrfs. With 20 threads each walking a dedicated 1000 dirs * 1000 files directory tree to stat(2) on a 32 core + 24GB ram vm: before: 3.54s user 892.30s system 1966% cpu 45.549 total after: 3.28s user 738.66s system 1955% cpu 37.932 total (-16.7%) Long-term we should pick up the effort to introduce more fine-grained locking and possibly improve on the currently used hash implementation. - Start zeroing i_state in inode_init_always() instead of doing it in individual filesystems. This allows us to remove an unneeded lock acquire in new_inode() and not burden individual filesystems with this. dcache: - Move d_lockref out of the area used by RCU lookup to avoid cacheline ping poing because the embedded name is sharing a cacheline with d_lockref. - Fix dentry size on 32bit with CONFIG_SMP=y so it does actually end up with 128 bytes in total" * tag 'vfs-6.11.inode' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fs: fix dentry size vfs: move d_lockref out of the area used by RCU lookup bcachefs: remove now spurious i_state initialization xfs: remove now spurious i_state initialization in xfs_inode_alloc vfs: partially sanitize i_state zeroing on inode creation xfs: preserve i_state around inode_init_always in xfs_reinit_inode btrfs: use iget5_locked_rcu vfs: add rcu-based find_inode variants for iget ops
2024-07-15Merge tag 'vfs-6.11.casefold' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs casefolding updates from Christian Brauner: "This contains some work to simplify the handling of casefolded names: - Simplify the handling of casefolded names in f2fs and ext4 by keeping the names as a qstr to avoiding unnecessary conversions - Introduce a new generic_ci_match() libfs case-insensitive lookup helper and use it in both f2fs and ext4 allowing to remove the filesystem specific implementations - Remove a bunch of ifdefs by making the unicode build checks part of the code flow" * tag 'vfs-6.11.casefold' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: f2fs: Move CONFIG_UNICODE defguards into the code flow ext4: Move CONFIG_UNICODE defguards into the code flow f2fs: Reuse generic_ci_match for ci comparisons ext4: Reuse generic_ci_match for ci comparisons libfs: Introduce case-insensitive string comparison helper f2fs: Simplify the handling of cached casefolded names ext4: Simplify the handling of cached casefolded names
2024-06-27vfs: rename parent_ino to d_parent_ino and make it use RCUMateusz Guzik
The routine is used by procfs through dir_emit_dots. The combined RCU and lock fallback implementation is too big for an inline. Given that the routine takes a dentry argument fs/dcache.c seems like the place to put it in. Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Link: https://lore.kernel.org/r/20240627161152.802567-1-mjguzik@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-06-27fs: new helper vfs_empty_path()Christian Brauner
Make it possible to quickly check whether AT_EMPTY_PATH is valid. Note, after some discussion we decided to also allow NULL to be passed instead of requiring the empty string. Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-06-25fs: Export in_group_or_capable()Youling Tang
Export in_group_or_capable() as a VFS helper function. Signed-off-by: Youling Tang <tangyouling@kylinos.cn> Link: https://lore.kernel.org/r/20240620032335.147136-1-youling.tang@linux.dev Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-06-20fs: Add initial atomic write support info to statxPrasad Singamsetty
Extend statx system call to return additional info for atomic write support support for a file. Helper function generic_fill_statx_atomic_writes() can be used by FSes to fill in the relevant statx fields. For now atomic_write_segments_max will always be 1, otherwise some rules would need to be imposed on iovec length and alignment, which we don't want now. Signed-off-by: Prasad Singamsetty <prasad.singamsetty@oracle.com> jpg: relocate bdev support to another patch Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: John Garry <john.g.garry@oracle.com> Acked-by: Darrick J. Wong <djwong@kernel.org> Link: https://lore.kernel.org/r/20240620125359.2684798-5-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-20fs: Initial atomic write supportPrasad Singamsetty
An atomic write is a write issued with torn-write protection, meaning that for a power failure or any other hardware failure, all or none of the data from the write will be stored, but never a mix of old and new data. Userspace may add flag RWF_ATOMIC to pwritev2() to indicate that the write is to be issued with torn-write prevention, according to special alignment and length rules. For any syscall interface utilizing struct iocb, add IOCB_ATOMIC for iocb->ki_flags field to indicate the same. A call to statx will give the relevant atomic write info for a file: - atomic_write_unit_min - atomic_write_unit_max - atomic_write_segments_max Both min and max values must be a power-of-2. Applications can avail of atomic write feature by ensuring that the total length of a write is a power-of-2 in size and also sized between atomic_write_unit_min and atomic_write_unit_max, inclusive. Applications must ensure that the write is at a naturally-aligned offset in the file wrt the total write length. The value in atomic_write_segments_max indicates the upper limit for IOV_ITER iovcnt. Add file mode flag FMODE_CAN_ATOMIC_WRITE, so files which do not have the flag set will have RWF_ATOMIC rejected and not just ignored. Add a type argument to kiocb_set_rw_flags() to allows reads which have RWF_ATOMIC set to be rejected. Helper function generic_atomic_write_valid() can be used by FSes to verify compliant writes. There we check for iov_iter type is for ubuf, which implies iovcnt==1 for pwritev2(), which is an initial restriction for atomic_write_segments_max. Initially the only user will be bdev file operations write handler. We will rely on the block BIO submission path to ensure write sizes are compliant for the bdev, so we don't need to check atomic writes sizes yet. Signed-off-by: Prasad Singamsetty <prasad.singamsetty@oracle.com> jpg: merge into single patch and much rewrite Acked-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Link: https://lore.kernel.org/r/20240620125359.2684798-4-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-12vfs: add rcu-based find_inode variants for iget opsMateusz Guzik
This avoids one inode hash lock acquire in the common case on inode creation, in effect significantly reducing contention. On the stock kernel said lock is typically taken twice: 1. once to check if the inode happens to already be present 2. once to add it to the hash The back-to-back lock/unlock pattern is known to degrade performance significantly, which is further exacerbated if the hash is heavily populated (long chains to walk, extending hold time). Arguably hash sizing and hashing algo need to be revisited, but that's beyond the scope of this patch. With the acquire from step 1 eliminated with RCU lookup throughput increases significantly at the scale of 20 cores (benchmark results at the bottom). So happens the hash already supports RCU-based operation, but lookups on inode insertions didn't take advantage of it. This of course has its limits as the global lock is still a bottleneck. There was a patchset posted which introduced fine-grained locking[1] but it appears staled. Apart from that doubt was expressed whether a handrolled hash implementation is appropriate to begin with, suggesting replacement with rhashtables. Nobody committed to carrying [1] across the finish line or implementing anything better, thus the bandaid below. iget_locked consumers (notably ext4) get away without any changes because inode comparison method is built-in. iget5_locked consumers pass a custom callback. Since removal of locking adds more problems (inode can be changing) it's not safe to assume all filesystems happen to cope. Thus iget5_locked_rcu gets added, requiring manual conversion of interested filesystems. In order to reduce code duplication find_inode and find_inode_fast grow an argument indicating whether inode hash lock is held, which is passed down in case sleeping is necessary. They always rcu_read_lock, which is redundant but harmless. Doing it conditionally reduces readability for no real gain that I can see. RCU-alike restrictions were already put on callbacks due to the hash spinlock being held. Benchmarking: There is a real cache-busting workload scanning millions of files in parallel (it's a backup appliance), where the initial lookup is guaranteed to fail resulting in the two lock acquires on stock kernel (and one with the patch at hand). Implemented below is a synthetic benchmark providing the same behavior. [I shall note the workload is not running on Linux, instead it was causing trouble elsewhere. Benchmark below was used while addressing said problems and was found to adequately represent the real workload.] Total real time fluctuates by 1-2s. With 20 threads each walking a dedicated 1000 dirs * 1000 files directory tree to stat(2) on a 32 core + 24GB RAM vm: ext4 (needed mkfs.ext4 -N 24000000): before: 3.77s user 890.90s system 1939% cpu 46.118 total after: 3.24s user 397.73s system 1858% cpu 21.581 total (-53%) That's 20 million files to visit, while the machine can only cache about 15 million at a time (obtained from ext4_inode_cache object count in /proc/slabinfo). Since each terminal inode is only visited once per run this amounts to 0% hit ratio for the dentry cache and the hash table (there are however hits for the intermediate directories). On repeated runs the kernel caches the last ~15 mln, meaning there is ~5 mln of uncached inodes which are going to be visited first, evicting the previously cached state as it happens. Lack of hits can be trivially verified with bpftrace, like so: bpftrace -e 'kretprobe:find_inode_fast { @[kstack(), retval != 0] = count(); }'\ -c "/bin/sh walktrees /testfs 20" Best ran more than once. Expected results after "warmup": [snip] @[ __ext4_iget+275 ext4_lookup+224 __lookup_slow+130 walk_component+219 link_path_walk.part.0.constprop.0+614 path_lookupat+62 filename_lookup+204 vfs_statx+128 vfs_fstatat+131 __do_sys_newfstatat+38 do_syscall_64+87 entry_SYSCALL_64_after_hwframe+118 , 1]: 20000 @[ __ext4_iget+275 ext4_lookup+224 __lookup_slow+130 walk_component+219 path_lookupat+106 filename_lookup+204 vfs_statx+128 vfs_fstatat+131 __do_sys_newfstatat+38 do_syscall_64+87 entry_SYSCALL_64_after_hwframe+118 , 1]: 20000000 That is 20 million calls for the initial lookup and 20 million after allocating a new inode, all of them failing to return a value != 0 (i.e., they are returning NULL -- no match found). Of course aborting the benchmark in the middle and starting it again (or messing with the state in other ways) is going to alter these results. Benchmark can be found here: https://people.freebsd.org/~mjg/fstree.tgz [1] https://lore.kernel.org/all/20231206060629.2827226-1-david@fromorbit.com/ Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Link: https://lore.kernel.org/r/20240611173824.535995-2-mjguzik@gmail.com Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-06-07libfs: Introduce case-insensitive string comparison helperGabriel Krisman Bertazi
generic_ci_match can be used by case-insensitive filesystems to compare strings under lookup with dirents in a case-insensitive way. This function is currently reimplemented by each filesystem supporting casefolding, so this reduces code duplication in filesystem-specific code. [eugen.hristev@collabora.com: rework to first test the exact match, cleanup and add error message] Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com> Signed-off-by: Eugen Hristev <eugen.hristev@collabora.com> Link: https://lore.kernel.org/r/20240606073353.47130-4-eugen.hristev@collabora.com Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-06-05vfs: retire user_path_at_empty and drop empty arg from getname_flagsMateusz Guzik
No users after do_readlinkat started doing the job on its own. Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Link: https://lore.kernel.org/r/20240604155257.109500-3-mjguzik@gmail.com Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-05-27fs: switch timespec64 fields in inode to discrete integersJeff Layton
Adjacent struct timespec64's pack poorly. Switch them to discrete integer fields for the seconds and nanoseconds. This shaves 8 bytes off struct inode with a garden-variety Fedora Kconfig on x86_64, but that also moves the i_lock into the previous cacheline, away from the fields it protects. To remedy that, move i_generation above the i_lock, which moves the new 4-byte hole to just after the i_fsnotify_mask in my setup. Amir has plans to use that to expand the i_fsnotify_mask, so add a comment to that effect as well. Cc: Amir Goldstein <amir73il@gmail.com> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/20240517-amtime-v1-1-7b804ca4be8f@kernel.org Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-05-21Merge tag 'pull-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds
Pull misc vfs updates from Al Viro: "Assorted commits that had missed the last merge window..." * tag 'pull-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: remove call_{read,write}_iter() functions do_dentry_open(): kill inode argument kernel_file_open(): get rid of inode argument get_file_rcu(): no need to check for NULL separately fd_is_open(): move to fs/file.c close_on_exec(): pass files_struct instead of fdtable
2024-05-20Merge tag 'xfs-6.10-merge-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds
Pull xfs updates from Chandan Babu: "Online repair feature continues to be expanded. Also, we now support delayed allocation for realtime devices which have an extent size that is equal to filesystem's block size. New code: - Introduce Parent Pointer extended attribute for inodes - Bring back delalloc support for realtime devices which have an extent size that is equal to filesystem's block size - Improve performance of log incompat feature handling Online Repair: - Implement atomic file content exchanges i.e. exchange ranges of bytes between two files atomically - Create temporary files to repair file-based metadata. This uses atomic file content exchange facility to swap file fork mappings between the temporary file and the metadata inode - Allow callers of directory/xattr code to set an explicit owner number to be written into the header fields of any new blocks that are created. This is required to avoid walking every block of the new structure and modify their ownership during online repair - Repair more data structures: - Extended attributes - Inode unlinked state - Directories - Symbolic links - AGI's unlinked inode list - Parent pointers - Move Orphan files to lost and found directory - Fixes for Inode repair functionality - Introduce a new sub-AG FITRIM implementation to reduce the duration for which the AGF lock is held - Updates for the design documentation - Use Parent Pointers to assist in checking directories, parent pointers, extended attributes, and link counts Fixes: - Prevent userspace from reading invalid file data due to incorrect. updation of file size when performing a non-atomic clone operation - Minor fixes to online repair - Fix confusing return values from xfs_bmapi_write() - Fix an out of bounds access due to incorrect h_size during log recovery - Defer upgrading the extent counters in xfs_reflink_end_cow_extent() until we know we are going to modify the extent mapping - Remove racy access to if_bytes check in xfs_reflink_end_cow_extent() - Fix sparse warnings Cleanups: - Hold inode locks on all files involved in a rename until the completion of the operation. This is in preparation for the parent pointers patchset where parent pointers are applied in a separate chained update from the actual directory update - Compile out v4 support when disabled - Cleanup xfs_extent_busy_clear() - Remove unused flags and fields from struct xfs_da_args - Remove definitions of unused functions - Improve extended attribute validation - Add higher level directory operations helpers to remove duplication of code - Cleanup quota (un)reservation interfaces" * tag 'xfs-6.10-merge-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (221 commits) xfs: simplify iext overflow checking and upgrade xfs: remove a racy if_bytes check in xfs_reflink_end_cow_extent xfs: upgrade the extent counters in xfs_reflink_end_cow_extent later xfs: xfs_quota_unreserve_blkres can't fail xfs: consolidate the xfs_quota_reserve_blkres definitions xfs: clean up buffer allocation in xlog_do_recovery_pass xfs: fix log recovery buffer allocation for the legacy h_size fixup xfs: widen flags argument to the xfs_iflags_* helpers xfs: minor cleanups of xfs_attr3_rmt_blocks xfs: create a helper to compute the blockcount of a max sized remote value xfs: turn XFS_ATTR3_RMT_BUF_SPACE into a function xfs: use unsigned ints for non-negative quantities in xfs_attr_remote.c xfs: do not allocate the entire delalloc extent in xfs_bmapi_write xfs: fix xfs_bmap_add_extent_delay_real for partial conversions xfs: remove the xfs_iext_peek_prev_extent call in xfs_bmapi_allocate xfs: pass the actual offset and len to allocate to xfs_bmapi_allocate xfs: don't open code XFS_FILBLKS_MIN in xfs_bmapi_write xfs: lift a xfs_valid_startblock into xfs_bmapi_allocate xfs: remove the unusued tmp_logflags variable in xfs_bmapi_allocate xfs: fix error returns from xfs_bmapi_write ...
2024-05-20Merge tag 'fsnotify_for_v6.10-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull fsnotify updates from Jan Kara: - reduce overhead of fsnotify infrastructure when no permission events are in use - a few small cleanups * tag 'fsnotify_for_v6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: fsnotify: fix UAF from FS_ERROR event on a shutting down filesystem fsnotify: optimize the case of no permission event watchers fsnotify: use an enum for group priority constants fsnotify: move s_fsnotify_connectors into fsnotify_sb_info fsnotify: lazy attach fsnotify_sb_info state to sb fsnotify: create helper fsnotify_update_sb_watchers() fsnotify: pass object pointer and type to fsnotify mark helpers fanotify: merge two checks regarding add of ignore mark fsnotify: create a wrapper fsnotify_find_inode_mark() fsnotify: create helpers to get sb and connp from object fsnotify: rename fsnotify_{get,put}_sb_connectors() fsnotify: Avoid -Wflex-array-member-not-at-end warning fanotify: remove unneeded sub-zero check for unsigned value
2024-05-19Merge tag 'mm-stable-2024-05-17-19-19' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull mm updates from Andrew Morton: "The usual shower of singleton fixes and minor series all over MM, documented (hopefully adequately) in the respective changelogs. Notable series include: - Lucas Stach has provided some page-mapping cleanup/consolidation/ maintainability work in the series "mm/treewide: Remove pXd_huge() API". - In the series "Allow migrate on protnone reference with MPOL_PREFERRED_MANY policy", Donet Tom has optimized mempolicy's MPOL_PREFERRED_MANY mode, yielding almost doubled performance in one test. - In their series "Memory allocation profiling" Kent Overstreet and Suren Baghdasaryan have contributed a means of determining (via /proc/allocinfo) whereabouts in the kernel memory is being allocated: number of calls and amount of memory. - Matthew Wilcox has provided the series "Various significant MM patches" which does a number of rather unrelated things, but in largely similar code sites. - In his series "mm: page_alloc: freelist migratetype hygiene" Johannes Weiner has fixed the page allocator's handling of migratetype requests, with resulting improvements in compaction efficiency. - In the series "make the hugetlb migration strategy consistent" Baolin Wang has fixed a hugetlb migration issue, which should improve hugetlb allocation reliability. - Liu Shixin has hit an I/O meltdown caused by readahead in a memory-tight memcg. Addressed in the series "Fix I/O high when memory almost met memcg limit". - In the series "mm/filemap: optimize folio adding and splitting" Kairui Song has optimized pagecache insertion, yielding ~10% performance improvement in one test. - Baoquan He has cleaned up and consolidated the early zone initialization code in the series "mm/mm_init.c: refactor free_area_init_core()". - Baoquan has also redone some MM initializatio code in the series "mm/init: minor clean up and improvement". - MM helper cleanups from Christoph Hellwig in his series "remove follow_pfn". - More cleanups from Matthew Wilcox in the series "Various page->flags cleanups". - Vlastimil Babka has contributed maintainability improvements in the series "memcg_kmem hooks refactoring". - More folio conversions and cleanups in Matthew Wilcox's series: "Convert huge_zero_page to huge_zero_folio" "khugepaged folio conversions" "Remove page_idle and page_young wrappers" "Use folio APIs in procfs" "Clean up __folio_put()" "Some cleanups for memory-failure" "Remove page_mapping()" "More folio compat code removal" - David Hildenbrand chipped in with "fs/proc/task_mmu: convert hugetlb functions to work on folis". - Code consolidation and cleanup work related to GUP's handling of hugetlbs in Peter Xu's series "mm/gup: Unify hugetlb, part 2". - Rick Edgecombe has developed some fixes to stack guard gaps in the series "Cover a guard gap corner case". - Jinjiang Tu has fixed KSM's behaviour after a fork+exec in the series "mm/ksm: fix ksm exec support for prctl". - Baolin Wang has implemented NUMA balancing for multi-size THPs. This is a simple first-cut implementation for now. The series is "support multi-size THP numa balancing". - Cleanups to vma handling helper functions from Matthew Wilcox in the series "Unify vma_address and vma_pgoff_address". - Some selftests maintenance work from Dev Jain in the series "selftests/mm: mremap_test: Optimizations and style fixes". - Improvements to the swapping of multi-size THPs from Ryan Roberts in the series "Swap-out mTHP without splitting". - Kefeng Wang has significantly optimized the handling of arm64's permission page faults in the series "arch/mm/fault: accelerate pagefault when badaccess" "mm: remove arch's private VM_FAULT_BADMAP/BADACCESS" - GUP cleanups from David Hildenbrand in "mm/gup: consistently call it GUP-fast". - hugetlb fault code cleanups from Vishal Moola in "Hugetlb fault path to use struct vm_fault". - selftests build fixes from John Hubbard in the series "Fix selftests/mm build without requiring "make headers"". - Memory tiering fixes/improvements from Ho-Ren (Jack) Chuang in the series "Improved Memory Tier Creation for CPUless NUMA Nodes". Fixes the initialization code so that migration between different memory types works as intended. - David Hildenbrand has improved follow_pte() and fixed an errant driver in the series "mm: follow_pte() improvements and acrn follow_pte() fixes". - David also did some cleanup work on large folio mapcounts in his series "mm: mapcount for large folios + page_mapcount() cleanups". - Folio conversions in KSM in Alex Shi's series "transfer page to folio in KSM". - Barry Song has added some sysfs stats for monitoring multi-size THP's in the series "mm: add per-order mTHP alloc and swpout counters". - Some zswap cleanups from Yosry Ahmed in the series "zswap same-filled and limit checking cleanups". - Matthew Wilcox has been looking at buffer_head code and found the documentation to be lacking. The series is "Improve buffer head documentation". - Multi-size THPs get more work, this time from Lance Yang. His series "mm/madvise: enhance lazyfreeing with mTHP in madvise_free" optimizes the freeing of these things. - Kemeng Shi has added more userspace-visible writeback instrumentation in the series "Improve visibility of writeback". - Kemeng Shi then sent some maintenance work on top in the series "Fix and cleanups to page-writeback". - Matthew Wilcox reduces mmap_lock traffic in the anon vma code in the series "Improve anon_vma scalability for anon VMAs". Intel's test bot reported an improbable 3x improvement in one test. - SeongJae Park adds some DAMON feature work in the series "mm/damon: add a DAMOS filter type for page granularity access recheck" "selftests/damon: add DAMOS quota goal test" - Also some maintenance work in the series "mm/damon/paddr: simplify page level access re-check for pageout" "mm/damon: misc fixes and improvements" - David Hildenbrand has disabled some known-to-fail selftests ni the series "selftests: mm: cow: flag vmsplice() hugetlb tests as XFAIL". - memcg metadata storage optimizations from Shakeel Butt in "memcg: reduce memory consumption by memcg stats". - DAX fixes and maintenance work from Vishal Verma in the series "dax/bus.c: Fixups for dax-bus locking"" * tag 'mm-stable-2024-05-17-19-19' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (426 commits) memcg, oom: cleanup unused memcg_oom_gfp_mask and memcg_oom_order selftests/mm: hugetlb_madv_vs_map: avoid test skipping by querying hugepage size at runtime mm/hugetlb: add missing VM_FAULT_SET_HINDEX in hugetlb_wp mm/hugetlb: add missing VM_FAULT_SET_HINDEX in hugetlb_fault selftests: cgroup: add tests to verify the zswap writeback path mm: memcg: make alloc_mem_cgroup_per_node_info() return bool mm/damon/core: fix return value from damos_wmark_metric_value mm: do not update memcg stats for NR_{FILE/SHMEM}_PMDMAPPED selftests: cgroup: remove redundant enabling of memory controller Docs/mm/damon/maintainer-profile: allow posting patches based on damon/next tree Docs/mm/damon/maintainer-profile: change the maintainer's timezone from PST to PT Docs/mm/damon/design: use a list for supported filters Docs/admin-guide/mm/damon/usage: fix wrong schemes effective quota update command Docs/admin-guide/mm/damon/usage: fix wrong example of DAMOS filter matching sysfs file selftests/damon: classify tests for functionalities and regressions selftests/damon/_damon_sysfs: use 'is' instead of '==' for 'None' selftests/damon/_damon_sysfs: find sysfs mount point from /proc/mounts selftests/damon/_damon_sysfs: check errors from nr_schemes file reads mm/damon/core: initialize ->esz_bp from damos_quota_init_priv() selftests/damon: add a test for DAMOS quota goal ...
2024-05-15Merge tag 'integrity-v6.10' of ↵Linus Torvalds
ssh://ra.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity Pull integrity updates from Mimi Zohar: "Two IMA changes, one EVM change, a use after free bug fix, and a code cleanup to address "-Wflex-array-member-not-at-end" warnings: - The existing IMA {ascii, binary}_runtime_measurements lists include a hard coded SHA1 hash. To address this limitation, define per TPM enabled hash algorithm {ascii, binary}_runtime_measurements lists - Close an IMA integrity init_module syscall measurement gap by defining a new critical-data record - Enable (partial) EVM support on stacked filesystems (overlayfs). Only EVM portable & immutable file signatures are copied up, since they do not contain filesystem specific metadata" * tag 'integrity-v6.10' of ssh://ra.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity: ima: add crypto agility support for template-hash algorithm evm: Rename is_unsupported_fs to is_unsupported_hmac_fs fs: Rename SB_I_EVM_UNSUPPORTED to SB_I_EVM_HMAC_UNSUPPORTED evm: Enforce signatures on unsupported filesystem for EVM_INIT_X509 ima: re-evaluate file integrity on file metadata change evm: Store and detect metadata inode attributes changes ima: Move file-change detection variables into new structure evm: Use the metadata inode to calculate metadata hash evm: Implement per signature type decision in security_inode_copy_up_xattr security: allow finer granularity in permitting copy-up of security xattrs ima: Rename backing_inode to real_inode integrity: Avoid -Wflex-array-member-not-at-end warnings ima: define an init_module critical data record ima: Fix use-after-free on a dentry's dname.name
2024-05-13Merge tag 'vfs-6.10.misc' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "This contains the usual miscellaneous features, cleanups, and fixes for vfs and individual fses. Features: - Free up FMODE_* bits. I've freed up bits 6, 7, 8, and 24. That means we now have six free FMODE_* bits in total (but bit #6 already got used for FMODE_WRITE_RESTRICTED) - Add FOP_HUGE_PAGES flag (follow-up to FMODE_* cleanup) - Add fd_raw cleanup class so we can make use of automatic cleanup provided by CLASS(fd_raw, f)(fd) for O_PATH fds as well - Optimize seq_puts() - Simplify __seq_puts() - Add new anon_inode_getfile_fmode() api to allow specifying f_mode instead of open-coding it in multiple places - Annotate struct file_handle with __counted_by() and use struct_size() - Warn in get_file() whether f_count resurrection from zero is attempted (epoll/drm discussion) - Folio-sophize aio - Export the subvolume id in statx() for both btrfs and bcachefs - Relax linkat(AT_EMPTY_PATH) requirements - Add F_DUPFD_QUERY fcntl() allowing to compare two file descriptors for dup*() equality replacing kcmp() Cleanups: - Compile out swapfile inode checks when swap isn't enabled - Use (1 << n) notation for FMODE_* bitshifts for clarity - Remove redundant variable assignment in fs/direct-io - Cleanup uses of strncpy in orangefs - Speed up and cleanup writeback - Move fsparam_string_empty() helper into header since it's currently open-coded in multiple places - Add kernel-doc comments to proc_create_net_data_write() - Don't needlessly read dentry->d_flags twice Fixes: - Fix out-of-range warning in nilfs2 - Fix ecryptfs overflow due to wrong encryption packet size calculation - Fix overly long line in xfs file_operations (follow-up to FMODE_* cleanup) - Don't raise FOP_BUFFER_{R,W}ASYNC for directories in xfs (follow-up to FMODE_* cleanup) - Don't call xfs_file_open from xfs_dir_open (follow-up to FMODE_* cleanup) - Fix stable offset api to prevent endless loops - Fix afs file server rotations - Prevent xattr node from overflowing the eraseblock in jffs2 - Move fdinfo PTRACE_MODE_READ procfs check into the .permission() operation instead of .open() operation since this caused userspace regressions" * tag 'vfs-6.10.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (39 commits) afs: Fix fileserver rotation getting stuck selftests: add F_DUPDFD_QUERY selftests fcntl: add F_DUPFD_QUERY fcntl() file: add fd_raw cleanup class fs: WARN when f_count resurrection is attempted seq_file: Simplify __seq_puts() seq_file: Optimize seq_puts() proc: Move fdinfo PTRACE_MODE_READ check into the inode .permission operation fs: Create anon_inode_getfile_fmode() xfs: don't call xfs_file_open from xfs_dir_open xfs: drop fop_flags for directories xfs: fix overly long line in the file_operations shmem: Fix shmem_rename2() libfs: Add simple_offset_rename() API libfs: Fix simple_offset_rename_exchange() jffs2: prevent xattr node from overflowing the eraseblock vfs, swap: compile out IS_SWAPFILE() on swapless configs vfs: relax linkat() AT_EMPTY_PATH - aka flink() - requirements fs/direct-io: remove redundant assignment to variable retval fs/dcache: Re-use value stored to dentry->d_flags instead of re-reading ...
2024-05-04fs: WARN when f_count resurrection is attemptedKees Cook
It should never happen that get_file() is called on a file with f_count equal to zero. If this happens, a use-after-free condition has happened[1], and we need to attempt a best-effort reporting of the situation to help find the root cause more easily. Additionally, this serves as a data corruption indicator that system owners using warn_limit or panic_on_warn would like to have detected. Link: https://lore.kernel.org/lkml/7c41cf3c-2a71-4dbb-8f34-0337890906fc@gmail.com/ [1] Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20240503201620.work.651-kees@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>