summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2024-11-21Revert "f2fs: remove unreachable lazytime mount option parsing"Jaegeuk Kim
This reverts commit 54f43a10fa257ad4af02a1d157fefef6ebcfa7dc. The above commit broke the lazytime mount, given mount("/dev/vdb", "/mnt/test", "f2fs", 0, "lazytime"); CC: stable@vger.kernel.org # 6.11+ Signed-off-by: Daniel Rosenberg <drosen@google.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2024-11-21statmount: fix security option retrievalChristian Brauner
Fix the inverted check for security_sb_show_options(). Link: https://lore.kernel.org/r/c8eaa647-5d67-49b6-9401-705afcb7e4d7@stanley.mountain Link: https://lore.kernel.org/r/20241120-verehren-rhabarber-83a11b297bcc@brauner Fixes: aefff51e1c29 ("statmount: retrieve security mount options") Reviewed-by: Jeff Layton <jlayton@kernel.org> Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Cc: stable@vger.kernel.org # mainline only Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-11-21statmount: clean up unescaped option handlingMiklos Szeredi
Move common code from opt_array/opt_sec_array to helper. This helper does more than just unescape options, so rename to statmount_opt_process(). Handle corner case of just a single character in options. Rename some local variables to better describe their function. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Link: https://lore.kernel.org/r/20241120142732.55210-1-mszeredi@redhat.com Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-11-21iomap: elide flush from partial eof zero rangeBrian Foster
iomap zero range flushes pagecache in certain situations to determine which parts of the range might require zeroing if dirty data is present in pagecache. The kernel robot recently reported a regression associated with this flushing in the following stress-ng workload on XFS: stress-ng --timeout 60 --times --verify --metrics --no-rand-seed --metamix 64 This workload involves repeated small, strided, extending writes. On XFS, this produces a pattern of post-eof speculative preallocation, conversion of preallocation from delalloc to unwritten, dirtying pagecache over newly unwritten blocks, and then rinse and repeat from the new EOF. This leads to repetitive flushing of the EOF folio via the zero range call XFS uses for writes that start beyond current EOF. To mitigate this problem, special case EOF block zeroing to prefer zeroing the folio over a flush when the EOF folio is already dirty. To do this, split out and open code handling of an unaligned start offset. This brings most of the performance back by avoiding flushes on zero range calls via write and truncate extension operations. The flush doesn't occur in these situations because the entire range is post-eof and therefore the folio that overlaps EOF is the only one in the range. Signed-off-by: Brian Foster <bfoster@redhat.com> Link: https://lore.kernel.org/r/20241115200155.593665-4-bfoster@redhat.com Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-11-21iomap: lift zeroed mapping handling into iomap_zero_range()Brian Foster
In preparation for special handling of subranges, lift the zeroed mapping logic from the iterator into the caller. Since this puts the pagecache dirty check and flushing in the same place, streamline the comments a bit as well. Signed-off-by: Brian Foster <bfoster@redhat.com> Link: https://lore.kernel.org/r/20241115200155.593665-3-bfoster@redhat.com Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-11-21iomap: reset per-iter state on non-error iter advancesBrian Foster
iomap_iter_advance() zeroes the processed and mapping fields on every non-error iteration except for the last expected iteration (i.e. return 0 expected to terminate the iteration loop). This appears to be circumstantial as nothing currently relies on these fields after the final iteration. Therefore to better faciliate iomap_iter reuse in subsequent patches, update iomap_iter_advance() to always reset per-iteration state on successful completion. Signed-off-by: Brian Foster <bfoster@redhat.com> Link: https://lore.kernel.org/r/20241115200155.593665-2-bfoster@redhat.com Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-11-21fscache: Remove duplicate included headerThorsten Blum
Remove duplicate included header file linux/uio.h Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com> Link: https://lore.kernel.org/r/20240628062329.321162-2-thorsten.blum@toblux.com Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-11-21iomap: warn on zero range of a post-eof folioBrian Foster
iomap_zero_range() uses buffered writes for manual zeroing, no longer updates i_size for such writes, but is still explicitly called for post-eof ranges. The historical use case for this is zeroing post-eof speculative preallocation on extending writes from XFS. However, XFS also recently changed to convert all post-eof delalloc mappings to unwritten in the iomap_begin() handler, which means it now never expects manual zeroing of post-eof mappings. In other words, all post-eof mappings should be reported as holes or unwritten. This is a subtle dependency that can be hard to detect if violated because associated codepaths are likely to update i_size after folio locks are dropped, but before writeback happens to occur. For example, if XFS reverts back to some form of manual zeroing of post-eof blocks on write extension, writeback of those zeroed folios will now race with the presumed i_size update from the subsequent buffered write. Since iomap_zero_range() can't correctly zero post-eof mappings beyond EOF without updating i_size, warn if this ever occurs. This serves as minimal indication that if this use case is reintroduced by a filesystem, iomap_zero_range() might need to reconsider i_size updates for write extending use cases. Signed-off-by: Brian Foster <bfoster@redhat.com> Link: https://lore.kernel.org/r/20241115145931.535207-1-bfoster@redhat.com Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-11-20ovl: Filter invalid inodes with missing lookup functionVasiliy Kovalev
Add a check to the ovl_dentry_weird() function to prevent the processing of directory inodes that lack the lookup function. This is important because such inodes can cause errors in overlayfs when passed to the lowerstack. Reported-by: syzbot+a8c9d476508bd14a90e5@syzkaller.appspotmail.com Link: https://syzkaller.appspot.com/bug?extid=a8c9d476508bd14a90e5 Suggested-by: Miklos Szeredi <miklos@szeredi.hu> Link: https://lore.kernel.org/linux-unionfs/CAJfpegvx-oS9XGuwpJx=Xe28_jzWx5eRo1y900_ZzWY+=gGzUg@mail.gmail.com/ Signed-off-by: Vasiliy Kovalev <kovalev@altlinux.org> Cc: <stable@vger.kernel.org> Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2024-11-19Merge tag 'timers-core-2024-11-18' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer updates from Thomas Gleixner: "A rather large update for timekeeping and timers: - The final step to get rid of auto-rearming posix-timers posix-timers are currently auto-rearmed by the kernel when the signal of the timer is ignored so that the timer signal can be delivered once the corresponding signal is unignored. This requires to throttle the timer to prevent a DoS by small intervals and keeps the system pointlessly out of low power states for no value. This is a long standing non-trivial problem due to the lock order of posix-timer lock and the sighand lock along with life time issues as the timer and the sigqueue have different life time rules. Cure this by: - Embedding the sigqueue into the timer struct to have the same life time rules. Aside of that this also avoids the lookup of the timer in the signal delivery and rearm path as it's just a always valid container_of() now. - Queuing ignored timer signals onto a seperate ignored list. - Moving queued timer signals onto the ignored list when the signal is switched to SIG_IGN before it could be delivered. - Walking the ignored list when SIG_IGN is lifted and requeue the signals to the actual signal lists. This allows the signal delivery code to rearm the timer. This also required to consolidate the signal delivery rules so they are consistent across all situations. With that all self test scenarios finally succeed. - Core infrastructure for VFS multigrain timestamping This is required to allow the kernel to use coarse grained time stamps by default and switch to fine grained time stamps when inode attributes are actively observed via getattr(). These changes have been provided to the VFS tree as well, so that the VFS specific infrastructure could be built on top. - Cleanup and consolidation of the sleep() infrastructure - Move all sleep and timeout functions into one file - Rework udelay() and ndelay() into proper documented inline functions and replace the hardcoded magic numbers by proper defines. - Rework the fsleep() implementation to take the reality of the timer wheel granularity on different HZ values into account. Right now the boundaries are hard coded time ranges which fail to provide the requested accuracy on different HZ settings. - Update documentation for all sleep/timeout related functions and fix up stale documentation links all over the place - Fixup a few usage sites - Rework of timekeeping and adjtimex(2) to prepare for multiple PTP clocks A system can have multiple PTP clocks which are participating in seperate and independent PTP clock domains. So far the kernel only considers the PTP clock which is based on CLOCK TAI relevant as that's the clock which drives the timekeeping adjustments via the various user space daemons through adjtimex(2). The non TAI based clock domains are accessible via the file descriptor based posix clocks, but their usability is very limited. They can't be accessed fast as they always go all the way out to the hardware and they cannot be utilized in the kernel itself. As Time Sensitive Networking (TSN) gains traction it is required to provide fast user and kernel space access to these clocks. The approach taken is to utilize the timekeeping and adjtimex(2) infrastructure to provide this access in a similar way how the kernel provides access to clock MONOTONIC, REALTIME etc. Instead of creating a duplicated infrastructure this rework converts timekeeping and adjtimex(2) into generic functionality which operates on pointers to data structures instead of using static variables. This allows to provide time accessors and adjtimex(2) functionality for the independent PTP clocks in a subsequent step. - Consolidate hrtimer initialization hrtimers are set up by initializing the data structure and then seperately setting the callback function for historical reasons. That's an extra unnecessary step and makes Rust support less straight forward than it should be. Provide a new set of hrtimer_setup*() functions and convert the core code and a few usage sites of the less frequently used interfaces over. The bulk of the htimer_init() to hrtimer_setup() conversion is already prepared and scheduled for the next merge window. - Drivers: - Ensure that the global timekeeping clocksource is utilizing the cluster 0 timer on MIPS multi-cluster systems. Otherwise CPUs on different clusters use their cluster specific clocksource which is not guaranteed to be synchronized with other clusters. - Mostly boring cleanups, fixes, improvements and code movement" * tag 'timers-core-2024-11-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (140 commits) posix-timers: Fix spurious warning on double enqueue versus do_exit() clocksource/drivers/arm_arch_timer: Use of_property_present() for non-boolean properties clocksource/drivers/gpx: Remove redundant casts clocksource/drivers/timer-ti-dm: Fix child node refcount handling dt-bindings: timer: actions,owl-timer: convert to YAML clocksource/drivers/ralink: Add Ralink System Tick Counter driver clocksource/drivers/mips-gic-timer: Always use cluster 0 counter as clocksource clocksource/drivers/timer-ti-dm: Don't fail probe if int not found clocksource/drivers:sp804: Make user selectable clocksource/drivers/dw_apb: Remove unused dw_apb_clockevent functions hrtimers: Delete hrtimer_init_on_stack() alarmtimer: Switch to use hrtimer_setup() and hrtimer_setup_on_stack() io_uring: Switch to use hrtimer_setup_on_stack() sched/idle: Switch to use hrtimer_setup_on_stack() hrtimers: Delete hrtimer_init_sleeper_on_stack() wait: Switch to use hrtimer_setup_sleeper_on_stack() timers: Switch to use hrtimer_setup_sleeper_on_stack() net: pktgen: Switch to use hrtimer_setup_sleeper_on_stack() futex: Switch to use hrtimer_setup_sleeper_on_stack() fs/aio: Switch to use hrtimer_setup_sleeper_on_stack() ...
2024-11-19Merge tag 'irq-core-2024-11-18' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull interrupt subsystem updates from Thomas Gleixner: "Tree wide: - Make nr_irqs static to the core code and provide accessor functions to remove existing and prevent future aliasing problems with local variables or function arguments of the same name. Core code: - Prevent freeing an interrupt in the devres code which is not managed by devres in the first place. - Use seq_put_decimal_ull_width() for decimal values output in /proc/interrupts which increases performance significantly as it avoids parsing the format strings over and over. - Optimize raising the timer and hrtimer soft interrupts by using the 'set bit only' variants instead of the combined version which checks whether ksoftirqd should be woken up. The latter is a pointless exercise as both soft interrupts are raised in the context of the timer interrupt and therefore never wake up ksoftirqd. - Delegate timer/hrtimer soft interrupt processing to a dedicated thread on RT. Timer and hrtimer soft interrupts are always processed in ksoftirqd on RT enabled kernels. This can lead to high latencies when other soft interrupts are delegated to ksoftirqd as well. The separate thread allows to run them seperately under a RT scheduling policy to reduce the latency overhead. Drivers: - New drivers or extensions of existing drivers to support Renesas RZ/V2H(P), Aspeed AST27XX, T-HEAD C900 and ATMEL sam9x7 interrupt chips - Support for multi-cluster GICs on MIPS. MIPS CPUs can come with multiple CPU clusters, where each CPU cluster has its own GIC (Generic Interrupt Controller). This requires to access the GIC of a remote cluster through a redirect register block. This is encapsulated into a set of helper functions to keep the complexity out of the actual code paths which handle the GIC details. - Support for encrypted guests in the ARM GICV3 ITS driver The ITS page needs to be shared with the hypervisor and therefore must be decrypted. - Small cleanups and fixes all over the place" * tag 'irq-core-2024-11-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (50 commits) irqchip/riscv-aplic: Prevent crash when MSI domain is missing genirq/proc: Use seq_put_decimal_ull_width() for decimal values softirq: Use a dedicated thread for timer wakeups on PREEMPT_RT. timers: Use __raise_softirq_irqoff() to raise the softirq. hrtimer: Use __raise_softirq_irqoff() to raise the softirq riscv: defconfig: Enable T-HEAD C900 ACLINT SSWI drivers irqchip: Add T-HEAD C900 ACLINT SSWI driver dt-bindings: interrupt-controller: Add T-HEAD C900 ACLINT SSWI device irqchip/stm32mp-exti: Use of_property_present() for non-boolean properties irqchip/mips-gic: Fix selection of GENERIC_IRQ_EFFECTIVE_AFF_MASK irqchip/mips-gic: Prevent indirect access to clusters without CPU cores irqchip/mips-gic: Multi-cluster support irqchip/mips-gic: Setup defaults in each cluster irqchip/mips-gic: Support multi-cluster in for_each_online_cpu_gic() irqchip/mips-gic: Replace open coded online CPU iterations genirq/irqdesc: Use str_enabled_disabled() helper in wakeup_show() genirq/devres: Don't free interrupt which is not managed by devres irqchip/gic-v3-its: Fix over allocation in itt_alloc_pool() irqchip/aspeed-intc: Add AST27XX INTC support dt-bindings: interrupt-controller: Add support for ASPEED AST27XX INTC ...
2024-11-19Merge tag 'sched-core-2024-11-18' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: "Core facilities: - Add the "Lazy preemption" model (CONFIG_PREEMPT_LAZY=y), which optimizes fair-class preemption by delaying preemption requests to the tick boundary, while working as full preemption for RR/FIFO/DEADLINE classes. (Peter Zijlstra) - x86: Enable Lazy preemption (Peter Zijlstra) - riscv: Enable Lazy preemption (Jisheng Zhang) - Initialize idle tasks only once (Thomas Gleixner) - sched/ext: Remove sched_fork() hack (Thomas Gleixner) Fair scheduler: - Optimize the PLACE_LAG when se->vlag is zero (Huang Shijie) Idle loop: - Optimize the generic idle loop by removing unnecessary memory barrier (Zhongqiu Han) RSEQ: - Improve cache locality of RSEQ concurrency IDs for intermittent workloads (Mathieu Desnoyers) Waitqueues: - Make wake_up_{bit,var} less fragile (Neil Brown) PSI: - Pass enqueue/dequeue flags to psi callbacks directly (Johannes Weiner) Preparatory patches for proxy execution: - Add move_queued_task_locked helper (Connor O'Brien) - Consolidate pick_*_task to task_is_pushable helper (Connor O'Brien) - Split out __schedule() deactivate task logic into a helper (John Stultz) - Split scheduler and execution contexts (Peter Zijlstra) - Make mutex::wait_lock irq safe (Juri Lelli) - Expose __mutex_owner() (Juri Lelli) - Remove wakeups from under mutex::wait_lock (Peter Zijlstra) Misc fixes and cleanups: - Remove unused __HAVE_THREAD_FUNCTIONS hook support (David Disseldorp) - Update the comment for TIF_NEED_RESCHED_LAZY (Sebastian Andrzej Siewior) - Remove unused bit_wait_io_timeout (Dr. David Alan Gilbert) - remove the DOUBLE_TICK feature (Huang Shijie) - fix the comment for PREEMPT_SHORT (Huang Shijie) - Fix unnused variable warning (Christian Loehle) - No PREEMPT_RT=y for all{yes,mod}config" * tag 'sched-core-2024-11-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits) sched, x86: Update the comment for TIF_NEED_RESCHED_LAZY. sched: No PREEMPT_RT=y for all{yes,mod}config riscv: add PREEMPT_LAZY support sched, x86: Enable Lazy preemption sched: Enable PREEMPT_DYNAMIC for PREEMPT_RT sched: Add Lazy preemption model sched: Add TIF_NEED_RESCHED_LAZY infrastructure sched/ext: Remove sched_fork() hack sched: Initialize idle tasks only once sched: psi: pass enqueue/dequeue flags to psi callbacks directly sched/uclamp: Fix unnused variable warning sched: Split scheduler and execution contexts sched: Split out __schedule() deactivate task logic into a helper sched: Consolidate pick_*_task to task_is_pushable helper sched: Add move_queued_task_locked helper locking/mutex: Expose __mutex_owner() locking/mutex: Make mutex::wait_lock irq safe locking/mutex: Remove wakeups from under mutex::wait_lock sched: Improve cache locality of RSEQ concurrency IDs for intermittent workloads sched: idle: Optimize the generic idle loop by removing needless memory barrier ...
2024-11-19Merge tag 'random-6.13-rc1-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/crng/random Pull random number generator updates from Jason Donenfeld: "This contains a single series from Uros to replace uses of <linux/random.h> with prandom.h or other more specific headers as needed, in order to avoid a circular header issue. Uros' goal is to be able to use percpu.h from prandom.h, which will then allow him to define __percpu in percpu.h rather than in compiler_types.h" * tag 'random-6.13-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random: prandom: Include <linux/percpu.h> in <linux/prandom.h> random: Do not include <linux/prandom.h> in <linux/random.h> netem: Include <linux/prandom.h> in sch_netem.c lib/test_scanf: Include <linux/prandom.h> instead of <linux/random.h> lib/test_parman: Include <linux/prandom.h> instead of <linux/random.h> bpf/tests: Include <linux/prandom.h> instead of <linux/random.h> lib/rbtree-test: Include <linux/prandom.h> instead of <linux/random.h> random32: Include <linux/prandom.h> instead of <linux/random.h> kunit: string-stream-test: Include <linux/prandom.h> lib/interval_tree_test.c: Include <linux/prandom.h> instead of <linux/random.h> bpf: Include <linux/prandom.h> instead of <linux/random.h> scsi: libfcoe: Include <linux/prandom.h> instead of <linux/random.h> fscrypt: Include <linux/once.h> in fs/crypto/keyring.c mtd: tests: Include <linux/prandom.h> instead of <linux/random.h> media: vivid: Include <linux/prandom.h> in vivid-vid-cap.c drm/lib: Include <linux/prandom.h> instead of <linux/random.h> drm/i915/selftests: Include <linux/prandom.h> instead of <linux/random.h> crypto: testmgr: Include <linux/prandom.h> instead of <linux/random.h> x86/kaslr: Include <linux/prandom.h> instead of <linux/random.h>
2024-11-19gfs2: Prevent inode creation raceAndreas Gruenbacher
When a request to evict an inode comes in over the network, we are trying to grab an inode reference via the iopen glock's gl_object pointer. There is a very small probability that by the time such a request comes in, inode creation hasn't completed and the I_NEW flag is still set. To deal with that, wait for the inode and then check if inode creation was successful. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2024-11-19gfs2: Only defer deletes when we have an iopen glockAndreas Gruenbacher
The mechanism to defer deleting unlinked inodes is tied to delete_work_func(), which is tied to iopen glocks. When we don't have an iopen glock, we must carry out deletes immediately instead. Fixes a NULL pointer dereference in gfs2_evict_inode(). Fixes: 8c21c2c71e66 ("gfs2: Call gfs2_queue_verify_delete from gfs2_evict_inode") Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2024-11-19ceph: improve caps debugging outputPatrick Donnelly
This improves uniformity and exposes important sequence numbers. Now looks like: <7>[ 73.749563] ceph: caps.c:4465 : [c9653bca-110b-4f70-9f84-5a195b205e9a 15290] caps mds2 op export ino 20000000000.fffffffffffffffe inode 0000000008d2e5ea seq 0 iseq 0 mseq 0 ... <7>[ 73.749574] ceph: caps.c:4102 : [c9653bca-110b-4f70-9f84-5a195b205e9a 15290] cap 20000000000.fffffffffffffffe export to peer 1 piseq 1 pmseq 1 ... <7>[ 73.749645] ceph: caps.c:4465 : [c9653bca-110b-4f70-9f84-5a195b205e9a 15290] caps mds1 op import ino 20000000000.fffffffffffffffe inode 0000000008d2e5ea seq 1 iseq 1 mseq 1 ... <7>[ 73.749681] ceph: caps.c:4244 : [c9653bca-110b-4f70-9f84-5a195b205e9a 15290] cap 20000000000.fffffffffffffffe import from peer 2 piseq 686 pmseq 0 ... <7>[ 248.645596] ceph: caps.c:4465 : [c9653bca-110b-4f70-9f84-5a195b205e9a 15290] caps mds1 op revoke ino 20000000000.fffffffffffffffe inode 0000000008d2e5ea seq 2538 iseq 1 mseq 1 See also ceph.git commit cb4ff28af09f ("mds: add issue_seq to all cap messages"). Link: https://tracker.ceph.com/issues/66704 Signed-off-by: Patrick Donnelly <pdonnell@redhat.com> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-11-19ceph: correct ceph_mds_cap_peer field namePatrick Donnelly
The peer seq is used as the issue_seq. Use that name for consistency. See also ceph.git commit 1da6ef237fc7 ("include/ceph_fs: correct ceph_mds_cap_peer field name"). Link: https://tracker.ceph.com/issues/66704 Signed-off-by: Patrick Donnelly <pdonnell@redhat.com> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-11-19ceph: correct ceph_mds_cap_item field namePatrick Donnelly
The issue_seq is sent with bulk cap releases, not the current sequence number. See also ceph.git commit 655cddb7c9f3 ("include/ceph_fs: correct ceph_mds_cap_item field name"). Link: https://tracker.ceph.com/issues/66704 Signed-off-by: Patrick Donnelly <pdonnell@redhat.com> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-11-18Merge tag 'arm64-upstream' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 updates from Catalin Marinas: - Support for running Linux in a protected VM under the Arm Confidential Compute Architecture (CCA) - Guarded Control Stack user-space support. Current patches follow the x86 ABI of implicitly creating a shadow stack on clone(). Subsequent patches (already on the list) will add support for clone3() allowing finer-grained control of the shadow stack size and placement from libc - AT_HWCAP3 support (not running out of HWCAP2 bits yet but we are getting close with the upcoming dpISA support) - Other arch features: - In-kernel use of the memcpy instructions, FEAT_MOPS (previously only exposed to user; uaccess support not merged yet) - MTE: hugetlbfs support and the corresponding kselftests - Optimise CRC32 using the PMULL instructions - Support for FEAT_HAFT enabling ARCH_HAS_NONLEAF_PMD_YOUNG - Optimise the kernel TLB flushing to use the range operations - POE/pkey (permission overlays): further cleanups after bringing the signal handler in line with the x86 behaviour for 6.12 - arm64 perf updates: - Support for the NXP i.MX91 PMU in the existing IMX driver - Support for Ampere SoCs in the Designware PCIe PMU driver - Support for Marvell's 'PEM' PCIe PMU present in the 'Odyssey' SoC - Support for Samsung's 'Mongoose' CPU PMU - Support for PMUv3.9 finer-grained userspace counter access control - Switch back to platform_driver::remove() now that it returns 'void' - Add some missing events for the CXL PMU driver - Miscellaneous arm64 fixes/cleanups: - Page table accessors cleanup: type updates, drop unused macros, reorganise arch_make_huge_pte() and clean up pte_mkcont(), sanity check addresses before runtime P4D/PUD folding - Command line override for ID_AA64MMFR0_EL1.ECV (advertising the FEAT_ECV for the generic timers) allowing Linux to boot with firmware deployments that don't set SCTLR_EL3.ECVEn - ACPI/arm64: tighten the check for the array of platform timer structures and adjust the error handling procedure in gtdt_parse_timer_block() - Optimise the cache flush for the uprobes xol slot (skip if no change) and other uprobes/kprobes cleanups - Fix the context switching of tpidrro_el0 when kpti is enabled - Dynamic shadow call stack fixes - Sysreg updates - Various arm64 kselftest improvements * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (168 commits) arm64: tls: Fix context-switching of tpidrro_el0 when kpti is enabled kselftest/arm64: Try harder to generate different keys during PAC tests kselftest/arm64: Don't leak pipe fds in pac.exec_sign_all() arm64/ptrace: Clarify documentation of VL configuration via ptrace kselftest/arm64: Corrupt P0 in the irritator when testing SSVE acpi/arm64: remove unnecessary cast arm64/mm: Change protval as 'pteval_t' in map_range() kselftest/arm64: Fix missing printf() argument in gcs/gcs-stress.c kselftest/arm64: Add FPMR coverage to fp-ptrace kselftest/arm64: Expand the set of ZA writes fp-ptrace does kselftets/arm64: Use flag bits for features in fp-ptrace assembler code kselftest/arm64: Enable build of PAC tests with LLVM=1 kselftest/arm64: Check that SVCR is 0 in signal handlers selftests/mm: Fix unused function warning for aarch64_write_signal_pkey() kselftest/arm64: Fix printf() compiler warnings in the arm64 syscall-abi.c tests kselftest/arm64: Fix printf() warning in the arm64 MTE prctl() test kselftest/arm64: Fix printf() compiler warnings in the arm64 fp tests kselftest/arm64: Fix build with stricter assemblers arm64/scs: Drop unused prototype __pi_scs_patch_vmlinux() arm64/scs: Deal with 64-bit relative offsets in FDE frames ...
2024-11-18nfsd: allow for up to 32 callback session slotsJeff Layton
nfsd currently only uses a single slot in the callback channel, which is proving to be a bottleneck in some cases. Widen the callback channel to a max of 32 slots (subject to the client's target_maxreqs value). Change the cb_holds_slot boolean to an integer that tracks the current slot number (with -1 meaning "unassigned"). Move the callback slot tracking info into the session. Add a new u32 that acts as a bitmap to track which slots are in use, and a u32 to track the latest callback target_slotid that the client reports. To protect the new fields, add a new per-session spinlock (the se_lock). Fix nfsd41_cb_get_slot to always search for the lowest slotid (using ffs()). Finally, convert the session->se_cb_seq_nr field into an array of ints and add the necessary handling to ensure that the seqids get reset when the slot table grows after shrinking. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18nfs_common: must not hold RCU while calling nfsd_file_put_localMike Snitzer
Move holding the RCU from nfs_to_nfsd_file_put_local to nfs_to_nfsd_net_put. It is the call to nfs_to->nfsd_serv_put that requires the RCU anyway (the puts for nfsd_file and netns were combined to avoid an extra indirect reference but that micro-optimization isn't possible now). This fixes xfstests generic/013 and it triggering: "Voluntary context switch within RCU read-side critical section!" [ 143.545738] Call Trace: [ 143.546206] <TASK> [ 143.546625] ? show_regs+0x6d/0x80 [ 143.547267] ? __warn+0x91/0x140 [ 143.547951] ? rcu_note_context_switch+0x496/0x5d0 [ 143.548856] ? report_bug+0x193/0x1a0 [ 143.549557] ? handle_bug+0x63/0xa0 [ 143.550214] ? exc_invalid_op+0x1d/0x80 [ 143.550938] ? asm_exc_invalid_op+0x1f/0x30 [ 143.551736] ? rcu_note_context_switch+0x496/0x5d0 [ 143.552634] ? wakeup_preempt+0x62/0x70 [ 143.553358] __schedule+0xaa/0x1380 [ 143.554025] ? _raw_spin_unlock_irqrestore+0x12/0x40 [ 143.554958] ? try_to_wake_up+0x1fe/0x6b0 [ 143.555715] ? wake_up_process+0x19/0x20 [ 143.556452] schedule+0x2e/0x120 [ 143.557066] schedule_preempt_disabled+0x19/0x30 [ 143.557933] rwsem_down_read_slowpath+0x24d/0x4a0 [ 143.558818] ? xfs_efi_item_format+0x50/0xc0 [xfs] [ 143.559894] down_read+0x4e/0xb0 [ 143.560519] xlog_cil_commit+0x1b2/0xbc0 [xfs] [ 143.561460] ? _raw_spin_unlock+0x12/0x30 [ 143.562212] ? xfs_inode_item_precommit+0xc7/0x220 [xfs] [ 143.563309] ? xfs_trans_run_precommits+0x69/0xd0 [xfs] [ 143.564394] __xfs_trans_commit+0xb5/0x330 [xfs] [ 143.565367] xfs_trans_roll+0x48/0xc0 [xfs] [ 143.566262] xfs_defer_trans_roll+0x57/0x100 [xfs] [ 143.567278] xfs_defer_finish_noroll+0x27a/0x490 [xfs] [ 143.568342] xfs_defer_finish+0x1a/0x80 [xfs] [ 143.569267] xfs_bunmapi_range+0x4d/0xb0 [xfs] [ 143.570208] xfs_itruncate_extents_flags+0x13d/0x230 [xfs] [ 143.571353] xfs_free_eofblocks+0x12e/0x190 [xfs] [ 143.572359] xfs_file_release+0x12d/0x140 [xfs] [ 143.573324] __fput+0xe8/0x2d0 [ 143.573922] __fput_sync+0x1d/0x30 [ 143.574574] nfsd_filp_close+0x33/0x60 [nfsd] [ 143.575430] nfsd_file_free+0x96/0x150 [nfsd] [ 143.576274] nfsd_file_put+0xf7/0x1a0 [nfsd] [ 143.577104] nfsd_file_put_local+0x18/0x30 [nfsd] [ 143.578070] nfs_close_local_fh+0x101/0x110 [nfs_localio] [ 143.579079] __put_nfs_open_context+0xc9/0x180 [nfs] [ 143.580031] nfs_file_clear_open_context+0x4a/0x60 [nfs] [ 143.581038] nfs_file_release+0x3e/0x60 [nfs] [ 143.581879] __fput+0xe8/0x2d0 [ 143.582464] __fput_sync+0x1d/0x30 [ 143.583108] __x64_sys_close+0x41/0x80 [ 143.583823] x64_sys_call+0x189a/0x20d0 [ 143.584552] do_syscall_64+0x64/0x170 [ 143.585240] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 143.586185] RIP: 0033:0x7f3c5153efd7 Fixes: 65f2a5c36635 ("nfs_common: fix race in NFS calls to nfsd_file_put_local() and nfsd_serv_put()") Signed-off-by: Mike Snitzer <snitzer@kernel.org> Reviewed-by: NeilBrown <neilb@suse.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18nfsd: get rid of include ../internal.hAl Viro
added back in 2015 for the sake of vfs_clone_file_range(), which is in linux/fs.h these days Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18nfsd: fix nfs4_openowner leak when concurrent nfsd4_open occurYang Erkun
The action force umount(umount -f) will attempt to kill all rpc_task even umount operation may ultimately fail if some files remain open. Consequently, if an action attempts to open a file, it can potentially send two rpc_task to nfs server. NFS CLIENT thread1 thread2 open("file") ... nfs4_do_open _nfs4_do_open _nfs4_open_and_get_state _nfs4_proc_open nfs4_run_open_task /* rpc_task1 */ rpc_run_task rpc_wait_for_completion_task umount -f nfs_umount_begin rpc_killall_tasks rpc_signal_task rpc_task1 been wakeup and return -512 _nfs4_do_open // while loop ... nfs4_run_open_task /* rpc_task2 */ rpc_run_task rpc_wait_for_completion_task While processing an open request, nfsd will first attempt to find or allocate an nfs4_openowner. If it finds an nfs4_openowner that is not marked as NFS4_OO_CONFIRMED, this nfs4_openowner will released. Since two rpc_task can attempt to open the same file simultaneously from the client to server, and because two instances of nfsd can run concurrently, this situation can lead to lots of memory leak. Additionally, when we echo 0 to /proc/fs/nfsd/threads, warning will be triggered. NFS SERVER nfsd1 nfsd2 echo 0 > /proc/fs/nfsd/threads nfsd4_open nfsd4_process_open1 find_or_alloc_open_stateowner // alloc oo1, stateid1 nfsd4_open nfsd4_process_open1 find_or_alloc_open_stateowner // find oo1, without NFS4_OO_CONFIRMED release_openowner unhash_openowner_locked list_del_init(&oo->oo_perclient) // cannot find this oo // from client, LEAK!!! alloc_stateowner // alloc oo2 nfsd4_process_open2 init_open_stateid // associate oo1 // with stateid1, stateid1 LEAK!!! nfs4_get_vfs_file // alloc nfsd_file1 and nfsd_file_mark1 // all LEAK!!! nfsd4_process_open2 ... write_threads ... nfsd_destroy_serv nfsd_shutdown_net nfs4_state_shutdown_net nfs4_state_destroy_net destroy_client __destroy_client // won't find oo1!!! nfsd_shutdown_generic nfsd_file_cache_shutdown kmem_cache_destroy for nfsd_file_slab and nfsd_file_mark_slab // bark since nfsd_file1 // and nfsd_file_mark1 // still alive ======================================================================= BUG nfsd_file (Not tainted): Objects remaining in nfsd_file on __kmem_cache_shutdown() ----------------------------------------------------------------------- Slab 0xffd4000004438a80 objects=34 used=1 fp=0xff11000110e2ad28 flags=0x17ffffc0000240(workingset|head|node=0|zone=2|lastcpupid=0x1fffff) CPU: 4 UID: 0 PID: 757 Comm: sh Not tainted 6.12.0-rc6+ #19 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.1-2.fc37 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x53/0x70 slab_err+0xb0/0xf0 __kmem_cache_shutdown+0x15c/0x310 kmem_cache_destroy+0x66/0x160 nfsd_file_cache_shutdown+0xac/0x210 [nfsd] nfsd_destroy_serv+0x251/0x2a0 [nfsd] nfsd_svc+0x125/0x1e0 [nfsd] write_threads+0x16a/0x2a0 [nfsd] nfsctl_transaction_write+0x74/0xa0 [nfsd] vfs_write+0x1ae/0x6d0 ksys_write+0xc1/0x160 do_syscall_64+0x5f/0x170 entry_SYSCALL_64_after_hwframe+0x76/0x7e Disabling lock debugging due to kernel taint Object 0xff11000110e2ac38 @offset=3128 Allocated in nfsd_file_do_acquire+0x20f/0xa30 [nfsd] age=1635 cpu=3 pid=800 nfsd_file_do_acquire+0x20f/0xa30 [nfsd] nfsd_file_acquire_opened+0x5f/0x90 [nfsd] nfs4_get_vfs_file+0x4c9/0x570 [nfsd] nfsd4_process_open2+0x713/0x1070 [nfsd] nfsd4_open+0x74b/0x8b0 [nfsd] nfsd4_proc_compound+0x70b/0xc20 [nfsd] nfsd_dispatch+0x1b4/0x3a0 [nfsd] svc_process_common+0x5b8/0xc50 [sunrpc] svc_process+0x2ab/0x3b0 [sunrpc] svc_handle_xprt+0x681/0xa20 [sunrpc] nfsd+0x183/0x220 [nfsd] kthread+0x199/0x1e0 ret_from_fork+0x31/0x60 ret_from_fork_asm+0x1a/0x30 Add nfs4_openowner_unhashed to help found unhashed nfs4_openowner, and break nfsd4_open process to fix this problem. Cc: stable@vger.kernel.org # v5.4+ Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Yang Erkun <yangerkun@huawei.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18NFSD: Add nfsd4_copy time-to-liveChuck Lever
Keep async copy state alive for a few lease cycles after the copy completes so that OFFLOAD_STATUS returns something meaningful. This means that NFSD's client shutdown processing needs to purge any of this state that happens to be waiting to die. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18NFSD: Add a laundromat reaper for async copy stateChuck Lever
RFC 7862 Section 4.8 states: > A copy offload stateid will be valid until either (A) the client > or server restarts or (B) the client returns the resource by > issuing an OFFLOAD_CANCEL operation or the client replies to a > CB_OFFLOAD operation. Instead of releasing async copy state when the CB_OFFLOAD callback completes, now let it live until the next laundromat run after the callback completes. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18NFSD: Block DESTROY_CLIENTID only when there are ongoing async COPY operationsChuck Lever
Currently __destroy_client() consults the nfs4_client's async_copies list to determine whether there are ongoing async COPY operations. However, NFSD now keeps copy state in that list even when the async copy has completed, to enable OFFLOAD_STATUS to find the COPY results for a while after the COPY has completed. DESTROY_CLIENTID should not be blocked if the client's async_copies list contains state for only completed copy operations. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18NFSD: Handle an NFS4ERR_DELAY response to CB_OFFLOADChuck Lever
RFC 7862 permits callback services to respond to CB_OFFLOAD with NFS4ERR_DELAY. Currently NFSD drops the CB_OFFLOAD in that case. To improve the reliability of COPY offload, NFSD should rather send another CB_OFFLOAD completion notification. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18NFSD: Free async copy information in nfsd4_cb_offload_release()Chuck Lever
RFC 7862 Section 4.8 states: > A copy offload stateid will be valid until either (A) the client > or server restarts or (B) the client returns the resource by > issuing an OFFLOAD_CANCEL operation or the client replies to a > CB_OFFLOAD operation. Currently, NFSD purges the metadata for an async COPY operation as soon as the CB_OFFLOAD callback has been sent. It does not wait even for the client's CB_OFFLOAD response, as the paragraph above suggests that it should. This makes the OFFLOAD_STATUS operation ineffective during the window between the completion of an asynchronous COPY and the server's receipt of the corresponding CB_OFFLOAD response. This is important if, for example, the client responds with NFS4ERR_DELAY, or the transport is lost before the server receives the response. A client might use OFFLOAD_STATUS to query the server about the still pending asynchronous COPY, but NFSD will respond to OFFLOAD_STATUS as if it had never heard of the presented copy stateid. This patch starts to address this issue by extending the lifetime of struct nfsd4_copy at least until the server has seen the client's CB_OFFLOAD response, or the CB_OFFLOAD has timed out. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18NFSD: Fix nfsd4_shutdown_copy()Chuck Lever
nfsd4_shutdown_copy() is just this: while ((copy = nfsd4_get_copy(clp)) != NULL) nfsd4_stop_copy(copy); nfsd4_get_copy() bumps @copy's reference count, preventing nfsd4_stop_copy() from releasing @copy. A while loop like this usually works by removing the first element of the list, but neither nfsd4_get_copy() nor nfsd4_stop_copy() alters the async_copies list. Best I can tell, then, is that nfsd4_shutdown_copy() continues to loop until other threads manage to remove all the items from this list. The spinning loop blocks shutdown until these items are gone. Possibly the reason we haven't seen this issue in the field is because client_has_state() prevents __destroy_client() from calling nfsd4_shutdown_copy() if there are any items on this list. In a subsequent patch I plan to remove that restriction. Fixes: e0639dc5805a ("NFSD introduce async copy feature") Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18NFSD: Add a tracepoint to record canceled async COPY operationsChuck Lever
Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18nfsd: make nfsd4_session->se_flags a boolJeff Layton
While this holds the flags from the CREATE_SESSION request, nothing ever consults them. The only flag used is NFS4_SESSION_DEAD. Make it a simple bool instead. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18nfsd: remove nfsd4_session->se_bchannelJeff Layton
This field is written and is never consulted again. Remove it. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18nfsd: make use of warning provided by refcount_tNeilBrown
refcount_t, by design, checks for unwanted situations and provides warnings. It is rarely useful to have explicit warnings with refcount usage. In this case we have an explicit warning if a refcount_t reaches zero when decremented. Simply using refcount_dec() will provide a similar warning and also mark the refcount_t as saturated to avoid any possible use-after-free. This patch drops the warning and uses refcount_dec() instead of refcount_dec_and_test(). Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18nfsd: Don't fail OP_SETCLIENTID when there are too many clients.NeilBrown
Failing OP_SETCLIENTID or OP_EXCHANGE_ID should only happen if there is memory allocation failure. Putting a hard limit on the number of clients is not really helpful as it will either happen too early and prevent clients that the server can easily handle, or too late and allow clients when the server is swamped. The calculated limit is still useful for expiring courtesy clients where there are "too many" clients, but it shouldn't prevent the creation of active clients. Testing of lots of clients against small-mem servers reports repeated NFS4ERR_DELAY responses which doesn't seem helpful. There may have been reports of similar problems in production use. Also remove an outdated comment - we do use a slab cache. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18nfsd: release svc_expkey/svc_export with rcu_workYang Erkun
The last reference for `cache_head` can be reduced to zero in `c_show` and `e_show`(using `rcu_read_lock` and `rcu_read_unlock`). Consequently, `svc_export_put` and `expkey_put` will be invoked, leading to two issues: 1. The `svc_export_put` will directly free ex_uuid. However, `e_show`/`c_show` will access `ex_uuid` after `cache_put`, which can trigger a use-after-free issue, shown below. ================================================================== BUG: KASAN: slab-use-after-free in svc_export_show+0x362/0x430 [nfsd] Read of size 1 at addr ff11000010fdc120 by task cat/870 CPU: 1 UID: 0 PID: 870 Comm: cat Not tainted 6.12.0-rc3+ #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.1-2.fc37 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x53/0x70 print_address_description.constprop.0+0x2c/0x3a0 print_report+0xb9/0x280 kasan_report+0xae/0xe0 svc_export_show+0x362/0x430 [nfsd] c_show+0x161/0x390 [sunrpc] seq_read_iter+0x589/0x770 seq_read+0x1e5/0x270 proc_reg_read+0xe1/0x140 vfs_read+0x125/0x530 ksys_read+0xc1/0x160 do_syscall_64+0x5f/0x170 entry_SYSCALL_64_after_hwframe+0x76/0x7e Allocated by task 830: kasan_save_stack+0x20/0x40 kasan_save_track+0x14/0x30 __kasan_kmalloc+0x8f/0xa0 __kmalloc_node_track_caller_noprof+0x1bc/0x400 kmemdup_noprof+0x22/0x50 svc_export_parse+0x8a9/0xb80 [nfsd] cache_do_downcall+0x71/0xa0 [sunrpc] cache_write_procfs+0x8e/0xd0 [sunrpc] proc_reg_write+0xe1/0x140 vfs_write+0x1a5/0x6d0 ksys_write+0xc1/0x160 do_syscall_64+0x5f/0x170 entry_SYSCALL_64_after_hwframe+0x76/0x7e Freed by task 868: kasan_save_stack+0x20/0x40 kasan_save_track+0x14/0x30 kasan_save_free_info+0x3b/0x60 __kasan_slab_free+0x37/0x50 kfree+0xf3/0x3e0 svc_export_put+0x87/0xb0 [nfsd] cache_purge+0x17f/0x1f0 [sunrpc] nfsd_destroy_serv+0x226/0x2d0 [nfsd] nfsd_svc+0x125/0x1e0 [nfsd] write_threads+0x16a/0x2a0 [nfsd] nfsctl_transaction_write+0x74/0xa0 [nfsd] vfs_write+0x1a5/0x6d0 ksys_write+0xc1/0x160 do_syscall_64+0x5f/0x170 entry_SYSCALL_64_after_hwframe+0x76/0x7e 2. We cannot sleep while using `rcu_read_lock`/`rcu_read_unlock`. However, `svc_export_put`/`expkey_put` will call path_put, which subsequently triggers a sleeping operation due to the following `dput`. ============================= WARNING: suspicious RCU usage 5.10.0-dirty #141 Not tainted ----------------------------- ... Call Trace: dump_stack+0x9a/0xd0 ___might_sleep+0x231/0x240 dput+0x39/0x600 path_put+0x1b/0x30 svc_export_put+0x17/0x80 e_show+0x1c9/0x200 seq_read_iter+0x63f/0x7c0 seq_read+0x226/0x2d0 vfs_read+0x113/0x2c0 ksys_read+0xc9/0x170 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x67/0xd1 Fix these issues by using `rcu_work` to help release `svc_expkey`/`svc_export`. This approach allows for an asynchronous context to invoke `path_put` and also facilitates the freeing of `uuid/exp/key` after an RCU grace period. Fixes: 9ceddd9da134 ("knfsd: Allow lockless lookups of the exports") Signed-off-by: Yang Erkun <yangerkun@huawei.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18nfsd: make sure exp active before svc_export_showYang Erkun
The function `e_show` was called with protection from RCU. This only ensures that `exp` will not be freed. Therefore, the reference count for `exp` can drop to zero, which will trigger a refcount use-after-free warning when `exp_get` is called. To resolve this issue, use `cache_get_rcu` to ensure that `exp` remains active. ------------[ cut here ]------------ refcount_t: addition on 0; use-after-free. WARNING: CPU: 3 PID: 819 at lib/refcount.c:25 refcount_warn_saturate+0xb1/0x120 CPU: 3 UID: 0 PID: 819 Comm: cat Not tainted 6.12.0-rc3+ #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.1-2.fc37 04/01/2014 RIP: 0010:refcount_warn_saturate+0xb1/0x120 ... Call Trace: <TASK> e_show+0x20b/0x230 [nfsd] seq_read_iter+0x589/0x770 seq_read+0x1e5/0x270 vfs_read+0x125/0x530 ksys_read+0xc1/0x160 do_syscall_64+0x5f/0x170 entry_SYSCALL_64_after_hwframe+0x76/0x7e Fixes: bf18f163e89c ("NFSD: Using exp_get for export getting") Cc: stable@vger.kernel.org # 4.20+ Signed-off-by: Yang Erkun <yangerkun@huawei.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18lockd: Remove unneeded initialization of file_lock::c.flc_flagsChuck Lever
Since commit 75c7940d2a86 ("lockd: set missing fl_flags field when retrieving args"), nlmsvc_retrieve_args() initializes the flc_flags field. svcxdr_decode_lock() no longer needs to do this. This clean up removes one dependency on the nlm_lock:fl field. No behavior change is expected. Analysis: svcxdr_decode_lock() is called by: nlm4svc_decode_testargs() nlm4svc_decode_lockargs() nlm4svc_decode_cancargs() nlm4svc_decode_unlockargs() nlm4svc_decode_testargs() is used by: - NLMPROC4_TEST and NLMPROC4_TEST_MSG, which call nlmsvc_retrieve_args() - NLMPROC4_GRANTED and NLMPROC4_GRANTED_MSG, which don't pass the lock's file_lock to the generic lock API nlm4svc_decode_lockargs() is used by: - NLMPROC4_LOCK and NLM4PROC4_LOCK_MSG, which call nlmsvc_retrieve_args() - NLMPROC4_UNLOCK and NLM4PROC4_UNLOCK_MSG, which call nlmsvc_retrieve_args() - NLMPROC4_NM_LOCK, which calls nlmsvc_retrieve_args() nlm4svc_decode_cancargs() is used by: - NLMPROC4_CANCEL and NLMPROC4_CANCEL_MSG, which call nlmsvc_retrieve_args() nlm4svc_decode_unlockargs() is used by: - NLMPROC4_UNLOCK and NLMPROC4_UNLOCK_MSG, which call nlmsvc_retrieve_args() All callers except GRANTED/GRANTED_MSG eventually call nlmsvc_retrieve_args() before using nlm_lock::fl.c.flc_flags. Thus this change is safe. Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: NeilBrown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18lockd: Remove unused parameter to nlmsvc_testlock()Chuck Lever
The nlm_cookie parameter has been unused since commit 09802fd2a8ca ("lockd: rip out deferred lock handling from testlock codepath"). Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: NeilBrown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18lockd: Remove some snippets of unfinished codeChuck Lever
Clean up. Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: NeilBrown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18lockd: Remove unnecessary memset()Chuck Lever
Since commit 103cc1fafee4 ("SUNRPC: Parametrize how much of argsize should be zeroed") (and possibly long before that, even) all of the memory underlying rqstp->rq_argp is zeroed already. There's no need for the memset() in nlm4svc_decode_shareargs(). Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: NeilBrown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18NFSD: Cap the number of bytes copied by nfs4_reset_recoverydir()Chuck Lever
It's only current caller already length-checks the string, but let's be safe. Fixes: 0964a3d3f1aa ("[PATCH] knfsd: nfsd4 reboot dirname fix") Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18NFSD: Remove unused values from nfsd4_encode_components_esc()Chuck Lever
Clean up. The computed value of @p is saved each time through the loop but is never used. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18NFSD: Remove unused results in nfsd4_encode_pathname4()Chuck Lever
Clean up. The result of "*p++" is saved, but is not used before it is overwritten. The result of xdr_encode_opaque() is saved each time through the loop but is never used. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18NFSD: Prevent NULL dereference in nfsd4_process_cb_update()Chuck Lever
@ses is initialized to NULL. If __nfsd4_find_backchannel() finds no available backchannel session, setup_callback_client() will try to dereference @ses and segfault. Fixes: dcbeaa68dbbd ("nfsd4: allow backchannel recovery") Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18NFSD: Remove a never-true comparisonChuck Lever
fh_size is an unsigned int, thus it can never be less than 0. Fixes: d8b26071e65e ("NFSD: simplify struct nfsfh") Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18NFSD: Remove dead code in nfsd4_create_session()Chuck Lever
Clean up. AFAICT, there is no way to reach the out_free_conn label with @old set to a non-NULL value, so the expire_client(old) call is never reached and can be removed. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18nfsd: refine and rename NFSD_MAY_LOCKNeilBrown
NFSD_MAY_LOCK means a few different things. - it means that GSS is not required. - it means that with NFSEXP_NOAUTHNLM, authentication is not required - it means that OWNER_OVERRIDE is allowed. None of these are specific to locking, they are specific to the NLM protocol. So: - rename to NFSD_MAY_NLM - set NFSD_MAY_OWNER_OVERRIDE and NFSD_MAY_BYPASS_GSS in nlm_fopen() so that NFSD_MAY_NLM doesn't need to imply these. - move the test on NFSEXP_NOAUTHNLM out of nfsd_permission() and into fh_verify where other special-case tests on the MAY flags happen. nfsd_permission() can be called from other places than fh_verify(), but none of these will have NFSD_MAY_NLM. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18NFSD: Replace use of NFSD_MAY_LOCK in nfsd4_lock()Chuck Lever
NFSv4 LOCK operations should not avoid the set of authorization checks that apply to all other NFSv4 operations. Also, the "no_auth_nlm" export option should apply only to NLM LOCK requests. It's not necessary or sensible to apply it to NFSv4 LOCK operations. Instead, set no permission bits when calling fh_verify(). Subsequent stateid processing handles authorization checks. Reported-by: NeilBrown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18nfsd: replace call_rcu by kfree_rcu for simple kmem_cache_free callbackJulia Lawall
Since SLOB was removed and since commit 6c6c47b063b5 ("mm, slab: call kvfree_rcu_barrier() from kmem_cache_destroy()"), it is not necessary to use call_rcu when the callback only performs kmem_cache_free. Use kfree_rcu() directly. The changes were made using Coccinelle. Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr> Reviewed-by: Jeff Layton <jlayton@kernel.org> Acked-by: Paul E. McKenney <paulmck@kernel.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-18nfsd: Fix NFSD_MAY_BYPASS_GSS and NFSD_MAY_BYPASS_GSS_ON_ROOTPali Rohár
Currently NFSD_MAY_BYPASS_GSS and NFSD_MAY_BYPASS_GSS_ON_ROOT do not bypass only GSS, but bypass any method. This is a problem specially for NFS3 AUTH_NULL-only exports. The purpose of NFSD_MAY_BYPASS_GSS_ON_ROOT is described in RFC 2623, section 2.3.2, to allow mounting NFS2/3 GSS-only export without authentication. So few procedures which do not expose security risk used during mount time can be called also with AUTH_NONE or AUTH_SYS, to allow client mount operation to finish successfully. The problem with current implementation is that for AUTH_NULL-only exports, the NFSD_MAY_BYPASS_GSS_ON_ROOT is active also for NFS3 AUTH_UNIX mount attempts which confuse NFS3 clients, and make them think that AUTH_UNIX is enabled and is working. Linux NFS3 client never switches from AUTH_UNIX to AUTH_NONE on active mount, which makes the mount inaccessible. Fix the NFSD_MAY_BYPASS_GSS and NFSD_MAY_BYPASS_GSS_ON_ROOT implementation and really allow to bypass only exports which have enabled some real authentication (GSS, TLS, or any other). The result would be: For AUTH_NULL-only export if client attempts to do mount with AUTH_UNIX flavor then it will receive access errors, which instruct client that AUTH_UNIX flavor is not usable and will either try other auth flavor (AUTH_NULL if enabled) or fails mount procedure. Similarly if client attempt to do mount with AUTH_NULL flavor and only AUTH_UNIX flavor is enabled then the client will receive access error. This should fix problems with AUTH_NULL-only or AUTH_UNIX-only exports if client attempts to mount it with other auth flavor (e.g. with AUTH_NULL for AUTH_UNIX-only export, or with AUTH_UNIX for AUTH_NULL-only export). Signed-off-by: Pali Rohár <pali@kernel.org> Reviewed-by: NeilBrown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>