summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2024-09-17resource: fix region_intersects() vs add_memory_driver_managed()Huang Ying
On a system with CXL memory, the resource tree (/proc/iomem) related to CXL memory may look like something as follows. 490000000-50fffffff : CXL Window 0 490000000-50fffffff : region0 490000000-50fffffff : dax0.0 490000000-50fffffff : System RAM (kmem) Because drivers/dax/kmem.c calls add_memory_driver_managed() during onlining CXL memory, which makes "System RAM (kmem)" a descendant of "CXL Window X". This confuses region_intersects(), which expects all "System RAM" resources to be at the top level of iomem_resource. This can lead to bugs. For example, when the following command line is executed to write some memory in CXL memory range via /dev/mem, $ dd if=data of=/dev/mem bs=$((1 << 10)) seek=$((0x490000000 >> 10)) count=1 dd: error writing '/dev/mem': Bad address 1+0 records in 0+0 records out 0 bytes copied, 0.0283507 s, 0.0 kB/s the command fails as expected. However, the error code is wrong. It should be "Operation not permitted" instead of "Bad address". More seriously, the /dev/mem permission checking in devmem_is_allowed() passes incorrectly. Although the accessing is prevented later because ioremap() isn't allowed to map system RAM, it is a potential security issue. During command executing, the following warning is reported in the kernel log for calling ioremap() on system RAM. ioremap on RAM at 0x0000000490000000 - 0x0000000490000fff WARNING: CPU: 2 PID: 416 at arch/x86/mm/ioremap.c:216 __ioremap_caller.constprop.0+0x131/0x35d Call Trace: memremap+0xcb/0x184 xlate_dev_mem_ptr+0x25/0x2f write_mem+0x94/0xfb vfs_write+0x128/0x26d ksys_write+0xac/0xfe do_syscall_64+0x9a/0xfd entry_SYSCALL_64_after_hwframe+0x4b/0x53 The details of command execution process are as follows. In the above resource tree, "System RAM" is a descendant of "CXL Window 0" instead of a top level resource. So, region_intersects() will report no System RAM resources in the CXL memory region incorrectly, because it only checks the top level resources. Consequently, devmem_is_allowed() will return 1 (allow access via /dev/mem) for CXL memory region incorrectly. Fortunately, ioremap() doesn't allow to map System RAM and reject the access. So, region_intersects() needs to be fixed to work correctly with the resource tree with "System RAM" not at top level as above. To fix it, if we found a unmatched resource in the top level, we will continue to search matched resources in its descendant resources. So, we will not miss any matched resources in resource tree anymore. In the new implementation, an example resource tree |------------- "CXL Window 0" ------------| |-- "System RAM" --| will behave similar as the following fake resource tree for region_intersects(, IORESOURCE_SYSTEM_RAM, ), |-- "System RAM" --||-- "CXL Window 0a" --| Where "CXL Window 0a" is part of the original "CXL Window 0" that isn't covered by "System RAM". Link: https://lkml.kernel.org/r/20240906030713.204292-2-ying.huang@intel.com Fixes: c221c0b0308f ("device-dax: "Hotplug" persistent memory for use like normal RAM") Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Jonathan Cameron <jonathan.cameron@huawei.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Alison Schofield <alison.schofield@intel.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Baoquan He <bhe@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-17Merge tag 'printk-for-6.12' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux Pull printk updates from Petr Mladek: "This is the "last" part of the support for the new nbcon consoles. Where "nbcon" stays for "No Big console lock CONsoles" aka not under the console_lock. New callbacks are added to struct console: - write_thread() for flushing nbcon consoles in task context. - write_atomic() for flushing nbcon consoles in atomic context, including NMI. - con->device_lock() and device_unlock() for taking the driver specific lock, for example, port->lock. New printk-specific kthreads are created: - per-console kthreads which get responsible for flushing normal priority messages on nbcon consoles. - thread which gets responsible for flushing normal priority messages on all consoles when CONFIG_RT enabled. The new callbacks are called under a special per-console lock which has already been added back in v6.7. It allows to distinguish three severities: normal, emergency, and panic. A context with a higher priority could take over the ownership when it is safe even in the middle of handling a record. The panic context could do it even when it is not safe. But it is allowed only for the final desperate flush before entering the infinite loop. The new lock helps to flush the messages directly in emergency and panic contexts. But it is not enough in all situations: - console_lock() is still need for synchronization against boot consoles. - con->device_lock() is need for synchronization against other operations on the same HW, e.g. serial port speed setting, non-printk related read/write. The dependency on con->device_lock() is mutual. Any code taking the driver specific lock has to acquire the related nbcon console context as well. For example, see the new uart_port_lock() API. It provides the necessary synchronization against emergency and panic contexts where the messages are flushed only under the new per-console lock. Maybe surprisingly, a quite tricky part is the decision how to flush the consoles in various situations. It has to take into account: - message priority: normal, emergency, panic - scheduling context: task, atomic, deferred_legacy - registered consoles: boot, legacy, nbcon - threads are running: early boot, suspend, shutdown, panic - caller: printk(), pr_flush(), printk_flush_in_panic(), console_unlock(), console_start(), ... The primary decision is made in printk_get_console_flush_type(). It creates a hint what the caller should do: - flush nbcon consoles directly or via the kthread - call the legacy loop (console_unlock()) directly or via irq_work The existing behavior is preserved for the legacy consoles. The only exception is that they are not longer flushed directly from printk() in panic() before CPUs are stopped. But this blocking happens only when at least one nbcon console is registered. The motivation is to increase a chance to produce the crash dump. They legacy consoles might create a deadlock in compare with nbcon consoles. The nbcon console should allow to see the messages even when the crash dump fails. There are three possible ways how nbcon consoles are flushed: - The per-nbcon-console kthread is responsible for flushing messages added with the normal priority. This is the default mode. - The legacy loop, aka console_unlock(), is used when there is still a boot console registered. There is no easy way how to match an early console driver with a nbcon console driver. And the console_lock() provides the only reliable serialization at the moment. The legacy loop uses either con->write_atomic() or con->write_thread() callbacks depending on whether it is allowed to schedule. The atomic variant has to be used from printk(). - In other situations, the messages are flushed directly using write_atomic() which can be called in any context, including NMI. It is primary needed during early boot or shutdown, in emergency situations, and panic. The emergency priority is used by a code called within nbcon_cpu_emergency_enter()/exit(). At the moment, it is used in four situations: WARN(), Oops, lockdep, and RCU stall reports. Finally, there is no nbcon console at the moment. It means that the changes should _not_ modify the existing behavior. The only exception is CONFIG_RT which would force offloading the legacy loop, for normal priority context, into the dedicated kthread" * tag 'printk-for-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux: (54 commits) printk: Avoid false positive lockdep report for legacy printing printk: nbcon: Assign nice -20 for printing threads printk: Implement legacy printer kthread for PREEMPT_RT tty: sysfs: Add nbcon support for 'active' proc: Add nbcon support for /proc/consoles proc: consoles: Add notation to c_start/c_stop printk: nbcon: Show replay message on takeover printk: Provide helper for message prepending printk: nbcon: Rely on kthreads for normal operation printk: nbcon: Use thread callback if in task context for legacy printk: nbcon: Relocate nbcon_atomic_emit_one() printk: nbcon: Introduce printer kthreads printk: nbcon: Init @nbcon_seq to highest possible printk: nbcon: Add context to usable() and emit() printk: Flush console on unregister_console() printk: Fail pr_flush() if before SYSTEM_SCHEDULING printk: nbcon: Add function for printers to reacquire ownership printk: nbcon: Use raw_cpu_ptr() instead of open coding printk: Use the BITS_PER_LONG macro lockdep: Mark emergency sections in lockdep splats ...
2024-09-17Merge tag 'timers-core-2024-09-16' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer updates from Thomas Gleixner: "Core: - Overhaul of posix-timers in preparation of removing the workaround for periodic timers which have signal delivery ignored. - Remove the historical extra jiffie in msleep() msleep() adds an extra jiffie to the timeout value to ensure minimal sleep time. The timer wheel ensures minimal sleep time since the large rewrite to a non-cascading wheel, but the extra jiffie in msleep() remained unnoticed. Remove it. - Make the timer slack handling correct for realtime tasks. The procfs interface is inconsistent and does neither reflect reality nor conforms to the man page. Show the correct 0 slack for real time tasks and enforce it at the core level instead of having inconsistent individual checks in various timer setup functions. - The usual set of updates and enhancements all over the place. Drivers: - Allow the ACPI PM timer to be turned off during suspend - No new drivers - The usual updates and enhancements in various drivers" * tag 'timers-core-2024-09-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (43 commits) ntp: Make sure RTC is synchronized when time goes backwards treewide: Fix wrong singular form of jiffies in comments cpu: Use already existing usleep_range() timers: Rename next_expiry_recalc() to be unique platform/x86:intel/pmc: Fix comment for the pmc_core_acpi_pm_timer_suspend_resume function clocksource/drivers/jcore: Use request_percpu_irq() clocksource/drivers/cadence-ttc: Add missing clk_disable_unprepare in ttc_setup_clockevent clocksource/drivers/asm9260: Add missing clk_disable_unprepare in asm9260_timer_init clocksource/drivers/qcom: Add missing iounmap() on errors in msm_dt_timer_init() clocksource/drivers/ingenic: Use devm_clk_get_enabled() helpers platform/x86:intel/pmc: Enable the ACPI PM Timer to be turned off when suspended clocksource: acpi_pm: Add external callback for suspend/resume clocksource/drivers/arm_arch_timer: Using for_each_available_child_of_node_scoped() dt-bindings: timer: rockchip: Add rk3576 compatible timers: Annotate possible non critical data race of next_expiry timers: Remove historical extra jiffie for timeout in msleep() hrtimer: Use and report correct timerslack values for realtime tasks hrtimer: Annotate hrtimer_cpu_base_.*_expiry() for sparse. timers: Add sparse annotation for timer_sync_wait_running(). signal: Replace BUG_ON()s ...
2024-09-17Merge tag 'irq-core-2024-09-16' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq updates from Thomas Gleixner: "Core: - Remove a global lock in the affinity setting code The lock protects a cpumask for intermediate results and the lock causes a bottleneck on simultaneous start of multiple virtual machines. Replace the lock and the static cpumask with a per CPU cpumask which is nicely serialized by raw spinlock held when executing this code. - Provide support for giving a suffix to interrupt domain names. That's required to support devices with subfunctions so that the domain names are distinct even if they originate from the same device node. - The usual set of cleanups and enhancements all over the place Drivers: - Support for longarch AVEC interrupt chip - Refurbishment of the Armada driver so it can be extended for new variants. - The usual set of cleanups and enhancements all over the place" * tag 'irq-core-2024-09-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (73 commits) genirq: Use cpumask_intersects() genirq/cpuhotplug: Use cpumask_intersects() irqchip/apple-aic: Only access system registers on SoCs which provide them irqchip/apple-aic: Add a new "Global fast IPIs only" feature level irqchip/apple-aic: Skip unnecessary enabling of use_fast_ipi dt-bindings: apple,aic: Document A7-A11 compatibles irqdomain: Use IS_ERR_OR_NULL() in irq_domain_trim_hierarchy() genirq/msi: Use kmemdup_array() instead of kmemdup() genirq/proc: Change the return value for set affinity permission error genirq/proc: Use irq_move_pending() in show_irq_affinity() genirq/proc: Correctly set file permissions for affinity control files genirq: Get rid of global lock in irq_do_set_affinity() genirq: Fix typo in struct comment irqchip/loongarch-avec: Add AVEC irqchip support irqchip/loongson-pch-msi: Prepare get_pch_msi_handle() for AVECINTC irqchip/loongson-eiointc: Rename CPUHP_AP_IRQ_LOONGARCH_STARTING LoongArch: Architectural preparation for AVEC irqchip LoongArch: Move irqchip function prototypes to irq-loongson.h irqchip/loongson-pch-msi: Switch to MSI parent domains softirq: Remove unused 'action' parameter from action callback ...
2024-09-17Merge tag 'timers-clocksource-2024-09-16' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull clocksource watchdog updates from Thomas Gleixner: - Make the uncertainty margin handling more robust to prevent false positives - Clarify comments * tag 'timers-clocksource-2024-09-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: clocksource: Set cs_watchdog_read() checks based on .uncertainty_margin clocksource: Fix comments on WATCHDOG_THRESHOLD & WATCHDOG_MAX_SKEW clocksource: Improve comments for watchdog skew bounds
2024-09-17Merge tag 'smp-core-2024-09-16' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull CPU hotplug updates from Thomas Gleixner: - Prepare the core for supporting parallel hotplug on loongarch - A small set of cleanups and enhancements * tag 'smp-core-2024-09-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: smp: Mark smp_prepare_boot_cpu() __init cpu: Fix W=1 build kernel-doc warning cpu/hotplug: Provide weak fallback for arch_cpuhp_init_parallel_bringup() cpu/hotplug: Make HOTPLUG_PARALLEL independent of HOTPLUG_SMT
2024-09-16Merge tag 'audit-pr-20240911' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit Pull audit updates from Paul Moore: - Fix some remaining problems with PID/TGID reporting When most users think about PIDs, what they are really thinking about is the TGID. This commit shifts the audit PID logging and filtering to use the TGID value which should provide a more meaningful audit stream and filtering experience for users. - Migrate to the str_enabled_disabled() helper Evidently we have helper functions that help ensure if we mistype "enabled" or "disabled" it is now caught at compile time. I guess we're fancy now. * tag 'audit-pr-20240911' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit: audit: Make use of str_enabled_disabled() helper audit: use task_tgid_nr() instead of task_pid_nr()
2024-09-16Merge tag 'vfs-6.12.misc' of ↵Linus Torvalds
gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "This contains the usual pile of misc updates: Features: - Add F_CREATED_QUERY fcntl() that allows userspace to query whether a file was actually created. Often userspace wants to know whether an O_CREATE request did actually create a file without using O_EXCL. The current logic is that to first attempts to open the file without O_CREAT | O_EXCL and if ENOENT is returned userspace tries again with both flags. If that succeeds all is well. If it now reports EEXIST it retries. That works fairly well but some corner cases make this more involved. If this operates on a dangling symlink the first openat() without O_CREAT | O_EXCL will return ENOENT but the second openat() with O_CREAT | O_EXCL will fail with EEXIST. The reason is that openat() without O_CREAT | O_EXCL follows the symlink while O_CREAT | O_EXCL doesn't for security reasons. So it's not something we can really change unless we add an explicit opt-in via O_FOLLOW which seems really ugly. All available workarounds are really nasty (fanotify, bpf lsm etc) so add a simple fcntl(). - Try an opportunistic lookup for O_CREAT. Today, when opening a file we'll typically do a fast lookup, but if O_CREAT is set, the kernel always takes the exclusive inode lock. This was likely done with the expectation that O_CREAT means that we always expect to do the create, but that's often not the case. Many programs set O_CREAT even in scenarios where the file already exists (see related F_CREATED_QUERY patch motivation above). The series contained in the pr rearranges the pathwalk-for-open code to also attempt a fast_lookup in certain O_CREAT cases. If a positive dentry is found, the inode_lock can be avoided altogether and it can stay in rcuwalk mode for the last step_into. - Expose the 64 bit mount id via name_to_handle_at() Now that we provide a unique 64-bit mount ID interface in statx(2), we can now provide a race-free way for name_to_handle_at(2) to provide a file handle and corresponding mount without needing to worry about racing with /proc/mountinfo parsing or having to open a file just to do statx(2). While this is not necessary if you are using AT_EMPTY_PATH and don't care about an extra statx(2) call, users that pass full paths into name_to_handle_at(2) need to know which mount the file handle comes from (to make sure they don't try to open_by_handle_at a file handle from a different filesystem) and switching to AT_EMPTY_PATH would require allocating a file for every name_to_handle_at(2) call - Add a per dentry expire timeout to autofs There are two fairly well known automounter map formats, the autofs format and the amd format (more or less System V and Berkley). Some time ago Linux autofs added an amd map format parser that implemented a fair amount of the amd functionality. This was done within the autofs infrastructure and some functionality wasn't implemented because it either didn't make sense or required extra kernel changes. The idea was to restrict changes to be within the existing autofs functionality as much as possible and leave changes with a wider scope to be considered later. One of these changes is implementing the amd options: 1) "unmount", expire this mount according to a timeout (same as the current autofs default). 2) "nounmount", don't expire this mount (same as setting the autofs timeout to 0 except only for this specific mount) . 3) "utimeout=<seconds>", expire this mount using the specified timeout (again same as setting the autofs timeout but only for this mount) To implement these options per-dentry expire timeouts need to be implemented for autofs indirect mounts. This is because all map keys (mounts) for autofs indirect mounts use an expire timeout stored in the autofs mount super block info. structure and all indirect mounts use the same expire timeout. Fixes: - Fix missing fput for FSCONFIG_SET_FD in autofs - Use param->file for FSCONFIG_SET_FD in coda - Delete the 'fs/netfs' proc subtreee when netfs module exits - Make sure that struct uid_gid_map fits into a single cacheline - Don't flush in-flight wb switches for superblocks without cgroup writeback - Correcting the idmapping mount example in the idmapping documentation - Fix a race between evice_inodes() and find_inode() and iput() - Refine the show_inode_state() macro definition in writeback code - Prevent dump_mapping() from accessing invalid dentry.d_name.name - Show actual source for debugfs in /proc/mounts - Annotate data-race of busy_poll_usecs in eventpoll - Don't WARN for racy path_noexec check in exec code - Handle OOM on mnt_warn_timestamp_expiry() - Fix some spelling in the iomap design documentation - Fix typo in procfs comment - Fix typo in fs/namespace.c comment Cleanups: - Add the VFS git tree to the MAINTAINERS file - Move FMODE_UNSIGNED_OFFSET to fop_flags freeing up another f_mode bit in struct file bringing us to 5 free f_mode bits - Remove the __I_DIO_WAKEUP bit from i_state flags as we can simplify the wait mechanism - Remove the unused path_put_init() helper - Replace a __u32 with u32 for s_fsnotify_mask as __u32 is uapi specific - Replace the unsigned long i_state member with a u32 i_state member in struct inode freeing up 4 bytes in struct inode. Instead of using the bit based wait apis we're now using the var event apis and using the individual bytes of the i_state member to wait on state changes - Explain how per-syscall AT_* flags should be allocated - Use in_group_or_capable() helper to simplify the posix acl mode update code - Switch to LIST_HEAD() in fsync_buffers_list() to simplify the code - Removed comment about d_rcu_to_refcount() as that function doesn't exist anymore - Add kernel documentation for lookup_fast() - Don't re-zero evenpoll fields - Remove outdated comment after close_fd() - Fix imprecise wording in comment about the pipe filesystem - Drop GFP_NOFAIL mode from alloc_page_buffers - Missing blank line warnings and struct declaration improved in file_table - Annotate struct poll_list with __counted_by() - Remove the unused read parameter in percpu-rwsem - Remove linux/prefetch.h include from direct-io code - Use kmemdup_array instead of kmemdup for multiple allocation in mnt_idmapping code - Remove unused mnt_cursor_del() declaration Performance tweaks: - Dodge smp_mb in break_lease and break_deleg in the common case - Only read fops once in fops_{get,put}() - Use RCU in ilookup() - Elide smp_mb in iversion handling in the common case - Drop one lock trip in evict()" * tag 'vfs-6.12.misc' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (58 commits) uidgid: make sure we fit into one cacheline proc: Fix typo in the comment fs/pipe: Correct imprecise wording in comment fhandle: expose u64 mount id to name_to_handle_at(2) uapi: explain how per-syscall AT_* flags should be allocated fs: drop GFP_NOFAIL mode from alloc_page_buffers writeback: Refine the show_inode_state() macro definition fs/inode: Prevent dump_mapping() accessing invalid dentry.d_name.name mnt_idmapping: Use kmemdup_array instead of kmemdup for multiple allocation netfs: Delete subtree of 'fs/netfs' when netfs module exits fs: use LIST_HEAD() to simplify code inode: make i_state a u32 inode: port __I_LRU_ISOLATING to var event vfs: fix race between evice_inodes() and find_inode()&iput() inode: port __I_NEW to var event inode: port __I_SYNC to var event fs: reorder i_state bits fs: add i_state helpers MAINTAINERS: add the VFS git tree fs: s/__u32/u32/ for s_fsnotify_mask ...
2024-09-16Merge tag 'pm-6.12-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management updates from Rafael Wysocki: "By the number of new lines of code, the most visible change here is the addition of hybrid CPU capacity scaling support to the intel_pstate driver. Next are the amd-pstate driver changes related to the calculation of the AMD boost numerator and preferred core detection. As far as new hardware support is concerned, the intel_idle driver will now handle Granite Rapids Xeon processors natively, the intel_rapl power capping driver will recognize family 1Ah of AMD processors and Intel ArrowLake-U chipos, and intel_pstate will handle Granite Rapids and Sierra Forest chips in the out-of-band (OOB) mode. Apart from the above, there is a usual collection of assorted fixes and code cleanups in many places and there are tooling updates. Specifics: - Remove LATENCY_MULTIPLIER from cpufreq (Qais Yousef) - Add support for Granite Rapids and Sierra Forest in OOB mode to the intel_pstate cpufreq driver (Srinivas Pandruvada) - Add basic support for CPU capacity scaling on x86 and make the intel_pstate driver set asymmetric CPU capacity on hybrid systems without SMT (Rafael Wysocki) - Add missing MODULE_DESCRIPTION() macros to the powerpc cpufreq driver (Jeff Johnson) - Several OF related cleanups in cpufreq drivers (Rob Herring) - Enable COMPILE_TEST for ARM drivers (Rob Herrring) - Introduce quirks for syscon failures and use socinfo to get revision for TI cpufreq driver (Dhruva Gole, Nishanth Menon) - Minor cleanups in amd-pstate driver (Anastasia Belova, Dhananjay Ugwekar) - Minor cleanups for loongson, cpufreq-dt and powernv cpufreq drivers (Danila Tikhonov, Huacai Chen, and Liu Jing) - Make amd-pstate validate return of any attempt to update EPP limits, which fixes the masking hardware problems (Mario Limonciello) - Move the calculation of the AMD boost numerator outside of amd-pstate, correcting acpi-cpufreq on systems with preferred cores (Mario Limonciello) - Harden preferred core detection in amd-pstate to avoid potential false positives (Mario Limonciello) - Add extra unit test coverage for mode state machine (Mario Limonciello) - Fix an "Uninitialized variables" issue in amd-pstste (Qianqiang Liu) - Add Granite Rapids Xeon support to intel_idle (Artem Bityutskiy) - Disable promotion to C1E on Jasper Lake and Elkhart Lake in intel_idle (Kai-Heng Feng) - Use scoped device node handling to fix missing of_node_put() and simplify walking OF children in the riscv-sbi cpuidle driver (Krzysztof Kozlowski) - Remove dead code from cpuidle_enter_state() (Dhruva Gole) - Change an error pointer to NULL to fix error handling in the intel_rapl power capping driver (Dan Carpenter) - Fix off by one in get_rpi() in the intel_rapl power capping driver (Dan Carpenter) - Add support for ArrowLake-U to the intel_rapl power capping driver (Sumeet Pawnikar) - Fix the energy-pkg event for AMD CPUs in the intel_rapl power capping driver (Dhananjay Ugwekar) - Add support for AMD family 1Ah processors to the intel_rapl power capping driver (Dhananjay Ugwekar) - Remove unused stub for saveable_highmem_page() and remove deprecated macros from power management documentation (Andy Shevchenko) - Use ysfs_emit() and sysfs_emit_at() in "show" functions in the PM sysfs interface (Xueqin Luo) - Update the maintainers information for the operating-points-v2-ti-cpu DT binding (Dhruva Gole) - Drop unnecessary of_match_ptr() from ti-opp-supply (Rob Herring) - Add missing MODULE_DESCRIPTION() macros to devfreq governors (Jeff Johnson) - Use devm_clk_get_enabled() in the exynos-bus devfreq driver (Anand Moon) - Use of_property_present() instead of of_get_property() in the imx-bus devfreq driver (Rob Herring) - Update directory handling and installation process in the pm-graph Makefile and add .gitignore to ignore sleepgraph.py artifacts to pm-graph (Amit Vadhavana, Yo-Jung Lin) - Make cpupower display residency value in idle-info (Aboorva Devarajan) - Add missing powercap_set_enabled() stub function to cpupower (John B. Wyatt IV) - Add SWIG support to cpupower (John B. Wyatt IV)" * tag 'pm-6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (62 commits) cpufreq/amd-pstate-ut: Fix an "Uninitialized variables" issue cpufreq/amd-pstate-ut: Add test case for mode switches cpufreq/amd-pstate: Export symbols for changing modes amd-pstate: Add missing documentation for `amd_pstate_prefcore_ranking` cpufreq: amd-pstate: Add documentation for `amd_pstate_hw_prefcore` cpufreq: amd-pstate: Optimize amd_pstate_update_limits() cpufreq: amd-pstate: Merge amd_pstate_highest_perf_set() into amd_get_boost_ratio_numerator() x86/amd: Detect preferred cores in amd_get_boost_ratio_numerator() x86/amd: Move amd_get_highest_perf() out of amd-pstate ACPI: CPPC: Adjust debug messages in amd_set_max_freq_ratio() to warn ACPI: CPPC: Drop check for non zero perf ratio x86/amd: Rename amd_get_highest_perf() to amd_get_boost_ratio_numerator() ACPI: CPPC: Adjust return code for inline functions in !CONFIG_ACPI_CPPC_LIB x86/amd: Move amd_get_highest_perf() from amd.c to cppc.c PM: hibernate: Remove unused stub for saveable_highmem_page() pm:cpupower: Add error warning when SWIG is not installed MAINTAINERS: Add Maintainers for SWIG Python bindings pm:cpupower: Include test_raw_pylibcpupower.py pm:cpupower: Add SWIG bindings files for libcpupower pm:cpupower: Add missing powercap_set_enabled() stub function ...
2024-09-16Merge tag 'arm64-upstream' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 updates from Will Deacon: "The highlights are support for Arm's "Permission Overlay Extension" using memory protection keys, support for running as a protected guest on Android as well as perf support for a bunch of new interconnect PMUs. Summary: ACPI: - Enable PMCG erratum workaround for HiSilicon HIP10 and 11 platforms. - Ensure arm64-specific IORT header is covered by MAINTAINERS. CPU Errata: - Enable workaround for hardware access/dirty issue on Ampere-1A cores. Memory management: - Define PHYSMEM_END to fix a crash in the amdgpu driver. - Avoid tripping over invalid kernel mappings on the kexec() path. - Userspace support for the Permission Overlay Extension (POE) using protection keys. Perf and PMUs: - Add support for the "fixed instruction counter" extension in the CPU PMU architecture. - Extend and fix the event encodings for Apple's M1 CPU PMU. - Allow LSM hooks to decide on SPE permissions for physical profiling. - Add support for the CMN S3 and NI-700 PMUs. Confidential Computing: - Add support for booting an arm64 kernel as a protected guest under Android's "Protected KVM" (pKVM) hypervisor. Selftests: - Fix vector length issues in the SVE/SME sigreturn tests - Fix build warning in the ptrace tests. Timers: - Add support for PR_{G,S}ET_TSC so that 'rr' can deal with non-determinism arising from the architected counter. Miscellaneous: - Rework our IPI-based CPU stopping code to try NMIs if regular IPIs don't succeed. - Minor fixes and cleanups" * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (94 commits) perf: arm-ni: Fix an NULL vs IS_ERR() bug arm64: hibernate: Fix warning for cast from restricted gfp_t arm64: esr: Define ESR_ELx_EC_* constants as UL arm64: pkeys: remove redundant WARN perf: arm_pmuv3: Use BR_RETIRED for HW branch event if enabled MAINTAINERS: List Arm interconnect PMUs as supported perf: Add driver for Arm NI-700 interconnect PMU dt-bindings/perf: Add Arm NI-700 PMU perf/arm-cmn: Improve format attr printing perf/arm-cmn: Clean up unnecessary NUMA_NO_NODE check arm64/mm: use lm_alias() with addresses passed to memblock_free() mm: arm64: document why pte is not advanced in contpte_ptep_set_access_flags() arm64: Expose the end of the linear map in PHYSMEM_END arm64: trans_pgd: mark PTEs entries as valid to avoid dead kexec() arm64/mm: Delete __init region from memblock.reserved perf/arm-cmn: Support CMN S3 dt-bindings: perf: arm-cmn: Add CMN S3 perf/arm-cmn: Refactor DTC PMU register access perf/arm-cmn: Make cycle counts less surprising perf/arm-cmn: Improve build-time assertion ...
2024-09-16Merge tag 'v6.12-p1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 Pull crypto update from Herbert Xu" "API: - Make self-test asynchronous Algorithms: - Remove MPI functions added for SM3 - Add allocation error checks to remaining MPI functions (introduced for SM3) - Set default Jitter RNG OSR to 3 Drivers: - Add hwrng driver for Rockchip RK3568 SoC - Allow disabling SR-IOV VFs through sysfs in qat - Fix device reset bugs in hisilicon - Fix authenc key parsing by using generic helper in octeontx* Others: - Fix xor benchmarking on parisc" * tag 'v6.12-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (96 commits) crypto: n2 - Set err to EINVAL if snprintf fails for hmac crypto: camm/qi - Use ERR_CAST() to return error-valued pointer crypto: mips/crc32 - Clean up useless assignment operations crypto: qcom-rng - rename *_of_data to *_match_data crypto: qcom-rng - fix support for ACPI-based systems dt-bindings: crypto: qcom,prng: document support for SA8255p crypto: aegis128 - Fix indentation issue in crypto_aegis128_process_crypt() crypto: octeontx* - Select CRYPTO_AUTHENC crypto: testmgr - Hide ENOENT errors crypto: qat - Remove trailing space after \n newline crypto: hisilicon/sec - Remove trailing space after \n newline crypto: algboss - Pass instance creation error up crypto: api - Fix generic algorithm self-test races crypto: hisilicon/qm - inject error before stopping queue crypto: hisilicon/hpre - mask cluster timeout error crypto: hisilicon/qm - reset device before enabling it crypto: hisilicon/trng - modifying the order of header files crypto: hisilicon - add a lock for the qp send operation crypto: hisilicon - fix missed error branch crypto: ccp - do not request interrupt on cmd completion when irqs disabled ...
2024-09-16Merge tag 'net-next-6.12' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Jakub Kicinski: "The zero-copy changes are relatively significant, but regression risk should be contained. The feature needs to be used to cause trouble. Also it feels like we got an order of magnitude more semi-automated "refactoring" chaff than usual, I wonder if it's just us. Core & protocols: - Support Device Memory TCP, ability to zero-copy receive TCP payloads to a DMABUF region of memory while packet headers land separately in normal kernel buffers, and TCP processes then as usual. - The ability to read the PTP PHC (Physical Hardware Clock) alongside MONOTONIC_RAW timestamps with PTP_SYS_OFFSET_EXTENDED. Previously only CLOCK_REALTIME was supported. - Allow matching on all bits of IP DSCP for routing decisions. Previously we only supported on matching TOS bits in IPv4 which is a narrower interpretation of the same header field. - Increase the range of weights used for multi-path routing from 8 bits to 16 bits. - Add support for IPv6 PIO p flag in the Prefix Information Option per draft-ietf-6man-pio-pflag. - IPv6 IOAM6 support for new tunsrc encap mode for better performance. - Detect destinations which blackhole MPTCP traffic and avoid initiating MPTCP connections to them for a certain period of time, 1h by default. - Improve IPsec control path performance by removing the inexact policies list. - AF_VSOCK: add support for SIOCOUTQ ioctl. - Add enum for reasons TCP reset was sent for easier tracing. - Add SMC ringbufs usage statistics. Drivers: - Handle netconsole setup failures more gracefully, don't fail loading, retain the specified target as disabled. - Extend bonding's IPsec offload pass thru capabilities (ESN, stats). Filtering: - Add TCP_BPF_SOCK_OPS_CB_FLAGS to bpf_*sockopt() to address the case when long-lived sockets miss a chance to set additional callbacks if a sockops program was not attached early in their lifetime. - Support using BPF skb helpers in tracepoints. - Conntrack Netlink: support CTA_FILTER for flush. - Improve SCTP support in nfnetlink_queue. - Improve performance of large nftables flush transactions. Things we sprinkled into general kernel code: - selftests: support setting an "interpreter" for script files; make it easy to run as separate cases tests where one "interpreter" is fed various test descriptions (in our case packet sequences). Driver API: - Extend core and ethtool APIs to support many PHYs connected to a single interface (PHY topologies). - Extend cable diagnostics to specify whether Time Domain Reflectometry (TDR) or Active Link Cable Diagnostic (ALCD) was used. - Add library for implementing MAC-PHY Ethernet drivers for SPI devices compatible with Open Alliance 10BASE-T1x MAC-PHY Serial Interface (TC6) standard. - Add helpers to the PHY framework, for PHYs following the Open Alliance standards: - 1000BaseT1 link settings - cable test and diagnostics - Support listing / dumping all allocated RSS contexts. - Add configuration for frequency Embedded SYNC in DPLL, which magically embeds sync pulses into Ethernet signaling. Device drivers: - Ethernet high-speed NICs: - Broadcom (bnxt): - use better FW APIs for queue reset - support QOS and TPID settings for the SR-IOV VLAN - support dynamic MSI-X allocation - Intel (100G, ice, idpf): - ice: support PCIe subfunctions - iavf: add support for TC U32 filters on VFs - ice: support Embedded SYNC in DPLL - nVidia/Mellanox (mlx5): - support HW managed steering tables - support PCIe PTM cross timestamping - AMD/Pensando: - ionic: use page_pool to increase Rx performance - Cisco (enic): - report per-queue statistics - Ethernet virtual: - Microsoft vNIC: - mana: support configuring ring length - netvsc: enable more channels on systems with many CPUs - IBM veth: - optimize polling to improve TCP_RR performance - optimize performance of Tx handling - VirtIO net: - synchronize the operstate with the admin state to allow a lower virtio-net to propagate the link status to an upper device like macvlan - Ethernet NICs consumer, and embedded: - Add driver for Realtek automotive PCIe devices (RTL9054, RTL9068, RTL9072, RTL9075, RTL9068, RTL9071) - Add driver for Microchip LAN8650/1 10BASE-T1S MAC-PHY. - Microchip: - lan743x: use phylink - support WOL, EEE, pause, link settings - add Wake-on-LAN support for KSZ87xx family - add KSZ8895/KSZ8864 switch support - factor out FDMA code and use it in sparx5 and lan966x (including DCB support in both) - Synopsys (stmmac): - support frame preemption (configured using TC and ethtool) - support Loongson DWMAC (GMAC v3.73) - support RockChips RK3576 DWMAC - TI: - am65-cpsw: add multi queue RX support - icssg-prueth: HSR offload support - Cadence (macb): - enable software (hrtimer based) IRQ coalescing by default - Xilinx (axinet): - expose HW statistics - improve multicast filtering - relax Rx checksum offload constraints - MediaTek: - mt7530: add EN7581 support - Aspeed (ftgmac100): - report link speed and duplex - Intel: - igc: add mqprio offload - igc: report EEE configuration - RealTek (r8169): - add support for RTL8126A rev.b - Vitesse (vsc73xx): - implement FDB add/del/dump operations - Freescale (fs_enet): - use phylink - Ethernet PHYs: - vitesse: implement downshift and MDI-X in vsc73xx PHYs - microchip: support LAN887x, supporting IEEE 802.3bw (100BASE-T1) and IEEE 802.3bp (1000BASE-T1) specifications - add Applied Micro QT2025 PHY driver (in Rust) - add Motorcomm yt8821 2.5G Ethernet PHY driver - CAN: - add driver for Rockchip RK3568 CAN-FD controller - flexcan: add wakeup support for imx95 - kvaser_usb: set hardware timestamp on transmitted packets - WiFi: - mac80211/cfg80211: - EHT rate support in AQL airtime fairness - handle DFS (radar detection) per link in Multi-Link Operation - RealTek (rtw89): - support RTL8852BT and 8852BE-VT (WiFi 6) - support hardware rfkill - support HW encryption in unicast management frames - support Wake-on-WLAN with supported network detection - RealTek (rtw89): - improve Rx performance by using USB frame aggregation - support USB 3 with RTL8822CU/RTL8822BU - Intel (iwlwifi/mvm): - offload RLC/SMPS functionality to firmware - Marvell (mwifiex): - add host based MLME to enable WPA3 - Bluetooth: - add support for Amlogic HCI UART protocol - add support for ISO data/packets to Intel and NXP drivers" * tag 'net-next-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1303 commits) net/mlx5: HWS, check the correct variable in hws_send_ring_alloc_sq() netfilter: nft_socket: Fix a NULL vs IS_ERR() bug in nft_socket_cgroup_subtree_level() ice: Fix a NULL vs IS_ERR() check in probe() ice: Fix a couple NULL vs IS_ERR() bugs net: ethernet: fs_enet: Make the per clock optional net: ti: icssg-prueth: Add multicast filtering support in HSR mode net: ti: icssg-prueth: Enable HSR Tx duplication, Tx Tag and Rx Tag offload net: ti: icssg-prueth: Add support for HSR frame forward offload net: ti: icssg-prueth: Stop hardcoding def_inc net: ti: icss-iep: Move icss_iep structure net: ibm: emac: get rid of wol_irq net: ibm: emac: remove all waiting code net: ibm: emac: replace of_get_property net: ibm: emac: use netdev's phydev directly net: ibm: emac: use devm for register_netdev net: ibm: emac: remove mii_bus with devm net: ibm: emac: use devm for of_iomap net: ibm: emac: manage emac_irq with devm net: ibm: emac: use devm for alloc_etherdev octeontx2-af: debugfs: Add Channel info to RPM map ...
2024-09-13bpf: Call the missed kfree() when there is no special field in btfHou Tao
Call the missed kfree() in btf_parse_struct_metas() when there is no special field in btf, otherwise will get the following kmemleak report: unreferenced object 0xffff888101033620 (size 8): comm "test_progs", pid 604, jiffies 4295127011 ...... backtrace (crc e77dc444): [<00000000186f90f3>] kmemleak_alloc+0x4b/0x80 [<00000000ac8e9c4d>] __kmalloc_cache_noprof+0x2a1/0x310 [<00000000d99d68d6>] btf_new_fd+0x72d/0xe90 [<00000000f010b7f8>] __sys_bpf+0xec3/0x2410 [<00000000e077ed6f>] __x64_sys_bpf+0x1f/0x30 [<00000000a12f9e55>] x64_sys_call+0x199/0x9f0 [<00000000f3029ea6>] do_syscall_64+0x3b/0xc0 [<000000005640913a>] entry_SYSCALL_64_after_hwframe+0x4b/0x53 Fixes: 7a851ecb1806 ("bpf: Search for kptrs in prog BTF structs") Signed-off-by: Hou Tao <houtao1@huawei.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20240912012845.3458483-3-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-09-13bpf: Call the missed btf_record_free() when map creation failsHou Tao
When security_bpf_map_create() in map_create() fails, map_create() will call btf_put() and ->map_free() callback to free the map. It doesn't free the btf_record of map value, so add the missed btf_record_free() when map creation fails. However btf_record_free() needs to be called after ->map_free() just like bpf_map_free_deferred() did, because ->map_free() may use the btf_record to free the special fields in preallocated map value. So factor out bpf_map_free() helper to free the map, btf_record, and btf orderly and use the helper in both map_create() and bpf_map_free_deferred(). Signed-off-by: Hou Tao <houtao1@huawei.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20240912012845.3458483-2-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-09-13bpf: Zero former ARG_PTR_TO_{LONG,INT} args in case of errorDaniel Borkmann
For all non-tracing helpers which formerly had ARG_PTR_TO_{LONG,INT} as input arguments, zero the value for the case of an error as otherwise it could leak memory. For tracing, it is not needed given CAP_PERFMON can already read all kernel memory anyway hence bpf_get_func_arg() and bpf_get_func_ret() is skipped in here. Also, the MTU helpers mtu_len pointer value is being written but also read. Technically, the MEM_UNINIT should not be there in order to always force init. Removing MEM_UNINIT needs more verifier rework though: MEM_UNINIT right now implies two things actually: i) write into memory, ii) memory does not have to be initialized. If we lift MEM_UNINIT, it then becomes: i) read into memory, ii) memory must be initialized. This means that for bpf_*_check_mtu() we're readding the issue we're trying to fix, that is, it would then be able to write back into things like .rodata BPF maps. Follow-up work will rework the MEM_UNINIT semantics such that the intent can be better expressed. For now just clear the *mtu_len on error path which can be lifted later again. Fixes: 8a67f2de9b1d ("bpf: expose bpf_strtol and bpf_strtoul to all program types") Fixes: d7a4cb9b6705 ("bpf: Introduce bpf_strtol and bpf_strtoul helpers") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/e5edd241-59e7-5e39-0ee5-a51e31b6840a@iogearbox.net Link: https://lore.kernel.org/r/20240913191754.13290-5-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-09-13bpf: Improve check_raw_mode_ok test for MEM_UNINIT-tagged typesDaniel Borkmann
When checking malformed helper function signatures, also take other argument types into account aside from just ARG_PTR_TO_UNINIT_MEM. This concerns (formerly) ARG_PTR_TO_{INT,LONG} given uninitialized memory can be passed there, too. The func proto sanity check goes back to commit 435faee1aae9 ("bpf, verifier: add ARG_PTR_TO_RAW_STACK type"), and its purpose was to detect wrong func protos which had more than just one MEM_UNINIT-tagged type as arguments. The reason more than one is currently not supported is as we mark stack slots with STACK_MISC in check_helper_call() in case of raw mode based on meta.access_size to allow uninitialized stack memory to be passed to helpers when they just write into the buffer. Probing for base type as well as MEM_UNINIT tagging ensures that other types do not get missed (as it used to be the case for ARG_PTR_TO_{INT,LONG}). Fixes: 57c3bb725a3d ("bpf: Introduce ARG_PTR_TO_{INT,LONG} arg types") Reported-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Link: https://lore.kernel.org/r/20240913191754.13290-4-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-09-13bpf: Fix helper writes to read-only mapsDaniel Borkmann
Lonial found an issue that despite user- and BPF-side frozen BPF map (like in case of .rodata), it was still possible to write into it from a BPF program side through specific helpers having ARG_PTR_TO_{LONG,INT} as arguments. In check_func_arg() when the argument is as mentioned, the meta->raw_mode is never set. Later, check_helper_mem_access(), under the case of PTR_TO_MAP_VALUE as register base type, it assumes BPF_READ for the subsequent call to check_map_access_type() and given the BPF map is read-only it succeeds. The helpers really need to be annotated as ARG_PTR_TO_{LONG,INT} | MEM_UNINIT when results are written into them as opposed to read out of them. The latter indicates that it's okay to pass a pointer to uninitialized memory as the memory is written to anyway. However, ARG_PTR_TO_{LONG,INT} is a special case of ARG_PTR_TO_FIXED_SIZE_MEM just with additional alignment requirement. So it is better to just get rid of the ARG_PTR_TO_{LONG,INT} special cases altogether and reuse the fixed size memory types. For this, add MEM_ALIGNED to additionally ensure alignment given these helpers write directly into the args via *<ptr> = val. The .arg*_size has been initialized reflecting the actual sizeof(*<ptr>). MEM_ALIGNED can only be used in combination with MEM_FIXED_SIZE annotated argument types, since in !MEM_FIXED_SIZE cases the verifier does not know the buffer size a priori and therefore cannot blindly write *<ptr> = val. Fixes: 57c3bb725a3d ("bpf: Introduce ARG_PTR_TO_{INT,LONG} arg types") Reported-by: Lonial Con <kongln9170@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Link: https://lore.kernel.org/r/20240913191754.13290-3-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-09-13bpf: Remove truncation test in bpf_strtol and bpf_strtoul helpersDaniel Borkmann
Both bpf_strtol() and bpf_strtoul() helpers passed a temporary "long long" respectively "unsigned long long" to __bpf_strtoll() / __bpf_strtoull(). Later, the result was checked for truncation via _res != ({unsigned,} long)_res as the destination buffer for the BPF helpers was of type {unsigned,} long which is 32bit on 32bit architectures. Given the latter was a bug in the helper signatures where the destination buffer got adjusted to {s,u}64, the truncation check can now be removed. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20240913191754.13290-2-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-09-13bpf: Fix bpf_strtol and bpf_strtoul helpers for 32bitDaniel Borkmann
The bpf_strtol() and bpf_strtoul() helpers are currently broken on 32bit: The argument type ARG_PTR_TO_LONG is BPF-side "long", not kernel-side "long" and therefore always considered fixed 64bit no matter if 64 or 32bit underlying architecture. This contract breaks in case of the two mentioned helpers since their BPF_CALL definition for the helpers was added with {unsigned,}long *res. Meaning, the transition from BPF-side "long" (BPF program) to kernel-side "long" (BPF helper) breaks here. Both helpers call __bpf_strtoll() with "long long" correctly, but later assigning the result into 32-bit "*(long *)" on 32bit architectures. From a BPF program point of view, this means upper bits will be seen as uninitialised. Therefore, fix both BPF_CALL signatures to {s,u}64 types to fix this situation. Now, changing also uapi/bpf.h helper documentation which generates bpf_helper_defs.h for BPF programs is tricky: Changing signatures there to __{s,u}64 would trigger compiler warnings (incompatible pointer types passing 'long *' to parameter of type '__s64 *' (aka 'long long *')) for existing BPF programs. Leaving the signatures as-is would be fine as from BPF program point of view it is still BPF-side "long" and thus equivalent to __{s,u}64 on 64 or 32bit underlying architectures. Note that bpf_strtol() and bpf_strtoul() are the only helpers with this issue. Fixes: d7a4cb9b6705 ("bpf: Introduce bpf_strtol and bpf_strtoul helpers") Reported-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/481fcec8-c12c-9abb-8ecb-76c71c009959@iogearbox.net Link: https://lore.kernel.org/r/20240913191754.13290-1-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-09-13bpf: Fix a sdiv overflow issueYonghong Song
Zac Ecob reported a problem where a bpf program may cause kernel crash due to the following error: Oops: divide error: 0000 [#1] PREEMPT SMP KASAN PTI The failure is due to the below signed divide: LLONG_MIN/-1 where LLONG_MIN equals to -9,223,372,036,854,775,808. LLONG_MIN/-1 is supposed to give a positive number 9,223,372,036,854,775,808, but it is impossible since for 64-bit system, the maximum positive number is 9,223,372,036,854,775,807. On x86_64, LLONG_MIN/-1 will cause a kernel exception. On arm64, the result for LLONG_MIN/-1 is LLONG_MIN. Further investigation found all the following sdiv/smod cases may trigger an exception when bpf program is running on x86_64 platform: - LLONG_MIN/-1 for 64bit operation - INT_MIN/-1 for 32bit operation - LLONG_MIN%-1 for 64bit operation - INT_MIN%-1 for 32bit operation where -1 can be an immediate or in a register. On arm64, there are no exceptions: - LLONG_MIN/-1 = LLONG_MIN - INT_MIN/-1 = INT_MIN - LLONG_MIN%-1 = 0 - INT_MIN%-1 = 0 where -1 can be an immediate or in a register. Insn patching is needed to handle the above cases and the patched codes produced results aligned with above arm64 result. The below are pseudo codes to handle sdiv/smod exceptions including both divisor -1 and divisor 0 and the divisor is stored in a register. sdiv: tmp = rX tmp += 1 /* [-1, 0] -> [0, 1] if tmp >(unsigned) 1 goto L2 if tmp == 0 goto L1 rY = 0 L1: rY = -rY; goto L3 L2: rY /= rX L3: smod: tmp = rX tmp += 1 /* [-1, 0] -> [0, 1] if tmp >(unsigned) 1 goto L1 if tmp == 1 (is64 ? goto L2 : goto L3) rY = 0; goto L2 L1: rY %= rX L2: goto L4 // only when !is64 L3: wY = wY // only when !is64 L4: [1] https://lore.kernel.org/bpf/tPJLTEh7S_DxFEqAI2Ji5MBSoZVg7_G-Py2iaZpAaWtM961fFTWtsnlzwvTbzBzaUzwQAoNATXKUlt0LZOFgnDcIyKCswAnAGdUF3LBrhGQ=@protonmail.com/ Reported-by: Zac Ecob <zacecob@protonmail.com> Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20240913150326.1187788-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-09-13module: Refine kmemleak scanned areasVincent Donnefort
commit ac3b43283923 ("module: replace module_layout with module_memory") introduced a set of memory regions for the module layout sharing the same attributes. However, it didn't update the kmemleak scanned areas which intended to limit kmemleak scan to sections containing writable data. This means sections such as .text and .rodata are scanned by kmemleak. Refine the scanned areas for modules by limiting it to MOD_TEXT and MOD_INIT_TEXT mod_mem regions. CC: Song Liu <song@kernel.org> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Vincent Donnefort <vdonnefort@google.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2024-09-13module: abort module loading when sysfs setup suffer errorsChunhui Li
When insmod a kernel module, if fails in add_notes_attrs or add_sysfs_attrs such as memory allocation fail, mod_sysfs_setup will still return success, but we can't access user interface on android device. Patch for make mod_sysfs_setup can check the error of add_notes_attrs and add_sysfs_attrs [mcgrof: the section stuff comes from linux history.git [0]] Fixes: 3f7b0672086b ("Module section offsets in /sys/module") [0] Fixes: 6d76013381ed ("Add /sys/module/name/notes") Acked-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: Petr Pavlu <petr.pavlu@suse.com> Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202409010016.3XIFSmRA-lkp@intel.com/ Closes: https://lore.kernel.org/oe-kbuild-all/202409072018.qfEzZbO7-lkp@intel.com/ Link: https://git.kernel.org/pub/scm/linux/kernel/git/history/history.git/commit/?id=3f7b0672086b97b2d7f322bdc289cbfa203f10ef [0] Signed-off-by: Xion Wang <xion.wang@mediatek.com> Signed-off-by: Chunhui Li <chunhui.li@mediatek.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2024-09-12Merge tag 'for-netdev' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2024-09-11 We've added 12 non-merge commits during the last 16 day(s) which contain a total of 20 files changed, 228 insertions(+), 30 deletions(-). There's a minor merge conflict in drivers/net/netkit.c: 00d066a4d4ed ("netdev_features: convert NETIF_F_LLTX to dev->lltx") d96608794889 ("netkit: Disable netpoll support") The main changes are: 1) Enable bpf_dynptr_from_skb for tp_btf such that this can be used to easily parse skbs in BPF programs attached to tracepoints, from Philo Lu. 2) Add a cond_resched() point in BPF's sock_hash_free() as there have been several syzbot soft lockup reports recently, from Eric Dumazet. 3) Fix xsk_buff_can_alloc() to account for queue_empty_descs which got noticed when zero copy ice driver started to use it, from Maciej Fijalkowski. 4) Move the xdp:xdp_cpumap_kthread tracepoint before cpumap pushes skbs up via netif_receive_skb_list() to better measure latencies, from Daniel Xu. 5) Follow-up to disable netpoll support from netkit, from Daniel Borkmann. 6) Improve xsk selftests to not assume a fixed MAX_SKB_FRAGS of 17 but instead gather the actual value via /proc/sys/net/core/max_skb_frags, also from Maciej Fijalkowski. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: sock_map: Add a cond_resched() in sock_hash_free() selftests/bpf: Expand skb dynptr selftests for tp_btf bpf: Allow bpf_dynptr_from_skb() for tp_btf tcp: Use skb__nullable in trace_tcp_send_reset selftests/bpf: Add test for __nullable suffix in tp_btf bpf: Support __nullable argument suffix for tp_btf bpf, cpumap: Move xdp:xdp_cpumap_kthread tracepoint before rcv selftests/xsk: Read current MAX_SKB_FRAGS from sysctl knob xsk: Bump xsk_queue::queue_empty_descs in xp_can_alloc() tcp_bpf: Remove an unused parameter for bpf_tcp_ingress() bpf, sockmap: Correct spelling skmsg.c netkit: Disable netpoll support Signed-off-by: Jakub Kicinski <kuba@kernel.org> ==================== Link: https://patch.msgid.link/20240911211525.13834-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-12bpf: convert bpf_token_create() to CLASS(fd, ...)Al Viro
Keep file reference through the entire thing, don't bother with grabbing struct path reference and while we are at it, don't confuse the hell out of readers by random mix of path.dentry->d_sb and path.mnt->mnt_sb uses - these two are equal, so just put one of those into a local variable and use that. Reviewed-by: Christian Brauner <brauner@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
2024-09-12Merge tag 'wq-for-6.11-rc7-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq Pull workqueue fix from Tejun Heo: "A fix for a NULL worker->pool deref bug which can be triggered when a worker is created and then destroyed immediately" * tag 'wq-for-6.11-rc7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: Clear worker->pool in the worker thread context
2024-09-12dma-mapping: reflow dma_supportedChristoph Hellwig
dma_supported has become too much spaghetti for my taste. Reflow it to remove the duplicate use_dma_iommu condition and make the main path more obvious. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Leon Romanovsky <leon@kernel.org>
2024-09-12uidgid: make sure we fit into one cachelineChristian Brauner
When I expanded uidgid mappings I intended for a struct uid_gid_map to fit into a single cacheline on x86 as they tend to be pretty performance sensitive (idmapped mounts etc). But a 4 byte hole was added that brought it over 64 bytes. Fix that by adding the static extent array and the extent counter into a substruct. C's type punning for unions guarantees that we can access ->nr_extents even if the last written to member wasn't within the same object. This is also what we rely on in struct_group() and friends. This of course relies on non-strict aliasing which we don't do. 99) If the member used to read the contents of a union object is not the same as the member last used to store a value in the object, the appropriate part of the object representation of the value is reinterpreted as an object representation in the new type as described in 6.2.6 (a process sometimes called "type punning"). Link: https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2310.pdf Link: https://lore.kernel.org/r/20240910-work-uid_gid_map-v1-1-e6bc761363ed@kernel.org Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-09-12dma-mapping: reliably inform about DMA support for IOMMULeon Romanovsky
If the DMA IOMMU path is going to be used, the appropriate check should return that DMA is supported. Fixes: b5c58b2fdc42 ("dma-mapping: direct calls for dma-iommu") Closes: https://lore.kernel.org/all/181e06ff-35a3-434f-b505-672f430bd1cb@notapiano Reported-by: Nícolas F. R. A. Prado <nfraprado@collabora.com> #KernelCI Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Robin Murphy <robin.murphy@arm.com> Tested-by: Nícolas F. R. A. Prado <nfraprado@collabora.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2024-09-11sched: Move update_other_load_avgs() to kernel/sched/pelt.cTejun Heo
96fd6c65efc6 ("sched: Factor out update_other_load_avgs() from __update_blocked_others()") added update_other_load_avgs() in kernel/sched/syscalls.c right above effective_cpu_util(). This location didn't fit that well in the first place, and with 5d871a63997f ("sched/fair: Move effective_cpu_util() and effective_cpu_util() in fair.c") moving effective_cpu_util() to kernel/sched/fair.c, it looks even more out of place. Relocate the function to kernel/sched/pelt.c where all its callees are. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com>
2024-09-11workqueue: Clear worker->pool in the worker thread contextLai Jiangshan
Marc Hartmayer reported: [ 23.133876] Unable to handle kernel pointer dereference in virtual kernel address space [ 23.133950] Failing address: 0000000000000000 TEID: 0000000000000483 [ 23.133954] Fault in home space mode while using kernel ASCE. [ 23.133957] AS:000000001b8f0007 R3:0000000056cf4007 S:0000000056cf3800 P:000000000000003d [ 23.134207] Oops: 0004 ilc:2 [#1] SMP (snip) [ 23.134516] Call Trace: [ 23.134520] [<0000024e326caf28>] worker_thread+0x48/0x430 [ 23.134525] ([<0000024e326caf18>] worker_thread+0x38/0x430) [ 23.134528] [<0000024e326d3a3e>] kthread+0x11e/0x130 [ 23.134533] [<0000024e3264b0dc>] __ret_from_fork+0x3c/0x60 [ 23.134536] [<0000024e333fb37a>] ret_from_fork+0xa/0x38 [ 23.134552] Last Breaking-Event-Address: [ 23.134553] [<0000024e333f4c04>] mutex_unlock+0x24/0x30 [ 23.134562] Kernel panic - not syncing: Fatal exception: panic_on_oops With debuging and analysis, worker_thread() accesses to the nullified worker->pool when the newly created worker is destroyed before being waken-up, in which case worker_thread() can see the result detach_worker() reseting worker->pool to NULL at the begining. Move the code "worker->pool = NULL;" out from detach_worker() to fix the problem. worker->pool had been designed to be constant for regular workers and changeable for rescuer. To share attaching/detaching code for regular and rescuer workers and to avoid worker->pool being accessed inadvertently when the worker has been detached, worker->pool is reset to NULL when detached no matter the worker is rescuer or not. To maintain worker->pool being reset after detached, move the code "worker->pool = NULL;" in the worker thread context after detached. It is either be in the regular worker thread context after PF_WQ_WORKER is cleared or in rescuer worker thread context with wq_pool_attach_mutex held. So it is safe to do so. Cc: Marc Hartmayer <mhartmay@linux.ibm.com> Link: https://lore.kernel.org/lkml/87wmjj971b.fsf@linux.ibm.com/ Reported-by: Marc Hartmayer <mhartmay@linux.ibm.com> Fixes: f4b7b53c94af ("workqueue: Detach workers directly in idle_cull_fn()") Cc: stable@vger.kernel.org # v6.11+ Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-09-11bpf: Use fake pt_regs when doing bpf syscall tracepoint tracingYonghong Song
Salvatore Benedetto reported an issue that when doing syscall tracepoint tracing the kernel stack is empty. For example, using the following command line bpftrace -e 'tracepoint:syscalls:sys_enter_read { print("Kernel Stack\n"); print(kstack()); }' bpftrace -e 'tracepoint:syscalls:sys_exit_read { print("Kernel Stack\n"); print(kstack()); }' the output for both commands is === Kernel Stack === Further analysis shows that pt_regs used for bpf syscall tracepoint tracing is from the one constructed during user->kernel transition. The call stack looks like perf_syscall_enter+0x88/0x7c0 trace_sys_enter+0x41/0x80 syscall_trace_enter+0x100/0x160 do_syscall_64+0x38/0xf0 entry_SYSCALL_64_after_hwframe+0x76/0x7e The ip address stored in pt_regs is from user space hence no kernel stack is printed. To fix the issue, kernel address from pt_regs is required. In kernel repo, there are already a few cases like this. For example, in kernel/trace/bpf_trace.c, several perf_fetch_caller_regs(fake_regs_ptr) instances are used to supply ip address or use ip address to construct call stack. Instead of allocate fake_regs in the stack which may consume a lot of bytes, the function perf_trace_buf_alloc() in perf_syscall_{enter, exit}() is leveraged to create fake_regs, which will be passed to perf_call_bpf_{enter,exit}(). For the above bpftrace script, I got the following output with this patch: for tracepoint:syscalls:sys_enter_read === Kernel Stack syscall_trace_enter+407 syscall_trace_enter+407 do_syscall_64+74 entry_SYSCALL_64_after_hwframe+75 === and for tracepoint:syscalls:sys_exit_read === Kernel Stack syscall_exit_work+185 syscall_exit_work+185 syscall_exit_to_user_mode+305 do_syscall_64+118 entry_SYSCALL_64_after_hwframe+75 === Reported-by: Salvatore Benedetto <salvabenedetto@meta.com> Suggested-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20240910214037.3663272-1-yonghong.song@linux.dev
2024-09-11bpf: Check percpu map value size firstTao Chen
Percpu map is often used, but the map value size limit often ignored, like issue: https://github.com/iovisor/bcc/issues/2519. Actually, percpu map value size is bound by PCPU_MIN_UNIT_SIZE, so we can check the value size whether it exceeds PCPU_MIN_UNIT_SIZE first, like percpu map of local_storage. Maybe the error message seems clearer compared with "cannot allocate memory". Signed-off-by: Jinke Han <jinkehan@didiglobal.com> Signed-off-by: Tao Chen <chen.dylane@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20240910144111.1464912-2-chen.dylane@gmail.com
2024-09-11Merge branch 'tip/sched/core' into sched_ext/for-6.12Tejun Heo
Pull in tip/sched/core to resolve two merge conflicts: - 96fd6c65efc6 ("sched: Factor out update_other_load_avgs() from __update_blocked_others()") 5d871a63997f ("sched/fair: Move effective_cpu_util() and effective_cpu_util() in fair.c") A simple context conflict. The former added __update_blocked_others() in the same #ifdef CONFIG_SMP block that effective_cpu_util() and sched_cpu_util() are in and the latter moved those functions to fair.c. This makes __update_blocked_others() more out of place. Will follow up with a patch to relocate. - 96fd6c65efc6 ("sched: Factor out update_other_load_avgs() from __update_blocked_others()") 84d265281d6c ("sched/pelt: Use rq_clock_task() for hw_pressure") The former factored out the body of __update_blocked_others() into update_other_load_avgs(). The latter changed how update_hw_load_avg() is called in the body. Resolved by applying the change to update_other_load_avgs() instead. Signed-off-by: Tejun Heo <tj@kernel.org>
2024-09-11Merge tag 'printk-for-6.11-fixup' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux Pull printk fix from Petr Mladek: - Fix build of serial_core as a module * tag 'printk-for-6.11-fixup' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux: printk: Export match_devname_and_update_preferred_console()
2024-09-11kernel/workqueue.c: fix DEFINE_PER_CPU_SHARED_ALIGNED expansionBaoquan He
Make tags always produces below annoying warnings: ctags: Warning: kernel/workqueue.c:470: null expansion of name pattern "\1" ctags: Warning: kernel/workqueue.c:474: null expansion of name pattern "\1" ctags: Warning: kernel/workqueue.c:478: null expansion of name pattern "\1" In commit 25528213fe9f ("tags: Fix DEFINE_PER_CPU expansions"), codes in places have been adjusted including cpu_worker_pools definition. I noticed in commit 4cb1ef64609f ("workqueue: Implement BH workqueues to eventually replace tasklets"), cpu_worker_pools definition was unfolded back. Not sure if it was intentionally done or ignored carelessly. Makes change to mute them specifically. Signed-off-by: Baoquan He <bhe@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-09-11Merge branches 'pm-sleep', 'pm-opp' and 'pm-tools'Rafael J. Wysocki
Merge updates related to system sleep, operating performance points (OPP) updates, and PM tooling updates for 6.12-rc1: - Remove unused stub for saveable_highmem_page() and remove deprecated macros from power management documentation (Andy Shevchenko). - Use ysfs_emit() and sysfs_emit_at() in "show" functions in the PM sysfs interface (Xueqin Luo). - Update the maintainers information for the operating-points-v2-ti-cpu DT binding (Dhruva Gole). - Drop unnecessary of_match_ptr() from ti-opp-supply (Rob Herring). - Update directory handling and installation process in the pm-graph Makefile and add .gitignore to ignore sleepgraph.py artifacts to pm-graph (Amit Vadhavana, Yo-Jung Lin). - Make cpupower display residency value in idle-info (Aboorva Devarajan). - Add missing powercap_set_enabled() stub function to cpupower (John B. Wyatt IV). - Add SWIG support to cpupower (John B. Wyatt IV). * pm-sleep: PM: hibernate: Remove unused stub for saveable_highmem_page() Documentation: PM: Discourage use of deprecated macros PM: sleep: Use sysfs_emit() and sysfs_emit_at() in "show" functions PM: hibernate: Use sysfs_emit() and sysfs_emit_at() in "show" functions * pm-opp: dt-bindings: opp: operating-points-v2-ti-cpu: Update maintainers opp: ti: Drop unnecessary of_match_ptr() * pm-tools: pm:cpupower: Add error warning when SWIG is not installed MAINTAINERS: Add Maintainers for SWIG Python bindings pm:cpupower: Include test_raw_pylibcpupower.py pm:cpupower: Add SWIG bindings files for libcpupower pm:cpupower: Add missing powercap_set_enabled() stub function pm-graph: Update directory handling and installation process in Makefile pm-graph: Make git ignore sleepgraph.py artifacts tools/cpupower: display residency value in idle-info
2024-09-11bpf: wire up sleepable bpf_get_stack() and bpf_get_task_stack() helpersAndrii Nakryiko
Add sleepable implementations of bpf_get_stack() and bpf_get_task_stack() helpers and allow them to be used from sleepable BPF program (e.g., sleepable uprobes). Note, the stack trace IPs capturing itself is not sleepable (that would need to be a separate project), only build ID fetching is sleepable and thus more reliable, as it will wait for data to be paged in, if necessary. For that we make use of sleepable build_id_parse() implementation. Now that build ID related internals in kernel/bpf/stackmap.c can be used both in sleepable and non-sleepable contexts, we need to add additional rcu_read_lock()/rcu_read_unlock() protection around fetching perf_callchain_entry, but with the refactoring in previous commit it's now pretty straightforward. We make sure to do rcu_read_unlock (in sleepable mode only) right before stack_map_get_build_id_offset() call which can sleep. By that time we don't have any more use of perf_callchain_entry. Note, bpf_get_task_stack() will fail for user mode if task != current. And for kernel mode build ID are irrelevant. So in that sense adding sleepable bpf_get_task_stack() implementation is a no-op. It feel right to wire this up for symmetry and completeness, but I'm open to just dropping it until we support `user && crosstask` condition. Reviewed-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20240829174232.3133883-10-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-09-11bpf: decouple stack_map_get_build_id_offset() from perf_callchain_entryAndrii Nakryiko
Change stack_map_get_build_id_offset() which is used to convert stack trace IP addresses into build ID+offset pairs. Right now this function accepts an array of u64s as an input, and uses array of struct bpf_stack_build_id as an output. This is problematic because u64 array is coming from perf_callchain_entry, which is (non-sleepable) RCU protected, so once we allows sleepable build ID fetching, this all breaks down. But its actually pretty easy to make stack_map_get_build_id_offset() works with array of struct bpf_stack_build_id as both input and output. Which is what this patch is doing, eliminating the dependency on perf_callchain_entry. We require caller to fill out bpf_stack_build_id.ip fields (all other can be left uninitialized), and update in place as we do build ID resolution. We make sure to READ_ONCE() and cache locally current IP value as we used it in a few places to find matching VMA and so on. Given this data is directly accessible and modifiable by user's BPF code, we should make sure to have a consistent view of it. Reviewed-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20240829174232.3133883-9-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-09-11lib/buildid: rename build_id_parse() into build_id_parse_nofault()Andrii Nakryiko
Make it clear that build_id_parse() assumes that it can take no page fault by renaming it and current few users to build_id_parse_nofault(). Also add build_id_parse() stub which for now falls back to non-sleepable implementation, but will be changed in subsequent patches to take advantage of sleepable context. PROCMAP_QUERY ioctl() on /proc/<pid>/maps file is using build_id_parse() and will automatically take advantage of more reliable sleepable context implementation. Reviewed-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20240829174232.3133883-6-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-09-11bpf: Support __nullable argument suffix for tp_btfPhilo Lu
Pointers passed to tp_btf were trusted to be valid, but some tracepoints do take NULL pointer as input, such as trace_tcp_send_reset(). Then the invalid memory access cannot be detected by verifier. This patch fix it by add a suffix "__nullable" to the unreliable argument. The suffix is shown in btf, and PTR_MAYBE_NULL will be added to nullable arguments. Then users must check the pointer before use it. A problem here is that we use "btf_trace_##call" to search func_proto. As it is a typedef, argument names as well as the suffix are not recorded. To solve this, I use bpf_raw_event_map to find "__bpf_trace##template" from "btf_trace_##call", and then we can see the suffix. Suggested-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Philo Lu <lulie@linux.alibaba.com> Link: https://lore.kernel.org/r/20240911033719.91468-2-lulie@linux.alibaba.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-09-11bpf, cpumap: Move xdp:xdp_cpumap_kthread tracepoint before rcvDaniel Xu
cpumap takes RX processing out of softirq and onto a separate kthread. Since the kthread needs to be scheduled in order to run (versus softirq which does not), we can theoretically experience extra latency if the system is under load and the scheduler is being unfair to us. Moving the tracepoint to before passing the skb list up the stack allows users to more accurately measure enqueue/dequeue latency introduced by cpumap via xdp:xdp_cpumap_enqueue and xdp:xdp_cpumap_kthread tracepoints. f9419f7bd7a5 ("bpf: cpumap add tracepoints") which added the tracepoints states that the intent behind them was for general observability and for a feedback loop to see if the queues are being overwhelmed. This change does not mess with either of those use cases but rather adds a third one. Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Link: https://lore.kernel.org/bpf/47615d5b5e302e4bd30220473779e98b492d47cd.1725585718.git.dxu@dxuuu.xyz
2024-09-11sched/cpufreq: Use NSEC_PER_MSEC for deadline taskChristian Loehle
Convert the sugov deadline task attributes to use the available definitions to make them more readable. No functional change. Signed-off-by: Christian Loehle <christian.loehle@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Acked-by: Juri Lelli <juri.lelli@redhat.com> Acked-by: Rafael J. Wysocki <rafael@kernel.org> Link: https://lore.kernel.org/r/20240813144348.1180344-5-christian.loehle@arm.com
2024-09-11Merge branch 'for-6.11-fixup' into for-linusPetr Mladek
2024-09-11Merge v6.11-rc7 into drm-nextSimona Vetter
Thomas needs 5a498d4d06d6 ("drm/fbdev-dma: Only install deferred I/O if necessary") in drm-misc, so start the backmerge cascade. Signed-off-by: Simona Vetter <simona.vetter@ffwll.ch>
2024-09-10sched_ext: Don't trigger ops.quiescent/runnable() on migrationsTejun Heo
A task moving across CPUs should not trigger quiescent/runnable task state events as the task is staying runnable the whole time and just stopping and then starting on different CPUs. Suppress quiescent/runnable task state events if task_on_rq_migrating(). Signed-off-by: Tejun Heo <tj@kernel.org> Suggested-by: David Vernet <void@manifault.com> Cc: Daniel Hodges <hodges.daniel.scott@gmail.com> Cc: Changwoo Min <multics69@gmail.com> Cc: Andrea Righi <andrea.righi@linux.dev> Cc: Dan Schatzberg <schatzberg.dan@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-09-10sched_ext: Synchronize bypass state changes with rq lockTejun Heo
While the BPF scheduler is being unloaded, the following warning messages trigger sometimes: NOHZ tick-stop error: local softirq work is pending, handler #80!!! This is caused by the CPU entering idle while there are pending softirqs. The main culprit is the bypassing state assertion not being synchronized with rq operations. As the BPF scheduler cannot be trusted in the disable path, the first step is entering the bypass mode where the BPF scheduler is ignored and scheduling becomes global FIFO. This is implemented by turning scx_ops_bypassing() true. However, the transition isn't synchronized against anything and it's possible for enqueue and dispatch paths to have different ideas on whether bypass mode is on. Make each rq track its own bypass state with SCX_RQ_BYPASSING which is modified while rq is locked. This removes most of the NOHZ tick-stop messages but not completely. I believe the stragglers are from the sched core bug where pick_task_scx() can be called without preceding balance_scx(). Once that bug is fixed, we should verify that all occurrences of this error message are gone too. v2: scx_enabled() test moved inside the for_each_possible_cpu() loop so that the per-cpu states are always synchronized with the global state. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: David Vernet <void@manifault.com>
2024-09-10cgroup: Do not report unavailable v1 controllers in /proc/cgroupsMichal Koutný
This is a followup to CONFIG-urability of cpuset and memory controllers for v1 hierarchies. Make the output in /proc/cgroups reflect that !CONFIG_CPUSETS_V1 is like !CONFIG_CPUSETS and !CONFIG_MEMCG_V1 is like !CONFIG_MEMCG. The intended effect is that hiding the unavailable controllers will hint users not to try mounting them on v1. Signed-off-by: Michal Koutný <mkoutny@suse.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-09-10cgroup: Disallow mounting v1 hierarchies without controller implementationMichal Koutný
The configs that disable some v1 controllers would still allow mounting them but with no controller-specific files. (Making such hierarchies equivalent to named v1 hierarchies.) To achieve behavior consistent with actual out-compilation of a whole controller, the mounts should treat respective controllers as non-existent. Wrap implementation into a helper function, leverage legacy_files to detect compiled out controllers. The effect is that mounts on v1 would fail and produce a message like: [ 1543.999081] cgroup: Unknown subsys name 'memory' Signed-off-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-09-10cgroup/cpuset: Expose cpuset filesystem with cpuset v1 onlyMichal Koutný
The cpuset filesystem is a legacy interface to cpuset controller with (pre-)v1 features. It makes little sense to co-mount it on systems without cpuset v1, so do not build it when cpuset v1 is not built neither. Signed-off-by: Michal Koutný <mkoutny@suse.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-09-10PM: hibernate: Remove unused stub for saveable_highmem_page()Andy Shevchenko
When saveable_highmem_page() is unused, it prevents kernel builds with clang, `make W=1` and CONFIG_WERROR=y: kernel/power/snapshot.c:1369:21: error: unused function 'saveable_highmem_page' [-Werror,-Wunused-function] 1369 | static inline void *saveable_highmem_page(struct zone *z, unsigned long p) | ^~~~~~~~~~~~~~~~~~~~~ Fix this by removing unused stub. See also commit 6863f5643dd7 ("kbuild: allow Clang to find unused static inline functions for W=1 build"). Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Link: https://patch.msgid.link/20240905184848.318978-1-andriy.shevchenko@linux.intel.com Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>