summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)Author
2024-12-08Merge tag 'mm-hotfixes-stable-2024-12-07-22-39' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "24 hotfixes. 17 are cc:stable. 15 are MM and 9 are non-MM. The usual bunch of singletons - please see the relevant changelogs for details" * tag 'mm-hotfixes-stable-2024-12-07-22-39' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (24 commits) iio: magnetometer: yas530: use signed integer type for clamp limits sched/numa: fix memory leak due to the overwritten vma->numab_state mm/damon: fix order of arguments in damos_before_apply tracepoint lib: stackinit: hide never-taken branch from compiler mm/filemap: don't call folio_test_locked() without a reference in next_uptodate_folio() scatterlist: fix incorrect func name in kernel-doc mm: correct typo in MMAP_STATE() macro mm: respect mmap hint address when aligning for THP mm: memcg: declare do_memsw_account inline mm/codetag: swap tags when migrate pages ocfs2: update seq_file index in ocfs2_dlm_seq_next stackdepot: fix stack_depot_save_flags() in NMI context mm: open-code page_folio() in dump_page() mm: open-code PageTail in folio_flags() and const_folio_flags() mm: fix vrealloc()'s KASAN poisoning logic Revert "readahead: properly shorten readahead when falling back to do_page_cache_ra()" selftests/damon: add _damon_sysfs.py to TEST_FILES selftest: hugetlb_dio: fix test naming ocfs2: free inode when ocfs2_get_init_inode() fails nilfs2: fix potential out-of-bounds memory access in nilfs_find_entry() ...
2024-12-06Merge tag 'fixes-2024-12-06' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock Pull memblock fixes from Mike Rapoport: "Restore check for node validity in arch_numa. The rework of NUMA initialization in arch_numa dropped a check that refused to accept configurations with invalid node IDs. Restore that check to ensure that when firmware passes invalid nodes, such configuration is rejected and kernel gracefully falls back to dummy NUMA" * tag 'fixes-2024-12-06' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock: arch_numa: Restore nid checks before registering a memblock with a node memblock: allow zero threshold in validate_numa_converage()
2024-12-05mm/filemap: don't call folio_test_locked() without a reference in ↵David Hildenbrand
next_uptodate_folio() The folio can get freed + buddy-merged + reallocated in the meantime, resulting in us calling folio_test_locked() possibly on a tail page. This makes const_folio_flags VM_BUG_ON_PGFLAGS() when stumbling over the tail page. Could this result in other issues? Doesn't look like it. False positives and false negatives don't really matter, because this folio would get skipped either way when detecting that they have been reallocated in the meantime. Fix it by performing the folio_test_locked() checked after grabbing a reference. If this ever becomes a real problem, we could add a special helper that racily checks if the bit is set even on tail pages ... but let's hope that's not required so we can just handle it cleaner: work on the folio after we hold a reference. Do we really need the folio_test_locked() check if we are going to trylock briefly after? Well, we can at least avoid a xas_reload(). It's a bit unclear which exact change introduced that issue. Likely, ever since we made PG_locked obey to the PF_NO_TAIL policy it could have been triggered in some way. Link: https://lkml.kernel.org/r/20241129125303.4033164-1-david@redhat.com Fixes: 48c935ad88f5 ("page-flags: define PG_locked behavior on compound pages") Signed-off-by: David Hildenbrand <david@redhat.com> Reported-by: syzbot+9f9a7f73fb079b2387a6@syzkaller.appspotmail.com Closes: https://lore.kernel.org/lkml/674184c9.050a0220.1cc393.0001.GAE@google.com/ Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Hillf Danton <hdanton@sina.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-05mm: correct typo in MMAP_STATE() macroLorenzo Stoakes
We mistakenly refer to len rather than len_ here. The only existing caller passes len to the len_ parameter so this has no impact on the code, but it is obviously incorrect to do this, so fix it. Link: https://lkml.kernel.org/r/20241118175414.390827-1-lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Cc: Jann Horn <jannh@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-05mm: respect mmap hint address when aligning for THPKalesh Singh
Commit efa7df3e3bb5 ("mm: align larger anonymous mappings on THP boundaries") updated __get_unmapped_area() to align the start address for the VMA to a PMD boundary if CONFIG_TRANSPARENT_HUGEPAGE=y. It does this by effectively looking up a region that is of size, request_size + PMD_SIZE, and aligning up the start to a PMD boundary. Commit 4ef9ad19e176 ("mm: huge_memory: don't force huge page alignment on 32 bit") opted out of this for 32bit due to regressions in mmap base randomization. Commit d4148aeab412 ("mm, mmap: limit THP alignment of anonymous mappings to PMD-aligned sizes") restricted this to only mmap sizes that are multiples of the PMD_SIZE due to reported regressions in some performance benchmarks -- which seemed mostly due to the reduced spatial locality of related mappings due to the forced PMD-alignment. Another unintended side effect has emerged: When a user specifies an mmap hint address, the THP alignment logic modifies the behavior, potentially ignoring the hint even if a sufficiently large gap exists at the requested hint location. Example Scenario: Consider the following simplified virtual address (VA) space: ... 0x200000-0x400000 --- VMA A 0x400000-0x600000 --- Hole 0x600000-0x800000 --- VMA B ... A call to mmap() with hint=0x400000 and len=0x200000 behaves differently: - Before THP alignment: The requested region (size 0x200000) fits into the gap at 0x400000, so the hint is respected. - After alignment: The logic searches for a region of size 0x400000 (len + PMD_SIZE) starting at 0x400000. This search fails due to the mapping at 0x600000 (VMA B), and the hint is ignored, falling back to arch_get_unmapped_area[_topdown](). In general the hint is effectively ignored, if there is any existing mapping in the below range: [mmap_hint + mmap_size, mmap_hint + mmap_size + PMD_SIZE) This changes the semantics of mmap hint; from ""Respect the hint if a sufficiently large gap exists at the requested location" to "Respect the hint only if an additional PMD-sized gap exists beyond the requested size". This has performance implications for allocators that allocate their heap using mmap but try to keep it "as contiguous as possible" by using the end of the exisiting heap as the address hint. With the new behavior it's more likely to get a much less contiguous heap, adding extra fragmentation and performance overhead. To restore the expected behavior; don't use thp_get_unmapped_area_vmflags() when the user provided a hint address, for anonymous mappings. Note: As Yang Shi pointed out: the issue still remains for filesystems which are using thp_get_unmapped_area() for their get_unmapped_area() op. It is unclear what worklaods will regress for if we ignore THP alignment when the hint address is provided for such file backed mappings -- so this fix will be handled separately. Link: https://lkml.kernel.org/r/20241118214650.3667577-1-kaleshsingh@google.com Fixes: efa7df3e3bb5 ("mm: align larger anonymous mappings on THP boundaries") Signed-off-by: Kalesh Singh <kaleshsingh@google.com> Reviewed-by: Rik van Riel <riel@surriel.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Rik van Riel <riel@surriel.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Hans Boehm <hboehm@google.com> Cc: Lokesh Gidra <lokeshgidra@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-05mm: memcg: declare do_memsw_account inlineJohn Sperbeck
In commit 66d60c428b23 ("mm: memcg: move legacy memcg event code into memcontrol-v1.c"), the static do_memsw_account() function was moved from a .c file to a .h file. Unfortunately, the traditional inline keyword wasn't added. If a file (e.g., a unit test) includes the .h file, but doesn't refer to do_memsw_account(), it will get a warning like: mm/memcontrol-v1.h:41:13: warning: unused function 'do_memsw_account' [-Wunused-function] 41 | static bool do_memsw_account(void) | ^~~~~~~~~~~~~~~~ Link: https://lkml.kernel.org/r/20241128203959.726527-1-jsperbeck@google.com Fixes: 66d60c428b23 ("mm: memcg: move legacy memcg event code into memcontrol-v1.c") Signed-off-by: John Sperbeck <jsperbeck@google.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-05mm/codetag: swap tags when migrate pagesDavid Wang
Current solution to adjust codetag references during page migration is done in 3 steps: 1. sets the codetag reference of the old page as empty (not pointing to any codetag); 2. subtracts counters of the new page to compensate for its own allocation; 3. sets codetag reference of the new page to point to the codetag of the old page. This does not work if CONFIG_MEM_ALLOC_PROFILING_DEBUG=n because set_codetag_empty() becomes NOOP. Instead, let's simply swap codetag references so that the new page is referencing the old codetag and the old page is referencing the new codetag. This way accounting stays valid and the logic makes more sense. Link: https://lkml.kernel.org/r/20241129025213.34836-1-00107082@163.com Fixes: e0a955bf7f61 ("mm/codetag: add pgalloc_tag_copy()") Signed-off-by: David Wang <00107082@163.com> Closes: https://lore.kernel.org/lkml/20241124074318.399027-1-00107082@163.com/ Acked-by: Suren Baghdasaryan <surenb@google.com> Suggested-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Yu Zhao <yuzhao@google.com> Cc: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-05mm: open-code page_folio() in dump_page()Matthew Wilcox (Oracle)
page_folio() calls page_fixed_fake_head() which will misidentify this page as being a fake head and load off the end of 'precise'. We may have a pointer to a fake head, but that's OK because it contains the right information for dump_page(). gcc-15 is smart enough to catch this with -Warray-bounds: In function 'page_fixed_fake_head', inlined from '_compound_head' at ../include/linux/page-flags.h:251:24, inlined from '__dump_page' at ../mm/debug.c:123:11: ../include/asm-generic/rwonce.h:44:26: warning: array subscript 9 is outside +array bounds of 'struct page[1]' [-Warray-bounds=] Link: https://lkml.kernel.org/r/20241125201721.2963278-2-willy@infradead.org Fixes: fae7d834c43c ("mm: add __dump_folio()") Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reported-by: Kees Cook <kees@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-05mm: fix vrealloc()'s KASAN poisoning logicAndrii Nakryiko
When vrealloc() reuses already allocated vmap_area, we need to re-annotate poisoned and unpoisoned portions of underlying memory according to the new size. This results in a KASAN splat recorded at [1]. A KASAN mis-reporting issue where there is none. Note, hard-coding KASAN_VMALLOC_PROT_NORMAL might not be exactly correct, but KASAN flag logic is pretty involved and spread out throughout __vmalloc_node_range_noprof(), so I'm using the bare minimum flag here and leaving the rest to mm people to refactor this logic and reuse it here. Link: https://lkml.kernel.org/r/20241126005206.3457974-1-andrii@kernel.org Link: https://lore.kernel.org/bpf/67450f9b.050a0220.21d33d.0004.GAE@google.com/ [1] Fixes: 3ddc2fefe6f3 ("mm: vmalloc: implement vrealloc()") Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-05Revert "readahead: properly shorten readahead when falling back to ↵Jan Kara
do_page_cache_ra()" This reverts commit 7c877586da3178974a8a94577b6045a48377ff25. Anders and Philippe have reported that recent kernels occasionally hang when used with NFS in readahead code. The problem has been bisected to 7c877586da3 ("readahead: properly shorten readahead when falling back to do_page_cache_ra()"). The cause of the problem is that ra->size can be shrunk by read_pages() call and subsequently we end up calling do_page_cache_ra() with negative (read huge positive) number of pages. Let's revert 7c877586da3 for now until we can find a proper way how the logic in read_pages() and page_cache_ra_order() can coexist. This can lead to reduced readahead throughput due to readahead window confusion but that's better than outright hangs. Link: https://lkml.kernel.org/r/20241126145208.985-1-jack@suse.cz Fixes: 7c877586da31 ("readahead: properly shorten readahead when falling back to do_page_cache_ra()") Reported-by: Anders Blomdell <anders.blomdell@gmail.com> Reported-by: Philippe Troin <phil@fifi.org> Signed-off-by: Jan Kara <jack@suse.cz> Tested-by: Philippe Troin <phil@fifi.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-05kasan: make report_lock a raw spinlockJared Kangas
If PREEMPT_RT is enabled, report_lock is a sleeping spinlock and must not be locked when IRQs are disabled. However, KASAN reports may be triggered in such contexts. For example: char *s = kzalloc(1, GFP_KERNEL); kfree(s); local_irq_disable(); char c = *s; /* KASAN report here leads to spin_lock() */ local_irq_enable(); Make report_spinlock a raw spinlock to prevent rescheduling when PREEMPT_RT is enabled. Link: https://lkml.kernel.org/r/20241119210234.1602529-1-jkangas@redhat.com Fixes: 342a93247e08 ("locking/spinlock: Provide RT variant header: <linux/spinlock_rt.h>") Signed-off-by: Jared Kangas <jkangas@redhat.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-05mm/mempolicy: fix migrate_to_node() assuming there is at least one VMA in a MMDavid Hildenbrand
We currently assume that there is at least one VMA in a MM, which isn't true. So we might end up having find_vma() return NULL, to then de-reference NULL. So properly handle find_vma() returning NULL. This fixes the report: Oops: general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] PREEMPT SMP KASAN PTI KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007] CPU: 1 UID: 0 PID: 6021 Comm: syz-executor284 Not tainted 6.12.0-rc7-syzkaller-00187-gf868cd251776 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/30/2024 RIP: 0010:migrate_to_node mm/mempolicy.c:1090 [inline] RIP: 0010:do_migrate_pages+0x403/0x6f0 mm/mempolicy.c:1194 Code: ... RSP: 0018:ffffc9000375fd08 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffffc9000375fd78 RCX: 0000000000000000 RDX: ffff88807e171300 RSI: dffffc0000000000 RDI: ffff88803390c044 RBP: ffff88807e171428 R08: 0000000000000014 R09: fffffbfff2039ef1 R10: ffffffff901cf78f R11: 0000000000000000 R12: 0000000000000003 R13: ffffc9000375fe90 R14: ffffc9000375fe98 R15: ffffc9000375fdf8 FS: 00005555919e1380(0000) GS:ffff8880b8700000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00005555919e1ca8 CR3: 000000007f12a000 CR4: 00000000003526f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> kernel_migrate_pages+0x5b2/0x750 mm/mempolicy.c:1709 __do_sys_migrate_pages mm/mempolicy.c:1727 [inline] __se_sys_migrate_pages mm/mempolicy.c:1723 [inline] __x64_sys_migrate_pages+0x96/0x100 mm/mempolicy.c:1723 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xcd/0x250 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f [akpm@linux-foundation.org: add unlikely()] Link: https://lkml.kernel.org/r/20241120201151.9518-1-david@redhat.com Fixes: 39743889aaf7 ("[PATCH] Swap Migration V5: sys_migrate_pages interface") Signed-off-by: David Hildenbrand <david@redhat.com> Reported-by: syzbot+3511625422f7aa637f0d@syzkaller.appspotmail.com Closes: https://lore.kernel.org/lkml/673d2696.050a0220.3c9d61.012f.GAE@google.com/T/ Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Christoph Lameter <cl@linux.com> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-05mm/gup: handle NULL pages in unpin_user_pages()John Hubbard
The recent addition of "pofs" (pages or folios) handling to gup has a flaw: it assumes that unpin_user_pages() handles NULL pages in the pages** array. That's not the case, as I discovered when I ran on a new configuration on my test machine. Fix this by skipping NULL pages in unpin_user_pages(), just like unpin_folios() already does. Details: when booting on x86 with "numa=fake=2 movablecore=4G" on Linux 6.12, and running this: tools/testing/selftests/mm/gup_longterm ...I get the following crash: BUG: kernel NULL pointer dereference, address: 0000000000000008 RIP: 0010:sanity_check_pinned_pages+0x3a/0x2d0 ... Call Trace: <TASK> ? __die_body+0x66/0xb0 ? page_fault_oops+0x30c/0x3b0 ? do_user_addr_fault+0x6c3/0x720 ? irqentry_enter+0x34/0x60 ? exc_page_fault+0x68/0x100 ? asm_exc_page_fault+0x22/0x30 ? sanity_check_pinned_pages+0x3a/0x2d0 unpin_user_pages+0x24/0xe0 check_and_migrate_movable_pages_or_folios+0x455/0x4b0 __gup_longterm_locked+0x3bf/0x820 ? mmap_read_lock_killable+0x12/0x50 ? __pfx_mmap_read_lock_killable+0x10/0x10 pin_user_pages+0x66/0xa0 gup_test_ioctl+0x358/0xb20 __se_sys_ioctl+0x6b/0xc0 do_syscall_64+0x7b/0x150 entry_SYSCALL_64_after_hwframe+0x76/0x7e Link: https://lkml.kernel.org/r/20241121034933.77502-1-jhubbard@nvidia.com Fixes: 94efde1d1539 ("mm/gup: avoid an unnecessary allocation call for FOLL_LONGTERM cases") Signed-off-by: John Hubbard <jhubbard@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Vivek Kasireddy <vivek.kasireddy@intel.com> Cc: Dave Airlie <airlied@redhat.com> Cc: Gerd Hoffmann <kraxel@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Peter Xu <peterx@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: Dongwon Kim <dongwon.kim@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Junxiao Chang <junxiao.chang@intel.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-02module: Convert symbol namespace to string literalPeter Zijlstra
Clean up the existing export namespace code along the same lines of commit 33def8498fdd ("treewide: Convert macro and uses of __section(foo) to __section("foo")") and for the same reason, it is not desired for the namespace argument to be a macro expansion itself. Scripted using git grep -l -e MODULE_IMPORT_NS -e EXPORT_SYMBOL_NS | while read file; do awk -i inplace ' /^#define EXPORT_SYMBOL_NS/ { gsub(/__stringify\(ns\)/, "ns"); print; next; } /^#define MODULE_IMPORT_NS/ { gsub(/__stringify\(ns\)/, "ns"); print; next; } /MODULE_IMPORT_NS/ { $0 = gensub(/MODULE_IMPORT_NS\(([^)]*)\)/, "MODULE_IMPORT_NS(\"\\1\")", "g"); } /EXPORT_SYMBOL_NS/ { if ($0 ~ /(EXPORT_SYMBOL_NS[^(]*)\(([^,]+),/) { if ($0 !~ /(EXPORT_SYMBOL_NS[^(]*)\(([^,]+), ([^)]+)\)/ && $0 !~ /(EXPORT_SYMBOL_NS[^(]*)\(\)/ && $0 !~ /^my/) { getline line; gsub(/[[:space:]]*\\$/, ""); gsub(/[[:space:]]/, "", line); $0 = $0 " " line; } $0 = gensub(/(EXPORT_SYMBOL_NS[^(]*)\(([^,]+), ([^)]+)\)/, "\\1(\\2, \"\\3\")", "g"); } } { print }' $file; done Requested-by: Masahiro Yamada <masahiroy@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://mail.google.com/mail/u/2/#inbox/FMfcgzQXKWgMmjdFwwdsfgxzKpVHWPlc Acked-by: Greg KH <gregkh@linuxfoundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-12-01memblock: allow zero threshold in validate_numa_converage()Mike Rapoport (Microsoft)
Currently memblock validate_numa_converage() returns false negative when threshold set to zero. Make the check if the memory size with invalid node ID is greater than the threshold exclusive to fix that. Link: https://lore.kernel.org/all/Z0mIDBD4KLyxyOCm@kernel.org/ Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
2024-11-30Merge tag 'kbuild-v6.13' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild Pull Kbuild updates from Masahiro Yamada: - Add generic support for built-in boot DTB files - Enable TAB cycling for dialog buttons in nconfig - Fix issues in streamline_config.pl - Refactor Kconfig - Add support for Clang's AutoFDO (Automatic Feedback-Directed Optimization) - Add support for Clang's Propeller, a profile-guided optimization. - Change the working directory to the external module directory for M= builds - Support building external modules in a separate output directory - Enable objtool for *.mod.o and additional kernel objects - Use lz4 instead of deprecated lz4c - Work around a performance issue with "git describe" - Refactor modpost * tag 'kbuild-v6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (85 commits) kbuild: rename .tmp_vmlinux.kallsyms0.syms to .tmp_vmlinux0.syms gitignore: Don't ignore 'tags' directory kbuild: add dependency from vmlinux to resolve_btfids modpost: replace tdb_hash() with hash_str() kbuild: deb-pkg: add python3:native to build dependency genksyms: reduce indentation in export_symbol() modpost: improve error messages in device_id_check() modpost: rename alias symbol for MODULE_DEVICE_TABLE() modpost: rename variables in handle_moddevtable() modpost: move strstarts() to modpost.h modpost: convert do_usb_table() to a generic handler modpost: convert do_of_table() to a generic handler modpost: convert do_pnp_device_entry() to a generic handler modpost: convert do_pnp_card_entries() to a generic handler modpost: call module_alias_printf() from all do_*_entry() functions modpost: pass (struct module *) to do_*_entry() functions modpost: remove DEF_FIELD_ADDR_VAR() macro modpost: deduplicate MODULE_ALIAS() for all drivers modpost: introduce module_alias_printf() helper modpost: remove unnecessary check in do_acpi_entry() ...
2024-11-27Merge tag 'memblock-v6.13-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock Pull memblock updates from Mike Rapoport: - replace hardcoded strings with str_on_off() in report_meminit() - initialize reserved pages to MIGRATE_MOVABLE when deferred struct page initialization is enabled so that if the reserved pages are freed they are put on movable free lists like it is done now when deferred struct page initialization is disabled * tag 'memblock-v6.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock: memblock: uniformly initialize all reserved pages to MIGRATE_MOVABLE mm: Use str_on_off() helper function in report_meminit()
2024-11-27Rename .data.once to .data..once to fix resetting WARN*_ONCEMasahiro Yamada
Commit b1fca27d384e ("kernel debug: support resetting WARN*_ONCE") added support for clearing the state of once warnings. However, it is not functional when CONFIG_LD_DEAD_CODE_DATA_ELIMINATION or CONFIG_LTO_CLANG is enabled, because .data.once matches the .data.[0-9a-zA-Z_]* pattern in the DATA_MAIN macro. Commit cb87481ee89d ("kbuild: linker script do not match C names unless LD_DEAD_CODE_DATA_ELIMINATION is configured") was introduced to suppress the issue for the default CONFIG_LD_DEAD_CODE_DATA_ELIMINATION=n case, providing a minimal fix for stable backporting. We were aware this did not address the issue for CONFIG_LD_DEAD_CODE_DATA_ELIMINATION=y. The plan was to apply correct fixes and then revert cb87481ee89d. [1] Seven years have passed since then, yet the #ifdef workaround remains in place. Meanwhile, commit b1fca27d384e introduced the .data.once section, and commit dc5723b02e52 ("kbuild: add support for Clang LTO") extended the #ifdef. Using a ".." separator in the section name fixes the issue for CONFIG_LD_DEAD_CODE_DATA_ELIMINATION and CONFIG_LTO_CLANG. [1]: https://lore.kernel.org/linux-kbuild/CAK7LNASck6BfdLnESxXUeECYL26yUDm0cwRZuM4gmaWUkxjL5g@mail.gmail.com/ Fixes: b1fca27d384e ("kernel debug: support resetting WARN*_ONCE") Fixes: dc5723b02e52 ("kbuild: add support for Clang LTO") Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2024-11-26Merge tag 'rust-6.13' of https://github.com/Rust-for-Linux/linuxLinus Torvalds
Pull rust updates from Miguel Ojeda: "Toolchain and infrastructure: - Enable a series of lints, including safety-related ones, e.g. the compiler will now warn about missing safety comments, as well as unnecessary ones. How safety documentation is organized is a frequent source of review comments, thus having the compiler guide new developers on where they are expected (and where not) is very nice. - Start using '#[expect]': an interesting feature in Rust (stabilized in 1.81.0) that makes the compiler warn if an expected warning was _not_ emitted. This is useful to avoid forgetting cleaning up locally ignored diagnostics ('#[allow]'s). - Introduce '.clippy.toml' configuration file for Clippy, the Rust linter, which will allow us to tweak its behaviour. For instance, our first use cases are declaring a disallowed macro and, more importantly, enabling the checking of private items. - Lints-related fixes and cleanups related to the items above. - Migrate from 'receiver_trait' to 'arbitrary_self_types': to get the kernel into stable Rust, one of the major pieces of the puzzle is the support to write custom types that can be used as 'self', i.e. as receivers, since the kernel needs to write types such as 'Arc' that common userspace Rust would not. 'arbitrary_self_types' has been accepted to become stable, and this is one of the steps required to get there. - Remove usage of the 'new_uninit' unstable feature. - Use custom C FFI types. Includes a new 'ffi' crate to contain our custom mapping, instead of using the standard library 'core::ffi' one. The actual remapping will be introduced in a later cycle. - Map '__kernel_{size_t,ssize_t,ptrdiff_t}' to 'usize'/'isize' instead of 32/64-bit integers. - Fix 'size_t' in bindgen generated prototypes of C builtins. - Warn on bindgen < 0.69.5 and libclang >= 19.1 due to a double issue in the projects, which we managed to trigger with the upcoming tracepoint support. It includes a build test since some distributions backported the fix (e.g. Debian -- thanks!). All major distributions we list should be now OK except Ubuntu non-LTS. 'macros' crate: - Adapt the build system to be able run the doctests there too; and clean up and enable the corresponding doctests. 'kernel' crate: - Add 'alloc' module with generic kernel allocator support and remove the dependency on the Rust standard library 'alloc' and the extension traits we used to provide fallible methods with flags. Add the 'Allocator' trait and its implementations '{K,V,KV}malloc'. Add the 'Box' type (a heap allocation for a single value of type 'T' that is also generic over an allocator and considers the kernel's GFP flags) and its shorthand aliases '{K,V,KV}Box'. Add 'ArrayLayout' type. Add 'Vec' (a contiguous growable array type) and its shorthand aliases '{K,V,KV}Vec', including iterator support. For instance, now we may write code such as: let mut v = KVec::new(); v.push(1, GFP_KERNEL)?; assert_eq!(&v, &[1]); Treewide, move as well old users to these new types. - 'sync' module: add global lock support, including the 'GlobalLockBackend' trait; the 'Global{Lock,Guard,LockedBy}' types and the 'global_lock!' macro. Add the 'Lock::try_lock' method. - 'error' module: optimize 'Error' type to use 'NonZeroI32' and make conversion functions public. - 'page' module: add 'page_align' function. - Add 'transmute' module with the existing 'FromBytes' and 'AsBytes' traits. - 'block::mq::request' module: improve rendered documentation. - 'types' module: extend 'Opaque' type documentation and add simple examples for the 'Either' types. drm/panic: - Clean up a series of Clippy warnings. Documentation: - Add coding guidelines for lints and the '#[expect]' feature. - Add Ubuntu to the list of distributions in the Quick Start guide. MAINTAINERS: - Add Danilo Krummrich as maintainer of the new 'alloc' module. And a few other small cleanups and fixes" * tag 'rust-6.13' of https://github.com/Rust-for-Linux/linux: (82 commits) rust: alloc: Fix `ArrayLayout` allocations docs: rust: remove spurious item in `expect` list rust: allow `clippy::needless_lifetimes` rust: warn on bindgen < 0.69.5 and libclang >= 19.1 rust: use custom FFI integer types rust: map `__kernel_size_t` and friends also to usize/isize rust: fix size_t in bindgen prototypes of C builtins rust: sync: add global lock support rust: macros: enable the rest of the tests rust: macros: enable paste! use from macro_rules! rust: enable macros::module! tests rust: kbuild: expand rusttest target for macros rust: types: extend `Opaque` documentation rust: block: fix formatting of `kernel::block::mq::request` module rust: macros: fix documentation of the paste! macro rust: kernel: fix THIS_MODULE header path in ThisModule doc comment rust: page: add Rust version of PAGE_ALIGN rust: helpers: remove unnecessary header includes rust: exports: improve grammar in commentary drm/panic: allow verbose version check ...
2024-11-26Merge tag 'fuse-update-6.13' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse updates from Miklos Szeredi: - Add page -> folio conversions (Joanne Koong, Josef Bacik) - Allow max size of fuse requests to be configurable with a sysctl (Joanne Koong) - Allow FOPEN_DIRECT_IO to take advantage of async code path (yangyun) - Fix large kernel reads (like a module load) in virtio_fs (Hou Tao) - Fix attribute inconsistency in case readdirplus (and plain lookup in corner cases) is racing with inode eviction (Zhang Tianci) - Fix a WARN_ON triggered by virtio_fs (Asahi Lina) * tag 'fuse-update-6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: (30 commits) virtiofs: dax: remove ->writepages() callback fuse: check attributes staleness on fuse_iget() fuse: remove pages for requests and exclusively use folios fuse: convert direct io to use folios mm/writeback: add folio_mark_dirty_lock() fuse: convert writebacks to use folios fuse: convert retrieves to use folios fuse: convert ioctls to use folios fuse: convert writes (non-writeback) to use folios fuse: convert reads to use folios fuse: convert readdir to use folios fuse: convert readlink to use folios fuse: convert cuse to use folios fuse: add support in virtio for requests using folios fuse: support folios in struct fuse_args_pages and fuse_copy_pages() fuse: convert fuse_notify_store to use folios fuse: convert fuse_retrieve to use folios fuse: use the folio based vmstat helpers fuse: convert fuse_writepage_need_send to take a folio fuse: convert fuse_do_readpage to use folios ...
2024-11-25Merge tag 'slab-for-6.13-v2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab Pull slab updates from Vlastimil Babka: - Add new slab_strict_numa boot parameter to enforce per-object memory policies on top of slab folio policies, for systems where saving cost of remote accesses is more important than minimizing slab allocation overhead (Christoph Lameter) - Fix for freeptr_offset alignment check being too strict for m68k (Geert Uytterhoeven) - krealloc() fixes for not violating __GFP_ZERO guarantees on krealloc() when slub_debug (redzone and object tracking) is enabled (Feng Tang) - Fix a memory leak in case sysfs registration fails for a slab cache, and also no longer fail to create the cache in that case (Hyeonggon Yoo) - Fix handling of detected consistency problems (due to buggy slab user) with slub_debug enabled, so that it does not cause further list corruption bugs (yuan.gao) - Code cleanup and kerneldocs polishing (Zhen Lei, Vlastimil Babka) * tag 'slab-for-6.13-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: slab: Fix too strict alignment check in create_cache() mm/slab: Allow cache creation to proceed even if sysfs registration fails mm/slub: Avoid list corruption when removing a slab from the full list mm/slub, kunit: Add testcase for krealloc redzone and zeroing mm/slub: Improve redzone check and zeroing for krealloc() mm/slub: Consider kfence case for get_orig_size() SLUB: Add support for per object memory policies mm, slab: add kerneldocs for common SLAB_ flags mm/slab: remove duplicate check in create_cache() mm/slub: Move krealloc() and related code to slub.c mm/kasan: Don't store metadata inside kmalloc object when slub_debug_orig_size is on
2024-11-25Merge tag 'mm-nonmm-stable-2024-11-24-02-05' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: - The series "resource: A couple of cleanups" from Andy Shevchenko performs some cleanups in the resource management code - The series "Improve the copy of task comm" from Yafang Shao addresses possible race-induced overflows in the management of task_struct.comm[] - The series "Remove unnecessary header includes from {tools/}lib/list_sort.c" from Kuan-Wei Chiu adds some cleanups and a small fix to the list_sort library code and to its selftest - The series "Enhance min heap API with non-inline functions and optimizations" also from Kuan-Wei Chiu optimizes and cleans up the min_heap library code - The series "nilfs2: Finish folio conversion" from Ryusuke Konishi finishes off nilfs2's folioification - The series "add detect count for hung tasks" from Lance Yang adds more userspace visibility into the hung-task detector's activity - Apart from that, singelton patches in many places - please see the individual changelogs for details * tag 'mm-nonmm-stable-2024-11-24-02-05' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (71 commits) gdb: lx-symbols: do not error out on monolithic build kernel/reboot: replace sprintf() with sysfs_emit() lib: util_macros_kunit: add kunit test for util_macros.h util_macros.h: fix/rework find_closest() macros Improve consistency of '#error' directive messages ocfs2: fix uninitialized value in ocfs2_file_read_iter() hung_task: add docs for hung_task_detect_count hung_task: add detect count for hung tasks dma-buf: use atomic64_inc_return() in dma_buf_getfile() fs/proc/kcore.c: fix coccinelle reported ERROR instances resource: avoid unnecessary resource tree walking in __region_intersects() ocfs2: remove unused errmsg function and table ocfs2: cluster: fix a typo lib/scatterlist: use sg_phys() helper checkpatch: always parse orig_commit in fixes tag nilfs2: convert metadata aops from writepage to writepages nilfs2: convert nilfs_recovery_copy_block() to take a folio nilfs2: convert nilfs_page_count_clean_buffers() to take a folio nilfs2: remove nilfs_writepage nilfs2: convert checkpoint file to be folio-based ...
2024-11-23Merge tag 'mm-stable-2024-11-18-19-27' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - The series "zram: optimal post-processing target selection" from Sergey Senozhatsky improves zram's post-processing selection algorithm. This leads to improved memory savings. - Wei Yang has gone to town on the mapletree code, contributing several series which clean up the implementation: - "refine mas_mab_cp()" - "Reduce the space to be cleared for maple_big_node" - "maple_tree: simplify mas_push_node()" - "Following cleanup after introduce mas_wr_store_type()" - "refine storing null" - The series "selftests/mm: hugetlb_fault_after_madv improvements" from David Hildenbrand fixes this selftest for s390. - The series "introduce pte_offset_map_{ro|rw}_nolock()" from Qi Zheng implements some rationaizations and cleanups in the page mapping code. - The series "mm: optimize shadow entries removal" from Shakeel Butt optimizes the file truncation code by speeding up the handling of shadow entries. - The series "Remove PageKsm()" from Matthew Wilcox completes the migration of this flag over to being a folio-based flag. - The series "Unify hugetlb into arch_get_unmapped_area functions" from Oscar Salvador implements a bunch of consolidations and cleanups in the hugetlb code. - The series "Do not shatter hugezeropage on wp-fault" from Dev Jain takes away the wp-fault time practice of turning a huge zero page into small pages. Instead we replace the whole thing with a THP. More consistent cleaner and potentiall saves a large number of pagefaults. - The series "percpu: Add a test case and fix for clang" from Andy Shevchenko enhances and fixes the kernel's built in percpu test code. - The series "mm/mremap: Remove extra vma tree walk" from Liam Howlett optimizes mremap() by avoiding doing things which we didn't need to do. - The series "Improve the tmpfs large folio read performance" from Baolin Wang teaches tmpfs to copy data into userspace at the folio size rather than as individual pages. A 20% speedup was observed. - The series "mm/damon/vaddr: Fix issue in damon_va_evenly_split_region()" fro Zheng Yejian fixes DAMON splitting. - The series "memcg-v1: fully deprecate charge moving" from Shakeel Butt removes the long-deprecated memcgv2 charge moving feature. - The series "fix error handling in mmap_region() and refactor" from Lorenzo Stoakes cleanup up some of the mmap() error handling and addresses some potential performance issues. - The series "x86/module: use large ROX pages for text allocations" from Mike Rapoport teaches x86 to use large pages for read-only-execute module text. - The series "page allocation tag compression" from Suren Baghdasaryan is followon maintenance work for the new page allocation profiling feature. - The series "page->index removals in mm" from Matthew Wilcox remove most references to page->index in mm/. A slow march towards shrinking struct page. - The series "damon/{self,kunit}tests: minor fixups for DAMON debugfs interface tests" from Andrew Paniakin performs maintenance work for DAMON's self testing code. - The series "mm: zswap swap-out of large folios" from Kanchana Sridhar improves zswap's batching of compression and decompression. It is a step along the way towards using Intel IAA hardware acceleration for this zswap operation. - The series "kasan: migrate the last module test to kunit" from Sabyrzhan Tasbolatov completes the migration of the KASAN built-in tests over to the KUnit framework. - The series "implement lightweight guard pages" from Lorenzo Stoakes permits userapace to place fault-generating guard pages within a single VMA, rather than requiring that multiple VMAs be created for this. Improved efficiencies for userspace memory allocators are expected. - The series "memcg: tracepoint for flushing stats" from JP Kobryn uses tracepoints to provide increased visibility into memcg stats flushing activity. - The series "zram: IDLE flag handling fixes" from Sergey Senozhatsky fixes a zram buglet which potentially affected performance. - The series "mm: add more kernel parameters to control mTHP" from Maíra Canal enhances our ability to control/configuremultisize THP from the kernel boot command line. - The series "kasan: few improvements on kunit tests" from Sabyrzhan Tasbolatov has a couple of fixups for the KASAN KUnit tests. - The series "mm/list_lru: Split list_lru lock into per-cgroup scope" from Kairui Song optimizes list_lru memory utilization when lockdep is enabled. * tag 'mm-stable-2024-11-18-19-27' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (215 commits) cma: enforce non-zero pageblock_order during cma_init_reserved_mem() mm/kfence: add a new kunit test test_use_after_free_read_nofault() zram: fix NULL pointer in comp_algorithm_show() memcg/hugetlb: add hugeTLB counters to memcg vmstat: call fold_vm_zone_numa_events() before show per zone NUMA event mm: mmap_lock: check trace_mmap_lock_$type_enabled() instead of regcount zram: ZRAM_DEF_COMP should depend on ZRAM MAINTAINERS/MEMORY MANAGEMENT: add document files for mm Docs/mm/damon: recommend academic papers to read and/or cite mm: define general function pXd_init() kmemleak: iommu/iova: fix transient kmemleak false positive mm/list_lru: simplify the list_lru walk callback function mm/list_lru: split the lock to per-cgroup scope mm/list_lru: simplify reparenting and initial allocation mm/list_lru: code clean up for reparenting mm/list_lru: don't export list_lru_add mm/list_lru: don't pass unnecessary key parameters kasan: add kunit tests for kmalloc_track_caller, kmalloc_node_track_caller kasan: change kasan_atomics kunit test as KUNIT_CASE_SLOW kasan: use EXPORT_SYMBOL_IF_KUNIT to export symbols ...
2024-11-21Merge tag 'for-linus-iommufd' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd Pull iommufd updates from Jason Gunthorpe: "Several new features and uAPI for iommufd: - IOMMU_IOAS_MAP_FILE allows passing in a file descriptor as the backing memory for an iommu mapping. To date VFIO/iommufd have used VMA's and pin_user_pages(), this now allows using memfds and memfd_pin_folios(). Notably this creates a pure folio path from the memfd to the iommu page table where memory is never broken down to PAGE_SIZE. - IOMMU_IOAS_CHANGE_PROCESS moves the pinned page accounting between two processes. Combined with the above this allows iommufd to support a VMM re-start using exec() where something like qemu would exec() a new version of itself and fd pass the memfds/iommufd/etc to the new process. The memfd allows DMA access to the memory to continue while the new process is getting setup, and the CHANGE_PROCESS updates all the accounting. - Support for fault reporting to userspace on non-PRI HW, such as ARM stall-mode embedded devices. - IOMMU_VIOMMU_ALLOC introduces the concept of a HW/driver backed virtual iommu. This will be used by VMMs to access hardware features that are contained with in a VM. The first use is to inform the kernel of the virtual SID to physical SID mapping when issuing SID based invalidation on ARM. Further uses will tie HW features that are directly accessed by the VM, such as invalidation queue assignment and others. - IOMMU_VDEVICE_ALLOC informs the kernel about the mapping of virtual device to physical device within a VIOMMU. Minimially this is used to translate VM issued cache invalidation commands from virtual to physical device IDs. - Enhancements to IOMMU_HWPT_INVALIDATE and IOMMU_HWPT_ALLOC to work with the VIOMMU - ARM SMMuv3 support for nested translation. Using the VIOMMU and VDEVICE the driver can model this HW's behavior for nested translation. This includes a shared branch from Will" * tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd: (51 commits) iommu/arm-smmu-v3: Import IOMMUFD module namespace iommufd: IOMMU_IOAS_CHANGE_PROCESS selftest iommufd: Add IOMMU_IOAS_CHANGE_PROCESS iommufd: Lock all IOAS objects iommufd: Export do_update_pinned iommu/arm-smmu-v3: Support IOMMU_HWPT_INVALIDATE using a VIOMMU object iommu/arm-smmu-v3: Allow ATS for IOMMU_DOMAIN_NESTED iommu/arm-smmu-v3: Use S2FWB for NESTED domains iommu/arm-smmu-v3: Support IOMMU_DOMAIN_NESTED iommu/arm-smmu-v3: Support IOMMU_VIOMMU_ALLOC Documentation: userspace-api: iommufd: Update vDEVICE iommufd/selftest: Add vIOMMU coverage for IOMMU_HWPT_INVALIDATE ioctl iommufd/selftest: Add IOMMU_TEST_OP_DEV_CHECK_CACHE test command iommufd/selftest: Add mock_viommu_cache_invalidate iommufd/viommu: Add iommufd_viommu_find_dev helper iommu: Add iommu_copy_struct_from_full_user_array helper iommufd: Allow hwpt_id to carry viommu_id for IOMMU_HWPT_INVALIDATE iommu/viommu: Add cache_invalidate to iommufd_viommu_ops iommufd/selftest: Add IOMMU_VDEVICE_ALLOC test coverage iommufd/viommu: Add IOMMUFD_OBJ_VDEVICE and IOMMU_VDEVICE_ALLOC ioctl ...
2024-11-21Merge tag 'net-next-6.13' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Paolo Abeni: "The most significant set of changes is the per netns RTNL. The new behavior is disabled by default, regression risk should be contained. Notably the new config knob PTP_1588_CLOCK_VMCLOCK will inherit its default value from PTP_1588_CLOCK_KVM, as the first is intended to be a more reliable replacement for the latter. Core: - Started a very large, in-progress, effort to make the RTNL lock scope per network-namespace, thus reducing the lock contention significantly in the containerized use-case, comprising: - RCU-ified some relevant slices of the FIB control path - introduce basic per netns locking helpers - namespacified the IPv4 address hash table - remove rtnl_register{,_module}() in favour of rtnl_register_many() - refactor rtnl_{new,del,set}link() moving as much validation as possible out of RTNL lock - convert all phonet doit() and dumpit() handlers to RCU - convert IPv4 addresses manipulation to per-netns RTNL - convert virtual interface creation to per-netns RTNL the per-netns lock infrastructure is guarded by the CONFIG_DEBUG_NET_SMALL_RTNL knob, disabled by default ad interim. - Introduce NAPI suspension, to efficiently switching between busy polling (NAPI processing suspended) and normal processing. - Migrate the IPv4 routing input, output and control path from direct ToS usage to DSCP macros. This is a work in progress to make ECN handling consistent and reliable. - Add drop reasons support to the IPv4 rotue input path, allowing better introspection in case of packets drop. - Make FIB seqnum lockless, dropping RTNL protection for read access. - Make inet{,v6} addresses hashing less predicable. - Allow providing timestamp OPT_ID via cmsg, to correlate TX packets and timestamps Things we sprinkled into general kernel code: - Add small file operations for debugfs, to reduce the struct ops size. - Refactoring and optimization for the implementation of page_frag API, This is a preparatory work to consolidate the page_frag implementation. Netfilter: - Optimize set element transactions to reduce memory consumption - Extended netlink error reporting for attribute parser failure. - Make legacy xtables configs user selectable, giving users the option to configure iptables without enabling any other config. - Address a lot of false-positive RCU issues, pointed by recent CI improvements. BPF: - Put xsk sockets on a struct diet and add various cleanups. Overall, this helps to bump performance by 12% for some workloads. - Extend BPF selftests to increase coverage of XDP features in combination with BPF cpumap. - Optimize and homogenize bpf_csum_diff helper for all archs and also add a batch of new BPF selftests for it. - Extend netkit with an option to delegate skb->{mark,priority} scrubbing to its BPF program. - Make the bpf_get_netns_cookie() helper available also to tc(x) BPF programs. Protocols: - Introduces 4-tuple hash for connected udp sockets, speeding-up significantly connected sockets lookup. - Add a fastpath for some TCP timers that usually expires after close, the socket lock contention. - Add inbound and outbound xfrm state caches to speed up state lookups. - Avoid sending MPTCP advertisements on stale subflows, reducing risks on loosing them. - Make neighbours table flushing more scalable, maintaining per device neigh lists. Driver API: - Introduce a unified interface to configure transmission H/W shaping, and expose it to user-space via generic-netlink. - Add support for per-NAPI config via netlink. This makes napi configuration persistent across queues removal and re-creation. Requires driver updates, currently supported drivers are: nVidia/Mellanox mlx4 and mlx5, Broadcom brcm and Intel ice. - Add ethtool support for writing SFP / PHY firmware blocks. - Track RSS context allocation from ethtool core. - Implement support for mirroring to DSA CPU port, via TC mirror offload. - Consolidate FDB updates notification, to avoid duplicates on device-specific entries. - Expose DPLL clock quality level to the user-space. - Support master-slave PHY config via device tree. Tests and tooling: - forwarding: introduce deferred commands, to simplify the cleanup phase Drivers: - Updated several drivers - Amazon vNic, Google vNic, Microsoft vNic, Intel e1000e and Broadcom Tigon3 - to use netdev-genl to link the IRQs and queues to NAPI IDs, allowing busy polling and better introspection. - Ethernet high-speed NICs: - nVidia/Mellanox: - mlx5: - a large refactor to implement support for cross E-Switch scheduling - refactor H/W conter management to let it scale better - H/W GRO cleanups - Intel (100G, ice):: - add support for ethtool reset - implement support for per TX queue H/W shaping - AMD/Solarflare: - implement per device queue stats support - Broadcom (bnxt): - improve wildcard l4proto on IPv4/IPv6 ntuple rules - Marvell Octeon: - Add representor support for each Resource Virtualization Unit (RVU) device. - Hisilicon: - add support for the BMC Gigabit Ethernet - IBM (EMAC): - driver cleanup and modernization - Cisco (VIC): - raise the queues number limit to 256 - Ethernet virtual: - Google vNIC: - implement page pool support - macsec: - inherit lower device's features and TSO limits when offloading - virtio_net: - enable premapped mode by default - support for XDP socket(AF_XDP) zerocopy TX - wireguard: - set the TSO max size to be GSO_MAX_SIZE, to aggregate larger packets. - Ethernet NICs embedded and virtual: - Broadcom ASP: - enable software timestamping - Freescale: - add enetc4 PF driver - MediaTek: Airoha SoC: - implement BQL support - RealTek r8169: - enable TSO by default on r8168/r8125 - implement extended ethtool stats - Renesas AVB: - enable TX checksum offload - Synopsys (stmmac): - support header splitting for vlan tagged packets - move common code for DWMAC4 and DWXGMAC into a separate FPE module. - add dwmac driver support for T-HEAD TH1520 SoC - Synopsys (xpcs): - driver refactor and cleanup - TI: - icssg_prueth: add VLAN offload support - Xilinx emaclite: - add clock support - Ethernet switches: - Microchip: - implement support for the lan969x Ethernet switch family - add LAN9646 switch support to KSZ DSA driver - Ethernet PHYs: - Marvel: 88q2x: enable auto negotiation - Microchip: add support for LAN865X Rev B1 and LAN867X Rev C1/C2 - PTP: - Add support for the Amazon virtual clock device - Add PtP driver for s390 clocks - WiFi: - mac80211 - EHT 1024 aggregation size for transmissions - new operation to indicate that a new interface is to be added - support radio separation of multi-band devices - move wireless extension spy implementation to libiw - Broadcom: - brcmfmac: optional LPO clock support - Microchip: - add support for Atmel WILC3000 - Qualcomm (ath12k): - firmware coredump collection support - add debugfs support for a multitude of statistics - Qualcomm (ath5k): - Arcadyan ARV45XX AR2417 & Gigaset SX76[23] AR241[34]A support - Realtek: - rtw88: 8821au and 8812au USB adapters support - rtw89: add thermal protection - rtw89: fine tune BT-coexsitence to improve user experience - rtw89: firmware secure boot for WiFi 6 chip - Bluetooth - add Qualcomm WCN785x support for ids Foxconn 0xe0fc/0xe0f3 and 0x13d3:0x3623 - add Realtek RTL8852BE support for id Foxconn 0xe123 - add MediaTek MT7920 support for wireless module ids - btintel_pcie: add handshake between driver and firmware - btintel_pcie: add recovery mechanism - btnxpuart: add GPIO support to power save feature" * tag 'net-next-6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1475 commits) mm: page_frag: fix a compile error when kernel is not compiled Documentation: tipc: fix formatting issue in tipc.rst selftests: nic_performance: Add selftest for performance of NIC driver selftests: nic_link_layer: Add selftest case for speed and duplex states selftests: nic_link_layer: Add link layer selftest for NIC driver bnxt_en: Add FW trace coredump segments to the coredump bnxt_en: Add a new ethtool -W dump flag bnxt_en: Add 2 parameters to bnxt_fill_coredump_seg_hdr() bnxt_en: Add functions to copy host context memory bnxt_en: Do not free FW log context memory bnxt_en: Manage the FW trace context memory bnxt_en: Allocate backing store memory for FW trace logs bnxt_en: Add a 'force' parameter to bnxt_free_ctx_mem() bnxt_en: Refactor bnxt_free_ctx_mem() bnxt_en: Add mem_valid bit to struct bnxt_ctx_mem_type bnxt_en: Update firmware interface spec to 1.10.3.85 selftests/bpf: Add some tests with sockmap SK_PASS bpf: fix recursive lock when verdict program return SK_PASS wireguard: device: support big tcp GSO wireguard: selftests: load nf_conntrack if not present ...
2024-11-21Merge tag 'bpf-next-6.13' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Pull bpf updates from Alexei Starovoitov: - Add BPF uprobe session support (Jiri Olsa) - Optimize uprobe performance (Andrii Nakryiko) - Add bpf_fastcall support to helpers and kfuncs (Eduard Zingerman) - Avoid calling free_htab_elem() under hash map bucket lock (Hou Tao) - Prevent tailcall infinite loop caused by freplace (Leon Hwang) - Mark raw_tracepoint arguments as nullable (Kumar Kartikeya Dwivedi) - Introduce uptr support in the task local storage map (Martin KaFai Lau) - Stringify errno log messages in libbpf (Mykyta Yatsenko) - Add kmem_cache BPF iterator for perf's lock profiling (Namhyung Kim) - Support BPF objects of either endianness in libbpf (Tony Ambardar) - Add ksym to struct_ops trampoline to fix stack trace (Xu Kuohai) - Introduce private stack for eligible BPF programs (Yonghong Song) - Migrate samples/bpf tests to selftests/bpf test_progs (Daniel T. Lee) - Migrate test_sock to selftests/bpf test_progs (Jordan Rife) * tag 'bpf-next-6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (152 commits) libbpf: Change hash_combine parameters from long to unsigned long selftests/bpf: Fix build error with llvm 19 libbpf: Fix memory leak in bpf_program__attach_uprobe_multi bpf: use common instruction history across all states bpf: Add necessary migrate_disable to range_tree. bpf: Do not alloc arena on unsupported arches selftests/bpf: Set test path for token/obj_priv_implicit_token_envvar selftests/bpf: Add a test for arena range tree algorithm bpf: Introduce range_tree data structure and use it in bpf arena samples/bpf: Remove unused variable in xdp2skb_meta_kern.c samples/bpf: Remove unused variables in tc_l2_redirect_kern.c bpftool: Cast variable `var` to long long bpf, x86: Propagate tailcall info only for subprogs bpf: Add kernel symbol for struct_ops trampoline bpf: Use function pointers count as struct_ops links count bpf: Remove unused member rcu from bpf_struct_ops_map selftests/bpf: Add struct_ops prog private stack tests bpf: Support private stack for struct_ops progs selftests/bpf: Add tracing prog private stack tests bpf, x86: Support private stack in jit ...
2024-11-20slab: Fix too strict alignment check in create_cache()Geert Uytterhoeven
On m68k, where the minimum alignment of unsigned long is 2 bytes: Kernel panic - not syncing: __kmem_cache_create_args: Failed to create slab 'io_kiocb'. Error -22 CPU: 0 UID: 0 PID: 1 Comm: swapper Not tainted 6.12.0-atari-03776-g7eaa1f99261a #1783 Stack from 0102fe5c: 0102fe5c 00514a2b 00514a2b ffffff00 00000001 0051f5ed 00425e78 00514a2b 0041eb74 ffffffea 00000310 0051f5ed ffffffea ffffffea 00601f60 00000044 0102ff20 000e7a68 0051ab8e 004383b8 0051f5ed ffffffea 000000b8 00000007 01020c00 00000000 000e77f0 0041e5f0 005f67c0 0051f5ed 000000b6 0102fef4 00000310 0102fef4 00000000 00000016 005f676c 0060a34c 00000010 00000004 00000038 0000009a 01000000 000000b8 005f668e 0102e000 00001372 0102ff88 Call Trace: [<00425e78>] dump_stack+0xc/0x10 [<0041eb74>] panic+0xd8/0x26c [<000e7a68>] __kmem_cache_create_args+0x278/0x2e8 [<000e77f0>] __kmem_cache_create_args+0x0/0x2e8 [<0041e5f0>] memset+0x0/0x8c [<005f67c0>] io_uring_init+0x54/0xd2 The minimal alignment of an integral type may differ from its size, hence is not safe to assume that an arbitrary freeptr_t (which is basically an unsigned long) is always aligned to 4 or 8 bytes. As nothing seems to require the additional alignment, it is safe to fix this by relaxing the check to the actual minimum alignment of freeptr_t. Fixes: aaa736b186239b7d ("io_uring: specify freeptr usage for SLAB_TYPESAFE_BY_RCU io_kiocb cache") Fixes: d345bd2e9834e2da ("mm: add kmem_cache_create_rcu()") Reported-by: Guenter Roeck <linux@roeck-us.net> Closes: https://lore.kernel.org/37c588d4-2c32-4aad-a19e-642961f200d7@roeck-us.net Cc: <stable@vger.kernel.org> Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Tested-by: Guenter Roeck <linux@roeck-us.net> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2024-11-19Merge tag 'timers-core-2024-11-18' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer updates from Thomas Gleixner: "A rather large update for timekeeping and timers: - The final step to get rid of auto-rearming posix-timers posix-timers are currently auto-rearmed by the kernel when the signal of the timer is ignored so that the timer signal can be delivered once the corresponding signal is unignored. This requires to throttle the timer to prevent a DoS by small intervals and keeps the system pointlessly out of low power states for no value. This is a long standing non-trivial problem due to the lock order of posix-timer lock and the sighand lock along with life time issues as the timer and the sigqueue have different life time rules. Cure this by: - Embedding the sigqueue into the timer struct to have the same life time rules. Aside of that this also avoids the lookup of the timer in the signal delivery and rearm path as it's just a always valid container_of() now. - Queuing ignored timer signals onto a seperate ignored list. - Moving queued timer signals onto the ignored list when the signal is switched to SIG_IGN before it could be delivered. - Walking the ignored list when SIG_IGN is lifted and requeue the signals to the actual signal lists. This allows the signal delivery code to rearm the timer. This also required to consolidate the signal delivery rules so they are consistent across all situations. With that all self test scenarios finally succeed. - Core infrastructure for VFS multigrain timestamping This is required to allow the kernel to use coarse grained time stamps by default and switch to fine grained time stamps when inode attributes are actively observed via getattr(). These changes have been provided to the VFS tree as well, so that the VFS specific infrastructure could be built on top. - Cleanup and consolidation of the sleep() infrastructure - Move all sleep and timeout functions into one file - Rework udelay() and ndelay() into proper documented inline functions and replace the hardcoded magic numbers by proper defines. - Rework the fsleep() implementation to take the reality of the timer wheel granularity on different HZ values into account. Right now the boundaries are hard coded time ranges which fail to provide the requested accuracy on different HZ settings. - Update documentation for all sleep/timeout related functions and fix up stale documentation links all over the place - Fixup a few usage sites - Rework of timekeeping and adjtimex(2) to prepare for multiple PTP clocks A system can have multiple PTP clocks which are participating in seperate and independent PTP clock domains. So far the kernel only considers the PTP clock which is based on CLOCK TAI relevant as that's the clock which drives the timekeeping adjustments via the various user space daemons through adjtimex(2). The non TAI based clock domains are accessible via the file descriptor based posix clocks, but their usability is very limited. They can't be accessed fast as they always go all the way out to the hardware and they cannot be utilized in the kernel itself. As Time Sensitive Networking (TSN) gains traction it is required to provide fast user and kernel space access to these clocks. The approach taken is to utilize the timekeeping and adjtimex(2) infrastructure to provide this access in a similar way how the kernel provides access to clock MONOTONIC, REALTIME etc. Instead of creating a duplicated infrastructure this rework converts timekeeping and adjtimex(2) into generic functionality which operates on pointers to data structures instead of using static variables. This allows to provide time accessors and adjtimex(2) functionality for the independent PTP clocks in a subsequent step. - Consolidate hrtimer initialization hrtimers are set up by initializing the data structure and then seperately setting the callback function for historical reasons. That's an extra unnecessary step and makes Rust support less straight forward than it should be. Provide a new set of hrtimer_setup*() functions and convert the core code and a few usage sites of the less frequently used interfaces over. The bulk of the htimer_init() to hrtimer_setup() conversion is already prepared and scheduled for the next merge window. - Drivers: - Ensure that the global timekeeping clocksource is utilizing the cluster 0 timer on MIPS multi-cluster systems. Otherwise CPUs on different clusters use their cluster specific clocksource which is not guaranteed to be synchronized with other clusters. - Mostly boring cleanups, fixes, improvements and code movement" * tag 'timers-core-2024-11-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (140 commits) posix-timers: Fix spurious warning on double enqueue versus do_exit() clocksource/drivers/arm_arch_timer: Use of_property_present() for non-boolean properties clocksource/drivers/gpx: Remove redundant casts clocksource/drivers/timer-ti-dm: Fix child node refcount handling dt-bindings: timer: actions,owl-timer: convert to YAML clocksource/drivers/ralink: Add Ralink System Tick Counter driver clocksource/drivers/mips-gic-timer: Always use cluster 0 counter as clocksource clocksource/drivers/timer-ti-dm: Don't fail probe if int not found clocksource/drivers:sp804: Make user selectable clocksource/drivers/dw_apb: Remove unused dw_apb_clockevent functions hrtimers: Delete hrtimer_init_on_stack() alarmtimer: Switch to use hrtimer_setup() and hrtimer_setup_on_stack() io_uring: Switch to use hrtimer_setup_on_stack() sched/idle: Switch to use hrtimer_setup_on_stack() hrtimers: Delete hrtimer_init_sleeper_on_stack() wait: Switch to use hrtimer_setup_sleeper_on_stack() timers: Switch to use hrtimer_setup_sleeper_on_stack() net: pktgen: Switch to use hrtimer_setup_sleeper_on_stack() futex: Switch to use hrtimer_setup_sleeper_on_stack() fs/aio: Switch to use hrtimer_setup_sleeper_on_stack() ...
2024-11-18Merge tag 'arm64-upstream' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 updates from Catalin Marinas: - Support for running Linux in a protected VM under the Arm Confidential Compute Architecture (CCA) - Guarded Control Stack user-space support. Current patches follow the x86 ABI of implicitly creating a shadow stack on clone(). Subsequent patches (already on the list) will add support for clone3() allowing finer-grained control of the shadow stack size and placement from libc - AT_HWCAP3 support (not running out of HWCAP2 bits yet but we are getting close with the upcoming dpISA support) - Other arch features: - In-kernel use of the memcpy instructions, FEAT_MOPS (previously only exposed to user; uaccess support not merged yet) - MTE: hugetlbfs support and the corresponding kselftests - Optimise CRC32 using the PMULL instructions - Support for FEAT_HAFT enabling ARCH_HAS_NONLEAF_PMD_YOUNG - Optimise the kernel TLB flushing to use the range operations - POE/pkey (permission overlays): further cleanups after bringing the signal handler in line with the x86 behaviour for 6.12 - arm64 perf updates: - Support for the NXP i.MX91 PMU in the existing IMX driver - Support for Ampere SoCs in the Designware PCIe PMU driver - Support for Marvell's 'PEM' PCIe PMU present in the 'Odyssey' SoC - Support for Samsung's 'Mongoose' CPU PMU - Support for PMUv3.9 finer-grained userspace counter access control - Switch back to platform_driver::remove() now that it returns 'void' - Add some missing events for the CXL PMU driver - Miscellaneous arm64 fixes/cleanups: - Page table accessors cleanup: type updates, drop unused macros, reorganise arch_make_huge_pte() and clean up pte_mkcont(), sanity check addresses before runtime P4D/PUD folding - Command line override for ID_AA64MMFR0_EL1.ECV (advertising the FEAT_ECV for the generic timers) allowing Linux to boot with firmware deployments that don't set SCTLR_EL3.ECVEn - ACPI/arm64: tighten the check for the array of platform timer structures and adjust the error handling procedure in gtdt_parse_timer_block() - Optimise the cache flush for the uprobes xol slot (skip if no change) and other uprobes/kprobes cleanups - Fix the context switching of tpidrro_el0 when kpti is enabled - Dynamic shadow call stack fixes - Sysreg updates - Various arm64 kselftest improvements * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (168 commits) arm64: tls: Fix context-switching of tpidrro_el0 when kpti is enabled kselftest/arm64: Try harder to generate different keys during PAC tests kselftest/arm64: Don't leak pipe fds in pac.exec_sign_all() arm64/ptrace: Clarify documentation of VL configuration via ptrace kselftest/arm64: Corrupt P0 in the irritator when testing SSVE acpi/arm64: remove unnecessary cast arm64/mm: Change protval as 'pteval_t' in map_range() kselftest/arm64: Fix missing printf() argument in gcs/gcs-stress.c kselftest/arm64: Add FPMR coverage to fp-ptrace kselftest/arm64: Expand the set of ZA writes fp-ptrace does kselftets/arm64: Use flag bits for features in fp-ptrace assembler code kselftest/arm64: Enable build of PAC tests with LLVM=1 kselftest/arm64: Check that SVCR is 0 in signal handlers selftests/mm: Fix unused function warning for aarch64_write_signal_pkey() kselftest/arm64: Fix printf() compiler warnings in the arm64 syscall-abi.c tests kselftest/arm64: Fix printf() warning in the arm64 MTE prctl() test kselftest/arm64: Fix printf() compiler warnings in the arm64 fp tests kselftest/arm64: Fix build with stricter assemblers arm64/scs: Drop unused prototype __pi_scs_patch_vmlinux() arm64/scs: Deal with 64-bit relative offsets in FDE frames ...
2024-11-18Merge tag 'ext4_for_linus-6.13-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "A lot of miscellaneous ext4 bug fixes and cleanups this cycle, most notably in the journaling code, bufered I/O, and compiler warning cleanups" * tag 'ext4_for_linus-6.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (33 commits) jbd2: Fix comment describing journal_init_common() ext4: prevent an infinite loop in the lazyinit thread ext4: use struct_size() to improve ext4_htree_store_dirent() ext4: annotate struct fname with __counted_by() jbd2: avoid dozens of -Wflex-array-member-not-at-end warnings ext4: use str_yes_no() helper function ext4: prevent delalloc to nodelalloc on remount jbd2: make b_frozen_data allocation always succeed ext4: cleanup variable name in ext4_fc_del() ext4: use string choices helpers jbd2: remove the 'success' parameter from the jbd2_do_replay() function jbd2: remove useless 'block_error' variable jbd2: factor out jbd2_do_replay() jbd2: refactor JBD2_COMMIT_BLOCK process in do_one_pass() jbd2: unified release of buffer_head in do_one_pass() jbd2: remove redundant judgments for check v1 checksum ext4: use ERR_CAST to return an error-valued pointer mm: zero range of eof folio exposed by inode size extension ext4: partial zero eof block on unaligned inode size extension ext4: disambiguate the return value of ext4_dio_write_end_io() ...
2024-11-18Merge tag 'pull-fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds
Pull 'struct fd' class updates from Al Viro: "The bulk of struct fd memory safety stuff Making sure that struct fd instances are destroyed in the same scope where they'd been created, getting rid of reassignments and passing them by reference, converting to CLASS(fd{,_pos,_raw}). We are getting very close to having the memory safety of that stuff trivial to verify" * tag 'pull-fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (28 commits) deal with the last remaing boolean uses of fd_file() css_set_fork(): switch to CLASS(fd_raw, ...) memcg_write_event_control(): switch to CLASS(fd) assorted variants of irqfd setup: convert to CLASS(fd) do_pollfd(): convert to CLASS(fd) convert do_select() convert vfs_dedupe_file_range(). convert cifs_ioctl_copychunk() convert media_request_get_by_fd() convert spu_run(2) switch spufs_calls_{get,put}() to CLASS() use convert cachestat(2) convert do_preadv()/do_pwritev() fdget(), more trivial conversions fdget(), trivial conversions privcmd_ioeventfd_assign(): don't open-code eventfd_ctx_fdget() o2hb_region_dev_store(): avoid goto around fdget()/fdput() introduce "fd_pos" class, convert fdget_pos() users to it. fdget_raw() users: switch to CLASS(fd_raw) convert vmsplice() to CLASS(fd) ...
2024-11-18Merge tag 'vfs-6.13.tmpfs' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull tmpfs case folding updates from Christian Brauner: "This adds case-insensitive support for tmpfs. The work contained in here adds support for case-insensitive file names lookups in tmpfs. The main difference from other casefold filesystems is that tmpfs has no information on disk, just on RAM, so we can't use mkfs to create a case-insensitive tmpfs. For this implementation, there's a mount option for casefolding. The rest of the patchset follows a similar approach as ext4 and f2fs. The use case for this feature is similar to the use case for ext4, to better support compatibility layers (like Wine), particularly in combination with sandboxing/container tools (like Flatpak). Those containerization tools can share a subset of the host filesystem with an application. In the container, the root directory and any parent directories required for a shared directory are on tmpfs, with the shared directories bind-mounted into the container's view of the filesystem. If the host filesystem is using case-insensitive directories, then the application can do lookups inside those directories in a case-insensitive way, without this needing to be implemented in user-space. However, if the host is only sharing a subset of a case-insensitive directory with the application, then the parent directories of the mount point will be part of the container's root tmpfs. When the application tries to do case-insensitive lookups of those parent directories on a case-sensitive tmpfs, the lookup will fail" * tag 'vfs-6.13.tmpfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: tmpfs: Initialize sysfs during tmpfs init tmpfs: Fix type for sysfs' casefold attribute libfs: Fix kernel-doc warning in generic_ci_validate_strict_name docs: tmpfs: Add casefold options tmpfs: Expose filesystem features via sysfs tmpfs: Add flag FS_CASEFOLD_FL support for tmpfs dirs tmpfs: Add casefold lookup support libfs: Export generic_ci_ dentry functions unicode: Recreate utf8_parse_version() unicode: Export latest available UTF-8 version number ext4: Use generic_ci_validate_strict_name helper libfs: Create the helper function generic_ci_validate_strict_name()
2024-11-18Merge tag 'vfs-6.13.pagecache' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs pagecache updates from Christian Brauner: "Cleanup filesystem page flag usage: This continues the work to make the mappedtodisk/owner_2 flag available to filesystems which don't use buffer heads. Further patches remove uses of Private2. This brings us very close to being rid of it entirely" * tag 'vfs-6.13.pagecache' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: migrate: Remove references to Private2 ceph: Remove call to PagePrivate2() btrfs: Switch from using the private_2 flag to owner_2 mm: Remove PageMappedToDisk nilfs2: Convert nilfs_copy_buffer() to use folios fs: Move clearing of mappedtodisk to buffer.c
2024-11-18Merge tag 'vfs-6.13.misc' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "Features: - Fixup and improve NLM and kNFSD file lock callbacks Last year both GFS2 and OCFS2 had some work done to make their locking more robust when exported over NFS. Unfortunately, part of that work caused both NLM (for NFS v3 exports) and kNFSD (for NFSv4.1+ exports) to no longer send lock notifications to clients This in itself is not a huge problem because most NFS clients will still poll the server in order to acquire a conflicted lock It's important for NLM and kNFSD that they do not block their kernel threads inside filesystem's file_lock implementations because that can produce deadlocks. We used to make sure of this by only trusting that posix_lock_file() can correctly handle blocking lock calls asynchronously, so the lock managers would only setup their file_lock requests for async callbacks if the filesystem did not define its own lock() file operation However, when GFS2 and OCFS2 grew the capability to correctly handle blocking lock requests asynchronously, they started signalling this behavior with EXPORT_OP_ASYNC_LOCK, and the check for also trusting posix_lock_file() was inadvertently dropped, so now most filesystems no longer produce lock notifications when exported over NFS Fix this by using an fop_flag which greatly simplifies the problem and grooms the way for future uses by both filesystems and lock managers alike - Add a sysctl to delete the dentry when a file is removed instead of making it a negative dentry Commit 681ce8623567 ("vfs: Delete the associated dentry when deleting a file") introduced an unconditional deletion of the associated dentry when a file is removed. However, this led to performance regressions in specific benchmarks, such as ilebench.sum_operations/s, prompting a revert in commit 4a4be1ad3a6e ("Revert "vfs: Delete the associated dentry when deleting a file""). This reintroduces the concept conditionally through a sysctl - Expand the statmount() system call: * Report the filesystem subtype in a new fs_subtype field to e.g., report fuse filesystem subtypes * Report the superblock source in a new sb_source field * Add a new way to return filesystem specific mount options in an option array that returns filesystem specific mount options separated by zero bytes and unescaped. This allows caller's to retrieve filesystem specific mount options and immediately pass them to e.g., fsconfig() without having to unescape or split them * Report security (LSM) specific mount options in a separate security option array. We don't lump them together with filesystem specific mount options as security mount options are generic and most users aren't interested in them The format is the same as for the filesystem specific mount option array - Support relative paths in fsconfig()'s FSCONFIG_SET_STRING command - Optimize acl_permission_check() to avoid costly {g,u}id ownership checks if possible - Use smp_mb__after_spinlock() to avoid full smp_mb() in evict() - Add synchronous wakeup support for ep_poll_callback. Currently, epoll only uses wake_up() to wake up task. But sometimes there are epoll users which want to use the synchronous wakeup flag to give a hint to the scheduler, e.g., the Android binder driver. So add a wake_up_sync() define, and use wake_up_sync() when sync is true in ep_poll_callback() Fixes: - Fix kernel documentation for inode_insert5() and iget5_locked() - Annotate racy epoll check on file->f_ep - Make F_DUPFD_QUERY associative - Avoid filename buffer overrun in initramfs - Don't let statmount() return empty strings - Add a cond_resched() to dump_user_range() to avoid hogging the CPU - Don't query the device logical blocksize multiple times for hfsplus - Make filemap_read() check that the offset is positive or zero Cleanups: - Various typo fixes - Cleanup wbc_attach_fdatawrite_inode() - Add __releases annotation to wbc_attach_and_unlock_inode() - Add hugetlbfs tracepoints - Fix various vfs kernel doc parameters - Remove obsolete TODO comment from io_cancel() - Convert wbc_account_cgroup_owner() to take a folio - Fix comments for BANDWITH_INTERVAL and wb_domain_writeout_add() - Reorder struct posix_acl to save 8 bytes - Annotate struct posix_acl with __counted_by() - Replace one-element array with flexible array member in freevxfs - Use idiomatic atomic64_inc_return() in alloc_mnt_ns()" * tag 'vfs-6.13.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (35 commits) statmount: retrieve security mount options vfs: make evict() use smp_mb__after_spinlock instead of smp_mb statmount: add flag to retrieve unescaped options fs: add the ability for statmount() to report the sb_source writeback: wbc_attach_fdatawrite_inode out of line writeback: add a __releases annoation to wbc_attach_and_unlock_inode fs: add the ability for statmount() to report the fs_subtype fs: don't let statmount return empty strings fs:aio: Remove TODO comment suggesting hash or array usage in io_cancel() hfsplus: don't query the device logical block size multiple times freevxfs: Replace one-element array with flexible array member fs: optimize acl_permission_check() initramfs: avoid filename buffer overrun fs/writeback: convert wbc_account_cgroup_owner to take a folio acl: Annotate struct posix_acl with __counted_by() acl: Realign struct posix_acl to save 8 bytes epoll: Add synchronous wakeup support for ep_poll_callback coredump: add cond_resched() to dump_user_range mm/page-writeback.c: Fix comment of wb_domain_writeout_add() mm/page-writeback.c: Update comment for BANDWIDTH_INTERVAL ...
2024-11-18Merge tag 'vfs-6.13.mgtime' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs multigrain timestamps from Christian Brauner: "This is another try at implementing multigrain timestamps. This time with significant help from the timekeeping maintainers to reduce the performance impact. Thomas provided a base branch that contains the required timekeeping interfaces for the VFS. It serves as the base for the multi-grain timestamp work: - Multigrain timestamps allow the kernel to use fine-grained timestamps when an inode's attributes is being actively observed via ->getattr(). With this support, it's possible for a file to get a fine-grained timestamp, and another modified after it to get a coarse-grained stamp that is earlier than the fine-grained time. If this happens then the files can appear to have been modified in reverse order, which breaks VFS ordering guarantees. To prevent this, a floor value is maintained for multigrain timestamps. Whenever a fine-grained timestamp is handed out, record it, and when later coarse-grained stamps are handed out, ensure they are not earlier than that value. If the coarse-grained timestamp is earlier than the fine-grained floor, return the floor value instead. The timekeeper changes add a static singleton atomic64_t into timekeeper.c that is used to keep track of the latest fine-grained time ever handed out. This is tracked as a monotonic ktime_t value to ensure that it isn't affected by clock jumps. Because it is updated at different times than the rest of the timekeeper object, the floor value is managed independently of the timekeeper via a cmpxchg() operation, and sits on its own cacheline. Two new public timekeeper interfaces are added: (1) ktime_get_coarse_real_ts64_mg() fills a timespec64 with the later of the coarse-grained clock and the floor time (2) ktime_get_real_ts64_mg() gets the fine-grained clock value, and tries to swap it into the floor. A timespec64 is filled with the result. - The VFS has always used coarse-grained timestamps when updating the ctime and mtime after a change. This has the benefit of allowing filesystems to optimize away a lot metadata updates, down to around 1 per jiffy, even when a file is under heavy writes. Unfortunately, this has always been an issue when we're exporting via NFSv3, which relies on timestamps to validate caches. A lot of changes can happen in a jiffy, so timestamps aren't sufficient to help the client decide when to invalidate the cache. Even with NFSv4, a lot of exported filesystems don't properly support a change attribute and are subject to the same problems with timestamp granularity. Other applications have similar issues with timestamps (e.g backup applications). If we were to always use fine-grained timestamps, that would improve the situation, but that becomes rather expensive, as the underlying filesystem would have to log a lot more metadata updates. This adds a way to only use fine-grained timestamps when they are being actively queried. Use the (unused) top bit in inode->i_ctime_nsec as a flag that indicates whether the current timestamps have been queried via stat() or the like. When it's set, we allow the kernel to use a fine-grained timestamp iff it's necessary to make the ctime show a different value. This solves the problem of being able to distinguish the timestamp between updates, but introduces a new problem: it's now possible for a file being changed to get a fine-grained timestamp. A file that is altered just a bit later can then get a coarse-grained one that appears older than the earlier fine-grained time. This violates timestamp ordering guarantees. This is where the earlier mentioned timkeeping interfaces help. A global monotonic atomic64_t value is kept that acts as a timestamp floor. When we go to stamp a file, we first get the latter of the current floor value and the current coarse-grained time. If the inode ctime hasn't been queried then we just attempt to stamp it with that value. If it has been queried, then first see whether the current coarse time is later than the existing ctime. If it is, then we accept that value. If it isn't, then we get a fine-grained time and try to swap that into the global floor. Whether that succeeds or fails, we take the resulting floor time, convert it to realtime and try to swap that into the ctime. We take the result of the ctime swap whether it succeeds or fails, since either is just as valid. Filesystems can opt into this by setting the FS_MGTIME fstype flag. Others should be unaffected (other than being subject to the same floor value as multigrain filesystems)" * tag 'vfs-6.13.mgtime' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fs: reduce pointer chasing in is_mgtime() test tmpfs: add support for multigrain timestamps btrfs: convert to multigrain timestamps ext4: switch to multigrain timestamps xfs: switch to multigrain timestamps Documentation: add a new file documenting multigrain timestamps fs: add percpu counters for significant multigrain timestamp events fs: tracepoints around multigrain timestamp events fs: handle delegated timestamps in setattr_copy_mgtime timekeeping: Add percpu counter for tracking floor swap events timekeeping: Add interfaces for handling timestamps with a floor value fs: have setattr_copy handle multigrain timestamps appropriately fs: add infrastructure for multigrain timestamps
2024-11-16Merge tag 'mm-hotfixes-stable-2024-11-16-15-33' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull hotfixes from Andrew Morton: "10 hotfixes, 7 of which are cc:stable. All singletons, please see the changelogs for details" * tag 'mm-hotfixes-stable-2024-11-16-15-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mm: revert "mm: shmem: fix data-race in shmem_getattr()" ocfs2: uncache inode which has failed entering the group mm: fix NULL pointer dereference in alloc_pages_bulk_noprof mm, doc: update read_ahead_kb for MADV_HUGEPAGE fs/proc/task_mmu: prevent integer overflow in pagemap_scan_get_args() sched/task_stack: fix object_is_on_stack() for KASAN tagged pointers crash, powerpc: default to CRASH_DUMP=n on PPC_BOOK3S_32 mm/mremap: fix address wraparound in move_page_tables() tools/mm: fix compile error mm, swap: fix allocation and scanning race with swapoff
2024-11-16mm: revert "mm: shmem: fix data-race in shmem_getattr()"Andrew Morton
Revert d949d1d14fa2 ("mm: shmem: fix data-race in shmem_getattr()") as suggested by Chuck [1]. It is causing deadlocks when accessing tmpfs over NFS. As Hugh commented, "added just to silence a syzbot sanitizer splat: added where there has never been any practical problem". Link: https://lkml.kernel.org/r/ZzdxKF39VEmXSSyN@tissot.1015granger.net [1] Fixes: d949d1d14fa2 ("mm: shmem: fix data-race in shmem_getattr()") Acked-by: Hugh Dickins <hughd@google.com> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: Jeongjun Park <aha310510@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-16Merge branch 'slab/for-6.13/features' into slab/for-nextVlastimil Babka
Merge the slab feature branch for 6.13: - Add new slab_strict_numa parameter for per-object memory policies (Christoph Lameter)
2024-11-16mm/slab: Allow cache creation to proceed even if sysfs registration failsHyeonggon Yoo
When kobject_init_and_add() fails during cache creation, kobj->name can be leaked because SLUB does not call kobject_put(), which should be invoked per the kobject API documentation. This has a bit of historical context, though; SLUB does not call kobject_put() to avoid double-free for struct kmem_cache because 1) simply calling it would free all resources related to the cache, and 2) struct kmem_cache descriptor is always freed by cache_cache()'s error handling path, causing struct kmem_cache to be freed twice. This issue can be reproduced by creating new slab caches while applying failslab for kernfs_node_cache. This makes kobject_add_varg() succeed, but causes kobject_add_internal() to fail in kobject_init_and_add() during cache creation. Historically, this issue has attracted developers' attention several times. Each time a fix addressed either the leak or the double-free, it caused the other issue. Let's summarize a bit of history here: The leak has existed since the early days of SLUB. Commit 54b6a731025f ("slub: fix leak of 'name' in sysfs_slab_add") introduced a double-free bug while fixing the leak. Commit 80da026a8e5d ("mm/slub: fix slab double-free in case of duplicate sysfs filename") re-introduced the leak while fixing the double-free error. Commit dde3c6b72a16 ("mm/slub: fix a memory leak in sysfs_slab_add()") fixed the memory leak, but it was later reverted by commit 757fed1d0898 ("Revert "mm/slub: fix a memory leak in sysfs_slab_add()"") to avoid the double-free error. This is where we are now: we've chosen a memory leak over a double-free. To resolve this memory leak, skip creating sysfs files if it fails and continue with cache creation regardless (as suggested by Christoph). This resolves the memory leak because both the cache and the kobject remain alive on kobject_init_and_add() failure. If SLUB tries to create an alias for a cache without sysfs files, its symbolic link will not be generated. Since a slab cache might not have associated sysfs files, call kobject_del() only if such files exist. Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2024-11-16mm/slub: Avoid list corruption when removing a slab from the full listyuan.gao
Boot with slub_debug=UFPZ. If allocated object failed in alloc_consistency_checks, all objects of the slab will be marked as used, and then the slab will be removed from the partial list. When an object belonging to the slab got freed later, the remove_full() function is called. Because the slab is neither on the partial list nor on the full list, it eventually lead to a list corruption (actually a list poison being detected). So we need to mark and isolate the slab page with metadata corruption, do not put it back in circulation. Because the debug caches avoid all the fastpaths, reusing the frozen bit to mark slab page with metadata corruption seems to be fine. [ 4277.385669] list_del corruption, ffffea00044b3e50->next is LIST_POISON1 (dead000000000100) [ 4277.387023] ------------[ cut here ]------------ [ 4277.387880] kernel BUG at lib/list_debug.c:56! [ 4277.388680] invalid opcode: 0000 [#1] PREEMPT SMP PTI [ 4277.389562] CPU: 5 PID: 90 Comm: kworker/5:1 Kdump: loaded Tainted: G OE 6.6.1-1 #1 [ 4277.392113] Workqueue: xfs-inodegc/vda1 xfs_inodegc_worker [xfs] [ 4277.393551] RIP: 0010:__list_del_entry_valid_or_report+0x7b/0xc0 [ 4277.394518] Code: 48 91 82 e8 37 f9 9a ff 0f 0b 48 89 fe 48 c7 c7 28 49 91 82 e8 26 f9 9a ff 0f 0b 48 89 fe 48 c7 c7 58 49 91 [ 4277.397292] RSP: 0018:ffffc90000333b38 EFLAGS: 00010082 [ 4277.398202] RAX: 000000000000004e RBX: ffffea00044b3e50 RCX: 0000000000000000 [ 4277.399340] RDX: 0000000000000002 RSI: ffffffff828f8715 RDI: 00000000ffffffff [ 4277.400545] RBP: ffffea00044b3e40 R08: 0000000000000000 R09: ffffc900003339f0 [ 4277.401710] R10: 0000000000000003 R11: ffffffff82d44088 R12: ffff888112cf9910 [ 4277.402887] R13: 0000000000000001 R14: 0000000000000001 R15: ffff8881000424c0 [ 4277.404049] FS: 0000000000000000(0000) GS:ffff88842fd40000(0000) knlGS:0000000000000000 [ 4277.405357] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 4277.406389] CR2: 00007f2ad0b24000 CR3: 0000000102a3a006 CR4: 00000000007706e0 [ 4277.407589] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 4277.408780] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 4277.410000] PKRU: 55555554 [ 4277.410645] Call Trace: [ 4277.411234] <TASK> [ 4277.411777] ? die+0x32/0x80 [ 4277.412439] ? do_trap+0xd6/0x100 [ 4277.413150] ? __list_del_entry_valid_or_report+0x7b/0xc0 [ 4277.414158] ? do_error_trap+0x6a/0x90 [ 4277.414948] ? __list_del_entry_valid_or_report+0x7b/0xc0 [ 4277.415915] ? exc_invalid_op+0x4c/0x60 [ 4277.416710] ? __list_del_entry_valid_or_report+0x7b/0xc0 [ 4277.417675] ? asm_exc_invalid_op+0x16/0x20 [ 4277.418482] ? __list_del_entry_valid_or_report+0x7b/0xc0 [ 4277.419466] ? __list_del_entry_valid_or_report+0x7b/0xc0 [ 4277.420410] free_to_partial_list+0x515/0x5e0 [ 4277.421242] ? xfs_iext_remove+0x41a/0xa10 [xfs] [ 4277.422298] xfs_iext_remove+0x41a/0xa10 [xfs] [ 4277.423316] ? xfs_inodegc_worker+0xb4/0x1a0 [xfs] [ 4277.424383] xfs_bmap_del_extent_delay+0x4fe/0x7d0 [xfs] [ 4277.425490] __xfs_bunmapi+0x50d/0x840 [xfs] [ 4277.426445] xfs_itruncate_extents_flags+0x13a/0x490 [xfs] [ 4277.427553] xfs_inactive_truncate+0xa3/0x120 [xfs] [ 4277.428567] xfs_inactive+0x22d/0x290 [xfs] [ 4277.429500] xfs_inodegc_worker+0xb4/0x1a0 [xfs] [ 4277.430479] process_one_work+0x171/0x340 [ 4277.431227] worker_thread+0x277/0x390 [ 4277.431962] ? __pfx_worker_thread+0x10/0x10 [ 4277.432752] kthread+0xf0/0x120 [ 4277.433382] ? __pfx_kthread+0x10/0x10 [ 4277.434134] ret_from_fork+0x2d/0x50 [ 4277.434837] ? __pfx_kthread+0x10/0x10 [ 4277.435566] ret_from_fork_asm+0x1b/0x30 [ 4277.436280] </TASK> Fixes: 643b113849d8 ("slub: enable tracking of full slabs") Suggested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Suggested-by: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: yuan.gao <yuan.gao@ucloud.cn> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2024-11-16mm/slub: Improve redzone check and zeroing for krealloc()Feng Tang
For current krealloc(), one problem is its caller doesn't pass the old request size, say the object is 64 bytes kmalloc one, but caller may only requested 48 bytes. Then when krealloc() shrinks or grows in the same object, or allocate a new bigger object, it lacks this 'original size' information to do accurate data preserving or zeroing (when __GFP_ZERO is set). Thus with slub debug redzone and object tracking enabled, parts of the object after krealloc() might contain redzone data instead of zeroes, which is violating the __GFP_ZERO guarantees. Good thing is in this case, kmalloc caches do have this 'orig_size' feature. So solve the problem by utilize 'org_size' to do accurate data zeroing and preserving. [Thanks to syzbot and V, Narasimhan for discovering kfence and big kmalloc related issues in early patch version] Suggested-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Feng Tang <feng.tang@intel.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2024-11-16mm/slub: Consider kfence case for get_orig_size()Feng Tang
When 'orig_size' of kmalloc object is enabled by debug option, it should either contains the actual requested size or the cache's 'object_size'. But it's not true if that object is a kfence-allocated one, and the data at 'orig_size' offset of metadata could be zero or other values. This is not a big issue for current 'orig_size' usage, as init_object() and check_object() during alloc/free process will be skipped for kfence addresses. But it could cause trouble for other usage in future. Use the existing kfence helper kfence_ksize() which can return the real original request size. Signed-off-by: Feng Tang <feng.tang@intel.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2024-11-14cma: enforce non-zero pageblock_order during cma_init_reserved_mem()Ritesh Harjani (IBM)
cma_init_reserved_mem() checks base and size alignment with CMA_MIN_ALIGNMENT_BYTES. However, some users might call this during early boot when pageblock_order is 0. That means if base and size does not have pageblock_order alignment, it can cause functional failures during cma activate area. So let's enforce pageblock_order to be non-zero during cma_init_reserved_mem() to catch such wrong usages. 1. This was seen with fadump on PowerPC which was calling cma_init_reserved_mem() before the pageblock_order was initialized. This is now fixed in the fadump on PowerPC itself. The details of that can be found in the patch including the userspace-visible effect of the issue [1]. 2. However it was also decided that we should add a stronger enforcement check within cma_init_reserved_mem() to catch such wrong usages [2]. Hence this patch. This is ok to be in -next and there is no "Fixes" tag required for this patch. [1]: https://lore.kernel.org/all/3ae208e48c0d9cefe53d2dc4f593388067405b7d.1729146153.git.ritesh.list@gmail.com/ [2]: https://lore.kernel.org/all/83eb128e-4f06-4725-a843-a4563f246a44@redhat.com/ Link: https://lkml.kernel.org/r/e274344b44d5f80fa54c52f530387257fe99ec65.1731505681.git.ritesh.list@gmail.com Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-14mm/kfence: add a new kunit test test_use_after_free_read_nofault()Nirjhar Roy
Faults from copy_from_kernel_nofault() need to be handled by fixup table and should not be handled by kfence. Otherwise while reading /proc/kcore which uses copy_from_kernel_nofault(), kfence can generate false negatives. This can happen when /proc/kcore ends up reading an unmapped address from kfence pool. Let's add a testcase to cover this case. Link: https://lkml.kernel.org/r/210e561f7845697a32de44b643393890f180069f.1729272697.git.ritesh.list@gmail.com Signed-off-by: Nirjhar Roy <nirjhar@linux.ibm.com> Co-developed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Tested-by: Marco Elver <elver@google.com> Reviewed-by: Marco Elver <elver@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-14memcg/hugetlb: add hugeTLB counters to memcgJoshua Hahn
This patch introduces a new counter to memory.stat that tracks hugeTLB usage, only if hugeTLB accounting is done to memory.current. This feature is enabled the same way hugeTLB accounting is enabled, via the memory_hugetlb_accounting mount flag for cgroupsv2. 1. Why is this patch necessary? Currently, memcg hugeTLB accounting is an opt-in feature [1] that adds hugeTLB usage to memory.current. However, the metric is not reported in memory.stat. Given that users often interpret memory.stat as a breakdown of the value reported in memory.current, the disparity between the two reports can be confusing. This patch solves this problem by including the metric in memory.stat as well, but only if it is also reported in memory.current (it would also be confusing if the value was reported in memory.stat, but not in memory.current) Aside from the consistency between the two files, we also see benefits in observability. Userspace might be interested in the hugeTLB footprint of cgroups for many reasons. For instance, system admins might want to verify that hugeTLB usage is distributed as expected across tasks: i.e. memory-intensive tasks are using more hugeTLB pages than tasks that don't consume a lot of memory, or are seen to fault frequently. Note that this is separate from wanting to inspect the distribution for limiting purposes (in which case, hugeTLB controller makes more sense). 2. We already have a hugeTLB controller. Why not use that? It is true that hugeTLB tracks the exact value that we want. In fact, by enabling the hugeTLB controller, we get all of the observability benefits that I mentioned above, and users can check the total hugeTLB usage, verify if it is distributed as expected, etc. With this said, there are 2 problems: (a) They are still not reported in memory.stat, which means the disparity between the memcg reports are still there. (b) We cannot reasonably expect users to enable the hugeTLB controller just for the sake of hugeTLB usage reporting, especially since they don't have any use for hugeTLB usage enforcing [2]. 3. Implementation Details: In the alloc / free hugetlb functions, we call lruvec_stat_mod_folio regardless of whether memcg accounts hugetlb. mem_cgroup_commit_charge which is called from alloc_hugetlb_folio will set memcg for the folio only if the CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING cgroup mount option is used, so lruvec_stat_mod_folio accounts per-memcg hugetlb counters only if the feature is enabled. Regardless of whether memcg accounts for hugetlb, the newly added global counter is updated and shown in /proc/vmstat. The global counter is added because vmstats is the preferred framework for cgroup stats. It makes stat items consistent between global and cgroups. It also provides a per-node breakdown, which is useful. Because it does not use cgroup-specific hooks, we also keep generic MM code separate from memcg code. [1] https://lore.kernel.org/all/20231006184629.155543-1-nphamcs@gmail.com/ [2] Of course, we can't make a new patch for every feature that can be duplicated. However, since the existing solution of enabling the hugeTLB controller is an imperfect solution that still leaves a discrepancy between memory.stat and memory.curent, I think that it is reasonable to isolate the feature in this case. Link: https://lkml.kernel.org/r/20241101204402.1885383-1-joshua.hahnjy@gmail.com Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com> Suggested-by: Nhat Pham <nphamcs@gmail.com> Suggested-by: Shakeel Butt <shakeel.butt@linux.dev> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Chris Down <chris@chrisdown.name> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Michal Koutný <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Zefan Li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-14vmstat: call fold_vm_zone_numa_events() before show per zone NUMA eventMengEn Sun
Since 5.14-rc1, NUMA events will only be folded from per-CPU statistics to per zone and global statistics when the user actually needs it. Currently, the kernel has performs the fold operation when reading /proc/vmstat, but does not perform the fold operation in /proc/zoneinfo. This can lead to inaccuracies in the following statistics in zoneinfo: - numa_hit - numa_miss - numa_foreign - numa_interleave - numa_local - numa_other Therefore, before printing per-zone vm_numa_event when reading /proc/zoneinfo, we should also perform the fold operation. Link: https://lkml.kernel.org/r/1730433998-10461-1-git-send-email-mengensun@tencent.com Fixes: f19298b9516c ("mm/vmstat: convert NUMA statistics to basic NUMA counters") Signed-off-by: MengEn Sun <mengensun@tencent.com> Reviewed-by: JinLiang Zheng <alexjlzheng@tencent.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-14mm: fix NULL pointer dereference in alloc_pages_bulk_noprofJinjiang Tu
We triggered a NULL pointer dereference for ac.preferred_zoneref->zone in alloc_pages_bulk_noprof() when the task is migrated between cpusets. When cpuset is enabled, in prepare_alloc_pages(), ac->nodemask may be &current->mems_allowed. when first_zones_zonelist() is called to find preferred_zoneref, the ac->nodemask may be modified concurrently if the task is migrated between different cpusets. Assuming we have 2 NUMA Node, when traversing Node1 in ac->zonelist, the nodemask is 2, and when traversing Node2 in ac->zonelist, the nodemask is 1. As a result, the ac->preferred_zoneref points to NULL zone. In alloc_pages_bulk_noprof(), for_each_zone_zonelist_nodemask() finds a allowable zone and calls zonelist_node_idx(ac.preferred_zoneref), leading to NULL pointer dereference. __alloc_pages_noprof() fixes this issue by checking NULL pointer in commit ea57485af8f4 ("mm, page_alloc: fix check for NULL preferred_zone") and commit df76cee6bbeb ("mm, page_alloc: remove redundant checks from alloc fastpath"). To fix it, check NULL pointer for preferred_zoneref->zone. Link: https://lkml.kernel.org/r/20241113083235.166798-1-tujinjiang@huawei.com Fixes: 387ba26fb1cb ("mm/page_alloc: add a bulk page allocator") Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Alexander Lobakin <alobakin@pm.me> Cc: David Hildenbrand <david@redhat.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-14mm/mremap: fix address wraparound in move_page_tables()Jann Horn
On 32-bit platforms, it is possible for the expression `len + old_addr < old_end` to be false-positive if `len + old_addr` wraps around. `old_addr` is the cursor in the old range up to which page table entries have been moved; so if the operation succeeded, `old_addr` is the *end* of the old region, and adding `len` to it can wrap. The overflow causes mremap() to mistakenly believe that PTEs have been copied; the consequence is that mremap() bails out, but doesn't move the PTEs back before the new VMA is unmapped, causing anonymous pages in the region to be lost. So basically if userspace tries to mremap() a private-anon region and hits this bug, mremap() will return an error and the private-anon region's contents appear to have been zeroed. The idea of this check is that `old_end - len` is the original start address, and writing the check that way also makes it easier to read; so fix the check by rearranging the comparison accordingly. (An alternate fix would be to refactor this function by introducing an "orig_old_start" variable or such.) Tested in a VM with a 32-bit X86 kernel; without the patch: ``` user@horn:~/big_mremap$ cat test.c #define _GNU_SOURCE #include <stdlib.h> #include <stdio.h> #include <err.h> #include <sys/mman.h> #define ADDR1 ((void*)0x60000000) #define ADDR2 ((void*)0x10000000) #define SIZE 0x50000000uL int main(void) { unsigned char *p1 = mmap(ADDR1, SIZE, PROT_READ|PROT_WRITE, MAP_ANONYMOUS|MAP_PRIVATE|MAP_FIXED_NOREPLACE, -1, 0); if (p1 == MAP_FAILED) err(1, "mmap 1"); unsigned char *p2 = mmap(ADDR2, SIZE, PROT_NONE, MAP_ANONYMOUS|MAP_PRIVATE|MAP_FIXED_NOREPLACE, -1, 0); if (p2 == MAP_FAILED) err(1, "mmap 2"); *p1 = 0x41; printf("first char is 0x%02hhx\n", *p1); unsigned char *p3 = mremap(p1, SIZE, SIZE, MREMAP_MAYMOVE|MREMAP_FIXED, p2); if (p3 == MAP_FAILED) { printf("mremap() failed; first char is 0x%02hhx\n", *p1); } else { printf("mremap() succeeded; first char is 0x%02hhx\n", *p3); } } user@horn:~/big_mremap$ gcc -static -o test test.c user@horn:~/big_mremap$ setarch -R ./test first char is 0x41 mremap() failed; first char is 0x00 ``` With the patch: ``` user@horn:~/big_mremap$ setarch -R ./test first char is 0x41 mremap() succeeded; first char is 0x41 ``` Link: https://lkml.kernel.org/r/20241111-fix-mremap-32bit-wrap-v1-1-61d6be73b722@google.com Fixes: af8ca1c14906 ("mm/mremap: optimize the start addresses in move_page_tables()") Signed-off-by: Jann Horn <jannh@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Qi Zheng <zhengqi.arch@bytedance.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-14mm, swap: fix allocation and scanning race with swapoffKairui Song
There are two flags used to synchronize allocation and scanning with swapoff: SWP_WRITEOK and SWP_SCANNING. SWP_WRITEOK: Swapoff will first unset this flag, at this point any further swap allocation or scanning on this device should just abort so no more new entries will be referencing this device. Swapoff will then unuse all existing swap entries. SWP_SCANNING: This flag is set when device is being scanned. Swapoff will wait for all scanner to stop before the final release of the swap device structures to avoid UAF. Note this flag is the highest used bit of si->flags so it could be added up arithmetically, if there are multiple scanner. commit 5f843a9a3a1e ("mm: swap: separate SSD allocation from scan_swap_map_slots()") ignored SWP_SCANNING and SWP_WRITEOK flags while separating cluster allocation path from the old allocation path. Add the flags back to fix swapoff race. The race is hard to trigger as si->lock prevents most parallel operations, but si->lock could be dropped for reclaim or discard. This issue is found during code review. This commit fixes this problem. For SWP_SCANNING, Just like before, set the flag before scan and remove it afterwards. For SWP_WRITEOK, there are several places where si->lock could be dropped, it will be error-prone and make the code hard to follow if we try to cover these places one by one. So just do one check before the real allocation, which is also very similar like before. With new cluster allocator it may waste a bit of time iterating the clusters but won't take long, and swapoff is not performance sensitive. Link: https://lkml.kernel.org/r/20241112083414.78174-1-ryncsn@gmail.com Fixes: 5f843a9a3a1e ("mm: swap: separate SSD allocation from scan_swap_map_slots()") Reported-by: "Huang, Ying" <ying.huang@intel.com> Closes: https://lore.kernel.org/linux-mm/87a5es3f1f.fsf@yhuang6-desk2.ccr.corp.intel.com/ Signed-off-by: Kairui Song <kasong@tencent.com> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Chris Li <chrisl@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-14Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.12-rc8). Conflicts: tools/testing/selftests/net/.gitignore 252e01e68241 ("selftests: net: add netlink-dumps to .gitignore") be43a6b23829 ("selftests: ncdevmem: Move ncdevmem under drivers/net/hw") https://lore.kernel.org/all/20241113122359.1b95180a@canb.auug.org.au/ drivers/net/phy/phylink.c 671154f174e0 ("net: phylink: ensure PHY momentary link-fails are handled") 7530ea26c810 ("net: phylink: remove "using_mac_select_pcs"") Adjacent changes: drivers/net/ethernet/stmicro/stmmac/dwmac-intel-plat.c 5b366eae7193 ("stmmac: dwmac-intel-plat: fix call balance of tx_clk handling routines") e96321fad3ad ("net: ethernet: Switch back to struct platform_driver::remove()") Signed-off-by: Jakub Kicinski <kuba@kernel.org>