Age | Commit message (Collapse) | Author |
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Ingo Molnar:
- Introduce cmpxchg128() -- aka. the demise of cmpxchg_double()
The cmpxchg128() family of functions is basically & functionally the
same as cmpxchg_double(), but with a saner interface.
Instead of a 6-parameter horror that forced u128 - u64/u64-halves
layout details on the interface and exposed users to complexity,
fragility & bugs, use a natural 3-parameter interface with u128
types.
- Restructure the generated atomic headers, and add kerneldoc comments
for all of the generic atomic{,64,_long}_t operations.
The generated definitions are much cleaner now, and come with
documentation.
- Implement lock_set_cmp_fn() on lockdep, for defining an ordering when
taking multiple locks of the same type.
This gets rid of one use of lockdep_set_novalidate_class() in the
bcache code.
- Fix raw_cpu_generic_try_cmpxchg() bug due to an unintended variable
shadowing generating garbage code on Clang on certain ARM builds.
* tag 'locking-core-2023-06-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (43 commits)
locking/atomic: scripts: fix ${atomic}_dec_if_positive() kerneldoc
percpu: Fix self-assignment of __old in raw_cpu_generic_try_cmpxchg()
locking/atomic: treewide: delete arch_atomic_*() kerneldoc
locking/atomic: docs: Add atomic operations to the driver basic API documentation
locking/atomic: scripts: generate kerneldoc comments
docs: scripts: kernel-doc: accept bitwise negation like ~@var
locking/atomic: scripts: simplify raw_atomic*() definitions
locking/atomic: scripts: simplify raw_atomic_long*() definitions
locking/atomic: scripts: split pfx/name/sfx/order
locking/atomic: scripts: restructure fallback ifdeffery
locking/atomic: scripts: build raw_atomic_long*() directly
locking/atomic: treewide: use raw_atomic*_<op>()
locking/atomic: scripts: add trivial raw_atomic*_<op>()
locking/atomic: scripts: factor out order template generation
locking/atomic: scripts: remove leftover "${mult}"
locking/atomic: scripts: remove bogus order parameter
locking/atomic: xtensa: add preprocessor symbols
locking/atomic: x86: add preprocessor symbols
locking/atomic: sparc: add preprocessor symbols
locking/atomic: sh: add preprocessor symbols
...
|
|
It feels very unlikely that anybody would want to do a GUP in an
unmapped area under the stack pointer, but real users sometimes do some
really strange things. So add a (temporary) warning for the case where
a GUP fails and expanding the stack might have made it work.
It's trivial to do the expansion in the caller as part of getting the mm
lock in the first place - see __access_remote_vm() for ptrace, for
example - it's just that it's unnecessarily painful to do it deep in the
guts of the GUP lookup when we might have to drop and re-take the lock.
I doubt anybody actually does anything quite this strange, but let's be
proactive: adding these warnings is simple, and will make debugging it
much easier if they trigger.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This finishes the job of always holding the mmap write lock when
extending the user stack vma, and removes the 'write_locked' argument
from the vm helper functions again.
For some cases, we just avoid expanding the stack at all: drivers and
page pinning really shouldn't be extending any stacks. Let's see if any
strange users really wanted that.
It's worth noting that architectures that weren't converted to the new
lock_mm_and_find_vma() helper function are left using the legacy
"expand_stack()" function, but it has been changed to drop the mmap_lock
and take it for writing while expanding the vma. This makes it fairly
straightforward to convert the remaining architectures.
As a result of dropping and re-taking the lock, the calling conventions
for this function have also changed, since the old vma may no longer be
valid. So it will now return the new vma if successful, and NULL - and
the lock dropped - if the area could not be extended.
Tested-by: Vegard Nossum <vegard.nossum@oracle.com>
Tested-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> # ia64
Tested-by: Frank Scheiner <frank.scheiner@web.de> # ia64
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cleanups from Dave Hansen:
"As usual, these are all over the map. The biggest cluster is work from
Arnd to eliminate -Wmissing-prototype warnings:
- Address -Wmissing-prototype warnings
- Remove repeated 'the' in comments
- Remove unused current_untag_mask()
- Document urgent tip branch timing
- Clean up MSR kernel-doc notation
- Clean up paravirt_ops doc
- Update Srivatsa S. Bhat's maintained areas
- Remove unused extern declaration acpi_copy_wakeup_routine()"
* tag 'x86_cleanups_for_6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (22 commits)
x86/acpi: Remove unused extern declaration acpi_copy_wakeup_routine()
Documentation: virt: Clean up paravirt_ops doc
x86/mm: Remove unused current_untag_mask()
x86/mm: Remove repeated word in comments
x86/lib/msr: Clean up kernel-doc notation
x86/platform: Avoid missing-prototype warnings for OLPC
x86/mm: Add early_memremap_pgprot_adjust() prototype
x86/usercopy: Include arch_wb_cache_pmem() declaration
x86/vdso: Include vdso/processor.h
x86/mce: Add copy_mc_fragile_handle_tail() prototype
x86/fbdev: Include asm/fb.h as needed
x86/hibernate: Declare global functions in suspend.h
x86/entry: Add do_SYSENTER_32() prototype
x86/quirks: Include linux/pnp.h for arch_pnpbios_disabled()
x86/mm: Include asm/numa.h for set_highmem_pages_init()
x86: Avoid missing-prototype warnings for doublefault code
x86/fpu: Include asm/fpu/regset.h
x86: Add dummy prototype for mk_early_pgtbl_32()
x86/pci: Mark local functions as 'static'
x86/ftrace: Move prepare_ftrace_return prototype to header
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 confidential computing update from Borislav Petkov:
- Add support for unaccepted memory as specified in the UEFI spec v2.9.
The gist of it all is that Intel TDX and AMD SEV-SNP confidential
computing guests define the notion of accepting memory before using
it and thus preventing a whole set of attacks against such guests
like memory replay and the like.
There are a couple of strategies of how memory should be accepted -
the current implementation does an on-demand way of accepting.
* tag 'x86_cc_for_v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
virt: sevguest: Add CONFIG_CRYPTO dependency
x86/efi: Safely enable unaccepted memory in UEFI
x86/sev: Add SNP-specific unaccepted memory support
x86/sev: Use large PSC requests if applicable
x86/sev: Allow for use of the early boot GHCB for PSC requests
x86/sev: Put PSC struct on the stack in prep for unaccepted memory support
x86/sev: Fix calculation of end address based on number of pages
x86/tdx: Add unaccepted memory support
x86/tdx: Refactor try_accept_one()
x86/tdx: Make _tdx_hypercall() and __tdx_module_call() available in boot stub
efi/unaccepted: Avoid load_unaligned_zeropad() stepping into unaccepted memory
efi: Add unaccepted memory support
x86/boot/compressed: Handle unaccepted memory
efi/libstub: Implement support for unaccepted memory
efi/x86: Get full memory map in allocate_e820()
mm: Add support for unaccepted memory
|
|
Pull block updates from Jens Axboe:
- NVMe pull request via Keith:
- Various cleanups all around (Irvin, Chaitanya, Christophe)
- Better struct packing (Christophe JAILLET)
- Reduce controller error logs for optional commands (Keith)
- Support for >=64KiB block sizes (Daniel Gomez)
- Fabrics fixes and code organization (Max, Chaitanya, Daniel
Wagner)
- bcache updates via Coly:
- Fix a race at init time (Mingzhe Zou)
- Misc fixes and cleanups (Andrea, Thomas, Zheng, Ye)
- use page pinning in the block layer for dio (David)
- convert old block dio code to page pinning (David, Christoph)
- cleanups for pktcdvd (Andy)
- cleanups for rnbd (Guoqing)
- use the unchecked __bio_add_page() for the initial single page
additions (Johannes)
- fix overflows in the Amiga partition handling code (Michael)
- improve mq-deadline zoned device support (Bart)
- keep passthrough requests out of the IO schedulers (Christoph, Ming)
- improve support for flush requests, making them less special to deal
with (Christoph)
- add bdev holder ops and shutdown methods (Christoph)
- fix the name_to_dev_t() situation and use cases (Christoph)
- decouple the block open flags from fmode_t (Christoph)
- ublk updates and cleanups, including adding user copy support (Ming)
- BFQ sanity checking (Bart)
- convert brd from radix to xarray (Pankaj)
- constify various structures (Thomas, Ivan)
- more fine grained persistent reservation ioctl capability checks
(Jingbo)
- misc fixes and cleanups (Arnd, Azeem, Demi, Ed, Hengqi, Hou, Jan,
Jordy, Li, Min, Yu, Zhong, Waiman)
* tag 'for-6.5/block-2023-06-23' of git://git.kernel.dk/linux: (266 commits)
scsi/sg: don't grab scsi host module reference
ext4: Fix warning in blkdev_put()
block: don't return -EINVAL for not found names in devt_from_devname
cdrom: Fix spectre-v1 gadget
block: Improve kernel-doc headers
blk-mq: don't insert passthrough request into sw queue
bsg: make bsg_class a static const structure
ublk: make ublk_chr_class a static const structure
aoe: make aoe_class a static const structure
block/rnbd: make all 'class' structures const
block: fix the exclusive open mask in disk_scan_partitions
block: add overflow checks for Amiga partition support
block: change all __u32 annotations to __be32 in affs_hardblocks.h
block: fix signed int overflow in Amiga partition support
block: add capacity validation in bdev_add_partition()
block: fine-granular CAP_SYS_ADMIN for Persistent Reservation
block: disallow Persistent Reservation on partitions
reiserfs: fix blkdev_put() warning from release_journal_dev()
block: fix wrong mode for blkdev_get_by_dev() from disk_scan_partitions()
block: document the holder argument to blkdev_get_by_path
...
|
|
Pull splice updates from Jens Axboe:
"This kills off ITER_PIPE to avoid a race between truncate,
iov_iter_revert() on the pipe and an as-yet incomplete DMA to a bio
with unpinned/unref'ed pages from an O_DIRECT splice read. This causes
memory corruption.
Instead, we either use (a) filemap_splice_read(), which invokes the
buffered file reading code and splices from the pagecache into the
pipe; (b) copy_splice_read(), which bulk-allocates a buffer, reads
into it and then pushes the filled pages into the pipe; or (c) handle
it in filesystem-specific code.
Summary:
- Rename direct_splice_read() to copy_splice_read()
- Simplify the calculations for the number of pages to be reclaimed
in copy_splice_read()
- Turn do_splice_to() into a helper, vfs_splice_read(), so that it
can be used by overlayfs and coda to perform the checks on the
lower fs
- Make vfs_splice_read() jump to copy_splice_read() to handle
direct-I/O and DAX
- Provide shmem with its own splice_read to handle non-existent pages
in the pagecache. We don't want a ->read_folio() as we don't want
to populate holes, but filemap_get_pages() requires it
- Provide overlayfs with its own splice_read to call down to a lower
layer as overlayfs doesn't provide ->read_folio()
- Provide coda with its own splice_read to call down to a lower layer
as coda doesn't provide ->read_folio()
- Direct ->splice_read to copy_splice_read() in tty, procfs, kernfs
and random files as they just copy to the output buffer and don't
splice pages
- Provide wrappers for afs, ceph, ecryptfs, ext4, f2fs, nfs, ntfs3,
ocfs2, orangefs, xfs and zonefs to do locking and/or revalidation
- Make cifs use filemap_splice_read()
- Replace pointers to generic_file_splice_read() with pointers to
filemap_splice_read() as DIO and DAX are handled in the caller;
filesystems can still provide their own alternate ->splice_read()
op
- Remove generic_file_splice_read()
- Remove ITER_PIPE and its paraphernalia as generic_file_splice_read
was the only user"
* tag 'for-6.5/splice-2023-06-23' of git://git.kernel.dk/linux: (31 commits)
splice: kdoc for filemap_splice_read() and copy_splice_read()
iov_iter: Kill ITER_PIPE
splice: Remove generic_file_splice_read()
splice: Use filemap_splice_read() instead of generic_file_splice_read()
cifs: Use filemap_splice_read()
trace: Convert trace/seq to use copy_splice_read()
zonefs: Provide a splice-read wrapper
xfs: Provide a splice-read wrapper
orangefs: Provide a splice-read wrapper
ocfs2: Provide a splice-read wrapper
ntfs3: Provide a splice-read wrapper
nfs: Provide a splice-read wrapper
f2fs: Provide a splice-read wrapper
ext4: Provide a splice-read wrapper
ecryptfs: Provide a splice-read wrapper
ceph: Provide a splice-read wrapper
afs: Provide a splice-read wrapper
9p: Add splice_read wrapper
net: Make sock_splice_read() use copy_splice_read() by default
tty, proc, kernfs, random: Use copy_splice_read()
...
|
|
Make calls to extend_vma() and find_extend_vma() fail if the write lock
is required.
To avoid making this a flag-day event, this still allows the old
read-locking case for the trivial situations, and passes in a flag to
say "is it write-locked". That way write-lockers can say "yes, I'm
being careful", and legacy users will continue to work in all the common
cases until they have been fully converted to the new world order.
Co-Developed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
arm has an additional check for address < FIRST_USER_ADDRESS before
expanding the stack. Since FIRST_USER_ADDRESS is defined everywhere
(generally as 0), move that check to the generic expand_downwards().
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This is done as a separate patch from introducing the new
lock_mm_and_find_vma() helper, because while it's an obvious change,
it's not what x86 used to do in this area.
We already abort the page fault on fatal signals anyway, so why should
we wait for the mmap lock only to then abort later? With the new helper
function that returns without the lock held on failure anyway, this is
particularly easy and straightforward.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
.. and make x86 use it.
This basically extracts the existing x86 "find and expand faulting vma"
code, but extends it to also take the mmap lock for writing in case we
actually do need to expand the vma.
We've historically short-circuited that case, and have some rather ugly
special logic to serialize the stack segment expansion (since we only
hold the mmap lock for reading) that doesn't match the normal VM
locking.
That slight violation of locking worked well, right up until it didn't:
the maple tree code really does want proper locking even for simple
extension of an existing vma.
So extract the code for "look up the vma of the fault" from x86, fix it
up to do the necessary write locking, and make it available as a helper
function for other architectures that can use the common helper.
Note: I say "common helper", but it really only handles the normal
stack-grows-down case. Which is all architectures except for PA-RISC
and IA64. So some rare architectures can't use the helper, but if they
care they'll just need to open-code this logic.
It's also worth pointing out that this code really would like to have an
optimistic "mmap_upgrade_trylock()" to make it quicker to go from a
read-lock (for the common case) to taking the write lock (for having to
extend the vma) in the normal single-threaded situation where there is
no other locking activity.
But that _is_ all the very uncommon special case, so while it would be
nice to have such an operation, it probably doesn't matter in reality.
I did put in the skeleton code for such a possible future expansion,
even if it only acts as pseudo-documentation for what we're doing.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Ackerley Tng reported an issue with hugetlbfs fallocate as noted in the
Closes tag. The issue showed up after the conversion of hugetlb page
cache lookup code to use page_cache_next_miss. User visible effects are:
- hugetlbfs fallocate incorrectly returns -EEXIST if pages are presnet
in the file.
- hugetlb pages will not be included in core dumps if they need to be
brought in via GUP.
- userfaultfd UFFDIO_COPY will not notice pages already present in the
cache. It may try to allocate a new page and potentially return
ENOMEM as opposed to EEXIST.
Revert the use page_cache_next_miss() in hugetlb code.
IMPORTANT NOTE FOR STABLE BACKPORTS:
This patch will apply cleanly to v6.3. However, due to the change of
filemap_get_folio() return values, it will not function correctly. This
patch must be modified for stable backports.
[dan.carpenter@linaro.org: fix hugetlbfs_pagecache_present()]
Link: https://lkml.kernel.org/r/efa86091-6a2c-4064-8f55-9b44e1313015@moroto.mountain
Link: https://lkml.kernel.org/r/20230621212403.174710-2-mike.kravetz@oracle.com
Fixes: d0ce0e47b323 ("mm/hugetlb: convert hugetlb fault paths to use alloc_hugetlb_folio()")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reported-by: Ackerley Tng <ackerleytng@google.com>
Closes: https://lore.kernel.org/linux-mm/cover.1683069252.git.ackerleytng@google.com
Reviewed-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Erdem Aktas <erdemaktas@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Vishal Annapurve <vannapurve@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
This reverts commit 9425c591e06a9ab27a145ba655fb50532cf0bcc9
The reverted commit fixed up routines primarily used by readahead code
such that they could also be used by hugetlb. Unfortunately, this
caused a performance regression as pointed out by the Closes: tag.
The hugetlb code which uses page_cache_next_miss will be addressed in
a subsequent patch.
Link: https://lkml.kernel.org/r/20230621212403.174710-1-mike.kravetz@oracle.com
Fixes: 9425c591e06a ("page cache: fix page_cache_next/prev_miss off by one")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202306211346.1e9ff03e-oliver.sang@intel.com
Reviewed-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Ackerley Tng <ackerleytng@google.com>
Cc: Erdem Aktas <erdemaktas@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Vishal Annapurve <vannapurve@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
When memory.reclaim was introduced, it became the first case where
cgroup_reclaim() is true for the root cgroup. Johannes concluded [1] that
for most cases this is okay, except for one case. Historically, kswapd
would throttle reclaim on a node if a lot of pages marked for reclaim are
under writeback (aka the node is congested). This occurred by setting
LRUVEC_CONGESTED bit in lruvec->flags. The bit would be cleared when the
node is balanced.
Similarly, cgroup reclaim would set the same bit when an lruvec is
congested, and clear it on the way out of reclaim (to throttle local
reclaimers).
Before the introduction of memory.reclaim, the root memcg was the only
target of kswapd reclaim, and non-root memcgs were the only targets of
cgroup reclaim, so they would never interfere. Using the same bit for
both was fine. After memory.reclaim, it is possible for cgroup reclaim on
the root cgroup to clear the bit set by kswapd. This would result in
reclaim on the node to be unthrottled before the node is balanced.
Fix this by introducing separate bits for cgroup-level and node-level
congestion. kswapd can unthrottle an lruvec that is marked as congested
by cgroup reclaim (as the entire node should no longer be congested), but
not vice versa (to prevent premature unthrottling before the entire node
is balanced).
[1]https://lore.kernel.org/lkml/20230405200150.GA35884@cmpxchg.org/
Link: https://lkml.kernel.org/r/20230621023101.432780-1-yosryahmed@google.com
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Reported-by: Johannes Weiner <hannes@cmpxchg.org>
Closes: https://lore.kernel.org/lkml/20230405200150.GA35884@cmpxchg.org/
Cc: Michal Hocko <mhocko@suse.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Evidently, global_reclaim() can be a confusing name. Especially that it
used to exist before with a subtly different definition (removed by commit
b5ead35e7e1d ("mm: vmscan: naming fixes: global_reclaim() and
sane_reclaim()"). It can be interpreted as non-cgroup reclaim, even
though it returns true for cgroup reclaim on the root memcg (through
memory.reclaim).
Rename it to root_reclaim() in an attempt to make it less ambiguous, and
add documentation to it as well as cgroup_reclaim.
Link: https://lkml.kernel.org/r/20230621023053.432374-1-yosryahmed@google.com
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Reported-by: Johannes Weiner <hannes@cmpxchg.org>
Closes: https://lore.kernel.org/lkml/20230405200150.GA35884@cmpxchg.org/
Acked-by: Yu Zhao <yuzhao@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Directly use a folio instead of page_folio() when page successfully
isolated (hugepage and movable page) and after folio_get_nontail_page(),
which removes several calls to compound_head().
Link: https://lkml.kernel.org/r/20230619110718.65679-1-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: James Gowans <jgowans@amazon.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
If exclusive loads are enabled for zswap, we invalidate the entry before
returning from zswap_frontswap_load(), after dropping the local reference.
However, the tree lock is dropped during decompression after the local
reference is acquired, so the entry could be invalidated before we drop
the local ref. If this happens, the entry is freed once we drop the local
ref, and zswap_invalidate_entry() tries to invalidate an already freed
entry.
Fix this by:
(a) Making sure zswap_invalidate_entry() is always called with a local
ref held, to avoid being called on a freed entry.
(b) Making sure zswap_invalidate_entry() only drops the ref if the entry
was actually on the rbtree. Otherwise, another invalidation could
have already happened, and the initial ref is already dropped.
With these changes, there is no need to check that there is no need to
make sure the entry still exists in the tree in zswap_reclaim_entry()
before invalidating it, as zswap_reclaim_entry() will make this check
internally.
Link: https://lkml.kernel.org/r/20230621093009.637544-1-yosryahmed@google.com
Fixes: b9c91c43412f ("mm: zswap: support exclusive loads")
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Reported-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
These files no longer need pagevec.h, mostly due to function declarations
being moved out of it.
Link: https://lkml.kernel.org/r/20230621164557.3510324-14-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Most of these should just refer to the LRU cache rather than the data
structure used to implement the LRU cache.
Link: https://lkml.kernel.org/r/20230621164557.3510324-13-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
We don't use pagevecs for the LRU cache any more, and we don't know that
the failed invalidations were due to the folio being in an LRU cache. So
rename it to be more accurate.
Link: https://lkml.kernel.org/r/20230621164557.3510324-12-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
All users are now converted to use the folio_batch so we can get rid of
this data structure.
Link: https://lkml.kernel.org/r/20230621164557.3510324-11-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
All callers have now been converted to call
check_move_unevictable_folios().
Link: https://lkml.kernel.org/r/20230621164557.3510324-7-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Commit bf75f200569d ("mm/page_alloc: add page->buddy_list and
page->pcp_list") introduces page->buddy_list and page->pcp_list as a union
with page->lru, but missed to change get_page_from_free_area() to use
page->buddy_list to clarify the correct type of list for a free page.
Link: https://lkml.kernel.org/r/7e7ab533247d40c0ea0373c18a6a48e5667f9e10.1687333557.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Now that the driver core allows for struct class to be in read-only
memory, move the bdi_class structure to be declared at build time placing
it into read-only memory, instead of having to be dynamically allocated at
load time.
Link: https://lkml.kernel.org/r/20230620183314.682822-2-gregkh@linuxfoundation.org
Signed-off-by: Ivan Orlov <ivan.orlov0322@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Delete a triply out-of-date comment from add_swap_count_continuation():
1. vmalloc_to_page() changed from pte_offset_map() to pte_offset_kernel()
2. pte_offset_map() changed from using kmap_atomic() to kmap_local_page()
3. kmap_atomic() changed from using fixed FIX_KMAP addresses in 2.6.37.
Link: https://lkml.kernel.org/r/9022632b-ba9d-8cb0-c25-4be9786481b5@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
early_pfn_to_nid() is called frequently in init_reserved_page(), it
returns the node id of the PFN. These PFN are probably from the same
memory region, they have the same node id. It's not necessary to call
early_pfn_to_nid() for each PFN.
Pass nid to reserve_bootmem_region() and drop the call to
early_pfn_to_nid() in init_reserved_page(). Also, set nid on all reserved
pages before doing this, as some reserved memory regions may not be set
nid.
The most beneficial function is memmap_init_reserved_pages() if
CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled.
The following data was tested on an x86 machine with 190GB of RAM.
before:
memmap_init_reserved_pages() 67ms
after:
memmap_init_reserved_pages() 20ms
Link: https://lkml.kernel.org/r/20230619023406.424298-1-yajun.deng@linux.dev
Signed-off-by: Yajun Deng <yajun.deng@linux.dev>
Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
These routines are not intended to return zero, the callers cannot do
anything sane with a 0 return. They should return an error which means
future calls to GUP will not succeed, or they should return some non-zero
number of pinned pages which means GUP should be called again.
If start + nr_pages overflows it should return -EOVERFLOW to signal the
arguments are invalid.
Syzkaller keeps tripping on this when fuzzing GUP arguments.
Link: https://lkml.kernel.org/r/0-v1-3d5ed1f20d50+104-gup_overflow_jgg@nvidia.com
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reported-by: syzbot+353c7be4964c6253f24a@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/000000000000094fdd05faa4d3a4@google.com
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The per node numa_stat values for shmem don't change on page migration for
THP:
grep shmem /sys/fs/cgroup/machine.slice/.../memory.numa_stat:
shmem N0=1092616192 N1=10485760
shmem_thp N0=1092616192 N1=10485760
migratepages 9181 0 1:
shmem N0=0 N1=1103101952
shmem_thp N0=1092616192 N1=10485760
Fix that by updating shmem_thp counters likewise to shmem counters on page
migration.
[jglauber@digitalocean.com: use folio_test_pmd_mappable instead of folio_test_transhuge]
Link: https://lkml.kernel.org/r/20230622094720.510540-1-jglauber@digitalocean.com
Link: https://lkml.kernel.org/r/20230619103351.234837-1-jglauber@digitalocean.com
Signed-off-by: Jan Glauber <jglauber@digitalocean.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
mem_cgroup_init() request for allocations from each possible node, and
it's used to be a problem because NODE_DATA is not allocated for offline
node. Things have already changed since commit 09f49dca570a9 ("mm: handle
uninitialized numa nodes gracefully"), so it's unnecessary to check for
!node_online nodes here.
How to test?
qemu-system-x86_64 \
-kernel vmlinux \
-initrd full.rootfs.cpio.gz \
-append "console=ttyS0,115200 root=/dev/ram0 nokaslr earlyprintk=serial oops=panic panic_on_warn" \
-drive format=qcow2,file=vm_disk.qcow2,media=disk,if=ide \
-enable-kvm \
-cpu host \
-m 8G,slots=2,maxmem=16G \
-smp cores=4,threads=1,sockets=2 \
-object memory-backend-ram,id=mem0,size=4G \
-object memory-backend-ram,id=mem1,size=4G \
-numa node,memdev=mem0,cpus=0-3,nodeid=0 \
-numa node,memdev=mem1,cpus=4-7,nodeid=1 \
-numa node,nodeid=2 \
-net nic,model=virtio,macaddr=52:54:00:12:34:58 \
-net user \
-nographic \
-rtc base=localtime \
-gdb tcp::6000
Guest state when booting:
[ 0.048881] NUMA: Node 0 [mem 0x00000000-0x0009ffff] + [mem 0x00100000-0xbfffffff] -> [mem 0x00000000-0xbfffffff]
[ 0.050489] NUMA: Node 0 [mem 0x00000000-0xbfffffff] + [mem 0x100000000-0x13fffffff] -> [mem 0x00000000-0x13fffffff]
[ 0.052173] NODE_DATA(0) allocated [mem 0x13fffc000-0x13fffffff]
[ 0.053164] NODE_DATA(1) allocated [mem 0x23fffa000-0x23fffdfff]
[ 0.054187] Zone ranges:
[ 0.054587] DMA [mem 0x0000000000001000-0x0000000000ffffff]
[ 0.055551] DMA32 [mem 0x0000000001000000-0x00000000ffffffff]
[ 0.056515] Normal [mem 0x0000000100000000-0x000000023fffffff]
[ 0.057484] Movable zone start for each node
[ 0.058149] Early memory node ranges
[ 0.058705] node 0: [mem 0x0000000000001000-0x000000000009efff]
[ 0.059679] node 0: [mem 0x0000000000100000-0x00000000bffdffff]
[ 0.060659] node 0: [mem 0x0000000100000000-0x000000013fffffff]
[ 0.061649] node 1: [mem 0x0000000140000000-0x000000023fffffff]
[ 0.062638] Initmem setup node 0 [mem 0x0000000000001000-0x000000013fffffff]
[ 0.063745] Initmem setup node 1 [mem 0x0000000140000000-0x000000023fffffff]
[ 0.064855] DMA zone: 158 reserved pages exceeds freesize 0
[ 0.065746] Initializing node 2 as memoryless
[ 0.066437] Initmem setup node 2 as memoryless
[ 0.067132] DMA zone: 158 reserved pages exceeds freesize 0
[ 0.068037] On node 0, zone DMA: 1 pages in unavailable ranges
[ 0.068265] On node 0, zone DMA: 97 pages in unavailable ranges
[ 0.124755] On node 0, zone Normal: 32 pages in unavailable ranges
cat /sys/devices/system/node/online
0-1
cat /sys/devices/system/node/possible
0-2
Link: https://lkml.kernel.org/r/20230619130442.2487-1-haifeng.xu@shopee.com
Signed-off-by: Haifeng Xu <haifeng.xu@shopee.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
syzbot is reporting lockdep warning in __stack_depot_save(), for
the caller of __stack_depot_save() (i.e. __kasan_record_aux_stack() in
this report) is responsible for masking __GFP_KSWAPD_RECLAIM flag in
order not to wake kswapd which in turn wakes kcompactd.
Since kasan/kmsan functions might be called with arbitrary locks held,
mask __GFP_KSWAPD_RECLAIM flag from all GFP_NOWAIT/GFP_ATOMIC allocations
in kasan/kmsan.
Note that kmsan_save_stack_with_flags() is changed to mask both
__GFP_DIRECT_RECLAIM flag and __GFP_KSWAPD_RECLAIM flag, for
wakeup_kswapd() from wake_all_kswapds() from __alloc_pages_slowpath()
calls wakeup_kcompactd() if __GFP_KSWAPD_RECLAIM flag is set and
__GFP_DIRECT_RECLAIM flag is not set.
Link: https://lkml.kernel.org/r/656cb4f5-998b-c8d7-3c61-c2d37aa90f9a@I-love.SAKURA.ne.jp
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reported-by: syzbot <syzbot+ece2915262061d6e0ac1@syzkaller.appspotmail.com>
Closes: https://syzkaller.appspot.com/bug?extid=ece2915262061d6e0ac1
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Marco Elver <elver@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
On some machines, the normal zone can have a large memory hole like below
memory layout, and we can see the range from 0x100000000 to 0x1800000000
is a hole. So when isolating some migratable pages, the scanner can meet
the hole and it will take more time to skip the large hole. From my
measurement, I can see the isolation scanner will take 80us ~ 100us to
skip the large hole [0x100000000 - 0x1800000000].
So adding a new helper to fast search next online memory section to skip
the large hole can help to find next suitable pageblock efficiently. With
this patch, I can see the large hole scanning only takes < 1us.
[ 0.000000] Zone ranges:
[ 0.000000] DMA [mem 0x0000000040000000-0x00000000ffffffff]
[ 0.000000] DMA32 empty
[ 0.000000] Normal [mem 0x0000000100000000-0x0000001fa7ffffff]
[ 0.000000] Movable zone start for each node
[ 0.000000] Early memory node ranges
[ 0.000000] node 0: [mem 0x0000000040000000-0x0000000fffffffff]
[ 0.000000] node 0: [mem 0x0000001800000000-0x0000001fa3c7ffff]
[ 0.000000] node 0: [mem 0x0000001fa3c80000-0x0000001fa3ffffff]
[ 0.000000] node 0: [mem 0x0000001fa4000000-0x0000001fa402ffff]
[ 0.000000] node 0: [mem 0x0000001fa4030000-0x0000001fa40effff]
[ 0.000000] node 0: [mem 0x0000001fa40f0000-0x0000001fa73cffff]
[ 0.000000] node 0: [mem 0x0000001fa73d0000-0x0000001fa745ffff]
[ 0.000000] node 0: [mem 0x0000001fa7460000-0x0000001fa746ffff]
[ 0.000000] node 0: [mem 0x0000001fa7470000-0x0000001fa758ffff]
[ 0.000000] node 0: [mem 0x0000001fa7590000-0x0000001fa7ffffff]
[baolin.wang@linux.alibaba.com: limit next_ptn to not exceed cc->free_pfn]
Link: https://lkml.kernel.org/r/a1d859c28af0c7e85e91795e7473f553eb180a9d.1686813379.git.baolin.wang@linux.alibaba.com
Link: https://lkml.kernel.org/r/75b4c8ca36bf44ad8c42bf0685ac19d272e426ec.1686705221.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Suggested-by: David Hildenbrand <david@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
|
|
lru_gen_rotate_memcg() can happen in softirq if memory.soft_limit_in_bytes
is set. This requires memcg_lru->lock to be irq safe. Lockdep warns on
this.
This problem only affects memcg v1.
Link: https://lkml.kernel.org/r/20230619193821.2710944-1-yuzhao@google.com
Fixes: e4dde56cd208 ("mm: multi-gen LRU: per-node lru_gen_folio lists")
Signed-off-by: Yu Zhao <yuzhao@google.com>
Reported-by: syzbot+87c490fd2be656269b6a@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=87c490fd2be656269b6a
Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Cross-merge networking fixes after downstream PR.
Conflicts:
tools/testing/selftests/net/fcnal-test.sh
d7a2fc1437f7 ("selftests: net: fcnal-test: check if FIPS mode is enabled")
dd017c72dde6 ("selftests: fcnal: Test SO_DONTROUTE on TCP sockets.")
https://lore.kernel.org/all/5007b52c-dd16-dbf6-8d64-b9701bfa498b@tessares.net/
https://lore.kernel.org/all/20230619105427.4a0df9b3@canb.auug.org.au/
No adjacent changes.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The arm64 documentation has moved under Documentation/arch/. Fix up a
reference in mm/mremap.c to match.
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull hotfixes from Andrew Morton:
"19 hotfixes. 8 of these are cc:stable.
This includes a wholesale reversion of the post-6.4 series 'make slab
shrink lockless'. After input from Dave Chinner it has been decided
that we should go a different way [1]"
Link: https://lkml.kernel.org/r/ZH6K0McWBeCjaf16@dread.disaster.area [1]
* tag 'mm-hotfixes-stable-2023-06-20-12-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
selftests/mm: fix cross compilation with LLVM
mailmap: add entries for Ben Dooks
nilfs2: prevent general protection fault in nilfs_clear_dirty_page()
Revert "mm: vmscan: make global slab shrink lockless"
Revert "mm: vmscan: make memcg slab shrink lockless"
Revert "mm: vmscan: add shrinker_srcu_generation"
Revert "mm: shrinkers: make count and scan in shrinker debugfs lockless"
Revert "mm: vmscan: hold write lock to reparent shrinker nr_deferred"
Revert "mm: vmscan: remove shrinker_rwsem from synchronize_shrinkers()"
Revert "mm: shrinkers: convert shrinker_rwsem to mutex"
nilfs2: fix buffer corruption due to concurrent device reads
scripts/gdb: fix SB_* constants parsing
scripts: fix the gfp flags header path in gfp-translate
udmabuf: revert 'Add support for mapping hugepages (v4)'
mm/khugepaged: fix iteration in collapse_file
memfd: check for non-NULL file_seals in memfd_create() syscall
mm/vmalloc: do not output a spurious warning when huge vmalloc() fails
mm/mprotect: fix do_mprotect_pkey() limit check
writeback: fix dereferencing NULL mapping->host on writeback_page_template
|
|
It's only used inside page_alloc.c now. So make it static and remove the
declaration in mm.h.
Link: https://lkml.kernel.org/r/20230617034622.1235913-1-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
No user checks the return value of mem_cgroup_scan_tasks(). Make the
return value void.
Link: https://lkml.kernel.org/r/20230616063030.977586-1-zhangpeng362@huawei.com
Signed-off-by: ZhangPeng <zhangpeng362@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nanyong Sun <sunnanyong@huawei.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Commit 5ff6e2fff88e ("mm/damon/core: fix divide error in
damon_nr_accesses_to_accesses_bp()") fixed a bug by adding arguments
validation in damon_set_attrs(). Add a unit test for the added validation
to ensure the bug cannot occur again.
Link: https://lkml.kernel.org/r/20230615183323.87561-1-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
try_get_folio() takes in a page, then chooses to do some folio operations
based on the flags (either FOLL_GET or FOLL_PIN). We can rewrite this
function to be more purpose oriented.
After calling try_get_folio(), if neither FOLL_GET nor FOLL_PIN are set,
warn and fail. If FOLL_GET is set we can return the result. If FOLL_GET
is not set then FOLL_PIN is set, so we pin the folio.
This change assists with folio conversions, and makes the function more
readable.
Link: https://lkml.kernel.org/r/20230614021312.34085-5-vishal.moola@gmail.com
Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
verify_dma_pinned() checks that pages are dma-pinned. We can convert this
to use folios.
Link: https://lkml.kernel.org/r/20230614021312.34085-4-vishal.moola@gmail.com
Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
KASAN's boot time kernel parameter 'kasan.fault=' currently supports
'report' and 'panic', which results in either only reporting bugs or also
panicking on reports.
However, some users may wish to have more control over when KASAN reports
result in a kernel panic: in particular, KASAN reported invalid _writes_
are of special interest, because they have greater potential to corrupt
random kernel memory or be more easily exploited.
To panic on invalid writes only, introduce 'kasan.fault=panic_on_write',
which allows users to choose to continue running on invalid reads, but
panic only on invalid writes.
Link: https://lkml.kernel.org/r/20230614095158.1133673-1-elver@google.com
Signed-off-by: Marco Elver <elver@google.com>
Reviewed-by: Alexander Potapenko <glider@google.com>
Cc: Aleksandr Nogikh <nogikh@google.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Taras Madan <tarasmadan@google.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
When an entry started writeback, it used to be invalidated with ref count
logic alone, meaning that it would stay on the tree until all references
were put. The problem with this behavior is that as soon as the writeback
started, the ownership of the data held by the entry is passed to the
swapcache and should not be left in zswap too. Currently there are no
known issues because of this, but this change explicitly invalidates an
entry that started writeback to reduce opportunities for future bugs.
This patch is a follow up on the series titled "mm: zswap: move writeback
LRU from zpool to zswap" + commit f090b7949768("mm: zswap: support
exclusive loads").
Link: https://lkml.kernel.org/r/20230614143122.74471-1-cerasuolodomenico@gmail.com
Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Since commit c7c3dec1c9db ("mm: rmap: remove lock_page_memcg()"),
no more user, kill lock_page_memcg() and unlock_page_memcg().
Link: https://lkml.kernel.org/r/20230614143612.62575-1-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
cma: display pfn as well as pfn_to_page(pfn)
page_owner: display pfn in hex rather than decimal
Link: https://lkml.kernel.org/r/20230613092533.15449-1-quic_yingangl@quicinc.com
Signed-off-by: Kassey Li <quic_yingangl@quicinc.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
These are equivalent, but DEFINE_READ_MOSTLY_HASHTABLE exists to define
a hashtable in the .data..read_mostly section.
Link: https://lkml.kernel.org/r/20230609-khugepage-v1-1-dad4e8382298@google.com
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
sharing
When running UnixBench/Execl throughput case, false sharing is observed
due to frequent read on base_addr and write on free_bytes, chunk_md.
UnixBench/Execl represents a class of workload where bash scripts are
spawned frequently to do some short jobs. It will do system call on execl
frequently, and execl will call mm_init to initialize mm_struct of the
process. mm_init will call __percpu_counter_init for percpu_counters
initialization. Then pcpu_alloc is called to read the base_addr of
pcpu_chunk for memory allocation. Inside pcpu_alloc, it will call
pcpu_alloc_area to allocate memory from a specified chunk. This function
will update "free_bytes" and "chunk_md" to record the rest free bytes and
other meta data for this chunk. Correspondingly, pcpu_free_area will also
update these 2 members when free memory.
Call trace from perf is as below:
+ 57.15% 0.01% execl [kernel.kallsyms] [k] __percpu_counter_init
+ 57.13% 0.91% execl [kernel.kallsyms] [k] pcpu_alloc
- 55.27% 54.51% execl [kernel.kallsyms] [k] osq_lock
- 53.54% 0x654278696e552f34
main
__execve
entry_SYSCALL_64_after_hwframe
do_syscall_64
__x64_sys_execve
do_execveat_common.isra.47
alloc_bprm
mm_init
__percpu_counter_init
pcpu_alloc
- __mutex_lock.isra.17
In current pcpu_chunk layout, `base_addr' is in the same cache line with
`free_bytes' and `chunk_md', and `base_addr' is at the last 8 bytes. This
patch moves `bound_map' up to `base_addr', to let `base_addr' locate in a
new cacheline.
With this change, on Intel Sapphire Rapids 112C/224T platform, based on
v6.4-rc4, the 160 parallel score improves by 24%.
The pcpu_chunk struct is a backing data structure per chunk, so the
additional memory should not be dramatic. A chunk covers ballpark
between 64kb and 512kb memory depending on some config and boot time
stuff, so I believe the additional memory used here is nominal at best.
Working the #s on my desktop:
Percpu: 58624 kB
28 cores -> ~2.1MB of percpu memory.
At say ~128KB per chunk -> 33 chunks, generously 40 chunks.
Adding alignment might bump the chunk size ~64 bytes, so in total ~2KB
of overhead?
I believe we can do a little better to avoid eating that full padding,
so likely less than that.
[dennis@kernel.org: changelog details]
Link: https://lkml.kernel.org/r/20230610030730.110074-1-yu.ma@intel.com
Signed-off-by: Yu Ma <yu.ma@intel.com>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Acked-by: Dennis Zhou <dennis@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
establish_demotion_targets() is defined while CONFIG_MIGRATION is
enabled. There's no need to check it again.
Link: https://lkml.kernel.org/r/20230610034114.981861-1-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Add __meminit to kcompactd_run() and kcompactd_stop() to ensure they're
default to __init when memory hotplug is not enabled.
Link: https://lkml.kernel.org/r/20230610034615.997813-1-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Android reported a performance regression in the userfaultfd unmap path.
A closer inspection on the userfaultfd_unmap_prep() change showed that a
second tree walk would be necessary in the reworked code.
Fix the regression by passing each VMA that will be unmapped through to
the userfaultfd_unmap_prep() function as they are added to the unmap list,
instead of re-walking the tree for the VMA.
Link: https://lkml.kernel.org/r/20230601015402.2819343-1-Liam.Howlett@oracle.com
Fixes: 69dbe6daf104 ("userfaultfd: use maple tree iterator to iterate VMAs")
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reported-by: Suren Baghdasaryan <surenb@google.com>
Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|