summaryrefslogtreecommitdiff
path: root/fs/proc
AgeCommit message (Collapse)Author
2019-03-05mm: convert PG_balloon to PG_offlineDavid Hildenbrand
PG_balloon was introduced to implement page migration/compaction for pages inflated in virtio-balloon. Nowadays, it is only a marker that a page is part of virtio-balloon and therefore logically offline. We also want to make use of this flag in other balloon drivers - for inflated pages or when onlining a section but keeping some pages offline (e.g. used right now by XEN and Hyper-V via set_online_page_callback()). We are going to expose this flag to dump tools like makedumpfile. But instead of exposing PG_balloon, let's generalize the concept of marking pages as logically offline, so it can be reused for other purposes later on. Rename PG_balloon to PG_offline. This is an indicator that the page is logically offline, the content stale and that it should not be touched (e.g. a hypervisor would have to allocate backing storage in order for the guest to dump an unused page). We can then e.g. exclude such pages from dumps. We replace and reuse KPF_BALLOON (23), as this shouldn't really harm (and for now the semantics stay the same). In following patches, we will make use of this bit also in other balloon drivers. While at it, document PGTABLE. [akpm@linux-foundation.org: fix comment text, per David] Link: http://lkml.kernel.org/r/20181119101616.8901-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Konstantin Khlebnikov <koct9i@gmail.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Pankaj gupta <pagupta@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Christian Hansen <chansen3@cisco.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Pavel Tatashin <pasha.tatashin@oracle.com> Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Miles Chen <miles.chen@mediatek.com> Cc: David Rientjes <rientjes@google.com> Cc: Kazuhito Hagio <k-hagio@ab.jp.nec.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Baoquan He <bhe@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Dave Young <dyoung@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: Juergen Gross <jgross@suse.com> Cc: Julien Freche <jfreche@vmware.com> Cc: Kairui Song <kasong@redhat.com> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Len Brown <len.brown@intel.com> Cc: Lianbo Jiang <lijiang@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Nadav Amit <namit@vmware.com> Cc: Omar Sandoval <osandov@fb.com> Cc: Pavel Machek <pavel@ucw.cz> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Stephen Hemminger <sthemmin@microsoft.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Xavier Deguillard <xdeguillard@vmware.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05Merge branch 'irq-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq updates from Thomas Gleixner: "The interrupt departement delivers this time: - New infrastructure to manage NMIs on platforms which have a sane NMI delivery, i.e. identifiable NMI vectors instead of a single lump. - Simplification of the interrupt affinity management so drivers don't have to implement ugly loops around the PCI/MSI enablement. - Speedup for interrupt statistics in /proc/stat - Provide a function to retrieve the default irq domain - A new interrupt controller for the Loongson LS1X platform - Affinity support for the SiFive PLIC - Better support for the iMX irqsteer driver - NUMA aware memory allocations for GICv3 - The usual small fixes, improvements and cleanups all over the place" * 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits) irqchip/imx-irqsteer: Add multi output interrupts support irqchip/imx-irqsteer: Change to use reg_num instead of irq_group dt-bindings: irq: imx-irqsteer: Add multi output interrupts support dt-binding: irq: imx-irqsteer: Use irq number instead of group number irqchip/brcmstb-l2: Use _irqsave locking variants in non-interrupt code irqchip/gicv3-its: Use NUMA aware memory allocation for ITS tables irqdomain: Allow the default irq domain to be retrieved irqchip/sifive-plic: Implement irq_set_affinity() for SMP host irqchip/sifive-plic: Differentiate between PLIC handler and context irqchip/sifive-plic: Add warning in plic_init() if handler already present irqchip/sifive-plic: Pre-compute context hart base and enable base PCI/MSI: Remove obsolete sanity checks for multiple interrupt sets genirq/affinity: Remove the leftovers of the original set support nvme-pci: Simplify interrupt allocation genirq/affinity: Add new callback for (re)calculating interrupt sets genirq/affinity: Store interrupt sets size in struct irq_affinity genirq/affinity: Code consolidation irqchip/irq-sifive-plic: Check and continue in case of an invalid cpuid. irqchip/i8259: Fix shutdown order by moving syscore_ops registration dt-bindings: interrupt-controller: loongson ls1x intc ...
2019-03-05signal: add pidfd_send_signal() syscallChristian Brauner
The kill() syscall operates on process identifiers (pid). After a process has exited its pid can be reused by another process. If a caller sends a signal to a reused pid it will end up signaling the wrong process. This issue has often surfaced and there has been a push to address this problem [1]. This patch uses file descriptors (fd) from proc/<pid> as stable handles on struct pid. Even if a pid is recycled the handle will not change. The fd can be used to send signals to the process it refers to. Thus, the new syscall pidfd_send_signal() is introduced to solve this problem. Instead of pids it operates on process fds (pidfd). /* prototype and argument /* long pidfd_send_signal(int pidfd, int sig, siginfo_t *info, unsigned int flags); /* syscall number 424 */ The syscall number was chosen to be 424 to align with Arnd's rework in his y2038 to minimize merge conflicts (cf. [25]). In addition to the pidfd and signal argument it takes an additional siginfo_t and flags argument. If the siginfo_t argument is NULL then pidfd_send_signal() is equivalent to kill(<positive-pid>, <signal>). If it is not NULL pidfd_send_signal() is equivalent to rt_sigqueueinfo(). The flags argument is added to allow for future extensions of this syscall. It currently needs to be passed as 0. Failing to do so will cause EINVAL. /* pidfd_send_signal() replaces multiple pid-based syscalls */ The pidfd_send_signal() syscall currently takes on the job of rt_sigqueueinfo(2) and parts of the functionality of kill(2), Namely, when a positive pid is passed to kill(2). It will however be possible to also replace tgkill(2) and rt_tgsigqueueinfo(2) if this syscall is extended. /* sending signals to threads (tid) and process groups (pgid) */ Specifically, the pidfd_send_signal() syscall does currently not operate on process groups or threads. This is left for future extensions. In order to extend the syscall to allow sending signal to threads and process groups appropriately named flags (e.g. PIDFD_TYPE_PGID, and PIDFD_TYPE_TID) should be added. This implies that the flags argument will determine what is signaled and not the file descriptor itself. Put in other words, grouping in this api is a property of the flags argument not a property of the file descriptor (cf. [13]). Clarification for this has been requested by Eric (cf. [19]). When appropriate extensions through the flags argument are added then pidfd_send_signal() can additionally replace the part of kill(2) which operates on process groups as well as the tgkill(2) and rt_tgsigqueueinfo(2) syscalls. How such an extension could be implemented has been very roughly sketched in [14], [15], and [16]. However, this should not be taken as a commitment to a particular implementation. There might be better ways to do it. Right now this is intentionally left out to keep this patchset as simple as possible (cf. [4]). /* naming */ The syscall had various names throughout iterations of this patchset: - procfd_signal() - procfd_send_signal() - taskfd_send_signal() In the last round of reviews it was pointed out that given that if the flags argument decides the scope of the signal instead of different types of fds it might make sense to either settle for "procfd_" or "pidfd_" as prefix. The community was willing to accept either (cf. [17] and [18]). Given that one developer expressed strong preference for the "pidfd_" prefix (cf. [13]) and with other developers less opinionated about the name we should settle for "pidfd_" to avoid further bikeshedding. The "_send_signal" suffix was chosen to reflect the fact that the syscall takes on the job of multiple syscalls. It is therefore intentional that the name is not reminiscent of neither kill(2) nor rt_sigqueueinfo(2). Not the fomer because it might imply that pidfd_send_signal() is a replacement for kill(2), and not the latter because it is a hassle to remember the correct spelling - especially for non-native speakers - and because it is not descriptive enough of what the syscall actually does. The name "pidfd_send_signal" makes it very clear that its job is to send signals. /* zombies */ Zombies can be signaled just as any other process. No special error will be reported since a zombie state is an unreliable state (cf. [3]). However, this can be added as an extension through the @flags argument if the need ever arises. /* cross-namespace signals */ The patch currently enforces that the signaler and signalee either are in the same pid namespace or that the signaler's pid namespace is an ancestor of the signalee's pid namespace. This is done for the sake of simplicity and because it is unclear to what values certain members of struct siginfo_t would need to be set to (cf. [5], [6]). /* compat syscalls */ It became clear that we would like to avoid adding compat syscalls (cf. [7]). The compat syscall handling is now done in kernel/signal.c itself by adding __copy_siginfo_from_user_generic() which lets us avoid compat syscalls (cf. [8]). It should be noted that the addition of __copy_siginfo_from_user_any() is caused by a bug in the original implementation of rt_sigqueueinfo(2) (cf. 12). With upcoming rework for syscall handling things might improve significantly (cf. [11]) and __copy_siginfo_from_user_any() will not gain any additional callers. /* testing */ This patch was tested on x64 and x86. /* userspace usage */ An asciinema recording for the basic functionality can be found under [9]. With this patch a process can be killed via: #define _GNU_SOURCE #include <errno.h> #include <fcntl.h> #include <signal.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <sys/stat.h> #include <sys/syscall.h> #include <sys/types.h> #include <unistd.h> static inline int do_pidfd_send_signal(int pidfd, int sig, siginfo_t *info, unsigned int flags) { #ifdef __NR_pidfd_send_signal return syscall(__NR_pidfd_send_signal, pidfd, sig, info, flags); #else return -ENOSYS; #endif } int main(int argc, char *argv[]) { int fd, ret, saved_errno, sig; if (argc < 3) exit(EXIT_FAILURE); fd = open(argv[1], O_DIRECTORY | O_CLOEXEC); if (fd < 0) { printf("%s - Failed to open \"%s\"\n", strerror(errno), argv[1]); exit(EXIT_FAILURE); } sig = atoi(argv[2]); printf("Sending signal %d to process %s\n", sig, argv[1]); ret = do_pidfd_send_signal(fd, sig, NULL, 0); saved_errno = errno; close(fd); errno = saved_errno; if (ret < 0) { printf("%s - Failed to send signal %d to process %s\n", strerror(errno), sig, argv[1]); exit(EXIT_FAILURE); } exit(EXIT_SUCCESS); } /* Q&A * Given that it seems the same questions get asked again by people who are * late to the party it makes sense to add a Q&A section to the commit * message so it's hopefully easier to avoid duplicate threads. * * For the sake of progress please consider these arguments settled unless * there is a new point that desperately needs to be addressed. Please make * sure to check the links to the threads in this commit message whether * this has not already been covered. */ Q-01: (Florian Weimer [20], Andrew Morton [21]) What happens when the target process has exited? A-01: Sending the signal will fail with ESRCH (cf. [22]). Q-02: (Andrew Morton [21]) Is the task_struct pinned by the fd? A-02: No. A reference to struct pid is kept. struct pid - as far as I understand - was created exactly for the reason to not require to pin struct task_struct (cf. [22]). Q-03: (Andrew Morton [21]) Does the entire procfs directory remain visible? Just one entry within it? A-03: The same thing that happens right now when you hold a file descriptor to /proc/<pid> open (cf. [22]). Q-04: (Andrew Morton [21]) Does the pid remain reserved? A-04: No. This patchset guarantees a stable handle not that pids are not recycled (cf. [22]). Q-05: (Andrew Morton [21]) Do attempts to signal that fd return errors? A-05: See {Q,A}-01. Q-06: (Andrew Morton [22]) Is there a cleaner way of obtaining the fd? Another syscall perhaps. A-06: Userspace can already trivially retrieve file descriptors from procfs so this is something that we will need to support anyway. Hence, there's no immediate need to add another syscalls just to make pidfd_send_signal() not dependent on the presence of procfs. However, adding a syscalls to get such file descriptors is planned for a future patchset (cf. [22]). Q-07: (Andrew Morton [21] and others) This fd-for-a-process sounds like a handy thing and people may well think up other uses for it in the future, probably unrelated to signals. Are the code and the interface designed to permit such future applications? A-07: Yes (cf. [22]). Q-08: (Andrew Morton [21] and others) Now I think about it, why a new syscall? This thing is looking rather like an ioctl? A-08: This has been extensively discussed. It was agreed that a syscall is preferred for a variety or reasons. Here are just a few taken from prior threads. Syscalls are safer than ioctl()s especially when signaling to fds. Processes are a core kernel concept so a syscall seems more appropriate. The layout of the syscall with its four arguments would require the addition of a custom struct for the ioctl() thereby causing at least the same amount or even more complexity for userspace than a simple syscall. The new syscall will replace multiple other pid-based syscalls (see description above). The file-descriptors-for-processes concept introduced with this syscall will be extended with other syscalls in the future. See also [22], [23] and various other threads already linked in here. Q-09: (Florian Weimer [24]) What happens if you use the new interface with an O_PATH descriptor? A-09: pidfds opened as O_PATH fds cannot be used to send signals to a process (cf. [2]). Signaling processes through pidfds is the equivalent of writing to a file. Thus, this is not an operation that operates "purely at the file descriptor level" as required by the open(2) manpage. See also [4]. /* References */ [1]: https://lore.kernel.org/lkml/20181029221037.87724-1-dancol@google.com/ [2]: https://lore.kernel.org/lkml/874lbtjvtd.fsf@oldenburg2.str.redhat.com/ [3]: https://lore.kernel.org/lkml/20181204132604.aspfupwjgjx6fhva@brauner.io/ [4]: https://lore.kernel.org/lkml/20181203180224.fkvw4kajtbvru2ku@brauner.io/ [5]: https://lore.kernel.org/lkml/20181121213946.GA10795@mail.hallyn.com/ [6]: https://lore.kernel.org/lkml/20181120103111.etlqp7zop34v6nv4@brauner.io/ [7]: https://lore.kernel.org/lkml/36323361-90BD-41AF-AB5B-EE0D7BA02C21@amacapital.net/ [8]: https://lore.kernel.org/lkml/87tvjxp8pc.fsf@xmission.com/ [9]: https://asciinema.org/a/IQjuCHew6bnq1cr78yuMv16cy [11]: https://lore.kernel.org/lkml/F53D6D38-3521-4C20-9034-5AF447DF62FF@amacapital.net/ [12]: https://lore.kernel.org/lkml/87zhtjn8ck.fsf@xmission.com/ [13]: https://lore.kernel.org/lkml/871s6u9z6u.fsf@xmission.com/ [14]: https://lore.kernel.org/lkml/20181206231742.xxi4ghn24z4h2qki@brauner.io/ [15]: https://lore.kernel.org/lkml/20181207003124.GA11160@mail.hallyn.com/ [16]: https://lore.kernel.org/lkml/20181207015423.4miorx43l3qhppfz@brauner.io/ [17]: https://lore.kernel.org/lkml/CAGXu5jL8PciZAXvOvCeCU3wKUEB_dU-O3q0tDw4uB_ojMvDEew@mail.gmail.com/ [18]: https://lore.kernel.org/lkml/20181206222746.GB9224@mail.hallyn.com/ [19]: https://lore.kernel.org/lkml/20181208054059.19813-1-christian@brauner.io/ [20]: https://lore.kernel.org/lkml/8736rebl9s.fsf@oldenburg.str.redhat.com/ [21]: https://lore.kernel.org/lkml/20181228152012.dbf0508c2508138efc5f2bbe@linux-foundation.org/ [22]: https://lore.kernel.org/lkml/20181228233725.722tdfgijxcssg76@brauner.io/ [23]: https://lwn.net/Articles/773459/ [24]: https://lore.kernel.org/lkml/8736rebl9s.fsf@oldenburg.str.redhat.com/ [25]: https://lore.kernel.org/lkml/CAK8P3a0ej9NcJM8wXNPbcGUyOUZYX+VLoDFdbenW3s3114oQZw@mail.gmail.com/ Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Jann Horn <jannh@google.com> Cc: Andy Lutomirsky <luto@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Florian Weimer <fweimer@redhat.com> Signed-off-by: Christian Brauner <christian@brauner.io> Reviewed-by: Tycho Andersen <tycho@tycho.ws> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: David Howells <dhowells@redhat.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Serge Hallyn <serge@hallyn.com> Acked-by: Aleksa Sarai <cyphar@cyphar.com>
2019-02-28proc: Add fs_context support to procfsDavid Howells
Add fs_context support to procfs. Signed-off-by: David Howells <dhowells@redhat.com> cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-02-28procfs: Move proc_fill_super() to fs/proc/root.cDavid Howells
Move proc_fill_super() to fs/proc/root.c as that's where the other superblock stuff is. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com> cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-02-21proc, oom: do not report alien mms when setting oom_score_adjMichal Hocko
Tetsuo has reported that creating a thousands of processes sharing MM without SIGHAND (aka alien threads) and setting /proc/<pid>/oom_score_adj will swamp the kernel log and takes ages [1] to finish. This is especially worrisome that all that printing is done under RCU lock and this can potentially trigger RCU stall or softlockup detector. The primary reason for the printk was to catch potential users who might depend on the behavior prior to 44a70adec910 ("mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj") but after more than 2 years without a single report I guess it is safe to simply remove the printk altogether. The next step should be moving oom_score_adj over to the mm struct and remove all the tasks crawling as suggested by [2] [1] http://lkml.kernel.org/r/97fce864-6f75-bca5-14bc-12c9f890e740@i-love.sakura.ne.jp [2] http://lkml.kernel.org/r/20190117155159.GA4087@dhcp22.suse.cz Link: http://lkml.kernel.org/r/20190212102129.26288-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: David Rientjes <rientjes@google.com> Cc: Yong-Taek Lee <ytk.lee@samsung.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-14Merge branch 'linus' into irq/coreThomas Gleixner
Pick up upstream changes to avoid conflicts for pending patches.
2019-02-12mm: proc: smaps_rollup: fix pss_locked calculationSandeep Patil
The 'pss_locked' field of smaps_rollup was being calculated incorrectly. It accumulated the current pss everytime a locked VMA was found. Fix that by adding to 'pss_locked' the same time as that of 'pss' if the vma being walked is locked. Link: http://lkml.kernel.org/r/20190203065425.14650-1-sspatil@android.com Fixes: 493b0e9d945f ("mm: add /proc/pid/smaps_rollup") Signed-off-by: Sandeep Patil <sspatil@android.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Daniel Colascione <dancol@google.com> Cc: <stable@vger.kernel.org> [4.14.x, 4.19.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-11Merge tag 'v5.0-rc6' into sched/core, to pick up fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-02-10proc/stat: Make the interrupt statistics more efficientThomas Gleixner
Waiman reported that on large systems with a large amount of interrupts the readout of /proc/stat takes a long time to sum up the interrupt statistics. In principle this is not a problem. but for unknown reasons some enterprise quality software reads /proc/stat with a high frequency. The reason for this is that interrupt statistics are accounted per cpu. So the /proc/stat logic has to sum up the interrupt stats for each interrupt. The interrupt core provides now a per interrupt summary counter which can be used to avoid the summation loops completely except for interrupts marked PER_CPU which are only a small fraction of the interrupt space if at all. Another simplification is to iterate only over the active interrupts and skip the potentially large gaps in the interrupt number space and just print zeros for the gaps without going into the interrupt core in the first place. Waiman provided test results from a 4-socket IvyBridge-EX system (60-core 120-thread, 3016 irqs) excuting a test program which reads /proc/stat 50,000 times: Before: 18.436s (sys 18.380s) After: 3.769s (sys 3.742s) Reported-by: Waiman Long <longman@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com> Reviewed-by: Waiman Long <longman@redhat.com> Reviewed-by: Marc Zyngier <marc.zyngier@arm.com> Reviewed-by: Davidlohr Bueso <dbueso@suse.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Kees Cook <keescook@chromium.org> Cc: linux-fsdevel@vger.kernel.org Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Daniel Colascione <dancol@google.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Randy Dunlap <rdunlap@infradead.org> Link: https://lkml.kernel.org/r/20190208135021.013828701@linutronix.de
2019-02-07mm: make mm->pinned_vm an atomic64 counterDavidlohr Bueso
Taking a sleeping lock to _only_ increment a variable is quite the overkill, and pretty much all users do this. Furthermore, some drivers (ie: infiniband and scif) that need pinned semantics can go to quite some trouble to actually delay via workqueue (un)accounting for pinned pages when not possible to acquire it. By making the counter atomic we no longer need to hold the mmap_sem and can simply some code around it for pinned_vm users. The counter is 64-bit such that we need not worry about overflows such as rdma user input controlled from userspace. Reviewed-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Christoph Lameter <cl@linux.com> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-04sched/core: Convert sighand_struct.count to refcount_tElena Reshetova
atomic_t variables are currently used to implement reference counters with the following properties: - counter is initialized to 1 using atomic_set() - a resource is freed upon counter reaching zero - once counter reaches zero, its further increments aren't allowed - counter schema uses basic atomic operations (set, inc, inc_not_zero, dec_and_test, etc.) Such atomic variables should be converted to a newly provided refcount_t type and API that prevents accidental counter overflows and underflows. This is important since overflows and underflows can lead to use-after-free situation and be exploitable. The variable sighand_struct.count is used as pure reference counter. Convert it to refcount_t and fix up the operations. ** Important note for maintainers: Some functions from refcount_t API defined in lib/refcount.c have different memory ordering guarantees than their atomic counterparts. The full comparison can be seen in https://lkml.org/lkml/2017/11/15/57 and it is hopefully soon in state to be merged to the documentation tree. Normally the differences should not matter since refcount_t provides enough guarantees to satisfy the refcounting use cases, but in some rare cases it might matter. Please double check that you don't have some undocumented memory guarantees for this variable usage. For the sighand_struct.count it might make a difference in following places: - __cleanup_sighand: decrement in refcount_dec_and_test() only provides RELEASE ordering and control dependency on success vs. fully ordered atomic counterpart Suggested-by: Kees Cook <keescook@chromium.org> Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: David Windsor <dwindsor@gmail.com> Reviewed-by: Hans Liljestrand <ishkamiel@gmail.com> Reviewed-by: Andrea Parri <andrea.parri@amarulasolutions.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: akpm@linux-foundation.org Cc: viro@zeniv.linux.org.uk Link: https://lkml.kernel.org/r/1547814450-18902-2-git-send-email-elena.reshetova@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-02-01proc: fix /proc/net/* after setns(2)Alexey Dobriyan
/proc entries under /proc/net/* can't be cached into dcache because setns(2) can change current net namespace. [akpm@linux-foundation.org: coding-style fixes] [akpm@linux-foundation.org: avoid vim miscolorization] [adobriyan@gmail.com: write test, add dummy ->d_revalidate hook: necessary if /proc/net/* is pinned at setns time] Link: http://lkml.kernel.org/r/20190108192350.GA12034@avx2 Link: http://lkml.kernel.org/r/20190107162336.GA9239@avx2 Fixes: 1da4d377f943fe4194ffb9fb9c26cc58fad4dd24 ("proc: revalidate misc dentries") Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Reported-by: Mateusz Stępień <mateusz.stepien@netrounds.com> Reported-by: Ahmad Fatoum <a.fatoum@pengutronix.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-25audit: move loginuid and sessionid from CONFIG_AUDITSYSCALL to CONFIG_AUDITRichard Guy Briggs
loginuid and sessionid (and audit_log_session_info) should be part of CONFIG_AUDIT scope and not CONFIG_AUDITSYSCALL since it is used in CONFIG_CHANGE, ANOM_LINK, FEATURE_CHANGE (and INTEGRITY_RULE), none of which are otherwise dependent on AUDITSYSCALL. Please see github issue https://github.com/linux-audit/audit-kernel/issues/104 Signed-off-by: Richard Guy Briggs <rgb@redhat.com> [PM: tweaked subject line for better grep'ing] Signed-off-by: Paul Moore <paul@paul-moore.com>
2019-01-08procfs: add smack subdir to attrsCasey Schaufler
Back in 2007 I made what turned out to be a rather serious mistake in the implementation of the Smack security module. The SELinux module used an interface in /proc to manipulate the security context on processes. Rather than use a similar interface, I used the same interface. The AppArmor team did likewise. Now /proc/.../attr/current will tell you the security "context" of the process, but it will be different depending on the security module you're using. This patch provides a subdirectory in /proc/.../attr for Smack. Smack user space can use the "current" file in this subdirectory and never have to worry about getting SELinux attributes by mistake. Programs that use the old interface will continue to work (or fail, as the case may be) as before. The proposed S.A.R.A security module is dependent on the mechanism to create its own attr subdirectory. The original implementation is by Kees Cook. Signed-off-by: Casey Schaufler <casey@schaufler-ca.com> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Kees Cook <keescook@chromium.org>
2019-01-04fs/proc/base.c: slightly faster /proc/*/limitsAlexey Dobriyan
Header of /proc/*/limits is a fixed string, so print it directly without formatting specifiers. Link: http://lkml.kernel.org/r/20181203164242.GB6904@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-04fs/proc/inode.c: delete unnecessary variable in proc_alloc_inode()Alexey Dobriyan
Link: http://lkml.kernel.org/r/20181203164015.GA6904@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-04fs/proc/util.c: include fs/proc/internal.h for name_to_int()Eric Biggers
name_to_int() is defined in fs/proc/util.c and declared in fs/proc/internal.h, but the declaration isn't included at the point of the definition. Include the header to enforce that the definition matches the declaration. This addresses a gcc warning when -Wmissing-prototypes is enabled. Link: http://lkml.kernel.org/r/20181115001833.49371-1-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-04fs/proc/base.c: use ns_capable instead of capable for timerslack_nsBenjamin Gordon
Access to timerslack_ns is controlled by a process having CAP_SYS_NICE in its effective capability set, but the current check looks in the root namespace instead of the process' user namespace. Since a process is allowed to do other activities controlled by CAP_SYS_NICE inside a namespace, it should also be able to adjust timerslack_ns. Link: http://lkml.kernel.org/r/20181030180012.232896-1-bmgordon@google.com Signed-off-by: Benjamin Gordon <bmgordon@google.com> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Cc: John Stultz <john.stultz@linaro.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Kees Cook <keescook@chromium.org> Cc: "Serge E. Hallyn" <serge@hallyn.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Oren Laadan <orenl@cellrox.com> Cc: Ruchi Kandoi <kandoiruchi@google.com> Cc: Rom Lemarchand <romlem@android.com> Cc: Todd Kjos <tkjos@google.com> Cc: Colin Cross <ccross@android.com> Cc: Nick Kralevich <nnk@google.com> Cc: Dmitry Shmidt <dimitrysh@google.com> Cc: Elliott Hughes <enh@google.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28mm, proc: report PR_SET_THP_DISABLE in procMichal Hocko
David Rientjes has reported that commit 1860033237d4 ("mm: make PR_SET_THP_DISABLE immediately active") has changed the way how we report THPable VMAs to the userspace. Their monitoring tool is triggering false alarms on PR_SET_THP_DISABLE tasks because it considers an insufficient THP usage as a memory fragmentation resp. memory pressure issue. Before the said commit each newly created VMA inherited VM_NOHUGEPAGE flag and that got exposed to the userspace via /proc/<pid>/smaps file. This implementation had its downsides as explained in the commit message but it is true that the userspace doesn't have any means to query for the process wide THP enabled/disabled status. PR_SET_THP_DISABLE is a process wide flag so it makes a lot of sense to export in the process wide context rather than per-vma. Introduce a new field to /proc/<pid>/status which export this status. If PR_SET_THP_DISABLE is used then it reports false same as when the THP is not compiled in. It doesn't consider the global THP status because we already export that information via sysfs Link: http://lkml.kernel.org/r/20181211143641.3503-4-mhocko@kernel.org Fixes: 1860033237d4 ("mm: make PR_SET_THP_DISABLE immediately active") Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reported-by: David Rientjes <rientjes@google.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Paul Oppenheimer <bepvte@gmail.com> Cc: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28mm, thp, proc: report THP eligibility for each vmaMichal Hocko
Userspace falls short when trying to find out whether a specific memory range is eligible for THP. There are usecases that would like to know that http://lkml.kernel.org/r/alpine.DEB.2.21.1809251248450.50347@chino.kir.corp.google.com : This is used to identify heap mappings that should be able to fault thp : but do not, and they normally point to a low-on-memory or fragmentation : issue. The only way to deduce this now is to query for hg resp. nh flags and confronting the state with the global setting. Except that there is also PR_SET_THP_DISABLE that might change the picture. So the final logic is not trivial. Moreover the eligibility of the vma depends on the type of VMA as well. In the past we have supported only anononymous memory VMAs but things have changed and shmem based vmas are supported as well these days and the query logic gets even more complicated because the eligibility depends on the mount option and another global configuration knob. Simplify the current state and report the THP eligibility in /proc/<pid>/smaps for each existing vma. Reuse transparent_hugepage_enabled for this purpose. The original implementation of this function assumes that the caller knows that the vma itself is supported for THP so make the core checks into __transparent_hugepage_enabled and use it for existing callers. __show_smap just use the new transparent_hugepage_enabled which also checks the vma support status (please note that this one has to be out of line due to include dependency issues). [mhocko@kernel.org: fix oops with NULL ->f_mapping] Link: http://lkml.kernel.org/r/20181224185106.GC16738@dhcp22.suse.cz Link: http://lkml.kernel.org/r/20181211143641.3503-3-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Paul Oppenheimer <bepvte@gmail.com> Cc: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28mm/mmu_notifier: use structure for invalidate_range_start/end calls v2Jérôme Glisse
To avoid having to change many call sites everytime we want to add a parameter use a structure to group all parameters for the mmu_notifier invalidate_range_start/end cakks. No functional changes with this patch. [akpm@linux-foundation.org: coding style fixes] Link: http://lkml.kernel.org/r/20181205053628.3210-3-jglisse@redhat.com Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Acked-by: Christian König <christian.koenig@amd.com> Acked-by: Jan Kara <jack@suse.cz> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Ross Zwisler <zwisler@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krcmar <rkrcmar@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Felix Kuehling <felix.kuehling@amd.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> From: Jérôme Glisse <jglisse@redhat.com> Subject: mm/mmu_notifier: use structure for invalidate_range_start/end calls v3 fix build warning in migrate.c when CONFIG_MMU_NOTIFIER=n Link: http://lkml.kernel.org/r/20181213171330.8489-3-jglisse@redhat.com Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28/proc/kpagecount: return 0 for special pages that are never mappedAnthony Yznaga
Certain pages that are never mapped to userspace have a type indicated in the page_type field of their struct pages (e.g. PG_buddy). page_type overlaps with _mapcount so set the count to 0 and avoid calling page_mapcount() for these pages. [anthony.yznaga@oracle.com: incorporate feedback from Matthew Wilcox] Link: http://lkml.kernel.org/r/1544481313-27318-1-git-send-email-anthony.yznaga@oracle.com Link: http://lkml.kernel.org/r/1543963526-27917-1-git-send-email-anthony.yznaga@oracle.com Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Matthew Wilcox <willy@infradead.org> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Miles Chen <miles.chen@mediatek.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28mm: convert totalram_pages and totalhigh_pages variables to atomicArun KS
totalram_pages and totalhigh_pages are made static inline function. Main motivation was that managed_page_count_lock handling was complicating things. It was discussed in length here, https://lore.kernel.org/patchwork/patch/995739/#1181785 So it seemes better to remove the lock and convert variables to atomic, with preventing poteintial store-to-read tearing as a bonus. [akpm@linux-foundation.org: coding style fixes] Link: http://lkml.kernel.org/r/1542090790-21750-4-git-send-email-arunks@codeaurora.org Signed-off-by: Arun KS <arunks@codeaurora.org> Suggested-by: Michal Hocko <mhocko@suse.com> Suggested-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-13proc/sysctl: don't return ENOMEM on lookup when a table is unregisteringIvan Delalande
proc_sys_lookup can fail with ENOMEM instead of ENOENT when the corresponding sysctl table is being unregistered. In our case we see this upon opening /proc/sys/net/*/conf files while network interfaces are being deleted, which confuses our configuration daemon. The problem was successfully reproduced and this fix tested on v4.9.122 and v4.20-rc6. v2: return ERR_PTRs in all cases when proc_sys_make_inode fails instead of mixing them with NULL. Thanks Al Viro for the feedback. Fixes: ace0c791e6c3 ("proc/sysctl: Don't grab i_lock under sysctl_lock.") Cc: stable@vger.kernel.org Signed-off-by: Ivan Delalande <colona@arista.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-11-01Merge tag 'stackleak-v4.20-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull stackleak gcc plugin from Kees Cook: "Please pull this new GCC plugin, stackleak, for v4.20-rc1. This plugin was ported from grsecurity by Alexander Popov. It provides efficient stack content poisoning at syscall exit. This creates a defense against at least two classes of flaws: - Uninitialized stack usage. (We continue to work on improving the compiler to do this in other ways: e.g. unconditional zero init was proposed to GCC and Clang, and more plugin work has started too). - Stack content exposure. By greatly reducing the lifetime of valid stack contents, exposures via either direct read bugs or unknown cache side-channels become much more difficult to exploit. This complements the existing buddy and heap poisoning options, but provides the coverage for stacks. The x86 hooks are included in this series (which have been reviewed by Ingo, Dave Hansen, and Thomas Gleixner). The arm64 hooks have already been merged through the arm64 tree (written by Laura Abbott and reviewed by Mark Rutland and Will Deacon). With VLAs having been removed this release, there is no need for alloca() protection, so it has been removed from the plugin" * tag 'stackleak-v4.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: arm64: Drop unneeded stackleak_check_alloca() stackleak: Allow runtime disabling of kernel stack erasing doc: self-protection: Add information about STACKLEAK feature fs/proc: Show STACKLEAK metrics in the /proc file system lkdtm: Add a test for STACKLEAK gcc-plugins: Add STACKLEAK plugin for tracking the kernel stack x86/entry: Add STACKLEAK erasing the kernel stack at the end of syscalls
2018-10-31mm: remove include/linux/bootmem.hMike Rapoport
Move remaining definitions and declarations from include/linux/bootmem.h into include/linux/memblock.h and remove the redundant header. The includes were replaced with the semantic patch below and then semi-automated removal of duplicated '#include <linux/memblock.h> @@ @@ - #include <linux/bootmem.h> + #include <linux/memblock.h> [sfr@canb.auug.org.au: dma-direct: fix up for the removal of linux/bootmem.h] Link: http://lkml.kernel.org/r/20181002185342.133d1680@canb.auug.org.au [sfr@canb.auug.org.au: powerpc: fix up for removal of linux/bootmem.h] Link: http://lkml.kernel.org/r/20181005161406.73ef8727@canb.auug.org.au [sfr@canb.auug.org.au: x86/kaslr, ACPI/NUMA: fix for linux/bootmem.h removal] Link: http://lkml.kernel.org/r/20181008190341.5e396491@canb.auug.org.au Link: http://lkml.kernel.org/r/1536927045-23536-30-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chris Zankel <chris@zankel.net> Cc: "David S. Miller" <davem@davemloft.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greentime Hu <green.hu@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Ingo Molnar <mingo@redhat.com> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: Jonas Bonn <jonas@southpole.se> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ley Foon Tan <lftan@altera.com> Cc: Mark Salter <msalter@redhat.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Palmer Dabbelt <palmer@sifive.com> Cc: Paul Burton <paul.burton@mips.com> Cc: Richard Kuo <rkuo@codeaurora.org> Cc: Richard Weinberger <richard@nod.at> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Serge Semin <fancer.lancer@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-31fs/proc/vmcore.c: Convert to use vmf_error()Souptick Joarder
This code can be replaced with vmf_error() inline function. Link: http://lkml.kernel.org/r/20180918145945.GA11392@jordon-HP-15-Notebook-PC Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com> Reviewed-by: Matthew Wilcox <willy@infradead.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-28Merge branch 'xarray' of git://git.infradead.org/users/willy/linux-daxLinus Torvalds
Pull XArray conversion from Matthew Wilcox: "The XArray provides an improved interface to the radix tree data structure, providing locking as part of the API, specifying GFP flags at allocation time, eliminating preloading, less re-walking the tree, more efficient iterations and not exposing RCU-protected pointers to its users. This patch set 1. Introduces the XArray implementation 2. Converts the pagecache to use it 3. Converts memremap to use it The page cache is the most complex and important user of the radix tree, so converting it was most important. Converting the memremap code removes the only other user of the multiorder code, which allows us to remove the radix tree code that supported it. I have 40+ followup patches to convert many other users of the radix tree over to the XArray, but I'd like to get this part in first. The other conversions haven't been in linux-next and aren't suitable for applying yet, but you can see them in the xarray-conv branch if you're interested" * 'xarray' of git://git.infradead.org/users/willy/linux-dax: (90 commits) radix tree: Remove multiorder support radix tree test: Convert multiorder tests to XArray radix tree tests: Convert item_delete_rcu to XArray radix tree tests: Convert item_kill_tree to XArray radix tree tests: Move item_insert_order radix tree test suite: Remove multiorder benchmarking radix tree test suite: Remove __item_insert memremap: Convert to XArray xarray: Add range store functionality xarray: Move multiorder_check to in-kernel tests xarray: Move multiorder_shrink to kernel tests xarray: Move multiorder account test in-kernel radix tree test suite: Convert iteration test to XArray radix tree test suite: Convert tag_tagged_items to XArray radix tree: Remove radix_tree_clear_tags radix tree: Remove radix_tree_maybe_preload_order radix tree: Remove split/join code radix tree: Remove radix_tree_update_node_t page cache: Finish XArray conversion dax: Convert page fault handlers to XArray ...
2018-10-26mm: zero-seek shrinkersJohannes Weiner
The page cache and most shrinkable slab caches hold data that has been read from disk, but there are some caches that only cache CPU work, such as the dentry and inode caches of procfs and sysfs, as well as the subset of radix tree nodes that track non-resident page cache. Currently, all these are shrunk at the same rate: using DEFAULT_SEEKS for the shrinker's seeks setting tells the reclaim algorithm that for every two page cache pages scanned it should scan one slab object. This is a bogus setting. A virtual inode that required no IO to create is not twice as valuable as a page cache page; shadow cache entries with eviction distances beyond the size of memory aren't either. In most cases, the behavior in practice is still fine. Such virtual caches don't tend to grow and assert themselves aggressively, and usually get picked up before they cause problems. But there are scenarios where that's not true. Our database workloads suffer from two of those. For one, their file workingset is several times bigger than available memory, which has the kernel aggressively create shadow page cache entries for the non-resident parts of it. The workingset code does tell the VM that most of these are expendable, but the VM ends up balancing them 2:1 to cache pages as per the seeks setting. This is a huge waste of memory. These workloads also deal with tens of thousands of open files and use /proc for introspection, which ends up growing the proc_inode_cache to absurdly large sizes - again at the cost of valuable cache space, which isn't a reasonable trade-off, given that proc inodes can be re-created without involving the disk. This patch implements a "zero-seek" setting for shrinkers that results in a target ratio of 0:1 between their objects and IO-backed caches. This allows such virtual caches to grow when memory is available (they do cache/avoid CPU work after all), but effectively disables them as soon as IO-backed objects are under pressure. It then switches the shrinkers for procfs and sysfs metadata, as well as excess page cache shadow nodes, to the new zero-seek setting. Link: http://lkml.kernel.org/r/20181009184732.762-5-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Domas Mituzas <dmituzas@fb.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Rik van Riel <riel@surriel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26sched: loadavg: consolidate LOAD_INT, LOAD_FRAC, CALC_LOADJohannes Weiner
There are several definitions of those functions/macros in places that mess with fixed-point load averages. Provide an official version. [akpm@linux-foundation.org: fix missed conversion in block/blk-iolatency.c] Link: http://lkml.kernel.org/r/20180828172258.3185-5-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Suren Baghdasaryan <surenb@google.com> Tested-by: Daniel Drake <drake@endlessm.com> Cc: Christopher Lameter <cl@linux.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <jweiner@fb.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Enderborg <peter.enderborg@sony.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vinayak Menon <vinmenon@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26mm, proc: add KReclaimable to /proc/meminfoVlastimil Babka
The vmstat NR_KERNEL_MISC_RECLAIMABLE counter is for kernel non-slab allocations that can be reclaimed via shrinker. In /proc/meminfo, we can show the sum of all reclaimable kernel allocations (including slab) as "KReclaimable". Add the same counter also to per-node meminfo under /sys With this counter, users will have more complete information about kernel memory usage. Non-slab reclaimable pages (currently just the ION allocator) will not be missing from /proc/meminfo, making users wonder where part of their memory went. More precisely, they already appear in MemAvailable, but without the new counter, it's not obvious why the value in MemAvailable doesn't fully correspond with the sum of other counters participating in it. Link: http://lkml.kernel.org/r/20180731090649.16028-6-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Roman Gushchin <guro@fb.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Laura Abbott <labbott@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Sumit Semwal <sumit.semwal@linaro.org> Cc: Vijayanand Jitta <vjitta@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26mm: /proc/pid/smaps_rollup: fix NULL pointer deref in smaps_pte_range()Vlastimil Babka
Leonardo reports an apparent regression in 4.19-rc7: BUG: unable to handle kernel NULL pointer dereference at 00000000000000f0 PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP PTI CPU: 3 PID: 6032 Comm: python Not tainted 4.19.0-041900rc7-lowlatency #201810071631 Hardware name: LENOVO 80UG/Toronto 4A2, BIOS 0XCN45WW 08/09/2018 RIP: 0010:smaps_pte_range+0x32d/0x540 Code: 80 00 00 00 00 74 a9 48 89 de 41 f6 40 52 40 0f 85 04 02 00 00 49 2b 30 48 c1 ee 0c 49 03 b0 98 00 00 00 49 8b 80 a0 00 00 00 <48> 8b b8 f0 00 00 00 e8 b7 ef ec ff 48 85 c0 0f 84 71 ff ff ff a8 RSP: 0018:ffffb0cbc484fb88 EFLAGS: 00010202 RAX: 0000000000000000 RBX: 0000560ddb9e9000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000560ddb9e9 RDI: 0000000000000001 RBP: ffffb0cbc484fbc0 R08: ffff94a5a227a578 R09: ffff94a5a227a578 R10: 0000000000000000 R11: 0000560ddbbe7000 R12: ffffe903098ba728 R13: ffffb0cbc484fc78 R14: ffffb0cbc484fcf8 R15: ffff94a5a2e9cf48 FS: 00007f6dfb683740(0000) GS:ffff94a5aaf80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000000000f0 CR3: 000000011c118001 CR4: 00000000003606e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: __walk_page_range+0x3c2/0x6f0 walk_page_vma+0x42/0x60 smap_gather_stats+0x79/0xe0 ? gather_pte_stats+0x320/0x320 ? gather_hugetlb_stats+0x70/0x70 show_smaps_rollup+0xcd/0x1c0 seq_read+0x157/0x400 __vfs_read+0x3a/0x180 ? security_file_permission+0x93/0xc0 ? security_file_permission+0x93/0xc0 vfs_read+0x8f/0x140 ksys_read+0x55/0xc0 __x64_sys_read+0x1a/0x20 do_syscall_64+0x5a/0x110 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Decoded code matched to local compilation+disassembly points to smaps_pte_entry(): } else if (unlikely(IS_ENABLED(CONFIG_SHMEM) && mss->check_shmem_swap && pte_none(*pte))) { page = find_get_entry(vma->vm_file->f_mapping, linear_page_index(vma, addr)); Here, vma->vm_file is NULL. mss->check_shmem_swap should be false in that case, however for smaps_rollup, smap_gather_stats() can set the flag true for one vma and leave it true for subsequent vma's where it should be false. To fix, reset the check_shmem_swap flag to false. There's also related bug which sets mss->swap to shmem_swapped, which in the context of smaps_rollup overwrites any value accumulated from previous vma's. Fix that as well. Note that the report suggests a regression between 4.17.19 and 4.19-rc7, which makes the 4.19 series ending with commit 258f669e7e88 ("mm: /proc/pid/smaps_rollup: convert to single value seq_file") suspicious. But the mss was reused for rollup since 493b0e9d945f ("mm: add /proc/pid/smaps_rollup") so let's play it safe with the stable backport. Link: http://lkml.kernel.org/r/555fbd1f-4ac9-0b58-dcd4-5dc4380ff7ca@suse.cz Link: https://bugzilla.kernel.org/show_bug.cgi?id=201377 Fixes: 493b0e9d945f ("mm: add /proc/pid/smaps_rollup") Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reported-by: Leonardo Soares Müller <leozinho29_eu@hotmail.com> Tested-by: Leonardo Soares Müller <leozinho29_eu@hotmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Daniel Colascione <dancol@google.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-23Merge branch 'x86-mm-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 mm updates from Ingo Molnar: "Lots of changes in this cycle: - Lots of CPA (change page attribute) optimizations and related cleanups (Thomas Gleixner, Peter Zijstra) - Make lazy TLB mode even lazier (Rik van Riel) - Fault handler cleanups and improvements (Dave Hansen) - kdump, vmcore: Enable kdumping encrypted memory with AMD SME enabled (Lianbo Jiang) - Clean up VM layout documentation (Baoquan He, Ingo Molnar) - ... plus misc other fixes and enhancements" * 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (51 commits) x86/stackprotector: Remove the call to boot_init_stack_canary() from cpu_startup_entry() x86/mm: Kill stray kernel fault handling comment x86/mm: Do not warn about PCI BIOS W+X mappings resource: Clean it up a bit resource: Fix find_next_iomem_res() iteration issue resource: Include resource end in walk_*() interfaces x86/kexec: Correct KEXEC_BACKUP_SRC_END off-by-one error x86/mm: Remove spurious fault pkey check x86/mm/vsyscall: Consider vsyscall page part of user address space x86/mm: Add vsyscall address helper x86/mm: Fix exception table comments x86/mm: Add clarifying comments for user addr space x86/mm: Break out user address space handling x86/mm: Break out kernel address space handling x86/mm: Clarify hardware vs. software "error_code" x86/mm/tlb: Make lazy TLB mode lazier x86/mm/tlb: Add freed_tables element to flush_tlb_info x86/mm/tlb: Add freed_tables argument to flush_tlb_mm_range smp,cpumask: introduce on_each_cpu_cond_mask smp: use __cpumask_set_cpu in on_each_cpu_cond ...
2018-10-09proc/vmcore: Fix i386 build error of missing copy_oldmem_page_encrypted()Borislav Petkov
Lianbo reported a build error with a particular 32-bit config, see Link below for details. Provide a weak copy_oldmem_page_encrypted() function which architectures can override, in the same manner other functionality in that file is supplied. Reported-by: Lianbo Jiang <lijiang@redhat.com> Signed-off-by: Borislav Petkov <bp@suse.de> CC: x86@kernel.org Link: http://lkml.kernel.org/r/710b9d95-2f70-eadf-c4a1-c3dc80ee4ebb@redhat.com
2018-10-06kdump, proc/vmcore: Enable kdumping encrypted memory with SME enabledLianbo Jiang
In the kdump kernel, the memory of the first kernel needs to be dumped into the vmcore file. If SME is enabled in the first kernel, the old memory has to be remapped with the memory encryption mask in order to access it properly. Split copy_oldmem_page() functionality to handle encrypted memory properly. [ bp: Heavily massage everything. ] Signed-off-by: Lianbo Jiang <lijiang@redhat.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: kexec@lists.infradead.org Cc: tglx@linutronix.de Cc: mingo@redhat.com Cc: hpa@zytor.com Cc: akpm@linux-foundation.org Cc: dan.j.williams@intel.com Cc: bhelgaas@google.com Cc: baiyaowei@cmss.chinamobile.com Cc: tiwai@suse.de Cc: brijesh.singh@amd.com Cc: dyoung@redhat.com Cc: bhe@redhat.com Cc: jroedel@suse.de Link: https://lkml.kernel.org/r/be7b47f9-6be6-e0d1-2c2a-9125bc74b818@redhat.com
2018-10-05proc: restrict kernel stack dumps to rootJann Horn
Currently, you can use /proc/self/task/*/stack to cause a stack walk on a task you control while it is running on another CPU. That means that the stack can change under the stack walker. The stack walker does have guards against going completely off the rails and into random kernel memory, but it can interpret random data from your kernel stack as instruction pointers and stack pointers. This can cause exposure of kernel stack contents to userspace. Restrict the ability to inspect kernel stacks of arbitrary tasks to root in order to prevent a local attacker from exploiting racy stack unwinding to leak kernel task stack contents. See the added comment for a longer rationale. There don't seem to be any users of this userspace API that can't gracefully bail out if reading from the file fails. Therefore, I believe that this change is unlikely to break things. In the case that this patch does end up needing a revert, the next-best solution might be to fake a single-entry stack based on wchan. Link: http://lkml.kernel.org/r/20180927153316.200286-1-jannh@google.com Fixes: 2ec220e27f50 ("proc: add /proc/*/stack") Signed-off-by: Jann Horn <jannh@google.com> Acked-by: Kees Cook <keescook@chromium.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Ken Chen <kenchen@google.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Laura Abbott <labbott@redhat.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-09-29xarray: Replace exceptional entriesMatthew Wilcox
Introduce xarray value entries and tagged pointers to replace radix tree exceptional entries. This is a slight change in encoding to allow the use of an extra bit (we can now store BITS_PER_LONG - 1 bits in a value entry). It is also a change in emphasis; exceptional entries are intimidating and different. As the comment explains, you can choose to store values or pointers in the xarray and they are both first-class citizens. Signed-off-by: Matthew Wilcox <willy@infradead.org> Reviewed-by: Josef Bacik <jbacik@fb.com>
2018-09-20fs/proc/kcore.c: fix invalid memory access in multi-page read optimizationDominique Martinet
The 'm' kcore_list item could point to kclist_head, and it is incorrect to look at m->addr / m->size in this case. There is no choice but to run through the list of entries for every address if we did not find any entry in the previous iteration Reset 'm' to NULL in that case at Omar Sandoval's suggestion. [akpm@linux-foundation.org: add comment] Link: http://lkml.kernel.org/r/1536100702-28706-1-git-send-email-asmadeus@codewreck.org Fixes: bf991c2231117 ("proc/kcore: optimize multiple page reads") Signed-off-by: Dominique Martinet <asmadeus@codewreck.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Omar Sandoval <osandov@osandov.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: James Morse <james.morse@arm.com> Cc: Bhupesh Sharma <bhsharma@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-09-04fs/proc: Show STACKLEAK metrics in the /proc file systemAlexander Popov
Introduce CONFIG_STACKLEAK_METRICS providing STACKLEAK information about tasks via the /proc file system. In particular, /proc/<pid>/stack_depth shows the maximum kernel stack consumption for the current and previous syscalls. Although this information is not precise, it can be useful for estimating the STACKLEAK performance impact for your workloads. Suggested-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Alexander Popov <alex.popov@linux.com> Tested-by: Laura Abbott <labbott@redhat.com> Signed-off-by: Kees Cook <keescook@chromium.org>
2018-08-26Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf updates from Thomas Gleixner: "Kernel: - Improve kallsyms coverage - Add x86 entry trampolines to kcore - Fix ARM SPE handling - Correct PPC event post processing Tools: - Make the build system more robust - Small fixes and enhancements all over the place - Update kernel ABI header copies - Preparatory work for converting libtraceevnt to a shared library - License cleanups" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (100 commits) tools arch: Update arch/x86/lib/memcpy_64.S copy used in 'perf bench mem memcpy' tools arch x86: Update tools's copy of cpufeatures.h perf python: Fix pyrf_evlist__read_on_cpu() interface perf mmap: Store real cpu number in 'struct perf_mmap' perf tools: Remove ext from struct kmod_path perf tools: Add gzip_is_compressed function perf tools: Add lzma_is_compressed function perf tools: Add is_compressed callback to compressions array perf tools: Move the temp file processing into decompress_kmodule perf tools: Use compression id in decompress_kmodule() perf tools: Store compression id into struct dso perf tools: Add compression id into 'struct kmod_path' perf tools: Make is_supported_compression() static perf tools: Make decompress_to_file() function static perf tools: Get rid of dso__needs_decompress() call in __open_dso() perf tools: Get rid of dso__needs_decompress() call in symbol__disassemble() perf tools: Get rid of dso__needs_decompress() call in read_object_code() tools lib traceevent: Change to SPDX License format perf llvm: Allow passing options to llc in addition to clang perf parser: Improve error message for PMU address filters ...
2018-08-23fs/proc/vmcore.c: hide vmcoredd_mmap_dumps() for nommu buildsArnd Bergmann
Without CONFIG_MMU, we get a build warning: fs/proc/vmcore.c:228:12: error: 'vmcoredd_mmap_dumps' defined but not used [-Werror=unused-function] static int vmcoredd_mmap_dumps(struct vm_area_struct *vma, unsigned long dst, The function is only referenced from an #ifdef'ed caller, so this uses the same #ifdef around it. Link: http://lkml.kernel.org/r/20180525213526.2117790-1-arnd@arndb.de Fixes: 7efe48df8a3d ("vmcore: append device dumps to vmcore as elf notes") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Ganesh Goudar <ganeshgr@chelsio.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22proc/kcore: add vmcoreinfo note to /proc/kcoreOmar Sandoval
The vmcoreinfo information is useful for runtime debugging tools, not just for crash dumps. A lot of this information can be determined by other means, but this is much more convenient, and it only adds a page at most to the file. Link: http://lkml.kernel.org/r/fddbcd08eed76344863303878b12de1c1e2a04b6.1531953780.git.osandov@fb.com Signed-off-by: Omar Sandoval <osandov@fb.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Bhupesh Sharma <bhsharma@redhat.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: James Morse <james.morse@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22proc/kcore: optimize multiple page readsOmar Sandoval
The current code does a full search of the segment list every time for every page. This is wasteful, since it's almost certain that the next page will be in the same segment. Instead, check if the previous segment covers the current page before doing the list search. Link: http://lkml.kernel.org/r/fd346c11090cf93d867e01b8d73a6567c5ac6361.1531953780.git.osandov@fb.com Signed-off-by: Omar Sandoval <osandov@fb.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Bhupesh Sharma <bhsharma@redhat.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: James Morse <james.morse@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22proc/kcore: clean up ELF header generationOmar Sandoval
Currently, the ELF file header, program headers, and note segment are allocated all at once, in some icky code dating back to 2.3. Programs tend to read the file header, then the program headers, then the note segment, all separately, so this is a waste of effort. It's cleaner and more efficient to handle the three separately. Link: http://lkml.kernel.org/r/19c92cbad0e11f6103ff3274b2e7a7e51a1eb74b.1531953780.git.osandov@fb.com Signed-off-by: Omar Sandoval <osandov@fb.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Bhupesh Sharma <bhsharma@redhat.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: James Morse <james.morse@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22proc/kcore: hold lock during readOmar Sandoval
Now that we're using an rwsem, we can hold it during the entirety of read_kcore() and have a common return path. This is preparation for the next change. [akpm@linux-foundation.org: fix locking bug reported by Tetsuo Handa] Link: http://lkml.kernel.org/r/d7cfbc1e8a76616f3b699eaff9df0a2730380534.1531953780.git.osandov@fb.com Signed-off-by: Omar Sandoval <osandov@fb.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Bhupesh Sharma <bhsharma@redhat.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: James Morse <james.morse@arm.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22proc/kcore: fix memory hotplug vs multiple opens raceOmar Sandoval
There's a theoretical race condition that will cause /proc/kcore to miss a memory hotplug event: CPU0 CPU1 // hotplug event 1 kcore_need_update = 1 open_kcore() open_kcore() kcore_update_ram() kcore_update_ram() // Walk RAM // Walk RAM __kcore_update_ram() __kcore_update_ram() kcore_need_update = 0 // hotplug event 2 kcore_need_update = 1 kcore_need_update = 0 Note that CPU1 set up the RAM kcore entries with the state after hotplug event 1 but cleared the flag for hotplug event 2. The RAM entries will therefore be stale until there is another hotplug event. This is an extremely unlikely sequence of events, but the fix makes the synchronization saner, anyways: we serialize the entire update sequence, which means that whoever clears the flag will always succeed in replacing the kcore list. Link: http://lkml.kernel.org/r/6106c509998779730c12400c1b996425df7d7089.1531953780.git.osandov@fb.com Signed-off-by: Omar Sandoval <osandov@fb.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Bhupesh Sharma <bhsharma@redhat.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: James Morse <james.morse@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22proc/kcore: replace kclist_lock rwlock with rwsemOmar Sandoval
Now we only need kclist_lock from user context and at fs init time, and the following changes need to sleep while holding the kclist_lock. Link: http://lkml.kernel.org/r/521ba449ebe921d905177410fee9222d07882f0d.1531953780.git.osandov@fb.com Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Bhupesh Sharma <bhsharma@redhat.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: James Morse <james.morse@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22proc/kcore: don't grab lock for memory hotplug notifierOmar Sandoval
The memory hotplug notifier kcore_callback() only needs kclist_lock to prevent races with __kcore_update_ram(), but we can easily eliminate that race by using an atomic xchg() in __kcore_update_ram(). This is preparation for converting kclist_lock to an rwsem. Link: http://lkml.kernel.org/r/0a4bc89f4dbde8b5b2ea309f7b4fb6a85fe29df2.1531953780.git.osandov@fb.com Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Bhupesh Sharma <bhsharma@redhat.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: James Morse <james.morse@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22proc/kcore: don't grab lock for kclist_add()Omar Sandoval
Patch series "/proc/kcore improvements", v4. This series makes a few improvements to /proc/kcore. It fixes a couple of small issues in v3 but is otherwise the same. Patches 1, 2, and 3 are prep patches. Patch 4 is a fix/cleanup. Patch 5 is another prep patch. Patches 6 and 7 are optimizations to ->read(). Patch 8 makes it possible to enable CRASH_CORE on any architecture, which is needed for patch 9. Patch 9 adds vmcoreinfo to /proc/kcore. This patch (of 9): kclist_add() is only called at init time, so there's no point in grabbing any locks. We're also going to replace the rwlock with a rwsem, which we don't want to try grabbing during early boot. While we're here, mark kclist_add() with __init so that we'll get a warning if it's called from non-init code. Link: http://lkml.kernel.org/r/98208db1faf167aa8b08eebfa968d95c70527739.1531953780.git.osandov@fb.com Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Bhupesh Sharma <bhsharma@redhat.com> Tested-by: Bhupesh Sharma <bhsharma@redhat.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Bhupesh Sharma <bhsharma@redhat.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: James Morse <james.morse@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>