summaryrefslogtreecommitdiff
path: root/include/linux
AgeCommit message (Collapse)Author
2021-06-27Revert "signal: Allow tasks to cache one sigqueue struct"Linus Torvalds
This reverts commits 4bad58ebc8bc4f20d89cff95417c9b4674769709 (and 399f8dd9a866e107639eabd3c1979cd526ca3a98, which tried to fix it). I do not believe these are correct, and I'm about to release 5.13, so am reverting them out of an abundance of caution. The locking is odd, and appears broken. On the allocation side (in __sigqueue_alloc()), the locking is somewhat straightforward: it depends on sighand->siglock. Since one caller doesn't hold that lock, it further then tests 'sigqueue_flags' to avoid the case with no locks held. On the freeing side (in sigqueue_cache_or_free()), there is no locking at all, and the logic instead depends on 'current' being a single thread, and not able to race with itself. To make things more exciting, there's also the data race between freeing a signal and allocating one, which is handled by using WRITE_ONCE() and READ_ONCE(), and being mutually exclusive wrt the initial state (ie freeing will only free if the old state was NULL, while allocating will obviously only use the value if it was non-NULL, so only one or the other will actually act on the value). However, while the free->alloc paths do seem mutually exclusive thanks to just the data value dependency, it's not clear what the memory ordering constraints are on it. Could writes from the previous allocation possibly be delayed and seen by the new allocation later, causing logical inconsistencies? So it's all very exciting and unusual. And in particular, it seems that the freeing side is incorrect in depending on "current" being single-threaded. Yes, 'current' is a single thread, but in the presense of asynchronous events even a single thread can have data races. And such asynchronous events can and do happen, with interrupts causing signals to be flushed and thus free'd (for example - sending a SIGCONT/SIGSTOP can happen from interrupt context, and can flush previously queued process control signals). So regardless of all the other questions about the memory ordering and locking for this new cached allocation, the sigqueue_cache_or_free() assumptions seem to be fundamentally incorrect. It may be that people will show me the errors of my ways, and tell me why this is all safe after all. We can reinstate it if so. But my current belief is that the WRITE_ONCE() that sets the cached entry needs to be a smp_store_release(), and the READ_ONCE() that finds a cached entry needs to be a smp_load_acquire() to handle memory ordering correctly. And the sequence in sigqueue_cache_or_free() would need to either use a lock or at least be interrupt-safe some way (perhaps by using something like the percpu 'cmpxchg': it doesn't need to be SMP-safe, but like the percpu operations it needs to be interrupt-safe). Fixes: 399f8dd9a866 ("signal: Prevent sigqueue caching after task got released") Fixes: 4bad58ebc8bc ("signal: Allow tasks to cache one sigqueue struct") Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-25Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge misc fixes from Andrew Morton: "24 patches, based on 4a09d388f2ab382f217a764e6a152b3f614246f6. Subsystems affected by this patch series: mm (thp, vmalloc, hugetlb, memory-failure, and pagealloc), nilfs2, kthread, MAINTAINERS, and mailmap" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (24 commits) mailmap: add Marek's other e-mail address and identity without diacritics MAINTAINERS: fix Marek's identity again mm/page_alloc: do bulk array bounds check after checking populated elements mm/page_alloc: __alloc_pages_bulk(): do bounds check before accessing array mm/hwpoison: do not lock page again when me_huge_page() successfully recovers mm,hwpoison: return -EHWPOISON to denote that the page has already been poisoned mm/memory-failure: use a mutex to avoid memory_failure() races mm, futex: fix shared futex pgoff on shmem huge page kthread: prevent deadlock when kthread_mod_delayed_work() races with kthread_cancel_delayed_work_sync() kthread_worker: split code for canceling the delayed work timer mm/vmalloc: unbreak kasan vmalloc support KVM: s390: prepare for hugepage vmalloc mm/vmalloc: add vmalloc_no_huge nilfs2: fix memory leak in nilfs_sysfs_delete_device_group mm/thp: another PVMW_SYNC fix in page_vma_mapped_walk() mm/thp: fix page_vma_mapped_walk() if THP mapped by ptes mm: page_vma_mapped_walk(): get vma_address_end() earlier mm: page_vma_mapped_walk(): use goto instead of while (1) mm: page_vma_mapped_walk(): add a level of indentation mm: page_vma_mapped_walk(): crossing page table boundary ...
2021-06-25Merge tag 'ceph-for-5.13-rc8' of https://github.com/ceph/ceph-clientLinus Torvalds
Pull ceph fixes from Ilya Dryomov: "Two regression fixes from the merge window: one in the auth code affecting old clusters and one in the filesystem for proper propagation of MDS request errors. Also included a locking fix for async creates, marked for stable" * tag 'ceph-for-5.13-rc8' of https://github.com/ceph/ceph-client: libceph: set global_id as soon as we get an auth ticket libceph: don't pass result into ac->ops->handle_reply() ceph: fix error handling in ceph_atomic_open and ceph_lookup ceph: must hold snap_rwsem when filling inode for async create
2021-06-24mm, futex: fix shared futex pgoff on shmem huge pageHugh Dickins
If more than one futex is placed on a shmem huge page, it can happen that waking the second wakes the first instead, and leaves the second waiting: the key's shared.pgoff is wrong. When 3.11 commit 13d60f4b6ab5 ("futex: Take hugepages into account when generating futex_key"), the only shared huge pages came from hugetlbfs, and the code added to deal with its exceptional page->index was put into hugetlb source. Then that was missed when 4.8 added shmem huge pages. page_to_pgoff() is what others use for this nowadays: except that, as currently written, it gives the right answer on hugetlbfs head, but nonsense on hugetlbfs tails. Fix that by calling hugetlbfs-specific hugetlb_basepage_index() on PageHuge tails as well as on head. Yes, it's unconventional to declare hugetlb_basepage_index() there in pagemap.h, rather than in hugetlb.h; but I do not expect anything but page_to_pgoff() ever to need it. [akpm@linux-foundation.org: give hugetlb_basepage_index() prototype the correct scope] Link: https://lkml.kernel.org/r/b17d946b-d09-326e-b42a-52884c36df32@google.com Fixes: 800d8c63b2e9 ("shmem: add huge pages support") Reported-by: Neel Natu <neelnatu@google.com> Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Zhang Yi <wetpzy@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Darren Hart <dvhart@infradead.org> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-24mm/vmalloc: add vmalloc_no_hugeClaudio Imbrenda
Patch series "mm: add vmalloc_no_huge and use it", v4. Add vmalloc_no_huge() and export it, so modules can allocate memory with small pages. Use the newly added vmalloc_no_huge() in KVM on s390 to get around a hardware limitation. This patch (of 2): Commit 121e6f3258fe3 ("mm/vmalloc: hugepage vmalloc mappings") added support for hugepage vmalloc mappings, it also added the flag VM_NO_HUGE_VMAP for __vmalloc_node_range to request the allocation to be performed with 0-order non-huge pages. This flag is not accessible when calling vmalloc, the only option is to call directly __vmalloc_node_range, which is not exported. This means that a module can't vmalloc memory with small pages. Case in point: KVM on s390x needs to vmalloc a large area, and it needs to be mapped with non-huge pages, because of a hardware limitation. This patch adds the function vmalloc_no_huge, which works like vmalloc, but it is guaranteed to always back the mapping using small pages. This new function is exported, therefore it is usable by modules. [akpm@linux-foundation.org: whitespace fixes, per Christoph] Link: https://lkml.kernel.org/r/20210614132357.10202-1-imbrenda@linux.ibm.com Link: https://lkml.kernel.org/r/20210614132357.10202-2-imbrenda@linux.ibm.com Fixes: 121e6f3258fe3 ("mm/vmalloc: hugepage vmalloc mappings") Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Acked-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Cornelia Huck <cohuck@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-24libceph: set global_id as soon as we get an auth ticketIlya Dryomov
Commit 61ca49a9105f ("libceph: don't set global_id until we get an auth ticket") delayed the setting of global_id too much. It is set only after all tickets are received, but in pre-nautilus clusters an auth ticket and the service tickets are obtained in separate steps (for a total of three MAuth replies). When the service tickets are requested, global_id is used to build an authorizer; if global_id is still 0 we never get them and fail to establish the session. Moving the setting of global_id into protocol implementations. This way global_id can be set exactly when an auth ticket is received, not sooner nor later. Fixes: 61ca49a9105f ("libceph: don't set global_id until we get an auth ticket") Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jeff Layton <jlayton@kernel.org>
2021-06-24libceph: don't pass result into ac->ops->handle_reply()Ilya Dryomov
There is no result to pass in msgr2 case because authentication failures are reported through auth_bad_method frame and in MAuth case an error is returned immediately. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jeff Layton <jlayton@kernel.org>
2021-06-22locking/lockdep: Improve noinstr vs errorsPeter Zijlstra
Better handle the failure paths. vmlinux.o: warning: objtool: debug_locks_off()+0x23: call to console_verbose() leaves .noinstr.text section vmlinux.o: warning: objtool: debug_locks_off()+0x19: call to __kasan_check_write() leaves .noinstr.text section debug_locks_off+0x19/0x40: instrument_atomic_write at include/linux/instrumented.h:86 (inlined by) __debug_locks_off at include/linux/debug_locks.h:17 (inlined by) debug_locks_off at lib/debug_locks.c:41 Fixes: 6eebad1ad303 ("lockdep: __always_inline more for noinstr") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20210621120120.784404944@infradead.org
2021-06-18Merge tag 'net-5.13-rc7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Networking fixes for 5.13-rc7, including fixes from wireless, bpf, bluetooth, netfilter and can. Current release - regressions: - mlxsw: spectrum_qdisc: Pass handle, not band number to find_class() to fix modifying offloaded qdiscs - lantiq: net: fix duplicated skb in rx descriptor ring - rtnetlink: fix regression in bridge VLAN configuration, empty info is not an error, bot-generated "fix" was not needed - libbpf: s/rx/tx/ typo on umem->rx_ring_setup_done to fix umem creation Current release - new code bugs: - ethtool: fix NULL pointer dereference during module EEPROM dump via the new netlink API - mlx5e: don't update netdev RQs with PTP-RQ, the special purpose queue should not be visible to the stack - mlx5e: select special PTP queue only for SKBTX_HW_TSTAMP skbs - mlx5e: verify dev is present in get devlink port ndo, avoid a panic Previous releases - regressions: - neighbour: allow NUD_NOARP entries to be force GCed - further fixes for fallout from reorg of WiFi locking (staging: rtl8723bs, mac80211, cfg80211) - skbuff: fix incorrect msg_zerocopy copy notifications - mac80211: fix NULL ptr deref for injected rate info - Revert "net/mlx5: Arm only EQs with EQEs" it may cause missed IRQs Previous releases - always broken: - bpf: more speculative execution fixes - netfilter: nft_fib_ipv6: skip ipv6 packets from any to link-local - udp: fix race between close() and udp_abort() resulting in a panic - fix out of bounds when parsing TCP options before packets are validated (in netfilter: synproxy, tc: sch_cake and mptcp) - mptcp: improve operation under memory pressure, add missing wake-ups - mptcp: fix double-lock/soft lookup in subflow_error_report() - bridge: fix races (null pointer deref and UAF) in vlan tunnel egress - ena: fix DMA mapping function issues in XDP - rds: fix memory leak in rds_recvmsg Misc: - vrf: allow larger MTUs - icmp: don't send out ICMP messages with a source address of 0.0.0.0 - cdc_ncm: switch to eth%d interface naming" * tag 'net-5.13-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (139 commits) net: ethernet: fix potential use-after-free in ec_bhf_remove selftests/net: Add icmp.sh for testing ICMP dummy address responses icmp: don't send out ICMP messages with a source address of 0.0.0.0 net: ll_temac: Avoid ndo_start_xmit returning NETDEV_TX_BUSY net: ll_temac: Fix TX BD buffer overwrite net: ll_temac: Add memory-barriers for TX BD access net: ll_temac: Make sure to free skb when it is completely used MAINTAINERS: add Guvenc as SMC maintainer bnxt_en: Call bnxt_ethtool_free() in bnxt_init_one() error path bnxt_en: Fix TQM fastpath ring backing store computation bnxt_en: Rediscover PHY capabilities after firmware reset cxgb4: fix wrong shift. mac80211: handle various extensible elements correctly mac80211: reset profile_periodicity/ema_ap cfg80211: avoid double free of PMSR request cfg80211: make certificate generation more robust mac80211: minstrel_ht: fix sample time check net: qed: Fix memcpy() overflow of qed_dcbx_params() net: cdc_eem: fix tx fixup skb leak net: hamradio: fix memory leak in mkiss_close ...
2021-06-18Merge tag 'pm-5.13-rc7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management fix from Rafael Wysocki: "Remove recently added frequency invariance support from the CPPC cpufreq driver, because it has turned out to be problematic and it cannot be fixed properly on time for 5.13 (Viresh Kumar)" * tag 'pm-5.13-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: Revert "cpufreq: CPPC: Add support for frequency invariance"
2021-06-16net/mlx5e: Don't create devices during unload flowDmytro Linkin
Running devlink reload command for port in switchdev mode cause resources to corrupt: driver can't release allocated EQ and reclaim memory pages, because "rdma" auxiliary device had add CQs which blocks EQ from deletion. Erroneous sequence happens during reload-down phase, and is following: 1. detach device - suspends auxiliary devices which support it, destroys others. During this step "eth-rep" and "rdma-rep" are destroyed, "eth" - suspended. 2. disable SRIOV - moves device to legacy mode; as part of disablement - rescans drivers. This step adds "rdma" auxiliary device. 3. destroy EQ table - <failure>. Driver shouldn't create any device during unload flows. To handle that implement MLX5_PRIV_FLAGS_DETACH flag, set it on device detach and unset on device attach. If flag is set do no-op on drivers rescan. Fixes: a925b5e309c9 ("net/mlx5: Register mlx5 devices to auxiliary virtual bus") Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-06-16mm/thp: unmap_mapping_page() to fix THP truncate_cleanup_page()Hugh Dickins
There is a race between THP unmapping and truncation, when truncate sees pmd_none() and skips the entry, after munmap's zap_huge_pmd() cleared it, but before its page_remove_rmap() gets to decrement compound_mapcount: generating false "BUG: Bad page cache" reports that the page is still mapped when deleted. This commit fixes that, but not in the way I hoped. The first attempt used try_to_unmap(page, TTU_SYNC|TTU_IGNORE_MLOCK) instead of unmap_mapping_range() in truncate_cleanup_page(): it has often been an annoyance that we usually call unmap_mapping_range() with no pages locked, but there apply it to a single locked page. try_to_unmap() looks more suitable for a single locked page. However, try_to_unmap_one() contains a VM_BUG_ON_PAGE(!pvmw.pte,page): it is used to insert THP migration entries, but not used to unmap THPs. Copy zap_huge_pmd() and add THP handling now? Perhaps, but their TLB needs are different, I'm too ignorant of the DAX cases, and couldn't decide how far to go for anon+swap. Set that aside. The second attempt took a different tack: make no change in truncate.c, but modify zap_huge_pmd() to insert an invalidated huge pmd instead of clearing it initially, then pmd_clear() between page_remove_rmap() and unlocking at the end. Nice. But powerpc blows that approach out of the water, with its serialize_against_pte_lookup(), and interesting pgtable usage. It would need serious help to get working on powerpc (with a minor optimization issue on s390 too). Set that aside. Just add an "if (page_mapped(page)) synchronize_rcu();" or other such delay, after unmapping in truncate_cleanup_page()? Perhaps, but though that's likely to reduce or eliminate the number of incidents, it would give less assurance of whether we had identified the problem correctly. This successful iteration introduces "unmap_mapping_page(page)" instead of try_to_unmap(), and goes the usual unmap_mapping_range_tree() route, with an addition to details. Then zap_pmd_range() watches for this case, and does spin_unlock(pmd_lock) if so - just like page_vma_mapped_walk() now does in the PVMW_SYNC case. Not pretty, but safe. Note that unmap_mapping_page() is doing a VM_BUG_ON(!PageLocked) to assert its interface; but currently that's only used to make sure that page->mapping is stable, and zap_pmd_range() doesn't care if the page is locked or not. Along these lines, in invalidate_inode_pages2_range() move the initial unmap_mapping_range() out from under page lock, before then calling unmap_mapping_page() under page lock if still mapped. Link: https://lkml.kernel.org/r/a2a4a148-cdd8-942c-4ef8-51b77f643dbe@google.com Fixes: fc127da085c2 ("truncate: handle file thp") Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Jan Kara <jack@suse.cz> Cc: Jue Wang <juew@google.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Wang Yugui <wangyugui@e16-tech.com> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-16mm/thp: try_to_unmap() use TTU_SYNC for safe splittingHugh Dickins
Stressing huge tmpfs often crashed on unmap_page()'s VM_BUG_ON_PAGE (!unmap_success): with dump_page() showing mapcount:1, but then its raw struct page output showing _mapcount ffffffff i.e. mapcount 0. And even if that particular VM_BUG_ON_PAGE(!unmap_success) is removed, it is immediately followed by a VM_BUG_ON_PAGE(compound_mapcount(head)), and further down an IS_ENABLED(CONFIG_DEBUG_VM) total_mapcount BUG(): all indicative of some mapcount difficulty in development here perhaps. But the !CONFIG_DEBUG_VM path handles the failures correctly and silently. I believe the problem is that once a racing unmap has cleared pte or pmd, try_to_unmap_one() may skip taking the page table lock, and emerge from try_to_unmap() before the racing task has reached decrementing mapcount. Instead of abandoning the unsafe VM_BUG_ON_PAGE(), and the ones that follow, use PVMW_SYNC in try_to_unmap_one() in this case: adding TTU_SYNC to the options, and passing that from unmap_page(). When CONFIG_DEBUG_VM, or for non-debug too? Consensus is to do the same for both: the slight overhead added should rarely matter, except perhaps if splitting sparsely-populated multiply-mapped shmem. Once confident that bugs are fixed, TTU_SYNC here can be removed, and the race tolerated. Link: https://lkml.kernel.org/r/c1e95853-8bcd-d8fd-55fa-e7f2488e78f@google.com Fixes: fec89c109f3a ("thp: rewrite freeze_page()/unfreeze_page() with generic rmap walkers") Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Jan Kara <jack@suse.cz> Cc: Jue Wang <juew@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Wang Yugui <wangyugui@e16-tech.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-16mm/thp: make is_huge_zero_pmd() safe and quickerHugh Dickins
Most callers of is_huge_zero_pmd() supply a pmd already verified present; but a few (notably zap_huge_pmd()) do not - it might be a pmd migration entry, in which the pfn is encoded differently from a present pmd: which might pass the is_huge_zero_pmd() test (though not on x86, since L1TF forced us to protect against that); or perhaps even crash in pmd_page() applied to a swap-like entry. Make it safe by adding pmd_present() check into is_huge_zero_pmd() itself; and make it quicker by saving huge_zero_pfn, so that is_huge_zero_pmd() will not need to do that pmd_page() lookup each time. __split_huge_pmd_locked() checked pmd_trans_huge() before: that worked, but is unnecessary now that is_huge_zero_pmd() checks present. Link: https://lkml.kernel.org/r/21ea9ca-a1f5-8b90-5e88-95fb1c49bbfa@google.com Fixes: e71769ae5260 ("mm: enable thp migration for shmem thp") Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Jan Kara <jack@suse.cz> Cc: Jue Wang <juew@google.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Wang Yugui <wangyugui@e16-tech.com> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-16mm/hugetlb: expand restore_reserve_on_error functionalityMike Kravetz
The routine restore_reserve_on_error is called to restore reservation information when an error occurs after page allocation. The routine alloc_huge_page modifies the mapping reserve map and potentially the reserve count during allocation. If code calling alloc_huge_page encounters an error after allocation and needs to free the page, the reservation information needs to be adjusted. Currently, restore_reserve_on_error only takes action on pages for which the reserve count was adjusted(HPageRestoreReserve flag). There is nothing wrong with these adjustments. However, alloc_huge_page ALWAYS modifies the reserve map during allocation even if the reserve count is not adjusted. This can cause issues as observed during development of this patch [1]. One specific series of operations causing an issue is: - Create a shared hugetlb mapping Reservations for all pages created by default - Fault in a page in the mapping Reservation exists so reservation count is decremented - Punch a hole in the file/mapping at index previously faulted Reservation and any associated pages will be removed - Allocate a page to fill the hole No reservation entry, so reserve count unmodified Reservation entry added to map by alloc_huge_page - Error after allocation and before instantiating the page Reservation entry remains in map - Allocate a page to fill the hole Reservation entry exists, so decrement reservation count This will cause a reservation count underflow as the reservation count was decremented twice for the same index. A user would observe a very large number for HugePages_Rsvd in /proc/meminfo. This would also likely cause subsequent allocations of hugetlb pages to fail as it would 'appear' that all pages are reserved. This sequence of operations is unlikely to happen, however they were easily reproduced and observed using hacked up code as described in [1]. Address the issue by having the routine restore_reserve_on_error take action on pages where HPageRestoreReserve is not set. In this case, we need to remove any reserve map entry created by alloc_huge_page. A new helper routine vma_del_reservation assists with this operation. There are three callers of alloc_huge_page which do not currently call restore_reserve_on error before freeing a page on error paths. Add those missing calls. [1] https://lore.kernel.org/linux-mm/20210528005029.88088-1-almasrymina@google.com/ Link: https://lkml.kernel.org/r/20210607204510.22617-1-mike.kravetz@oracle.com Fixes: 96b96a96ddee ("mm/hugetlb: fix huge page reservation leak in private mapping error paths" Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Mina Almasry <almasrymina@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-16mm/swap: fix pte_same_as_swp() not removing uffd-wp bit when comparePeter Xu
I found it by pure code review, that pte_same_as_swp() of unuse_vma() didn't take uffd-wp bit into account when comparing ptes. pte_same_as_swp() returning false negative could cause failure to swapoff swap ptes that was wr-protected by userfaultfd. Link: https://lkml.kernel.org/r/20210603180546.9083-1-peterx@redhat.com Fixes: f45ec5ff16a7 ("userfaultfd: wp: support swap and page migration") Signed-off-by: Peter Xu <peterx@redhat.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: <stable@vger.kernel.org> [5.7+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-16mm,hwpoison: fix race with hugetlb page allocationNaoya Horiguchi
When hugetlb page fault (under overcommitting situation) and memory_failure() race, VM_BUG_ON_PAGE() is triggered by the following race: CPU0: CPU1: gather_surplus_pages() page = alloc_surplus_huge_page() memory_failure_hugetlb() get_hwpoison_page(page) __get_hwpoison_page(page) get_page_unless_zero(page) zero = put_page_testzero(page) VM_BUG_ON_PAGE(!zero, page) enqueue_huge_page(h, page) put_page(page) __get_hwpoison_page() only checks the page refcount before taking an additional one for memory error handling, which is not enough because there's a time window where compound pages have non-zero refcount during hugetlb page initialization. So make __get_hwpoison_page() check page status a bit more for hugetlb pages with get_hwpoison_huge_page(). Checking hugetlb-specific flags under hugetlb_lock makes sure that the hugetlb page is not transitive. It's notable that another new function, HWPoisonHandlable(), is helpful to prevent a race against other transitive page states (like a generic compound page just before PageHuge becomes true). Link: https://lkml.kernel.org/r/20210603233632.2964832-2-nao.horiguchi@gmail.com Fixes: ead07f6a867b ("mm/memory-failure: introduce get_hwpoison_page() for consistent refcount handling") Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Reported-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Tony Luck <tony.luck@intel.com> Cc: <stable@vger.kernel.org> [5.12+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-15ptp: improve max_adj check against unreasonable valuesJakub Kicinski
Scaled PPM conversion to PPB may (on 64bit systems) result in a value larger than s32 can hold (freq/scaled_ppm is a long). This means the kernel will not correctly reject unreasonably high ->freq values (e.g. > 4294967295ppb, 281474976645 scaled PPM). The conversion is equivalent to a division by ~66 (65.536), so the value of ppb is always smaller than ppm, but not small enough to assume narrowing the type from long -> s32 is okay. Note that reasonable user space (e.g. ptp4l) will not use such high values, anyway, 4289046510ppb ~= 4.3x, so the fix is somewhat pedantic. Fixes: d39a743511cd ("ptp: validate the requested frequency adjustment.") Fixes: d94ba80ebbea ("ptp: Added a brand new class driver for ptp clocks.") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-14Revert "cpufreq: CPPC: Add support for frequency invariance"Viresh Kumar
This reverts commit 4c38f2df71c8e33c0b64865992d693f5022eeaad. There are few races in the frequency invariance support for CPPC driver, namely the driver doesn't stop the kthread_work and irq_work on policy exit during suspend/resume or CPU hotplug. A proper fix won't be possible for the 5.13-rc, as it requires a lot of changes. Lets revert the patch instead for now. Fixes: 4c38f2df71c8 ("cpufreq: CPPC: Add support for frequency invariance") Reported-by: Qian Cai <quic_qiancai@quicinc.com> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-06-12mm: relocate 'write_protect_seq' in struct mm_structFeng Tang
0day robot reported a 9.2% regression for will-it-scale mmap1 test case[1], caused by commit 57efa1fe5957 ("mm/gup: prevent gup_fast from racing with COW during fork"). Further debug shows the regression is due to that commit changes the offset of hot fields 'mmap_lock' inside structure 'mm_struct', thus some cache alignment changes. From the perf data, the contention for 'mmap_lock' is very severe and takes around 95% cpu cycles, and it is a rw_semaphore struct rw_semaphore { atomic_long_t count; /* 8 bytes */ atomic_long_t owner; /* 8 bytes */ struct optimistic_spin_queue osq; /* spinner MCS lock */ ... Before commit 57efa1fe5957 adds the 'write_protect_seq', it happens to have a very optimal cache alignment layout, as Linus explained: "and before the addition of the 'write_protect_seq' field, the mmap_sem was at offset 120 in 'struct mm_struct'. Which meant that count and owner were in two different cachelines, and then when you have contention and spend time in rwsem_down_write_slowpath(), this is probably *exactly* the kind of layout you want. Because first the rwsem_write_trylock() will do a cmpxchg on the first cacheline (for the optimistic fast-path), and then in the case of contention, rwsem_down_write_slowpath() will just access the second cacheline. Which is probably just optimal for a load that spends a lot of time contended - new waiters touch that first cacheline, and then they queue themselves up on the second cacheline." After the commit, the rw_semaphore is at offset 128, which means the 'count' and 'owner' fields are now in the same cacheline, and causes more cache bouncing. Currently there are 3 "#ifdef CONFIG_XXX" before 'mmap_lock' which will affect its offset: CONFIG_MMU CONFIG_MEMBARRIER CONFIG_HAVE_ARCH_COMPAT_MMAP_BASES The layout above is on 64 bits system with 0day's default kernel config (similar to RHEL-8.3's config), in which all these 3 options are 'y'. And the layout can vary with different kernel configs. Relayouting a structure is usually a double-edged sword, as sometimes it can helps one case, but hurt other cases. For this case, one solution is, as the newly added 'write_protect_seq' is a 4 bytes long seqcount_t (when CONFIG_DEBUG_LOCK_ALLOC=n), placing it into an existing 4 bytes hole in 'mm_struct' will not change other fields' alignment, while restoring the regression. Link: https://lore.kernel.org/lkml/20210525031636.GB7744@xsang-OptiPlex-9020/ [1] Reported-by: kernel test robot <oliver.sang@intel.com> Signed-off-by: Feng Tang <feng.tang@intel.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-12net: make get_net_ns return error if NET_NS is disabledChangbin Du
There is a panic in socket ioctl cmd SIOCGSKNS when NET_NS is not enabled. The reason is that nsfs tries to access ns->ops but the proc_ns_operations is not implemented in this case. [7.670023] Unable to handle kernel NULL pointer dereference at virtual address 00000010 [7.670268] pgd = 32b54000 [7.670544] [00000010] *pgd=00000000 [7.671861] Internal error: Oops: 5 [#1] SMP ARM [7.672315] Modules linked in: [7.672918] CPU: 0 PID: 1 Comm: systemd Not tainted 5.13.0-rc3-00375-g6799d4f2da49 #16 [7.673309] Hardware name: Generic DT based system [7.673642] PC is at nsfs_evict+0x24/0x30 [7.674486] LR is at clear_inode+0x20/0x9c The same to tun SIOCGSKNS command. To fix this problem, we make get_net_ns() return -EINVAL when NET_NS is disabled. Meanwhile move it to right place net/core/net_namespace.c. Signed-off-by: Changbin Du <changbin.du@gmail.com> Fixes: c62cce2caee5 ("net: add an ioctl to get a socket network namespace") Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: David Laight <David.Laight@ACULAB.COM> Cc: Christian Brauner <christian.brauner@ubuntu.com> Suggested-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-12Merge tag 'usb-5.13-rc6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb Pull USB fixes from Greg KH: "Here are a number of tiny USB fixes for 5.13-rc6. There are more than I would normally like, but there's been a bunch of people banging on the gadget and dwc3 and typec code recently for I think an Android release, which has resulted in a number of small fixes. It's nice to see companies send fixes upstream for this type of work, a notable change from years ago. Anyway, fixes in here are: - usb-serial device id updates - usb-serial cp210x driver fixes for broken firmware versions - typec fixes for crazy charging devices and other reported problems - dwc3 fixes for reported problems found - gadget fixes for reported problems - tiny xhci fixes - other small fixes for reported issues. - revert of a problem fix found by linux-next testing All of these have passed 0-day and linux-next testing with no reported problems (the revert for the found linux-next build problem included)" * tag 'usb-5.13-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (44 commits) Revert "usb: gadget: fsl: Re-enable driver for ARM SoCs" usb: typec: mux: Fix copy-paste mistake in typec_mux_match usb: typec: ucsi: Clear PPM capability data in ucsi_init() error path usb: gadget: fsl: Re-enable driver for ARM SoCs usb: typec: wcove: Use LE to CPU conversion when accessing msg->header USB: serial: cp210x: fix CP2102N-A01 modem control USB: serial: cp210x: fix alternate function for CP2102N QFN20 usb: misc: brcmstb-usb-pinmap: check return value after calling platform_get_resource() usb: dwc3: ep0: fix NULL pointer exception usb: gadget: eem: fix wrong eem header operation usb: typec: intel_pmc_mux: Put ACPI device using acpi_dev_put() usb: typec: intel_pmc_mux: Add missed error check for devm_ioremap_resource() usb: typec: intel_pmc_mux: Put fwnode in error case during ->probe() usb: typec: tcpm: Do not finish VDM AMS for retrying Responses usb: fix various gadget panics on 10gbps cabling usb: fix various gadgets null ptr deref on 10gbps cabling. usb: pci-quirks: disable D3cold on xhci suspend for s2idle on AMD Renoir usb: f_ncm: only first packet of aggregate needs to start timer USB: f_ncm: ncm_bitrate (speed) is unsigned MAINTAINERS: usb: add entry for isp1760 ...
2021-06-12Merge tag 'char-misc-5.13-rc6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc Pull char/misc driver fixes from Greg KH: "Here are some small misc driver fixes for 5.13-rc6 that fix some reported problems: - Tiny phy driver fixes for reported issues - rtsx regression for when the device suspended - mhi driver fix for a use-after-free All of these have been in linux-next for a few days with no reported issues" * tag 'char-misc-5.13-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: misc: rtsx: separate aspm mode into MODE_REG and MODE_CFG bus: mhi: pci-generic: Fix hibernation bus: mhi: pci_generic: Fix possible use-after-free in mhi_pci_remove() bus: mhi: pci_generic: T99W175: update channel name from AT to DUN phy: Sparx5 Eth SerDes: check return value after calling platform_get_resource() phy: ralink: phy-mt7621-pci: drop 'of_match_ptr' to fix -Wunused-const-variable phy: ti: Fix an error code in wiz_probe() phy: phy-mtk-tphy: Fix some resource leaks in mtk_phy_init() phy: cadence: Sierra: Fix error return code in cdns_sierra_phy_probe() phy: usb: Fix misuse of IS_ENABLED
2021-06-12Merge tag 'sched-urgent-2021-06-12' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: "Misc fixes: - Fix performance regression caused by lack of intended batching of RCU callbacks by over-eager NOHZ-full code. - Fix cgroups related corruption of load_avg and load_sum metrics. - Three fixes to fix blocked load, util_sum/runnable_sum and util_est tracking bugs" * tag 'sched-urgent-2021-06-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/fair: Fix util_est UTIL_AVG_UNCHANGED handling sched/pelt: Ensure that *_sum is always synced with *_avg tick/nohz: Only check for RCU deferred wakeup on user/guest entry when needed sched/fair: Make sure to update tg contrib for blocked load sched/fair: Keep load_avg and load_sum synced
2021-06-10Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdmaLinus Torvalds
Pull rdma fixes from Jason Gunthorpe: "A mixture of small bug fixes and a small security issue: - WARN_ON when IPoIB is automatically moved between namespaces - Long standing bug where mlx5 would use the wrong page for the doorbell recovery memory if fork is used - Security fix for mlx4 that disables the timestamp feature - Several crashers for mlx5 - Plug a recent mlx5 memory leak for the sig_mr" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: IB/mlx5: Fix initializing CQ fragments buffer RDMA/mlx5: Delete right entry from MR signature database RDMA: Verify port when creating flow rule RDMA/mlx5: Block FDB rules when not in switchdev mode RDMA/mlx4: Do not map the core_clock page to user space unless enabled RDMA/mlx5: Use different doorbell memory for different processes RDMA/ipoib: Fix warning caused by destroying non-initial netns
2021-06-09net/mlx5e: Fix page reclaim for dead peer hairpinDima Chumak
When adding a hairpin flow, a firmware-side send queue is created for the peer net device, which claims some host memory pages for its internal ring buffer. If the peer net device is removed/unbound before the hairpin flow is deleted, then the send queue is not destroyed which leads to a stack trace on pci device remove: [ 748.005230] mlx5_core 0000:08:00.2: wait_func:1094:(pid 12985): MANAGE_PAGES(0x108) timeout. Will cause a leak of a command resource [ 748.005231] mlx5_core 0000:08:00.2: reclaim_pages:514:(pid 12985): failed reclaiming pages: err -110 [ 748.001835] mlx5_core 0000:08:00.2: mlx5_reclaim_root_pages:653:(pid 12985): failed reclaiming pages (-110) for func id 0x0 [ 748.002171] ------------[ cut here ]------------ [ 748.001177] FW pages counter is 4 after reclaiming all pages [ 748.001186] WARNING: CPU: 1 PID: 12985 at drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c:685 mlx5_reclaim_startup_pages+0x34b/0x460 [mlx5_core] [ +0.002771] Modules linked in: cls_flower mlx5_ib mlx5_core ptp pps_core act_mirred sch_ingress openvswitch nsh xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 br_netfilter rpcrdma rdma_ucm ib_iser libiscsi scsi_transport_iscsi rdma_cm ib_umad ib_ipoib iw_cm ib_cm ib_uverbs ib_core overlay fuse [last unloaded: pps_core] [ 748.007225] CPU: 1 PID: 12985 Comm: tee Not tainted 5.12.0+ #1 [ 748.001376] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 [ 748.002315] RIP: 0010:mlx5_reclaim_startup_pages+0x34b/0x460 [mlx5_core] [ 748.001679] Code: 28 00 00 00 0f 85 22 01 00 00 48 81 c4 b0 00 00 00 31 c0 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 c7 c7 40 cc 19 a1 e8 9f 71 0e e2 <0f> 0b e9 30 ff ff ff 48 c7 c7 a0 cc 19 a1 e8 8c 71 0e e2 0f 0b e9 [ 748.003781] RSP: 0018:ffff88815220faf8 EFLAGS: 00010286 [ 748.001149] RAX: 0000000000000000 RBX: ffff8881b4900280 RCX: 0000000000000000 [ 748.001445] RDX: 0000000000000027 RSI: 0000000000000004 RDI: ffffed102a441f51 [ 748.001614] RBP: 00000000000032b9 R08: 0000000000000001 R09: ffffed1054a15ee8 [ 748.001446] R10: ffff8882a50af73b R11: ffffed1054a15ee7 R12: fffffbfff07c1e30 [ 748.001447] R13: dffffc0000000000 R14: ffff8881b492cba8 R15: 0000000000000000 [ 748.001429] FS: 00007f58bd08b580(0000) GS:ffff8882a5080000(0000) knlGS:0000000000000000 [ 748.001695] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 748.001309] CR2: 000055a026351740 CR3: 00000001d3b48006 CR4: 0000000000370ea0 [ 748.001506] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 748.001483] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 748.001654] Call Trace: [ 748.000576] ? mlx5_satisfy_startup_pages+0x290/0x290 [mlx5_core] [ 748.001416] ? mlx5_cmd_teardown_hca+0xa2/0xd0 [mlx5_core] [ 748.001354] ? mlx5_cmd_init_hca+0x280/0x280 [mlx5_core] [ 748.001203] mlx5_function_teardown+0x30/0x60 [mlx5_core] [ 748.001275] mlx5_uninit_one+0xa7/0xc0 [mlx5_core] [ 748.001200] remove_one+0x5f/0xc0 [mlx5_core] [ 748.001075] pci_device_remove+0x9f/0x1d0 [ 748.000833] device_release_driver_internal+0x1e0/0x490 [ 748.001207] unbind_store+0x19f/0x200 [ 748.000942] ? sysfs_file_ops+0x170/0x170 [ 748.001000] kernfs_fop_write_iter+0x2bc/0x450 [ 748.000970] new_sync_write+0x373/0x610 [ 748.001124] ? new_sync_read+0x600/0x600 [ 748.001057] ? lock_acquire+0x4d6/0x700 [ 748.000908] ? lockdep_hardirqs_on_prepare+0x400/0x400 [ 748.001126] ? fd_install+0x1c9/0x4d0 [ 748.000951] vfs_write+0x4d0/0x800 [ 748.000804] ksys_write+0xf9/0x1d0 [ 748.000868] ? __x64_sys_read+0xb0/0xb0 [ 748.000811] ? filp_open+0x50/0x50 [ 748.000919] ? syscall_enter_from_user_mode+0x1d/0x50 [ 748.001223] do_syscall_64+0x3f/0x80 [ 748.000892] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 748.001026] RIP: 0033:0x7f58bcfb22f7 [ 748.000944] Code: 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24 [ 748.003925] RSP: 002b:00007fffd7f2aaa8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [ 748.001732] RAX: ffffffffffffffda RBX: 000000000000000d RCX: 00007f58bcfb22f7 [ 748.001426] RDX: 000000000000000d RSI: 00007fffd7f2abc0 RDI: 0000000000000003 [ 748.001746] RBP: 00007fffd7f2abc0 R08: 0000000000000000 R09: 0000000000000001 [ 748.001631] R10: 00000000000001b6 R11: 0000000000000246 R12: 000000000000000d [ 748.001537] R13: 00005597ac2c24a0 R14: 000000000000000d R15: 00007f58bd084700 [ 748.001564] irq event stamp: 0 [ 748.000787] hardirqs last enabled at (0): [<0000000000000000>] 0x0 [ 748.001399] hardirqs last disabled at (0): [<ffffffff813132cf>] copy_process+0x146f/0x5eb0 [ 748.001854] softirqs last enabled at (0): [<ffffffff8131330e>] copy_process+0x14ae/0x5eb0 [ 748.013431] softirqs last disabled at (0): [<0000000000000000>] 0x0 [ 748.001492] ---[ end trace a6fabd773d1c51ae ]--- Fix by destroying the send queue of a hairpin peer net device that is being removed/unbound, which returns the allocated ring buffer pages to the host. Fixes: 4d8fcf216c90 ("net/mlx5e: Avoid unbounded peer devices when unpairing TC hairpin rules") Signed-off-by: Dima Chumak <dchumak@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-06-09Merge tag 'compiler-attributes-for-linus-v5.13-rc6' of ↵Linus Torvalds
git://github.com/ojeda/linux Pull compiler attribute update from Miguel Ojeda: "A trivial update to the compiler attributes: Add 'continue' keyword to documentation in comment (from Wei Ming Chen)" * tag 'compiler-attributes-for-linus-v5.13-rc6' of git://github.com/ojeda/linux: Compiler Attributes: Add continue in comment
2021-06-09Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull kvm fixes from Paolo Bonzini: "Bugfixes, including a TLB flush fix that affects processors without nested page tables" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: kvm: fix previous commit for 32-bit builds kvm: avoid speculation-based attacks from out-of-range memslot accesses KVM: x86: Unload MMU on guest TLB flush if TDP disabled to force MMU sync KVM: x86: Ensure liveliness of nested VM-Enter fail tracepoint message selftests: kvm: Add support for customized slot0 memory size KVM: selftests: introduce P47V64 for s390x KVM: x86: Ensure PV TLB flush tracepoint reflects KVM behavior KVM: X86: MMU: Use the correct inherited permissions to get shadow page KVM: LAPIC: Write 0 to TMICT should also cancel vmx-preemption timer KVM: SVM: Fix SEV SEND_START session length & SEND_UPDATE_DATA query length after commit 238eca821cee
2021-06-09misc: rtsx: separate aspm mode into MODE_REG and MODE_CFGRicky Wu
aspm (Active State Power Management) rtsx_comm_set_aspm: this function is for driver to make sure not enter power saving when processing of init and card_detcct ASPM_MODE_CFG: 8411 5209 5227 5229 5249 5250 Change back to use original way to control aspm ASPM_MODE_REG: 5227A 524A 5250A 5260 5261 5228 Keep the new way to control aspm Fixes: 121e9c6b5c4c ("misc: rtsx: modify and fix init_hw function") Reported-by: Chris Chiu <chris.chiu@canonical.com> Tested-by: Gordon Lack <gordon.lack@dsl.pipex.com> Cc: stable <stable@vger.kernel.org> Signed-off-by: Ricky Wu <ricky_wu@realtek.com> Link: https://lore.kernel.org/r/20210607101634.4948-1-ricky_wu@realtek.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-09kvm: fix previous commit for 32-bit buildsPaolo Bonzini
array_index_nospec does not work for uint64_t on 32-bit builds. However, the size of a memory slot must be less than 20 bits wide on those system, since the memory slot must fit in the user address space. So just store it in an unsigned long. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-08kvm: avoid speculation-based attacks from out-of-range memslot accessesPaolo Bonzini
KVM's mechanism for accessing guest memory translates a guest physical address (gpa) to a host virtual address using the right-shifted gpa (also known as gfn) and a struct kvm_memory_slot. The translation is performed in __gfn_to_hva_memslot using the following formula: hva = slot->userspace_addr + (gfn - slot->base_gfn) * PAGE_SIZE It is expected that gfn falls within the boundaries of the guest's physical memory. However, a guest can access invalid physical addresses in such a way that the gfn is invalid. __gfn_to_hva_memslot is called from kvm_vcpu_gfn_to_hva_prot, which first retrieves a memslot through __gfn_to_memslot. While __gfn_to_memslot does check that the gfn falls within the boundaries of the guest's physical memory or not, a CPU can speculate the result of the check and continue execution speculatively using an illegal gfn. The speculation can result in calculating an out-of-bounds hva. If the resulting host virtual address is used to load another guest physical address, this is effectively a Spectre gadget consisting of two consecutive reads, the second of which is data dependent on the first. Right now it's not clear if there are any cases in which this is exploitable. One interesting case was reported by the original author of this patch, and involves visiting guest page tables on x86. Right now these are not vulnerable because the hva read goes through get_user(), which contains an LFENCE speculation barrier. However, there are patches in progress for x86 uaccess.h to mask kernel addresses instead of using LFENCE; once these land, a guest could use speculation to read from the VMM's ring 3 address space. Other architectures such as ARM already use the address masking method, and would be susceptible to this same kind of data-dependent access gadgets. Therefore, this patch proactively protects from these attacks by masking out-of-bounds gfns in __gfn_to_hva_memslot, which blocks speculation of invalid hvas. Sean Christopherson noted that this patch does not cover kvm_read_guest_offset_cached. This however is limited to a few bytes past the end of the cache, and therefore it is unlikely to be useful in the context of building a chain of data dependent accesses. Reported-by: Artemiy Margaritov <artemiy.margaritov@gmail.com> Co-developed-by: Artemiy Margaritov <artemiy.margaritov@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-08Merge tag 'regulator-fix-v5.13-rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator Pull regulator fixes from Mark Brown: "A collection of fixes for the regulator API that have come up since the merge window, including a big batch of fixes from Axel Lin's usual careful and detailed review. The one stand out fix here is Dmitry Baryshkov's fix for an issue where we fail to power on the parents of always on regulators during system startup if they weren't already powered on" * tag 'regulator-fix-v5.13-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator: (21 commits) regulator: rt4801: Fix NULL pointer dereference if priv->enable_gpios is NULL regulator: hi6421v600: Fix .vsel_mask setting regulator: bd718x7: Fix the BUCK7 voltage setting on BD71837 regulator: atc260x: Fix n_voltages and min_sel for pickable linear ranges regulator: rtmv20: Fix to make regcache value first reading back from HW regulator: mt6315: Fix function prototype for mt6315_map_mode regulator: rtmv20: Add Richtek to Kconfig text regulator: rtmv20: Fix .set_current_limit/.get_current_limit callbacks regulator: hisilicon: use the correct HiSilicon copyright regulator: bd71828: Fix .n_voltages settings regulator: bd70528: Fix off-by-one for buck123 .n_voltages setting regulator: max77620: Silence deferred probe error regulator: max77620: Use device_set_of_node_from_dev() regulator: scmi: Fix off-by-one for linear regulators .n_voltages setting regulator: core: resolve supply for boot-on/always-on regulators regulator: fixed: Ensure enable_counter is correct if reg_domain_disable fails regulator: Check ramp_delay_table for regulator_set_ramp_delay_regmap regulator: fan53880: Fix missing n_voltages setting regulator: da9121: Return REGULATOR_MODE_INVALID for invalid mode regulator: fan53555: fix TCS4525 voltage calulation ...
2021-06-06Merge tag 'arm-soc-fixes-v5.13-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc Pull ARM SoC fixes from Olof Johansson: "A set of fixes that have been coming in over the last few weeks, the usual mix of fixes: - DT fixups for TI K3 - SATA drive detection fix for TI DRA7 - Power management fixes and a few build warning removals for OMAP - OP-TEE fix to use standard API for UUID exporting - DT fixes for a handful of i.MX boards And a few other smaller items" * tag 'arm-soc-fixes-v5.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (29 commits) arm64: meson: select COMMON_CLK soc: amlogic: meson-clk-measure: remove redundant dev_err call in meson_msr_probe() ARM: OMAP1: ams-delta: remove unused function ams_delta_camera_power bus: ti-sysc: Fix flakey idling of uarts and stop using swsup_sidle_act ARM: dts: imx: emcon-avari: Fix nxp,pca8574 #gpio-cells ARM: dts: imx7d-pico: Fix the 'tuning-step' property ARM: dts: imx7d-meerkat96: Fix the 'tuning-step' property arm64: dts: freescale: sl28: var1: fix RGMII clock and voltage arm64: dts: freescale: sl28: var4: fix RGMII clock and voltage ARM: imx: pm-imx27: Include "common.h" arm64: dts: zii-ultra: fix 12V_MAIN voltage arm64: dts: zii-ultra: remove second GEN_3V3 regulator instance arm64: dts: ls1028a: fix memory node bus: ti-sysc: Fix am335x resume hang for usb otg module ARM: OMAP2+: Fix build warning when mmc_omap is not built ARM: OMAP1: isp1301-omap: Add missing gpiod_add_lookup_table function ARM: OMAP1: Fix use of possibly uninitialized irq variable optee: use export_uuid() to copy client UUID arm64: dts: ti: k3*: Introduce reg definition for interrupt routers arm64: dts: ti: k3-am65|j721e|am64: Map the dma / navigator subsystem via explicit ranges ...
2021-06-05Merge tag 'omap-for-v5.13/fixes-pm' of ↵Olof Johansson
git://git.kernel.org/pub/scm/linux/kernel/git/tmlind/linux-omap into arm/fixes PM and build warning fixes for omaps While chasing system suspend related regressions, I noticed few other issues related to PM would be good to have fixed: - UART idling does not always work for hardware autoidle features - am335x resume works only the first time unless musb module is loaded Then there are three patches for omap1 related warnings caused by the gpio changes, and one build warning fix for legacy mmc platform code when mmc is built as a loadable module. These can all be merged whenever suitable naturally. I've sent the more urgent SATA regression fix separately although it appears in this pull request too because of the branches merged. * tag 'omap-for-v5.13/fixes-pm' of git://git.kernel.org/pub/scm/linux/kernel/git/tmlind/linux-omap: ARM: OMAP1: ams-delta: remove unused function ams_delta_camera_power bus: ti-sysc: Fix flakey idling of uarts and stop using swsup_sidle_act bus: ti-sysc: Fix am335x resume hang for usb otg module ARM: OMAP2+: Fix build warning when mmc_omap is not built ARM: OMAP1: isp1301-omap: Add missing gpiod_add_lookup_table function ARM: OMAP1: Fix use of possibly uninitialized irq variable Link: https://lore.kernel.org/r/pull-1622614772-543196@atomide.com Signed-off-by: Olof Johansson <olof@lixom.net>
2021-06-05Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge misc fixes from Andrew Morton: "13 patches. Subsystems affected by this patch series: mips, mm (kfence, debug, pagealloc, memory-hotplug, hugetlb, kasan, and hugetlb), init, proc, lib, ocfs2, and mailmap" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: mailmap: use private address for Michel Lespinasse ocfs2: fix data corruption by fallocate lib: crc64: fix kernel-doc warning mm, hugetlb: fix simple resv_huge_pages underflow on UFFDIO_COPY mm/kasan/init.c: fix doc warning proc: add .gitignore for proc-subset-pid selftest hugetlb: pass head page to remove_hugetlb_page() drivers/base/memory: fix trying offlining memory blocks with memory holes on aarch64 mm/page_alloc: fix counting of free pages after take off from buddy mm/debug_vm_pgtable: fix alignment for pmd/pud_advanced_tests() pid: take a reference when initializing `cad_pid` kfence: use TASK_IDLE when awaiting allocation Revert "MIPS: make userspace mapping young by default"
2021-06-05Revert "MIPS: make userspace mapping young by default"Thomas Bogendoerfer
This reverts commit f685a533a7fab35c5d069dcd663f59c8e4171a75. The MIPS cache flush logic needs to know whether the mapping was already established to decide how to flush caches. This is done by checking the valid bit in the PTE. The commit above breaks this logic by setting the valid in the PTE in new mappings, which causes kernel crashes. Link: https://lkml.kernel.org/r/20210526094335.92948-1-tsbogend@alpha.franken.de Fixes: f685a533a7f ("MIPS: make userspace mapping young by default") Reported-by: Zhou Yanjie <zhouyanjie@wanyeetech.com> Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Huang Pei <huangpei@loongson.cn> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-04Merge tag 'net-5.13-rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Networking fixes, including fixes from bpf, wireless, netfilter and wireguard trees. The bpf vs lockdown+audit fix is the most notable. Things haven't slowed down just yet, both in terms of regressions in current release and largish fixes for older code, but we usually see a slowdown only after -rc5. Current release - regressions: - virtio-net: fix page faults and crashes when XDP is enabled - mlx5e: fix HW timestamping with CQE compression, and make sure they are only allowed to coexist with capable devices - stmmac: - fix kernel panic due to NULL pointer dereference of mdio_bus_data - fix double clk unprepare when no PHY device is connected Current release - new code bugs: - mt76: a few fixes for the recent MT7921 devices and runtime power management Previous releases - regressions: - ice: - track AF_XDP ZC enabled queues in bitmap to fix copy mode Tx - fix allowing VF to request more/less queues via virtchnl - correct supported and advertised autoneg by using PHY capabilities - allow all LLDP packets from PF to Tx - kbuild: quote OBJCOPY var to avoid a pahole call break the build Previous releases - always broken: - bpf, lockdown, audit: fix buggy SELinux lockdown permission checks - mt76: address the recent FragAttack vulnerabilities not covered by generic fixes - ipv6: fix KASAN: slab-out-of-bounds Read in fib6_nh_flush_exceptions - Bluetooth: - fix the erroneous flush_work() order, to avoid double free - use correct lock to prevent UAF of hdev object - nfc: fix NULL ptr dereference in llcp_sock_getname() after failed connect - ieee802154: multiple fixes to error checking and return values - igb: fix XDP with PTP enabled - intel: add correct exception tracing for XDP - tls: fix use-after-free when TLS offload device goes down and back up - ipvs: ignore IP_VS_SVC_F_HASHED flag when adding service - netfilter: nft_ct: skip expectations for confirmed conntrack - mptcp: fix falling back to TCP in presence of out of order packets early in connection lifetime - wireguard: switch from O(n) to a O(1) algorithm for maintaining peers, fixing stalls and a large memory leak in the process Misc: - devlink: correct VIRTUAL port to not have phys_port attributes - Bluetooth: fix VIRTIO_ID_BT assigned number - net: return the correct errno code ENOBUF -> ENOMEM - wireguard: - peer: allocate in kmem_cache saving 25% on peer memory - do not use -O3" * tag 'net-5.13-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (91 commits) cxgb4: avoid link re-train during TC-MQPRIO configuration sch_htb: fix refcount leak in htb_parent_to_leaf_offload wireguard: allowedips: free empty intermediate nodes when removing single node wireguard: allowedips: allocate nodes in kmem_cache wireguard: allowedips: remove nodes in O(1) wireguard: allowedips: initialize list head in selftest wireguard: peer: allocate in kmem_cache wireguard: use synchronize_net rather than synchronize_rcu wireguard: do not use -O3 wireguard: selftests: make sure rp_filter is disabled on vethc wireguard: selftests: remove old conntrack kconfig value virtchnl: Add missing padding to virtchnl_proto_hdrs ice: Allow all LLDP packets from PF to Tx ice: report supported and advertised autoneg using PHY capabilities ice: handle the VF VSI rebuild failure ice: Fix VFR issues for AVF drivers that expect ATQLEN cleared ice: Fix allowing VF to request more/less queues via virtchnl virtio-net: fix for skb_over_panic inside big mode ipv6: Fix KASAN: slab-out-of-bounds Read in fib6_nh_flush_exceptions fib: Return the correct errno code ...
2021-06-04Merge tag 'pci-v5.13-fixes-1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci Pull PCI fixes from Bjorn Helgaas: - Fix MSIs for platforms with "msi-map" device-tree property, which we broke in v5.13-rc1 (Jean-Philippe Brucker) - Add Krzysztof Wilczyński as PCI reviewer (Lorenzo Pieralisi) * tag 'pci-v5.13-fixes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: PCI/MSI: Fix MSIs for generic hosts that use device-tree's "msi-map" MAINTAINERS: Add Krzysztof as PCI host/endpoint controllers reviewer
2021-06-04Merge tag 'drm-fixes-2021-06-04-1' of git://anongit.freedesktop.org/drm/drmLinus Torvalds
Pull drm fixes from Dave Airlie: "Two big regression reverts in here, one for fbdev and one i915. Otherwise it's mostly amdgpu display fixes, and tegra fixes. fb: - revert broken fb_defio patch amdgpu: - Display fixes - FRU EEPROM error handling fix - RAS fix - PSP fix - Releasing pinned BO fix i915: - Revert conversion to io_mapping_map_user() which lead to BUG_ON() - Fix check for error valued returns in a selftest tegra: - SOR power domain race condition fix - build warning fix - runtime pm ref leak fix - modifier fix" * tag 'drm-fixes-2021-06-04-1' of git://anongit.freedesktop.org/drm/drm: amd/display: convert DRM_DEBUG_ATOMIC to drm_dbg_atomic drm/amdgpu: make sure we unpin the UVD BO drm/amd/amdgpu:save psp ring wptr to avoid attack drm/amd/display: Fix potential memory leak in DMUB hw_init drm/amdgpu: Don't query CE and UE errors drm/amd/display: Fix overlay validation by considering cursors drm/amdgpu: refine amdgpu_fru_get_product_info drm/amdgpu: add judgement for dc support drm/amd/display: Fix GPU scaling regression by FS video support drm/amd/display: Allow bandwidth validation for 0 streams. Revert "i915: use io_mapping_map_user" drm/i915/selftests: Fix return value check in live_breadcrumbs_smoketest() Revert "fb_defio: Remove custom address_space_operations" drm/tegra: Correct DRM_FORMAT_MOD_NVIDIA_SECTOR_LAYOUT drm/tegra: sor: Fix AUX device reference leak drm/tegra: Get ref for DP AUX channel, not its ddc adapter drm/tegra: Fix shift overflow in tegra_shared_plane_atomic_update drm/tegra: sor: Fully initialize SOR before registration gpu: host1x: Split up client initalization and registration drm/tegra: sor: Do not leak runtime PM reference
2021-06-04virtchnl: Add missing padding to virtchnl_proto_hdrsGeert Uytterhoeven
On m68k (Coldfire M547x): CC drivers/net/ethernet/intel/i40e/i40e_main.o In file included from drivers/net/ethernet/intel/i40e/i40e_prototype.h:9, from drivers/net/ethernet/intel/i40e/i40e.h:41, from drivers/net/ethernet/intel/i40e/i40e_main.c:12: include/linux/avf/virtchnl.h:153:36: warning: division by zero [-Wdiv-by-zero] 153 | { virtchnl_static_assert_##X = (n)/((sizeof(struct X) == (n)) ? 1 : 0) } | ^ include/linux/avf/virtchnl.h:844:1: note: in expansion of macro ‘VIRTCHNL_CHECK_STRUCT_LEN’ 844 | VIRTCHNL_CHECK_STRUCT_LEN(2312, virtchnl_proto_hdrs); | ^~~~~~~~~~~~~~~~~~~~~~~~~ include/linux/avf/virtchnl.h:844:33: error: enumerator value for ‘virtchnl_static_assert_virtchnl_proto_hdrs’ is not an integer constant 844 | VIRTCHNL_CHECK_STRUCT_LEN(2312, virtchnl_proto_hdrs); | ^~~~~~~~~~~~~~~~~~~ On m68k, integers are aligned on addresses that are multiples of two, not four, bytes. Hence the size of a structure containing integers may not be divisible by 4. Fix this by adding explicit padding. Fixes: 1f7ea1cd6a374842 ("ice: Enable FDIR Configure for AVF") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-06-04usb: typec: tcpm: Properly handle Alert and Status MessagesKyle Tso
When receiving Alert Message, if it is not unexpected but is unsupported for some reason, the port should return Not_Supported Message response. Also, according to PD3.0 Spec 6.5.2.1.4 Event Flags Field, the OTP/OVP/OCP flags in the Event Flags field in Status Message no longer require Get_PPS_Status Message to clear them. Thus remove it when receiving Status Message with those flags being set. In addition, add the missing AMS operations for Status Message. Fixes: 64f7c494a3c0 ("typec: tcpm: Add support for sink PPS related messages") Fixes: 0908c5aca31e ("usb: typec: tcpm: AMS and Collision Avoidance") Signed-off-by: Kyle Tso <kyletso@google.com> Link: https://lore.kernel.org/r/20210531164928.2368606-1-kyletso@google.com Cc: stable <stable@vger.kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-04Merge tag 'drm/tegra/for-5.13-rc5' of ↵Dave Airlie
ssh://git.freedesktop.org/git/tegra/linux into drm-fixes drm/tegra: Fixes for v5.13-rc5 The most important change here fixes a race condition that causes either HDA or (more frequently) display to malfunction because they race for enabling the SOR power domain at probe time. Other than that, there's a couple of build warnings for issues introduced in v5.13 as well as some minor fixes, such as reference leak plugs. Signed-off-by: Dave Airlie <airlied@redhat.com> From: Thierry Reding <thierry.reding@gmail.com> Link: https://patchwork.freedesktop.org/patch/msgid/20210603144624.788861-1-thierry.reding@gmail.com
2021-06-03RDMA/mlx4: Do not map the core_clock page to user space unless enabledShay Drory
Currently when mlx4 maps the hca_core_clock page to the user space there are read-modifiable registers, one of which is semaphore, on this page as well as the clock counter. If user reads the wrong offset, it can modify the semaphore and hang the device. Do not map the hca_core_clock page to the user space unless the device has been put in a backwards compatibility mode to support this feature. After this patch, mlx4 core_clock won't be mapped to user space on the majority of existing devices and the uverbs device time feature in ibv_query_rt_values_ex() will be disabled. Fixes: 52033cfb5aab ("IB/mlx4: Add mmap call to map the hardware clock") Link: https://lore.kernel.org/r/9632304e0d6790af84b3b706d8c18732bc0d5e27.1622726305.git.leonro@nvidia.com Signed-off-by: Shay Drory <shayd@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2021-06-03sched/fair: Fix util_est UTIL_AVG_UNCHANGED handlingDietmar Eggemann
The util_est internal UTIL_AVG_UNCHANGED flag which is used to prevent unnecessary util_est updates uses the LSB of util_est.enqueued. It is exposed via _task_util_est() (and task_util_est()). Commit 92a801e5d5b7 ("sched/fair: Mask UTIL_AVG_UNCHANGED usages") mentions that the LSB is lost for util_est resolution but find_energy_efficient_cpu() checks if task_util_est() returns 0 to return prev_cpu early. _task_util_est() returns the max value of util_est.ewma and util_est.enqueued or'ed w/ UTIL_AVG_UNCHANGED. So task_util_est() returning the max of task_util() and _task_util_est() will never return 0 under the default SCHED_FEAT(UTIL_EST, true). To fix this use the MSB of util_est.enqueued instead and keep the flag util_est internal, i.e. don't export it via _task_util_est(). The maximal possible util_avg value for a task is 1024 so the MSB of 'unsigned int util_est.enqueued' isn't used to store a util value. As a caveat the code behind the util_est_se trace point has to filter UTIL_AVG_UNCHANGED to see the real util_est.enqueued value which should be easy to do. This also fixes an issue report by Xuewen Yan that util_est_update() only used UTIL_AVG_UNCHANGED for the subtrahend of the equation: last_enqueued_diff = ue.enqueued - (task_util() | UTIL_AVG_UNCHANGED) Fixes: b89997aa88f0b sched/pelt: Fix task util_est update filtering Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Xuewen Yan <xuewen.yan@unisoc.com> Reviewed-by: Vincent Donnefort <vincent.donnefort@arm.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20210602145808.1562603-1-dietmar.eggemann@arm.com
2021-06-02usb: pd: Set PD_T_SINK_WAIT_CAP to 310msKyle Tso
Current timer PD_T_SINK_WAIT_CAP is set to 240ms which will violate the SinkWaitCapTimer (tTypeCSinkWaitCap 310 - 620 ms) defined in the PD Spec if the port is faster enough when running the state machine. Set it to the lower bound 310ms to ensure the timeout is in Spec. Fixes: f0690a25a140 ("staging: typec: USB Type-C Port Manager (tcpm)") Cc: stable <stable@vger.kernel.org> Reviewed-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Kyle Tso <kyletso@google.com> Link: https://lore.kernel.org/r/20210528081613.730661-1-kyletso@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-01Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid Pull HID fixes from Jiri Kosina: - memory leak fix in usbhid from Anirudh Rayabharam - additions for a few new recognized generic key IDs from Dmitry Torokhov - Asus T101HA and Dell K15A quirks from Hans de Goede - memory leak fix in amd_sfh from Basavaraj Natikar - Win8 compatibility and Stylus fixes in multitouch driver from Ahelenia Ziemiańska - NULL pointer dereference fix in hid-magicmouse from Johan Hovold - assorted other small fixes and device ID additions * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid: (33 commits) HID: asus: Cleanup Asus T101HA keyboard-dock handling HID: magicmouse: fix NULL-deref on disconnect HID: intel-ish-hid: ipc: Add Alder Lake device IDs HID: i2c-hid: fix format string mismatch HID: amd_sfh: Fix memory leak in amd_sfh_work HID: amd_sfh: Use devm_kzalloc() instead of kzalloc() HID: ft260: improve error handling of ft260_hid_feature_report_get() HID: magicmouse: fix crash when disconnecting Magic Trackpad 2 HID: gt683r: add missing MODULE_DEVICE_TABLE HID: pidff: fix error return code in hid_pidff_init() HID: logitech-hidpp: initialize level variable HID: multitouch: Disable event reporting on suspend on the Asus T101HA touchpad HID: core: Remove extraneous empty line before EXPORT_SYMBOL_GPL(hid_check_keys_pressed) HID: hid-sensor-custom: Process failure of sensor_hub_set_feature() HID: i2c-hid: Skip ELAN power-on command after reset HID: usbhid: fix info leak in hid_submit_ctrl HID: Add BUS_VIRTUAL to hid_connect logging HID: multitouch: set Stylus suffix for Stylus-application devices, too HID: multitouch: require Finger field to mark Win8 reports as MT HID: remove the unnecessary redefinition of a macro ...
2021-06-01net/mlx5: DR, Create multi-destination flow table with level less than 64Yevgeny Kliteynik
Flow table that contains flow pointing to multiple flow tables or multiple TIRs must have a level lower than 64. In our case it applies to muli- destination flow table. Fix the level of the created table to comply with HW Spec definitions, and still make sure that its level lower than SW-owned tables, so that it would be possible to point from the multi-destination FW table to SW tables. Fixes: 34583beea4b7 ("net/mlx5: DR, Create multi-destination table for SW-steering use") Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Reviewed-by: Alex Vesker <valex@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-06-01Revert "fb_defio: Remove custom address_space_operations"Matthew Wilcox
Commit ccf953d8f3d6 makes framebuffers which use deferred I/O stop displaying updates after the first one. This is because the pages handled by fb_defio no longer have a page_mapping(). That prevents page_mkclean() from marking the PTEs as clean, and so writes are only noticed the first time. Reported-by: Andy Shevchenko <andy.shevchenko@gmail.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch> Link: https://patchwork.freedesktop.org/patch/msgid/YLZEhv0cpZp8uVE3@casper.infradead.org
2021-05-31Merge tag 'fsnotify_for_v5.13-rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull fsnotify fixes from Jan Kara: "A fix for permission checking with fanotify unpriviledged groups. Also there's a small update in MAINTAINERS file for fanotify" * tag 'fsnotify_for_v5.13-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: fanotify: fix permission model of unprivileged group MAINTAINERS: Add Matthew Bobrowski as a reviewer
2021-05-31tick/nohz: Only check for RCU deferred wakeup on user/guest entry when neededFrederic Weisbecker
Checking for and processing RCU-nocb deferred wakeup upon user/guest entry is only relevant when nohz_full runs on the local CPU, otherwise the periodic tick should take care of it. Make sure we don't needlessly pollute these fast-paths as a -3% performance regression on a will-it-scale.per_process_ops has been reported so far. Fixes: 47b8ff194c1f (entry: Explicitly flush pending rcuog wakeup before last rescheduling point) Fixes: 4ae7dc97f726 (entry/kvm: Explicitly flush pending rcuog wakeup before last rescheduling point) Reported-by: kernel test robot <oliver.sang@intel.com> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20210527113441.465489-1-frederic@kernel.org