summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2017-04-27xfrm: fix GRO for !CONFIG_NETFILTERSabrina Dubroca
In xfrm_input() when called from GRO, async == 0, and we end up skipping the processing in xfrm4_transport_finish(). GRO path will always skip the NF_HOOK, so we don't need the special-case for !NETFILTER during GRO processing. Fixes: 7785bba299a8 ("esp: Add a software GRO codepath") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2017-04-27sched/cputime: Fix ksoftirqd cputime accounting regressionFrederic Weisbecker
irq_time_read() returns the irqtime minus the ksoftirqd time. This is necessary because irq_time_read() is used to substract the IRQ time from the sum_exec_runtime of a task. If we were to include the softirq time of ksoftirqd, this task would substract its own CPU time everytime it updates ksoftirqd->sum_exec_runtime which would therefore never progress. But this behaviour got broken by: a499a5a14db ("sched/cputime: Increment kcpustat directly on irqtime account") ... which now includes ksoftirqd softirq time in the time returned by irq_time_read(). This has resulted in wrong ksoftirqd cputime reported to userspace through /proc/stat and thus "top" not showing ksoftirqd when it should after intense networking load. ksoftirqd->stime happens to be correct but it gets scaled down by sum_exec_runtime through task_cputime_adjusted(). To fix this, just account the strict IRQ time in a separate counter and use it to report the IRQ time. Reported-and-tested-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Reviewed-by: Rik van Riel <riel@redhat.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Wanpeng Li <wanpeng.li@hotmail.com> Link: http://lkml.kernel.org/r/1493129448-5356-1-git-send-email-fweisbec@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-04-27nvme-scsi: remove nvme_trans_security_protocolChristoph Hellwig
This function just returns the same error code and sense data as the default statement in the switch in the caller. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com>
2017-04-26uapi: change the type of struct statx_timestamp.tv_nsec to unsignedDmitry V. Levin
The comment asserting that the value of struct statx_timestamp.tv_nsec must be negative when statx_timestamp.tv_sec is negative, is wrong, as could be seen from the following example: #define _FILE_OFFSET_BITS 64 #include <assert.h> #include <fcntl.h> #include <stdio.h> #include <sys/stat.h> #include <unistd.h> #include <asm/unistd.h> #include <linux/stat.h> int main(void) { static const struct timespec ts[2] = { { .tv_nsec = UTIME_OMIT }, { .tv_sec = -2, .tv_nsec = 42 } }; assert(utimensat(AT_FDCWD, ".", ts, 0) == 0); struct stat st; assert(stat(".", &st) == 0); printf("st_mtim.tv_sec = %lld, st_mtim.tv_nsec = %lu\n", (long long) st.st_mtim.tv_sec, (unsigned long) st.st_mtim.tv_nsec); struct statx stx; assert(syscall(__NR_statx, AT_FDCWD, ".", 0, 0, &stx) == 0); printf("stx_mtime.tv_sec = %lld, stx_mtime.tv_nsec = %lu\n", (long long) stx.stx_mtime.tv_sec, (unsigned long) stx.stx_mtime.tv_nsec); return 0; } It expectedly prints: st_mtim.tv_sec = -2, st_mtim.tv_nsec = 42 stx_mtime.tv_sec = -2, stx_mtime.tv_nsec = 42 The more generic comment asserting that the value of struct statx_timestamp.tv_nsec might be negative is confusing to say the least. It contradicts both the struct stat.st_[acm]time_nsec tradition and struct timespec.tv_nsec requirements in utimensat syscall. If statx syscall ever returns a stx_[acm]time containing a negative tv_nsec that cannot be passed unmodified to utimensat syscall, it will cause an immense confusion. Fix this source of confusion by changing the type of struct statx_timestamp.tv_nsec from __s32 to __u32. Fixes: a528d35e8bfc ("statx: Add a system call to make enhanced file info available") Signed-off-by: Dmitry V. Levin <ldv@altlinux.org> Signed-off-by: David Howells <dhowells@redhat.com> cc: linux-api@vger.kernel.org cc: mtk.manpages@gmail.com Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-04-26Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparcLinus Torvalds
Pull sparc fixes from David Miller: "I didn't want the release to go out without the statx system call properly hooked up" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc: sparc: Update syscall tables. sparc64: Fill in rest of HAVE_REGS_AND_STACK_ACCESS_API
2017-04-26statx: Kill fd-with-NULL-path support in favour of AT_EMPTY_PATHDavid Howells
With the new statx() syscall, the following both allow the attributes of the file attached to a file descriptor to be retrieved: statx(dfd, NULL, 0, ...); and: statx(dfd, "", AT_EMPTY_PATH, ...); Change the code to reject the first option, though this means copying the path and engaging pathwalk for the fstat() equivalent. dfd can be a non-directory provided path is "". [ The timing of this isn't wonderful, but applying this now before we have statx() in any released kernel, before anybody starts using the NULL special case. - Linus ] Fixes: a528d35e8bfc ("statx: Add a system call to make enhanced file info available") Reported-by: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: David Howells <dhowells@redhat.com> cc: Eric Sandeen <sandeen@sandeen.net> cc: fstests@vger.kernel.org cc: linux-api@vger.kernel.org cc: linux-man@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-04-26scsi: Implement blk_mq_ops.show_rq()Bart Van Assche
Show the SCSI CDB for pending SCSI commands in /sys/kernel/debug/block/*/mq/*/dispatch and */rq_list. An example of how SCSI commands are displayed by this code: ffff8801703245c0 {.op=READ, .cmd_flags=META PRIO, .rq_flags=DONTPREP IO_STAT STATS, .tag=14, .internal_tag=-1, .cmd=Read(10) 28 00 2a 81 1b 30 00 00 08 00} Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Omar Sandoval <osandov@fb.com> Cc: Martin K. Petersen <martin.petersen@oracle.com> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: Hannes Reinecke <hare@suse.com> Cc: <linux-scsi@vger.kernel.org> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-26blk-mq: Add blk_mq_ops.show_rq()Bart Van Assche
This new callback function will be used in the next patch to show more information about SCSI requests. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Omar Sandoval <osandov@fb.com> Cc: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-26blk-mq: Show operation, cmd_flags and rq_flags namesBart Van Assche
Show the operation name, .cmd_flags and .rq_flags as names instead of numbers. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-26blk-mq: Make blk_flags_show() callers append a newline characterBart Van Assche
This patch does not change any functionality but makes it possible to produce a single line of output with multiple flag-to-name translations. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-26blk-mq: Move the "state" debugfs attribute one level downBart Van Assche
Move the "state" attribute from the top level to the "mq" directory as requested by Omar. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-26blk-mq: Unregister debugfs attributes earlierBart Van Assche
We currently call blk_mq_free_queue() from blk_cleanup_queue() before we unregister the debugfs attributes for that queue in blk_release_queue(). This leaves a window open during which accessing most of the mq debugfs attributes would cause a use-after-free. Additionally, the "state" attribute allows running the queue, which we should not do after the queue has entered the "dead" state. Fix both cases by unregistering the debugfs attributes before freeing queue resources starts. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-26blk-mq: Only unregister hctxs for which registration succeededBart Van Assche
Hctx unregistration involves calling kobject_del(). kobject_del() must not be called if kobject_add() has not been called. Hence in the error path only unregister hctxs for which registration succeeded. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Cc: Omar Sandoval <osandov@fb.com> Cc: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-26blk-mq-debugfs: Rename functions for registering and unregistering the mq ↵Bart Van Assche
directory Since the blk_mq_debugfs_*register_hctxs() functions register and unregister all attributes under the "mq" directory, rename these into blk_mq_debugfs_*register_mq(). Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-26blk-mq: Let blk_mq_debugfs_register() look up the queue nameBart Van Assche
A later patch will move the call of blk_mq_debugfs_register() to a function to which the queue name is not passed as an argument. To avoid having to add a 'name' argument to multiple callers, let blk_mq_debugfs_register() look up the queue name. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-26blk-mq: Register <dev>/queue/mq after having registered <dev>/queueBart Van Assche
A later patch in this series will modify blk_mq_debugfs_register() such that it uses q->kobj.parent to determine the name of a request queue. Hence make sure that that pointer is initialized before blk_mq_debugfs_register() is called. To avoid lock inversion, protect sysfs / debugfs registration with the queue sysfs_lock instead of the global mutex all_q_mutex. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-26Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: 1) MLX5 bug fixes from Saeed Mahameed et al: - released wrong resources when firmware timeout happens - fix wrong check for encapsulation size limits - UAR memory leak - ETHTOOL_GRXCLSRLALL failed to fill in info->data 2) Don't cache l3mdev on mis-matches local route, causes net devices to leak refs. From Robert Shearman. 3) Handle fragmented SKBs properly in macsec driver, the problem is that we were mis-sizing the sgvec table. From Jason A. Donenfeld. 4) We cannot have checksum offload enabled for inner UDP tunneled packet during IPSEC, from Ansis Atteka. 5) Fix double SKB free in ravb driver, from Dan Carpenter. 6) Fix CPU port handling in b53 DSA driver, from Florian Dainelli. 7) Don't use on-stack buffers for usb_control_msg() in CAN usb driver, from Maksim Salau. 8) Fix device leak in macvlan driver, from Herbert Xu. We have to purge the broadcast queue properly on port destroy. 9) Fix tx ring entry limit on EF10 devices in sfc driver. From Bert Kenward. 10) Fix memory leaks in team driver, from Pan Bian. 11) Don't setup ipv6_stub before it can be actually used, from Paolo Abeni. 12) Fix tipc socket flow control accounting, from Parthasarathy Bhuvaragan. 13) Fix crash on module unload in hso driver, from Andreas Kemnade. 14) Fix purging of bridge multicast entries, the problem is that if we don't defer it to ndo_uninit it's possible for new entries to get added after we purge. Fix from Xin Long. 15) Don't return garbage for PACKET_HDRLEN getsockopt, from Alexander Potapenko. 16) Fix autoneg stall properly in PHY layer, and revert micrel driver change that was papering over it. From Alexander Kochetkov. 17) Don't dereference an ipv4 route as an ipv6 one in the ip6_tunnnel code, from Cong Wang. 18) Clear out the congestion control private of the TCP socket in all of the right places, from Wei Wang. 19) rawv6_ioctl measures SKB length incorrectly, fix from Jamie Bainbridge. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (41 commits) ipv6: check raw payload size correctly in ioctl tcp: memset ca_priv data to 0 properly ipv6: check skb->protocol before lookup for nexthop net: core: Prevent from dereferencing null pointer when releasing SKB macsec: dynamically allocate space for sglist Revert "phy: micrel: Disable auto negotiation on startup" net: phy: fix auto-negotiation stall due to unavailable interrupt net/packet: check length in getsockopt() called with PACKET_HDRLEN net: ipv6: regenerate host route if moved to gc list bridge: move bridge multicast cleanup to ndo_uninit ipv6: fix source routing qed: Fix error in the dcbx app meta data initialization. netvsc: fix calculation of available send sections net: hso: fix module unloading tipc: fix socket flow control accounting error at tipc_recv_stream tipc: fix socket flow control accounting error at tipc_send_stream ipv6: move stub initialization after ipv6 setup completion team: fix memory leaks sfc: tx ring can only have 2048 entries for all EF10 NICs macvlan: Fix device ref leak when purging bc_queue ...
2017-04-26ipv6: check raw payload size correctly in ioctlJamie Bainbridge
In situations where an skb is paged, the transport header pointer and tail pointer can be the same because the skb contents are in frags. This results in ioctl(SIOCINQ/FIONREAD) incorrectly returning a length of 0 when the length to receive is actually greater than zero. skb->len is already correctly set in ip6_input_finish() with pskb_pull(), so use skb->len as it always returns the correct result for both linear and paged data. Signed-off-by: Jamie Bainbridge <jbainbri@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-26tcp: memset ca_priv data to 0 properlyWei Wang
Always zero out ca_priv data in tcp_assign_congestion_control() so that ca_priv data is cleared out during socket creation. Also always zero out ca_priv data in tcp_reinit_congestion_control() so that when cc algorithm is changed, ca_priv data is cleared out as well. We should still zero out ca_priv data even in TCP_CLOSE state because user could call connect() on AF_UNSPEC to disconnect the socket and leave it in TCP_CLOSE state and later call setsockopt() to switch cc algorithm on this socket. Fixes: 2b0a8c9ee ("tcp: add CDG congestion control") Reported-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: Wei Wang <weiwan@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-26ipv6: check skb->protocol before lookup for nexthopWANG Cong
Andrey reported a out-of-bound access in ip6_tnl_xmit(), this is because we use an ipv4 dst in ip6_tnl_xmit() and cast an IPv4 neigh key as an IPv6 address: neigh = dst_neigh_lookup(skb_dst(skb), &ipv6_hdr(skb)->daddr); if (!neigh) goto tx_err_link_failure; addr6 = (struct in6_addr *)&neigh->primary_key; // <=== HERE addr_type = ipv6_addr_type(addr6); if (addr_type == IPV6_ADDR_ANY) addr6 = &ipv6_hdr(skb)->daddr; memcpy(&fl6->daddr, addr6, sizeof(fl6->daddr)); Also the network header of the skb at this point should be still IPv4 for 4in6 tunnels, we shold not just use it as IPv6 header. This patch fixes it by checking if skb->protocol is ETH_P_IPV6: if it is, we are safe to do the nexthop lookup using skb_dst() and ipv6_hdr(skb)->daddr; if not (aka IPv4), we have no clue about which dest address we can pick here, we have to rely on callers to fill it from tunnel config, so just fall to ip6_route_output() to make the decision. Fixes: ea3dc9601bda ("ip6_tunnel: Add support for wildcard tunnel endpoints.") Reported-by: Andrey Konovalov <andreyknvl@google.com> Tested-by: Andrey Konovalov <andreyknvl@google.com> Cc: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-26net: core: Prevent from dereferencing null pointer when releasing SKBMyungho Jung
Added NULL check to make __dev_kfree_skb_irq consistent with kfree family of functions. Link: https://bugzilla.kernel.org/show_bug.cgi?id=195289 Signed-off-by: Myungho Jung <mhjungk@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-26macsec: dynamically allocate space for sglistJason A. Donenfeld
We call skb_cow_data, which is good anyway to ensure we can actually modify the skb as such (another error from prior). Now that we have the number of fragments required, we can safely allocate exactly that amount of memory. Fixes: c09440f7dcb3 ("macsec: introduce IEEE 802.1AE driver") Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Acked-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-26Revert "phy: micrel: Disable auto negotiation on startup"David S. Miller
This reverts commit 99f81afc139c6edd14d77a91ee91685a414a1c66. It was papering over the real problem, which is fixed by commit f555f34fdc58 ("net: phy: fix auto-negotiation stall due to unavailable interrupt") Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-26net: phy: fix auto-negotiation stall due to unavailable interruptAlexander Kochetkov
The Ethernet link on an interrupt driven PHY was not coming up if the Ethernet cable was plugged before the Ethernet interface was brought up. The patch trigger PHY state machine to update link state if PHY was requested to do auto-negotiation and auto-negotiation complete flag already set. During power-up cycle the PHY do auto-negotiation, generate interrupt and set auto-negotiation complete flag. Interrupt is handled by PHY state machine but doesn't update link state because PHY is in PHY_READY state. After some time MAC bring up, start and request PHY to do auto-negotiation. If there are no new settings to advertise genphy_config_aneg() doesn't start PHY auto-negotiation. PHY continue to stay in auto-negotiation complete state and doesn't fire interrupt. At the same time PHY state machine expect that PHY started auto-negotiation and is waiting for interrupt from PHY and it won't get it. Fixes: 321beec5047a ("net: phy: Use interrupts when available in NOLINK state") Signed-off-by: Alexander Kochetkov <al.kochet@gmail.com> Cc: stable <stable@vger.kernel.org> # v4.9+ Tested-by: Roger Quadros <rogerq@ti.com> Tested-by: Alexandre Belloni <alexandre.belloni@free-electrons.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-26Merge tag 'sound-4.11' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound Pull sound fixes from Takashi Iwai: "Since we got a bonus week, let me try to screw a few pending fixes. A slightly large fix is the locking fix in ASoC STI driver, but it's pretty board-specific, and the risk is fairly low. All the rest are small / trivial fixes, mostly marked as stable, for ALSA sequencer core, ASoC topology, ASoC Intel bytcr and Firewire drivers" * tag 'sound-4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: ASoC: intel: Fix PM and non-atomic crash in bytcr drivers ALSA: firewire-lib: fix inappropriate assignment between signed/unsigned type ALSA: seq: Don't break snd_use_lock_sync() loop by timeout ASoC: topology: Fix to store enum text values ASoC: STI: Fix null ptr deference in IRQ handler ALSA: oxfw: fix regression to handle Stanton SCS.1m/1d
2017-04-26HAVE_ARCH_HARDENED_USERCOPY is unconditional nowAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-04-26CONFIG_ARCH_HAS_RAW_COPY_USER is unconditional nowAl Viro
all architectures converted Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-04-26Merge branches 'uaccess.alpha', 'uaccess.arc', 'uaccess.arm', ↵Al Viro
'uaccess.arm64', 'uaccess.avr32', 'uaccess.bfin', 'uaccess.c6x', 'uaccess.cris', 'uaccess.frv', 'uaccess.h8300', 'uaccess.hexagon', 'uaccess.ia64', 'uaccess.m32r', 'uaccess.m68k', 'uaccess.metag', 'uaccess.microblaze', 'uaccess.mips', 'uaccess.mn10300', 'uaccess.nios2', 'uaccess.openrisc', 'uaccess.parisc', 'uaccess.powerpc', 'uaccess.s390', 'uaccess.score', 'uaccess.sh', 'uaccess.sparc', 'uaccess.tile', 'uaccess.um', 'uaccess.unicore32', 'uaccess.x86' and 'uaccess.xtensa' into work.uaccess
2017-04-26m32r: switch to RAW_COPY_USERAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-04-26ide-pm: always pass 0 error to ide_complete_rq in ide_do_devsetChristoph Hellwig
The caller only looks at the scsi_request result field anyway. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-26ide-pm: always pass 0 error to __blk_end_request_allChristoph Hellwig
ide_pm_execute_rq exectures a PM request synchronously, and in the failure case where it calls __blk_end_request_all it never checks the error field passed to the end_io callback, so don't bother setting it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-26scsi_transport_sas: always pass 0 error to blk_end_request_allChristoph Hellwig
The SAS transport queues are only used by bsg, and bsg always looks at the scsi_request results and never add the error passed in the end_io callback. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-26xfrm: do the garbage collection after flushing policyXin Long
Now xfrm garbage collection can be triggered by 'ip xfrm policy del'. These is no reason not to do it after flushing policies, especially considering that 'garbage collection deferred' is only triggered when it reaches gc_thresh. It's no good that the policy is gone but the xdst still hold there. The worse thing is that xdst->route/orig_dst is also hold and can not be released even if the orig_dst is already expired. This patch is to do the garbage collection if there is any policy removed in xfrm_policy_flush. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2017-04-26x86/mm: Fix flush_tlb_page() on XenAndy Lutomirski
flush_tlb_page() passes a bogus range to flush_tlb_others() and expects the latter to fix it up. native_flush_tlb_others() has the fixup but Xen's version doesn't. Move the fixup to flush_tlb_others(). AFAICS the only real effect is that, without this fix, Xen would flush everything instead of just the one page on remote vCPUs in when flush_tlb_page() was called. Signed-off-by: Andy Lutomirski <luto@kernel.org> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Nadav Amit <namit@vmware.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: e7b52ffd45a6 ("x86/flush_tlb: try flush_tlb_single one by one in flush_tlb_range") Link: http://lkml.kernel.org/r/10ed0e4dfea64daef10b87fb85df1746999b4dba.1492844372.git.luto@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-04-26x86/mm: Make flush_tlb_mm_range() more predictableAndy Lutomirski
I'm about to rewrite the function almost completely, but first I want to get a functional change out of the way. Currently, if flush_tlb_mm_range() does not flush the local TLB at all, it will never do individual page flushes on remote CPUs. This seems to be an accident, and preserving it will be awkward. Let's change it first so that any regressions in the rewrite will be easier to bisect and so that the rewrite can attempt to change no visible behavior at all. The fix is simple: we can simply avoid short-circuiting the calculation of base_pages_to_flush. As a side effect, this also eliminates a potential corner case: if tlb_single_page_flush_ceiling == TLB_FLUSH_ALL, flush_tlb_mm_range() could have ended up flushing the entire address space one page at a time. Signed-off-by: Andy Lutomirski <luto@kernel.org> Acked-by: Dave Hansen <dave.hansen@intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Nadav Amit <namit@vmware.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/4b29b771d9975aad7154c314534fec235618175a.1492844372.git.luto@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-04-26x86/mm: Remove flush_tlb() and flush_tlb_current_task()Andy Lutomirski
I was trying to figure out what how flush_tlb_current_task() would possibly work correctly if current->mm != current->active_mm, but I realized I could spare myself the effort: it has no callers except the unused flush_tlb() macro. Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Nadav Amit <namit@vmware.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/e52d64c11690f85e9f1d69d7b48cc2269cd2e94b.1492844372.git.luto@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-04-26x86/vm86/32: Switch to flush_tlb_mm_range() in mark_screen_rdonly()Andy Lutomirski
mark_screen_rdonly() is the last remaining caller of flush_tlb(). flush_tlb_mm_range() is potentially faster and isn't obsolete. Compile-tested only because I don't know whether software that uses this mechanism even exists. Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Nadav Amit <namit@vmware.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/791a644076fc3577ba7f7b7cafd643cc089baa7d.1492844372.git.luto@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-04-26x86/mm/64: Fix crash in remove_pagetable()Kirill A. Shutemov
remove_pagetable() does page walk using p*d_page_vaddr() plus cast. It's not canonical approach -- we usually use p*d_offset() for that. It works fine as long as all page table levels are present. We broke the invariant by introducing folded p4d page table level. As result, remove_pagetable() interprets PMD as PUD and it leads to crash: BUG: unable to handle kernel paging request at ffff880300000000 IP: memchr_inv+0x60/0x110 PGD 317d067 P4D 317d067 PUD 3180067 PMD 33f102067 PTE 8000000300000060 Let's fix this by using p*d_offset() instead of p*d_page_vaddr() for page walk. Reported-by: Dan Williams <dan.j.williams@intel.com> Tested-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Fixes: f2a6a7050109 ("x86: Convert the rest of the code to support p4d_t") Link: http://lkml.kernel.org/r/20170425092557.21852-1-kirill.shutemov@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-04-26x86/unwind: Dump all stacks in unwind_dump()Josh Poimboeuf
Currently unwind_dump() dumps only the most recently accessed stack. But it has a few issues. In some cases, 'first_sp' can get out of sync with 'stack_info', causing unwind_dump() to start from the wrong address, flood the printk buffer, and eventually read a bad address. In other cases, dumping only the most recently accessed stack doesn't give enough data to diagnose the error. Fix both issues by dumping *all* stacks involved in the trace, not just the last one. Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: 8b5e99f02264 ("x86/unwind: Dump stack data on warnings") Link: http://lkml.kernel.org/r/016d6a9810d7d1bfc87ef8c0e6ee041c6744c909.1493171120.git.jpoimboe@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-04-26x86/unwind: Silence more entry-code related warningsJosh Poimboeuf
Borislav Petkov reported the following unwinder warning: WARNING: kernel stack regs at ffffc9000024fea8 in udevadm:92 has bad 'bp' value 00007fffc4614d30 unwind stack type:0 next_sp: (null) mask:0x6 graph_idx:0 ffffc9000024fea8: 000055a6100e9b38 (0x55a6100e9b38) ffffc9000024feb0: 000055a6100e9b35 (0x55a6100e9b35) ffffc9000024feb8: 000055a6100e9f68 (0x55a6100e9f68) ffffc9000024fec0: 000055a6100e9f50 (0x55a6100e9f50) ffffc9000024fec8: 00007fffc4614d30 (0x7fffc4614d30) ffffc9000024fed0: 000055a6100eaf50 (0x55a6100eaf50) ffffc9000024fed8: 0000000000000000 ... ffffc9000024fee0: 0000000000000100 (0x100) ffffc9000024fee8: ffff8801187df488 (0xffff8801187df488) ffffc9000024fef0: 00007ffffffff000 (0x7ffffffff000) ffffc9000024fef8: 0000000000000000 ... ffffc9000024ff10: ffffc9000024fe98 (0xffffc9000024fe98) ffffc9000024ff18: 00007fffc4614d00 (0x7fffc4614d00) ffffc9000024ff20: ffffffffffffff10 (0xffffffffffffff10) ffffc9000024ff28: ffffffff811c6c1f (SyS_newlstat+0xf/0x10) ffffc9000024ff30: 0000000000000010 (0x10) ffffc9000024ff38: 0000000000000296 (0x296) ffffc9000024ff40: ffffc9000024ff50 (0xffffc9000024ff50) ffffc9000024ff48: 0000000000000018 (0x18) ffffc9000024ff50: ffffffff816b2e6a (entry_SYSCALL_64_fastpath+0x18/0xa8) ... It unwinded from an interrupt which came in right after entry code called into a C syscall handler, before it had a chance to set up the frame pointer, so regs->bp still had its user space value. Add a check to silence warnings in such a case, where an interrupt has occurred and regs->sp is almost at the end of the stack. Reported-by: Borislav Petkov <bp@suse.de> Tested-by: Borislav Petkov <bp@suse.de> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: c32c47c68a0a ("x86/unwind: Warn on bad frame pointer") Link: http://lkml.kernel.org/r/c695f0d0d4c2cfe6542b90e2d0520e11eb901eb5.1493171120.git.jpoimboe@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-04-25Merge tag 'arc-4.11-final' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc Pull ARC fix from Vineet Gupta: "Last minute fixes for ARC: - build error in Mellanox nps platform - addressing lack of saving FPU regs in releavnt configs" * tag 'arc-4.11-final' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc: ARCv2: entry: save Accumulator register pair (r58:59) if present ARC: [plat-eznps] Fix build error
2017-04-25nfsd: stricter decoding of write-like NFSv2/v3 opsJ. Bruce Fields
The NFSv2/v3 code does not systematically check whether we decode past the end of the buffer. This generally appears to be harmless, but there are a few places where we do arithmetic on the pointers involved and don't account for the possibility that a length could be negative. Add checks to catch these. Reported-by: Tuomas Haanpää <thaan@synopsys.com> Reported-by: Ari Kauppi <ari@synopsys.com> Reviewed-by: NeilBrown <neilb@suse.com> Cc: stable@vger.kernel.org Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-04-25nfsd4: minor NFSv2/v3 write decoding cleanupJ. Bruce Fields
Use a couple shortcuts that will simplify a following bugfix. Cc: stable@vger.kernel.org Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-04-25nfsd: check for oversized NFSv2/v3 argumentsJ. Bruce Fields
A client can append random data to the end of an NFSv2 or NFSv3 RPC call without our complaining; we'll just stop parsing at the end of the expected data and ignore the rest. Encoded arguments and replies are stored together in an array of pages, and if a call is too large it could leave inadequate space for the reply. This is normally OK because NFS RPC's typically have either short arguments and long replies (like READ) or long arguments and short replies (like WRITE). But a client that sends an incorrectly long reply can violate those assumptions. This was observed to cause crashes. Also, several operations increment rq_next_page in the decode routine before checking the argument size, which can leave rq_next_page pointing well past the end of the page array, causing trouble later in svc_free_pages. So, following a suggestion from Neil Brown, add a central check to enforce our expectation that no NFSv2/v3 call has both a large call and a large reply. As followup we may also want to rewrite the encoding routines to check more carefully that they aren't running off the end of the page array. We may also consider rejecting calls that have any extra garbage appended. That would be safer, and within our rights by spec, but given the age of our server and the NFS protocol, and the fact that we've never enforced this before, we may need to balance that against the possibility of breaking some oddball client. Reported-by: Tuomas Haanpää <thaan@synopsys.com> Reported-by: Ari Kauppi <ari@synopsys.com> Cc: stable@vger.kernel.org Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-04-25ceph: fix recursion between ceph_set_acl() and __ceph_setattr()Yan, Zheng
ceph_set_acl() calls __ceph_setattr() if the setacl operation needs to modify inode's i_mode. __ceph_setattr() updates inode's i_mode, then calls posix_acl_chmod(). The problem is that __ceph_setattr() calls posix_acl_chmod() before sending the setattr request. The get_acl() call in posix_acl_chmod() can trigger a getxattr request. The reply of the getxattr request can restore inode's i_mode to its old value. The set_acl() call in posix_acl_chmod() sees old value of inode's i_mode, so it calls __ceph_setattr() again. Cc: stable@vger.kernel.org # needs backporting for < 4.9 Link: http://tracker.ceph.com/issues/19688 Reported-by: Jerry Lee <leisurelysw24@gmail.com> Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Tested-by: Luis Henriques <lhenriques@suse.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-04-25net/packet: check length in getsockopt() called with PACKET_HDRLENAlexander Potapenko
In the case getsockopt() is called with PACKET_HDRLEN and optlen < 4 |val| remains uninitialized and the syscall may behave differently depending on its value, and even copy garbage to userspace on certain architectures. To fix this we now return -EINVAL if optlen is too small. This bug has been detected with KMSAN. Signed-off-by: Alexander Potapenko <glider@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-25net: ipv6: regenerate host route if moved to gc listDavid Ahern
Taking down the loopback device wreaks havoc on IPv6 routing. By extension, taking down a VRF device wreaks havoc on its table. Dmitry and Andrey both reported heap out-of-bounds reports in the IPv6 FIB code while running syzkaller fuzzer. The root cause is a dead dst that is on the garbage list gets reinserted into the IPv6 FIB. While on the gc (or perhaps when it gets added to the gc list) the dst->next is set to an IPv4 dst. A subsequent walk of the ipv6 tables causes the out-of-bounds access. Andrey's reproducer was the key to getting to the bottom of this. With IPv6, host routes for an address have the dst->dev set to the loopback device. When the 'lo' device is taken down, rt6_ifdown initiates a walk of the fib evicting routes with the 'lo' device which means all host routes are removed. That process moves the dst which is attached to an inet6_ifaddr to the gc list and marks it as dead. The recent change to keep global IPv6 addresses added a new function, fixup_permanent_addr, that is called on admin up. That function restarts dad for an inet6_ifaddr and when it completes the host route attached to it is inserted into the fib. Since the route was marked dead and moved to the gc list, re-inserting the route causes the reported out-of-bounds accesses. If the device with the address is taken down or the address is removed, the WARN_ON in fib6_del is triggered. All of those faults are fixed by regenerating the host route if the existing one has been moved to the gc list, something that can be determined by checking if the rt6i_ref counter is 0. Fixes: f1705ec197e7 ("net: ipv6: Make address flushing on ifdown optional") Reported-by: Dmitry Vyukov <dvyukov@google.com> Reported-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-25bridge: move bridge multicast cleanup to ndo_uninitXin Long
During removing a bridge device, if the bridge is still up, a new mdb entry still can be added in br_multicast_add_group() after all mdb entries are removed in br_multicast_dev_del(). Like the path: mld_ifc_timer_expire -> mld_sendpack -> ... br_multicast_rcv -> br_multicast_add_group The new mp's timer will be set up. If the timer expires after the bridge is freed, it may cause use-after-free panic in br_multicast_group_expired. BUG: unable to handle kernel NULL pointer dereference at 0000000000000048 IP: [<ffffffffa07ed2c8>] br_multicast_group_expired+0x28/0xb0 [bridge] Call Trace: <IRQ> [<ffffffff81094536>] call_timer_fn+0x36/0x110 [<ffffffffa07ed2a0>] ? br_mdb_free+0x30/0x30 [bridge] [<ffffffff81096967>] run_timer_softirq+0x237/0x340 [<ffffffff8108dcbf>] __do_softirq+0xef/0x280 [<ffffffff8169889c>] call_softirq+0x1c/0x30 [<ffffffff8102c275>] do_softirq+0x65/0xa0 [<ffffffff8108e055>] irq_exit+0x115/0x120 [<ffffffff81699515>] smp_apic_timer_interrupt+0x45/0x60 [<ffffffff81697a5d>] apic_timer_interrupt+0x6d/0x80 Nikolay also found it would cause a memory leak - the mdb hash is reallocated and not freed due to the mdb rehash. unreferenced object 0xffff8800540ba800 (size 2048): backtrace: [<ffffffff816e2287>] kmemleak_alloc+0x67/0xc0 [<ffffffff81260bea>] __kmalloc+0x1ba/0x3e0 [<ffffffffa05c60ee>] br_mdb_rehash+0x5e/0x340 [bridge] [<ffffffffa05c74af>] br_multicast_new_group+0x43f/0x6e0 [bridge] [<ffffffffa05c7aa3>] br_multicast_add_group+0x203/0x260 [bridge] [<ffffffffa05ca4b5>] br_multicast_rcv+0x945/0x11d0 [bridge] [<ffffffffa05b6b10>] br_dev_xmit+0x180/0x470 [bridge] [<ffffffff815c781b>] dev_hard_start_xmit+0xbb/0x3d0 [<ffffffff815c8743>] __dev_queue_xmit+0xb13/0xc10 [<ffffffff815c8850>] dev_queue_xmit+0x10/0x20 [<ffffffffa02f8d7a>] ip6_finish_output2+0x5ca/0xac0 [ipv6] [<ffffffffa02fbfc6>] ip6_finish_output+0x126/0x2c0 [ipv6] [<ffffffffa02fc245>] ip6_output+0xe5/0x390 [ipv6] [<ffffffffa032b92c>] NF_HOOK.constprop.44+0x6c/0x240 [ipv6] [<ffffffffa032bd16>] mld_sendpack+0x216/0x3e0 [ipv6] [<ffffffffa032d5eb>] mld_ifc_timer_expire+0x18b/0x2b0 [ipv6] This could happen when ip link remove a bridge or destroy a netns with a bridge device inside. With Nikolay's suggestion, this patch is to clean up bridge multicast in ndo_uninit after bridge dev is shutdown, instead of br_dev_delete, so that netif_running check in br_multicast_add_group can avoid this issue. v1->v2: - fix this issue by moving br_multicast_dev_del to ndo_uninit, instead of calling dev_close in br_dev_delete. (NOTE: Depends upon b6fe0440c637 ("bridge: implement missing ndo_uninit()")) Fixes: e10177abf842 ("bridge: multicast: fix handling of temp and perm entries") Reported-by: Jianwen Ji <jiji@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-25nvme-lightnvm: add missing endianess conversion in nvme_nvm_end_ioChristoph Hellwig
Found by sparse. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Matias Bjørling <matias@cnexlabs.com>
2017-04-25nvme-scsi: Consider LBA format in IO splitting calculationJon Derrick
The current command submission code uses a sector-based value when considering the maximum number of blocks per command. With a 4k-formatted namespace and a command exceeding max hardware limits, this calculation doesn't split IOs which should be split and fails in the nvme layer. This patch fixes that calculation and enables IO splitting in these circumstances. Signed-off-by: Jon Derrick <jonathan.derrick@intel.com> Reviewed-by: Jens Axboe <axboe@fb.com> Signed-off-by: Christoph Hellwig <hch@lst.de>