summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2018-12-17btrfs: Adjust loop in free_extent_bufferNikolay Borisov
The loop construct in free_extent_buffer was added in 242e18c7c1a8 ("Btrfs: reduce lock contention on extent buffer locks") as means of reducing the times the eb lock is taken, the non-last ref count is decremented and lock is released. As the special handling of UNMAPPED extent buffers was removed now there is only one decrement op which is happening for EXTENT_BUFFER_UNMAPPED case. This commit modifies the loop condition so that in case of UNMAPPED buffers the eb's lock is taken only if we are 100% sure the eb is going to be freed by the current executor of the code. Additionally, remove superfluous ref count ops in btrfs test. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17btrfs: Remove special handling of EXTENT_BUFFER_UNMAPPED while freeingNikolay Borisov
Now that the whole of btrfs code has been audited for eb reference count management it's time to remove the hunk in free_extent_buffer that essentially considered the condition "eb->ref == 2 && EXTENT_BUFFER_DUMMY" to equal "eb->ref = 1". Also remove the last location which takes an extra reference count in alloc_test_extent_buffer. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17btrfs: Remove unnecessary tree locking code in qgroup_rescan_leafNikolay Borisov
In qgroup_rescan_leaf a copy is made of the target leaf by calling btrfs_clone_extent_buffer. The latter allocates a new buffer and attaches a new set of pages and copies the content of the source buffer. The new scratch buffer is only used to iterate it's items, it's not published anywhere and cannot be accessed by a third party. Hence, it's not necessary to perform any locking on it whatsoever. Furthermore, remove the extra extent_buffer_get call since the new buffer is always allocated with a reference count of 1 which is sufficient here. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17btrfs: Remove extra reference count bumps in btrfs_compare_treesNikolay Borisov
When the 2 comparison trees roots are initialised they are private to the function and already have reference counts of 1 each. There is no need to further increment the reference count since the cloned buffers are already accessed via struct btrfs_path. Eventually the 2 paths used for comparison are going to be released, effectively disposing of the cloned buffers. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17btrfs: Remove extraneous extent_buffer_get from tree_mod_log_rewindNikolay Borisov
When a rewound buffer is created it already has a ref count of 1 and the dummy flag set. Then another ref is taken bumping the count to 2. Finally when this buffer is released from btrfs_release_path the extra reference is decremented by the special handling code in free_extent_buffer. However, this special code is in fact redundant sinca ref count of 1 is still correct since the buffer is only accessed via btrfs_path struct. This paves the way forward of removing the special handling in free_extent_buffer. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17btrfs: Remove redundant extent_buffer_get in get_old_rootNikolay Borisov
get_old_root used used only by btrfs_search_old_slot to initialise the path structure. The old root is always a cloned buffer (either via alloc dummy or via btrfs_clone_extent_buffer) and its reference count is 2: 1 from allocation, 1 from extent_buffer_get call in get_old_root. This latter explicit ref count acquire operation is in fact unnecessary since the semantic is such that the newly allocated buffer is handed over to the btrfs_path for lifetime management. Considering this just remove the extra extent_buffer_get in get_old_root. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17btrfs: Remove needless tree locking in iterate_inode_extrefsNikolay Borisov
In iterate_inode_exrefs the eb is cloned via btrfs_clone_extent_buffer which creates a private extent buffer with the dummy flag set and ref count of 1. Then this buffer is locked for reading and its ref count is incremented by 1. Finally it's fed to the passed iterate_irefs_t function. The actual iterate call back is inode_to_path (coming from paths_from_inode) which feeds the eb to btrfs_ref_to_path. In this final function the passed eb is only read by first assigning it to the local eb variable. This variable is only modified in the case another eb was referenced from the passed path that is eb != eb_in check triggers. Considering this there is no point in locking the cloned eb in iterate_inode_refs since it's never being modified and is not published anywhere. Furthermore the cloned eb is completely fine having its ref count be 1. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17btrfs: Remove needless tree locking in iterate_inode_refsNikolay Borisov
In iterate_inode_refs the eb is cloned via btrfs_clone_extent_buffer which creates a private extent buffer with the dummy flag set and ref count of 1. Then this buffer is locked for reading and its ref count is incremented by 1. Finally it's fed to the passed iterate_irefs_t function. The actual iterate call back is inode_to_path (coming from paths_from_inode) which feeds the eb to btrfs_ref_to_path. In this final function the passed eb is only read by first assigning it to the local eb variable. This variable is only modified in the case another eb was referenced from the passed path that is eb != eb_in check triggers. Considering this there is no point in locking the cloned eb in iterate_inode_refs since it's never being modified and is not published anywhere. Furthermore the cloned eb is completely fine having its ref count be 1. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17btrfs: tests: Use BTRFS_MAX_EXTENT_SIZE to replace the intermediate numberQu Wenruo
In extent-io self test, we need 2 ordered extents at its maximum size to do the test. Instead of using the intermediate numbers, use BTRFS_MAX_EXTENT_SIZE for @max_bytes, and twice @max_bytes for @total_dirty. This should explain why we need all these magic numbers and prevent people to modify them by accident. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17Btrfs: support swap filesOmar Sandoval
Btrfs has not allowed swap files since commit 35054394c4b3 ("Btrfs: stop providing a bmap operation to avoid swapfile corruptions"). However, now that the proper restrictions are in place, Btrfs can support swap files through the swap file a_ops, similar to iomap in commit 67482129cdab ("iomap: add a swapfile activation function"). For Btrfs, activation needs to make sure that the file can be used as a swap file, which currently means that it must be fully allocated as NOCOW with no compression on one device. It must also do the proper tracking so that ioctls will not interfere with the swap file. Deactivation clears this tracking. Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17Btrfs: rename and export get_chunk_mapOmar Sandoval
The Btrfs swap code is going to need it, so give it a btrfs_ prefix and make it non-static. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17Btrfs: prevent ioctls from interfering with a swap fileOmar Sandoval
A later patch will implement swap file support for Btrfs, but before we do that, we need to make sure that the various Btrfs ioctls cannot change a swap file. When a swap file is active, we must make sure that the extents of the file are not moved and that they don't become shared. That means that the following are not safe: - chattr +c (enable compression) - reflink - dedupe - snapshot - defrag Don't allow those to happen on an active swap file. Additionally, balance, resize, device remove, and device replace are also unsafe if they affect an active swapfile. Add a red-black tree of block groups and devices which contain an active swapfile. Relocation checks each block group against this tree and skips it or errors out for balance or resize, respectively. Device remove and device replace check the tree for the device they will operate on. Note that we don't have to worry about chattr -C (disable nocow), which we ignore for non-empty files, because an active swapfile must be non-empty and can't be truncated. We also don't have to worry about autodefrag because it's only done on COW files. Truncate and fallocate are already taken care of by the generic code. Device add doesn't do relocation so it's not an issue, either. Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17btrfs: Remove extent_io_ops::split_extent_hook callbackNikolay Borisov
This is the counterpart to merge_extent_hook, similarly, it's used only for data/freespace inodes so let's remove it, rename it and call it directly where necessary. No functional changes. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17btrfs: Remove extent_io_ops::merge_extent_hook callbackNikolay Borisov
This callback is used only for data and free space inodes. Such inodes are guaranteed to have their extent_io_tree::private_data set to the inode struct. Exploit this fact to directly call the function. Also give it a more descriptive name. No functional changes. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17btrfs: Remove extent_io_ops::clear_bit_hook callbackNikolay Borisov
This is the counterpart to ex-set_bit_hook (now btrfs_set_delalloc_extent), similar to what was done before remove clear_bit_hook and rename the function. No functional changes. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17btrfs: Remove extent_io_ops::set_bit_hook extent_io callbackNikolay Borisov
This callback is used to properly account delalloc extents for data inodes (ordinary file inodes and freespace v1 inodes). Those can be easily identified since they have their extent_io trees ->private_data member point to the inode. Let's exploit this fact to remove the needless indirection through extent_io_hooks and directly call the function. Also give the function a name which reflects its purpose - btrfs_set_delalloc_extent. This patch also modified test_find_delalloc so that the extent_io_tree used for testing doesn't have its ->private_data set which would have caused a crash in btrfs_set_delalloc_extent due to the btrfs_inode->root member not being initialised. The old version of the code also didn't call set_bit_hook since the extent_io ops weren't set for the inode. No functional changes. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17btrfs: Remove extent_io_ops::check_extent_io_range callbackNikolay Borisov
This callback was only used in debug builds by btrfs_leak_debug_check. A better approach is to move its implementation in btrfs_leak_debug_check and ensure the latter is only executed for extent tree which have ->private_data set i.e. relate to a data node and not the btree one. No functional changes. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17btrfs: Remove extent_io_ops::writepage_end_io_hookNikolay Borisov
This callback is ony ever called for data page writeout so there is no need to actually abstract it via extent_io_ops. Lets just export it, remove the definition of the callback and call it directly in the functions that invoke the callback. Also rename the function to btrfs_writepage_endio_finish_ordered since what it really does is account finished io in the ordered extent data structures. No functional changes. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17btrfs: Remove extent_io_ops::writepage_start_hookNikolay Borisov
This hook is called only from __extent_writepage_io which is already called only from the data page writeout path. So there is no need to make an indirect call via extent_io_ops. This patch just removes the callback definition, exports the callback function and calls it directly at the only call site. Also give the function a more descriptive name. No functional changes. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17btrfs: Remove extent_io_ops::fill_delallocNikolay Borisov
This callback is called only from writepage_delalloc which in turn is guaranteed to be called from the data page writeout path. In the end there is no reason to have the call to this function to be indrected via the extent_io_ops structure. This patch removes the callback definition, exports the function and calls it directly. No functional changes. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ rename to btrfs_run_delalloc_range ] Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17btrfs: Add function to distinguish between data and btree inodeNikolay Borisov
This will be used in future patches that remove the optional extent_io_ops callbacks. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17btrfs: volumes: Make sure no dev extent is beyond device boundaryQu Wenruo
Add extra dev extent end check against device boundary. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17btrfs: volumes: Make sure there is no overlap of dev extents at mount timeQu Wenruo
Enhance btrfs_verify_dev_extents() to remember previous checked dev extents, so it can verify no dev extents can overlap. Analysis from Hans: "Imagine allocating a DATA|DUP chunk. In the chunk allocator, we first set... max_stripe_size = SZ_1G; max_chunk_size = BTRFS_MAX_DATA_CHUNK_SIZE ... which is 10GiB. Then... /* we don't want a chunk larger than 10% of writeable space */ max_chunk_size = min(div_factor(fs_devices->total_rw_bytes, 1), max_chunk_size); Imagine we only have one 7880MiB block device in this filesystem. Now max_chunk_size is down to 788MiB. The next step in the code is to search for max_stripe_size * dev_stripes amount of free space on the device, which is in our example 1GiB * 2 = 2GiB. Imagine the device has exactly 1578MiB free in one contiguous piece. This amount of bytes will be put in devices_info[ndevs - 1].max_avail Next we recalculate the stripe_size (which is actually the device extent length), based on the actual maximum amount of available raw disk space: stripe_size = div_u64(devices_info[ndevs - 1].max_avail, dev_stripes); stripe_size is now 789MiB Next we do... data_stripes = num_stripes / ncopies ...where data_stripes ends up as 1, because num_stripes is 2 (the amount of device extents we're going to have), and DUP has ncopies 2. Next there's a check... if (stripe_size * data_stripes > max_chunk_size) ...which matches because 789MiB * 1 > 788MiB. We go into the if code, and next is... stripe_size = div_u64(max_chunk_size, data_stripes); ...which resets stripe_size to max_chunk_size: 788MiB Next is a fun one... /* bump the answer up to a 16MB boundary */ stripe_size = round_up(stripe_size, SZ_16M); ...which changes stripe_size from 788MiB to 800MiB. We're not done changing stripe_size yet... /* But don't go higher than the limits we found while searching * for free extents */ stripe_size = min(devices_info[ndevs - 1].max_avail, stripe_size); This is bad. max_avail is twice the stripe_size (we need to fit 2 device extents on the same device for DUP). The result here is that 800MiB < 1578MiB, so it's unchanged. However, the resulting DUP chunk will need 1600MiB disk space, which isn't there, and the second dev_extent might extend into the next thing (next dev_extent? end of device?) for 22MiB. The last shown line of code relies on a situation where there's twice the value of stripe_size present as value for the variable stripe_size when it's DUP. This was actually the case before commit 92e222df7b "btrfs: alloc_chunk: fix DUP stripe size handling", from which I quote: "[...] in the meantime there's a check to see if the stripe_size does not exceed max_chunk_size. Since during this check stripe_size is twice the amount as intended, the check will reduce the stripe_size to max_chunk_size if the actual correct to be used stripe_size is more than half the amount of max_chunk_size." In the previous version of the code, the 16MiB alignment (why is this done, by the way?) would result in a 50% chance that it would actually do an 8MiB alignment for the individual dev_extents, since it was operating on double the size. Does this matter? Does it matter that stripe_size can be set to anything which is not 16MiB aligned because of the amount of remaining available disk space which is just taken? What is the main purpose of this round_up? The most straightforward thing to do seems something like... stripe_size = min( div_u64(devices_info[ndevs - 1].max_avail, dev_stripes), stripe_size ) ..just putting half of the max_avail into stripe_size." Link: https://lore.kernel.org/linux-btrfs/b3461a38-e5f8-f41d-c67c-2efac8129054@mendix.com/ Reported-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com> Signed-off-by: Qu Wenruo <wqu@suse.com> [ add analysis from report ] Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17btrfs: Refactor find_free_extent loops update into find_free_extent_update_loopQu Wenruo
We have a complex loop design for find_free_extent(), that has different behavior for each loop, some even includes new chunk allocation. Instead of putting such a long code into find_free_extent() and makes it harder to read, just extract them into find_free_extent_update_loop(). With all the cleanups, the main find_free_extent() should be pretty barebone: find_free_extent() |- Iterate through all block groups | |- Get a valid block group | |- Try to do clustered allocation in that block group | |- Try to do unclustered allocation in that block group | |- Check if the result is valid | | |- If valid, then exit | |- Jump to next block group | |- Push harder to find free extents |- If not found, re-iterate all block groups Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Su Yue <suy.fnst@cn.fujitsu.com> [ copy callchain from changelog to function comment ] Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17btrfs: Refactor unclustered extent allocation into ↵Qu Wenruo
find_free_extent_unclustered() This patch will extract unclsutered extent allocation code into find_free_extent_unclustered(). And this helper function will use return value to indicate what to do next. This should make find_free_extent() a little easier to read. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Su Yue <suy.fnst@cn.fujitsu.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> [Update merge conflict with fb5c39d7a887 ("btrfs: don't use ctl->free_space for max_extent_size")] Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17btrfs: Refactor clustered extent allocation into find_free_extent_clusteredQu Wenruo
We have two main methods to find free extents inside a block group: 1) clustered allocation 2) unclustered allocation This patch will extract the clustered allocation into find_free_extent_clustered() to make it a little easier to read. Instead of jumping between different labels in find_free_extent(), the helper function will use return value to indicate different behavior. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Su Yue <suy.fnst@cn.fujitsu.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17btrfs: Introduce find_free_extent_ctl structure for later reworkQu Wenruo
Instead of tons of different local variables in find_free_extent(), extract them into find_free_extent_ctl structure, and add better explanation for them. Some modification may looks redundant, but will later greatly simplify function parameter list during find_free_extent() refactor. Also add two comments to co-operate with fb5c39d7a887 ("btrfs: don't use ctl->free_space for max_extent_size"), to make ffe_ctl->max_extent_size update more reader-friendly. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Su Yue <suy.fnst@cn.fujitsu.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17btrfs: extent-tree: Detect bytes_pinned underflow earlierLu Fengqi
Introduce a new wrapper update_bytes_pinned to replace open coded bytes_pinned modifiers. Now the underflows of space_info::bytes_pinned get detected and reported. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17btrfs: extent-tree: Detect bytes_may_use underflow earlierQu Wenruo
Although we have space_info::bytes_may_use underflow detection in btrfs_free_reserved_data_space_noquota(), we have more callers who are subtracting number from space_info::bytes_may_use. So instead of doing underflow detection for every caller, introduce a new wrapper update_bytes_may_use() to replace open coded bytes_may_use modifiers. This also introduce a macro to declare more wrappers, but currently space_info::bytes_may_use is the mostly interesting one. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17Btrfs: remove no longer used stuff for tracking pending ordered extentsFilipe Manana
Tracking pending ordered extents per transaction was introduced in commit 50d9aa99bd35 ("Btrfs: make sure logged extents complete in the current transaction V3") and later updated in commit 161c3549b45a ("Btrfs: change how we wait for pending ordered extents"). However now that on fsync we always wait for ordered extents to complete before logging, done in commit 5636cf7d6dc8 ("btrfs: remove the logged extents infrastructure"), we no longer need the stuff to track for pending ordered extents, which was not completely removed in the mentioned commit. So remove the remaining of the pending ordered extents infrastructure. Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17Btrfs: remove no longer used logged range variables when logging extentsFilipe Manana
The logged_start and logged_end variables, at btrfs_log_changed_extents, were added in commit 8c6c592831a0 ("btrfs: log csums for all modified extents"). However since the recent simplification for fsync, which makes us wait for all ordered extents to complete before logging extents, we no longer need those variables. Commit a2120a473a80 ("btrfs: clean up the left over logged_list usage") forgot to remove them. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-14Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge misc fixes from Andrew Morton: "11 fixes" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: scripts/spdxcheck.py: always open files in binary mode checkstack.pl: fix for aarch64 userfaultfd: check VM_MAYWRITE was set after verifying the uffd is registered fs/iomap.c: get/put the page in iomap_page_create/release() hugetlbfs: call VM_BUG_ON_PAGE earlier in free_huge_page() memblock: annotate memblock_is_reserved() with __init_memblock psi: fix reference to kernel commandline enable arch/sh/include/asm/io.h: provide prototypes for PCI I/O mapping in asm/io.h mm/sparse: add common helper to mark all memblocks present mm: introduce common STRUCT_PAGE_MAX_SHIFT define alpha: fix hang caused by the bootmem removal
2018-12-14userfaultfd: check VM_MAYWRITE was set after verifying the uffd is registeredAndrea Arcangeli
Calling UFFDIO_UNREGISTER on virtual ranges not yet registered in uffd could trigger an harmless false positive WARN_ON. Check the vma is already registered before checking VM_MAYWRITE to shut off the false positive warning. Link: http://lkml.kernel.org/r/20181206212028.18726-2-aarcange@redhat.com Cc: <stable@vger.kernel.org> Fixes: 29ec90660d68 ("userfaultfd: shmem/hugetlbfs: only allow to register VM_MAYWRITE vmas") Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reported-by: syzbot+06c7092e7d71218a2c16@syzkaller.appspotmail.com Acked-by: Mike Rapoport <rppt@linux.ibm.com> Acked-by: Hugh Dickins <hughd@google.com> Acked-by: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-14fs/iomap.c: get/put the page in iomap_page_create/release()Piotr Jaroszynski
migrate_page_move_mapping() expects pages with private data set to have a page_count elevated by 1. This is what used to happen for xfs through the buffer_heads code before the switch to iomap in commit 82cb14175e7d ("xfs: add support for sub-pagesize writeback without buffer_heads"). Not having the count elevated causes move_pages() to fail on memory mapped files coming from xfs. Make iomap compatible with the migrate_page_move_mapping() assumption by elevating the page count as part of iomap_page_create() and lowering it in iomap_page_release(). It causes the move_pages() syscall to misbehave on memory mapped files from xfs. It does not not move any pages, which I suppose is "just" a perf issue, but it also ends up returning a positive number which is out of spec for the syscall. Talking to Michal Hocko, it sounds like returning positive numbers might be a necessary update to move_pages() anyway though (https://lkml.kernel.org/r/20181116114955.GJ14706@dhcp22.suse.cz). I only hit this in tests that verify that move_pages() actually moved the pages. The test also got confused by the positive return from move_pages() (it got treated as a success as positive numbers were not expected and not handled) making it a bit harder to track down what's going on. Link: http://lkml.kernel.org/r/20181115184140.1388751-1-pjaroszynski@nvidia.com Fixes: 82cb14175e7d ("xfs: add support for sub-pagesize writeback without buffer_heads") Signed-off-by: Piotr Jaroszynski <pjaroszynski@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: William Kucharski <william.kucharski@oracle.com> Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: Brian Foster <bfoster@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-14Merge tag 'for-linus-20181214' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block fixes from Jens Axboe: "Three small fixes for this week. contains: - spectre indexing fix for aio (Jeff) - fix for the previous zeroing bio fix, we don't need it for user mapped pages, and in fact it breaks some applications if we do (Keith) - allocation failure fix for null_blk with zoned (Shin'ichiro)" * tag 'for-linus-20181214' of git://git.kernel.dk/linux-block: block: Fix null_blk_zoned creation failure with small number of zones aio: fix spectre gadget in lookup_ioctx block/bio: Do not zero user pages
2018-12-14Merge tag 'ceph-for-4.20-rc7' of https://github.com/ceph/ceph-clientLinus Torvalds
Pull ceph fix from Ilya Dryomov: "Luis discovered a problem with the new copyfrom offload on the server side. Disable it for now" * tag 'ceph-for-4.20-rc7' of https://github.com/ceph/ceph-client: ceph: make 'nocopyfrom' a default mount option
2018-12-12Merge tag 'ovl-fixes-4.20-rc7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs Pull overlayfs fixes from Miklos Szeredi: "Needed to revert a patch, because it possibly introduces a security hole. Since the patch is basically a conceptual cleanup, not a bug fix, it's safe to revert. I'm not giving up on this, and discussions seemed to have reached an agreement over how to move forward, but that can wait 'till the next release. The other two patches are fixes for bugs introduced in recent releases" * tag 'ovl-fixes-4.20-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: Revert "ovl: relax permission checking on underlying layers" ovl: fix decode of dir file handle with multi lower layers ovl: fix missing override creds in link of a metacopy upper
2018-12-12Merge tag 'fuse-fixes-4.20-rc7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse fixes from Miklos Szeredi: "There's one patch fixing a minor but long lived bug, the others are fixing regressions introduced in this cycle" * tag 'fuse-fixes-4.20-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: fuse: continue to send FUSE_RELEASEDIR when FUSE_OPEN returns ENOSYS fuse: Fix memory leak in fuse_dev_free() fuse: fix revalidation of attributes for permission check fuse: fix fsync on directory fuse: Add bad inode check in fuse_destroy_inode()
2018-12-11fuse: continue to send FUSE_RELEASEDIR when FUSE_OPEN returns ENOSYSChad Austin
When FUSE_OPEN returns ENOSYS, the no_open bit is set on the connection. Because the FUSE_RELEASE and FUSE_RELEASEDIR paths share code, this incorrectly caused the FUSE_RELEASEDIR request to be dropped and never sent to userspace. Pass an isdir bool to distinguish between FUSE_RELEASE and FUSE_RELEASEDIR inside of fuse_file_put. Fixes: 7678ac50615d ("fuse: support clients that don't implement 'open'") Cc: <stable@vger.kernel.org> # v3.14 Signed-off-by: Chad Austin <chadaustin@fb.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-12-11aio: fix spectre gadget in lookup_ioctxJeff Moyer
Matthew pointed out that the ioctx_table is susceptible to spectre v1, because the index can be controlled by an attacker. The below patch should mitigate the attack for all of the aio system calls. Cc: stable@vger.kernel.org Reported-by: Matthew Wilcox <willy@infradead.org> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-11ceph: make 'nocopyfrom' a default mount optionLuis Henriques
Since we found a problem with the 'copy-from' operation after objects have been truncated, offloading object copies to OSDs should be discouraged until the issue is fixed. Thus, this patch adds the 'nocopyfrom' mount option to the default mount options which effectily means that remote copies won't be done in copy_file_range unless they are explicitly enabled at mount time. [ Adjust ceph_show_options() accordingly. ] Link: https://tracker.ceph.com/issues/37378 Signed-off-by: Luis Henriques <lhenriques@suse.com> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-12-10fuse: Fix memory leak in fuse_dev_free()Takeshi Misawa
When ntfs is unmounted, the following leak is reported by kmemleak. kmemleak report: unreferenced object 0xffff880052bf4400 (size 4096): comm "mount.ntfs", pid 16530, jiffies 4294861127 (age 3215.836s) hex dump (first 32 bytes): 00 44 bf 52 00 88 ff ff 00 44 bf 52 00 88 ff ff .D.R.....D.R.... 10 44 bf 52 00 88 ff ff 10 44 bf 52 00 88 ff ff .D.R.....D.R.... backtrace: [<00000000bf4a2f8d>] fuse_fill_super+0xb22/0x1da0 [fuse] [<000000004dde0f0c>] mount_bdev+0x263/0x320 [<0000000025aebc66>] mount_fs+0x82/0x2bf [<0000000042c5a6be>] vfs_kern_mount.part.33+0xbf/0x480 [<00000000ed10cd5b>] do_mount+0x3de/0x2ad0 [<00000000d59ff068>] ksys_mount+0xba/0xd0 [<000000001bda1bcc>] __x64_sys_mount+0xba/0x150 [<00000000ebe26304>] do_syscall_64+0x151/0x490 [<00000000d25f2b42>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [<000000002e0abd2c>] 0xffffffffffffffff fuse_dev_alloc() allocate fud->pq.processing. But this hash table is not freed. Fix this by freeing fud->pq.processing. Signed-off-by: Takeshi Misawa <jeliantsurux@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Fixes: be2ff42c5d6e ("fuse: Use hash table to link processing request")
2018-12-09Merge tag '4.20-rc5-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds
Pull cifs fixes from Steve French: "Three small fixes: a fix for smb3 direct i/o, a fix for CIFS DFS for stable and a minor cifs Kconfig fix" * tag '4.20-rc5-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6: CIFS: Avoid returning EBUSY to upper layer VFS cifs: Fix separator when building path from dentry cifs: In Kconfig CONFIG_CIFS_POSIX needs depends on legacy (insecure cifs)
2018-12-09Merge tag 'dax-fixes-4.20-rc6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull dax fixes from Dan Williams: "The last of the known regression fixes and fallout from the Xarray conversion of the filesystem-dax implementation. On the path to debugging why the dax memory-failure injection test started failing after the Xarray conversion a couple more fixes for the dax_lock_mapping_entry(), now called dax_lock_page(), surfaced. Those plus the bug that started the hunt are now addressed. These patches have appeared in a -next release with no issues reported. Note the touches to mm/memory-failure.c are just the conversion to the new function signature for dax_lock_page(). Summary: - Fix the Xarray conversion of fsdax to properly handle dax_lock_mapping_entry() in the presense of pmd entries - Fix inode destruction racing a new lock request" * tag 'dax-fixes-4.20-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: dax: Fix unlock mismatch with updated API dax: Don't access a freed inode dax: Check page->mapping isn't NULL
2018-12-08Merge tag 'xfs-4.20-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds
Pull xfs fixes from Darrick Wong: "Here are hopefully the last set of fixes for 4.20. There's a fix for a longstanding statfs reporting problem with project quotas, a correction for page cache invalidation behaviors when fallocating near EOF, and a fix for a broken metadata verifier return code. Finally, the most important fix is to the pipe splicing code (aka the generic copy_file_range fallback) to avoid pointless short directio reads by only asking the filesystem for as much data as there are available pages in the pipe buffer. Our previous fix (simulated short directio reads because the number of pages didn't match the length of the read requested) caused subtle problems on overlayfs, so that part is reverted. Anyhow, this series passes fstests -g all on xfs and overlay+xfs, and has passed 17 billion fsx operations problem-free since I started testing Summary: - Fix broken project quota inode counts - Fix incorrect PAGE_MASK/PAGE_SIZE usage - Fix incorrect return value in btree verifier - Fix WARN_ON remap flags false positive - Fix splice read overflows" * tag 'xfs-4.20-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: iomap: partially revert 4721a601099 (simulated directio short read on EFAULT) splice: don't read more than available pipe space vfs: allow some remap flags to be passed to vfs_clone_file_range xfs: fix inverted return from xfs_btree_sblock_verify_crc xfs: fix PAGE_MASK usage in xfs_free_file_space fs/xfs: fix f_ffree value for statfs when project quota is set
2018-12-07CIFS: Avoid returning EBUSY to upper layer VFSLong Li
EBUSY is not handled by VFS, and will be passed to user-mode. This is not correct as we need to wait for more credits. This patch also fixes a bug where rsize or wsize is used uninitialized when the call to server->ops->wait_mtu_credits() fails. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Long Li <longli@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2018-12-06Merge tag 'nfs-for-4.20-5' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds
Pull NFS client bugfixes from Trond Myklebust: "This is mainly fallout from the updates to the SUNRPC code that is being triggered from less common combinations of NFS mount options. Highlights include: Stable fixes: - Fix a page leak when using RPCSEC_GSS/krb5p to encrypt data. Bugfixes: - Fix a regression that causes the RPC receive code to hang - Fix call_connect_status() so that it handles tasks that got transmitted while queued waiting for the socket lock. - Fix a memory leak in call_encode() - Fix several other connect races. - Fix receive code error handling. - Use the discard iterator rather than MSG_TRUNC for compatibility with AF_UNIX/AF_LOCAL sockets. - nfs: don't dirty kernel pages read by direct-io - pnfs/Flexfiles fix to enforce per-mirror stateid only for NFSv4 data servers" * tag 'nfs-for-4.20-5' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: SUNRPC: Don't force a redundant disconnection in xs_read_stream() SUNRPC: Fix up socket polling SUNRPC: Use the discard iterator rather than MSG_TRUNC SUNRPC: Treat EFAULT as a truncated message in xs_read_stream_request() SUNRPC: Fix up handling of the XDRBUF_SPARSE_PAGES flag SUNRPC: Fix RPC receive hangs SUNRPC: Fix a potential race in xprt_connect() SUNRPC: Fix a memory leak in call_encode() SUNRPC: Fix leak of krb5p encode pages SUNRPC: call_connect_status() must handle tasks that got transmitted nfs: don't dirty kernel pages read by direct-io flexfiles: enforce per-mirror stateid only for v4 DSes
2018-12-06cifs: Fix separator when building path from dentryPaulo Alcantara
Make sure to use the CIFS_DIR_SEP(cifs_sb) as path separator for prefixpath too. Fixes a bug with smb1 UNIX extensions. Fixes: a6b5058fafdf ("fs/cifs: make share unaccessible at root level mountable") Signed-off-by: Paulo Alcantara <palcantara@suse.com> Reviewed-by: Aurelien Aptel <aaptel@suse.com> Signed-off-by: Steve French <stfrench@microsoft.com> CC: Stable <stable@vger.kernel.org>
2018-12-06cifs: In Kconfig CONFIG_CIFS_POSIX needs depends on legacy (insecure cifs)Steve French
Missing a dependency. Shouldn't show cifs posix extensions in Kconfig if CONFIG_CIFS_ALLOW_INSECURE_DIALECTS (ie SMB1 protocol) is disabled. Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2018-12-05Merge tag 'for-4.20-rc5-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fix from David Sterba: "A patch in 4.19 introduced a sanity check that was too strict and a filesystem cannot be mounted. This happens for filesystems with more than 10 devices and has been reported by a few users so we need the fix to propagate to stable" * tag 'for-4.20-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: tree-checker: Don't check max block group size as current max chunk size limit is unreliable