summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2014-01-25fs: __fget_light() can use __fget() in slow pathOleg Nesterov
The slow path in __fget_light() can use __fget() to avoid the code duplication. Saves 232 bytes. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-01-25fs: factor out common code in fget_light() and fget_raw_light()Oleg Nesterov
Apart from FMODE_PATH check fget_light() and fget_raw_light() are identical, shift the code into the new helper, __fget_light(fd, mask). Saves 208 bytes. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-01-25fs: factor out common code in fget() and fget_raw()Oleg Nesterov
Apart from FMODE_PATH check fget() and fget_raw() are identical, shift the code into the new simple helper, __fget(fd, mask). Saves 160 bytes. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-01-25change close_files() to use rcu_dereference_raw(files->fdt)Oleg Nesterov
put_files_struct() and close_files() do rcu_read_lock() to make rcu_dereference_check_fdtable() happy. This looks a bit ugly, files_fdtable() just reads the pointer, we can simply use rcu_dereference_raw() to avoid the warning. The patch also changes close_files() to return fdt, this avoids another rcu_read_lock()/files_fdtable() in put_files_struct(). I think close_files() needs more cleanups: - we do not need xchg() exactly because we are the last user of this files_struct - "if (file)" should be turned into WARN_ON(!file) Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-01-25introduce __fcheck_files() to fix rcu_dereference_check_fdtable(), kill ↵Oleg Nesterov
rcu_my_thread_group_empty() rcu_dereference_check_fdtable() looks very wrong, 1. rcu_my_thread_group_empty() was added by 844b9a8707f1 "vfs: fix RCU-lockdep false positive due to /proc" but it doesn't really fix the problem. A CLONE_THREAD (without CLONE_FILES) task can hit the same race with get_files_struct(). And otoh rcu_my_thread_group_empty() can suppress the correct warning if the caller is the CLONE_FILES (without CLONE_THREAD) task. 2. files->count == 1 check is not really right too. Even if this files_struct is not shared it is not safe to access it lockless unless the caller is the owner. Otoh, this check is sub-optimal. files->count == 0 always means it is safe to use it lockless even if files != current->files, but put_files_struct() has to take rcu_read_lock(). See the next patch. This patch removes the buggy checks and turns fcheck_files() into __fcheck_files() which uses rcu_dereference_raw(), the "unshared" callers, fget_light() and fget_raw_light(), can use it to avoid the warning from RCU-lockdep. fcheck_files() is trivially reimplemented as rcu_lockdep_assert() plus __fcheck_files(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-01-25kill reiserfs_bdevname()Al Viro
it's never called with NULL argument... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-01-25afs: get rid of junk in fs/afs/proc.cAl Viro
kill pointless method instances and don't bother with ->owner - it's ignored for procfs files anyway, make use of remove_proc_subtree() for removal, get rid of cell->proc_dir. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-01-25nls: have register_nls() set ->ownerAl Viro
pass owner explicitly to __register_nls(), make register_nls() a macro passing THIS_MODULE as the owner argument to __register_nls(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-01-25eventfd_ctx_fdget(): use fdget() instead of fget()Al Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-01-25btrfs: sanitize BTRFS_IOC_FILE_EXTENT_SAMEAl Viro
* don't assume that ->dest_count won't change between copy_from_user() and memdup_user() * use fdget instead of fget * don't bother comparing superblocks when we'd already compared vfsmounts * get rid of excessive goto * use file_inode() instead of open-coding the sucker Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-01-25qnx4: clean qnx4_fill_super() upAl Viro
* pass on-disk superblock to qnx4_chkroot() explicitly * don't leave stale (and unused) pointers in qnx4_super_block * free stuff in ->kill_sb(); ->put_super() becomes empty and dies * simplify failure exits Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-01-25efs: get rid of ->put_super()Al Viro
simplifies failure exits in ->mount()... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-01-25cramfs: take headers to fs/cramfsAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-01-25cramfs: get rid of ->put_super()Al Viro
failure exits are simpler that way Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-01-25affs: use ->kill_sb() to simplify ->put_super() and failure exits of ->mount()Al Viro
... and return saner errors from ->mount(), while we are at it Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-01-25xfs: switch to kfree_put_link()Al Viro
don't bother open-coding it... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-01-25ecryptfs: fix failure handling in ->readlink()Al Viro
If ecryptfs_readlink_lower() fails, buf remains an uninitialized pointer and passing it nd_set_link() won't do anything good. Fixed by switching ecryptfs_readlink_lower() to saner API - make it return buf or ERR_PTR(...) and update callers. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-01-24xfs: allow logical-sector sized O_DIRECTEric Sandeen
Some time ago, mkfs.xfs started picking the storage physical sector size as the default filesystem "sector size" in order to avoid RMW costs incurred by doing IOs at logical sector size alignments. However, this means that for a filesystem made with i.e. a 4k sector size on an "advanced format" 4k/512 disk, 512-byte direct IOs are no longer allowed. This means that XFS has essentially turned this AF drive into a hard 4K device, from the filesystem on up. XFS's mkfs-specified "sector size" is really just controlling the minimum size & alignment of filesystem metadata. There is no real need to tightly couple XFS's minimal metadata size to the minimum allowed direct IO size; XFS can continue doing metadata in optimal sizes, but still allow smaller DIOs for apps which issue them, for whatever reason. This patch adds a new field to the xfs_buftarg, so that we now track 2 sizes: 1) The metadata sector size, which is the minimum unit and alignment of IO which will be performed by metadata operations. 2) The device logical sector size The first is used internally by the file system for metadata alignment and IOs. The second is used for the minimum allowed direct IO alignment. This has passed xfstests on filesystems made with 4k sectors, including when run under the patch I sent to ignore XFS_IOC_DIOINFO, and issue 512 DIOs anyway. I also directly tested end of block behavior on preallocated, sparse, and existing files when we do a 512 IO into a 4k file on a 4k-sector filesystem, to be sure there were no unexpected behaviors. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2014-01-24xfs: rename xfs_buftarg structure membersEric Sandeen
In preparation for adding new members to the structure, give these old ones more descriptive names: bt_ssize -> bt_meta_sectorsize bt_smask -> bt_meta_sectormask Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2014-01-24xfs: clean up xfs_buftargEric Sandeen
Clean up the xfs_buftarg structure a bit: - remove bt_bsize which is never used - replace bt_sshift with bt_ssize; we only ever shift it back Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2014-01-23romfs: fix returm err while getting inode in fill_superRui Xiang
Getting an inode by romfs_iget may lead to an err in fill_super, and the err value should be return. And it should return -ENOMEM instead while d_make_root fails, fix it too. Signed-off-by: Rui Xiang <rui.xiang@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23userns: relax the posix_acl_valid() checksAndreas Gruenbacher
So far, POSIX ACLs are using a canonical representation that keeps all ACL entries in a strict order; the ACL_USER and ACL_GROUP entries for specific users and groups are ordered by user and group identifier, respectively. The user-space code provides ACL entries in this order; the kernel verifies that the ACL entry order is correct in posix_acl_valid(). User namespaces allow to arbitrary map user and group identifiers which can cause the ACL_USER and ACL_GROUP entry order to differ between user space and the kernel; posix_acl_valid() would then fail. Work around this by allowing ACL_USER and ACL_GROUP entries to be in any order in the kernel. The effect is only minor: file permission checks will pick the first matching ACL_USER entry, and check all matching ACL_GROUP entries. (The libacl user-space library and getfacl / setfacl tools will not create ACLs with duplicate user or group idenfifiers; they will handle ACLs with entries in an arbitrary order correctly.) Signed-off-by: Andreas Gruenbacher <agruen@linbit.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Theodore Tso <tytso@mit.edu> Cc: Christoph Hellwig <hch@infradead.org> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Jan Kara <jack@suse.cz> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23fs-ext3-use-rbtree-postorder-iteration-helper-instead-of-opencoding-fixAndrew Morton
use do{}while - more efficient and it squishes a coccinelle warning Reported-by: Fengguang Wu <fengguang.wu@intel.com> Cc: Cody P Schafer <cody@linux.vnet.ibm.com> Cc: Jan Kara <jack@suse.cz> Cc: Michel Lespinasse <walken@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23fs/ext3: use rbtree postorder iteration helper instead of opencodingCody P Schafer
Use rbtree_postorder_for_each_entry_safe() to destroy the rbtree instead of opencoding an alternate postorder iteration that modifies the tree Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com> Cc: Michel Lespinasse <walken@google.com> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23fs/jffs2: use rbtree postorder iteration helper instead of opencodingCody P Schafer
Use rbtree_postorder_for_each_entry_safe() to destroy the rbtree instead of opencoding an alternate postorder iteration that modifies the tree Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com> Cc: Michel Lespinasse <walken@google.com> Cc: Jan Kara <jack@suse.cz> Cc: David Woodhouse <dwmw2@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23fs/ext4: use rbtree postorder iteration helper instead of opencodingCody P Schafer
Use rbtree_postorder_for_each_entry_safe() to destroy the rbtree instead of opencoding an alternate postorder iteration that modifies the tree Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Michel Lespinasse <walken@google.com> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23fs/ubifs: use rbtree postorder iteration helper instead of opencodingCody P Schafer
Use rbtree_postorder_for_each_entry_safe() to destroy the rbtree instead of opencoding an alternate postorder iteration that modifies the tree Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com> Cc: Michel Lespinasse <walken@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Artem Bityutskiy <dedekind1@gmail.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23fs/exec.c: call arch_pick_mmap_layout() only onceRichard Weinberger
Currently both setup_new_exec() and flush_old_exec() issue a call to arch_pick_mmap_layout(). As setup_new_exec() and flush_old_exec() are always called pairwise arch_pick_mmap_layout() is called twice. This patch removes one call from setup_new_exec() to have it only called once. Signed-off-by: Richard Weinberger <richard@nod.at> Tested-by: Pat Erley <pat-lkml@erley.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23exec: avoid propagating PF_NO_SETAFFINITY into userspace childZhang Yi
Userspace process doesn't want the PF_NO_SETAFFINITY, but its parent may be a kernel worker thread which has PF_NO_SETAFFINITY set, and this worker thread can do kernel_thread() to create the child. Clearing this flag in usersapce child to enable its migrating capability. Signed-off-by: Zhang Yi <zhang.yi20@zte.com.cn> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23fs/proc/array.c: change do_task_stat() to use while_each_thread()Oleg Nesterov
Change the remaining next_thread (ab)users to use while_each_thread(). The last user which should be changed is next_tid(), but we can't do this now. __exit_signal() and complete_signal() are fine, they actually need next_thread() logic. This patch (of 3): do_task_stat() can use while_each_thread(), no changes in the compiled code. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Kees Cook <keescook@chromium.org> Reviewed-by: Sameer Nanda <snanda@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23exec: kill task_struct->did_execOleg Nesterov
We can kill either task->did_exec or PF_FORKNOEXEC, they are mutually exclusive. The patch kills ->did_exec because it has a single user. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23exec: move the final allow_write_access/fput into free_bprm()Oleg Nesterov
Both success/failure paths cleanup bprm->file, we can move this code into free_bprm() to simlify and cleanup this logic. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Acked-by: Kees Cook <keescook@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23exec:check_unsafe_exec: kill the dead -EAGAIN and clear_in_exec logicOleg Nesterov
fs_struct->in_exec == T means that this ->fs is used by a single process (thread group), and one of the treads does do_execve(). To avoid the mt-exec races this code has the following complications: 1. check_unsafe_exec() returns -EBUSY if ->in_exec was already set by another thread. 2. do_execve_common() records "clear_in_exec" to ensure that the error path can only clear ->in_exec if it was set by current. However, after 9b1bf12d5d51 "signals: move cred_guard_mutex from task_struct to signal_struct" we do not need these complications: 1. We can't race with our sub-thread, this is called under per-process ->cred_guard_mutex. And we can't race with another CLONE_FS task, we already checked that this fs is not shared. We can remove the dead -EAGAIN logic. 2. "out_unmark:" in do_execve_common() is either called under ->cred_guard_mutex, or after de_thread() which kills other threads, so we can't race with sub-thread which could set ->in_exec. And if ->fs is shared with another process ->in_exec should be false anyway. We can clear in_exec unconditionally. This also means that check_unsafe_exec() can be void. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23exec:check_unsafe_exec: use while_each_thread() rather than next_thread()Oleg Nesterov
next_thread() should be avoided, change check_unsafe_exec() to use while_each_thread(). Nobody except signal->curr_target actually needs next_thread-like code, and we need to change (fix) this interface. This particular code is fine, p == current. But in general the code like this can loop forever if p exits and next_thread(t) can't reach the unhashed thread. This also saves 32 bytes. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23fs/proc: don't use module_init for non-modular core codePaul Gortmaker
PROC_FS is a bool, so this code is either present or absent. It will never be modular, so using module_init as an alias for __initcall is rather misleading. Fix this up now, so that we can relocate module_init from init.h into module.h in the future. If we don't do this, we'd have to add module.h to obviously non-modular code, and that would be ugly at best. Note that direct use of __initcall is discouraged, vs. one of the priority categorized subgroups. As __initcall gets mapped onto device_initcall, our use of fs_initcall (which makes sense for fs code) will thus change these registrations from level 6-device to level 5-fs (i.e. slightly earlier). However no observable impact of that small difference has been observed during testing, or is expected. Also note that this change uncovers a missing semicolon bug in the registration of vmcore_init as an initcall. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23fs/proc_namespace.c: simplify testing nsp and nsp->mnt_nsAxel Lin
Trivial cleanup to eliminate a goto. Signed-off-by: Axel Lin <axel.lin@ingics.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23fs/proc/proc_devtree.c: remove empty /proc/device-tree when no openfirmware ↵Dave Jones
exists. Distribution kernels might want to build in support for /proc/device-tree for kernels that might end up running on hardware that doesn't support openfirmware. This results in an empty /proc/device-tree existing. Remove it if the OFW root node doesn't exist. This situation actually confuses grub2, resulting in install failures. grub2 sees the /proc/device-tree and picks the wrong install target cf. http://bzr.savannah.gnu.org/lh/grub/trunk/grub/annotate/4300/util/grub-install.in#L311 grub should be more robust, but still, leaving an empty proc dir seems pointless. Addresses https://bugzilla.redhat.com/show_bug.cgi?id=818378. Signed-off-by: Dave Jones <davej@redhat.com> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Paul Mackerras <paulus@samba.org> Cc: Josh Boyer <jwboyer@fedoraproject.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23proc: set attributes of pde using accessor functionsRui Xiang
Use existing accessors proc_set_user() and proc_set_size() to set attributes. Just a cleanup. Signed-off-by: Rui Xiang <rui.xiang@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23proc: fix ->f_pos overflows in first_tid()Oleg Nesterov
1. proc_task_readdir()->first_tid() path truncates f_pos to int, this is wrong even on 64bit. We could check that f_pos < PID_MAX or even INT_MAX in proc_task_readdir(), but this patch simply checks the potential overflow in first_tid(), this check is nop on 64bit. We do not care if it was negative and the new unsigned value is huge, all we need to ensure is that we never wrongly return !NULL. 2. Remove the 2nd "nr != 0" check before get_nr_threads(), nr_threads == 0 is not distinguishable from !pid_task() above. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Sameer Nanda <snanda@chromium.org> Cc: Sergey Dyasly <dserrg@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23proc: don't (ab)use ->group_leader in proc_task_readdir() pathsOleg Nesterov
proc_task_readdir() does not really need "leader", first_tid() has to revalidate it anyway. Just pass proc_pid(inode) to first_tid() instead, it can do pid_task(PIDTYPE_PID) itself and read ->group_leader only if necessary. The patch also extracts the "inode is dead" code from pid_delete_dentry(dentry) into the new trivial helper, proc_inode_is_dead(inode), proc_task_readdir() uses it to return -ENOENT if this dir was removed. This is a bit racy, but the race is very inlikely and the getdents() after openndir() can see the empty "." + ".." dir only once. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Sameer Nanda <snanda@chromium.org> Cc: Sergey Dyasly <dserrg@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23proc: change first_tid() to use while_each_thread() rather than next_thread()Oleg Nesterov
Rerwrite the main loop to use while_each_thread() instead of next_thread(). We are going to fix or replace while_each_thread(), next_thread() should be avoided whenever possible. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Sameer Nanda <snanda@chromium.org> Cc: Sergey Dyasly <dserrg@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23proc: fix the potential use-after-free in first_tid()Oleg Nesterov
proc_task_readdir() verifies that the result of get_proc_task() is pid_alive() and thus its ->group_leader is fine too. However this is not necessarily true after rcu_read_unlock(), we need to recheck this again after first_tid() does rcu_read_lock(). Otherwise leader->thread_group.next (used by next_thread()) can be invalid if the rcu grace period expires in between. The race is subtle and unlikely, but still it is possible afaics. To simplify lets ignore the "likely" case when tid != 0, f_version can be cleared by proc_task_operations->llseek(). Suppose we have a main thread M and its subthread T. Suppose that f_pos == 3, iow first_tid() should return T. Now suppose that the following happens between rcu_read_unlock() and rcu_read_lock(): 1. T execs and becomes the new leader. This removes M from ->thread_group but next_thread(M) is still T. 2. T creates another thread X which does exec as well, T goes away. 3. X creates another subthread, this increments nr_threads. 4. first_tid() does next_thread(M) and returns the already dead T. Note also that we need 2. and 3. only because of get_nr_threads() check, and this check was supposed to be optimization only. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Sameer Nanda <snanda@chromium.org> Cc: Sergey Dyasly <dserrg@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23proc: cleanup/simplify get_task_state/task_state_arrayOleg Nesterov
get_task_state() and task_state_array[] look confusing and suboptimal, it is not clear what it can actually report to user-space and task_state_array[] blows .data for no reason. 1. state = (tsk->state & TASK_REPORT) | tsk->exit_state is not clear. TASK_REPORT is self-documenting but it is not clear what ->exit_state can add. Move the potential exit_state's (EXIT_ZOMBIE and EXIT_DEAD) into TASK_REPORT and use it to calculate the final result. 2. With the change above it is obvious that task_state_array[] has the unused entries just to make BUILD_BUG_ON() happy. Change this BUILD_BUG_ON() to use TASK_REPORT rather than TASK_STATE_MAX and shrink task_state_array[]. 3. Turn the "while (state)" loop into fls(state). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: David Laight <David.Laight@ACULAB.COM> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23coredump: make __get_dumpable/get_dumpable inline, kill fs/coredump.hOleg Nesterov
1. Remove fs/coredump.h. It is not clear why do we need it, it only declares __get_dumpable(), signal.c includes it for no reason. 2. Now that get_dumpable() and __get_dumpable() are really trivial make them inline in linux/sched.h. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Kees Cook <keescook@chromium.org> Cc: Alex Kelly <alex.page.kelly@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Petr Matousek <pmatouse@redhat.com> Cc: Vasily Kulikov <segoon@openwall.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23coredump: kill MMF_DUMPABLE and MMF_DUMP_SECURELYOleg Nesterov
Nobody actually needs MMF_DUMPABLE/MMF_DUMP_SECURELY, they are only used to enforce the encoding of SUID_DUMP_* enum in mm->flags & MMF_DUMPABLE_MASK. Now that set_dumpable() updates both bits atomically we can kill them and simply store the value "as is" in 2 lower bits. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Kees Cook <keescook@chromium.org> Cc: Alex Kelly <alex.page.kelly@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Petr Matousek <pmatouse@redhat.com> Cc: Vasily Kulikov <segoon@openwall.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23coredump: set_dumpable: fix the theoretical race with itselfOleg Nesterov
set_dumpable() updates MMF_DUMPABLE_MASK in a non-trivial way to ensure that get_dumpable() can't observe the intermediate state, but this all can't help if multiple threads call set_dumpable() at the same time. And in theory commit_creds()->set_dumpable(SUID_DUMP_ROOT) racing with sys_prctl()->set_dumpable(SUID_DUMP_DISABLE) can result in SUID_DUMP_USER. Change this code to update both bits atomically via cmpxchg(). Note: this assumes that it is safe to mix bitops and cmpxchg. IOW, if, say, an architecture implements cmpxchg() using the locking (like arch/parisc/lib/bitops.c does), then it should use the same locks for set_bit/etc. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Kees Cook <keescook@chromium.org> Cc: Alex Kelly <alex.page.kelly@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Petr Matousek <pmatouse@redhat.com> Cc: Vasily Kulikov <segoon@openwall.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23hfsplus: remove hfsplus_file_lookup()Sougata Santra
HFS+ resource fork lookup breaks opendir() library function. Since opendir first calls open() with O_DIRECTORY flag set. O_DIRECTORY means "refuse to open if not a directory". The open system call in the kernel does a check for inode->i_op->lookup and returns -ENOTDIR. So if hfsplus_file_lookup is set it allows opendir() for plain files. Also resource fork lookup in HFS+ does not work. Since it is never invoked after VFS permission checking. It will always return with -EACCES. When we call opendir() on a file, it does not return NULL. opendir() library call is based on open with O_DIRECTORY flag passed and then layered on top of getdents() system call. O_DIRECTORY means "refuse to open if not a directory". The open() system call in the kernel does a check for: do_sys_open() -->..--> can_lookup() i.e it only checks inode->i_op->lookup and returns ENOTDIR if this function pointer is not set. In OSX, we can open "file/rsrc" to get the resource fork of "file". This behavior is emulated inside hfsplus on Linux, which means that to some degree every file acts like a directory. That is the reason lookup() inode operations is supported for files, and it is possible to do a lookup on this specific name. As a result of this open succeeds without returning ENOTDIR for HFS+ Please see the LKML discussion thread on this issue: http://marc.info/?l=linux-fsdevel&m=122823343730412&w=2 I tried to test file/rsrc lookup in HFS+ driver and the feature does not work. From OSX: $ touch test $ echo "1234" > test/..namedfork/rsrc $ ls -l test..namedfork/rsrc --rw-r--r-- 1 tuxera staff 5 10 dec 12:59 test/..namedfork/rsrc [sougata@ultrabook tmp]$ id uid=1000(sougata) gid=1000(sougata) groups=1000(sougata),5(tty),18(dialout),1001(vboxusers) [sougata@ultrabook tmp]$ mount /dev/sdb1 on /mnt/tmp type hfsplus (rw,relatime,umask=0,uid=1000,gid=1000,nls=utf8) [sougata@ultrabook tmp]$ ls -l test/rsrc ls: cannot access test/rsrc: Permission denied According to this LKML thread it is expected behavior. http://marc.info/?t=121139033800008&r=1&w=4 I guess now that permission checking happens in vfs generic_permission() ? So it turns out that even though the lookup() inode_operation exists for HFS+ files. It cannot really get invoked ?. So if we can disable this feature to make opendir() work for HFS+. Signed-off-by: Sougata Santra <sougata@tuxera.com> Acked-by: Christoph Hellwig <hch@lst.de> Cc: Vyacheslav Dubeyko <slava@dubeyko.com> Cc: Anton Altaparmakov <aia21@cam.ac.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23nilfs2: add comments for ioctlsVyacheslav Dubeyko
Add comments for ioctls in fs/nilfs2/ioctl.c file and describe NILFS2 specific ioctls in Documentation/filesystems/nilfs2.txt. Signed-off-by: Vyacheslav Dubeyko <slava@dubeyko.com> Reviewed-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Cc: Wenliang Fan <fanwlexca@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23fs/nilfs2: fix integer overflow in nilfs_ioctl_wrap_copy()Wenliang Fan
The local variable 'pos' in nilfs_ioctl_wrap_copy function can overflow if a large number was passed to argv->v_index from userspace and the sum of argv->v_index and argv->v_nmembs exceeds the maximum value of __u64 type integer (= ~(__u64)0 = 18446744073709551615). Here, argv->v_index is a 64-bit width argument to specify the start position of target data items (such as segment number, checkpoint number, or virtual block address of nilfs), and argv->v_nmembs gives the total number of the items that userland programs (such as lssu, lscp, or cleanerd) want to get information about, which also gives the maximum element count of argv->v_base[] array. nilfs_ioctl_wrap_copy() calls dofunc() repeatedly and increments the position variable 'pos' at the end of each iteration if dofunc() itself didn't update 'pos': if (pos == ppos) pos += n; This patch prevents the overflow here by rejecting pairs of a start position (argv->v_index) and a total count (argv->v_nmembs) which leads to the overflow. [konishi.ryusuke@lab.ntt.co.jp: fix signedness issue] Signed-off-by: Wenliang Fan <fanwlexca@gmail.com> Cc: Vyacheslav Dubeyko <slava@dubeyko.com> Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23fs/pipe.c: skip file_update_time on frozen fsDmitry Monakhov
Pipe has no data associated with fs so it is not good idea to block pipe_write() if FS is frozen, but we can not update file's time on such filesystem. Let's use same idea as we use in touch_time(). Addresses https://bugzilla.kernel.org/show_bug.cgi?id=65701 Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>