Age | Commit message (Collapse) | Author |
|
git://git.kernel.org/pub/scm/linux/kernel/git/crng/random
Pull more random number generator updates from Jason Donenfeld:
"This time with some large scale treewide cleanups.
The intent of this pull is to clean up the way callers fetch random
integers. The current rules for doing this right are:
- If you want a secure or an insecure random u64, use get_random_u64()
- If you want a secure or an insecure random u32, use get_random_u32()
The old function prandom_u32() has been deprecated for a while
now and is just a wrapper around get_random_u32(). Same for
get_random_int().
- If you want a secure or an insecure random u16, use get_random_u16()
- If you want a secure or an insecure random u8, use get_random_u8()
- If you want secure or insecure random bytes, use get_random_bytes().
The old function prandom_bytes() has been deprecated for a while
now and has long been a wrapper around get_random_bytes()
- If you want a non-uniform random u32, u16, or u8 bounded by a
certain open interval maximum, use prandom_u32_max()
I say "non-uniform", because it doesn't do any rejection sampling
or divisions. Hence, it stays within the prandom_*() namespace, not
the get_random_*() namespace.
I'm currently investigating a "uniform" function for 6.2. We'll see
what comes of that.
By applying these rules uniformly, we get several benefits:
- By using prandom_u32_max() with an upper-bound that the compiler
can prove at compile-time is ≤65536 or ≤256, internally
get_random_u16() or get_random_u8() is used, which wastes fewer
batched random bytes, and hence has higher throughput.
- By using prandom_u32_max() instead of %, when the upper-bound is
not a constant, division is still avoided, because
prandom_u32_max() uses a faster multiplication-based trick instead.
- By using get_random_u16() or get_random_u8() in cases where the
return value is intended to indeed be a u16 or a u8, we waste fewer
batched random bytes, and hence have higher throughput.
This series was originally done by hand while I was on an airplane
without Internet. Later, Kees and I worked on retroactively figuring
out what could be done with Coccinelle and what had to be done
manually, and then we split things up based on that.
So while this touches a lot of files, the actual amount of code that's
hand fiddled is comfortably small"
* tag 'random-6.1-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random:
prandom: remove unused functions
treewide: use get_random_bytes() when possible
treewide: use get_random_u32() when possible
treewide: use get_random_{u8,u16}() when possible, part 2
treewide: use get_random_{u8,u16}() when possible, part 1
treewide: use prandom_u32_max() when possible, part 2
treewide: use prandom_u32_max() when possible, part 1
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs
Pull UBI and UBIFS updates from Richard Weinberger:
"UBI:
- Use bitmap API to allocate bitmaps
- New attach mode, disable_fm, to attach without fastmap
- Fixes for various typos in comments
UBIFS:
- Fix for a deadlock when setting xattrs for encrypted file
- Fix for an assertion failures when truncating encrypted files
- Fixes for various typos in comments"
* tag 'for-linus-6.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs:
ubi: fastmap: Add fastmap control support for 'UBI_IOCATT' ioctl
ubi: fastmap: Use the bitmap API to allocate bitmaps
ubifs: Fix AA deadlock when setting xattr for encrypted file
ubifs: Fix UBIFS ro fail due to truncate in the encrypted directory
mtd: ubi: drop unexpected word 'a' in comments
ubi: block: Fix typos in comments
ubi: fastmap: Fix typo in comments
ubi: Fix repeated words in comments
ubi: ubi-media.h: Fix comment typo
ubi: block: Remove in vain semicolon
ubifs: Fix ubifs_check_dir_empty() kernel-doc comment
|
|
The prandom_bytes() function has been a deprecated inline wrapper around
get_random_bytes() for several releases now, and compiles down to the
exact same code. Replace the deprecated wrapper with a direct call to
the real function. This was done as a basic find and replace.
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Yury Norov <yury.norov@gmail.com>
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> # powerpc
Acked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
The prandom_u32() function has been a deprecated inline wrapper around
get_random_u32() for several releases now, and compiles down to the
exact same code. Replace the deprecated wrapper with a direct call to
the real function. The same also applies to get_random_int(), which is
just a wrapper around get_random_u32(). This was done as a basic find
and replace.
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Yury Norov <yury.norov@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz> # for ext4
Acked-by: Toke Høiland-Jørgensen <toke@toke.dk> # for sch_cake
Acked-by: Chuck Lever <chuck.lever@oracle.com> # for nfsd
Acked-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Mika Westerberg <mika.westerberg@linux.intel.com> # for thunderbolt
Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs
Acked-by: Helge Deller <deller@gmx.de> # for parisc
Acked-by: Heiko Carstens <hca@linux.ibm.com> # for s390
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Rather than incurring a division or requesting too many random bytes for
the given range, use the prandom_u32_max() function, which only takes
the minimum required bytes from the RNG and avoids divisions. This was
done mechanically with this coccinelle script:
@basic@
expression E;
type T;
identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
typedef u64;
@@
(
- ((T)get_random_u32() % (E))
+ prandom_u32_max(E)
|
- ((T)get_random_u32() & ((E) - 1))
+ prandom_u32_max(E * XXX_MAKE_SURE_E_IS_POW2)
|
- ((u64)(E) * get_random_u32() >> 32)
+ prandom_u32_max(E)
|
- ((T)get_random_u32() & ~PAGE_MASK)
+ prandom_u32_max(PAGE_SIZE)
)
@multi_line@
identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
identifier RAND;
expression E;
@@
- RAND = get_random_u32();
... when != RAND
- RAND %= (E);
+ RAND = prandom_u32_max(E);
// Find a potential literal
@literal_mask@
expression LITERAL;
type T;
identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
position p;
@@
((T)get_random_u32()@p & (LITERAL))
// Add one to the literal.
@script:python add_one@
literal << literal_mask.LITERAL;
RESULT;
@@
value = None
if literal.startswith('0x'):
value = int(literal, 16)
elif literal[0] in '123456789':
value = int(literal, 10)
if value is None:
print("I don't know how to handle %s" % (literal))
cocci.include_match(False)
elif value == 2**32 - 1 or value == 2**31 - 1 or value == 2**24 - 1 or value == 2**16 - 1 or value == 2**8 - 1:
print("Skipping 0x%x for cleanup elsewhere" % (value))
cocci.include_match(False)
elif value & (value + 1) != 0:
print("Skipping 0x%x because it's not a power of two minus one" % (value))
cocci.include_match(False)
elif literal.startswith('0x'):
coccinelle.RESULT = cocci.make_expr("0x%x" % (value + 1))
else:
coccinelle.RESULT = cocci.make_expr("%d" % (value + 1))
// Replace the literal mask with the calculated result.
@plus_one@
expression literal_mask.LITERAL;
position literal_mask.p;
expression add_one.RESULT;
identifier FUNC;
@@
- (FUNC()@p & (LITERAL))
+ prandom_u32_max(RESULT)
@collapse_ret@
type T;
identifier VAR;
expression E;
@@
{
- T VAR;
- VAR = (E);
- return VAR;
+ return E;
}
@drop_var@
type T;
identifier VAR;
@@
{
- T VAR;
... when != VAR
}
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Yury Norov <yury.norov@gmail.com>
Reviewed-by: KP Singh <kpsingh@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz> # for ext4 and sbitmap
Reviewed-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> # for drbd
Acked-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Heiko Carstens <hca@linux.ibm.com> # for s390
Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # for mmc
Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
This is in preparation for adding tmpfile support to fuse, which requires
that the tmpfile creation and opening are done as a single operation.
Replace the 'struct dentry *' argument of i_op->tmpfile with
'struct file *'.
Call finish_open_simple() as the last thing in ->tmpfile() instances (may
be omitted in the error case).
Change d_tmpfile() argument to 'struct file *' as well to make callers more
readable.
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Following process:
vfs_setxattr(host)
ubifs_xattr_set
down_write(host_ui->xattr_sem) <- lock first time
create_xattr
ubifs_new_inode(host)
fscrypt_prepare_new_inode(host)
fscrypt_policy_to_inherit(host)
if (IS_ENCRYPTED(inode))
fscrypt_require_key(host)
fscrypt_get_encryption_info(host)
ubifs_xattr_get(host)
down_read(host_ui->xattr_sem) <- AA deadlock
, which may trigger an AA deadlock problem:
[ 102.620871] INFO: task setfattr:1599 blocked for more than 10 seconds.
[ 102.625298] Not tainted 5.19.0-rc7-00001-gb666b6823ce0-dirty #711
[ 102.628732] task:setfattr state:D stack: 0 pid: 1599
[ 102.628749] Call Trace:
[ 102.628753] <TASK>
[ 102.628776] __schedule+0x482/0x1060
[ 102.629964] schedule+0x92/0x1a0
[ 102.629976] rwsem_down_read_slowpath+0x287/0x8c0
[ 102.629996] down_read+0x84/0x170
[ 102.630585] ubifs_xattr_get+0xd1/0x370 [ubifs]
[ 102.630730] ubifs_crypt_get_context+0x1f/0x30 [ubifs]
[ 102.630791] fscrypt_get_encryption_info+0x7d/0x1c0
[ 102.630810] fscrypt_policy_to_inherit+0x56/0xc0
[ 102.630817] fscrypt_prepare_new_inode+0x35/0x160
[ 102.630830] ubifs_new_inode+0xcc/0x4b0 [ubifs]
[ 102.630873] ubifs_xattr_set+0x591/0x9f0 [ubifs]
[ 102.630961] xattr_set+0x8c/0x3e0 [ubifs]
[ 102.631003] __vfs_setxattr+0x71/0xc0
[ 102.631026] vfs_setxattr+0x105/0x270
[ 102.631034] do_setxattr+0x6d/0x110
[ 102.631041] setxattr+0xa0/0xd0
[ 102.631087] __x64_sys_setxattr+0x2f/0x40
Fetch a reproducer in [Link].
Just like ext4 does, which skips encrypting for inode with
EXT4_EA_INODE_FL flag. Stop encypting xattr inode for ubifs.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=216260
Fixes: f4e3634a3b64222 ("ubifs: Fix races between xattr_{set|get} ...")
Fixes: d475a507457b5ca ("ubifs: Add skeleton for fscrypto")
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
|
|
The ubifs_compress() function does not compress the data When the
data length is short than 128 bytes or the compressed data length
is not ideal.It cause that the compressed length of the truncated
data in the truncate_data_node() function may be greater than the
length of the raw data read from the flash.
The above two lengths are transferred to the ubifs_encrypt()
function as parameters. This may lead to assertion fails and then
the file system becomes read-only.
This patch use the actual length of the data in the memory as the
input parameter for assert comparison, which avoids the problem.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=216213
Signed-off-by: ZhaoLong Wang <wangzhaolong1@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
|
|
Fix function name in fs/ubifs/dir.c kernel-doc comment
to remove warning found by running scripts/kernel-doc,
which is caused by using 'make W=1'.
fs/ubifs/dir.c:883: warning: expecting prototype for check_dir_empty().
Prototype was for ubifs_check_dir_empty() instead
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
"Most of the MM queue. A few things are still pending.
Liam's maple tree rework didn't make it. This has resulted in a few
other minor patch series being held over for next time.
Multi-gen LRU still isn't merged as we were waiting for mapletree to
stabilize. The current plan is to merge MGLRU into -mm soon and to
later reintroduce mapletree, with a view to hopefully getting both
into 6.1-rc1.
Summary:
- The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe
Lin, Yang Shi, Anshuman Khandual and Mike Rapoport
- Some kmemleak fixes from Patrick Wang and Waiman Long
- DAMON updates from SeongJae Park
- memcg debug/visibility work from Roman Gushchin
- vmalloc speedup from Uladzislau Rezki
- more folio conversion work from Matthew Wilcox
- enhancements for coherent device memory mapping from Alex Sierra
- addition of shared pages tracking and CoW support for fsdax, from
Shiyang Ruan
- hugetlb optimizations from Mike Kravetz
- Mel Gorman has contributed some pagealloc changes to improve
latency and realtime behaviour.
- mprotect soft-dirty checking has been improved by Peter Xu
- Many other singleton patches all over the place"
[ XFS merge from hell as per Darrick Wong in
https://lore.kernel.org/all/YshKnxb4VwXycPO8@magnolia/ ]
* tag 'mm-stable-2022-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (282 commits)
tools/testing/selftests/vm/hmm-tests.c: fix build
mm: Kconfig: fix typo
mm: memory-failure: convert to pr_fmt()
mm: use is_zone_movable_page() helper
hugetlbfs: fix inaccurate comment in hugetlbfs_statfs()
hugetlbfs: cleanup some comments in inode.c
hugetlbfs: remove unneeded header file
hugetlbfs: remove unneeded hugetlbfs_ops forward declaration
hugetlbfs: use helper macro SZ_1{K,M}
mm: cleanup is_highmem()
mm/hmm: add a test for cross device private faults
selftests: add soft-dirty into run_vmtests.sh
selftests: soft-dirty: add test for mprotect
mm/mprotect: fix soft-dirty check in can_change_pte_writable()
mm: memcontrol: fix potential oom_lock recursion deadlock
mm/gup.c: fix formatting in check_and_migrate_movable_page()
xfs: fail dax mount if reflink is enabled on a partition
mm/memcontrol.c: remove the redundant updating of stats_flush_threshold
userfaultfd: don't fail on unrecognized features
hugetlb_cgroup: fix wrong hugetlb cgroup numa stat
...
|
|
filemap_migrate_folio() is a little more general than ubifs really needs,
but it's better to share the code.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
|
|
Currently shrinkers are anonymous objects. For debugging purposes they
can be identified by count/scan function names, but it's not always
useful: e.g. for superblock's shrinkers it's nice to have at least an
idea of to which superblock the shrinker belongs.
This commit adds names to shrinkers. register_shrinker() and
prealloc_shrinker() functions are extended to take a format and arguments
to master a name.
In some cases it's not possible to determine a good name at the time when
a shrinker is allocated. For such cases shrinker_debugfs_rename() is
provided.
The expected format is:
<subsystem>-<shrinker_type>[:<instance>]-<id>
For some shrinkers an instance can be encoded as (MAJOR:MINOR) pair.
After this change the shrinker debugfs directory looks like:
$ cd /sys/kernel/debug/shrinker/
$ ls
dquota-cache-16 sb-devpts-28 sb-proc-47 sb-tmpfs-42
mm-shadow-18 sb-devtmpfs-5 sb-proc-48 sb-tmpfs-43
mm-zspool:zram0-34 sb-hugetlbfs-17 sb-pstore-31 sb-tmpfs-44
rcu-kfree-0 sb-hugetlbfs-33 sb-rootfs-2 sb-tmpfs-49
sb-aio-20 sb-iomem-12 sb-securityfs-6 sb-tracefs-13
sb-anon_inodefs-15 sb-mqueue-21 sb-selinuxfs-22 sb-xfs:vda1-36
sb-bdev-3 sb-nsfs-4 sb-sockfs-8 sb-zsmalloc-19
sb-bpf-32 sb-pipefs-14 sb-sysfs-26 thp-deferred_split-10
sb-btrfs:vda2-24 sb-proc-25 sb-tmpfs-1 thp-zero-9
sb-cgroup2-30 sb-proc-39 sb-tmpfs-27 xfs-buf:vda1-37
sb-configfs-23 sb-proc-41 sb-tmpfs-29 xfs-inodegc:vda1-38
sb-dax-11 sb-proc-45 sb-tmpfs-35
sb-debugfs-7 sb-proc-46 sb-tmpfs-40
[roman.gushchin@linux.dev: fix build warnings]
Link: https://lkml.kernel.org/r/Yr+ZTnLb9lJk6fJO@castle
Reported-by: kernel test robot <lkp@intel.com>
Link: https://lkml.kernel.org/r/20220601032227.4076670-4-roman.gushchin@linux.dev
Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs
Pull JFFS2, UBI and UBIFS updates from Richard Weinberger:
"JFFS2:
- Fixes for a memory leak
UBI:
- Fixes for fastmap (UAF, high CPU usage)
UBIFS:
- Minor cleanups"
* tag 'for-linus-5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs:
ubi: ubi_create_volume: Fix use-after-free when volume creation failed
ubi: fastmap: Check wl_pool for free peb before wear leveling
ubi: fastmap: Fix high cpu usage of ubi_bgt by making sure wl_pool not empty
ubifs: Use NULL instead of using plain integer as pointer
ubifs: Simplify the return expression of run_gc()
jffs2: fix memory leak in jffs2_do_fill_super
jffs2: Use kzalloc instead of kmalloc/memset
|
|
This fixes the following sparse warnings:
fs/ubifs/xattr.c:680:58: warning: Using plain integer as NULL pointer
Signed-off-by: Haowen Bai <baihaowen@meizu.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
|
|
Simplify the return expression.
Reported-by: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: Minghao Chi <chi.minghao@zte.com.cn>
Reviewed-by: Tudor Ambarus <tudor.ambarus@microchip.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
|
|
Pull page cache updates from Matthew Wilcox:
- Appoint myself page cache maintainer
- Fix how scsicam uses the page cache
- Use the memalloc_nofs_save() API to replace AOP_FLAG_NOFS
- Remove the AOP flags entirely
- Remove pagecache_write_begin() and pagecache_write_end()
- Documentation updates
- Convert several address_space operations to use folios:
- is_dirty_writeback
- readpage becomes read_folio
- releasepage becomes release_folio
- freepage becomes free_folio
- Change filler_t to require a struct file pointer be the first
argument like ->read_folio
* tag 'folio-5.19' of git://git.infradead.org/users/willy/pagecache: (107 commits)
nilfs2: Fix some kernel-doc comments
Appoint myself page cache maintainer
fs: Remove aops->freepage
secretmem: Convert to free_folio
nfs: Convert to free_folio
orangefs: Convert to free_folio
fs: Add free_folio address space operation
fs: Convert drop_buffers() to use a folio
fs: Change try_to_free_buffers() to take a folio
jbd2: Convert release_buffer_page() to use a folio
jbd2: Convert jbd2_journal_try_to_free_buffers to take a folio
reiserfs: Convert release_buffer_page() to use a folio
fs: Remove last vestiges of releasepage
ubifs: Convert to release_folio
reiserfs: Convert to release_folio
orangefs: Convert to release_folio
ocfs2: Convert to release_folio
nilfs2: Remove comment about releasepage
nfs: Convert to release_folio
jfs: Convert to release_folio
...
|
|
Use folios throughout the release_folio path.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
|
|
This is a "weak" conversion which converts straight back to using pages.
A full conversion should be performed at some point, hopefully by
someone familiar with the filesystem.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
|
|
There are no more aop flags left, so remove the parameter.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
There are no more aop flags left, so remove the parameter.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
FS_CRYPTO_BLOCK_SIZE is neither the filesystem block size nor the
granularity of encryption. Rather, it defines two logically separate
constraints that both arise from the block size of the AES cipher:
- The alignment required for the lengths of file contents blocks
- The minimum input/output length for the filenames encryption modes
Since there are way too many things called the "block size", and the
connection with the AES block size is not easily understood, split
FS_CRYPTO_BLOCK_SIZE into two constants FSCRYPT_CONTENTS_ALIGNMENT and
FSCRYPT_FNAME_MIN_MSG_LEN that more clearly describe what they are.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20220405010914.18519-1-ebiggers@kernel.org
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs
Pull JFFS2, UBI and UBIFS updates from Richard Weinberger:
"JFFS2:
- Fixes for various memory issues
UBI:
- Fix for a race condition in cdev ioctl handler
UBIFS:
- Fixes for O_TMPFILE and whiteout handling
- Fixes for various memory issues"
* tag 'for-linus-5.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs:
ubifs: rename_whiteout: correct old_dir size computing
jffs2: fix memory leak in jffs2_scan_medium
jffs2: fix memory leak in jffs2_do_mount_fs
jffs2: fix use-after-free in jffs2_clear_xattr_subsystem
fs/jffs2: fix comments mentioning i_mutex
ubi: fastmap: Return error code if memory allocation fails in add_aeb()
ubifs: Fix to add refcount once page is set private
ubifs: Fix read out-of-bounds in ubifs_wbuf_write_nolock()
ubifs: setflags: Make dirtied_ino_d 8 bytes aligned
ubifs: Rectify space amount budget for mkdir/tmpfile operations
ubifs: Fix 'ui->dirty' race between do_tmpfile() and writeback work
ubifs: Rename whiteout atomically
ubifs: Add missing iput if do_tmpfile() failed in rename whiteout
ubifs: Fix wrong number of inodes locked by ui_mutex in ubifs_inode comment
ubifs: Fix deadlock in concurrent rename whiteout and inode writeback
ubifs: rename_whiteout: Fix double free for whiteout_ui->data
ubi: Fix race condition between ctrl_cdev_ioctl and ubi_cdev_ioctl
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core updates from Greg KH:
"Here is the set of driver core changes for 5.18-rc1.
Not much here, primarily it was a bunch of cleanups and small updates:
- kobj_type cleanups for default_groups
- documentation updates
- firmware loader minor changes
- component common helper added and take advantage of it in many
drivers (the largest part of this pull request).
All of these have been in linux-next for a while with no reported
problems"
* tag 'driver-core-5.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (54 commits)
Documentation: update stable review cycle documentation
drivers/base/dd.c : Remove the initial value of the global variable
Documentation: update stable tree link
Documentation: add link to stable release candidate tree
devres: fix typos in comments
Documentation: add note block surrounding security patch note
samples/kobject: Use sysfs_emit instead of sprintf
base: soc: Make soc_device_match() simpler and easier to read
driver core: dd: fix return value of __setup handler
driver core: Refactor sysfs and drv/bus remove hooks
driver core: Refactor multiple copies of device cleanup
scripts: get_abi.pl: Fix typo in help message
kernfs: fix typos in comments
kernfs: remove unneeded #if 0 guard
ALSA: hda/realtek: Make use of the helper component_compare_dev_name
video: omapfb: dss: Make use of the helper component_compare_dev
power: supply: ab8500: Make use of the helper component_compare_dev
ASoC: codecs: wcd938x: Make use of the helper component_compare/release_of
iommu/mediatek: Make use of the helper component_compare/release_of
drm: of: Make use of the helper component_release_of
...
|
|
Pull filesystem folio updates from Matthew Wilcox:
"Primarily this series converts some of the address_space operations to
take a folio instead of a page.
Notably:
- a_ops->is_partially_uptodate() takes a folio instead of a page and
changes the type of the 'from' and 'count' arguments to make it
obvious they're bytes.
- a_ops->invalidatepage() becomes ->invalidate_folio() and has a
similar type change.
- a_ops->launder_page() becomes ->launder_folio()
- a_ops->set_page_dirty() becomes ->dirty_folio() and adds the
address_space as an argument.
There are a couple of other misc changes up front that weren't worth
separating into their own pull request"
* tag 'folio-5.18b' of git://git.infradead.org/users/willy/pagecache: (53 commits)
fs: Remove aops ->set_page_dirty
fb_defio: Use noop_dirty_folio()
fs: Convert __set_page_dirty_no_writeback to noop_dirty_folio
fs: Convert __set_page_dirty_buffers to block_dirty_folio
nilfs: Convert nilfs_set_page_dirty() to nilfs_dirty_folio()
mm: Convert swap_set_page_dirty() to swap_dirty_folio()
ubifs: Convert ubifs_set_page_dirty to ubifs_dirty_folio
f2fs: Convert f2fs_set_node_page_dirty to f2fs_dirty_node_folio
f2fs: Convert f2fs_set_data_page_dirty to f2fs_dirty_data_folio
f2fs: Convert f2fs_set_meta_page_dirty to f2fs_dirty_meta_folio
afs: Convert afs_dir_set_page_dirty() to afs_dir_dirty_folio()
btrfs: Convert extent_range_redirty_for_io() to use folios
fs: Convert trivial uses of __set_page_dirty_nobuffers to filemap_dirty_folio
btrfs: Convert from set_page_dirty to dirty_folio
fscache: Convert fscache_set_page_dirty() to fscache_dirty_folio()
fs: Add aops->dirty_folio
fs: Remove aops->launder_page
orangefs: Convert launder_page to launder_folio
nfs: Convert from launder_page to launder_folio
fuse: Convert from launder_page to launder_folio
...
|
|
The inode allocation is supposed to use alloc_inode_sb(), so convert
kmem_cache_alloc() of all filesystems to alloc_inode_sb().
Link: https://lkml.kernel.org/r/20220228122126.37293-5-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Theodore Ts'o <tytso@mit.edu> [ext4]
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Alex Shi <alexs@kernel.org>
Cc: Anna Schumaker <Anna.Schumaker@Netapp.com>
Cc: Chao Yu <chao@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Fam Zheng <fam.zheng@bytedance.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kari Argillander <kari.argillander@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
When renaming the whiteout file, the old whiteout file is not deleted.
Therefore, we add the old dentry size to the old dir like XFS.
Otherwise, an error may be reported due to `fscki->calc_sz != fscki->size`
in check_indes.
Fixes: 9e0a1fff8db56ea ("ubifs: Implement RENAME_WHITEOUT")
Reported-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
|
|
Removes a call to __set_page_dirty_nobuffers().
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs
|
|
This is a straightfoward conversion.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs
|
|
There are currently 2 ways to create a set of sysfs files for a
kobj_type, through the default_attrs field, and the default_groups
field. Move the ubifs sysfs code to use default_groups field which has
been the preferred way since aa30f47cf666 ("kobject: Add support for
default attribute groups to kobj_type") so that we can soon get rid of
the obsolete default_attrs field.
Cc: Stefan Schaeckeler <schaecsn@gmx.net>
Cc: linux-mtd@lists.infradead.org
Acked-by: Richard Weinberger <richard@nod.at>
Link: https://lore.kernel.org/r/20220114104820.1340879-1-gregkh@linuxfoundation.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
MM defined the rule [1] very clearly that once page was set with PG_private
flag, we should increment the refcount in that page, also main flows like
pageout(), migrate_page() will assume there is one additional page
reference count if page_has_private() returns true. Otherwise, we may
get a BUG in page migration:
page:0000000080d05b9d refcount:-1 mapcount:0 mapping:000000005f4d82a8
index:0xe2 pfn:0x14c12
aops:ubifs_file_address_operations [ubifs] ino:8f1 dentry name:"f30e"
flags: 0x1fffff80002405(locked|uptodate|owner_priv_1|private|node=0|
zone=1|lastcpupid=0x1fffff)
page dumped because: VM_BUG_ON_PAGE(page_count(page) != 0)
------------[ cut here ]------------
kernel BUG at include/linux/page_ref.h:184!
invalid opcode: 0000 [#1] SMP
CPU: 3 PID: 38 Comm: kcompactd0 Not tainted 5.15.0-rc5
RIP: 0010:migrate_page_move_mapping+0xac3/0xe70
Call Trace:
ubifs_migrate_page+0x22/0xc0 [ubifs]
move_to_new_page+0xb4/0x600
migrate_pages+0x1523/0x1cc0
compact_zone+0x8c5/0x14b0
kcompactd+0x2bc/0x560
kthread+0x18c/0x1e0
ret_from_fork+0x1f/0x30
Before the time, we should make clean a concept, what does refcount means
in page gotten from grab_cache_page_write_begin(). There are 2 situations:
Situation 1: refcount is 3, page is created by __page_cache_alloc.
TYPE_A - the write process is using this page
TYPE_B - page is assigned to one certain mapping by calling
__add_to_page_cache_locked()
TYPE_C - page is added into pagevec list corresponding current cpu by
calling lru_cache_add()
Situation 2: refcount is 2, page is gotten from the mapping's tree
TYPE_B - page has been assigned to one certain mapping
TYPE_A - the write process is using this page (by calling
page_cache_get_speculative())
Filesystem releases one refcount by calling put_page() in xxx_write_end(),
the released refcount corresponds to TYPE_A (write task is using it). If
there are any processes using a page, page migration process will skip the
page by judging whether expected_page_refs() equals to page refcount.
The BUG is caused by following process:
PA(cpu 0) kcompactd(cpu 1)
compact_zone
ubifs_write_begin
page_a = grab_cache_page_write_begin
add_to_page_cache_lru
lru_cache_add
pagevec_add // put page into cpu 0's pagevec
(refcnf = 3, for page creation process)
ubifs_write_end
SetPagePrivate(page_a) // doesn't increase page count !
unlock_page(page_a)
put_page(page_a) // refcnt = 2
[...]
PB(cpu 0)
filemap_read
filemap_get_pages
add_to_page_cache_lru
lru_cache_add
__pagevec_lru_add // traverse all pages in cpu 0's pagevec
__pagevec_lru_add_fn
SetPageLRU(page_a)
isolate_migratepages
isolate_migratepages_block
get_page_unless_zero(page_a)
// refcnt = 3
list_add(page_a, from_list)
migrate_pages(from_list)
__unmap_and_move
move_to_new_page
ubifs_migrate_page(page_a)
migrate_page_move_mapping
expected_page_refs get 3
(migration[1] + mapping[1] + private[1])
release_pages
put_page_testzero(page_a) // refcnt = 3
page_ref_freeze // refcnt = 0
page_ref_dec_and_test(0 - 1 = -1)
page_ref_unfreeze
VM_BUG_ON_PAGE(-1 != 0, page)
UBIFS doesn't increase the page refcount after setting private flag, which
leads to page migration task believes the page is not used by any other
processes, so the page is migrated. This causes concurrent accessing on
page refcount between put_page() called by other process(eg. read process
calls lru_cache_add) and page_ref_unfreeze() called by migration task.
Actually zhangjun has tried to fix this problem [2] by recalculating page
refcnt in ubifs_migrate_page(). It's better to follow MM rules [1], because
just like Kirill suggested in [2], we need to check all users of
page_has_private() helper. Like f2fs does in [3], fix it by adding/deleting
refcount when setting/clearing private for a page. BTW, according to [4],
we set 'page->private' as 1 because ubifs just simply SetPagePrivate().
And, [5] provided a common helper to set/clear page private, ubifs can
use this helper following the example of iomap, afs, btrfs, etc.
Jump [6] to find a reproducer.
[1] https://lore.kernel.org/lkml/2b19b3c4-2bc4-15fa-15cc-27a13e5c7af1@aol.com
[2] https://www.spinics.net/lists/linux-mtd/msg04018.html
[3] http://lkml.iu.edu/hypermail/linux/kernel/1903.0/03313.html
[4] https://lore.kernel.org/linux-f2fs-devel/20210422154705.GO3596236@casper.infradead.org
[5] https://lore.kernel.org/all/20200517214718.468-1-guoqing.jiang@cloud.ionos.com
[6] https://bugzilla.kernel.org/show_bug.cgi?id=214961
Fixes: 1e51764a3c2ac0 ("UBIFS: add new flash file system")
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
|
|
Function ubifs_wbuf_write_nolock() may access buf out of bounds in
following process:
ubifs_wbuf_write_nolock():
aligned_len = ALIGN(len, 8); // Assume len = 4089, aligned_len = 4096
if (aligned_len <= wbuf->avail) ... // Not satisfy
if (wbuf->used) {
ubifs_leb_write() // Fill some data in avail wbuf
len -= wbuf->avail; // len is still not 8-bytes aligned
aligned_len -= wbuf->avail;
}
n = aligned_len >> c->max_write_shift;
if (n) {
n <<= c->max_write_shift;
err = ubifs_leb_write(c, wbuf->lnum, buf + written,
wbuf->offs, n);
// n > len, read out of bounds less than 8(n-len) bytes
}
, which can be catched by KASAN:
=========================================================
BUG: KASAN: slab-out-of-bounds in ecc_sw_hamming_calculate+0x1dc/0x7d0
Read of size 4 at addr ffff888105594ff8 by task kworker/u8:4/128
Workqueue: writeback wb_workfn (flush-ubifs_0_0)
Call Trace:
kasan_report.cold+0x81/0x165
nand_write_page_swecc+0xa9/0x160
ubifs_leb_write+0xf2/0x1b0 [ubifs]
ubifs_wbuf_write_nolock+0x421/0x12c0 [ubifs]
write_head+0xdc/0x1c0 [ubifs]
ubifs_jnl_write_inode+0x627/0x960 [ubifs]
wb_workfn+0x8af/0xb80
Function ubifs_wbuf_write_nolock() accepts that parameter 'len' is not 8
bytes aligned, the 'len' represents the true length of buf (which is
allocated in 'ubifs_jnl_xxx', eg. ubifs_jnl_write_inode), so
ubifs_wbuf_write_nolock() must handle the length read from 'buf' carefully
to write leb safely.
Fetch a reproducer in [Link].
Fixes: 1e51764a3c2ac0 ("UBIFS: add new flash file system")
Link: https://bugzilla.kernel.org/show_bug.cgi?id=214785
Reported-by: Chengsong Ke <kechengsong@huawei.com>
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
|
|
Make 'ui->data_len' aligned with 8 bytes before it is assigned to
dirtied_ino_d. Since 8871d84c8f8b0c6b("ubifs: convert to fileattr")
applied, 'setflags()' only affects regular files and directories, only
xattr inode, symlink inode and special inode(pipe/char_dev/block_dev)
have none- zero 'ui->data_len' field, so assertion
'!(req->dirtied_ino_d & 7)' cannot fail in ubifs_budget_space().
To avoid assertion fails in future evolution(eg. setflags can operate
special inodes), it's better to make dirtied_ino_d 8 bytes aligned,
after all aligned size is still zero for regular files.
Fixes: 1e51764a3c2ac05a ("UBIFS: add new flash file system")
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
|
|
UBIFS should make sure the flash has enough space to store dirty (Data
that is newer than disk) data (in memory), space budget is exactly
designed to do that. If space budget calculates less data than we need,
'make_reservation()' will do more work(return -ENOSPC if no free space
lelf, sometimes we can see "cannot reserve xxx bytes in jhead xxx, error
-28" in ubifs error messages) with ubifs inodes locked, which may effect
other syscalls.
A simple way to decide how much space do we need when make a budget:
See how much space is needed by 'make_reservation()' in ubifs_jnl_xxx()
function according to corresponding operation.
It's better to report ENOSPC in ubifs_budget_space(), as early as we can.
Fixes: 474b93704f32163 ("ubifs: Implement O_TMPFILE")
Fixes: 1e51764a3c2ac05 ("UBIFS: add new flash file system")
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
|
|
'ui->dirty' is not protected by 'ui_mutex' in function do_tmpfile() which
may race with ubifs_write_inode[wb_workfn] to access/update 'ui->dirty',
finally dirty space is released twice.
open(O_TMPFILE) wb_workfn
do_tmpfile
ubifs_budget_space(ino_req = { .dirtied_ino = 1})
d_tmpfile // mark inode(tmpfile) dirty
ubifs_jnl_update // without holding tmpfile's ui_mutex
mark_inode_clean(ui)
if (ui->dirty)
ubifs_release_dirty_inode_budget(ui) // release first time
ubifs_write_inode
mutex_lock(&ui->ui_mutex)
ubifs_release_dirty_inode_budget(ui)
// release second time
mutex_unlock(&ui->ui_mutex)
ui->dirty = 0
Run generic/476 can reproduce following message easily
(See reproducer in [Link]):
UBIFS error (ubi0:0 pid 2578): ubifs_assert_failed [ubifs]: UBIFS assert
failed: c->bi.dd_growth >= 0, in fs/ubifs/budget.c:554
UBIFS warning (ubi0:0 pid 2578): ubifs_ro_mode [ubifs]: switched to
read-only mode, error -22
Workqueue: writeback wb_workfn (flush-ubifs_0_0)
Call Trace:
ubifs_ro_mode+0x54/0x60 [ubifs]
ubifs_assert_failed+0x4b/0x80 [ubifs]
ubifs_release_budget+0x468/0x5a0 [ubifs]
ubifs_release_dirty_inode_budget+0x53/0x80 [ubifs]
ubifs_write_inode+0x121/0x1f0 [ubifs]
...
wb_workfn+0x283/0x7b0
Fix it by holding tmpfile ubifs inode lock during ubifs_jnl_update().
Similar problem exists in whiteout renaming, but previous fix("ubifs:
Rename whiteout atomically") has solved the problem.
Fixes: 474b93704f32163 ("ubifs: Implement O_TMPFILE")
Link: https://bugzilla.kernel.org/show_bug.cgi?id=214765
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
|
|
Currently, rename whiteout has 3 steps:
1. create tmpfile(which associates old dentry to tmpfile inode) for
whiteout, and store tmpfile to disk
2. link whiteout, associate whiteout inode to old dentry agagin and
store old dentry, old inode, new dentry on disk
3. writeback dirty whiteout inode to disk
Suddenly power-cut or error occurring(eg. ENOSPC returned by budget,
memory allocation failure) during above steps may cause kinds of problems:
Problem 1: ENOSPC returned by whiteout space budget (before step 2),
old dentry will disappear after rename syscall, whiteout file
cannot be found either.
ls dir // we get file, whiteout
rename(dir/file, dir/whiteout, REANME_WHITEOUT)
ENOSPC = ubifs_budget_space(&wht_req) // return
ls dir // empty (no file, no whiteout)
Problem 2: Power-cut happens before step 3, whiteout inode with 'nlink=1'
is not stored on disk, whiteout dentry(old dentry) is written
on disk, whiteout file is lost on next mount (We get "dead
directory entry" after executing 'ls -l' on whiteout file).
Now, we use following 3 steps to finish rename whiteout:
1. create an in-mem inode with 'nlink = 1' as whiteout
2. ubifs_jnl_rename (Write on disk to finish associating old dentry to
whiteout inode, associating new dentry with old inode)
3. iput(whiteout)
Rely writing in-mem inode on disk by ubifs_jnl_rename() to finish rename
whiteout, which avoids middle disk state caused by suddenly power-cut
and error occurring.
Fixes: 9e0a1fff8db56ea ("ubifs: Implement RENAME_WHITEOUT")
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
|
|
whiteout inode should be put when do_tmpfile() failed if inode has been
initialized. Otherwise we will get following warning during umount:
UBIFS error (ubi0:0 pid 1494): ubifs_assert_failed [ubifs]: UBIFS
assert failed: c->bi.dd_growth == 0, in fs/ubifs/super.c:1930
VFS: Busy inodes after unmount of ubifs. Self-destruct in 5 seconds.
Fixes: 9e0a1fff8db56ea ("ubifs: Implement RENAME_WHITEOUT")
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Suggested-by: Sascha Hauer <s.hauer@pengutronix.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
|
|
Since 9ec64962afb1702f75b("ubifs: Implement RENAME_EXCHANGE") and
9e0a1fff8db56eaaebb("ubifs: Implement RENAME_WHITEOUT") are applied,
ubifs_rename locks and changes 4 ubifs inodes, correct the comment
for ui_mutex in ubifs_inode.
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
|
|
Following hung tasks:
[ 77.028764] task:kworker/u8:4 state:D stack: 0 pid: 132
[ 77.028820] Call Trace:
[ 77.029027] schedule+0x8c/0x1b0
[ 77.029067] mutex_lock+0x50/0x60
[ 77.029074] ubifs_write_inode+0x68/0x1f0 [ubifs]
[ 77.029117] __writeback_single_inode+0x43c/0x570
[ 77.029128] writeback_sb_inodes+0x259/0x740
[ 77.029148] wb_writeback+0x107/0x4d0
[ 77.029163] wb_workfn+0x162/0x7b0
[ 92.390442] task:aa state:D stack: 0 pid: 1506
[ 92.390448] Call Trace:
[ 92.390458] schedule+0x8c/0x1b0
[ 92.390461] wb_wait_for_completion+0x82/0xd0
[ 92.390469] __writeback_inodes_sb_nr+0xb2/0x110
[ 92.390472] writeback_inodes_sb_nr+0x14/0x20
[ 92.390476] ubifs_budget_space+0x705/0xdd0 [ubifs]
[ 92.390503] do_rename.cold+0x7f/0x187 [ubifs]
[ 92.390549] ubifs_rename+0x8b/0x180 [ubifs]
[ 92.390571] vfs_rename+0xdb2/0x1170
[ 92.390580] do_renameat2+0x554/0x770
, are caused by concurrent rename whiteout and inode writeback processes:
rename_whiteout(Thread 1) wb_workfn(Thread2)
ubifs_rename
do_rename
lock_4_inodes (Hold ui_mutex)
ubifs_budget_space
make_free_space
shrink_liability
__writeback_inodes_sb_nr
bdi_split_work_to_wbs (Queue new wb work)
wb_do_writeback(wb work)
__writeback_single_inode
ubifs_write_inode
LOCK(ui_mutex)
↑
wb_wait_for_completion (Wait wb work) <-- deadlock!
Reproducer (Detail program in [Link]):
1. SYS_renameat2("/mp/dir/file", "/mp/dir/whiteout", RENAME_WHITEOUT)
2. Consume out of space before kernel(mdelay) doing budget for whiteout
Fix it by doing whiteout space budget before locking ubifs inodes.
BTW, it also fixes wrong goto tag 'out_release' in whiteout budget
error handling path(It should at least recover dir i_size and unlock
4 ubifs inodes).
Fixes: 9e0a1fff8db56ea ("ubifs: Implement RENAME_WHITEOUT")
Link: https://bugzilla.kernel.org/show_bug.cgi?id=214733
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
|
|
'whiteout_ui->data' will be freed twice if space budget fail for
rename whiteout operation as following process:
rename_whiteout
dev = kmalloc
whiteout_ui->data = dev
kfree(whiteout_ui->data) // Free first time
iput(whiteout)
ubifs_free_inode
kfree(ui->data) // Double free!
KASAN reports:
==================================================================
BUG: KASAN: double-free or invalid-free in ubifs_free_inode+0x4f/0x70
Call Trace:
kfree+0x117/0x490
ubifs_free_inode+0x4f/0x70 [ubifs]
i_callback+0x30/0x60
rcu_do_batch+0x366/0xac0
__do_softirq+0x133/0x57f
Allocated by task 1506:
kmem_cache_alloc_trace+0x3c2/0x7a0
do_rename+0x9b7/0x1150 [ubifs]
ubifs_rename+0x106/0x1f0 [ubifs]
do_syscall_64+0x35/0x80
Freed by task 1506:
kfree+0x117/0x490
do_rename.cold+0x53/0x8a [ubifs]
ubifs_rename+0x106/0x1f0 [ubifs]
do_syscall_64+0x35/0x80
The buggy address belongs to the object at ffff88810238bed8 which
belongs to the cache kmalloc-8 of size 8
==================================================================
Let ubifs_free_inode() free 'whiteout_ui->data'. BTW, delete unused
assignment 'whiteout_ui->data_len = 0', process 'ubifs_evict_inode()
-> ubifs_jnl_delete_inode() -> ubifs_jnl_write_inode()' doesn't need it
(because 'inc_nlink(whiteout)' won't be excuted by 'goto out_release',
and the nlink of whiteout inode is 0).
Fixes: 9e0a1fff8db56ea ("ubifs: Implement RENAME_WHITEOUT")
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
|
|
If ubifs_garbage_collect_leb() returns -EAGAIN and ubifs_return_leb
returns error, a LEB will always has a "taken" flag. In this case,
set the ubifs to read-only to prevent a worse situation.
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
|
|
If ubifs_garbage_collect_leb() returns -EAGAIN and enters the "out"
branch, ubifs_return_leb will execute twice on the same lnum. This
can cause data loss in concurrency situations.
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
|
|
Hulk Robot reported a KASAN report about slab-out-of-bounds:
==================================================================
BUG: KASAN: slab-out-of-bounds in ubifs_change_lp+0x3a9/0x1390 [ubifs]
Read of size 8 at addr ffff888101c961f8 by task fsstress/1068
[...]
Call Trace:
check_memory_region+0x1c1/0x1e0
ubifs_change_lp+0x3a9/0x1390 [ubifs]
ubifs_change_one_lp+0x170/0x220 [ubifs]
ubifs_garbage_collect+0x7f9/0xda0 [ubifs]
ubifs_budget_space+0xfe4/0x1bd0 [ubifs]
ubifs_write_begin+0x528/0x10c0 [ubifs]
Allocated by task 1068:
kmemdup+0x25/0x50
ubifs_lpt_lookup_dirty+0x372/0xb00 [ubifs]
ubifs_update_one_lp+0x46/0x260 [ubifs]
ubifs_tnc_end_commit+0x98b/0x1720 [ubifs]
do_commit+0x6cb/0x1950 [ubifs]
ubifs_run_commit+0x15a/0x2b0 [ubifs]
ubifs_budget_space+0x1061/0x1bd0 [ubifs]
ubifs_write_begin+0x528/0x10c0 [ubifs]
[...]
==================================================================
In ubifs_garbage_collect(), if ubifs_find_dirty_leb returns an error,
lp is an uninitialized variable. But lp.num might be used in the out
branch, which is a random value. If the value is -1 or another value
that can pass the check, soob may occur in the ubifs_change_lp() in
the following procedure.
To solve this problem, we initialize lp.lnum to -1, and then initialize
it correctly in ubifs_find_dirty_leb, which is not equal to -1, and
ubifs_return_leb is executed only when lp.lnum != -1.
if find a retained or indexing LEB and continue to next loop, but break
before find another LEB, the "taken" flag of this LEB will be cleaned
in ubi_return_lebi(). This bug has also been fixed in this patch.
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
|
|
The snprintf() function returns the number of bytes (not including the
NUL terminator) which would have been printed if there were enough
space. So it can be greater than UBIFS_DFS_DIR_LEN. And actually if
it equals UBIFS_DFS_DIR_LEN then that's okay so this check is too
strict.
Fixes: 9a620291fc01 ("ubifs: Export filesystem error counters")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
|
|
Not all ubifs filesystem errors are propagated to userspace.
Export bad magic, bad node and crc errors via sysfs. This allows userspace
to notice filesystem errors:
/sys/fs/ubifs/ubiX_Y/errors_magic
/sys/fs/ubifs/ubiX_Y/errors_node
/sys/fs/ubifs/ubiX_Y/errors_crc
The counters are reset to 0 with a remount.
Signed-off-by: Stefan Schaeckeler <sschaeck@cisco.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
|
|
it seems freeing the write buffers in the error path of the
ubifs_remount_rw() is wrong. It leads later to a kernel oops like this:
[10016.431274] UBIFS (ubi0:0): start fixing up free space
[10090.810042] UBIFS (ubi0:0): free space fixup complete
[10090.814623] UBIFS error (ubi0:0 pid 512): ubifs_remount_fs: cannot
spawn "ubifs_bgt0_0", error -4
[10101.915108] UBIFS (ubi0:0): background thread "ubifs_bgt0_0" started,
PID 517
[10105.275498] Unable to handle kernel NULL pointer dereference at
virtual address 0000000000000030
[10105.284352] Mem abort info:
[10105.287160] ESR = 0x96000006
[10105.290252] EC = 0x25: DABT (current EL), IL = 32 bits
[10105.295592] SET = 0, FnV = 0
[10105.298652] EA = 0, S1PTW = 0
[10105.301848] Data abort info:
[10105.304723] ISV = 0, ISS = 0x00000006
[10105.308573] CM = 0, WnR = 0
[10105.311564] user pgtable: 4k pages, 48-bit VAs, pgdp=00000000f03d1000
[10105.318034] [0000000000000030] pgd=00000000f6cee003,
pud=00000000f4884003, pmd=0000000000000000
[10105.326783] Internal error: Oops: 96000006 [#1] PREEMPT SMP
[10105.332355] Modules linked in: ath10k_pci ath10k_core ath mac80211
libarc4 cfg80211 nvme nvme_core cryptodev(O)
[10105.342468] CPU: 3 PID: 518 Comm: touch Tainted: G O
5.4.3 #1
[10105.349517] Hardware name: HYPEX CPU (DT)
[10105.353525] pstate: 40000005 (nZcv daif -PAN -UAO)
[10105.358324] pc : atomic64_try_cmpxchg_acquire.constprop.22+0x8/0x34
[10105.364596] lr : mutex_lock+0x1c/0x34
[10105.368253] sp : ffff000075633aa0
[10105.371563] x29: ffff000075633aa0 x28: 0000000000000001
[10105.376874] x27: ffff000076fa80c8 x26: 0000000000000004
[10105.382185] x25: 0000000000000030 x24: 0000000000000000
[10105.387495] x23: 0000000000000000 x22: 0000000000000038
[10105.392807] x21: 000000000000000c x20: ffff000076fa80c8
[10105.398119] x19: ffff000076fa8000 x18: 0000000000000000
[10105.403429] x17: 0000000000000000 x16: 0000000000000000
[10105.408741] x15: 0000000000000000 x14: fefefefefefefeff
[10105.414052] x13: 0000000000000000 x12: 0000000000000fe0
[10105.419364] x11: 0000000000000fe0 x10: ffff000076709020
[10105.424675] x9 : 0000000000000000 x8 : 00000000000000a0
[10105.429986] x7 : ffff000076fa80f4 x6 : 0000000000000030
[10105.435297] x5 : 0000000000000000 x4 : 0000000000000000
[10105.440609] x3 : 0000000000000000 x2 : ffff00006f276040
[10105.445920] x1 : ffff000075633ab8 x0 : 0000000000000030
[10105.451232] Call trace:
[10105.453676] atomic64_try_cmpxchg_acquire.constprop.22+0x8/0x34
[10105.459600] ubifs_garbage_collect+0xb4/0x334
[10105.463956] ubifs_budget_space+0x398/0x458
[10105.468139] ubifs_create+0x50/0x180
[10105.471712] path_openat+0x6a0/0x9b0
[10105.475284] do_filp_open+0x34/0x7c
[10105.478771] do_sys_open+0x78/0xe4
[10105.482170] __arm64_sys_openat+0x1c/0x24
[10105.486180] el0_svc_handler+0x84/0xc8
[10105.489928] el0_svc+0x8/0xc
[10105.492808] Code: 52800013 17fffffb d2800003 f9800011 (c85ffc05)
[10105.498903] ---[ end trace 46b721d93267a586 ]---
To reproduce the problem:
1. Filesystem initially mounted read-only, free space fixup flag set.
2. mount -o remount,rw <mountpoint>
3. it takes some time (free space fixup running)
... try to terminate running mount by CTRL-C
... does not respond, only after free space fixup is complete
... then "ubifs_remount_fs: cannot spawn "ubifs_bgt0_0", error -4"
4. mount -o remount,rw <mountpoint>
... now finished instantly (fixup already done).
5. Create file or just unmount the filesystem and we get the oops.
Cc: <stable@vger.kernel.org>
Fixes: b50b9f408502 ("UBIFS: do not free write-buffers when in R/O mode")
Signed-off-by: Petr Cvachoucek <cvachoucek@gmail.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
|
|
Repalce kthread_create/wake_up_process() with kthread_run()
to simplify the code.
Signed-off-by: Cai Huoqing <caihuoqing@baidu.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
|
|
Found with `codespell -i 3 -w fs/ubifs/**` and proof reading that parts.
Signed-off-by: Alexander Dahl <ada@thorsis.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
|
|
The max_namelen field is unnecessary, as it is set to 255 (NAME_MAX) on
all filesystems that support fscrypt (or plan to support fscrypt). For
simplicity, just use NAME_MAX directly instead.
Link: https://lore.kernel.org/r/20210909184513.139281-1-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
|
|
The stat() family of syscalls report the wrong size for encrypted
symlinks, which has caused breakage in several userspace programs.
Fix this by calling fscrypt_symlink_getattr() after ubifs_getattr() for
encrypted symlinks. This function computes the correct size by reading
and decrypting the symlink target (if it's not already cached).
For more details, see the commit which added fscrypt_symlink_getattr().
Fixes: ca7f85be8d6c ("ubifs: Add support for encrypted symlinks")
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20210702065350.209646-5-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
|
|
xfstests-generic/476 reports a warning message as below:
WARNING: CPU: 2 PID: 30347 at fs/inode.c:361 inc_nlink+0x52/0x70
Call Trace:
do_rename+0x502/0xd40 [ubifs]
ubifs_rename+0x8b/0x180 [ubifs]
vfs_rename+0x476/0x1080
do_renameat2+0x67c/0x7b0
__x64_sys_renameat2+0x6e/0x90
do_syscall_64+0x66/0xe0
entry_SYSCALL_64_after_hwframe+0x44/0xae
Following race case can cause this:
rename_whiteout(Thread 1) wb_workfn(Thread 2)
ubifs_rename
do_rename
__writeback_single_inode
spin_lock(&inode->i_lock)
whiteout->i_state |= I_LINKABLE
inode->i_state &= ~dirty;
---- How race happens on i_state:
(tmp = whiteout->i_state | I_LINKABLE)
(tmp = inode->i_state & ~dirty)
(whiteout->i_state = tmp)
(inode->i_state = tmp)
----
spin_unlock(&inode->i_lock)
inc_nlink(whiteout)
WARN_ON(!(inode->i_state & I_LINKABLE)) !!!
Fix to add i_lock to avoid i_state update race condition.
Fixes: 9e0a1fff8db56ea ("ubifs: Implement RENAME_WHITEOUT")
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
|