summaryrefslogtreecommitdiff
path: root/drivers
AgeCommit message (Collapse)Author
2023-06-27dm integrity: scale down the recalculate buffer if memory allocation failsMikulas Patocka
If memory allocation fails, try to reduce the size of the recalculate buffer and continue with that smaller buffer. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-06-27dm integrity: only allocate recalculate buffer when neededMikulas Patocka
dm-integrity preallocated 8MiB buffer for recalculating in the constructor and freed it in the destructor. This wastes memory when the user has many dm-integrity devices. Fix dm-integrity so that the buffer is only allocated when recalculation is in progress; allocate the buffer at the beginning of integrity_recalc() and free it at the end. Note that integrity_recalc() doesn't hold any locks when allocating the buffer, so it shouldn't cause low-memory deadlock. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-06-27dm integrity: reduce vmalloc space footprint on 32-bit architecturesMikulas Patocka
It was reported that dm-integrity runs out of vmalloc space on 32-bit architectures. On x86, there is only 128MiB vmalloc space and dm-integrity consumes it quickly because it has a 64MiB journal and 8MiB recalculate buffer. Fix this by reducing the size of the journal to 4MiB and the size of the recalculate buffer to 1MiB, so that multiple dm-integrity devices can be created and activated on 32-bit architectures. Cc: stable@vger.kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-06-23dm ioctl: Refuse to create device named "." or ".."Demi Marie Obenour
Using either of these is going to greatly confuse userspace, as they are not valid symlink names and so creating the usual /dev/mapper/NAME symlink will not be possible. As creating a device with either of these names is almost certainly a userspace bug, just error out. Signed-off-by: Demi Marie Obenour <demi@invisiblethingslab.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-06-23dm ioctl: Refuse to create device named "control"Demi Marie Obenour
Typical userspace setups create a symlink under /dev/mapper with the name of the device, but /dev/mapper/control is reserved for DM's control device. Therefore, trying to create such a device is almost certain to be a userspace bug. Signed-off-by: Demi Marie Obenour <demi@invisiblethingslab.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-06-23dm ioctl: Avoid double-fetch of versionDemi Marie Obenour
The version is fetched once in check_version(), which then does some validation and then overwrites the version in userspace with the API version supported by the kernel. copy_params() then fetches the version from userspace *again*, and this time no validation is done. The result is that the kernel's version number is completely controllable by userspace, provided that userspace can win a race condition. Fix this flaw by not copying the version back to the kernel the second time. This is not exploitable as the version is not further used in the kernel. However, it could become a problem if future patches start relying on the version field. Cc: stable@vger.kernel.org Signed-off-by: Demi Marie Obenour <demi@invisiblethingslab.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-06-23dm ioctl: structs and parameter strings must not overlapDemi Marie Obenour
The NUL terminator for each target parameter string must precede the following 'struct dm_target_spec'. Otherwise, dm_split_args() might corrupt this struct. Furthermore, the first 'struct dm_target_spec' must come after the 'struct dm_ioctl', as if it overlaps too much dm_split_args() could corrupt the 'struct dm_ioctl'. Signed-off-by: Demi Marie Obenour <demi@invisiblethingslab.com> Reviewed-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-06-23dm ioctl: Avoid pointer arithmetic overflowDemi Marie Obenour
Especially on 32-bit systems, it is possible for the pointer arithmetic to overflow and cause a userspace pointer to be dereferenced in the kernel. Signed-off-by: Demi Marie Obenour <demi@invisiblethingslab.com> Reviewed-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-06-23dm ioctl: Check dm_target_spec is sufficiently alignedDemi Marie Obenour
Otherwise subsequent code, if given malformed input, could dereference a misaligned 'struct dm_target_spec *'. Signed-off-by: Demi Marie Obenour <demi@invisiblethingslab.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> # use %zu Reviewed-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-06-22dm integrity: Use %*ph for printing hexdump of a small bufferAndy Shevchenko
The kernel already has a helper to print a hexdump of a small buffer via pointer extension. Use that instead of open coded variant. In long term it helps to kill pr_cont() or at least narrow down its use. Note, the format is slightly changed, i.e. the trailing space is always printed. Also the IV dump is limited by 64 bytes which seems fine. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-06-16dm thin: disable discards for thin-pool if no_discard_passdownMike Snitzer
Also rename disable_passdown_if_not_supported to disable_discard_passdown_if_not_supported. And fold passdown_enabled() into only caller. Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-06-16dm: remove stale/redundant dm_internal_{suspend,resume} prototypes in dm.hMike Snitzer
dm_internal_suspend() no longer exists. Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-06-16dm: skip dm-stats work in alloc_io() unless neededMike Snitzer
Don't dm_stats_record_start() if dm_stats_used() is false. Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-06-16dm: avoid needless dm_io access if all IO accounting is disabledMike Snitzer
Update dm_io_acct() to eliminate most dm_io struct accesses if both block core's IO stats and dm-stats are disabled. Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-06-16dm: support turning off block-core's io stats accountingLi Nan
Commit bc58ba9468d9 ("block: add sysfs file for controlling io stats accounting") allowed users to turn off disk stat accounting completely by checking if queue flag QUEUE_FLAG_IO_STAT is set. In dm, this flag is neither set nor checked: so block-core's io stats are continuously counted and cannot be turned off. Add support for turning off block-core's io stats accounting for dm. Set QUEUE_FLAG_IO_STAT for dm's request_queue. If QUEUE_FLAG_IO_STAT is set when an io starts, record the need for block core's io stats by setting the DM_IO_BLK_STAT dm_io flag to avoid io stats being disabled in the middle of the io. DM statistics (dm-stats) is independent of block-core's io stats and remains unchanged. Signed-off-by: Li Nan <linan122@huawei.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-06-16dm zone: Use the bitmap API to allocate bitmapsChristophe JAILLET
Use bitmap_zalloc()/bitmap_free() instead of hand-writing them. It is less verbose and it improves the semantic. Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-06-16dm thin metadata: Fix ABBA deadlock by resetting dm_bufio_clientLi Lingfeng
As described in commit 8111964f1b85 ("dm thin: Fix ABBA deadlock between shrink_slab and dm_pool_abort_metadata"), ABBA deadlocks will be triggered because shrinker_rwsem currently needs to held by dm_pool_abort_metadata() as a side-effect of thin-pool metadata operation failure. The following three problem scenarios have been noticed: 1) Described by commit 8111964f1b85 ("dm thin: Fix ABBA deadlock between shrink_slab and dm_pool_abort_metadata") 2) shrinker_rwsem and throttle->lock P1(drop cache) P2(kworker) drop_caches_sysctl_handler drop_slab shrink_slab down_read(&shrinker_rwsem) - LOCK A do_shrink_slab super_cache_scan prune_icache_sb dispose_list evict ext4_evict_inode ext4_clear_inode ext4_discard_preallocations ext4_mb_load_buddy_gfp ext4_mb_init_cache ext4_wait_block_bitmap __ext4_error ext4_handle_error ext4_commit_super ... dm_submit_bio do_worker throttle_work_update down_write(&t->lock) -- LOCK B process_deferred_bios commit metadata_operation_failed dm_pool_abort_metadata dm_block_manager_create dm_bufio_client_create register_shrinker down_write(&shrinker_rwsem) -- LOCK A thin_map thin_bio_map thin_defer_bio_with_throttle throttle_lock down_read(&t->lock) - LOCK B 3) shrinker_rwsem and wait_on_buffer P1(drop cache) P2(kworker) drop_caches_sysctl_handler drop_slab shrink_slab down_read(&shrinker_rwsem) - LOCK A do_shrink_slab ... ext4_wait_block_bitmap __ext4_error ext4_handle_error jbd2_journal_abort jbd2_journal_update_sb_errno jbd2_write_superblock submit_bh // LOCK B // RELEASE B do_worker throttle_work_update down_write(&t->lock) - LOCK B process_deferred_bios process_bio commit metadata_operation_failed dm_pool_abort_metadata dm_block_manager_create dm_bufio_client_create register_shrinker register_shrinker_prepared down_write(&shrinker_rwsem) - LOCK A bio_endio wait_on_buffer __wait_on_buffer Fix these by resetting dm_bufio_client without holding shrinker_rwsem. Fixes: 8111964f1b85 ("dm thin: Fix ABBA deadlock between shrink_slab and dm_pool_abort_metadata") Cc: stable@vger.kernel.org Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-06-16dm crypt: fix crypt_ctr_cipher_new return value on invalid AEAD cipherMikulas Patocka
If the user specifies invalid AEAD cipher, dm-crypt should return the error returned from crypt_ctr_auth_spec, not -ENOMEM. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-06-16dm thin: update .io_hints methods to not require handling discards lastMike Snitzer
Removes assumptions about what might follow the discard setup code (previously the code would return early if discards not enabled). Makes it possible to add more capabilites to the end of each .io_hints method (which is the natural thing to do when adding new features). Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-06-16dm thin: remove return code variable in pool_mapMike Snitzer
Always returns DM_MAPIO_REMAPPED so no need for variable. Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-06-16dm flakey: introduce random_read_corrupt and random_write_corrupt optionsMikulas Patocka
The random_read_corrupt and random_write_corrupt options corrupt a random byte in a bio with the provided probability. The corruption only happens in the "down" interval. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-06-16dm flakey: clone pages on write bio before corrupting themMikulas Patocka
dm-flakey has an option to corrupt write bios. It corrupts the memory that is being written. This can cause system crashes or security bugs - for example, if the user writes a shared library code with O_DIRECT flag to a dm-flakey device, it corrupts the memory for all users that have the shared library mapped. Fix this bug by cloning the bio and corrupting the clone rather than the original. Also drop the test for ZERO_PAGE(0) - it can't happen because we write the cloned pages. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-06-16dm crypt: allocate compound pages if possibleMikulas Patocka
It was reported that allocating pages for the write buffer in dm-crypt causes measurable overhead [1]. Change dm-crypt to allocate compound pages if they are available. If not, fall back to the mempool. [1] https://listman.redhat.com/archives/dm-devel/2023-February/053284.html Suggested-by: Matthew Wilcox <willy@infradead.org> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2023-06-16Merge tag 'nvme-6.5-2023-06-16' of git://git.infradead.org/nvme into ↵Jens Axboe
for-6.5/block Pull NVMe updates from Keith: "nvme updates for Linux 6.5 - Various cleanups all around (Irvin, Chaitanya, Christophe) - Better struct packing (Christophe JAILLET) - Reduce controller error logs for optional commands (Keith) - Support for >=64KiB block sizes (Daniel Gomez) - Fabrics fixes and code organization (Max, Chaitanya, Daniel Wagner)" * tag 'nvme-6.5-2023-06-16' of git://git.infradead.org/nvme: (27 commits) nvme: forward port sysfs delete fix nvme: skip optional id ctrl csi if it failed nvme-core: use nvme_ns_head_multipath instead of ns->head->disk nvmet-fcloop: Do not wait on completion when unregister fails nvme-fabrics: open code __nvmf_host_find() nvme-fabrics: error out to unlock the mutex nvme: Increase block size variable size to 32-bit nvme-fcloop: no need to return from void function nvmet-auth: remove unnecessary break after goto nvmet-auth: remove some dead code nvme-core: remove redundant check from nvme_init_ns_head nvme: move sysfs code to a dedicated sysfs.c file nvme-fabrics: prevent overriding of existing host nvme-fabrics: check hostid using uuid_equal nvme-fabrics: unify common code in admin and io queue connect nvmet: reorder fields in 'struct nvmefc_fcp_req' nvmet: reorder fields in 'struct nvme_dhchap_queue_context' nvmet: reorder fields in 'struct nvmf_ctrl_options' nvme: reorder fields in 'struct nvme_ctrl' nvmet: reorder fields in 'struct nvmet_sq' ...
2023-06-16nvme: forward port sysfs delete fixKeith Busch
We had a late fix that modified nvme_sysfs_delete() after the staging branch for the next merge window relocated the function to a new file. Port commit 2eb94dd56a4a4 ("nvme: do not let the user delete a ctrl before a complete") to the latest to avoid a potentially confusing merge conflict. Cc: Maurizio Lombardi <mlombard@redhat.com> Cc: Max Gurtovoy <mgurtovoy@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-06-15bcache: fixup btree_cache_wait list damageMingzhe Zou
We get a kernel crash about "list_add corruption. next->prev should be prev (ffff9c801bc01210), but was ffff9c77b688237c. (next=ffffae586d8afe68)." crash> struct list_head 0xffff9c801bc01210 struct list_head { next = 0xffffae586d8afe68, prev = 0xffffae586d8afe68 } crash> struct list_head 0xffff9c77b688237c struct list_head { next = 0x0, prev = 0x0 } crash> struct list_head 0xffffae586d8afe68 struct list_head struct: invalid kernel virtual address: ffffae586d8afe68 type: "gdb_readmem_callback" Cannot access memory at address 0xffffae586d8afe68 [230469.019492] Call Trace: [230469.032041] prepare_to_wait+0x8a/0xb0 [230469.044363] ? bch_btree_keys_free+0x6c/0xc0 [escache] [230469.056533] mca_cannibalize_lock+0x72/0x90 [escache] [230469.068788] mca_alloc+0x2ae/0x450 [escache] [230469.080790] bch_btree_node_get+0x136/0x2d0 [escache] [230469.092681] bch_btree_check_thread+0x1e1/0x260 [escache] [230469.104382] ? finish_wait+0x80/0x80 [230469.115884] ? bch_btree_check_recurse+0x1a0/0x1a0 [escache] [230469.127259] kthread+0x112/0x130 [230469.138448] ? kthread_flush_work_fn+0x10/0x10 [230469.149477] ret_from_fork+0x35/0x40 bch_btree_check_thread() and bch_dirty_init_thread() may call mca_cannibalize() to cannibalize other cached btree nodes. Only one thread can do it at a time, so the op of other threads will be added to the btree_cache_wait list. We must call finish_wait() to remove op from btree_cache_wait before free it's memory address. Otherwise, the list will be damaged. Also should call bch_cannibalize_unlock() to release the btree_cache_alloc_lock and wake_up other waiters. Fixes: 8e7102273f59 ("bcache: make bch_btree_check() to be multithreaded") Fixes: b144e45fc576 ("bcache: make bch_sectors_dirty_init() to be multithreaded") Cc: stable@vger.kernel.org Signed-off-by: Mingzhe Zou <mingzhe.zou@easystack.cn> Signed-off-by: Coly Li <colyli@suse.de> Link: https://lore.kernel.org/r/20230615121223.22502-7-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-15bcache: Fix __bch_btree_node_alloc to make the failure behavior consistentZheng Wang
In some specific situations, the return value of __bch_btree_node_alloc may be NULL. This may lead to a potential NULL pointer dereference in caller function like a calling chain : btree_split->bch_btree_node_alloc->__bch_btree_node_alloc. Fix it by initializing the return value in __bch_btree_node_alloc. Fixes: cafe56359144 ("bcache: A block layer cache") Cc: stable@vger.kernel.org Signed-off-by: Zheng Wang <zyytlz.wz@163.com> Signed-off-by: Coly Li <colyli@suse.de> Link: https://lore.kernel.org/r/20230615121223.22502-6-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-15bcache: Remove unnecessary NULL point check in node allocationsZheng Wang
Due to the previous fix of __bch_btree_node_alloc, the return value will never be a NULL pointer. So IS_ERR is enough to handle the failure situation. Fix it by replacing IS_ERR_OR_NULL check by an IS_ERR check. Fixes: cafe56359144 ("bcache: A block layer cache") Cc: stable@vger.kernel.org Signed-off-by: Zheng Wang <zyytlz.wz@163.com> Signed-off-by: Coly Li <colyli@suse.de> Link: https://lore.kernel.org/r/20230615121223.22502-5-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-15bcache: Remove dead references to cache_readaheadsAndrea Tomassetti
The cache_readaheads stat counter is not used anymore and should be removed. Signed-off-by: Andrea Tomassetti <andrea.tomassetti-opensource@devo.com> Signed-off-by: Coly Li <colyli@suse.de> Link: https://lore.kernel.org/r/20230615121223.22502-4-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-15bcache: make kobj_type structures constantThomas Weißschuh
Since commit ee6d3dd4ed48 ("driver core: make kobj_type constant.") the driver core allows the usage of const struct kobj_type. Take advantage of this to constify the structure definitions to prevent modification at runtime. Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Signed-off-by: Coly Li <colyli@suse.de> Link: https://lore.kernel.org/r/20230615121223.22502-3-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-15bcache: Convert to use sysfs_emit()/sysfs_emit_at() APIsye xingchen
Follow the advice of the Documentation/filesystems/sysfs.rst and show() should only use sysfs_emit() or sysfs_emit_at() when formatting the value to be returned to user space. Signed-off-by: ye xingchen <ye.xingchen@zte.com.cn> Signed-off-by: Coly Li <colyli@suse.de> Link: https://lore.kernel.org/r/20230615121223.22502-2-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-14scsi: sg: fix blktrace debugfs entries leakageYu Kuai
sg_ioctl() support to enable blktrace, which will create debugfs entries "/sys/kernel/debug/block/sgx/", however, there is no guarantee that user will remove these entries through ioctl, and deleting sg device doesn't cleanup these blktrace entries. This problem can be fixed by cleanup blktrace while releasing request_queue, however, it's not a good idea to do this special handling in common layer just for sg device. Fix this problem by shutdown bltkrace in sg_device_destroy(), where the device is deleted and all the users close the device, also grab a scsi_device reference from sg_add_device() to prevent scsi_device to be freed before sg_device_destroy(); Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20230610022003.2557284-3-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-14brd: use cond_resched instead of cond_resched_rcuPankaj Raghav
The body of the loop is run without RCU lock held. Use the regular cond_resched() instead of cond_resched_rcu(). Fixes: 786bb0245881 ("brd: use XArray instead of radix-tree to index backing pages") Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20230614133538.1279369-1-p.raghav@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-13md/raid1-10: limit the number of plugged bioYu Kuai
bio can be added to plug infinitely, and following writeback test can trigger huge amount of plugged bio: Test script: modprobe brd rd_nr=4 rd_size=10485760 mdadm -CR /dev/md0 -l10 -n4 /dev/ram[0123] --assume-clean --bitmap=internal echo 0 > /proc/sys/vm/dirty_background_ratio fio -filename=/dev/md0 -ioengine=libaio -rw=write -bs=4k -numjobs=1 -iodepth=128 -name=test Test result: Monitor /sys/block/md0/inflight will found that inflight keep increasing until fio finish writing, after running for about 2 minutes: [root@fedora ~]# cat /sys/block/md0/inflight 0 4474191 Fix the problem by limiting the number of plugged bio based on the number of copies for original bio. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230529131106.2123367-8-yukuai1@huaweicloud.com
2023-06-13md/raid1-10: don't handle pluged bio by daemon threadYu Kuai
current->bio_list will be set under submit_bio() context, in this case bitmap io will be added to the list and wait for current io submission to finish, while current io submission must wait for bitmap io to be done. commit 874807a83139 ("md/raid1{,0}: fix deadlock in bitmap_unplug.") fix the deadlock by handling plugged bio by daemon thread. On the one hand, the deadlock won't exist after commit a214b949d8e3 ("blk-mq: only flush requests from the plug in blk_mq_submit_bio"). On the other hand, current solution makes it impossible to flush plugged bio in raid1/10_make_request(), because this will cause that all the writes will goto daemon thread. In order to limit the number of plugged bio, commit 874807a83139 ("md/raid1{,0}: fix deadlock in bitmap_unplug.") is reverted, and the deadlock is fixed by handling bitmap io asynchronously. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230529131106.2123367-7-yukuai1@huaweicloud.com
2023-06-13md/md-bitmap: add a new helper to unplug bitmap asynchrouslyYu Kuai
If bitmap is enabled, bitmap must update before submitting write io, this is why unplug callback must move these io to 'conf->pending_io_list' if 'current->bio_list' is not empty, which will suffer performance degradation. A new helper md_bitmap_unplug_async() is introduced to submit bitmap io in a kworker, so that submit bitmap io in raid10_unplug() doesn't require that 'current->bio_list' is empty. This patch prepare to limit the number of plugged bio. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230529131106.2123367-6-yukuai1@huaweicloud.com
2023-06-13md/raid1-10: submit write io directly if bitmap is not enabledYu Kuai
Commit 6cce3b23f6f8 ("[PATCH] md: write intent bitmap support for raid10") add bitmap support, and it changed that write io is submitted through daemon thread because bitmap need to be updated before write io. And later, plug is used to fix performance regression because all the write io will go to demon thread, which means io can't be issued concurrently. However, if bitmap is not enabled, the write io should not go to daemon thread in the first place, and plug is not needed as well. Fixes: 6cce3b23f6f8 ("[PATCH] md: write intent bitmap support for raid10") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230529131106.2123367-5-yukuai1@huaweicloud.com
2023-06-13md/raid1-10: factor out a helper to submit normal writeYu Kuai
There are multiple places to do the same thing, factor out a helper to prevent redundant code, and the helper will be used in following patch as well. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230529131106.2123367-4-yukuai1@huaweicloud.com
2023-06-13md/raid1-10: factor out a helper to add bio to plugYu Kuai
The code in raid1 and raid10 is identical, prepare to limit the number of plugged bios. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230529131106.2123367-3-yukuai1@huaweicloud.com
2023-06-13md/raid10: prevent soft lockup while flush writesYu Kuai
Currently, there is no limit for raid1/raid10 plugged bio. While flushing writes, raid1 has cond_resched() while raid10 doesn't, and too many writes can cause soft lockup. Follow up soft lockup can be triggered easily with writeback test for raid10 with ramdisks: watchdog: BUG: soft lockup - CPU#10 stuck for 27s! [md0_raid10:1293] Call Trace: <TASK> call_rcu+0x16/0x20 put_object+0x41/0x80 __delete_object+0x50/0x90 delete_object_full+0x2b/0x40 kmemleak_free+0x46/0xa0 slab_free_freelist_hook.constprop.0+0xed/0x1a0 kmem_cache_free+0xfd/0x300 mempool_free_slab+0x1f/0x30 mempool_free+0x3a/0x100 bio_free+0x59/0x80 bio_put+0xcf/0x2c0 free_r10bio+0xbf/0xf0 raid_end_bio_io+0x78/0xb0 one_write_done+0x8a/0xa0 raid10_end_write_request+0x1b4/0x430 bio_endio+0x175/0x320 brd_submit_bio+0x3b9/0x9b7 [brd] __submit_bio+0x69/0xe0 submit_bio_noacct_nocheck+0x1e6/0x5a0 submit_bio_noacct+0x38c/0x7e0 flush_pending_writes+0xf0/0x240 raid10d+0xac/0x1ed0 Fix the problem by adding cond_resched() to raid10 like what raid1 did. Note that unlimited plugged bio still need to be optimized, for example, in the case of lots of dirty pages writeback, this will take lots of memory and io will spend a long time in plug, hence io latency is bad. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230529131106.2123367-2-yukuai1@huaweicloud.com
2023-06-13md/raid10: fix io loss while replacement replace rdevLi Nan
When removing a disk with replacement, the replacement will be used to replace rdev. During this process, there is a brief window in which both rdev and replacement are read as NULL in raid10_write_request(). This will result in io not being submitted but it should be. //remove //write raid10_remove_disk raid10_write_request mirror->rdev = NULL read rdev -> NULL mirror->rdev = mirror->replacement mirror->replacement = NULL read replacement -> NULL Fix it by reading replacement first and rdev later, meanwhile, use smp_mb() to prevent memory reordering. Fixes: 475b0321a4df ("md/raid10: writes should get directed to replacement as well as original.") Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230602091839.743798-3-linan666@huaweicloud.com
2023-06-13md/raid10: Do not add spare disk when recovery failsLi Nan
In raid10_sync_request(), if data cannot be read from any disk for recovery, it will go to 'giveup' and let 'chunks_skipped' + 1. After multiple 'giveup', when 'chunks_skipped >= geo.raid_disks', it will return 'max_sector', indicating that the recovery has been completed. However, the recovery is just aborted and the data remains inconsistent. Fix it by setting mirror->recovery_disabled, which will prevent the spare disk from being added to this mirror. The same issue also exists during resync, it will be fixed afterwards. Signed-off-by: Li Nan <linan122@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230602091839.743798-2-linan666@huaweicloud.com
2023-06-13md/raid10: clean up md_add_new_disk()Li Nan
Commit 1a855a060665 ("md: fix bug with re-adding of partially recovered device.") only add device which is set to In_sync. But it let devices without metadata cannot be added when they should be. Commit bf572541ab44 ("md: fix regression with re-adding devices to arrays with no metadata") fix the above issue, it set device without metadata to In_sync when add new disk. However, after commit f466722ca614 ("md: Change handling of save_raid_disk and metadata update during recovery.") deletes changes of the first patch, setting In_sync for devcie without metadata is meanless because the flag will be cleared soon and will not be used during this period. Clean it up. Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230527101851.3266500-2-linan666@huaweicloud.com
2023-06-13md/raid10: prioritize adding disk to 'removed' mirrorLi Nan
When add a new disk to raid10, it will traverse conf->mirror from start and find one of the following mirror to add: 1. mirror->rdev is set to WantReplacement and it have no replacement, set new disk to mirror->replacement. 2. no mirror->rdev, set new disk to mirror->rdev. There is a array as below (sda is set to WantReplacement): Number Major Minor RaidDevice State 0 8 0 0 active sync set-A /dev/sda - 0 0 1 removed 2 8 32 2 active sync set-A /dev/sdc 3 8 48 3 active sync set-B /dev/sdd Use 'mdadm --add' to add a new disk to this array, the new disk will become sda's replacement instead of add to removed position, which is confusing for users. Meanwhile, after new disk recovery success, sda will be set to Faulty. Prioritize adding disk to 'removed' mirror is a better choice. In the above scenario, the behavior is the same as before, except sda will not be deleted. Before other disks are added, continued use sda is more reliable. Signed-off-by: Li Nan <linan122@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230527092007.3008856-1-linan666@huaweicloud.com
2023-06-13md/raid10: improve code of mrdev in raid10_sync_requestLi Nan
'need_recover' and 'mrdev' are equivalent in raid10_sync_request(), and inc mrdev->nr_pending is unreasonable if don't need recovery. Replace 'need_recover' with 'mrdev', and only inc nr_pending when needed. Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230527072218.2365857-3-linan666@huaweicloud.com
2023-06-13md/raid10: fix null-ptr-deref of mreplace in raid10_sync_requestLi Nan
There are two check of 'mreplace' in raid10_sync_request(). In the first check, 'need_replace' will be set and 'mreplace' will be used later if no-Faulty 'mreplace' exists, In the second check, 'mreplace' will be set to NULL if it is Faulty, but 'need_replace' will not be changed accordingly. null-ptr-deref occurs if Faulty is set between two check. Fix it by merging two checks into one. And replace 'need_replace' with 'mreplace' because their values are always the same. Fixes: ee37d7314a32 ("md/raid10: Fix raid10 replace hang when new added disk faulty") Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230527072218.2365857-2-linan666@huaweicloud.com
2023-06-13md/raid5: don't start reshape when recovery or replace is in progressYu Kuai
When recovery is interrupted (reboot, etc.) check for MD_RECOVERY_RUNNING is not enough to tell recovery is in progress. Also check recovery_cp before starting reshape. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230529133410.2125914-1-yukuai1@huaweicloud.com
2023-06-13md: protect md_thread with rcuYu Kuai
Currently, there are many places that md_thread can be accessed without protection, following are known scenarios that can cause null-ptr-dereference or uaf: 1) sync_thread that is allocated and started from md_start_sync() 2) mddev->thread can be accessed directly from timeout_store() and md_bitmap_daemon_work() 3) md_unregister_thread() from action_store(). Currently, a global spinlock 'pers_lock' is borrowed to protect 'mddev->thread' in some places, this problem can be fixed likewise, however, use a global lock for all the cases is not good. Fix this problem by protecting all md_thread with rcu. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230523021017.3048783-6-yukuai1@huaweicloud.com
2023-06-13md/bitmap: factor out a helper to set timeoutYu Kuai
Register/unregister 'mddev->thread' are both under 'reconfig_mutex', however, some context didn't hold the mutex to access mddev->thread, which can cause null-ptr-deference: 1) md_bitmap_daemon_work() can be called from md_check_recovery() where 'reconfig_mutex' is not held, deference 'mddev->thread' might cause null-ptr-deference, because md_unregister_thread() reset the pointer before stopping the thread. 2) timeout_store() access 'mddev->thread' multiple times, null-ptr-deference can be triggered if 'mddev->thread' is reset in the middle. This patch factor out a helper to set timeout, the new helper always check if 'mddev->thread' is null first, so that problem 1 can be fixed. Now that this helper only access 'mddev->thread' once, but it's possible that 'mddev->thread' can be freed while this helper is still in progress, hence the problem is not fixed yet. Follow up patches will fix this by protecting md_thread with rcu. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230523021017.3048783-5-yukuai1@huaweicloud.com
2023-06-13md/bitmap: always wake up md_thread in timeout_storeYu Kuai
md_wakeup_thread() can handle the case that pass in md_thread is NULL, the only difference is that md_wakeup_thread() will be called when current timeout is 'MAX_SCHEDULE_TIMEOUT', this should not matter because timeout_store() is not hot path, and the daemon process is woke up more than demand from other context already. Prepare to factor out a helper to set timeout. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230523021017.3048783-4-yukuai1@huaweicloud.com