summaryrefslogtreecommitdiff
path: root/block
AgeCommit message (Collapse)Author
2024-09-10blk-throttle: remove last_low_overflow_timeYu Kuai
last_low_overflow_time is not used anymore after commit bf20ab538c81 ("blk-throttle: remove CONFIG_BLK_DEV_THROTTLING_LOW"). Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20240903135149.271857-2-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-10block: Prevent deadlocks when switching elevatorsDamien Le Moal
Commit af2814149883 ("block: freeze the queue in queue_attr_store") changed queue_attr_store() to always freeze a sysfs attribute queue before calling the attribute store() method, to ensure that no IOs are in-flight when an attribute value is being updated. However, this change created a potential deadlock situation for the scheduler queue attribute as changing the queue elevator with elv_iosched_store() can result in a call to request_module() if the user requested module is not already registered. If the file of the requested module is stored on the block device of the frozen queue, a deadlock will happen as the read operations triggered by request_module() will wait for the queue freeze to end. Solve this issue by introducing the load_module method in struct queue_sysfs_entry, and to calling this method function in queue_attr_store() before freezing the attribute queue. The macro definition QUEUE_RW_LOAD_MODULE_ENTRY() is added to define a queue sysfs attribute that needs loading a module. The definition of the scheduler atrribute is changed to using QUEUE_RW_LOAD_MODULE_ENTRY(), with the function elv_iosched_load_module() defined as the load_module method. elv_iosched_store() can then be simplified to remove the call to request_module(). Reported-by: Richard W.M. Jones <rjones@redhat.com> Reported-by: Jiri Jaburek <jjaburek@redhat.com> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219166 Fixes: af2814149883 ("block: freeze the queue in queue_attr_store") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Tested-by: Richard W.M. Jones <rjones@redhat.com> Link: https://lore.kernel.org/r/20240908000704.414538-1-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-07blk-mq: add missing unplug trace eventKeith Busch
The single-queue optimized list flush doesn't have an unplug trace event to pair with the plug event. Add one. In the unlikely event an error occurs and falls back to the less optimized plug flush path, it's possible a 2nd unplug trace event will be logged, but it will show the remainig count that weren't previously handled. Signed-off-by: Keith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/20240906194540.3719642-1-kbusch@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-04block: fix integer overflow in BLKSECDISCARDAlexey Dobriyan
I independently rediscovered commit 22d24a544b0d49bbcbd61c8c0eaf77d3c9297155 block: fix overflow in blk_ioctl_discard() but for secure erase. Same problem: uint64_t r[2] = {512, 18446744073709551104ULL}; ioctl(fd, BLKSECDISCARD, r); will enter near infinite loop inside blkdev_issue_secure_erase(): a.out: attempt to access beyond end of device loop0: rw=5, sector=3399043073, nr_sectors = 1024 limit=2048 bio_check_eod: 3286214 callbacks suppressed Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Link: https://lore.kernel.org/r/9e64057f-650a-46d1-b9f7-34af391536ef@p183 Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-04block: fix comment to use set_current_stateAlvaro Parker
The explanatory comment used `set_task_state` instead of `set_current_state` which is the function actually used in the code. Signed-off-by: Alvaro Parker <alparkerdf@gmail.com> Link: https://lore.kernel.org/r/20240903172214.520086-1-alparkerdf@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-04bio-integrity: don't restrict the size of integrity metadataMikulas Patocka
bio_integrity_add_page restricts the size of the integrity metadata to queue_max_hw_sectors(q). This restriction is not needed because oversized bios are split automatically. This restriction causes problems with dm-integrity 'inline' mode - if we send a large bio to dm-integrity and the bio's metadata are larger than queue_max_hw_sectors(q), bio_integrity_add_page fails and the bio is ended with BLK_STS_RESOURCE error. An example that triggers it: dd: error writing '/dev/mapper/in2': Cannot allocate memory 1+0 records in 0+0 records out 0 bytes copied, 0.00169291 s, 0.0 kB/s Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Fixes: fb0987682c62 ("dm-integrity: introduce the Inline mode") Fixes: 0ece1d649b6d ("bio-integrity: create multi-page bvecs in bio_integrity_add_page()") Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: Anuj Gupta <anuj20.g@samsung.com> Link: https://lore.kernel.org/r/e41b3b8e-16c2-70cb-97cb-881234bb200d@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-03block, bfq: use bfq_reassign_last_bfqq() in bfq_bfqq_move()Yu Kuai
Instead of open coding it, there are no functional changes. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240902130329.3787024-5-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-03block, bfq: don't break merge chain in bfq_split_bfqq()Yu Kuai
Consider the following scenario: Process 1 Process 2 Process 3 Process 4 (BIC1) (BIC2) (BIC3) (BIC4) Λ | | | \-------------\ \-------------\ \--------------\| V V V bfqq1--------->bfqq2---------->bfqq3----------->bfqq4 ref 0 1 2 4 If Process 1 issue a new IO and bfqq2 is found, and then bfq_init_rq() decide to spilt bfqq2 by bfq_split_bfqq(). Howerver, procress reference of bfqq2 is 1 and bfq_split_bfqq() just clear the coop flag, which will break the merge chain. Expected result: caller will allocate a new bfqq for BIC1 Process 1 Process 2 Process 3 Process 4 (BIC1) (BIC2) (BIC3) (BIC4) | | | \-------------\ \--------------\| V V bfqq1--------->bfqq2---------->bfqq3----------->bfqq4 ref 0 0 1 3 Since the condition is only used for the last bfqq4 when the previous bfqq2 and bfqq3 are already splited. Fix the problem by checking if bfqq is the last one in the merge chain as well. Fixes: 36eca8948323 ("block, bfq: add Early Queue Merge (EQM)") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240902130329.3787024-4-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-03block, bfq: choose the last bfqq from merge chain in bfq_setup_cooperator()Yu Kuai
Consider the following merge chain: Process 1 Process 2 Process 3 Process 4 (BIC1) (BIC2) (BIC3) (BIC4) Λ | | | \--------------\ \-------------\ \-------------\| V V V bfqq1--------->bfqq2---------->bfqq3----------->bfqq4 IO from Process 1 will get bfqf2 from BIC1 first, then bfq_setup_cooperator() will found bfqq2 already merged to bfqq3 and then handle this IO from bfqq3. However, the merge chain can be much deeper and bfqq3 can be merged to other bfqq as well. Fix this problem by iterating to the last bfqq in bfq_setup_cooperator(). Fixes: 36eca8948323 ("block, bfq: add Early Queue Merge (EQM)") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240902130329.3787024-3-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-03block, bfq: fix possible UAF for bfqq->bic with merge chainYu Kuai
1) initial state, three tasks: Process 1 Process 2 Process 3 (BIC1) (BIC2) (BIC3) | Λ | Λ | Λ | | | | | | V | V | V | bfqq1 bfqq2 bfqq3 process ref: 1 1 1 2) bfqq1 merged to bfqq2: Process 1 Process 2 Process 3 (BIC1) (BIC2) (BIC3) | | | Λ \--------------\| | | V V | bfqq1--------->bfqq2 bfqq3 process ref: 0 2 1 3) bfqq2 merged to bfqq3: Process 1 Process 2 Process 3 (BIC1) (BIC2) (BIC3) here -> Λ | | \--------------\ \-------------\| V V bfqq1--------->bfqq2---------->bfqq3 process ref: 0 1 3 In this case, IO from Process 1 will get bfqq2 from BIC1 first, and then get bfqq3 through merge chain, and finially handle IO by bfqq3. Howerver, current code will think bfqq2 is owned by BIC1, like initial state, and set bfqq2->bic to BIC1. bfq_insert_request -> by Process 1 bfqq = bfq_init_rq(rq) bfqq = bfq_get_bfqq_handle_split bfqq = bic_to_bfqq -> get bfqq2 from BIC1 bfqq->ref++ rq->elv.priv[0] = bic rq->elv.priv[1] = bfqq if (bfqq_process_refs(bfqq) == 1) bfqq->bic = bic -> record BIC1 to bfqq2 __bfq_insert_request new_bfqq = bfq_setup_cooperator -> get bfqq3 from bfqq2->new_bfqq bfqq_request_freed(bfqq) new_bfqq->ref++ rq->elv.priv[1] = new_bfqq -> handle IO by bfqq3 Fix the problem by checking bfqq is from merge chain fist. And this might fix a following problem reported by our syzkaller(unreproducible): ================================================================== BUG: KASAN: slab-use-after-free in bfq_do_early_stable_merge block/bfq-iosched.c:5692 [inline] BUG: KASAN: slab-use-after-free in bfq_do_or_sched_stable_merge block/bfq-iosched.c:5805 [inline] BUG: KASAN: slab-use-after-free in bfq_get_queue+0x25b0/0x2610 block/bfq-iosched.c:5889 Write of size 1 at addr ffff888123839eb8 by task kworker/0:1H/18595 CPU: 0 PID: 18595 Comm: kworker/0:1H Tainted: G L 6.6.0-07439-gba2303cacfda #6 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 Workqueue: kblockd blk_mq_requeue_work Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x91/0xf0 lib/dump_stack.c:106 print_address_description mm/kasan/report.c:364 [inline] print_report+0x10d/0x610 mm/kasan/report.c:475 kasan_report+0x8e/0xc0 mm/kasan/report.c:588 bfq_do_early_stable_merge block/bfq-iosched.c:5692 [inline] bfq_do_or_sched_stable_merge block/bfq-iosched.c:5805 [inline] bfq_get_queue+0x25b0/0x2610 block/bfq-iosched.c:5889 bfq_get_bfqq_handle_split+0x169/0x5d0 block/bfq-iosched.c:6757 bfq_init_rq block/bfq-iosched.c:6876 [inline] bfq_insert_request block/bfq-iosched.c:6254 [inline] bfq_insert_requests+0x1112/0x5cf0 block/bfq-iosched.c:6304 blk_mq_insert_request+0x290/0x8d0 block/blk-mq.c:2593 blk_mq_requeue_work+0x6bc/0xa70 block/blk-mq.c:1502 process_one_work kernel/workqueue.c:2627 [inline] process_scheduled_works+0x432/0x13f0 kernel/workqueue.c:2700 worker_thread+0x6f2/0x1160 kernel/workqueue.c:2781 kthread+0x33c/0x440 kernel/kthread.c:388 ret_from_fork+0x4d/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1b/0x30 arch/x86/entry/entry_64.S:305 </TASK> Allocated by task 20776: kasan_save_stack+0x20/0x40 mm/kasan/common.c:45 kasan_set_track+0x25/0x30 mm/kasan/common.c:52 __kasan_slab_alloc+0x87/0x90 mm/kasan/common.c:328 kasan_slab_alloc include/linux/kasan.h:188 [inline] slab_post_alloc_hook mm/slab.h:763 [inline] slab_alloc_node mm/slub.c:3458 [inline] kmem_cache_alloc_node+0x1a4/0x6f0 mm/slub.c:3503 ioc_create_icq block/blk-ioc.c:370 [inline] ioc_find_get_icq+0x180/0xaa0 block/blk-ioc.c:436 bfq_prepare_request+0x39/0xf0 block/bfq-iosched.c:6812 blk_mq_rq_ctx_init.isra.7+0x6ac/0xa00 block/blk-mq.c:403 __blk_mq_alloc_requests+0xcc0/0x1070 block/blk-mq.c:517 blk_mq_get_new_requests block/blk-mq.c:2940 [inline] blk_mq_submit_bio+0x624/0x27c0 block/blk-mq.c:3042 __submit_bio+0x331/0x6f0 block/blk-core.c:624 __submit_bio_noacct_mq block/blk-core.c:703 [inline] submit_bio_noacct_nocheck+0x816/0xb40 block/blk-core.c:732 submit_bio_noacct+0x7a6/0x1b50 block/blk-core.c:826 xlog_write_iclog+0x7d5/0xa00 fs/xfs/xfs_log.c:1958 xlog_state_release_iclog+0x3b8/0x720 fs/xfs/xfs_log.c:619 xlog_cil_push_work+0x19c5/0x2270 fs/xfs/xfs_log_cil.c:1330 process_one_work kernel/workqueue.c:2627 [inline] process_scheduled_works+0x432/0x13f0 kernel/workqueue.c:2700 worker_thread+0x6f2/0x1160 kernel/workqueue.c:2781 kthread+0x33c/0x440 kernel/kthread.c:388 ret_from_fork+0x4d/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1b/0x30 arch/x86/entry/entry_64.S:305 Freed by task 946: kasan_save_stack+0x20/0x40 mm/kasan/common.c:45 kasan_set_track+0x25/0x30 mm/kasan/common.c:52 kasan_save_free_info+0x2b/0x50 mm/kasan/generic.c:522 ____kasan_slab_free mm/kasan/common.c:236 [inline] __kasan_slab_free+0x12c/0x1c0 mm/kasan/common.c:244 kasan_slab_free include/linux/kasan.h:164 [inline] slab_free_hook mm/slub.c:1815 [inline] slab_free_freelist_hook mm/slub.c:1841 [inline] slab_free mm/slub.c:3786 [inline] kmem_cache_free+0x118/0x6f0 mm/slub.c:3808 rcu_do_batch+0x35c/0xe30 kernel/rcu/tree.c:2189 rcu_core+0x819/0xd90 kernel/rcu/tree.c:2462 __do_softirq+0x1b0/0x7a2 kernel/softirq.c:553 Last potentially related work creation: kasan_save_stack+0x20/0x40 mm/kasan/common.c:45 __kasan_record_aux_stack+0xaf/0xc0 mm/kasan/generic.c:492 __call_rcu_common kernel/rcu/tree.c:2712 [inline] call_rcu+0xce/0x1020 kernel/rcu/tree.c:2826 ioc_destroy_icq+0x54c/0x830 block/blk-ioc.c:105 ioc_release_fn+0xf0/0x360 block/blk-ioc.c:124 process_one_work kernel/workqueue.c:2627 [inline] process_scheduled_works+0x432/0x13f0 kernel/workqueue.c:2700 worker_thread+0x6f2/0x1160 kernel/workqueue.c:2781 kthread+0x33c/0x440 kernel/kthread.c:388 ret_from_fork+0x4d/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1b/0x30 arch/x86/entry/entry_64.S:305 Second to last potentially related work creation: kasan_save_stack+0x20/0x40 mm/kasan/common.c:45 __kasan_record_aux_stack+0xaf/0xc0 mm/kasan/generic.c:492 __call_rcu_common kernel/rcu/tree.c:2712 [inline] call_rcu+0xce/0x1020 kernel/rcu/tree.c:2826 ioc_destroy_icq+0x54c/0x830 block/blk-ioc.c:105 ioc_release_fn+0xf0/0x360 block/blk-ioc.c:124 process_one_work kernel/workqueue.c:2627 [inline] process_scheduled_works+0x432/0x13f0 kernel/workqueue.c:2700 worker_thread+0x6f2/0x1160 kernel/workqueue.c:2781 kthread+0x33c/0x440 kernel/kthread.c:388 ret_from_fork+0x4d/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1b/0x30 arch/x86/entry/entry_64.S:305 The buggy address belongs to the object at ffff888123839d68 which belongs to the cache bfq_io_cq of size 1360 The buggy address is located 336 bytes inside of freed 1360-byte region [ffff888123839d68, ffff88812383a2b8) The buggy address belongs to the physical page: page:ffffea00048e0e00 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff88812383f588 pfn:0x123838 head:ffffea00048e0e00 order:3 entire_mapcount:0 nr_pages_mapped:0 pincount:0 flags: 0x17ffffc0000a40(workingset|slab|head|node=0|zone=2|lastcpupid=0x1fffff) page_type: 0xffffffff() raw: 0017ffffc0000a40 ffff88810588c200 ffffea00048ffa10 ffff888105889488 raw: ffff88812383f588 0000000000150006 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff888123839d80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff888123839e00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >ffff888123839e80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ ffff888123839f00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff888123839f80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ================================================================== Fixes: 36eca8948323 ("block, bfq: add Early Queue Merge (EQM)") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240902130329.3787024-2-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-03iomap: add a private argument for iomap_file_buffered_writeJosef Bacik
In order to switch fuse over to using iomap for buffered writes we need to be able to have the struct file for the original write, in case we have to read in the page to make it uptodate. Handle this by using the existing private field in the iomap_iter, and add the argument to iomap_file_buffered_write. This will allow us to pass the file in through the iomap buffered write path, and is flexible for any other file systems needs. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Link: https://lore.kernel.org/r/7f55c7c32275004ba00cddf862d970e6e633f750.1724755651.git.josef@toxicpanda.com Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-29block: don't use bio_split_rw on misc operationsChristoph Hellwig
bio_split_rw is designed to split read and write bios with a payload. Currently it is called by __bio_split_to_limits for all operations not explicitly list, which works because bio_may_need_split explicitly checks for bi_vcnt == 1 and thus skips the bypass if there is no payload and bio_for_each_bvec loop will never execute it's body if bi_size is 0. But all this is hard to understand, fragile and wasted pointless cycles. Switch __bio_split_to_limits to only call bio_split_rw for READ and WRITE command and don't attempt any kind split for operation that do not require splitting. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Tested-by: Hans Holmberg <hans.holmberg@wdc.com> Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com> Link: https://lore.kernel.org/r/20240826173820.1690925-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-29block: properly handle REQ_OP_ZONE_APPEND in __bio_split_to_limitsChristoph Hellwig
Currently REQ_OP_ZONE_APPEND is handled by the bio_split_rw case in __bio_split_to_limits. This is harmful because REQ_OP_ZONE_APPEND bios do not adhere to the soft max_limits value but instead use their own capped version of max_hw_sectors, leading to incorrect splits that later blow up in bio_split. We still need the bio_split_rw logic to count nr_segs for blk-mq code, so add a new wrapper that passes in the right limit, and turns any bio that would need a split into an error as an additional debugging aid. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Tested-by: Hans Holmberg <hans.holmberg@wdc.com> Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com> Link: https://lore.kernel.org/r/20240826173820.1690925-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-29block: rework bio splittingChristoph Hellwig
The current setup with bio_may_exceed_limit and __bio_split_to_limits is a bit of a mess. Change it so that __bio_split_to_limits does all the work and is just a variant of bio_split_to_limits that returns nr_segs. This is done by inlining it and instead have the various bio_split_* helpers directly submit the potentially split bios. To support btrfs, the rw version has a lower level helper split out that just returns the offset to split. This turns out to nicely clean up the btrfs flow as well. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: David Sterba <dsterba@suse.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Tested-by: Hans Holmberg <hans.holmberg@wdc.com> Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com> Link: https://lore.kernel.org/r/20240826173820.1690925-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-28block: remove checks for FALLOC_FL_NO_HIDE_STALEChristoph Hellwig
While the FALLOC_FL_NO_HIDE_STALE value has been registered, it has always been rejected by vfs_fallocate before making it into blkdev_fallocate because it isn't in the supported mask. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240827065123.1762168-2-hch@lst.de Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-28block: fix detection of unsupported WRITE SAME in blkdev_issue_write_zeroesDarrick J. Wong
On error, blkdev_issue_write_zeroes used to recheck the block device's WRITE SAME queue limits after submitting WRITE SAME bios. As stated in the comment, the purpose of this was to collapse all IO errors to EOPNOTSUPP if the effect of issuing bios was that WRITE SAME got turned off in the queue limits. Therefore, it does not make sense to reuse the zeroes limit that was read earlier in the function because we only care about the queue limit *now*, not what it was at the start of the function. Found by running generic/351 from fstests. Fixes: 64b582ca88ca1 ("block: Read max write zeroes once for __blkdev_issue_write_zeroes()") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Garry <john.g.garry@oracle.com> Link: https://lore.kernel.org/r/20240827175340.GB1977952@frogsfrogsfrogs Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-26blk_iocost: fix more out of bound shiftsKonstantin Ovsepian
Recently running UBSAN caught few out of bound shifts in the ioc_forgive_debts() function: UBSAN: shift-out-of-bounds in block/blk-iocost.c:2142:38 shift exponent 80 is too large for 64-bit type 'u64' (aka 'unsigned long long') ... UBSAN: shift-out-of-bounds in block/blk-iocost.c:2144:30 shift exponent 80 is too large for 64-bit type 'u64' (aka 'unsigned long long') ... Call Trace: <IRQ> dump_stack_lvl+0xca/0x130 __ubsan_handle_shift_out_of_bounds+0x22c/0x280 ? __lock_acquire+0x6441/0x7c10 ioc_timer_fn+0x6cec/0x7750 ? blk_iocost_init+0x720/0x720 ? call_timer_fn+0x5d/0x470 call_timer_fn+0xfa/0x470 ? blk_iocost_init+0x720/0x720 __run_timer_base+0x519/0x700 ... Actual impact of this issue was not identified but I propose to fix the undefined behaviour. The proposed fix to prevent those out of bound shifts consist of precalculating exponent before using it the shift operations by taking min value from the actual exponent and maximum possible number of bits. Reported-by: Breno Leitao <leitao@debian.org> Signed-off-by: Konstantin Ovsepian <ovs@ovs.to> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20240822154137.2627818-1-ovs@ovs.to Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-20block,lsm: add LSM blob and new LSM hooks for block devicesDeven Bowers
This patch introduces a new LSM blob to the block_device structure, enabling the security subsystem to store security-sensitive data related to block devices. Currently, for a device mapper's mapped device containing a dm-verity target, critical security information such as the roothash and its signing state are not readily accessible. Specifically, while the dm-verity volume creation process passes the dm-verity roothash and its signature from userspace to the kernel, the roothash is stored privately within the dm-verity target, and its signature is discarded post-verification. This makes it extremely hard for the security subsystem to utilize these data. With the addition of the LSM blob to the block_device structure, the security subsystem can now retain and manage important security metadata such as the roothash and the signing state of a dm-verity by storing them inside the blob. Access decisions can then be based on these stored data. The implementation follows the same approach used for security blobs in other structures like struct file, struct inode, and struct superblock. The initialization of the security blob occurs after the creation of the struct block_device, performed by the security subsystem. Similarly, the security blob is freed by the security subsystem before the struct block_device is deallocated or freed. This patch also introduces a new hook security_bdev_setintegrity() to save block device's integrity data to the new LSM blob. For example, for dm-verity, it can use this hook to expose its roothash and signing state to LSMs, then LSMs can save these data into the LSM blob. Please note that the new hook should be invoked every time the security information is updated to keep these data current. For example, in dm-verity, if the mapping table is reloaded and configured to use a different dm-verity target with a new roothash and signing information, the previously stored data in the LSM blob will become obsolete. It is crucial to re-invoke the hook to refresh these data and ensure they are up to date. This necessity arises from the design of device-mapper, where a device-mapper device is first created, and then targets are subsequently loaded into it. These targets can be modified multiple times during the device's lifetime. Therefore, while the LSM blob is allocated during the creation of the block device, its actual contents are not initialized at this stage and can change substantially over time. This includes alterations from data that the LSM 'trusts' to those it does not, making it essential to handle these changes correctly. Failure to address this dynamic aspect could potentially allow for bypassing LSM checks. Signed-off-by: Deven Bowers <deven.desai@linux.microsoft.com> Signed-off-by: Fan Wu <wufan@linux.microsoft.com> [PM: merge fuzz, subject line tweaks] Signed-off-by: Paul Moore <paul@paul-moore.com>
2024-08-20softirq: Remove unused 'action' parameter from action callbackCaleb Sander Mateos
When soft interrupt actions are called, they are passed a pointer to the struct softirq action which contains the action's function pointer. This pointer isn't useful, as the action callback already knows what function it is. And since each callback handles a specific soft interrupt, the callback also knows which soft interrupt number is running. No soft interrupt action callback actually uses this parameter, so remove it from the function pointer signature. This clarifies that soft interrupt actions are global routines and makes it slightly cheaper to call them. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/all/20240815171549.3260003-1-csander@purestorage.com
2024-08-19block: Read max write zeroes once for __blkdev_issue_write_zeroes()John Garry
As reported in [0], we may get a hang when formatting a XFS FS on a RAID0 drive. Commit 73a768d5f955 ("block: factor out a blk_write_zeroes_limit helper") changed __blkdev_issue_write_zeroes() to read the max write zeroes value in the loop. This is not safe as max write zeroes may change in value. Specifically for the case of [0], the value goes to 0, and we get an infinite loop. Lift the limit reading out of the loop. [0] https://lore.kernel.org/linux-xfs/4d31268f-310b-4220-88a2-e191c3932a82@oracle.com/T/#t Fixes: 73a768d5f955 ("block: factor out a blk_write_zeroes_limit helper") Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240815163228.216051-2-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-16blk-cgroup: Remove unused declaration blkg_path()Yue Haibing
Commit bb7e5a193d8b ("block, bfq: remove blkg_path()") removed the implementation but leave declaration. Signed-off-by: Yue Haibing <yuehaibing@huawei.com> Link: https://lore.kernel.org/r/20240816095821.877842-1-yuehaibing@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-15block: Fix lockdep warning in blk_mq_mark_tag_waitLi Lingfeng
Lockdep reported a warning in Linux version 6.6: [ 414.344659] ================================ [ 414.345155] WARNING: inconsistent lock state [ 414.345658] 6.6.0-07439-gba2303cacfda #6 Not tainted [ 414.346221] -------------------------------- [ 414.346712] inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage. [ 414.347545] kworker/u10:3/1152 [HC0[0]:SC0[0]:HE0:SE1] takes: [ 414.349245] ffff88810edd1098 (&sbq->ws[i].wait){+.?.}-{2:2}, at: blk_mq_dispatch_rq_list+0x131c/0x1ee0 [ 414.351204] {IN-SOFTIRQ-W} state was registered at: [ 414.351751] lock_acquire+0x18d/0x460 [ 414.352218] _raw_spin_lock_irqsave+0x39/0x60 [ 414.352769] __wake_up_common_lock+0x22/0x60 [ 414.353289] sbitmap_queue_wake_up+0x375/0x4f0 [ 414.353829] sbitmap_queue_clear+0xdd/0x270 [ 414.354338] blk_mq_put_tag+0xdf/0x170 [ 414.354807] __blk_mq_free_request+0x381/0x4d0 [ 414.355335] blk_mq_free_request+0x28b/0x3e0 [ 414.355847] __blk_mq_end_request+0x242/0xc30 [ 414.356367] scsi_end_request+0x2c1/0x830 [ 414.345155] WARNING: inconsistent lock state [ 414.345658] 6.6.0-07439-gba2303cacfda #6 Not tainted [ 414.346221] -------------------------------- [ 414.346712] inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage. [ 414.347545] kworker/u10:3/1152 [HC0[0]:SC0[0]:HE0:SE1] takes: [ 414.349245] ffff88810edd1098 (&sbq->ws[i].wait){+.?.}-{2:2}, at: blk_mq_dispatch_rq_list+0x131c/0x1ee0 [ 414.351204] {IN-SOFTIRQ-W} state was registered at: [ 414.351751] lock_acquire+0x18d/0x460 [ 414.352218] _raw_spin_lock_irqsave+0x39/0x60 [ 414.352769] __wake_up_common_lock+0x22/0x60 [ 414.353289] sbitmap_queue_wake_up+0x375/0x4f0 [ 414.353829] sbitmap_queue_clear+0xdd/0x270 [ 414.354338] blk_mq_put_tag+0xdf/0x170 [ 414.354807] __blk_mq_free_request+0x381/0x4d0 [ 414.355335] blk_mq_free_request+0x28b/0x3e0 [ 414.355847] __blk_mq_end_request+0x242/0xc30 [ 414.356367] scsi_end_request+0x2c1/0x830 [ 414.356863] scsi_io_completion+0x177/0x1610 [ 414.357379] scsi_complete+0x12f/0x260 [ 414.357856] blk_complete_reqs+0xba/0xf0 [ 414.358338] __do_softirq+0x1b0/0x7a2 [ 414.358796] irq_exit_rcu+0x14b/0x1a0 [ 414.359262] sysvec_call_function_single+0xaf/0xc0 [ 414.359828] asm_sysvec_call_function_single+0x1a/0x20 [ 414.360426] default_idle+0x1e/0x30 [ 414.360873] default_idle_call+0x9b/0x1f0 [ 414.361390] do_idle+0x2d2/0x3e0 [ 414.361819] cpu_startup_entry+0x55/0x60 [ 414.362314] start_secondary+0x235/0x2b0 [ 414.362809] secondary_startup_64_no_verify+0x18f/0x19b [ 414.363413] irq event stamp: 428794 [ 414.363825] hardirqs last enabled at (428793): [<ffffffff816bfd1c>] ktime_get+0x1dc/0x200 [ 414.364694] hardirqs last disabled at (428794): [<ffffffff85470177>] _raw_spin_lock_irq+0x47/0x50 [ 414.365629] softirqs last enabled at (428444): [<ffffffff85474780>] __do_softirq+0x540/0x7a2 [ 414.366522] softirqs last disabled at (428419): [<ffffffff813f65ab>] irq_exit_rcu+0x14b/0x1a0 [ 414.367425] other info that might help us debug this: [ 414.368194] Possible unsafe locking scenario: [ 414.368900] CPU0 [ 414.369225] ---- [ 414.369548] lock(&sbq->ws[i].wait); [ 414.370000] <Interrupt> [ 414.370342] lock(&sbq->ws[i].wait); [ 414.370802] *** DEADLOCK *** [ 414.371569] 5 locks held by kworker/u10:3/1152: [ 414.372088] #0: ffff88810130e938 ((wq_completion)writeback){+.+.}-{0:0}, at: process_scheduled_works+0x357/0x13f0 [ 414.373180] #1: ffff88810201fdb8 ((work_completion)(&(&wb->dwork)->work)){+.+.}-{0:0}, at: process_scheduled_works+0x3a3/0x13f0 [ 414.374384] #2: ffffffff86ffbdc0 (rcu_read_lock){....}-{1:2}, at: blk_mq_run_hw_queue+0x637/0xa00 [ 414.375342] #3: ffff88810edd1098 (&sbq->ws[i].wait){+.?.}-{2:2}, at: blk_mq_dispatch_rq_list+0x131c/0x1ee0 [ 414.376377] #4: ffff888106205a08 (&hctx->dispatch_wait_lock){+.-.}-{2:2}, at: blk_mq_dispatch_rq_list+0x1337/0x1ee0 [ 414.378607] stack backtrace: [ 414.379177] CPU: 0 PID: 1152 Comm: kworker/u10:3 Not tainted 6.6.0-07439-gba2303cacfda #6 [ 414.380032] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 [ 414.381177] Workqueue: writeback wb_workfn (flush-253:0) [ 414.381805] Call Trace: [ 414.382136] <TASK> [ 414.382429] dump_stack_lvl+0x91/0xf0 [ 414.382884] mark_lock_irq+0xb3b/0x1260 [ 414.383367] ? __pfx_mark_lock_irq+0x10/0x10 [ 414.383889] ? stack_trace_save+0x8e/0xc0 [ 414.384373] ? __pfx_stack_trace_save+0x10/0x10 [ 414.384903] ? graph_lock+0xcf/0x410 [ 414.385350] ? save_trace+0x3d/0xc70 [ 414.385808] mark_lock.part.20+0x56d/0xa90 [ 414.386317] mark_held_locks+0xb0/0x110 [ 414.386791] ? __pfx_do_raw_spin_lock+0x10/0x10 [ 414.387320] lockdep_hardirqs_on_prepare+0x297/0x3f0 [ 414.387901] ? _raw_spin_unlock_irq+0x28/0x50 [ 414.388422] trace_hardirqs_on+0x58/0x100 [ 414.388917] _raw_spin_unlock_irq+0x28/0x50 [ 414.389422] __blk_mq_tag_busy+0x1d6/0x2a0 [ 414.389920] __blk_mq_get_driver_tag+0x761/0x9f0 [ 414.390899] blk_mq_dispatch_rq_list+0x1780/0x1ee0 [ 414.391473] ? __pfx_blk_mq_dispatch_rq_list+0x10/0x10 [ 414.392070] ? sbitmap_get+0x2b8/0x450 [ 414.392533] ? __blk_mq_get_driver_tag+0x210/0x9f0 [ 414.393095] __blk_mq_sched_dispatch_requests+0xd99/0x1690 [ 414.393730] ? elv_attempt_insert_merge+0x1b1/0x420 [ 414.394302] ? __pfx___blk_mq_sched_dispatch_requests+0x10/0x10 [ 414.394970] ? lock_acquire+0x18d/0x460 [ 414.395456] ? blk_mq_run_hw_queue+0x637/0xa00 [ 414.395986] ? __pfx_lock_acquire+0x10/0x10 [ 414.396499] blk_mq_sched_dispatch_requests+0x109/0x190 [ 414.397100] blk_mq_run_hw_queue+0x66e/0xa00 [ 414.397616] blk_mq_flush_plug_list.part.17+0x614/0x2030 [ 414.398244] ? __pfx_blk_mq_flush_plug_list.part.17+0x10/0x10 [ 414.398897] ? writeback_sb_inodes+0x241/0xcc0 [ 414.399429] blk_mq_flush_plug_list+0x65/0x80 [ 414.399957] __blk_flush_plug+0x2f1/0x530 [ 414.400458] ? __pfx___blk_flush_plug+0x10/0x10 [ 414.400999] blk_finish_plug+0x59/0xa0 [ 414.401467] wb_writeback+0x7cc/0x920 [ 414.401935] ? __pfx_wb_writeback+0x10/0x10 [ 414.402442] ? mark_held_locks+0xb0/0x110 [ 414.402931] ? __pfx_do_raw_spin_lock+0x10/0x10 [ 414.403462] ? lockdep_hardirqs_on_prepare+0x297/0x3f0 [ 414.404062] wb_workfn+0x2b3/0xcf0 [ 414.404500] ? __pfx_wb_workfn+0x10/0x10 [ 414.404989] process_scheduled_works+0x432/0x13f0 [ 414.405546] ? __pfx_process_scheduled_works+0x10/0x10 [ 414.406139] ? do_raw_spin_lock+0x101/0x2a0 [ 414.406641] ? assign_work+0x19b/0x240 [ 414.407106] ? lock_is_held_type+0x9d/0x110 [ 414.407604] worker_thread+0x6f2/0x1160 [ 414.408075] ? __kthread_parkme+0x62/0x210 [ 414.408572] ? lockdep_hardirqs_on_prepare+0x297/0x3f0 [ 414.409168] ? __kthread_parkme+0x13c/0x210 [ 414.409678] ? __pfx_worker_thread+0x10/0x10 [ 414.410191] kthread+0x33c/0x440 [ 414.410602] ? __pfx_kthread+0x10/0x10 [ 414.411068] ret_from_fork+0x4d/0x80 [ 414.411526] ? __pfx_kthread+0x10/0x10 [ 414.411993] ret_from_fork_asm+0x1b/0x30 [ 414.412489] </TASK> When interrupt is turned on while a lock holding by spin_lock_irq it throws a warning because of potential deadlock. blk_mq_prep_dispatch_rq blk_mq_get_driver_tag __blk_mq_get_driver_tag __blk_mq_alloc_driver_tag blk_mq_tag_busy -> tag is already busy // failed to get driver tag blk_mq_mark_tag_wait spin_lock_irq(&wq->lock) -> lock A (&sbq->ws[i].wait) __add_wait_queue(wq, wait) -> wait queue active blk_mq_get_driver_tag __blk_mq_tag_busy -> 1) tag must be idle, which means there can't be inflight IO spin_lock_irq(&tags->lock) -> lock B (hctx->tags) spin_unlock_irq(&tags->lock) -> unlock B, turn on interrupt accidentally -> 2) context must be preempt by IO interrupt to trigger deadlock. As shown above, the deadlock is not possible in theory, but the warning still need to be fixed. Fix it by using spin_lock_irqsave to get lockB instead of spin_lock_irq. Fixes: 4f1731df60f9 ("blk-mq: fix potential io hang by wrong 'wake_batch'") Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20240815024736.2040971-1-lilingfeng@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-13block: constify ext_pi_ref_escape()Alexey Dobriyan
This function doesn't mutate data. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/d24611b3-dddf-473a-903d-39290db03b11@p183 Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-13block: delete module stuff from t10-piAlexey Dobriyan
It is not possible to build t10-pi.ko anymore. This file doesn't even export functions. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/216ccc79-5b80-47b2-b507-990951aa810c@p183 Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-08-12scsi: block: Don't check REQ_ATOMIC for readsJohn Garry
We check in submit_bio_noacct() if flag REQ_ATOMIC is set for both read and write operations, and then validate the atomic operation if set. Flag REQ_ATOMIC can only be set for writes, so don't bother checking for reads. Fixes: 9da3d1e912f3 ("block: Add core atomic write support") Signed-off-by: John Garry <john.g.garry@oracle.com> Link: https://lore.kernel.org/r/20240805113315.1048591-3-john.g.garry@oracle.com Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2024-08-07fs: Convert aops->write_begin to take a folioMatthew Wilcox (Oracle)
Convert all callers from working on a page to working on one page of a folio (support for working on an entire folio can come later). Removes a lot of folio->page->folio conversions. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-07fs: Convert aops->write_end to take a folioMatthew Wilcox (Oracle)
Most callers have a folio, and most implementations operate on a folio, so remove the conversion from folio->page->folio to fit through this interface. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-07buffer: Convert block_write_end() to take a folioMatthew Wilcox (Oracle)
All callers now have a folio, so pass it in instead of converting from a folio to a page and back to a folio again. Saves a call to compound_head(). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-07block: Use a folio in blkdev_write_end()Matthew Wilcox (Oracle)
Replaces two hidden calls to compound_head() with one explicit one. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-07-28blk-ioprio: remove per-disk structureYu Kuai
ioprio works on the blk-cgroup level, all disks in the same cgroup are the same, and the struct ioprio_blkg doesn't have anything in it. Hence register the policy is enough, because cpd_alloc/free_fn will be handled for each blk-cgroup, and there is no need to activate the policy for disk. Hence remove blk_ioprio_init/exit and ioprio_alloc/free_pd. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20240719071506.158075-4-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-28blk-ioprio: remove ioprio_blkcg_from_bio()Yu Kuai
Currently, if config is enabled, then ioprio is always enabled by default from blkcg_init_disk(), hence there is no point to check if the policy is enabled from blkg in ioprio_blkcg_from_bio(). Hence remove it and get blkcg directly from bio. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20240719071506.158075-3-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-28blk-cgroup: check for pd_(alloc|free)_fn in blkcg_activate_policy()Yu Kuai
Currently all policies implement pd_(alloc|free)_fn, however, this is not necessary for ioprio that only works for blkcg, not blkg. There are no functional changes, prepare to cleanup activating ioprio policy. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20240719071506.158075-2-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-27blk-throttle: remove more latency dead-codeDr. David Alan Gilbert
The struct 'latency_bucket' and the #define 'request_bucket_index' are unused since commit bf20ab538c81 ("blk-throttle: remove CONFIG_BLK_DEV_THROTTLING_LOW") and the 'LATENCY_BUCKET_SIZE' #define was only used by the 'request_bucket_index' define. Remove them. Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Link: https://lore.kernel.org/r/20240727155824.1000042-1-linux@treblig.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-24block: fix deadlock between sd_remove & sd_releaseYang Yang
Our test report the following hung task: [ 2538.459400] INFO: task "kworker/0:0":7 blocked for more than 188 seconds. [ 2538.459427] Call trace: [ 2538.459430] __switch_to+0x174/0x338 [ 2538.459436] __schedule+0x628/0x9c4 [ 2538.459442] schedule+0x7c/0xe8 [ 2538.459447] schedule_preempt_disabled+0x24/0x40 [ 2538.459453] __mutex_lock+0x3ec/0xf04 [ 2538.459456] __mutex_lock_slowpath+0x14/0x24 [ 2538.459459] mutex_lock+0x30/0xd8 [ 2538.459462] del_gendisk+0xdc/0x350 [ 2538.459466] sd_remove+0x30/0x60 [ 2538.459470] device_release_driver_internal+0x1c4/0x2c4 [ 2538.459474] device_release_driver+0x18/0x28 [ 2538.459478] bus_remove_device+0x15c/0x174 [ 2538.459483] device_del+0x1d0/0x358 [ 2538.459488] __scsi_remove_device+0xa8/0x198 [ 2538.459493] scsi_forget_host+0x50/0x70 [ 2538.459497] scsi_remove_host+0x80/0x180 [ 2538.459502] usb_stor_disconnect+0x68/0xf4 [ 2538.459506] usb_unbind_interface+0xd4/0x280 [ 2538.459510] device_release_driver_internal+0x1c4/0x2c4 [ 2538.459514] device_release_driver+0x18/0x28 [ 2538.459518] bus_remove_device+0x15c/0x174 [ 2538.459523] device_del+0x1d0/0x358 [ 2538.459528] usb_disable_device+0x84/0x194 [ 2538.459532] usb_disconnect+0xec/0x300 [ 2538.459537] hub_event+0xb80/0x1870 [ 2538.459541] process_scheduled_works+0x248/0x4dc [ 2538.459545] worker_thread+0x244/0x334 [ 2538.459549] kthread+0x114/0x1bc [ 2538.461001] INFO: task "fsck.":15415 blocked for more than 188 seconds. [ 2538.461014] Call trace: [ 2538.461016] __switch_to+0x174/0x338 [ 2538.461021] __schedule+0x628/0x9c4 [ 2538.461025] schedule+0x7c/0xe8 [ 2538.461030] blk_queue_enter+0xc4/0x160 [ 2538.461034] blk_mq_alloc_request+0x120/0x1d4 [ 2538.461037] scsi_execute_cmd+0x7c/0x23c [ 2538.461040] ioctl_internal_command+0x5c/0x164 [ 2538.461046] scsi_set_medium_removal+0x5c/0xb0 [ 2538.461051] sd_release+0x50/0x94 [ 2538.461054] blkdev_put+0x190/0x28c [ 2538.461058] blkdev_release+0x28/0x40 [ 2538.461063] __fput+0xf8/0x2a8 [ 2538.461066] __fput_sync+0x28/0x5c [ 2538.461070] __arm64_sys_close+0x84/0xe8 [ 2538.461073] invoke_syscall+0x58/0x114 [ 2538.461078] el0_svc_common+0xac/0xe0 [ 2538.461082] do_el0_svc+0x1c/0x28 [ 2538.461087] el0_svc+0x38/0x68 [ 2538.461090] el0t_64_sync_handler+0x68/0xbc [ 2538.461093] el0t_64_sync+0x1a8/0x1ac T1: T2: sd_remove del_gendisk __blk_mark_disk_dead blk_freeze_queue_start ++q->mq_freeze_depth bdev_release mutex_lock(&disk->open_mutex) sd_release scsi_execute_cmd blk_queue_enter wait_event(!q->mq_freeze_depth) mutex_lock(&disk->open_mutex) SCSI does not set GD_OWNS_QUEUE, so QUEUE_FLAG_DYING is not set in this scenario. This is a classic ABBA deadlock. To fix the deadlock, make sure we don't try to acquire disk->open_mutex after freezing the queue. Cc: stable@vger.kernel.org Fixes: eec1be4c30df ("block: delete partitions later in del_gendisk") Signed-off-by: Yang Yang <yang.yang@vivo.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Fixes: and Cc: stable tags are missing. Otherwise this patch looks fine Link: https://lore.kernel.org/r/20240724070412.22521-1-yang.yang@vivo.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-22Merge tag 'for-6.11/block-20240722' of git://git.kernel.dk/linuxLinus Torvalds
Pull more block updates from Jens Axboe: - MD fixes via Song: - md-cluster fixes (Heming Zhao) - raid1 fix (Mateusz Jończyk) - s390/dasd module description (Jeff) - Series cleaning up and hardening the blk-mq debugfs flag handling (John, Christoph) - blk-cgroup cleanup (Xiu) - Error polled IO attempts if backend doesn't support it (hexue) - Fix for an sbitmap hang (Yang) * tag 'for-6.11/block-20240722' of git://git.kernel.dk/linux: (23 commits) blk-cgroup: move congestion_count to struct blkcg sbitmap: fix io hung due to race on sbitmap_word::cleared block: avoid polling configuration errors block: Catch possible entries missing from rqf_name[] block: Simplify definition of RQF_NAME() block: Use enum to define RQF_x bit indexes block: Catch possible entries missing from cmd_flag_name[] block: Catch possible entries missing from alloc_policy_name[] block: Catch possible entries missing from hctx_flag_name[] block: Catch possible entries missing from hctx_state_name[] block: Catch possible entries missing from blk_queue_flag_name[] block: Make QUEUE_FLAG_x as an enum block: Relocate BLK_MQ_MAX_DEPTH block: Relocate BLK_MQ_CPU_WORK_BATCH block: remove QUEUE_FLAG_STOPPED block: Add missing entry to hctx_flag_name[] block: Add zone write plugging entry to rqf_name[] block: Add missing entries from cmd_flag_name[] s390/dasd: fix error checks in dasd_copy_pair_store() s390/dasd: add missing MODULE_DESCRIPTION() macros ...
2024-07-22Merge tag 'for-6.11/block-post-20240722' of git://git.kernel.dk/linuxLinus Torvalds
Pull block integrity mapping updates from Jens Axboe: "A set of cleanups and fixes for the block integrity support. Sent separately from the main block changes from last week, as they depended on later fixes in the 6.10-rc cycle" * tag 'for-6.11/block-post-20240722' of git://git.kernel.dk/linux: block: don't free the integrity payload in bio_integrity_unmap_free_user block: don't free submitter owned integrity payload on I/O completion block: call bio_integrity_unmap_free_user from blk_rq_unmap_user block: don't call bio_uninit from bio_endio block: also return bio_integrity_payload * from stubs block: split integrity support out of bio.h
2024-07-19Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsiLinus Torvalds
Pull SCSI updates from James Bottomley: "Updates to the usual drivers (ufs, lpfc, qla2xxx, mpi3mr) plus some misc small fixes. The only core changes are to both bsg and scsi to pass in the device instead of setting it afterwards as q->queuedata, so no functional change" * tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (69 commits) scsi: aha152x: Use DECLARE_COMPLETION_ONSTACK for non-constant completion scsi: qla2xxx: Convert comma to semicolon scsi: qla2xxx: Update version to 10.02.09.300-k scsi: qla2xxx: Use QP lock to search for bsg scsi: qla2xxx: Reduce fabric scan duplicate code scsi: qla2xxx: Fix optrom version displayed in FDMI scsi: qla2xxx: During vport delete send async logout explicitly scsi: qla2xxx: Complete command early within lock scsi: qla2xxx: Fix flash read failure scsi: qla2xxx: Return ENOBUFS if sg_cnt is more than one for ELS cmds scsi: qla2xxx: Fix for possible memory corruption scsi: qla2xxx: validate nvme_local_port correctly scsi: qla2xxx: Unable to act on RSCN for port online scsi: ufs: exynos: Add support for Flash Memory Protector (FMP) scsi: ufs: core: Add UFSHCD_QUIRK_KEYS_IN_PRDT scsi: ufs: core: Add fill_crypto_prdt variant op scsi: ufs: core: Add UFSHCD_QUIRK_BROKEN_CRYPTO_ENABLE scsi: ufs: core: fold ufshcd_clear_keyslot() into its caller scsi: ufs: core: Add UFSHCD_QUIRK_CUSTOM_CRYPTO_PROFILE scsi: ufs: mcq: Make .get_hba_mac() optional ...
2024-07-19blk-cgroup: move congestion_count to struct blkcgXiu Jianfeng
The congestion_count was introduced into the struct cgroup by commit d09d8df3a294 ("blkcg: add generic throttling mechanism"), but since it is closely related to the blkio subsys, it is not appropriate to put it in the struct cgroup, so let's move it to struct blkcg. There should be no functional changes because blkcg is per cgroup. Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20240716133058.3491350-1-xiujianfeng@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-19block: avoid polling configuration errorshexue
This patch adds a poll queue check, aiming to help users use polled IO accurately. If users do polled IO but the device doesn't have poll queues, they will get suboptimal performance data and waste CPU resources. Add a poll queue check batching this. If users don't have the device properly configured, or if it simply doesn't support polled IO, it will error the IO with -EOPNOTSUPP. This is similar to what we used to do for sync polled IO, which is no longer supported. Signed-off-by: hexue <xue01.he@samsung.com> Link: https://lore.kernel.org/r/20240718070817.1031494-1-xue01.he@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-19block: Catch possible entries missing from rqf_name[]John Garry
Add a BUILD_BUG_ON() call to ensure that we are not missing entries in rqf_name[]. Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: John Garry <john.g.garry@oracle.com> Link: https://lore.kernel.org/r/20240719112912.3830443-16-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-19block: Simplify definition of RQF_NAME()John Garry
Now that we have a bit index for RQF_x in __RQF_x, use __RQF_x to simplify the definition of RQF_NAME() by not using ilog2((__force u32()). Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: John Garry <john.g.garry@oracle.com> Link: https://lore.kernel.org/r/20240719112912.3830443-15-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-19block: Catch possible entries missing from cmd_flag_name[]John Garry
Add a BUILD_BUG_ON() call to ensure that we are not missing entries in cmd_flag_name[]. Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: John Garry <john.g.garry@oracle.com> Link: https://lore.kernel.org/r/20240719112912.3830443-13-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-19block: Catch possible entries missing from alloc_policy_name[]John Garry
Make BLK_TAG_ALLOC_x an enum and add a "max" entry. Add a BUILD_BUG_ON() call to ensure that we are not missing entries in hctx_flag_name[]. Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: John Garry <john.g.garry@oracle.com> Link: https://lore.kernel.org/r/20240719112912.3830443-12-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-19block: Catch possible entries missing from hctx_flag_name[]John Garry
Refresh values in BLK_MQ_F_x enum, and then re-arrange members in hctx_flag_name[] to match that enum. Renumber BLK_MQ_F_ALLOC_POLICY_START_BIT to match the value refresh. Add a BUILD_BUG_ON() call to ensure that we are not missing entries in hctx_flag_name[]. Signed-off-by: John Garry <john.g.garry@oracle.com> Link: https://lore.kernel.org/r/20240719112912.3830443-11-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-19block: Catch possible entries missing from hctx_state_name[]John Garry
Add a build-time assert that we are not missing entries from hctx_state_name[]. For this, create a separate enum for state flags and add a "max" entry for BLK_MQ_S_x flags. The numbering for those enum values is as default, so don't explicitly number. Signed-off-by: John Garry <john.g.garry@oracle.com> Link: https://lore.kernel.org/r/20240719112912.3830443-10-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-19block: Catch possible entries missing from blk_queue_flag_name[]John Garry
Assert that we are not missing flag entries in blk_queue_flag_name[]. Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20240719112912.3830443-9-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-19block: Relocate BLK_MQ_CPU_WORK_BATCHJohn Garry
BLK_MQ_CPU_WORK_BATCH is defined in include/linux/blk-mq.h, but only used in blk-mq.c, so relocate to block/blk-mq.h Signed-off-by: John Garry <john.g.garry@oracle.com> Link: https://lore.kernel.org/r/20240719112912.3830443-6-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-19block: remove QUEUE_FLAG_STOPPEDChristoph Hellwig
QUEUE_FLAG_STOPPED is entirely unused. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: John Garry <john.g.garry@oracle.com> Link: https://lore.kernel.org/r/20240719112912.3830443-5-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-19block: Add missing entry to hctx_flag_name[]John Garry
Add missing entry for NO_SCHED_BY_DEFAULT and reorder to match the enum. Signed-off-by: John Garry <john.g.garry@oracle.com> Link: https://lore.kernel.org/r/20240719112912.3830443-4-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-07-19block: Add zone write plugging entry to rqf_name[]John Garry
Add missing entry. Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: John Garry <john.g.garry@oracle.com> Link: https://lore.kernel.org/r/20240719112912.3830443-3-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>