<feed xmlns='http://www.w3.org/2005/Atom'>
<title>pm24.git/block, branch v4.8-rc7</title>
<subtitle>Unnamed repository; edit this file 'description' to name the repository.
</subtitle>
<id>https://git.kobert.dev/pm24.git/atom?h=v4.8-rc7</id>
<link rel='self' href='https://git.kobert.dev/pm24.git/atom?h=v4.8-rc7'/>
<link rel='alternate' type='text/html' href='https://git.kobert.dev/pm24.git/'/>
<updated>2016-08-24T21:38:01Z</updated>
<entry>
<title>blk-mq: improve warning for running a queue on the wrong CPU</title>
<updated>2016-08-24T21:38:01Z</updated>
<author>
<name>Jens Axboe</name>
<email>axboe@fb.com</email>
</author>
<published>2016-08-24T21:38:01Z</published>
<link rel='alternate' type='text/html' href='https://git.kobert.dev/pm24.git/commit/?id=0e87e58bf60edb6bb28e493c7a143f41b091a5e5'/>
<id>urn:sha1:0e87e58bf60edb6bb28e493c7a143f41b091a5e5</id>
<content type='text'>
__blk_mq_run_hw_queue() currently warns if we are running the queue on a
CPU that isn't set in its mask. However, this can happen if a CPU is
being offlined, and the workqueue handling will place the work on CPU0
instead. Improve the warning so that it only triggers if the batch cpu
in the hardware queue is currently online.  If it triggers for that
case, then it's indicative of a flow problem in blk-mq, so we want to
retain it for that case.

Signed-off-by: Jens Axboe &lt;axboe@fb.com&gt;
</content>
</entry>
<entry>
<title>blk-mq: don't overwrite rq-&gt;mq_ctx</title>
<updated>2016-08-24T21:34:35Z</updated>
<author>
<name>Jens Axboe</name>
<email>axboe@fb.com</email>
</author>
<published>2016-08-24T21:34:35Z</published>
<link rel='alternate' type='text/html' href='https://git.kobert.dev/pm24.git/commit/?id=e57690fe009b2ab0cee8a57f53be634540e49c9d'/>
<id>urn:sha1:e57690fe009b2ab0cee8a57f53be634540e49c9d</id>
<content type='text'>
We do this in a few places, if the CPU is offline. This isn't allowed,
though, since on multi queue hardware, we can't just move a request
from one software queue to another, if they map to different hardware
queues. The request and tag isn't valid on another hardware queue.

This can happen if plugging races with CPU offlining. But it does
no harm, since it can only happen in the window where we are
currently busy freezing the queue and flushing IO, in preparation
for redoing the software &lt;-&gt; hardware queue mappings.

Signed-off-by: Jens Axboe &lt;axboe@fb.com&gt;
</content>
</entry>
<entry>
<title>block: make sure a big bio is split into at most 256 bvecs</title>
<updated>2016-08-24T14:17:24Z</updated>
<author>
<name>Ming Lei</name>
<email>ming.lei@canonical.com</email>
</author>
<published>2016-08-23T13:49:45Z</published>
<link rel='alternate' type='text/html' href='https://git.kobert.dev/pm24.git/commit/?id=4d70dca4eadf2f95abe389116ac02b8439c2d16c'/>
<id>urn:sha1:4d70dca4eadf2f95abe389116ac02b8439c2d16c</id>
<content type='text'>
After arbitrary bio size was introduced, the incoming bio may
be very big. We have to split the bio into small bios so that
each holds at most BIO_MAX_PAGES bvecs for safety reason, such
as bio_clone().

This patch fixes the following kernel crash:

&gt; [  172.660142] BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
&gt; [  172.660229] IP: [&lt;ffffffff811e53b4&gt;] bio_trim+0xf/0x2a
&gt; [  172.660289] PGD 7faf3e067 PUD 7f9279067 PMD 0
&gt; [  172.660399] Oops: 0000 [#1] SMP
&gt; [...]
&gt; [  172.664780] Call Trace:
&gt; [  172.664813]  [&lt;ffffffffa007f3be&gt;] ? raid1_make_request+0x2e8/0xad7 [raid1]
&gt; [  172.664846]  [&lt;ffffffff811f07da&gt;] ? blk_queue_split+0x377/0x3d4
&gt; [  172.664880]  [&lt;ffffffffa005fb5f&gt;] ? md_make_request+0xf6/0x1e9 [md_mod]
&gt; [  172.664912]  [&lt;ffffffff811eb860&gt;] ? generic_make_request+0xb5/0x155
&gt; [  172.664947]  [&lt;ffffffffa0445c89&gt;] ? prio_io+0x85/0x95 [bcache]
&gt; [  172.664981]  [&lt;ffffffffa0448252&gt;] ? register_cache_set+0x355/0x8d0 [bcache]
&gt; [  172.665016]  [&lt;ffffffffa04497d3&gt;] ? register_bcache+0x1006/0x1174 [bcache]

The issue can be reproduced by the following steps:
	- create one raid1 over two virtio-blk
	- build bcache device over the above raid1 and another cache device
	and bucket size is set as 2Mbytes
	- set cache mode as writeback
	- run random write over ext4 on the bcache device

Fixes: 54efd50(block: make generic_make_request handle arbitrarily sized bios)
Reported-by: Sebastian Roesner &lt;sroesner-kernelorg@roesner-online.de&gt;
Reported-by: Eric Wheeler &lt;bcache@lists.ewheeler.net&gt;
Cc: stable@vger.kernel.org (4.3+)
Cc: Shaohua Li &lt;shli@fb.com&gt;
Acked-by: Kent Overstreet &lt;kent.overstreet@gmail.com&gt;
Signed-off-by: Ming Lei &lt;ming.lei@canonical.com&gt;
Signed-off-by: Jens Axboe &lt;axboe@fb.com&gt;
</content>
</entry>
<entry>
<title>block: Fix race triggered by blk_set_queue_dying()</title>
<updated>2016-08-17T01:36:14Z</updated>
<author>
<name>Bart Van Assche</name>
<email>bart.vanassche@sandisk.com</email>
</author>
<published>2016-08-16T23:48:36Z</published>
<link rel='alternate' type='text/html' href='https://git.kobert.dev/pm24.git/commit/?id=1b856086813be9371929b6cc62045f9fd470f5a0'/>
<id>urn:sha1:1b856086813be9371929b6cc62045f9fd470f5a0</id>
<content type='text'>
blk_set_queue_dying() can be called while another thread is
submitting I/O or changing queue flags, e.g. through dm_stop_queue().
Hence protect the QUEUE_FLAG_DYING flag change with locking.

Signed-off-by: Bart Van Assche &lt;bart.vanassche@sandisk.com&gt;
Cc: Christoph Hellwig &lt;hch@lst.de&gt;
Cc: Mike Snitzer &lt;snitzer@redhat.com&gt;
Cc: stable &lt;stable@vger.kernel.org&gt;
Signed-off-by: Jens Axboe &lt;axboe@fb.com&gt;
</content>
</entry>
<entry>
<title>block: Fix secure erase</title>
<updated>2016-08-16T15:16:51Z</updated>
<author>
<name>Adrian Hunter</name>
<email>adrian.hunter@intel.com</email>
</author>
<published>2016-08-16T07:59:35Z</published>
<link rel='alternate' type='text/html' href='https://git.kobert.dev/pm24.git/commit/?id=7afafc8a44bf0ab841b17d450b02aedb3a138985'/>
<id>urn:sha1:7afafc8a44bf0ab841b17d450b02aedb3a138985</id>
<content type='text'>
Commit 288dab8a35a0 ("block: add a separate operation type for secure
erase") split REQ_OP_SECURE_ERASE from REQ_OP_DISCARD without considering
all the places REQ_OP_DISCARD was being used to mean either. Fix those.

Signed-off-by: Adrian Hunter &lt;adrian.hunter@intel.com&gt;
Fixes: 288dab8a35a0 ("block: add a separate operation type for secure erase")
Signed-off-by: Jens Axboe &lt;axboe@fb.com&gt;
</content>
</entry>
<entry>
<title>block: rename bio bi_rw to bi_opf</title>
<updated>2016-08-07T20:41:02Z</updated>
<author>
<name>Jens Axboe</name>
<email>axboe@fb.com</email>
</author>
<published>2016-08-05T21:35:16Z</published>
<link rel='alternate' type='text/html' href='https://git.kobert.dev/pm24.git/commit/?id=1eff9d322a444245c67515edb52bc0eb68374aa8'/>
<id>urn:sha1:1eff9d322a444245c67515edb52bc0eb68374aa8</id>
<content type='text'>
Since commit 63a4cc24867d, bio-&gt;bi_rw contains flags in the lower
portion and the op code in the higher portions. This means that
old code that relies on manually setting bi_rw is most likely
going to be broken. Instead of letting that brokeness linger,
rename the member, to force old and out-of-tree code to break
at compile time instead of at runtime.

No intended functional changes in this commit.

Signed-off-by: Jens Axboe &lt;axboe@fb.com&gt;
</content>
</entry>
<entry>
<title>blk-mq: fix deadlock in blk_mq_register_disk() error path</title>
<updated>2016-08-04T20:19:16Z</updated>
<author>
<name>Jens Axboe</name>
<email>axboe@fb.com</email>
</author>
<published>2016-08-02T14:45:44Z</published>
<link rel='alternate' type='text/html' href='https://git.kobert.dev/pm24.git/commit/?id=c0f3fd2b387448d67ae83c4ce1cc69da375b9186'/>
<id>urn:sha1:c0f3fd2b387448d67ae83c4ce1cc69da375b9186</id>
<content type='text'>
If we fail registering any of the hardware queues, we call
into blk_mq_unregister_disk() with the hotplug mutex already
held. Since blk_mq_unregister_disk() attempts to acquire the
same mutex, we end up in a less than happy place.

Reported-by: Jinpu Wang &lt;jinpu.wang@profitbricks.com&gt;
Signed-off-by: Jens Axboe &lt;axboe@fb.com&gt;
</content>
</entry>
<entry>
<title>block: fix bdi vs gendisk lifetime mismatch</title>
<updated>2016-08-04T20:19:16Z</updated>
<author>
<name>Dan Williams</name>
<email>dan.j.williams@intel.com</email>
</author>
<published>2016-07-31T18:15:13Z</published>
<link rel='alternate' type='text/html' href='https://git.kobert.dev/pm24.git/commit/?id=df08c32ce3be5be138c1dbfcba203314a3a7cd6f'/>
<id>urn:sha1:df08c32ce3be5be138c1dbfcba203314a3a7cd6f</id>
<content type='text'>
The name for a bdi of a gendisk is derived from the gendisk's devt.
However, since the gendisk is destroyed before the bdi it leaves a
window where a new gendisk could dynamically reuse the same devt while a
bdi with the same name is still live.  Arrange for the bdi to hold a
reference against its "owner" disk device while it is registered.
Otherwise we can hit sysfs duplicate name collisions like the following:

 WARNING: CPU: 10 PID: 2078 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x64/0x80
 sysfs: cannot create duplicate filename '/devices/virtual/bdi/259:1'

 Hardware name: HP ProLiant DL580 Gen8, BIOS P79 05/06/2015
  0000000000000286 0000000002c04ad5 ffff88006f24f970 ffffffff8134caec
  ffff88006f24f9c0 0000000000000000 ffff88006f24f9b0 ffffffff8108c351
  0000001f0000000c ffff88105d236000 ffff88105d1031e0 ffff8800357427f8
 Call Trace:
  [&lt;ffffffff8134caec&gt;] dump_stack+0x63/0x87
  [&lt;ffffffff8108c351&gt;] __warn+0xd1/0xf0
  [&lt;ffffffff8108c3cf&gt;] warn_slowpath_fmt+0x5f/0x80
  [&lt;ffffffff812a0d34&gt;] sysfs_warn_dup+0x64/0x80
  [&lt;ffffffff812a0e1e&gt;] sysfs_create_dir_ns+0x7e/0x90
  [&lt;ffffffff8134faaa&gt;] kobject_add_internal+0xaa/0x320
  [&lt;ffffffff81358d4e&gt;] ? vsnprintf+0x34e/0x4d0
  [&lt;ffffffff8134ff55&gt;] kobject_add+0x75/0xd0
  [&lt;ffffffff816e66b2&gt;] ? mutex_lock+0x12/0x2f
  [&lt;ffffffff8148b0a5&gt;] device_add+0x125/0x610
  [&lt;ffffffff8148b788&gt;] device_create_groups_vargs+0xd8/0x100
  [&lt;ffffffff8148b7cc&gt;] device_create_vargs+0x1c/0x20
  [&lt;ffffffff811b775c&gt;] bdi_register+0x8c/0x180
  [&lt;ffffffff811b7877&gt;] bdi_register_dev+0x27/0x30
  [&lt;ffffffff813317f5&gt;] add_disk+0x175/0x4a0

Cc: &lt;stable@vger.kernel.org&gt;
Reported-by: Yi Zhang &lt;yizhan@redhat.com&gt;
Tested-by: Yi Zhang &lt;yizhan@redhat.com&gt;
Signed-off-by: Dan Williams &lt;dan.j.williams@intel.com&gt;

Fixed up missing 0 return in bdi_register_owner().

Signed-off-by: Jens Axboe &lt;axboe@fb.com&gt;
</content>
</entry>
<entry>
<title>blk-mq: Allow timeouts to run while queue is freezing</title>
<updated>2016-08-04T20:19:16Z</updated>
<author>
<name>Gabriel Krisman Bertazi</name>
<email>krisman@linux.vnet.ibm.com</email>
</author>
<published>2016-08-01T14:23:39Z</published>
<link rel='alternate' type='text/html' href='https://git.kobert.dev/pm24.git/commit/?id=71f79fb3179e69b0c1448a2101a866d871c66e7f'/>
<id>urn:sha1:71f79fb3179e69b0c1448a2101a866d871c66e7f</id>
<content type='text'>
In case a submitted request gets stuck for some reason, the block layer
can prevent the request starvation by starting the scheduled timeout work.
If this stuck request occurs at the same time another thread has started
a queue freeze, the blk_mq_timeout_work will not be able to acquire the
queue reference and will return silently, thus not issuing the timeout.
But since the request is already holding a q_usage_counter reference and
is unable to complete, it will never release its reference, preventing
the queue from completing the freeze started by first thread.  This puts
the request_queue in a hung state, forever waiting for the freeze
completion.

This was observed while running IO to a NVMe device at the same time we
toggled the CPU hotplug code. Eventually, once a request got stuck
requiring a timeout during a queue freeze, we saw the CPU Hotplug
notification code get stuck inside blk_mq_freeze_queue_wait, as shown in
the trace below.

[c000000deaf13690] [c000000deaf13738] 0xc000000deaf13738 (unreliable)
[c000000deaf13860] [c000000000015ce8] __switch_to+0x1f8/0x350
[c000000deaf138b0] [c000000000ade0e4] __schedule+0x314/0x990
[c000000deaf13940] [c000000000ade7a8] schedule+0x48/0xc0
[c000000deaf13970] [c0000000005492a4] blk_mq_freeze_queue_wait+0x74/0x110
[c000000deaf139e0] [c00000000054b6a8] blk_mq_queue_reinit_notify+0x1a8/0x2e0
[c000000deaf13a40] [c0000000000e7878] notifier_call_chain+0x98/0x100
[c000000deaf13a90] [c0000000000b8e08] cpu_notify_nofail+0x48/0xa0
[c000000deaf13ac0] [c0000000000b92f0] _cpu_down+0x2a0/0x400
[c000000deaf13b90] [c0000000000b94a8] cpu_down+0x58/0xa0
[c000000deaf13bc0] [c0000000006d5dcc] cpu_subsys_offline+0x2c/0x50
[c000000deaf13bf0] [c0000000006cd244] device_offline+0x104/0x140
[c000000deaf13c30] [c0000000006cd40c] online_store+0x6c/0xc0
[c000000deaf13c80] [c0000000006c8c78] dev_attr_store+0x68/0xa0
[c000000deaf13cc0] [c0000000003974d0] sysfs_kf_write+0x80/0xb0
[c000000deaf13d00] [c0000000003963e8] kernfs_fop_write+0x188/0x200
[c000000deaf13d50] [c0000000002e0f6c] __vfs_write+0x6c/0xe0
[c000000deaf13d90] [c0000000002e1ca0] vfs_write+0xc0/0x230
[c000000deaf13de0] [c0000000002e2cdc] SyS_write+0x6c/0x110
[c000000deaf13e30] [c000000000009204] system_call+0x38/0xb4

The fix is to allow the timeout work to execute in the window between
dropping the initial refcount reference and the release of the last
reference, which actually marks the freeze completion.  This can be
achieved with percpu_refcount_tryget, which does not require the counter
to be alive.  This way the timeout work can do it's job and terminate a
stuck request even during a freeze, returning its reference and avoiding
the deadlock.

Allowing the timeout to run is just a part of the fix, since for some
devices, we might get stuck again inside the device driver's timeout
handler, should it attempt to allocate a new request in that path -
which is a quite common action for Abort commands, which need to be sent
after a timeout.  In NVMe, for instance, we call blk_mq_alloc_request
from inside the timeout handler, which will fail during a freeze, since
it also tries to acquire a queue reference.

I considered a similar change to blk_mq_alloc_request as a generic
solution for further device driver hangs, but we can't do that, since it
would allow new requests to disturb the freeze process.  I thought about
creating a new function in the block layer to support unfreezable
requests for these occasions, but after working on it for a while, I
feel like this should be handled in a per-driver basis.  I'm now
experimenting with changes to the NVMe timeout path, but I'm open to
suggestions of ways to make this generic.

Signed-off-by: Gabriel Krisman Bertazi &lt;krisman@linux.vnet.ibm.com&gt;
Cc: Brian King &lt;brking@linux.vnet.ibm.com&gt;
Cc: Keith Busch &lt;keith.busch@intel.com&gt;
Cc: linux-nvme@lists.infradead.org
Cc: linux-block@vger.kernel.org
Reviewed-by: Christoph Hellwig &lt;hch@lst.de&gt;
Signed-off-by: Jens Axboe &lt;axboe@fb.com&gt;
</content>
</entry>
<entry>
<title>block: fix use-after-free in seq file</title>
<updated>2016-08-04T20:19:16Z</updated>
<author>
<name>Vegard Nossum</name>
<email>vegard.nossum@oracle.com</email>
</author>
<published>2016-07-29T08:40:31Z</published>
<link rel='alternate' type='text/html' href='https://git.kobert.dev/pm24.git/commit/?id=77da160530dd1dc94f6ae15a981f24e5f0021e84'/>
<id>urn:sha1:77da160530dd1dc94f6ae15a981f24e5f0021e84</id>
<content type='text'>
I got a KASAN report of use-after-free:

    ==================================================================
    BUG: KASAN: use-after-free in klist_iter_exit+0x61/0x70 at addr ffff8800b6581508
    Read of size 8 by task trinity-c1/315
    =============================================================================
    BUG kmalloc-32 (Not tainted): kasan: bad access detected
    -----------------------------------------------------------------------------

    Disabling lock debugging due to kernel taint
    INFO: Allocated in disk_seqf_start+0x66/0x110 age=144 cpu=1 pid=315
            ___slab_alloc+0x4f1/0x520
            __slab_alloc.isra.58+0x56/0x80
            kmem_cache_alloc_trace+0x260/0x2a0
            disk_seqf_start+0x66/0x110
            traverse+0x176/0x860
            seq_read+0x7e3/0x11a0
            proc_reg_read+0xbc/0x180
            do_loop_readv_writev+0x134/0x210
            do_readv_writev+0x565/0x660
            vfs_readv+0x67/0xa0
            do_preadv+0x126/0x170
            SyS_preadv+0xc/0x10
            do_syscall_64+0x1a1/0x460
            return_from_SYSCALL_64+0x0/0x6a
    INFO: Freed in disk_seqf_stop+0x42/0x50 age=160 cpu=1 pid=315
            __slab_free+0x17a/0x2c0
            kfree+0x20a/0x220
            disk_seqf_stop+0x42/0x50
            traverse+0x3b5/0x860
            seq_read+0x7e3/0x11a0
            proc_reg_read+0xbc/0x180
            do_loop_readv_writev+0x134/0x210
            do_readv_writev+0x565/0x660
            vfs_readv+0x67/0xa0
            do_preadv+0x126/0x170
            SyS_preadv+0xc/0x10
            do_syscall_64+0x1a1/0x460
            return_from_SYSCALL_64+0x0/0x6a

    CPU: 1 PID: 315 Comm: trinity-c1 Tainted: G    B           4.7.0+ #62
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
     ffffea0002d96000 ffff880119b9f918 ffffffff81d6ce81 ffff88011a804480
     ffff8800b6581500 ffff880119b9f948 ffffffff8146c7bd ffff88011a804480
     ffffea0002d96000 ffff8800b6581500 fffffffffffffff4 ffff880119b9f970
    Call Trace:
     [&lt;ffffffff81d6ce81&gt;] dump_stack+0x65/0x84
     [&lt;ffffffff8146c7bd&gt;] print_trailer+0x10d/0x1a0
     [&lt;ffffffff814704ff&gt;] object_err+0x2f/0x40
     [&lt;ffffffff814754d1&gt;] kasan_report_error+0x221/0x520
     [&lt;ffffffff8147590e&gt;] __asan_report_load8_noabort+0x3e/0x40
     [&lt;ffffffff83888161&gt;] klist_iter_exit+0x61/0x70
     [&lt;ffffffff82404389&gt;] class_dev_iter_exit+0x9/0x10
     [&lt;ffffffff81d2e8ea&gt;] disk_seqf_stop+0x3a/0x50
     [&lt;ffffffff8151f812&gt;] seq_read+0x4b2/0x11a0
     [&lt;ffffffff815f8fdc&gt;] proc_reg_read+0xbc/0x180
     [&lt;ffffffff814b24e4&gt;] do_loop_readv_writev+0x134/0x210
     [&lt;ffffffff814b4c45&gt;] do_readv_writev+0x565/0x660
     [&lt;ffffffff814b8a17&gt;] vfs_readv+0x67/0xa0
     [&lt;ffffffff814b8de6&gt;] do_preadv+0x126/0x170
     [&lt;ffffffff814b92ec&gt;] SyS_preadv+0xc/0x10

This problem can occur in the following situation:

open()
 - pread()
    - .seq_start()
       - iter = kmalloc() // succeeds
       - seqf-&gt;private = iter
    - .seq_stop()
       - kfree(seqf-&gt;private)
 - pread()
    - .seq_start()
       - iter = kmalloc() // fails
    - .seq_stop()
       - class_dev_iter_exit(seqf-&gt;private) // boom! old pointer

As the comment in disk_seqf_stop() says, stop is called even if start
failed, so we need to reinitialise the private pointer to NULL when seq
iteration stops.

An alternative would be to set the private pointer to NULL when the
kmalloc() in disk_seqf_start() fails.

Cc: stable@vger.kernel.org
Signed-off-by: Vegard Nossum &lt;vegard.nossum@oracle.com&gt;
Acked-by: Tejun Heo &lt;tj@kernel.org&gt;
Signed-off-by: Jens Axboe &lt;axboe@fb.com&gt;
</content>
</entry>
</feed>
