Age | Commit message (Collapse) | Author |
|
Most of the logging infrastructure in XFS is unneccessary and
designed around the infrastructure supplied by Irix rather than
Linux. To rationalise the logging interfaces, start by introducing
simple printk wrappers similar to the dev_printk() infrastructure.
Later patches will convert code to use this new interface.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
Commit 493f3358cb289ccf716c5a14fa5bb52ab75943e5 added this call to
xfs_fs_geometry() in order to avoid passing kernel stack data back
to user space:
+ memset(geo, 0, sizeof(*geo));
Unfortunately, one of the callers of that function passes the
address of a smaller data type, cast to fit the type that
xfs_fs_geometry() requires. As a result, this can happen:
Kernel panic - not syncing: stack-protector: Kernel stack is corrupted
in: f87aca93
Pid: 262, comm: xfs_fsr Not tainted 2.6.38-rc6-493f3358cb2+ #1
Call Trace:
[<c12991ac>] ? panic+0x50/0x150
[<c102ed71>] ? __stack_chk_fail+0x10/0x18
[<f87aca93>] ? xfs_ioc_fsgeometry_v1+0x56/0x5d [xfs]
Fix this by fixing that one caller to pass the right type and then
copy out the subset it is interested in.
Note: This patch is an alternative to one originally proposed by
Eric Sandeen.
Reported-by: Jeffrey Hundstad <jeffrey.hundstad@mnsu.edu>
Signed-off-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Tested-by: Jeffrey Hundstad <jeffrey.hundstad@mnsu.edu>
|
|
According to the report from Jiro SEKIBA titled "regression in
2.6.37?" (Message-Id: <8739n8vs1f.wl%jir@sekiba.com>), on 2.6.37 and
later kernels, lscp command no longer displays "i" flag on checkpoints
that snapshot operations or garbage collection created.
This is a regression of nilfs2 checkpointing function, and it's
critical since it broke behavior of a part of nilfs2 applications.
For instance, snapshot manager of TimeBrowse gets to create
meaningless snapshots continuously; snapshot creation triggers another
checkpoint, but applications cannot distinguish whether the new
checkpoint contains meaningful changes or not without the i-flag.
This patch fixes the regression and brings that application behavior
back to normal.
Reported-by: Jiro SEKIBA <jir@unicus.jp>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Tested-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Tested-by: Jiro SEKIBA <jir@unicus.jp>
Cc: stable <stable@kernel.org> [2.6.37]
|
|
According to Russell King, adfs was written to not require the big
kernel lock, and all inode updates are done under adfs_dir_lock.
All other metadata in adfs is read-only and does not require locking.
The use of the BKL is the result of various pushdowns from the VFS
operations.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Russell King <rmk@arm.linux.org.uk>
Cc: Stuart Swales <stuart.swales.croftnuisk@gmail.com>
|
|
Signed-off-by: Justin P. Mattock <justinmattock@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
|
|
Fix new kernel-doc warning in fs/block_dev.c:
Warning(fs/block_dev.c:937): No description found for parameter 'kill_dirty'
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: fix truncate after open
fuse: fix hang of single threaded fuseblk filesystem
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2:
ocfs2: Check heartbeat mode for kernel stacks only
Ocfs2/refcounttree: Fix a bug for refcounttree to writeback clusters in a right number.
ocfs2: Fix estimate of necessary credits for mkdir
|
|
The Patch below removes one to many "n's" in a word..
Signed-off-by: Justin P. Mattock <justinmattock@gmail.com>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: linux-ext4@vger.kernel.org
Acked-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
Orphan cleanup is currently executed even if the file system has some
number of unknown ROCOMPAT features, which deletes inodes and frees
blocks, which could be very bad for some RO_COMPAT features.
This patch skips the orphan cleanup if it contains readonly compatible
features not known by this ext3 implementation, which would prevent
the fs from being mounted (or remounted) readwrite.
Signed-off-by: Amir Goldstein <amir73il@users.sf.net>
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
vfs_rename_other() does not lock renamed inode with i_mutex. Thus changing
i_nlink in a non-atomic manner (which happens in ext2_rename()) can corrupt
it as reported and analyzed by Josh.
In fact, there is no good reason to mess with i_nlink of the moved file.
We did it presumably to simulate linking into the new directory and unlinking
from an old one. But the practical effect of this is disputable because fsck
can possibly treat file as being properly linked into both directories without
writing any error which is confusing. So we just stop increment-decrement
games with i_nlink which also fixes the corruption.
CC: stable@kernel.org
CC: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Josh Hunt <johunt@akamai.com>
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
Squashfs_get_sb() to squashfs_mount() conversion (commit 152a0836)
results in line over 80 characters.
Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>
|
|
Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>
|
|
Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>
|
|
Pass the dictionary size used to compress datablocks. Using a
dictionary size less than the block size saves memory overhead, in many
cases without adversely affecting compression ratio.
Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>
|
|
Extend decompressor framework to handle compression options stored in
the filesystem. These options can be used by the relevant decompressor
at initialisation time to over-ride defaults.
The presence of compression options in the filesystem is indicated by
the COMP_OPT filesystem flag. If present the data is read from the
filesystem and passed to the decompressor init function. The decompressor
init function signature has been extended to take this data.
Also update the init function signature in the glib, lzo and xz
decompressor wrappers.
Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>
|
|
If no extent conversion is required, wake up any processes waiting for
the page's writeback to be complete and free the ext4_io_end structure
directly in ext4_end_bio() instead of dropping it on the linked list
(which requires taking a spinlock to queue and dequeue the io_end
structure), and waiting for the workqueue to do this work.
This removes an extra scheduling delay before process waiting for an
fsync() to complete gets woken up, and it also reduces the CPU
overhead for a random write workload.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
|
Orphan cleanup is currently executed even if the file system has some
number of unknown ROCOMPAT features, which deletes inodes and frees
blocks, which could be very bad for some RO_COMPAT features,
especially the SNAPSHOT feature.
This patch skips the orphan cleanup if it contains readonly compatible
features not known by this ext4 implementation, which would prevent
the fs from being mounted (or remounted) readwrite.
Signed-off-by: Amir Goldstein <amir73il@users.sf.net>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
|
nblocks is passed into ext4_truncate_restart_trans() from
ext4_ext_truncate_extend_restart() with a value different from the default
blocks_for_truncate(), but is being ignored.
The two other calls to ext4_truncate_restart_trans() already pass the
default value, which is then being recalculated inside the function.
Fix the problem by using the passed argument.
Signed-off-by: Amir Goldstein <amir73il@users.sf.net>
|
|
This assures that the root inode is not leaked, and that sb->s_root is
NULL, which will prevent generic_shutdown_super() from doing extra
work, including call sync_filesystem, which ultimately results in
ext4_sync_fs() getting called with an uninitialized struct super,
which is the cause of the crash noted in Kernel Bugzilla #26752.
https://bugzilla.kernel.org/show_bug.cgi?id=26752
Signed-off-by: Manish Katiyar <mkatiyar@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
|
Fix the FIEMAP ioctl so that it returns all of the page ranges which
are still subject to delayed allocation. We were missing some cases
if the file was sparse.
Reported by Chris Mason <chris.mason@oracle.com>:
>We've had reports on btrfs that cp is giving us files full of zeros
>instead of actually copying them. It was tracked down to a bug with
>the btrfs fiemap implementation where it was returning holes for
>delalloc ranges.
>
>Newer versions of cp are trusting fiemap to tell it where the holes
>are, which does seem like a pretty neat trick.
>
>I decided to give xfs and ext4 a shot with a few tests cases too, xfs
>passed with all the ones btrfs was getting wrong, and ext4 got the basic
>delalloc case right.
>$ mkfs.ext4 /dev/xxx
>$ mount /dev/xxx /mnt
>$ dd if=/dev/zero of=/mnt/foo bs=1M count=1
>$ fiemap-test foo
>ext: 0 logical: [ 0.. 255] phys: 0.. 255
>flags: 0x007 tot: 256
>
>Horray! But once we throw a hole in, things go bad:
>$ mkfs.ext4 /dev/xxx
>$ mount /dev/xxx /mnt
>$ dd if=/dev/zero of=/mnt/foo bs=1M count=1 seek=1
>$ fiemap-test foo
>< no output >
>
>We've got a delalloc extent after the hole and ext4 fiemap didn't find
>it. If I run sync to kick the delalloc out:
>$sync
>$ fiemap-test foo
>ext: 0 logical: [ 256.. 511] phys: 34048.. 34303
>flags: 0x001 tot: 256
>
>fiemap-test is sitting in my /usr/local/bin, and I have no idea how it
>got there. It's full of pretty comments so I know it isn't mine, but
>you can grab it here:
>
>http://oss.oracle.com/~mason/fiemap-test.c
>
>xfsqa has a fiemap program too.
After Fix, test results are as follows:
ext: 0 logical: [ 256.. 511] phys: 0.. 255
flags: 0x007 tot: 256
ext: 0 logical: [ 256.. 511] phys: 33280.. 33535
flags: 0x001 tot: 256
$ mkfs.ext4 /dev/xxx
$ mount /dev/xxx /mnt
$ dd if=/dev/zero of=/mnt/foo bs=1M count=1 seek=1
$ sync
$ dd if=/dev/zero of=/mnt/foo bs=1M count=1 seek=3
$ dd if=/dev/zero of=/mnt/foo bs=1M count=1 seek=5
$ fiemap-test foo
ext: 0 logical: [ 256.. 511] phys: 33280.. 33535
flags: 0x000 tot: 256
ext: 1 logical: [ 768.. 1023] phys: 0.. 255
flags: 0x006 tot: 256
ext: 2 logical: [ 1280.. 1535] phys: 0.. 255
flags: 0x007 tot: 256
Tested-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
|
If CONFIG_EXT4_DEBUG is enabled, then if a block allocation fails due
to disk being full, a verbose debugging message is printed, even if
the malloc-debug switch has not been enabled. Suppress the debugging
message so that nothing is printed unless malloc-debug has been turned
on.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
|
In ext4_bio_write_page(), if the memory allocation for the struct
ext4_io_page fails, it returns with the page's PageWriteback flag set.
This will end up causing the page not to skip writeback in
WB_SYNC_NONE mode, and in WB_SYNC_ALL mode (i.e., on a sync, fsync, or
umount) the writeback daemon will get stuck forever on the
wait_on_page_writeback() function in write_cache_pages_da().
Or, if journalling is enabled and the file gets deleted, it the
journal thread can get stuck in journal_finish_inode_data_buffers()
call to filemap_fdatawait().
Another place where things can get hung up is in
truncate_inode_pages(), called out of ext4_evict_inode().
Fix this by not setting PageWriteback until after we have successfully
allocated the struct ext4_io_page.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
|
Move the initialization of all of the fields of the mpd structure to
write_cache_pages_da(). This simplifies the code considerably.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
|
If we have accumulated a contiguous region of memory to be written
out, and the next page can added to this region, don't bother locking
(and then unlocking the page) before writing out the memory. In the
unlikely event that the next page was being written back by some other
CPU, we can also skip waiting that page to finish writeback.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
|
Because the ext4 page writeback codepath had been prematurely calling
clear_page_dirty_for_io(), if it turned out that a particular page
couldn't be written out during a particular pass of
write_cache_pages_da(), the page would have to get redirtied by
calling redirty_pages_for_writeback(). Not only was this wasted work,
but redirty_page_for_writeback() would increment wbc->pages_skipped to
signal to writeback_sb_inodes() that buffers were locked, and that it
should skip this inode until later.
Since this signal was incorrect in ext4's case --- which was caused by
ext4's historically incorrect use of write_cache_pages() ---
ext4_da_writepages() saved and restored wbc->skipped_pages to avoid
confusing writeback_sb_inodes().
Now that we've fixed ext4 to call clear_page_dirty_for_io() right
before initiating the page I/O, we can nuke the page_skipped
save/restore hackery, and breathe a sigh of relief.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
|
Move when we call clear_page_dirty_for_io() to just before we actually
write the page. This simplifies the code somewhat, and avoids marking
pages as clean and then needing to remark them as dirty later.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
|
Eliminate duplicate code, unneeded variables, etc., to make it easier
to understand the code. No behavioral changes were made in this patch.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
|
Fold the __mpage_da_writepage() function into write_cache_pages_da().
This will give us opportunities to clean up and simplify the resulting
code.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
|
Now that we've fixed the file corruption bug in commit d50bdd5aa55,
it's time to enable mblk_io_submit by default.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
|
If ext4_da_block_invalidatepages() is called because of a
failure from ext4_map_blocks() in mpage_da_map_and_submit(),
it's supposed to clean up -- including unlock -- all the
pages in the mpd structure. But these values may not match
up, even on a system in which block size == page size:
mpd->b_blocknr != mpd->first_page
mpd->b_size != (mpd->next_page - mpd->first_page)
ext4_da_block_invalidatepages() has been using b_blocknr and
b_size; this patch changes it to use first_page and
next_page.
Tested: I injected a small number (5%) of failures in
ext4_map_blocks() in the case that the flags contain
EXT4_GET_BLOCKS_DELALLOC_RESERVE, and ran fsstress on this
kernel. Without this patch, I got hung tasks every time.
With this patch, I see no hangs in many runs of fsstress.
Signed-off-by: Curt Wohlgemuth <curtw@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
|
In mpage_da_map_and_submit(), if we have a delayed block
allocation failure from ext4_map_blocks(), we need to mark
the IO as complete, by setting
mpd->io_done = 1;
Otherwise, we could end up submitting the pages in an outer
loop; since they are unlocked on mapping failure in
ext4_da_block_invalidatepages(), this will cause a bug check
in mpage_da_submit_io().
I tested this by injected failures into ext4_map_blocks().
Without this patch, a simple fsstress run will bug check;
with the patch, it works fine.
Signed-off-by: Curt Wohlgemuth <curtw@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
|
A race can occur when io_submit() races with io_destroy():
CPU1 CPU2
io_submit()
do_io_submit()
...
ctx = lookup_ioctx(ctx_id);
io_destroy()
Now do_io_submit() holds the last reference to ctx.
...
queue new AIO
put_ioctx(ctx) - frees ctx with active AIOs
We solve this issue by checking whether ctx is being destroyed in AIO
submission path after adding new AIO to ctx. Then we are guaranteed that
either io_destroy() waits for new AIO or we see that ctx is being
destroyed and bail out.
Cc: Nick Piggin <npiggin@kernel.dk>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
aio-dio-invalidate-failure GPFs in aio_put_req from io_submit.
lookup_ioctx doesn't implement the rcu lookup pattern properly.
rcu_read_lock does not prevent refcount going to zero, so we might take
a refcount on a zero count ioctx.
Fix the bug by atomically testing for zero refcount before incrementing.
[jack@suse.cz: added comment into the code]
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The kernel automatically evaluates partition tables of storage devices.
The code for evaluating LDM partitions (in fs/partitions/ldm.c) contains
a bug that causes a kernel oops on certain corrupted LDM partitions. A
kernel subsystem seems to crash, because, after the oops, the kernel no
longer recognizes newly connected storage devices.
The patch changes ldm_parse_vmdb() to Validate the value of vblk_size.
Signed-off-by: Timo Warns <warns@pre-sense.de>
Cc: Eugene Teo <eugeneteo@kernel.sg>
Acked-by: Richard Russon <ldm@flatcap.org>
Cc: Harvey Harrison <harvey.harrison@gmail.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
In several places, an epoll fd can call another file's ->f_op->poll()
method with ep->mtx held. This is in general unsafe, because that other
file could itself be an epoll fd that contains the original epoll fd.
The code defends against this possibility in its own ->poll() method using
ep_call_nested, but there are several other unsafe calls to ->poll
elsewhere that can be made to deadlock. For example, the following simple
program causes the call in ep_insert recursively call the original fd's
->poll, leading to deadlock:
#include <unistd.h>
#include <sys/epoll.h>
int main(void) {
int e1, e2, p[2];
struct epoll_event evt = {
.events = EPOLLIN
};
e1 = epoll_create(1);
e2 = epoll_create(2);
pipe(p);
epoll_ctl(e2, EPOLL_CTL_ADD, e1, &evt);
epoll_ctl(e1, EPOLL_CTL_ADD, p[0], &evt);
write(p[1], p, sizeof p);
epoll_ctl(e1, EPOLL_CTL_ADD, e2, &evt);
return 0;
}
On insertion, check whether the inserted file is itself a struct epoll,
and if so, do a recursive walk to detect whether inserting this file would
create a loop of epoll structures, which could lead to deadlock.
[nelhage@ksplice.com: Use epmutex to serialize concurrent inserts]
Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Nelson Elhage <nelhage@ksplice.com>
Reported-by: Nelson Elhage <nelhage@ksplice.com>
Tested-by: Nelson Elhage <nelhage@ksplice.com>
Cc: <stable@kernel.org> [2.6.34+, possibly earlier]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
Btrfs: fix fiemap bugs with delalloc
Btrfs: set FMODE_EXCL in btrfs_device->mode
Btrfs: make btrfs_rm_device() fail gracefully
Btrfs: Avoid accessing unmapped kernel address
Btrfs: Fix BTRFS_IOC_SUBVOL_SETFLAGS ioctl
Btrfs: allow balance to explicitly allocate chunks as it relocates
Btrfs: put ENOSPC debugging under a mount option
|
|
* 'for-linus' of git://neil.brown.name/md:
md: Fix - again - partition detection when array becomes active
Fix over-zealous flush_disk when changing device size.
md: avoid spinlock problem in blk_throtl_exit
md: correctly handle probe of an 'mdp' device.
md: don't set_capacity before array is active.
md: Fix raid1->raid0 takeover
|
|
I'm seeing the following oops when testing afs:
Unable to handle kernel paging request for data at address 0x00000008
...
NIP [c0000000003393b0] .afs_unlink_writeback+0x38/0xc0
LR [c00000000033987c] .afs_put_writeback+0x98/0xec
Call Trace:
[c00000000345f600] [c00000000033987c] .afs_put_writeback+0x98/0xec
[c00000000345f690] [c00000000033ae80] .afs_write_begin+0x6a4/0x75c
[c00000000345f790] [c00000000012b77c] .generic_file_buffered_write+0x148/0x320
[c00000000345f8d0] [c00000000012e1b8] .__generic_file_aio_write+0x37c/0x3e4
[c00000000345f9d0] [c00000000012e2a8] .generic_file_aio_write+0x88/0xfc
[c00000000345fa90] [c0000000003390a8] .afs_file_write+0x10c/0x178
[c00000000345fb40] [c000000000188788] .do_sync_write+0xc4/0x128
[c00000000345fcc0] [c000000000189658] .vfs_write+0xe8/0x1d8
[c00000000345fd70] [c000000000189884] .SyS_write+0x68/0xb0
[c00000000345fe30] [c000000000008564] syscall_exit+0x0/0x40
afs_write_begin hits an error and calls afs_unlink_writeback. In there
we do list_del_init on an uninitialised list.
The patch below initialises ->link when creating the afs_writeback struct.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Commit e1181ee6 "vfs: pass struct file to do_truncate on O_TRUNC
opens" broke the behavior of open(O_TRUNC|O_RDONLY) in fuse. Fuse
assumed that when called from open, a truncate() will be done, not an
ftruncate().
Fix by restoring the old behavior, based on the ATTR_OPEN flag.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
|
|
Single threaded NTFS-3G could get stuck if a delayed RELEASE reply
triggered a DESTROY request via path_put().
Fix this by
a) making RELEASE requests synchronous, whenever possible, on fuseblk
filesystems
b) if not possible (triggered by an asynchronous read/write) then do
the path_put() in a separate thread with schedule_work().
Reported-by: Oliver Neukum <oneukum@suse.de>
Cc: stable@kernel.org
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
|
|
In ext4_mb_check_group_pa(), the current preallocation space is
replaced with a new preallocation space when the two have the same
distance from the goal block.
This doesn't actually gain us anything, so change things so that the
function only switches to the new preallocation group if its distance
from the goal block is strictly smaller than the current preallocaiton
group's distance from the goal block.
Signed-off-by: Coly Li <bosong.ly@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
|
|
Signed-off-by: Coly Li <bosong.ly@taobao.com>
Cc: Alex Tomas <alex@clusterfs.com>
Cc: Theodore Tso <tytso@google.com>
|
|
This patch adds comments to ext4_mb_mark_free_simple to make it more
understandable.
Signed-off-by: Coly Li <bosong.ly@taobao.com>
Cc: Alex Tomas <alex@clusterfs.com>
Cc: Theodore Tso <tytso@google.com>
|
|
In __mb_check_buddy(), look at the code below:
591 fstart = -1;
592 buddy = mb_find_buddy(e4b, 0, &max);
593 for (i = 0; i < max; i++) {
594 if (!mb_test_bit(i, buddy)) {
595 MB_CHECK_ASSERT(i >= e4b->bd_info->bb_first_free);
596 if (fstart == -1) {
597 fragments++;
598 fstart = i;
599 }
600 continue;
601 }
602 fstart = -1;
603 /* check used bits only */
604 for (j = 0; j < e4b->bd_blkbits + 1; j++) {
605 buddy2 = mb_find_buddy(e4b, j, &max2);
606 k = i >> j;
607 MB_CHECK_ASSERT(k < max2);
608 MB_CHECK_ASSERT(mb_test_bit(k, buddy2));
609 }
610 }
611 MB_CHECK_ASSERT(!EXT4_MB_GRP_NEED_INIT(e4b->bd_info));
612 MB_CHECK_ASSERT(e4b->bd_info->bb_fragments == fragments);
613
614 grp = ext4_get_group_info(sb, e4b->bd_group);
615 buddy = mb_find_buddy(e4b, 0, &max);
On line 592, buddy is fetched by mb_find_buddy() with order 0, between
line 593 to line 615, buddy is not changed, therefore there is
no need to fetch buddy again from mb_find_buddy() with order 0 again.
We can safely remove the second mb_find_buddy() on line 615.
Signed-off-by: Coly Li <bosong.ly@taobao.com>
Cc: Alex Tomas <alex@clusterfs.com>
Cc: Theodore Tso <tytso@google.com>
|
|
Current code calculate max no matter whether order is zero, it's
unnecessary. This cleanup patch sets max to "1 << (e4b->bd_blkbits
+ 3)" only when order == 0.
Signed-off-by: Coly Li <bosong.ly@taobao.com>
Cc: Alex Tomas <alex@clusterfs.com>
Cc: Theodore Tso <tytso@google.com>
|
|
The new implementation of bd_link_disk_holder() added by 49731baa41d
(block: restore multiple bd_link_disk_holder() support) didn't get an
extra reference for the holder_dir kobject of the slave bdev; however,
bdev kills holder_dir on removal, not release, so if the slave bdev is
removed while there are holder links, the holder_dir will be destroyed
while there still are holder links, which leads to oops later when
bd_unlink_disk_order() tries to remove those links.
Make bd_link_disk_holder() grab an extra reference for the slave's
holder_dir and put it in bd_unlink_disk_holder().
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: "Hawrylewicz Czarnowski, Przemyslaw" <przemyslaw.hawrylewicz.czarnowski@intel.com>
Tested-by: "Hawrylewicz Czarnowski, Przemyslaw" <przemyslaw.hawrylewicz.czarnowski@intel.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This patch is a performance improvement to GFS2's dealloc code.
Rather than update the quota file and statfs file for every
single block that's stripped off in unlink function do_strip,
this patch keeps track and updates them once for every layer
that's stripped. This is done entirely inside the existing
transaction, so there should be no risk of corruption.
The other functions that deallocate blocks will be unaffected
because they are using wrapper functions that do the same
thing that they do today.
I tested this code on my roth cluster by creating 200
files in a directory, each of which is 100MB, then on
four nodes, I simultaneously deleted the files, thus competing
for GFS2 resources (but different files). The commands
I used were:
[root@roth-01]# time for i in `seq 1 4 200` ; do rm /mnt/gfs2/bigdir/gfs2.$i; done
[root@roth-02]# time for i in `seq 2 4 200` ; do rm /mnt/gfs2/bigdir/gfs2.$i; done
[root@roth-03]# time for i in `seq 3 4 200` ; do rm /mnt/gfs2/bigdir/gfs2.$i; done
[root@roth-05]# time for i in `seq 4 4 200` ; do rm /mnt/gfs2/bigdir/gfs2.$i; done
The performance increase was significant:
roth-01 roth-02 roth-03 roth-05
--------- --------- --------- ---------
old: real 0m34.027 0m25.021s 0m23.906s 0m35.646s
new: real 0m22.379s 0m24.362s 0m24.133s 0m18.562s
Total time spent deleting:
old: 118.6s
new: 89.4
For this particular case, this showed a 25% performance increase for
GFS2 unlinks.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
|
|
When we trim some free blocks in a group of ext3, we should
calculate the free blocks properly and check whether there are
enough freed blocks left for us to trim. Current solution will
only calculate free spaces if they are large for a trim which
is wrong.
Let us see a small example:
a group has 1.5M free which are 300k, 300k, 300k, 300k, 300k.
And minblocks is 1M. With current solution, we have to iterate
the whole group since these 300k will never be subtracted from
1.5M. But actually we should exit after we find the first 2
free spaces since the left 3 chunks only sum up to 900K if we
subtract the first 600K although they can't be trimed.
Cc: Jan Kara <jack@suse.cz>
Cc: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
As we have make the consense in the e-mail[1], the trim start should
be added with first_data_block. So this patch fulfill it and remove
the check for start < first_data_block.
[1] http://www.spinics.net/lists/linux-ext4/msg22737.html
Cc: Jan Kara <jack@suse.cz>
Cc: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
|