From f466722ca614edcd14f3337373f33132117c7612 Mon Sep 17 00:00:00 2001
From: NeilBrown <neilb@suse.de>
Date: Mon, 9 Dec 2013 12:04:56 +1100
Subject: md: Change handling of save_raid_disk and metadata update during
 recovery.

Since commit d70ed2e4fafdbef0800e739
   MD: Allow restarting an interrupted incremental recovery.

we don't write out the metadata to devices while they are recovering.
This had a good reason, but has unfortunate consequences.  This patch
changes things to make them work better.

At issue is what happens if the array is shut down while a recovery is
happening, particularly a bitmap-guided recovery.
Ideally the recovery should pick up where it left off.
However the metadata cannot represent the state "A recovery is in
process which is guided by the bitmap".

Before the above mentioned commit, we wrote metadata to the device
which said "this is being recovered and it is up to <here>".  So after
a restart, a full recovery (not bitmap-guided) would happen from
where-ever it was up to.

After the commit the metadata wasn't updated so it still said "This
device is fully in sync with <this> event count".  That leads to a
bitmap-based recovery following the whole bitmap, which should be a
lot less work than a full recovery from some starting point.  So this
was an improvement.

However updates some metadata but not all leads to other problems.
In particular, the metadata written to the fully-up-to-date device
record that the array has all devices present (even though some are
recovering).  So on restart, mdadm wants to find all devices and
expects them to have current event counts.
Obviously it doesn't (some have old event counts) so (when assembling
with --incremental) it waits indefinitely for the rest of the expected
devices.

It really is wrong to not update all the metadata together.  Do that
is bound to cause confusion.
Instead, we should make it possible to record the truth in the
metadata.  i.e. we need to be able to record that a device is being
recovered based on the bitmap.
We already have a Feature flag to say that recovery is happening.  We
now add another one to say that it is a bitmap-based recovery.

With this we can remove the code that disables the write-out of
metadata on some devices.

So this patch:
 - moves the setting of 'saved_raid_disk' from add_new_disk to
   the validate_super methods.  This makes sure it is always set
   properly, both when adding a new device to an array, and when
   assembling an array from a collection of devices.
 - Adds a metadata flag MD_FEATURE_RECOVERY_BITMAP which is only
   used if MD_FEATURE_RECOVERY_OFFSET is set, and record that a
   bitmap-based recovery is allowed.
   This is only present in v1.x metadata. v0.90 doesn't support
   devices which are in the middle of recovery at all.
 - Only skips writing metadata to Faulty devices.

 - Also allows rdev state to be set to "-insync" via sysfs.
   This can be used for external-metadata arrays.  When the
   'role' is set the device is assumed to be in-sync.  If, after
   setting the role, we set the state to "-insync", the role is
   moved to saved_raid_disk which effectively says the device is
   partly in-sync with that slot and needs a bitmap recovery.

Cc: Andrei Warkentin <andreiw@vmware.com>
Signed-off-by: NeilBrown <neilb@suse.de>
---
 drivers/md/md.c | 42 +++++++++++++++++++++++-------------------
 1 file changed, 23 insertions(+), 19 deletions(-)

(limited to 'drivers/md')
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 2a456a5d59a8..539f08885e7f 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -1183,6 +1183,7 @@ static int super_90_validate(struct mddev *mddev, struct md_rdev *rdev)
 			    desc->raid_disk < mddev->raid_disks */) {
 			set_bit(In_sync, &rdev->flags);
 			rdev->raid_disk = desc->raid_disk;
+			rdev->saved_raid_disk = desc->raid_disk;
 		} else if (desc->state & (1<<MD_DISK_ACTIVE)) {
 			/* active but not in sync implies recovery up to
 			 * reshape position.  We don't know exactly where
@@ -1681,10 +1682,14 @@ static int super_1_validate(struct mddev *mddev, struct md_rdev *rdev)
 			set_bit(Faulty, &rdev->flags);
 			break;
 		default:
+			rdev->saved_raid_disk = role;
 			if ((le32_to_cpu(sb->feature_map) &
-			     MD_FEATURE_RECOVERY_OFFSET))
+			     MD_FEATURE_RECOVERY_OFFSET)) {
 				rdev->recovery_offset = le64_to_cpu(sb->recovery_offset);
-			else
+				if (!(le32_to_cpu(sb->feature_map) &
+				      MD_FEATURE_RECOVERY_BITMAP))
+					rdev->saved_raid_disk = -1;
+			} else
 				set_bit(In_sync, &rdev->flags);
 			rdev->raid_disk = role;
 			break;
@@ -1746,6 +1751,9 @@ static void super_1_sync(struct mddev *mddev, struct md_rdev *rdev)
 			cpu_to_le32(MD_FEATURE_RECOVERY_OFFSET);
 		sb->recovery_offset =
 			cpu_to_le64(rdev->recovery_offset);
+		if (rdev->saved_raid_disk >= 0 && mddev->bitmap)
+			sb->feature_map |=
+				cpu_to_le32(MD_FEATURE_RECOVERY_BITMAP);
 	}
 	if (test_bit(Replacement, &rdev->flags))
 		sb->feature_map |=
@@ -2487,8 +2495,7 @@ repeat:
 		if (rdev->sb_loaded != 1)
 			continue; /* no noise on spare devices */
 
-		if (!test_bit(Faulty, &rdev->flags) &&
-		    rdev->saved_raid_disk == -1) {
+		if (!test_bit(Faulty, &rdev->flags)) {
 			md_super_write(mddev,rdev,
 				       rdev->sb_start, rdev->sb_size,
 				       rdev->sb_page);
@@ -2504,11 +2511,9 @@ repeat:
 				rdev->badblocks.size = 0;
 			}
 
-		} else if (test_bit(Faulty, &rdev->flags))
+		} else
 			pr_debug("md: %s (skipping faulty)\n",
 				 bdevname(rdev->bdev, b));
-		else
-			pr_debug("(skipping incremental s/r ");
 
 		if (mddev->level == LEVEL_MULTIPATH)
 			/* only need to write one superblock... */
@@ -2624,6 +2629,8 @@ state_store(struct md_rdev *rdev, const char *buf, size_t len)
 	 *  blocked - sets the Blocked flags
 	 *  -blocked - clears the Blocked and possibly simulates an error
 	 *  insync - sets Insync providing device isn't active
+	 *  -insync - clear Insync for a device with a slot assigned,
+	 *            so that it gets rebuilt based on bitmap
 	 *  write_error - sets WriteErrorSeen
 	 *  -write_error - clears WriteErrorSeen
 	 */
@@ -2672,6 +2679,11 @@ state_store(struct md_rdev *rdev, const char *buf, size_t len)
 	} else if (cmd_match(buf, "insync") && rdev->raid_disk == -1) {
 		set_bit(In_sync, &rdev->flags);
 		err = 0;
+	} else if (cmd_match(buf, "-insync") && rdev->raid_disk >= 0) {
+		clear_bit(In_sync, &rdev->flags);
+		rdev->saved_raid_disk = rdev->raid_disk;
+		rdev->raid_disk = -1;
+		err = 0;
 	} else if (cmd_match(buf, "write_error")) {
 		set_bit(WriteErrorSeen, &rdev->flags);
 		err = 0;
@@ -5780,6 +5792,7 @@ static int add_new_disk(struct mddev * mddev, mdu_disk_info_t *info)
 				clear_bit(Bitmap_sync, &rdev->flags);
 			} else
 				rdev->raid_disk = -1;
+			rdev->saved_raid_disk = rdev->raid_disk;
 		} else
 			super_types[mddev->major_version].
 				validate_super(mddev, rdev);
@@ -5792,11 +5805,6 @@ static int add_new_disk(struct mddev * mddev, mdu_disk_info_t *info)
 			return -EINVAL;
 		}
 
-		if (test_bit(In_sync, &rdev->flags))
-			rdev->saved_raid_disk = rdev->raid_disk;
-		else
-			rdev->saved_raid_disk = -1;
-
 		clear_bit(In_sync, &rdev->flags); /* just to be sure */
 		if (info->state & (1<<MD_DISK_WRITEMOSTLY))
 			set_bit(WriteMostly, &rdev->flags);
@@ -7948,14 +7956,10 @@ void md_reap_sync_thread(struct mddev *mddev)
 		mddev->pers->finish_reshape(mddev);
 
 	/* If array is no-longer degraded, then any saved_raid_disk
-	 * information must be scrapped.  Also if any device is now
-	 * In_sync we must scrape the saved_raid_disk for that device
-	 * do the superblock for an incrementally recovered device
-	 * written out.
+	 * information must be scrapped.
 	 */
-	rdev_for_each(rdev, mddev)
-		if (!mddev->degraded ||
-		    test_bit(In_sync, &rdev->flags))
+	if (!mddev->degraded)
+		rdev_for_each(rdev, mddev)
 			rdev->saved_raid_disk = -1;
 
 	md_update_sb(mddev, 1);
-- 
cgit v1.2.3-70-g09d2


From 7eb418851f3278de67126ea0c427641ab4792c57 Mon Sep 17 00:00:00 2001
From: NeilBrown <neilb@suse.de>
Date: Tue, 14 Jan 2014 15:55:14 +1100
Subject: md: allow a partially recovered device to be hot-added to an array.

When adding a new device into an array it is normally important to
clear any stale data from ->recovery_offset else the new device may
not be recovered properly.

However when re-adding a device which is known to be nearly in-sync,
this is not needed and can be detrimental.  The (bitmap-based)
resync will still happen, and further recovery is only needed from
where-ever it was already up to.

So if save_raid_disk is set, signifying a re-add, don't clear
->recovery_offset.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 drivers/md/md.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

(limited to 'drivers/md')

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 539f08885e7f..757e388308a8 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -7736,7 +7736,8 @@ static int remove_and_add_spares(struct mddev *mddev,
 		       !test_bit(Bitmap_sync, &rdev->flags)))
 			continue;
 
-		rdev->recovery_offset = 0;
+		if (rdev->saved_raid_disk < 0)
+			rdev->recovery_offset = 0;
 		if (mddev->pers->
 		    hot_add_disk(mddev, rdev) == 0) {
 			if (sysfs_link_rdev(mddev, rdev))
-- 
cgit v1.2.3-70-g09d2


From 0b59bb6422e43ad0534073e2cbc4d0f52720da88 Mon Sep 17 00:00:00 2001
From: NeilBrown <neilb@suse.de>
Date: Tue, 14 Jan 2014 16:30:10 +1100
Subject: md/raid10: avoid fullsync when not necessary.

This is the raid10 equivalent of

commit 4f0a5e012cf41321d611e7cad63e1017d143d138
    MD RAID1: Further conditionalize 'fullsync'

If a device in a newly assembled array is not fully recovered we
currently do a fully resync by don't need to.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 drivers/md/raid10.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

(limited to 'drivers/md')

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 06eeb99ea6fc..8d39d63281b9 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -3747,7 +3747,8 @@ static int run(struct mddev *mddev)
 		    !test_bit(In_sync, &disk->rdev->flags)) {
 			disk->head_position = 0;
 			mddev->degraded++;
-			if (disk->rdev)
+			if (disk->rdev &&
+			    disk->rdev->saved_raid_disk < 0)
 				conf->fullsync = 1;
 		}
 		disk->recovery_disabled = mddev->recovery_disabled - 1;
-- 
cgit v1.2.3-70-g09d2


From 830778a180f268ac106f072b8aad793a79088c87 Mon Sep 17 00:00:00 2001
From: NeilBrown <neilb@suse.de>
Date: Tue, 14 Jan 2014 15:17:03 +1100
Subject: md: ensure metadata is writen after raid level change.

level_store() currently does not make sure the metadata is
updates to reflect the new raid level.  It simply sets MD_CHANGE_DEVS.

Any level with a ->thread will quickly notice this and update the
metadata.  However RAID0 and Linear do not have a thread so no
metadata update happens until the array is stopped.  At that point the
metadata is written.

This is later that we would like.  While the delay doesn't risk any
data it can cause confusion.  So if there is no md thread, immediately
update the metadata after a level change.

Reported-by: Richard Michael <rmichael@edgeofthenet.org>
Signed-off-by: NeilBrown <neilb@suse.de>
---
 drivers/md/md.c | 2 ++
 1 file changed, 2 insertions(+)

(limited to 'drivers/md')

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 757e388308a8..a20b7806de7a 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -3611,6 +3611,8 @@ level_store(struct mddev *mddev, const char *buf, size_t len)
 	pers->run(mddev);
 	set_bit(MD_CHANGE_DEVS, &mddev->flags);
 	mddev_resume(mddev);
+	if (!mddev->thread)
+		md_update_sb(mddev, 1);
 	sysfs_notify(&mddev->kobj, NULL, "level");
 	md_new_event(mddev);
 	return rv;
-- 
cgit v1.2.3-70-g09d2


From cb335f88eb35af712d1f4171642d0487f7bb2e7e Mon Sep 17 00:00:00 2001
From: Nicolas Schichan <nschichan@freebox.fr>
Date: Wed, 15 Jan 2014 16:58:52 +0100
Subject: md: check command validity early in md_ioctl().

Verify that the cmd parameter passed to md_ioctl() is valid before
doing anything.

This fixes mddev->hold_active being set to 0 when an invalid ioctl
command is passed to md_ioctl() before the array has been configured.

Clearing mddev->hold_active in that case can lead to a livelock
situation when an invalid ioctl number is given to md_ioctl() by a
process when the mddev is currently being opened by another process:

Process 1				Process 2
---------				---------

md_alloc()
  mddev_find()
  -> returns a new mddev with
     hold_active == UNTIL_IOCTL
  add_disk()
  -> sends KOBJ_ADD uevent

					(sees KOBJ_ADD uevent for device)
                    			md_open()
                    			md_ioctl(INVALID_IOCTL)
                    			-> returns ENODEV and clears
                       			   mddev->hold_active
                    			md_release()
                      			md_put()
                      			-> deletes the mddev as
                         		   hold_active is 0

md_open()
  mddev_find()
  -> returns a newly
    allocated mddev with
    mddev->gendisk == NULL
-> returns with ERESTARTSYS
   (kernel restarts the open syscall)

Signed-off-by: Nicolas Schichan <nschichan@freebox.fr>
Signed-off-by: NeilBrown <neilb@suse.de>
---
 drivers/md/md.c | 29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

(limited to 'drivers/md')

diff --git a/drivers/md/md.c b/drivers/md/md.c
index a20b7806de7a..b890d3fb0e02 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -6356,6 +6356,32 @@ static int md_getgeo(struct block_device *bdev, struct hd_geometry *geo)
 	return 0;
 }
 
+static inline bool md_ioctl_valid(unsigned int cmd)
+{
+	switch (cmd) {
+	case ADD_NEW_DISK:
+	case BLKROSET:
+	case GET_ARRAY_INFO:
+	case GET_BITMAP_FILE:
+	case GET_DISK_INFO:
+	case HOT_ADD_DISK:
+	case HOT_REMOVE_DISK:
+	case PRINT_RAID_DEBUG:
+	case RAID_AUTORUN:
+	case RAID_VERSION:
+	case RESTART_ARRAY_RW:
+	case RUN_ARRAY:
+	case SET_ARRAY_INFO:
+	case SET_BITMAP_FILE:
+	case SET_DISK_FAULTY:
+	case STOP_ARRAY:
+	case STOP_ARRAY_RO:
+		return true;
+	default:
+		return false;
+	}
+}
+
 static int md_ioctl(struct block_device *bdev, fmode_t mode,
 			unsigned int cmd, unsigned long arg)
 {
@@ -6364,6 +6390,9 @@ static int md_ioctl(struct block_device *bdev, fmode_t mode,
 	struct mddev *mddev = NULL;
 	int ro;
 
+	if (!md_ioctl_valid(cmd))
+		return -ENOTTY;
+
 	switch (cmd) {
 	case RAID_VERSION:
 	case GET_ARRAY_INFO:
-- 
cgit v1.2.3-70-g09d2


From 9f97e4b128d2ea90a5f5063ea0ee3b0911f4c669 Mon Sep 17 00:00:00 2001
From: NeilBrown <neilb@suse.de>
Date: Thu, 16 Jan 2014 09:35:38 +1100
Subject: md/raid5: fix long-standing problem with bitmap handling on write
 failure.

Before a write starts we set a bit in the write-intent bitmap.
When the write completes we clear that bit if the write was successful
to all devices.  However if the write wasn't fully successful we
should not clear the bit.  If the faulty drive is subsequently
re-added, the fact that the bit is still set ensure that we will
re-write the data that is missing.

This logic is mediated by the STRIPE_DEGRADED flag - we only clear the
bitmap bit when this flag is not set.
Currently we correctly set the flag if a write starts when some
devices are failed or missing.  But we do *not* set the flag if some
device failed during the write attempt.
This is wrong and can result in clearing the bit inappropriately.

So: set the flag when a write fails.

This bug has been present since bitmaps were introduces, so the fix is
suitable for any -stable kernel.

Reported-by: Ethan Wilson <ethan.wilson@shiftmail.org>
Cc: stable@vger.kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
---
 drivers/md/raid5.c | 1 +
 1 file changed, 1 insertion(+)

(limited to 'drivers/md')

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index cbb15716a5db..3088d3af5a89 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -2111,6 +2111,7 @@ static void raid5_end_write_request(struct bio *bi, int error)
 			set_bit(R5_MadeGoodRepl, &sh->dev[i].flags);
 	} else {
 		if (!uptodate) {
+			set_bit(STRIPE_DEGRADED, &sh->state);
 			set_bit(WriteErrorSeen, &rdev->flags);
 			set_bit(R5_WriteError, &sh->dev[i].flags);
 			if (!test_and_set_bit(WantReplacement, &rdev->flags))
-- 
cgit v1.2.3-70-g09d2


From 7da9d450ab2843bf1db378c156acc6304dbc1c2b Mon Sep 17 00:00:00 2001
From: NeilBrown <neilb@suse.de>
Date: Wed, 22 Jan 2014 11:45:03 +1100
Subject: md/raid5: close recently introduced race in stripe_head management.

As release_stripe and __release_stripe decrement ->count and then
manipulate ->lru both under ->device_lock, it is important that
get_active_stripe() increments ->count and clears ->lru also under
->device_lock.

However we currently list_del_init ->lru under the lock, but increment
the ->count outside the lock.  This can lead to races and list
corruption.

So move the atomic_inc(&sh->count) up inside the ->device_lock
protected region.

Note that we still increment ->count without device lock in the case
where get_free_stripe() was called, and in fact don't take
->device_lock at all in that path.
This is safe because if the stripe_head can be found by
get_free_stripe, then the hash lock assures us the no-one else could
possibly be calling release_stripe() at the same time.

Fixes: 566c09c53455d7c4f1130928ef8071da1a24ea65
Cc: stable@vger.kernel.org (3.13)
Reported-and-tested-by: Ian Kumlien <ian.kumlien@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
---
 drivers/md/raid5.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

(limited to 'drivers/md')

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 3088d3af5a89..03f82ab87d9e 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -675,8 +675,10 @@ get_active_stripe(struct r5conf *conf, sector_t sector,
 					 || !conf->inactive_blocked),
 					*(conf->hash_locks + hash));
 				conf->inactive_blocked = 0;
-			} else
+			} else {
 				init_stripe(sh, sector, previous);
+				atomic_inc(&sh->count);
+			}
 		} else {
 			spin_lock(&conf->device_lock);
 			if (atomic_read(&sh->count)) {
@@ -695,13 +697,11 @@ get_active_stripe(struct r5conf *conf, sector_t sector,
 					sh->group = NULL;
 				}
 			}
+			atomic_inc(&sh->count);
 			spin_unlock(&conf->device_lock);
 		}
 	} while (sh == NULL);
 
-	if (sh)
-		atomic_inc(&sh->count);
-
 	spin_unlock_irq(conf->hash_locks + hash);
 	return sh;
 }
-- 
cgit v1.2.3-70-g09d2