Re: [Hampshire] disk types and layout on a new box

Top Page

Reply to this message
Author: Andy Smith
Date:  
To: hampshire
Subject: Re: [Hampshire] disk types and layout on a new box
Hi Adam,

On Fri, Sep 27, 2013 at 08:58:10AM +0100, Dr A. J. Trickett wrote:
> I've pretty much decided to get a flash drive as the root file system, my
> preferred "bidder" are currently building with Intel 335 drives. I'm not sure
> exactly what combination and mix to go for.
>
> I don't think the 180 GB drive is large enough on it's own, so I could get a
> pair of them and then LVM them together and put a single ext4 over the two.


I know you say later in the thread that you backup all your
important stuff, but for me, the lost productivity involved in a
disk failure is worth a lot more than the cost of the disk itself.

The problem with using LVM to concatenate two drives together for
more space is that you've doubled the chance of a failure. SSDs
aren't particularly more reliable than conventional HDDs, and the
HDD is usually the first thing to break.

    http://www.zdnet.com/ssd-infant-mortality-ii-7000003945/
    http://www.tomshardware.com/reviews/ssd-reliability-failure-rate,2923-9.html


It's all a few years old, and I've plenty of anecdotes of people
who've "never seen an SSD failure", but personally I am not yet
prepared to believe they are any more reliable. Nor any less.

When a physical volume in LVM disappears, the physical extents that
were on it are obviously no longer available. Depending on the LVM
allocation policy in use probably some logical volumes would then
have parts or their entirety missing.

The system won't even let you activate a volume group that has
physical volumes missing, although you can override this if you tell
it that you really know what you are doing. That would allow you to
still use the logical volumes that didn't have bits missing.

You're in a bit of an awkward situation here because you want:

- Tons of storage
- Performance
- Reliability

and you haven't got the cash for all three. There isn't going to be
a single correct answer, and there are a variety of trade-offs you
can make depending on what your priorities are.

I will try to think up what is bound to be an incomplete list.

My own preferred answer though would be something like this:

- Two SSDs in my desktop in a RAID-1, mass storage in a separate
device with some sort of RAID configuration, and a decent backup
regime.

I don't consider that over the top in the home. HP Microservers
are back on cash back offer and make great fairly low energy
consumption networked storage devices. There's a bunch of other
cheap dedicated NAS devices that are suitable for home and small
office use as well.

The advantages of having the mass storage in a separate device is
that it makes it a lot easier to manage. If disks die you can
replace them without downtime. If it's not performing well enough
you can add disks. Even SSDs when that starts to make sense.

You'll probably upgrade your desktop machines a lot more often
than the file server, because the file server doesn't need much
grunt. No need to keep redesigning how the storage will work with
each desktop upgrade.

But it's not for everyone, it's inevitably more complicated and
expensive.

So let's say the file server is a no-go. It's got to all be in the
desktop.

- Two SSDs and two HDDs in two separate RAID-1s

Each SSD should be big enough for your OS and whatever other
performance storage you feel you need. Potentially that could be
quite small - your OS should easily fit in about 2G without you
trying hard. I fit Debian wheezy on a 512M CF card in one of my
devices without doing anything special, but admittedly it has only
vi for an editor and I even removed the less command. ;-)

Mirror the HDDs as well for redundancy (if using Linux software
RAID consider RAID-10 for the HDDs - it works with only two
devices and performs better than RAID-1. Stick to RAID-1 for the
SSDs though because that RAID level supports TRIM/discard).

If you can afford two small SSDs then I think you could try
stretching to two SSDs plus two bigger HDDs, because HDDs are
really cheap.

Can't afford two SSDs?

- One SSD, two HDDs

There's really no excuse to not have two HDDs. Again put your OS
and performance stuff on the SSD, put the "bulk" stuff on the HDD
mirror.

Even more important than normal to have good backups.

If you have decided to have a desktop with some SSD storage and some
HDD storage in it, regardless of whether they're mirrored or not
it's now a case of working out where to put the data and how to make
the smaller SSD storage speed up the larger HDD storage.

Linux has a bunch of interesting options for caching slow storage
with faster storage.

- ZFS on Linux

ZFS is going places in the Linux world. Ubuntu and Debian have
packages for it now (though don't expect to be able to call up
Mark Shuttleworth at 3am and ask him to assign some minions to fix
it or anything, like). But at least no more downloading kernel
source from a strange web site and having to build it yourself.

ZFS supports the concept of tiered storage. You build your storage
pool with a tier of fast stuff like SSDs, and a tier of slow stuff
like HDDs, and it does the right thing.

I am not a ZFS expert but tiered storage works like this:

    ZFS has the concept of "cache" devices and "log" devices. It
    calls cache devices L2ARC and it calls log devices ZIL (ZFS
    Intent Log).


    If you tell it that a device is an L2ARC device then it will
    copy hot storage extents into that device and consult it first
    in future when wanting to access them. So this is a fast read
    cache. It only speeds up read operations.


    Since it's only used for read operations you don't need to
    mirror it. In fact you cannot mirror it; if you add more than
    one then ZFS stripes them. If one dies then it reads the data
    from disk again and puts it on the other L2ARC devices. If all
    L2ARC devices are dead then it gets the data from disk every
    time, which works but is slower.


    If you tell it that a device is a ZIL device then writes go to
    the ZIL first, and are later asynchronously written to the
    slower tier. Point being that the ZIL device is going to be
    really quick, so it can tell the OS that this write has
    definitely hit persistent storage and you can be on your way
    now.


    You really should mirror ZIL devices. Otherwise if one goes pop
    then you lose all the data that was written to it that didn't
    yet make it onto the slower media. This might only be a few
    megabytes but that's enough to ruin your file system and your
    day. So, you can mirror those.


If you have multiple SSDs then you can partition each one into two
bits, the larger partition being for L2ARC and the smaller one
being for ZIL. You then end up with two L2ARC devices and one
mirrored ZIL device.

If you'd be willing to investigate this you would actually end up
with just a single logical device and it would be really simple.

ZFS itself has many years of testing, but it's a comparative
newcomer on Linux and integration with your favourite distribution
may not be completely there. That means, for example, that you
might encounter difficulties installing, doing upgrades and other
areas where the operating system config needs to know about
partition layout. Probably nothing that can't be worked around by
some time on the command line, but maybe you are not comfortable
with that.

Also there's unlikely to be much third party support for times of
trouble, just community support (from the ZFS on Linux community).

- bcache http://bcache.evilpiepirate.org/
flashcache https://github.com/facebook/flashcache/
enhanceio https://github.com/stec-inc/EnhanceIO
dm-cache https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/device-mapper/cache.txt

These are Linux kernel answers to L2ARC and ZIL. They cache one
set of devices with another set of devices. They are all pretty
new. bcache is a Google project, Flashcache is a Facebook project,
EnhanceIO is a fork of Flashcache that is trying to get more
community support, dm-cache is more closely tied to the existing
device-mapper and LVM projects.

They have different levels of readiness. I think bcache got
included in the upstream 3.11 kernel. dm-cache maybe since 3.9.
The others maybe not yet.

In any case I personally would regard these as highly experimental
and wouldn't use them except for if I could stand to see it break
unexpectedly and lose all data at any time. Since it's a layer in
front of *all your storage* the stakes are kind of high.

You can take a slightly safer (and slower) path with all of them
by using them as a read cache only (like L2ARC), but still
software bugs could equal corruption being written back.

- mdadm write-mostly http://linux.die.net/man/4/md search for
"write-mostly"

"write-mostly" is a feature of Linux software RAID-1 where you
tell it that some devices are unsuitable for reads. Writes still
go to both.

Let's say that you could not afford two SSDs. You have just one
SSD and a pair of HDDs. Example:

/dev/sda - 180G SSD
/dev/sdb - 2T HDD
/dev/sdc - 2T HDD

Partition sdb and sdc into two bits, one of 180G and the other the
rest. Make a RAID-10 of sdb1 and sdc1, call it md0. Make a RAID-1
on top of *that* of sda and md0, call it md1. Tell it that its
component md0 is write-mostly. Make a RAID-10 of sdb2 and sdc2,
call it md2. You end up with this:

  /dev/md1 -    180G RAID-1
  /dev/md2 - ~1,820G RAID-10


In theory you have all the speedy read advantages of an SSD, but
you've mirrored it onto an HDD so you can continue working even if
your SSD goes pop. It is of course a lot more complicated.

People report some success with this strategy:

http://marc.info/?l=linux-raid&m=126496930530289&w=2

Of course it's only going to be caching reads. You may want to
reserve some SSD space outside of the md array to use for
guaranteed fast write space. Accepting that it won't be redundant
in the face of failure.

- LVM

LVM's great when you're not entirely sure what your needs will be
or if they change often.

I earlier cautioned against using LVM to concatenate a bunch of
drives without redundancy, but you can use it in other ways.

Once you've ended up with a block device that is fast and a block
device that is slow, you can use them both as LVM physical devices
and put them both into the same volume group.

You then create a logical volume for each kind of data you store,
e.g. one for VM images, one for photos, and so on.

You can specify the allocation policy of each LV in order to tell
it where to place the physical extents - you make sure that the
extents you need to perform well are placed on the fast PV. Best of
all, if you get it wrong or change your mind or your needs change,
you can move the extents on a live system without unmounting
anything.

If you only had the one SSD, you could combine this with the
"write-mostly" above. In the above example md1 was the fast

So in summary, given the constraint of having to fit all the storage
into the desktop, this is how I personally would approach this:

I'd buy two SSDs and two HDDs. If I couldn't afford that, I'd buy
one SSD and two HDDs.

If I felt brave enough for ZFS then I'd do that, because it makes
tiered storage fairly simple.

Otherwise I'd go the route of LVM in order to balance the
differing needs of the data across the different types of storage,
whilst still retaining redundancy. I'd use md "write-mostly"
underneath the LVM if I only had one SSD.

Every Linux distribution should have an installer that supports
software RAID (might have to use "alternate" ISO on Ubuntu), so
just do that and set the write-mostly afterwards, once it's
booted, if necessary.

No matter what scheme I ended up with I'd probably make some
effort to keep /boot outside of all the complicated stuff,
because:

  - that really simplifies booting
  - /boot is quite small anyway (512M should be ample)
  - Aside from kernel upgrades, /boot isn't read or written after
    boot, so it doesn't matter if it's on slow media. RAID-1 of
    HDDs.


Cheers,
Andy

--
http://bitfolk.com/ -- No-nonsense VPS hosting

--
Please post to: Hampshire@???
Web Interface: https://mailman.lug.org.uk/mailman/listinfo/hampshire
LUG URL: http://www.hantslug.org.uk
--------------------------------------------------------------