<< Previous | Home | Next >>

Building up & testing a new 9TB SATA RAID10 NFSv4 NAS, part III

'Mistakes Were Made'
Bookmark and Share

Six months after starting this project, I received an email from Ryan Ellis asking me if I had any tips regarding this NAS build.

Just read your article on the NAS. Amazing job. I would like to replicate what you did. Any tips?

Well, I most definitely do have a list of things I'd have done different! Nothing too dramatic, but there are some decisions that were overkill and some where I went a little too low end.

  • Using Intel and Asus motherboards: The Intel DP55KG w/ P55 Express chip set is not liking the Ubuntu 10.04 LTS Linux, or apparently any Linux distro for that matter. Specifically, the NAS box with the Intel mother board is unable to do a soft reboot. That means every reboot requires my physical presence in the data center. This has been a known problem for a while but it did not turn up during my mobo research. Many folks have tried various kernel options to change the rebooting behavior with mixed success. I've not been able to resolve the issue. When building up the NAS box I told myself that the Linux community would eventually resolve the issue. Maybe it has, but now that we are in production I can't really experiment with the server.

    Lesson Learned: If the mobo is not working perfectly for you, then find another. It's too painful to revisit once in production.

  • Not using "server grade" motherboards: Linux is unable to monitor things on the Asus and Intel motherboards, like fan speed and temperature, that I'd like to be graphing in Cacti. This is apparently possible with the "server grade" budget motherboards from the likes of SuperMicro.

    Self-built 9TB NFSv4 Network Attached Storage (NAS) with DRBD block level replication

    Lesson Learned: It only saved us SFr.400-800 to use these performance desktop motherboards, but our ability to proactively monitor fans is lost. I wish I'd gone for a SuperMicro motherboard.

  • The network load is much lower than I had realized, so the Intel Quad-port NIC is overkill--not even 100 Mbps at peak usage! This is apparently due to the client side file cache on our client server machines. This was difficult to predict on our old system because we were running with direct attached storage. In hind sight I wish I'd done more research. The two Intel PRO/1000 PT Quad Port Server Adapter could have been single port NICs, saving us SFr.800 total.

    Production bandwidth usage of NAS

    Lesson Learned: Try to accurately measure and predict how much network traffic you'll see. Did I really need four port NIC bonding? No even close.

  • I didn't pay enough attention to adapter-to-drive cabling. The LSI 3ware 9650SE-ML16 card came with 1-to-4, Multilane-to-SATA breakout cables, but the SuperMicro SuperChassis 836A-R1200B came with backplane with four Multilane ports. That ment sourcing four CBL-SFF8087-05M Multilane-to-Multilane cables, an extra cost. And when I did get them, two were ~10cm shorter than I would have preferred--the cables are currently a bit tight and cannot be moved without loosening the connection. We probably spent another SFr100 on extra cabling.

    Lesson Learned: At least think about device-to-device cabling beforehand, and don't leave until the build.

  • RAID 1+0 may have been overkill, RAID 6 performance would probably have sufficed. Our production metrics seem to indicate that we run at no more than 33-40%, conservatively, of capacity at peak, and the vast build of our NAS activity is reads. RAID 6 probably would have been a safe choice, and doing so would have reduced the number of hard drives by 6 total (3 on each server), which would also have allowed us to use a smaller chassis. Total savings would have been SFr 1700-2000, a non-trivial amount.

    My wife, Robyn, helping me build up one of the NAS servers
    My wife Robyn plugging in the 2TB WD Hard Drives into the 9TB NAS

    That said, we would be reducing our margin for error, room for future growth (there are currently two empty drive bays on each server), and not allowed changes in application behavior which would result in more writes. (RAID 6 is great for heavy read applications, like ours, but have much weaker write performance characteristics.)

  • I did not appreciate how little I understood drbd, or block-level replication for that matter. This resulted in taking poorly understood actions on production data. In hind sight, it would have been wise to setup a test environment on the side (Amazon EC2, some old kit, whatever) and experimented. If I had made a mistake, we would have had to implement our disaster recovery procedures, which are time consuming and resulted in non-trivial down time.

    Lesson Learned: If it works like magic, then you don't have a clue how it works. For something as fundamental as DRBD is to a redundant NAS system, one should make decisions ad novum, 'with intent'.

  • Setting up the monitoring was significantly more work that I had predicted. While our Cacti + SNMP setup is very powerful, it is not easy to get going for anything but very common metrics. Specifically, configuring important alerts for things like drive failures, or graphs of NFSv4 metrics has been a considerable amount of work. In fact, I've had to come up with my own NFSv4 Cacti template which, to my surprise, did not exist.

    Cacti Monitoring: Tree View

  • These boxes are heavy. Like in the 30kg region. Installing them into the rack alone, even with the assistance of a foot-actuated hydraulic lift, was difficult and borderline dangerous. Managing to get the rails aligned correctly was very challenging.

    60kg of Network Attached Storage
    Self-built 9TB NFSv4 Network Attached Storage (NAS) with DRBD block level replication at the server room


    Lesson Learned: Don't install anything other than a switch alone.

  • WD Green versus WD RE4 drives: We could probably have used cheaper WD Green drives instead of the RE4 series "Enterprise Hard Drives". Ryan Shrout and Patrick Norton talk about the apparent fallacy that WD Green drives are not suitable for a NAS in Episode #95 of This Week in Computer Hardware. The cost savings is huge. Currently at Digitec.ch, where we bought our drives, a WD Caviar RE4 2TB runs for SFr255 and a WD Caviar Green is SFr109--a SFr146 savings. With the 22 data mount hard drives in our build, that works out to SFr3,212! And we could have saved an additional SFr~168 on the operating system drives too.

All of that said, we are in production and everything works. More dramatically, this project after a mere six months has already resulted in a positive return on investment, when accounting for hardware costs alone. Factor in the time I spent on this project, 60-80 hours, and we will be in the black some time in Q1 2011. Not bad. (This self-built approach was taken in favor of outsourcing our storage to our hosting company's shared NetApp NAS at a TB/month rate.) It also has been a wildly educational experience and forced me to understand my application even more than before.

The series:
Building up & testing a new 9TB SATA RAID10 NFSv4 NAS, part I
Building up & testing a new 9TB SATA RAID10 NFSv4 NAS, part II
Building up & testing a new 9TB SATA RAID10 NFSv4 NAS, part III

Fresh install of mediastreamvalidator fails self test

My frustration with Apple and anything iPhone grows
Bookmark and Share

This morning we discovered that the MPEG-TS streams our application was producing did not meet Apple's latest, seemingly every changing, requirements for playback on the iPhone and iPad. Never mind that our streams playback just fine; Apple now requires that they get a stamp of approval from Media Stream Validator Tool (mediastreamvalidator) for new iPhone apps. To get mediastreamvalidator, I downloaded and installed it. Apparently it is still in beta.

mediastreamvalidator, when run with arguments let's us know what the commandline line arguments and options are.

manoa:~ stu$ mediastreamvalidator
Media Stream Validator
http://www.apple.com

Basic commands:
  mediastreamvalidator help       list all commands or get more help on a command
  mediastreamvalidator parse      parse a playlist
  mediastreamvalidator selftest   run the mediastreamvalidator test suite
  mediastreamvalidator validate   validate a playlist

manoa:~ stu$ mediastreamvalidator --version
Media Stream Validator: Beta Version 1.0(101102)
  Python standard library: /System/Library/Frameworks/Python.framework/Versions/
        2.6/lib/python2.6
  msvlib: /Library/Python/2.6/site-packages/msvlib

Copyright 2009-2010 Apple Inc.
http://www.apple.com

manoa:~ stu$ 

Interesting, it has a selftest. Let's try that out!

manoa:~ stu$ mediastreamvalidator selftest
test_parse_path (test_utilities.URLUtilitiesTests) ... ok
                        {snip}
test_correct_sliding_sequence 
  (test_server_validator.ServerSlidingWindowCorrectBehaviorTests) ... FAIL
                        {snip}
(test_server_validator.ServerVariantPlaylistTests) ... ok

======================================================================
FAIL: test_correct_sliding_sequence (test_server_validator.ServerSlidingWindowCorrectBehaviorTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Library/Python/2.6/site-packages/msvlib/tests/test_server_validator.py", 
    line 1054, in test_correct_sliding_sequence
  File "/Library/Python/2.6/site-packages/msvlib/tests/test_server_validator.py", 
    line 76, in check_valid
  File "/Library/Python/2.6/site-packages/msvlib/tests/test_server_validator.py", 
    line 40, in _check_valid_object
AssertionError: Unexpected fatal parsing problems:

ERROR: First line of playlist must be an M3U tag.
1:    
      ^


----------------------------------------------------------------------
Ran 189 tests in 199.610s

FAILED (failures=1)
manoa:~ stu$ 

A failure. Great. Now what do we do? Is the validator valid? Is the validator's analysis of our media stream valid?

Producing media for the iPhone is getting harder, more complicated. One would expect it to become easier, faster, better, safer. Our streams are being held hostage by beta software that fails a self test from Apple.

(OK, self pity drama queen rant over. Back to work.)

Move the SSD from my 2008 Core 2 Duo MBP to 2010 Core i5 MBP

Time lapse photography experiment
Bookmark and Share

A few weeks ago, my trusty 2.5 year old 2008 MacBook Pro died. Since the machine is at the center of my working life at xtendx, it needed immediate replacement. We quickly ordered a new MacBook Pro. It sports a 2.4 GHz Core i5 CPU and 4GB of ram for only SFr 2200 from Digitec.

The only weakness was the stock 5,400 RPM HDD. Fortunately for me, the ten-month old Intel X-25M SSD inside the 2008 MBP was still good. I swapped it out inside 60 seconds! (Ha ha.)

I have to say, all that bitching and moaning on the Intertubes about the new unibody MBP is a bunch of hogwash. Opening and working inside the case of the unibody MBP is soooo much easier than on the 2008 MPB.

  • Fewer screws to remove and replace (and lose)
  • Less time opening and closing the case
  • No need to unplug (and potentially damage) the keyboard
  • No little panels to replace, like the old MBP's memory cover

I am very pleased with the result. As a side bonus, there was no need to reinstall all my software. The only complaint was from the Audible items in iTunes, for which I will have to reauthorize.

Cacti Graph Template and Script for NFS v4 Client

First draft: 14 of 35 data sources graphed
Bookmark and Share

Over the summer we at xtendx moved to a new data center at Green.ch, and I setup a new two-node 9 TB DRBD + ext4 + NFS4 NAS. I've also setup a Cacti monitoring system to graph key metrics and alert us to problems. One metric that the Cacti community has not gotten around to addressing is the relatively new and shiny NFSv4 clients and servers.

To address that omission, I started on my first Cacti graph template project. What the Cacti community does have is an NFS v3 client graphing template, so I used this as a base. Here is my preliminary result:

The data aquisition is performed on the target machine with a simple Bash (/usr/local/bin/cacti-nfs4.sh) script that is executed by snmpd. The script:

NFS=/proc/net/rpc/nfs
proc="read write commit open open_confirm open_named_att_dir \
  open_downgrade close set_attr fsinfo renew set_clientid confirm lock \
  lock_test unlock access get_attr loopup lookup_root remove rename \
  link symlink create pathconf statfs readlink readdir server_pas \
  delegreturn getacl setacl"

i=4;

for a in $proc; do
#       echo -n "$a.value "
        grep proc4 $NFS \
                | cut -f $i -d ' ' \
                | awk '{print $1}'
        i=$(expr $i + 1)
done

To configure snmpd to call the script and report the result, I've added the below to /etc/snmp/snmpd.conf and then restarted the snmpd daemon. (Now is probably a good time to mention that this is all on Debian Lenny, so if you are on a different platform YMMV.)

extend .1.3.6.1.4.1.2021.66 nfs_client /bin/sh /usr/local/bin/cacti-nfs4.sh

Once this is done the configuration can be confirmed with snmpwalk, like this below. The output not only should be there, but also match the output of cacti-nfs4.sh

ballito:/home/stu# !snmpwalk
snmpwalk -v2c -c public  ballito.be .1.3.6.1.4.1.2021.66.4
UCD-SNMP-MIB::ucdavis.66.4.1.2.10.110.102.115.95.99.108.105.101.110.116.1 = STRING: "1209434"
UCD-SNMP-MIB::ucdavis.66.4.1.2.10.110.102.115.95.99.108.105.101.110.116.2 = STRING: "46246622"
UCD-SNMP-MIB::ucdavis.66.4.1.2.10.110.102.115.95.99.108.105.101.110.116.3 = STRING: "46218746"
UCD-SNMP-MIB::ucdavis.66.4.1.2.10.110.102.115.95.99.108.105.101.110.116.4 = STRING: "52265481"
UCD-SNMP-MIB::ucdavis.66.4.1.2.10.110.102.115.95.99.108.105.101.110.116.5 = STRING: "69"
UCD-SNMP-MIB::ucdavis.66.4.1.2.10.110.102.115.95.99.108.105.101.110.116.6 = STRING: "0"
UCD-SNMP-MIB::ucdavis.66.4.1.2.10.110.102.115.95.99.108.105.101.110.116.7 = STRING: "2"
UCD-SNMP-MIB::ucdavis.66.4.1.2.10.110.102.115.95.99.108.105.101.110.116.8 = STRING: "52197430"
UCD-SNMP-MIB::ucdavis.66.4.1.2.10.110.102.115.95.99.108.105.101.110.116.9 = STRING: "46346973"
UCD-SNMP-MIB::ucdavis.66.4.1.2.10.110.102.115.95.99.108.105.101.110.116.10 = STRING: "2"
UCD-SNMP-MIB::ucdavis.66.4.1.2.10.110.102.115.95.99.108.105.101.110.116.11 = STRING: "0"
UCD-SNMP-MIB::ucdavis.66.4.1.2.10.110.102.115.95.99.108.105.101.110.116.12 = STRING: "1"
UCD-SNMP-MIB::ucdavis.66.4.1.2.10.110.102.115.95.99.108.105.101.110.116.13 = STRING: "1"
UCD-SNMP-MIB::ucdavis.66.4.1.2.10.110.102.115.95.99.108.105.101.110.116.14 = STRING: "0"
UCD-SNMP-MIB::ucdavis.66.4.1.2.10.110.102.115.95.99.108.105.101.110.116.15 = STRING: "0"
UCD-SNMP-MIB::ucdavis.66.4.1.2.10.110.102.115.95.99.108.105.101.110.116.16 = STRING: "0"
UCD-SNMP-MIB::ucdavis.66.4.1.2.10.110.102.115.95.99.108.105.101.110.116.17 = STRING: "3260450"
UCD-SNMP-MIB::ucdavis.66.4.1.2.10.110.102.115.95.99.108.105.101.110.116.18 = STRING: "42080094"
UCD-SNMP-MIB::ucdavis.66.4.1.2.10.110.102.115.95.99.108.105.101.110.116.19 = STRING: "47829475"
UCD-SNMP-MIB::ucdavis.66.4.1.2.10.110.102.115.95.99.108.105.101.110.116.20 = STRING: "1"
UCD-SNMP-MIB::ucdavis.66.4.1.2.10.110.102.115.95.99.108.105.101.110.116.21 = STRING: "2357"
UCD-SNMP-MIB::ucdavis.66.4.1.2.10.110.102.115.95.99.108.105.101.110.116.22 = STRING: "89"
UCD-SNMP-MIB::ucdavis.66.4.1.2.10.110.102.115.95.99.108.105.101.110.116.23 = STRING: "0"
UCD-SNMP-MIB::ucdavis.66.4.1.2.10.110.102.115.95.99.108.105.101.110.116.24 = STRING: "0"
UCD-SNMP-MIB::ucdavis.66.4.1.2.10.110.102.115.95.99.108.105.101.110.116.25 = STRING: "161"
UCD-SNMP-MIB::ucdavis.66.4.1.2.10.110.102.115.95.99.108.105.101.110.116.26 = STRING: "1"
UCD-SNMP-MIB::ucdavis.66.4.1.2.10.110.102.115.95.99.108.105.101.110.116.27 = STRING: "27924"
UCD-SNMP-MIB::ucdavis.66.4.1.2.10.110.102.115.95.99.108.105.101.110.116.28 = STRING: "0"
UCD-SNMP-MIB::ucdavis.66.4.1.2.10.110.102.115.95.99.108.105.101.110.116.29 = STRING: "46133"
UCD-SNMP-MIB::ucdavis.66.4.1.2.10.110.102.115.95.99.108.105.101.110.116.30 = STRING: "3"
UCD-SNMP-MIB::ucdavis.66.4.1.2.10.110.102.115.95.99.108.105.101.110.116.31 = STRING: "140977"
UCD-SNMP-MIB::ucdavis.66.4.1.2.10.110.102.115.95.99.108.105.101.110.116.32 = STRING: "0"
UCD-SNMP-MIB::ucdavis.66.4.1.2.10.110.102.115.95.99.108.105.101.110.116.33 = STRING: "0"
ballito:/home/stu# 

With the data being properly collected and reported by SNMP, it's now time to configure Cacti to record and graph it all. nfsstat (which is at the core of the Bash script) reports 35 different metrics for NFS v4. For my tastes, that is way too many metrics for a single graph. I've defined an abridged Cacti graph template with 14 values for now, and intended to squeeze that down to 10 in the upcoming weeks. That said, the template has all but one data source template included. In the end, I think having two graphs may make sense: An everything graph, and a "values of interest" graph.

Here is the graph template itself: cacti_graph_template_ucdnet_-_nfs4_client.xml

Please feel free to let me know if you have any issues with it, have made improvements worth posting, or found it useful and want to buy me a beer!

Building up & testing a new 9TB SATA RAID10 NFSv4 NAS, part II

Initial DRBD Sync - 9TB in 4.5 days
Bookmark and Share

After much blood, sweat and tears getting xtendx up and running in our new data center with Green.ch this summer, I finally had time to complete a major "high availability" aspect of our new file server platform. Our basic architecture is fairly simple: a couple of application servers backed by the network attached storage.

The network attached storage (NAS) is comprised of a pair of nearly identical self-built (Building up & testing a new 9TB SATA RAID10 NFSv4 NAS, part I) file servers. The next, more complex step was to configure a Class C (Primary + Secondary) DRBD cluster. In an effort to mitigate some risk and spread the workload over the course of the summer, I staged the entire installation:

  1. May: Build NAS 0 (thanks to my wife, Robyn, too!)
  2. Early June: Install NAS 0 into data center, copy over production data
  3. Mid-June: Put NAS 0 into production as a simple ext4 + NSFv4 file server
  4. July: Build NAS 1 (Again, a thanks to my wife)
  5. July + August: Install NAS 1 into data center, configure as a HA NAS: drbd + ext4 + NSFv4.
  6. September: Copy data to NAS 1, take NAS 1 into production
  7. Late-September: configure NAS 0 as secondary node in a HA NAS: drbd + ext4 + NSFv4.
  8. Early-October: Initialize DRBD synchronization.

Kicking off the block-level disk synchronization was a big deal. These servers are a running production system and it is paramount that existing service delivery was not impacted. At first I had left the DRBD sync rate unaltered, which I believe effectively means 'fast as you can'. This quickly resulted in poor read times for production applications, so I kicked it down dramatically while the sync was in progress with the drbdsetup command:

 sudo drbdsetup /dev/drbd1 syncer -r 24M
After fooling around with various rates, I settled on 24M. It's much lower than the system could theoretically synchronize at, but that is not the goal. This production value is also now configured in /etc/drbd.d/global_common.conf:

common {
    #snip
    syncer {
        # rate after al-extents use-rle cpu-mask verify-alg csums-alg
        rate 24M; 
    }
}

Interestingly, the synchronization rate does not match disk IO as measured by Cacti via SNMP; in fact it is roughly half. (I have no idea why.) As you can see below, it took a solid 4.5 days for the synchronization of a single 9TB device to complete:

Primary DRBD Node Read and Write IO (MBps)
Bytes-Per-Second during initial DRBD synchronization by primary to secondary

Secondary DRBD Node Read and Write IO (MBps)
Bytes-Per-Second during initial DRBD synchronization by secondary from primary

A bonus to this work on a production system is that the DRBD sync reserved as a real-world production load test. I now know with a respectable degree of certainty what our NAS is capable of, and when it is approaching stressed.

The series:
Building up & testing a new 9TB SATA RAID10 NFSv4 NAS, part I
Building up & testing a new 9TB SATA RAID10 NFSv4 NAS, part II
Building up & testing a new 9TB SATA RAID10 NFSv4 NAS, part III