Posts Tagged ‘“snapshot and Hadoop”’

Enterprise Storage – Is it overkill for Hadoop?

October 23, 2012 2 comments

Like with any buzzword – Information Lifecycle Management (ILM), virtualization, cloud computing and now Hadoop, enterprise storage vendors take great pains to explain why this new buzz/trend is something that they anticipated all along and for which they have pre-designed capabilities in their storage arrays.

The mantra from the big storage vendors in the case of Hadoop is: “Hadoop may prescribe the use of Direct-Attached-Storage (DAS) on each Hadoop node, but we know what is best for you – our $3000 per TB enterprise class storage in a NAS or SAN configuration!”.

Let us step back and look at the arguments you’re likely to hear from your enterprise storage vendor:

  1. Why use just-a-bunch-of-disks (JBOD) when you can use our “enterprise class” storage?  Well you’ve already bought into open source Hadoop over proprietary data warehouses so investing in commodity x86 servers and JBODs is not that big a leap of faith.
  2. Why use DAS when you can use Network Attached Storage (NAS) or SAN?  For starters Hadoop is designed to be a  “shared nothing” architecture where DAS will suffice.  Shared network storage like NAS or SAN could very well be overkill from a cost point of view if all you plan to store is petabytes of log and web 2.0 data in the Hadoop cluster.
  3. You need enterprise class storage because you need redundancy!  Considering that Hadoop Distributed File System (HDFS) is designed to take your data and distribute it across many nodes in the cluster specifically to safeguard against failure of a single node to ensure there is no data loss, that argument also doesn’t wash.
  4. You need fault tolerance and we give you that with enterprise class drives!  Mean Time Between Failure (MTBF) on enterprise class Fibre channel or SATA drives are definitely higher than those on low-end SATA drives but considering that all spinning drives will eventually fail, it’s a question only of how frequently and at what cost you replace failed drives.
  5. Hadoop is not “enterprise class” and has single points of failure!  In the Hadoop architecture the NameNode stores HDFS metadata (like the location of blocks) for the entire Hadoop cluster.  However if this concerns you then you have the option to shard (technical term for partition) the metadata across multiple primary NameNodes.  By doing so you reduce the impact of NameNode failures to just the data managed by that particular NameNode which failed.
  6. Only our enterprise class storage provides you with data protection for Hadoop!  Considering that Hadoop replicates files across multiple NameNodes this argument may not hold true.
  7. You get Snapshots only with our enterprise class storage!   True but considering that you the CIO plan to implement Hadoop to store big data (log files, machine generated data, pictures, web 2.0 data) do you really need to give it the enterprise grade data protection that you’d give your payroll or customer data?

So getting back to basics, how much should you the CIO budget for computing and storage resources for your Hadoop project?  If you expect to process 1 TB of data every month, you’d be ok with 4 TB of disk space per node (3 TB for the 3-copies that Hadoop invariably makes and 1 TB extra for overhead).  Cloudera recommends 4-12 drives per storage node in a non-RAID configuration.  Plan to spend around $4000 or more per Hadoop node for server and storage combined.  Isn’t that a far cry from the $3000 per TB quote you received from the enterprise storage vendor?  However if rolling your own doesn’t strike your fancy from a cost/risk perspective then there are always pre-built Hadoop appliances like Oracle Big Data Appliance in the $500,000 range or comparable offerings from vendors like EMC/Isilon or NetApp.

Having saved money in deploying your Hadoop cluster with off-the-shelf servers and direct-attached storage what do you do with your remaining budget $$?  Invest in statisticians, quants (math wizards and computer programmers), get your staff trained on using tools like R (open source programming language for statistical computing).  After all its not the data or where you store it that really matters, what matters is the meaningful insights you derive from that data with the goal of improving your company’s bottom line!