Virtualizing Apache Hadoop
One morning as you brood over shrinking IT budgets and the upcoming fiscal cliff, your IT department submits a request for you to approve the purchase of hardware and software to build a Hadoop cluster so they can store and analyze log data for IT forensics. While you are contemplating approving this budget item your marketing dept. gets wind of this planned rollout and also demands budget for their very own Hadoop cluster to do sentiment analysis on customer data.
You decide to educate yourself on the options out there. In so doing, you ask yourself:
Is there a way to run multiple Hadoop clusters on the same physical hardware considering that all servers in the Hadoop cluster will be far from being fully utilized?
Is there a way to run Hadoop and non-Hadoop workloads on the same physical hardware?
If you pose this question to your VMware rep, the answer you might receive is: We recommend that you run Hadoop in virtual machines on vSphere using VMware Serengeti code.
What this means for your IT dept is that they will have to run ESXi (part of vSphere) as the hypervisor on physical servers, create VMs, run Hadoop components like the Job Tracker and Task Tracker and NameNode on VMs. In addition, VMware adds an extra component called VMware Hadoop Run Time Manager to each of your VMs. Think of this component as the broker which talks to vCenter (the management component within the vSphere suite) on one hand and to Serengeti’s user interface on the other.
Once you do this you now have a way to scale the compute nodes (like the Hadoop Task Tracker) independent of scaling the storage nodes (like the Hadoop data nodes) while they all run on the same physical server.
Now you ask the questions:
If we are using VMware ESXi from vSphere, can we use other VMware tools like Distributed Resource Scheduler (DRS) to isolate resource pools by business unit or department?
Can VMs be moved between physical servers using vMotion?
The answer is yes provided you are willing to deploy shared storage across a Storage Area Network (SAN). If you go the SAN route be prepared to pay for one or more relatively expensive Fibre channel Host Bus Adapter (HBA) per server, Fibre channel switches and Fibre channel enabled storage arrays. If you go the iSCSI route you can rely on built-in GbE or 10 GbE NICs in your servers, but you will still need high end GbE switches and iSCSI storage arrays as your shared storage.
You now face a conundrum: VMware has greatly improved your server utilization but the Faustian bargain you’ve made is to buy into SANs and SAN-attached enterprise storage. This is not only expensive to procure but expensive to maintain. If you are not comfortable with the idea of deploying a high-end SAN but you still want the benefits of DRS and vMotion why not consider a SAN-free option like those from startups like Nutanix, Simplivity or Tintri?
What Nutanix does is collapse the compute layer and the storage layer into 2 U blocks each of which has 4 servers (also called nodes), Fusion-IO PCIe SSD, SATA SSD and SATA HDD. Think of these as Lego building blocks that combine compute, storage and storage tiering all into a 2 U box called a block. These blocks allow you to scale up with the benefits of VMware and a SAN without actually buying or deploying a SAN or dedicated networked storage. Unlike in traditional clustered storage, here no memory or disk is shared between nodes, hence a shared-nothing cluster architecture more inline with the Hadoop philosophy. In essence you have eliminated the cost/complexity of rolling out servers, SAN fabrics and networked storage so your IT folks can focus their attention on Hadoop and VMware. Once Hadoop is up-and-running on the Nutanix blocks your IT dept could set things up so that the same blocks are used to serve up VDI in the day and Hadoop at night, essentially guaranteeing maximum utilization of your expensive assets. Note that support for Serengeti is upcoming but may not be available yet from Nutanix.
Is Nutanix the only option in town? No. SimpliVity is another startup which claims to have comparable technology to that of Nutanix. Another option is to consider deploying your choice of x86 servers back-ended by Tintri appliances (flash and disk storage combined into a single appliance). If IT budgets are not an issue, you could consider higher end options like EMC VBlock or NetApp FlexPod. In conclusion there is no one size fits all, you can build your own virtualized Hadoop solution using many different options from various vendors, each approach will have its own set of pros and cons. However what is refreshing is that we no longer live in a world where vendors may say “Any customer can have a car painted any colour that he wants so long as it is black.” If anything, the pendulum has swung in the other direction into a world of too many choices.