Archive
Amazon Elastic MapReduce and private storage
A topic that caught my eye recently was the announcement that NetApp customers using AWS now have the option of buying NetApp storage and placing it in Amazon Direct Connect co-location sites so that apps like Amazon Elastic MapReduce (EMR) can now access this NetApp storage in the co-location site more efficiently. To my mind this seemed like an apples to oranges comparison.
I can see the first benefit which is that if you are a NetApp shop you can have your in-house NetApp storage backup all your volume Snapshots into a remote array’s SnapVault and mirror all your local volumes into a remote SnapMirror target located in Amazon’s co-location facility. This gives you a large vault for your backups in addition to provide a DR site in the public cloud. Another potential use case cited is to use your co-located enterprise storage array as storage for Elastic MapReduce. I must confess that the rationale behind this use case eludes me.
Let us step back and look at how customers typically use EMR. Airbnb the online travel site had a need to analyze reams of data to address questions like:
- If a user stayed in a property listed on Airbnb in a certain month, did friends of that user on FaceBook also patronize the same property?
- If listings in a certain area see low patronage is it because users don’t want to visit them or is it because users never found those specific listings?
Airbnb found that MySQL didn’t cut it for this type of analysis and eventually moved to using Amazon EMR. Like Airbnb you the CIO of an enterprise might have done extensive research and bought into the value of using Hadoop and MapReduce for your big data crunching. However you decided against the CapEx of buying your own servers, local storage, associated power & cooling costs and decided to run Hadoop in the public cloud and specifically on AWS. Eventually you got tired of having to manage the virtual Hadoop cluster in AWS and decided to move to a managed service like Elastic MapReduce. Amazon’s 3 step process for EMR might have helped clinch the deal:
- Upload your application data into an S3 bucket
- Create a job flow using EMR
- Get results from the S3 bucket
In parallel you might have moved from a traditional Oracle RDBMS to NoSQL and eventually to Amazon’s online NoSQL database service DynamoDB. Perhaps the underlying SSD based storage and $1 per GB cost was the siren song from Amazon that drew you to this decision. Now that you use both EMR and DynamoDB you have the option to archive DynamoDB tables to Amazon S3 as csv files. If you change your mind you could re-import this data into DynamoDB from S3 and you could do useful things like link live tables in DynamoDB with archived tables in S3.
One of the reasons EMR customers use EMR in the first place is because they like the idea of using as much processing and storage resources as needed and relinquishing these resources when the job is done. In view of this you might ask like I did: Is there value in having Amazon Elastic MapReduce use co-located private storage?
One obvious advantage of placing your private storage in an Amazon Direct Connect site is that you get a dedicated circuit – 1 GbE or 10 GbE -(instead of the internet) from the Amazon Direct Connect site to the AWS cloud. This equates to more predictable data transfers and lower costs since you will be paying Amazon a lower rate for the Direct Connect circuit than you’d pay if you connected from your enterprise data-center to AWS over the internet.
Considering that your enterprise private storage in the collocation site is far from “cheap and deep” archive storage, the value of having EMR use your expensive enterprise storage as persistent storage eludes me. More so since there are ways to use EBS volumes on commodity storage as persistent storage for EMR. I welcome clarifications and counter points from EMR users more experienced with this topic than myself. Thanks for reading this far!
OpenFlow, SDN and Big data
OpenFlow promised a way out of the tyranny in a world dominated by proprietary networking vendors like Cisco, Juniper, Alcatel and others. It offered the promise of traffic engineering using low-cost hardware and a way to design your network in a deterministic fashion rather than have to over-provision everything for the worst case scenario. In jargon-speak OpenFlow allows “separation of the control plane from the data plane.”
What this means to you and me is that we can now take functions like access control lists (ACL) that were handled by the routers/switches and move that intelligence into Java software running in a virtual machine. To do all these wonderful things all you need is:
- OpenFlow controller
- OpenFlow enabled switch
Vendors who make OpenFlow compliant switches include Brocade, HP, Juniper, Dell, Extreme Networks, Arista and IBM. Carriers who deploy OpenFlow in their networks today include Verizon.
Software Defined Networking (SDN) is the latest buzzword which sends shivers down the spines of the big router/switch vendors who made billions of dollars selling proprietary firmware based routers and switches. The stifling world they thrust upon us of access routers, aggregation routers, layer 3 routers and WAN routers, millions of lines of code for each router, roadmaps that stretch into the next 10 years – may all become a thing of the past thanks to the unrelenting march of technologies like SDN.
How do you visualize SDN? Think of this new architecture as being modular and neatly layered like an IKEA shelf, with the data-plane layer at the bottom, control plane above it and the application plane on top:
- The data-plane layer switches can be of two types: OpenFlow Hypervisor switches and OpenFlow physical switches.
- The control plane could feature commercially available controllers like Big Switch “Big network controller”.
- The application plane is where you would have SDN applications, cloud orchestration and other business applications.
Players in the emerging SDN space include vendors like Nicira and Big Switch Networks.
- Nicira who made available via open source their vSwitch Open vSwitch
- Big Switch Networks who made available via open source their OpenFlow controller “Floodlight” under the Apache 2 license.
- Midokura who virtualize the network stack of OpenFlow with their MidoNet
So what did the legacy networking and virtualization vendors do? As the adage goes – If you can’t build it in-house then at least acquire it so you regain some nominal measure of control. This scenario was acted out recently when:
- VMware acquired Nicira for $1.2B
- Brocade acquired Vyatta for an undisclosed amount
- Cisco acquired Cloupia for $125M and Meraki for $1.2B
That’s enough about the vendors in the space. If I am a user how do I benefit from SDN?
Benefits of SDN to enterprise customers:
- The ability to squeeze more VMs into you existing rack servers results in driving down your power and cooling costs. This in turn mean more CapEx savings and OpEX savings (due to fewer trouble tickets) especially in an enterprise cloud environment.
Fidelity and Goldman Sachs are rumored to be customers of Big Switch Networks.
Benefits of SDN to cloud service providers:
- A way to virtualize your network
- If you are benefiting from virtualizing servers using VMware or Citrix technology why stop there? Why not go down a level and virtualize the underlying network of proprietary switches and routers and treat them like an un-differentiated pool of resources?
- A programmatic way to control your infrastructure:
- If you are a telecom service provider you now have a way to move away from the proprietary architectures of Cisco, Juniper, Alcatel which requires a vendor’s programmer to program your routers. You can now have your own telecom engineers do the programming blissfully oblivious to the make/model of the underlying router.
- Better orchestration of cloud services.
- Rapid provisioning including scale-up and scale-down.
In the service provider market these can quickly become key business differentiators. Cloud providers like Amazon, Google, Facebook, MSN, Yahoo, Rackspace take advantage of the benefits of SDN today. Traditional services providers like NTT are also adopters of SDN.
If you are still with me you might wonder: I get all this OpenFlow, SDN business but what pray tell does all this have to do with Hadoop and big data? For starters, Infoblox has proven that Hadoop performance can be accelerated using OpenFlow enabled Ethernet switches instead of using pricier InfiniBand or other interconnects. As Stuart Bailey, CTO of Infoblox says “Imagine a single programmer routinely having 10,000 cores and their associated networks to write Big Data applications! SDN enables new industrial applications like Big Data analysis in the same way the PC brought spreadsheets and word processors into the enterprise.”
If you wish to explore this area further look into papers like these from IBM on how you can program your network at run-time for big data applications. To learn more about SDN itself look into papers like this one from Dell. If you happen to be an HP shop you’ll want to catch up on HP’s solutions for SDN. So you see in the end big data has tangible benefits from this move towards SDN regardless of who makes your underlying network hardware.
Virtualizing Apache Hadoop
One morning as you brood over shrinking IT budgets and the upcoming fiscal cliff, your IT department submits a request for you to approve the purchase of hardware and software to build a Hadoop cluster so they can store and analyze log data for IT forensics. While you are contemplating approving this budget item your marketing dept. gets wind of this planned rollout and also demands budget for their very own Hadoop cluster to do sentiment analysis on customer data.
You decide to educate yourself on the options out there. In so doing, you ask yourself:
Should I buy another dozen servers for a second Hadoop cluster and risk higher power/cooling expenses not to mention greater use of sparse floor tile space in the data center?
Is there a way to run multiple Hadoop clusters on the same physical hardware considering that all servers in the Hadoop cluster will be far from being fully utilized?
Is there a way to run Hadoop and non-Hadoop workloads on the same physical hardware?
If you pose this question to your VMware rep, the answer you might receive is: We recommend that you run Hadoop in virtual machines on vSphere using VMware Serengeti code.
What this means for your IT dept is that they will have to run ESXi (part of vSphere) as the hypervisor on physical servers, create VMs, run Hadoop components like the Job Tracker and Task Tracker and NameNode on VMs. In addition, VMware adds an extra component called VMware Hadoop Run Time Manager to each of your VMs. Think of this component as the broker which talks to vCenter (the management component within the vSphere suite) on one hand and to Serengeti’s user interface on the other.
Once you do this you now have a way to scale the compute nodes (like the Hadoop Task Tracker) independent of scaling the storage nodes (like the Hadoop data nodes) while they all run on the same physical server.
Now you ask the questions:
If we are using VMware ESXi from vSphere, can we use other VMware tools like Distributed Resource Scheduler (DRS) to isolate resource pools by business unit or department?
Can VMs be moved between physical servers using vMotion?
The answer is yes provided you are willing to deploy shared storage across a Storage Area Network (SAN). If you go the SAN route be prepared to pay for one or more relatively expensive Fibre channel Host Bus Adapter (HBA) per server, Fibre channel switches and Fibre channel enabled storage arrays. If you go the iSCSI route you can rely on built-in GbE or 10 GbE NICs in your servers, but you will still need high end GbE switches and iSCSI storage arrays as your shared storage.
You now face a conundrum: VMware has greatly improved your server utilization but the Faustian bargain you’ve made is to buy into SANs and SAN-attached enterprise storage. This is not only expensive to procure but expensive to maintain. If you are not comfortable with the idea of deploying a high-end SAN but you still want the benefits of DRS and vMotion why not consider a SAN-free option like those from startups like Nutanix, Simplivity or Tintri?
What Nutanix does is collapse the compute layer and the storage layer into 2 U blocks each of which has 4 servers (also called nodes), Fusion-IO PCIe SSD, SATA SSD and SATA HDD. Think of these as Lego building blocks that combine compute, storage and storage tiering all into a 2 U box called a block. These blocks allow you to scale up with the benefits of VMware and a SAN without actually buying or deploying a SAN or dedicated networked storage. Unlike in traditional clustered storage, here no memory or disk is shared between nodes, hence a shared-nothing cluster architecture more inline with the Hadoop philosophy. In essence you have eliminated the cost/complexity of rolling out servers, SAN fabrics and networked storage so your IT folks can focus their attention on Hadoop and VMware. Once Hadoop is up-and-running on the Nutanix blocks your IT dept could set things up so that the same blocks are used to serve up VDI in the day and Hadoop at night, essentially guaranteeing maximum utilization of your expensive assets. Note that support for Serengeti is upcoming but may not be available yet from Nutanix.
Is Nutanix the only option in town? No. SimpliVity is another startup which claims to have comparable technology to that of Nutanix. Another option is to consider deploying your choice of x86 servers back-ended by Tintri appliances (flash and disk storage combined into a single appliance). If IT budgets are not an issue, you could consider higher end options like EMC VBlock or NetApp FlexPod. In conclusion there is no one size fits all, you can build your own virtualized Hadoop solution using many different options from various vendors, each approach will have its own set of pros and cons. However what is refreshing is that we no longer live in a world where vendors may say “Any customer can have a car painted any colour that he wants so long as it is black.” If anything, the pendulum has swung in the other direction into a world of too many choices.
Fortune 500 companies using Hadoop
When we think Wall Street we typically think of companies like Morgan Stanley, JPMC, UBS and others. Ever wondered how they process big data and what tools they use to do so?
Morgan Stanley: When faced with the problem of identifying triggers that initiated market events, they looked at vast quantities of web and database logs and decided to put the log data into Hadoop and run some time based correlations on the data. The resulting system now provides a way to go back in time and identify which application or user initiated the transaction that culminated in the market event of interest.
JPMC: As a financial services company with over 150 PB of online data, 30,000 databases, 3.5 billion logins to user accounts, JPMC was faced with the task of reducing fraud, managing IT risk and mining data for customer insights. To do this they turned to Hadoop which now gave them a single platform to store all the data making it easier to query for insights.
If we consider the world of web2.0 companies, auction house eBay and travel site Orbitz may come to mind.
eBay: As a massive online presence eBay has over 97 million active buyers and sellers, over 200 million items for sale in over 50,000 categories. This translates into 10 TB or more of incoming data per day. When tasked with finding a way to solve real time problems by crunching predictive models they turned to Hadoop and built a 500-node Hadoop cluster using Sun servers running Linux. As time went by a need arose to create a better real-time search engine for the auction site. This is now being built using Hadoop and HBase (To get more details I recommend you do a search under project “Cassini”).
Orbitz: Orbitz the online travel vendor had a need to determine metrics like “how long does it take for a user to download a page?”. When Orbitz developers needed to understand why production systems had issues they needed a way to mine huge volumes of production log data. The solution they implemented uses a combination of Hadoop and Hive to process weblogs which are then further processed using scripts written in R (open source statistical package which supports visualization) to derive useful metrics related to hotel bookings and user ratings. Hadoop ended up complementing their existing data warehouse systems instead of replacing them.
In the brick-and-mortar world of retailers there are companies like Walmart, Kmart, Target and Sears. Consider how Sears uses Hadoop for big data:
Sears: When faced with a need to evaluate the results of various marketing campaigns among other needs, Sears appears to have entered the Hadoop world in a big way with a 300-node Hadoop cluster storing and processing 2 PB of data. More recently they have begun using Hadoop to set pricing based on variables like availability of a product in a store, what a competitor would charge for a similar product, what economic conditions exist in that area. In addition their big data system allows them to send customized coupons to consumers by location, for instance if you are in New York and hit by Hurricane Sandy it is useful to receive Sears coupons for generators, bleach and other survival tools. In an industry dominated by etailers like Amazon, the ability to set and change prices dynamically, court loyalty program consumers with customized offers are some of the many ways that brick-and-mortar firms like Sears are trying to stay relevant.
In conclusion the takeways would be that Hadoop complements your existing data warehouse, it provides an open source scalable way to store PB of log data but you still need to build a solution that is tailored to solve your specific needs whether they be sentiment analysis, customer satisfaction metrics or judging the effectiveness of your marketing campaigns. Hadoop is a tool but like any good tool it doesn’t offer you a panacea nor can it be used in isolation.
Hadoop as a cloud service
When tasked with crafting a company wide strategy on whether to process big data in-house or outsource it to the public cloud you might rightly ask “Whose cloud are we talking about?”
It struck me that every major vendor whether it be Amazon, Microsoft, Google, IBM or Rackspace appears to have (or plans to have) a Hadoop/big-data offering in the cloud.
Take Amazon for instance. Amazon offers “Elastic MapReduce” as a cloud-service running along with Apache (or MapR) Hadoop on EC2 cloud platforms using S3 storage. This is interesting to companies who might already be storing corporate data on S3 storage, they can now pull this data into Hadoop Distributed File System (HDFS) running on Elastic Compute Cloud (EC2) to do their processing using MapReduce. However if you don’t use Amazon Simple Storage Service (S3) storage today you might wonder what other alternatives exist.
Microsoft appears ready to roll-out a service called “HDInsight” using Hortonworks Hadoop on the Azure platform using storage like Azure blob storage underneath. While Microsoft has been successful getting big name wins like the US Environmental Protection Agency (EPA), Toyota and California’s Santa Clara County for its Office 365 service, it is not clear how many customers will want to run Hadoop in Azure. However the Hive ODBC driver from Microsoft and Hortonworks opens up a world of possibilities to Excel and PowerPivot users who would use these familiar tools to query data within Hadoop running in Azure.
Smaller providers like Rackspace talk of planning to roll out a Hadoop service over OpenStack using their existing EMC and NetApp storage. Rackspace would use their support offerings as a way to differentiate from other cloud providers.
I was surprised to notice that IBM has “up leveled” the conversation to typical use-cases. Rather than focus on how IBM BigInsights would run on IBM SmartCloud on x86 servers running Linux and having IBM SONAS and Storewize 7000 technology powering all this, they’ve chosen to hide most of this tech-trivia and instead focus on use-cases. One example they provide is of a telecom service provider using IBM BigInsights to identify fraud and prevent customer churn. The fact that this involves using IBM’s templates for InfoSphere Streams which is an add-on in the BigInsights service isn’t of interest to customers. What a CIO at a telecom provider would want to know is “How does this solution in the cloud help me impact my company’s bottom-line by reducing customer churn?”. I think the year ahead promises to be an interesting one for enterprise customers as cloud providers refine their Hadoop-as-a-cloud-service offering and make it easier for customers to derive meaningful insights from this big data in the cloud.