Home > Big Data and Hadoop > Amazon Elastic MapReduce and private storage

Amazon Elastic MapReduce and private storage

November 29, 2012 Leave a comment Go to comments

A topic that caught my eye recently was the announcement that NetApp customers using AWS now have the option of buying NetApp storage and placing it in Amazon Direct Connect co-location sites so that apps like Amazon Elastic MapReduce (EMR) can now access this NetApp storage in the co-location site more efficiently. To my mind this seemed like an apples to oranges comparison.

I can see the first benefit which is that if you are a NetApp shop you can have your in-house NetApp storage backup all your volume Snapshots into a remote array’s SnapVault and mirror all your local volumes into a remote SnapMirror target located in Amazon’s co-location facility.   This gives you a large vault for your backups in addition to provide a DR site in the public cloud.  Another potential use case cited is to use your co-located enterprise storage array as storage for Elastic MapReduce.  I must confess that the rationale behind this use case eludes me.

Let us step back and look at how customers typically use EMR.   Airbnb the online travel site had a need to analyze reams of data to address questions like:

  • If a user stayed in a property listed on Airbnb in a certain month, did friends of that user on FaceBook also patronize the same property?
  • If listings in a certain area see low patronage is it because users don’t want to visit them or is it because users never found those specific listings?

Airbnb found that MySQL didn’t cut it for this type of analysis and eventually moved to using Amazon EMR.   Like Airbnb you the CIO of an enterprise might have done extensive research and bought into the value of using Hadoop and MapReduce for your big data crunching.  However you decided against the CapEx of buying your own servers, local storage, associated power & cooling costs and decided to run Hadoop in the public cloud and specifically on AWS.  Eventually you got tired of having to manage the virtual Hadoop cluster in AWS and decided to move to a managed service like Elastic MapReduce.    Amazon’s 3 step process for EMR might have helped clinch the deal:

  • Upload your application data into an S3 bucket
  • Create a job flow using EMR
  • Get results from the S3 bucket

In parallel you might have moved from a traditional Oracle RDBMS to NoSQL and eventually to Amazon’s online NoSQL database service DynamoDB.  Perhaps the underlying SSD based storage and $1 per GB cost was the siren song from Amazon that drew you to this decision.  Now that you use both EMR and DynamoDB you have the option to archive DynamoDB tables to Amazon S3 as csv files.  If you change your mind you could re-import this data into DynamoDB from S3 and you could do useful things like link live tables in DynamoDB with archived tables in S3.

One of the reasons EMR customers use EMR in the first place is because they like the idea of using as much processing and storage resources as needed and relinquishing these resources when the job is done.  In view of this you might ask like I did:   Is there value in having Amazon Elastic MapReduce use co-located private storage?   

One obvious advantage of placing your private storage in an Amazon Direct Connect site is that you get a dedicated circuit – 1 GbE or 10 GbE -(instead of the internet) from the Amazon Direct Connect site to the AWS cloud.  This equates to more predictable data transfers and lower costs since you will be paying Amazon a lower rate for the Direct Connect circuit than you’d pay if you connected from your enterprise data-center to AWS over the internet.

Considering that your enterprise private storage in the collocation site is far from “cheap and deep” archive storage, the value of having EMR use your expensive enterprise storage as persistent storage eludes me.  More so since there are ways to use EBS volumes on commodity storage as persistent storage for EMR.   I welcome clarifications and counter points from EMR users more experienced with this topic than myself.   Thanks for reading this far!

Advertisements
  1. November 30, 2012 at 11:24 pm

    I remember reading NetApp’s appliance which aims to reduce network traffic through some sort of RAID configuration, but EMR and NetApp beats me too

  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: