Home > Big Data and Hadoop > Big Data Warehouse in the cloud

Big Data Warehouse in the cloud

November 28, 2012 Leave a comment Go to comments

To paraphrase Bob Dylan “The Times They Are a-Changin’”…

Once there was a predictable world of big data warehouse vendors (IBM, Oracle, Teradata) and business intelligence (BI)  vendors (Oracle, Microsoft, IBM, SAP).  This world involved million dollar data warehouse systems and customers had few alternatives.

Shocks to the big data warehouse vendors came from the emergence of Hadoop and MapReduce. The old style data warehouse vendors countered by saying “Hadoop is not big data” but grudgingly accepted Hadoop as suitable for unstructured data, extract transform load (where ETL processes run in  parallel across a Hadoop cluster as opposed to pulling data from a SAN into dedicated ETL servers) and staging but insisted that their big data warehouse for structured data wasn’t going anywhere.  They insisted that Hadoop may help you create predictive models but you still needed to load these models into the data warehouse to gain meaningful insights.  For instance if you are an IBM shop you may recall that upon noting the inevitable rise of Hadoop at enterprise accounts, IBM introduced BigInsights which was their packaging of Hadoop.  To ensure that customers stayed with Cognos, IBM introduced Hive connectors that could connect to the big data stored in BigInsights.

Amazon disrupted the big data warehouse world with the announcement of RedShift (a data warehouse server in the cloud) .  Amazon is taking the established data warehouse vendors head-on but leaving the world of BI vendors intact by partnering (for now) with the likes of MicroStrategy, Jaspersoft and IBM Cognos.  Is Amazon alone in doing this?  No.  Startup TreasureData also offers a data warehouse in the cloud. TreasureData also partners with existing vendors like Heroku, JasperSoft and Engineyard.  However the size and scope of Amazon getting into this market will be a game changer.

If you are an enterprise CIO what does all this mean for you?  You now have the option to move away from $19,000 to $25,000 per terabyte  using legacy data warehouse products in your data center and paying Amazon $1000 per terabyte over a 3 year period.   What makes Amazon so sure that they have mastered the dark art of building data warehouses?  For one, they have used ParAccel technology under the covers to power RedShift.  Amazon’s value proposition to you: Use familiar SQL tools to query your big data.

The old world of legacy BI vendors woke up one morning to find startups like Pentaho, JasperSoft and Tableau grazing in their customer patch.  Startups like Platfora  went a step further by introducing a BI engine for Hadoop.  Platfora boldly advocates that traditional data warehouses and BI tools are things of the past.  Their approach involves querying the data in Hadoop (could be Hortonworks or Cloudera) and pulling the data in- memory so it can be accessed faster. Is Platfora alone in advocating in-memory technology?  No, the software giant SAP has been promoting HANA and Oracle had its TimesTen product for a while.  SAP HANA is an in-memory big data engine promoted by SAP via OEM relationships with various server vendors including Cisco’s UCS line and using solid state memory to achieve low latencies.  SAP also offers HANA as a cloud offering in the Amazon cloud and calls it SAP HANA One.  Unlike the appliance based HANA this version in the Amazon cloud is aimed at small enterprises and limits you to 60 GB of RAM per instance.

What if you like the idea of in-memory databases like HANA but cringe at the licensing costs?  One option is to look into open source alternatives like Druid which was created by the vendor MetaMarkets.   Druid claims to provide real-time analytics using an in-memory database and best of all – it is open source.  What if you like Druid but don’t want the bother of writing your own analytics – then MetaMarkets offers a hosted analytics service using Druid under the covers.

The takeaway here is that you don’t have to be resigned to the old world of pricey data warehouses and legacy business intelligence tools.  You have an array of more attractive options without the associated vendor lock-in.  You may have Hadoop and MapReduce to thank for opening up this brave new world.

  1. --K
    November 29, 2012 at 8:12 am

    Nice Post Ravi, maybe a typo but redshift is not $1000/gb/hr.

    • November 29, 2012 at 10:50 am

      Thanks for reading and for the catch. I corrected the post to reflect the $1000 per TB that RedShift actually charges with a 3 year commitment.

  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: