Posts Tagged ‘etl tool’

Are ETL vendors still relevant now that Hadoop is here?

April 11, 2013 Leave a comment

hourglassETL is another 3-letter acronym which stands for Extract Transform & Load.  ETL refers to the extraction of data from diverse sources (like relational databases, flat files), transforming (sorting, joining, synchronizing, cleaning) said data and then loading it into a data warehouse.  This workflow assumes you know before hand the type of repetitive reports you would run on the data in the data warehouse.

Companies like Informatica, IBM, Oracle, SAS, Business Objects  have built profitable businesses providing enterprise class ETL for huge volumes of data in heterogeneous environments.  Now that open-source Hadoop on commodity x86 server hardware can import data from ERP, DBMS, XML files and export it after some transformations into a regular data warehouse, one is tempted to ask: Is ETL still relevant now that Hadoop does a lot of the ETL function?

To answer this question step back and ask yourself:

Did Hadoop replace my traditional data warehouse?

If your goal is to determine sales trends for you company, you are more likely to run queries on corporate data in the data warehouse, if you are the marketing dept of a company tasked with determining the RoI from a recent email campaign you are morely likely to query data residing in a datawarehouse like Netezza rather than direct your queries at the Hadoop cluster.  So just like Hadoop didn’t replace your data warehouse, its not likely to replace your ETL tools either.

Data warehouse and ETL continue to exist, what changes is where/how you actually implement them.  For instance Amazon is offering “RedShift” a data warehouse in the public cloud at prices ~$1000 per TB, this takes away the need to spend $19,000 to $25,000 per TB to store the same data in a traditional data warehouse on big iron within your private data center.  If you decide to go with RedShift you still need an ETL tool to get all of your on-premise data into RedShift in the public cloud.

As an enterprise you may want to merge your customer data currently residing on with your in-house financial systems.  There again you’d need ETL tools possibly from vendors like SnapLogic.

The demarcation lines between what data is stored in-house versus in-the-cloud are getting blurrier every day.  In this changing world the need for ETL continues to exist, what may change is who ends up providing the ETL functionality, a few years ago it was only Informatica or IBM with high upfront costs in training, consulting and acquisition.  Today it might be open source tools like Pentaho Kettle or Talend or commercial tools from SnapLogicSyncsort and other vendors with more innovative approaches and lower acquisition costs.

Categories: Big Data and Hadoop Tags: