Archive
Data Lakes & Hadoop analytics
Initially coined in the Hadoop context, a “data lake” referred to the ability to consolidate data (structured and semi-structured) from different data silos (say from CRM, ERP, supply chain) in its native format into Hadoop. You didn’t need to worry about schema, structure or other data requirements until the data needed to be queried or processed. For a typical eCommerce site it might be transactional data, data from marketing campaigns, clues from the online behavior of consumers. The goal of an eCommerce site might be to analyze all this data and send out targeted coupons and other promotions to influence prospective buyers.
Later software companies like Pivotal appropriated this term. The storage vendor behind Pivotal – EMC came up with its own marketing spin on a data lake which involved EMC ViPR with Isilon storage on the back-end. Not to be outdone HDS acquired Pentaho and could make a claim that Pentaho actually coined the term “data lake”. Microsoft marketing uses the term “Azure data lake” to refer to a central repository of data where data scientists could use their favorite tools to derive insights from the data. The analyst firm Gartner cautioned that the lack of oversight (“governance” if you want to use big words) of what goes into the data lake could result in a “data swamp”. Not to be outdone, Hortonworks (the company selling services around Apache Hadoop) counters with the argument that technologies like the Apache Knox gateway (security gateway available for Hadoop) enable a way to democratize access to corporate data in the data lake while maintaining compliance with corporate security policies.
Who actually uses a data lake today? Besides Google and Facebook? I’d be curious to know. In the interim deriving insights via Hadoop analytics on data wherever it resides (whether it be on a NetApp FAS system or on some other networked storage) may be the right first step. I’d welcome input from readers who use data lakes today to solve business related problems.