Posts Tagged ‘“Hadoop”’

Introduction to big data, Hadoop, SIEM

October 18, 2012 6 comments

What is “big data”?

Like the story of the Blind men and the elephant everyone has a different perception of what constitutes “big data”.  Strangely enough they could all be right!

  • An IT manager in a corporation in the oil & gas space would say that a 4 Terabyte 3-dimensional seismic files constitute big data.
  • A utility company might consider data from millions of smart meters to be big data.
  • An airplane engine manufacturer might say that the terabytes of data generated in one transatlantic flight by sensors in the airplane engine constitutes big data.
  • A fortune 500 company IT department might consider log files generated by web servers, firewalls, routers, badge-access systems to be big data.
  • A casino in Las Vegas or Macau would consider HD surveillance video files to be big data.
  • A hospital might consider data generated by patient sensors to be big data.
  • A company interested in determining customer sentiment via analysis might view petabytes of Twitter data to be big data.

Regardless of the industry, the big data in question has monetary value to the company.  It either represents a valuable asset of the company or a way to prevent loss of other valuable assets. “Big data” doesn’t have to equate to “big files”.  Big data could comprise small files generated so rapidly that it would saturate traditional data stores like relational databases.   Hence a new way is needed to collect, analyze and act upon this big data.

What is Hadoop and where does Hadoop fit with “big data”?

Hadoop is an operating system that allows parallel processing of large amounts of data stored on clusters of commodity servers with locally attached disk.   In this sense Hadoop is anathema to the big storage vendors who built billion dollar businesses selling the idea that storage has to be centralized and consolidated, has to be NAS or SAN attached so that it may be backed up and replicated.  Hadoop runs counter to that argument by relying on direct attached storage available on commodity x86 servers.

My database guy tells me that the RDBMS is sufficient, we don’t need Hadoop!

As long as the data is in the gigabytes range and is structured data, RDBMS perform exceptionally well.  However if your organization needs to conform to the Federal Rules of Civil Procedure (FRCP) you may want to archive email and email attachments going back several years.  Now you enter the realm of unstructured or semi structured data for storing which you might find RDBMS unsuitable.  This is where Hadoop comes in.  Being open-source there is lower acquisition cost (I won’t say no acquisition cost as you still need to identify servers, direct-attached-storage, in-house Hadoop expertise) and you have the ability to scale-out as opposed to just scale-up which is what RDBMS do.

My IT forensics guy tells me that the existing Security Information & Event Management (SIEM) system is enough and we don’t need Hadoop!

The short answer is that you need both.  If your goal is to comply with regulations like PCI, Sarbanes-Oxley Act (SOX) and your requirement is to collect, search (in a structured manner) and analyze logs from servers, firewalls, routers in your network then a SIEM is the right tool to use.  Since it uses a back-end relational database you don’t want to use your SIEM to store petabytes of data.

However if your goal is to store a year or more worth of email and email attachments, badge access logs, surveillance videos  then you can’t realistically use a SIEM for this.  This is where Hadoop shines.  Though Hadoop works well with all types of files (structured and unstructured) it is really designed to handle large unstructured files.