Home > Big Data and Hadoop > Introduction to big data, Hadoop, SIEM

Introduction to big data, Hadoop, SIEM

What is “big data”?

Like the story of the Blind men and the elephant everyone has a different perception of what constitutes “big data”.  Strangely enough they could all be right!

  • An IT manager in a corporation in the oil & gas space would say that a 4 Terabyte 3-dimensional seismic files constitute big data.
  • A utility company might consider data from millions of smart meters to be big data.
  • An airplane engine manufacturer might say that the terabytes of data generated in one transatlantic flight by sensors in the airplane engine constitutes big data.
  • A fortune 500 company IT department might consider log files generated by web servers, firewalls, routers, badge-access systems to be big data.
  • A casino in Las Vegas or Macau would consider HD surveillance video files to be big data.
  • A hospital might consider data generated by patient sensors to be big data.
  • A company interested in determining customer sentiment via analysis might view petabytes of Twitter data to be big data.

Regardless of the industry, the big data in question has monetary value to the company.  It either represents a valuable asset of the company or a way to prevent loss of other valuable assets. “Big data” doesn’t have to equate to “big files”.  Big data could comprise small files generated so rapidly that it would saturate traditional data stores like relational databases.   Hence a new way is needed to collect, analyze and act upon this big data.

What is Hadoop and where does Hadoop fit with “big data”?

Hadoop is an operating system that allows parallel processing of large amounts of data stored on clusters of commodity servers with locally attached disk.   In this sense Hadoop is anathema to the big storage vendors who built billion dollar businesses selling the idea that storage has to be centralized and consolidated, has to be NAS or SAN attached so that it may be backed up and replicated.  Hadoop runs counter to that argument by relying on direct attached storage available on commodity x86 servers.

My database guy tells me that the RDBMS is sufficient, we don’t need Hadoop!

As long as the data is in the gigabytes range and is structured data, RDBMS perform exceptionally well.  However if your organization needs to conform to the Federal Rules of Civil Procedure (FRCP) you may want to archive email and email attachments going back several years.  Now you enter the realm of unstructured or semi structured data for storing which you might find RDBMS unsuitable.  This is where Hadoop comes in.  Being open-source there is lower acquisition cost (I won’t say no acquisition cost as you still need to identify servers, direct-attached-storage, in-house Hadoop expertise) and you have the ability to scale-out as opposed to just scale-up which is what RDBMS do.

My IT forensics guy tells me that the existing Security Information & Event Management (SIEM) system is enough and we don’t need Hadoop!

The short answer is that you need both.  If your goal is to comply with regulations like PCI, Sarbanes-Oxley Act (SOX) and your requirement is to collect, search (in a structured manner) and analyze logs from servers, firewalls, routers in your network then a SIEM is the right tool to use.  Since it uses a back-end relational database you don’t want to use your SIEM to store petabytes of data.

However if your goal is to store a year or more worth of email and email attachments, badge access logs, surveillance videos  then you can’t realistically use a SIEM for this.  This is where Hadoop shines.  Though Hadoop works well with all types of files (structured and unstructured) it is really designed to handle large unstructured files.

Advertisements
  1. Matt Collins
    October 29, 2012 at 10:35 am

    Yes Big Data is a relative term. But beyond it’s value I’d say it’s is not the regular data the organization is used to work with but the additional data they will have to cope with.
    Also it maybe be time to find another term like Hadoop for Big Data or it will never be in the mainstream; Hadoop developments are way too complex, costly & lengthy for the vast majority of organizations.
    As far as RDBMS & SIEMs are concerned since the latter is based on the former they suffer from the same inherent inabilities to deal with large volumes of unstructured & semi-structured data.
    As far as turning Hadoop in a pseudo SIEM why not but for the same Fortune 500 crowd. But there are commercial ready to use flat file tools out there like Secnology. And because we are talking Big Data don’t forget to hire a Data Expert, as there is no “Magic Software” yet.

  1. November 14, 2012 at 8:00 pm
  2. April 4, 2013 at 7:37 pm
  3. December 7, 2013 at 12:41 pm
  4. November 22, 2014 at 11:11 am
  5. November 24, 2014 at 2:44 pm

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: