Archive for October, 2012

Advanced Persistent Threats – What is an enterprise CSO to do?

October 26, 2012 1 comment

What exactly is an Advanced Persistent Threat (APT)?

Is it just a new buzzword?  Is APT a new type of malware, a new type of botnet?  Is it merely politically versus financially motivated hacking?  These are some of the questions that are asked about APT.  The consensus among industry experts is that APT may be operationally “advanced” but not advanced in the technical sense of the word.  The attackers may be a group of people who rely on persistence – after an attack there may be no activity for months or years lulling the victim into a false sense of security but the infection persists in the victim’s environment unknown to them.    The goal of an APT is to traverse the victim’s network and target specific confidential data on that network and exfiltrate that data by using an external command and control server using the least sophisticated attack-tools possible.

Methods used in APT attacks

Spear phishing is a method favored by attackers.  Since it uses an element of impersonation spear phishing attacks bypass existing email filters.  It is a way to trick employees into clicking on links or into opening email attachments in email that originates from within the employee’s own domain.   Tools in the tool chest of APT attackers include a collection of single-use zero day vulnerabilities (vulnerabilities for which the major Antivirus players haven’t released a patch yet).  Attackers also invest in command and control protocols that can stay undetected.

Recommendations for a Chief Security Officer (CSO) on dealing with APT:

  • Patch every host in you network from host computers to network attached printers/copier machines – on the day the patch is released.  Don’t procrastinate.
  • Use whitelisting technology (in the form of whitelist agents) especially on non-Windows devices like network printers.  Whitelisting may also take the form of instituting a corporate policy on “approved” encrypted USB drives that may be used by employees who wish to download files from their computers.
  • Log Domain Name Service (DNS) queries and Dynamic Host Configuration Protocol (DHCP) queries in your network.
  • Store a few months to a year worth of NetFlow/IPFIX traffic possibly in a scalable Hadoop cluster built using commodity x86 servers.  Use this data to learn how many hosts in your network have connected in the past to a blacklisted IP address outside your network.
  • Use full packet capture for post-mortem investigations to augment your in-house forensics capabilities.
  • Try to block attacks using an Intrusion Prevention System (IPS) rather than just detecting them using Network based Intrusion Detection Systems (IDS).  Look into an IPS that can detect executable code in .chm (which are Microsoft compiled HTML help) files.
  • Make sure that Host based Intrusion Detection (HIDS) is not disabled on your Active Directory servers so that malicious attempts to determine a password by brute force may be detected.
  • Create an APT task force for detailed malware analysis and have them use sites like contagiodump to keep informed about new exploits and vulnerabilities.
  • Don’t assume that a commercial Security Information & Event Management (SIEM) will have a “stop APT now” button.  A commercial SIEM may collect, normalize and base line data but still requires human involvement to find the root cause.
  • Create a social footprint (Twitter, Facebook, LinkedIn, Pinterest) of your executives, setup honeypots to trap attackers who prey on the Facebook accounts of your executives.
  • Create awareness programs to educate your end-users about the dangers of spear phishing.  Use them as your first line of defense even if such education doesn’t deliver a 100% return on investment.
  • Reach out to your peer CSOs and to agencies like the FBI for a confidential internal briefing and exchange of lessons learned.

In conclusion even implementing all of the above steps may not give you 100% protection but it will definitely get you further along.  APTs are a fact of life today and CSO and CIOs have no option but to figure out a plan to deal with them.

Real-time threat intelligence using Hadoop

October 25, 2012 1 comment

Now that you are familiar with Hadoop and big data you might ask the question “Who uses Hadoop for real-time cyber-security?”.

One example might be McAfee Global Threat Intelligence a product from McAfee (part of Intel) which collects data from millions of sensors world-wide, correlates this data to provide real-time reputation scoring and threat intelligence.   If  you are a McAfee customer and need a way to get reputation scores about reputed “bad actors” on the internet, you could deploy a GTI proxy appliance in your location and have every McAfee end-point node in your location use the proxy appliance to query the GTI application in the cloud.  The GTI application runs over a Hadoop cluster.  Such access to real-time threat intelligence helps McAfee end point products deliver more effective cyber-security.

Another example is IpTrust (Endgame systems) a cloud based service whose reputation scoring system collects data, runs it through MapReduce and then hands it over to Cassandra (a NoSQL distributed database mgmt. system) running over Hadoop Distributed File System (HDFS).  Apparently they have a good business model as their customers include HP and IBM.  Why use Hadoop?  Simply because if your goal is to mine millions or billions of log files to look for botnet activity what better and more scalable platform could there be than open source Hadoop?

Yet another example is SOURCEfire Immunet which uses Hadoop to collect data from 2 million end-points monitored by SOURCEfire and provide real-time protection against malware and zero-day attacks.

In conclusion if you are a security vendor deploying a cloud based reputation scoring service and you have a need to process and store way more data than traditional databases can handle then you should consider Hadoop as the foundation for your solution.

Enterprise Storage – Is it overkill for Hadoop?

October 23, 2012 2 comments

Like with any buzzword – Information Lifecycle Management (ILM), virtualization, cloud computing and now Hadoop, enterprise storage vendors take great pains to explain why this new buzz/trend is something that they anticipated all along and for which they have pre-designed capabilities in their storage arrays.

The mantra from the big storage vendors in the case of Hadoop is: “Hadoop may prescribe the use of Direct-Attached-Storage (DAS) on each Hadoop node, but we know what is best for you – our $3000 per TB enterprise class storage in a NAS or SAN configuration!”.

Let us step back and look at the arguments you’re likely to hear from your enterprise storage vendor:

  1. Why use just-a-bunch-of-disks (JBOD) when you can use our “enterprise class” storage?  Well you’ve already bought into open source Hadoop over proprietary data warehouses so investing in commodity x86 servers and JBODs is not that big a leap of faith.
  2. Why use DAS when you can use Network Attached Storage (NAS) or SAN?  For starters Hadoop is designed to be a  “shared nothing” architecture where DAS will suffice.  Shared network storage like NAS or SAN could very well be overkill from a cost point of view if all you plan to store is petabytes of log and web 2.0 data in the Hadoop cluster.
  3. You need enterprise class storage because you need redundancy!  Considering that Hadoop Distributed File System (HDFS) is designed to take your data and distribute it across many nodes in the cluster specifically to safeguard against failure of a single node to ensure there is no data loss, that argument also doesn’t wash.
  4. You need fault tolerance and we give you that with enterprise class drives!  Mean Time Between Failure (MTBF) on enterprise class Fibre channel or SATA drives are definitely higher than those on low-end SATA drives but considering that all spinning drives will eventually fail, it’s a question only of how frequently and at what cost you replace failed drives.
  5. Hadoop is not “enterprise class” and has single points of failure!  In the Hadoop architecture the NameNode stores HDFS metadata (like the location of blocks) for the entire Hadoop cluster.  However if this concerns you then you have the option to shard (technical term for partition) the metadata across multiple primary NameNodes.  By doing so you reduce the impact of NameNode failures to just the data managed by that particular NameNode which failed.
  6. Only our enterprise class storage provides you with data protection for Hadoop!  Considering that Hadoop replicates files across multiple NameNodes this argument may not hold true.
  7. You get Snapshots only with our enterprise class storage!   True but considering that you the CIO plan to implement Hadoop to store big data (log files, machine generated data, pictures, web 2.0 data) do you really need to give it the enterprise grade data protection that you’d give your payroll or customer data?

So getting back to basics, how much should you the CIO budget for computing and storage resources for your Hadoop project?  If you expect to process 1 TB of data every month, you’d be ok with 4 TB of disk space per node (3 TB for the 3-copies that Hadoop invariably makes and 1 TB extra for overhead).  Cloudera recommends 4-12 drives per storage node in a non-RAID configuration.  Plan to spend around $4000 or more per Hadoop node for server and storage combined.  Isn’t that a far cry from the $3000 per TB quote you received from the enterprise storage vendor?  However if rolling your own doesn’t strike your fancy from a cost/risk perspective then there are always pre-built Hadoop appliances like Oracle Big Data Appliance in the $500,000 range or comparable offerings from vendors like EMC/Isilon or NetApp.

Having saved money in deploying your Hadoop cluster with off-the-shelf servers and direct-attached storage what do you do with your remaining budget $$?  Invest in statisticians, quants (math wizards and computer programmers), get your staff trained on using tools like R (open source programming language for statistical computing).  After all its not the data or where you store it that really matters, what matters is the meaningful insights you derive from that data with the goal of improving your company’s bottom line!

Full text search on big data

October 22, 2012 Leave a comment

Now that you have bought into how analytics on “big data” will improve your company’s bottom-line revenue you are tasked with identifying your company’s search strategy for big data.
Full text search is offered by various established vendors:

  • Microsoft with its acquisition of FAST Search and Powerset
  • HP with its acquisition of Autonomy who had IDOL eDiscovery
  • IBM with its acquisition of Vivisimo
  • EMC with its acquisition of Kazeon
  • Oracle with its acquisition of Endeca
  • Dassault with its acquisition of Exalead

The flip side with each of the above is that you are now forced into buying a big vendor solution perhaps comparable to a consumer being forced to buy a complex CD/MP3/radio/clock when all you wanted was the clock with big blue lettering so it is visible in the dark!

What alternatives do you have to the big vendors? Lucene is a full text search engine written in the Java language.  A more precise definition I’ve seen on the net is “Lucene is an information retrieval library with support for tokenizers, analyzers, query parsers”..  Pray what does this mean to you and me?  Just that Lucene is a low level tool-kit and a library without support for key things like distributed indexes.    Not of much use to a CIO with limited engineering staff.   If you were a company like Salesforce, LinkedIn or Twitter with legions of developers Lucene would have been perfectly adequate.

What then should the typical enterprise who is not one of the booming dot coms do?  Look into technologies like Apache Solr – an HTTP search server built on top of Lucene.  Specifically Solr gives you REST-like HTTP/XML and JSON APIs which means you don’t have to have developers creating Java code.
What Solr gives you is the following:

  • Full-text search
  • Highlighting of hits
  • Faceted search
  • Dynamic clustering
  • Database integration
  • Rich document handling of formats like Word and PDF

Are there companies using Solr over Lucene today?  Groupon is said to use Solr/Lucene to index email sent to Groupon customers and to inform consumers about deals available in their geographical vicinity.  Other companies which are said to use Solr/Lucene include Netflix, AT&T, Sears, Ford, and Verizon.

By this time you’re asking yourself, if I invest in developing searching using Solr over Lucene how do I get support if it is all open-source?.  Upcoming vendors like Lucidworks, OpenLogic  offer consulting and support for deploying search applications using Solr over Lucene. The other question you might have is “Why invest in x86 servers, rack space, cooling, floor tile issues, what are the pros/cons if I move everything into the Amazon cloud?.”  An interesting cost comparison on this topic done by OpenLogic may be found here.  Having read all of the above what if you want to explore open source alternatives to Solr? Look into Elastic search, Blur.  In short, there are many ways to find the proverbial needle in the “big data” haystack.

Sentiment analysis using big data

October 19, 2012 1 comment

If the CEO of your company tasks you with creating a way to mine Big Data to do sentiment analysis so you may help identify the reason for flagging last quarter sales then read on!  There appear to be many choices for doing analytics on big data depending upon what you have already invested in to date:

  • If you are an Oracle shop your friendly Oracle rep might recommend using Oracle big iron to create a complete system that involves Hadoop, RDBMS and business intelligence (BI) in the form of their Big Data appliance (essentially big iron running Cloudera Hadoop), Big data connectors to Exadata (database system designed for mixed workloads) and Oracle Exalytics for the Business Intelligence.
  • If you are an SAP CRM customer using SAP BusinessObjects and HANA, the recommendation you receive might be to use SAP Rapid-Deployment Solution (RDS) along with technology from SAP partner NetBase.
  • If you happen to be an HP shop then the recommendation you receive might involve running the Autonomy (now HP) IDOL software engine on each node in your Hadoop cluster and doing concept searching using IDOL.
  • If you are a Teradata shop the recommendation you receive might involve a solution involving Hadoop integrated with a relational database, in other words a combination of Teradata Aster Discovery Platform and Teradata Integrated Data warehouse along with Hortonworks Hadoop.

However if you draw the line at paying the big vendors for analytics and want to roll-your-own (assuming you have Hadoop expertise in-house) there are companies like AltoScale who advocate building your own system as described here.  The example AltoScale provides involves Twitter sentiment analysis using Hadoop.

While Twitter sentiment analysis is useful for pollsters during election time, if you happen to be the CIO of Proctor and Gamble and are keen to know what your customer base is saying about your latest flavor of Tide detergent, depending on Twitter feeds might not give you all the insights you need.   After all the 20- something person who Tweets every non-event to the stratosphere is not likely to be a fan of doing his/her own laundry!   What makes me so opinionated?  Well I rely on the insights gained from comic strips and am a big fan of Zits so take all my advice with a grain of salt.

Introduction to big data, Hadoop, SIEM

October 18, 2012 6 comments

What is “big data”?

Like the story of the Blind men and the elephant everyone has a different perception of what constitutes “big data”.  Strangely enough they could all be right!

  • An IT manager in a corporation in the oil & gas space would say that a 4 Terabyte 3-dimensional seismic files constitute big data.
  • A utility company might consider data from millions of smart meters to be big data.
  • An airplane engine manufacturer might say that the terabytes of data generated in one transatlantic flight by sensors in the airplane engine constitutes big data.
  • A fortune 500 company IT department might consider log files generated by web servers, firewalls, routers, badge-access systems to be big data.
  • A casino in Las Vegas or Macau would consider HD surveillance video files to be big data.
  • A hospital might consider data generated by patient sensors to be big data.
  • A company interested in determining customer sentiment via analysis might view petabytes of Twitter data to be big data.

Regardless of the industry, the big data in question has monetary value to the company.  It either represents a valuable asset of the company or a way to prevent loss of other valuable assets. “Big data” doesn’t have to equate to “big files”.  Big data could comprise small files generated so rapidly that it would saturate traditional data stores like relational databases.   Hence a new way is needed to collect, analyze and act upon this big data.

What is Hadoop and where does Hadoop fit with “big data”?

Hadoop is an operating system that allows parallel processing of large amounts of data stored on clusters of commodity servers with locally attached disk.   In this sense Hadoop is anathema to the big storage vendors who built billion dollar businesses selling the idea that storage has to be centralized and consolidated, has to be NAS or SAN attached so that it may be backed up and replicated.  Hadoop runs counter to that argument by relying on direct attached storage available on commodity x86 servers.

My database guy tells me that the RDBMS is sufficient, we don’t need Hadoop!

As long as the data is in the gigabytes range and is structured data, RDBMS perform exceptionally well.  However if your organization needs to conform to the Federal Rules of Civil Procedure (FRCP) you may want to archive email and email attachments going back several years.  Now you enter the realm of unstructured or semi structured data for storing which you might find RDBMS unsuitable.  This is where Hadoop comes in.  Being open-source there is lower acquisition cost (I won’t say no acquisition cost as you still need to identify servers, direct-attached-storage, in-house Hadoop expertise) and you have the ability to scale-out as opposed to just scale-up which is what RDBMS do.

My IT forensics guy tells me that the existing Security Information & Event Management (SIEM) system is enough and we don’t need Hadoop!

The short answer is that you need both.  If your goal is to comply with regulations like PCI, Sarbanes-Oxley Act (SOX) and your requirement is to collect, search (in a structured manner) and analyze logs from servers, firewalls, routers in your network then a SIEM is the right tool to use.  Since it uses a back-end relational database you don’t want to use your SIEM to store petabytes of data.

However if your goal is to store a year or more worth of email and email attachments, badge access logs, surveillance videos  then you can’t realistically use a SIEM for this.  This is where Hadoop shines.  Though Hadoop works well with all types of files (structured and unstructured) it is really designed to handle large unstructured files.

Rationale behind this blog..

October 17, 2012 Leave a comment

[Disclaimer: This is my personal blog.  The views expressed here are mine alone and in no way reflect those of my employers].

A technology product vendor who claims to market “solutions” or claims to want to be your “trusted advisor” is like a pest control company employee who does a free termite inspection on your home only to declare that you need to buy a yearly contract to get rid of termites ASAP.  You want to believe the technology vendor because he claims to be a professional but a nagging voice within your head asks you “why would this person give me unbiased advice if at the end of it he expects to receive a purchase order against my budget?”

A technology vendor harping on the virtues of a point product is akin to a car sales expounding on the virtues of DOHC versus SOHC in the engine when all you the customer wanted was a “red car that goes from 0 to 60 in 5 seconds!”.  I’ve often wondered where the empathy has gone, why not put yourself in the buyer’s shoes and explain how you can solve a problem rather than try to make your quota selling your point product?

This blog is intended to introduce topics like Big Data, networked storage, cloud computing, cyber security and other topics to CIOs with the focus on answering the “so what?”, “why should I care?” and “what should I do next?”.

I hope this blog will prove instructive and entertaining at the same time.