Archive for April, 2013

Microservers for power savings

April 28, 2013 Leave a comment

Everything_looks_like_a_nailIf you have a hammer everything looks like a nail.  Conversely if all you have known are power hungry servers then you probably used them to address every workload.

Microservers came into the mainstream when cloud providers realized that they don’t need beefy power-hungry servers if they are not running heavy computing workloads.  If a provider  is to run less demanding compute jobs like serving contact information up to a user on your website, why deal with the floor tile space, power consumption, HVAC costs associated with high end servers?  Why not benefit from the power savings associated with Intel Atom or ARM based servers?

For instance, HP runs a portion of its website using its own microservers using the Intel “Centerton” Atom S1200 processers with appreciable power savings.  HP claims that microservers consume “89 per cent less energy and 60 per cent less space”.  This is assuming 1600 Moonshot Calxeda EnergyCore microservers (based on ARM based SoC) crammed in a ½ server rack to do the job typically done by 10 racks of 1U servers.

HP markets the microservers as suitable for hyperscale workloads.  A hyperscale workload is a lightweight workload that has to be done in large numbers so requires scaling from a few servers to several  thousand servers.   To appreciate the low power consumption of ARM based servers check out this article on “bicycle powered ARM servers”.

One concern that makes cloud providers hesitate to use ARM based microservers is whether legacy apps written for x86 based servers can be ported over to ARM based microservers.  To address this, a UK-based company called Ellexus sells a product called Breeze which makes it easy to migrate applications from x86 based servers to ARM based servers.  What if you are not sure that Breeze can do the job for you?  Ellexus partners with a cloud provider Boston who offer ARM-as-a-cloud service so you can actually try migrating your apps using Breeze without the intial CAPEX of buying ARM based microservers for your in-house datacenter.

Dell offers the Intel Xeon E3-based Dell PowerEdge C5220 microservers.  IBM is talking about Hadoop on the P5020.

To get further savings and cut out the 25% to 30% gross margins made by HP and Dell consider going directly to their Original Design Manufacturers (ODM)s like Quanta.  Taiwan based Quanta the server arm of $37bn Quanta Computer makes “white box” servers for big names like Google, Facebook, Amazon, Rackspace, Yahoo and Baidu.  The Quanta S910 series is an example of a microserver using Intel Xeon processors.  Quanta addresses the Facebook Open Compute Project which is of interest to cloud operators like Rackspace.  The Open Compute Project and specifically Facebook’s “Group Hug” motherboard spec aims to support different vendor CPUs on the same system.  This news will send shivers through the boardrooms of Intel and AMD but is good for you the consumer as you are not beholden to one CPU vendor.

At a time when major server vendors like IBM are trying to exit the x86 server business it may make sense to cut out the big name server vendors and deal directly with their suppliers if not for savings, at least to ensure a long term consistent roadmap with ever more computing capacity and lower power and space consumption.

Using NetFlow/IPFIX to detect cyber threats

April 26, 2013 Leave a comment

Most commercial switches, routers, firewalls in your datacenter support NetFlow (network logs) and in some cases IPFIX (next-gen replacement for NetFlow version 9).  Why not use NetFlow to alert you to cyber threats like worms, botnets, Advanced Persistent Threats (APT)?                         NetFlow solution architecture

The idea of using NetFlow is not a new one.  Argonne National Labs have been using NetFlow to detect zero-day attacks since 2006.

An Advanced Persistent Threat (APT) starts by mining employee data from Facebook, LinkedIn and other social media sites and focuses on stealing a corporation’s intellectual property using innocuous applications like Skype to move the content around.   APTs fly under the radar of signature-based perimeter security appliances like firewalls and Intrusion Detection Systems (IDS).  However, you can use NetFlow/IPFIX to identify APTs by comparing flows in the NetFlow/IPFIX collector with a host reputation database offered by cloud services like McAfee GTI.  The actual ingest of the host reputation database and comparison with flows would involve a tool like Plixer Scrutinizer™.  By blocking traffic going to the known compromised hosts (which hosts the APT command and control malware) you are neutralizing the goal of the adversary who sent the APT into your network.

Why bother with IPFIX (next-gen NetFlow) when there are older versions of NetFlow?

You can export URL information via IPFIX (using vendor extensions supported by IPFIX).  This allows you to determine what URL a user clicked on before succumbing to malware.  How many other people clicked on the same bad URL?  Products which export URL information via IPFIX include Ntop nProbe, Dell SonicWALL, Citrix AppFlow.

For Voice-over-IP (VoIP) traffic you can export details like caller-id, codec, jitter and packet loss.

Why use dedicated NetFlow/IPFIX sensors when routers/switches/firewalls may suffice? 

Even router vendors like Cisco recognize that customers who buy high end routers may not want to expend expensive CPU cycles on NetFlow/IPFIX generation nor rely on sampling NetFlow which makes it unusable for cyber-security applications.  The need is for offloaded appliances that product packet-accurate non-sample NetFlow/IPFIX.  Cisco’s own NetFlow Generation Appliance (NGA) is an option.  The older NGA 3140 tops out at 120,000 frames per second (fps).   Higher end offload appliances from some  vendors can sustain 250,000 to 500,000 fps to keep up with busy 10 Gb network pipes.

So we have a way to generate NetFlow/IPFIX but what about the analytics needed to actually detect cyber-attacks?  While you may have a traditional SIEM (HP ArcSight ESM, McAfee ESM, IBM QRadar) or a tool like Splunk it is unrealistic to send NetFlow/IPFIX data at very high rates into these systems.  A better way would be to trim down the traffic and analyze it on the wire before sending it to the SIEM.

NetFlow Logic a bay area startup has a high volume NetFlow processing product “NetFlow Integrator” which can ingest NetFlow/IPFIX records and process the stream in flight using an in-memory database. The product scales its throughput based on the number of underlying server cores.  For instance a 16 core server would allow it to scale throughput to over 500,000 fps.

The product is not a NetFlow collector but is categorized as a NetFlow/IPFIX Mediator (see RFC 5982).  NetFlow Integrator reduces NetFlow data by consolidating information into “conversations” rather than flows within a conversation.  Flow records are processed by one or more rules (canned or custom – creating using a GUI/SDK) which have their own logic to apply to each flow record.  These rules can aid in the following types of detection:

Detection of Botnet:

Botnet detection using NetFlow IntegratorA user would load a list of known Command & Control servers (possibly obtained from sites like Emerging Threats or from your own private source) into the rule.  Every incoming NetFlow record is examined to determine if the source or destination IP address matches this list.  If there is a match this matched information is forwarded to the SIEM.  The SIEM in turn will alert a security analyst if any botnet slaves are detected on the network.

APT detection:

NetFlow Integrator has rules to identify scanners who are doing “port sweeps” of your network.  It can also look for data exfiltration by examining an infected host that starts proliferating on the internal network.  A custom algorithm detects when a client suddenly starts behaving like a server.  This is something that can’t be done by signature based firewalls. 

In conclusion, to detect malware/botnets/APT use NetFlow/IPFIX which is something your routers, switches, firewalls support today.  Keep your existing SIEM in place but introduce an IPFIX offload appliance especially if you have large 10 Gb network pipes and don’t want to burden your routers.  Use a tool like that from NetFlow Logic to analyze NetFlow/IPFIX records on the wire and use your SIEM for the alerting and remediation.

Detecting zero day attacks using big data

April 16, 2013 Leave a comment

Snort and Intrusion Detection: 

Snort is a widely used open-source network Intrusion Detection System (NIDS) capable of both traffic analyses in real time as well as packet logging.  The reason for its popularity is that it is open source and effective at detecting everything from port scans, buffer overflows to OS fingerprinting attempts.  Known attacks follow a certain activity pattern and these are captured in “signatures” available from the open source community as well as from SOURCEfire.

Snort architectureIn the Snort architecture, the packet sniffer as the name suggests eavesdrops on network traffic, the preprocessor checks packets against plug-ins to determine if the packets exhibit a certain behavior, the detection engine takes incoming packets and runs them through a set of rules.  If the rules match what is in the packet then an alert is generated.  The alert may go to a log file or a MySQL or PostGres database.

Using Snort on big data stored in Hadoop

What if today you received new Snort signatures you didn’t have 3 months ago but want to use the new signatures to detect zero-day attacks (unknown exploits) in historical packet capture data?  This historical packet capture data may be in archive storage within your corporate data center or located on cloud storage like Amazon S3.

One solution is to analyze full packet captures using Apache Pig (a tool that abstracts a user from the complexities of MapReduce).  If you aren’t comfortable using MapReduce but have a few days of packet capture data on your laptop and know how to write some queries to query this local capture data, you can then transition your queries to a Hadoop cluster containing weeks or months of packet capture data using an open-source tool called PacketPig.

PacketPig  (an open source project located on github) offers many loaders (Java programs which provide access to specific info in a packet capture) one of which is SnortLoader () which allows you to analyze months of packet capture data dispersed across Hadoop nodes. The way you detect a zero day attack using PacketPig is by using SnortLoader() to scan archived packet capture data using old Snort signatures (from an old snort.conf file) then scanning it a second time using the latest Snort signatures.  After filtering out signatures that appear in both scans what you have left are zero day attacks.  More details of how this is done may be found here.  This is yet another example of how Hadoop running on commodity servers with direct attached storage can help provide a cyber-security solution for zero day attacks.

Security for Big data in Hadoop

April 15, 2013 Leave a comment

Why secure the Hadoop cluster?

securinghadoopHadoop has 2 key components: Hadoop Distributed File System (HDFS) and MapReduce.  HDFS consists of geo-dispersed DataNodes  accessed via a NameNode.  Hadoop was designed without much security in mind hence a malicious user can bypass the NameNode and access a data node directly and if he/she knows the block location of data, that data can be retrieved or modified.  In addition, data that is being sent from a DataNode to a client can be easily sniffed using generic packet sniffing technology.

Authentication solution

Securing the Hadoop cluster requires that you know that authentication (Is the user who claims to be Bob really Bob?) differs from authorization (Is user Bob authorized to submit HDFS or MapReduce jobs?)

To address the question of authentication Hadoop recommends the user of Kerberos which is bundled with Hadoop.    Unlike network firewalls which assume that all dangers reside outside the corporate network, Kerberos operates on the assumption that the network connections are unreliable and a possible weak link.  How does Kerberos itself work?  We assume that in addition to the client and the server there is a Kerberos Key Distribution Center (KDC) which has 2 components: Authentication Server, Ticket Granting Server.

  • A client who wishes to access a server will first authenticate itself with the Authentication Server (AS).  The AS will return an authentication ticket called a Ticket-Granting-Ticket (TGT).
  • The client now uses the TGT to request a service ticket from the Ticket Granting Server (TGS) in the KDC
  • Lastly the client uses the service ticket to authenticate itself with the server which provides the desired service to the client. In a Hadoop cluster the server could be the NameNode or JobTracker.

Benefits of using Kerberos for authentication:

  • Kerberos helps you keep rogue nodes and rogue applications out of your Hadoop cluster.

It is also recommended that Hadoop daemons use Kerberos to do authentication between daemons.

Encryption solution

Encryption of data-in-motion can be achieved using SSL/TLS to encrypt data moving between nodes and applications.  Encryption of data-at-rest at the file layer is recommended to protect against malicious users who obtain unauthorized access to data nodes to inspect files.   While the benefits of encryption are clear, encryption is very compute intensive.  Intel has jumped in to address this with its own flavor of Apache Hadoop that has been optimized to do AES encryption provided your Hadoop nodes are using Intel Xeon processors.  Intel claims that Intel Advanced Encryption Standard New Instruction (AES-NI) which is built into Intel Xeon processors can accelerate encryption performance in a Hadoop cluster by 5.3x and decryption performance by 19.8x.

Cynics would argue that Intel’s new-found interest in Hadoop has more to do with keeping low-cost ARM processor based microservers at bay and less to do with improving Hadoop security.  Whatever the rationale behind it, speeding up encryption &  decryption can only increase the use of these data protection techniques by the Hadoop user base.

Security in future (a gateway to Hadoop)

Hortonworks (as well as NASA, Microsoft and Cloudera) are promoting Knox Gateway a perimeter security solution involving a single point of authentication for the Hadoop cluster.  Clients who want to access the Hadoop cluster would have to first traverse the gateway which itself would reside in a DMZ.  Time will tell how this new technology is embraced by the Hadoop user community.

Are ETL vendors still relevant now that Hadoop is here?

April 11, 2013 Leave a comment

hourglassETL is another 3-letter acronym which stands for Extract Transform & Load.  ETL refers to the extraction of data from diverse sources (like relational databases, flat files), transforming (sorting, joining, synchronizing, cleaning) said data and then loading it into a data warehouse.  This workflow assumes you know before hand the type of repetitive reports you would run on the data in the data warehouse.

Companies like Informatica, IBM, Oracle, SAS, Business Objects  have built profitable businesses providing enterprise class ETL for huge volumes of data in heterogeneous environments.  Now that open-source Hadoop on commodity x86 server hardware can import data from ERP, DBMS, XML files and export it after some transformations into a regular data warehouse, one is tempted to ask: Is ETL still relevant now that Hadoop does a lot of the ETL function?

To answer this question step back and ask yourself:

Did Hadoop replace my traditional data warehouse?

If your goal is to determine sales trends for you company, you are more likely to run queries on corporate data in the data warehouse, if you are the marketing dept of a company tasked with determining the RoI from a recent email campaign you are morely likely to query data residing in a datawarehouse like Netezza rather than direct your queries at the Hadoop cluster.  So just like Hadoop didn’t replace your data warehouse, its not likely to replace your ETL tools either.

Data warehouse and ETL continue to exist, what changes is where/how you actually implement them.  For instance Amazon is offering “RedShift” a data warehouse in the public cloud at prices ~$1000 per TB, this takes away the need to spend $19,000 to $25,000 per TB to store the same data in a traditional data warehouse on big iron within your private data center.  If you decide to go with RedShift you still need an ETL tool to get all of your on-premise data into RedShift in the public cloud.

As an enterprise you may want to merge your customer data currently residing on with your in-house financial systems.  There again you’d need ETL tools possibly from vendors like SnapLogic.

The demarcation lines between what data is stored in-house versus in-the-cloud are getting blurrier every day.  In this changing world the need for ETL continues to exist, what may change is who ends up providing the ETL functionality, a few years ago it was only Informatica or IBM with high upfront costs in training, consulting and acquisition.  Today it might be open source tools like Pentaho Kettle or Talend or commercial tools from SnapLogicSyncsort and other vendors with more innovative approaches and lower acquisition costs.

Categories: Big Data and Hadoop Tags: