Network overlay SDN solution used by PEER1

December 27, 2013 Leave a comment

Embrane powering PEER1 Hosting

SDN enabled cloud hosting

As I enjoy my employer mandated shut-down and gaze sleepily at the phone on my desk, my mind begins to wander… Today we use services from companies like Vonage and make international calls over high speed IP networks.  Vonage in turn operates a global network using the services of cloud hosting companies like PEER1 Hosting.  PEER1 claims to have 20,000 servers worldwide, own 13,000 miles of fiber optic cable, 21 Points of Presence (POP) in 2 continents, 16 data centers in 13 cities around the world.  Like any large hosting provider they use networking gear from vendors like Juniper and Cisco.  Their primary data-center in the UK is said to deploy Juniper gear: EX-series Ethernet switches with virtual chassis technology,  MX-series 3D edge routers,  and SRX series services gateways.   However when they wanted to offer customers an automated way to spin up firewalls, site-to-site VPNs and load balancers in a matter of minutes they ended up using technology from Embrane a 40-person start-up in the crowded area of SDN start-ups.

Why did they not use SDN solutions from Juniper their vendor of choice for routers and switches?    Juniper initially partnered with VMware to support NSX for SDN overlays and Juniper was a platinum member of the OpenDaylight project.  Then Juniper did an about-turn and acquired Contrail a technology that competes with VMware.   Juniper claimed that despite not supporting OpenFlow, the Contrail SDN controller will offer greater visibility into layer 2 switches due to the bridging of BGP and MPLS.  Unlike the Cisco SDN solution which prefers Cisco hardware, Juniper claimed that Contrail would work with non-Juniper gear like that from Cisco.  To generate interest from the open source community in Contrail, Juniper open sourced OpenContrail – though most of the contributors to OpenContrail on GitHub are from Juniper.

It is interesting to note that customers like Peer1 may rely on Juniper or Cisco for their hardware but when it comes to finding an automated way to deploy networking services as an overlay they go with start-ups like Embrane.  Embrane has an interesting concept: Their technology allows a hosting provider to overlay firewalls and load balancers using vLinks which are point-to-point layer3 overlays capable of running over any vendor hardware (not just Cisco or Juniper).  Many such vLinks are part of a single administrative domain called vTopology.    Embrane allows you to make a single API call to bring up or bring down an entire vTopology.  An example of how this makes life easier for a hosting provider: When an application is decommissioned all associated firewall rules are also decommissioned unlike other methods where you end up with firewall rules living past their time.

I will watch with interest to see if companies like Embrane and PlumGrid end up turning the SDN world on its head and pumping some adrenalin into the dull world created by hardware vendor monopolies and their interpretation of SDN.

Categories: NFV, SDN

Software Defined Networking – What’s in it for the customer?

December 23, 2013 Leave a comment

What's in it for me the customer?

What’s in it for me the customer?

While the terms Software Defined Networking (SDN) and Network Functions Virtualization (NFV) are often used in the same breath, they are complementary technologies offering greater promise when used together.  NFV has to do with virtualizing switches and routers, while SDN separates the control plane and data plane of your network, leading to a programmable network which in turn facilitates easier deployment of business applications over that network.  Rather than get bogged down into what the  router/switch vendors are saying about SDN or NFV, let us step back and hear the perspective of the customer.  What is motivating a telecom provider or an enterprise data center to consider SDN or NFV?

AT&T views NFV as a tool to reduce cycle time for rolling out new services and removing old services.  AT&T seeks a common API (not necessarily an open API) between the SDN controller and the physical network devices.  It recognizes that you can put some network software on Commercial off-the-shelf (COTS) devices but may end up retaining custom network hardware with proprietary ASICs to get the acceleration and higher throughput that COTS devices may not deliver.

Deutsche Telecom makes a fine distinction – They see in SDN a way to program “network services” (not network products) from a centralized location for a multi-vendor network partly located in-house and partly in the cloud.

NTT Communications states 3 goals: Reduce CapEx and OpEx, differentiate based on services, faster time-to-market.   NTT was an early adopter of Nicira NVP, Nicira SDN controllers, VMware vCloud Director.  It virtualizes network connectivity using NEC ProgrammableFlow solutions – NEC SDN controllers for an OpenFlow network.  NTT Communications also collaborate with NTT Labs on Ryu (Open source OpenFlow controller).

If you ask the Hyperscale data centers like Facebook, Google, Rackspace they have a slightly different goal.

Facebook has a goal of gaining greater control of their networking hardware and taking away “secret ASIC commands” issued by router/switch vendors to equipment in the Facebook data center.

Google has as its goal a single view of the network, better utilization of network links and hit-less software upgrades.

A trading firm if asked for their expectations from SDN might say that they want to offer their customers big pipes with very low latency during trading hours and after-hours would like to pay less to their carrier for the same big pipe but with a higher latency to enable less latency sensitive applications like on-line backup.

The common thread here is services, the ability to roll-out services more effectively and create differentiation in a crowded price-sensitive market.   In the follow-on articles we’ll look at what each of the major vendors has to offer and pros/cons of each SDN solution.

Categories: NFV, SDN Tags:

Software Defined Networking – Promise versus Reality

November 9, 2013 1 comment

Computer technicianThe promise of Software Defined Networking (SDN) was to abstract the human network manager from vendor specific networking equipment.  SDN promised network managers at cloud providers and webscale hosting companies a utopian world where they could define network capacity, membership, usage policies and have that magically pushed down to underlying routers/switches regardless of which vendor made the router/switch.  It offered service providers a way to get new functionality, reconfigure and update existing routers/switches using software rather than having to allocate shrinking CAPEX budgets for newer router/switch hardware.   How it proposed to achieve all this was by separating the control plane from the network data plane.   The SDN controller was to have a big view of the needs of the applications and translate that need into appropriate network bandwidth.

Just as combustion engine based auto-makers panicked at the dawn of electric cars, proprietary router/switch vendors saw their ~60% gross margins at risk from SDN and hurriedly came up with their own interpretations of SDN.  Cisco’s approach based on technology from the “spin in” of Insieme Networks is that if you as a customer want the benefits of SDN, Cisco will sell you a new application-aware switch (NX 9000) which will run an optimized new OS (available in 2014) on which you can use merchant silicon (Broadcom Trident II silicon) and you’ll have OpenFlow, OpenDaylight controllers, a control plane that is de-coupled from the data plane.  It assumes that customers would live with the lack of backward compatibility with older Cisco hardware like the Nexus 7000.  There was a silver lining to this argument: Should you choose to forgo the siren song of open hardware & merchant silicon and return to the Cisco fold, you will be rewarded with an APIC policy controller (in 2014) which will manage compute, network, storage, applications & security as a single entity.  APIC will give you visibility into application interaction and service level metrics.  Cisco also claims that using its Application Centric Infrastructure (ACI) switching configuration will lower TCO by eliminating the per-VM tax imposed by competitor VMware’s network virtualization platform NSX and reduce dependence on the VMware hypervisor.  VMware with Nicira under its belt, will of course disagree and have its own counter spin.

Juniper’s approach was to acquire Contrail and offer Contrail (commercial version) and OpenContrail (open source version) instead of OpenDaylight.  This is a Linux based network overlay software designed to run on commodity x86 servers and aiming to bridge physical networks and virtual computing environments.  Contrail can use OpenStack and CloudStack as the orchestration protocol but won’t support OpenFlow.

Startup Big Switch Networks (the anti-overlay-software startup) has continued to use OpenFlow to program switches -supposedly 1000 switches per controller. Once considered the potential control plane partner of the major router/switch vendors they have been relegated to a secondary role quite possibly since Cisco and Juniper have no intentions of giving up their cozy gross margins to an upstart.  Another startup Plexxi (the anti-access-switch startup) relies on its own SDN controller and switches connected together by wave division multiplexing (WDM).  Its approach is the opposite of that taken by overlay software like Contrail since its talking about assigning a physical fiber to a flow.

Where do SSDs play in all this?

Startup SolidFire makes iSCSI block storage in the form of 1U arrays crammed with SSDs and interconnected by 10GbE.  Service providers seem to like the SolidFire approach as it offers them a way to set resource allocation per user (read IOPS per storage volume) for the shared storage.  Plexxi is an SDN startup with its own line of switches communicating via Wave Division Multiplexing and its own SDN controller with software connectors.  Plexxi and SolidFire have jointly released an interesting solution involving a cluster of all flash storage arrays from SolidFire and a Plexxi SDN controller managing Plexxi switches.

It appears that the Plexxi connector queries the SolidFire Element OS (cluster manager) learns about the cluster, converts this learned information into relationships (“affinity” in Plexxi-speak) and hands it down to a Plexxi SDN controller.  The controller in turn manages Plexxi switches sitting atop server racks.  What all this buys a service provider is a way to migrate array level quality-of-service (QoS) from SolidFire to network level QoS across the Plexxi switches.

While the big switch vendors are duking it out with technology from Insiemi versus Contrail relying on expensive spin-ins versus acquisitions, their service provider customers like Colt, ViaWest (customer of Cisco UCS servers), Databarracks and others who use SolidFire arrays are looking with interest at solutions like the Plexii-SolidFire solution mentioned above which promises tangible RoI from deploying SDN.  Vendors selling high margin switches would do well to notice that the barbarians are at the gates and the citizenry of service providers is quietly preparing to embrace them.

Categories: SDN Tags: , , , ,

OpenStack and solid state drives

October 20, 2013 Leave a comment

If you are a service provider or enterprise considering deploying private clouds using OpenStack (an open source alternative to VMware vCloud) then you are in the company of other OpenStack adopters like PayPal and eBay.  This article considers the value of SSDs to cloud deployments using OpenStack (not Citrix CloudStack or Eucalyptus).

cloudsBlock storage & OpenStack: If your public or private cloud is supporting a virtualized environment where you want up to a Terabyte of disk storage to be accessible from within a virtual machine (VM) such that it can be partitioned/formatted/mounted and stays persistent till the user deletes it, then your option for block storage is any storage for which OpenStack Cinder (an OpenStack project for managing storage volumes) supports a block storage driver.  Open source block storage options include:

Proprietary alternatives for OpenStack block storage include products from IBM, NetApp, Nexenta and SolidFire.

Object storage & OpenStack: On the other hand if your goal is to access multi terabytes of storage and you are willing to access it over a REST API and you want the storage to stay persistent till the user deletes it, then your open source options for object storage include:

  • Swift – A good choice if you plan to distribute your storage cluster across many data centers.  Here objects and files are stored on disk drives spread across numerous servers in the data center.  It is the OpenStack software that ensures data integrity & replication of this dispersed data
  • Ceph  – A good choice if you plan to have a single solution to support both block and object level access and want support for thin-provisioning
  • Gluster – A good choice if you want a single solution to support both block and file level access

Solid state drives (SSD) or spinning disk?

An OpenStack Swift cluster that has high write requirements would benefit from using SSDs to store metadata.  Zmanda (a provider of open source backup software) has run benchmarks to prove that SSD based Swift containers outperform HDD based Swift containers especially when the predominant operations are PUT and DELETE.  If you are a service provider looking to deploy a cloud based backup/recovery service based on OpenStack Swift and each of your customers is to have a unique container assigned to them, then you stand to benefit from using SSDs over spinning disks.

Turnkey options?

As a service provider if you are looking for an OpenStack cloud-in-a-box to compete with Amazon S3 consider vendors like MorphLabs.   They offer turn-key solutions on Dell servers with storage nodes running NexentaStor (commercial implementation of OpenSolaris and ZFS), KVM hypervisor, VMs running Windows or Linux as the guest OS all on a combination of SSDs and HDDs.  The use of SSDs allows MorphLabs to claim lower power consumption and price per CPU as compared to “disk heavy” (their term not mine) vBlock (from Cisco & EMC) and FlexPod (from NetApp) systems.

In conclusion if you are planning to deploy clouds based on OpenStack, SSDs offer you some great alternatives to spinning rust (oops disk).

Categories: Big Data and Hadoop

Price per GB for SSD – why its not always the best yardstick

October 12, 2013 Leave a comment

Price per GB, performance and endurance are the yardsticks used to decide which solid state drive (SSD) to buy for use in a corporate data center or to use in a server or flash based storage array.

Are endurance numbers really comparable? Especially when you consider that one vendor might use consumer grade cMLC NAND while another might use enterprise grade eMLC NAND with vastly different program erase (P/E) cycles?  What was the write amplification factor (WAF) – the ratio of SSD controller writes versus the host writes –  that was used for the calculation?  One vendor might quote endurance in TB or PB written while another might use Drive Writes Per Day (DWPD).

One vendor might state fresh-out-of-the-box (FOB) performance numbers in IOPS on their datasheet while another might display steady state numbers.  One might use a synthetic benchmark tool like IOMETER which focuses on queue depth (number of outstanding I/Os), block size and transfer rates instead of an application based benchmark like SysMark which ignores all these criteria and focuses on testing how a real-world application might drive the SSD.  Even with tools like IOMETER whether IOMETER  2006, 2008 or 2010 is used will cause the results to vary.  To add further complexity, the performance numbers will vary widely depending on whether they were measured with aWoman Eating Fruit Outdoors queue depth (number of outstanding I/Os) of 3, 32, 64, 128 or 256.  To compound it one vendor might be looking at compressible data (Word docs, spreadsheets) while another might be quoting numbers for incompressible data (.zip files or .jpeg), some might be using SandForce (now LSI) controllers which compress data before writing it to NAND while others might not.  So what is an SSD buyer to do?  Get a drive from a vendor you trust and run your own benchmarks whether they are synthetic or application based and derive your own conclusions.

Now why do I find $ per GB as a yardstick amusing?  Consider this analogy – could we convince a Japanese consumer that the cantaloupe we buy from a local store for $2.99 here in California is equivalent to a musk melon purchased in Japan for $16,000 yen?  From a $ per melon point of view, the price differences are difficult for us to fathom but to a buyer of the $16,000 melon it is apparently a premium worth paying for.

Categories: DRAM, SSD

DRAM drives for ZFS based systems and apps like High Frequency Trading

September 21, 2013 Leave a comment

Next gen DRAM drivesAssume that you are a storage systems integrator with two areas of focus:

  • High frequency Trading
  • High end NAS filers for enterprise use

Consider your first area of focus – High frequency trading (HFT).  As we discussed before it is extremely fast trading by traders using supercharged computers, complex programs & algorithms and servers tactically co-located next to servers at a trading exchange.  HFT traders capitalize on the split second advantages offered by their unique systems.  If you are building computer infrastructure for high frequency trading then when it comes to memory devices that can handle transactional logging you have to choose between HDD, NAND flash based SSD and DRAM based SSDs.

For your second area of focus, you may be building high end NAS filers based on the Zettabyte File System and you will need an accelerator device for the ZFS Intent Log (ZIL).  This accelerator device has to be optimized for synchronous writes and must exhibit two key characteristics:

  • Extremely low latency
  • Very high sustained write IOPS

In addition, The ZIL accelerator has to be accessible from both nodes of a 2 node cluster (for high availability) to allow both nodes to access the log.  This precludes a single port SATA interface based NAND flash SSD and requires that you consider dual ported SAS SSDs.  While traditional NAND flash based SAS SSD will give you an advantage over the fastest HDD, you are still dealing with SSDs which wear out over time.  You are forced to decide between an eMLC NAND based SSD with 30,000 Program Erase (P/E) cycles or a relatively expensive SLC NAND based SSD with 200,000 P/E cycles.   

Ideally you want a Non Volatile Memory (NVM) device with infinite endurance, ultra-low latency and extremely high sustained IOPS performance which stays consistent regardless of the IO distribution (random, sequential or mixed).  Such a device would give you the best of both worlds: the SSD form factor that you are familiar with and the infinite endurance of DRAM.  However DRAM based SSDs on the market today are more likely to exhibit latencies of ~23 micro seconds and ~65,000 read IOPS with 4K sustained and 50,000 write IOPS with 4K sustained.  I would contend that you’d be better served with emerging products that exhibit ultra-low latencies of less than 5 micro seconds and 125,000 read IOPS with 4K sustained.   When I think of how far these emerging DRAM drives have come in terms of performance, I’m reminded of the Oldsmobile jingle “This is not your father’s Oldsmobile”.

However if I leave you with the conclusion that these blazing fast DRAM drives are relevant only for HFT and ZFS ZIL then I would be doing you a dis-service.  I believe that these emerging DRAM drives would be ideal devices to store log files for write-intensive mail systems, logs for Microsoft Exchange and metadata for file systems which allow you to separate metadata from file data. 

So what do you think?  Is a DRAM drive in your future?  All constructive feedback from integrators and storage practitioners is welcome.

Categories: Big Data and Hadoop

High frequency trading and SAS SSD

June 3, 2013 1 comment
High Frequency Trading

High Frequency Trading

Considering that more than half of all the stock market trades in the USA come from high frequency trading let us consider how the emergence of Serial Attached SCSI (SAS) SSDs helps this particular segment.

High Frequency Trading (HFT) involves looking for obscure signals in the market (including spikes in interest rates) using very high end servers, making trading decisions and conveying orders to the exchanges in microseconds (millionth of a second).  Sophisticated HFT algorithms on servers placed close to the trading exchanges, combined with technologies like wireless microwave instead of fiber optics, help ensure that trading decisions occur in micro-seconds. Such HFT applications typically use high end servers with multi-core processors along with hardware and firmware optimized for the lowest possible latencies.  This is where solid state drives come in.  With no moving parts, no noise, very low power consumption, 500x improvement in IOPS (100,000 IOPS  for SSD versus 200 IOPS from a 15000 rpm SAS HDD), what’s not to like about SSDs?  In terms of workloads, SSDs outperform HDDs when it comes to random small block (8KB or 4KB block size) workloads as found in apps like OLTP.

In enterprise SSDs you have a choice: Serial ATA (SATA) or Serial Attached SCSI (SAS).  Servers used in HFT would benefit from SAS over SATA SSDs.  Why is that you ask?

  • Multi-path: SAS SSDs offer dual-porting (for high availability – if one path to the data on the SSD goes down there is another path to access the same data)
  • Longer cable lengths (25 feet versus 3 feet for SATA) due to the use of higher signaling voltages by SAS.
  • Greater transfer rates or throughput: Between 6 Gb/s and 12 Gb/s for SAS.
  • Support for wide ports:  Multiple paths between the server and the SSD device.
  • Data integrity end-to-end:  Achieved using cyclic redundancy checks (CRC) from the time data leaves the server travels to the SSD and returns back to the server.

To understand the need for SAS SSDs we must first understand what an SSD is all about.  An SSD comprises a device controller (made by Marvell, SandForce, Indilinx, PMC Sierra etc.,) behind which is NAND flash memory (made by Micron, Samsung, Toshiba etc.,) managed by a mgmt. system within the enclosure of the SSD.

NAND flash memory refers to non-volatile memory (contents are retained even when power to the circuit is shut off) using a logical circuit called a NAND gate.  The SSDs interface to the server could be SATA, SAS (6 Gb/s or 12 Gb/s) or PCIe.

SSDs write information in a sequential manner into NAND flash in contiguous blocks (each block has multiple pages each of which is ~ 8K in size).  Unlike mechanical HDDs which can merrily overwrite information, SSDs have a quirk in that to reclaim a page the SSD has to erase an entire block where the page resides.  (A human analogy might be where you and your significant other are enjoying a candle-lit dinner in a restaurant when the maître d’hôtel walks over to request that you move to another location in the restaurant as a celebrity party needs to be seated in a contiguous set of tables within the same section of the restaurant).  In SSD terminology the flash controller (maître d’hôtel in our human example) would be doing “garbage collection” by re-allocating data in pages within a block to a new block prior to overwriting the first block

Enterprise class SSDs are usually described in terms of:

  • Endurance or Program/Erase cycles (the more you write to a NAND flash cell the weaker it becomes eventually getting marked as a bad cell).  Wear leveling usually done by the flash controller refers to a way to prolong the life of the NAND flash blocks by distributing writes across blocks.
  • Write amplification factor (refers to the amount of data the SAS controller in the server has to write in relation to the amount of data the flash controller in the SAS SSD has to write)
  • Drive writes per day – DW/D (10 to 25 DW/D) over a period of many years (3 to 5 years).

When comparing SAS SSDs you’d look at features like:

  • Type of NAND (Usually MLC, eMLC or cMLC) used.
  • Performance at “steady state” not just out-of-the-box.
  • Mean time between failure (MTBF) and mean time to data loss.
  • Sustained read/write throughput in MB/s.
  • Read/write IOPS with 4 KB random operations
  • IOPS with a mixed workload of read and writes.
  • Encryption (128 bit or 256 bit AES compliant)
  • Unrecoverable bit error rates
  • Protection for any in-flight data in the event of a sudden power loss to the SSD
  • Extent to which SMART (Self-Monitoring, Analysis and Reporting Technology) attributes are supported

SAS SSDs are a good choice wherever you have enterprise class servers with workloads that have write-once read-many characteristics.  In conclusion, if you have high end applications like HFT running on enterprise class servers you need enterprise class SAS SSDs.  This is not to imply that only the HFT segment benefits from SAS SSDs, the same benefits can also be experienced across other verticals like Web 2.0 and  cloud computing.

Categories: Big Data and Hadoop, SSD

Microservers for power savings

April 28, 2013 Leave a comment

Everything_looks_like_a_nailIf you have a hammer everything looks like a nail.  Conversely if all you have known are power hungry servers then you probably used them to address every workload.

Microservers came into the mainstream when cloud providers realized that they don’t need beefy power-hungry servers if they are not running heavy computing workloads.  If a provider  is to run less demanding compute jobs like serving contact information up to a user on your website, why deal with the floor tile space, power consumption, HVAC costs associated with high end servers?  Why not benefit from the power savings associated with Intel Atom or ARM based servers?

For instance, HP runs a portion of its website using its own microservers using the Intel “Centerton” Atom S1200 processers with appreciable power savings.  HP claims that microservers consume “89 per cent less energy and 60 per cent less space”.  This is assuming 1600 Moonshot Calxeda EnergyCore microservers (based on ARM based SoC) crammed in a ½ server rack to do the job typically done by 10 racks of 1U servers.

HP markets the microservers as suitable for hyperscale workloads.  A hyperscale workload is a lightweight workload that has to be done in large numbers so requires scaling from a few servers to several  thousand servers.   To appreciate the low power consumption of ARM based servers check out this article on “bicycle powered ARM servers”.

One concern that makes cloud providers hesitate to use ARM based microservers is whether legacy apps written for x86 based servers can be ported over to ARM based microservers.  To address this, a UK-based company called Ellexus sells a product called Breeze which makes it easy to migrate applications from x86 based servers to ARM based servers.  What if you are not sure that Breeze can do the job for you?  Ellexus partners with a cloud provider Boston who offer ARM-as-a-cloud service so you can actually try migrating your apps using Breeze without the intial CAPEX of buying ARM based microservers for your in-house datacenter.

Dell offers the Intel Xeon E3-based Dell PowerEdge C5220 microservers.  IBM is talking about Hadoop on the P5020.

To get further savings and cut out the 25% to 30% gross margins made by HP and Dell consider going directly to their Original Design Manufacturers (ODM)s like Quanta.  Taiwan based Quanta the server arm of $37bn Quanta Computer makes “white box” servers for big names like Google, Facebook, Amazon, Rackspace, Yahoo and Baidu.  The Quanta S910 series is an example of a microserver using Intel Xeon processors.  Quanta addresses the Facebook Open Compute Project which is of interest to cloud operators like Rackspace.  The Open Compute Project and specifically Facebook’s “Group Hug” motherboard spec aims to support different vendor CPUs on the same system.  This news will send shivers through the boardrooms of Intel and AMD but is good for you the consumer as you are not beholden to one CPU vendor.

At a time when major server vendors like IBM are trying to exit the x86 server business it may make sense to cut out the big name server vendors and deal directly with their suppliers if not for savings, at least to ensure a long term consistent roadmap with ever more computing capacity and lower power and space consumption.

Using NetFlow/IPFIX to detect cyber threats

April 26, 2013 Leave a comment

Most commercial switches, routers, firewalls in your datacenter support NetFlow (network logs) and in some cases IPFIX (next-gen replacement for NetFlow version 9).  Why not use NetFlow to alert you to cyber threats like worms, botnets, Advanced Persistent Threats (APT)?                         NetFlow solution architecture

The idea of using NetFlow is not a new one.  Argonne National Labs have been using NetFlow to detect zero-day attacks since 2006.

An Advanced Persistent Threat (APT) starts by mining employee data from Facebook, LinkedIn and other social media sites and focuses on stealing a corporation’s intellectual property using innocuous applications like Skype to move the content around.   APTs fly under the radar of signature-based perimeter security appliances like firewalls and Intrusion Detection Systems (IDS).  However, you can use NetFlow/IPFIX to identify APTs by comparing flows in the NetFlow/IPFIX collector with a host reputation database offered by cloud services like McAfee GTI.  The actual ingest of the host reputation database and comparison with flows would involve a tool like Plixer Scrutinizer™.  By blocking traffic going to the known compromised hosts (which hosts the APT command and control malware) you are neutralizing the goal of the adversary who sent the APT into your network.

Why bother with IPFIX (next-gen NetFlow) when there are older versions of NetFlow?

You can export URL information via IPFIX (using vendor extensions supported by IPFIX).  This allows you to determine what URL a user clicked on before succumbing to malware.  How many other people clicked on the same bad URL?  Products which export URL information via IPFIX include Ntop nProbe, Dell SonicWALL, Citrix AppFlow.

For Voice-over-IP (VoIP) traffic you can export details like caller-id, codec, jitter and packet loss.

Why use dedicated NetFlow/IPFIX sensors when routers/switches/firewalls may suffice? 

Even router vendors like Cisco recognize that customers who buy high end routers may not want to expend expensive CPU cycles on NetFlow/IPFIX generation nor rely on sampling NetFlow which makes it unusable for cyber-security applications.  The need is for offloaded appliances that product packet-accurate non-sample NetFlow/IPFIX.  Cisco’s own NetFlow Generation Appliance (NGA) is an option.  The older NGA 3140 tops out at 120,000 frames per second (fps).   Higher end offload appliances from some  vendors can sustain 250,000 to 500,000 fps to keep up with busy 10 Gb network pipes.

So we have a way to generate NetFlow/IPFIX but what about the analytics needed to actually detect cyber-attacks?  While you may have a traditional SIEM (HP ArcSight ESM, McAfee ESM, IBM QRadar) or a tool like Splunk it is unrealistic to send NetFlow/IPFIX data at very high rates into these systems.  A better way would be to trim down the traffic and analyze it on the wire before sending it to the SIEM.

NetFlow Logic a bay area startup has a high volume NetFlow processing product “NetFlow Integrator” which can ingest NetFlow/IPFIX records and process the stream in flight using an in-memory database. The product scales its throughput based on the number of underlying server cores.  For instance a 16 core server would allow it to scale throughput to over 500,000 fps.

The product is not a NetFlow collector but is categorized as a NetFlow/IPFIX Mediator (see RFC 5982).  NetFlow Integrator reduces NetFlow data by consolidating information into “conversations” rather than flows within a conversation.  Flow records are processed by one or more rules (canned or custom – creating using a GUI/SDK) which have their own logic to apply to each flow record.  These rules can aid in the following types of detection:

Detection of Botnet:

Botnet detection using NetFlow IntegratorA user would load a list of known Command & Control servers (possibly obtained from sites like Emerging Threats or from your own private source) into the rule.  Every incoming NetFlow record is examined to determine if the source or destination IP address matches this list.  If there is a match this matched information is forwarded to the SIEM.  The SIEM in turn will alert a security analyst if any botnet slaves are detected on the network.

APT detection:

NetFlow Integrator has rules to identify scanners who are doing “port sweeps” of your network.  It can also look for data exfiltration by examining an infected host that starts proliferating on the internal network.  A custom algorithm detects when a client suddenly starts behaving like a server.  This is something that can’t be done by signature based firewalls. 

In conclusion, to detect malware/botnets/APT use NetFlow/IPFIX which is something your routers, switches, firewalls support today.  Keep your existing SIEM in place but introduce an IPFIX offload appliance especially if you have large 10 Gb network pipes and don’t want to burden your routers.  Use a tool like that from NetFlow Logic to analyze NetFlow/IPFIX records on the wire and use your SIEM for the alerting and remediation.

Detecting zero day attacks using big data

April 16, 2013 Leave a comment

Snort and Intrusion Detection: 

Snort is a widely used open-source network Intrusion Detection System (NIDS) capable of both traffic analyses in real time as well as packet logging.  The reason for its popularity is that it is open source and effective at detecting everything from port scans, buffer overflows to OS fingerprinting attempts.  Known attacks follow a certain activity pattern and these are captured in “signatures” available from the open source community as well as from SOURCEfire.

Snort architectureIn the Snort architecture, the packet sniffer as the name suggests eavesdrops on network traffic, the preprocessor checks packets against plug-ins to determine if the packets exhibit a certain behavior, the detection engine takes incoming packets and runs them through a set of rules.  If the rules match what is in the packet then an alert is generated.  The alert may go to a log file or a MySQL or PostGres database.

Using Snort on big data stored in Hadoop

What if today you received new Snort signatures you didn’t have 3 months ago but want to use the new signatures to detect zero-day attacks (unknown exploits) in historical packet capture data?  This historical packet capture data may be in archive storage within your corporate data center or located on cloud storage like Amazon S3.

One solution is to analyze full packet captures using Apache Pig (a tool that abstracts a user from the complexities of MapReduce).  If you aren’t comfortable using MapReduce but have a few days of packet capture data on your laptop and know how to write some queries to query this local capture data, you can then transition your queries to a Hadoop cluster containing weeks or months of packet capture data using an open-source tool called PacketPig.

PacketPig  (an open source project located on github) offers many loaders (Java programs which provide access to specific info in a packet capture) one of which is SnortLoader () which allows you to analyze months of packet capture data dispersed across Hadoop nodes. The way you detect a zero day attack using PacketPig is by using SnortLoader() to scan archived packet capture data using old Snort signatures (from an old snort.conf file) then scanning it a second time using the latest Snort signatures.  After filtering out signatures that appear in both scans what you have left are zero day attacks.  More details of how this is done may be found here.  This is yet another example of how Hadoop running on commodity servers with direct attached storage can help provide a cyber-security solution for zero day attacks.