Posts Tagged ‘security for Hadoop’

Security for Big data in Hadoop

April 15, 2013 Leave a comment

Why secure the Hadoop cluster?

securinghadoopHadoop has 2 key components: Hadoop Distributed File System (HDFS) and MapReduce.  HDFS consists of geo-dispersed DataNodes  accessed via a NameNode.  Hadoop was designed without much security in mind hence a malicious user can bypass the NameNode and access a data node directly and if he/she knows the block location of data, that data can be retrieved or modified.  In addition, data that is being sent from a DataNode to a client can be easily sniffed using generic packet sniffing technology.

Authentication solution

Securing the Hadoop cluster requires that you know that authentication (Is the user who claims to be Bob really Bob?) differs from authorization (Is user Bob authorized to submit HDFS or MapReduce jobs?)

To address the question of authentication Hadoop recommends the user of Kerberos which is bundled with Hadoop.    Unlike network firewalls which assume that all dangers reside outside the corporate network, Kerberos operates on the assumption that the network connections are unreliable and a possible weak link.  How does Kerberos itself work?  We assume that in addition to the client and the server there is a Kerberos Key Distribution Center (KDC) which has 2 components: Authentication Server, Ticket Granting Server.

  • A client who wishes to access a server will first authenticate itself with the Authentication Server (AS).  The AS will return an authentication ticket called a Ticket-Granting-Ticket (TGT).
  • The client now uses the TGT to request a service ticket from the Ticket Granting Server (TGS) in the KDC
  • Lastly the client uses the service ticket to authenticate itself with the server which provides the desired service to the client. In a Hadoop cluster the server could be the NameNode or JobTracker.

Benefits of using Kerberos for authentication:

  • Kerberos helps you keep rogue nodes and rogue applications out of your Hadoop cluster.

It is also recommended that Hadoop daemons use Kerberos to do authentication between daemons.

Encryption solution

Encryption of data-in-motion can be achieved using SSL/TLS to encrypt data moving between nodes and applications.  Encryption of data-at-rest at the file layer is recommended to protect against malicious users who obtain unauthorized access to data nodes to inspect files.   While the benefits of encryption are clear, encryption is very compute intensive.  Intel has jumped in to address this with its own flavor of Apache Hadoop that has been optimized to do AES encryption provided your Hadoop nodes are using Intel Xeon processors.  Intel claims that Intel Advanced Encryption Standard New Instruction (AES-NI) which is built into Intel Xeon processors can accelerate encryption performance in a Hadoop cluster by 5.3x and decryption performance by 19.8x.

Cynics would argue that Intel’s new-found interest in Hadoop has more to do with keeping low-cost ARM processor based microservers at bay and less to do with improving Hadoop security.  Whatever the rationale behind it, speeding up encryption &  decryption can only increase the use of these data protection techniques by the Hadoop user base.

Security in future (a gateway to Hadoop)

Hortonworks (as well as NASA, Microsoft and Cloudera) are promoting Knox Gateway a perimeter security solution involving a single point of authentication for the Hadoop cluster.  Clients who want to access the Hadoop cluster would have to first traverse the gateway which itself would reside in a DMZ.  Time will tell how this new technology is embraced by the Hadoop user community.