End-to-End Security and Auditing in a Big Data as a Service Deployment

Preview:

Citation preview

End-to-End Security and Auditing in a Big-Data-as-a-Service (BDaaS) Deployment

Nanda Vijaydev - BlueDataAbhiraj Butala - BlueData

“A mechanism for the delivery of statistical analysis tools and information that helps organizations understand and use insights gained from large information sets in order to gain a competitive

advantage.”

On-Demand, Self-Service, ElasticBig Data Infrastructure, Applications,

Analytics

Source: www.semantikoz.com/blog/big-data-as-a-service-definition-classification

Big-Data-as-a-Service (BDaaS)

Multi-Tenant Big-Data-as-a-Service

Data/Storage

Prod

2.2

Dev/Test

2.4

POC

2.3

Prod

2.3

Dev/Test

2.4

MARKETING R&D MANUFACTURING360 Customer View Log Analysis Predictive Maintenance

Data Lake Staging

Multiple compute services (Hadoop, BI, Spark)

There is a shared Data Lake (Shared HDFS)

Why BDaaS? – Compute Side Of The Story

• Set of applications that interact with Hadoop keeps growing

• Various versions of the same app/distro run in parallel

• Enterprises have need to scale compute up and down based on usage

• A model similar to Amazon AWS with S3 as storage and applications on EC2

Why BDaaS? – Data Side Of The Story

• Production cluster access takes time and is generally restricted

• Staging clusters may not have all the data• Data exists on other storage systems such

as NFS Isilon is common• Users also want to upload arbitrary files

for analysis

Hadoop – A Collection Of Services

Hadoop is a collection of storage and compute services such as HDFS, HBase, Hive, Yarn, Solr, Kafka

Security In Hadoop • Authenticate user into Hadoop ecosystem

– Each service has its own integration with LDAP/AD for authentication

• Authorize and limit their actions to selected services. Authorization is granted separately for each service. Example:– Folder “/user/customer” in HDFS has ‘r-x’ to user ‘alice’, and ‘-

wx’ to user ‘bob’– Enable column level access to a Hive Table. “Customer.Name”

& “Customer.PhoneNumber” is only accessible by some users and groups

Ranger – A Pluggable Security Framework

• Ranger works with a common user DB (LDAP/AD) for authentication • Provides a plug-in for individual Hadoop services to enable

authorization• Allows users to define policies in a central location, using WEB UI or

APIs• Users can define their own plug-in for a custom service and manage

them centrally via Ranger Admin

Defining HDFS Ranger Policies

HDFS Policy List

Marketing Policy Drill Down

Security Considerations in BDaaS

Data/Storage

Prod

2.2

Dev/Test

2.4

POC

2.3

Prod

2.3

Dev/Test

2.4

MARKETING R&D MANUFACTURING360 Customer View Log Analysis Predictive Maintenance

Data Lake Staging 1. User Identity – Data Lake

2. User Identity - Application Level

3. User Identity propagation to Data Layer

1. User identity within a Data Lake

2. User identity in application layer

3. Prevent data duplication & maintain user integrity across layers

1. Securing The Data Lake

LDAPKDCData/Storage

Prod

2.2

Dev/Test

2.4

POC

2.3

Prod

2.3

Dev/Test

2.4

MARKETING R&D MANUFACTURING360 Customer View Log Analysis Predictive Maintenance

Data Lake Staging 1. Authentication & Authorization – Data Lake

2. User Identity - Application Level

3. User Identity propagation to Data Layer

2. Securing The App Layer

LDAP

KDCData/Storage

Prod

2.2

Dev/Test

2.4

POC

2.3

Prod

2.3

Dev/Test

2.4

MARKETING R&D MANUFACTURING360 Customer View Log Analysis Predictive Maintenance

Data Lake Staging 1. Authentication & Authorization – Data Lake

2. User Identity - Application Level

3. User Identity propagation to Data Layer

App containers are integrated with LDAP

KDC

AliceBob Tom

3. Identity Propagation to Data Layer

LDAP

KDCData/Storage

Prod

2.2

Dev/Test

2.4

POC

2.3

Prod

2.3

Dev/Test

2.4

MARKETING R&D MANUFACTURING360 Customer View Log Analysis Predictive Maintenance

Data Lake Staging 1. Authentication & Authorization – Data Lake

2. User Identity - Application Level

3. User Identity propagation to Data Layer

KDC

AliceBob Tom

User Identity Propagation

Two Ways–Users connect directly to HDFS

• Simple Authentication• Kerberos Authentication

–Users connect to HDFS via a Super-user (Impersonation)

HDFS Direct Connections

LDAP

KDC

Prod

2.2

Dev/Test

2.4

POC

2.3

Prod

2.3

Dev/Test

2.4

MARKETING R&D MANUFACTURING360 Customer View Log Analysis Predictive Maintenance

KDC

Alice BobTom

HDFSData Lake

HDFS Direct Connections..

– hdfs-audit.log

– Ranger policies are enforced for alice and bob as they are the effective users

HDFS Direct Connections..

• Single Hadoop Setup– Ideal

• Multi-tenant, Multi-application Setup– Kerberized HDFS needs kerberized compute and services– May not want to kerberize Dev/QA setups– Hadoop versions should be compatible all across– Data duplication

HDFS Super-user Connections

• Super-users perform actions on behalf of other users (Impersonation/Proxying)

• Adding a new super-user is easy– core-site.xml

HDFS Super-user Connections..

LDAP

KDC

Prod

2.2

Dev/Test

2.4

POC

2.3

Prod

2.3

Dev/Test

2.4

MARKETING R&D MANUFACTURING360 Customer View Log Analysis Predictive Maintenance

KDC

Alice BobTom

HDFSData Lake

DataTap Caching Servicevia – super-user

HDFS Super-user Connections..

– hdfs-audit.log

– Ranger Authorization policies still enforced, as alice and bob are effective users

HDFS Super-user Connections..

Multi-tenant, Multi-application Setup– Works for applications which don’t support Kerberos (yet)– Dev/Test setups need not be kerberized– DataTap service can abstract version incompatibilities– Can help avoid data duplication– Need tight LDAP/AD integration though!

Ranger in Action

Hue Example

HDFS Permissions on Data Lake

• Set HDFS file access for ‘/user/secret’ to strict mode

• Set umask to ‘077’

HDFS Ranger Policies

DataTap Caching Service

Create Table via Hue

Query table via Hue - Success

Query table via Hue - Failure

Ranger Audit Logs

Key Takeaways

• BDaaS is more than Hadoop-as-a-Service– Includes BI / ETL / Analytics + Data Science tools

• Security is an important consideration in BDaaS• Data duplication is not an option• Global user authentication using a centralized DB like LDAP/AD is a must• Apache Ranger helps in enforcing global policies, provided user identities

are propagated correctly

Q & A

www.bluedata.com

Nanda Vijaydev@nandavijaydev

Abhiraj Butala@abhirajbutala

Recommended