View
6
Download
0
Category
Preview:
Citation preview
Security in streaming applications
A case study with Apache Apex
Pramod ImmaneniPMC Apex & Chief Architect,
DTpramod@datatorrent.com
Apache Apex
• Stream processing platform• In memory, distributed
• Simple programming model• Write your own custom logic, pipelining
• Scalable• High throughput, low latency• Dynamic scaling responding to SLA
• Fault tolerant• Node outages, hadoop outages• Stateful recovery, incremental recovery• End to end exactly once
• Productivity library• Commonly needed connectors, business logic• Production tested
• Operability – DataTorrent RTS• Deployment and monitoring console• Deep introspection and debugging
Apache Apex and DataTorrent Product StackDesigned to help you at every stage of your data-in-motion pipeline
Solutions for Business Problems
Ingestion & Data Prep ETL Pipelines
Ease of Use Tools Real-Time Data VisualizationManagement & MonitoringGUI Application
Assembly
Application Templates
Apex-Malhar Operator Library
Big Data Infrastructure Hadoop 2.x – YARN + HDFS – On Prem & Cloud
Core
High-level APITransformation ML & Score SQL Analytic
s
FileSync
Dev Framework
Batch Support
Apache Apex Core
Kafka HDFS
HDFS HDFS JDBC HDFS JDBC
Kafka
4
Application Development Model
A Stream is a sequence of data tuplesA typical Operator takes one or more input streams, performs computations & emits one or more output streams
• Each Operator is YOUR custom business logic in java, or built-in operator from our open source library• Operator has many instances that run in parallel and each instance is single-threaded
Directed Acyclic Graph (DAG) is made up of operators and streams
Directed Acyclic Graph (DAG)
Filtered
Stream
Output StreamTuple Tuple
Filtered Stream
Enriched Stream
Enriched
Stream
er
Operator
er
Operator
er
Operator
er
Operator
er
Operator
er
Operator
5
Native Hadoop Integration
• YARN is the resource manager
• HDFS for storing persistent state
6
•Secure Hadoop•Kerberos security•Delegation tokens•All interactions between distributed components are authenticated•Kerberos enabled for Hadoop web services and management pages
•Running Apex on Secure Hadoop•Apex CLI•Apex applications
•Running DT Console and Gateway on Secure Hadoop
Components
6
7
•Kerberos•Authentication in multi-user, multi-node computing environments
•Between users and services•Between services across nodes
•Mutual authentication•Use of central trusted service
•Symmetric keys•Created by MIT
•Delegation Tokens•Used when a party does not have kerberos credentials
•Non fixed clients like application containers•A byte sequence created from different fields such as user information, timestamp and keys•Tokens have an expiry period•Clients provide tokens and the services verify the token
Kerberos and Delegation tokens
7
8
Apex CLI
8
•Uses Kerberos credentials to authenticate with Hadoop•Sets up delegation tokens for STRAM during launch
•RM and NN delegation tokens•Support HA configuration
•Sets up credentials for token refresh (discussed later)•Impersonation
•Can proxy as a specified user different from the Kerberos credentials•Requires extra Hadoop configuration
9
CLI configuration
9
• Short lived applicationsLogin to Kerberos using kinit
<property> <name>dt.authentication.principal</name> <value>kerberos-principal</value>
</property><property>
<name>dt.authentication.keytab</name> <value>keytab-file</value>
</property><property>
<name>dt.authentication.store.keytab</name> <value>hdfs-path-to-keytab-file</value>
</property>
• Long living applicationsConfiguration in dt-site.xml
kinit -k -t path-to-keytab-file kerberos-principal
10
Apex application architecture
10
11
•Uses delegation tokens when communicating with Hadoop•Refreshes them before they are expired
•Hadoop delegation tokens for Streaming Containers•Seeds them during launch
•Stram delegation tokens•Used for RPC communication between Streaming Containers and STRAM•Created and seeded by STRAM when containers are launched
•Buffer server tokens•Used for authentication between Buffer server and clients•Peer-to-peer authentication between containers•Created and seeded by STRAM during container deployment•Persists tokens in case containers fail and restart
STRAM
11
12
•Delegation Tokens from STRAM•Uses Hadoop delegation tokens for hadoop services•Uses STRAM delegation token to communicate with STRAM
•Buffer server token•Receives it in initial deployment context – StreamingContainerContext•Starts buffer server with this token
•Buffer server client tokens•Receives client tokens for input and output ports in operator deployment context
•InputDeployInfo and OuputDeployInfo•Seeds the buffer server clients with these tokens
•Used in communication with buffer server
Streaming Container
12
13
•Each buffer server has its own token•Seeded during start
•Clients have to provide the token to authenticate and receive services
•BufferServerPublisher – Used by operators to send data to buffer server•BufferServerSubscriber – Receives data from buffer server for operators•BufferServerController – Used by STRAM to communicate with buffer server to perform maintenance tasks
•WIP•Provide more options such as multiple tokens and token refresh
Buffer Server
13
14
•STRAM web service interface•To query details such as health, status, statistics etc of an application•To affect changes in the application
•Challenges•Cannot be Kerberos for same reasons as STRAM RPC•Clients will not have delegation tokens to start with
•Hybrid approach•Clients authenticate with Resource Manager proxy using Kerberos
•Proxy communicates with STRAM web service filter without any credentials•Filter only accepts non-credential requests from proxy•Sends an authentication token back to client
•Clients use the authentication token for all future communication with STRAM•RTS Gateway and map-reduce use this approach
STRAM web services
14
15
Web services authentication
15
16
Configuration
16
• Web service authentication can be enabled or disabled per application
• Hadoop also does not make it mandatory in secure mode<property> <name>dt.application.name.attr.STRAM_HTTP_AUTHENTICATION</name>
<value>security-option</value> </property>
• Security OptionsENABLE- Enable AuthenticationFOLLOW_HADOOP_AUTH - Enable authentication if secure mode is
enabled in Hadoop, the defaultFOLLOW_HADOOP_HTTP_AUTH - Enable authentication only if HTTP
authentication is enabled in Hadoop and not just secure mode.DISABLE - Disable Authentication
17
•Delegation tokens in Hadoop expire after 7 days•Applications will be killed
•Options•Configure Hadoop to increase expiry time – Not practical•Can application get new tokens before current tokens expire
•Auto-refresh•STRAM and Streaming Container request Hadoop services for new delegation tokens before current ones expire•To request new tokens Kerberos credentials are needed•Kerberos credentials and other configuration provided by Apex CLI when application is being launched
Delegation Token refresh
17
18
Configuration
18
<property> <name>dt.authentication.store.keytab</name> <value>hdfs-path-to-keytab-file</value>
</property><property>
<name>dt.resourcemanager.delegation.token.max-lifetime</name> <value>604800000</value> </property> <property>
<name>dt.namenode.delegation.token.max-lifetime</name> <value>604800000</value> </property><property>
<name>dt.authentication.token.refresh.factor</name> <value>0.7</value> </property>
19
•Backend service for the UI Console
•On secure Hadoop•Interact with Kerberos enabled Hadoop•Work with secure Applications
•Supports user authentication into the UI console•LDAP, AD, Kerberos etc
DTGateway
19
20
DTGateway security architecture
20
21
Configuration
21
• Uses Kerberos credentials to authenticate with Hadoop• Kerberos credentials can be configured during installation
<property> <name>dt.gateway.authentication.principal</name>
<value>[kerberos-principal]</value></property><property>
<name>dt.gateway.authentication.keytab</name> <value>[keytab-file]</value></property>
• Uses Kerberos over HTTP (SPNEGO) when interacting with Hadoop web services that have Kerberos enabled
22
Application launch & Impersonation
22
•Applications launched via CLI using dtGateway’s own Kerberos credentials•The user the Applications will run as, on the Hadoop side, can be configured
•Specified as a configuration setting
•The possible values for the user-strategy are•AUTH_USER – app runs as the authenticated user, default if not configured•GATEWAY_USER – app runs as the same user the dtGateway process is running under•SPECIFIED_USER –app runs as the specific user
<property> <name>dt.gateway.hadoop.user.strategy</name>
<value>[user-strategy]</value></property>
<property> <name>dt.gateway.hadoop.user.name</name> <value>specific-username</value>
</property>
23
Hadoop configuration
23
<property> <name>hadoop.proxyuser.[username].groups</name>
<value>*</value></property><property>
<name>hadoop.proxyuser.[username].hosts</name> <value>*</value></property>
• The username above should be the DTGateway kerberos username• Allows dtGateway to connect with the kerberos username and launch apps as other
users
• Additional configuration needs on Hadoop side
24
Authentication with STRAM
24
• Security is enabled for STRAM web services• dtGateway first obtains a security cookie by connecting to STRAM via
RM proxy• Make a web service request to STRAM web service path /ws via the RM proxy• Receives a cookie called dt-client
• Subsequently uses the cookie for all direct communication with STRAM
25
Resources• https://apex.apache.org/docs/apex/security/• https://en.wikipedia.org/wiki/Kerberos_(protocol)• http://
www.kerberos.org/events/2010conf/2010slides/2010kerberos_owen_omalley.pdf
• https://www.cloudera.com/documentation/enterprise/5-7-x/topics/cm_sg_principal_keytab.html
Q&A
26
Recommended