Upload
spark-summit
View
137
Download
4
Embed Size (px)
Citation preview
1 ©Hortonworks Inc.2011– 2017 AllRightsReserved
ApacheSparkandObjectStores—[email protected]@steveloughran
February2017
Steve Loughran,Hadoop committer, PMC member, …
Chris Nauroth, Apache Hadoop committer & PMC ASF member
Rajesh BalamohanTez Committer, PMC Member
3 ©Hortonworks Inc.2011– 2017 AllRightsReserved
ORC, Parquetdatasets
inbound
ElasticETL
HDFS
external
4 ©Hortonworks Inc.2011– 2017 AllRightsReserved
datasets
external
Notebooks
library
5 ©Hortonworks Inc.2011– 2017 AllRightsReserved
Streaming
6 ©Hortonworks Inc.2011– 2017 AllRightsReserved
AFilesystem:Directories,Filesà Data
/
work
pending
part-00
part-01
00
00
00
01
01
01
complete
part-01
rename("/work/pending/part-01", "/work/complete")
7 ©Hortonworks Inc.2011– 2017 AllRightsReserved
ObjectStore:hash(name)->blob
00
00
00
01
01
s01 s02
s03 s04
hash("/work/pending/part-01")["s02", "s03", "s04"]
copy("/work/pending/part-01","/work/complete/part01")
01
010101
delete("/work/pending/part-01")
hash("/work/pending/part-00")["s01", "s02", "s04"]
8 ©Hortonworks Inc.2011– 2017 AllRightsReserved
RESTAPIs
00
00
00
01
01
s01 s02
s03 s04
HEAD /work/complete/part-01
PUT /work/complete/part01x-amz-copy-source: /work/pending/part-01
01
DELETE /work/pending/part-01
PUT /work/pending/part-01... DATA ...
GET /work/pending/part-01Content-Length: 1-8192
GET /?prefix=/work&delimiter=/
9 ©Hortonworks Inc.2011– 2017 AllRightsReserved
Often:EventuallyConsistent
00
00
00
01
01
s01 s02
s03 s04
01
DELETE /work/pending/part-00
GET /work/pending/part-00
GET /work/pending/part-00
200
200
200
10 ©Hortonworks Inc.2011– 2017 AllRightsReserved
org.apache.hadoop.fs.FileSystem
hdfs s3awasb adlswift gs
11 ©Hortonworks Inc.2011– 2017 AllRightsReserved
s3:// —“inode on S3”
s3n://“Native” S3
s3a:// Replaces s3n
swift://OpenStack
wasb://Azure WASB
Phase I: Stabilize S3A
oss://Aliyun
gs://Google Cloud
Phase II: speed & scale adl://Azure Data Lake
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
s3://Amazon EMR S3
History of Object Storage Support
Phase III: scale & consistency
12 ©Hortonworks Inc.2011– 2017 AllRightsReserved
CloudStorageConnectorsAzure WASB ● Strongly consistent
● Good performance● Well-tested on applications (incl. HBase)
ADL ● Strongly consistent● Tuned for big data analytics workloads
Amazon Web Services S3A ● Eventually consistent - consistency work in progress by Hortonworks
● Performance improvements in progress● Active development in Apache
EMRFS ● Proprietary connector used in EMR● Optional strong consistency for a cost
Google Cloud Platform GCS ● Multiple configurable consistency policies● Currently Google open source● Good performance● Could improve test coverage
13 ©Hortonworks Inc.2011– 2017 AllRightsReserved
FourChallenges
1. Classpath
2. Credentials
3. Code
4. Commitment
14 ©Hortonworks Inc.2011– 2017 AllRightsReserved
UseS3AtoworkwithS3(EMR: useAmazon'ss3://)
15 ©Hortonworks Inc.2011– 2017 AllRightsReserved
Classpath:fix“NoFileSystem forscheme:s3a”
hadoop-aws-2.7.x.jar
aws-java-sdk-1.7.4.jarjoda-time-2.9.3.jar(jackson-*-2.6.5.jar)
See SPARK-7481
Get Spark with Hadoop 2.7+ JARs
16 ©Hortonworks Inc.2011– 2017 AllRightsReserved
Credentials
core-site.xml orspark-default.conf
spark.hadoop.fs.s3a.access.key MY_ACCESS_KEY
spark.hadoop.fs.s3a.secret.key MY_SECRET_KEY
spark-submitpropagatesEnvironmentVariables
export AWS_ACCESS_KEY=MY_ACCESS_KEY
export AWS_SECRET_KEY=MY_SECRET_KEY
NEVER: share, check in to SCM, paste in bug reports…
17 ©Hortonworks Inc.2011– 2017 AllRightsReserved
AuthenticationFailure:403
com.amazonaws.services.s3.model.AmazonS3Exception:The request signature we calculated does not matchthe signature you provided.Check your key and signing method.
1. Check joda-time.jar & JVM version2. Credentials wrong3. Credentials not propagating4. Local system clock (more likely on VMs)
18 ©Hortonworks Inc.2011– 2017 AllRightsReserved
Code:BasicIO
// Read in public datasetval lines = sc.textFile("s3a://landsat-pds/scene_list.gz")val lineCount = lines.count()
// generate and write dataval numbers = sc.parallelize(1 to 10000)numbers.saveAsTextFile("s3a://hwdev-stevel-demo/counts")
All you need is the URL
19 ©Hortonworks Inc.2011– 2017 AllRightsReserved
Code:justusetheURLoftheobjectstore
val csvdata = spark.read.options(Map("header" -> "true","inferSchema" -> "true","mode" -> "FAILFAST"))
.csv("s3a://landsat-pds/scene_list.gz")
...read time O(distance)
20 ©Hortonworks Inc.2011– 2017 AllRightsReserved
DataFrames
val landsat = "s3a://stevel-demo/landsat"csvData.write.parquet(landsat)
val landsatOrc = "s3a://stevel-demo/landsatOrc"csvData.write.orc(landsatOrc)
val df = spark.read.parquet(landsat)val orcDf = spark.read.parquet(landsatOrc)
…list inconsistency...commit time O(data)
21 ©Hortonworks Inc.2011– 2017 AllRightsReserved
FindingdirtydatawithSparkSQL
val sqlDF = spark.sql("SELECT id, acquisitionDate, cloudCover"
+ s" FROM parquet.`${landsat}`")
val negativeClouds = sqlDF.filter("cloudCover < 0")negativeClouds.show()
* filter columns and data early * whether/when to cache()?* copy popular data to HDFS
22 ©Hortonworks Inc.2011– 2017 AllRightsReserved
spark-default.conf
spark.sql.parquet.filterPushdown truespark.sql.parquet.mergeSchema falsespark.hadoop.parquet.enable.summary-metadata false
spark.sql.orc.filterPushdown truespark.sql.orc.splits.include.file.footer truespark.sql.orc.cache.stripe.details.size 10000
spark.sql.hive.metastorePartitionPruning true
23 ©Hortonworks Inc.2011– 2017 AllRightsReserved
RecentS3APerformance(Hadoop2.8,HDP2.5,CDH6(?))
// forward seek by skipping streamspark.hadoop.fs.s3a.readahead.range 256K
// faster backward seek for ORC and Parquet inputspark.hadoop.fs.s3a.experimental.input.fadvise random
// PUT blocks in separate threadsspark.hadoop.fs.s3a.fast.output.enabled true
24 ©Hortonworks Inc.2011– 2017 AllRightsReserved
TheCommitmentProblem
⬢ rename() usedforatomiccommitmenttransaction
⬢ timetocopy()+delete()proportionaltodata*files
⬢ S3:6+MB/s
⬢ Azure:alotfaster—usually
spark.speculation falsespark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2spark.hadoop.mapreduce.fileoutputcommitter.cleanup.skipped true
25 ©Hortonworks Inc.2011– 2017 AllRightsReserved
The "Direct Committer"?
26 ©Hortonworks Inc.2011– 2017 AllRightsReserved
Notebooks? Classpath & Credentials
27 ©Hortonworks Inc.2011– 2017 AllRightsReserved
AzureStorage:wasb://
AfullsubstituteforHDFS
28 ©Hortonworks Inc.2011– 2017 AllRightsReserved
Classpath:fix“NoFileSystem forscheme:wasb”
wasb:// :Consistent,withveryfastrename(hence:commits)
hadoop-azure-2.7.x.jar
azure-storage-2.2.0.jar
+ (jackson-core; http-components, hadoop-common)
29 ©Hortonworks Inc.2011– 2017 AllRightsReserved
Credentials:core-site.xml /spark-default.conf
<property><name>fs.azure.account.key.example.blob.core.windows.net</name><value>0c0d44ac83ad7f94b0997b36e6e9a25b49a1394c</value></property>
spark.hadoop.fs.azure.account.key.example.blob.core.windows.net0c0d44ac83ad7f94b0997b36e6e9a25b49a1394c
wasb://[email protected]
30 ©Hortonworks Inc.2011– 2017 AllRightsReserved
Example:AzureStorageandStreaming
val streaming = new StreamingContext(sparkConf,Seconds(10))val azure = "wasb://[email protected]/in"val lines = streaming.textFileStream(azure)val matches = lines.map(line => {
println(line)line
})matches.print()streaming.start()
* PUT into the streaming directory* keep the dir clean* size window for slow scans* checkpoints slow —reduce frequency
31 ©Hortonworks Inc.2011– 2017 AllRightsReserved
s3guard:fast,consistentS3metadataHADOOP-13445Hortonworks+Cloudera+WesternDigital
32 ©Hortonworks Inc.2011– 2017 AllRightsReserved
DynamoDB asconsistentmetadatastore
00
00
00
01
01
s01 s02
s03 s04
01
DELETE part-00200
HEAD part-00200
HEAD part-00404
PUT part-00200
00
33 ©Hortonworks Inc.2011– 2017 AllRightsReserved
34 ©Hortonworks Inc.2011– 2017 AllRightsReserved
Summary
⬢ ObjectStoreslookjustlikeanyotherFilesystemURL
⬢ …butdoneedclasspathandconfiguration
⬢ Issues:performance, commitment
⬢ TunetoreduceI/O
⬢ Keepthosecredentialssecret!
Finally:keepaneyeoutfors3guard!
35 ©Hortonworks Inc.2011– 2017 AllRightsReserved35
©Hortonworks Inc.2011– 2017.AllRightsReserved
Questions?
36 ©Hortonworks Inc.2011– 2017 AllRightsReserved
BackupSlides
37 ©Hortonworks Inc.2011– 2017 AllRightsReserved
NotCovered
⬢ Partitioning/directorylayout
⬢ Infrastructure Throttling
⬢ Optimalpathnames
⬢ Errorhandling
⬢ Metrics
38 ©Hortonworks Inc.2011– 2017 AllRightsReserved
DependenciesinHadoop2.8
hadoop-aws-2.8.x.jar
aws-java-sdk-core-1.10.6.jaraws-java-sdk-kms-1.10.6.jaraws-java-sdk-s3-1.10.6.jarjoda-time-2.9.3.jar (jackson-*-2.6.5.jar)
hadoop-aws-2.8.x.jar
azure-storage-4.2.0.jar
39 ©Hortonworks Inc.2011– 2017 AllRightsReserved
S3Server-SideEncryption
⬢ EncryptionofdataatrestatS3
⬢ SupportstheSSE-S3option:eachobjectencryptedbyauniquekeyusingAES-256cipher
⬢ Nowcovered inS3Aautomatedtestsuites
⬢ Supportforadditionaloptionsunderdevelopment (SSE-KMSandSSE-C)
40 ©Hortonworks Inc.2011– 2017 AllRightsReserved
Advancedauthentication
<property><name>fs.s3a.aws.credentials.provider</name><value>org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider,org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider,com.amazonaws.auth.EnvironmentVariableCredentialsProvider,com.amazonaws.auth.InstanceProfileCredentialsProvider,org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider
</value></property>
+encrypted credentials in JECKS files on HDFS
41 ©Hortonworks Inc.2011– 2017 AllRightsReserved