27
Spark at eBay - Troubleshooting the everyday issues Aug. 6, 2014 Seattle Spark Meetup Don Watters - Sr. Manager of Architecture, eBay Inc. Suzanne Monthofer - Solutions Architect, eBay Inc.

Spark at eBay - Troubleshooting the everyday issuesfiles.meetup.com/12063092/SparkMeetupAugust2014Public.pdf · Spark at eBay - Troubleshooting the everyday issues ... • Had to

  • Upload
    vunhi

  • View
    217

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Spark at eBay - Troubleshooting the everyday issuesfiles.meetup.com/12063092/SparkMeetupAugust2014Public.pdf · Spark at eBay - Troubleshooting the everyday issues ... • Had to

Spark at eBay -

Troubleshooting the

everyday issues

Aug. 6, 2014

Seattle Spark Meetup

Don Watters - Sr. Manager of Architecture, eBay Inc.

Suzanne Monthofer - Solutions Architect, eBay Inc.

Page 2: Spark at eBay - Troubleshooting the everyday issuesfiles.meetup.com/12063092/SparkMeetupAugust2014Public.pdf · Spark at eBay - Troubleshooting the everyday issues ... • Had to

Agenda

– eBay Overview

– Spark Motivation

– Use Cases At eBay

– Troubleshooting the everyday issues

2

Page 3: Spark at eBay - Troubleshooting the everyday issuesfiles.meetup.com/12063092/SparkMeetupAugust2014Public.pdf · Spark at eBay - Troubleshooting the everyday issues ... • Had to

eBay Overview

3

> 50 thousand categories of products > 200 million items listed for sale on the site Average retailer has thousands of products

Page 4: Spark at eBay - Troubleshooting the everyday issuesfiles.meetup.com/12063092/SparkMeetupAugust2014Public.pdf · Spark at eBay - Troubleshooting the everyday issues ... • Had to

4

PLATFORM

Page 5: Spark at eBay - Troubleshooting the everyday issuesfiles.meetup.com/12063092/SparkMeetupAugust2014Public.pdf · Spark at eBay - Troubleshooting the everyday issues ... • Had to

5 5

Data @ eBay

5

>50 TB/day new data

>100 PB/day

>100 Trillion pairs of information

Millions of queries/day

>6000 business users & analysts

>50k chains of logic

24x7x365

99.98+% Availability

turning over a TB every second Active/Active

Near-Real-time

>100k data elements

Always online

Processed

Page 6: Spark at eBay - Troubleshooting the everyday issuesfiles.meetup.com/12063092/SparkMeetupAugust2014Public.pdf · Spark at eBay - Troubleshooting the everyday issues ... • Had to

Spark Motivation

– Great Promise!

– Fits our pattern well

– Iterative approach possible, like SQL

6

Page 7: Spark at eBay - Troubleshooting the everyday issuesfiles.meetup.com/12063092/SparkMeetupAugust2014Public.pdf · Spark at eBay - Troubleshooting the everyday issues ... • Had to

7

Page 8: Spark at eBay - Troubleshooting the everyday issuesfiles.meetup.com/12063092/SparkMeetupAugust2014Public.pdf · Spark at eBay - Troubleshooting the everyday issues ... • Had to

Agenda

– Use Cases At eBay

8

Page 9: Spark at eBay - Troubleshooting the everyday issuesfiles.meetup.com/12063092/SparkMeetupAugust2014Public.pdf · Spark at eBay - Troubleshooting the everyday issues ... • Had to

9

eBay Transformer = More Data

Page 10: Spark at eBay - Troubleshooting the everyday issuesfiles.meetup.com/12063092/SparkMeetupAugust2014Public.pdf · Spark at eBay - Troubleshooting the everyday issues ... • Had to

Agenda

– Troubleshooting the everyday issues

10

Page 11: Spark at eBay - Troubleshooting the everyday issuesfiles.meetup.com/12063092/SparkMeetupAugust2014Public.pdf · Spark at eBay - Troubleshooting the everyday issues ... • Had to

Tools and Skill sets

• JIRA issue tracking – internal and apache

• Github repository – source version control, documentation (.md)

• Compilation/dependencies - Maven – jar dependencies

• Java – versioning, debugging stack traces, environments, multiple JDK/JREs, compatibility errors

• POSIX OS – environment variables, directory structures, permissions, Shell scripting

• HDFS, hadoop queues, formats, compression

• Yarn/Mesos – environments, debugging, logs, killing

• JIRA internal wikis – global internal collaboration

• User groups, internal DLs, platform support teams, informal emails

• Ability to decipher Java Stack traces

• Stack Overflow, Googling, indirect clues

• Scrappiness:

when dwarfed by a challenge, compensating for seeming inadequacies through will, persistence and heart

11

Page 12: Spark at eBay - Troubleshooting the everyday issuesfiles.meetup.com/12063092/SparkMeetupAugust2014Public.pdf · Spark at eBay - Troubleshooting the everyday issues ... • Had to

Most Common Question: Yarn ShellException

(GiraphApplicationMaster.java:onContainersCompleted(574)) - Got container status for

containerID=container_1392317581183_0245_01_000003, state=COMPLETE, exitStatus=1, diagnostics=Exception from container-launch:

org.apache.hadoop.util.Shell$ExitCodeException: at org.apache.hadoop.util.Shell.runCommand(Shell.java:464)

at org.apache.hadoop.util.Shell.run(Shell.java:379)

at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589)

at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:252)

at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:283)

at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:79)

at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)

at java.util.concurrent.FutureTask.run(FutureTask.java:138)

at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

at java.lang.Thread.run(Thread.java:662)

12

Means an error occurred in the Yarn container – need to search for Java

stack trace deeper in the Yarn logs…

Page 13: Spark at eBay - Troubleshooting the everyday issuesfiles.meetup.com/12063092/SparkMeetupAugust2014Public.pdf · Spark at eBay - Troubleshooting the everyday issues ... • Had to

Killing Yarn Jobs and Viewing Yarn

Logs and status in many places:

• Hadoop console (transient – disappear after job done)

• Aggregated Yarn logs – not available until job finishes or is killed

• Execution shell – only very high-level status

Killing: Ctrl-C, then

/apache/hadoop/bin/yarn application -kill application_1392973982912_7321

Viewing Logs: /apache/hadoop/bin/yarn logs -applicationId application_1392973982912_7321

Sifting to find text ”Exception”, ”Memory”, etc. | grep Exception -5

| grep Memory -5

• Would like easier debugging and exiting on errors

• May look at a log4j appenders

13

Page 14: Spark at eBay - Troubleshooting the everyday issuesfiles.meetup.com/12063092/SparkMeetupAugust2014Public.pdf · Spark at eBay - Troubleshooting the everyday issues ... • Had to

Biggest Challenge: Resource Allocation/Capacity

Scheduling

• Users must request needed resources

• Long-running jobs hang without releasing resources and must be killed

manually

• Created a dedicated Spark queue – still not equitable

• Capacity allocation prioritization is complex

• Spark shell hangs on to memory

• Many users deciding to wait for better stability and better guarantee of

resource availability and job completion

• Yarn vs. Mesos debate?

14

Page 15: Spark at eBay - Troubleshooting the everyday issuesfiles.meetup.com/12063092/SparkMeetupAugust2014Public.pdf · Spark at eBay - Troubleshooting the everyday issues ... • Had to

Tuning Spark – Hanging Jobs and Out-of-Memory

Errors

– spark.default.parallelism - # requested Yarn containers

– spark.executor.memory - ~75-90% requested Yarn container memory size

– spark.storage.memoryFraction - lower from default 0.6 to ~0.2 (if you are not pinning significant amount of data)

– Remove outliers from dataset (dual-pass with larger entities)

– Use primitive data types – avoid Strings

– Use Kryo serialization

– app UI at localhost:4040 (disabled on our cluster)

– Need to understand inner workings of Spark

– Community working to reduce the amount of configuration needed

Alex Rubensteyn blog post: “Spark should be better than MapReduce (if only it worked)”

http://blog.explainmydata.com/2014/05/spark-should-be-better-than-mapreduce.html?m=1

Patrick Wendell’s talk on performance at Spark Summit 2013:

https://spark-summit.org/talk/wendell-understanding-the-performance-of-spark-applications/

Tuning Guide: https://spark.apache.org/docs/latest/tuning.html

15

Page 16: Spark at eBay - Troubleshooting the everyday issuesfiles.meetup.com/12063092/SparkMeetupAugust2014Public.pdf · Spark at eBay - Troubleshooting the everyday issues ... • Had to

Yarn Improvements Needed for Spark

16

Great talk by Sandy Ryza from Cloudera at Spark Summit 2014

https://www.youtube.com/watch?v=N6pJhxCPe-Y

Page 17: Spark at eBay - Troubleshooting the everyday issuesfiles.meetup.com/12063092/SparkMeetupAugust2014Public.pdf · Spark at eBay - Troubleshooting the everyday issues ... • Had to

Rapid Pace of Change

17

Page 18: Spark at eBay - Troubleshooting the everyday issuesfiles.meetup.com/12063092/SparkMeetupAugust2014Public.pdf · Spark at eBay - Troubleshooting the everyday issues ... • Had to

18

SPARK-1203

spark-shell on yarn-client race in properly getting hdfs delegation

tokens - error on saveAsTextFile Exception in thread "main" org.apache.hadoop.ipc.RemoteException(java.io.IOException):

Delegation Token can be issued only with kerberos or web authentication at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDelegationToken(FSNamesystem.java:6211)

at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getDelegationToken(NameNodeRpcServer.java:461)

...

at org.apache.hadoop.hdfs.DFSClient.getDelegationToken(DFSClient.java:920)

at org.apache.hadoop.hdfs.DistributedFileSystem.getDelegationToken(DistributedFileSystem.java:1336)

at org.apache.hadoop.fs.FileSystem.collectDelegationTokens(FileSystem.java:527)

at org.apache.hadoop.fs.FileSystem.addDelegationTokens(FileSystem.java:505)

at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:121)

at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:100)

at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80)

at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:202)

Burnt by bugs in snapshots during incubating phase …

Check Spark JIRA issues https://issues.apache.org/jira/browse/SPARK/

Page 19: Spark at eBay - Troubleshooting the everyday issuesfiles.meetup.com/12063092/SparkMeetupAugust2014Public.pdf · Spark at eBay - Troubleshooting the everyday issues ... • Had to

Apache Shark – Hive on Spark

…NOW OBSOLETE…

• Google protobuf error (notorious) – had to replace bundled jar Caused by: java.lang.VerifyError: class org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$SetOwnerRequestProto overrides final method

getUnknownFields.()Lcom/google/protobuf/UnknownFieldSet;

at java.lang.ClassLoader.defineClass1(Native Method)

at java.lang.ClassLoader.defineClass(Unknown Source)

• Had to replace hadoop core/security jars with eBay jars

• JDBC driver: mysql-connector-java-5.0.8-bin.jar

• Got it working on single node – able to access/query existing hive tables

• Couldn’t use for extremely large tables/joins yet (need multi-node)

• Requires JDK 1.7 – couldn’t run on multiple nodes in cluster (still 1.6)

• ./bin/shark-withinfo –skipRddReload to avoid a bad table error

• Performance 2-5x’s better than Hive for 8M row table count query

…Start Looking at Spark SQL!

19

Page 20: Spark at eBay - Troubleshooting the everyday issuesfiles.meetup.com/12063092/SparkMeetupAugust2014Public.pdf · Spark at eBay - Troubleshooting the everyday issues ... • Had to

Exception in thread "main" org.apache.hadoop.hive.ql.metadata.HiveException:

java.lang.RuntimeException: Unable to instantiate

org.apache.hadoop.hive.metastore.HiveMetaStoreClient at org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1072)

at shark.memstore2.TableRecovery$.reloadRdds(TableRecovery.scala:49)

at shark.SharkCliDriver.<init>(SharkCliDriver.scala:283)

at shark.SharkCliDriver$.main(SharkCliDriver.scala:162)

at shark.SharkCliDriver.main(SharkCliDriver.scala)

Caused by: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient

at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1139)

at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.<init>(RetryingMetaStoreClient.java:51)

at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:61)

at org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:2288)

at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:2299)

at org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1070)

... 4 more

Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)

at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source)

at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source)

at java.lang.reflect.Constructor.newInstance(Unknown Source)

at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1137)

... 9 more

Caused by: java.lang.VerifyError: class

org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$SetO

wnerRequestProto overrides final method

getUnknownFields.()Lcom/google/protobuf/UnknownFieldSet; at java.lang.ClassLoader.defineClass1(Native Method)

at java.lang.ClassLoader.defineClass(Unknown Source)

at java.security.SecureClassLoader.defineClass(Unknown Source))

20

Page 21: Spark at eBay - Troubleshooting the everyday issuesfiles.meetup.com/12063092/SparkMeetupAugust2014Public.pdf · Spark at eBay - Troubleshooting the everyday issues ... • Had to

Shark Jar Incompatibilities

21

Caused by: KrbException: Server not found in Kerberos database (7)

at sun.security.krb5.KrbTgsRep.<init>(Unknown Source)

at sun.security.krb5.KrbTgsReq.getReply(Unknown Source)

at sun.security.krb5.KrbTgsReq.sendAndGetCreds(Unknown Source)

at sun.security.krb5.internal.CredentialsUtil.serviceCreds(Unknown Source)

14/05/07 17:49:58 ERROR security.UserGroupInformation: PriviledgedActionException as:[email protected] cause:javax.security.sasl.SaslException: GSS initiate failed [Caused by

GSSException: No valid credentials provided (Mechanism level: Server not found in Kerberos

database (7))] 14/05/07 17:49:58 INFO security.UserGroupInformation: Initiating logout for [email protected]

14/05/07 17:49:58 INFO security.UserGroupInformation: Initiating re-login [email protected]

14/05/07 17:50:02 ERROR security.UserGroupInformation: PriviledgedActionException as:[email protected]

ause:javax.security.sasl.SaslException: GSS initiate failed [Caused by

GSSException: No valid credentials provided (Mechanism level: Server not found

in Kerberos database (7))] 14/05/07 17:50:02 WARN security.UserGroupInformation: Not attempting to re-login since the last re-login was

attempted less than 600 seconds before.

Page 22: Spark at eBay - Troubleshooting the everyday issuesfiles.meetup.com/12063092/SparkMeetupAugust2014Public.pdf · Spark at eBay - Troubleshooting the everyday issues ... • Had to

Shark vs. Hive, Spark SQL vs Shark

Big Data Benchmarks

22

https://amplab

.cs.berkeley.e

du/benchmark

/

http://databricks.com/

blog/2014/06/02/excit

ing-performance-

improvements-on-

the-horizon-for-

spark-sql.html

Page 23: Spark at eBay - Troubleshooting the everyday issuesfiles.meetup.com/12063092/SparkMeetupAugust2014Public.pdf · Spark at eBay - Troubleshooting the everyday issues ... • Had to

Compilation: Maven, sbt, ivy, ant

• Maven/sbt/ivy/munge can be complex, finicky

[info] Resolving com.ebay.incdata.metis#metis-matching-engine;1.0-SNAPSHOT ...

[warn] module not found: com.ebay.incdata.metis#metis-matching-engine;1.0-SNAPSHOT

[warn] ==== local: tried

[warn] /Users/smonthofer/.ivy2/local/com.ebay.incdata.metis/metis-matching-engine/1.0-SNAPSHOT/ivys/ivy.xml

[warn] ==== public: tried

[warn] http://repo1.maven.org/maven2/com/ebay/incdata/metis/metis-matching-engine/1.0-SNAPSHOT/metis-matching-engine-1.0-SNAPSHOT.pom

[warn] ==== Local Maven Repository: tried

[warn] file:///var/root/.m2/repository/com/ebay/incdata/metis/metis-matching-engine/1.0-SNAPSHOT/metis-matching-engine-1.0-SNAPSHOT.pomURI

has an authority component

at sbt.IvyActions$.sbt$IvyActions$$resolve(IvyActions.scala:213)

at sbt.IvyActions$$anonfun$update$1.apply(IvyActions.scala:122)

at sbt.IvyActions$$anonfun$update$1.apply(IvyActions.scala:121)

[warn] ::::::::::::::::::::::::::::::::::::::::::::::

[warn] :: UNRESOLVED DEPENDENCIES ::

[warn] ::::::::::::::::::::::::::::::::::::::::::::::

java.net.MalformedURLException: no protocol: /Users/smonthofer/.m2/repository

• build.sbt resolvers +=

"Local Maven Repository" at file:///Users/smonthofer/.m2/repository

• Needed 3 slashes (platform independence feature)!!! Grrrr…

23

Page 24: Spark at eBay - Troubleshooting the everyday issuesfiles.meetup.com/12063092/SparkMeetupAugust2014Public.pdf · Spark at eBay - Troubleshooting the everyday issues ... • Had to

Learned New Term:

Yak Shaving

24

From Urban Dictionary:

Any seemingly pointless activity which is actually

necessary to solve a problem which solves a

problem which, several levels of recursion later,

solves the real problem you're working on.

origin: MIT AI Lab, after 2000: orig. probably from a

Ren & Stimpy episode.

Building scalable systems is not all sexy roflscale fun.

It’s a lot of plumbing and yak shaving. A lot of

hacking together tools that really ought to exist

already, but all the open source solutions out there

are too bad (and yours ends up bad too, but at least

it solves your particular problem).

- Martin Kleppmann, LinkedIn, Founder of Rapportive

Page 25: Spark at eBay - Troubleshooting the everyday issuesfiles.meetup.com/12063092/SparkMeetupAugust2014Public.pdf · Spark at eBay - Troubleshooting the everyday issues ... • Had to

25

Simple documentation saves

time later for yourself and for

others

Cut/paste/collect things that work, errors,

common commands and put on a wiki page

(even email drafts are a fast holding place).

Source control/backups for working versions

– be able to start from scratch

Maven, sbt, dependencies – complex,

corruptible, bizarre tricks, multiple open

source projects – magic (also scary)

Get ahead of the curve on new technology

cause new challenges will always come up

From xkcd

Page 26: Spark at eBay - Troubleshooting the everyday issuesfiles.meetup.com/12063092/SparkMeetupAugust2014Public.pdf · Spark at eBay - Troubleshooting the everyday issues ... • Had to

If you want to succeed as badly as you want the air,

then you will get it… there is no other secret to success

- Socrates (lesson to his students)

Privileged and Confidential 26

Quoted by Spark User group user:

Page 27: Spark at eBay - Troubleshooting the everyday issuesfiles.meetup.com/12063092/SparkMeetupAugust2014Public.pdf · Spark at eBay - Troubleshooting the everyday issues ... • Had to

Spark at eBay -

Troubleshooting the

everyday issues

Aug. 6, 2014

Seattle Spark Meetup

Don Watters – [email protected]

Suzanne Monthofer – [email protected]