4
What is Spark Apache Spark is open source framework for fast, in-memory data processing. It currently supports Scala, Java and Python. Besides the core libraries, there is support for streaming, machine learning, data frames, integration with R and a version of SQL. EricMarshal l

Spark infrastructure

Embed Size (px)

Citation preview

Page 1: Spark infrastructure

What is Spark

Apache Spark is open source framework for fast, in-memory data processing. It currently supports Scala, Java and Python. Besides the core libraries, there is support for streaming, machine learning, data frames,

integration with R and a version of SQL.

EricMarshall

Page 2: Spark infrastructure

Spark compatibility and ecosystem• Spark runs in a clustered environment of arbitrary size and is designed to sit on top of a distributed file systems like HDFS, Cassandra, or S3. • Spark integrates with schedulers including Yarn and Mesos. Spark scales well and has deployed a cluster of 8000 nodes at the time of this writing.•Spark can read from most all sources and has performant connectors to nosql and sql datastores and tools like Tableau.

Page 3: Spark infrastructure

Spark and Hadoop

Spark can read from most all sources and has performant connectors to the Hadoop eco-system, other nosql and sql datastores and tools like Tableau. Spark can connect to streams or work in batches.

Spark also can run in a stand-alone clustered mode with HDFS or any form of shared file system (like NFS mounted to each node with the same path).

Spark can run highly available. Spark is resilient to Worker failures and will move work to other Workers. Spark supports standby Masters or can rely on the cluster’s scheduling software.

Or run within Hadoop as aYarn job; reading/writng from HFDS and connecting to other data sources.

Page 4: Spark infrastructure

Spark Tasks Spark is agnostic regarding the underlying cluster manager. Spark applications run as

independent sets of processes on a cluster, coordinated by the SparkContext object in your main program (called the driver program).

Specifically, to run on a cluster, the SparkContext can connect to several types of cluster managers (either Spark’s own standalone cluster manager or Mesos/YARN), which allocate resources across applications.

Each application has its own executor processes: managing threads, providing isolation between Spark contexts, also useful on the scheduling side as a unit of work.

Spark uses resources dynamically, if configured to do so. Scaling up and down as the work demands. (Currently only supported via Yarn)