[Spark meetup] Spark Streaming Overview

  • Published on
    14-Jul-2015

  • View
    1.044

  • Download
    6

Embed Size (px)

Transcript

  • SPARK STREAMING OVERVIEW

  • SPARK STREAMING OVERVIEW

  • SPARK STREAMING OVERVIEW

  • SPARK STREAMING OVERVIEW

  • SparkSQL

    SparkStreaming

    MLlib(machine learning)

    GraphX(graph)

    SPARK STREAMING OVERVIEW

  • SPARK STREAMING OVERVIEW

  • SPARK STREAMING OVERVIEW

  • SPARK STREAMING OVERVIEW

  • SPARK STREAMING OVERVIEW

  • SPARK STREAMING OVERVIEW

  • SPARK STREAMING OVERVIEW

  • Kafka provides seamless integration between information of producers and consumers without blocking the producers of the information, and without letting producers know who the final consumers are.

    Each consumer keeps control of its own offset (read)

    On demand topic creation

    SPARK STREAMING OVERVIEW

  • ETL and ELT, wide catalog of sources and sinks

    Flexible design of topologies and agent deployment strategies.

    Data transformation, thanks to interceptors.

    SPARK STREAMING OVERVIEW

  • readClobreadCSVreadLinereadMultiLinereadAvroreadJson

    addCurrentTimeaddLocalHostgeoIPfindReplaceSplit

    generateUUIDdecompressIfextractJsonPathsdetectMimeType

    xqueryextractURIComponentsxsltGrok (regular expressions)

    exec

    spooling

    logger

    SPARK STREAMING OVERVIEW

  • SPARK STREAMING OVERVIEW

  • SPARK STREAMING OVERVIEW

  • CASSANDRA

    Kafka

    STRATIO DEEP

    STRATIO DEEP

    SPARK STREAMING OVERVIEW

  • Shark(SQL)

    SparkStreaming

    Mllib(machine learning)

    GraphX(graph)

    SPARK STREAMING OVERVIEW

  • RDD, what is that?

    SPARK STREAMING OVERVIEW

  • RDD, what is that?

    SPARK STREAMING OVERVIEW

  • SPARK STREAMING OVERVIEW

  • SPARK STREAMING OVERVIEW

  • SPARK STREAMING OVERVIEW

  • SPARK STREAMING OVERVIEW

  • SPARK STREAMING OVERVIEW

  • SPARK STREAMING OVERVIEW

  • SPARK STREAMING OVERVIEW

  • ?SPARK STREAMING OVERVIEW

  • Spark Streaming: Overall view

    SPARK STREAMING OVERVIEW

  • SPARK STREAMING OVERVIEW

    Spark Streaming: Overall view

  • Discretized Stream or DStream.

    SPARK STREAMING OVERVIEW

  • Discretized Stream or DStream.

    SPARK STREAMING OVERVIEW

  • Discretized Stream or DStream.

    SPARK STREAMING OVERVIEW

  • Overall view

    SPARK STREAMING OVERVIEW

  • Input DStreams and Receivers.

    Basic (distributed with Spark Streaming).

    Advanced (available as dependency).

    SPARK STREAMING OVERVIEW

  • Basic sources

    File Stream.

    Sockets.

    Actors (Akka).

    Queue RDDs (Testing).

    SPARK STREAMING OVERVIEW

  • Advanced sources

    SPARK STREAMING OVERVIEW

  • Do It Yourself

    Code onStart()

    Code onStop()

    Code receive()

    Custom Receiver ready!

    SPARK STREAMING OVERVIEW

  • map(func), flatMap(func), filter(func), count()

    repartition(numPartitions)

    union(otherStream)

    reduce(func),countByValue(), reduceByKey(func, [numTasks])

    join(otherStream, [numTasks]), cogroup(otherStream, [numTasks])

    transform(func)

    updateStateByKey(func)

    window(windowLength, slideInterval)

    countByWindow(windowLength, slideInterval)

    reduceByWindow(func, windowLength, slideInterval)

    reduceByKeyAndWindow(func, windowLength, slideInterval, [numTasks])

    countByValueAndWindow(windowLength, slideInterval, [numTasks])

    print()

    foreachRDD(func)

    saveAsObjectFiles(prefix, [suffix])

    saveAsTextFiles(prefix, [suffix])

    saveAsHadoopFiles(prefix, [suffix])

    SPARK STREAMING OVERVIEW

  • SPARK STREAMING OVERVIEW

  • SPARK STREAMING OVERVIEW

  • SPARK STREAMING OVERVIEW

  • SPARK STREAMING OVERVIEW

  • SPARK STREAMING OVERVIEW

  • SPARK STREAMING OVERVIEW

  • SPARK STREAMING OVERVIEW

  • SPARK STREAMING OVERVIEW

  • SPARK STREAMING OVERVIEW

  • SPARK STREAMING OVERVIEW

  • SPARK STREAMING OVERVIEW

  • SPARK STREAMING OVERVIEW

  • SPARK STREAMING OVERVIEW

  • Stateful transformations (updateStateByKey, reduceByKeyAndWindow).

    As fault-tolerance mechanism, when driver crashes.

    HDFS is mandatory if you are going to use operations that requires checkpointing.

    SPARK STREAMING OVERVIEW

  • Configuration parameters

    spark.streaming.receiver.maxRate

    spark.streaming.concurrentJobs

    spark.streaming.receiver.writeAheadLogs.enable

    spark.streaming.unpersist

    SPARK STREAMING OVERVIEW

  • each node has mutable state and for each record they have to update state & send new records

    SPARK STREAMING OVERVIEW

  • SPARK STREAMING OVERVIEW

  • SPARK STREAMING OVERVIEW

  • SPARK STREAMING OVERVIEW

  • SPARK STREAMING OVERVIEW

  • SPARK STREAMING OVERVIEW

  • SPARK STREAMING OVERVIEW

  • SPARK STREAMING OVERVIEW

  • SPARK STREAMING OVERVIEW

  • SPARK STREAMING OVERVIEW