93
1 Processing Processing “BIG-DATA” “BIG-DATA” In In Real Time Real Time Yanai Franchi , Tikal Yanai Franchi , Tikal

Processing Big Data in Real-Time - Yanai Franchi, Tikal

Embed Size (px)

Citation preview

1

ProcessingProcessing“BIG-DATA”“BIG-DATA”In In Real TimeReal Time

Yanai Franchi , TikalYanai Franchi , Tikal

2

Two years ago...Two years ago...

3

4

Vacation to BarcelonaVacation to Barcelona

5

After a Long Travel DayAfter a Long Travel Day

6

Going to a Salsa Club

7

Best Salsa Club NOW

● Good Music

● Crowded – Now!

8

Same Problem in “gogobot”

9

10

gogobot checkinHeat Map Service

Lets' Develop“Gogobot Checkins Heat-Map”

11

Key Notes● Collector Service - Collects checkins as text addresses

– We need to use GeoLocation ServiceWe need to use GeoLocation Service

● Upon elapsed interval, the last locations list will be displayed as Heat-Map in GUI.

● Web Scale service – 10Ks checkins/seconds all over the world (imaginary, but lets do it for the exercise).

● Accuracy – Sample data, NOT critical data.

– Proportionately representative

– Data volume is large enough to is large enough to compensate for data loss.compensate for data loss.

12

Heat-Map Context

Text-Address

Checkins Heat-MapService

Gogobot System

GogobotMicro Service

GogobotMicro Service

GogobotMicro Service

Geo LocationService

Get-GeoCode(Address)

Heat-Map

Last Interval Locations

13

Database

Persist Checkin Intervals

ProcessingCheckins

ReadText Address

Check-in #1Check-in #2Check-in #3Check-in #4Check-in #5Check-in #6Check-in #7Check-in #8Check-in #9...

Simulate Checkins with a File

Plan A

GET Geo Location

Geo LocationService

14

Tons of Addresses Arriving Every Second

15

Architect - First Reaction...

16

Second Reaction...

17

DeveloperFirst

Reaction

18

SecondReaction

19

Problems ?

● Tedious: Spend time conf iguring where to send messages, deploying workers, and deploying intermediate queues.

● Brittle: There's little fault-tolerance.

● Painful to scale: Partition of running worker/s is complicated.

20

What We Want ?● Horizontal scalability● Fault-tolerance● No intermediate message brokers!● Higher level abstraction than message

passing● “Just works”● Guaranteed data processing (not in this

case)

21

Apache Storm

✔Horizontal scalability

✔Fault-tolerance

✔No intermediate message brokers!

✔Higher level abstraction than message passing

✔“Just works”

✔Guaranteed data processing

22

Anatomy of Storm

23

What is Storm ?

● CEP - Open source and distributed realtime computation system. – Makes it easy to Makes it easy to reliably process unboundedreliably process unbounded streams streams ofof

tuplestuples– Doing for realtime processing what Hadoop did for batch Doing for realtime processing what Hadoop did for batch

processing.processing.

● Fast - 1M Tuples/sec per node. – It is scalable,fault-tolerant, guarantees your data will be It is scalable,fault-tolerant, guarantees your data will be

processed, and is easy to set up and operate.processed, and is easy to set up and operate.

24

Streams

Tuple Tuple Tuple Tuple Tuple Tuple

Unbounded sequence of tuples

25

Spouts

Tuple Tuple

Sources of Streams

Tuple Tuple

26

Bolts

Tuple

TupleTuple

Processes input streams and producesnew streams

TupleTupleTupleTuple

Tuple TupleTuple

27

Storm Topology

Network of spouts and bolts

Tuple

TupleTuple

TupleTuple TupleTupleTuple TupleTupleTuple

Tuple

Tuple

TupleTuple TupleTupleTuple

28

Guarantee for Processing

● Storm guarantees the full processing of a tuple by tracking its state

● In case of failure, Storm can re-process it.● Source tuples with full “acked” trees are removed

from the system

29

Tasks (Bolt/Spout Instance)

Spouts and bolts execute asmany tasks across the cluster

30

Stream Grouping

When a tuple is emitted, which task(instance) does it go to?

31

Stream Grouping

● Shuff le grouping: pick a random task● Fields grouping: consistent hashing on a subset of

tuple f ields● All grouping: send to all tasks● Global grouping: pick task with lowest id

32

Tasks , Executors , Workers

Task Task Task

Worker Process

Sput /Bolt

Sput /Bolt

Sput /Bolt=

Executor Thread

JVM

Executor Thread

33

Bolt B Bolt B

Worker Process

Executor

Spout A

Executor

Node

SupervisorBolt C Bolt C

Executor

Bolt B Bolt B

Worker Process

Executor

Spout A

Executor

Node

SupervisorBolt C Bolt C

Executor

34

Nimbus

Supervisor Supervisor

Supervisor Supervisor

Supervisor Supervisor

Upload/Rebalance Heat-Map Topology

Zoo KeeperNodes

Storm Architecture

Master Node (similar to Hadoop JobTracker)

NOT criticalfor running topology

35

Nimbus

Supervisor Supervisor

Supervisor Supervisor

Supervisor Supervisor

Upload/Rebalance Heat-Map Topology

Zoo Keeper

Storm Architecture

Used For Cluster Coordination

A few nodes

36

Nimbus

Supervisor Supervisor

Supervisor Supervisor

Supervisor Supervisor

Upload/Rebalance Heat-Map Topology

Zoo Keeper

Storm Architecture

Run Worker Processes

37

Assembling Heatmap Topology

38

HeatMap Input/Output Tuples

● Input Tuples: Timestamp and Text Address : – (9:00:07 PM , “287 Hudson St New York NY 10013”)(9:00:07 PM , “287 Hudson St New York NY 10013”)

● Output Tuple: Time interval, and a list of points for it:– (9:00:00 PM to 9:00:15 PM, (9:00:00 PM to 9:00:15 PM,

ListList((((40.719,-73.98740.719,-73.987),(40.726,-74.001),(),(40.726,-74.001),(40.719,-73.98740.719,-73.987))))

39

Checkins Spout

GeocodeLookup

Bolt

HeatmapBuilder

Bolt

PersistorBolt

(9:01 PM @ 287 Hudson st)

(9:01 PM , (40.736, -74,354)))

Heat Map Storm

Topology(9:00 PM – 9:15 PM , List((40.73, -74,34),

(51.36, -83,33),(69.73, -34,24))

Upon Elapsed Interval

40

Checkins Spoutpublic class CheckinsSpout extends BaseRichSpout {

private List<String> sampleLocations;private int nextEmitIndex;private SpoutOutputCollector outputCollector;

@Overridepublic void open(Map map, TopologyContext topologyContext,

SpoutOutputCollector spoutOutputCollector) {this.outputCollector = spoutOutputCollector;this.nextEmitIndex = 0;sampleLocations = IOUtils.readLines(

ClassLoader.getSystemResourceAsStream("sanple-locations.txt"));}

@Overridepublic void nextTuple() {

String address = checkins.get(nextEmitIndex);String checkin = new Date().getTime()+"@ADDRESS:"+address;

outputCollector.emit(new Values(checkin));nextEmitIndex = (nextEmitIndex + 1) % sampleLocations.size();

}

@Override

public void declareOutputFields(OutputFieldsDeclarer declarer) {declarer.declare(new Fields("str"));

}}

We hold stateNo need for thread safety

Declare output fields

Been called iteratively by Storm

41

Geocode Lookup Boltpublic class GeocodeLookupBolt extends BaseBasicBolt {

private LocatorService locatorService;

@Overridepublic void prepare(Map stormConf, TopologyContext context) {

locatorService = new GoogleLocatorService();}

@Overridepublic void execute(Tuple tuple, BasicOutputCollector outputCollector) {

String str = tuple.getStringByField("str");String[] parts = str.split("@");Long time = Long.valueOf(parts[0]);String address = parts[1];

LocationDTO locationDTO = locatorService.getLocation(address);String city = locationDTO.getCity();outputCollector.emit(new Values(city,time,locationDTO) );

}

@Overridepublic void declareOutputFields(OutputFieldsDeclarer fieldsDeclarer) {

fieldsDeclarer.declare(new Fields("city","time", "location"));}

}

Get Geocode, Create DTO

42

Tick Tuple – Repeating Mantra

43

Two Streams to Heat-Map Builder

On tick tuple, we f lush our Heat-Map

Checkin 1 Checkin 4 Checkin 5 Checkin 6

HeatMap-Builder Bolt

44

Tick Tuple in Actionpublic class HeatMapBuilderBolt extends BaseBasicBolt {

private Map<String, List<LocationDTO>> heatmaps;

@Overridepublic Map<String, Object> getComponentConfiguration() {

Config conf = new Config();conf.put(Config.TOPOLOGY_TICK_TUPLE_FREQ_SECS, 60 );return conf;

}

@Overridepublic void execute(Tuple tuple, BasicOutputCollector outputCollector) {

if (isTickTuple(tuple)) {// Emit accumulated intervals

} else {// Add check-in info to the current interval in the Map

}}

private boolean isTickTuple(Tuple tuple) { return tuple.getSourceComponent().equals(Constants.SYSTEM_COMPONENT_ID)

&& tuple.getSourceStreamId().equals(Constants.SYSTEM_TICK_STREAM_ID);}

@Overridepublic void declareOutputFields(OutputFieldsDeclarer declarer) {

declarer.declare(new Fields("time-interval", "city","locationsList"));}

Tick interval

Hold latest intervals

45

Persister Boltpublic class PersistorBolt extends BaseBasicBolt {

private Jedis jedis;

@Overridepublic void execute(Tuple tuple, BasicOutputCollector outputCollector) {

Long timeInterval = tuple.getLongByField("time-interval"); String city = tuple.getStringByField("city"); String locationsList = objectMapper.writeValueAsString

( tuple.getValueByField("locationsList"));

String dbKey = "checkins-" + timeInterval+"@"+city;

jedis.setex(dbKey, 3600*24 ,locationsList);

jedis.publish("location-key", dbKey);

}}

Publish in Redis channel for debugging

Persist in Redisfor 24h

46

Shuffle Grouping

Shuffle Grouping

Check-in #1Check-in #2Check-in #3Check-in #4Check-in #5Check-in #6Check-in #7Check-in #8Check-in #9...

Sample Checkins File

Read Text Addresses

Transforming the TuplesCheckins

Spout

GeocodeLookup

Bolt

HeatmapBuilder

Bolt

DatabasePersistor

Bolt

Get Geo Location

Geo LocationService

Field Grouping(city)

Group by city

47

Heat Map Topologypublic class LocalTopologyRunner { public static void main(String[] args) {

TopologyBuilder builder = buildTopolgy();StormSubmitter.submitTopology(

"local-heatmap", new Config(), builder.createTopology());

}

private static TopologyBuilder buildTopolgy() { topologyBuilder builder = new TopologyBuilder(); builder.setSpout("checkins", new CheckinsSpout());

builder.setBolt("geocode-lookup", new GeocodeLookupBolt() ).shuffleGrouping("checkins");

builder.setBolt("heatmap-builder", new HeatMapBuilderBolt() ).fieldsGrouping("geocode-lookup", new Fields("city"));

builder.setBolt("persistor", new PersistorBolt() ).shuffleGrouping("heatmap-builder");

return builder;

}}

48

Its NOT Scaled

49

50

Scaling the Topologypublic class LocalTopologyRunner {conf.setNumWorkers(20); public static void main(String[] args) {

TopologyBuilder builder = buildTopolgy();Config conf = new Config();conf.setNumWorkers(2);StormSubmitter.submitTopology(

"local-heatmap", conf, builder.createTopology()); }

private static TopologyBuilder buildTopolgy() { topologyBuilder builder = new TopologyBuilder(); builder.setSpout("checkins", new CheckinsSpout(), 4 );

builder.setBolt("geocode-lookup", new GeocodeLookupBolt() , 8 ).shuffleGrouping("checkins").setNumTasks(64);

builder.setBolt("heatmap-builder", new HeatMapBuilderBolt() , 4).fieldsGrouping("geocode-lookup", new Fields("city"));

builder.setBolt("persistor", new PersistorBolt() , 2 ).shuffleGrouping("heatmap-builder").setNumTasks(4);

return builder;

}}

Parallelism hint

Increase TasksFor Future

Set no. of workers

51

Database

Storm Heat-Map Topology

Persist Checkin Intervals

GET Geo Location

Check-in #1Check-in #2Check-in #3Check-in #4Check-in #5Check-in #6Check-in #7Check-in #8Check-in #9...

ReadText Address

Sample Checkins File

Recap – Plan A

Geo LocationService

52

We have something working

53

Add Kafka Messaging

54

Plan B - Kafka Spout&Bolt to HeatMap

GeocodeLookup

Bolt

HeatmapBuilder

Bolt

KafkaCheckins

Spout

Database

PersistorBolt

Geo LocationService

Read Text Addresses

CheckinKafkaTopic

PublishCheckins

LocationsTopic

KafkaLocations

Bolt

55

56

They all are GoodBut not for all use-cases

57

KafkaA little introduction

58

59

60

61

Pub-Sub Messaging System

62

63

64

65

66

Stateless Broker &Doesn't Fear the File System

67

68

69

70

Topics● Logical collections of partitions (the physical f iles). ● A broker contains some of the partitions for a topic

71

A partition is Consumed byExactly One Group's Consumer

72

Distributed &Fault-Tolerant

73

Broker 1 Broker 3Broker 2

Zoo Keeper

Consumer 1 Consumer 2

Producer 1 Producer 2

74

Broker 1 Broker 4Broker 3Broker 2

Zoo Keeper

Consumer 1 Consumer 2

Producer 1 Producer 2

75

Broker 1 Broker 4Broker 3Broker 2

Zoo Keeper

Consumer 1 Consumer 2

Producer 1 Producer 2

76

Broker 1 Broker 4Broker 3Broker 2

Zoo Keeper

Consumer 1 Consumer 2

Producer 1 Producer 2

77

Broker 1 Broker 4Broker 3Broker 2

Zoo Keeper

Consumer 1 Consumer 2

Producer 1 Producer 2

78

Broker 1 Broker 4Broker 3Broker 2

Zoo Keeper

Consumer 1 Consumer 2

Producer 1 Producer 2

79

Broker 1 Broker 4Broker 3Broker 2

Zoo Keeper

Consumer 1 Consumer 2

Producer 1 Producer 2

80

Broker 1 Broker 3Broker 2

Zoo Keeper

Consumer 1 Consumer 2

Producer 1 Producer 2

81

Broker 1 Broker 3Broker 2

Zoo Keeper

Consumer 1 Consumer 2

Producer 1 Producer 2

82

Broker 1 Broker 3Broker 2

Zoo Keeper

Consumer 1 Consumer 2

Producer 1 Producer 2

83

Broker 1 Broker 3Broker 2

Zoo Keeper

Consumer 1

Producer 1 Producer 2

84

Broker 1 Broker 3Broker 2

Zoo Keeper

Consumer 1

Producer 1 Producer 2

85

Broker 1 Broker 3Broker 2

Zoo Keeper

Consumer 1

Producer 1 Producer 2

86

Performance Benchmark3 Brokers

3 Producers3 Consumers

Cheap Machines

• “Up to 2 million writes/sec on 3 cheap machines”

• Using 3 producers on 3 different machines, 3x async replication,

• Only 1 producer/machine because NIC already saturatedOnly 1 producer/machine because NIC already saturated

• End-to-End Latency is about 10ms for 99.9%

• Sustained throughput as stored data grows

•87

88

Add Kafka to our Topologypublic class LocalTopologyRunner { ...

private static TopologyBuilder buildTopolgy() {

... builder.setSpout("checkins", new KafkaSpout(kafkaConfig) , 4); ... builder.setBolt("kafkaProducer", new KafkaOutputBolt ( "localhost:9092",

"kafka.serializer.StringEncoder","locations-topic"))

.shuffleGrouping("persistor");

return builder;}

}

Kafka Bolt

Kafka Spout

89

Checkin HTTP Reactor

PublishCheckins

Database

CheckinKafkaTopic

Consume Checkins

Storm Heat-Map Topology

LocationsKafkaTopic

Publish Interval Key

Persist Checkin Intervals

Geo LocationServiceGET Geo

Location

Text-Address

90

Demo

91

SummaryWhen You go out to Salsa Club...

● Good Music

● Crowded

92

More Conclusions..

● BigData – Also refers to Velocity of data (not only Volume of data)

● Storm – Great for real-time BigData processing. Complementary for Hadoop batch jobs.

● Kafka – Great messaging for logs/events data, been served as a good “source” for Storm spout

93

Thanks