20
SPUD A Distributed High Performance Publish-Subscribe Cluster Uriel Peled and Tal Kol Guided by Edward Bortnikov Software Systems Laboratory Software Systems Laboratory Faculty of Electrical Engineering, Technion Faculty of Electrical Engineering, Technion

SPUD A Distributed High Performance Publish-Subscribe Cluster

Embed Size (px)

DESCRIPTION

SPUD A Distributed High Performance Publish-Subscribe Cluster. Uriel Peled and Tal Kol Guided by Edward Bortnikov Software Systems Laboratory Faculty of Electrical Engineering, Technion. Project Goal. Design and implement a general-purpose Publish-Subscribe server - PowerPoint PPT Presentation

Citation preview

Page 1: SPUD A Distributed High Performance  Publish-Subscribe Cluster

SPUDA Distributed High Performance

Publish-Subscribe Cluster

Uriel Peled and Tal Kol

Guided by Edward Bortnikov

Software Systems LaboratorySoftware Systems LaboratoryFaculty of Electrical Engineering, TechnionFaculty of Electrical Engineering, Technion

Page 2: SPUD A Distributed High Performance  Publish-Subscribe Cluster

Project Goal

Design and implement a general-purpose Publish-Subscribe server

Push traditional implementations into global scale performance demands

1 million concurrent clientsMillions of concurrent topicsHigh transaction rate

Demonstrate server abilities with a fun client application

Page 3: SPUD A Distributed High Performance  Publish-Subscribe Cluster

What is Pub/Sub?

topic://traffic-jams/ayalon

subscribe

publish

accident in hashalom

accident in hashalom

Page 4: SPUD A Distributed High Performance  Publish-Subscribe Cluster

What Can We Do With It?Collaborative Web

Browsing

others: others:

Page 5: SPUD A Distributed High Performance  Publish-Subscribe Cluster

What Can We Do With It?Instant Messaging

Hi buddy!

Hi buddy!

Page 6: SPUD A Distributed High Performance  Publish-Subscribe Cluster

Seems Easy To Implement, But…

“I’m behind a NAT, I can’t connect!”Not all client setups are server friendly

“Server is too busy, try again later?!”1 million concurrent clients is simply too much

“The server is so slow!!!”Service time grows exponentially with load

“A server crashed, everything is lost!”

Single points of failure will eventually fail

Page 7: SPUD A Distributed High Performance  Publish-Subscribe Cluster

Naïve Implementation(example 1)

Simple UDP for client-server communication

No need for sessions since we send messagesVery low cost-per-clientSounds perfect?

NAT

Page 8: SPUD A Distributed High Performance  Publish-Subscribe Cluster

NAT Traversal

UDP hole punchingNAT will accept UDP reply for a short windowOur measurements: 15-30 secondsKeep UDP pinging from each client every 15s

Days-long TCP sessionsNAT remembers current sessions for repliesIf WWW works - we should workIncreases dramatically cost-per-clientOur research: all IM’s do exactly this

Page 9: SPUD A Distributed High Performance  Publish-Subscribe Cluster

Naïve Implementation(example 2)

Blocking I/O with one thread per client

Basic model for most servers (JAVA default)Traditional UNIX – fork for every clientSounds perfect?

500clients

500clients

500clients

Page 10: SPUD A Distributed High Performance  Publish-Subscribe Cluster

Network I/O InternalsBlocking I/O – one thread per client

2MB stack, 1GB virtual space enough for only 512 (!)

Non-blocking I/O - selectLinear fd searches are very slow

Asynchronous I/O – completion portsThread pool to handle request completionOur measurements: 30,000 concurrent clients!What is the bottleneck?

Number of locked pages (zero-byte receives)

TCP/IP kernel driver non-paged pool allocations

Page 11: SPUD A Distributed High Performance  Publish-Subscribe Cluster

Scalability

Scale upBuy a bigger box

Scale outBuy more boxes

Which one to do?Both!Push each box to its hardware maximum

1000’s of servers is impractical

Add relevant boxes as load increasesThe Google way (cheap PC server farms)

Page 12: SPUD A Distributed High Performance  Publish-Subscribe Cluster

Identify Our Load Factors

Concurrent TCP clientsScale up: async-I/O, 0-byte-recv, larger NPPScale out: dedicate boxes to handle clients=> Connection Server (CS)

High transaction throughput (topic load)

Scale up: software optimizationsScale out: dedicate boxes to handle topics => Topic Server (TS)

Design the cluster accordingly

Page 13: SPUD A Distributed High Performance  Publish-Subscribe Cluster

Network Architecture

C1 T1

C2

C3

T2

T3

Room 1

C1 T1

C2

C3

T2

Room 2

C1

T1C2

C3

T2

T3

Room 3

CLB1

CLB2

Page 14: SPUD A Distributed High Performance  Publish-Subscribe Cluster

Client Load Balancing

CLB CS1

CS2

CS3

TS1

TS2request CS

load balance:- user location- CS client load

given CS2

loginsubscribepublish

Page 15: SPUD A Distributed High Performance  Publish-Subscribe Cluster

Topic Load BalancingStatic

CS

TS0

TS3

TS2

TS1

subscribe:traffic

Room 0

subscribe:923481%4

=1

Page 16: SPUD A Distributed High Performance  Publish-Subscribe Cluster

Topic Load BalancingDynamic

TS1

CS

Room 0

Room 1

Room 2

TS1

TS1

subscribe

subscribeR0: 345KR1: ?R2: ?

subscribeR0: 345KR1: 278KR2: ?

subscribeR0: 345KR1: 278KR2: 301K

subscribe

R1: 278K

handlesubscribe

Page 17: SPUD A Distributed High Performance  Publish-Subscribe Cluster

Performance PitfallsData Copies

Single instance - reference counting (REF_BLOCK)

Multi-buffer messages (MESSAGE: header, body, tail)

Context SwitchesFlexible module exec foundation (MODULE)

Processor num sized thread pools

Memory AllocationMM: custom memory pools (POOL, POOL_BLOCK)

fine-grained locking, pre-allocation, batching, single-size

Lock ContentionEVENT, MUTEX, RW_MUTEX, interlocked API

Page 18: SPUD A Distributed High Performance  Publish-Subscribe Cluster

+Init()+Cleanup()+ReloadConfig()+Start()+ThreadMain()

-Parent : Module-CompletionPort : int-NumThreads : int

General::Module

+ShowHelp()+HandleCommand()

-ServerType : int

General::Application

+ReloadFile()+GetStringParam()+GetDwordParam()+GetBooleanParam()+GetIpParam()

-Filename : char*-Values : struct

General::Config

1

+Log()+DebugLog()+Assert()

-Debug : bool

General::Log

1

+UpdateValue()+GetStats()+WriteStatsToFile()+RequestStatsFromAllServers()+PrintStatsString()

-Values : struct

General::Stats

1

+AllocBlock()+AddFreeBlocks()

-SizeLists : struct

Pool

1

-Size : int-Body : char[]

Memory::BodyBlock

-Header : struct

Memory::HeaderBlock

1 1

+operator new()+operator delete()

-BlockSize : int

Memory::PoolBlock

+RefcountInc()+RefcountDec()+Free()

-Refcount : int

Memory::RefBlock

CLBSpecific::ClbServer CSSpecific::CsServer TSSpecific::TsServer

Class Diagram (Application)

Page 19: SPUD A Distributed High Performance  Publish-Subscribe Cluster

Class Diagram (TS, CS)+Init()+Cleanup()+ReloadConfig()+Start()+ThreadMain()

-Parent : Module-CompletionPort : int-NumThreads : int

General::Module

+ShowHelp()+HandleCommand()

-ServerType : int

General::Application

+StartServer()+ConnectSocket()+DoSendOperation()+DoReceiveOperation()

IOStack::TcpIO

+StartServer()+DoSendOperation()+DoReceiveOperarion()

IOStack::UdpIO

1

1

+StartServer()+ConnectSocket()+DoSendOperation()+DoReceiveOperation()

-ServerSocket : int

IOStack::IO

+RegisterMessageHandler()+SendMessage()+HandleSendCompleted()+HandleServerStarted()+HandleNewConnection()+HandleHeaderReceived()+HandleBodyReceived()+HandleDatagramReceived()+HandleIOFailure()

-MessageHandlers : MessageHandler[]

IOStack::ProtocolHandler

1

+HandleReceivedMessage()+CreateMessage()+HandleNewPeer()+HandleIOFailure()

MsgHandlers::PingHandler

+SendMessage()+HandleReceivedMessage()+HandleNewPeer()+HandleIOFailure()+CreateMessage()+CreateMessageTail()

-SenderId : long

IOStack::MessageHandler

+GetNumRooms()+GetNumServersInRoom()+GetServerByIndex()

-Servers : struct-PingInterval : int

General::ServerDb

+GetIdFromPeer()+GetAddressFromPeer()+Add()+Remove()+GetPeerFromId()+GetPeerFromAddress()

General::PeerDb

-RoomIndex : int-IndexInRoom : int

Types::Server

-Socket : int-Address : struct-Id : int-State : int

Types::Peer

111

111

1

+HandleReceivedMessage()+CreateMessage()

MsgHandlers::UserAckHandler

1111

+CalcTopicHash()+GetTsFromHash()+GetTsFromTopic()+GetNextTsInRing()

CSSpecific::FindTs

TSSpecific::TsServer

+HandleReceivedMessage()+CreateMessage()

MsgHandlers::StatsHandler

+HandleReceivedMessage()+CreateMessage()

MsgHandlers::CacheHandler

+HandleReceivedMessage()+CreateMessage()

TsNotifyHandler

+HandleReceivedMessage()+CreateMessage()

MsgHandlers::TsRequestHandler

+HandleReceivedMessage()+CreateMessage()

MsgHandlers::ReplicationHandler

+ChooseTs()

TSSpecific::TsLoadBalancer

+SearchCache()+UpdateCache()+RemoveCache()+Print()

-Hashtable : struct

TSSpecific::TopicCache

+IsTopicSelfOwned()+AddTopic()+RemoveTopic()+SubscribeUserToTopic()+UnsubscribeUserToTopic()+AddTopicReplica()+GetTopicSubscriberList()+GetLoad()+Print()

-SqlConnection : struct

TSSpecific::TopicDatabase

+UpdateUser()+GetUser()+RemoveUser()+GetTotalUsers()+Print()

-SqlConnection : struct

TSSpecific::UserDatabase

111

+Init()+Cleanup()+ReloadConfig()+Start()+ThreadMain()

-Parent : Module-CompletionPort : int-NumThreads : int

General::Module

+ShowHelp()+HandleCommand()

-ServerType : int

General::Application

+StartServer()+ConnectSocket()+DoSendOperation()+DoReceiveOperation()

IOStack::TcpIO

+StartServer()+DoSendOperation()+DoReceiveOperarion()

IOStack::UdpIO

1

1

+StartServer()+ConnectSocket()+DoSendOperation()+DoReceiveOperation()

-ServerSocket : int

IOStack::IO

+RegisterMessageHandler()+SendMessage()+HandleSendCompleted()+HandleServerStarted()+HandleNewConnection()+HandleHeaderReceived()+HandleBodyReceived()+HandleDatagramReceived()+HandleIOFailure()

-MessageHandlers : MessageHandler[]

IOStack::ProtocolHandler

1

+HandleReceivedMessage()+CreateMessage()+HandleNewPeer()+HandleIOFailure()

MsgHandlers::LoadHandler

+HandleReceivedMessage()+CreateMessage()+HandleNewPeer()+HandleIOFailure()

MsgHandlers::PingHandler

+SendMessage()+HandleReceivedMessage()+HandleNewPeer()+HandleIOFailure()+CreateMessage()+CreateMessageTail()

-SenderId : long

IOStack::MessageHandler

+GetNumRooms()+GetNumServersInRoom()+GetServerByIndex()

-Servers : struct-PingInterval : int

General::ServerDb

+GetIdFromPeer()+GetAddressFromPeer()+Add()+Remove()+GetPeerFromId()+GetPeerFromAddress()

General::PeerDb

-RoomIndex : int-IndexInRoom : int

Types::Server

-Socket : int-Address : struct-Id : int-State : int

Types::Peer

111

111

1

CSSpecific::CsServer

+HandleReceivedMessage()+CreateMessage()+HandleNewPeer()+HandleIOFailure()

MsgHandlers::LoginHandler

+HandleReceivedMessage()+CreateMessage()

MsgHandlers::CsRequestHandler

+HandleReceivedMessage()+CreateMessage()

MsgHandlers::CsNotifyHandler

+HandleReceivedMessage()+CreateMessage()

MsgHandlers::UserAckHandler

+HandleReceivedMessage()+CreateMessage()

MsgHandlers::TsNotifyHandler

1111

+Add()+Remove()+AllocClient()+FreeClient()+GetLoad()

-ClientPingInterval : int-Clients : Client[]

CSSpecific::ClientDb

+CalcTopicHash()+GetTsFromHash()+GetTsFromTopic()+GetNextTsInRing()

CSSpecific::FindTs

-GeoParam : int

Types::Client

1

Page 20: SPUD A Distributed High Performance  Publish-Subscribe Cluster

Stress Testing

client load test 2

0.00

200.00

400.00

600.00

800.00

1000.00

1200.00

1400.00

1600.00

1800.00

1471013161922252831

client load (K)

turn

aro

un

d t

ime

(m

s)

סידרה1

Measure publish-notify turnaround time

1 ms resolution using MM timer, avg. of 30

Increasing client and/or topic loadSeveral room topologies examinedResults:

• Exponential-like climb• TS increase: better times • CS increase: better max clients time not improved