Download pdf - Scalability broad strokes

ScalabilityBroad Strokes - Best practices

Definition● Concurrency a.k.a number of simultaneous

requests, Latency● Throughput a.k.a total number of item

processed● Extensibility - application design for ability to

add new features etc.● We’d be mostly talking about first two.

Concurrency & Performance

● Scalability is measured as number of requests/users an application support without degrading the performance.

● Performance is a measure of individual request process time mostly.

Handling Scale● Throttling● Cache● Stateful vs. stateless● Asynchronous vs. synchronous● Service oriented design

Where (Multi tiered)● At the client (Browser)

○ Http headers○ Asynchronous calls○ local DB

● At the server ( Web tier/application tier)○ Cache -- distributed○ Stateless○ Asynchronous

● DB○ Cap theorem

Client● Http headers

○ Pragmatic headers not only cache on browsers but help with intelligent proxies.

○ YSlow/G page speed guidelines are always useful.○ e-Tags, long expiry are very good practices.○ sprites and image maps

● Ajax is good for scalability but some time may cause performance issues.

Client Server Network● Always compress response.● Even on JSON the bandwidth gains are

great.● In server-server calls consider binary

protocols or more efficient ones ● Even on the web, network layer like spdy

etc. are interesting.

Server -- Numbers all should know

● http://static.googleusercontent.com/media/research.google.com/en//people/jeff/stanford-295-talk.pdf

● Writes are heavy.● Disk seeks are heavier than network round trip with

memory seek.● Global shared data is expensive, if locking is involved.● Reads do not need to be transactional, just consistent.● Eventual consistency is useful.

http://static.googleusercontent.com/media/research.google.com/en//people/jeff/stanford-295-talk.pdf



Server - Cache(Low latency)

● Cache ○ Complete HTML response○ Output from Database

● Cache strategy is determined by○ is it a broadcast?○ is it a multicast?○ A unicast?

● Cache works best for broadcast.● Distributed Caching with consistent hash works very well. ● Pitfall is cache purge

Server (Concurrency)● Sequential processing is leaving out CPU and other

resources● Write parallelism is very important.● But Shared globals are heavy, hence a trade off.● In case of Java, JMM understanding is necessary.● Amdahl’s Law helps in determining the maximum gain

that can be achieved with parallel implementations.● If making it parallel, even a small fraction of sequential

work can cause loss of throughput

http://docs.oracle.com/javase/specs/jls/se8/html/jls-17.html

http://en.wikipedia.org/wiki/Amdahl's_law

Server (State?full:less)● Given shared access is expensive, keeping state on

server is heavy.● Sessions if available on shared memory are great.● No session and share nothing works best.● Even cache is better.● Generally stateless code is modular, easier to unit test

and easier to profile.● On a function stack than heap.● Stateless helps in scale out. (Scale out??)

Server Synchronous/Asynchronous

● Waiting for I/O, network connections, DB queries is bad.● How about “query of death”? on write?● Writes if not very small should be kept asynchronous.● Helps on parallelization.● Reliable queues can improve latency.● idempotent code helps in avoiding many pitfalls.● Generally asynchronous is achieved

○ Queue/Topic based infrastructure■ Good for event processing and propagation of events

○ Incremental batches● Asynch I/O ? servers, Node.js/ngnix/apache event mpm ??

Debugging for Scale● Profile

○ In java■ gc logs■ JVisualVM■ Thread and memory dumps

○ GNU■ hprof■ strace■ gdb■ system utilities

Scale Horizontal vs. vertical

● For a stateless, asynchronous, idempotent and multithreaded application the horizontal scaling works , very well.

● Easier to understand with storage a.k.a databases.

Database● Which type of DBMS ?

○ RDBMS○ Key space based multi column family○ Document based○ Graph○ any other NoSQL?○ Solr and elasticsearch

Database scale out limitation

● CAP theorem○ Consistency○ Availability○ Partition tolerance○ Not available simultaneously

● Eventual consistency is preferred choice.

RDBMS● Index based query always● For RDBMS a query of death is a death knock.● Generally Write once and read at multiple slaves works

better.● To normalize or not● normalize for extensibility● Use solr/nosql for read scale● One multiple table join complex query or multiple simple

query?? (performance/scale)

NoSQL● Several options ranging from document databases to

multiple column family● We mostly use

○ Mongo○ Cassandra○ Neo4j (in some cases)○ Titan

● Provide very high throughput with manageable clustering/sharding

Mongo (iBeat)● Increasing data volumes threatens the

scalability and availability● Though search is available, it’s not very

efficient.● The limit of a single document is 16 MB.● Repair DB and reindexing do impact

performance.

Mongo (iBeat ..)● Mongo sharding as a solution● Data volume per replica set decreased.● For document size limit gridFS was used.● With less document volume, the overhead of

index etc. reduced.● But sharding itself with large amount of data

was carried out over a long period of time.

Big Data● Normally associated with such large and complex data that traditional data

management/visualization tools fail to capture, curate or process.● Current definition defines 3 aspects a.k.a (3V)

○ Volume○ Velocity○ Variety

● General usage is in○ Genetic algorithms○ Machine learning○ Natural language processing○ Time series analysis (a.k.a attribution analysis)○ Visualizations ○ and many more ...

Big Data● Our usage is

○ Analytics○ User preference,personalization,profiling○ Recommendation○ Decision support system

● The standard known open source eco systems○ Hadoop○ Event processors /stream engines e.g. storm,spark,S4

Big data (Hadoop..)● Hadoop - Originally a component of Nutch, is now a

biggest driver in big data technologies.● MapReduce a mechanism/framework to run massively

parallel systems. Published originally by Google.● Mapreduce - the trick is distributed sorting.● New languages for statistical computation e.g. R

http://research.google.com/archive/mapreduce.html

Hadoop stack components

Image borrowed from http://blogs.gartner.com/merv-adrian/2013/02/21/hadoop-2013-part-two-projects/

Big data - Real time analysis

● While Map Reduce is great throughput solution, it doesn’t help with real time or near real time processing

● Eco system are evolving either coupled with MapReduce or HDFS.

● Storm/Spark stream for augmenting Mapreduce based computations.

Most important● Ability to determine impact of changes● Seamless deployments

?