ScalabilityBroad Strokes - Best practices
Definition● Concurrency a.k.a number of simultaneous
requests, Latency● Throughput a.k.a total number of item
processed● Extensibility - application design for ability to
add new features etc.● We’d be mostly talking about first two.
Concurrency & Performance
● Scalability is measured as number of requests/users an application support without degrading the performance.
● Performance is a measure of individual request process time mostly.
Handling Scale● Throttling● Cache● Stateful vs. stateless● Asynchronous vs. synchronous● Service oriented design
Where (Multi tiered)● At the client (Browser)
○ Http headers○ Asynchronous calls○ local DB
● At the server ( Web tier/application tier)○ Cache -- distributed○ Stateless○ Asynchronous
● DB○ Cap theorem
Client● Http headers
○ Pragmatic headers not only cache on browsers but help with intelligent proxies.
○ YSlow/G page speed guidelines are always useful.○ e-Tags, long expiry are very good practices.○ sprites and image maps
● Ajax is good for scalability but some time may cause performance issues.
Client Server Network● Always compress response.● Even on JSON the bandwidth gains are
great.● In server-server calls consider binary
protocols or more efficient ones ● Even on the web, network layer like spdy
etc. are interesting.
Server -- Numbers all should know
● http://static.googleusercontent.com/media/research.google.com/en//people/jeff/stanford-295-talk.pdf
● Writes are heavy.● Disk seeks are heavier than network round trip with
memory seek.● Global shared data is expensive, if locking is involved.● Reads do not need to be transactional, just consistent.● Eventual consistency is useful.
Server - Cache(Low latency)
● Cache ○ Complete HTML response○ Output from Database
● Cache strategy is determined by○ is it a broadcast?○ is it a multicast?○ A unicast?
● Cache works best for broadcast.● Distributed Caching with consistent hash works very well. ● Pitfall is cache purge
Server (Concurrency)● Sequential processing is leaving out CPU and other
resources● Write parallelism is very important.● But Shared globals are heavy, hence a trade off.● In case of Java, JMM understanding is necessary.● Amdahl’s Law helps in determining the maximum gain
that can be achieved with parallel implementations.● If making it parallel, even a small fraction of sequential
work can cause loss of throughput
Server (State?full:less)● Given shared access is expensive, keeping state on
server is heavy.● Sessions if available on shared memory are great.● No session and share nothing works best.● Even cache is better.● Generally stateless code is modular, easier to unit test
and easier to profile.● On a function stack than heap.● Stateless helps in scale out. (Scale out??)
Server Synchronous/Asynchronous
● Waiting for I/O, network connections, DB queries is bad.● How about “query of death”? on write?● Writes if not very small should be kept asynchronous.● Helps on parallelization.● Reliable queues can improve latency.● idempotent code helps in avoiding many pitfalls.● Generally asynchronous is achieved
○ Queue/Topic based infrastructure■ Good for event processing and propagation of events
○ Incremental batches● Asynch I/O ? servers, Node.js/ngnix/apache event mpm ??
Debugging for Scale● Profile
○ In java■ gc logs■ JVisualVM■ Thread and memory dumps
○ GNU■ hprof■ strace■ gdb■ system utilities
Scale Horizontal vs. vertical
● For a stateless, asynchronous, idempotent and multithreaded application the horizontal scaling works , very well.
● Easier to understand with storage a.k.a databases.
Database● Which type of DBMS ?
○ RDBMS○ Key space based multi column family○ Document based○ Graph○ any other NoSQL?○ Solr and elasticsearch
Database scale out limitation
● CAP theorem○ Consistency○ Availability○ Partition tolerance○ Not available simultaneously
● Eventual consistency is preferred choice.
RDBMS● Index based query always● For RDBMS a query of death is a death knock.● Generally Write once and read at multiple slaves works
better.● To normalize or not● normalize for extensibility● Use solr/nosql for read scale● One multiple table join complex query or multiple simple
query?? (performance/scale)
NoSQL● Several options ranging from document databases to
multiple column family● We mostly use
○ Mongo○ Cassandra○ Neo4j (in some cases)○ Titan
● Provide very high throughput with manageable clustering/sharding
Mongo (iBeat)● Increasing data volumes threatens the
scalability and availability● Though search is available, it’s not very
efficient.● The limit of a single document is 16 MB.● Repair DB and reindexing do impact
performance.
Mongo (iBeat ..)● Mongo sharding as a solution● Data volume per replica set decreased.● For document size limit gridFS was used.● With less document volume, the overhead of
index etc. reduced.● But sharding itself with large amount of data
was carried out over a long period of time.
Big Data● Normally associated with such large and complex data that traditional data
management/visualization tools fail to capture, curate or process.● Current definition defines 3 aspects a.k.a (3V)
○ Volume○ Velocity○ Variety
● General usage is in○ Genetic algorithms○ Machine learning○ Natural language processing○ Time series analysis (a.k.a attribution analysis)○ Visualizations ○ and many more ...
Big Data● Our usage is
○ Analytics○ User preference,personalization,profiling○ Recommendation○ Decision support system
● The standard known open source eco systems○ Hadoop○ Event processors /stream engines e.g. storm,spark,S4
Big data (Hadoop..)● Hadoop - Originally a component of Nutch, is now a
biggest driver in big data technologies.● MapReduce a mechanism/framework to run massively
parallel systems. Published originally by Google.● Mapreduce - the trick is distributed sorting.● New languages for statistical computation e.g. R
Hadoop stack components
Image borrowed from http://blogs.gartner.com/merv-adrian/2013/02/21/hadoop-2013-part-two-projects/
Big data - Real time analysis
● While Map Reduce is great throughput solution, it doesn’t help with real time or near real time processing
● Eco system are evolving either coupled with MapReduce or HDFS.
● Storm/Spark stream for augmenting Mapreduce based computations.
Most important● Ability to determine impact of changes● Seamless deployments
?